Acta Scientiarum Naturalium Universitatis Pekinensis ›› 2024, Vol. 60 ›› Issue (1): 13-22.DOI: 10.13209/j.0479-8023.2023.070

Previous Articles     Next Articles

A Low-Resource Named Entity Recognition Method for Cultural Heritage Field Incorporating Knowledge Fusion

LI Chao, HOU Xia, QIAO Xiuming   

  1. Computer School, Beijing Information Science & Technology University, Beijing 100192
  • Received:2023-05-12 Revised:2023-08-23 Online:2024-01-20 Published:2024-01-20
  • Contact: HOU Xia, E-mail: houxia(at)bistu.edu.cn

融合知识的文博领域低资源命名实体识别方法研究

李超, 侯霞, 乔秀明   

  1. 北京信息科技大学计算机学院, 北京 100192
  • 通讯作者: 侯霞, E-mail: houxia(at)bistu.edu.cn
  • 基金资助:
    北京市自然科学基金(4224090)资助

Abstract:

In cultural heritage field, entity nesting of cultural relics data is obvious, the entity boundary is not unique, and the marked data in the field of cultural relics is extremely lacking. All the problems above can lead to the low recognition performance of named entities in the field of cultural relics. To address these issues, we construct a dataset called FewRlicsData for NER in the field of cultural heritage and propose a knowledge-enhanced, low-resource NER method RelicsNER. This method integrates the semantic knowledge of category description information into the cultural relics text, employs the span-based method to decode and solve the entity nesting problem, and uses the boundary smoothing method to alleviate the overconfidence problem of span recognition model. Compared with the baseline model, the proposed method achieves higher F1 scores on the FewRlicsData dataset and demonstrates good performance in named entity recognition tasks in the cultural heritage field. Experimental results on the public dataset OntoNotes 4.0 indicate that the proposed method has good generalization ability. Additionally, small-scale data experiments on OntoNotes 4.0 and MSRA datasets show that the performance of the proposed method surpasses that of the baseline model, demonstrating its applicability in low-resource scenarios.

Key words: cultural heritage field, named entity recognition, knowledge fusion, attention mechanism

摘要:

文物数据的实体嵌套问题明显, 实体边界不唯一, 且文博领域已标注数据极度缺乏, 导致该领域命名实体识别性能较低。针对这些问题, 构建一个可用于文物命名实体识别的数据集FewRlicsData, 提出一种融合知识的文博领域低资源命名实体识别方法RelicsNER。该方法将类别描述信息的语义知识融入文物文本中, 使用基于跨度的方式进行解码, 用于改善实体嵌套问题, 并采用边界平滑的方式缓解跨度识别模型的过度自信问题。与基线模型相比, 该方法在FewRlicsData数据集上的F1值有所提升, 在文博领域命名实体识别任务中取得较好的性能。在公开数据集OntoNotes 4.0上的实验结果证明该方法具有较好的泛化性, 同时在数据集OntoNotes 4.0和MSRA上进行小规模数据实验, 性能均高于基线模型, 说明所提方法适用于低资源场景。 

关键词: 文博领域, 命名实体识别, 知识融合, 注意力机制