Acta Scientiarum Naturalium Universitatis Pekinensis ›› 2022, Vol. 58 ›› Issue (1): 99-105.DOI: 10.13209/j.0479-8023.2021.099

Previous Articles     Next Articles

Consistency Check for Chinese Word Segmentation via Contextual Similarity

LIU Wei, HUANG Kaiyu, YU Hao, HUANG Degen   

  1. School of Computer Science and Technology, Dalian University of Technology, Dalian 116023
  • Received:2021-06-08 Revised:2021-08-14 Online:2022-01-20 Published:2022-01-20
  • Contact: HUANG Degen, E-mail: huangdg(at)dlut.edu.cn

基于语境相似度的中文分词一致性检验研究

刘伟, 黄锴宇, 余浩, 黄德根   

  1. 大连理工大学计算机科学与技术学院, 大连 116023
  • 通讯作者: 黄德根, E-mail: huangdg(at)dlut.edu.cn
  • 基金资助:
    国家科技创新2030—“新一代人工智能”重大项目(2020AAA0108004)和国家自然科学基金(U1936109, 61672127)资助

Abstract:

The authors propose a method of consistency check for Chinese word segmentation based on contextual similarity. First, the classification constraints based on word formation, part of speech and dependency syntax are designed by using the features of morphology and syntax. Then, the semantic information of the context in which the inconsistent strings are located is encoded by using pretrained word embeddings, and the inconsistent strings are classified by semantic similarity between contexts. Experimental results show that proposed method can effectively improve the accuracy of consistency check for Chinese word segmentation. Further, three mainstream Chinese word segmentation models are used to re-implement in the revised Chinese word segmentation corpus. The result shows that proposed method can effectively improve the quality of Chinese word segmentation corpus, and the F1 scores of three Chinese word segmentation models are improved by 1.18%, 1.25% and 1.04% respectively.

Key words: 中文分词, 一致性检验, 语料库构建, 语境相似度

摘要:

提出一种基于语境相似度的中文分词一致性检验方法。首先利用词法和句法层面的特征, 设计基于构词、词性和依存句法的分类规则, 再使用预训练词向量, 对不一致字串所在语境的语义信息进行编码, 通过语境间的语义相似度对不一致字串进行分类。在人工构建的36万字分词语料库中进行分词一致性检验, 结果表明该方法能够有效地提高中文分词一致性检验的准确率。进一步地, 使用3 种主流中文分词模型在修正一致性后的分词语料中重新训练和测试, 结果表明该方法可以有效地提高分词语料库的质量, 3种中文分词模型的F1值分别提高1.18%, 1.25%和1.04%。

关键词: 中文分词, 一致性检验, 语料库构建, 语境相似度