北京大学学报自然科学版 ›› 2025, Vol. 61 ›› Issue (1): 45-52.DOI: 10.13209/j.0479-8023.2025.001

上一篇    下一篇

面向新闻文本的汉藏新词抽取及分析

庞仙1,2,3, 陈波3,4, 赵小兵3,4,†   

  1. 1. 教育部语言文字应用研究所, 北京 100010 2. 首都师范大学文学院, 北京 100089 3. 中央民族大学国家语言资源监测与研究民族语言中心, 北京 100081 4. 中央民族大学信息工程学院, 北京 100081
  • 收稿日期:2024-01-17 修回日期:2024-06-30 出版日期:2025-01-20 发布日期:2025-01-20
  • 通讯作者: 赵小兵, E-mail: nmzxb_cn(at)163.com
  • 基金资助:
    国家社会科学基金重大项目(22&ZD035)资助

Extraction and Analysis of Chinese-Tibetan New Words from News Texts

PANG Xian1,2,3, CHEN Bo3,4, ZHAO Xiaobing3,4,†   

  1. 1. Institute of Applied Linguistics, Ministry of Education of China, Beijing 100010 2. School of Literature, Capital Normal University, Beijing 100089 3. National Language Resources Monitoring and Research Center for Ethnic Languages, Minzu University of China, Beijing 100081 4. Information Engineering Institute, Minzu University of China, Beijing 100081
  • Received:2024-01-17 Revised:2024-06-30 Online:2025-01-20 Published:2025-01-20
  • Contact: ZHAO Xiaobing, E-mail: nmzxb_cn(at)163.com

摘要:

提出一种有效的面向新闻文本的无监督新词抽取方法。该方法通过结合无监督的TopWORDS算法和分词工具PKUSEG, 辅助启发式词语抽取方法, 实现从汉文和藏文新闻文本中抽取年度新词, 共抽取到2022年度汉文新词606个, 藏文新词664个。该方法能够减少人工筛选工作量, 并显著地提高新词抽取的效率。与《中国语言生活状况报告2023》发布的2022年度汉文新词相比, 该方法抽取的新词在数量和语种方面优势明显。此外, 对汉文和藏文新词进行对齐, 并从新词的发展和使用状况角度开展案例分析。

关键词: 新闻文本, 汉文, 藏文, 新词抽取

Abstract:

This paper proposes an effective unsupervised extraction method for news text. Combined with the unsupervised TopWORDS algorithm and the word segmentation tool PKUSEG, and aided by the heuristic word extraction method, the annual new words are extracted from Chinese and Tibetan news texts. A total of 606 new words in Chinese and 664 new words in Tibetan are extracted for 2022. In terms of efficiency, this method reduces the workload of manual selection and significantly improves the efficiency of new words extraction. In terms of effect, compared with the 2022 Chinese new words published in the “Language Situation in China: 2023”, the new words extracted by this method have obvious advantages in terms of number and language. In addition, this paper aligns the Chinese and Tibetan new words. A case study is engaged from the perspective of the development and use of new words.

Key words: news text, Chinese, Tibetan, new words extraction