Acta Scientiarum Naturalium Universitatis Pekinensis ›› 2023, Vol. 59 ›› Issue (1): 11-20.DOI: 10.13209/j.0479-8023.2022.070

Previous Articles     Next Articles

English Books Automatic Classification According to CLC

JIANG Yanting1,2   

  1. 1. Sichuan Hydrological and Water Resources Survey Center, Chengdu 610036 2. CPC Party School of Jintang County, Chengdu 610400
  • Received:2022-05-13 Revised:2022-08-03 Online:2023-01-20 Published:2023-01-20
  • Contact: JIANG Yanting, E-mail: jiangyanting(at)mail.bnu.edu.cn

依据《中国图书馆分类法》的英文图书分类探索

蒋彦廷1,2   

  1. 1. 四川省水文水资源勘测中心, 成都 610036 2. 中共金堂县委党校, 成都 610400
  • 通讯作者: 蒋彦廷, E-mail: jiangyanting(at)mail.bnu.edu.cn

Abstract:

Faced with lacking of English books annotated with CLC (Chinese Library Classification) label and imbalance data, this paper combines augmentation strategies from library, information and general fields: 1) classification mapping from Library of Congress Classification (LCC) to CLC; 2) semantic enhancement based on Chinese-English parallel thesaurus; 3) punctuation or 4) conjunction inserting to initial texts. Experiments show that combining 4 strategies can optimize the performance of models on test set. Accuracy and Macro-F1 respectively increase by 3.61 and 3.35 percentage points. Comprehensive methods is superior to other text enhancement strategies. By BERT word embeddings visualization and words information entropy computing, this paper inferred that the reason why punctuation or conjunction inserting works was the various adjacent words and connection function in grammar.

Key words: pre-trained language models, Chinese Library Classification, classification mapping, Chinese thesaurus, text augmentation

摘要:

针对带有中图分类号的英文图书数据量小以及类别不平衡的问题, 将图情领域的文本增强策略(《美国国会图书馆分类法》到《中国图书馆分类法》的类目映射方法和基于中-英文平行的《汉语主题词表》的语义增强方法)与一般领域文本增强策略(向原始英文文本插入标点或连词)相结合, 旨在增强模型泛化能力。实验表明, 综合后的策略能有效地提高模型在测试集的表现, 正确率和宏F1值分别上升3.61和3.35个百分点, 效果优于其他单一的文本增强方法。最后, 通过BERT词向量可视化与词语信息熵计算, 分析出丰富的邻近词和语法上的连缀功能是插入标点或连词方法有效的原因。

关键词: 预训练语言模型, 中国图书馆分类法, 类目映射, 汉语主题词表, 文本增强