北京大学学报自然科学版 ›› 2022, Vol. 58 ›› Issue (1): 77-82.DOI: 10.13209/j.0479-8023.2021.104

上一篇    下一篇

基于类别混合嵌入的电力文本层次化分类方法

陈晓娜, 高鹏飞, 梁越, 马应龙   

  1. 华北电力大学控制与计算机工程学院, 北京 102206
  • 收稿日期:2021-05-31 修回日期:2021-08-14 出版日期:2022-01-20 发布日期:2022-01-20
  • 通讯作者: 马应龙, E-mail: yinglongma(at)ncepu.edu.cn
  • 基金资助:
    国家重点研发计划课题(2018YFC0831404)资助

A Category Hybrid Embedding Based Approach for Power Text Hierarchical Classification

CHEN Xiaona, GAO Pengfei, LIANG Yue, MA Yinglong   

  1. School of Control and Computer Engineering, North China Electric Power University, Beijing 102206
  • Received:2021-05-31 Revised:2021-08-14 Online:2022-01-20 Published:2022-01-20
  • Contact: MA Yinglong, E-mail: yinglongma(at)ncepu.edu.cn

摘要:

针对当前电力文本分类方法中因忽视类别标签之间潜在语义关联关系而导致分类性能低效的问题, 提出一种基于层次化分类模型的电力文本分类方法。首先, 利用采集的电力成果非结构化文档, 采用自动化信息提取技术和标注技术, 构建电力文本多标签分类训练集, 并结合领域知识分析, 构建类别标签之间的层次化关系。然后, 提出基于类别结构和标签语义混合嵌入的文本分类模型 HONLSTM-BERT, 利用类别标签之间的层次化结构关系进行自顶向下的层次化文本分类。最后, 通过实验与当前流行的文本分类模型进行对比分析, 结果表明HONLSTM-BERT方法具有更好的分类准确率, 可有效地提高电力文本自动分类性能。

关键词: 电力信息技术, 电力文本分类, 层次化文本分类, 类别嵌入

Abstract:

Aiming at the problem that the current power text classification methods ignore the latent semantic association between category labels and therefore lead to low classification performance, a hierarchical multi-label power text classification method is proposed. Firstly, a power multi-label text dataset is built using automatic information extraction based on power unstructured texts, and the hierarchical structural relationships between categories are constructed by leveraging relevant domain knowledge. Secondly, a text classification method HONLSTM-BERT is proposed based on hybrid embeddings of category structure and label semantics for hierarchically classifying power texts in a top-down manner. At last, experiments were made in comparison with some popular text classification methods, and the experimental results show that proposed HONLSTM-BERT method achieves superior classification accuracy, and can efficiently improve the performance of automatic text classification.

Key words: power information technology, power text classification, hierarchical text classification, category embedding