Acta Scientiarum Naturalium Universitatis Pekinensis ›› 2025, Vol. 61 ›› Issue (6): 1047-1056.DOI: 10.13209/j.0479-8023.2025.066

Previous Articles     Next Articles

Research on Authorship Verification Methods Integrating Text Embedding and Machine Learning

WANG Xinmin1,2,†, ZHU Wenqing2, HAN Zhuoxi3, LIU Hao2   

  1. 1. National Engineering Laboratory for Big Data Analysis and Applications, Peking University, Beijing 100871 2. Changsha Institute for Computing and Digital Economy, Peking University, Changsha 410205 3. Center for Global Connectivity Studies at Institute of Ocean Research, Peking University, Beijing 100871
  • Received:2024-10-22 Revised:2025-07-07 Online:2025-11-20 Published:2025-11-20
  • Contact: WANG Xinmin, E-mail: wangxinmin(at)pku.edu.cn

融合文本嵌入和机器学习的作者身份验证方法研究

王新民1,2,†, 朱文卿2, 韩卓希3, 刘豪2   

  1. 1. 北京大学大数据分析与应用技术国家工程实验室, 北京 100871 2. 北京大学长沙计算与数字经济研究院, 长沙 410205 3. 北京大学海洋研究院全球互联互通研究中心, 北京 100871
  • 通讯作者: 王新民, E-mail: wangxinmin(at)pku.edu.cn

Abstract:

Based on integrate deep learning and machine learning, an annotated dataset oriented towards authorship verification is constructed, employing BERT and dynamic similarity threshold strategy to enhance label quality. An authorship identification model integrating BERT text embedding and XGBoost-BO is proposed. This model combines the powerful feature extraction capabilities of BERT and the efficient classification performance of XGBoost. Additionally, Bayesian optimization is employed for hyperparameter search to achieve accurate authorship verification. Furthermore, the effectiveness of the dynamic similarity threshold strategy in enhancing author similarity determination accuracy is investigated, as well as the significant role of Bayesian optimization in automatically adjusting XGBoost hyperparameters to improve its overall performance. Experimental results demonstrate that the proposed method outperforms all baseline methods across various evaluation metrics, providing an effective solution for authorship verification. 

Key words: authorship verification, dynamic similarity threshold, BERT, XGBoost, Bayesian optimization

摘要:

基于深度学习与机器学习相结合的方法, 构建面向作者身份验证的标注数据集, 并采用BERT和动态相似度阈值策略来提升标签质量。然后, 提出一种融合BERT文本嵌入和XGBoost-BO的作者身份识别模型, 该模型通过结合BERT强大的特征提取能力、XGBoost高效的分类性能以及贝叶斯优化的超参数搜索策略, 实现对作者身份的准确判断。同时, 探讨动态相似度阈值策略在提升作者相似度判定准确性方面的有效性, 以及贝叶斯优化在自动调整XGBoost超参数、提升模型综合性能方面的显著作用。实验结果表明, 该方法在各项指标上均优于其他对比算法, 可为作者身份验证提供新的思路和方法。

关键词: 作者身份验证, 动态相似度阈值, BERT, XGBoost, 贝叶斯优化