北京大学学报(自然科学版)

基于随机森林分类的微博机器用户识别研究

刘勘1,袁蕴英1,刘萍2   

  1. 1. 中南财经政法大学信息与安全工程学院, 武汉 430074; 2. 武汉大学信息管理学院, 武汉 430072;
  • 收稿日期:2014-07-27 出版日期:2015-03-20 发布日期:2015-03-20

A Weibo Bot-users Indentification Model Based on Random Forest

LIU Kan1, YUAN Yunying1, LIU Ping2   

  1. 1. School of Information and Safety Engineering, Zhongnan University of Economics and Law, Wuhan 430074; 2. School of Information Management, Wuhan University, Wuhan 430072;
  • Received:2014-07-27 Online:2015-03-20 Published:2015-03-20

摘要: 针对网络上机器用户大量散布谣言, 发布虚假信息, 误导网民舆论, 严重影响网络环境的问题, 以微博中的机器用户为研究对象, 结合其自动化程度高、伪装能力强、信息发布有针对性的特点, 从行为模式、微博内容、用户关系和发布平台4个维度分析机器用户的特征指标, 利用信息熵、内容重复率等8个指标构建微博用户的特征向量, 通过随机森林算法设计微博中机器用户的识别模型。最后, 在真实的新浪微博数据集上进行验证, 结果表明本模型识别机器用户的准确度达到96.7%, 可以有效地区分微博中的机器用户和普通用户。

关键词: 机器用户, 微博, 随机森林

Abstract: Bot-users spread rumors or fake information widely, misleading the public opinion, seriously affecting the normal network environment. Taking Weibo bot-users as main focus, considering their high-level automation, strong disguise power and targeted ability to release, a four-dimensional characteristic index of information entropy, content repetition rate, reputation, mutural, mention ratio, comment ratio, message and numofplatform is proposed to construct a feature vector and an identification model based on random forest algorithm is designed to recognize the bot-users. Finally, the Sina Weibo set are used to verify the efficiency and effectiveness of the model, with the accuracy of 96.7%. The result shows that the model is good at distinguishing the bot-users from ordinary users.

Key words: bot-users, Weibo, random forest

中图分类号: