Acta Scientiarum Naturalium Universitatis Pekinensis ›› 2016, Vol. 52 ›› Issue (2): 199-209.DOI: 10.13209/j.0479-8023.2015.125

Previous Articles     Next Articles

NB-MAFIA: An N-List Based Maximal Frequent Itemset Algorithm

SHEN Gehui1, LIU Peidong1, DENG Zhihong2,3   

  1. 1. Department of Computer Science and Technology, School of Electronics Engineering and Computer Science, Peking University, Beijing 100871
    2. Department of Machine Intelligence, School of Electronics Engineering and Computer Science, Peking University, Beijing 100871
    3. Key Laboratory Machine Perception (MOE), Peking University, Beijing 100871
  • Received:2014-12-30 Online:2016-03-20 Published:2016-02-20
  • Contact: DENG Zhihong, E-mail: zhdeng(at)cis.pku.edu.cn

NB-MAFIA: 基于N-List的最长频繁项集挖掘算法

沈戈晖1, 刘沛东1, 邓志鸿2,3   

  1. 1. 北京大学信息科学技术学院计算机科学技术系, 北京 100871
    2. 北京大学信息科学技术学院智能科学系, 北京 100871
    3. 北京大学机器感知与智能教育部重点实验室, 北京 100871
  • 通讯作者: 邓志鸿, E-mail: zhdeng(at)cis.pku.edu.cn
  • 基金资助:
    国家自然科学基金(61170091)和863 计划(2015AA015403)资助

Abstract:

Abstract The authors propose an efficient algorithm, NB-MAFIA, for mining maximal frequent itemset using NList, which uses node list of prefix tree to represent itemsets. By using N-List, itemsets’ support can be efficiently computed because of the high compactness of N-List and the efficiency of the method to intersect two N-Lists. Meanwhile, the authors employ some search space pruning strategies and superset checking strategy to improve NB-MAFIA. To evaluate NB-MAFIA, the authors compare proposed algorithm with two state-of-the-art algorithms on a variety of real and synthesis datasets. Experimental results show that NB-MAFIA is efficient and outperform the baseline algorithms in most case. Especially, NB-MAFIA is more efficient on dense datasets.

Key words: data mining, frequent itemset mining, maximal itemset, N-List, algorithm

摘要:

本文在深度优先搜索的框架上, 引入基于项集前缀树节点链表的项集表示方法N-List, 提出一个高效的最长频繁项集挖掘算法NB-MAFIA。N-List的高压缩率和高效的求交集方法可以实现项集支持度的快速计算, 同时采用对搜索空间的剪枝策略和超集检测策略来提高算法效率。在多个真实和仿真数据集上, 通过实验评估了NB-MAFIA和两个经典算法。实验结果表明NB-MAFIA在多数情况下优于其他算法, 尤其在真实和稠密数据集上优势更为明显。

关键词: 数据挖掘, 频繁项集挖掘, 最长项集, N-List, 算法

CLC Number: