基于逗号的汉语子句识别研究

北京大学学报（自然科学版）

基于逗号的汉语子句识别研究

李艳翠^1,2,冯文贺³,周国栋¹,朱坤华²

1. 苏州大学计算机科学与技术学院, 苏州 215006; 2. 河南科技学院信息工程学院, 新乡 453003; 3. 河南科技学院人文学院, 新乡 453003;

收稿日期:2012-05-30 出版日期:2013-01-20 发布日期:2013-01-20

Research of Chinese Clause Identificiton Based on Comma

LI Yancui^1,2, FENG Wenhe³, ZHOU Guodong¹, ZHU Kunhua²

1. Department of Computer Science and Technology, Soochow University, Suzhou 215006; 2. School of InformationEngineering, Henan Institute of Science and Technology, Xinxiang 453003; 3. School of Humanities, Henan Institute of Science and Technology, Xinxiang 453003;

Received:2012-05-30 Online:2013-01-20 Published:2013-01-20

摘要/Abstract

摘要： 根据篇章分析的任务和实践, 结合传统研究, 提出汉语的基本篇章单位为子句, 并从结构、功能、形式等方面给出其定义。分析了逗号与子句的关系, 并在标注语料上进行了基于逗号的汉语子句识别研究。首先手工标注了CTB6.0中前100篇文档的逗号是否为子句边界的信息, 在标注结果中抽取句法、词汇、长度等特征进行实验, 子句识别准确率为90%。然后利用信息增益选出贡献最大的9个特征, 使用它们也可获得较高的子句识别准确率。最后仅使用词法信息, 子句识别准确率可达84.5%。实验证明子句的定义合理, 基于逗号的子句识别在理论上和实验上均可行。

关键词: 逗号, 汉语子句, 子句识别

Abstract: According to the task of Chinese discourse analysis and practice, combined with traditional study, the authors propose clause as basic discourse unit and give its definition from the structure, function, form etc. The authors analyse the relationship between the comma and clause, and research clause identification using comma on annotation corpus. The corpus labeled whether each comma can be regarded as clause boundary information extract from CTB6.0, and have total of 2171 commas in 1348 sentences. The authors extract syntax, vocabulary, length features for experiment, and clauses identification accuracy can reach 90%. Nine greatest contribution features are chosen by information gain, they can obtain high clauses identification accuracy. Finally only using morphology feature, the accuracy can reach 84.5%. Experiments show that the definition of clause is reasonable and identification clause based on the comma is feasible.

Key words: comma, Chinese clause, clause identification

中图分类号:

TP391

李艳翠,冯文贺,周国栋,朱坤华. 基于逗号的汉语子句识别研究[J]. 北京大学学报（自然科学版）.

LI Yancui,FENG Wenhe,ZHOU Guodong,ZHU Kunhua. Research of Chinese Clause Identificiton Based on Comma[J]. Acta Scientiarum Naturalium Universitatis Pekinensis.

导出引用管理器 EndNote|Ris|BibTeX

链接本文: https://xbna.pku.edu.cn/CN/

https://xbna.pku.edu.cn/CN/Y2013/V49/I1/7

[1]	刘秋慧, 张坤丽, 许鸿飞, 俞士汶, 昝红英. 助词“的”用法自动识别研究[J]. 北京大学学报（自然科学版）, 2018, 54(3): 466-474.
[2]	柯永红, 朱永福, 穗志方, 俞士汶. 基于多特征的语义角色标注一致性计算方法研究[J]. 北京大学学报（自然科学版）, 2018, 54(3): 475-480.
[3]	杨萌, 李培峰, 朱巧明. 一种基于Tree-LSTM的句子相似度计算方法[J]. 北京大学学报（自然科学版）, 2018, 54(3): 481-486.
[4]	张雨, 曾立, 邹磊. 大规模图数据的正则路径查询[J]. 北京大学学报（自然科学版）, 2018, 54(2): 236-242.
[5]	魏星, 王玮, 陈静萍, 解焱陆, 张劲松. 基于发音特征的汉语发音偏误自动标注[J]. 北京大学学报（自然科学版）, 2018, 54(2): 243-248.
[6]	林心宜, 严睿, 赵东岩. 融合词、句层级信息的抽取式摘要优化框架[J]. 北京大学学报（自然科学版）, 2018, 54(2): 229-235.
[7]	周楠, 赵悦, 李要嫱, 徐晓娜, 才旺拉姆, 吴立成. 基于瓶颈特征的藏语拉萨话连续语音识别研究[J]. 北京大学学报（自然科学版）, 2018, 54(2): 249-254.
[8]	谭亦鸣, 王明文, 李茂西. 基于翻译质量估计的神经网络译文自动后编辑[J]. 北京大学学报（自然科学版）, 2018, 54(2): 255-261.
[9]	吴焕钦, 张红阳, 李静梅, 朱俊国, 杨沐昀, 李生. 基于伪数据的机器翻译质量估计模型的训练[J]. 北京大学学报（自然科学版）, 2018, 54(2): 279-285.
[10]	吕书宁, 董志安. 利用URL-Key领域术语识别方法[J]. 北京大学学报（自然科学版）, 2018, 54(2): 262-270.
[11]	王文超, 吕学强, 张凯, 周建设. 足球赛事战报的自动写作研究[J]. 北京大学学报（自然科学版）, 2018, 54(2): 271-278.
[12]	应文豪, 肖欣延, 李素建, 吕雅娟, 穗志方. 一种利用语义相似度改进问答摘要的方法[J]. 北京大学学报自然科学版, 2017, 53(2): 197-203.
[13]	栗青生, 徐强, 肖建国, 刘泉, 张解放. 汉字动态生成的结构与风格模型[J]. 北京大学学报自然科学版, 2017, 53(2): 219-229.
[14]	陈玉敬, 吕学强, 周建设, 李宁. NBA赛事新闻的自动写作研究[J]. 北京大学学报自然科学版, 2017, 53(2): 211-218.
[15]	张丽林, 李茂西, 肖文艳, 万剑怡, 王明文. 机器翻译自动评价中领域知识复述抽取研究[J]. 北京大学学报自然科学版, 2017, 53(2): 230-238.