Acta Scientiarum Naturalium Universitatis Pekinensis

Automatic Recognition Research on Chinese Adverb DOU’s Usages

ZHANG Jingjie,ZAN Hongying

2013, 49(1): 165-169.

Asbtract ( )

PDF (420KB) ( )

Related Articles | Metrics

The authors recognized the adverb DOU’s usages with two methods, rule-based and statistics-based, and analyzed theirs advantages and disadvantages, respectively. And then, the methods of rule-based and statistical- based were combined. The accuracies of these three methods are 82%, 89.62% and 98.54%. The experiments show that combination method of rule-based and statistical-based is more effective in automatically recognizing of adverb DOU’s usages.

Natural Annotation Research in Large-Scale Corpora with a Focus on Chinese Word Segmentation

RAO Gaoqi,XIU Chi,XUN Endong

2013, 49(1): 140-146.

Asbtract ( )

PDF (361KB) ( )

Related Articles | Metrics

The distribution and meaning of natural annotations on large datasets are discussed. The proposed research on word extraction shows the positive potential of both implicit and explicit natural annotation in word segmentation. Experiments on word extraction indicates that the implicit natural annotation derived from language laws and patterns are more powerful in splitting character strings in raw corpora.

A Dynamic Density-Based Clustering Algorithm Appropriate to Large-Scale Text Processing

LI Xia,JIANG Shengyi,ZHANG Qiansheng,ZHU Jing

2013, 49(1): 133-139.

Asbtract ( )

PDF (524KB) ( )

Related Articles | Metrics

Because of the high time complexity and complicated parameter setting in traditional density-based clustering algorithm, a new density definition is proposed, which just needs one parameter and can find clusters with different densities. The authors also expand the algorithm to a two-stage dynamic density-based clustering algorithm, which can process large-scale text corpus data. Experiments on synthetic dataset, large-scale dataset from UCI, English text corpus and Chinese text corpus show that TSDDBCA algorithm has the characteristic of easy parameter setting and high clustering efficiency, and can be applied to clustering process to large-scale text data.

A Parallel Training Research of Chinese Part-of-Speech Tagging CRF Model Based on MapReduce

LIU Tao,LEI Lin,CHEN Luo,XIONG Wei

2013, 49(1): 147-152.

Asbtract ( )

PDF (448KB) ( )

Related Articles | Metrics

Conditional random field (CRF) model bears a major drawback of low training efficiency for large-scale data processing. A parallel method of conditional random field model training based on MapReduce is proposed to solve the problem. The method designs parallel algorithm for feature selection and parameters estimation of CRF model to achieve a parallel iterative scaling algorithm. Experiments show that the method improves the efficiency and reduces time cost significantly while guaranteeing the training result correctness.

Learning to Rank Concept Annotation for Text

TU Xinhui,HE Tingting,LI Fang,WANG Jianwen

2013, 49(1): 153-158.

Asbtract ( )

PDF (1361KB) ( )

Related Articles | Metrics

This paper proposed an automatic text annotation method (CRM, concept ranking model) based on learning to ranking model. Firstly the authors built a training set of concept annotation manualy, and then used the Ranking SVM algorithm to generate concept ranking model, finally the concept ranking model was used to generate concept annotation for any texts. Experiments show that proposed method has a significant improvement in various indicators compared to traditional annotation methods, and concept annotation results is closer to human annotation.

Chinese Word Segmentation for Patent Documents

YUE Jinyuan,XU Jin’an,ZHANG Yujie

2013, 49(1): 159-164.

Asbtract ( )

PDF (480KB) ( )

Related Articles | Metrics

According to the characteristics of the patent documents, the authors present a statistics approach for Chinese word segmentation based on domain dictionaries. NC-value algorithm and conditional random fields model (CRF) are adopted for the domain terms extraction, to solve the unknown words recognition issue. The experimental results show that the proposed method can improve the efficiency of the word segmentation and the identification of the unknown words. For an open test, the precision of the experimental results is 95.56 %, the recall-rate is 96.18%, and F-measure is 95.87%.

Comparative News Summarization Using Co-ranking Graph Model

HUANG Xiaojiang,WAN Xiaojun,XIAO Jianguo

2013, 49(1): 31-38.

Asbtract ( )

PDF (567KB) ( )

Related Articles | Metrics

The authors propose an approach of comparative news summarization using co-ranking graph model. The model makes use of the similarity between sentences within each topic and the comparativeness between sentences of different topics, and then calculates the saliences of sentences of both topics simultaneously using an iterative reinforcement approach. Experiment results show the effectiveness of the proposed approach.

Unsupervised Opinion Word Disambiguation Based on Topic Distribution Similarity

GUO Yingmei,SHI Xiaodong,CHEN Yidong,GAO Yan

2013, 49(1): 95-101.

Asbtract ( )

PDF (530KB) ( )

Related Articles | Metrics

The authors present an automatic method for choosing the correct sense of a polysemous word by using topic information, distance and mutual information of words. The only resources used in the method are an online dictionary and a web search engine. The sense of ambiguous opinion word can be broadly described from words in the context. Experiments show that new approach could achieve high accuracy, and especially keep superior performance for opinion words with more alternative senses.

Unsupervised Topic and Sentiment Unification Model for Sentiment Analysis

SUN Yan,ZHOU Xueguang,FU Wei

2013, 49(1): 102-108.

Asbtract ( )

PDF (462KB) ( )

Related Articles | Metrics

Supervised and semi-supervised sentiment classification methods need label corpora for classifier training. To solve this problem, an unsupervised topic and sentiment unification model (UTSU model) is proposed based on the LDA model. UTSU model imposes a constraint that all words in a sentence are generated from one sentiment and each word is generated from one topic. This constraint conforms to the sentiment expression of language and will not limit the topic relation of words. UTSU model is compeletly unsupervised and it needs neither labeled corpora nor sentiment seed words. The experiments of sentiment classification show that UTSU model comes close to supervised classification methods and outperforms other topic and sentiment unification models. UTSU model improves the F1 value of sentiment classification 2% than ASUM model and 16% than JST model.

Using Event Dependency Cue Inference to Recognize Event Relation

MA Bin,HONG Yu,YANG Xuerong,YAO Jianmin,ZHU Qiaoming

2013, 49(1): 109-116.

Asbtract ( )

PDF (536KB) ( )

Related Articles | Metrics

According to the corresponding discourse structure and semantic features of events which are treated as the basic semantic unit, by analyzing the semantic dependency relation between events and the rules of event inference, the authors propose an event relation recognization method based on event dependency cue to detect latent semantic relation between events: whether events hold logical relation or not. Compared with the traditional method based on semantic similarity, the proposed method achieves 5% improvement.

Topic Partition for Automatic Summarization

TONG Yijian,TANG Huifeng

2013, 49(1): 39-44.

Asbtract ( )

PDF (384KB) ( )

Related Articles | Metrics

Current topic partition algorithm was summarized and classified. The authors improved one of the most effective topic partition algorithm?the TextSegFault (TSF) algorithm, and used TSF or its variant to partition topics based on the type of the text in order to meet the need of automatic summarization. Results show that the proposed method can help avoid the loss of minor topic or topic redundancy brought about by using traditional ways in automatic summarization, thus lead to the balanced structure of the summary.

Research on Fast Incremental Training Algorithm for Word Alignment

LUO Wei

2013, 49(1): 88-94.

Asbtract ( )

PDF (454KB) ( )

Related Articles | Metrics

This study puts emphasis on the incremental training algorithm for word alignment, which is the bottleneck during the construction of translation model. Based on two unsupervised word alignment models, the author proposes an incremental training algorithm which is based on initialization and online EM algorithm. Experiments show that the proposed method is efficient and would not hurt the quality of word alignment and translation.

Construction of Chinese Sentence-Category Dependency Treebank

WANG Huilan

2013, 49(1): 25-30.

Asbtract ( )

PDF (388KB) ( )

Related Articles | Metrics

Aiming at the area of machine translation applications, this paper conduct research on the construction of Chinese Sentence-Category Dependency Treebank (CSCDT) based on the theory of hierarchical network of concepts. Conceptual category tagset and sentence-category relation tagset for the treebank are presented also with the example tree of CSCDT.

Automatic Identification of Chinese Coordination Relations

ZHENG Lüexing,Lü Xueqiang,LIU Kun,LIN Jin

2013, 49(1): 20-24.

Asbtract ( )

PDF (452KB) ( )

Related Articles | Metrics

The authors presented an approach of Chinese coordination relations recognition based on CRFs. Tokens were tagged with different roles according to their functions in the generation of Chinese coordination relations. Then coordination relations were recognized by CRFs (conditional random fields). Compared with the maximum spanning tree dependency parsing, the experiment shows that recall and precision of coordination relations increase by 9.1%, 13.8%.

Automatic Identification of Chinese Coordination Discourse Relation

WU Yunfang,SHI Jing,WAN Fuqiang,Lü Xueqiang

2013, 49(1): 1-6.

Asbtract ( )

PDF (494KB) ( )

Related Articles | Metrics

Several methods are proposed to automatically identify coordination relation, which is the most widely distributed one among discourse relations. The authors exploit semantic similarity and structure similarity to compute the sentence similarity, using lexical similarity, maximum common substring calculation, maximum length matching around head word, special words strengthening. Three of the above methods are integrated, and the experiment achieves promising results.

Research of Chinese Clause Identificiton Based on Comma

LI Yancui,FENG Wenhe,ZHOU Guodong,ZHU Kunhua

2013, 49(1): 7-14.

Asbtract ( )

PDF (462KB) ( )

Related Articles | Metrics

According to the task of Chinese discourse analysis and practice, combined with traditional study, the authors propose clause as basic discourse unit and give its definition from the structure, function, form etc. The authors analyse the relationship between the comma and clause, and research clause identification using comma on annotation corpus. The corpus labeled whether each comma can be regarded as clause boundary information extract from CTB6.0, and have total of 2171 commas in 1348 sentences. The authors extract syntax, vocabulary, length features for experiment, and clauses identification accuracy can reach 90%. Nine greatest contribution features are chosen by information gain, they can obtain high clauses identification accuracy. Finally only using morphology feature, the accuracy can reach 84.5%. Experiments show that the definition of clause is reasonable and identification clause based on the comma is feasible.

Concept Template of Combining Attributes with Attributive Values

CHENG Xianyi,SHI Quan,SHEN Xuehua,TIAN Yuhe

2013, 49(1): 15-19.

Asbtract ( )

PDF (445KB) ( )

Related Articles | Metrics

According to the task of Chinese discourse analysis and practice, combined with traditional study, the authors propose clause as basic discourse unit and give its definition from the structure, function, form etc. The authors analyse the The authors extract triad 〈concept, attribute, property value〉 based on ontology. Concept is represented as a vector which is combined with attributes and attributive values based on vocabulary clustering. Comparasion between conceptual template based on attributes and based on attributive values is performed. The study shows that the concept template of combination attributes with attributive values is superior to any single template.

Automatic Table Boundary Detection and Performance Evaluation in Fixed-Layout Documents

FANG Jing,GAO Liangcai,QIU Ruiheng,TANG Zhi

2013, 49(1): 45-53.

Asbtract ( )

PDF (677KB) ( )

Related Articles | Metrics

The authors propose a novel and effective table boundary detection method via visual separators and geometric content layout information, which is effective for both Chinese and English documents. Additionally, due to the lack of automatic evaluation system for table boundaries detection, the authors also provide a publicly available large-scale dataset, composed of same amount of Chinese and English pages make ground-truth and propose mobile reading oriented performance measurements. Evaluation and comparison with two other open source table boundary detection projects demonstrates effectiveness of the proposed method and practicality of the evaluation suit.

Optical Font Recognition of Chinese Based on the Stroke Tip Similarity

WANG Xiao,Lü Xiaoqing,TANG Zhi

2013, 49(1): 54-60.

Asbtract ( )

PDF (521KB) ( )

Related Articles | Metrics

The authors present a novel method for the OFR of single Chinese character on a large font set. This method explores the specific parts of strokes of each character, called Stroke Tips, which is regarded as a good feature for font recognition. Experiments show that the recognition rate on common font sets is superior than that of other methods. Furthermore, experiments on an extended font set confirm the scalability of the proposed method, which means that this method is suitable for the OFR of single Chinese character on a large font set.

Dynamic Description Library for Jiaguwen Characters and the Reserch of the Characters Processing

LI Qingsheng,WU Qinxia,YANG Yuxing

2013, 49(1): 61-67.

Asbtract ( )

PDF (859KB) ( )

Related Articles | Metrics

The glyphs of the Jiaguwen characters are highly variable, and many of them have several variant forms. Under this consideration, a human-computer interactive dynamic description method for Jiaguwen characters is presented. It describes Jiaguwen characters by stroke-segment-vectors and stroke elements. This method provides a solution to the problems that the internal code space is too small to encode all Jiaguwen characters and the identification of Jiaguwen characters’ glyphs. The method also gives an excellent resolution to the problem of editing digitized ancient Chinese characters.

Error Detection for Statistical Machine Translation Based on Feature Comparison and Maximum Entropy Model Classifier

DU Jinhua,WANG Sha

2013, 49(1): 81-87.

Asbtract ( )

PDF (413KB) ( )

Related Articles | Metrics

The authors firstly introduce three typical word posterior probabilities (WPP) for error detection and classification, which are fixed position WPP, sliding window WPP, and alignment-based WPP, and analyzes their impact on the detection performance. Then each WPP feature is combined with three linguistic features (Word, POS and LG Parsing knowledge) over the maximum entropy classifier to predict the translation errors. Experimental results on Chinese-to-English NIST datasets show that the influences of different WPP features on the classification error rate (CER) are significant, and the combination of WPP with linguistic features can significantly reduce the CER and improve the prediction capability of the classifier.

Tibetan Syllable Rule Model and Applications

ZHU Jie,LI Tianrui,GE Sangduoji,REN Qingnuobu,QIAO Shaojie

2013, 49(1): 68-74.

Asbtract ( )

PDF (572KB) ( )

Related Articles | Metrics

The authors first introduce unique configuration approach of Tibetan syllable, and several limitations of the styles for Tibetan combination caused by the pronunciation features of the Tibetan letters, and then focus on the study of Tibetan characters. A simplified model for modern Tibetan syllables and its corresponding rule base are established by using Tibetan grammar rules. Its applications are extensively analyzed. Algorithms for automatic spelling of Tibetan syllable are proposed as well. Experiments and case studies validate the rule base of Tibetan.

Tibetan Number Identification and Translation

SUN Meng,HUA Quecairang,LIU Kai,Lü Yajuan,LIU Qun

2013, 49(1): 75-80.

Asbtract ( )

PDF (497KB) ( )

Related Articles | Metrics

The authors propose a definition of Tibetan number basic component through analyzing the inner structure and the boundary information. A best path decision was applied in judging basic component, then the number was recognized and translated by a finite automation model, finally a template matching algorithm was used for processing complicated number. The F-score of identification and translation is 98.73% and the BLEU score of Tibetan-Chinese translation obtains an improvement of 2.64%.

Social Network Compression Based on the Importance of the Community Nodes

LI Hongbo,ZHANG Jianpei,YANG Jing,BAI Jinbo,CHU Yan,ZHANG Lejun

2013, 49(1): 117-125.

Asbtract ( )

PDF (1901KB) ( )

Related Articles | Metrics

In response to the inadequacies of current graph compression methods, such as higher time complexity, dependence on experiences to set parameters, too many parameters to adjust, compression loss, ignoring the community structure of network, a social network compression method is proposed based on the importance of the community nodes. The method include community discovery algorithm (GS) based on greedy strategy and social network compression algorithm (SNC). Adopting topological potential theory GS algorithm is not only capable of discovering communities but also capable of mining important nodes in the communities. SNC algorithm takes communities as targets, achieves lossless compression while maintaining the connections between communities, and keeps important nodes in communities or basic community structure if necessary. The feasibility and effectiveness of the method are verified in experiments.

A Clustering Chunking Method Based on Manifold Geodesic Distance

LEI Lin,XIONG Wei,JING Ning,XIAO Jianfu

2013, 49(1): 126-132.

Asbtract ( )

PDF (535KB) ( )

Related Articles | Metrics

Regarding the Chinese chunker analysis as a procedure of inner-sentence word clustering and chunker type labeling, a grammar function space is constructed at first, and then embedded in a lower dimension space by applying ISOMAP to observe the distribution feature of Chinese word in the embedding space. In the hierarchical clustering algorithm which is aiming at partitioning word into different clusters, the manifold geodesic distance is employed instead of Euclidean distance to measure the similarity between words. The algorithm facilitates the increment of Chinese chunker analysis performance under the condition of appropriate algorithm complexity.

Table of Content