Loading...
[an error occurred while processing this directive]

Table of Content

    20 January 2013, Volume 49 Issue 1
    Automatic Recognition Research on Chinese Adverb DOU’s Usages
    ZHANG Jingjie,ZAN Hongying
    2013, 49(1):  165-169. 
    Asbtract ( )   PDF (420KB) ( )  
    Related Articles | Metrics
    The authors recognized the adverb DOU’s usages with two methods, rule-based and statistics-based, and analyzed theirs advantages and disadvantages, respectively. And then, the methods of rule-based and statistical- based were combined. The accuracies of these three methods are 82%, 89.62% and 98.54%. The experiments show that combination method of rule-based and statistical-based is more effective in automatically recognizing of adverb DOU’s usages.
    Natural Annotation Research in Large-Scale Corpora with a Focus on Chinese Word Segmentation
    RAO Gaoqi,XIU Chi,XUN Endong
    2013, 49(1):  140-146. 
    Asbtract ( )   PDF (361KB) ( )  
    Related Articles | Metrics
    The distribution and meaning of natural annotations on large datasets are discussed. The proposed research on word extraction shows the positive potential of both implicit and explicit natural annotation in word segmentation. Experiments on word extraction indicates that the implicit natural annotation derived from language laws and patterns are more powerful in splitting character strings in raw corpora.
    A Dynamic Density-Based Clustering Algorithm Appropriate to Large-Scale Text Processing
    LI Xia,JIANG Shengyi,ZHANG Qiansheng,ZHU Jing
    2013, 49(1):  133-139. 
    Asbtract ( )   PDF (524KB) ( )  
    Related Articles | Metrics
    Because of the high time complexity and complicated parameter setting in traditional density-based clustering algorithm, a new density definition is proposed, which just needs one parameter and can find clusters with different densities. The authors also expand the algorithm to a two-stage dynamic density-based clustering algorithm, which can process large-scale text corpus data. Experiments on synthetic dataset, large-scale dataset from UCI, English text corpus and Chinese text corpus show that TSDDBCA algorithm has the characteristic of easy parameter setting and high clustering efficiency, and can be applied to clustering process to large-scale text data.
    A Parallel Training Research of Chinese Part-of-Speech Tagging CRF Model Based on MapReduce
    LIU Tao,LEI Lin,CHEN Luo,XIONG Wei
    2013, 49(1):  147-152. 
    Asbtract ( )   PDF (448KB) ( )  
    Related Articles | Metrics
    Conditional random field (CRF) model bears a major drawback of low training efficiency for large-scale data processing. A parallel method of conditional random field model training based on MapReduce is proposed to solve the problem. The method designs parallel algorithm for feature selection and parameters estimation of CRF model to achieve a parallel iterative scaling algorithm. Experiments show that the method improves the efficiency and reduces time cost significantly while guaranteeing the training result correctness.
    Learning to Rank Concept Annotation for Text
    TU Xinhui,HE Tingting,LI Fang,WANG Jianwen
    2013, 49(1):  153-158. 
    Asbtract ( )   PDF (1361KB) ( )  
    Related Articles | Metrics
    This paper proposed an automatic text annotation method (CRM, concept ranking model) based on learning to ranking model. Firstly the authors built a training set of concept annotation manualy, and then used the Ranking SVM algorithm to generate concept ranking model, finally the concept ranking model was used to generate concept annotation for any texts. Experiments show that proposed method has a significant improvement in various indicators compared to traditional annotation methods, and concept annotation results is closer to human annotation.
    Chinese Word Segmentation for Patent Documents
    YUE Jinyuan,XU Jin’an,ZHANG Yujie
    2013, 49(1):  159-164. 
    Asbtract ( )   PDF (480KB) ( )  
    Related Articles | Metrics
    According to the characteristics of the patent documents, the authors present a statistics approach for Chinese word segmentation based on domain dictionaries. NC-value algorithm and conditional random fields model (CRF) are adopted for the domain terms extraction, to solve the unknown words recognition issue. The experimental results show that the proposed method can improve the efficiency of the word segmentation and the identification of the unknown words. For an open test, the precision of the experimental results is 95.56 %, the recall-rate is 96.18%, and F-measure is 95.87%.
    Comparative News Summarization Using Co-ranking Graph Model
    HUANG Xiaojiang,WAN Xiaojun,XIAO Jianguo
    2013, 49(1):  31-38. 
    Asbtract ( )   PDF (567KB) ( )  
    Related Articles | Metrics
    The authors propose an approach of comparative news summarization using co-ranking graph model. The model makes use of the similarity between sentences within each topic and the comparativeness between sentences of different topics, and then calculates the saliences of sentences of both topics simultaneously using an iterative reinforcement approach. Experiment results show the effectiveness of the proposed approach.
    Unsupervised Opinion Word Disambiguation Based on Topic Distribution Similarity
    GUO Yingmei,SHI Xiaodong,CHEN Yidong,GAO Yan
    2013, 49(1):  95-101. 
    Asbtract ( )   PDF (530KB) ( )  
    Related Articles | Metrics
    The authors present an automatic method for choosing the correct sense of a polysemous word by using topic information, distance and mutual information of words. The only resources used in the method are an online dictionary and a web search engine. The sense of ambiguous opinion word can be broadly described from words in the context. Experiments show that new approach could achieve high accuracy, and especially keep superior performance for opinion words with more alternative senses.
    Unsupervised Topic and Sentiment Unification Model for Sentiment Analysis
    SUN Yan,ZHOU Xueguang,FU Wei
    2013, 49(1):  102-108. 
    Asbtract ( )   PDF (462KB) ( )  
    Related Articles | Metrics
    Supervised and semi-supervised sentiment classification methods need label corpora for classifier training. To solve this problem, an unsupervised topic and sentiment unification model (UTSU model) is proposed based on the LDA model. UTSU model imposes a constraint that all words in a sentence are generated from one sentiment and each word is generated from one topic. This constraint conforms to the sentiment expression of language and will not limit the topic relation of words. UTSU model is compeletly unsupervised and it needs neither labeled corpora nor sentiment seed words. The experiments of sentiment classification show that UTSU model comes close to supervised classification methods and outperforms other topic and sentiment unification models. UTSU model improves the F1 value of sentiment classification 2% than ASUM model and 16% than JST model.
    Using Event Dependency Cue Inference to Recognize Event Relation
    MA Bin,HONG Yu,YANG Xuerong,YAO Jianmin,ZHU Qiaoming
    2013, 49(1):  109-116. 
    Asbtract ( )   PDF (536KB) ( )  
    Related Articles | Metrics
    According to the corresponding discourse structure and semantic features of events which are treated as the basic semantic unit, by analyzing the semantic dependency relation between events and the rules of event inference, the authors propose an event relation recognization method based on event dependency cue to detect latent semantic relation between events: whether events hold logical relation or not. Compared with the traditional method based on semantic similarity, the proposed method achieves 5% improvement.
    Topic Partition for Automatic Summarization
    TONG Yijian,TANG Huifeng
    2013, 49(1):  39-44. 
    Asbtract ( )   PDF (384KB) ( )  
    Related Articles | Metrics
    Current topic partition algorithm was summarized and classified. The authors improved one of the most effective topic partition algorithm?the TextSegFault (TSF) algorithm, and used TSF or its variant to partition topics based on the type of the text in order to meet the need of automatic summarization. Results show that the proposed method can help avoid the loss of minor topic or topic redundancy brought about by using traditional ways in automatic summarization, thus lead to the balanced structure of the summary.
    Research on Fast Incremental Training Algorithm for Word Alignment
    LUO Wei
    2013, 49(1):  88-94. 
    Asbtract ( )   PDF (454KB) ( )  
    Related Articles | Metrics
    This study puts emphasis on the incremental training algorithm for word alignment, which is the bottleneck during the construction of translation model. Based on two unsupervised word alignment models, the author proposes an incremental training algorithm which is based on initialization and online EM algorithm. Experiments show that the proposed method is efficient and would not hurt the quality of word alignment and translation.
    Construction of Chinese Sentence-Category Dependency Treebank
    WANG Huilan
    2013, 49(1):  25-30. 
    Asbtract ( )   PDF (388KB) ( )  
    Related Articles | Metrics
    Aiming at the area of machine translation applications, this paper conduct research on the construction of Chinese Sentence-Category Dependency Treebank (CSCDT) based on the theory of hierarchical network of concepts. Conceptual category tagset and sentence-category relation tagset for the treebank are presented also with the example tree of CSCDT.
    Automatic Identification of Chinese Coordination Relations
    ZHENG Lüexing,Lü Xueqiang,LIU Kun,LIN Jin
    2013, 49(1):  20-24. 
    Asbtract ( )   PDF (452KB) ( )  
    Related Articles | Metrics
    The authors presented an approach of Chinese coordination relations recognition based on CRFs. Tokens were tagged with different roles according to their functions in the generation of Chinese coordination relations. Then coordination relations were recognized by CRFs (conditional random fields). Compared with the maximum spanning tree dependency parsing, the experiment shows that recall and precision of coordination relations increase by 9.1%, 13.8%.
    Automatic Identification of Chinese Coordination Discourse Relation
    WU Yunfang,SHI Jing,WAN Fuqiang,Lü Xueqiang
    2013, 49(1):  1-6. 
    Asbtract ( )   PDF (494KB) ( )  
    Related Articles | Metrics
    Several methods are proposed to automatically identify coordination relation, which is the most widely distributed one among discourse relations. The authors exploit semantic similarity and structure similarity to compute the sentence similarity, using lexical similarity, maximum common substring calculation, maximum length matching around head word, special words strengthening. Three of the above methods are integrated, and the experiment achieves promising results.
    Research of Chinese Clause Identificiton Based on Comma
    LI Yancui,FENG Wenhe,ZHOU Guodong,ZHU Kunhua
    2013, 49(1):  7-14. 
    Asbtract ( )   PDF (462KB) ( )  
    Related Articles | Metrics
    According to the task of Chinese discourse analysis and practice, combined with traditional study, the authors propose clause as basic discourse unit and give its definition from the structure, function, form etc. The authors analyse the relationship between the comma and clause, and research clause identification using comma on annotation corpus. The corpus labeled whether each comma can be regarded as clause boundary information extract from CTB6.0, and have total of 2171 commas in 1348 sentences. The authors extract syntax, vocabulary, length features for experiment, and clauses identification accuracy can reach 90%. Nine greatest contribution features are chosen by information gain, they can obtain high clauses identification accuracy. Finally only using morphology feature, the accuracy can reach 84.5%. Experiments show that the definition of clause is reasonable and identification clause based on the comma is feasible.
    Concept Template of Combining Attributes with Attributive Values
    CHENG Xianyi,SHI Quan,SHEN Xuehua,TIAN Yuhe
    2013, 49(1):  15-19. 
    Asbtract ( )   PDF (445KB) ( )  
    Related Articles | Metrics
    According to the task of Chinese discourse analysis and practice, combined with traditional study, the authors propose clause as basic discourse unit and give its definition from the structure, function, form etc. The authors analyse the The authors extract triad 〈concept, attribute, property value〉 based on ontology. Concept is represented as a vector which is combined with attributes and attributive values based on vocabulary clustering. Comparasion between conceptual template based on attributes and based on attributive values is performed. The study shows that the concept template of combination attributes with attributive values is superior to any single template.
    Automatic Table Boundary Detection and Performance Evaluation in Fixed-Layout Documents
    FANG Jing,GAO Liangcai,QIU Ruiheng,TANG Zhi
    2013, 49(1):  45-53. 
    Asbtract ( )   PDF (677KB) ( )  
    Related Articles | Metrics
    The authors propose a novel and effective table boundary detection method via visual separators and geometric content layout information, which is effective for both Chinese and English documents. Additionally, due to the lack of automatic evaluation system for table boundaries detection, the authors also provide a publicly available large-scale dataset, composed of same amount of Chinese and English pages make ground-truth and propose mobile reading oriented performance measurements. Evaluation and comparison with two other open source table boundary detection projects demonstrates effectiveness of the proposed method and practicality of the evaluation suit.
    Optical Font Recognition of Chinese Based on the Stroke Tip Similarity
    WANG Xiao,Lü Xiaoqing,TANG Zhi
    2013, 49(1):  54-60. 
    Asbtract ( )   PDF (521KB) ( )  
    Related Articles | Metrics
    The authors present a novel method for the OFR of single Chinese character on a large font set. This method explores the specific parts of strokes of each character, called Stroke Tips, which is regarded as a good feature for font recognition. Experiments show that the recognition rate on common font sets is superior than that of other methods. Furthermore, experiments on an extended font set confirm the scalability of the proposed method, which means that this method is suitable for the OFR of single Chinese character on a large font set.
    Dynamic Description Library for Jiaguwen Characters and the Reserch of the Characters Processing
    LI Qingsheng,WU Qinxia,YANG Yuxing
    2013, 49(1):  61-67. 
    Asbtract ( )   PDF (859KB) ( )  
    Related Articles | Metrics
    The glyphs of the Jiaguwen characters are highly variable, and many of them have several variant forms. Under this consideration, a human-computer interactive dynamic description method for Jiaguwen characters is presented. It describes Jiaguwen characters by stroke-segment-vectors and stroke elements. This method provides a solution to the problems that the internal code space is too small to encode all Jiaguwen characters and the identification of Jiaguwen characters’ glyphs. The method also gives an excellent resolution to the problem of editing digitized ancient Chinese characters.
    Error Detection for Statistical Machine Translation Based on Feature Comparison and Maximum Entropy Model Classifier
    DU Jinhua,WANG Sha
    2013, 49(1):  81-87. 
    Asbtract ( )   PDF (413KB) ( )  
    Related Articles | Metrics
    The authors firstly introduce three typical word posterior probabilities (WPP) for error detection and classification, which are fixed position WPP, sliding window WPP, and alignment-based WPP, and analyzes their impact on the detection performance. Then each WPP feature is combined with three linguistic features (Word, POS and LG Parsing knowledge) over the maximum entropy classifier to predict the translation errors. Experimental results on Chinese-to-English NIST datasets show that the influences of different WPP features on the classification error rate (CER) are significant, and the combination of WPP with linguistic features can significantly reduce the CER and improve the prediction capability of the classifier.
    Tibetan Syllable Rule Model and Applications
    ZHU Jie,LI Tianrui,GE Sangduoji,REN Qingnuobu,QIAO Shaojie
    2013, 49(1):  68-74. 
    Asbtract ( )   PDF (572KB) ( )  
    Related Articles | Metrics
    The authors first introduce unique configuration approach of Tibetan syllable, and several limitations of the styles for Tibetan combination caused by the pronunciation features of the Tibetan letters, and then focus on the study of Tibetan characters. A simplified model for modern Tibetan syllables and its corresponding rule base are established by using Tibetan grammar rules. Its applications are extensively analyzed. Algorithms for automatic spelling of Tibetan syllable are proposed as well. Experiments and case studies validate the rule base of Tibetan.
    Tibetan Number Identification and Translation
    SUN Meng,HUA Quecairang,LIU Kai,Lü Yajuan,LIU Qun
    2013, 49(1):  75-80. 
    Asbtract ( )   PDF (497KB) ( )  
    Related Articles | Metrics
    The authors propose a definition of Tibetan number basic component through analyzing the inner structure and the boundary information. A best path decision was applied in judging basic component, then the number was recognized and translated by a finite automation model, finally a template matching algorithm was used for processing complicated number. The F-score of identification and translation is 98.73% and the BLEU score of Tibetan-Chinese translation obtains an improvement of 2.64%.
    Social Network Compression Based on the Importance of the Community Nodes
    LI Hongbo,ZHANG Jianpei,YANG Jing,BAI Jinbo,CHU Yan,ZHANG Lejun
    2013, 49(1):  117-125. 
    Asbtract ( )   PDF (1901KB) ( )  
    Related Articles | Metrics
    In response to the inadequacies of current graph compression methods, such as higher time complexity, dependence on experiences to set parameters, too many parameters to adjust, compression loss, ignoring the community structure of network, a social network compression method is proposed based on the importance of the community nodes. The method include community discovery algorithm (GS) based on greedy strategy and social network compression algorithm (SNC). Adopting topological potential theory GS algorithm is not only capable of discovering communities but also capable of mining important nodes in the communities. SNC algorithm takes communities as targets, achieves lossless compression while maintaining the connections between communities, and keeps important nodes in communities or basic community structure if necessary. The feasibility and effectiveness of the method are verified in experiments.
    A Clustering Chunking Method Based on Manifold Geodesic Distance
    LEI Lin,XIONG Wei,JING Ning,XIAO Jianfu
    2013, 49(1):  126-132. 
    Asbtract ( )   PDF (535KB) ( )  
    Related Articles | Metrics
    Regarding the Chinese chunker analysis as a procedure of inner-sentence word clustering and chunker type labeling, a grammar function space is constructed at first, and then embedded in a lower dimension space by applying ISOMAP to observe the distribution feature of Chinese word in the embedding space. In the hierarchical clustering algorithm which is aiming at partitioning word into different clusters, the manifold geodesic distance is employed instead of Euclidean distance to measure the similarity between words. The algorithm facilitates the increment of Chinese chunker analysis performance under the condition of appropriate algorithm complexity.