Acta Scientiarum Naturalium Universitatis Pekinensis

Research on Mathematical Formula Identification in Digital Chinese Documents

LIN Xiaoyan,GAO Liangcai,TANG Zhi

2014, 50(1): 17-24.

Asbtract ( )

PDF (536KB) ( )

Related Articles | Metrics

Different from the traditional formula identification methods for scanned images and Latin documents, a formula identification method which considers the characteristics of digital Chinese documents is proposed to identify both isolated and embedded formulae using both machine learning techniques and heuristic rules. Text line detection strategies and word segmentation rules are proposed towards Chinese documents, effective features and machine learning algorithms of formula identification from Chinese documents are selected, and post-processing techniques, including text line or word merging, are proposed to overcome the over-segmentation problems. The experimental results show that the proposed method achieves satisfactory results in identifying formulae from digital Chinese documents. Furthermore, a public Chinese document dataset is constructed in order to facilitate the fair comparison between different formula identification methods.

Terminology Definition Discrimination Based on the Internet

WU Ruihong,Lü Xueqiang

2014, 50(1): 33-40.

Asbtract ( )

PDF (535KB) ( )

Related Articles | Metrics

The authors first proposed a definition discrimination model and a method to solve the problem that one terminology has multiple definitions. Baidu Ecyclopedia and Baidu Search results are used to construct the reference definition of the terminology, then the reference definition and the definition patterns summarized from the corpus are used to select the best definition from the candidate definitions. Part of the terminologies and their definitions in CNKI Concept Knowledge Library are chosen as the discrimited corpus in the experiment. Results show that the accuracy is 96.1%, which has a good performance.

Attribute and Attribute Value Extracted from Chinese Online Encyclopedia

JIA Zhen,YANG Yufei,HE Dake,LIU Shengjiu,YIN Hongfeng

2014, 50(1): 41-47.

Asbtract ( )

PDF (523KB) ( )

Related Articles | Metrics

An unsupervised approach is proposed to extract attribute and attribute value from Chinese online encyclopedia entry articles. Attribute values are viewed as named entities and class attributes are extracted based on frequent patterns mining and association analysis. A bootstrapping method is used to find attribute trigger words for each attribute. Attribute value extraction patterns are generated automatically from sentences which contain attribute trigger words and named entity tags of attribute value. Hierarchy clustering algorithm is applied to obtain reliable patterns. Experimental dataset are collected from HudongBaike. The experiment results show that the method is feasible and effective.

Ontology-Based News Personalized Recommendation

RAO Junyang,JIA Aixia,FENG Yansong,ZHAO Dongyan

2014, 50(1): 1-8.

Asbtract ( )

PDF (1000KB) ( )

Related Articles | Metrics

The authors concentrate on exploiting the background knowledge to address the semantic analysis in content-based filtering. An Ontology Based Similarity Model (OBSM) is proposed to calculate the news-user similarity through collaboratively built ontological structures. In order to deal with the noisy nature of these coarse-grained structures, an ontology based clustering model is introduced into the framework, called X-OBSM, which clusters concepts of a user profile on a coarse-grained ontology. Experiment results show that both OBSM and X-OBSM outperform the baselines by a large margin, specifically, X-OBSM performs better than OBSM in both quality and efficiency.

A Microblog Short Text Oriented Multi-class Feature Extraction Method of Fine-Grained Sentiment Analysis

HE Feiyan,HE Yanxiang,LIU Nan,LIU Jianbo,PENG Min

2014, 50(1): 48-54.

Asbtract ( )

PDF (447KB) ( )

Related Articles | Metrics

Combined with TF-IDF method and variance statistical forumla, a new method for the extraction of multi-class feature is presented. This microblog short text oriented extraction method is used to determine the fine-grained sentiment type. Then the processes of fine-grained sentiment analysis is bulit. This method is used to praticipate the NLP&CC2013 evaluation, and the effectiveness of this method is proved by the good ranking of the subimitted data.

Cross-Language Sentiment Analysis Based on Parser

CHEN Qiang,HE Yanxiang,LIU Xule,SUN Songtao,PENG Min,LI Fei

2014, 50(1): 55-60.

Asbtract ( )

PDF (475KB) ( )

Related Articles | Metrics

Using the syntactic analysis model, the statement is divided into several combinations of words. According to the subject-predicate component of compound words and emotional color difference of emotional words, different weights are given respectively. The authors statistically analyze the distribution of the emotional statement, use the characteristic parameter training the classifier, and employ the trained classifier for the test corpus emotional classification. Experiment results show that the emotion classification discriminant accuracy rate and recall rate of this method is more ideal, compared with the existing discrimination methods. This method can also be used in the statement of comparative discrimination and negative polarity judgment.

Semi-supervised Sentiment Classification with Social Network

XUE Yunxia,LI Shoushan,WANG Zhongqing

2014, 50(1): 61-66.

Asbtract ( )

PDF (482KB) ( )

Related Articles | Metrics

Based on the social connection anong the comments in the social network, the authors propose a new approach of semi-supervised sentiment classification, and provide a document-word and social connection bipartite graph structure and apply to label propagation algorithm. Evaluation shows that the proposed approach performs better than that which only considers comment textual information.

Research on Entity Linking of Chinese Microblog

ZHU Min,JIA Zhen,ZUO Ling,WU Anjun,CHEN Fangzheng,BAI Yu

2014, 50(1): 73-78.

Asbtract ( )

PDF (497KB) ( )

Related Articles | Metrics

The authors focus on the task of entity linking of Chinese microblog in NLP&CC2013, taking Sina microblog data provided by CCF as training data and test data, and Yebol Chinese segmentation system as segmentation technology. A way of entity linking is proposed which links from knowledge base to search entity from thesaurus, using improved pinyin edit distance and suffix vocabulary matching method. The authors also propose a disambiguation method, and the method combine entity clustering disambiguation and similar entity disambiguation based on Baidu encyclopedia. In the task of Chinese microblog entity linking of CCF, this system performs as the third-most-correct-probability system with a correct rate of 0.8838 in ten systems. The result indicates that the proposed entity link and entity disambiguation has efficiency and the ability to apply noise in text.

Multi-strategy Approach for Fine-Grained Sentiment Analysis of Chinese Microblog

OUYANG Chunping,YANG Xiaohua,LEI Longyan,XU Qiang,YU Ying,LIU Zhiming

2014, 50(1): 67-72.

Asbtract ( )

PDF (499KB) ( )

Related Articles | Metrics

Fine-grained sentiment analysis of Chinese microblog is investigated and a method of multi-strategy fusion is proposed. Firstly, the authors apply naive Bayesian to identify sentiment or non-sentiment about microblog. Secondly, based on emotion ontology, a method for how to form 21 sentiment features vectors of microblog is presented. At last, fine-grained sentiment of microblog is classified based on SVM and KNN respectively. Experiment results show that multi-strategy fusion is better than a single method, in addition, “NB+SVM” strategy is better than “NB+KNN” strategy.

Recognition and Classification of Emotions in the Chinese Microblog Based on Emotional Factor

ZHANG Jing,ZHU Bo,LIANG Linlin,HOU Min,TENG Yonglin

2014, 50(1): 79-84.

Asbtract ( )

PDF (528KB) ( )

Related Articles | Metrics

Based on basic emotional words and phrases, an emotional dictionary was bulit. According to the special expressions of emotions and functions of punctuations and emoticons in the emotional analysis, a set of emotional rules were set up. The authors recognized and classified the emotions in microblogs according to the algorithm based on the emotional rules and dictionary, and achieved preferable result in the task of emotional analysis of Chinese microblogs in the 2nd Conference on NLP&CC hosted by CCF in 2013. Experiment results show that this algorithm can work effectively.

Automatic Understanding of Natural Language Questions for Querying Chinese Knowledge Bases

XU Kun,FENG Yansong,ZHAO Dongyan,CHEN Liwei,ZOU Lei

2014, 50(1): 85-92.

Asbtract ( )

PDF (493KB) ( )

Related Articles | Metrics

A framework to transform natural language questions into computer-understoodable structured queries is presented. The authors propose to use query semantic graph to represent the semantics in Chinese questions, and adopt predicate and entity disambiguation to match the query graph to the schema of a knowledge base. The authors collect a benchmark of 42 frequently-asked questions randomly sampled from 3 categories of Baidu Knows, including person, location and organization. Experiment results show that proposed framework can effectively convert natural language questions into SPARQL queries, and lay a good foundation for the next generation of intelligent question answering systems.

Function of Semantic Features in Opinion Target Extraction and Its Polarity Identification

ZHOU Hongzhao,HOU Mingwu,YAN Pengli,ZHANG Yeqing,HOU Min,TENG Yonglin

2014, 50(1): 93-99.

Asbtract ( )

PDF (631KB) ( )

Related Articles | Metrics

The authors bring forward seven types of semantic features related to opinion target extraction. They are evaluation-triggering words, evaluation-eliminating words, the words insulated from opinion target, the forward-orientated verbs, the backward-orientated verbs, the verbs of psychological movement and the attributive-directing verbs. Five types of semantic features related to polarity identification are also proposed. They are positive nouns, negative nouns, meaning-shifting nouns, measuring adjectives and semantic construction. The authors explain the twelve features from aspects of the necessity and their usages. The result shows that the application of semantic features improves the precision of the system.

Pronoun Resolution Based on Deep Learning

XI Xuefeng,ZHOU Guodong

2014, 50(1): 100-110.

Asbtract ( )

PDF (589KB) ( )

Related Articles | Metrics

Because coreference resolution is a fundamental task in natural language process, a coreference resolution system based on Deep Learning model via the deep belief nets (DBN), which is a classifier of a combination of several unsupervised learning networks, named RBM (restricted Boltzmann machine) and a supervised learning network named BP (back-propagation), is proposed to detect and classify the coreference relationships between the anaphor and antecedent. The RBM layers maintain as much information as possible when feature vectors are transferred to next layer. The BP layer is trained to classify the features generated by the last RBM layer. The experiments are conducted on the ACE 2004 English NWIRE corpus and the ACE 2005 Chinese NWIRE corpus. The results show that increasing the number of layers RBM training and joining of abstract layer for feature set are able to improve the performance of coreference resolution system.

Research of Chinese Implicit Discourse Relation Recognition

SUN Jing,LI Yancui,ZHOU Guodong,FENG Wenhe

2014, 50(1): 111-117.

Asbtract ( )

PDF (532KB) ( )

Related Articles | Metrics

The authors use a self-built Chinese Discourse Treebank (80% relations are implicit) to recognize implicit relations. In this corpus, discourse relations are divided into three layers, the first layer has four types: causality, coordination, transition and explanation. Based on this corpus, maximum entropy classifier is employed to identify four types relations with context, lexical and dependency parse features. Experimental results show that total accuracy is 62.15% and the identification effect of coordination is the best, F1 reaches 75.26%.

An Unsupervised Method for Chinese Speech Text Localization in Comic Images

LIU Dong,LI Luyuan,WANG Yongtao,TANG Zhi

2014, 50(1): 25-32.

Asbtract ( )

PDF (3419KB) ( )

Related Articles | Metrics

For satisfying the growing needs of reading Chinese comic images on mobile devices, the authors propose an unsupervised Chinese speech text localization method which is different from the existing learning-based methods.The method consists of three major stages: 1) the first stage is to detect the white region that surrounds the text charactders (speech balloons, similarly hereinafter) using the connectivity of white region within the balloons and localize the characters within the speech balloon; 2) the detected characters are clustered into character strings (a row or column of characters aligning horizontally or vertically) based on the character shape and the consistency of typesetting, and their font features are extracted; 3) based on the features of the extracted fonts, the third stage is to detect rest of the character strings via Bayesian classifier. The proposed method is tested on a dataset consists of 900 comic images and reaches satisfactory results.

C-TERN: A Temporal Information Processing Algorithm of Chinese Military News Story Based on Cascade Finite State Automata

WANG Wei,ZHAO Dongyan,SU Tingting

2014, 50(1): 9-16.

Asbtract ( )

PDF (506KB) ( )

Related Articles | Metrics

The authors propose a new method C-TERN to recognize and normalize the temporal expression in military story based on cascade finite state automata. Firstly, C-TERN recognizes the temporal expression in military story, and layers the temporal information extracted from general language and military language, and recognizes the temporal by layer. Then, in the procedure of temporal expression normalization, C-TERN ratiocinates and normalizes the simple/specify time, duration time, absolute and relative temporal expression in four steps. The method pays special attention to the correctness of the regulation extraction, the dispelling of the collision between regulations, and the reasonability of the matching method. The experimental results on multi-information show that proposed method can recognize and normalize the absolute and relative temporal expression as well as the simple/specify time and duration time effectively. It can better meets the temporal information processing needs in military applications.

Recognition and Classification of Relation Words in the Compound Sentences Based on Tsinghua Chinese Treebank

LI Yancui,SUN Jing,ZHOU Guodong,FENG Wenhe

2014, 50(1): 118-124.

Asbtract ( )

PDF (426KB) ( )

Related Articles | Metrics

According to Tsinghua Chinese Treebank annotation methods, the authors extracted relation words and marked their categories. Then syntax, lexical and position features of automatic syntax tree with and without functional marker were extracted to recognize and classify relation words. Experiment results show that relative recognition accuracy is 95.7%, and relation words classification F1 is 77.2%.

Automatically Parsing Chinese Discourse Based on Maximum Entropy

TU Mei,ZHOU Yu,ZONG Chengqing

2014, 50(1): 125-132.

Asbtract ( )

PDF (494KB) ( )

Related Articles | Metrics

The authors focus on how to segment semantic units in Chinese discourse and how to label relations among semantic units automatically. During the parsing process, several sequence labelling methods are compared for discourse segmentation, while a maximum entropy-based training and decoding algorithm is specially proposed. Experiments are done based on Tsinghua Chinese Treebank, which is annotated with logical and semantic relations at complex-sentence level. Experimental results show that F-score of discourse segmentation reaches 89.1%. When parsing discourses with no more than 6 relations included, the labeling F-score can achieve 63%.

Using Inference Cues to Recognize Event Relation

MA Bin,HONG Yu,YANG Xuerong,YAO Jianmin,ZHU Qiaoming

2014, 50(1): 133-141.

Asbtract ( )

PDF (521KB) ( )

Related Articles | Metrics

The autors propose an event relation recognition method based on event inference cues by analyzing the semantic dependency relation and event arguments distribution features between events and the rules of event inference. Experiment result shows that the proposed method achieves 9.57% improvement compared with the traditional method based on event term and entity.

Research on Recognition Units of Large Vocabulary Speech Recognition System of Uyghur

Nurmemet Yolwas,Wushour Silamu,Reyiman Tursun

2014, 50(1): 149-152.

Asbtract ( )

PDF (405KB) ( )

Related Articles | Metrics

Uyghur is an agglutinative language and words are not optimal recognition units for Uyghur LVCSR systems. With regard to recognition unit selection problem in Uyghur LVCSR systems, a more suitable recognition units for Uyghur likes sub-word is designed, and the combining recognition units of word and sub-word are proposed. The performance of language models and speech recognition are evaluated on different recognition units. Experiment results show that the proposed recognition units outperforms word units in terms of unit size, language model perplexity, and can give a relative word error rate reduction of 22% over the word based system.

An Approach for Tibetan Text Automatic Proofreading and Its System Design

ZHU Jie,LI Tianrui,LIU Shengjiu

2014, 50(1): 142-148.

Asbtract ( )

PDF (535KB) ( )

Related Articles | Metrics

Taking Tibetan syllable spelling check, Sanskrit transliteration in Tibetan check, connective relation and words check as the objects, a framework of Tibetan text automatic proofreading and an algorithm of Tibetan connective relation are proposed. Under the framework and algorithm, a Tibetan text automatic proofreading system is designed and implemented. Reliability and effectiveness of the algorithm and system are confirmed through corresponding experiments.

Study of Feature Weighted-Based Generation Method for Dian Strokes of Chinese Character

LI Qingsheng,XIONG Jing,WU Qinxia,YANG Yuxing

2014, 50(1): 153-160.

Asbtract ( )

PDF (565KB) ( )

Related Articles | Metrics

According to the difficulties of Chinese characters design and development, a Chinese characters description method based on feature points abstract and a generation method based on character glyph skeleton were proposed. The authors researched the techniques and methods about feature points and their weights, characteristic expression and weight vector of the feature points in Chinese character generation. A Chinese character point strokes generation algorithm was introduced and some verification experiments were implemented. The results demonstrate the reliability and practicality of the algorithm. It can improve the efficiency of the Chinese character design, and provide a solution for other Chinese character strokes generation.

Study on the Technique of Automatic Generation of Oracle Characters Based on Semantic Component

WU Qinxia,LI Qingsheng,GAO Feng

2014, 50(1): 161-166.

Asbtract ( )

PDF (496KB) ( )

Related Articles | Metrics

According to the features of Oracle characters, such as changing shape and variant font froms, an automatic generation methode of Oracle Bone script font library based on semantic component is proposed by analyzing the structure of Oracle Bone characters. The method is based on the dynamic description library. Firstly an algorithm is used to get the feature information of the component Oracle characters; secondly the semantic component is made by reorganization strokes; finally the component library base is generated by feature descriptions for component. The semantic component library is designed with affine transform, from which each Oracle character could be generated. Experiments show that this method can effectively solve the problem of Oracle character input without word stock, and can also solve the Oracle character encoding, component statistics, and interpretation of the incognizant Oracle character, etc.

Document-Level Automatic Machine Translation Evaluation Based on Weighted Lexical Cohesion

GONG Zhengxian,LI Liangyou

2014, 50(1): 173-179.

Asbtract ( )

PDF (450KB) ( )

Related Articles | Metrics

Based on LC method, weighted LC (WLC) method is proposed, which assigns weights for words by PageRank algorithm running on word graph of documents. Furthermore, a new method named PWLC is also proposed, which biases PageRank algorithm to words with specific POS tags. The experiment results show that WLC and PWLC have higher Spearman correlation than LC at document-level evaluation. Combined with others metrics, such as BLEU and TER, the proposed metrics both show better performance of evaluation at document level.

Statistical Machine Translation Model Pruning Based on Translation Log

LIU Kai,Lü Yajuan,JIANG Wenbin,LIU Qun

2014, 50(1): 167-172.

Asbtract ( )

PDF (399KB) ( )

Related Articles | Metrics

The authors propose a novel translation log based translation rule pruning method, which prunes translation rules according to the translation rule hit counts pairs. Experiment results show that the proposed method requires only 1% - 3% translation rules without significantly difference compared to the full model.

Japanese Time Expression Recognition and Translation

ZHAO Ziyu,XU Jin’an,ZHANG Yujie,LIU Jiangming

2014, 50(1): 180-186.

Asbtract ( )

PDF (482KB) ( )

Related Articles | Metrics

Based on the defined knowledge base, the authors presented a Japanese time expression recognition method through combining rules set strengthened by knowledge base with statistical model. In order to increase recognition accuracy, according to the Timex2 standards’ granular classification on time, the knowledge base was progressively expanded and reconstructed given the Japanese time characteristic to achieve rules set optimization and update. Simultaneously, CRF model was fused to enhance the generalization ability of Japanese time expression recognition. The authors studied the time translation accuracy of phrase-based translation model and proved the necessity of combing rules with statistical machine translation (SMT). Experiment results show that the F1 value of Japanese time expression recognition reaches 0.8987 on open test, and both the precision and recall by the method based on rules and parallel dictionary of Japanese to Chinese time expression are a bit higher than those by the method based on statistical translation model.

Summarization Based on Hidden Topic Markov Model with Multi-features

LIU Jiangming,XU Jin’an,ZHANG Yujie

2014, 50(1): 187-193.

Asbtract ( )

PDF (474KB) ( )

Related Articles | Metrics

Based on hidden topic Markov model (HTMM), the authors eliminate assumption limitation in LDA (latent dirichlet allocation) to exploit the structure information during generating summary, and use multi-features based on document content to improve the summary quality. Furthermore, a method for developing single-document summarization to multi-document summarization without breaking document structure is proposed, to achieve the perfect automatic summarization system. Meanwhile, experiment results on the standard dataset DUC2007 show the advantage of HTMM and multi-feature. Compared with the performace of LDA, ROUGE values are improved based on HTMM with multi-features.

Research of Automatic News Summary Report with Topic

LU Lu,LI Juanzi,HOU Lei,ZHANG Lanshan

2014, 50(1): 194-200.

Asbtract ( )

PDF (599KB) ( )

Related Articles | Metrics

News summary report can effectively solve the problem of reading large number of news. News topic clustering is a good way to find potential hot spots and views. News trend, relations with topics and entities can be mined from news topics. Keywords and topics sentences can be extracted from the news topics. All these semantic information mined from news topics give a good show from mutil-angle. These information are good materials for the news summary report with help of the given wording rules. Then a news summary report with thorough analysis and rich charts is automatically created.

Table of Content