Acta Scientiarum Naturalium Universitatis Pekinensis

Entity Recognition Research in Online Medical Texts

SU Ya, LIU Jie, HUANG Yalou

2016, 52(1): 1-9. DOI: 10.13209/j.0479-8023.2016.020

Asbtract ( )

HTML

PDF (1120KB) ( )

Related Articles | Metrics

The authors design recognition features with the consideration of medical field characteristic for the online medical text, and the experiment of the entity recognition is carried out on the self-built data set. Concerned about five common diseases: gastritis, lung cancer, asthma, hypertension and diabetes. In the experiment, an advanced machine learning model Conditional Random Field is used for training and testing. The target entities include five kinds: disease, symptoms, drugs, treatment methods and check. The effectiveness of the proposed features is verified by using the experimental method, and the accuracy of the total 81.26% is obtained and the recall rate is 60.18%. Subsequently, the further analysis is given for the recognition features.

Research on the Sense Guessing of Chinese Unknown Words Based on “Semantic Knowledge-base of Modern Chinese”

SHANG Fenfen, GU Yanhui, DAI Rubing, LI Bin, ZHOU Junsheng, QU Weiguang

2016, 52(1): 10-16. DOI: 10.13209/j.0479-8023.2016.009

Asbtract ( )

HTML

PDF (396KB) ( )

Related Articles | Metrics

Based on the research issue of sense guessing of Chinese unknown words, different levels of semantic dictionary were introduced by applying “Semantic Knowledge-base of Modern Chinese”. Models have constructed for sense guessing by using these dictionary. Each model was intergrated to predict the unknown words and obtained better performance. Based on each model, semantic prediction and annotation of the unknown words in People’s Daily which published in 2000 were evaluated. Finally, corpus resources with the sense annotation of unknown words were obtained.

An Entity Linking Approach Based on Topic-Sensitive Random Walk with Restart

LI Maolin

2016, 52(1): 17-24. DOI: 10.13209/j.0479-8023.2016.003

Asbtract ( )

HTML

PDF (729KB) ( )

Related Articles | Metrics

Entity linking is the process of linking name mentions in text with their referent entities in a knowledge base. This paper tackles this task by proposing an approach based on topic-sensitive random walk with restart. Firstly, the context information of mentions is used to expand mentions and search the candidate entities in Wikipedia knowledge base for mentions. Secondly, graph can be constructed in accordance with the intermediate result in the pre step. Finally, the topic-sensitive random walk with restart model is used to rank the candidate entities and choose the top 1 as the linked entity. Experimental results show that proposed approach on KBP2014 data set gets F score 0.623 which is higher than every other systems’ mentioned in this paper. The proposed approach can improve the entity linking system’s performance.

Research on the Construction of Bilingual Movie Knowledge Graph

WANG Weiwei, WANG Zhigang, PAN Liangming, LIU Yang, ZHANG Jiangtao

2016, 52(1): 25-34. DOI: 10.13209/j.0479-8023.2016.022

Asbtract ( )

HTML

PDF (658KB) ( )

Related Articles | Metrics

This paper proposes a method to construct Bilingual Movie Knowledge Graph (BMKG). The authors first builds Bilingual Movie Ontology (BMO) through a semi-automatic way, and aligns each data source with it in order to ensure semantic consistency of heterogeneous data sources. For entity linking, the proposed method makes best use of the field characteristics and calculate entity similarity based on both Word2Vec and TFIDF models, which greatly improve entity linking. For entity matching, a similarity flooding based algorithm is proposed, which utilizes the intrinsic links between the movie data sources, addressing the problem of similarity computation between cross-lingual entities. The experiment results show that the entity matching precision is over 90% when the threshold is above 0.75. In addition, a movie knowledge graph sharing platform is also built to provide open data access and query interface.

New Word Detection Based on an Improved PMI Algorithm for Enhancing Segmentation System

DU Liping, LI Xiaoge, YU Gen, LIU Chunli, LIU Rui

2016, 52(1): 35-40. DOI: 10.13209/j.0479-8023.2016.024

Asbtract ( )

HTML

PDF (401KB) ( )

Related Articles | Metrics

This paper presents an unsupervised method to identify internet new words from the large scale web corpus, which combines with an improved Point-wise Mutual Information (PMI), PMIk algorithm, and some basic rules. This method can recognize internet new words with length from 2 to n (n is any number as needed). Experimented based on 257 MB Baidu Tieba corpus, the precision of proposed system achieves 97.39% when the parameter value of PMIk algorithm is equal to 10, and the precision increases 28.79%, compared to PMI method. The results show that proposed system is significant and efficient for detecting new word from the large scale web corpus. Compiling the results of new word discovery into user dictionary and then loading the user dictionary into ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System), experimented with 10 KB Baidu Tieba corpus, the precision, the recall and F-measure were promoted 7.93%, 3.73% and 5.91% respectively, compared with ICTCLAS. The result show that new word discovery could improve the performance of segmentation for web corpus significantly.

A Star-Graph-Based Detection Method for Reflection Symmetry of Chinese Characters

LIAO Yuan, Lü Xiaoqing, SUN Jianling, TANG Zhi, WANG Yongtao

2016, 52(1): 41-48. DOI: 10.13209/j.0479-8023.2016.015

Asbtract ( )

HTML

PDF (1456KB) ( )

Related Articles | Metrics

This study proposes a detection method of bilateral symmetry for Chinese characters that combines different types of character features, such as scale invariant feature transform (SIFT) and contour information. A directed graph is constructed with the basic symmetric elements of a character to describe the enhancement
relationships among the elements. Furthermore, the detection of the most significant axes of symmetry in one character is transformed into the problem of finding star subgraphs with local maximum weight. Experiment results show that the proposed method outperforms the existing methods on Chinese characters database.

A Benchmark for Stroke Extraction of Chinese Characters

CHEN Xudong, LIAN Zhouhui, TANG Yingmin, XIAO Jianguo

2016, 52(1): 49-57. DOI: 10.13209/j.0479-8023.2016.025

Asbtract ( )

HTML

PDF (533KB) ( )

Related Articles | Metrics

Abstract This paper presents a benchmark, which includes a manually-constructed database and evaluation tools.
Specifically, the database contains a number of images of Chinese characters represented in four commonly-used
font styles and corresponding stroke images manually segmented from character images. Performance of a given
stroke extraction method can be evaluated by calculating dissimilarities of the automatic segmentation results and
the ground truth using two specially-designed metrics. Moreover, the authors also propose a new method based on
Delaunay triangulation to effectively extract strokes from Chinese characters. Experimental results obtained by
comparing three algorithms demonstrate that the benchmark works well for the evaluation of stroke extraction
approaches and the proposed method performs considerably well in the application of stroke extraction for Chinese
characters.

Rule-Based Detection and Analysis of Annotation Errors in Dependency Treebank

SHI Linlin, QIU Likun, KANG Shiyong

2016, 52(1): 58-64. DOI: 10.13209/j.0479-8023.2016.005

Asbtract ( )

HTML

PDF (428KB) ( )

Related Articles | Metrics

The authors try to transform dependency tree into phrase structure tree, and detect annotation errors automatically based on manual rules. The method is used in processing Peking University Multi-view Chinese Treebank (PMT). Although PMT has been manually checked twice before processed by this method, 1529 errors are detected among the 50275 sentences and the precision is 100%. The errors mainly belong to three types: word segmentation error, mismatching between POS and syntactic role, and syntactic role error. This method can further improve treebank quality, and be applied to other dependency treebanks.

A New Ranking Method for Chinese Discourse Tree Building

WU Yunfang, WAN Fuqiang, XU Yifeng, Lü Xueqiang

2016, 52(1): 65-74. DOI: 10.13209/j.0479-8023.2016.014

Asbtract ( )

HTML

PDF (450KB) ( )

Related Articles | Metrics

This paper proposes a novel method for sentence-level Chinese discourse tree building. The authors
constrcut a Chinese discourse annotated corpus in the framework of Rhetorical Structure Theory, and propose a
ranking-like SVM (SVM-R) model to automatically build the tree structure, which can capture the relative
associated strength among three consecutive text spans rather than only two adjacent spans. The experimental
results show that proposed SVM-R method significantly outperforms state-of-the-art methods in discourse parsing
accuracy. It is also demonstrated that the useful features for discourse tree building are consistent with Chinese
language characteristics.

Integrating of Grapheme-Based and Phoneme-Based Transliteration Unit Alignment Method

LIU Bojia, XU Jin’an, CHEN Yufeng, ZHANG Yujie

2016, 52(1): 75-80. DOI: 10.13209/j.0479-8023.2016.001

Asbtract ( )

HTML

PDF (362KB) ( )

Related Articles | Metrics

In order to solve the errors caused by only using the pheneme-based method or the grapheme-based method, applying the theory of statistics and rules, this paper proposes a new method for transliteration unit alignment which integrates the two main transliteration methods. Four experiments are designed to compare with the traditional methods. Experimental results show that proposed method outperforms other methods in terms of performance in machine transliteration.

Chinese Calligraphy Alignment Based on 3D Point Set Registration

LIU Yingbin, SUN Yannan, XUN Endong

2016, 52(1): 81-88. DOI: 10.13209/j.0479-8023.2016.016

Asbtract ( )

HTML

PDF (926KB) ( )

Related Articles | Metrics

This paper presents an innovative method to align two glyph contours with three steps. First, 2D Bézier curve control points of glyph contours of each character are expanded into 3D space. Second, a Gaussian Mixture Model (GMM) is constructed using this 3D point set. Finally, the authors establish alignment by minimizing the Euclidean Distance (L2) between two GMMs and then apply transformation accordingly. Expansion to 3D space helps make use of inherent constraints of Chinese calligraphy beyond 2D coordinates. The advantage of using Gaussian Mixture Model is to maintain both the overall shape property and the local writing features during the alignment process. Experiments results verify the feasibility and effectiveness of proposed method and it performs well for both single stroke and whole character.

A Chinese Event Trigger Inference Approach Based on Markov Logic Networks

ZHU Shaohua, LI Peifeng, ZHU Qiaoming

2016, 52(1): 89-96. DOI: 10.13209/j.0479-8023.2016.012

Asbtract ( )

HTML

PDF (867KB) ( )

Related Articles | Metrics

Previous Chinese argument extraction approaches mainly focus on feature engineering and trigger expansion, which cannot exploit inner relation between trigger mentions in same document. To address this issue, the authors bring forward a novel trigger inference mechanism based on Markov logic network. Head morpheme, the probabilities of a trigger mention fulfilling true and pseudo events from the training set and the relationships between trigger mentions are used to infer those trigger mentions with lack of effective context information or low confidences in testing set. Experimental results on the ACE 2005 Chinese corpus show that the proposed approach outperforms the baseline, with the F1 improvements of 3.65% and 2.51% in trigger identification and event type classification respectively.

Global Inference for Co-reference Resolution between Chinese Events

TENG Jiayue, LI Peifeng, ZHU Qiaoming

2016, 52(1): 97-103. DOI: 10.13209/j.0479-8023.2016.010

Asbtract ( )

HTML

PDF (494KB) ( )

Related Articles | Metrics

Currently, most pairwise resolution models for event co-reference focused on classification or clustering approaches, which ignored the relations between events in a document. A global optimization model for event co-reference resolution was proposed to resolve the inconsistent event chains in classifier-based approaches. This model regarded co-reference resolution as a integer linear program problem and introduced various kinds of constraints, such as symmetry, transitivity, triggers, argument roles, event distances, to further improve the performance. The experimental results show that the proposed model outperforms the local classifier by 4.20% in F1-measure.

Constructing a News Story Chain from Word Coverage Perspective

FU Jiabing, DONG Shoubin

2016, 52(1): 104-112. DOI: 10.13209/j.0479-8023.2016.018

Asbtract ( )

HTML

PDF (1214KB) ( )

Related Articles | Metrics

Current studies merely focus on a story chain’s similarity of topic relationship and importance of documents, whilst almost ignoring its logical coherency and explainability. Along with algorithm complexity brought about by exponential growth in sets of news data, a story chain from word coverage perspective is constructed, taking advantage of the story comments to position the turning point of each event. The ideas of similarity of topic relationship and sparsity differences as well as RPCA approach are used to conduct logical modeling for the documents. Random walk and graph traversals are adopted to quantify and construct an explainable and logically coherent story chain. The double-blind experiment reveals that proposed method outperforms other algorithms.

Research on Example-Based Phrase Pairs in Statistical Machine Translation

LI Qiang, LI Mu, ZHANG Dongdong, ZHU Jingbo

2016, 52(1): 113-119. DOI: 10.13209/j.0479-8023.2016.007

Asbtract ( )

HTML

PDF (465KB) ( )

Related Articles | Metrics

Abstract Due to the sparsity of data and the limitation of bilingual data size, many high-quality phrase pairs can’t be generated. The example-based phrase pairs proposed by the authors are generated through decomposing, substituting and generating the typical phrase pairs, and the typical phrase pairs are generated by the typical phrase extraction method in phrase-based statistical machine translation. On the Chinese-to-English Newswire and Oral translation tasks, the experimental results demonstrate significant improvements achieved by the proposed methods. Moreover, a gain of about 1% BLEU score increase is yielded on some test sets.

Similar Spatial Textual Objects Retrieval Strategy

GU Yanhui, WANG Daosheng, WANG Yonggen, LONG Yunfei, JIANG Suoliang, ZHOU Junsheng, QU Weiguang

2016, 52(1): 120-126. DOI: 10.13209/j.0479-8023.2016.008

Asbtract ( )

HTML

PDF (469KB) ( )

Related Articles | Metrics

Based on the efficiency and effectiveness issue of traditional simiar spatial textual objects retrieval, a semantic aware strategy which can effectively and efficiently retrieve the top-k similar spatial textal objects is proposed. The efficient retrieval strategy which is based on spatial textual objects is built on a common framework of spatial object retrieval, and it can satisfy the efficiency and effectiveness issues of users. Extensive experimental evaluation demonstrates that the performance of the proposed method outperforms the state-of-the-art approach.

A Selectional Preference Based Translation Model for SMT

TANG Haiqing, XIONG Deyi

2016, 52(1): 127-133. DOI: 10.13209/j.0479-8023.2016.013

Asbtract ( )

HTML

PDF (336KB) ( )

Related Articles | Metrics

The limited semantic knowledge is used in the phrase-based statistical machine translation (SMT), which causes that the translation quality of long-distance verb and its object is low. A selectional preference based translation model is proposed, which inducts the semantic constraints that a verb imposes on its object to select the proper argument-head word for the predicate with long distance. The authors train the corpus to obtain the conditional probability based selectional preferences for verb, and integrate the selectional preferences into a phrase-based translation system and evaluate on a Chinese-to-English translation task with large-scale training data. Experiment results show that the integration of selectional preference into SMT can effectively capture the long-distance semantic dependencies and improve the translation quality.

Multiple-Choice Question Answering Based on Textual Entailment

WANG Baoxin, ZHENG Dequan, WANG Xiaoxue, ZHAO Shanshan, ZHAO Tiejun

2016, 52(1): 134-140. DOI: 10.13209/j.0479-8023.2016.017

Asbtract ( )

HTML

PDF (471KB) ( )

Related Articles | Metrics

This paper proposes a method to compute textual entailment strength, taking multiple-choice questions which have clear candidate answers as research objects, aiming at the phenomenon of long text entailing short text. Two methods are used to answer the college entrance examination geography multiple-choice questions based on the Wikipedia Chinese Corpus in the absence of large-scale questions and answers. One is based on the sentence similarity and the other is based on the textual entailment proposed above. The accuracy rate of the proposed method is 36.93%, increasing by 2.44% than the way based on the word embedding sentence similarity, increasing 7.66% than the way based on the Vector Space Model sentence similarity, which confirm the effectiveness of the method based on the textual entailment.

A Dynamical Glyph Generation Method of Xiangxi Folk Hmong Characters and Its Implementation Approach

MO Liping, ZHOU Kaiqing

2016, 52(1): 141-147. DOI: 10.13209/j.0479-8023.2016.019

Asbtract ( )

HTML

PDF (1123KB) ( )

Related Articles | Metrics

To effectively solve the glyph generation and glyph description problem, a dynamical glyph generation method of Xiangxi folk Hmong characters is proposed. According to this method, the glyph generation process can be described as a combination arithmetic expression. Hmong characters component acts as the operand, and the location relationship between the components decides the operator. Glyphs in different structure can be dynamically generated by combination of two or three components. Further, if combination arithmetic expression is converted to ideographic description sequence (IDS), the proposed method can be implemented with the help of the IDS explain mechanism of operation system. Test results illustrate that, the Xiangxi Hmong characters glyph, which generated by the mapping script based on the proposed method, can meet practical requirements.

Chinese-Slavic Mongolian Named Entity Translation Based on Word Alignment

YANG Ping, HOU Hongxu, JIANG Yupeng, SHEN Zhipeng, DU Jian

2016, 52(1): 148-154. DOI: 10.13209/j.0479-8023.2016.006

Asbtract ( )

HTML

PDF (421KB) ( )

Related Articles | Metrics

Chinese to Slavic Mongolian Named Entity Translation in cross Chinese and Slavic Mongolian information processing has a very important significance. However, using the machine translation method directly cannot achieve satisfactory result. In order to solve the above problem, a novel approach was proposed to extract Chinese-Slavic Mongolian Named Entity pairs automatically. Only the Chinese named entities need to be identified, then extracting all of the candidate named entity pairs using sliding window method based on HMM word alignment result. Finally filtering all of the candidate named entity translation units based on Max Entropy Model integrated with five features, and choose the most probable aligned Slavic Mongolian NEs to the Chinese NEs. Experimental results show that this approach outperforms HMM model, achieves high quality of Chinese-Slavic Mongolian named entity pairs with relatively high precision, even though sometimes the word alignment result is partially correct.

Uyghur Text Automatic Segmentation Method Based on Inter-Word Association Degree Measuring

Turdi Tohti, Winira Musajan, Askar Hamdulla

2016, 52(1): 155-164. DOI: 10.13209/j.0479-8023.2016.023

Asbtract ( )

HTML

PDF (836KB) ( )

Related Articles | Metrics

This paper puts forward a new idea and related algorithms for Uyghur segmentation. The word based Bi-gram and contextual information are derived from large scale raw corpus automatically, and according to the Uyghur word association rules, the liner combinations of mutual information, difference of t-test and dual adjacent entropy are taken as a new measurement to estimate the association strength between two adjacent Uyghur words. The weakly associated inter-word position is taken as a segmentation point and the perfect word strings both on its semantics and structural integrity, not just the words separated by spaces, is obtained. The experimental result on large-scale corpus shows that the proposed algorithm achieves 88.21% segmentation accuracy.

Personalized Model for Rating Prediction Based on Review Analysis

MA Chunping, CHEN Wenliang

2016, 52(1): 165-170. DOI: 10.13209/j.0479-8023.2016.011

Asbtract ( )

HTML

PDF (323KB) ( )

Related Articles | Metrics

Existing recommender systems do not take full advantage of personalization. To address this problem, a novel approach is proposed to mine the opinions and preference of users to build a personalized model for each user or item. Experimental results generated from a real data set show that the proposed approach can improve the accuracy of rating prediction.

Exploiting Lexical Sentiment Membership-Based Features to Polarity Classification

SONG Jiaying, HUANG Xu, FU Guohong

2016, 52(1): 171-177. DOI: 10.13209/j.0479-8023.2016.004

Asbtract ( )

HTML

PDF (334KB) ( )

Related Articles | Metrics

A lexical sentiment membership based feature representation was presented for Chinese polarity classification under the framework of fuzzy set theory. TF-IDF weighted words are used to construct the corresponding positive and negative polarity membership for each feature word, and the log-ratio of each membership is computed. A support vector machines based polarity classifier is built with the membership logratios as its features. Furthermore, the classifier is evaluated over different datasets, including a corpus of reviews on automobile products, the NLPCC2014 data for sentiment classification evaluation and the IMDB film comments. The experimental results show that the proposed sentiment membership feature representation outperforms the state of the art feature representations such as the Boolean features, the frequent-based features and the word embeddings based features.

Research on the Visualization Method of Social Crowd Emotion Based on Microblog Text Data Analysis

LIU Cuijuan, LIU Zhen, CHAI Yanjie, FANG Hao, LIU Liangping

2016, 52(1): 178-186. DOI: 10.13209/j.0479-8023.2016.021

Asbtract ( )

HTML

PDF (1441KB) ( )

Related Articles | Metrics

Existing sentiment analysis focus on the emotional tendency, which are lack of detailed description of all kinds of emotions, they can’t intuitively reflect the emotional change of social groups. An emotional analysis method based on the combination of dependency parsing and artificial tagging was proposed. Facial expression animation to present emotions analysis was realized. The microblog crowd’s emotion in different areas for different social events was visualized. The experimental results show that the model could closely and effectively simulate the crowd emotion, and it could provide a new way of the analysis of network public opinion based on large data.

Table of Content