Acta Scientiarum Naturalium Universitatis Pekinensis

Discovering Abnormal Data in RDF Knowledge Base

HE Binbin,ZOU Lei,ZHAO Dongyan

2015, 51(2): 195-202.

Asbtract ( )

PDF (739KB) ( )

Related Articles | Metrics

To effectively improve the data quality of RDF knowledge base, a solution is proposed about abnoraml data discovery and errouneous data repair in RDF graphs. Firstly, the authors innovatively define graph-based conditional functional dependency (GCFD) that can represent the attribute value and semantic structure dependencies of RDF data in a uniform manner. Then, an efficient framework and some novel pruning rules are proposed to discover GCFDs, and the workflow of auto-repairing errorneous data are given. Extensive experiments on several real-life RDF repositories confirm the superiority of proposed solution.

Research on Reversible Transformation from Re-flowable Document into Fixed-layout Document

LI Ning,LIU Yin,LIANG Qi,FENG Xue

2015, 51(2): 203-212.

Asbtract ( )

PDF (3825KB) ( )

Related Articles | Metrics

Considering the limitation of traditional methods to integrate re-flowable document and fixed-layout documents, a reversible transformation method was proposed which recorded the additional information used to restore the original document in the target. The authors discussed the principle, key techniques, experiment result, advantages and future work of the reversible transformation. Selecting UOF (Uniform Office-document Format) as the source re-flowable document format and CEBX (Common E-document of Blending XML) as the target fixed-layout document format, a successful reversible transformation from UOF into CEBX was implemented. Experiment shows that the method has fairly good result.

A Study on Classification of Forms with Similar Layout

WANG Simeng,GAO Liangcai,WANG Yuehan,LI Pingli,TANG Zhi

2015, 51(2): 213-219.

Asbtract ( )

PDF (8513KB) ( )

Related Articles | Metrics

The authors propose a simple but effective distance based method to identify forms with similar layouts by measuring the user filled-in data, preprinted data and dithering data. The proposed method utilizes three kinds of weight components to mitigate the impact of randomness of user filled-in data, consistency of similar layouts and position dithering respectively. Experimental results show that the proposed method can achieve more than 90% classification accuracy on a series of data sets, which is significantly better than the results of the state-of-the-art method.

Query Classification by Using URL-Key

LI Xuewei,Lü Xueqiang,DONG Zhian,LIU Kehui

2015, 51(2): 220-226.

Asbtract ( )

PDF (460KB) ( )

Related Articles | Metrics

For the problem of query classification, a variance based method is proposed to identify domain URL-key by the domain URL organized manually from aggregator sites and the use frequency of URL-key in each category. Then, the URL-key is filtered by using machine translation, pinyin and search results feedback technology. Finally, coupled with relevance feedback, the authors classify the query by selecting the URL-key as feature and establishing the URL-key vector with a SVM multi-class classifier. Experimental results show that the proposed method uses less resources and the F-value is 7% higher than contrast method.

A Query Weighted-Based Method for User Modeling

HU Juan,BAI Yu,CAI Dongfeng

2015, 51(2): 227-233.

Asbtract ( )

PDF (573KB) ( )

References | Related Articles | Metrics

A query weighted-based method is proposed for user modeling by simulating the interaction between user and search engine. First, the query log is divided into sessions according to the session division principle. Then, for each session, a group of user behavior information, such as query frequency, duration and the ranks of the clicked URLs, are employed to calculate the weight of queries. Finally, the voting method is used to generate user model. The experiment results show the effectiveness of the method over the AOL query log dataset.

Chinese Explanatory Opinionated Sentence Recognition Based on Auto-Encoding Features

HE Yu,PAN Da,FU Guohong

2015, 51(2): 234-240.

Asbtract ( )

PDF (491KB) ( )

Related Articles | Metrics

An auto-encoding feature based classification method to Chinese explanatory opinionated sentence recognition was presented. An explanatory opinion corpus is built firstly from online product reviews in cellphone and car domains. Then, word embeddings are learned from product reviews using the auto-encoding technique. Finally, the learned word embeddings are used as features for explanatory opinionated sentence classification under the framework of supported vector machines. Experimental results show that word embeddings are more effective than some traditional representations of features like Chi-square, TF-IDF and information gains for explanatory opinionated sentence classification.

Stroke Retrieval of Handwritten Chinese Character Images for Handwriting Teaching

XUN Endong,Lü Xiaochen,AN Weihua,SUN Yannan

2015, 51(2): 241-248.

Asbtract ( )

PDF (607KB) ( )

Related Articles | Metrics

For intelligent teaching of Chinese character handwriting, the authors present a stroke retrieval method for handwritten Chinese character images, which includes three steps. Firstly, the method extracts the skeletons from the handwritten image. Secondly, from the perspective of knowledge engineering, it eliminates the skeleton distortions by using the stable grapheme topology. Thirdly, it divides the skeletons into some strokes and outputs the matching relationship between them and the strokes in the template character, by building and solving the similarity model with A* algorithm. The result of the method can be used to automatic quality assessment for handwritten Chinese character images.

On the Simplification of Chinese Characters in Taiwan:A Perspective of Corpus Linguistics

WANG Boli,SHI Xiaodong,CHEN Yidong,REN Wenyao,YAN Siyao

2015, 51(2): 249-254.

Asbtract ( )

PDF (488KB) ( )

Related Articles | Metrics

Corpus linguistics methods were applied to prove that there is simplification phenomenon of Chinese characters in Taiwan. Firstly, a Taiwan Chinese corpus was built up with a large number of texts from media, government website and blog. Secondly, with statistics from corpus, it was proved that civilians in Taiwan prefer to use those popular Chinese characters with fewer strokes, which implies a simplification phenomenon. Lastly, the authors analyzed several influential factors of the simplification of Chinese characters in Taiwan, including simplified Chinese from the mainland, Chinese character encoding and Chinese IME.

Study of Intelligent Multimedia Display System for Classic Chinese Poetry

YAN Siyao,ZHENG Xuling,SHI Xiaodong,ZHENG Fakui

2015, 51(2): 255-261.

Asbtract ( )

PDF (1608KB) ( )

Related Articles | Metrics

Focusing on classical Chinese poetry, the authors firstly achieve the combination of NLP, computing poetics and computer animation to solve the automatic animation generation of classical poetries. The authors firstly automatically determine the poetry style, subject matter and the time using SVM-based and collaborative learning classifier. After the achievement of automatic animation generation by using Flash Actionscript 3.0 script, using co-occurrence relationship supplies animation elements and gives poetry scene classification method. The results show that the proposed methodology initially solves the automatic generation of classical poetry animation, and provides a theoretical basis and experimental foundation for subsequent research.

Automatic Classification of Tang Poetry Themes

HU Renfen,ZHU Yuchen

2015, 51(2): 262-268.

Asbtract ( )

PDF (509KB) ( )

Related Articles | Metrics

The authors propose a text classification model for Tang poetry. Firstly seven categories are defined for poetry themes: love and marriage, frontier war, friendship and farewell, journey and homesick, landscape and countryside, history and nostalgia, others. 500 Tang poems are selected as research samples, and they are represented in vectors with Vector Space Model (VSM). To reduce the vector dimensions, feature selection is made by Chi-square test. Two classifiers are built based on Naive Bayes and Support Vector Machine algorithms. The models perform well in classification experiment. Besides, the authors verify the positive effect of poetry titles, authors and types to poetry themes by text classification models, which could offer scientific reference to the related research of Tang poetry.

Mandarin Speech Emotion Recognition Based on MFCCG-PCA

CHEN Weiliang,SUN Xiao

2015, 51(2): 269-274.

Asbtract ( )

PDF (413KB) ( )

Related Articles | Metrics

To solve the problem that the dimension of the speech emotion characteristic value is big and it is difficult to train, a new speech emotion recognition model, MFCCG-PCA, is put forward by the combination of the MFCC model and the PCA model. Multiple sets of experiments show that the MFCCG-PCA model has larger performance improvement than general MFCC model in the aspect of speech emotion recognition.

Recognition of Comparative Sentences Based on Syntactic and Semantic Rules-System

BAI Linnan,HU Renfen,LIU Zhiying

2015, 51(2): 275-281.

Asbtract ( )

PDF (452KB) ( )

Related Articles | Metrics

The authors propose a novel method to identify comparative sentences based on rules, and these rules contain syntactic and semantic features of comparative sentences. Comparative marks and comparative result words are significant elements to identify comparative sentences. Based on this, the authors conclude the categories and identification rules of comparative sentences. Four models are designed to respectively recognize every category. Experiments show that proposed method can gain satisfactory results in comparative parser and recognition, which lay good foundation for comparative relation extraction.

Short Texts Feature Extraction and Clustering Based on Auto-Encoder

LIU Kan,YUAN Yunying

2015, 51(2): 282-288.

Asbtract ( )

PDF (557KB) ( )

Related Articles | Metrics

According to the characteristics of short texts, the authors propose a feature extraction and clustering algorithm named deep denoise sparse auto-encoder. The algorithm takes the advantage of deep learning, transforming those high-dimensional, sparse vectors into new, low-dimensional, essential ones. Firstly, L1 paradigm is introduced to avoid overfitting, and the noise is added to improve the robustness. Experimental result shows that applying extracted text features can significantly improve the effectiveness of clustering. It is a valid method to solve the high-dimensional, sparse problem in the short text vector.

A Weibo Bot-users Indentification Model Based on Random Forest

LIU Kan,YUAN Yunying,LIU Ping

2015, 51(2): 289-300.

Asbtract ( )

PDF (718KB) ( )

Related Articles | Metrics

Bot-users spread rumors or fake information widely, misleading the public opinion, seriously affecting the normal network environment. Taking Weibo bot-users as main focus, considering their high-level automation, strong disguise power and targeted ability to release, a four-dimensional characteristic index of information entropy, content repetition rate, reputation, mutural, mention ratio, comment ratio, message and numofplatform is proposed to construct a feature vector and an identification model based on random forest algorithm is designed to recognize the bot-users. Finally, the Sina Weibo set are used to verify the efficiency and effectiveness of the model, with the accuracy of 96.7%. The result shows that the model is good at distinguishing the bot-users from ordinary users.

Multi-strategies Extraction of Chinese Synonyms

SONG Wenjie,GU Yanhui,ZHOU Junsheng,SUN Yujie,YAN Jie,QU Weiguang

2015, 51(2): 301-306.

Asbtract ( )

PDF (881KB) ( )

Related Articles | Metrics

Cilin and Chinese Concept Dictionary are used as dictionary resources in many NLP applications. The authors study some strategies on Chinese synonyms extraction according to key word of the infobox in Baidubaike and HTML tag of the web page in Zdic. Meanwhile, DIPRE (Dual Iterative Pattern Relation Expansion) is applied to discover high credible patterns and synonymous instances in Encyclopedia corpora. Extensive experimental evaluation demonstrates that proposed strategies outperform the NLP&CC 2012 evaluation results. A sophisticated synonym dictionary is built with manually proofreading for noun part of the Grammatical Knowledge-Base of Contemporary Chinese, which would make contributions to perfect the semantic systems of the Grammatical Knowledge-base of Contemporary Chinese.

Automatic Recognition and Classification on Chinese Discourse Connective

LI Yancui,SUN Jing,ZHOU Guodong

2015, 51(2): 307-314.

Asbtract ( )

PDF (621KB) ( )

Related Articles | Metrics

Based on the annotation of discourse connective in Chinese Discourse Treebank, especially the annotation of the connective and its relation classification. The authors extract syntax, lexical and position features of automatic syntax tree and standard syntax tree, and use supervised method to recognize and classify connective. Experimental results show that connective recognition F1-measure is 69.2%, and connective classification accuracy is 89.1%.

Recognizing the Ellipsis of Opinion Target in Chinese Text

ZHU Zhu,WANG Rong,LI Shoushan,ZHOU Guodong

2015, 51(2): 315-320.

Asbtract ( )

PDF (404KB) ( )

Related Articles | Metrics

A novel method is proposed to recognize the ellipsis of opinion target in Chinese text. The approach treats the task of opinion target ellipsis as a binary classification problem, which applies the machine learning algorithm. Then three kinds of features, namely position-independent features of sentence, position-dependent features of sentence and contextual features, are applied to the recognition task separately. The experimental results in three domains demonstrate that the machine learning-based method is effective for the task of the recognition of opinion target ellipsis.

Construction of a Chinese Entity Linking Corpus

SHU Jiagen,HUI Haotian,QIAN Longhua,ZHU Qiaoming

2015, 51(2): 321-327.

Asbtract ( )

PDF (7043KB) ( )

Related Articles | Metrics

In view of the lack of Chinese entity linking benchmark corpus, the methodology of automatic construction and manual annotation was applied to build a Chinese entity linking corpus as well as its related Chinese knowledge base derived from the ACE2005 Chinese corpus and the Chinese Wikipedia resource. Contrary to traditional English entity linking corpus, this corpus is based on entities rather than individual entity mentions. The construction of Chinese entity linking corpus provides a benchmark platform to the Chinese entity linking research community.

Conversion of Multiple Resources for POS Tagging

GAO Enting,CHAO Jiayuan,LI Zhenghua

2015, 51(2): 328-334.

Asbtract ( )

PDF (568KB) ( )

Related Articles | Metrics

The authors propose an annotation conversion method using multiple resources for POS tagging, aiming to convert the source-side annotations into target-side and then combine the data to get larger training data. Two innovate strategies are proposed. The first strategy uses reliability information of guide features. The second strategy uses ambiguous labelings to improve the quality of converted data. Results demonstrate that the first strategy is helpful for annotation conversion while the second does little to conversion.

Translation Similarity Model Based on Bilingual Compositional Semantics

WANG Chaochao,XIONG Deyi,ZHANG Min

2015, 51(2): 335-341.

Asbtract ( )

PDF (511KB) ( )

Related Articles | Metrics

The authors propose a translation similarity model based on bilingual compositional semantics to integrate the bilingual semantic similarity feature into decoding process to improve translation quality. In the proposed model, monolingual compositional vectors for phrases are obtained at the source and target side respectively using distributional approach. These monolingual vectors are then projected onto the same semantic space and therefore transformed into bilingual compositional vectors. Base on this semantic space, translation similarity between source phrases and their corresponding target phrases is calculated. The similarities are integrated into the decoder as a new feature. Experiments on Chinese-to-English NIST06 and NIST08 test sets show that the proposed model significantly outperforms the baseline by 0.56 and 0.42 BLEU points respectively.

Improved Statistical Machine Translation with Source Language Paraphrase

SU Chen,ZHANG Yujie,GUO Zhen,XU Jin’an

2015, 51(2): 342-348.

Asbtract ( )

PDF (448KB) ( )

Related Articles | Metrics

The performance of statistical machine translation (SMT) suffers from the insufficiency of parallel corpus. To solve the problem, the authors propose a paraphrase based SMT framework with three solutions: 1) acquiring paraphrase knowledge based on a third language; 2) expressing multiple paraphrases of input sentence in a lattice and modifying decoder to be able to process it; 3) integrating paraphrase knowledge as features into log-linear model. In this way, not only more expressions in source language can be covered, but also more expressions in target language can be generated as candidate translations. To verify proposed method, experiments are conducted on three training data sets with different sizes, and evaluate the improvement of the performance of SMT system contributed by paraphrasing. Experimental results show that the translation performance is improved significantly (BLEU+1.4%) when the parallel corpus is small (10 K), and a good performance (BLEU+0.32%) is also achieved when parallel corpus is large enough (1 M).

Text Feature Analysis on SAO Structure Extraction from Chinese Patent Literatures

RAO Qi,WANG Peiyan,ZHANG Guiping

2015, 51(2): 349-356.

Asbtract ( )

PDF (525KB) ( )

Related Articles | Metrics

In order to resolve the problem of SAO-based relation extraction from Chinese patent literatures, a series of experiments were implemented by using Support Vector Machines. It focused on the analysis of the validity of basic lexical information, syntactic information such as the shortest path enclosed tree, and distance features used in related works. The results show that simple lexical features can contribute to a good performance, while syntactic features cannot bring a remarkable improvement. Moreover, the feasibility of a new representation of words, word embeddings, is validated on SAO-based relation extraction.

Expanding Training Dataset with Class Hierarchy in Hierarchical Text Categorization

LI Baoli

2015, 51(2): 357-366.

Asbtract ( )

PDF (785KB) ( )

Related Articles | Metrics

As the number of classes is quite large in a hierarchical text categorization problem, it usually costs much to obtain a training dataset of reasonable size and sample distribution. Several strategies are proposed and compared to generate new training samples from the class hierarchy in a hierarchical text classification problem. These solutions try to make full use of the class hierarchy (including class names, their descriptions if any, and relationships between them), and derive new pseudo training samples based on connotations and extensions of classes. Experiments on the dataset of the first large scale Chinese News Categorization at NLPCC 2014 show that the localized expanding strategy based on class extensions performs better. The proposed official system achieved MacroF1 0.8413 and 0.7139 at level 1 and level 2 respectively, which ranked the proposed system the second place among the 10 participating systems.

A Supervised Dynamic Topic Model

JIANG Zhuoren,CHEN Yan,GAO Liangcai,TANG Zhi,LIU Xiaozhong

2015, 51(2): 367-376.

Asbtract ( )

PDF (3346KB) ( )

Related Articles | Metrics

An innovative Supervised Dynamic Topic Model (S-DTM) is developed for overcoming the limitation of tranditional topic models. S-DTM models the time-varying language dynamics and is combined with supervised learning technology by adding label restriction in topic variational inference. It makes the topic-label mapping and improves the interpret ability of topics. A set of experiments is conducted on a twenty-five-year-spanning Chinese journal paper corpus that is mainly focusing on natural language processing. Experiment results show that compared with static supervised topic model and unsupervised dynamic topic model, S-DTM has a better semantic interpretation performance, reflects the topic structure of a document more accurately, captures the dynamic evolution of the term-distribution of topics more precisely.

Table of Content