北京大学学报自然科学版 ›› 2017, Vol. 53 ›› Issue (2): 344-352.DOI: 10.13209/j.0479-8023.2016.090
收稿日期:
2015-10-12
修回日期:
2016-01-06
出版日期:
2017-03-20
发布日期:
2017-03-20
通讯作者:
张毅
基金资助:
Xingguang WANG, Ruijie ZHANG, Yi ZHANG†()
Received:
2015-10-12
Revised:
2016-01-06
Online:
2017-03-20
Published:
2017-03-20
Contact:
Yi ZHANG
摘要:
针对目前地名消歧方法普遍缺乏理论基础和统一形式化方法的现状, 以地理学第一定律为理论基础, 使用地理关联度形式化地理实体之间的邻近性。在此基础上, 提出基于证据理论的地名消歧计算模型, 用于表示与合成上下文中共现的地名证据。该模型模拟人类阅读和理解文本中时空语义的认知过程, 并为地名消歧处理提供一个统一的易扩展的形式化框架。最后, 给出本文地名消歧方法的实现算法及其实验评估。结果显示, 算法综合性能指标F1达到89.60%, 取得较好的实验效果。
中图分类号:
王星光, 张瑞洁, 张毅. 基于地理关联度和证据理论的地名消歧方法研究[J]. 北京大学学报自然科学版, 2017, 53(2): 344-352.
Xingguang WANG, Ruijie ZHANG, Yi ZHANG. Toponym Resolution Based on Geo-relevance and D-S Theory[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2017, 53(2): 344-352.
地理实体 | 定性地理距离 | |||
---|---|---|---|---|
南京市 | 江苏省 | 秦淮河 | 长江 | |
南京鼓楼区 | 0 | 0 | 0 | 0 |
徐州鼓楼区 | 6 | 0 | 6 | 6 |
开封鼓楼区 | 12 | 6 | 12 | 11 |
福州鼓楼区 | 12 | 12 | 13 | 9 |
表1 歧义地名所有可能所指与证据源之间的定性距离(县级尺度)
Table 1 Qualitative distance among candidates belonging to ambiguous toponym and evidences (county level)
地理实体 | 定性地理距离 | |||
---|---|---|---|---|
南京市 | 江苏省 | 秦淮河 | 长江 | |
南京鼓楼区 | 0 | 0 | 0 | 0 |
徐州鼓楼区 | 6 | 0 | 6 | 6 |
开封鼓楼区 | 12 | 6 | 12 | 11 |
福州鼓楼区 | 12 | 12 | 13 | 9 |
地理实体 | 地理关联度 | |||
---|---|---|---|---|
南京市 | 江苏省 | 秦淮河 | 长江 | |
南京鼓楼区 | 1 | 1 | 1 | 1 |
徐州鼓楼区 | 0.0025 | 1 | 0.0025 | 0.0025 |
开封鼓楼区 | 6.144×10-6 | 0.0025 | 6.144×10-6 | 1.670×10-5 |
福州鼓楼区 | 6.144×10-6 | 6.144×10-6 | 2.260×10-6 | 0.0001 |
表2 歧义地名所有可能所指与证据源之间的地理关联度
Table 2 Geo-relevance among candidates belonging to ambiguous toponym and evidences
地理实体 | 地理关联度 | |||
---|---|---|---|---|
南京市 | 江苏省 | 秦淮河 | 长江 | |
南京鼓楼区 | 1 | 1 | 1 | 1 |
徐州鼓楼区 | 0.0025 | 1 | 0.0025 | 0.0025 |
开封鼓楼区 | 6.144×10-6 | 0.0025 | 6.144×10-6 | 1.670×10-5 |
福州鼓楼区 | 6.144×10-6 | 6.144×10-6 | 2.260×10-6 | 0.0001 |
地理实体 | Mass值 | ||||
---|---|---|---|---|---|
南京市 | 江苏省 | 秦淮河 | 长江 | 证据合成 | |
南京鼓楼区 | 0.2500 | 0.2500 | 0.2500 | 0.2500 | 0.6296 |
徐州鼓楼区 | 0.0006 | 0.2500 | 0.0006 | 0.0006 | 0.1241 |
开封鼓楼区 | 1.536×10-6 | 0.0006 | 1.536×10-6 | 4.175×10-6 | 0.0003 |
福州鼓楼区 | 1.536×10-6 | 1.536×10-6 | 5.651×10-7 | 3.085×10-5 | 1.158×10-5 |
未分配信度 | 0.7494 | 0.4994 | 0.7494 | 0.7493 | 0.2460 |
表3 证据源和证据合成的mass值
Table 3 Mass value combined by evidences
地理实体 | Mass值 | ||||
---|---|---|---|---|---|
南京市 | 江苏省 | 秦淮河 | 长江 | 证据合成 | |
南京鼓楼区 | 0.2500 | 0.2500 | 0.2500 | 0.2500 | 0.6296 |
徐州鼓楼区 | 0.0006 | 0.2500 | 0.0006 | 0.0006 | 0.1241 |
开封鼓楼区 | 1.536×10-6 | 0.0006 | 1.536×10-6 | 4.175×10-6 | 0.0003 |
福州鼓楼区 | 1.536×10-6 | 1.536×10-6 | 5.651×10-7 | 3.085×10-5 | 1.158×10-5 |
未分配信度 | 0.7494 | 0.4994 | 0.7494 | 0.7493 | 0.2460 |
算法 | P/% | R/% | F1/% |
---|---|---|---|
Baseline | 70.90 | 70.90 | 70.90 |
RE-5 | 91.49 | 73.32 | 81.40 |
RE-19 | 90.42 | 74.89 | 81.92 |
RE_B-5 | 89.60 | 89.60 | 89.60 |
表4 不同地名消歧算法的性能比较
Table 4 Performance comparison of different toponym resolution methods
算法 | P/% | R/% | F1/% |
---|---|---|---|
Baseline | 70.90 | 70.90 | 70.90 |
RE-5 | 91.49 | 73.32 | 81.40 |
RE-19 | 90.42 | 74.89 | 81.92 |
RE_B-5 | 89.60 | 89.60 | 89.60 |
[1] | Longley P A, Goodchild M F, Rhind D W.Geographic information systems and science. London: John Wiley & Sons, 2005 |
[2] | Smith D A, Crane G.Disambiguating geographic names in a historical digital library // Proceedings of ECDL. Darmstadt, 2001: 127-136 |
[3] | Amitay E, Har’El N, Sivan R, et al. Web-a-where: geotagging web content // Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2004: 273-280 |
[4] | Garbin E, Mani I.Disambiguating toponyms in news // Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Lan-guage Processing. Vancouver, BC, 2005: 363-370 |
[5] | Leidner J L.Toponym resolution in text: annotation, evaluation and applications of spatial grounding of place names [D]. Edinburgh: University of Edinburgh, 2007 |
[6] | Olligschlaeger A M, Hauptmann A G.Multimodal information systems and GIS: the informedia digital video library // Proceedings of the 1999 ESRI User Conference. San Diego, 1999: 102-106 |
[7] | Densham I, Reid J.A geo-coding service encom-passing a geo-parsing tool and integrated digital gazetteer service // Proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic Referen-ces. Edmonton, 2003: 79-80 |
[8] | Jones C B, Purves R S.Geographical information retrieval. International Journal of Geographical Infor-mation Science, 2008, 22(3): 219-228 |
[9] | 张毅, 王星光, 陈敏, 等. 基于语义的文本地理范围提取方法. 高技术通讯, 2012, 22(2): 165-170 |
[10] | Buscaldi D, Rosso P.A conceptual density-based approach for the disambiguation of toponyms. International Journal of Geographical Information Science, 2008, 22(3): 301-313 |
[11] | Overell S, Rüger S.Using co-occurrence models for placename disambiguation. International Journal of Geographical Information Science, 2008, 22(3): 265-287 |
[12] | Batista D S, Silva M J, Couto F M, et al.Geographic signatures for semantic retrieval // Proceedings of the 6th Workshop on Geographic Information Retrieval. New York: ACM, 2010: 93-100 |
[13] | Hu Youheng, Ge Linlin.A supervised machine learning approach to toponym disambiguation // The Geospatial Web. London: Springer, 2007: 117-128 |
[14] | Lieberman M D, Samet H.Adaptive context features for toponym resolution in streaming news // Pro-ceedings of the 35th International ACM SIGIR Con-ference on Research and Development in Information Retrieval. New York: ACM, 2012: 731-740 |
[15] | Santos J, Anastácio I, Martins B. Using machine learning methods for disambiguating place references in textual documents. Geojournal, 2015, 80(3): 375‒375 |
[16] | Wang Xingguang, Zhang Yi, Chen Min, et al.An evidence-based approach for toponym disambiguation // Proceedings of the Eighteenth International Con-ference on Geoinformatics. Beijing: IEEE, 2010: 1-7 |
[17] | Zhang Wei, Gelernter J.Geocoding location expre-ssions in Twitter messages: a preference learning method. Journal of Spatial Information Science, 2014 (9): 37-70 |
[18] | Li Yi, Moffat A, Stokes N, et al.Exploring proba-bilistic toponym resolution for geographical infor-mation retrieval // Proceedings of 3rd Workshop on Geographic Information Retrieval. Seattle, 2006: 17-22 |
[19] | Tang Xuri, Chen Xiaohe, Peng Minxuan.Toponym resolution in discourse // Natural Language Proce-ssing and Knowledge Engineering. Beijing, 2008: 1-8 |
[20] | Volz R, Kleb J, Mueller W.Towards ontology-based disambiguation of geographical identifiers // WWW 2007 Workshop I3. Banff, 2007: 1-7 |
[21] | 杜萍, 刘勇. 中文地名识别与歧义消除: 以中国县级以上行政区划地名为例. 遥感技术与应用, 2011, 26(6): 868-873 |
[22] | 唐旭日, 陈小荷, 张雪英. 中文文本的地名解析方法研究. 武汉大学学报: 信息科学版, 2010, 35(8): 930-935 |
[23] | Pouliquen B, Steinberger R, Ignat C, et al.Geographical information recognition and visualiza-tion in texts written in various languages // Procee-dings of the 2004 ACM Symposium on Applied Com-puting. New York: ACM, 2004: 1051-1058 |
[24] | Li Huifeng, Srihari R K, Niu Cheng, et al.Location normalization for information extraction // Procee-dings of the 19th International Conference on Com-putational Linguistics. Stroudsburg: Association for Computational Linguistics, 2002: 1-7 |
[25] | Li Huifeng, Srihari R K, Niu Cheng, et al.InfoXtract location normalization: a hybrid approach to geogra-phic references in information extraction // Procee-dings of the HLT-NAACL 2003 workshop on Analy-sis of Geographic References-Volume 1. Stroudsburg: Association for Computational Linguistics, 2003: 39-44 |
[26] | Overell S.Geographic information retrieval: classifi-cation, disambiguation and modelling [D]. London: Imperial College London, 2009 |
[27] | Hu Yingjie, Janowicz K, Prasad S. Improving Wikipedia-based place name disambiguation in short texts using structured data from DBpedia [C/OL] // Proceedings of the 8th Workshop on Geographic Information Retrieval. Dallas: ACM. (2014‒11‒04) [2015‒09‒20]. |
[28] | 朱少楠, 张雪英, 李明, 等. 基于行政隶属关系树状图的地名消歧方法. 地理与地理信息科学, 2013, 29(3): 39-42 |
[29] | Bensalem I, Kholladi M K.Toponym disambiguation by arborescent relationships. Journal of Computer Science, 2010, 6(6): 653-659 |
[30] | Overell S, Magalhaes J, Rüger S. Place disambigua-tion with co-occurrence models // CLEF 2006 Work-shop, Working Notes. Alicante, 2006: 59-68 |
[31] | Overell S, Rüger S M. Identifying and grounding descriptions of places // Proceedings of the Third ACM Workshop on Geographical Information Retrie-val at SIGIR (GIR’06). Seattle, 2006: 14‒16 |
[32] | Leidner J L, Sinclair G, Webber B.Grounding spatial named entities for information extraction and question answering // Proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic References-Volume 1. Stroudsburg: Association for Computational Linguistics, 2003: 31-38 |
[33] | Buscaldi D, Rosso P.Map-based vs. knowledge-based toponym disambiguation // Proceedings of the 2nd International Workshop on Geographic Information Retrieval (GIR’08). New York: ACM, 2008: 19-22 |
[34] | Lieberman M D, Samet H, Sankaranayananan J. Geotagging: Using proximity, sibling, promi-nence clues to understand comma groups // Procee-dings of the 6th Workshop on Geographic Information Retrieval (GIR’10). Zurich: ACM, 2010: 1‒8 |
[35] | Pouliquen B, Kimler M, Steinberger R, et al. Geo-coding multilingual texts: Recognition, disambigua-tion and visualization // Proceedings of LREC-2006. Genoa, 2006: 53‒58 |
[36] | Buscaldi D, Magnini B. Grounding toponyms in an Italian local news corpus // Proceedings of the 6th Workshop on Geographic Information Retrieval. Zurich, 2010: 1‒5 |
[37] | Buscaldi D.Approaches to disambiguating toponyms. Sigspatial Special, 2011, 3(2): 16-19 |
[38] | Liu Yu, Wang Fahui, Kang Chaogui, et al.Analyzing relatedness by toponym co-occurrences on web pages. Transactions in GIS, 2014, 18(1): 89-107 |
[39] | Tobler W R.A computer movie simulating urban growth in the Detroit region. Economic geography, 1970, 46: 234-240 |
[40] | Hill L.Core elements of digital gazetteers: place-names, categories, and footprints // Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries. Berlin: Springer, 2000: 280-290 |
[41] | Zhang Yi, Gao Yong, Xue Lulu, et al.A common sense geographic knowledge base for GIR. Science in China Series E: Technological Sciences, 2008, 51(1): 26-37 |
[42] | 邬伦, 刘瑜, 张晶, 等. 地理信息系统: 原理方法和应用. 北京: 科学出版社, 2005 |
[43] | Longley P A, Goodchild M F, Maguire D J, et al.Geographical information systems and science. Chi-chester: John Wiley & Sons, 2005 |
[44] | Lloyd R, Patton D, Cammack R. Basic-level geo-graphic categories. The Professional Geographer, 1996, 48(2): 181‒181 |
[45] | Dempster A P.Upper and lower probabilities induced by a multivalued mapping. The Annals of Mathema-tical Statistics, 1967, 38(2): 325-339 |
[46] | Shafer G.A mathematical theory of evidence. Prin-ceton: Princeton University Press, 1976 |
[1] | 翟卫欣, 陈波, 童晓冲, 程承旗. 多尺度空间填充曲线空间连续性研究[J]. 北京大学学报(自然科学版), 2018, 54(2): 331-335. |
[2] | 葛磊, 武芳, 李建胜, 马智刚. 一种基于特征的复杂平房顶建筑化简算法[J]. 北京大学学报自然科学版, 2017, 53(1): 1-7. |
[3] | 张瑞洁, 田原, 刘思叶, 王雯夫. 即时通信文本中地理信息提取——以微信为例[J]. 北京大学学报(自然科学版), 2016, 52(6): 985-989. |
[4] | 程承旗, 吴飞龙, 王嵘, 秦永刚, 童晓冲, 陈波. 地球空间参考网格系统建设初探[J]. 北京大学学报(自然科学版), 2016, 52(6): 1041-1049. |
[5] | 高勇, 姜丹, 刘磊, 林星, 邬伦. 一种地理信息检索的定性模型[J]. 北京大学学报(自然科学版), 2016, 52(2): 265-273. |
[6] | 胡晓光, 程承旗, 童晓冲. 基于GeoSOT-3D的三维数据表达研究[J]. 北京大学学报(自然科学版), 2015, 51(6): 1022-1028. |
[7] | 刘岳峰;康葳. 一种基于对象快照模型的时空查询原子模型[J]. 北京大学学报(自然科学版), 2015, 51(4): 755-762. |
[8] | 朱瀚,李怀瑜,肖汉,陈秀万,王婷婷. 基于共享位置数据的最短时间路径算法[J]. 北京大学学报(自然科学版), 2015, 51(1): 86-92. |
[9] | 刘岳峰,司若辰,康葳. 城市出租车客流网络结构复杂性特征研究[J]. 北京大学学报(自然科学版), 2014, 50(5): 873-879. |
[10] | 焦龙,刘岳峰,司若辰. 距离约束在出租车客流网络结构中的作用研究[J]. 北京大学学报(自然科学版), 2014, 50(5): 880-886. |
[11] | 刘瑜,龚俐,童庆禧. 空间交互作用中的距离影响及定量分析[J]. 北京大学学报(自然科学版), 2014, 50(3): 526-534. |
[12] | 吕雪锋,廖永丰,程承旗,金安. 基于GeoSOT区位标识的多源遥感数据组织研究[J]. 北京大学学报(自然科学版), 2014, 50(2): 331-340. |
[13] | 刘磊,高勇,林星,邬伦. 定性地理信息检索方法及其实现[J]. 北京大学学报(自然科学版), 2013, 49(6): 1017-1024. |
[14] | 肖汉,杜永慧,徐金泽,陈秀万. 融合GIS的犯罪概率模型及应用[J]. 北京大学学报(自然科学版), 2013, 49(6): 1025-1030. |
[15] | 杨宇博,程承旗. 基于分块Harris特征的航拍视频拼接方法[J]. 北京大学学报(自然科学版), 2013, 49(4): 657-661. |
阅读次数 | ||||||||||||||||||||||||||||||||||||||||||||||||||
全文 849
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||
摘要 1334
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||