Pairing Wikipedia Articles Across Languages

Marcus Klang, Pierre Nugues


Abstract
Wikipedia has become a reference knowledge source for scores of NLP applications. One of its invaluable features lies in its multilingual nature, where articles on a same entity or concept can have from one to more than 200 different versions. The interlinking of language versions in Wikipedia has undergone a major renewal with the advent of Wikidata, a unified scheme to identify entities and their properties using unique numbers. However, as the interlinking is still manually carried out by thousands of editors across the globe, errors may creep in the assignment of entities. In this paper, we describe an optimization technique to match automatically language versions of articles, and hence entities, that is only based on bags of words and anchors. We created a dataset of all the articles on persons we extracted from Wikipedia in six languages: English, French, German, Russian, Spanish, and Swedish. We report a correct match of at least 94.3% on each pair.
Anthology ID:
W16-4410
Volume:
Proceedings of the Open Knowledge Base and Question Answering Workshop (OKBQA 2016)
Month:
December
Year:
2016
Address:
Osaka, Japan
Editors:
Key-Sun Choi, Christina Unger, Piek Vossen, Jin-Dong Kim, Noriko Kando, Axel-Cyrille Ngonga Ngomo
Venue:
WS
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
72–76
Language:
URL:
https://aclanthology.org/W16-4410
DOI:
Bibkey:
Cite (ACL):
Marcus Klang and Pierre Nugues. 2016. Pairing Wikipedia Articles Across Languages. In Proceedings of the Open Knowledge Base and Question Answering Workshop (OKBQA 2016), pages 72–76, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
Pairing Wikipedia Articles Across Languages (Klang & Nugues, 2016)
Copy Citation:
PDF:
https://aclanthology.org/W16-4410.pdf