Cross-lingual NIL Entity Clustering for Low-resource Languages

Kevin Blissett, Heng Ji


Abstract
Clustering unlinkable entity mentions across documents in multiple languages (cross-lingual NIL Clustering) is an important task as part of Entity Discovery and Linking (EDL). This task has been largely neglected by the EDL community because it is challenging to outperform simple edit distance or other heuristics based baselines. We propose a novel approach based on encoding the orthographic similarity of the mentions using a Recurrent Neural Network (RNN) architecture. Our model adapts a training procedure from the one-shot facial recognition literature in order to achieve this. We also perform several exploratory probing tasks on our name encodings in order to determine what specific types of information are likely to be encoded by our model. Experiments show our approach provides up to a 6.6% absolute CEAFm F-Score improvement over state-of-the-art methods and successfully captures phonological relations across languages.
Anthology ID:
W19-2804
Volume:
Proceedings of the Second Workshop on Computational Models of Reference, Anaphora and Coreference
Month:
June
Year:
2019
Address:
Minneapolis, USA
Editors:
Maciej Ogrodniczuk, Sameer Pradhan, Yulia Grishina, Vincent Ng
Venue:
CRAC
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
20–25
Language:
URL:
https://aclanthology.org/W19-2804
DOI:
10.18653/v1/W19-2804
Bibkey:
Cite (ACL):
Kevin Blissett and Heng Ji. 2019. Cross-lingual NIL Entity Clustering for Low-resource Languages. In Proceedings of the Second Workshop on Computational Models of Reference, Anaphora and Coreference, pages 20–25, Minneapolis, USA. Association for Computational Linguistics.
Cite (Informal):
Cross-lingual NIL Entity Clustering for Low-resource Languages (Blissett & Ji, CRAC 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-2804.pdf