Learning Text Representations for 500K Classification Tasks on Named Entity Disambiguation

Ander Barrena, Aitor Soroa, Eneko Agirre


Abstract
Named Entity Disambiguation algorithms typically learn a single model for all target entities. In this paper we present a word expert model and train separate deep learning models for each target entity string, yielding 500K classification tasks. This gives us the opportunity to benchmark popular text representation alternatives on this massive dataset. In order to face scarce training data we propose a simple data-augmentation technique and transfer-learning. We show that bag-of-word-embeddings are better than LSTMs for tasks with scarce training data, while the situation is reversed when having larger amounts. Transferring a LSTM which is learned on all datasets is the most effective context representation option for the word experts in all frequency bands. The experiments show that our system trained on out-of-domain Wikipedia data surpass comparable NED systems which have been trained on in-domain training data.
Anthology ID:
K18-1017
Volume:
Proceedings of the 22nd Conference on Computational Natural Language Learning
Month:
October
Year:
2018
Address:
Brussels, Belgium
Editors:
Anna Korhonen, Ivan Titov
Venue:
CoNLL
SIG:
SIGNLL
Publisher:
Association for Computational Linguistics
Note:
Pages:
171–180
Language:
URL:
https://aclanthology.org/K18-1017
DOI:
10.18653/v1/K18-1017
Bibkey:
Cite (ACL):
Ander Barrena, Aitor Soroa, and Eneko Agirre. 2018. Learning Text Representations for 500K Classification Tasks on Named Entity Disambiguation. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 171–180, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Learning Text Representations for 500K Classification Tasks on Named Entity Disambiguation (Barrena et al., CoNLL 2018)
Copy Citation:
PDF:
https://aclanthology.org/K18-1017.pdf
Code
 anderbarrena/500kNED