Word Embedding Evaluation in Downstream Tasks and Semantic Analogies

Joaquim Santos, Bernardo Consoli, Renata Vieira


Abstract
Language Models have long been a prolific area of study in the field of Natural Language Processing (NLP). One of the newer kinds of language models, and some of the most used, are Word Embeddings (WE). WE are vector space representations of a vocabulary learned by a non-supervised neural network based on the context in which words appear. WE have been widely used in downstream tasks in many areas of study in NLP. These areas usually use these vector models as a feature in the processing of textual data. This paper presents the evaluation of newly released WE models for the Portuguese langauage, trained with a corpus composed of 4.9 billion tokens. The first evaluation presented an intrinsic task in which WEs had to correctly build semantic and syntactic relations. The second evaluation presented an extrinsic task in which the WE models were used in two downstream tasks: Named Entity Recognition and Semantic Similarity between Sentences. Our results show that a diverse and comprehensive corpus can often outperform a larger, less textually diverse corpus, and that batch training may cause quality loss in WE models.
Anthology ID:
2020.lrec-1.594
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4828–4834
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.594
DOI:
Bibkey:
Cite (ACL):
Joaquim Santos, Bernardo Consoli, and Renata Vieira. 2020. Word Embedding Evaluation in Downstream Tasks and Semantic Analogies. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4828–4834, Marseille, France. European Language Resources Association.
Cite (Informal):
Word Embedding Evaluation in Downstream Tasks and Semantic Analogies (Santos et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.594.pdf