Exploring Bilingual Word Embeddings for Hiligaynon, a Low-Resource Language

Leah Michel, Viktor Hangya, Alexander Fraser


Abstract
This paper investigates the use of bilingual word embeddings for mining Hiligaynon translations of English words. There is very little research on Hiligaynon, an extremely low-resource language of Malayo-Polynesian origin with over 9 million speakers in the Philippines (we found just one paper). We use a publicly available Hiligaynon corpus with only 300K words, and match it with a comparable corpus in English. As there are no bilingual resources available, we manually develop a English-Hiligaynon lexicon and use this to train bilingual word embeddings. But we fail to mine accurate translations due to the small amount of data. To find out if the same holds true for a related language pair, we simulate the same low-resource setup on English to German and arrive at similar results. We then vary the size of the comparable English and German corpora to determine the minimum corpus size necessary to achieve competitive results. Further, we investigate the role of the seed lexicon. We show that with the same corpus size but with a smaller seed lexicon, performance can surpass results of previous studies. We release the lexicon of 1,200 English-Hiligaynon word pairs we created to encourage further investigation.
Anthology ID:
2020.lrec-1.313
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2573–2580
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.313
DOI:
Bibkey:
Cite (ACL):
Leah Michel, Viktor Hangya, and Alexander Fraser. 2020. Exploring Bilingual Word Embeddings for Hiligaynon, a Low-Resource Language. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2573–2580, Marseille, France. European Language Resources Association.
Cite (Informal):
Exploring Bilingual Word Embeddings for Hiligaynon, a Low-Resource Language (Michel et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.313.pdf