Multilingual Corpus Creation for Multilingual Semantic Similarity Task

Mahtab Ahmed, Chahna Dixit, Robert E. Mercer, Atif Khan, Muhammad Rifayat Samee, Felipe Urra


Abstract
In natural language processing, the performance of a semantic similarity task relies heavily on the availability of a large corpus. Various monolingual corpora are available (mainly English); but multilingual resources are very limited. In this work, we describe a semi-automated framework to create a multilingual corpus which can be used for the multilingual semantic similarity task. The similar sentence pairs are obtained by crawling bilingual websites, whereas the dissimilar sentence pairs are selected by applying topic modeling and an Open-AI GPT model on the similar sentence pairs. We focus on websites in the government, insurance, and banking domains to collect English-French and English-Spanish sentence pairs; however, this corpus creation approach can be applied to any other industry vertical provided that a bilingual website exists. We also show experimental results for multilingual semantic similarity to verify the quality of the corpus and demonstrate its usage.
Anthology ID:
2020.lrec-1.516
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4190–4196
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.516
DOI:
Bibkey:
Cite (ACL):
Mahtab Ahmed, Chahna Dixit, Robert E. Mercer, Atif Khan, Muhammad Rifayat Samee, and Felipe Urra. 2020. Multilingual Corpus Creation for Multilingual Semantic Similarity Task. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4190–4196, Marseille, France. European Language Resources Association.
Cite (Informal):
Multilingual Corpus Creation for Multilingual Semantic Similarity Task (Ahmed et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.516.pdf