Automatic Spanish Translation of SQuAD Dataset for Multi-lingual Question Answering

Casimiro Pio Carrino, Marta R. Costa-jussà, José A. R. Fonollosa


Abstract
Recently, multilingual question answering became a crucial research topic, and it is receiving increased interest in the NLP community. However, the unavailability of large-scale datasets makes it challenging to train multilingual QA systems with performance comparable to the English ones. In this work, we develop the Translate Align Retrieve (TAR) method to automatically translate the Stanford Question Answering Dataset (SQuAD) v1.1 to Spanish. We then used this dataset to train Spanish QA systems by fine-tuning a Multilingual-BERT model. Finally, we evaluated our QA models with the recently proposed MLQA and XQuAD benchmarks for cross-lingual Extractive QA. Experimental results show that our models outperform the previous Multilingual-BERT baselines achieving the new state-of-the-art values of 68.1 F1 on the Spanish MLQA corpus and 77.6 F1 on the Spanish XQuAD corpus. The resulting, synthetically generated SQuAD-es v1.1 corpora, with almost 100% of data contained in the original English version, to the best of our knowledge, is the first large-scale QA training resource for Spanish.
Anthology ID:
2020.lrec-1.677
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
5515–5523
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.677
DOI:
Bibkey:
Cite (ACL):
Casimiro Pio Carrino, Marta R. Costa-jussà, and José A. R. Fonollosa. 2020. Automatic Spanish Translation of SQuAD Dataset for Multi-lingual Question Answering. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5515–5523, Marseille, France. European Language Resources Association.
Cite (Informal):
Automatic Spanish Translation of SQuAD Dataset for Multi-lingual Question Answering (Carrino et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.677.pdf