Benchmarking Multidomain English-Indonesian Machine Translation

Tri Wahyu Guntara, Alham Fikri Aji, Radityo Eko Prasojo


Abstract
In the context of Machine Translation (MT) from-and-to English, Bahasa Indonesia has been considered a low-resource language, and therefore applying Neural Machine Translation (NMT) which typically requires large training dataset proves to be problematic. In this paper, we show otherwise by collecting large, publicly-available datasets from the Web, which we split into several domains: news, religion, general, and conversation, to train and benchmark some variants of transformer-based NMT models across the domains. We show using BLEU that our models perform well across them , outperform the baseline Statistical Machine Translation (SMT) models, and perform comparably with Google Translate. Our datasets (with the standard split for training, validation, and testing), code, and models are available on https://github.com/gunnxx/indonesian-mt-data
Anthology ID:
2020.bucc-1.6
Volume:
Proceedings of the 13th Workshop on Building and Using Comparable Corpora
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Reinhard Rapp, Pierre Zweigenbaum, Serge Sharoff
Venue:
BUCC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
35–43
Language:
English
URL:
https://aclanthology.org/2020.bucc-1.6
DOI:
Bibkey:
Cite (ACL):
Tri Wahyu Guntara, Alham Fikri Aji, and Radityo Eko Prasojo. 2020. Benchmarking Multidomain English-Indonesian Machine Translation. In Proceedings of the 13th Workshop on Building and Using Comparable Corpora, pages 35–43, Marseille, France. European Language Resources Association.
Cite (Informal):
Benchmarking Multidomain English-Indonesian Machine Translation (Guntara et al., BUCC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.bucc-1.6.pdf
Code
 gunnxx/indonesian-mt-data