Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining

This paper describes UTokyo’s submission to the AmericasNLP 2021 Shared Task on machine translation systems for indigenous languages of the Americas. We present a low-resource machine translation system that improves translation accuracy using cross-lingual language model pretraining. Our system uses an mBART implementation of fairseq to pretrain on a large set of monolingual data from a diverse set of high-resource languages before finetuning on 10 low-resource indigenous American languages: Aymara, Bribri, Asháninka, Guaraní, Wixarika, Náhuatl, Hñähñu, Quechua, Shipibo-Konibo, and Rarámuri. On average, our system achieved BLEU scores that were 1.64 higher and chrF scores that were 0.0749 higher than the baseline.


Introduction
Neural machine translation (NMT) systems have produced translations of commendable accuracy under large-data training conditions but are datahungry (Zoph et al., 2016) and perform poorly in low resource languages, where parallel data is lacking (Koehn and Knowles, 2017).
Many of the indigenous languages of the Americas lack adequate amounts of parallel data, so existing NMT systems have difficulty producing accurate translations for these languages. Additionally, many of these indigenous languages exhibit linguistic properties that are uncommon in high-resource languages, such as English or Chinese, that are used to train NMT systems.
One striking feature of many indigenous American languages is their polysynthesis (Brinton, 1885;Payne, 2014). Polysynthetic languages display high levels of inflection and are morphologically complex. However, NMT systems are weak in translating "low-frequency words belonging to highly-inflected categories (e.g. verbs)" (Koehn and Knowles, 2017). Quechua, a low-resource, polysynthetic American language, has on average twice as many morphemes per word compared to English (Ortega et al., 2020b), which makes machine translation difficult. Mager et al. (2018b) shows that information is often lost when translating polysynthetic languages into Spanish due to a misalignment of morphemes. Thus, existing NMT systems are not appropriate for indigenous American languages, which are low-resource, polysynthetic languages.
Despite the scarcity of parallel data for these indigenous languages, some are spoken widely and have a pressing need for improved machine translation. For example, Quechua is spoken by more than 10 million people in South America, but some Quechua speakers are not able to access health care due to a lack of Spanish ability (Freire, 2011).
Other languages lack a large population of speakers and may appear to have relatively low demand for translation, but many of these languages are also crucial in many domains such as health care, the maintenance of cultural history, and international security (Klavans, 2018). Improved translation techniques for low-resource, polysynthetic languages are thus of great value.
In light of this, we participated in the Americas-NLP 2021 Shared Task to help further the development of new approaches to low-resource machine translation of polysynthetic languages, which are not commonly studied in natural language processing. The task consisted of producing translations from Spanish to 10 different indigenous American languages.
In this paper, we describe our system designed for the AmericasNLP 2021 Shared Task, which achieved BLEU scores that were 1.64 higher and CHRF scores that were 0.0749 higher than the baseline on average. Our system improves translation accuracy by using monolingual data to improve understanding of natural language before finetuning for each of the 10 indigenous languages.

Data
Our model employs two types of data: We selected a variety of widely-spoken languages across the Americas, Asia, Europe, Africa, and Oceania for the monolingual data we used during our pretraining, allowing our model to learn from a wide range of language families and linguistic features. These monolingual data were acquired from CC100 1 . We use these monolingual data as part of our pretraining, as this has been shown to improve results with smaller parallel datasets (Conneau and Lample, 2019;Liu et al., 2020;Song et al., 2019).

Parallel Data
The parallel data between Spanish and the indigenous American languages were provided by Amer-icasNLP 2021 (Mager et al., 2021). We have summarized some important details of the training data and development/test sets (Ebrahimi et al., 2021) below. More details about these data can be found in the AmericasNLP 2021 official repository 2 .
Aymara The Aymara-Spanish data came from translations by Global Voices and Facebook AI. The training data came primarily from Global Voices 3 (Prokopidis et al., 2016;Tiedemann, 2012), but because translations were done by volunteers, the texts have potentially different writing styles. The development and test sets came from translations from Spanish texts into Aymara La Paz jilata, a Central Aymara variant.
Bribri The Bribri-Spanish data (Feldman and Coto-Solano, 2020) came from six different sources (a dictionary, a grammar, two language learning textbooks, one storybook, and transcribed sentences from a spoken corpus) and three major dialects (Amubri, Coroma, and Salitre). Two different orthographies are widely used for Bribri, so an intermediate representation was used to facilitate training.
Asháninka The Asháninka-Spanish data 4 were extracted and pre-processed by Richard Castro (Cushimariano Romano and Sebastián Q., 2008; Ortega et al., 2020a;Mihas, 2011). Though the texts came from different pan-Ashaninka dialects, they were normalized using AshMorph (Ortega et al., 2020a). The development and test sets came from translations of Spanish texts done by Feliciano Torres Ríos.
Guaraní The Guaraní-Spanish data (Chiruzzo et al., 2020) consisted of training data from web sources (blogs and news articles) written in a mix of dialects and development and test sets written in pure Guaraní. Translations were provided by Perla Alvarez Britez.
Wixarika The Wixarika-Spanish data came from Mager et al. (2018a). The training, development, and test sets all used the same dialect (Wixarika of Zoquipan) and orthography, though word boundaries were not consistent between the development/test and training sets. Translations were provided by Silvino González de la Crúz.
Náhuatl The Náhuatl-Spanish data came from Gutierrez-Vasques et al. (2016). Náhuatl has a wide dialectal variation and no standard orthography, but most of the training data were close to a Classical Náhuatl orthographic "standard." The development and test sets came from translations made from Spanish into modern Náhuatl. An orthographic normalization was applied to these translations to make them closer to the Classical Náhuatl orthography found in the training data. This normalization was done by employing a rule-based approach based on predictable orthographic changes between modern varieties and Classical Náhuatl. Translations were provided by Giovany Martinez Sebastián, José Antonio, and Pedro Kapoltitan.
Hñähñu The Hñähñu-Spanish training data came from translations into Spanish from Hñähñu text from a set of different sources 5 . Most of these texts are in the Valle del Mezquital dialect. The development and test sets are in the Ñûhmû de Ixtenco, Tlaxcala variant. Translations were done by José Mateo Lino Cajero Velázquez.
Quechua The training set for Quechua-Spanish data (Agić and Vulić, 2019) came from Jehova's Witnesses texts (available in OPUS), sentences extracted from the official dictionary of the Minister of Education (MINEDU) in Peru for Quechua Ayacucho, and dictionary entries and samples collected and reviewed by Diego Huarcaya. Training sets were provided in both the Quchua Cuzco and Quechua Ayacucho variants, but our system only employed Quechua Ayacucho data during training. The development and test sets came from translations of Spanish text into Quechua Ayacucho, a standard version of Southern Quechua. Translations were provided by Facebook AI.

Shipibo-Konibo
The training set of the Shipibo-Konibo-Spanish data (Galarreta et al., 2017) was obtained from translations of flashcards and translations of sentences from books for bilingual education done by a bilingual teacher. Additionally, parallel sentences from a dictionary were used as part of the training data. The development and test sets came from translations from Spanish into Shipibo-Konibo done by Liz Chávez.
Rarámuri The training set of the Rarámuri-Spanish data came from a dictionary (Brambila, 1976). The development and tests sets came from translations from Spanish into the highlands Rarámuri by María del Cármen Sotelo Holguín. The training set and development/test sets use different orthographies.

Preprocessing
We tokenized all of our data together using Sen-tencePiece (Kudo and Richardson, 2018) in preparation for our multilingual model. We used a vocabulary size of 8000 and a character coverage of 0.9995, as the wide variety of languages covered carry a rich character set.

Pretraining
We pretrained our model on the 20 languages described in 2.1 with an mBART (Liu et al., 2020) implementation of FAIRSEQ (Ott et al., 2019). We pretrained on 32 NVIDIA V100 GPUs for three hours.

Balancing data across languages
Due to the large variability in text data size between different languages, we used the exponential sampling technique used in Conneau and Lample (2019); Liu et al. (2020), where the text is resampled according to smoothing parameter α as follows: In equation 1, q i refers to the resample probability for language i, given multinomial distribution As we want our model to work well with the low-resource languages, we chose a smoothing parameter of α = 0.25 (compared with α = 0.7 used in mBART (Liu et al., 2020)) to alleviate model bias towards the higher proportion of data from high-resource languages.

Hyperparameters
We used a six-layer Transformer with a hidden dimension of 512 and feed-forward size of 2048. We set the maximum sequence length to 512, with a batch size of 1024. We optimized the model using Adam (Kingma and Ba, 2015) using hyperparameters β = (0.9, 0.98) and = 10 −6 . We used a learning rate of 6 × 10 −4 over 10,000 iterations. For regularization, we used a dropout rate of 0.5 and weight decay of 0.01. We also experimented with lower dropout rates but found that a higher dropout rate gave us a model that produces better translations.

Finetuning
Using our pretrained model, we performed finetuning on each of the 10 indigenous American languages with the same hyperparameters used during pretraining. For each language, we conducted our finetuning using four NVIDIA V100 GPUs for three hours.

Evaluation
Using the SacreBLEU library 7 (Post, 2018), we evaluated our system outputs with detokenized BLEU (Papineni et al., 2002;Post, 2018). Due to the polysynthetic nature of the languages involved in this task, we also used CHRF (Popović, 2015) to measure performance at the character level and better see how well morphemes or parts of morphemes were translated, rather than whole words. For these reasons, we focused on optimizing the CHRF score.

Results
We describe our results in Table 1. Our test results (Test1 and Test2) show considerable improvements over the baseline provided by AmericasNLP 2021. We also included our own results on the development set (Dev) for comparison. The trends we saw in the Dev results parallel our test results; languages for which our system achieved high scores in Dev (e.g. Wixarika and Guaraní) also demonstrated high scores in Test1 and Test2. Likewise, languages for which our system performed relatively poorly in Dev (e.g. Rarámuri, whose poor performance may be attributed to the difference in orthographies between the training set and development/test sets) also performed poorly in Test1 and Test2. This matches the trend seen in the baseline scores.
The baseline results and Test2 results were both 7 https://github.com/mjpost/sacrebleu produced using the same test set and by systems where the development set was not used for training. Thus, the baseline results and Test2 results can be directly compared. On average, our system used to produce the Test2 results achieved BLEU scores that were 1.54 higher and CHRF scores that were 0.0725 higher than the baseline. On the same test set, our Test1 system produced higher BLEU and CHRF scores for nearly every language. This is expected, as the system used to produce Test1 was trained on slightly more data; it used the development set of the indigenous American languages provided by AmericasNLP 2021 in addition to the training set. If we factor in our results from Test1 to our Test2 results, we achieved BLEU scores that were 1.64 higher and CHRF scores that were 0.0749 higher than the baseline on average. Overall, we attribute this improvement in scores primarily to the cross-lingual language model pretraining (Conneau and Lample, 2019) we performed, allowing our model to learn about natural language from the monolingual data before finetuning on each of the 10 indigenous languages.

Conclusions and Future Work
We described our system to improve low-resource machine translation for the AmericasNLP 2021 Shared Task. We constructed a system using the mBART implementation of FAIRSEQ to translate from Spanish to 10 different low-resource indigenous languages from the Americas. We demon-strated strong improvements over the baseline by pretraining on a large amount of monolingual data before finetuning our model for each of the lowresource languages.
We are interested in using dictionary augmentation techniques and creating pseudo-monolingual data to use during the pretraining process, as we have seen improved results with these two techniques when translating several low-resource African languages. We can also incorporate these two techniques in an iterative pretraining procedure (Tran et al., 2020) to produce more pseudomonolingual data and further train our pretrained model for potentially better results.
Future research should also explore using probabilistic finite-state morphological segmenters, which may improve translations by exploiting regular agglutinative patterns without the need for much linguistic knowledge (Mager et al., 2018a) and thus may work well with the low-resource, polysynthetic languages dealt with in this paper.