Peru is Multilingual, Its Machine Translation Should Be Too?

Peru is a multilingual country with a long history of contact between the indigenous languages and Spanish. Taking advantage of this context for machine translation is possible with multilingual approaches for learning both unsupervised subword segmentation and neural machine translation models. The study proposes the first multilingual translation models for four languages spoken in Peru: Aymara, Ashaninka, Quechua and Shipibo-Konibo, providing both many-to-Spanish and Spanish-to-many models and outperforming pairwise baselines in most of them. The task exploited a large English-Spanish dataset for pre-training, monolingual texts with tagged back-translation, and parallel corpora aligned with English. Finally, by fine-tuning the best models, we also assessed the out-of-domain capabilities in two evaluation datasets for Quechua and a new one for Shipibo-Konibo.


Introduction
Neural Machine Translation (NMT) has opened several research directions to exploit as many and diverse data as possible. Massive multilingual NMT models, for instance, take advantage of several language-pair datasets in a single system (Johnson et al., 2017). This offers several advantages, such as a simple training process and enhanced performance of the language-pairs with little data (although sometimes detrimental to the high-resource language-pairs). However, massive models of dozens of languages are not necessarily the best outcome, as it is demonstrated that smaller clusters still offer the same benefits (Tan et al., 2019;Oncevay et al., 2020).
Peru offers a rich diversity context for machine translation research with 47 native languages (Simons and Fenning, 2019). All of them are highly distinguishing from Castilian Spanish, the primary 1 Available in: https://github.com/aoncevay/mt-peru official language in the country and the one spoken by the majority of the population. However, from the computational perspective, all of these languages do not have enough resources, such as monolingual or parallel texts, and most of them are considered endangered (Zariquiey et al., 2019).
In this context, the main question then arises: shouldn't machine translation be multilingual for languages spoken in a multilingual country like Peru? By taking advantage of few resources, and other strategies such as multilingual unsupervised subword segmentation models (Kudo, 2018), pretraining with high resource language-pairs (Kocmi and Bojar, 2018), back-translation (Sennrich et al., 2016a), and fine-tuning (Neubig and Hu, 2018), we deployed the first many-to-one and one-to-many multilingual NMT models (paired with Spanish) for four indigenous languages: Aymara, Ashaninka, Quechua and Shipibo-Konibo.

Languages and datasets
To enhance replicability, we only used the datasets provided in the AmericasNLP Shared Task 2 .
• Southern Quechua: with 6+ millions of speakers and several variants, it is the most widespread indigenous language in Peru. AmericasNLP provides evaluation sets in the standard Southern Quechua, which is based mostly on the Quechua Ayacucho (quy) variant. There is parallel data from dictionaries and Jehovah Witnesses (Agić and Vulić, 2019). There is parallel corpus aligned with English too. We also include the close variant of Quechua Cusco (quz) to support the multilingual learning. • Aymara (aym): with 1.7 million of speakers (mostly in Bolivia). The parallel and monolingual data is extracted from a news website (Global Voices) and distributed by OPUS (Tiedemann, 2012). There are aligned data with English too. The four languages are highly agglutinative or polysynthetic, meaning that they usually express a large amount of information in just one word with several joint morphemes. This is a real challenge for MT and subword segmentation methods, given the high probability of addressing a "rare word" for the system. We also note that each language belongs to a different language family, but that is not a problem for multilingual models, as usually the family-based clusters are not the most effective ones ( Pre-processing The datasets were noisy and not cleaned. Lines are reduced according to several heuristics: Arabic numbers or punctuation do not match in the parallel sentences, there are more symbols or numbers than words in a sentence, the ratio of words from one side is five times larger or shorter than the other, among others.

Evaluation
The train data have been extracted from different domains and sources, which are not necessarily the same as the evaluation sets provided for the Shared Task. Therefore, the official development set (995 sentences per language) is split into three parts: 25%-25%-50%. The first two parts are our custom dev and devtest sets 3 . We add the 50% section to the training set with a sampling distribution of 20%, to reduce the domain gap in the training data. Likewise, we extract a sample of the training and double the size of the development set. The mixed data in the validation set is relevant, as it allows to evaluate how the model fits with all the domains. We used the same multi-text sentences for evaluation, and avoid any overlapping of the Spanish side with the training set, this is also important as we are going to evaluate multilingual models. Evaluation for all the models used BLEU (

Multilingual subword segmentation
Ortega et al. (2020b) used morphological information, such as affixes, to guide the Byte-Pair-Encoding (BPE) segmentation algorithm (Sennrich et al., 2016b) for Quechua. However, their improvement is not significant, and according to Bostrom and Durrett (2020), BPE tends to oversplit roots of infrequent words. They showed that a unigram language model (Kudo, 2018) seems like a better alternative to split affixes and preserve roots (in English and Japanese).
To take advantage of the potential lexical sharing of the languages (e.g. loanwords) and address the polysynthetic nature of the indigenous languages, we trained a unique multilingual segmentation model by sampling all languages with a uniform distribution. We used the unigram model implementation in SentencePiece (Kudo and Richardson, 2018) with a vocabulary size of 32,000.

Procedure
For the experiments, we used a Transformer-base model (Vaswani et al., 2017) with the default configuration in Marian NMT (Junczys-Dowmunt et al., 2018). The steps are as follows: Pre-training We pre-trained two MT models with the Spanish-English language-pair in both directions. We did not include an agglutinative language like Finnish (Ortega et al., 2020b) for two reasons: it is not a must to consider highly related languages for effective transfer learning (e.g. English-German to English-Tamil (Bawden et al., 2020)), and we wanted to translate the English side of en-aym, en-quy and en-quz to augment their correspondent Spanish-paired datasets. The en→es and es→en models achieved 34.4 and 32.3 BLEU points, respectively, in the newsdev2013 set.
Multilingual fine-tuning Using the pre-trained en→es model, we fine-tuned the first multilingual model many-to-Spanish. Following established practices, we used a uniform sampling for all the datasets (quz-es included) to avoid under-fitting the low-resource language-pairs 4 . Results are in Table 2, row (a). We replicated this to the es→many direction (row (e)), using the es→en model.
Back-translation With model (a), we backtranslated (BT) the monolingual data of the indigenous languages and train models (b) and (f): original plus BT data. However, the results with BT data underperformed or did not converge. Potential reasons are the noisy translation outputs of model (a) and the larger amount of BT than humantranslated sentences for all languages, even though we sampled BT and human translations uniformly.

Tagged back-translation (BT[t])
To alleviate the issue, we add a special tag for the BT data (Caswell et al., 2019). With BT[t], we send a signal to the model that it is processing synthetic data, and thus, it may not hurt the learning over the real data. Table 2 (rows (c,g)) shows the results.
Pairwise baselines We obtained pairwise systems by fine-tuning the same pre-trained models (without any back-translated data). For a straightforward comparison, they used the same multilingual SentencePiece model.

Analysis and discussion
One of the most exciting outcomes is the deteriorated performance of the multilingual models using BT data, as we usually expect that added backtranslated texts would benefit performance. Using tags (BT[t]) to differentiate which data is synthetic or not is only a simple step to address this issue; however, there could be evaluated more informed strategies for denoising or performing online data selection (Wang et al., 2018).
Besides, in the translation into Spanish, the multilingual model without BT data outperforms the rest models in all languages but Quechua, where the pairwise system achieved the best translation accuracy. Quechua is the "highest"-resource languagepair in the experiment, and its performance is deteriorated in the multilingual setting 5 . A similar scenario is shown in the other translation direction from Spanish, where the best multilingual setting (+BT[t]) cannot overcome the es→quy model in the devtest set.
Nevertheless, the gains for Aymara, Ashaninka and Shipibo-Konibo are outstanding. Moreover, we note that the models are not totally overfitted to any of the evaluation sets. Exceptions are es→aym and es→quy, with a significant performance dropping from dev to devtest, meaning that it started to overfit to the training data. However, for Spanish→Ashaninka, we observe that the model achieved a better performance in the devtest set. This is due to oversampling of the same-domain dev partition for training ( §4.1) and the small original training set.  Concerning the results on the official test set, the performance is lower than the results with the custom evaluation sets. The main potential reason is that the official test is four times bigger than the custom devtest, and therefore, offers more diversity and challenge for the evaluation. Another point to highlight is that the best result in the Spanish-Quechua language-pair is obtained by a multilingual model (the scores between the model (e) and (g) are not significantly different) instead of the pairwise baseline.
Decoding an indigenous language is still a challenging task, and the relatively low BLEU scores cannot suggest a translation with proper adequacy or fluency. However, BLEU works at the wordlevel, and other character-level metrics should be considered to better assess the highly agglutinative nature of the languages. For reference, we also report the chrF scores in Table 3 for the best multilingual setting and the pairwise baseline. As for the Spanish decoding, fluency is preserved from the English→Spanish pre-trained model 6 , but more adequacy is needed.

Out-of-domain evaluation
It is relevant to assess out-of-domain capabilities, but more important to evaluate whether the models are still capable to fine-tune without overfitting. We use a small evaluation set for Quechua (Kallpa, with 100 sentences), which contains sentences extracted from a magazine (Ortega et al., 2020b). Likewise, we introduce a new evaluation set for Shipibo-Konibo (Kirika, 200 sentences), which contains short traditional stories.
We tested our best model for each language-pair, fine-tune it (+FT) with half of the out-of-domain dataset, and evaluate it in the other half. To avoid overfitting, we controlled cross-entropy loss and considered very few updates for validation steps. Results are shown in Table 3, where we observe that it is possible to fine-tune the multilingual or pairwise models to the new domains without loosing too much performance in the original test.
The Quechua translations rapidly improved with the fine-tuning step, and there is a small gain in the original test for es→quy, although the scores are relatively low in general. Nevertheless, our model could outperform others (by extrapolation, we can assume that the scores for the rule-based Apertium system (Cavero and Madariaga, 2007) and Ortega et al. (2020b)'s NMT system are similar in half of the dataset).
For Shipibo-Konibo, we also observe some small gains in both directions without hurting the previous performance, but the scores are far from being robust. Kirika is challenging given its old style: the translations are extracted from an old book written by missionaries, and even when the spelling has been modernised, there are differences in the use of some auxiliary verbs for instance (extra words that affect the evaluation metric) 7 .

Conclusion and future work
Peru is multilingual, ergo, its machine translation should be too! We conclude that multilingual machine translation models can enhance the performance in truly low-resource languages like Aymara, Ashaninka and Shipibo-Konibo, in translation from and into Spanish. For Quechua, even when the pairwise system performed better in this study, there is a simple step to give a multilingual setting another opportunity: to include a higher-resource languagepair that may support the multilingual learning process. This could be related in some aspect like morphology (another agglutinative language) or the discourse (domain). Other approaches focused on more advanced sampling or adding specific layers to restore the performance of the higher-resource languages might be considered as well. Besides, tagged back-translation allowed to take some advantage of the monolingual data; however, one of the most critical following steps is to obtain a more robust many-to-Spanish model to generate backtranslated data with more quality. Furthermore, to address the multi-domain nature of these datasets,  Table 5: Statistics and cleaning for all parallel corpora. We observe that the Shipibo-Konibo and Ashaninka corpora are the least noisy ones. S = number of sentences, T = number of tokens. There are sentence alignment issues in the Quechua datasets, which require a more specialised tool to address.