The REPU CS’ Spanish–Quechua Submission to the AmericasNLP 2021 Shared Task on Open Machine Translation

We present the submission of REPUcs to the AmericasNLP machine translation shared task for the low resource language pair Spanish–Quechua. Our neural machine translation system ranked first in Track two (development set not used for training) and third in Track one (training includes development data). Our contribution is focused on: (i) the collection of new parallel data from different web sources (poems, lyrics, lexicons, handbooks), and (ii) using large Spanish–English data for pre-training and then fine-tuning the Spanish–Quechua system. This paper describes the new parallel corpora and our approach in detail.


Introduction
REPUcs participated in the AmericasNLP 2021 machine translation shared task (Mager et al., 2021) for the Spanish-Quechua language pair. Quechua is one of the most spoken languages in South America (Simons and Fenning, 2019), with several variants, and for this competition, the target language is Southern Quechua. A disadvantage of working with indigenous languages is that there are few documents per language from which to extract parallel or even monolingual corpora. Additionally, most of these languages are traditionally oral, which is the case of Quechua. In order to compensate the lack of data we first obtain a collection of new parallel corpora to augment the available data for the shared task. In addition, we propose to use transfer learning (Zoph et al., 2016) using large Spanish-English data in a neural machine translation (NMT) model. To boost the performance of our transfer learning approach, we follow the work of Kocmi and Bojar (2018), which demonstrated that sharing the source language and a vocabulary of subword 1 "Research Experience for Peruvian Undergraduates -Computer Science" is a program that connects Peruvian students with researchers worldwide. The author was part of the 2021 cohort: https://www.repuprogram.org/repu-cs. units can improve the performance of low resource languages.

Spanish→Quechua
Quechua is the most widespread language family in South America, with more than 6 millions speakers and several variants. For the AmericasNLP Shared Task, the development and test sets were prepared using the Standard Southern Quechua writing system, which is based on the Quechua Ayacucho (quy) variant (for simplification, we will refer to it as Quechua for the rest of the paper). This is an official language in Peru, and according to Zariquiey et al. (2019) it is labelled as endangered. Quechua is essentially a spoken language so there is a lack of written materials. Moreover, it is a polysynthetic language, meaning that it usually express large amount of information using several morphemes in a single word. Hence, subword segmentation methods will have to minimise the problem of addressing "rare words" for an NMT system.
To the best of our knowledge, Ortega et al. (2020b) is one of the few studies that employed a sequence-to-sequence NMT model for Southern Quechua, and they focused on transfer learning with Finnish, an agglutinative language similar to Quechua. Likewise, Huarcaya Taquiri (2020) used the Jehovah Witnesses dataset (Agić and Vulić, 2019), together with additional lexicon data, to train an NMT model that reached up to 39 BLEU points on Quechua. However, the results in both cases were high because the development and test set are split from the same distribution (domain) as the training set. On the other hand, Ortega and Pillaipakkamnatt (2018) improved alignments for Quechua by using Finnish(an agglutinative language) as the pivot language. The corpus source is the parallel treebank of Rios et al. (Rios et al., 2012)., so we deduce that they worked with Quechua Cuzco (quz). (Ortega et al., 2020a) In the AmericasNLP shared task, new out-of-domain evaluation sets were released, and there were two tracks: using or not the validation set for training the final submission. We addressed both tracks by collecting more data and pre-training the NMT model with large Spanish-English data.
3 Data and pre-processing In this competition we are going to use the Ameri-casNLP Shared Task datasets and new corpora extracted from documents and websites in Quechua.

AmericasNLP datasets
For training, the available parallel data comes from dictionaries and Jehovah Witnesses dataset (JW300; Agić and Vulić, 2019). AmericasNLP also released parallel corpus aligned with English (en) and the close variant of Quechua Cusco (quz) to enhance multilingual learning. For validation, there is a development set made with 994 sentences from Spanish and Quechua (quy) (Ebrahimi et al., 2021). Detailed information from all the available datasets with their corresponding languages is as follows: • JW300 (quy, quz, en): texts from the religious domain available in OPUS (Tiedemann, 2012 (2020). This dataset is made from 9k sentences, phrases and word translations. Furthermore, to examine the domain resemblance, it is important to analyse the similarity between the training and development. Table 1 shows the percentage of the development set tokens that overlap with the tokens in the training datasets on Spanish (es) and Quechua (quy) after deleting all types of symbols.
We observe from Table 1 that the domain of the training and development set are different as the overlapping in Quechua does not even go above 50%. There are two approaches to address this Dataset % Dev overlapping es quy JW300 85% 45% MINEDU 15% 5% Dict_misc 40% 18% Table 1: Word overlapping ratio between the development and the available training sets in AmericasNLP problem: to add part of the development set into the training or to obtain additional data from the same or a more similar domain. In this paper, we focus on the second approach.

New parallel corpora
Sources of Quechua documents Even though Quechua is an official language in Peru, official government websites are not translated to Quechua or any other indigenous language, so it is not possible to perform web scrapping (Bustamante et al., 2020). However, the Peruvian Government has published handbooks and lexicons for Quechua Ayacucho and Quechua Cusco, plus other educational resources to support language learning in indigenous communities. In addition, there are official documents such as the Political Constitution of Peru and the Regulation of the Amazon Parliament that are translated to the Quechua Cusco variant. We have found three unofficial sources to extract parallel corpora from Quechua Ayacucho (quy). The first one is a website, made by Maximiliano Duran (Duran, 2010), that encourages the learning of Quechua Ayacucho. The site contains poems, stories, riddles, songs, phrases and a vocabulary for Quechua. The second one is a website for different lyrics of poems and songs which have available translations for both variants of Quechua (Lyrics translate, 2008). The third source is a Quechua handbook for the Quechua Ayacucho variant elaborated by Iter and Cárdenas (2019).
Sources that were extracted but not used due to time constrains were the Political Constitution of Peru and the Regulation of the Amazon Parliament. Other non-extracted source is a dictionary for Quechua Ayacucho from a website called InkaTour 2 . This source was not used because we already had a dictionary.
Methodology for corpus creation The available vocabulary in Duran (2010) was extracted manually and transformed into parallel corpora using the first pair of parenthesis as separators. We will call this dataset "Lexicon".
All the additional sentences in Duran (2010) and a few poems from (Lyrics translate, 2008) were manually aligned to obtain the Web Miscellaneous (WebMisc) corpus. Likewise, translations from the Quechua educational handbook (Iter and Cárdenas, 2019) were manually aligned to obtain a parallel corpus (Handbook). 3 In the case of the official documents for Quechua Cusco, there was a specific format were the Spanish text was followed by the Quechua translation. After manually arranging the line breaks to separate each translation pair, we automatically constructed a parallel corpus for both documents. Paragraphs with more than 2 sentences that had the same number of sentences as their translation were split into small sentences and the unmatched paragraphs were deleted.  Table 2: Corpora description: S = #sentences in corpus; N = number of tokens; V = vocabulary size; V1 = number of tokens occurring once (hapax); V/N = vocabulary growth rate; V1/N = hapax growth rate; mean = word frequency mean We notice that the vocabulary and hapax growth rate is similar for Quechua (quy) in WebMisc and Handbook even though the latter has more than twice the number of sentences. In addition, it was expected that the word frequency mean and the vocabulary size were lower for Quechua, as this demonstrates its agglutinative property. However, this does not happens in the Lexicon dataset, since is understandable as it is a dictionary that has one or two words for the translation.

Corpora description
Moreover, there is a high presence of tokens occurring only once in both languages. In other words, there is a possibility that our datasets have spelling errors or presence of foreign words (Nagata et al., 2018). However, in this case this could be more related to the vast vocabulary, as the datasets are made of sentences from different domains (poems, songs, teaching, among others).
Furthermore, it is important to examine the similarities between the new datasets and the development set. The percentage of the development set words that overlap with the words of the new datasets on Spanish (es) and Quechua (quy) after eliminating all symbols is shown in  Although at first glance the analysis may show that there is not a significant similarity with the development set, we have to take into account that in Table 1, JW300 has 121k sentences and Dict_misc is a dictionary, so it is easy to overlap some of the development set words at least once.However , in the case of WebMisc and Handbook datasets, the quantity of sentences are less than 3k per dataset and even so the percentage of overlapping in Spanish is quite good. This result goes according to the contents of the datasets, as they contain common phrases and open domain sentences, which are the type of sentences that the development set has.

English-Spanish dataset
For pre-training, we used the EuroParl dataset for Spanish-English (1.9M sentences) (Koehn, 2005) and its development corpora for evaluation.

Approach used 4.1 Evaluation
From the Europarl dataset, we extracted 3,000 sentences for validation. For testing we used the devel-opment set from the WMT2006 campaign (Koehn and Monz, 2006).
In the case of Quechua, as the official development set contains only 1,000 sentences there was no split for the testing. Hence, validation results will be taken into account as testing ones.
The main metric in this competition is chrF (Popović, 2017) which evaluates character n-grams and is a useful metric for agglutinative languages such as Quechua. We also reported the BLEU scores (Papineni et al., 2002). We used the implementations of sacreBLEU (Post, 2018).

Subword segmentation
Subword segmentation is a crucial process for the translation of polysinthetic languages such as Quechua. We used the Byte-Pair-Encoding (BPE; Sennrich et al., 2016) implementation in Sentence-Piece (Kudo and Richardson, 2018) with a vocabulary size of 32,000. To generate a richer vocabulary, we trained a segmentation model with all three languages (Spanish, English and Quechua), where we upsampled the Quechua data to reach a uniform distribution.

Procedure
For all experiments, we used a Transformer-based model (Vaswani et al., 2017) with default parameters from the Fairseq toolkit (Ott et al., 2019). The criteria for early stopping was cross-entropy loss for 15 steps.
We first pre-trained a Spanish-English model on the Europarl dataset in order to obtain a good encoding capability on the Spanish side. Using this pre-trained model, we implemented two different versions for fine-tunning. First, with the JW300 dataset, which was the largest Spanish-Quechua corpus, and the second one with all the available datasets (including the ones that we obtained) for Quechua.

Results and discussion
The results from the transfer learning models and the baseline are shown in Table 4. We observe that the best result on BLEU and chrF was obtained using the provided datasets together with the extracted datasets. This shows that the new corpora were helpful to improve translation performance.
From Table 4, we observe that using transfer learning showed a considerable improvement in comparison with the baseline (+0.56 in BLEU and  +0.007 in chrF). Moreover, using transfer learning with all the available datasets obtained the best BLEU and chrF score. Specially, it had a 0.012 increase in chrF which is quite important as chrF is the metric that best evaluates translation in this case. Overall, the results do not seem to be good in terms of BLEU. However, a manual analysis of the sentences shows that the model is learning to translate a considerable amount of affixes.

Input (ES)
El control de armas probablemente no es popular en Texas.

Input (EN)
Weapon control is probably not popular in Texas.

Reference (QUY)
Texaspiqa sutillapas arma controlayqa manachusmi hinachu apakun Output Texas llaqtapi armakuna controlayqa manam runakunapa runachu For instance, the subwords "arma", "mana", among others, have been correctly translated but are not grouped in the same words as in the reference. In addition, only the word "controlayqa" is translated correctly, which would explain the low results in BLEU. Decoding an agglutinative language is a very difficult task, and the low BLEU scores cannot suggest a translation with proper adequacy and/or fluency (as we can also observe this from the example). Nevertheless, BLEU works at word-level so other character-level metrics should be considered to inspect agglutinative languages. This would be the case of chrF (Popović, 2017) were there is an increase of around 3% when using the AmericasNLP altogether with the new extracted corpora.
Translations using the transfer learning model trained with all available Quechua datasets were submitted for track 2 (Development set not used for Training). For the submission of track 1 (Development set used for Training) we retrained the best transfer learning model adding the validation to the training for 40 epochs. The official results of the competition are shown in Table 6.

Conclusion
In this paper, we focused on extracting new datasets for Spanish-Quechua, which helped to improve the performance of our model. Moreover, we found that using transfer learning was beneficial to the results even without the additional data. By combining the new corpora in the fine-tuning step, we managed to obtain the first place on Track 2 and the third place on Track 1 of the AmericasNLP Shared Task. Due to time constrains, the Quechua Cusco data was not used, but it can be beneficial for further work.
In general, we found that the translating Quechua is a challenging task for two reasons. Firstly, there is a lack of data for all the variants of Quechua, and the available documents are hard to extract. In this research, all the new datasets were extracted and aligned mostly manually. Secondly, the agglutinative nature of Quechua motivates more research about effective subword segmentation methods.