The Helsinki submission to the AmericasNLP shared task

The University of Helsinki participated in the AmericasNLP shared task for all ten language pairs. Our multilingual NMT models reached the first rank on all language pairs in track 1, and first rank on nine out of ten language pairs in track 2. We focused our efforts on three aspects: (1) the collection of additional data from various sources such as Bibles and political constitutions, (2) the cleaning and filtering of training data with the OpusFilter toolkit, and (3) different multilingual training techniques enabled by the latest version of the OpenNMT-py toolkit to make the most efficient use of the scarce data. This paper describes our efforts in detail.


Introduction
The University of Helsinki participated in the AmericasNLP 2021 Shared Task on Open Machine Translation for all ten language pairs. The shared task is aimed at developing machine translation (MT) systems for indigenous languages of the Americas, all of them paired with Spanish (Mager et al., 2021). Needless to say, these language pairs pose big challenges since none of them benefits from large quantities of parallel data and there is limited monolingual data. For our participation, we focused our efforts mainly on three aspects: (1) gathering additional parallel and monolingual data for each language, taking advantage in particular of the OPUS corpus collection (Tiedemann, 2012), the JHU Bible corpus (McCarthy et al., 2020) and translations of political constitutions of various Latin American countries, (2) cleaning and filtering the corpora to maximize their quality with the OpusFilter toolbox (Aulamo et al., 2020), and (3) contrasting different training techniques that could take advantage of the scarce data available.
We pre-trained NMT systems to produce backtranslations for the monolingual portions of the data. We also trained multilingual systems that make use of language labels on the source sentence to specify the target language (Johnson et al., 2017). This has been shown to leverage the information available data across different language pairs and boosts performance on the low-resource scenarios.
We submitted five runs for each language pair, three in track 1 (development set included in training) and two in track 2 (development set not included in training). The best-performing model is a multilingual Transformer pre-trained on Spanish-English data and fine-tuned to the ten indigenous languages. The (partial or complete) inclusion of the development set during training consistently led to substantial improvements.
The collected data sets and data processing code are available from our fork of the organizers' Git repository. 1 2 Data preparation A main part of our effort was directed to finding relevant corpora that could help with the translation tasks, as well as to make the best out of the data provided by the organizers. In order to have an efficient procedure to maintain and process the data sets for all the ten languages, we utilized the Opus-Filter toolbox 2 (Aulamo et al., 2020). It provides both ready-made and extensible methods for combining, cleaning, and filtering parallel and monolingual corpora. OpusFilter uses a configuration file that lists all the steps for processing the data; in order to make quick changes and extensions programmatically, we generated the configuration file with a Python script. Figure 1 shows a part of the applied OpusFilter workflow for a single language pair, Spanish-Raramuri, and restricted to the primary training data. The provided training set and (concatenated)  additional parallel data are first independently normalized and cleaned (preprocess), then concatenated, preprocessed with common normalizations, filtered from duplicates, and finally filtered from noisy segments.

Data collection
We collected parallel and monolingual data from several sources. An overview of the resources, including references and URLs, is given in Tables 3  and 4 in the appendix.
Organizer-provided resources The shared task organizers provided parallel datasets for training for all ten languages. These datasets are referred to as train in this paper. For some of the languages (Ashaninka, Wixarika and Shipibo-Konibo), the organizers pointed participants to repositories containing additional parallel or monolingual data. We refer to these resources as extra and mono respectively. Furthermore, the organizers provided development and test sets for all ten language pairs of the shared task (Ebrahimi et al., 2021).
OPUS The OPUS corpus collection (Tiedemann, 2012) provides only few datasets for the relevant languages. Besides the resources for Aymara and Quechua provided by the organizers as offi-cial training data, we found an additional parallel dataset for Spanish-Quechua, and monolingual data for Aymara, Guarani, Hñähñu, Nahuatl and Quechua. These resources are also listed under extra and mono.
Constitutions We found translations of the Mexican constitution into Hñähñu, Nahuatl, Raramuri and Wixarika, of the Bolivian constitution into Aymara and Quechua, and of the Peruvian constitution into Quechua. 3 We extracted the data from the HTML or PDF sources and aligned them with the Spanish version on paragraph and sentence levels. The latter was done using a standard lengthbased approach with lexical re-alignment, as in hunalign 4 (Varga et al., 2005), using paragraph breaks as hard boundaries. They are part of the extra resources.
Bibles The JHU Bible corpus (McCarthy et al., 2020) covers all languages of the shared task with at least one Bible translation. We found that some translations were near-duplicates that only differed in tokenization, and removed them. For those languages for which several dialectal varieties were available, we attempted to select subsets based on the target varieties of the shared task, as specified by the organizers (see Tables 3 and 4 for details). All Spanish Bible translations in the JHUBC are limited to the New Testament. In order to maximize the amount of parallel data, we substituted them by full-coverage Spanish Bible translations from Mayer and Cysouw (2014). 5 Since we have multiple versions of the Bible in Spanish as well as in some of the target languages, we applied the product method in OpusFilter to randomly take at most 5 different versions of the same sentence (skipping empty and duplicate lines).

Data normalization and cleaning
We noticed that some of the corpora in the same language used different orthographic conventions and had other issues that would hinder NMT model training. We applied various data normalization  language  code  train  extra combined  dedup filtered  bibles monoling backtr dev   Ashaninka  cni  3883  0  3883  3860  3858  38846  13195 17278 883  Aymara  aym  6531  8970  15501  8889  8352 154520  16750 17886 996  Bribri  bzd  7508  0  7508  7303  7303  38502  0  0 996  Guarani  gn  26032  0  26032  14495  14483  39457  40516 62703 995  Hñähñu  oto  4889  2235  7124  7056  7049  39726  537  366 599  Nahuatl  nah  16145  2250  18395  17667  17431  39772  9222  8450  and cleaning steps to improve the quality of the data, with the goal of making the training data more similar to the development data (which we expected to be similar to the test data). For Bribri, Raramuri and Wixarika, we found normalization scripts or guidelines on the organizers' Github page or sources referenced therein (cf. the norm entries in Tables 3 and 4). We reimplemented them as custom OpusFilter preprocessors.
Bribri, Hñähñu, Nahuatl, and Raramuri training sets were originally tokenized. Following our decision to use untokenized input for unsupervised word segmentation, we detokenized the respective corpora with the Moses detokenizer supported by OpusFilter, using the English patterns.
Finally, for all datasets, we applied OpusFilter's WhitespaceNormalizer preprocessor, which replaces all sequences of whitespace characters with a single space.

Data filtering
The organizer-provided and extra training data sets were concatenated before the filtering phase. Then all exact duplicates were removed from the data using OpusFilter's duplicate removal step. After duplicate removal, we applied some predefined filters from OpusFilter. Not all filters were applied to all languages; instead, we selected the appropriate filters based on manual observation of the data and the proportion of sentences removed by the filter. Appendix A describes the filters in detail.

Back-translations
We translated all monolingual data to Spanish, using early versions of both Model A and Model B (see Section 3), in order to create additional synthetic parallel training data. A considerable amount of the back-translations produced by Model A ended up in a different language than Spanish, whereas some translations by Model B remained empty. We kept both outputs, but aggressively filtered them (see Appendix A), concatenated them, and removed exact duplicates.

Data sizes
For most language pairs, the Bibles made up the largest portion of the data. Thus we decided to keep the Bibles separate from the other smaller, but likely more useful, training sources. Table 1 shows the sizes of the training datasets before and after filtering as well as the additional datasets. It can be seen that there is a difference of almost two orders of magnitude between the smallest (cni) and largest (quy) combined training data sets. The addition of the Bibles and back-translations evens out the differences to some extent.

Spanish-English data
Model B (see below) takes advantage of abundant parallel data for Spanish-English. These resources come exclusively from OPUS (Tiedemann, 2012) and include the following sources: Open-Subtitles, Europarl, JW300, GlobalVoices, News-Commentary, TED2020, Tatoeba, bible-uedin. All corpora are again filtered and deduplicated, yielding 17,5M sentence pairs from OpenSubtitles and 4,4M sentence pairs from the other sources taken together. During training, both parts are assigned the same weight to avoid overfitting on subtitle data. The Spanish-English WMT-News corpus, also from OPUS, is used for validation.

Models
We experimented with two major model setups, which we refer to by A and B below. Both are multilingual NMT models based on the Transformer architecture (Vaswani et al., 2017) and are implemented with OpenNMT-py 2.0 (Klein et al., 2017). All models were trained on a single GPU. The training data is segmented using Sentence-Piece (Kudo and Richardson, 2018) subword models with 32k units, trained jointly on all languages. Following our earlier experience (Scherrer et al., 2020), subword regularization (Kudo, 2018) is applied during training. Further details of the configurations are listed in Appendix B.

Model A
Model A is a multilingual translation model with 11 source languages (10 indigenous languages + Spanish) and the same 11 target languages. It is trained on all available parallel data in both directions as well as all available monolingual data. The target language is specified with a language label on the source sentence (Johnson et al., 2017).
The model was first trained for 200 000 steps, weighting the Bibles data to occur only 0.3 times as much as all the other corpora. We picked the last checkpoint, since it attained the best accuracy and perplexity in the combined development set. This model constitutes submission A-0dev.
Then, independently for each of the languages, we fine-tuned this model for another 2 500 steps on language-specific data, including 50% of the development set of the corresponding language. These models, one per language, constitute submission A-50dev.

Model B
Model B is a multilingual translation model with one source language (Spanish) and 11 target languages (10 indigenous languages + English). It is trained on all available parallel data with Spanish on the source side using target language labels. 6 The training takes place in two phases. In the first phase, the model is trained on 90% of Spanish-English data and 1% of data coming from each of the ten American languages. With this first phase, we aim to take advantage of the large amounts of data to obtain a good Spanish encoder. In the second phase, the proportion of Spanish-English data is reduced to 50%. 7 We train the first phase for 100k steps and pick the best intermediate savepoint according to the English-only validation set, which occurred after 72k steps. We then initialize two phase 2 models with this savepoint. For model B-0dev, we change the proportions of the training data and include the back-translations. For model B-50dev, we additionally include a randomly sampled 50% of each language's development set. We train both models until 200 000 steps and pick the best intermediate savepoint according to an eleven-language validation set, consisting of WMT-News and the remaining halves of the ten development sets.
Since the inclusion of development data showed massive improvements, we decided to continue training from the best savepoint of B-50dev (156k), adding also the remaining half of the development set to the training data. This model, referred to as B-100dev, was trained for an additional 14k steps until validation perplexity reached a local minimum.

Results
We submitted three systems to track 1 (development set allowed for training), namely A-50dev, B-50dev and B-100dev, and two systems to track 2 (development set not allowed for training), namely A-0dev and B-0dev. The results are in Table 2.
In track 1, our model B-100dev reached first rank and B-50dev reached second rank for all ten languages. Model A-50dev was ranked third to sixth, depending on the language. This shows that model B consistently outperformed model A, presumably thanks to its Spanish-English pre-training. Including the full development set in training (B-100dev) further improves the performance, although this implies that savepoint selection becomes guesswork.
For track 2, the tendency is similar. Model B-0dev was ranked first for nine out of ten languages, taking 2nd rank for Spanish-Quechua. A-0dev was ranked second to fourth on all except Quechua. 8

Ablation study
We investigate the impact of our data selection strategies via an ablation study where we repeat the second training phase of model B with several variants of the B-0dev setup. In Figure 2 we show intermediate evaluations on the concatenation of the 10 development sets every 2000 training steps.
8 After submission, we noticed that the Quechua backtranslations were generated with the wrong model. This may explain the poor performance of our systems on this language. The green curve, which corresponds to the B-0dev model, obtains the highest maximum scores. The impact of the back-translations is considerable (blue vs. green curve) despite their presumed low quality. The addition of Bibles did not improve the chrF2 scores (blue vs. orange curve). We presume that this is due to the mismatch in linguistic varieties, spelling and genre. It would be instructive to break down this effect according to the language.
The application of the OpusFilter pipeline to the train and extra data (yellow vs. orange curve) shows a positive effect at the beginning of the training, but this effect fades out later.
Finally, and rather unsurprisingly, our corpus weighting strategy (50% English, 50% indigenous languages, blue curve) outperforms the weighting strategy employed during the first training phase (90% English, 10% indigenous languages, grey curve). It could be interesting to experiment with even lower proportions of English data, taking into account the risk of catastrophic forgetting.

Conclusions
In this paper, we describe our submissions to the AmericasNLP shared task, where we submitted translations for all ten language pairs in both tracks. Our strongest system is the result of gathering additional relevant data, carefully filtering the data for each language pair and pre-training a Transformer-based multilingual NMT system with large Spanish-English parallel data. Except for Spanish-Quechua in track 2, all our submissions ranked top for both tracks.

A OpusFilter settings
The following filters were used for the training data except for back-translated data and Bibles: • LengthFilter: Remove sentences longer than 1000 characters. Applied to Aymara, Nahuatl, Quechua, Raramuri.
• CharacterScoreFilter: Remove sentences for which less than 90% characters are from the Latin alphabet. Applied to Aymara, Quechua, Raramuri.
The Bribri and Shipibo-Konibo corpora seemed clean enough that we did not apply any filters for them. After generating the Bible data, we noticed that some of the lines contained only a single 'BLANK' string. The segments with these lines were removed afterwards.
From the provided monolingual datasets, we filtered out sentences with more than 500 words.
The back-translated data was filtered with the following filters: • LengthRatioFilter with threshold 2 and word units • CharacterScoreFilter with Latin script and threshold 0.9 on the Spanish side and 0.7 on the other side • LanguageIDFilter with a threshold of 0.8 for the Spanish side only.

B Hyperparameters
Model A uses a 6-layered Transformer with 8 heads, 512 dimensions in the embeddings and 1024 dimensions in the feed-forward layers. The batch size is 4096 tokens, with an accumulation count of 8. The Adam optimizer is used with beta1=0.9 and beta2=0.998. The Noam decay method is used with a learning rate of 3.0 and 40000 warm-up steps. Subword sampling is applied during training (20 samples, α = 0.1).
Model B uses a 8-layered Transformer with 16 heads, 1024 dimensions in the embeddings and 4096 dimensions in the feed-forward layers. The batch size is 9200 tokens in phase 1 and 4600 tokens in phase 2, with an accumulation count of 4. The Adam optimizer is used with beta1=0.9 and beta2=0.997. The Noam decay method is used with a learning rate of 2.0 and 16000 warm-up steps. Subword sampling is applied during training (20 samples, α = 0.1). As a post-processing step, we removed the <unk> tokens from the outputs of model B.