IndT5: A Text-to-Text Transformer for 10 Indigenous Languages

Transformer language models have become fundamental components of NLP based pipelines. Although several Transformer have been introduced to serve many languages, there is a shortage of models pre-trained for low-resource and Indigenous languages in particular. In this work, we introduce IndT5, the first Transformer language model for Indigenous languages. To train IndT5, we build IndCorpus, a new corpus for 10 Indigenous languages and Spanish. We also present the application of IndT5 to machine translation by investigating different approaches to translate between Spanish and the Indigenous languages as part of our contribution to the AmericasNLP 2021 Shared Task on Open Machine Translation. IndT5 and IndCorpus are publicly available for research.


Introduction
Indigenous languages are starting to attract attention in the field of natural language processing (NLP), with the number of related publications growing in recent years (Mager et al., 2018). In spite of this interest, there remains a multitude of challenges for handling Indigenous languages. Complexity of the morphological systems of some of these languages and lack of standard orthography for writing them are among these challenges (Mager et al., 2018;Littell et al., 2018). The most fundamental issue facing NLP efforts, however, remains the lack of digital textual data that can be exploited for systems development.
In this work, we describe a scenario usually faced when trying to develop NLP systems for Indigenous languages and we focus on machine translation (MT). We adopt a neural machine translation approach (NMT) (Koehn, 2017) as our method. We show that, in spite of its recent success on many 1 https://github.com/UBC-NLP/IndT5 Figure 1: A map of the ten Indigenous languages covered by IndT5, our text-to-text Transformer model, and our IndCorpus dataset. The languages are mainly spoken in five Latin American countries.
contexts, NMT still struggles in very low-resource settings involving Indigenous languages. This is due to the core difficulty of lack of parallel textual data, but also even monolingual data.
Although our main goal in this work in particular is to develop translation models from Spanish to several Indigenous languages of the Americas, we adopt a transfer learning approach where we offer resources that can be exploited for other downstream tasks. Namely, we build a dataset for ten Indigenous languages and Spanish which we refer to as IndCorpus. Figure 1 and Table 1 provide an overview of the ten Indigenous languages in our new dataset (Eberhard et al., 2021). We also exploit IndCorpus for pre-training a Transformer language model following the unified approach introduced by (Raffel et al., 2019). Our resulting model, IndT5, treats every text NLP problem as a "textto-text" problem, i.e. taking text as input and producing new text as output. We apply IndT5 to the MT task as a way to transfer knowledge acquired by the model to this particular context. Our experiments show the utility of our new language model and the dataset it exploits for the downstream Indigenous MT task but that very large space for improvement still exists. The rest of the paper is organized as follows: In Section 2, we introduce recent MT work in lowresource and Indigenous languages settings. In Section 3, we describe how we develop our new language model for ten Indigenous languages. In Section 4, we describe our NMT models. We conclude in Section 5.

Low-Resource MT
A number of methods and techniques have been proposed to mitigate the effects of having rather small datasets for machine translation. These include data augmentation, transfer learning, hyperparameter tuning, incorporating linguistic knowledge, and knowledge distillation.
Since the main bottleneck of low-resource MT is the lack of abundant parallel textual data, data augmentation is straightforwardly a potential method to enhance the model performance. Back translation is a way to augment parallel data (Sennrich et al., 2016a). By training a target-to-source translation model with original data and feeding in monolingual data of target language, synthetic parallel data is generated. If the target language is rich in textual data, much synthetic parallel data can be added into training data and may benefit the final translation model. Transfer learning is another method that can boost the performance of MT on low-resource languages (Zoph et al., 2016;Nguyen and Chiang, 2017;Kocmi and Bojar, 2018). The rationale behind one approach to transfer learning is that knowledge obtained while translating high-resource languages may be transferable to translation of lowresource languages. In Zoph et al. (2016), a parent model is first trained on a high-resource language pair (i.e., French to English) then a child model is trained on a low-resource language pair (i.e., Uzbek to English). The Uzbek-English model has 10.7 BLEU score without parent model and 15.0 with the parent model. It is also shown that the more similar the two source languages, the more performance gain is possible. For example, a Spanish-English MT model has 16.4 BLEU score without parent model and 31.0 with French-English parent model. The performance gain is much more than when transferring French-English parent model to the more distant context of the Uzbek-English child model. Sennrich and Zhang (2019) argue that instead of using hyperparameters that work in high-resource settings, there should be a set of hyperparameters specific to the low-resource scenario. For example, keeping the vocabulary size small, training a model with relatively small capacity, and having smaller batch size may be beneficial to model performance. When building a vocabulary with BPE, by reducing the the number of merge operations, a smaller vocabulary can be obtained and an inclusion of lowfrequency (sub)words can be avoided. Inclusion of inclusion of low-frequency (sub)words could otherwise negatively influencing representation learning effectiveness.
Leveraging linguistic knowledge for data augmentation,  use a rule-based syntax parser and a dictionary to generate parallel data. By reordering target-language sentences into source-language syntactic structure and then mapping target-language words into source-language words with a dictionary, the size of parallel data is enlarged and translation performance is improved. Baziotis et al. (2020) leverage a language model to help enhance the performance of the translation model. Similar to the idea of knowledge distillation (Hinton et al., 2015), a teacher model and a student model are trained where the language model plays the role of teacher and translation model plays the role of student. With this design, the teacher model needs only monolingual data and does not have to rely on large parallel data.

MT of Indigenous Languages
Unlike high-resource languages such as English and French, Indigenous languages are often lowresource. Due to this, it is common that researchers of Indigenous languages adopt methods that can fare well in low-resource scenarios. This includes using the Transformer architecture and its variants in both low-resource (Adebara et al., 2021(Adebara et al., , 2020Przystupa and Abdul-Mageed, 2019) and Indigenous language (Feldman and Coto-Solano, 2020;Orife, 2020;Le and Sadat, 2020) settings.
Despite the fact that Indigenous languages face difficulties similar to most low-resource languages, there are some challenges specific to Indigenous languages. As Mager et al. (2018) point out, some Indigenous languages have complex morphological systems and some have various non-standardized orthographic conventions. For example, Micher (2018) shows that in Inuktitut, an Indigineous language in North America with a complex morphological system, a corpus of one million tokens, there are about 225K different types for Inuktitut while about 30K types for English. Also, Micher (2018) shows that there can be lack of standardized spelling for some words. For example, the word Haammalat in Inuktitut has another seven different forms.
To cope with the issue of complex morphology, Ortega et al. (2020) build a translation model for Qeuchua, an Indigenous language of South America, with an integrated morphological segmentation method. To treat orthographic variation, Feldman and Coto-Solano (2020) standardize text with a rule-based system which converts diacritics and letters to contemporary orthographic convention.

IndT5
We train an Indigenous language model adopting the unified and flexible text-to-text transfer Transformer (T5) approach (Raffel et al., 2019). T5 treats every text-based language task as a "textto-text" problem, taking text format as input and producing new text format as output. T5 is essentially an encoder-decoder Transformer (Vaswani et al., 2017), with the encoder and decoder similar in configuration and size to a BERT Base (Devlin et al., 2019) but with some architectural modifica-tions. Modifications include applying a normalization layer before a sub-block and adding a pre-norm (i.e., initial input to the sub-block output). We call our resulting model IndT5. We now describe our dataset, vocabulary, and pre-training method for developing IndT5.

Training Data
We build IndCorpus, a collection of ten Indigenous languages and Spanish comprising 1.17 GB of text (∼5.37M sentences), to pre-train IndT5. IndCorpus is collected from both Wikipedia and the Bible. Table 2 provides the size and number of sentences for each language in our dataset.

IndT5 Vocabulary
The T5 (Raffel et al., 2019) model is based on a vocabulary acquired by the SentencePiece library 2 using English, French, German, and Romanian web pages from "Colossal Clean Crawled Corpus" (or C4 for short). We use a similar procedure to create our Indigenous languages vocabulary. Namely, we use SentencePiece (Kudo, 2018) to encode text as WordPiece (Sennrich et al., 2016b) tokens with a vocabulary size of 100K WordPieces extracted from IndCorpus.

Unsupervised Pre-Training
We leverage our unlabeled Indigenous corpus, IndCorpus, to pre-train IndT5. For that, we use a denoising objective (Raffel et al., 2019) that does not require labels. The main idea is feeding the model with corrupted (masked) versions of the original sentence, and training it to reconstruct the original sentence. Inspired by BERT's objective (i.e., masked language model) (Devlin et al., 2019), the denoising objective (Raffel et al., 2019) works by randomly sampling and dropping out 15% of tokens in the input sequence. All consecutive spans of dropped-out tokens are then replaced by a single sentinel token. We pre-train our model for 100K steps on the IndCorpus using the T5 Base architecture. 3 We refer to this model as IndT5 100k . Afterwards, we further pre-train on only the ten Indigenous languages part of our dataset (i.e., without the Spanish data) for 40K steps. We refer to this version of the model as IndT5 140k . For both pre-training steps, we use a learning rate of 0.01,   As part of the AmericasNLP 2021 Shared Task on Open Machine Translation, the training (Train) and development (Dev) datasets for ten target Indigeneous languages along with the source language Spanish were released. All the datasets are manually translated. Table 3 shows the number of sentences of different language pairs in shared task data. Table 4 provides example sentences extracted from the Dev dataset with their corresponding translations.

Approach
For all languages pairs except quy and gn, we fine-tune each of the two versions of our language model, i.e., both IndT5 100k and IndT5 140k , under two conditions: (A) we train on Train using 100% of Dev data for validation, for 150 epochs; (B) we fine-tune the best epoch from setting A for 50 epochs, adding 80% of Dev data to Train (using the remaining 20% Dev for validation).

Evaluation
We report the results of both IndT5 100k and IndT5 140k models using two metrics: BLEU score (Papineni et al., 2002) and ChrF++ (Popović, 2017). Tables 5 and 6 show the results of both models on Test sets for each of the language pairs using settings A and B described in Section 4.2, respectively.

Discussion
The results presented in Table 5 and Table 6 show that all our models, with both settings A and B, outperform the respective baselines across all languages. An exception is the languages aym and shp. As expected, fine-tuning the IndT5 100k and IndT5 140k models using the training data and 80% of the Dev data (i.e., setting B) improves the results with a mean of +0.003% and +0.04% in ChrF++ on the Test data, respectively. Interestingly, fur- Son más económicos porque son realmente buenos en gas. p+ h+k+ nip+ka raye at+ka aix+ m+ anenek+ ik+ gas.  ther pre-training IndT5 on only the ten Indigenous languages (i.e. target languages) produces better results with an average improvement of +0.003% and +0.004% in settings A and B, respectively. Overall, the impact of limited data is clear.

Conclusion
In this work, we introduced a new Transformer language model (IndT5) and a dataset (IndCorpus) for ten Indigenous languages and Spanish. We applied IndT5 to the MT task on eight languages pairs as part of our submission to the AmericasNLP 2021 Shared Task. While IndT5 helps improve translation, the task remains hard due to absence of parallel as well as mono-  lingual data. In the future, we plan to integrate statistical MT methods to augment our data as well as investigate best hyperparameters for our neural models.