Neural Machine Translation Models with Back-Translation for the Extremely Low-Resource Indigenous Language Bribri

This paper presents a neural machine translation model and dataset for the Chibchan language Bribri, with an average performance of BLEU 16.9±1.7. This was trained on an extremely small dataset (5923 Bribri-Spanish pairs), providing evidence for the applicability of NMT in extremely low-resource environments. We discuss the challenges entailed in managing training input from languages without standard orthographies, we provide evidence of successful learning of Bribri grammar, and also examine the translations of structures that are infrequent in major Indo-European languages, such as positional verbs, ergative markers, numerical classifiers and complex demonstrative systems. In addition to this, we perform an experiment of augmenting the dataset through iterative back-translation (Sennrich et al., 2016a; Hoang et al., 2018) by using Spanish sentences to create synthetic Bribri sentences. This improves the score by an average of 1.0 BLEU, but only when the new Spanish sentences belong to the same domain as the other Spanish examples. This contributes to the small but growing body of research on Chibchan NLP.


Introduction
State-of-the-art systems in neural machine translation require large amounts of rich and varied parallel data between the source and target language (Lui et al., 2019). While this is less of a problem for highresource language pairs like English-German or English-Spanish, there are much fewer resources when one of the languages is an Indigenous language. While there are some researchers creating systems to optimize the training of NMT frameworks with lower resources, most work primarily from a "warmstart" where models are pre-trained on a small, but still significant amount of parallel data (He et al., 2016). For example even in papers geared towards developing techniques for low resource languages, training datasets can exceed millions of sentences (Edunov et al., 2018). Our motivation in adapting lowresource techniques to extremely low-resource situations is that they can help improve understanding of their underlying mechanisms and lower the barrier of entry for machine translation to be used for language revitalization purposes.
However, in the effective body of research, many papers use simulated low-resource scenarios where high resource parallel datasets are artificially truncated to test the efficacy of their techniques. While this does provide a level of standardization across different techniques, it is still artificial. This paper seeks to explore NMT in an authentic scenario, using an actual extremely low-resource language.
We leverage a variety of sources to produce a novel translation dataset in the indigenous Bribri language for demonstrating methods of neural machine translation (NMT) in extremely low-resource situations. To demonstrate usage of the dataset, we apply the technique of iterative back-translation with validation (Hoang et al., 2018). Finally, we present a translation analysis to show the unique challenges that this Bribri dataset presents.

Challenges for NMT in Extremely Low-Resource Languages
In extremely low resource situations, the drawbacks of NMT systems begin to show. NMT is a datahungry approach to machine translation and much less efficient with respect to training data compared to other approaches (Koehn and Knowles, 2017), but with enough data, it has shown excellent results (Hassan et al., 2018). Still, these approaches assume that written data is continuously produced in the languages. However, there are 7000 languages in the world (Eberhard and Fennig, 2020); many of them are spoken in Indigenous communities, have small populations, and are scarcely used in social media or in other forms amenable to automatic scraping (Lillehaugen, 2019;Keegan et al., 2015;Ní Bhroin, 2015). In many of these communities, languages like English and Spanish have displaced the Indigenous languages in domains such as technology and chatting, and so the available data is curtailed. In addition to this, many Indigenous communities have chronic digital inequalities, which makes it difficult to generate crowd-sourcing campaigns for those languages. Finally, in many cases the data that is most valuable to speakers of the language is that collected from elders and knowledge keepers, but those elders might be the people who have the least access to technological means of communication.
NMT also struggles with out of domain translation (Chu and Wang, 2018) which is only exacerbated by a smaller dataset. Our primary solution to this is employing a popular dataset augmentation technique called iterative back-translation (Hoang et al., 2018). Additionally, NMT models are fairly opaque (Koehn and Knowles, 2017) in that their learning process is not deterministic and is difficult to interpret. Here, we try to provide a closer, linguistic analysis of the translations produced by the models.

Challenges for Bribri NLP and NMT
Bribri (Glottocode brib1243) is an Indigenous language spoken by the Bribri people in Costa Rica and Panama. It belongs to the Chibchan family and it has approximately 7000 speakers (INEC 2010). The language is vulnerable (Moseley, 2010;Sánchez Avendaño, 2013), which means that there are still children who speak it at home, but there are domains where Spanish is used instead.
There is very little material published in the Bribri language, so training sets will always remain small. There are two main groups of people who write Bribri (and indeed most Indigenous languages in the Americas): University-affiliated researchers, and community member school teachers and others related to didactic material creation. There is little to no usage of the language online, and practically all of the material exists only as printed books. The main sources of bitext are textbooks for Spanish speakers to learn the language (Constenla et al., 2004;Jara Murillo and García Segura, 2013), Spanish-Bribri dictionaries (Margery, 2005), grammar books (Jara Murillo, 2018a), collections of oral literature (Jara Murillo, 2018b;García Segura, 2016;Constenla, 2006;Constenla, 2006;Constenla, 1996), and schoolbooks for Bribri children (Sánchez Avendaño, 2020). There is one digital corpus that also contains traditional stories . These sentences belong to general domains (e.g.Íma be' kie? 'How are you?'), but they also include specialized passages from traditional narrations.
There are numerous sources of internal variation in the data; these are summarized in table 1. First, Bribri has been studied over the last 50 years, but researchers have published materials using different writing systems. For example, a nasal vowel can be indicated by a line underneath the vowel (Constenla et al., 2004), by a tilde above the vowel (Jara Murillo and García Segura, 2013), or by a Polish hook (Margery, 2005). Moreover, this variation in orthography leads to many permutations of diacritic encodings. The wordù 'cooking pot' has a grave accent for high tone and a nasal vowel. If the vowel is expressed with the line below, the exact Unicode combining character varies amongst materials, and the tonal mark can be expressed as a single Unicode character with the 'u', or as separate characters.
Perhaps the most challenging source of variation is that of orthographic variation amongst different writers. Bribri is not yet standardized, and this leads to different orthographic conventions when the language is actually written. Table 1 shows two examples of this. In example (a), the copula dör appears as rör, the word taîë 'very much' appears as two words táìnë, and the nasalization is expressed with an n, not with a line. In example (b), the absolutive argument i 'it' appears in the same word as the verb kíe 'to be called', and the verb kíe itself has a tone marking that is not present in other standards. We want to stress that, even if this variation makes NLP work more difficult, this is a challenge that NLP has to deal with in order to do justice to these languages. When people write in Bribri and in any Indigenous language, they are expanding its domains of usage, and this contributes to their revitalization and normalization. Regardless of how much variation/"noise" there might be in the written text, the main priority is for the speakers of these languages to keep using them in their daily lives, regardless of how they write. There will be plenty of time later to debate standardization; right now all that matters is that the community can perpetuate the use of their language. While we have discussed some specifics for the Bribri case, all of these issues are present in many Indigenous languages around the world (Galla, 2016).
There is relatively little research on the NLP of Indigenous languages of the Americas (Mager et al., 2018), but Bribri has received some attention. Published research includes the design of virtual keyboards (Flores Solórzano, 2010) and finite-state machines for morphological analysis (Flores Solórzano, 2019;, and there is an online Spanish-Bribri dictionary (Krohn, 2020). Finally, there is work on machine learning: untrained forced alignment (Coto-Solano and Flores Solórzano, 2016;Coto-Solano and Flores Solórzano, 2017) was used to study the phonetics of the language.

NMT Training from Bribri Bitext
Using the resources cited above, we created a dataset with 5923 Bribri-Spanish sentence pairs. We performed diacritic and writing system standardization with a rule-based, deterministic system, with a result that smooths out some author-specific variation and that can be converted to either type of contemporary orthography (the Constenla et al. (2004) and the Jara Murillo (2018a) conventions). For example, the stringùx represents a nasal high-tone 'u', as the 'x' is not used as a letter in the language. (For simplicity, we will use the convention of indicating nasalization with a line underneath when showing Bribri text on this paper). We normalized forms to the Amubri orthography for phonological variations (e.g. nalà∼ñolò 'road') when there is a systematic way to convert them back to Coroma orthography, but preserved lexemes that are exclusive to each variant (Coroma: alàralar 'children'). Finally, we attempted to standardize common function words (e.g. copula rör to dör). After standardizing all sentence pairs, we encoded each sentence with separate byte pair encoding models for each language (Sennrich et al., 2016b) where each encoding had a vocabulary size of about 2200 tokens.
For each trial the dataset was split between 80% training, 10% validation, and 10% for testing. Additionally, the data was randomly shuffled ten times, resplit and retrained for further validation. From these sets we trained Transformer encoder/decoder models which in previous research have shown good performance for translation tasks. We used near identical hyperparameters to the base model described in Vaswani et al. (2017) with the exceptions being the number of training steps: each model was trained 4000 steps with a batch size of 4096 tokens on a single GPU. The models were only trained 4000 steps to prevent overfitting on the small dataset. Each translation model was trained using the PyTorch port of the OpenNMT package (Klein et al., 2017) on Google Colaboratory instances with a range of different GPUs. The training took approximately 50 minutes per model. Each model was then evaluated against the testing data with the Bilingual Evaluation Understudy Score (BLEU) technique (Papineni et al., 2002) where the specific implementation was the multi-bleu script from Moses (Koehn et al., 2007).
The code for sentence standardization, training, and a sample of BRI-SPA sentence pairs can be found here: http://github.com/rolandocoto/bribri-coling2020.

BLEU scores
We built 3 different types of models: One based on all the data (5923 total pairs; henceforth the 6K model), one based on half of the data (2961 total pairs, henceforth the 3K model), and one with only a quarter (1480 total pairs, henceforth the 1.5K model). Table 2 summarizes the dataset splits, the repeated random validation average (10 trials) and the maximum BLEU values for all the validation tests.

Model
Training pairs  The Bribri→Spanish models had an average performance of BLEU 16.9. This was better than the Spanish→Bribri models, which had an average of BLEU 14.2. There was considerable variation in the training results, which is to be expected with such a small amount of data, as the specific shuffle of each trial will have a large effect on the results. For example, while the best performance for a SPA→BRI pair was 18.9, the average BLEU for the 6K models was 14.2 ± 2.7, and the worst performing model was BLEU 10.3. An obvious question would be: How did some of the models perform so well? The dataset might be helping us: The data itself is mainly from textbooks and grammar books, which are rich in examples that the models might be taking advantage of. This might be a positive development for the implementation of NMT for Indigenous languages, given that most of the available data is in this format.

Translation analysis: Bribri→Spanish
What is the model learning? Table 3 shows examples from the top performing Bribri→Spanish model. Many translations are correct, and when it makes mistakes it is because of mismatches in the grammar of the two languages. Example 3 has the pronoun ie' "he/she/singularThey", which is genderless in Bribri. In Spanish the pronoun's gender should match the name Ana, but because gender is underspecified in Bribri, the Spanish translation comes out with the wrong pronoun. There are also difficulties in matching the verbal morphologies of the two languages. Bribri has a middle voice, where the verb is performed without a specific actor. (Japanese also has these structures: Rajio ga naotta 'The radio got fixed'). The model could not translate Bribri middle voice correctly, and tried to match it with the Spanish active voice, giving it the random actor 'I' in example 4.

English
Bribri source

Translation analysis: Spanish→Bribri
Bribri is structurally very different from most Indo-European languages, and so we are interested in how the models are learning Bribri targets. Here we will focus on four grammatical phenomena: (i) positional verbs, (ii) ergativity, (iii) numerical classifiers and (iv) demonstratives. Let's begin with the positional verbs. The sentence Wépa tsoù a 'The men are in the house' has the simple verb tso 'to be in'. Bribri can also use more specific verbs to provide specific details about the position of the subject. For example, in the sentence Wépa ië'tenù a 'The men are standing in the house', the verb ië'ten 'to be standing in a place [plural]' tells you the position of the men relative to the house. Languages like German also have positionals (e.g. Der Teddy liegt auf dem Boden. lit: 'The teddy bear lies on the floor'). Bribri positional verbs include tkër 'to be sitting on', tër 'to be lying on', a'r 'to be hanging' and dur 'To be standing [singular]' amongst others. Table 4 shows the translations from the best performing Spanish→Bribri model, and compares them with the reference Bribri sentence. The system was successful in learning some positionals, but sometimes it overgeneralizes (replacing a'r with tër in example 3), and sometimes it fails to use the positional, as in example 4.
Bribri is a morphologically ergative language (Pacchiarotti, 2016). This means that the subjects of transitive sentences are marked with an affix. There are numerous other ergative languages in the world, such as Basque, Samoan and Warlpiri. In the case of Bribri, the morpheme tö/dör/'r marks the ergative word. Table 5 shows that, while the model has learned to place the ergative in some subjects of transitive sentences, it hasn't learned the general rule yet. Example 3 has the subject be' 'you', which is the doer of the action and should therefore be marked as ergative, but the model did not mark it as such.
Bribri has a type of word called numerical classifiers. These are number words, but each has a specific semantic class. For example, in Bribri, children are classified as persons, and so the word three would be mañál. On the other hand, the word chicken is classified as a small thing, and so the word three would become mañàt. Another example are houses, which are classified as buildings. When counting buildings,  Be'rù aì sawè ? Be'ù sawè ? Missing ergative marker dör/tö/'r. the word three is mañátkue. There are numerous other numerical classes, such as flat objects, bunches and cylindrical objects. This phenomenon is not exclusive to Bribri, and is also present in languages like Mandarin Chinese and Japanese (e.g. inu ippiki 'one small animal of dog', terebi ichidai 'one machine of TV', gakusei hitori 'one person of student'). The model is relatively successful in learning the numerical classifiers of Bribri, as can be seen in the first two examples of table 6. Even when the model makes a mistake with the number, as in examples 4 and 6, it gets the classifier category correctly. The most common mistake seems to be confusing numbers with abstract quantities. In example 3, the reference sentence had the word "one [different] human", but the target sentence has "a human". In example 5, the word "five people" is replaced by "many".
Finally, Bribri has a very complex system of demonstratives. Where English has the words this and that, Bribri demonstratives distinguish two spatial axes: closeness to the speaker (near, far) and relative position to the speaker (above, same level, underneath). This leads to structures like dù aí 'that bird up there, nearby', dù aì 'that bird up there, far away', dù awì 'that bird as the same level as me, far away' and dù dià 'that bird down there, far away'.
The model seems to have difficulty coping with this complex system. Table 7 repeats previous examples for clarity and adds two more examples of issues with demonstratives. In examples 1 and 2, the system simplified the demonstrative to more general forms (èt 'one flat thing' instead of aí 'up there near'; e' 'that one' instead of awì 'that one same level far'). In example 3 the demonstrative is simply not present in the target translation. Example 4 was very close to having the correct demonstrative, but the model got the tone wrong, changing the demonstrative from "near" to "far".
One important detail to mention is that these errors can only partially be attributed to insufficient input from the Spanish source sentence. In sentence 3, the Spanish word aquella 'that one over there' might indeed be insufficient to determine the correct Bribri demonstrative. However, sentences 1, 2 and 5 have enough information in the Spanish source to perform an approximation of the Bribri demonstrative.
The issues discussed in this section serve as an illustration of some of the limits of the model. However, from the errors we can see that the model is using its very scarce data to learn some representations of Bribri morphological phenomena which have multiple mappings to the Spanish source sentences. Correct classifier: -tkue 'building'. Wrong number: mañá-'three', bò-'two'. Translation means: 'Then there were two houses'. 5. There are five women in the house.

Iterative Back-Translation
Back-translation is the technique for leveraging large monolingual corpora and weak translation models to augment parallel datasets (Sennrich et al., 2016a). A translation model is built using real parallel data from the target language into the source language. Then, monolingual sentences in the target language are translated using that model to create synthetic bitext between the two languages. Finally, this synthetic bitext is concatenated with the real bitext and a final model is trained. This simple technique has been shown to increase scores in some models by almost 2 BLEU (Edunov et al., 2018). Iterative backtranslation builds upon this technique by using the first synthetic bitext and real bitext to train a second target-to-source translation model to create new synthetic bitext that is, in theory, more accurate (Hoang et al., 2018).

Bribri Back-translation Results
In order to test the effect of back-translation, we used the structure in figure 1. First, we used a portion of our bitext corpus to train a Bribri→Spanish model. This provided a baseline for performance, and will henceforth be called the Base model. Second, we took the same bitext subset to train a Spanish→Bribri model. We used this to generate synthetic Bribri sentences out of real Spanish ones. We then combined these synthBribri+realSpanish pairs with the real bitext subset, and used this to train a Bribri→Spanish model that used both real and synthetic data. We will call this the Synth1 model. Third, we used the concatenation of real bitext and synthBribri+realSpanish set to train a second Spanish→Bribri model. We then generate a second set of synthetic Bribri sentences, using the real Spanish as input again. We combined these new synthBribri+realSpanish pairs with the original bitext to train a second Bribri→Spanish model, which we will call Synth2. A separate testing set (from the original Bribri-Spanish bitext) remains the same throughout the cycle so that results from different models are comparable. We test this structure using three models: 1. 1.5K real bitext (1480 total pairs), plus 1.5K (1480) Spanish sentences from the actual Bribri-Spanish corpus. Here, we ignore the real Bribri sentences we have and generate new synthetic Bribri from the in domain Spanish data we collected in the corpus.
2. 3K real bitext (2961 total pairs), plus 1.5K (2961) Spanish sentences from the actual Bribri-Spanish corpus. Again, we ignore the real Bribri sentences and generate new synthetic Bribri from the in domain Spanish data from the corpus.
3. 6K real bitext (5923 total pairs), plus 6K (5923) Spanish sentences from an out of domain source. These sentences come from part of the News-Commentary dataset (Tiedemann, 2012), which contains headlines of daily news, and is therefore very different from the text in the corpus.
We tested each of these five times, using random reshuffling of the data. Table 4.1 shows the average and maximum gains in BLEU scores when we compared the Base and Synth1 models, and the Base and Synth2 models. Figure 2 shows the variation in BLEU gains for the different models.  The Synth1 model trained on 3K real pairs (2368 train, 296 val, 296 train) and 2961 synthetic pairs had the best gains relative to Base, with an average gain of BLEU 1 and a maximum gain of BLEU 2.1 (from 10.51 to 12.60). When compared to the 6K models on table 2 (avg: BLEU 16.9; max: BLEU 19.8), we can see that adding 3K synthetic sentences leads to a gain of 22%∼33% that of adding 3K real sentences. While this result is lower than the 67%∼83% reported by Edunov et al. (2018), it shows that synthetic data can produce gains even in very small datasets. The Synth2 models gained much less when compared to the Base, suggesting that there might be limits to the effectiveness of back-translation.
Both of the 1.5K models suffered losses in BLEU. The 1.5K models might simply have too little data to generate appropriate synthetic Bribri. This result is in line with (Przystupa and Abdul-Mageed, 2019), who found that there are limits to how much models can learn from back-translation. The 6K models also show BLEU losses, but for a different reason: The out-of-domain data used to generate synthetic Bribri might be too different from the real bitext, which is after all mostly composed of stories and classroom examples, not news. This poses a challenge for the effectiveness of backtranslation: It might be the case that, in order to benefit from back-translation, Spanish input might need to come from other language learning books, or from traditional stories that are published in Spanish. Table 9 shows changes in Bribri→Spanish translations between the Base and Synth1 models, built with 2961 real bitext pairs and 2961 pairs of synthetic Bribri and real in-domain Spanish (3K model). Example 1 shows that the system is improving on its understanding of basic sentence structure. Synth1 wrongly interprets the verb ali' 'to cook' as its homophone ali' 'cassava'. Even if it doesn't get the exact verb right, it generates a translation that has the correct argument structure. Example 2 shows an improved understanding of subject structure and ergativity. The word bö' 'you [ERG]' is correctly understood by Synth1, whereas the Base model thinks that the subject is 'we'. Finally, example 3 shows a sentence where the models went from a model without positional information, to one that successfully understood the positional tulur 'to be sitting [plural]'. These examples provide evidence that the back-translation is helping the model further its generalizations of Bribri grammar. Ellos están conversando. 'They are chatting' (Same as target) Table 9: Improvements in translation from Base to Synth1 3K models.

Conclusions and Future Work
In this paper we described the challenges involved in building a 5923 bitext dataset for Bribri-Spanish, which combines textbook sentences with traditional stories, and standardizes some of the variation found in the data. We built a Transformer-based NMT model with a maximum BLEU of 19.8, and examined some of its errors in learning parts of Bribri grammar that are not found in Indo-European languages.
Finally, we trained a group of models augmented with iterative backtranslation. This technique produced a gain of 22%∼33% in BLEU compared to the gain made by augmenting with real data. More generally, the paper provides evidence that NMT techniques can have acceptable results with extremely low-resource languages, which could help in the process of language documentation. In future work, this dataset could be expanded to test other NMT techniques such as transfer learning (Zoph et al., 2016) or some unsupervised techniques (Wu et al., 2019;Lample et al., 2017). Given the current amount of bilingual Bribri text, we estimate that the dataset could potentially double in size, but unfortunately there isn't yet enough Bribri text to augment it past that point. Another expansion could be done with Bribri monolingual data, of which there is little (probably less than 5K sentences), but which could be used for dual learning (He et al., 2016).