Ayuuk-Spanish Neural Machine Translator

This paper presents the first neural machine translator system for the Ayuuk language. In our experiments we translate from Ayuuk to Spanish, and fromSpanish to Ayuuk. Ayuuk is a language spoken in the Oaxaca state of Mexico by the Ayuukjä’äy people (in Spanish commonly known as Mixes. We use different sources to create a low-resource parallel corpus, more than 6,000 phrases. For some of these resources we rely on automatic alignment. The proposed system is based on the Transformer neural architecture and it uses sub-word level tokenization as the input. We show the current performance given the resources we have collected for the San Juan Güichicovi variant, they are promising, up to 5 BLEU. We based our development on the Masakhane project for African languages.


Introduction
In recent years the efforts to preserve and promote the creation of NLP tools for the native languages of the Americas have increased, particularly addressing the challenges that this endeavour requires (Mager et al., 2018). Machine Translation (MT) has become one of the main goals to pursue since in the long term it might offer benefits to the communities that speak such languages. For instance, it might provide access to knowledge in their native language and facilitate access to services such legal, medical and finance assistance. In this work, we explore this avenue for the San Juan Güichicovi variant of the Ayuuk language, mainly because one of the authors is a native speaker of this variant. To our knowledge there has not been a construction of such a system for the Ayuuk although other variants 1 are available in the JW300 Corpus (Agić and Vulić, 2019).
In this work we rely in multiple previous work. At the core of our proposal we follow the steps from 1 Coatlán Mixe (ISO 639-3 mco), Ayuuk of the Coatlán region.
the Masakhane project 2 which focuses on African Languages (Nekoto et al., 2020). We also rely on the following libraries: • For the automatic alignment of our resources we use the YASA alignment (Lamraoui and Langlais) 3 • For the tokenization we use subword-nmt library 4 (Sennrich et al., 2016) • For the training of our models we use JoeyNMT 5 (Kreutzer et al., 2019).
With these tools we developed our code base that can be consulted online together with the part of the corpus which is freely available 6 . this municipality it can be estimated there is approximately 18, 298 speakers of the variant. It is important to notice that it is estimated that only 3, 205 are monolinguist.
The San Juan Güichicovi's Ayuuk variant does not has a normalized orthography, there are efforts to agree on orthographic conventions however there are strong positions related to number of consonants. One of these positions, it is known as the "bodegeros" position which proposes 20 consonants (see 1b.a) (Willett et al., 2018) vs "petakeros" which proposes a reduction to 13 (see 1b.b) (Reyes Gómez, 2005). In terms of vowels, this variant has six (see 2) which contrast with the other variants of Ayuuk which can have up to nine vowels.
(1) a. b ch d ds g j k l m n ñ p r s t ts w x y ' b. p t k x ts m n wy j l r s ' (2) a e ë i o u The following are examples of San Juan Güichicovi'sAyuuk these were taken from short stories recollected and written by Albino Pedro Juan a native speaker and preserver of the language.
The bunny become happy. El conejo se puso feliz.

Spanish
In the case of Spanish, our system produces translations in Mexican Spanish which belongs to the American Spanish variant 8 , we identify the language by the es ISO-639-1 code.

The parallel corpus
For the creation of the parallel corpus we collected samples from different sources for which there was a available translation between Ayuuk and Spanish, see Table 1.
Since we have a diverse source of linguistic sources it was necessary to normalize the orthography. For this we follow the proposal from Sagi-Vela González (2019) who has followed the unification of the Ayuuk language avoiding taking sides on the controversy about the number of consonants.  Mainly we made two replacements: ñ/ny and ch/tsy Some of the works were already aligned, others not. For those not aligned we created automatic alignments using the YASA tool (Lamraoui and Langlais). We discarded all empty and double alignments. Normalization and automatic alignments were manually verified by one of the authors. The corpus keep differences among both normalization variants: petakeros and bodegeros.
Finally, we randomly split the sentences into training, development and testing sets. For our experimentation we created two split versions, one strict and one random. In the strict version we use all the phrases from the National archive of indigenous languages (Lyon, 1980) as a test. Since these sentences are linguistically motivated and aim to show linguistic aspects of the language they tend to be harder to translate; This split resulted in 5, 847/700/912 (train/dev/test). In the random split we randomly sample sentences from our sources, the final split resulted in 5, 941/700/912 (train/dev/test). Notice that amount of phrases among splits changes, this is because after separating the test phrases, we remove repeated or similar phrases for the train/dev sets. Our intuition was to have a more uniform training/validation for the random split while the test follows the distribution of the original sources. We mimic this procedure for the strict sample.

Neural Architecture
Our translation model is based on the Transformer architecture (Vaswani et al., 2017). We use an encoder-decoder setting. For our experiments we These models were trained in a server with two Tesla V100 GPUs. To obtain a model it usually take us around 2h for a 100 epochs. We also were able to reproduce the experiments in the Colaboratory platform.

Experiments and results
As described in the previous section we have two different versions of our splits, strict and random. Per split we performed five experiments, two for configuration with fewer layers (A), and three for the configuration with more layers (B). We also modified: a) the maximum length of the phrase (50 or 70) b) the vocabulary of the BPE sub-word algorithm (we tested 2000 or 4000). Figure 1 shows the perplexity and the BLEU score in the development set during training for the direction Spanish (es) to Ayuuk (mir). The first part of the Table 2, columns two to five, presents the results on the development and test sets. Figure 2 shows the lerning curve on the direction of translation Ayuuk (mir) to Spanish (es). The second part of the table 2, columns six to nine, presents the results on the development and test for this translation direction.  As we can appreciate these sets of experiments show that the translation is possible. We have some gains on the model with more layers (B), this is not trivial since we have a small amount of training data. On the other hand, the strict split as expected shows to be very difficult to translate, the BLEU scores are minimal. However with the random splits the BLEU scores are more promising. We also observe there that in the current setting it is more "easy" to translate from Spanish to Ayuuk than the other direction. Finally, we perform a larger experimentation with 250 epochs using the B configuration, following the intuition we haven reach the right performance with 100. Figure 3 shows the learning curve on the development set, the bottom part of Table 2 shows our final results using the random split.

Conclusions and Further work
Previous experiences on MT based on deep learning architecture, particularly on seq2seq settings, for native languages of the Americas have not been promising (Mager and Meza, 2018). In particular, because there is little to none training data. However, our work shows that a standard model based on the Transformer architecture and under  extremely low resource setting can produce some results. They are still low for normal standards of the MT field however they are promising for the future.
In order to improve the performance of the system future work will focus on: 1. Collecting more data, paying attention to other variants of the Ayuuk language.
2. Although the strict setting strongly penalizes the evaluation, we will continue using linguistic motivated phrases as a good bar to evaluate our progress.
3. At this moment we rely on sub-word of the phrases, however our approach could benefit from a deeper morphology analysis (Kann et al., 2018). 4. Our normalization will continue respecting the petakeros and bodegeros positions, and for other variants we also incorporate positions regarding the number of vowels.