Highland Puebla Nahuatl Speech Translation Corpus for Endangered Language Documentation

Documentation of endangered languages (ELs) has become increasingly urgent as thousands of languages are on the verge of disappearing by the end of the 21st century. One challenging aspect of documentation is to develop machine learning tools to automate the processing of EL audio via automatic speech recognition (ASR), machine translation (MT), or speech translation (ST). This paper presents an open-access speech translation corpus of Highland Puebla Nahuatl (glottocode high1278), an EL spoken in central Mexico. It then addresses machine learning contributions to endangered language documentation and argues for the importance of speech translation as a key element in the documentation process. In our experiments, we observed that state-of-the-art end-to-end ST models could outperform a cascaded ST (ASR > MT) pipeline when translating endangered language documentation materials.


Introduction
Due to the need for global communication, computational technologies such as automatic speech recognition (ASR), machine translation (MT: textto-text), and speech translation (ST: speech-to-text) have focused their efforts on languages spoken by major population groups (Henrich et al., 2010). Many other languages that are spoken today will probably disappear by the end of the 21st century (Grenoble et al., 2011). For this reason, until very recently they have not been targeted for machine learning technologies. This is changing, however, as increasing attention has been paid to language loss and the need for preservation and, in best-case scenarios, revitalization of these languages.
This paper presents an open-access speech translation corpus from Highland Puebla Nahuatl to Spanish and discusses our initial effort on ST over the corresponding corpus. The following of this paper is organized as follows: in Section 2, we discuss the benefits of speech translation for EL documentation and pioneer-suggest it as the first step in the documentation process. In Section 3, we compare the strategies (i.e., cascaded model and end-to-end models) that can be used to automate ST for ELs. In Section 4 we introduce the Highland Puebla Nahuatl-to-Spanish corpus. Initial experimental efforts in building ST models are elaborated in Section 5. The conclusion is presented in in Section 6.

Benefits of speech-to-text translation as a first step in language documentation
The present article suggests that speech translation (ST) could be a viable and valuable tool for EL documentation efforts for three reasons (Anastasopoulos, 2019). First, the transcription of native language recordings may become particularly problematic and time-consuming (the "transcription bottleneck") when the remaining speakers are elderly, and the younger generation has at best a passive knowledge of the language, a common situation of ELs. Second, in many cases ST may be more accurate than MT for target language translation. Finally, many EL documentation projects suffer from a lack of human resources with the skills and time to transcribe and analyze recordings (for similar points about a "translation before transcription workflow", see Bird, 2020, section 2.2.2). By beginning with ST, semi-and passive speakers can better contribute to EL documentation of their native languages with a level of effort far lower than needed for transcription and analysis. Bilingual native speakers or researchers with incomplete knowledge of the source language structure can quickly produce highly informative free translations even if the original text is never, or only much later, segmented and glossed. A free translation in audio and subsequent capture by typing or using ASR systems for the major target L2 lan-guage (that are more accurate for major as opposed to minor and endangered languages) may take 4-5 hours of effort per hour of audio, whereas transcription (without analysis) may take 30-100 hours for the same unit. Starting with free translation, then, increases the pool of potential native speaker participants and quickly adds value to an audio corpus that may languish if the first step is always fixed as transcription and segmentation (morphological parsing and glossing).
In general, EL documentation proceeds in a fairly set sequence: (1) record; (2) transcribe in time-coded format; (3a) analyze by parsing (morphological segmentation) and glossing; and (3b) freely translate into a dominant, often colonial, language. It may be that some projects prioritize free translation (3b) over morphological segmentation and glossing. Given that each procedure adds a certain, often significant, amount of time to the processing pipeline, there is an increasing scarcity of resources as one proceeds from (1) to (3a/b). If the standard sequence is followed, there are invariably more recordings than transcriptions, more transcriptions than analyses, and (if the sequence is 3a > 3b) more analyses than free translations or (if the sequence is 3b > 3a) more free translations than analyses (see Bird, 2020, Table 3, p. 720).
The argument presented here is that the easiest data to obtain are the recordings followed by free translations into a major language. It may be beneficial to reorder the workflow so that an ST corpus, i.e., free translation of the recording, is prioritized. Only later would transcription and analysis (morphological segmentation and glossing) be inserted into the pipeline. To facilitate computational support for speech-to-text production, we would recommend a targeted number of recordings (e.g., 50 hours), followed by division into utterances with time stamps and free translation of the utterances into a major language. This corpus (or perhaps one even larger) would be used to train an end-to-end neural network in speech-to-text production. The trained ST system would then be used to process additional recordings, thus generating a very extensive freely translated corpus. Our hope would be that instead of basing ASR on an acoustic signal alone, using two coupled inputs-the speech signal and the free translation-might well lower ASR error rates from those obtained from the speech signal alone. The extent of improved accuracy is at this point simply a hypothesis. It would have to be empirically researched, something we hope to do in the near future (see Anastasopoulos, 2019, chap. 4). In this scenario for EL documentation, transcription and analysis proceed forward, but only after an extensive ST training/validation/test corpus has been developed. The resultant ST system would then be used to freely translate additional recordings as they are made.
Speech translation (ST) is very challenging, particularly for resource-scarce endangered languages. The degree of challenge might well be reduced if corpus creation focused from the beginning on translation without intermediate steps (transcription and analysis, which would take documentation in the direction of MT). Moreover, translation itself is a challenging art complicated by the lexical and morphosyntactic intricacies of languages and, more often than not, the discrepancies in vision and structure between source and target language (cf. Sapir, 1921, chap. 5). Extremely large corpora might smooth out the edges, but if free translations are created only after transcription, then the "transcription bottleneck" will also limit the availability of free translations. Limited EL free translation resources, in turn, creates the danger that idiosyncratic or literal translations might dominate the training set. This is another reason to position free translation directly from a recording before transcription and analysis.
Free translation and textual meaning: Even when a transcription has been produced and then morphologically segmented and glossed, free translations are beneficial, either generated from the transcription or directly from the speech signal. For example, although multiple sense glossing (i.e., choosing from multiple senses or functions in glossing a morpheme) clarifies ambiguous meanings, it is time-consuming for a human and challenging to automate. The semantic ambiguity of single morphemes will be mitigated if not resolved, however, if accompanied by free translations. Note the following interlinearization, in which, in isolation, the meaning of the gloss line is confusing. The free translations clarifies the meaning and offers a secondary sense to the verb root koto:ni.
Note also that multi-word lemmas and idiomatic expressions are in many cases opaque in word-byword (or, even more challenging, morpheme-bymorpheme) glossing. Again a gloss and parallel free translation preserve literal meaning while clarifying the actual meaning to target language speakers.
3 Strategies for automate speech-to-text translation: Cascaded model vs. end-to-end model One intuitive solution to automating free translation is the cascaded model. But this is difficult to implement since it relies on a pipeline from automatic speech recognition (ASR) to machine translation (MT). Most ELs, however, lack the material and data necessary to robustly train both ASR and MT systems (Do et al., 2014;Matsuura et al., 2020;Shi et al., 2021).
End-to-end ST has received much attention from the NLP research community because of its simpler implementation and computational efficiency (Bérard et al., 2016;Weiss et al., 2017;Inaguma et al., 2019;. In addition, it can also avoid propagating errors from ASR components by directly processing the speech signal. However, as with ASR and MT, ST also often suffers from limited training data and resultant difficulties in training a robust system, which makes the task challenging. There are few available examples of ST applied to endangered languages. Indeed, most speech translation efforts are between major languages (Di Gangi et al., 2019a;Cattoni et al., 2021;Kocabiyikoglu et al., 2018;Salesky et al., 2021). In these corpora, both source and target languages usually have a standardized writing system and ample training data, a situation generally absent for ELs. A well-known lowresource ST corpus is the Mboshi-French corpus (Godard et al., 2018). However, it is based on the reading of written texts, which does not present the difficulties encountered in conversational speech scenarios. In EL documentation projects, it is these latter scenarios that are most common.

Characteristics of Highland Puebla
Nahuatl (glottocode high1278) In this paper, we release a Highland Puebla Nahuatl (HPN; glottocode high1278) speech translation corpus for EL documentation. The corpus is governed by a Creative Commons BY-NC-SA 3.0 license and can be downloaded from http: //www.openslr.org/92. We have analyzed the corpus and explored different ST models and corresponding open-source training recipes in ES-PNet (Watanabe et al., 2018). Nahuatl languages are polysynthetic, agglutinative, head-marking languages with relatively productive derivational morphology, reduplication, and noun incorporation. A rich set of affixes creates the basis for a high number of potential words from any given lemma. As illustrated in Table 1, a transitive verb may contain half a dozen affixes; up to eight in a single word is not uncommon. Suffixes (not represented in Table 1) include tense/aspect/mood markings as well as "associated motion" (ti-cho:ka-ti-nemi-ya-h 1plS-cry-ligaturewalk-imperf-pl 'we used to go around crying' and directionals (ti-mits-ih-ita-to-h 1plS-2sgO-rdpl-seeextraverse.dir-pl 'we went to visit you').
Noun incorporation is not reflected in Table 1 as verbs with incorporated nouns may be treated as lexicalized stems with a compound internal structure. The function of the nominal stem can be highly varied (Tuggy, 1986) as it may lower valency (object incorporation) or leave valency unaffected, as with subject incorporation (not common), as well as both possessor raising (ni-kone:miki-k 1sgS-child-die-perfective.sg 'My child died on me') and modification (ni-kone:-tsahtsi-0 1sgSchild-shout-pres.sg 'I shout like a child'). Though noun incorporation is not fully productive (Mithun, 1984), it does increase the number of lemmas. It complicates patterns and meaning of reduplication, which may be at the left edge of the compound (transitive ma:teki > ma:ma:teki 'to cut repeatedly on the arm') or stem internal (e.g., ma:tehteki 'to harvest by hand'). It also complicates automatic translation, particularly in the case of out of vocabulary compounds in which there is no precedent for any of the possible interpretations of the incorporated noun stem.
The main challenge to developing machine translation algorithms for HPN is its morphological complexity, large numbers of words with a low directional prefix reflexive non-referential obj. adverbials reduplication verb stem +human -human (na:l-, ye:k-)  token-to-type ratio, and significant occurrences of both noun incorporation and reduplication accompanied by considerable variation in the semantic implications of incorporated noun stems and reduplicants. Table 2 lists type/token ratios in sample texts for three languages, including HPN. While the most frequent 100-word types cover roughly the same portion of text in all three languages, the remaining word types are represented in much lower frequency in HPN than in Yoloxóchitl Mixtec (glot-tologyolo1241, another EL spoken in Mexico) or English. As a corollary, this means that the remaining 41.1% of tokens (195,680) in the HPN corpus represents 41,718 types, a type-to-token ratio of 1:4.7. The equivalent ratio for English is 1:30.5. Finally, HPN word order is relatively flexible, which may pose an additional challenge to free translation as neither case marking or word order unambiguously serves to indicate grammatical function. The degree to which MT or ST can handle this relative variability in word order, even with relatively abundant resources, It is not clear.

Corpus Transcription
Recording: The HPN corpus was developed with speakers from the municipality of Cuetzalan del Progreso, in the northeastern sierra of the state. Most speakers were from San Miguel Tzinacapan and neighboring communities. Recordings use a 48 kHz sampling rate at 16-bits. To facilitate transcription of overlapping speech, each speaker was miked separately into one of two channels with a head-worn Shure SM-10a dynamic mic. A total of 954 recordings were made in a variety of genres. The principal topic, with 591 separate conversations, was plant nomenclature, classification, and use.
Transcription: The workflow commenced with recording sessions in relatively isolated environments. The original transcription was done in Transcriber (Barras et al., 2001) by one of four native speaker members of the research team: Amelia Domínguez Alcántara, Hermelindo Salazar Osollo, Ceferino Salgado Castañeda, and Eleuterio Gorostiza Salazar. Amith then reviewed each transcription, checking any doubts with a native speaker, before importing the finalized Transcriber file into ELAN (Wittenburg et al., 2006). In import, each speaker was assigned a separate tier, and then an additional dependent tier for the free translation was created for each speaker.
Spanish influence: Endangered languages are often spoken in a (neo-)colonial context in which the impact of a dominant language (often but not always non-Indigenous) is felt in many spheres (Mc-Convell and Meakins, 2005). HPN, particularly from the municipality of Cuetzalan, is striking for manifesting two perhaps contrary tendencies: (1) a puristic ideology that has motivated the creation of many neologisms along with (2) morphosyntactic shift under the subtle and covert influence of Spanish. 1 It is probably the case that neither neologisms nor morphosyntactic change poses much of a problem for machine translation; Spanish loans and code-switching into Spanish would undoubtedly be even less problematic. Indeed, it may well be that Spanish impact in many domains of HPN poses minimal problems for machine translation, particularly if the translation is text-to-text. One potential area of difficulty would be in speech translation, in which the Spanish translation is produced directly from a Nahuatl recording. In the conventions for HPN transcription, a Spanish loan with distinct meanings in Spanish vs. Nahuatl contexts is distinguished orthographically. It might be difficult to disambiguate the two if the translation is direct from audio. Thus note the following:āmo nikmati como tikchīwas ('I don't know how you will do it') vs.āmo nikmati komo tikchīwas ('I don't know if you will do it'). Spanish como ('how') may retain its Spanish meaning in a Nahuatl narrative (in which case it is written as if Spanish), or it may be used as a conditional ('if'), in which case it is conventionally written in Nahuatl orthography (komo). Even though the decision to orthographically distinguish [komo] / <como> meaning 'how' from [komo] / <komo> meaning 'if' is a particular feature of HPN transcription conventions, the ambiguity in meaning (i.e., translation) would persist even if the orthographies of the two senses were to be different.
In sum, then, it may be that the Spanish impact on Nahuatl is less problematic for MT than for ASR. The most problematic situation for ST is when a Spanish word is used in a Nahuatl-speaking community with both its original Spanish meaning or an innovative Nahuatl meaning. In this case, working via MT from a written transcription may have an advantage if the orthography used for each different meaning (original Spanish vs. innovated) is represented differently based on orthographic convention (as with como). But in other cases of Spanish language impact, it is not clear that the cascaded ST (ASR > MT) pipeline enjoys advantages over the direct end-to-end ST system.

Standardized Splits
The HPN corpus includes corpora for two tasks: ASR and ST(MT). The statistics and the partition information are shown in Table 3. The ASR corpus contains high-quality speech with phone-level transcription. The ST corpus is a subset of the ASR corpus in that it comprises the subset of the ASR corpus that includes time-aligned free translation of the HPN transcription.

Experiments
In this section, we present our initial effort on building an automatic ST model for EL documentation. Following the discussion in Section 3, we compare the cascaded model with end-to-end models.
To construct the cascaded model, we first conduct experiments on ASR and MT, respectively. Next,

Automatic Speech Recognition (ASR)
In many open-data tasks, end-to-end ASR compares favorably to traditional hidden Markov model-based ASR systems. The same trend is also shown in ASR for another endangered language, Yoloxóchitl Mixtec as presented in Shi et al. (2021), Table 2. Following a methodology similar to that used for ASR of Yoloxóchitl Mixtec, we have constructed a baseline system based on end-to-end ASR, specifically the transformer-based encoderdecoder architecture with hybrid CTC/attention loss (Watanabe et al., 2017;Karita et al., 2019). We have employed the exact same network configurations as the ESPNet MuST-C recipe. 3 The target of the system is 150 BPE units trained from the unigram language model. For decoding, we integrate the recurrent neural network language model with the ASR model. Specaugmentation is adopted for data augmentation (Park et al., 2019). The results in character error rate (CER) and word error rate (WER) are shown in Table 4. The experiments show that ASR improves only slightly as the result of increasing the data size from 45 to 156 hours.

Machine Translation (MT)
The MT experiments are conducted over the ST corpus with ground truth HPN transcription by native-speaker transcribers. We also adopt ESP-Net to train the MT model with encoder-decoder architecture (Inaguma et al., 2020

Speech Translation (ST)
While the traditional cascading approach to automating free translations (using two models, ASR and MT) shows strong results on many datasets, recent works have also shown competitive results using end-to-end systems that directly output translations from speech using a single model (Jan et al., 2019;Sperber and Paulik, 2020;Ansari et al., 2020). For low-resource settings, in particular, the data efficiencies of different methodologies become key performance factors (Bansal et al., 2018;Sperber et al., 2019). In this paper, we compare the performance of our dataset of both cascaded and single ST end-to-end systems. Both our cascaded and endto-end systems are based on the encoder-decoder architecture (Bérard et al., 2016;Weiss et al., 2017) and the transformer-based model (Di Gangi et al., 2019b;Inaguma et al., 2019).

(a) Cascaded ST Model (ASR > MT Pipeline):
The cascaded model consists of an ASR module and an MT module, each optimized separately during training. Each module is pre-trained with the same method as presented in Sections 5.1 and 5.2. During inference, the 1-best hypothesis from the ASR module is obtained via beam search with a beam size of 10, and this decoded transcription is passed to the subsequent MT module that finally outputs translated text. Results are shown in Table 5.
(b) End-to-end ST Model: In our experiments, we adopt the transformer-based encoder-decoder architecture with Specaugmentation. In addition, we default train the current system with the combination of ASR CTC-based loss from the encoder and ST translation loss from the decoder; this is referred to as E2E-ST with ASR-MTL. We also evaluate the Searchable Intermediates (SI) based ST model (E2E-ST with ASR-SI) introduced in Dalmia et al. (2021), where the ASR intermediates are found using the same decoding parameters as the ASR models of the cascade model. The detailed hyper-parameters follow the configuration of the ESPNet Must-C recipes. 5 ST results are shown in Table 5. While the performance of the Cascaded-ST system is close to that of the MT system, the E2E-ST with ASR-MTL system shows a significantly worse result. Since E2E-ST with ASR-MTL jointly optimizes a speech encoder with an ASR decoder that is not included in the final inference network, this subnet waste is likely causing data inefficiency that is evident in our low-resource dataset (Sperber et al., 2019). In contrast, E2E-ST with SI actually outperforms both the MT and cascaded-ST systems, suggesting that it is less degraded by the low-resource constraint (Anastasopoulos and Chiang, 2018;Dalmia et al., 2021). Furthermore, this result shows that Nahuatl is more easily translated with a methodology that can consider both speech and transcript sequences as inputs.
(c) Pre-training for end-to-end ST: To investigate the pre-training effect for HPN, we adopt the models trained from Sections 5.1 and 5.2. The ASR model in Section 5.1 was used for initialization of the ST encoder, while the MT model in Section 5.2 was used for initialization of the ST decoder.
As shown in Table 6, the best performance is reached with initialization from both ASR encoder and MT decoder. Pre-training encoder and decoder could help better ST modeling, while using the pre-trained ASR encoder could contribute to more performance improvements.
Some examples with the best model in Table 6 are shown in Appendix B. Based on the analysis, it generally indicates that the current ST system can translate some essential information into Spanish. However, it still cannot fully replace the human effort on the task. And the translation still needs significant correction from a human annotator.

Conclusions
In this paper, we release the Highland Puebla Nahuatl corpus for ASR, MT, and ST tasks. The corpus, related baseline models, and training recipes are open source under the CC BY-NC-ND 3.0 license. We expect the corpus to facilitate all three tasks for EL documentation. We also discuss and present three specific reasons for prioritizing ST as an initial step in the endangered language documentation sequence after the recording has taken place. Finally, we explore different technologies for ST of Highland Puebla Nahuatl and compare these to results obtained by processing through the cascaded ST pipeline.
As discussed in Section 2, we suggest that prioritizing free translation as a first, not final, step in documentation should be considered as: (1) it can rapidly make a corpus valuable to potential users even if transcription, morphlogical segmentation, and morpheme glossing is incomplete; (2) it enables semi-, passive and heritage speakers to participate in documentation of their languages; (3) it provides an alternative process for ASR in which the ASR target is not a transcription but a translation into a Western language; and (4) it creates a scenario in which the acoustic signal and free translation may be coupled as inputs into an end-toend ASR system. Therefore, our future works will focus on how the human effort could be reduced via ST models and on how to incorporate ST to improve the ASR performances. A Spanish language impact on Highland Puebla Nahuat HPN, particularly from the municipality of Cuetzalan, is striking for manifesting two seemingly contrary tendencies: neologisms and morphosyntactic. The first is a puristic ideology that values the native language as an expression of Indigenous identity. The second is a very strong influence of Spanish syntax that has led to a significant number of calques that are not only direct translations of Spanish, but that yield expressions that violate basic grammatical constraints of Nahuatl. Puristic ideology motivates many neologisms, many of which are nouns, that provide an alternative to Spanish loans. Spanish impact on morphosyntax is also prevalent. For example, with very few exceptions, the valency of Nahuatl verbs is fixed as either intransitive, transitive, or ditransitive. Thus to accept an object, an intransitive must undergo valency increase through an overt morphological process. But Spanish influence has created situations in which intransitive Nahuat verbs mark two arguments (subject and object) on the erstwhile intransitive stem. Under Spanish influence, the intransitive verbs kīsa 'to emerge' (Spanish 'salir')' and tikwi 'to light up' (Spanish 'prenderse') manifest otherwise ungrammatical forms: (a)āmo nēchkīsa (Ø-nēch-kīsa-Ø; 3sgS-1sgO-to.emerge-pres.sg) is a calque from Spanish 'no me sale' ('it doesn't turn out right for me'); (b) motikwi (Ø-mo-tikwi-Ø; 'it lights up') uses an unnecessary and ungrammatical reflexive marker influenced by the reflexive Spanish term 'se prende'.