End-to-End Automatic Speech Recognition: Its Impact on the Workflowin Documenting Yoloxóchitl Mixtec

This paper describes three open access Yoloxóchitl Mixtec corpora and presents the results and implications of end-to-end automatic speech recognition for endangered language documentation. Two issues are addressed. First, the advantage for ASR accuracy of targeting informational (BPE) units in addition to, or in substitution of, linguistic units (word, morpheme, morae) and then using ROVER for system combination. BPE units consistently outperform linguistic units although the best results are obtained by system combination of different BPE targets. Second, a case is made that for endangered language documentation, ASR contributions should be evaluated according to extrinsic criteria (e.g., positive impact on downstream tasks) and not simply intrinsic metrics (e.g., CER and WER). The extrinsic metric chosen is the level of reduction in the human effort needed to produce high-quality transcriptions for permanent archiving.

1 Introduction: Endangered language documentation history and context Endangered language (EL) documentation emerged as a field of linguistic activity in the 1990s, as reflected in several seminal moments.
In 1991 the Linguistic Society of America held a symposium entitled "Endangered Languages and their Preservation"; in 1992 Hale et al. (1992) published a seminal article on endangered languages in Language, the LSA's flagship journal. In 1998, Himmelmann (1998 argued for the development of documentary linguistics as an endeavor separate from and complementary to descriptive linguistics. By the early years of the present millennium, infrastructure efforts were being developed: metadata standards and best practices for archiving (Bird and Simons, 2003); tools for lexicography and corpus developments such as Shoebox, Transcriber (Barras et al., 1998), and ELAN (Wittenburg et al., 2006), and financial support for endangered language documentation (the Volkswagen Foundation, the NSF Documenting Endangered Language Program, and the SOAS Endangered Language Documentation Programme). Recent retrospectives on the impact of Hale et al. (1992) and Himmelmann (1998) have been published by Seifart et al. (2018) and McDonnell et al. (2018). Within the last decade, the National Science Foundation supported a series of three workshops, under the acronym AARDVARC (Automatically Annotated Repository of Digital Audio and Video Resources Community) to bring together field linguists working on endangered languages and computational linguists working on automatic annotation-particularly automatic speech recognition (ASR)-to address the impact of what has been called the "transcription bottleneck" (Whalen and Damir, 2012). Interest in applying machine learning to endangered language documentation is also manifested in four biennial workshops on this topic, the first in 2014 (Good et al., 2021). Finally, articles directly referencing ASR of endangered languages have become increasingly common over the last five years (Adams et al., 2018(Adams et al., , 2020Ćavar et al., 2016;Foley et al., 2018Foley et al., , 2019Gupta and Boulianne, 2020;Jimerson and Prud'hommeaux, 2018;Jimerson et al., 2018;Michaud et al., 2018;Mitra et al., 2016;Shi et al., 2021).
This article continues work on Yoloxóchitl Mixtec ASR (Mitra et al., 2016;Shi et al., 2021). The most recent efforts (2020 and 2021) have adopted the ESPNet toolkit for end-to-end automatic speech recognition (E2E ASR). This approach has proven to be very efficient in terms of time needed to develop the ASR recipe (Shi et al., 2021) and in yielding ASR hypotheses of an accuracy capable of significantly reducing the extent of human effort needed to finalize accurate transcribed audio for permanent archiving as here demonstrated. Section 2 discusses the Yoloxóchitl Mixtec corpora, and Section 3 explores the general goals of EL documentation. Section 4 reviews the E2E ASR and corresponding results using ESPNet. The conclusion is offered in Section 5.

The language
Much work on computer-assisted EL documentation is closely related to work on low-resource languages, for the obvious reason that most ELs have limited resources, be they time-coded transcriptions, interlinearized texts, or corpora in parallel translation. The resources for Yoloxóchitl Mixtec, the language targeted in this present study, are, however, relatively abundant by EL standards (119.32 hours over three corpora), the result of over a decade of linguistic and anthropological research by Amith and Castillo García (2020). Yoloxóchitl Mixtec (henceforth YM), an endangered Mixtecan language spoken in the municipality of San Luis Acatlán, Guerrero, Mexico, is one of some 50 languages in the Mixtec language family, which is within a larger unit, Otomanguean, that Suárez (1983) considers a hyper-family or stock. Mixtec languages (spoken in Oaxaca, Guerrero, and Puebla) are highly varied, the result of approximately 2,000 years of diversification. YM is spoken in four communities: Yoloxóchitl, Cuanacaxtitlan, Arroyo Cumiapa, and Buena Vista. Mutual intelligibility among the four communities is high despite differences in phonology, morphology, and syntax.
All villages have a simple common segmental inventory but apparently significant though still undocumented variation in tonal phonology; only Cuanacaxtitlan manifests tone sandhi. YMC (referring only to the Mixtec of the community of Yoloxóchitl [16.81602, -98.68597]) manifests 28 distinct tonal patterns on 1,451 to-date identified bimoraic lexical stems. The tonal patterns carry a significant functional load regarding the lexicon and inflection (Palancar et al., 2016). For example, 24 distinct tonal patterns on the bimoraic segmental sequence [nama] yield 30 words (including five homophones). The three principal aspectual forms (irrealis, incompletive, and completive) are almost invariably marked by a tonal variation on the first mora of the verbal stem (1 or 3 for the irrealis, 4 for the incompletive, and 13 for the completive; in addition 14 on the initial mora almost always indicates negation of the irrealis 1 ). In a not-insignificant number of cases, suppletive stems exist, generally manifesting variation in a stem-initial consonant and often the stem-initial vowel.
The ample tonal inventory of YMC presents obstacles to native speaker literacy and an ASR system learning to convert an acoustic signal to text. It also complicates the construction of a language lexicon for HMM-based systems, a lexicon that is not required in E2E ASR. The phonological and morphological differences between YMC and the Mixtec of the three other YM communities create challenges for transcription and, by extension, for applying YMC ASR to speech recordings from these other villages. To accomplish this, it will be necessary first to learn the phonology and morphology of these variants and then use this as input into a transfer learning scenario. Intralanguage variation among distinct communities (see Hildebrandt et al., 2017b andother articles in Hildebrandt et al., 2017a) is an additional factor that can negatively impact computer-assisted EL documentation efforts in both intra-and intercommunity contexts.  (Adams et al., 2018;Ćavar et al., 2016;Jimerson et al., 2018;Jimerson and Prud'hommeaux, 2018). This ample size has yielded lower character (CER) and word (WER) error rates than would usually occur with truly low-resource EL documentation projects.
Amith and Castillo García recorded the corpus at a 48KHz sampling rate and 16-bits (usually with a Marantz PMD 671 recorder, Shure SM-10a dynamic headset mics, and separate channels for each speaker). The entire corpus was transcribed by Castillo, a native speaker linguist (García, 2007).

YMC-FB:
A second YMC corpus (YMC-FB; for 'field botany') was developed during ethno-botanical fieldwork. Kenia Velasco Gutiérrez (a Spanish-speaking botanist) and Esteban Guadalupe Sierra (a native speaker from Yoloxóchitl) led 105 days of fieldwork that yielded 888 distinct plant collections. A total of 584 recordings were made in all four YM communities; only 452 were in Yoloxóchitl, and of these, 435, totaling 15.17 hours with only three speakers, were used as a second test case for E2E ASR. Recordings were done outdoors at the plant collection site with a Zoom H4n handheld digital recorder. The Zoom H4n internal mic was used; recordings were 48KHz, 16-bit, a single channel with one speaker talking after another (no overlap). Each recording has a short introduction by Velasco describing, in Spanish, the plant being collected. This Spanish section has not been factored into the duration of the YMC-FB corpus, nor has it been evaluated for character and word error rates at this time (pending future implementation of a multilingual model). The processing of the 435 recordings falls into two groups.
• 257 recordings (8.36 hours) were first transcribed by a novice trainee (Esteban Guadalupe) as part of transcription training. They were corrected in a separate ELAN tier by Castillo García and then the acoustic signals were processed by E2E ASR trained on the YMC-Exp corpus. The ASR CER and WER were obtained by comparing the ASR hypotheses to Castillo's transcriptions; Guadalupe's skill level (also measured in CER and WER) was obtained by comparing his transcription to that of Castillo. The results are discussed in Table 9 of Shi et al. (2021).
• 178 recordings (6.81 hours) were processed by E2E ASR, then corrected by Castillo. This set was not used to teach or evaluate novice trainee transcription skills but only to determine CER and WER for E2E ASR with the YMC-FB corpus.
No training or validation sets were created from this YMC-FB corpus, which for this present paper was used solely to test E2E ASR efficiency using the recipe developed from YMC-Exp corpus. CER and WER scores for YMC-FB were only produced after Castillo used the ELAN interface to correct the ASR hypotheses for this corpus (see Appendix A for an example ASR output).

YMC-VN:
The final corpus is a set of 24 narratives made to provide background information and off-camera voice for a documentary video. The recordings involved some speakers not represented in the YMC-Exp corpus. All recordings (5.16 hours) were made at 44.1kHz, 16-bit with a boom-held microphone and a Tascam portable digital recorder in a hotel room. This environment may have introduced reverb or other effects that might have negatively affected ASR CER and WER.
Accessibility: All three corpora (119.32 hours) are available at the OpenSLR data portal (Amith and Castillo García, 2020) 3 Goals and challenges of corpora-based endangered language documentation

Overview
The oft-cited Boasian trilogy of grammar, dictionaries, and texts is a common foundation for EL documentation. Good (2018, p. 14) parallels this classic conception with a "Himmelmannian" trilogy of recordings, metadata, and annotations (see Himmelmann 2018). For the purpose of the definition proposed here, EL documentation is considered to be based on the Boasian trilogy of (1) corpus, (2) lexicon (in the sense of dictionary), and (3) grammar. In turn, each element in the trilogy is molded by a series of expectations and best practices. An audio corpus, for example, would best be presented interlinearized with (a) lines corresponding to the transcription (often in a practical orthography or IPA transcription), (b) morphological segmentation (often called a 'parse'), (c) parallel glossing of each morpheme, (d) a free translation into a target, often colonial language, and (e) metadata about recording conditions and participants. This is effectively the Himmelmannian trilogy referenced by Good. A dictionary should contain certain minimum fields (e.g., part of speech, etymology, illustrative sentences). Grammatical descriptions (books and articles) are more openly defined (e.g., a reference vs. a pedagogical grammar) and may treat only parts of the language (e.g., verb morphology).
In a best-case scenario, these three elements of the Boasian trilogy are interdependent. Corpusbased lexicography clearly requires ample interlinearized transcriptions (IGT) of natural speech that can be used to (a) develop concordances mapped to lemmas (not word forms); (b) enrich a dictionary by finding lemmas in the corpus that are absent from an extant set of dictionary headwords; and (c) discover patterns in the corpus suggestive of multiword lemmas (e.g., ku 3 -na 3 a 4 followed by i 3 ni 2 (lit., 'darken heart' but meaning 'to faint'). A grammar will inform decisions about morphological segmentation used in the IGT as well as part-of-speech tags and other glosses. And a grammar itself would benefit greatly from a large set of annotated natural speech recordings not simply to provide examples of particular structures but to facilitate a statistical analysis of speech patterns (e.g., for YMC, the relative frequency of completive verbs marked solely by tone vs. those marked by the prefix ni 1 -). This integration of elements into one "hypertextual" documentation effort is proposed by Musgrave and Thieberger (2021), who note the importance of spontaneous text (i.e., corpora, which they separate into two elements, media, and text) and comment that "all examples [in the dictionary and grammar] should come from the spontaneous text and should be viewed in context" (p. 6).
Documentation of YMC has proceeded on the assumption that the hypertextual integration suggested by Musgrave and Thieberger is central to effective endangered language documentation based on natural speech and that textual transcription of multimedia recordings of natural speech is, therefore, the foundation for a dictionary and grammar based on actual language use. End-to-end ASR is used to rapidly increase corpus size while offering the opportunity to target certain genres (such as expert conversations on the nomenclature, classification, and use of local flora and fauna; ritual discourse; material cultural production; techniques for fishing and hunting) that are of ethnographic interest but are often insufficiently covered in EL documentation projects that struggle to produce large and varied corpora. With the human effortreducing advances in ASR for YMC presented in this paper, such extensive targeted recording of endangered cultural knowledge can now easily be included in the documentation effort.
The present paper focuses on end-to-end automatic speech recognition using the ESPNet toolkit (Guo et al., 2020;Shi et al., 2021;Watanabe et al., 2020Watanabe et al., , 2017Watanabe et al., , 2018. The basic goal is simple: To develop computational tools that reduce the amount of human effort required to produce accurate transcriptions in time-coded interlinearized format that will serve a wide range of potential stakeholders, from native and heritage speakers to specialized academics in institutions of higher learning, in the present and future generations. The evaluation metric, therefore, is not intrinsic (e.g., reduced CER and WER) but rather extrinsic: the impact of ASR on the downstream task of creating a large and varied corpus of Yoloxóchitl Mixtec.

Challenges to ASR of endangered languages
ASR for endangered languages is made difficult not simply because of limited resources for training a robust system but by a series of factors briefly discussed in this section.
Recording conditions: Noisy environments, including overlapping speech, reverberation in indoor recordings, natural sounds in outdoor recordings, less than optimal microphone placement (e.g., a boom mic in video recordings), and failure to separately mike speakers for multichannel recordings all negatively impact the accuracy of ASR output. Also to the point, field recordings are seldom made with an eye to seeding a corpus in ways that would specifically benefit ASR results (e.g., recording a large number of speakers for shorter durations, rather than fewer speakers for longer times). To date, then, processing a corpus through ASR techniques of any nature (HMM, end-to-end) has been more of an afterthought than planned at project beginning. Development of a corpus from the beginning with an eye to subsequent ASR potential would be immensely helpful to these computational efforts. It could, perhaps should, be increasingly considered in the initial project design. Indeed, just as funding agencies such as NSF require that projects address data management issues, it might be worth considering the suggested inclusion of how to make documentation materials more amenable to ASR and NLP processing as machine learning technologies are getting more robust.
Colonialization of language: Endangered languages do not die, to paraphrase Dorian (1978), with their "boots on." Rather, in the colonialized situation in which most ELs are immersed, there are multiple phonological, morphological, and syntactic influences from a dominant language. The incidence of a colonial language in native language recordings runs a gamut from multilanguage situations (e.g., each speaker using a distinct language, as often occurs in elicitation sessions: 'How would you translate ___ into Mixtec?'), to code-switching and borrowing or relexification in the speech of single individuals. In some languages (e.g., Nahuatl), a single word may easily combine stems from both native and colonial languages. Preliminary, though not quantified, CER analysis for YMC ASR suggests that "Spanish-origin" words provoke a significantly higher error rate than the YMC lexicon uninfluenced by Spanish. It is also not clear that a multilingual phone recognition system is the solution to character errors (such as ASR hypothesis 'cereso' for Spanish 'cerezo') that may derive from an orthographic system, such as that for Spanish, that is not designed, as many EL orthographies are, for consistency. Phonological shifts in borrowed terms also preclude the simple application of lexical tools to correct misspellings (as 'agustu' for the Spanish month 'agosto').
Orthographic conventions: The practical deep orthography developed by Amith and Castillo marks off boundaries of affixes (with a hyphen) and clitics (with an = sign). Tones are indicated by superscript numbers, from 1 low to 4 high, with five common rising and falling tones. Stem-final elided tones are enclosed in parentheses (e.g., underlying form be' 3 e (3) = 2 ; house=1sgPoss, 'my house'; surface form be' 3 e 2 ). Tone-based inflectional morphology is not separated in any YMC transcriptions. 2 The transcription strategy for YMC was unusual in that the practical orthography was a deep, underlying system that represented segmental morpheme boundaries and showed elided tones in parentheses. The original plans of Amith and Castillo were to use the transcribed audio as primary data for a corpus-based dictionary. A deep orthography facilitates discovery (without recourse to a morphological analyzer) of lemmas that may be altered in surface pronunciations by the effect of personmarking enclitics and certain common verbal prefixes (see Shi et al., 2021, §2.3).
Only after documentation (recording and timecoded transcriptions) was well advanced did work begin on a finite state transducer for the YMC corpus. this was made possible by collaboration with another NSF-DEL sponsored project. 3 The code was written by Jason Lilley in consultation with Amith and Castillo. As the FOMA FST was being built, FST output was repeatedly checked against expectations based on the morphological grammar until no discrepancies were noted. The FST, however, only generates surface forms consistent with Castillo's grammar. If speakers varied, for example, in the extent of vowel harmonization or regressive nasalization, the FST would yield only one surface form, that suggested by Castillo to be the most common. For example, underlying be 3 e (3) =an 4 (house=3sgFem; 'her house') surfaces as be 3ã4 even though for some speakers nasalization spreads to the stem initial vowel. Note, then, that the surface forms in the YMC-Exp corpus are based on FST generation from an underlying transcription as input and not from the direct transcription of the acoustic signal. It is occasionally the case that different speakers might extend vowel harmonization or nasalization leftward to different degrees. This could increase the CER and WER for ASR of surface forms, given that the reference for evaluation is not directly derived from the acoustic signal while the ASR hypothesis is so derived.
In an evaluation across the YMC-Exp development and test sets (total 6.53 hours) of the relative accuracy of ASR when using underlying versus surface orthography, it was found that training on underlying orthography produced slightly greater accuracy than training on surface forms: Underlying = 7.7/16.0 [CER/WER] compared to Surface = 7.8/16.5 [CER/WER] (Shi et al., 2021, see Table 4). The decision to use underlying representations in ASR training has, however, several more important advantages. First, for native speakers, the process of learning a deep practical orthography means that one learns segmental morphology as one learns to write. For the purposes of YMC language documentation, the ability of a neural network to directly learn segmental morphology as part of ASR training has resulted in a YMC ASR output across all three corpora with affixes and clitics separated and stem-final elided tones marked in parentheses. Semi-or un-supervised morphological learning as a separate NLP task is unnecessary when ASR training and testing was successfully carried out on a corpus with basic morphological segmentation. As the example in Appendix A demonstrates, ASR output includes basic segmentation at the morphological level. 3.3 Intrinsic metrics: CER, WER, and consistency in transcriptions used as reference: Although both CER and WER reference "error rate" in regards to character and word, respectively, the question of the accuracy of the reference itself is rarely explored (but cf. Saon et al., 2017). For YMC, only one speaker, Castillo García, is capable of accurate transcription, which in YMC is the sole gold standard for ASR training, validation, and testing. Thus there is a consistency to the transcription used as a reference. In comparison, for Highland Puebla Nahuat (another language that the present team is exploring), the situation is distinct. Three native speaker experts have worked with Amith on transcription for over six years, but the reference for ASR development are native-speaker transcriptions carefully proofed by Amith, a process that both corrected simple errors and applied a single standard implemented by one researcher. When all three native speaker experts were asked to transcribe the same 90 minutes or recordings, and the results were compared, there was not an insignificant level of variation ( 9%).
The aforementioned scenario suggests the impact on ASR intrinsic metrics of variation in transcriptions across multiple annotators, or even inconsistencies of one skilled annotator in the context of incipient writing systems. This affects not only ASR output but also the evaluation of ASR accuracy via character and word error rates. It may be that rather than character and word error rate, it would be advisable to consider the character and word discrepancy rate a change in terminology that perhaps better communicates the idea that the differences between REF and HYP are often as much a matter of opinion as fact. The nature and value of utilizing intrinsic metrics (e.g., CER and WER) for evaluating ASR effectiveness for endangered language documentation merits rethinking.
An additional factor that has emerged in the YMC corpora, which contains very rapid speech, is what may be called "hypercorrection". This is not uncommon and may occur with lenited forms (e.g., writing ndi 1 ku 4 chi 4 when close examination of the acoustic signal reveals that the speaker used the fully acceptable lenited form ndiu 14 chi 4 ) or when certain function words are reduced, at times effectively disappearing from the acoustic signal though not from the mind of a fluent speaker transcriber. In both cases, ASR "errors" might represent a more accurate representation of the acoustic signal than the transcription of even the most highly capable native speakers.
The above discussion also brings into question what it means to achieve human parity via an ASR system. Parity could perhaps best be considered as not based on CER and WER alone but on whether ASR output achieves a lower error rate in these two measurements as compared to what another skilled human transcriber might achieve.

Extrinsic metrics: Reduction of human effort as a goal for automatic speech recognition
Given the nature of EL documentation, which requires high levels of accuracy if the corpus is to be easily used for future linguistic research, it is essential that ASR-generated hypotheses be reviewed by an expert human annotator before permanent archiving. Certainly, audio can be archived with metadata alone or with unchecked ASR transcriptions (see Michaud et al., 2018, §4.3 and 4.4), but the workflow envisioned for YMC is to use ASR to reduce human effort while the archived corpus of audio and text maintains results equivalent to those that would be obtained by careful, and laborintensive, expert transcription. CER and WER were measured for YMC corpora with training sets of 10, 20, 50, and 92 hours. The CER/WER were as follows: 19.5/39.2 (10 hrs.), 12.7/26.2 (20 hrs.), 10.2/24.9 (50 hrs.), and 7.7/16.1 (92 hrs.); Table 5 in Shi et al. (2021). Measurement of human effort reduction suggests that with a corpus of 30-50 hours, even for a relatively challenging language such as YMC, E2E ASR can achieve the level of accuracy that allows a reduction of human effort by > 75 percent (e.g., from 40 to 10 hours, approximately).  Starting from the acoustic signal, Castillo García, a native speaker linguist, requires approximately 40 hours to transcribe 1 hour of YMC audio. Starting from initial ASR hypotheses incorporated into ELAN, this is reduced by approximately 75 percent to about 10 hours of effort to produce one finalized hour of time-coded transcription with marked segmentation of affixes and enclitics.
These totals are derived from measurements with the FB and VN corpora, the two corpora for which ASR provided the initial transcription, and Castillo subsequently corrected the output, keeping track of the time he spent. For the first corpus, Castillo required 58.20 hours to correct 6.65 hours of audio (from 173 of the 178 files that had not been first transcribed by a speaker trainee). This yields 8.76 hours of effort per hour of recording. The 5.16 hours (in 24 files) of the VN corpus required 53.07 hours to correct, a ratio of 10.28 hours of effort to finalize 1 hour of speech. Over the entire set of 197 files (11.81 hours), human effort was 111.27 hours, or 9.42 hours to correct 1 hour of audio. Given that the ASR system was trained on an underlying orthography, the final result of < 10 hours of human effort per hour of audio is a transcribed and partially parsed corpus. Table 3 presents an analysis of two lines of a recording that was first processed by E2E ASR and corrected by Castillo García. A fuller presentation and analysis are offered in the Appendix. This focus on extrinsic metrics reflects the realization that the ultimate goal of computational systems is not to achieve the lowest CER and WER but to help documentation initiatives more efficiently produce results that will benefit future stakeholders. . In practice, E2E ASR systems are less affected by linguistic constraints and are generally easier to train. The benefits of such systems are reflected in the recent trends of using end-to-end ASR for EL documentation (Adams et al., 2020;Thai et al., 2020;Matsuura et al., 2020;Hjortnaes et al., 2020;Shi et al., 2021).

Experimental results
Experimental results are presented in two subsections. The first addresses the performance of end-to-end ASR across three corpora, each with slightly different recording systems and content. As clear from the preceding discussion and illustrated in Table 2, in addition to training on the word unit, the YMC E2E ASR system was trained on six additional linguistic and informational sub-word units. ROVER was then used to produce composite systems in which the outputs of all seven systems were combined in three distinct manners. In all cases, ROVER combinations improved the result of any individual system, including the averages for either of the two types of units: linguistic and informational. 4 Those interested in the recordings and associated ELAN files may visit Amith and Castillo García (2020).
ASR and ROVER across three YMC corpora: As evident in Table 2, across all corpora, informational units (BPE) are more efficient than linguistic units (word, morpheme, morae) in regards to ASR accuracy. The average CER/WER for linguistic units (rows A-C) was 10.4 /19.5 (Exp[test]), 13.6/23.3 (FB), and 10.7/21.7 (VN). The corresponding figures for the BPE units (rows D-G) were 7.7/16.0 (Exp[test]), 9.7/19.5 (FB), and 6.8/16.8 (VN). In terms of percentage differences between the two types of units, the numbers are not insignificant. In regards to CER, performance improved from linguistic to informational units by 26.0, 28.7, and 36.4 percent across the Exp(Test), FB, and VN corpora. In regards to WER, performance improved by 17.9, 16.3, and 22.6 percent across the same three corpora.
The experiments also addressed two remaining questions: (1) does unweighted ROVER combination improve the accuracy of ASR results; (2) does adding linguistic unit performance units to the ROVER "voting pool" improve results over a combination of only BPE units. In regards to the first question: ROVER always improves results over any individual system (compare row H to rows A, B, and C, and row I to rows D, E, F, and G). The second question is addressed by comparing rows I (ROVER applied only to the four BPE results) to J (adding the ASR results for the three linguistic units into the combination). In only one of the six cases (CER of Exp[test]) does including word, morpheme, and morae lower the error rate from the results of a simple combination of the four BPE results (in this case from 7.6 [row I] to 7.4 [row J]). In one case, there is no change (CER for the VN corpus) and in four cases, including linguistic units slightly worsens the score from the combination of BPE units alone (row I with bold numbers). The implication of the preceding is that ASR using linguistic units yields significantly lower accuracy than ASR that uses informational (BPE) units. Combining the former with the latter in an unweighted ROVER system in most cases does not improve results. Whether a weighted combinatory system would do better is a question that will need to be explored.

Conclusion
A fundamental element of endangered language documentation is the creation of an extensive corpus of audio recordings accompanied by timecoded annotations in interlinear format. In the best of cases, such annotations include an accurate transcription aligned with morphological segmentation, glossing, and free translations. The degree to which such corpus creation is facilitated is the extrinsic metric by which ASR contributions to EL documentation should be considered. The project here discussed suggests a path to creating such corpora using end-to-end ASR technology to build up the resources (30-50 hours) necessary to train an ASR system with perhaps a 6-10 percent CER. Once this threshold is reached, it is unlikely that further improvement will significantly reduce the human effort needed to check the ASR output for accuracy. Indeed, even if there are no "errors" in the ASR output, confirmation of this through careful revision of the recording of the transcription would probably still take 3-4 hours. The effort reduction of 75 percent documented here for YMC is, therefore, approaching what may be considered the minimum amount of time to proofread transcription of natural speech in an endangered language.
This project has also demonstrated the advantage of using a practical orthography that separates affixes and clitics. In a relatively isolating language such as YM, such a system is not difficult for native speakers to write nor for ASR systems to learn. It has the advantage of creating a workflow in which parsed text is the direct output of E2E ASR. The error rate evaluations across the spectrum of corpora and CER/WER also demonstrate the advantage of using subword units such as BPE and subsequent processing by ROVER for system combination (see above and Table 2). The error rates could perhaps be lowered further as the corpus increases in size, as more care is placed on recording environments, and as normalization eliminates reported errors for minor discrepancies such as in transcription of back-channel cues. But such lower error rates will probably not significantly reduce the time for final revision.
A final question concerns additional steps once CER is reduced to 6-8 percent, and additional improvements to ASR would not significantly affect the human effort needed to produce a high-quality time-coded transcription and segmentation. Four topics are suggested: (1) address issues of noise, overlapping speech, and other challenging recording situations; (2) focus on transfer learning to related languages; (3) explore the impact of "colonialization" by a dominant language; and (4) focus additional ASR-supported corpus development on producing material for documentation of endangered cultural knowledge, a facet of documentation that is often absent from endangered language documentation projects. A Analysis of ASR errors in one recording from the FB corpus Unique identifier: 2017-12-01-b Speakers: Constantino Teodoro Bautista and Esteban Guadalupe Sierra Spanish:The first 13 seconds (3 segments) of the recording were of a Spanish speaker describing the plant being collected (Passiflora biflora Lam.) and have not been included below. Note: A total 16 out of 33 segments/utterances are without ASR error. These are marked with an asterisk. Original recording and ELAN file: Download at http://www.balsas-nahuatl.org/NLP 4*. 00:00:13.442 -> 00:00:17.105 ASR constantino teodoro bautista Exp Constantino Teodoro Bautista. Notes: ASR does not output caps or punctuation. 5*. 00:00:17.105 -> 00:00:19.477 ASR ya 1 mi 4 i 4 tu 1 tu' 4 un 4 ku 3 rra 42 Exp Ya 1 mi 4 i 4 tu 1 tu' 4 un 4 ku 3 rra 42 Notes: No errors in the ASR hypothesis.