ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization

Cherokee is a highly endangered Native American language spoken by the Cherokee people. The Cherokee culture is deeply embedded in its language. However, there are approximately only 2,000 fluent first language Cherokee speakers remaining in the world, and the number is declining every year. To help save this endangered language, we introduce ChrEn, a Cherokee-English parallel dataset, to facilitate machine translation research between Cherokee and English. Compared to some popular machine translation language pairs, ChrEn is extremely low-resource, only containing 14k sentence pairs in total. We split our parallel data in ways that facilitate both in-domain and out-of-domain evaluation. We also collect 5k Cherokee monolingual data to enable semi-supervised learning. Besides these datasets, we propose several Cherokee-English and English-Cherokee machine translation systems. We compare SMT (phrase-based) versus NMT (RNN-based and Transformer-based) systems; supervised versus semi-supervised (via language model, back-translation, and BERT/Multilingual-BERT) methods; as well as transfer learning versus multilingual joint training with 4 other languages. Our best results are 15.8/12.7 BLEU for in-domain and 6.5/5.0 BLEU for out-of-domain Chr-En/EnChr translations, respectively, and we hope that our dataset and systems will encourage future work by the community for Cherokee language revitalization. Our data, code, and demo will be publicly available at https://github.com/ZhangShiyue/ChrEn


Introduction
The Cherokee people are one of the indigenous peoples of the United States. Before the 1600s, they lived in what is now the southeastern United States (Peake Raymond, 2008). Today, there are three federally recognized nations of Cherokee Src. ᎥᏝ ᎡᎶᎯ ᎠᏁᎯ ᏱᎩ, ᎾᏍᎩᏯ ᎠᏴ ᎡᎶᎯ ᎨᎢ ᏂᎨᏒᎾ ᏥᎩ. Ref. They are not of the world, even as I am not of the world. SMT It was not the things upon the earth, even as I am not of the world. NMT I am not the world, even as I am not of the world. people: the Eastern Band of Cherokee Indians (EBCI), the United Keetoowah Band of Cherokee Indians (UKB), and the Cherokee Nation (CN). The Cherokee language, the language spoken by the Cherokee people, contributed to the survival of the Cherokee people and was historically the basic medium of transmission of arts, literature, traditions, and values (Nation, 2001;Peake Raymond, 2008). However, according to the Tri-Council Res. No. 02-2019, there are only 2,000 fluent first language Cherokee speakers left, and each Cherokee tribe is losing fluent speakers at faster rates than new speakers are developed. UNESCO has identified the dialect of Cherokee in Oklahoma is "definitely endangered", and the one in North Carolina is "severely endangered". Language loss is the loss of culture. CN started a 10-year language revitalization plan (Nation, 2001) in 2008, and the Tri-Council of Cherokee tribes declared a state of emergency in 2019 to save this dying language.
To revitalize Cherokee, language immersion programs are provided in elementary schools, and second language programs are offered in universities. However, students have difficulty finding exposure to this language beyond school hours (Albee, 2017). This motivates us to build up English (En) to Cherokee (Chr) machine translation systems so that we could automatically translate or aid human translators to translate English materials to Cherokee. Chr-to-En is also highly meaningful in helping spread Cherokee history and culture. Therefore, in this paper, we contribute our effort to Cherokee revitalization by constructing a clean Cherokee-English parallel dataset, ChrEn, which results in 14,151 pairs of sentences with around 313K English tokens and 206K Cherokee tokens. We also collect 5,210 Cherokee monolingual sentences with 93K Cherokee tokens. Both datasets are derived from bilingual or monolingual materials that are translated or written by first-language Cherokee speakers, then we manually aligned and cleaned the raw data. 2 Our datasets contain texts of two Cherokee dialects (Oklahoma and North Carolina), and diverse text types (e.g., sacred text, news). To facilitate the development of machine translation systems, we split our parallel data into five subsets: Train/Dev/Test/Out-dev/Out-test, in which Dev/Test and Out-dev/Out-test are for indomain and out-of-domain evaluation respectively. See an example from ChrEn in Table 1 and the detailed dataset description in Section 3.
The translation between Cherokee and English is not easy because the two languages are genealogically disparate. As shown in Figure 1, Cherokee is the sole member of the southern branch of the Iroquoian language family and is unintelligible to other Iroquoian languages, while English is from the West Germanic branch of the Indo-European language family. Cherokee uses a unique 85character syllabary invented by Sequoyah in the early 1820s, which is highly different from English's alphabetic writing system. Cherokee is a polysynthetic language, meaning that words are composed of many morphemes that each have independent meanings. A single Cherokee word can express the meaning of several English words, e.g., ᏫᏓᏥᏁᎩᏏ (widatsinegisi), or I am going off at a distance to get a liquid object. Since the semantics are often conveyed by the rich morphology, the word orders of Cherokee sentences are variable. There is no "basic word order" in Cherokee, and most word orders are possible (Montgomery-Anderson, 2008), while English generally follows the Subject-Verb-Object (SVO) word order. Plus, verbs comprise 75% of Cherokee, which is only 25% for English (Feeling, 1975(Feeling, , 1994. Hence, to develop translation systems for this low-resource and distant language pair, we investigate various machine translation paradigms and propose phrase-based (Koehn et al., 2003) Statistical Machine Translation (SMT) and RNNbased (Luong et al., 2015) or Transformer-based (Vaswani et al., 2017) Neural Machine Translation (NMT) systems for both Chr-En and En-Chr translations, as important starting points for future works. We apply three semi-supervised methods: using additional monolingual data to train the language model for SMT (Koehn and Knowles, 2017); incorporating BERT (or Multilingual-BERT) (Devlin et al., 2019) representations for NMT (Zhu et al., 2020), where we introduce four different ways to use BERT; and the back-translation method for both SMT and NMT (Bertoldi and Federico, 2009;Lambert et al., 2011;Sennrich et al., 2016b). Moreover, we explore the use of existing X-En parallel datasets of 4 other languages (X = Czech/German/Russian/Chinese) to improve Chr-En/En-Chr performance via transfer learning (Kocmi and Bojar, 2018) or multilingual joint training (Johnson et al., 2017).
Empirically, NMT is better than SMT for in-domain evaluation, while SMT is significantly better under the out-of-domain condition. RNN-NMT consistently performs better than Transformer-NMT. Semi-supervised learning improves supervised baselines in some cases (e.g., back-translation improves out-of-domain Chr-En NMT by 0.9 BLEU). Even though Cherokee is not related to any of the 4 languages (Czech/German/Russian/Chinese) in terms of their language family trees, surprisingly, we find that both transfer learning and multilingual joint training can improve Chr-En/En-Chr performance in most cases. Especially, transferring from Chinese-English achieves the best in-domain Chr-En performance, and joint learning with English-German obtains the best in-domain En-Chr performance. The best results are 15.8/12.7 BLEU for in-domain Chr-En/En-Chr translations; and 6.5/5.0 BLEU for outof-domain Chr-En/En-Chr translations. Finally, we conduct a 50-example human (expert) evaluation; however, the human judgment does not correlate with BLEU for the En-Chr translation, indi-attention from the NLP community in helping to save and revitalize this endangered language. An initial version of our data and its implications was introduced in (Frey, 2020). Note that we are not the first to propose a Cherokee-English parallel dataset. There is Chr-En parallel data available on OPUS (Tiedemann, 2012). 5 The main difference is that our parallel data contains 99% of their data and has 6K more examples from diverse domains.
Low-Resource Machine Translation. Even though machine translation has been studied for several decades, the majority of the initial research effort was on high-resource translation pairs, e.g., French-English, that have large-scale parallel datasets available. However, most of the language pairs in the world lack large-scale parallel data. In the last five years, there is an increasing research interest in these low-resource translation settings. The DARPA's LORELEI language packs contain the monolingual and parallel texts of three dozen languages that are considered as low-resource (Strassel and Tracey, 2016).  (Christodouloupoulos and Steed man, 2015). Because not many low-resource parallel datasets were publicly available, some low-resource machine translation research was done by sub-sampling high-resource language pairs (Johnson et al., 2017;Lample et al., 2018), but it may downplay the fact that low-resource translation pairs are usually distant languages. Our ChrEn dataset can not only be another open resource of low-resource MT research but also challenge MT methods with an extremely morphology rich language and a distant language pair. Two methods have been largely explored by existing works to improve low-resource MT. One is semi-supervised learning to use monolingual data (Gulcehre et al., 2015;Sennrich et al., 2016b). The other is cross-lingual transfer learning or multilingual joint learning (Kocmi and Bojar, 2018;Johnson et al., 2017). We explore both of them to improve Chr-En/En-Chr translations.  The key statistics of our parallel and monolingual data. Note that "% Unseen unique English tokens" is in terms of the Train split, for example, 13.3% of unique English tokens in Dev are unseen in Train.

Data Description
It is not easy to collect substantial data for endangered Cherokee. We obtain our data from bilingual or monolingual books and newspaper articles that are translated or written by first-language Cherokee speakers. In the following, we will introduce the data sources and the cleaning procedure and give detailed descriptions of our data statistics.

Parallel Data
Fifty-six percent of our parallel data is derived from the Cherokee New Testament. Other texts are novels, children's books, newspaper articles, etc. These texts vary widely in dates of publication, the oldest being dated to 1860. Additionally, our data encompasses both existing dialects of Cherokee: the Overhill dialect, mostly spoken in Oklahoma (OK), and the Middle dialect, mostly used in North Carolina (NC). These two dialects are mainly phonologically different and only have a few lexical differences (Uchihara, 2016). In this work, we do not explicitly distinguish them during translation. The left pie chart of Figure 2 shows the parallel data distributions over text types and dialects, and the complete information is in Ta  co-author, a proficient second-language speaker of Cherokee, manually aligned the sentences and fixed the errors introduced by OCR. This process is time-consuming and took several months. The resulting dataset consists of 14,151 sentence pairs. After tokenization, 7 there are around 313K English tokens and 206K Cherokee tokens in total with 14K unique English tokens and 38K unique Cherokee tokens. Notably, the Cherokee vocabulary is much larger than English because of its morphological complexity. This casts a big challenge to machine translation systems because a lot of Cherokee tokens are infrequent. To facilitate machine translation system development, we split this data into training, development, and testing sets. As our data stems from limited sources, we find that if we randomly split the data, some phrases/sub-sentences are repeated in training and evaluation sets, so the trained models will overfit to these frequent patterns. Considering that low-resource translation is usually accompanied by out-of-domain generalization in real-world applications, we provide two groups of develop-ment/testing sets. We separate all the sentence pairs from newspaper articles, 512 pairs in total, and randomly split them in half as out-of-domain development and testing sets, denoted by Out-dev and Out-test. The remaining sentence pairs are randomly split into in-domain Train, Dev, and Test. About 13.3% of unique English tokens and 37.7% of unique Cherokee tokens in Dev have not appeared in Train, while the percentages are 42.1% and 67.5% for Out-dev, which shows the difficulty of the out-of-domain generalization. Table 2 contains more detailed statistics; notably, the average sentence length of Cherokee is much shorter than English, which demonstrates that the semantics are morphologically conveyed in Cherokee.
Note that Cherokee-English parallel data is also available on OPUS (Tiedemann, 2012), which has 7.9K unique sentence pairs, 99% of which are the Cherokee New Testament that are also included in our parallel data, i.e., our data is bigger and has 6K more sentence pairs that are not sacred texts (novels, news, etc.). The detailed comparison will be discussed in A.2.

Monolingual Data
In addition to the parallel data, we also collect a small amount of Cherokee monolingual data, 5,210 sentences in total. This data is also mostly derived from Cherokee monolingual books. 8 As depicted by the right pie chart in Figure 2, the majority of monolingual data are also sacred text, which is Cherokee Old Testament, and it also contains two-dialect Cherokee texts. Complete information is in Table 15 of Appendix A.1. Similarly, we applied OCR to extract these texts. However, we only manually corrected the major errors introduced by OCR. Thus our monolingual data is noisy and contains some lexical errors. As shown in Table 2, there are around 93K Cherokee tokens in total with 20K unique Cherokee tokens. This monolingual data has a very small overlap with the parallel data; about 72% of the unique Cherokee tokens are unseen in the whole parallel data. Note that most of our monolingual data have English translations, i.e., it could be converted to parallel data. But it requires more effort from Cherokee speakers and will be part of our future work. For now, we show how to effectively use this monolingual data for semi-supervised gains.

Models
In this section, we will introduce our Cherokee-English and English-Cherokee translation systems. Adopting best practices from low-resource machine translation works, we propose both Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) systems, and for NMT, we test both RNN-based and Transformer-based models. We apply three semi-supervised methods: training language model with additional monolingual data for SMT (Koehn and Knowles, 2017), incorporating BERT or Multilingual-BERT representations into NMT (Zhu et al., 2020), and back-translation for both SMT and NMT (Bertoldi and Federico, 2009;Sennrich et al., 2016b). Further, we explore transfer learning (Kocmi and Bojar, 2018) from and multilingual joint training (Johnson et al., 2017) with 4 other languages (Czech/German/Russian/Chinese) for NMT.

SMT
Supervised SMT. SMT was the mainstream of machine translation research before neural models came out. Even if NMT has achieved state-of-theart performance on many translation tasks, SMT is still very competitive under low-resource and out-of-domain conditions (Koehn and Knowles, 2017). Phrase-based SMT is a dominant paradigm of SMT (Koehn et al., 2003). It first learns a phrase table from the parallel data that translates source phrases to target. Then, a reordering model learns to reorder the translated phrases. During decoding, a scoring model scores candidate translations by combining the weights from translation, reordering, and language models, and it is tuned by maximizing the translation performance on the development set. A simple illustration of SMT is shown in Figure 3. Note that, as Cherokee and English have different word orders (English follows SVO; Cherokee has variable word orders), one Cherokee phrase could be translated into two English words that are far apart in the sentence. This increases the difficulty of SMT that relies on phrase correspondence and is not good at distant word reordering (Zhang et al., 2017). We implement our SMT systems by Moses (Koehn et al., 2007).

Semi-Supervised SMT. Previous works have
shown that SMT can be improved by two semisupervised methods: (1) A big language model (Koehn and Knowles, 2017), i.e., a language model trained with big target-side monolingual  data; (2) Synthesizing bilingual data by backtranslating monolingual data (Bertoldi and Federico, 2009;Lambert et al., 2011). Using our Cherokee monolingual data and the publicly available English monolingual data, we test these two methods. For the first method, we use both parallel and monolingual data to train the language model; for the second method, we back-translate target-language monolingual data into the source language and then combine them with the training set to retrain a source-target SMT model.

NMT
Supervised NMT. NMT has mostly dominated recent machine translation research. Especially when a large amount of parallel data is available, NMT surpasses SMT by a large margin; moreover, NMT is good at generating fluent translations because of its auto-regressive generation nature. Koehn and Knowles (2017) pointed out the poor performance of NMT under low-resource and out-of-domain conditions; however, recent work from Sennrich and Zhang (2019) showed that low-resource NMT can be better than SMT by using proper training techniques and hyperparameters. NMT models usually follow encoderdecoder architecture. The encoder encodes the source sentence into hidden representations, then the decoder generates the target sentence word by word by "reading" these representations, as shown in Figure 3. We investigate two paradigms of NMT implementations: RNN-based model (Bahdanau et al., 2015) and Transformer-based model (Vaswani et al., 2017 Figure 4: The four different ways we proposed to incorporate BERT representations into NMT models. attention and FFN blocks instead of after, which is more robust (Baevski and Auli, 2019).
Semi-Supervised NMT. NMT models can often be improved when more training data is available; therefore, a lot of works have studied semisupervised approaches that utilize monolingual data to improve translation performance. Similar to SMT, we mainly investigate two semisupervised methods. The first is to leverage pretrained language models. Early works proposed shallow or deep fusion methods to rerank NMT outputs or add the language model's hidden states to NMT decoder (Jean et al., 2015;Gulcehre et al., 2015). Recently, the large-scale pre-trained language model, BERT (Devlin et al., 2019), has achieved impressive success in many NLP tasks. Zhu et al. (2020) showed that incorporating the contextualized BERT representations can significantly improve translation performances. Following but different from this work, we explore four different ways to incorporate BERT representations into NMT models for English-Cherokee translation only. 9 As depicted in Figure 4, we apply BERT representations by: 1 ⃝ Initializing NMT models' word embedding matrix with BERT's pretrained word embedding matrix ⃝ Using another attention to leverage BERT's output H B into decoder. Note that 3 ⃝ and 4 ⃝ will not be applied simultaneously, and all the combination of these four methods are treated as hyper-parameters, details are in Appendix B.4. In general, we hope BERT rep-resentations can help encoder understand English sentences better and thus improve translation performance. We also test Multilingual-BERT (Devlin et al., 2019) to see if a multilingual pre-trained model can generalize better to a newly encountered language. The second semi-supervised method we try is again the back-translation method. Sennrich et al. (2016b) has shown that applying this method on NMT obtains larger improvement than applying it on SMT, and it works better than the shallow or deep fusion methods.
Transferring & Multilingual NMT. Another important line of research is to improve lowresource translation performance by incorporating knowledge from other language pairs. As mentioned in Section 1, Cherokee is the sole member of the southern branch of the Iroquoian language family, so it seems that Cherokee is not "genealogically" related to any high-resource languages in terms of their language family trees. However, it is still interesting to see whether the translation knowledge between other languages and English can help with the translation between Cherokee and English. Hence, in this paper, we will explore two ways of leveraging other language pairs: Transfer learning and Multilingual joint training. Kocmi and Bojar (2018) proposed a simple and effective continual training strategy for the transfer learning of translation models. This method will first train a "parent" model using one language pair until convergence; then continue the training using another language pair, so as to transfer the translation knowledge of the first language pair to the second pair. Johnson et al. (2017) introduced the "many-to-one" and "one-to-many" methods for multilingual joint training of X-En and En-X systems. They achieve this by simply combining training data, except for the "one-tomany" method, every English sentence needs to start with a special token to specify the language to be translated into. We test both the transferring and multilingual methods for Chr-En/En-Chr translations with 4 other X-En/En-X language pairs (X=Czech/German/Russian/Chinese).

Experimental Details
We randomly sample 5K-100K sentences (about 0.5-10 times the size of the parallel training set) from News Crawl 2017 10 as our English monolingual data. We randomly sample 12K-58K examples (about 1-5 times the size of parallel training set) for each of the 4 language pairs (Czech/German/Russian/Chinese-English) from News Commentary v13 of WMT2018 11 and Bibleuedin (Christodouloupoulos and Steedman, 2015) on OPUS 12 . We apply tokenizer and truecaser from Moses (Koehn et al., 2007). We also apply the BPE tokonization (Sennrich et al., 2016c), but instead of using it as default, we treat it as hyperparameter. For systems with BERT, we apply the WordPiece tokenizer (Devlin et al., 2019). We compute detokenized and case-sensitive BLEU score (Papineni et al., 2002) using SacreBLEU (Post, 2018). 13 We implement our SMT systems via Moses (Koehn et al., 2007). SMT denotes the base system; SMT+bigLM represents the SMT system that uses additional monolingual data to train its language model; SMT with back-translation is denoted by SMT+BT. Our NMT systems are implemented by OpenNMT toolkit (Klein et al., 2017). Two baselines are RNN-NMT and Transformer-NMT. For En-Chr, we also test adding BERT or Multilingual-BERT representations (Devlin et al., 2019), NMT+BERT or NMT+mBERT, and with back-translation, NMT+BT. For Chr-En, we only test NMT+BT, treating the English monolingual data size as hyper-parameter. For both En-Chr and Chr-En, we test Transfer learning from and Multilingual joint training with 4 other languages denoted by NMT+X (T) and NMT+X (M) respectively, where X = Czech/German/Russian/Chinese. We treat the X-En data size as hyper-parameter. All other detailed model designs and hyperparameters are introduced in Appendix B.

Chr-En vs. En-Chr.
Overall, the Cherokee-English translation gets higher BLEU scores than the English-Cherokee translation. It is reasonable because English has a smaller vocabulary and simpler morphology; thus, it is easier to generate.
SMT vs. NMT. For in-domain evaluation, the best NMT systems surpass SMT for both translation directions. It could result from our extensive architecture hyper-parameter search; or, it supports our conjecture that SMT is not necessarily better than NMT because of the different word orders. But, SMT is dominantly better than NMT for out-of-domain evaluation, which is consistent with the results in Koehn and Knowles (2017).

RNN vs. Transformer.
Transformer-NMT performs worse than RNN-NMT, which contradicts the trends of some high-resource translations (Vaswani et al., 2017). We conjecture that Transformer architecture is more complex than RNN and thus requires larger-scale data to train properly. We also notice that Transformer models are very sensitive to hyper-parameters, so it can be possibly improved after a more extensive hyperparameter search. The best Transformer-NMT has a 5-layer encoder/decoder and 2-head attention, which is smaller-scale than the model used for high-resource translations (Vaswani et al., 2017). Another interesting observation is that previous works have shown applying BPE and using a small vocabulary by setting minimum word frequency are beneficial for low-resource translation (Sennrich et al., 2016c; Sennrich and Zhang, 2019); however, these techniques are not always being favored during our model selection procedure, as shown in Appendix B.4.

Supervised vs. Semi-supervised.
As shown in Table 3, using a big language model and backtranslation both only slightly improve SMT baselines on both directions. For English-Cherokee translation, leveraging BERT representations improves RNN-NMT by 0.4/0.5 BLEU points on Dev/Test. Multilingual-BERT does not work better than BERT. Back-translation with our Cherokee monolingual data barely improves performance for both in-domain and out-of-domain evaluations, probably because the monolingual data is also out-of-domain, 72% of the unique Cherokee tokens are unseen in the whole parallel data. For Cherokee-English translation, back-translation improves the out-of-domain evaluation of RNN-NMT by 0.9/0.9 BLEU points on Out-dev/Out-test, while it does not obviously improve in-domain evaluation. A possible reason is that the English monolingual data we used is news data that is not of the same domain as Dev/Test but closer to Outdev/Out-test so that it helps the model to do domain adaptation. We also investigate the influence of the English monolingual data size. We find that all of the NMT+BT systems perform best when only using 5K English monolingual data, see Figure 5 in Appendix B.5.
Transferring vs. Multilingual. Table 4 shows the transfer learning and multilingual joint training results. It can be observed that, in most cases, the in-domain RNN-NMT baseline (N4) can be improved by both methods, which demonstrates that even though the 4 languages are not related to Cherokee, their translation knowledge can still be helpful. Transferring from the Chinese-English model and joint training with English-German data achieve our best in-domain Cherokee-English  and English-Cherokee performance, respectively. However, there is barely an improvement on the out-of-domain evaluation sets, even though the X-En/En-X data is mostly news (same domain as Outdev/Out-test). On average, multilingual joint training performs slightly better than transfer learning and usually prefers a larger X-En/En-X data size (see details in Appendix B.4).

Qualitative Results
Automatic metrics are not always ideal for natural language generation (Wieting et al., 2019). As a new language to the NLP community, we are also not sure if BLEU is a good metric for Cherokee evaluation. Therefore, we conduct a smallscale human (expert) pairwise comparison by our coauthor between the translations generated by our NMT and SMT systems. We randomly sample 50 examples from Test or Out-test, anonymously shuffle the translations from two systems, and ask our coauthor to choose which one they think is better. 15 As shown in Table 5, human preference does not always follow the trends of BLEU scores. For English-Cherokee translation, though the RNN-NMT+BERT (N5) has a better BLEU score than SMT+BT (S3) (12.2 vs. 9.9), it is liked less by humans (21 vs. 29), indicating that BLEU is possibly not a suitable for Cherokee evaluation. A detailed study is beyond the scope of this paper but is an interesting future work direction.

Conclusion and Future Work
In this paper, we make our effort to revitalize the Cherokee language by introducing a clean Cherokee-English parallel dataset, ChrEn, with 15 The author, who conducted this human study, was not involved in the development of MT systems.   Table 3. 14K sentence pairs; and 5K Cherokee monolingual sentences. It not only can be another resource for low-resource machine translation research but also will help to attract attention from the NLP community to save this dying language. Besides, we propose our Chr-En and En-Chr baselines, including both SMT and NMT models, using both supervised and semi-supervised methods, and exploring both transfer learning and multilingual joint training methods with 4 other languages. Experiments show that SMT is significantly better and NMT under out-of-domain condition while NMT is better for in-domain evaluation; and the semi-supervised learning, transfer learning, and multilingual joint training can improve simply supervised baselines. Overall, our best models achieve 15.8/12.7 BLEU for in-domain Chr-En/En-Chr translations and 6.5/5.0 BLEU for outof-domain Chr-En/En-Chr translations. We hope these diverse baselines will serve as useful strong starting points for future work by the community. Our future work involves converting the monolingual data to parallel and collecting more data from the news domain.

A.2 Comparison with Existing Data
Here, we compare our parallel data with the data provided on OPUS (Tiedemann, 2012 , 2015), which is also present in our data. Table 6 17 shows the detailed statistics of our versus OPUS's parallel data. In summary, 99% of the OPUS data is also present in our parallel data, i.e., our data has 6K more sentence pairs that are not sacred texts (novels, news, etc.).

B.1 Data and Preprocessing
For semi-supervised learning, we sample additional English monolingual data from News Crawl 2017. 18  For all the data we used, the same tokenizer and truecaser from Moses (Koehn et al., 2007) are applied. For some NMT systems, we also apply the BPE subword tokenization (Sennrich et al., 2016c) with 20,000 merge operations for Cherokee and English separately. For NMT systems with BERT, we apply the WordPiece tokenizer from BERT (Devlin et al., 2019) for English. Before evaluation, the translation outputs are detokenized and detruecased. We use SacreBLEU (Post, 2018) 21 to compute the BLEU (Papineni et al., 2002) scores of all translation systems.

B.3 NMT Systems
Our NMT systems are all implemented by Open-NMT (Klein et al., 2017). As shown in Table 3 and Table 4, there are 16 NMT systems in total (N4-N19). For each of these systems, We conduct a limited amount of hyper-parameter grid search on Dev or Out-dev. The search space includes applying BPE or not, minimum word frequency threshold, number of encoder/decoder layers, hidden size, dropout, etc. The detailed hyperparameter tuning procedure will be discussed in the next subsection. During decoding, all systems use beam search with beam size 5 and replace unknown words with source words that have the highest attention weight.

B.4 Hyper-parameters
We observed the NMT models, especially the Transformer-NMT models, are sensitive to hyperparameters. Thus, we did a limited amount of hyper-parameter grid search when developing NMT models. For building vocabulary, we take BPE (Sennrich et al., 2016c) (use or not) and the minimum word frequency (0, 5, 10) as two hyper-parameters. For the model architecture, we explore different number of encoder/decoder layers (1, 2, 3 for RNN; 4, 5, 6 for Transformer), hidden size (512, 1024), embedding size (equals to hidden size, except 768 for BERT), tied decoder embeddings (Press and Wolf, 2017) (use or not), and number of attention heads (2,4,8). For training techniques, we tune dropout (0.1, 0.2, 0.3), label smoothing (Szegedy et al., 2016) (0.1, 0.2), average decay (1e-4 or not use), batch type (tokens or sentences), batch size (1000, 4000 for tokens; 32, 64 for sents), and warmup steps (3000, 4000). We take the English monolingual data size (5K, 10K, 20K, 50K, 100K) as hyper-parameter when we do back-translation for Cherokee-English translation. We take the size of Czech/German/Russian/Chinese-English parallel data (12K, 23K, 58K) and whether sampling from Bible-uedin (yes or no) as hyper-parameter when we do transfer or multilingual training. Besides, we take how we incorporate BERT as hyperparameter, and it is chosen from the following five settings and their combinations: • BERT embedding: Initializing NMT models' word embedding matrix with BERT's pretrained word embedding matrix I B , corresponding to 1 ⃝ in Figure 4; • BERT embedding (fix): The same as "BERT embedding" except we fix the word embedding during training; • BERT input: Concatenate NMT encoder's input I E with BERT's output H B , corresponding to 2 ⃝ in Figure 4; • BERT output: Concatenate NMT encoder's output H E with BERT's output H B , corresponding to 3 ⃝ in Figure 4; • BERT output (attention): Use another attention to leverage BERT's output H B into decoder, corresponding to 4 ⃝ in Figure 4; "BERT embedding" and "BERT embedding (fix)" will not be applied simultaneously, and "BERT output" and "BERT output (attention)" will not be applied simultaneously. Multilingual-BERT is used in the same ways. At most, there  are 576 searches per model, but oftentimes, we did less than that because we early cut off unpromising settings. All hyper-parameters are tuned on Dev or Out-dev for in-domain or out-of-domain evaluation, and the model selection is based on translation accuracy on Dev or Out-dev. Table 7, Table 8, Table 9, Table 10, Table 11, Table 12, and Table 13 list the hyper-parameters of all the systems shown in the Table 3 and Table 4. Since our parallel dataset is small (14K sentence pairs), even the slowest experiment, Transformer-NMT+mBERT, only takes 2 minutes per epoch using one Tesla V100 GPU. We train 100 epochs at most and using early stop when the translation accuracy on Dev or Out-dev does not improve for 10 epochs.

B.5 English Monolingual Data Size Influence
In the semi-supervised experiments of Cherokee-English, we investigate the influence of the English monolingual data size. As mentioned above, we use 5K, 10K, 20K, 50K, and 100K English monolingual sentences. Figure 5 shows its influence on translation performance. It can be observed that increasing English monolingual data size does not lead to higher performance, especially, all NMT+BT systems achieve the best performance when only use 5K English sentences.