Bootstrapping Transliteration with Constrained Discovery for Low-Resource Languages

Generating the English transliteration of a name written in a foreign script is an important and challenging step in multilingual knowledge acquisition and information extraction. Existing approaches to transliteration generation require a large (>5000) number of training examples. This difficulty contrasts with transliteration discovery, a somewhat easier task that involves picking a plausible transliteration from a given list. In this work, we present a bootstrapping algorithm that uses constrained discovery to improve generation, and can be used with as few as 500 training examples, which we show can be sourced from annotators in a matter of hours. This opens the task to languages for which large number of training examples are unavailable. We evaluate transliteration generation performance itself, as well the improvement it brings to cross-lingual candidate generation for entity linking, a typical downstream task. We present a comprehensive evaluation of our approach on nine languages, each written in a unique script.


Introduction
Transliteration is the process of transducing names from one writing system to another (e.g., ओबामा in Devanagari to Obama in Latin script) while preserving their pronunciation (Knight and Graehl, 1998;Karimi et al., 2011). In particular, back-transliteration from foreign languages to English has applications in multilingual knowledge acquisition tasks including named entity recognition (Darwish, 2013) and information retrieval (Virga and Khudanpur, 2003). Two tasks feature prominently in the transliteration literature: generation (Knight and Graehl, 1998) which involves producing an appropriate transliteration for a given word in an open-ended way, and discovery (Sproat et al., 2006;Klementiev and Roth, 1 code at github.com/shyamupa/hma-translit. 2008) which involves selecting an appropriate transliteration for a word from a list of candidates. This work develops transliteration generation approaches for low-resource languages.
Existing transliteration generation models require supervision in the form of source-target name pairs (≈5-10k), which are often collected from names in Wikipedia inter-language links (Irvine et al., 2010). However, most languages that use non-Latin scripts are under-represented in terms of such resources. Table 1 illustrates this issue, and the extra coverage one can achieve by extending to low-resource languages. A model that requires 50k name pairs as supervision can only support 6 languages, while one that just needs 500 could support 56. For a model to be widely applicable, it must function in low-resource settings.  Wikipedia inter-language links. While previous approaches for transliteration generation were applicable to only 24 languages (spanning 15 scripts), our approach is applicable to 56 languages (23 scripts). When counting scripts we exclude variants (e.g., all Cyrillic scripts and variants count as one).

Our Approach Previous Work
We propose a new bootstrapping algorithm that uses a weak generation model to guide discovery of good transliterations, which in turn aids future bootstrapping iterations. 2 By carefully controlling the interaction of discovery and the generation model via constrained inference, we show how to bootstrap a generation model using a dictionary of names in English, a list of words in the foreign script, and little initial supervision (≈500 name pairs). To the best of our knowledge, ours is the first work to accomplish transliteration generation in such a low-resource setting.
We demonstrate the practicality of our approach in truly low-resource scenarios and downstream applications through two case studies. First, in §8.1 we show that one can obtain the initial supervision from a single human annotator within a few hours for two languages -Armenian and Punjabi. This is a realistic scenario where language access is limited to a single native informant. Second, in §8.2 we show that our approach benefits a typical downstream application, namely candidate generation for cross-lingual entity linking, by improving recall on two low-resource languages -Tigrinya and Macedonian. We also present an analysis ( §7) of the inherent challenges of transliteration, and the trade-off between native (i.e., source) and foreign (i.e., target) vocabulary.

Related Work
We briefly review the limitations of existing generation and discovery approaches, and provide an overview of how our work addresses them. (Haizhou et al., 2004;Jiampojamarn et al., 2009;Ravi and Knight, 2009;Jiampojamarn et al., 2010;Finch et al., 2015, inter alia) requires generous amount of name pairs (≈5-10k) in order to learn to map words in the source script to the target script. While some approaches (Irvine et al., 2010;Tsai and Roth, 2018) use Wikipedia inter-language links to identify name pairs for supervision, a truly lowresource language (like Tigrinya) is likely to have limited Wikipedia presence as well.

Transliteration Generation
Transliteration Discovery (Sproat et al., 2006;Chang et al., 2009) is considerably easier than generation, owing to the smaller search space. However, discovery often uses features derived from resources that are unavailable for low-resource languages, like comparable corpora (Sproat et al., 2006;Klementiev and Roth, 2008).
A key limitation of discovery is the assumption that the correct transliteration(s) is in the list of candidates N . Since discovery models always pick something from N , they can produce false positives, if no correct transliteration is present in N .
To overcome this, it is prudent to develop generation models which can handle input for which the transliteration does not belong in N .

Our Work
We show that a weak generation model can be iteratively improved using constrained discovery. In particular, our work uses a weak generation model to discover new training pairs, using constraints to drive the bootstrapping. Our generation model is inspired by the success of sequence to sequence generation models (Sutskever et al., 2014;Bahdanau et al., 2015) for string transduction tasks like inflection and derivation generation (Faruqui et al., 2016;Cotterell et al., 2017;Aharoni and Goldberg, 2017;Makarov et al., 2017). Our bootstrapping framework can be viewed as an instance of constraint driven learning (Chang et al., 2007(Chang et al., , 2012.

Transliteration Generation with Hard Monotonic Attention -Seq2Seq(HMA)
We view generation as a string transduction task and use a sequence to sequence (Seq2Seq) generation model that uses hard monotonic attention (Aharoni and Goldberg, 2017), henceforth referred to as Seq2Seq(HMA). During generation, Seq2Seq(HMA) directly models the monotonic source-to-target sequence alignments, using a pointer that attends to a single input character at a time. Monotonic attention is a natural fit for transliteration because even though the number of characters needed to represent a sound in the source and target language vary, the sequence of sounds is presented in the same order. 3 We review Seq2Seq(HMA) below, and describe how it can be applied to transliteration generation.
Encoding Input Word Let Σ f be the source alphabet and Σ e be the English alphabet. Let x = (x 1 , x 2 , · · · , x n ) denote an input word where each character x i ∈ Σ f . The characters are first encoded using a embedding matrix W ∈ R |Σ f |×d to get character embeddings x 1 , x 2 , · · · , x n where each x i ∈ R d . These embeddings are fed into a bidirectional RNN encoder to generate encoded vectors h 1 , h 2 , · · · , h n where each h i ∈ R 2k , and k is the size of output vector of the forward (and backward) encoder. The encoded vectors h 1 , h 2 , · · · , h n are then fed into the decoder. shows how decoding proceeds for transliterating "थनोस" to "thanos". During decoding, the model attends to a source character (e.g.,थ shown in blue) and outputs target characters (t, h, a) until a step action is generated, which moves the attention position forward by one character (to न), and so on. Figure 1 illustrates the decoding process. The decoder RNN generates a sequence of actions {s 1 , s 2 , · · · }, such that each s i ∈ Σ e ∪ {step}. The step action controls an attention position a, attending on input character x a , with encoded vector h a . Each action s i is embedded into s i ∈ R d using a output embedding matrix A ∈ R (|Σe|+1)×d . At any time during decoding, the decoder uses its last hidden state, the embedding of the previous action s i and the encoded vector h a of the current attended position to generate the next action s i+1 . If the generated action is step, the decoder increments the attention position by one. This ensures that the decoding is monotonic, as the attention position can only move forward or stay at the same position during generation. We use Inference(G, x) to refer to the above decoding process for a trained generation model G and input word x.

Monotonic Decoding with Hard Attention
Training requires the oracle action sequence {s i } for input x 1:n that generates the correct transliteration y 1:m . The oracle sequence is generated using the train name pairs and Algorithm 1 in Aharoni and Goldberg (2017), with the characterlevel alignment between x 1:n and y 1:m being generated using the algorithm in Cotterell et al. (2016).

Inference Strategies
We describe an unconstrained and a constrained inference strategy to select the best transliterationŷ from a beam {y i } k i=1 of transliteration hypotheses, sorted in descending order by likelihood. The constrained strategy use a name dictionary N , to guide the inference. These strategies are applicable to any generation model.
• Unconstrained (U) selects the most likely item y 1 in the beam asŷ.
• Dictionary-Constrained (DC) selects the highest scoring hypothesis that is present in N , and defaults to y 1 if none are in N .
It is tempting to disallow the model from generating hypotheses which are not in the dictionary N . However, dictionaries are always incomplete, and restricting the search to generate from N inevitably leads to incorrect predictions if the correct transliteration is not in N . This is essentially the same as the problem inherent to discovery models.

Other Strategies in Previous Work
A related constrained inference strategy was proposed by Lin et al. (2016), who use a entity linking system  to correct and re-rank hypotheses, using any available context to aid hypothesis correction. Our constrained inference strategy is much simpler, requiring only a name dictionary N . We experimentally show that our approach outperforms that of Lin et al. (2016).

Low-Resource Bootstrapping
Low-resource languages will have a limited number of name pairs for training a generation model. To learn a good generation model in this setting, we propose a new bootstrapping algorithm, that uses constrained discovery to mine name pairs to re-train the generation model. Our algorithm requires a small (≈500) seed list of name pairs S for supervision, a dictionary N containing names in English, and a list of words V f in the foreign script.
Below we describe our algorithm and the constraints used to guide discovery of new name pairs.

The Bootstrapping Algorithm
Algorithm 1 shows the pseudo-code of the bootstrapping procedure. We initialize a weak generation model G 0 using a seed list of name pairs S (line 1). At iteration t, the current generation model G t produces the top-k transliteration hy- A source word and hypothesis pair (x, y i ), is added to the set of mined name pairs B if they satisfy a set of discovery constraints (described below) (line 8). A new generation model G t+1 is trained for the next iteration using the union of the seed list S and the mined name pairs B (line 12). B is purged after every iteration (line 3) to prevent G t+1 from being influenced by possibly incorrect name pairs mined in

Input:
English name dictionary N ; Seed training pairs S; Vocabulary in the target language V f .

Hyper-parameters:
initial minimum length threshold L min 0 ; minimum likelihood threshold δ min ; length ratio tolerance ϵ. Output: Generation model GT if (x, yi) satisfies constraints in §4.2 then 8: end if 10: end for 11: end for 12: earlier iterations. The algorithm converges when accuracy@1 stops increasing on a development set. We note that our bootstrapping approach is applicable to any transliteration generation model.
To ensure that high quality name pairs are added to the mined set B during bootstrapping, we use the following discovery constraints.

Discovery Constraints
A word-transliteration pair (x, y) is added to the set of mined pairs B, only if all the following constraints are satisfied, 1. y ∈ N . i.e., y belongs in the dictionary.
2. P(y | x) > δ min . The model is sufficiently confident about the transliteration.
3. The ratio of lengths |y| |x| should be close to the average ratio estimated from S (Matthews, 2007). We encode this using the constraint | |y| |x| − r(S)| ≤ ϵ, where ϵ is a tunable tolerance and r(S) is the average ratio in S.
We found that false positives were more likely to be short hypotheses in early iterations. As the model improves with each iteration, L min t is lowered to allow more new pairs to be mined.
We note that our bootstrapping algorithm can be formulated as an instance of constraint driven learning (Chang et al., 2007(Chang et al., , 2012.

Experimental Setup
Unless otherwise specified, we evaluate all generation models using the best model predictionŷ using acc@1 against the reference transliteration y * .

Training and Evaluation Dataset
We use the train and development sets from the Named Entities Workshop 2015 (Duan et al., 2015) (NEWS2015) for Hindi (hi), Kannada (kn), Bengali (bn), Tamil (ta) and Hebrew (he) as our train and evaluation set. 4 The size of the train set was ∼12k, 10k, 14k, 10k and 10k respectively, and all evaluation sets were ∼1k.
For the low resource experiments, we subsample 500 examples from each train set in the NEWS2015 dataset using five random seeds and report the averaged results. We also set aside a 1k name pairs from the corresponding NEWS2015 train set of each language as development data. The foreign script portion of the remaining train data is used as V f in the bootstrapping algorithm.

Model and Tuning Details
We implemented Seq2Seq(HMA) using PyTorch. 5 We used 50 dimensional character embeddings, and single layer GRU (Cho et al., 2014) encoder with 20 hidden states for all experiments. The Adam (Kingma and Ba, 2014) optimizer was used with default hyperparameters, a learning rate of 0.001, a batch size of 1, and maximum of 20 iterations in all experiments. Beam search used a width of 10. For lowresource experiments, all bootstrapping parameters were tuned on the development data set aside above. L min 0 is chosen from {10, 15, 20, 25}.

Name Dictionary
We use a name dictionary of 1.05 million names constructed from the English Wikipedia (dump dated 05/20/2017) by taking the list of title tokens in Wikipedia sorted by frequency, and removing tokens which appears only once.

Comparisons
We compare with the following generation models: P&R (Pasternack and Roth, 2009) A probabilistic transliteration generation approach that learns latent alignments between substrings in the source and the target words. The model is trained to score all possible segmentation and their alignments, using an EM-like algorithm.

DirecTL+ (Jiampojamarn et al., 2009)
A HMM-like discriminative string transduction model that predicts the output transliteration using many-to-many alignments between the source word and target transliteration. Following Jiampojamarn et al. (2009), we use the m2maligner (Jiampojamarn et al., 2007) to generate the many-to-many alignments, and the public implementation of DirecTL+ to train models. 6 RPI-ISI (Lin et al., 2016) A transliteration approach that uses a language-independent entity linking system  to jointly correct and re-rank the hypotheses produced by the generation model. We compare to both the unconstrained inference (U) approach and the entity linking constrained inference (+EL) approach.

Seq2Seq w/ Att
A sequence to sequence generation model which uses soft attention as described in (Bahdanau et al., 2015). This model does not enforce monotonicity at inference time, and serves as direct comparison for Seq2Seq(HMA).

Experiments
This section aims to analyze: (a) how effective is Seq2Seq(HMA) for transliteration generation when provided all available supervision ( §6.1)? and (b) how effective is the bootstrapping algorithm in the low-resource setting when only 500 examples are available ( §6.2)?

Full Supervision Setting
We compare Seq2Seq(HMA) with previous approaches when provided all available supervision, to see how it fares under standard evaluation.
Results in the unconstrained inference (U) setting (Table 2 top 5 rows) shows Seq2Seq(HMA), denoted by "Ours", outperforms previous approaches on Hindi, Kannada, and Bengali, with almost 3-4% gains. Improvements over the Seq2Seq with Attention (Seq2Seq w/ Att) model demonstrate the benefit of imposing the monotonicity constraint in the generation model. On Tamil and Hebrew, Seq2Seq(HMA) is at par with the best approaches, with negligible gap (∼0.3) in scores. Overall, we see that Seq2Seq(HMA) can achieve better (and sometimes competitive) scores than state-of-the-art approaches in full supervision settings. When comparing approaches which use constrained inference (  that using dictionary-constrained inference (as in Ours(DC)) is more effective than using a entitylinking model for re-ranking (RPI-ISI + EL).

Low-Resource Setting
In Table 2 (rows under "Low-Resource Setting"), we evaluate different models in a low-resource setting when provided only 500 name pairs as supervision. Results are averaged over 5 different random sub-samples of 500 examples.
The results clearly demonstrate that all generation models suffer a drop in performance when provided limited training data. Note that models like Seq2Seq with Attention suffer a larger drop than those which enforce monotonicity, suggesting that incorporating monotonicity into the inference step in the low-resource setting is essential. After bootstrapping our weak generation model using Algorithm 1, the performance improves substantially (last row in Table 2). On almost all languages, the generation model improves by at least 6%, with performance for Hindi and Bengali improving by more than 10%. Bootstrapping results for the languages are within 2-4% of the best model trained with all available supervision.
To better analyze the progress of the transliteration model during bootstrapping, we plot the ac-curacy@1 of the current transliteration model after each bootstrapping iteration for each of the languages (solid lines in Figure 2). For reference, we also show the best performance for a gener- ation model using all available supervision from §6.1 (dotted horizontal lines in Figure 2). From Figure 2, we can see that almost after 5 bootstrapping iterations, the generation model attains competitive performance to respective state-of-the-art models trained with full supervision.

Error Analysis
Though our model is state of the art, it does present a few weaknesses. We have found that the dictionary sometimes misleads the model during constrained inference. For example, the correct transliteration "vidyul" of the Hindi व ु ल, is not present in the dictionary, but another hypothesis "vidul" is. Another issue comes from the proportion of native (i.e., from the source language) and foreign (i.e., from English or other languages) names in the training data. It is usually not the case that the source and target scripts have the same transliteration rules. For example, य in Hindi might represent ya in English or Hindi names, but ja in German. Similarly, while अ should be a in Hindi names, it could be any of a few vowels in English. The NEWS2015 dataset does not report a native/foreign ratio, but by our estimation, it is about 70/30 for each language. This native and foreign names dichotomy are some of the inherent challenges in transliteration, that we discuss in detail in the next section.

Challenges Inherent to Transliteration
The fact that all models in Table 2 perform well or poorly on the same languages suggests that most of the observed performance variation is the result of factors intrinsic to the specific languages. Here we analyze some challenges that are inherent to the transliteration task, and explain why the performance ceiling is well under 100% for all languages, and lower for languages like Tamil and Hebrew than the others. The Hebrew script also introduces error because it tends to omit vowels or write them ambiguously, leaving the model to guess between plausible choices. For example, the word ‫מלך‬ could be transliterated melech "king" just as easily as malach "he ruled." When Hebrew does write vowels, it reuses consonant letters, again ambiguously. For example, ‫ה‬ can be used to express a or e, so ‫שמונה‬ can be either shmona or shmone "eight masculine/feminine". The script also does not reliably distinguish b from v or p from f, among others.

Source and Target-Specific Issues
All languages run into problems when they are faced with writing sounds that they do not natively distinguish. For example, Hindi does not make a distinction between w and v, so both vest and west are written as वे ट in its script.
These script-specific deficiencies explains why all models struggle on Tamil and Hebrew relative to the others. These issues cannot be completely resolved without memorizing individual sourcetarget pairs and leveraging context.  ters, decide whether to use f or ph for /f/; use k, c, ck, ch, or q for /k/, and so on. The problem is made worse because English is not the only language that uses Latin script. For example, German names like Schmidt should be written with sch instead of sh, and for French names like Margot and Margeau (which are pronounced the same), we have to resort to memorization. The arbitrariness extends into borrowings from the source languages as well.

Target-Driven
For example, the Indian name Bangalore is written with a silent-e, and the name Lakshadweep contains ee, instead of the expected i.

Disparity between Native and Foreign
All these issues come together to create a performance disparity between native names, which are well-integrated into the source language etymologically (Indian names like Jasodhara or Ramanathan for Hindi), and foreign names (French Grenoble or Japanese Honshu for Hindi), which are not. The above datasets include an unspecified mix of native and foreign names. This is a problem since any model must learn essentially separate transliteration schemes for each.
To quantify the effect of this, we annotate native and foreign names in the test split of the four Indian languages, and evaluate performance for both categories. Table 3 shows that our model performs significantly better on native names for all the languages. A possible reason for is that the source scripts were designed for writing native names (e.g., Tamil script lacks separate {ta, da, tha, dha} characters because the Tamil language does not distinguish these sounds). Furthermore, foreign names have a wide variety of origins with their own conventions as discussed in §7.1. The performance gap is proportionally greatest for Tamil, likely due to its script.

Case Studies
In this section, we evaluate the practical utility of our approach in low-resource settings and for downstream applications through two case studies.
We first show that obtaining an adequate seed list is possible with a few hours of manual annotation ( §8.1) from a single human annotator. We then show the positive impact that our approach has on a downstream task, by evaluating its contribution to candidate generation for Tigrinya and Macedonian entity linking ( §8.2).

Manual Annotation
The manual annotation exercises simulate a lowresource setting with only a single human annotator is available. We judge the usability of the annotations by training models on them and evaluating the models on test sets of 1000 names each, obtained from Wikipedia inter-language links. For bootstrapping experiments, we use the corpora shown in Table 4 to obtain foreign vocabulary V f .

Languages Studied
We investigate performance on two languages: Armenian and Punjabi. Spoken in Armenia and Turkey, Armenian is an Indo-European language with no close relatives. It has Eastern and Western dialects with different spelling conventions. Armenian Wikipedia is primarily written in the Eastern dialect, while our annotator was a native Western speaker. 7 Punjabi is an Indic language from Northwest India and Pakistan that is closely related to Hindi. Our annotator grew up primarily speaking Hindi.
Annotation Guidelines Annotators were given two tasks. First, they were asked to write two names and their English transliterations for each letter in the source script: one beginning with the letter and another containing it elsewhere. (e.g. "Julia" and "Benjamin" for the letter "j" if the source were English). The is done to ensure good coverage over the alphabet. Next, annotators were shown a list of English words and were asked to  provide plausible transliteration(s) into the target script. The list had a mix of recognizable foreign (e.g., Clinton, Helsinki) and native names (e.g., Sarkessian, Yerevan for Armenian). We collected about 600 and 500 annotated pairs respectively for Armenian and Punjabi. Table 5 shows that the performance of the models trained on the annotated data is comparable to that on the standard test corpora for other languages. This show that our approach is robust to human inconsistencies and regional spelling variations, and that obtaining an adequate seed list is possible with just a few hours of manual annotation.

Candidate Generation (CG)
Since transliteration is an intermediate step in many downstream multilingual information extraction tasks (Darwish, 2013;Kim et al., 2012;Jeong et al., 1999;Virga and Khudanpur, 2003;Chen et al., 2006), it is possibly to gauge its performance extrinsically by the impact it has on such tasks. We use the task of candidate generation (CG), which is a key step in cross-lingual entity linking.
The goal of cross-lingual entity linking (Mc-Namee et al., 2011;Tsai and Roth, 2016;Upadhyay et al., 2018) is to ground spans of text written in any language to an entity in a knowledge base (KB). For instance, grounding [Chicago] in the following German sentence to Chicago_(band). 8

[Chicago] wird in Woodstock aufzutreten.
The role of CG in cross-lingual entity linking is to create a set of plausible entities given a string while ensuring the correct KB entity belongs to that set. For the above German sentence, it would provide a list of possible KB entities for the string Chicago: Chicago_(band), Chicago_(city), Chicago_(font), etc., so that entity linking can select the band. Foreign scripts pose an additional challenge for CG because they must be transliter-8 Translation: Chicago will perform at Woodstock. ated before they are passed on to candidate generation. For instance, any mention of "Chicago" in Amharic must first be transliterated from ሺካጎ.
Most approaches for CG use Wikipedia interlanguage links to generate the lists of candidates (Tsai and Roth, 2016). While recent approaches such as Tsai and Roth (2018) have resorted to name translation for CG, they require over 10k examples for languages written in non-Latin scripts, which is prohibitive for low-resource languages with little Wikipedia presence.

Candidate Generation with Transliteration
We evaluate the extent to which our approach improves recall of a naive CG baseline that generates candidates by performing exact name match. For each span of text to be linked (or query mention), we first check if the naive name matching strategy finds any candidates in the KB. If none are found, the query mention is back-transliterated to English, and at most 20 candidates are generated using a inverted-index from English names to KB entities. The evaluation metric is recall@20, i.e., if the gold KB entity is in the top 20 candidates. We use Tigrinya and Macedonian as our test languages.
Tigrinya is a South Semitic language related to Amharic, written in the Ethiopic script, and spoken primarily in Eritrea and northern Ethiopia. The Tigrinya Wikipedia has <200 articles, so we use inter-language links (∼7.5k) from the Amharic Wikipedia instead to extract 1k name pairs for the seed set. We use the monolingual corpus in Table 4 for bootstrapping and evaluate on the unsequestered set provided under the NIST LoReHLT evaluation, containing 4,630 query mentions.
The Ethiopic script is an alphasyllabary, where each character is consonant-vowel pair. For example, the character መ is mä, ሚ with a tail is mi, and ሞ with a line is mo. With 26 consonants and 8 vowels, this leads to a set of >200 characters creating a sparsity problem since each character has its own Unicode code point. However, the code points are organized so that they can be automatically split 9 into unique consonant and vowel codes without explicitly understanding the script. We assign arbitrary ASCII codes to each consonant and vowel so that መ/mä becomes "D 1" and ሞ/mo becomes "D 6." This consonant-vowel splitting (CV-split) reduces the number of unique input characters to 55.   Table 4 is used for bootstrapping. Table 6 shows the results for the two languages. For Tigrinya, candidate generation with transliteration improves on the baseline by 4.2%. Splitting the characters (CV-split) gives another 5.7%, and adding bootstrapping gives 4.9% more. Our approach yields an overall 14.8% improvement in recall over the baseline, showing that we can effectively exploit the little available supervision by bootstrapping. Macedonian yields more dramatic results, where transliteration provides 38.6% improvement (more than double the baseline), with bootstrapping providing another 4.6%. The differences between Tigrinya and Macedonian is likely due both to their test sets, corpora and writing systems.

Conclusion and Future Work
We presented a new transliteration generation model, namely Seq2Seq(HMA), and a new bootstrapping algorithm that can iteratively improve a weak generation model using constrained discovery. The model presented here achieves state-ofthe-art results on typical training set sizes, and more importantly, works well in a low-resource setting with the aid of the bootstrapping algorithm.
The key benefit of the bootstrapping approach is that it can "recover" most of the performance lost in the low-resource setting when little supervision is available by training with a smaller seed set, an English name dictionary, and a list of unannotated words in the target script. Additionally, our bootstrapping algorithm admits any generation model, giving it wide applicability. Through case studies, we showed that collecting an adequate seed list is practical with a few hours of annotation. The benefit of incorporating our transliteration approach in a downstream task, namely candidate generation, was also demonstrated. Finally, we discussed some of the inherent challenges of learning transliteration and the deficits of existing training sets. There are several interesting directions for future work. Performing model combination, either by developing hybrid transliteration models (Nicolai et al., 2015) or by ensembling (Finch et al., 2016), can further improve low resource transliteration. Jointly leveraging similarities between related languages, such as writing systems or phonetic properties (Kunchukuttan et al., 2018), also shows promise for low-resource settings. Our analysis suggests value in revisiting "transliteration in context" approaches (Goto et al., 2003;Hermjakob et al., 2008), especially for languages like Hebrew. We would also like to expand on the analyses provided in §7 which uncover challenges inherent to the transliteration task, particularly the impact of the native/foreign distinction in the train and test data, the difficulties posed by specific scripts or pairs of scripts, and how these impact both backand forward-transliteration. Recent work from Merhav and Ash (2018) suggests many useful analyses that we would like to incorporate.