Leveraging Entity Linking and Related Language Projection to Improve Name Transliteration

Traditional name transliteration methods largely ignore source context information and inter-dependency among entities for entity disambiguation. We propose a novel approach to leverage state-of-the-art Entity Linking (EL) techniques to automatically correct name transliteration results, us-ing collective inference from source contexts and additional evidence from knowledge base. Experiments on transliterating names from seven languages to English demonstrate that our approach achieves 2 : 6 % to 15 : 7 % absolute gain over the baseline model, and significantly advances state-of-the-art. When contextual information exists, our approach can achieve further gains ( 24 : 2 %) by collectively translit-erating and disambiguating multiple re-lated entities. We also prove that combining Entity Linking and projecting resources from related languages obtained comparable performance as the method us-ing the same amount of training pairs in the original languages without Entity Linking. 1


Introduction
In Machine Translation and Cross-lingual Information Extraction tasks, an important problem is translating out-of-vocabulary words, mostly names. For some names, we can perform transliteration (Knight and Graehl, 1997;Knight and Graehl, 1998), namely converting them to their approximate phonetic equivalents. Previous methods have generally followed the two-step approach proposed by (Al-Onaizan and Knight, 2002): Generating transliteration hypotheses based on phoneme, grapheme or correspondence, and validating or re-ranking hypotheses using language modeling (Oh and Isahara, 2007) or Information Extraction from the target language (Ji et al., 2009).
In this paper, we focus on back-transliteration from languages lacking in Natural Language Processing (NLP) resources to English for two reasons: (1) In NLP tasks such as name tagging, we can take advantage of rich English resources by transliterating a name to English. Our analysis of 986 transliteration pairs from the Named Entities Workshop 2015 (NEWS2015) 2 Bengali development set shows that 574 English names can be found in the DBpedia 3 , while only 47 Bengali names exist in the same knowledge base (KB). (2) Back-transliterating names in other languages to English make them understandable by more users since English is widely spoken as a global lingua franca.
In this paper we analyze the following remaining challenges from previous methods: Challenge 1: Lack of Entity Grounding. Previous methods developed for transliteration benchmark tasks such as Named Entity Workshop (NEWS) Shared Task (Li et al., 2009) usually focus on transliterating independent names without properties (or contextual information). For example, "Kalashnikov" and "Calashnikov" are both acceptable transliterations for " " (kǎ lā shí ní kē fū) in terms of pronunciation. If we know that it refers to a rifle series, however, we should transliterate " " to "Ka" instead of "Ca" here. Therefore, we propose to ground the transliteration results to a KB whenever the contexts are available.
Challenge 2: Information-Losing. As pointed out in (Knight and Graehl, 1998), the informationlosing problem of transliteration makes it difficult to invert. For example, "la" and "ra" are two distinct sounds in English, while they usually collapse to " " (lā) in Chinese, which lacks the English "ra" sound.
To tackle these two challenges, we propose a novel approach that links a given name to a KB in target language, and subsequently exploits the linking results to correct transliteration hypotheses.   Table 1 shows some instances which require entity disambiguation before transliteration. Take the name " " (yà sè) as an example, it has several possible transliteration hypotheses. If we find " "(Lackey) in the same document, we can apply collective inference to link their transliteration candidates to a KB. Since "James Lackey" appears in the infobox of "Usher (Singer)", we take "Usher" and "Lackey" as transliterations for " " and " " respectively. Challenge 4: Lack of Training Pairs. Statistical transliteration models usually rely on thousands of name pairs for training. However, it might be costly to collect required training data for lowresource languages. To address this issue, we propose a simple but effective method which transliterates names in a low-resource language using a model trained on one of its similar languages by means of a character mapping table derived from Unicode charts. Figure 1 illustrates the overall framework of our approach, which consists of four steps as follows.

Approach Overview
1. Training. We employ a many-to-many align- ment model (m2m-aligner) (Jiampojamarn et al., 2007) to segment and align each transliteration pair in the training data.

Transliteration.
For each name in the test set, we apply a joint source-channel model (JSCM)  to generate a list of transliteration hypotheses, where the probabilities of n-grams of transliteration unit pairs are estimated from the alignment result.
3. Linking. We link each transliteration hypothesis to an English KB using a languageindependent entity linker (Wang et al., 2015). If context exists, we apply collective inference to link multiple related names simultaneously.
4. Hypotheses Correction. Finally, we revise each hypothesis using the surface forms of the linked entities, and merge and rank the revised hypotheses.
The detailed techniques for each step will be presented in the following sections.

Transliteration Hypotheses Generation
We use the joint source-channel model (JSCM) proposed in ) as our baseline model.
In our experiments, we estimate the conditional probability P (⟨s, t⟩ k |⟨s, t⟩ k−1 k−n+1 ) from the alignment result generated by the many-to-many alignment model (m2m-aligner).
Originally designed for letter-to-phoneme conversion, the m2m-aligner has also been used in previous transliteration-related tasks Cook and Stevenson, 2009;Jiampojamarn et al., 2008). We apply the m2malingner to the training data to obtain segmentations and alignments. For languages with a large number of characters, training pairs may not cover all characters. As a fallback option, we extend the m2m-aligner's output with pronunciations or romanizations of characters out of the training data. For example, if the Chinese character (kǒng) is absent in the training set, we use kong as its transliteration.

Entity Linking
We apply a state-of-the-art language-independent Entity Linker (Wang et al., 2015) to link each transliteration hypothesis to an English KB (DB-Pedia in our experiment). For each entity name mention m, this entity linker uses the surface form dictionary ⟨f, e 1 , e 2 , ..., e k ⟩, where e 1 , e 2 , ..., e k is the set of entities with surface form f in the KB according to their properties (e.g., labels, names, aliases), to locate a list of candidate entities and apply salience ranking by an entropy based approach. After that, it computes similarity scores for each entity mention and candidate entity pair ⟨m, e⟩ and re-ranks the candidate entities. If the context is available, the linker adopts an unsupervised collective inference approach which links multiple entity mentions simultaneously and selects corresponding entity candidates which are most strongly connected in the KB as the final linking results. Figure 2 shows the workflow of Collective Inference based Entity Linking. It first constructs a Mention Context Graph G m for all entity mentions M = {m 1 , m 2 , ..., m n } which co-occur within a context window 4 and generates a ranked entity candidate list for each entity mention. KB can be represented as a graph G k that consists of entities as vertices and weighed relations as edges. Hence, it also constructs a Candidate Graph which is a set of graphs G i c (i = 1, 2, ..). Here, each graph G i c represents a set of entity candidate for mentions in M . Finally, it applies a Candidate Graph collective validation approach that computes similarity scores between G m and G i c and selects G i c with the highest score as the final linking results.
In our experiment, we first transliterate all mentions with the JSCM. During the entity linking step, transliteration hypotheses of mentions within a context windows are linked simultaneously.

Hypothesis Correction
With a linked entity set E i for each transliteration hypothesis β i , we revise β i using the entity surface form based on following rules: (1) Split a transliteration hypothesis and the surface form of a linked entity into tokens. (2) Compute string similarity between every hypothesis token and entity token.
(3) Revise each hypothesis token to the its most similar entity token.
For example, the top-1 transliteration hypothesis of "พอร์ เตอร์ , แคเทอรี น แอนน์ " (Thai, Poter, Katherine Anne) is "Porter, Catherine Ann", and Katherine_Anne_Porter is one of the linked entities. Katherine, Anne, and Porter are used to revise Catherine, Ann, and Porter, respectively, regardless of the token order.
We compute the score of the revised hypothesis of transliteration β i and entity candidate c j as a product of the transliteration score and the linking score.  " (kǎ lā shí ní kē fū, Kalashnikov), surface forms of top-3 linked entities for each linkable hypothesis, and the revised hypotheses. Identical revision results from divergent transliteration hypotheses and entity candidates are considered as the same hypothesis, and their scores are summed. After merging revised hypotheses, we rank them and select the top-1 as the final transliteration, which in this example is Kalashnikov (3.2).
Ambiguous names which refer to more than one entity may be transliterated in different ways. If the context is provided, we can eliminate ambiguity based on collective inference. For example, in the sentence, " (Reno is the third most populous city in the state of Nevada after Las Vegas and Henderson)," the transliterations of " " (léi nuò) include Reno, a city in Nevada, Renault, a French automobile manufacturer, and Raynor, a virtual role in StarCraft. In order to link it to the correct entity, we first independently transliterate all mentions, namely " " (Las Vegas), " " (Henderson) " " (Reno) and " " (Nevada). Then we apply collective entity linking to these mentions. Since Reno has explicit and strong relations with Las Vegas (another city in Nevada), Henderson (another city in Nevada) and Nevada (the state of Reno) in the KB, it is ranked higher than other candidates by the linker.

Cross-lingual Projection
For low-resource languages, it is not feasible to directly apply this framework because manually collecting transliteration training pairs takes considerable time and effort.
We observed that some related languages share the same or similar character sets, linguistic characteristics and transliteration conventions. For example, a virama is employed to suppress the inherent vowel, namely schwa, of its preceding consonant in many Indic scripts. In the light of this fact, it is possible to transfer words or transliteration rules across related languages and thereby avoid collecting extra training data for each language. Therefore, we propose a Unicode namebased projection scheme that transfers IL words to their character equivalents in a high-resource related language so that we can apply the transliteration model trained for the high-resource language. This method is very similar to our recent work on building grapheme-to-phoneme models across related scripts (Deri and Knight, 2016).
In Unicode character code charts 5 , most vowels, consonants and signs are assigned a name with the following format:

SCRIPT TYPE NAME
For example, Bengali independent vowel "অ", dependent vowel sign " ু " and consonant "ক" are named BENGALI LETTER A, BENGALI VOWEL SIGN U and BENGALI LETTER KA, respectively.
Utilizing these Unicode character names as a bridge, our approach consists of the following steps: (1) For a low resource language L, select its related language L ′ whose transliteration pairs can be extracted from existing resources with minimal effort; (2) Construct a L to L ′ character mapping table based on Unicode character names; (3) Convert a L name α to α ′ using its corresponding L ′ characters in the mapping table; (4) Transliterate α ′ with the model trained on pairs in L ′ .
To illustrate this idea, the following example shows how to transliterate Hindi word "अमे रिका" (America) using a Bengali-to-English transliteration model.  Table 3 shows. For example, the Hindi vowel आ (DEVANAGARI LETTER AA) is mapped to its Bengali counterpart ই (BENGALI LETTER AA).

Hindi Name Bengali Name
Next, the Hindi word "अमे िरका" is converted to "অমেরিকা" character by character following the mapping table. Note that the result is not exactly the same as "আমেরিকা", the actual Bengali word representing "America". Finally, "অমেরিকা" is transliterated into "America" using the Bengalito-English transliteration model.

Experiments
In this section we will present experimental results for context-independent, context-dependent, and cross-lingual projection settings respectively.

Context-Independent Transliteration
We train transliteration models of six languages with the NEWS2015 data sets and use corresponding development sets as our test set since the official test sets are not publicly available. Additionally, because some source names lack matched entities in DBpedia, we also evaluate with subsets containing only linkable names, whose gold transliterations match at least one entity in DBpedia. Table 4 summarizes the data statistics 6,7 . We evaluate the performance based on a strict accuracy metric by checking whether our top 1 hypothesis of each name exactly matches the ground-truth transliteration.   Table 5: Back-transliteration accuracy on NEWS2015 development sets (%).
We use JSCM as our baseline model, which has comparable performance with other models used in the NEWS Shared Task. Since we focus on back-transliteration from low-resource languages to English in this work, and only Thai has the back-transliteration task and data set, we reverse the source and target names of other languages in our experiments. In addition, we train a unigram language model (LM) from Gigaword 8 to revise  The overall performance is shown in Table 5. We can see that entity linking improves the transliteration accuracy for all languages, especially for the linkable subsets. Previous results on back-transliteration are only available for Thai-to-English, and our top-1 accuracy (44.0%) notably advances the previous highest score (39.5%) reported in (Nicolai et al., 2015). Besides, our experiments show that the entity linker outperforms the language model trained on a large corpus.
In Table 6, we list some correction examples for languages we evaluate. Entity Linking has mainly made three types of corrections as follows.

Double consonants.
Repeated consonant letters in English, such as "tt", are usually transliterated into a single character in languages not written in Roman script. Since double consonants are less frequent than single ones, statistical models tend to back-transliterate a character into a single corresponding English letter, e.g., " "(sī) to "s" instead of "ss".

Consonant.
Because some orthographies lack characters to represent all English consonants, different English consonants may correspond to the same character in other languages. For example, "k" and "g" are usually transliterated into "க" in Tamil.

Vowel.
The correspondences between English vowel letters and phonemes are complex. A vowel letter may be pronounced in different ways, e.g., the three a's in banana, while distinct vowel letters may have the same pronunciation, e.g., the first e and second i in ingredient. Such inconsistency makes back-transliteration from shallow orthographies more difficult.
Although entity linking explicitly improves the transliteration quality, we observe that some correct transliteration hypotheses are mistakenly revised due to the lack of context and corresponding entities. For example, "लखपति" is correctly transliterated to Lakhpati but subsequently revised to Lakhpat because Lakhpati is absent in the KB.

Context-Dependent Transliteration
In order to evaluate context-dependent transliteration, we train a Chinese-to-English transliteration model on 44, 146 name pairs (Ji et al., 2009), and use the Chinese-to-English Entity Discovery and Linking data set in NIST TAC-KBP2015 evaluation  since the NEWS2015 Chinese data set only contains transliteration pairs without any contextual information. This data set contains 160 documents and 11, 066 Chinese mentions, including 1, 239 person names (469 unique ones). Each name has a ground-truth transliteration derived from English KB title.  Table 7: Impact of collective inference on transliteration (%). Table 7 presents the results for person names. We can see that by exploiting the contextual information, collective inference provides more accurate entity linking and hence further enhances transliteration.  Figure 5: Transliteration accuracy score with different sizes of training pairs. The first language of each pair is the language that the model is trained on, and the second one is the test language.

Cross-lingual Projection
In projection experiments, we train transliteration models on the NEWS2015 data sets described in Section 7.1, and evaluate each with other languages using the projection approach proposed in Section 6. Among the six languages, our results show that the Hindi model is the best for transliterating Bengali and Kannada names, while the Kannada model is the best for transliterating Hindi names. Learning curves of these three pairs depicted in Figure 5 show that it is feasible to transliterate names using a model trained on a related language without extra data. For example, a train-ing set of 10, 000 Hindi-English pairs achieves the same performance as 4, 000 Kannada-English pairs on transliterating Kannada names.
Nevertheless, we still observe a performance gap between training from the related language and the source language itself. This indicates that after applying our character mapping-based method, dissimilarities may still exist between the representations of the same name from two related languages. For example, because "भ" is the most similar Hindi character for Bengali character "ভ", we mapped "ভ" to "भ". However, apart from their common transliteration "bh", "ভ" can also be transliterated to "v", such as "ভেঙ্ কটেশ" (Venkatesh), thereby leading to some incorrect transliterations, e.g., "अभिनभास" (Abhinbhaas) to "Abhinvass". Another example is that the combination of "ॉ" and "क" is usually transliterated to "ock" in the Bengali training set, whereas its Hindi equivalent "োক" is usually transliterated to "och" or "ok", and therefore "लॉके ट"(locket) is incorrectly transliterated to "loket". In addition, representations referring to the same entity may have divergent origins in different languages (e.g., "German" has variants including "Germanisch", "Deutsch", "Alemannisch", "niemy" and "Sachsen").
However, we see that such performance gap can be narrowed or filled by entity linking. In Figure  We also train combination models using different sizes of pairs of a source language and 10, 000 pairs of its related language. Such a combination dramatically improves the performance, which means that by borrowing pairs from a related language, we can develop a high-performance model with only a small number of transliteration pairs of the source language.
We also find that it is difficult to construct mapping tables for Thai and Hebrew since they share few similar character names with Hindi, Kannada, Bengali, and Tamil. Additionally, despite of the fact that Kannada and Tamil lie on the same language family branch, they have evolved independently for centuries and have different representations of sounds, which to some extent explains why the projection between them is not as effective. For example, in Kannnda script, there are different letters for ka (ಕ) and ga (ಗ), whereas Tamil only uses one letter க (ka).

Remaining Challenges
Regardless of our improvement and promising results, the overall strict accuracy name transliteration is still quite low. We categorize the remaining challenges as follows.
• Name Segmentation. Some additional splitting or merging operations are needed for some names. For example, "अग् निपु राण" in Hindi should be transliterated into two tokens "Agni Purana".
• Source Language Specific Features. For example, "-istan" is a common country suffix in Turkish and Persian, and thus it can be ignored during transliteration (e.g., when transliterating "Gürcistan" in Turkish to English, we can focus on transliterating "Gürc" to "Georgia").
• Entity Profile. Name transliteration might follow specific conventions based on the entity's origin, gender, title and characteristic. For example, "Monroe" is transliterated to " " (mén luó) in "James Monroe", the fifth President of the United States, while " " (mèng lù) in "Marilyn Monroe", a famous American actress, where " ", " ", " " and " " refer to "door", "net", "dream" and "dew", respectively. For celebrities with corresponding entities in the KB, the collective inference method we employ can resolve the ambiguity and hence generate correct transliterations, while it does not work for out-of-KB ones. In order to transliterate such out-of-KB names, some of their properties, such as gender, need to be inferred from the text.
Since names are inherently associated with entities, it is natural to leverage entity linking to improve name transliteration. To the best of our knowledge, this is the first study using entity linking results to revise transliteration hypotheses. We also take the specific context of a name into consideration to improve the quality of entity linking and reduce ambiguity.
Additionally, to tackle the data sparsity challenge in low-resource languages, we propose a simple but effective cross-lingual projection approach to take advantage of resources in related languages. Similar cross-lingual projection methods based on data/annotation transfer have also been exploited for other Natural Language Processing tasks, including relation extraction, data annotation, entity recognition, and grapheme-tophoneme models (Xia and Lewis, 2007;Padó and Lapata, 2009;Kim et al., 2010;Faruqui and Kumar, 2015;Deri and Knight, 2016).

Conclusions and Future Work
For many names we need to know the real-world entities they refer to before generating their correct transliterations. In this paper we developed a novel context-aware name transliteration approach by leveraging Entity Linking and related language projection. Experiments have demonstrated that our approach can significantly enhance the transliteration performance. In the future we will explore more knowledge from the KB such as types and properties of entities to improve disambiguation and transliteration. We will also aim to incorporate morphology analysis, acquire and incorporate language-specific and culture-specific characteristics to address the remaining challenges.