Report of NEWS 2011 Machine Transliteration Shared Task

This report documents the Machine Transliteration Shared Task conducted as a part of the Named Entities Workshop (NEWS 2011), an IJCNLP 2011 workshop. The shared task features machine transliteration of proper names from English to 11 languages and from 3 languages to English. In total, 14 tasks are provided. 10 teams from 7 different countries participated in the evaluations. Finally, 73 standard and 4 non-standard runs are submitted, where diverse transliteration methodologies are explored and reported on the evaluation data. We report the results with 4 performance metrics. We believe that the shared task has successfully achieved its objective by providing a common benchmarking platform for the research community to evaluate the state-of-the-art technologies that benefit the future research and development.

machine transliteration of proper names from English to a set of languages. This shared task has witnessed enthusiastic participation of 31 teams from all over the world, with diversity of participation for a given system and wide coverage for a given language pair (more than a dozen participants per language pair). Diverse transliteration methodologies are represented adequately in the shared task for a given language pair, thus underscoring the fact that the workshop may truly indicate the state of the art in machine transliteration in these language pairs. We measure and report 6 performance metrics on the submitted results. We believe that the shared task has successfully achieved the following objectives: (i) bringing together the community of researchers in the area of Machine Transliteration to focus on various research avenues, (ii) Calibrating systems on common corpora, using common metrics, thus creating a reasonable baseline for the state-of-the-art of transliteration systems, and (iii) providing a quantitative basis for meaningful comparison and analysis between various algorithmic approaches used in machine transliteration. We believe that the results of this shared task would uncover a host of interesting research problems, giving impetus to research in this significant research area.

Introduction
Names play a significant role in many Natural Language Processing (NLP) and Information Retrieval (IR) systems. They have a critical role in Cross Language Information Retrieval (CLIR) and Machine Translation (MT) systems as the systems' performances are shown to positively correlate with the correct conversion of names between the languages in several studies (Demner-Fushman and Oard, 2002;Mandl and Womser-Hacker, 2005;Hermjakob et al., 2008;Udupa et al., 2009). The traditional source for name equivalence, the bilingual dictionaries -whether handcrafted or statistical -offer only limited support as they do not have sufficient coverage of names. New names are introduced to the vocabulary of a language every day.
All of the above point to the critical need for robust Machine Transliteration technology and systems. This has attracted attention from the research community. Over the last decade scores of papers on Machine Transliteration have appeared in the top Computational Linguistics, Information Retrieval and Data Management conferences, exploring diverse algorithmic approaches in a wide variety of different languages (Knight and Graehl, 1998;Li et al., 2004;Zelenko and Aone, 2006;Sproat et al., 2006;Sherif and Kondrak, 2007;Hermjakob et al., 2008;Goldwasser and Roth, 2008;Goldberg and Elhadad, 2008;Klementiev and Roth, 2006). However, there has not been any coordinated effort in calibrating the state-ofthe-art technical capabilities of machine transliteration: the studies explore different algorithmic approaches in different language pairs and report their performance in different metrics and tested on different corpora.
The overarching objective of this shared task is to drive the machine transliteration technology forward, to measure and baseline the state-of-the-art and to provide a meaningful comparison between the most promising algorithmic approaches in order to stimulate the discussions among the researchers. The NLP community in Asia is especially interested in transliteration as several major Asian languages do not use Latin script in their native writing systems. The Named Entity Workshop (NEWS 2009) in ACL-IJCNLP 2009 in Singapore provides an ideal platform for the shared task to take off. This is precisely what we address in this shared task on machine transliteration that is conducted as a part of the Named Entity Workshop (NEWS-2009), an ACL-IJCNLP 2009 workshop.
The shared task aims at achieving the following objectives: • Providing a forum to bring together the community of researchers in the area of Machine Transliteration to focus on various research avenues in this important research area.
• Calibrating systems on common hand-crafted corpora, using common metrics, in many different languages, thus creating a reasonable baseline for the state-of-the-art of transliteration systems.
• Analysing the results so that a reasonable comparison of different algorithmic approaches and their trade-offs (such as, transliteration quality vs. generality of approach across languages vs. training data size, etc.) may be explored.
We believe that a substantial part of what we have set out to achieve has been accomplished, and we present this report as a record of the task process, system participation and results and our findings. It is our hope that this reporting will generate lively discussions during the NEWS workshop and subsequent research in this important area. This introduction outlines the purpose of the transliteration shared task conducted as a part of the NEWS workshop. Section 2 outlines the machine transliteration task and the corpora used and Section 3 discusses the metrics chosen for evaluation, along with the rationale for choosing them. Section 4 sketches the participation. Section 5 presents the results of the shared task and the analysis of the results. Section 6, summarises the queries and feedback we have received from the participants and Section 7 concludes, presenting some lessons learnt from the current edition of the shared task, and some ideas we want to pursue in the future plan for the Machine Transliteration tasks.

Transliteration Shared Task
In this section, we outline the definition of the task, the process followed and the rationale for the decisions.

"Transliteration": A definition
There exists several terms that are used interchangeably in the contemporary research literature for the conversion of names between two languages, such as, transliteration, transcription, and sometimes Romanisation, especially if Latin scripts are used for target strings (Halpern, 2007).
Our aim is not only at capturing the name conversion process from a source to a target language, but also at its ultimate utility for downstream applications, such as CLIR and MT. We have narrowed down to three specific requirements for the task, as follows: "Transliteration is the conversion of a given name in the source language (a text string in the source writing system or orthography) to a name in the target language (another text string in the target writing system or orthography), such that the target language name is: (i) phonemically equivalent to the source name (ii) conforms to the phonology of the target language and (iii) matches the user intuition of the equivalent of the source language name in the target language." Given that the phoneme set of languages may not be exactly the same, the first requirement must be diluted to "close to", instead of "equivalent". The second requirement is needed to ensure that the target string is a valid string as per the target language phonology. The third requirement is introduced to produce what a normal user would expect (at least for the popular names), and in order to make it useful for downstream applications like MT or CLIR systems. Though the third requirement make systems produce target language strings that marginally violate the first or second requirements, it ensures that such transliteration system is of value to downstream systems. All the above requirements are implicitly enforced by the choice of name pairs used to define the training and test corpora in a given language pair. In cases where multiple equivalent target language names are possible for a source language name, we in-clude all of them.
After much debate, we have also retained the task name as "transliteration", though our definition may be closest to the "popular transcription" (Halpern, 2007), due to the popularity of term "Machine Transliteration" among the language technology researchers.

Shared Task Description
The shared task is specified as development of machine transliteration systems in one or more of the specified language pairs. Each language pair of the shared task consists of a source and a target language, implicitly specifying the transliteration direction. Training and development data in each of the language pairs have been made available to all registered participants for developing a transliteration system for that specific language pair using any approach that they find appropriate.
At the evaluation time, a standard hand-crafted test set consisting of between 1,000 and 3,000 source names (approximately 10% of the training data size) have been released, on which the participants are required to produce a ranked list of transliteration candidates in the target language for each source name. The system output is tested against a reference set (which may include multiple correct transliterations for some source names), and the performance of a system is captured in multiple metrics (defined in Section 3), each designed to capture a specific performance dimension.
For every language pair every participant is required to submit one run (designated as a "standard" run) that uses only the data provided by the NEWS workshop organisers in that language pair, and no other data or linguistic resources. This standard run ensures parity between systems and enables meaningful comparison of performance of various algorithmic approaches in a given language pair. Participants are allowed to submit more runs (designated as "non-standard") for every language pair using either data beyond that provided by the shared task organisers or linguistic resources in a specific language, or both. This essentially may enable any participant to demonstrate the limits of performance of their system in a given language pair.
The shared task timelines provide adequate time for development, testing (approximately 2 months after the release of the training data) and the final result submission (5 days after the release of the test data).

Shared Task Corpora
We have had two specific constraints in selecting languages for the shared task: language diversity and data availability. To make the shared task interesting and to attract wider participation, it is important to ensure a reasonable variety among the languages in terms of linguistic diversity, orthography and geography. Clearly, the ability of procuring and distributing a reasonably large (approximately 10K paired names for training and testing together) hand-crafted corpora consisting primarily of paired names is critical for this process. At the end of the planning stage and after discussion with the data providers, we have chosen the set of 7 languages shown in Table 1 for the task (Li et al., 2004;Kumaran and Kellner, 2007;MSRI, 2009;CJKI, 2009).
For all of the languages chosen, we have been able to procure paired names data between English and the respective languages and were able to make them available to the participants. In addition, we have been able to procure a specific corpus of about 40K Romanised Japanese names and their Kanji counterparts, and the corresponding language pair (Japanese names from their Romanised form to Kanji) has been included as one of the task language pair.
It should be noted here that each corpus has a definite skew in its characteristics: the names in the Chinese, Japanese and Korean (CJK) language corpora are Western names; the Indic languages (Hindi, Kannada and Tamil) corpora consists of a mix of Indian and Western names. The Romanised Kanji to Kanji corpus consists only of native Japanese names. While such characteristics may have provided us an opportunity to specifically measure the performance for forward transliterations (in CJK) and backward transliterations (in Romanised Kanji), we do not highlight such fine distinctions in this edition.
Finally, it should be noted here that the corpora procured and released for NEWS 2009 represent perhaps the most diverse and largest corpora to be used for any common transliteration tasks today.

Evaluation Metrics and Rationale
The participants have been asked to submit results of one standard and up to four non-standard runs. Each run contains a ranked list of up to 10 candidate transliterations for each source name.
The submitted results are compared to the ground truth (reference transliterations) using 6 evaluation metrics capturing different aspects of transliteration performance. Since a name may have multiple correct transliterations, all these alternatives are treated equally in the evaluation, that is, any of these alternatives is considered as a correct transliteration, and all candidates matching any of the reference transliterations are accepted as correct ones.
The following notation is further assumed: N : Total number of names (source words) in the test set n i : Number of reference transliterations for i-th name in the test set (n i ≥ 1) r i,j : j-th reference transliteration for i-th name in the test set c i,k : k-th candidate transliteration (system output) for i-th name in the test set (1 ≤ k ≤ 10) K i : Number of candidate transliterations produced by a transliteration system

Word Accuracy in Top-1 (ACC)
Also known as Word Error Rate, it measures correctness of the first transliteration candidate in the candidate list produced by a transliteration system. ACC = 1 means that all top candidates are correct transliterations i.e. they match one of the references, and ACC = 0 means that none of the top candidates are correct.
The mean F-score measures how different, on average, the top transliteration candidate is from its closest reference. F-score for each source word is a function of Precision and Recall and equals 1 when the top candidate matches one of the references, and 0 when there are no common characters between the candidate and any of the references. Precision and Recall are calculated based on the length of the Longest Common Subsequence (LCS) between a candidate and a reference: where ED is the edit distance and |x| is the length of x. For example, the longest common subsequence between "abcd" and "afcde" is "acd" and its length is 3. The best matching reference, that is, the reference for which the edit distance has the minimum, is taken for calculation. If the best matching reference is given by then Recall, Precision and F-score for i-th word are calculated as • The length is computed in distinct Unicode characters.
• No distinction is made on different character types of a language (e.g., vowel vs. consonants vs. combining diereses' etc.)

Mean Reciprocal Rank (MRR)
Measures traditional MRR for any right answer produced by the system, from among the candidates. 1/M RR tells approximately the average rank of the correct transliteration. MRR closer to 1 implies that the correct answer is mostly produced close to the top of the n-best lists.

MAP ref
Measures tightly the precision in the n-best candidates for i-th source name, for which reference transliterations are available. If all of the references are produced, then the MAP is 1. Let's denote the number of correct candidates for the i-th source word in k-best list as num(i, k). MAP ref is then given by 3.5 MAP 10 MAP 10 measures the precision in the 10-best candidates for i-th source name provided by the candidate system. In general, the higher MAP 10 is, the better is the quality of the transliteration system in capturing the multiple references.

MAP sys
MAP sys measures the precision in the top K i -best candidates produced by the system for i-th source name, for which n i reference transliterations are available. This measure allows the systems to produce variable number of transliterations, based on their confidence in identifying and producing correct transliterations.

Participation in Shared Task
There have been 31 systems from around the world that participated in the shared task and submitted the transliteration results for a common test data, produced by their systems trained on the common training corpora. A few teams have participated in all or almost all tasks (that is, language pairs); most others participated in 3 tasks on average. Each language pair has attracted on average around 13 teams. The participation details are shown in Table 3 and the demographics of the participating teams by country is shown in Figure 1.
,-."  Teams are required to submit at least one standard run for every task they participated in. In total 104 standard and 86 non-standard runs have been submitted. Table 2 shows the number of standard and non-standard runs submitted for each task. It is clear that the most "popular" tasks are transliteration from English to Hindi and from English to Chinese, attempted by 21 and 18 participants respectively. Overall, as can be noted from the results, each task has received significant participation.

Standard runs
The 8 individual plots in Figure 2 summarise (for each task) the results of standard runs via 3 measured metrics concerning output of at least one correct candidate per source word, namely, accuracy in top-1, F -score and Mean Reciprocal Rank (MRR). The plots in Figure 3 summarise (for each task) the results for 3 metrics on ranked ordered transliteration output of the systems, namely MAP ref , MAP 10 and MAP sys metrics. All the results are presented numerically in Tables 8-11, for all evaluation metrics. These are the official  evaluation results published for this edition of the transliteration shared task. Note that two teams have updated their results (after fixing bugs in their systems) after the deadline; their results are identified specifically. We find that two approaches to transliteration are most popular in the shared task submissions. One of these approaches is Phrase-based statistical machine transliteration (Finch and Sumita, 2008), an approach initially developed for machine translation (Koehn et al., 2003). Systems that adopted this approach are (Song, 2009;Haque et al., 2009;Noeman, 2009;Rama and Gali, 2009;Chinnakotla and Damani, 2009 With only a few exceptions, most implementations are based on approaches that are language-independent. Indeed, many of the participants fielded their systems on multiple languages, as can be seen from Table 3. We also note that combination of several different models via re-ranking of their outputs (CRF, Maximum Entropy Model, Margin Infused Relaxed Algorithm) proves to be very successful (Oh et al., 2009); their system (reported as Team ID 6) produced the best or second-best transliteration performance consistently across all metrics, in all tasks, except Japanese back-transliteration. Examples of other model combinations are (Das et al., 2009).
At least two teams (reported as Team IDs 14 and 27) incorporate language origin detection in their system (Bose and Sarkar, 2009;Khapra and Bhattacharyya, 2009). The Indian language corpora contains names of both English and Indic origin. Khapra and Bhattacharyya (2009) demonstrate how much the transliteration performance can be improved when language of origin detec-tion is employed, followed by a language-specific transliteration model for decoding.
Some systems merit specific mention as they adopt are rather unique approaches. Jiampojamarn et al. (2009) propose DirectTL discriminative sequence prediction model that is languageindependent (reported as Team ID 7). Their transliteration accuracy is among the highest in several tasks (EnCh, EnHi and EnRu). Zelenko (2009) present an approach to the transliteration problem based on Minimum Description Length (MDL) principle. Freitag and Wang (2009) approach the problem of transliteration with bidirectional perceptron edit models.
Finally, in Figure 4 we present a plot where each point represents a standard run by a system, with different tasks marked with specific shape and colour. This plot gives a bird-eye-view of the system performances across two most uncorrelated evaluation metrics, namely accuracy in top-1 (ACC) and Mean F -score. Not surprisingly, we notice very high performance in terms of F -score for English to Russian transliteration task, likely because Russian orthography follows pronunciation very closely, except for characters like soft and hard signs that can hardly be recovered from English words.
We also observe that Japanese backtransliteration has proven to be much harder than other (forward-transliteration) tasks. In general, we note that a well-performing transliteration system performs well across all metrics. We are curious about the correlation between different metrics, and the results (specifically, the Spearman's rank correlation coefficient) are presented below: • Accuracy in top-1 vs. F -score: 0.40 We find that F -score is the most uncorrelated metric: the Spearman's rank correlation coefficient between F -score and accuracy in top-1 is 0.40 and between F -score and MRR it is 0.44. This is likely because all metrics, except for F -score, are based on word accuracy, while F -score is based on word similarity allowing non-matching words to have scores well above 0.

Non-standard runs
For the non-standard runs there exist no restrictions for the teams on the use of more data or other linguistic resources. The purpose of non-standard runs is to see how accurate personal name transliteration can be, for a given language pair. The approaches used in non-standard runs are typical and may be summarised as follows: • Dictionary lookup.
• Pronunciation dictionaries to convert words to their phonetic transcription.
• Additional corpora for training and dictionary lookup, such as LDC English-Chinese named entity list LDC2005T34 (Linguistic Data Consortium, 2005).
• Web search, and in particular, Wikipedia search. First, transliteration candidates are generated. Then a Web search is performed to see if any of the candidates appear in the search results. Based on the results, the candidates are re-ranked.
The results are shown in Tables 16-19. For English to Chinese and English to Russian transliteration tasks the accuracy in top-1 can go as high as 0.909 and 0.955 respectively when Web search is used to aid transliteration.

Post-evaluation
Two participants have found a bug in their system implementation and re-evaluated the results after the deadline. Their results are marked specifically in Tables 4-8 and 16.

Process Analysis and Fine-tuning
In this section we highlight some of the suggestions and feedback that we have received from the participants during the course of this shared task. While a few of them have been implemented in the current edition, many of these may be considered in the future editions of the shared task.
More or different languages There is quite a bit of interest in enhancing the list of language pairs short-listed. While we are constrained (in this edition) due to the availability of manually verified data, certainly more languages will be included in the future editions, as some specific data have already been promised for future editions.
Bidirectional transliteration Many participants express interest in transliterations into English; and this reflexive task will be added in the future editions. We believe it will encourage more participation as it will be easy to read and verify system output in English for those teams not familiar with the non-English side of the language.
Forward vs. backward transliteration There is quite a bit of interest expressed in specifically separating forward and backward transliteration tasks. However, such separation requires specific corpora with known origin for each name pair, and clearly we are constrained by the availability of corpora. When corpora is available, the task may be designated explicitly in future editions.

Number of standard runs
The number of standard runs that may be submitted may be increased in the future editions, as many participants would like to submit many standard runs, trained with different parameters.

Errors in training and development corpora
While we have taken all precautions in acquiring and creating the corpora, some errors still remain. We thank those who have sent us the errata. However, since the affected part is less than 0.5% of the data, we believe that the effect on final results is minimal. The errata will be made available to all participants.

Conclusions and Future Plans
We are pleased to report a comprehensive calibration and baselining of machine transliteration apporaches as most state-of-the-art machine transliteration techniques are represented in the shared task. The most popular techniques such as Phrase-Based Machine Transliteration (Koehn et al., 2003), and Conditional Random Fields (Lafferty et al., 2001) are inspired by recent progress in machine translation. As the standard runs are limited by the use of corpus, most of the systems are implemented under the direct orthographic mapping (DOM) framework (Li et al., 2004). While the standard runs allow us to conduct meaningful comparison across different algorithms, we recognise that the non-standard runs open up more opportunities for exploiting larger linguistic corpora. It is also noted that several systems have reported improved performance over any previously reported results on similar corpora. NEWS 2009 Shared Task represents a successful debut of a community effort in driving machine transliteration techniques forward. The overwhelming responses in the first shared task also warrant continuation of such an effort in future ACL or IJCNLP events.

Acknowledgements
The organisers of the NEWS 2009 Shared Task would like to thank the Institute for Infocomm Research (Singapore), Microsoft Research India and CJK Institute (Japan) for providing the corpora and technical support. Without those, the Shared Task would not be possible. We thank those participants who identified errors in the data and sent us the errata. We want to thank Monojit Choudhury for his contribution to metrics defined for the shared task. We also want to thank the members of programme committee for their invaluable comments that improve the quality of the shared task papers. Finally, we wish to thank all the participants for their active participation that have made this first machine transliteration shared task a comprehensive one.  Table 3: Participation of teams in different tasks. * Participants without a system paper.                  Table 19: Non-standard runs for Japanese Transliterated to Japanese Kanji task.