Report of NEWS 2012 Machine Transliteration Shared Task

This report documents the Machine Transliteration Shared Task conducted as a part of the Named Entities Workshop (NEWS 2012), an ACL 2012 workshop. The shared task features machine translit-eration of proper names from English to 11 languages and from 3 languages to English. In total, 14 tasks are provided. 7 teams participated in the evaluations. Finally, 57 standard and 1 non-standard runs are submitted, where diverse translit-eration methodologies are explored and reported on the evaluation data. We report the results with 4 performance metrics. We believe that the shared task has successfully achieved its objective by providing a common benchmarking platform for the research community to evaluate the state-of-the-art technologies that beneﬁt the future research and development.


Introduction
Names play a significant role in many Natural Language Processing (NLP) and Information Retrieval (IR) systems. They are important in Cross Lingual Information Retrieval (CLIR) and Machine Translation (MT) as the system performance has been shown to positively correlate with the correct conversion of names between the languages in several studies (Demner-Fushman and Oard, 2002;Mandl and Womser-Hacker, 2005;Hermjakob et al., 2008;Udupa et al., 2009). The traditional source for name equivalence, the bilingual dictionaries -whether handcrafted or statistical -offer only limited support because new names always emerge.
All of the above point to the critical need for robust Machine Transliteration technology and systems. Much research effort has been made to ad-dress the transliteration issue in the research community (Knight and Graehl, 1998;Meng et al., 2001;Li et al., 2004;Zelenko and Aone, 2006;Sproat et al., 2006;Sherif and Kondrak, 2007;Hermjakob et al., 2008;Al-Onaizan and Knight, 2002;Goldwasser and Roth, 2008;Goldberg and Elhadad, 2008;Klementiev and Roth, 2006;Oh and Choi, 2002;Virga and Khudanpur, 2003;Wan and Verspoor, 1998;Kang and Choi, 2000;Gao et al., 2004;Zelenko and Aone, 2006;Li et al., 2009b;Li et al., 2009a). These previous work fall into three categories, i.e., grapheme-based, phoneme-based and hybrid methods. Graphemebased method (Li et al., 2004) treats transliteration as a direct orthographic mapping and only uses orthography-related features while phonemebased method (Knight and Graehl, 1998) makes use of phonetic correspondence to generate the transliteration. Hybrid method refers to the combination of several different models or knowledge sources to support the transliteration generation.
The first machine transliteration shared task (Li et al., 2009b;Li et al., 2009a) was held in NEWS 2009 at ACL-IJCNLP 2009. It was the first time to provide common benchmarking data in diverse language pairs for evaluation of state-of-the-art techniques. While the focus of the 2009 shared task was on establishing the quality metrics and on baselining the transliteration quality based on those metrics, the 2010 shared task (Li et al., 2010a;Li et al., 2010b) expanded the scope of the transliteration generation task to about a dozen languages, and explored the quality depending on the direction of transliteration, between the languages. In NEWS 2011 (Zhang et al., 2011a;Zhang et al., 2011b), we significantly increased the hand-crafted parallel named entities corpora to include 14 different language pairs from 11 language families, and made them available as the common dataset for the shared task. NEWS 2012 was a continued effort of NEWS 2011, NEWS 10 2010 and NEWS 2009. The rest of the report is organised as follows. Section 2 outlines the machine transliteration task and the corpora used and Section 3 discusses the metrics chosen for evaluation, along with the rationale for choosing them. Sections 4 and 5 present the participation in the shared task and the results with their analysis, respectively. Section 6 concludes the report.

Transliteration Shared Task
In this section, we outline the definition and the description of the shared task.

"Transliteration": A definition
There exists several terms that are used interchangeably in the contemporary research literature for the conversion of names between two languages, such as, transliteration, transcription, and sometimes Romanisation, especially if Latin scripts are used for target strings (Halpern, 2007).
Our aim is not only at capturing the name conversion process from a source to a target language, but also at its practical utility for downstream applications, such as CLIR and MT. Therefore, we adopted the same definition of transliteration as during the NEWS 2009 workshop (Li et al., 2009a) to narrow down "transliteration" to three specific requirements for the task, as follows:"Transliteration is the conversion of a given name in the source language (a text string in the source writing system or orthography) to a name in the target language (another text string in the target writing system or orthography), such that the target language name is: (i) phonemically equivalent to the source name (ii) conforms to the phonology of the target language and (iii) matches the user intuition of the equivalent of the source language name in the target language, considering the culture and orthographic character usage in the target language." Following NEWS 2011, in NEWS 2012, we still keep the three back-transliteration tasks. We define back-transliteration as a process of restoring transliterated words to their original languages. For example, NEWS 2012 offers the tasks to convert western names written in Chinese and Thai into their original English spellings, and romanized Japanese names into their original Kanji writings.

Shared Task Description
Following the tradition of NEWS workshop series, the shared task at NEWS 2012 is specified as development of machine transliteration systems in one or more of the specified language pairs. Each language pair of the shared task consists of a source and a target language, implicitly specifying the transliteration direction. Training and development data in each of the language pairs have been made available to all registered participants for developing a transliteration system for that specific language pair using any approach that they find appropriate.
At the evaluation time, a standard hand-crafted test set consisting of between 500 and 3,000 source names (approximately 5-10% of the training data size) have been released, on which the participants are required to produce a ranked list of transliteration candidates in the target language for each source name. The system output is tested against a reference set (which may include multiple correct transliterations for some source names), and the performance of a system is captured in multiple metrics (defined in Section 3), each designed to capture a specific performance dimension.
For every language pair each participant is required to submit at least one run (designated as a "standard" run) that uses only the data provided by the NEWS workshop organisers in that language pair, and no other data or linguistic resources. This standard run ensures parity between systems and enables meaningful comparison of performance of various algorithmic approaches in a given language pair. Participants are allowed to submit more "standard" runs, up to 4 in total. If more than one "standard" runs is submitted, it is required to name one of them as a "primary" run, which is used to compare results across different systems. In addition, up to 4 "non-standard" runs could be submitted for every language pair using either data beyond that provided by the shared task organisers or linguistic resources in a specific language, or both. This essentially may enable any participant to demonstrate the limits of performance of their system in a given language pair.
The shared task timelines provide adequate time for development, testing (more than 1 month after the release of the training data) and the final result submission (4 days after the release of the test data). 11

Shared Task Corpora
We considered two specific constraints in selecting languages for the shared task: language diversity and data availability. To make the shared task interesting and to attract wider participation, it is important to ensure a reasonable variety among the languages in terms of linguistic diversity, orthography and geography. Clearly, the ability of procuring and distributing a reasonably large (approximately 10K paired names for training and testing together) hand-crafted corpora consisting primarily of paired names is critical for this process. At the end of the planning stage and after discussion with the data providers, we have chosen the set of 14 tasks shown in Table 1 (Li et al., 2004;Kumaran and Kellner, 2007;MSRI, 2009;CJKI, 2010). The names given in the training sets for Chinese, Japanese, Korean, Thai, Persian and Hebrew languages are Western names and their respective transliterations; the Japanese Name (in English) → Japanese Kanji data set consists only of native Japanese names; the Arabic data set consists only of native Arabic names. The Indic data set (Hindi, Tamil, Kannada, Bangla) consists of a mix of Indian and Western names.
For all of the tasks chosen, we have been able to procure paired names data between the source and the target scripts and were able to make them available to the participants. For some language pairs, such as English-Chinese and English-Thai, there are both transliteration and back-transliteration tasks. Most of the task are just one-way transliteration, although Indian data sets contained mixture of names of both Indian and Western origins. The language of origin of the names for each task is indicated in the first column of Table 1.
Finally, it should be noted here that the corpora procured and released for NEWS 2012 represent perhaps the most diverse and largest corpora to be used for any common transliteration tasks today.

Evaluation Metrics and Rationale
The participants have been asked to submit results of up to four standard and four non-standard runs. One standard run must be named as the primary submission and is used for the performance summary. Each run contains a ranked list of up to 10 candidate transliterations for each source name. The submitted results are compared to the ground truth (reference transliterations) using 4 evaluation metrics capturing different aspects of transliteration performance. The same as the NEWS 2011, we have dropped two M AP metrics used in NEWS 2009 because they don't offer additional information to M AP ref . Since a name may have multiple correct transliterations, all these alternatives are treated equally in the evaluation, that is, any of these alternatives is considered as a correct transliteration, and all candidates matching any of the reference transliterations are accepted as correct ones.
The following notation is further assumed: N : Total number of names (source words) in the test set n i : Number of reference transliterations for i-th name in the test set (n i ≥ 1) r i,j : j-th reference transliteration for i-th name in the test set c i,k : k-th candidate transliteration (system output) for i-th name in the test set (1 ≤ k ≤ 10) K i : Number of candidate transliterations produced by a transliteration system

Word Accuracy in Top-1 (ACC)
Also known as Word Error Rate, it measures correctness of the first transliteration candidate in the candidate list produced by a transliteration system. ACC = 1 means that all top candidates are correct transliterations i.e. they match one of the references, and ACC = 0 means that none of the top candidates are correct.

Fuzziness in Top-1 (Mean F-score)
The mean F-score measures how different, on average, the top transliteration candidate is from its closest reference. F-score for each source word  is a function of Precision and Recall and equals 1 when the top candidate matches one of the references, and 0 when there are no common characters between the candidate and any of the references. Precision and Recall are calculated based on the length of the Longest Common Subsequence (LCS) between a candidate and a reference: where ED is the edit distance and |x| is the length of x. For example, the longest common subsequence between "abcd" and "afcde" is "acd" and its length is 3. The best matching reference, that is, the reference for which the edit distance has the minimum, is taken for calculation. If the best matching reference is given by then Recall, Precision and F-score for i-th word are calculated as • The length is computed in distinct Unicode characters.
• No distinction is made on different character types of a language (e.g., vowel vs. consonants vs. combining diereses etc.)

Mean Reciprocal Rank (MRR)
Measures traditional MRR for any right answer produced by the system, from among the candidates. 1/M RR tells approximately the average rank of the correct transliteration. MRR closer to 1 implies that the correct answer is mostly produced close to the top of the n-best lists.

MAP ref
Measures tightly the precision in the n-best candidates for i-th source name, for which reference transliterations are available. If all of the references are produced, then the MAP is 1. Let's denote the number of correct candidates for the i-th source word in k-best list as num(i, k). MAP ref is then given by 4 Participation in Shared Task 7 teams submitted their transliteration results. Table 3 shows the details of registration tasks. Teams are required to submit at least one standard run for every task they participated in. In total, we receive 57 standard and 1 non-standard runs. Table 2 shows the number of standard and non-standard runs submitted for each task. It is clear that the most "popular" task is the transliteration from English to Chinese being attempted by 7 participants. 13

English to
MIT@Lab of HIT x 4 IASL, Academia Sinica x 5 Yahoo Japan Corporation

Standard runs
All the results are presented numerically in Tables 4-17, for all evaluation metrics. These are the official evaluation results published for this edition of the transliteration shared task. The methodologies used in the ten submitted system papers are summarized as follows. Similar to their NEWS 2011 system, Finch et al. (2012) employ non-Parametric Bayesian method to cosegment bilingual named entities for model training and report very good performance. This system is based on phrase-based statistical machine transliteration (SMT) (Finch and Sumita, 2008), an approach initially developed for machine translation (Koehn et al., 2003), where the SMT system's log-linear model is augmented with a set of features specifically suited to the task of transliteration. In particular, the model utilizes a feature based on a joint source-channel model, and a feature based on a maximum entropy model that predicts target grapheme sequences using the local context of graphemes and grapheme sequences in both source and target languages. Different from their NEWS 2011 system, in order to solve the data sparseness issue, they use two RNN-based LM to project the grapheme set onto a smaller hidden representation: one for the target grapheme sequence and the other for the sequence of grapheme sequence pair used to generate the target. Zhang et al. (2012) also use the statistical phrase-based SMT framework. They propose the fine-grained English segmentation algorithm and other new features and achieve very good performance. Wu et al. (2012) uses m2m-aligner and DirecTL-p decoder and two re-ranking methods: co-occurrence at web corpus and JLIS-Reranking method based on the features from alignment results. They report very good performance at English-Korean tasks. Okuno (2012) studies the mpaligner (an improvement of m2m-aligner) and 14 shows that mpaligner is more effective than m2maligner. They also find that de-romanization is crucial to JnJk task and mora is the best alignment unit for EnJa task. Ammar et al. (2012) use CRF as the basic model but with two innovations: a training objective that optimizes toward any of a set of possible correct labels (i.e., multiple references) and a k-best reranking with non-local features. Their results on ArEn show that the two features are very effective in accuracy improvement. Kondrak et al. (2012) study the languagespecific adaptations in the context of two language pairs: English to Chinese (Pinyin representation) and Arabic to English (letter mapping). They conclude that Pinyin representation is useful while letter mapping is less effective. Kuo et al. (2012) explore two-stage CRF for Enligsh-to-Chinese task and show that the two-stage CRF outperform traditional one-stage CRF.

Non-standard runs
For the non-standard runs, we pose no restrictions on the use of data or other linguistic resources. The purpose of non-standard runs is to see how best personal name transliteration can be, for a given language pair. In NEWS 2012, only one non-standard run (Wu et al., 2012) was submitted. Their reported web-based re-validation method is very effective.

Conclusions and Future Plans
The Machine Transliteration Shared Task in NEWS 2012 shows that the community has a continued interest in this area. This report summarizes the results of the shared task. Again, we are pleased to report a comprehensive calibration and baselining of machine transliteration approaches as most state-of-the-art machine transliteration techniques are represented in the shared task.
In addition to the most popular techniques such as Phrase-Based Machine Transliteration (Koehn et al., 2003), CRF, re-ranking, DirecTL-p decoder, Non-Parametric Bayesian Co-segmentation (Finch et al., 2011), and Multi-to-Multi Joint Source Channel Model (Chen et al., 2011) in the NEWS 2011, we are delighted to see that several new techniques have been proposed and explored with promising results reported, including RNN-based LM (Finch et al., 2012), English Segmentation algorithm (Zhang et al., 2012), JLIS-reranking method (Wu et al., 2012), improved m2m-aligner (Okuno, 2012), multiple referenceoptimized CRF (Ammar et al., 2012), language dependent adaptation (Kondrak et al., 2012) and two-stage CRF (Kuo et al., 2012). As the standard runs are limited by the use of corpus, most of the systems are implemented under the direct orthographic mapping (DOM) framework (Li et al., 2004). While the standard runs allow us to conduct meaningful comparison across different algorithms, we recognise that the non-standard runs open up more opportunities for exploiting a variety of additional linguistic corpora.
Encouraged by the success of the NEWS workshop series, we would like to continue this event in the future conference to promote the machine transliteration research and development.