Report of NEWS 2015 Machine Transliteration Shared Task

This report presents the results from the Machine Transliteration Shared Task conducted as part of The Fifth Named Entities Workshop (NEWS 2015) held at ACL 2015 in Beijing, China. Similar to previous editions of NEWS Workshop, the Shared Task featured machine transliteration of proper names over 14 different language pairs, including 12 different languages and two different Japanese scripts. A total of 6 teams participated in the evaluation, submitting 194 standard and 12 non-standard runs, involving a diverse variety of transliteration methodologies. Four performance metrics were used to report the evaluation results. Once again, the NEWS shared task on machine transliteration has successfully achieved its objectives by providing a common ground for the research community to conduct comparative evaluations of state-of-the-art technologies that will benefit the future research and development in this area.


Introduction
Names play an important role in the performance of most Natural Language Processing (NLP) and Information Retrieval (IR) applications.They are also critical in cross-lingual applications such as Machine Translation (MT) and Cross-language Information Retrieval (CLIR), as it has been shown that system performance correlates positively with the quality of name conversion across languages (Demner-Fushman and Oard 2002, Mandl and Womser-Hacker 2005, Hermjakob et al. 2008, Udupa et al. 2009).Bilingual dictionaries constitute the traditional source of information for name conversion across languages, however they offer very limited support due to the fact that, in most languages, names are continuously emerging and evolving.
All of the above points to the critical need for robust Machine Transliteration methods and systems.During the last decade, significant efforts has been conducted by the research community to address the problem of machine transliteration (Knight and Graehl 1998, Meng et al. 2001, Li et al. 2004, Zelenko and Aone 2006, Sproat et al. 2006, Sherif and Kondrak 2007, Hermjakob et al. 2008, Al-Onaizan and Knight 2002, Goldwasser and Roth 2008, Goldberg and Elhadad 2008, Klementiev and Roth 2006, Oh and Choi 2002, Virga and Khudanpur 2003, Wan and Verspoor 1998, Kang and Choi 2000, Gao et al. 2004, Li et al. 2009a, Li et al. 2009b).These previous works fall into three main categories: grapheme-based, phoneme-based and hybrid methods.Grapheme based methods (Li et al. 2004) treat transliteration as a direct orthographic mapping and only uses orthography-related features while phoneme-based methods (Knight and Graehl 1998) make use of phonetic correspondences to generate the transliteration.The hybrid approach refers to the combination of several different models or knowledge sources to support the transliteration generation process.
The first machine transliteration shared task (Li et al. 2009b, Li et al. 2009a) was organized and conducted as part of NEWS 2009 at ACL-IJCNLP 2009.It was the first time that common benchmarking data in diverse language pairs was provided for evaluating state-of-the-art machine transliteration.While the focus of the 2009 shared task was on establishing the quality metrics and on setting up a baseline for transliteration quality based on those metrics, the 2010 shared task (Li et al. 2010a, Li et al. 2010b) focused on expanding the scope of the transliteration generation task to about a dozen languages and on exploring the quality of the task depending on the direction of transliteration.In NEWS 2011 (Zhang et al. 2011a, Zhang et al. 2011b), the focus was on significantly increasing the hand-crafted parallel corpora of named entities to include 14 different language pairs from 11 language families, and on making them available as the common dataset for the shared task.The NEWS 2015 Shared Task on Transliteration has been a continued effort for evaluating machine transliteration performance over such a common dataset following the NEWS 2012 (Zhang et al. 2012) and 2011 shared tasks.
In this paper, we present in full detail the results of the NEWS 2015 Machine Transliteration Shared Task.The rest of the paper is structured as follows.Section 2 provides as short review of the main characteristics of the machine transliteration task and the corpora used for it.Section 3 reviews the four metrics used for the evaluations.Section 4 reports specific details about participation in the 2015 edition of the shared task, and section 5 presents and discusses the evaluation results.Finally, section 6 presents our main conclusions and future plans.

Shared Task on Transliteration
Transliteration, sometimes also called Romanization, especially if Latin Scripts are used for target strings (Halpern 2007), deals with the conversion of names between two languages and/or script systems.Within the context of the Transliteration Shared Task, we are aiming not only at addressing the name conversion process but also its practical utility for downstream applications, such as MT and CLIR.
In this sense, we adopt the same definition of transliteration as proposed during the NEWS 2009 workshop (Li et al. 2009a).According to it, transliteration is understood as the "conversion of a given name in the source language (a text string in the source writing system or orthography) to a name in the target language (another text string in the target writing system or orthography" conditioned to the following specific requirements regarding the name representation in the target language: • it is phonetically equivalent to the source name, • it conforms to the phonology of the target language, and • it matches the user intuition on its equivalence with respect to the source language name.Following NEWS 2011 and NEWS 2012, the three back-transliteration tasks are maintained.Back-transliteration attempts to restore translit-erated names back into their original source language.For instance, the tasks for converting western names written in Chinese and Thai back into their original English spellings are considered.Similarly, a task for back-transliterating Romanized Japanese names into their original Kanji strings is considered too.

Shared Task Description
Following the tradition of NEWS workshop series, the shared task in NEWS 2015 consists of developing machine transliteration systems in one or more of the specified language pairs.Each language pair of the shared task consists of a source and a target language, implicitly specifying the transliteration direction.Training and development data in each of the language pairs was made available to all registered participants for developing their transliteration systems.
At the evaluation time, a standard hand-crafted test set consisting of between 500 and 3,000 source names (approximately 5-10% of the training data size) was released, on which the participants were required to produce a ranked list of transliteration candidates in the target language for each source name.The system output is tested against a reference set (which may include multiple correct transliterations for some source names), and the performance of a system is captured in multiple metrics (defined in Section 3), each designed to capture a specific performance dimension.
For every language pair, each participant was required to submit at least one run (designated as a "standard" run) that uses only the data provided by the NEWS workshop organizers in that language pair; i.e. no other data or linguistic resources are allowed for standard runs.This ensures parity between systems and enables meaningful comparison of performance of various algorithmic approaches in a given language pair.Participants were allowed to submit one or more standard runs for each task they participated in.If more than one standard runs were submitted, it was required to name one of them as a "primary" run, which was the one used to compare results across different systems.
In addition, more than one "non-standard" runs could be submitted for every language pair using either data beyond the one provided by the shared task organizers, any other available linguistic resources in a specific language pair, or both.This essentially enabled participants to demonstrate the limits of performance of their systems in a given language pair.

Shared Task Corpora
Two specific constraints were considered when selecting languages for the shared task: language diversity and data availability.To make the shared task interesting and to attract wider participation, it is important to ensure a reasonable variety among the languages in terms of linguistic diversity, orthography and geography.Clearly, the ability of procuring and distributing a reasonably large (approximately 10K paired names for training and testing together) hand-crafted corpora consisting primarily of paired names is critical for this process.Following NEWS 2011, the 14 tasks shown in Tables 1.a-e were used (Li et al. 2004, Kumaran and Kellner 2007, MSRI 2009, CJKI 2010).Additionally, the test sets from NEWS 2012 (each of size 1K) were also used for evaluation purposes in this shared task.
The names given in the training sets for Chinese, Japanese, Korean, Thai, Persian and Hebrew languages are Western names and their respective transliterations; the Japanese Name (in English) → Japanese Kanji data set consists only of native Japanese names; the Arabic data set consists only of native Arabic names.The Indic data set (Hindi, Tamil, Kannada, Bangla) consists of a mix of Indian and Western names.
For all of the tasks chosen, we have been able to procure paired-name data between the source and the target scripts and were able to make them available to the participants.For some language pairs, such as the case of English-Chinese and English-Thai, there are both transliteration and back-transliteration tasks.Most of the tasks are just one-way transliteration, although Indian data sets contains a mixture of names from both Indian and Western origins.

Evaluation Metrics and Rationale
The participants have been asked to submit standard and, optionally, non-standard runs.One of the standard runs must be named as the primary submission, which was the one used for the performance summary.Each run must contain a ranked list of up to ten candidate transliterations for each source name.The submitted results are compared to the ground truth (reference transliterations) using four evaluation metrics capturing different aspects of transliteration performance.The four considered evaluation metrics are: • Word Accuracy in Top-1 (ACC), • Fuzziness in Top-1 (Mean F-score), • Mean Reciprocal Rank (MRR), and In the next subsections, we present a brief description of the four considered evaluation metrics.The following notation is further assumed: • N : Total number of names (source words) in the test set, • n i : Number of reference transliterations for i-th name in the test set (n i ≥ 1), • r i,j : j-th reference transliteration for i-th name in the test set, produced by a transliteration system.

Word Accuracy in Top-1 (ACC)
Also known as Word Error Rate, it measures correctness of the first transliteration candidate in the candidate list produced by a transliteration system.ACC = 1 means that all top candidates are correct transliterations; i.e. they match one of the references, and ACC = 0 means that none of the top candidates are correct. (Eq.1)

Fuzziness in Top-1 (Mean F-score)
The Mean F-score measures how different, on average, the top transliteration candidate is from its closest reference.F-score for each source word is a function of Precision and Recall and equals 1 when the top candidate matches one of the references, and 0 when there are no common characters between the candidate and any of the references.
Precision and Recall are calculated based on the length of the Longest Common Subsequence (LCS) between a candidate and a reference: where ED is the edit distance and |x| is the length of x.For example, the longest common subsequence between "abcd" and "afcde" is "acd" and its length is 3.The best matching reference, i.e. the reference for which the edit distance has the minimum, is taken for calculation.If the best matching reference is given by the Recall, Precision and F-score for the i-th word are calculated as: The lengths are computed with respect to distinct Unicode characters, and no distinctions are made for different character types of a language (e.g.vowel vs. consonant vs. combining diereses, etc.).

Mean Reciprocal Rank (MRR)
Measures traditional MRR for any right answer produced by the system, from among the candidates.1/MRR tells approximately the average rank of the correct transliteration.MRR closer to 1 implies that the correct answer is mostly produced close to the top of the n-best lists. (Eq.8)

Mean Average Precision (MAP ref )
This metric measures tightly the precision in the n-best candidates for i-th source name, for which reference transliterations are available.If all of the references are produced, then the MAP is 1.
If we denote the number of correct candidates for the i-th source word in k-best list as num(i,k), then MAP ref is given by: (Eq.9) 4 Participation in the Shared Task A total of six teams from six different institutions participated in the NEWS 2015 Shared Task.More specifically, the participating teams were from University of Alberta (UALB), Uppsala University (UPPS), Beijing Jiaotong University (BJTU), the National Institute of Information and Communications Technology (NICT), the Indian Institute of Technology Bombay (IITB) and the National Taiwan University (NTU).Teams were required to submit at least one standard run for every task they participated in, and for both, NEWS 2011 and NEWS 2012, test sets.The former was used as a progress evaluation set and the latter as the official NEWS 2015 evaluation set.In total, we received 97 standard and 6 non-standard runs for each test set; i.e. 194 standard and 12 non-standard runs in total.Table 2 summarizes the number of standard runs, nonstandard runs and teams participating per task.As seen from the table, the most popular task continues to be the transliteration from English to Chinese (Zhang et al. 2012), followed by Chinese to English, English to Hindi, and English to Tamil.Non-standard runs were only submitted for 6 of the 14 tasks.

Task Results and Analysis
Figure 1 summarizes the results of the NEWS 2015 Shared Task.In the figure, only F-scores over the NEWS 2012 evaluation test set (referred to as NEWS12/15) for all primary standard submissions are depicted.A total of 41 primary standard submissions were received.
As seen from the figure, with the exception of the English to Japanese Katakana, only transliteration tasks involving Arabic, Persian and the four considered Indian languages are consistently scored above 80%.For the rest of the languages, with the exception of Japanese Katakana and Hebrew, scores are consistently in the range from 60% to 80%.Notice also that, regardless the availability of training data, the English to Chinese transliteration task seems to be the more demanding one for state-of-the-art systems with respect to the considered metric.
Another interesting observation that can be derived from the figure, when looking to the language pairs English-Chinese and English-Thai, is that systems tend to perform slightly better for the case of back-transliteration tasks.A much more comprehensive presentation of results for the NEWS 2015 Shared Task is provided in the Appendix at the end of this paper.There, resulting scores are reported for all received submissions, including standard and nonstandard submissions, over both the progress test (NEWS11) and evaluation test (NEWS12/15), and the four considered evaluation metrics.All results are presented in 28 tables, each of which reports the scores for one transliteration task over one test set.In the tables, all primary standard runs are highlighted in bold-italic fonts.
Regarding the systems participating in this year evaluation, the UALB's system (Nicolai et al. 2015) was based on multiple system combinations.They presented experimental results involving three different well-known transliteration approaches: DirecTL+ (Jiampojamarn et al. 2009), Sequitur (Bisani and Ney 2008) and SMT (Koehn et al. 2007).They showed error reductions of up to 20% over a baseline system by using system combination.
The UPPS's system (Shao et al. 2015) implemented a phrase-based transliteration approach, which is enhanced with refined alignments produced by the M2M-aligner (Jiampojamarn et al. 2007).They also implemented a ranking mechanism based on a linear regression, showing a significant improvement on both EnCh and ChEn transliteration tasks.
The BJTU's system (Wang et al. 2015a) implemented an SMT (Koehn et al. 2007) log linear model combination for transliteration, including standard SMT features such as a language model scores and forward and reverse phrase translation probabilities, as well as other specific transliteration features such as length of names and length of name penalties.
The NICT's system (Finch et al. 2015) builds upon their previous SMT-based system used for NEWS 2012 (Finch et al. 2012).In this shared task, the previous system rescoring step is augmented with a neural network based transliteration model (Bahdanau et al. 2014).They showed significant improvements in 8 of the 14 transliteration tasks with respect to their 2012 system.
The ITTB's system (Kunchukuttan and Bhattacharyya 2015) also followed the SMT approach to transliteration.In this case they include two specific preprocessing enhancements: the addition of word-boundary markers, and a languageindependent overlapping character segmentation.They observed that word-boundary markers substantially improved transliteration accuracy, and overlapping segmentation showed some potential.The NTU's system (Wang et al. 2015b) is based on DirecTL+ with alignments generated by the M2M-aligner (Jiampojamarn et al. 2010).In preprocessing, they experimented with different grapheme segmentation methods for English, Chinese and Korean; while in post-processing, they evaluated two re-ranking approaches: orthography similarity ranking and web-based ranking.
As seen from the previous system descriptions, phrase-based SMT approaches are still predominant in the state-of-the-art for machine transliteration.Significant improvements are achieved by incorporating novel approaches in the preprocessing and post-processing stages, as well as by system combinations.Regarding pre-processing, the main focus was on segmentation, while in post-processing, using neural networks for rescoring provided the most significant gains.
Finally, figure 2 compares, in terms of Mean F-scores, the best primary standard submissions in NEWS 2012 with the ones in NEWS 2015.As seen from the figure, in most of the considered transliteration tasks, some incremental improvements can be observed between the 2012 and 2015 shared tasks.The most significant improvements are in those tasks involving Japanese Katakana, Tamil, Bangla (Bengali) and Thai.
Regarding the observed drops in performance, only the one for the English to Korean Hangul task is significant.It is mainly due to the fact that the best performing system for this task in 2012 did not participate in the 2015 shared task.

Conclusions
The Shared Task on Machine Transliteration in NEWS 2015 has shown, once again, that the research community has a continued interest in this area.This report summarizes the results of the NEWS 2015 Shared Task.We are pleased to report a comprehensive set of machine transliteration approaches and their evaluation results over two test sets: progress test (NEWS11) and evaluation test (NEWS12/15), as well as two conditions: standard runs and nonstandard runs.While the standard runs allow for conducting meaningful comparisons across different algorithms, the non-standard runs open up more opportunities for exploiting a variety of additional linguistic resources.
Six teams from six different institutions participated in the shared task.In total, we received 97 standard and 6 non-standard runs for each test set; i.e. 194 standard and 12 non-standard runs in total.Most of the current state-of-the-art in machine transliteration is represented in the systems that have participated in the shared task.
Encouraged by the continued success of the NEWS workshop series, we plan to continue this event in the future to further promoting machine transliteration research and development.

Figure 1 :
Figure1: Mean F-scores (Top-1) on the evaluation test set (NEWS12/15) for all primary standard submissions and all transliteration tasks.

Figure 2 :
Figure 2: Mean F-scores (Top-1) on the evaluation test set (NEWS12/15) for the best primary standard submissions in 2012 and 2015.

Table 1 .
• Mean Average Precision (MAP ref ).c: Datasets provided by Microsoft Research India.

Table 2 :
Number of standard (Std)and nonstandard (Non) runs submitted, and teams participating in each task.

Table A1 :
Results for the English to Chinese transliteration task (EnCh) on Progress Test.

Table A2 :
Results for the English to Chinese transliteration task (EnCh) on Evaluation Test.

Table A3 :
Results for the Chinese to English transliteration task (ChEn) on Progress Test.

Table A4 :
Results for the Chinese to English transliteration task (ChEn) on Evaluation Test.

Table A5 :
Results for the English to Korean Hangul transliteration task (EnKo) on Progress Test.

Table A6 :
Results for the English to Korean Hangul transliteration task (EnKo) on Evaluation Test.

Table A7 :
Results for the English to Japanese Katakana transliteration task (EnJa) on Progress Test.

Table A8 :
Results for the English to Japanese Katakana transliteration task (EnJa) on Evaluation Test.

Table A9 :
Results for the English to Japanese Kanji transliteration task (JnJk) on Progress Test.

Table A10 :
Results for the English to Japanese Kanji transliteration task (JnJk) on Evaluation Test.

Table A11 :
Results for the Arabic to English transliteration task (ArEn) on Progress Test.

Table A12 :
Results for the Arabic to English transliteration task (ArEn) on Evaluation Test.

Table A13 :
Results for the English to Hindi transliteration task (EnHi) on Progress Test.

Table A14 :
Results for the English to Hindi transliteration task (EnHi) on Evaluation Test.

Table A15 :
Results for the English to Tamil transliteration task (EnTa) on Progress Test.

Table A16 :
Results for the English to Tamil transliteration task (EnTa) on Evaluation Test.

Table A17 :
Results for the English to Kannada transliteration task (EnKa) on Progress Test.

Table A18 :
Results for the English to Kannada transliteration task (EnKa) on Evaluation Test.

Table A19 :
Results for the English to Bangla (Bengali) transliteration task (EnBa) on Progress Test.

Table A20 :
Results for the English to Bangla (Bengali) transliteration task (EnBa) on Evaluation Test.

Table A21 :
Results for the English to Hebrew transliteration task (EnHe) on Progress Test.

Table A22 :
Results for the English to Hebrew transliteration task (EnHe) on Evaluation Test.

Table A23 :
Results for the English to Thai transliteration task (EnTh) on Progress Test.

Table A24 :
Results for the English to Thai transliteration task (EnTh) on Evaluation Test.

Table A25 :
Results for the Thai to English transliteration task (ThEn) on Progress Test.

Table A26 :
Results for the Thai to English transliteration task (ThEn) on Evaluation Test.

Table A27 :
Results for the English to Persian transliteration task (EnPe) on Progress Test.

Table A28 :
Results for the English to Persian transliteration task (EnPe) on Evaluation Test.