Report of NEWS 2018 Named Entity Transliteration Shared Task

This report presents the results from the Named Entity Transliteration Shared Task conducted as part of The Seventh Named Entities Workshop (NEWS 2018) held at ACL 2018 in Melbourne, Australia. Similar to previous editions of NEWS, the Shared Task featured 19 tasks on proper name transliteration, including 13 different languages and two different Japanese scripts. A total of 6 teams from 8 different institutions participated in the evaluation, submitting 424 runs, involving different transliteration methodologies. Four performance metrics were used to report the evaluation results. The NEWS shared task on machine transliteration has successfully achieved its objectives by providing a common ground for the research community to conduct comparative evaluations of state-of-the-art technologies that will benefit the future research and development in this area.


Introduction
Names play an important role in the performance of most natural language processing and information retrieval applications. They are also critical in cross-lingual applications such as machine translation and cross-language information retrieval, as it has been shown that system performance correlates positively with the quality of name conversion across languages (Demner-Fushman and Oard 2002, Mandl and Womser-Hacker 2005, Hermjakob et al. 2008, Udupa et al. 2009). Bilingual dictionaries constitute the traditional source of information for name conversion across languages, however they offer very limited support as in most languages names are continuously emerging and evolving.
All of the above points to the critical need for robust machine transliteration methods and systems. Significant efforts has been conducted by the research community to address the problem of machine transliteration (Knight and Graehl 1998, Meng et al. 2001, Li et al. 2004, Zelenko and Aone 2006, Sproat et al. 2006, Sherif and Kondrak 2007, Hermjakob et al. 2008, Al-Onaizan and Knight 2002, Goldwasser and Roth 2008, Goldberg and Elhadad 2008, Klementiev and Roth 2006, Oh and Choi 2002, Virga and Khudanpur 2003, Wan and Verspoor 1998, Kang and Choi 2000, Gao et al. 2004, Li et al. 2009a, Li et al. 2009b). These efforts fall into three main categories: grapheme-based, phoneme-based and hybrid methods. Grapheme based methods (Li et al. 2004) treat transliteration as a direct orthographic mapping and only uses orthographyrelated features while phoneme-based methods (Knight and Graehl 1998) make use of phonetic correspondences to generate the transliteration. The hybrid approach refers to the combination of several different models or knowledge sources to support the transliteration generation process. Recently, neural network approaches have been explored with varying successes, depending on the size of the training data.
The first machine transliteration shared task (Li et al. 2009a, Li et al. 2009b) was organized and conducted as part of NEWS 2009 at ACL-IJCNLP 2009. It was the first time that common benchmarking data in diverse language pairs was provided for evaluating state-of-the-art machine transliteration. While the focus of the 2009 shared task was on establishing the quality metrics and on setting up a baseline for transliteration quality based on those metrics, the 2010 shared task (Li et al. 2010a, Li et al. 2010b) fo-cused on expanding the scope of the transliteration generation task to about a dozen languages and on exploring the quality of the task depending on the direction of transliteration.
In NEWS 2011(Zhang et al. 2011a, Zhang et al. 2011b, the focus was on significantly increasing the hand-crafted parallel corpora of named entities to include 14 different language pairs from 11 language families, and on making them available as the common dataset for the shared task.
The NEWS 2018 Shared Task on Named Entity Transliteration has been a continued effort for evaluating machine transliteration performance following the NEWS edition of 2012 (Zhang et al. 2012), 2015(Zhang et al. 2015 and 2016(Duan et al. 2016.
In this paper, we present in full detail the results of NEWS 2018 Named Entity Transliteration Shared Task. The rest of the paper is structured as follows. Section 2 provides as short review of the main characteristics of the machine transliteration task and the corpora used for it. Section 3 reviews the four metrics used for the evaluations. Section 4 reports specific details about participation in the shared task, and section 5 presents and discusses the evaluation results. Finally, section 6 presents our main conclusions and future plans.

Shared Task on Transliteration
Transliteration, sometimes also called Romanization, especially if Latin Scripts are used for target strings (Halpern 2007), deals with the conversion of names between two languages and/or script systems. Within the context of this transliteration shared task, we are aiming not only at addressing the name conversion process but also its practical utility for downstream applications, such as machine translation and cross-language information retrieval.
In this context, we adopt the same definition of transliteration as proposed during NEWS 2009(Li et al. 2009a: transliteration is understood as the conversion of a given name in the source language (a text string in the source writing system or orthography) to a name in the target language (another text string in the target writing system or orthography) conditioned to the following specific requirements regarding the name representation in the target language: • it is phonetically equivalent to the source name, • it conforms to the phonology of the target language, and • it matches the user intuition on its equivalence with respect to the source language name. Following previous editions of NEWS some back-transliteration tasks are considered. Backtransliteration attempts to restore transliterated names back into their original source language. NEWS 2018 included a total of six backtransliteration tasks.

Shared Task Description
As in previous editions of the workshop series, the shared task in NEWS 2018 consists of developing machine transliteration systems in one or more of the specified language pairs. Each language pair of the shared task consists of a source and a target language, implicitly specifying the transliteration direction. Training and development data in each of the language pairs was made available to all registered participants for developing their transliteration systems.
At the evaluation time, hand-crafted test sets of source names were released to the participants, who were required to produce a ranked list of transliteration candidates in the target language for each source name. The system outputs were tested against their corresponding reference sets (which may include multiple correct transliterations for some source names). The performance of a system is quantified using multiple metrics (defined in Section 3).
In this edition of the workshop, only standard runs (restricted to the train and development data provided) were considered. No other data or linguistic resources were allowed for standard runs. This ensures parity between systems and enables meaningful comparison of performance of various algorithmic approaches in a given language pair. Participants were allowed to submit one or more standard runs for each task they participated in. If more than one standard runs were submitted, it was required to select one as the "primary" run by publishing it into the leaderboard. The primary runs are the ones used to compare results across different systems.

Shared Task Corpora
Two specific constraints were considered when selecting languages for the shared task: language diversity and data availability. To make the shared task interesting and to attract wider participation, it is important to ensure a reasonable variety of linguistic diversity, orthography and geography. Following NEWS 2016, the tasks were grouped into five categories based on the specific organizations providing the datasets. In Tables 1.a-e, Type refers to the type of task (transliteration, back-transliteration or mixed); Origin refers to the origin of the names; and Source/Target refer to the source/target scripts.

Evaluation Metrics and Rationale
The participants have been asked to submit standard and, optionally, non-standard runs. One of the standard runs must be named as the primary submission, which was the one used for the performance summary. Each run must contain a ranked list of up to ten candidate transliterations for each source name. The submitted results are compared to the ground truth (reference transliterations) using four evaluation metrics capturing different aspects of transliteration performance. The four considered evaluation metrics are • Word Accuracy in Top-1 (ACC), In the next subsections, we present a brief description of the four considered evaluation metrics. The following notation is further assumed: • N: Total number of names (source words) in the test set, • n i : Number of reference transliterations for ith name in the test set (n i ≥ 1), • r i,j : j-th reference transliteration for i-th name in the test set, • c i,k : k-th candidate transliteration (system output) for i-th name in the test set (1 ≤ k ≤ 10), • K i : Number of candidate transliterations produced by a transliteration system.

Word Accuracy in Top-1 (ACC)
Also known as Word Error Rate, it measures correctness of the first transliteration candidate in the candidate list produced by a transliteration system. ACC = 1 means that all top candidates are correct transliterations; i.e. they match one of the references, and ACC = 0 means that none of the top candidates are correct.

Fuzziness in Top-1 (Mean F-score)
The Mean F-score measures how different, on average, the top transliteration candidate is from its closest reference. F-score for each source word is a function of Precision and Recall and equals 1 when the top candidate matches one of the references, and 0 when there are no common characters between the candidate and any of the references.
Precision and Recall are calculated based on the length of the Longest Common Subsequence (LCS) between a candidate and a reference: where ED is the edit distance and |x| is the length of x. For example, the longest common subsequence between "abcd" and "afcde" is "acd" and its length is 3. The best matching reference, i.e. the reference for which the edit distance has the minimum, is taken for calculation. If the best matching reference is given by the Recall, Precision and F-score for the i-th word are calculated as: The lengths are computed with respect to distinct Unicode characters, and no distinctions are made for different character types of a language (e.g. vowel vs. consonant vs. combining diereses).

Mean Reciprocal Rank (MRR)
Measures traditional MRR for any right answer produced by the system, from among the candidates. 1/MRR tells approximately the average rank of the correct transliteration. MRR closer to 1 implies that the correct answer is mostly produced close to the top of the n-best lists.

Mean Average Precision (MAP ref )
This metric measures tightly the precision in the n-best candidates for i-th source name, for which reference transliterations are available. If all of the references are produced, then the MAP is 1. If we denote the number of correct candidates for the i-th source word in k-best list as num(i,k), then MAP ref is given by:

Participation in the Shared Task
A total of six teams from eight different institutions participated in the NEWS 2018 Shared Task. More specifically, the participating teams were from University of Alberta (UALB), University of Edinburgh (EDI), University of Jadavpur and Universitat des Saarlandes (UJUS), Universite du Quebec a Montreal (UQAM), and team SINGA (from National University of Singapore and Singapore University of Technology and Design) and WIPO (World Intellectual Property Organization) 1 .
In total, we received 424 standard runs. Table  2 summarizes the number of standard runs and the teams participated in each task.

Task Results and Analysis
In this section, we present the official results of the shared task along with brief descriptions of 1 This last team did not submit a system paper, but we are including their submission result for the sake of completeness. the different participant systems and some recommendations for future improvements. Most language pairs are able to achieve close to 80% or more in terms of F-score for at least some systems. An intriguing observation from Figure 1 is that for the language pair English-Chinese, the back-transliteration task from Chinese to English performs at least 15% better than the transliteration task from English to Chinese.

Shared Task Results
It also can be observed from the table that results for the T-EnPe and the B-PeEn tasks (western names) are significantly low. This resulted from a mismatch on scripting conventions used for the Persian language between the original train and development sets and the newly developed test set.
A much more comprehensive presentation of results for the NEWS 2018 Shared Task is provided in the Appendix at the end of this paper, where the resulting scores are reported for all received submissions for all four metrics, including non-primary submissions. All results are presented in 19 tables, each of which reports the scores for one transliteration task. In the tables, all primary standard runs are highlighted in bolditalic fonts.

Participant Systems
This year, the SINGA team (Snigdha et al. 2018) provided two baseline systems using Sequitur and Moses (phrase-based machine translation). All other systems used some version of neural modeling. It is interesting to note that non-neural systems by SINGA, while not the highest in performance, are generally comparable to neural systems or system combinations which include neural models.
Regarding the systems participating in this year evaluation, the UALB's system (Najafi et al. showed improvements of up to 8% absolute over a baseline system by using system combination.

Figure 1: Mean F-scores (Top-1) on the evaluation set for all primary submissions and tasks.
The UJUS system (Kundu et al. 2018) used an RNN-based NMT framework and a CNN-based NMT framework, where both byte-pair encoding and character-based segmentation were employed for both cases. They also adopted an ensemble method to choose the hypothesis that has the highest frequency of occurrence to further improve accuracy.
The translation techniques such as dropout regularization, model ensembling, and re-scoring with right-left models. The EDI system is competitive, outperforming other teams in most of the tasks it participated in.
The UQAM system (Le et al. 2018) aligned the sequences in the English Vietnamese language pair before an RNN based machine transliteration system was trained.

Issues and Recommendations
In this section, we report some issues encountered during the shared task execution along with recommendations for future improvement of the Shared Task on Named Entity Transliteration. 2 • As mentioned in section 5.1, scripting discrepancies between the train/dev data and the test data occurred for Persian characters in the T-EnPe and B-PeEn tasks. Specifically, the newly developed test set happens to contain a mixture of the Persian and Arabic scripts, which includes visually similar characters that have distinct encodings. This dataset will be revised to resolve this problem for the next evaluation campaign. • Some of the datasets for the shared task are available under specific licensing agreements that have to be undertaken directly by the participants from the data providers. The organizing team will explore alternative means to offer all the datasets in the shared task under a unique centralized licensing agreement, which should be ideally free of cost for the participants. • Some of the participants experienced failures and delays during submissions to the Co-daLab system. Most of these problems are due to server overloads. The organizing team will contact CodaLab support to see how these problems can be fully resolved, or at least minimized, in the future editions of the shared task. • Participants also believe that better publicity for the shared task would result in increased participation in the task. NEWS workshop organizers receive a significant number of request for dataset and information about the shared task throughout the year. However, the total number of participants in the shared task does not reflect such actual interests from the research community on the data and the tasks. Publicity strategies and shared task timelines will be revised accordingly.

Conclusions
The Shared Task on Named Entity Transliteration in NEWS 2018 has shown that the research community has a continued interest in this area. This report summarizes the results of the NEWS 2018 Shared Task.
We are pleased to report a comprehensive set of machine transliteration approaches and their evaluation results from 6 teams from 8 different institutions that participated in the shared task. This year, we received 424 runs in total. Most of the current state-of-the-art in machine transliteration is represented in the systems that have participated in the shared task. Encouraged by the continued success of the NEWS workshop series, we plan to continue this event in the future to further promoting machine transliteration research and development.

Acknowledgments
The organizers of the NEWS 2018 Shared Task would like to thank the Institute for Infocomm Research (Singapore), National University of Singapore, Artificial Intelligence Laboratory at the Ho Chi Minh City University of Science (AILab, VNU-HCMUS, Vietnam), Microsoft Research India, the Computer Science & Engineering Department of Jadavpur University (India), the CJK Institute (Japan), the National Electronics and Computer Technology Center (NECTEC, Thailand) and Sarvnaz Karim (RMIT, Australia) for providing the corpora and technical support. Without those, the Shared Task would not be possible. In addition, we want to thank Grandee Lee and Snigdha Singhania for their help and support with CodaLab and the baseline systems, respectively. We also want to thank all programme committee members for their valuable comments that improved the quality of the shared task papers and, finally, we wish to thank all participants for their active participation, which have made again the NEWS 2018 edition of the Shared Task on Named Entity Transliteration a successful competition.