Whitepaper of NEWS 2016 Shared Task on Machine Transliteration

Transliteration is deﬁned as phonetic translation of names across languages. Transliteration of Named Entities (NEs) is necessary in many applications, such as machine translation, corpus alignment, cross-language IR, information extraction and automatic lexicon acquisition. All such systems call for high-performance transliteration, which is the focus of shared task in the NEWS 2016 workshop. The objective of the shared task is to promote machine transliteration research by providing a common benchmarking platform for the community to evaluate the state-of-the-art technologies.


Task Description
The task is to develop machine transliteration system in one or more of the specified language pairs being considered for the task.Each language pair consists of a source and a target language.The training and development data sets released for each language pair are to be used for developing a transliteration system in whatever way that the participants find appropriate.At the evaluation time, a test set of source names only would be released, on which the participants are expected to produce a ranked list of transliteration candidates in another language (i.e.n-best transliterations), and this will be evaluated using common metrics.For every language pair the participants must submit at least one run that uses only the data provided by the NEWS workshop organisers in a given language pair (designated as "standard" run, primary submission).Users may submit more "stanrard" runs.They may also submit several "nonstandard" runs for each language pair that use other data than those provided by the NEWS 2016 * http://translit.i2r.a-star.edu.sg/news2016/workshop; such runs would be evaluated and reported separately.(a) The test data would be released on 25 April 2016, and the participants have a maximum of 7 days to submit their results in the expected format.(b) One "standard" run must be submitted from every group on a given language pair.Additional "standard" runs may be submitted, up to 4 "standard" runs in total.However, the participants must indicate one of the submitted "standard" runs as the "primary submission".The primary submission will be used for the performance summary.In addition to the "standard" runs, more "non-standard" runs may be submitted.In total, maximum 8 runs (up to 4 "standard" runs plus up to 4 "non-standard" runs) can be submitted from each group on a registered language pair.The definition of "standard" and "non-standard" runs is in Section 5. (c) Any runs that are "non-standard" must be tagged as such.(d) The test set is a list of names in source language only.Every group will produce and submit a ranked list of transliteration candidates in another language for each given name in the test set.Please note that this shared task is a "transliteration generation" task, i.e., given a name in a source language one is supposed to generate one or more transliterations in a target language.It is not the task of "transliteration discovery", i.e., given a name in the source language and a set of names in the target language evaluate how to find the appropriate names from the target set that are transliterations of the given source name.

Results (6 May 2016)
(a) On 6 May 2016, the evaluation results would be announced and will be made available on the Workshop website.(b) Note that only the scores (in respective metrics) of the participating systems on each language pairs would be published, and no explicit ranking of the participating systems would be published.(c) Note that this is a shared evaluation task and not a competition; the results are meant to be used to evaluate systems on common data set with common metrics, and not to rank the participating systems.While the participants can cite the performance of their systems (scores on metrics) from the workshop report, they should not use any ranking information in their publications.
(d) Furthermore, all participants should agree not to reveal identities of other participants in any of their publications unless you get permission from the other respective participants.By default, all participants remain anonymous in published results, unless they indicate otherwise at the time of uploading their results.Note that the results of all systems will be published, but the identities of those participants that choose not to disclose their identity to other participants will be masked.As a result, in this case, your organisation name will still appear in the web site as one of participants, but it will not be linked explicitly to your results.
5. Short Papers on Task (16 May 2016) (a) Each submitting site is required to submit a 4-page system paper (short paper) for its submissions, including their approach, data used and the results on either test set or development set or by nfold cross validation on training set.
(b) The review of the system papers will be done to improve paper quality and readability and make sure the authors' ideas and methods can be understood by the workshop participants.We are aiming at accepting all system papers, and selected ones will be presented orally in the NEWS 2016 workshop.
(c) All registered participants are required to register and attend the workshop to introduce your work.

Language Pairs
The tasks are to transliterate personal names or place names from a source to a target language as summarised in Table 1.Testing Data Source names only; size 1K -2K.This is a held-out set, which would be used for evaluating the quality of the transliterations.

Progress Testing Data
Source names only; size 0.6K -2.6K.This is the NEWS 2011 test set, it is held-out for progress study.
1. Participants will need to obtain licenses from the respective copyright owners and/or agree to the terms and conditions of use that are given on the downloading website (Li et al., 2004;MSRI, 2010;CJKI, 2010).NEWS 2016 will provide the contact details of each individual database.The data would be provided in Unicode UTF-8 encoding, in XML format; the results are expected to be submitted in UTF-8 encoding in XML format.
The XML formats details are available in Appendix A.
2. The data are provided in 3 sets as described above.
3. Name pairs are distributed as-is, as provided by the respective creators.(a) While the databases are mostly manually checked, there may be still inconsistency (that is, non-standard usage, region-specific usage, errors, etc.) or incompleteness (that is, not all right variations may be covered).(b) The participants may use any method to further clean up the data provided.i.If they are cleaned up manually, we appeal that such data be provided back to the organisers for redistribution to all the participating groups in that language pair; such sharing benefits all participants, and further ensures that the evaluation provides normalisation with respect to data quality.ii.If automatic cleanup were used, such cleanup would be considered a part of the system fielded, and hence not required to be shared with all participants.

Standard Runs
We expect that the participants to use only the data (parallel names) provided by the Shared Task for transliteration task for a "standard" run to ensure a fair evaluation.One such run (using only the data provided by the shared task) is mandatory for all participants for a given language pair that they participate in.
5. Non-standard Runs If more data (either parallel names data or monolingual data) were used, then all such runs using extra data must be marked as "non-standard".For such "nonstandard" runs, it is required to disclose the size and characteristics of the data used in the system paper.
6.A participant may submit a maximum of 8 runs for a given language pair (including the mandatory 1 "standard" run marked as "primary submission").
6 Paper Format

Figure 2 :
Figure 2: Example file: NEWS2012 EnHi TUniv 01 StdRunHMMBased.xml NEWS 2016 Shared Task offers 14 evaluation subtasks, among them ChEn and ThEn are the back-transliteration of EnCh and EnTh tasks respectively.NEWS 2016 releases training, development and testing data for each of the language pairs.NEWS 2016 continues all language pairs that were evaluated in NEWS 2011, 2012 and 2015.In such cases, the training, development and test data in the release of NEWS 2016 are the same as those in NEWS 2015.Please note that in order to have an accurate study of the research progress of machine transliteration technology, different from previous practice, the test/reference sets of NEWS 2011 are not released to the research community.Instead, we use the test sets of NEWS 2011 as progress test sets in NEWS 2016.NEWS 2016 participants are requested to submit results on the NEWS 2016 progress test sets (i.e., NEWS 2011 test sets).By doing so, we would like to do comparison studies by comparing the NEWS 2016 and NEWS 2011 results on the progress test sets and comparing the NEWS 2016 and the previous years' results on the test sets.We hope that we can have some insightful research findings in the progress studies.

Table 1 :
Source and target languages for the shared task on transliteration.