Moses-based official baseline for NEWS 2016

Transliteration is the phonetic translation between two different languages. There are many works that approach translit-eration using machine translation meth-ods. This paper describes the ofﬁcial base-line system for the NEWS 2016 workshop shared task. This baseline is based on a standard phrase-based machine translation system using Moses. Results are between the range of best and worst from last year’s workshops providing a nice starting point for participants this year.


Introduction
Transliteration of Name Entities is a useful task for many natural language processing applications such as cross-language information retrieval, information extraction or even machine translation. NEWS workshop has provided for various editions the opportunity to share strategies of transliteration and compare results among different sites. NEWS workshop this year offers training, development and test corpus for 14 language pairs. The final goal of this paper is to offer a baseline system for the NEWS 2016 workshop. Since a general strategy for transliteration has been to use techniques of machine translation, e.g. (Rama and Gali, 2009;David, 2012), we have chosen to use the phrase-based system (Koehn et al., 2003).
The phrase-based machine translation system tries to find the most probable target sentence given the source sentence. The theory behind phrase-based system has evolved from the noisy channel to the log-linear model, which is the one used nowadays. This model combines several feature functions including the translation and language model, the reordering model and the lexical models.
The only requirement to train a phrase-based system is to have a parallel corpus at the level of sentence. In the case of transliteration, we use words as sentences and characters as words. So, for example, parallel sentences to train a transliteration system in English-Hindi is shown in Table  1.

English
Hindi a a b h a a a a a b h e e r a a a b i d a º Ú a a b s h a r a º Next experimental section describes the preprocessing of the data and the final corpus statistics for the 14 tasks in the evaluation. We report the parameters used to train the phrase-based system. And finally, we explain the results obtained in terms of several automatic measures. After the experimental section, we include a section of conclusions.

System Description
The phrase-based system was built using Moses (Koehn et al., 2007), version 15th April 2016 from github, with standard parameters, including: grow-final-diag for alignment; Good-Turing smoothing of the relative frequencies; 3-gram language modeling using Kneser-Ney discounting and training with SRILM (Stolcke, 2002); and lexicalized reordering, which includes 6 feature functions. Optimization was done using the MERT algorithm and MBR option for decoding. It is important to note that the same system was used for the 14 tasks without any change or modification.

Results
Official results are reported in Table 3. In most tasks, results were in the middle of the ranking. Best ranking results were obtained in Englishto-Japanese (Kanji) and Arabic-to-English (no merit this one, because the baseline was the only participant). Worst ranking results were for English-Thai, English-to-Tamil, English-to-Hebrew, English-to-Korean, English-to-Japanese (Katakana).

Conclusions
This phrase-based system based on standard Moses has been offered to the NEWS organizers to provide a reasonable baseline system for the competition. Also, it helps the participants to know the quality level of their systems compared to state-ofthe-art transliteration when faced as a translation challenge.
In the next edition, we hope to provide an en-   hanced baseline system by tuning some parameters from the Moses system, and possibly competing in the shared task with some related approach to character-aware neural machine translation system (Costa-jussà and Fonollosa, 2016).