Neural Machine Translation Techniques for Named Entity Transliteration

Transliterating named entities from one language into another can be approached as neural machine translation (NMT) problem, for which we use deep attentional RNN encoder-decoder models. To build a strong transliteration system, we apply well-established techniques from NMT, such as dropout regularization, model ensembling, rescoring with right-to-left models, and back-translation. Our submission to the NEWS 2018 Shared Task on Named Entity Transliteration ranked first in several tracks.


Introduction
Transliteration of Named Entities (NEs) is defined as the phonetic translation of names across languages (Knight and Graehl, 1998). It is an important part of a number of natural language processing tasks, and machine translation in particular (Durrani et al., 2014;Sennrich et al., 2016c).
Machine transliteration can be approached as a sequence-to-sequence modeling problem (Finch et al., 2016;Ameur et al., 2017). In this work, we explore the Neural Machine Translation (NMT) approach based on an attentional RNN encoderdecoder neural network architecture , motivated by its successful application to other sequence-to-sequence tasks, such as grammatical error correction (Yuan and Briscoe, 2016), automatic post-editing (Junczys-Dowmunt and Grundkiewicz, 2016), sentence summarization (Chopra et al., 2016), or paraphrasing (Mallinson et al., 2017). We apply well-established techniques from NMT to machine transliteration building a strong system that achieves state-of-the-art-results. The techniques we exploit include: • Regularization with various dropouts preventing model overfitting; • Ensembling strategies involving independently trained models and model checkpoints; • Re-scoring of n-best list of candidate transliterations by right-to-left models; • Using synthetic training data generated via back-translation.
The developed system constitutes our submission to the NEWS 2018 Shared Task 1 on Named Entity Transliteration ranked first in several tracks. We describe the shared task in Section 2, including provided data sets and evaluation metrics. In Section 3, we present the model architecture and adopted NMT techniques. The experiment details are presented in Section 4, the results are reported in Section 5, and we conclude in Section 6.

Shared task on named entity transliteration
The NEWS 2018 shared task (Chen et al., 2018) continues the tradition from the previous tasks (Xiangyu Duan et al., 2016Zhang et al., 2012) and focuses on transliteration of personal and place names from English or into English or in both directions.

Datasets
Five different datasets have been made available for use as the training and development data. The data for Thai (EnTh, ThEn) comes from the NECTEC transliteration dataset. The second dataset is the RMIT English-Persian dataset (Karimi et al., 2006(Karimi et al., , 2007   transliteration datasets (Haizhou et al., 2004), and the VNU-HCMUS dataset (Cao et al., 2010;Ngo et al., 2015), respectively. Hindi, Tamil, Kannada, Bangla (EnHi, EnTa, EnKa, EnBa), and Hebrew (EnHe, HeEn) are provided by Microsoft Research India 2 . We do not evaluate our models on the dataset from the CJK Dictionary Institute as the data is not freely available for research purposes. We use 13 data sets for our experiments (Table 1). The data consists of genuine transliterations or back-translations or includes both.
No other parallel nor monolingual data are allowed for the constrained standard submissions that we participate in.

Evaluation
The quality of machine transliterations is evaluated with four automatic metrics in the shared task: word accuracy, mean F-score, mean reciprocal rank, and MAP ref (Chen et al., 2018). As a main evaluation metric for our experiments we use word accuracy (Acc) on the top candidate: The closer the value to 1.0, the more top candidates c i,1 are correct transliterations, i.e. they match one of the references r i,j . N is the total number of entries in a test set.

Neural machine translation
Our machine transliteration system is based on a deep RNN-based attentional encoder-decoder model that consists of a bidirectional multi-layer encoder and decoder, both using GRUs as their RNN variants (Sennrich et al., 2017b). It utilizes the BiDeep architecture proposed by Miceli , which combines deep transitions with stacked RNNs. We employ the soft-attention mechanism (Bahdanau et al., 2014), and leave hard monotonic attention models (Aharoni and Goldberg, 2017) for future work. Layer normalization (Ba et al., 2016) is applied to all recurrent and feed-forward layers, except for layers followed by a softmax. We use weight tying between target and output embeddings (Press and Wolf, 2017). The model operates on word level, and no special adaptation is made to the model architecture in order to support character-level transliteration, except data preprocessing (Section 4.1).

NMT techniques
Regularization Randomly dropping units from the neural network during training is an effective regularization method that prevents the model from overfitting (Srivastava et al., 2014).
For RNN networks, Gal and Ghahramani (2016) proposed variational dropout over RNN inputs and states, which we adopt in our experiments. Following Sennrich et al. (2016a), we also dropout entire source and target words (characters in our case) with a given probability.
Model ensembling Model ensembling leads to consistent improvements for NMT Sennrich et al., 2016a;Denkowski and Neubig, 2017). An ensemble of independent models usually outperforms an ensemble of different model checkpoints from a single training run as it results in more diverse models in the ensemble (Sennrich et al., 2017a). As an alternative method for checkpoint ensembles,  propose exponential smoothing of network parameters averaging them over the entire training.
We combine both methods and build ensembles of independently trained models with exponentially smoothed parameters.
Re-scoring with right-left models Re-scoring of an n-best list of candidate translations obtained from one system by another allows to incorporate additional features into the model or to combine multiple different systems that cannot be easily ensembled. Sennrich et al. (2016aSennrich et al. ( , 2017a, for rescoring a NMT system, propose to use separate  models trained on reversed target side that produce the target text from right-to-left. We adopt the following re-ranking technique: we first ensemble four standard left-to-right models to produce n-best lists of 20 transliteration candidates and then re-score them with two right-to-left models and re-rank. Back-translation Monolingual data can be backtranslated by a system trained on the reversed language direction to generate synthetic parallel corpora (Sennrich et al., 2016b). Additional training data can significantly improve a NMT system.
As the task is organized under a constrained settings and no data other than that provided by organizers is allowed, we consider the English examples from all datasets as our monolingual data and use back-translations and "forward-translations" to enlarge the amount of parallel training data.

Data preprocessing
We uppercase 5 and tokenize all words into sequences of characters and treat them as words. Whitespaces are replaced by a special character to be able to reconstruct word boundaries after decoding.
We use the training data provided in the NEWS 2018 shared task to create our training and validation sets, and the official development set as an internal test set. Validation sets consists of randomly selected 500 examples that are subtracted from the training data. If a name entity has alternative translations, we add them to the training data as separate examples with identical source side. The number of training examples varies between ca. 2,756 and 81,252 (Table 2).

Model architecture
We use the BiDeep model architecture  for all systems. The model consists of 4 bidirectional alternating stacked encoders with 2-layer transition cells, and 4 stacked decoders with the transition depth of 4 in the base RNN of the stack and 2 in the higher RNNs. We augment it with layer normalization, skip connections, and parameter tying between all embeddings and output layer. The RNN hidden state size is set to 1024, embeddings size to 512. Source and target vocabularies are identical. The size of the vocabulary varies across language pair and is determined by the number of unique characters in the training data.

Training settings
We limit the maximum input length to 80 characters during training. Variational dropout on all RNN inputs and states is set to 0.2, source and target dropouts are 0.1. A factor for exponential smoothing is set to 0.0001.
Optimization is performed with Adam (Kingma and Ba, 2014) with a mini-batch size fitted into 3GB of GPU memory 6 . Models are validated and saved every 500 mini-batches. We stop training when the cross-entropy cost on the validation set fails to reach a new minimum for 5 consecutive validation steps. As a final model we choose the one that achieves the highest word accuracy on the validation set. We train with learning rate of 0.003 and decrease the value by 0.9 every time the validation score does not improve over the current best value. We do not change any training hyperparameters across languages.
Decoding is done by beam search with a beam size of 10. The scores for each candidate translation are normalized by sentence length.

Results on the development set
We evaluate our methods on the official development set from the NEWS 2018 shared task (Table 3). Results for systems that do not use ensembles are averaged scores from four models.
Regularization with dropouts improves the word accuracy for all language pairs except English-Chinese. As expected, model ensembling brings significant and consistent gains. Re-ranking with right-to-left models is also an effective method raising accuracy, even for languages for which a single right-to-left model itself is worse then a baseline left-to-right model, e.g. for EnHi, EnKa and EnHe systems.
The scale of the improvement for systems trained on additional synthetic data depends on the method 7 More specifically, we use the source side of EnTh, EnPe, EnCh, EnVi, EnHi, EnTa, EnKa, EnBa, EnHe, and the target side of ThEn, PeEn, ChEn, HeEn data sets. that the synthetic examples are generated with: the systems into English benefit greatly from backtranslations 8 , while other systems that were supplied by forward-translations do not improve much or even slightly downgrade the accuracy.

Official results and conclusions
As final systems submitted to the NEWS 2018 shared task we chose ones that achieved the best performance on the development set (Table 3, last row). On the official test set, our systems are ranked first for most language pairs we experimented with 9 .
The results show that the neural machine translation approach can be employed to build efficient machine transliteration systems achieving state-ofthe-art results for multiple languages and providing strong baselines for future work.