Neural Network Transduction Models in Transliteration Generation

In this paper we examine the effectiveness of neural network sequence-to-sequence transduction in the task of transliteration generation. In this year’s shared evaluation we submitted two systems into all tasks. The primary system was based on the system used for the NEWS 2012 work-shop, but was augmented with an additional feature which was the generation probability from a neural network. The secondary system was the neural network model used on its own together with a simple beam search algorithm. Our results show that adding the neural network score as a feature into the phrase-based statistical machine transliteration system was able to increase the performance of the sys-tem. In addition, although the neural network alone was not able to match the performance of our primary system (which exploits it), it was able to deliver a re-spectable performance for most language pairs which is very promising considering the recency of this technique.


Introduction
Our primary system for the NEWS shared evaluation on transliteration generation is based on the system entered into the 2012 evaluation (Finch et al., 2012) which in turn was a development of the 2011 system (Finch et al., 2011).
The system is based around the application of phrase-based statistical machine translation (PB-SMT) techniques to the task of transliteration, as in (Finch and Sumita, 2008). The system differs from a typical phrase-based machine translation system in a number of important respects: • Characters rather than words are used as the atomic elements used in the transductive process • The generative process is constrained to be monotonic. No re-ordering model is used.
• The alignment process is constrained to be monotonic.
-A non-parametric Bayesian aligner is used instead of GIZA++ and extraction heuristics, to provide a joint alignment/phrase pair induction process.
• The log-linear weights are tuned towards the F-score evaluation metric used in the NEWS evaluation, rather than a machine translation oriented score such as BLEU (Papineni et al., 2001).
• A bilingual language model (Li et al., 2004) is used as a feature during decoding.
An n-best list of hypotheses from the PBSMT system outlined above was then re-scored using the following set of models: • A maximum entropy model (described in detail in (Finch et al., 2011)).
The re-scoring was done by extending the loglinear model of the PBSMT system with these 4 additional features. The weights for these features were tuned to maximize F-score in a second tuning step.
The novel aspect of our system in this year's evaluation is the use of a neural network that is capable of performing the entire transductive process. Neural networks capable of sequence-tosequence transduction where the sequences are of different lengths (Hermann and Blunsom, 2013;Cho et al., 2014a;Bahdanau et al., 2014) are a very recent development in the field of machine translation. We believe this type of approach ought to be well suited to the task of transliteration, which is a task strongly related to that of machine translation but with typically much smaller vocabulary sizes and no problems related to reordering and in most cases no issues relating to out of vocabulary words (characters in our case). On the other hand, it is generally believed (for example (Ellis and Morgan, 1999)) that neural networks can require large amounts of data in order to train effective models, and the data set sizes available in this shared evaluation are quite small, and this lack of data may have caused problems for the neural networks employed.
In all our experiments we have taken a strictly language independent approach. Each of the language pairs were processed automatically from the character sequence representation supplied for the shared tasks, with no language specific treatment for any of the language pairs.

Non-parametric Bayesian Alignment
To train the joint-source-channel model(s) in our system, we perform a many-to-many sequence alignment. To discover this alignment we use the Bayesian non-parametric technique described in (Finch and Sumita, 2010). Bayesian techniques typically build compact models with few parameters that do not overfit the data and have been shown to be effective for transliteration (Finch and Sumita, 2010;Finch et al., 2011).

Phrase-based SMT Models
The decoding was performed using a specially modified version of the OCTAVIAN decoder (Finch et al., 2007), an in-house multistack phrase-based decoder.
The PBSMT component of the system was implemented as a log-linear combination of 4 different models: a joint source-channel model; a target language model; a character insertion penalty mode; and a character sequence pair insertion penalty model. The following sections describe each of these models in detail. Due to the small size of many of the data sets in the shared tasks, we used all of the data to build models for the final systems.

N-gram joint source-channel model
The n-gram joint source-channel model used during decoding by the SMT decoder was trained from the Viterbi alignment arising from the final iteration (30 iterations were used) of the Bayesian segmentation process on the training data. We used the MIT language modeling toolkit (Bo-june et al., 2008) with modified Knesser-Ney smoothing to build this 5-gram model.

N-gram target Language model
The target language model was trained on the target side of the training data. We used the MIT language modeling toolkit with Knesser-Ney smoothing to build this 5-gram model.

Insertion penalty models
Both character based and character-sequence-pairbased insertion penalty models are simple models that add a constant value to their score each time a character (or character sequence pair) is added to the target hypotheses. These models control the tendency both of the joint source-channel model and the target language model to encourage derivations that are too short.

Overview
The system has a separate re-scoring stage that like the SMT models described in the previous section is implemented as a log-linear model. The loglinear weights are trained using the same MERT (Och, 2003) procedure. In principle, the weights for the models in this stage could be trained in a single step together with the SMT weights (Finch et al., 2011). However the models in this stage are computationally expensive, and to reduce training time we train their weights in a second step. The four models used for re-scoring (20-best) are described in the following sections.

Maximum-entropy model
The maximum entropy model used for re-scoring embodies a set of character and character-sequence based features designed to take the local context of source and target characters and character sequences into account; the reader is referred to (Finch et al., 2011) for a full description of this model.

RNN Language models
We introduce two RNN language models (Mikolov et al., 2011) into the re-scoring step of our system. The first model is a language model over character sequences in the target language; the second model is a joint source-channel model over bilingual character sequence pairs. These models were trained on the same data as their n-gram counterparts described in Sections 2.2.1 and 2.2.2. The models were trained using the training procedure described in (Finch et al., 2012).

Neural network transliteration model
The neural network transliteration model was trained directly from the source and target sequences themselves. The model used in tuning was trained only on the training data set; the model used for the final submission was trained on all of the data. The neural network software was developed using the GroundHog neural machine translation toolkit (Cho et al., 2014b), built on top of Theano (Bergstra et al., 2010;Bastien et al., 2012). For all of the experiments we used the same neural network architecture which was the default architecture supplied with the toolkit. That is, we used networks of 1000 hidden units and used the RNNSearch technique reported in (Bahdanau et al., 2014). In a set of pilot experiments we evaluated a number of neural network models with fewer parameters on development data, under the hypothesis that these would be more suitable for the task of transliteration. However, the best results came from the default set of parameters, and therefore these were used in all runs. Due to the resources required to train the neural network models only a few experiments were able to be performed and only on the English-Katakana task. It may be the case that different architectures could lead to in significantly higher performance than the results we obtained, and this remains an area for future research. The neural networks were trained for 50,000 iterations based on the analysis of the convergence of the performance on development data of a network trained on the English-Katakana task. The models took from 1 to 9 days to train, depending on the language pair, on a single core of a Tesla K40 GPU.

Parameter Tuning
The exponential log-linear model weights of both the SMT and re-scoring stages of our system were set by tuning the system on development data using the MERT procedure (Och, 2003) by means of the publicly available ZMERT toolkit 1 (Zaidan, 2009). The systems reported in this paper used a metric based on the word-level F-score, an official evaluation metric for the shared tasks (Zhang et al., 2012), which measures the relationship of the longest common subsequence of the transliteration pair to the lengths of both source and target sequences.

Evaluation Results
The official scores for our system are given in Table 1. It is interesting to compare the results of the 2012 system with the results from this year's primary submission on the 2012 test set, since these results show the effect of adding the neural network transliteration scores into the re-scorer. In 11 out of 14 of the runs, the system's performance was improved, and for some language pairs, notably En-He, En-Hi, En-Ka, En-Pe, En-Ta, En-Th, Th-En and Jn-Jk the improvement was substantial. The using the neural network model scores was ineffective for Ar-En, Ch-En and En-Ch. Ar-En was surprising as the training corpus size for this task was considerably larger than for any other task, and we expected this to benefit the neural network approach. Overall however, it is clear from the results that the neural network re-scoring was very effective and the effect was considerably greater than that from the RNN re-scoring models introduced in the 2012 system.
The results on the Jn-Jk task were surprising. The neural network transliteration system alone produced very low accuracy scores, but when used in combination with the PBSMT system gave a 9.7% increase in top-1 accuracy. One particular characteristic of this data set is the disparity in length between the sequences; kanji sequences were very short whereas the romanized form was much longer. Visual inspection of the output from the direct neural network transliteration showed that the output sequences derived from the roman character sequences, but were too long. When integrated with the PBSMT system, output sequences of this form were not a problem as they were rarely generated as candidates for re-scoring. We conducted two experiments in the reverse direction from Jk to Jn. The first was based on a neural network transliteration system from character to character in the same manner as the secondary submission. The second system was a neural network that transduced from character to character sequence. We used a 1-to-many sequence alignment induced by the Bayesian aligner to train this model. The character-to-character system had a top-1 accuracy of 0.245, the characterto-character sequence system had a top-1 accuracy of 0.305. These results indicate that the neural network is capable of generating long sequences from short sequences with reasonably high accuracy, and that there may be something to be gained by using phrasal units in the neural network transduction process, as was the case when moving from word-based models to phrase-based models in machine translation.

Conclusion
The system used for this year's shared evaluation was implemented within a phrase-based statistical machine translation framework augmented by a bilingual language model trained from a many-tomany alignment from a non-parametric Bayesian aligner. The system had a re-scoring step that inte-grated features from a maximum entropy model, a target RNN language model, a bilingual RNN language model, and a neural network transliteration model.
Our results showed that the neural network transliteration model was a very effective component in the re-scoring stage of our system that substantially improved the performance of our system over the 2012 system for most language pairs. Furthermore, the neural network transliterator was a capable system in its own right on most of the tasks, and equaled or exceeded the performance of our 2012 system on 3 language pairs. These results are particularly impressive considering that this line of research is relatively new, and we believe neural network transliteration models will have a bright future in this field.