Neural System Combination for Machine Translation

Neural machine translation (NMT) becomes a new approach to machine translation and generates much more fluent results compared to statistical machine translation (SMT). However, SMT is usually better than NMT in translation adequacy. It is therefore a promising direction to combine the advantages of both NMT and SMT. In this paper, we propose a neural system combination framework leveraging multi-source NMT, which takes as input the outputs of NMT and SMT systems and produces the final translation. Extensive experiments on the Chinese-to-English translation task show that our model archives significant improvement by 5.3 BLEU points over the best single system output and 3.4 BLEU points over the state-of-the-art traditional system combination methods.


Introduction
Neural machine translation has significantly improved the quality of machine translation in recent several years (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2015;Junczys-Dowmunt et al., 2016a). Although most sentences are more fluent than translations by statistical machine translation (SMT) (Koehn et al., 2003;Chiang, 2005), NMT has a problem to address translation adequacy especially for the rare and unknown words. Additionally, it suffers from over-translation and under-translation to some extent (Tu et al., 2016). Compared to NMT, SMT, such as phrase-based machine translation (PBMT, (Koehn et al., 2003)) and hierarchical phrase-based machine translation (HPMT, * Corresponding author. (Chiang, 2005)), does not need to limit the vocabulary and can guarantee translation coverage of source sentences. It is obvious that NMT and SMT have different strength and weakness. In order to take full advantages of both NMT and SMT, system combination can be a good choice.
Traditionally, system combination has been explored respectively in sentence-level, phrase-level, and word-level (Kumar and Byrne, 2004;Feng et al., 2009;Chen et al., 2009). Among them, word-level combination approaches that adopt confusion network for decoding have been quite successful (Rosti et al., 2007;Ayan et al., 2008;Freitag et al., 2014). However, these approaches are mainly designed for SMT without considering the features of NMT results. NMT opts to produce diverse words and free word order, which are quite different from SMT. And this will make it hard to construct a consistent confusion network. Furthermore, traditional system combination approaches cannot guarantee the fluency of the final translation results.
In this paper, we propose a neural system combination framework, which is adapted from the multi-source NMT model (Zoph and Knight, 2016). Different encoders are employed to model the semantics of the source language input and each best translation produced by different NMT and SMT systems. The encoders produce multiple context vector representations, from which the decoder generates the final output word by word. Since the same training data is used for NMT, SMT and neural system combination, we further design a smart strategy to simulate the real training data for neural system combination.
Specifically, we make the following contributions in this paper: • We propose a neural system combination method, which is adapted from multi-source NMT model and can accommodate both source inputs and different system translations. It combines the fluency of NMT and adequacy (especially the ability to address rare words) of SMT.
• We design a good strategy to construct appropriate training data for neural system combination.
• The extensive experiments on Chinese-English translation show that our model archives significant improvement by 3.4 BLEU points over the state-of-the-art system combination methods and 5.3 BLEU points over the best individual system output.

Neural Machine Translation
The encoder-decoder NMT with an attention mechanism (Bahdanau et al., 2015) has been proposed to softly align each decoder state with the encoder states, and computes the conditional probability of the translation given the source sentence. The encoder is a bidirectional neural network with gated recurrent units (GRU) (Cho et al., 2014) which reads an input sequence X = (x 1 , x 2 , ..., x m ) and encodes it into a sequence of hidden states H = h 1 , h 2 , ..., h m .
The decoder is a recurrent neural network that predicts a target sequence Y = (y 1 , y 2 , ..., y n ). Each word y j is predicted based on a recurrent hidden state s j , the previously predicted word y j−1 , and a context vector c j . c j is obtained from the weighted sum of the annotations h i . We use the latest implementation of attention-based NMT 1 .

Neural System Combination for
Machine Translation Macherey and Och (2007) gave empirical evidence that these systems to be combined need to be almost uncorrelated in order to be beneficial for system combination. Since NMT and SMT are two kinds of translation models with large differences, we attempt to build a neural system combination model, which can take advantage of the different systems. Model: Figure 1 illustrates the neural system combination framework, which can take as input the source sentence and the results of MT systems. Here, we use MT results as inputs to detail the model.
Formally, given the result sequences Z(Z n , Z p , and Z h ) of three MT systems for the same source sentence and previously generated target sequence Y <j = (y 1 , y 2 , ..., y j−1 ), the probability of the next target word y j is Here f (.) is a non-linear function, y j−1 represents the word embedding of the previous prediction word, and s j is the state of decoder at time step j, calculated by where s j−1 is previous hidden state,s j−1 is an intermediate state. And c j is the context vector of system combination obtained by attention mechanism, which is computed as weighted sum of the context vectors of three MT systems, just as illustrated in the middle part of Figure 1.
where K is the number of MT systems, and β jk is a normalized item calculated as follows: Here, we calculate kth MT system context c jk as a weighted sum of the source annotations: where is the annotation of z i from a bi-directional GRU, and its weight α k ji is computed by where e ji = v T a tanh(W asj−1 + U a h i ) scores how wells j−1 and h i match.
Training Data Simulation: The neural system combination framework should be trained on the outputs of multiple translation systems and the gold target translations. In order to keep consistency in training and testing, we design a strategy to simulate the real scenario. We randomly divide the training corpus into two parts, then reciprocally train the MT system on one half and translate the source sentences of the other half into target translations. The MT translations and the gold target reference can be available.

Experiments
We perform our experiments on the Chinese-English translation task. The MT systems participating in system combination are PBMT, HPMT and NMT. The evaluation metric is caseinsensitive BLEU (Papineni et al., 2002).

Data preparation
Our training data consists of 2.08M sentence pairs extracted from LDC corpus. We use NIST 2003 Chinese-English dataset as the validation set, NIST 2004-2006 datasets as test sets. We list all the translation methods as follows: • PBMT: It is the start-of-the-art phrase-based SMT system. We use its default setting and train a 4-gram language model on the target portion of the bilingual training data.
• HPMT: It is a hierarchical phrase-based SMT system, which uses its default configuration as PBMT in Moses.
• NMT: It is an attention-based NMT system, with the same setting given in section 2.

Training Details
The hyper-parameters used in our neural combination system are described as follows. We limit both Chinese and English vocabulary to 30k in our experiments. igure 2: Translation results (RIBES score) for different machine translation and system combination methods. and the word embedding dimension is 500 for all source and target word. The network parameters are updated with Adadelta algorithm. We adopt beam search with beam size b=10 at test time.
As to confusion-network-based system Jane, we use its default configuration and train a 4-gram language model on target data and 10M Xinhua portion of Gigaword corpus.

Main Results
We compare our neural combination system with the best individual engines, and the state-of-theart traditional combination system Jane (Freitag et al., 2014). Table 1 shows the BLEU of different models on development data and test data. The BLEU score of the multi-source neural combination model is 2.53 higher than the best single model HPMT. The source language input gives a further improvement of +1.12 BLEU points.
As shown in Table 1, Jane outperforms the best single MT system by 1.92 BLEU points. However, our neural combination system with source language gets an improvement of 1.67 BLEU points over Jane. Furthermore, when augmenting our neural combination system with ensemble decoding 2 , it leads to another significant boost of +1.69 BLEU points.

Word Order of Translation
We evaluate word order by the automatic evaluation metrics RIBES (Isozaki et al., 2010), whose score is a metric based on rank correlation coefficients with word precision. RIBES is known to have stronger correlation with human judgements than BLEU for English as discussed in Isozaki et al. (2010).   Table 3: Translation results (BLEU score) when we replace original NMT with strong E-NMT, which uses ensemble strategy with four NMT models. All results of system combination are based on strong outputs of E-NMT. Figure 2 illustrates experimental results of RIBES scores, which demonstrates that our neural combination model outperforms the best result of single MT system and Jane. Additionally, although BLEU point of Jane is higher than single NMT system, the word order of Jane is worse in terms of RIBES.

Rare and Unknown Words Translation
It is difficult for NMT systems to handle rare words, because low-frequency words in training data cannot capture latent translation mappings in neural network model. However, we do not need to limit the vocabulary in SMT, which are often able to translate rare words in training data. As shown in Table 2, the number of unknown words of our proposed model is 137 fewer than original NMT model. Table 4 shows an example of system combination. The Chinese word zuzhiwang is an out-ofvocabulary(OOV) for NMT and the baseline NMT cannot correctly translate this word. Although PBMT and HPMT translate this word well, they does not conform to the grammar. By combining the merits of NMT and SMT, our model gets the correct translation.

Effect of Ensemble Decoding
The performance of candidate systems is very important to the result of system combination, and we use ensemble strategy with four NMT models to improve the performance of original NMT system. As shown in Table 3, the E-NMT with Source 海珊 也 与 恐怖 组织网 建立 了 联系 。 Pinyin hanshan ye yu kongbu zuzhiwang jianli le lianxi 。 Reference hussein has also established ties with terrorist networks . PBMT hussein also has established relations and terrorist group . HPMT hussein also and terrorist group established relations . NMT hussein also established relations with UNK . Jane hussein also has established relations with . Multi hussein also has established relations with the terrorist group . Table 4: Translation examples of single system, Jane and our proposed model. ensemble strategy outperforms the original NMT system by +1.40 BLEU points, and it has become the best sytem in all MT systems, which is +0.68 BLEU points higher than HPMT. After replacing original NMT with strong E-NMT , Jane outperforms original result by +0.45 BLEU points, and our model gets an improvement of +3.08 BLEU points over Jane. Experiments further demonstrate that our proposed model is effective and robust for system combination.

Related Work
The recently proposed neural machine translation has drawn more and more attention. Most of the existing approaches and models mainly focus on designing better attention models (Luong et al., 2015a;Mi et al., 2016a,b;Tu et al., 2016;Meng et al., 2016), better strategies for handling rare and unknown words (Luong et al., 2015b;Zhang and Zong, 2016a; , exploiting large-scale monolingual data Sennrich et al., 2016a;Zhang and Zong, 2016b), and integrating SMT techniques (Shen et al., 2016;Junczys-Dowmunt et al., 2016b;. Our focus in this work is aiming to take advantage of NMT and SMT by system combination, which attempts to find consensus translations among different machine translation systems. In past several years, word-level, phrase-level and sentence-level system combination methods were well studied (Bangalore et al., 2001;Rosti et al., 2008;Li and Zong, 2008;Heafield and Lavie, 2010;Freitag et al., 2014;Ma and Mckeown, 2015;Zhu et al., 2016), and reported stateof-the-art performances in benchmarks for SMT. Here, we propose a neural system combination model which combines the advantages of NMT and SMT efficiently.
Recently, Niehues et al. (2016) use phrase-based SMT to pre-translate the inputs into target translations. Then a NMT system generates the final hypothesis using the pre-translation. Moreover, multi-source MT has been proved to be very effective to combine multiple source languages (Och and Ney, 2001;Zoph and Knight, 2016;Firat et al., 2016a,b;Garmash and Monz, 2016). Unlike previous works, we adapt multi-source NMT for system combination and design a good strategy to simulate the real training data for our neural system combination.

Conclusion and Future Work
In this paper, we propose a novel neural system combination framework for machine translation. The central idea is to take advantage of NMT and SMT by adapting the multi-source NMT model. The neural system combination method cannot only address the fluency of NMT and the adequacy of SMT, but also can accommodate the source sentences as input. Furthermore, our approach can further use ensemble decoding to boost the performance compared to traditional system combination methods. Experiments on Chinese-English datasets show that our approaches obtain significant improvements over the best individual system and the state-of-the-art traditional system combination methods. In the future work, we plan to encode nbest translation results to further improve the system combination quality. Additionally, it is interesting to extend this approach to other tasks like sentence compression and text abstraction.