Agreement on Target-bidirectional Neural Machine Translation

Neural machine translation (NMT) with recurrent neural networks, has proven to be an effective technique for end-to-end machine translation. However, in spite of its promising advances over traditional translation meth-ods, it typically suffers from an issue of unbalanced outputs, that arise from both the nature of recurrent neural networks themselves, and the challenges inherent in machine translation. To overcome this issue, we propose an agreement model for neural machine translation and show its effectiveness on large-scale Japanese-to-English and Chinese-to-English translation tasks. Our results show the model can achieve improvements of up to 1.4 B LEU over the strongest baseline NMT system. With the help of an ensemble technique, this new end-to-end NMT approach ﬁnally outperformed phrase-based and hierarchical phrase-based Moses baselines by up to 5.6 B LEU points.


Introduction
Recurrent neural network (RNN) has achieved great successes on several structured prediction tasks (Graves, 2013;Watanabe and Sumita, 2015;Dyer et al., 2015), in which RNNs are required to make a sequence of dependent predictions. One of its advantages is that an unbounded history is available to enrich the context for the prediction at the current time-step.
Despite its successes, recently,  pointed out that the RNN suffers from a fundamental issue of generating unbalanced outputs: that is to say the suffixes of its outputs are typically worse than the prefixes. This is due to the fact that later predictions directly depend on the accuracy of previous predictions. They empirically demonstrated this issue on two simple sequence-to-sequence learning tasks: machine transliteration and grapheme-to-phoneme conversion.
On the more general sequence-to-sequence learning task of machine translation (MT), neural machine translation (NMT) based on RNNs has recently become an active research topic (Sutskever et al., 2014;Bahdanau et al., 2014). Compared to those two simple tasks, MT involves in much larger vocabulary and frequent reordering between input and output sequences. This makes the prediction at each time-step far more challenging. In addition, sequences in MT are much longer, with averaged length of 36.7 being about 5 times longer than that in grapheme-to-phoneme conversion. Therefore, we believe that the history is more likely to contain incorrect predictions and the issue of unbalanced outputs may be more serious. This hypothesis is supported later (see Table 1 in §4.1), by an analysis that shows the quality of the prefixes of translation hypotheses is much higher than that of the suffixes.
To address this issue for NMT, in this paper we extend the agreement model proposed in  to the task of machine translation. Its key idea is to encourage the agreement between a pair of target-directional (left-to-right and right-to-left) NMT models in order to produce more balanced translations and thus improve the overall translation quality. Our contribution is two-fold: • We introduce a simple and general method to address the issue of unbalanced outputs for NMT ( §3). This method is robust without any extra hyperparameters to tune and is easy to implement. In addition, it is general enough to be applied on top of any of the existing RNN translation models, although it was implemented on top of the model in (Bahdanau et al., 2014) in this paper.
• We provide an empirical evaluation of the technique on large scale Japanese-to-English and Chinese-to-English translation tasks. The results show our model can generate more balanced translation results, and achieves substantial improvements (of up to 1.4 BLEU points) over the strongest NMT baseline ( §4

Overview of Neural Machine Translation
Suppose x = x 1 , x 2 , · · · , x m denotes a source sentence, y = y 1 , y 2 , · · · , y n denotes a target sentence. In addition, let x <t = x 1 , x 2 , · · · , x t−1 denote a prefix of x. Neural Machine Translation (NMT) directly maps a source sentence into a target within a probabilistic framework. Formally, it defines a conditional probability over a pair of sequences x and y via a recurrent neural network as follows: where θ is the set of model parameters; h t denotes a hidden state (i.e. a vector) of y at timestep t; g is a transformation function from a hidden state to a vector with dimension of the target-side vocabulary size; softmax is the softmax function, and [i] denotes the i th component in a vector. 2 Furthermore, 1 The absolute gains of our model can be expected to be further increased by applying the well-known techniques in (Jean et al., 2015;Luong et al., 2015) that address the problems presented by unknown words, but these techniques are beyond the scope of this paper. 2 In that sense, yt in Eq.(1) also denotes the index of this word in its vocabulary.
h t = f (h t−1 , c(x, y <t )) is defined by a recurrent function over both the previous hidden state h t−1 and the context c(x, y <t ). 3 Note that both h t and c(x, y <t ) have dimension d for all t.
In this paper, we develop our model on top of the neural machine translation approach of (Bahdanau et al., 2014), and we refer the reader this paper for a complete description of the model, for example, the definitons of f and c. The proposed method could just as easily been implemented on top of any other RNN models such as that in (Sutskever et al., 2014).

Agreement on Target-bidirectional NMT
In this section, we extend the method in  to address this issue of unbalanced outputs for NMT. The key idea is to: 1) train two kinds of NMT, i.e. one generating targets from left-to-right while the other from right-to-left; 2) encourage the agreement between them by joint search.

Training
The training objective function for our agreement (or joint) model is formalized as follows: where y r = y n , y n−1 · · · , y 1 is the reverse of sequence y; p(y | x; θ 1 ) denotes the left-to-right model with parameters θ 1 , while p(y r | x; θ 2 ) denotes the right-to-left model with parameters θ 2 , as defined in Eq.(1); and x, y ranges over a given training dataset. Following (Bahdanau et al., 2014), we employ AdaDelta (Zeiler, 2012) to minimize the loss .
Note that, in parallel to our efforts, Cheng et al. (2016) has explored the agreement idea for NMT close to ours. However, unlike their work on the agreement between source and target sides in the spirit of the general idea in (Liang et al., 2006), we focus on the agreement between left and right directions on the target side oriented to the natural issue of NMT itself. Although our model is orthogonal to theirs, one of our advantage is that our model does not rely on any additional hyperparameters to encourage agreement, given that tuning such hyperparameters for NMT is too costly.

Approximate Joint Search
Given a source sentence x and model parameters θ 1 , θ 2 , decoding can be formalized as follows: As pointed out by , it is NP-hard to perform an exact search, and so we adapt one of their approximate search methods for the machine translation scenario. The basic idea consists of two steps: 1) run beam search for forward and reverse models independently to obtain two k-best lists; 2) re-score the union of two k-best lists using the joint model to find the best candidate. We refer to the reader to  for further details.

Experiments
We conducted experiments on two challenging translation tasks: Japanese-to-English (JP-EN) and Chinese-to-English (CH-EN), using case-insensitive BLEU for evaluation.
For the JP-EN task, we use the data from NTCIR-9 (Goto et al., 2011): the training data consisted of 2.0M sentence pairs, The development and test sets contained 2K sentences with a single referece, respectively. For the CH-EN task, we used the data from the NIST2008 Open Machine Translation Campaign: the training data consisted of 1.8M sentence pairs, the development set was nist02 (878 sentences), and the test sets are were nist05 (1082 sentences), nist06 (1664 sentences) and nist08 (1357 sentences).
Four baselines were used. The first two were the conventional state-of-the-art translation systems, phrase-based and hierarchical phrase-based systems, which are from the latest version of well-known Moses (Koehn et al., 2007)  (NMT-J) was also implemented using NMT (Bahdanau et al., 2014). We followed the standard pipeline to train and run Moses. GIZA++ (Och and Ney, 2000) with grow-diag-final-and was used to build the translation model. We trained 5-gram target language models using the training set for JP-EN and the Gigaword corpus for CH-EN, and used a lexicalized distortion model. All experiments were run with the default settings except for a distortion-limit of 12 in the JP-EN experiment, as suggested by (Goto et al., 2013). 5 To alleviate the negative effects of randomness, the final reported results are averaged over five runs of MERT.
To ensure a fair comparison, we employed the same settings for all NMT systems. Specifically, except for the maximum sequence length (seqlen, which was to 80), and the stopping iteration which was selected using development data, we used the default settings set out in (Bahdanau et al., 2014) for all NMT-based systems: the dimension of word embedding was 620, the dimension of hidden units was 1000, the batch size was 80, the source and target side vocabulary sizes were 30000, and the beam size for decoding was 12. Training was conducted on a single Tesla K80 GPU, and it took about 6 days to train a single NMT system on our large-scale data.

Results and Analysis on the JP-EN Task
In §1, it was claimed that NMT generates unbalanced outputs. To demostrate this, we have to evaluate the partial translations, which is not trivial (Liu and Huang, 2014). Inspired by (Liu and Huang, 2014), we employ the idea of partial BLEU rather than potential BLEU, as there is no future string concept during NMT decoding. In addition, since the lower n-gram (for example, 1-gram) is easier to be aligned to the uncovered words in source side,  which might negatively affect the absolute statistics of evaluation, 6 we employ the partial 4-gram as the metric to evaluate the quality of partial translations (both prefixes and suffixes). In Table 1, we can see that the prefixes are of higher quality than the suffixes for a single left-to-right model . In contrast to this, it can be seen that our joint model (NMT-J) that includes one left-to-right and one right-to-left model, successfully addresses this issue, producing balanced outputs. Table 2 shows the main results on the JP-EN task. From this table, we can see that, although a single NMT model (either left-to-right or right-to-left) comfortably outperforms the Moses and Moses-hier baselines, our simple NMT-J (with one l2r and one r2l NMT model) obtain gains of 1.5 BLEU points over a single NMT. In addition, the more powerful joint model NMT-J-5, which is an ensemble of five l2r and five r2l NMT models, gains 0.7 BLEU points over the strongest NMT ensemble NMT-r2l-5, i.e. an ensemble of five r2l NMT models. The ensemble of joint models achieved considerable gains of 5.6 and 4.8 BLEU points over the state-of-the-art Moses and Moses-hier, respectively. To the best of our knowlege, it is the first time that an end-to-end neural machine translation system has achieved such improvements on the very challenging task of JP-EN translation.  One might argue that our NMT-J-5 contained ten NMT models in total, while the NMT-l2r-5 or NMT-r2l-5 only used five models, and thus such a comparison is unfair. Therefore, we integrated ten NMT models into the NMT-r2l-10 ensemble. In Table 2, we can see that NMT-r2l-10 is not necessarily better than NMT-r2l-5, which is consistent with the findings reported in (Zhou et al., 2002). Task   Table 3 shows the comparison between our method and the baselines on the CH-EN task. 7 The results were similar in character to the results for JP-EN. The proposed joint model (NMT-J-5) consistently outperformed the strongest neural baseline (NMT-l2r-5), an ensemble of five l2r NMT models, on all the test sets with gains up to 1.4 BLEU points. Furthermore, our model again achieved substantial gains over the Moses and Moses-hier systems, in the range 1.9∼5.2 BLEU points, depending on the test set.

Related Work
Target-bidirectional transduction techniques were pioneered in the field of machine translation (Watanabe and Sumita, 2002;Finch and Sumita, 2009;Zhang et al., 2013). They used the techniques for traditional SMT models, under the IBM framework (Watanabe and Sumita, 2002) or the feature-driven linear models (Finch and Sumita, 2009;Zhang et al., 2013). However, the target-bidirectional techniques we have developed for the unified neural network framework, target a pressing need directly motivated by a fundamental issue suffered by recurrent neural networks.
Target-directional neural network models have also been successfully employed in (Devlin et al., 2014). However, their approach was concerned with feedforward networks, which can not make full use of rich contextual information. As a result, their models could only be used as features (i.e. submodels) to augment traditional translation techniques in contrast to the end-to-end neural network framework for machine translation in our proposal.
Our approach is related to that in  in some sense. Both approaches can alleviate the mismatch between the training and testing stages: the history predictions are always correct in training while may be incorrect in testing.  introduce noise into history predictions in training to balance the mistmatch, while we try to make the history predictions in testing as accurate as those in training by using of two directional models. Therefore, theirs focuses on this problem from the view of training instead of both modeling and training as ours, but it is possible and promising to apply their approach to optimize our joint model.

Conclusion
In this paper, we investigate the issue of unbalanced outputs suffered by recurrent neural networks, and empirically show its existence in the context of machine translation. To address this issue, we propose an easy to implement agreement model that extends the method of  from simple sequence-to-sequence learning tasks to machine translation.
On two challenging JP-EN and CH-EN translation tasks, our approach was empirically shown to be effective in addressing the issue; by generating balanced outputs, it was able to consistently outperform a respectable NMT baseline on all test sets, delivering gains of up to 1.4 BLEU points. To put these results in the broader context of machine translation research, our approach (even without special handling of unknown words) achieved gains of up to 5.6 BLEU points over strong phrase-based and hierarchical phrase-based Moses baselines, with the help of an ensemble technique.