Chunk-Based Bi-Scale Decoder for Neural Machine Translation

In typical neural machine translation (NMT), the decoder generates a sentence word by word, packing all linguistic granularities in the same time-scale of RNN. In this paper, we propose a new type of decoder for NMT, which splits the decode state into two parts and updates them in two different time-scales. Specifically, we first predict a chunk time-scale state for phrasal modeling, on top of which multiple word time-scale states are generated. In this way, the target sentence is translated hierarchically from chunks to words, with information in different granularities being leveraged. Experiments show that our proposed model significantly improves the translation performance over the state-of-the-art NMT model.


Introduction
Recent work of neural machine translation (NMT) models propose to adopt the encoder-decoder framework for machine translation (Kalchbrenner and Blunsom, 2013;Cho et al., 2014;Sutskever et al., 2014), which employs a recurrent neural network (RNN) encoder to model the source context information and a RNN decoder to generate translations, which is significantly different from previous statistical machine translation systems (Koehn et al., 2003;Chiang, 2005).This framework is then extended by an attention mechanism, which acquires source sentence context dynamically at different decoding steps (Bahdanau et al., 2014;Luong et al., 2015).
The decoder state stores translation information at different granularities, determining which segment should be expressed (phrasal), and which word should be generated (lexical), respectively.However, due to the extensive existence of multiword phrases and expressions, the varying speed of the lexical component is much faster than the phrasal one.As in the generation of "the French Republic", the lexical component in the decoder will change thrice, each of which for a separate word.But the phrasal component may only change once.The inconsistent varying speed of the two components may cause translation errors.
Typical NMT model generates target sentences in the word level, packing the phrasal and lexical information in one hidden state, which is not necessarily the best for translation.Much previous work propose to improve the NMT model by adopting fine-grained translation levels such as the character or sub-word levels, which can learn the intermediate information inside words (Ling et al., 2015;Costa-jussà and Fonollosa, 2016;Chung et al., 2016;Luong et al., 2016;Lee et al., 2016;Sennrich and Haddow, 2016;Sennrich et al., 2016;García-Martínez et al., 2016).However, high level structures such as phrases has not been explicitly explored in NMT, which is very useful for machine translation (Koehn et al., 2007).
We propose a chunk-based bi-scale decoder for NMT, which explicitly splits the lexical and phrasal components into different time-scales.

Standard Neural Machine Translation Model
Generally, neural machine translation system directly models the conditional probability of the translation y word by word (Bahdanau et al., 2014).Formally, given an input sequence x = [x 1 , x 2 , . . ., x J ], and the previously generated sequence y <t = [y 1 , y 2 , . . ., y t−1 ], the probability of next target word y t is where f (•) is a non-linear function, e y t−1 is the embedding of y t−1 ; s t is the decode state at the time step t, which is computed by Here g(•) is a transition function of decoder RNN.c t is the context vector computed by where ATT is an attention operation, which outputs alignment distribution α: and h is the annotation of x from a bi-directional RNNs.The training objective is to maximize the likelihood of the training data.Beam search is adopted for decoding.
b t will be 0 or 1, where 1 denotes this is the boundary of a new chunk while 0 denotes not.Two different operations would be executed: In the COPY operation, the chunk state is kept the same as the previous step.In the UPDATE operation, e p t−1 is the representation of last chunk, which is computed by the LSTM-minus approach (Wang and Chang, 2016): here t is the boundary of last chunk and m(•) is a linear function.pc t is the context vector for chunk p t , which is calculated by a chunk attention model: The chunk attention model differs from the standard word attention model (i.e., Equation 3) at: 1) it reads chunk state p t−1 rather than word state s t−1 , and 2) it is only executed at boundary of each chunk rather than at each decoding step.
In this way, our model only extracts source context once for a chunk, and the words in one chunk will share the same context for word generation.The chunk attention mechanism adds a constrain that target words in the same chunk shares the same source context.
Training To encourage the proposed model to learn reasonable chunk state, we add two additional objectives in training: Chunk Tag Prediction: For each chunk, we predict the probability of its tag P (l k |x) = sof tmax f (p t , e pt , c t ) , where l k is the syntactic tag of the k-th chunk such as NP (noun phrase) and VP (verb phrase), and t is time step of its boundary.We also evaluate our model on the WMT translation task of German-English, newstest2014 (DE14) is adopted as development set and newstest2012, newstest2013 (DE1213) are adopted as testing set.

Chunk
The English sentences are labeled by a neural chunker, which is implemented according to Zhou et al. (2015).We use the case-insensitive 4-gram NIST BLEU score as our evaluation metric (Papineni et al., 2002).
In training, we limit the source and target vocabularies to the most frequent 30K words.We train each model with the sentences of length up to 50 words.Sizes of the chunk representation and chunk hidden state are set to 1000.All the other settings are the same as in Bahdanau et al. (2014).

Results on Chinese-English
We list the BLEU score of our proposed model in Table 1, comparing with Moses (Koehn et al., 2007) and dl4mt3 (Bahdanau et al., 2014), which are state-of-the-art models of SMT and NMT, respective.For Moses, we use the default configuration with a 4-gram language model trained on the target portion of the training data.For dl4mt, we also report the results (dl4mt-2) by using two decoder layers (Wu et al., 2016) for better comparison.
As shown in Table 1, our proposed model outperforms different baselines on all sets, which verifies that the chunk-based bi-scale decoder is effective for NMT.Our model gives a 1.6 BLEU score improvement upon the standard NMT baseline (dl4mt).We conduct experiment with dl4mt-2 to see whether the neural NMT system can model the bi-scale components with different varying speeds automatically.Surprisingly, we find that dl4mt-2 obtains lower BLEU scores than dl4mt.We speculate that the more complex model dl4mt-2 may need more training data for obtaining reasonable results.

Effectiveness of Chunk Attention
As described in Section 3, we propose to use the chunk attention to replace the word level attention in our model, in which the source context extracted by the chunk attention will be used for the corresponding word generations in the chunk.We also report the result of our model using conventional word attention for comparison.As shown in Table 2, our model with the chunk attention gives higher BLEU score than the word attention.
Intuitively, we think chunks are more specific in semantics, thus could extract more specific source context for translation.The chunk attention could be considered as a compromise approach between encoding the whole source sentence into decoder without attention (Sutskever et al., 2014) and utilizing word level attention at each step (Bahdanau et al., 2014).We also draw the figure of alignments by chunk attention (Figure 2), from which we can see that our chunk attention model can well explore the alignments from phrases to words.Table 3: Accuracies of predicted chunk boundary and chunk label.

Predictions of the Chunk Boundary and Chunk
Label We also compute predicted accuracies of chunk boundaries and chunk labels on the autochunked development and testing data (Table 3).We find that the chunk boundary could be predicted well, with an average accuracy of 89%, which shows that our model could capture the phrasal boundary information in the translation process.However, our model could not predict chunk labels as well as chunk boundaries.
We speculate that more syntactic context features should be added to improve the performance of predicting chunk labels.
Subjective Evaluation Following Tu et al. (2016Tu et al. ( , 2017a,b),b), we also compare our model with the dl4mt baseline by subjective evaluation.the translation is translated by.The human evaluator is asked to give 4 scores: adequacy score and fluency score, which are between 0 and 5, the larger, the better; under-translation score and overtranslation score, which are set to 1 when under or over translation errors occurs, otherwise set to 0. We list the averaged scores in Table 5.We find that our proposed model improves the dl4mt baseline on both the translation adequacy and fluency aspects.Specifically, the over translation error rate drops by 6%, which confirms the assumption in the introduction that splitting the fast and slow varying components in different time-scales could help alleviate the over translation errors.

Results on German-English
We evaluate our model on the WMT15 translation task from German to English.We find that our proposed chunk-based NMT model also obtains considerable accuracy improvements on German-English.However, the BLEU score gains are not as significant as on Chinese-English.We speculate that the difference between Chinese and English is larger than German and English.The chunk-based NMT model may be more useful for bilingual data with bigger difference.

Related Work
NMT with Various Granularities.A line of previous work propose to utilize other granularities besides words for NMT.By further exploiting the character level (Ling et al., 2015;Costajussà and Fonollosa, 2016;Chung et al., 2016;Luong et al., 2016;Lee et al., 2016), or the sub-word level (Sennrich and Haddow, 2016;Sennrich et al., 2016;García-Martínez et al., 2016) information, the corresponding NMT models capture the infor-mation inside the word and alleviate the problem of unknown words.While most of them focus on decomposing words into characters or sub-words, our work aims at composing words into phrases.
Incorporating Syntactic Information in NMT Syntactic information has been widely used in SMT (Liu et al., 2006;Marton and Resnik, 2008;Shen et al., 2008), and a lot of previous work explore to incorporate the syntactic information in NMT, which shows the effectiveness of the syntactic information (Stahlberg et al., 2016).Shi et al. (2016) give some empirical results that the deep networks of NMT are able to capture some useful syntactic information implicitly.Luong et al. (2016) propose to use a multi-task framework for NMT and neural parsing, achieving promising results.Eriguchi et al. (2016) propose a string-totree NMT system by end-to-end training.Different to previous work, we try to incorporate the syntactic information in the target side of NMT.Ishiwatari et al. (2017) concurrently propose to use chunk-based decoder to cope with the problem of free word-order languages.Differently, they adopt word-level attention, and predict the end of chunk by generating end-of-chunk tokens instead of using boundary gate.

Conclusion
We propose a chunk-based bi-scale decoder for neural machine translation, in which way, the target sentence is translated hierarchically from chunks to words, with information in different granularities being leveraged.Experiments show that our proposed model outperforms the standard attention-based neural machine translation baseline.Future work includes abandoning labeled chunk data, adopting reinforcement learning to explore the boundaries of phrase automatically (Mou et al., 2016).Our code is released on https: //github.com/zhouh/chunk-nmt.

Figure 1 :
Figure 1: The architecture of the chunk-based biscale NMT.
At each step of decoding, our model first predict a chunk state with a chunk attention, based on which multiple word states are generated with-out attention.The word state is updated at every step, while the chunk state is only updated when the chunk boundary is detected by a boundary gate automatically.In this way, we incorporate soft phrases into NMT, which makes the model flexible at capturing both global reordering of phrases and local translation inside phrases.
Compared with Equations 1 and 2, the generation of target word is based on the chunk state instead of the context vector c t produced by the attention model.Since a chunk may correspond to multiple words, we employ a boundary gate b t to decide the boundary of each chunk: 3 Chunk-Based Bi-Scale Neural Machine Translation Model Instead of the word-based decoder, we propose to use a chunk-based bi-scale decoder, which generates translation hierarchically with chunk and word time-scales, as shown in Figure 1.Formally, the probability of next word y t is P (y t |x) = sof tmax(f (e y t−1 , s t , p t )) (6) s t = g(s t−1 , e y t−1 , p t ) (7) here p t is the chunk state at step t.
Boundary Prediction: At each decoding step, we predict the probability of chunk boundary P (b t |x) = sof tmax(s t−1 , e y t−1 ).Accordingly, given a set of training examples {[x n , y n ]} N n=1 , the new training objective is

Table 1 :
BLEU scores for different systems.

Table 2 :
Results with different attention models.

Table 5 :
Results on German-English