Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder

Most neural machine translation (NMT) models are based on the sequential encoder-decoder framework, which makes no use of syntactic information. In this paper, we improve this model by explicitly incorporating source-side syntactic trees. More specifically, we propose (1) a bidirectional tree encoder which learns both sequential and tree structured representations; (2) a tree-coverage model that lets the attention depend on the source-side syntax. Experiments on Chinese-English translation demonstrate that our proposed models outperform the sequential attentional model as well as a stronger baseline with a bottom-up tree encoder and word coverage.


Introduction
Recently, neural machine translation (NMT) models (Sutskever et al., 2014;Bahdanau et al., 2015) have obtained state-of-the-art performance on many language pairs.Their success depends on the representation they use to bridge the source and target language sentences.However, this representation, a sequence of fixed-dimensional vectors, differs considerably from most theories about mental representations of sentences, and from traditional natural language processing pipelines, in which semantics is built up compositionally using a recursive syntactic structure.
Perhaps as evidence of this, current NMT models still suffer from syntactic errors such as attachment (Shi et al., 2016).We argue that instead of letting the NMT model rely solely on the implicit structure it learns during training (Cho et al.,  (b) binarized source side tree Figure 1: An example sentence pair (a), with its binarized source side tree (b).We use x i to represent the i-th word in the source sentence.We will use this sentence pairs as the running example throughout this paper.2014a), we can improve its performance by augmenting it with explicit structural information and using this information throughout the model.This has two benefits.
First, the explicit syntactic information will help the encoder generate better source side representations.Li et al. (2015) show that for tasks in which long-distance semantic dependencies matter, representations learned from recursive models using syntactic structures may be more powerful than those from sequential recurrent models.In the NMT case, given syntactic information, it will be easier for the encoder to incorporate long distance dependencies into better representations, which is especially important for the translation of long sentences.
Second, it becomes possible for the decoder to arXiv:1707.05436v1 [cs.CL] 18 Jul 2017 use syntactic information to guide its reordering decisions better (especially for language pairs with significant reordering, like Chinese-English).Although the attention model (Bahdanau et al., 2015) and the coverage model (Tu et al., 2016;Mi et al., 2016) provide effective mechanisms to control the generation of translation, these mechanisms work at the word level and cannot capture phrasal cohesion between the two languages (Fox, 2002;Kim et al., 2017).With explicit syntactic structure, the decoder can generate the translation more in line with the source syntactic structure.For example, when translating the phrase zhu manila dashiguan in Figure 1, the tree structure indicates that zhu 'in' and manila form a syntactic unit, so that the model can avoid breaking this unit up to make an incorrect translation like "in embassy of manila" 2 .
In this paper, we propose a novel encoderdecoder model that makes use of a precomputed source-side syntactic tree in both the encoder and decoder.In the encoder ( §3.3), we improve the tree encoder of Eriguchi et al. (2016) by introducing a bidirectional tree encoder.For each source tree node (including the source words), we generate a representation containing information both from below (as with the original bottom-up encoder) and from above (using a top-down encoder).Thus, the annotation of each node summarizes the surrounding sequential context, as well as the entire syntactic context.
In the decoder ( §3.4), we incorporate source syntactic tree structure into the attention model via an extension of the coverage model of Tu et al. (2016).With this tree-coverage model, we can better guide the generation phase of translation, for example, to learn a preference for phrasal cohesion (Fox, 2002).Moreover, with a tree encoder, the decoder may try to translate both a parent and a child node, even though they overlap; the treecoverage model enables the decoder to learn to avoid this problem.
To demonstrate the effectiveness of the proposed model, we carry out experiments on Chinese-English translation.Our experiments show that: (1) our bidirectional tree encoder based NMT system achieves significant improvements over the standard attention-based NMT system, and (2) incorporating source tree structure into the attention model yields a further improvement.
Figure 2: Illustration of the bidirectional sequential encoder.The dashed rectangle represents the annotation of word x i .
In all, we demonstrate an improvement of +3.54 BLEU over a standard attentional NMT system, and +1.90 BLEU over a stronger NMT system with a Tree-LSTM encoder (Eriguchi et al., 2016) and a coverage model (Tu et al., 2016).To the best of our knowledge, this is the first work that uses source-side syntax in both the encoder and decoder of an NMT system.

Neural Machine Translation
Most NMT systems follow the encoder-decoder framework with attention, first proposed by Bahdanau et al. (2015).Given a source sentence NMT aims to directly model the translation probability: where θ is a set of parameters and y < j is the sequence of previously generated target words.
Here, we briefly describe the underlying framework of the encoder-decoder NMT system.

Encoder Model
Following Bahdanau et al. (2015), we use a bidirectional gated recurrent unit (GRU) (Cho et al., 2014b) to encode the source sentence, so that the annotation of each word contains a summary of both the preceding and following words.The bidirectional GRU consists of a forward and a backward GRU, as shown in Figure 2. The forward GRU reads the source sentence from left to right and calculates a sequence of forward hidden states ( − → h 1 , . . ., − → h I ).The backward GRU scans the source sentence from right to left, resulting in a sequence of backward hidden states ( where s i is the i-th source word's word embedding, and GRU is a gated recurrent unit; see the paper by Cho et al. (2014b) for a definition.The annotation of each source word x i is obtained by concatenating the forward and backward hidden states: The whole sequence of these annotations is used by the decoder.

Decoder Model
The decoder is a forward GRU predicting the translation y word by word.The probability of generating the j-th word y j is: where t j−1 is the word embedding of the ( j − 1)th target word, d j is the decoder's hidden state of time j, and c j is the context vector at time j.The state d j is computed as where GRU(•) is extended to more than two arguments by first concatenating all arguments except the first.
The attention mechanism computes the context vector c i as a weighted sum of the source annotations, where the attention weight α j,i is and where v a , W a and U a are the weight matrices of the attention model, and e j,i is an attention model that scores how well d j−1 and ← → h i match.With this strategy, the decoder can attend to the source annotations that are most relevant at a given time.

Tree Structure Enhanced Neural Machine Translation
Although syntax has shown its effectiveness in non-neural statistical machine translation (SMT) systems (Yamada and Knight, 2001;Koehn et al., 2003;Liu et al., 2006;Chiang, 2007), most proposed NMT models (a notable exception being that of Eriguchi et al. (2016)) process a sentence only as a sequence of words, and do not explicitly exploit the inherent structure of natural language sentences.In this section, we present models which directly incorporate source syntactic trees into the encoder-decoder framework.

Preliminaries
Like Eriguchi et al. (2016), we currently focus on source side syntactic trees, which can be computed prior to translation.Whereas Eriguchi et al. (2016) use HPSG trees, we use phrase-structure trees as in the Penn Chinese Treebank (Xue et al., 2005).Currently, we are only using the structure information from the tree without the syntactic labels.
Thus our approach should be applicable to any syntactic grammar that provides such a tree structure (Figure 1(b)).More formally, the encoder is given a source sentence x = x 1 • • • x I as well as a source tree whose leaves are labeled x 1 , . . ., x I .We assume that this tree is strictly binary branching.For convenience, each node is assigned an index.The leaf nodes get indices 1, . . ., I, which is the same as their word indices.For any node with index k, let p(k) denote the index of the node's parent (if it exists), and L(k) and R(k) denote the indices of the node's left and right children (if they exist).
Following Eriguchi et al. (2016), we build a tree encoder on top of the sequential encoder (as shown in Figure 3(a)).If node k is a leaf node, its hidden state is the annotation produced by the sequential encoder: Thus, the encoder is able to capture both sequential context and syntactic context.If node k is an interior node, its hidden state is the combination of its previously calculated left child hidden state h L(k) and right child hidden state h R(k) : where f (•) is a nonlinear function, originally a Tree-LSTM (Tai et al., 2015;Eriguchi et al., 2016).
The first improvement we make to the above tree encoder is that, to be consistent with the sequential encoder model, we use Tree-GRU units instead of Tree-LSTM units.Similar to Tree-LSTMs, the Tree-GRU has gating mechanisms to control the information flow inside the unit for every node without separate memory cells.Then, Eq. 8 is calculated by a Tree-GRU as follows: where r L , r R are the reset gates and z L , z R are the update gates for the left and right children, and z is the update gate for the internal hidden state h↑ k .The U (•) and b (•) are the weight matrices and bias vectors.

Bidirectional Tree Encoder
Although the bottom-up tree encoder can take advantage of syntactic structure, the learned representation of a node is based on its subtree only; it contains no information from higher up in the tree.In particular, the representation of leaf nodes is still the sequential one.Thus no syntactic information is fed into words.By analogy with the bidirectional sequential encoder, we propose a natural extension of the bottom-up tree encoder: the bidirectional tree encoder (Figure 3(b)).
Unlike the bottom-up tree encoder or the rightto-left sequential encoder, the top-down encoder by itself would have no lexical information as input.To address this issue, we feed the hidden states of the bottom-up encoder to the top-down encoder.In this way, the information of the whole syntactic tree is handed to the root node and propagated to its offspring by the top-down encoder.In the top-down encoder, each hidden state has only one predecessor.In fact, the top-down path from root of a tree to any node can be viewed as a sequential recurrent neural network.We can calculate the hidden states of each node top-down using a standard sequential GRU.
First, the hidden state of the root node ρ is simply computed as follows: where W and b are a weight matrix and bias vector.Then, other nodes are calculated by a GRU.For hidden state h ↓ k : where p(k) is the parent index of k.We replace the weight matrices W r , U r , W z , U z , W and U in the standard GRU with P r D , Q r D , P z D , Q z D , P D , and Q D , respectively.The subscript D is either L or R depending on whether node k is a left or right child, respectively.
Finally, the annotation of each node is obtained by concatenating its bottom-up hidden state and top-down hidden state: This allows the tree structure information flow from the root to the leaves (words).Thus, all the annotations are based on the full context of word sequence and syntactic tree structure.Kokkinos and Potamianos (2017) propose a similar bidirectional Tree-GRU for sentiment analysis, which differs from ours in several respects: in the bottom-up encoder, we use separate reset/update gates for left and right children, analogous to Tree-LSTMs (Tai et al., 2015); in the topdown encoder, we use separate weights for left and right children.
Teng and Zhang ( 2016) also propose a bidirectional Tree-LSTM encoder for classification tasks.They use a more complex head-lexicalization scheme to feed the top-down encoder.We will compare their model with ours in the experiments.

Tree-Coverage Model
We also extend the decoder to incorporate information about the source syntax into the attention model.We have observed two issues in translations produced using the tree encoder.First, a syntactic phrase in the source sentence is often incorrectly translated into discontinuous words in the output.Second, since the non-leaf node annotations contain more information than the leaf node annotations, the attention model prefers to attend to the non-leaf nodes, which may aggravate the over-translation problem (translating the same part of the sentence more than once).
As shown in Figure 4(a), almost all the non-leaf nodes are attended too many times during decoding.As a result, the Chinese phrase zhu manila is translated twice because the model attends to the node spanning zhu manila even though both words have already been translated; there is no mechanism to prevent this.2016), we propose to use prior knowledge to control the attention mechanism.In our case, the prior knowledge is the source syntactic information.
In particular, we build our model on top of the word coverage model proposed by Tu et al. (2016), which alleviate the problems of over-translation and under-translation (failing to translate part of a sentence).The word coverage model makes the attention at a given time step j dependent on the attention at previous time steps via coverage vectors: The coverage vectors are, in turn, used to update the attention at the next time step, by a small modification to the calculation of e j,i in Eq. ( 7): 12) The word coverage model could be interpreted as a control mechanism for the attention model.Like the standard attention model, this coverage model sees the source-sentence annotations as a bag of vectors; it knows nothing about word order, still less about syntactic structure.
For our model, we extend the word coverage model to coverage on the tree structure by adding a coverage vector for each node in the tree.We further incorporate source tree structure information into the calculation of the coverage vector by requiring each node's coverage vector to depend on its children's coverage vectors and attentions at the previous time step: Although both child and parent nodes of a subtree are helpful for translation, they may supply redundant information.With our mechanism, when the child node is used to produce a translation, the coverage vector of its parent node will reflect this fact, so that the decoder may avoid using the redundant information in the parent node.Figure 4(b) shows a heatmap of the attention of our tree structure enhanced attention model.The attention of non-leaf nodes becomes more concentrated and the over-translation of zhu manila is corrected.

Data
We conduct experiments on the NIST Chinese-English translation task.The parallel training data consists of 1.6M sentence pairs extracted from LDC corpora,3 with 46.6M Chinese words and 52.5M English words, respectively.We use NIST MT02 as development data, and NIST MT03-06 as test data.These data are mostly in the same genre (newswire), avoiding the extra consideration of domain adaptation.Table 1 shows the statistics of the data sets.The Chinese side of the corpora is word segmented using ICTCLAS. 4 parse the Chinese sentences with the Berkeley Parser5 (Petrov and Klein, 2007) and binarize the resulting trees following Zhang and Clark (2009).
The English side of the corpora is lowercased and tokenized.
We filter out any translation pairs whose source sentences fail to be parsed.For efficient training, we also filter out the sentence pairs whose source or target lengths are longer than 50.We use a shortlist of the 30,000 most frequent words in each language to train our models, covering approximately 98.2% and 99.5% of the Chinese and English tokens, respectively.All out-of-vocabulary words are mapped to a special symbol UNK.

Model and Training Details
We compare our proposed models with several state-of-the-art NMT systems and techniques: • NMT: the standard attentional NMT model (Bahdanau et al., 2015).
We used the dl4mt implementation of the attentional model,6 reimplementing the tree encoder and word coverage models.The word embedding dimension is 512.The hidden layer sizes of both forward and backward sequential encoder are 1024 (except where indicated).Since our Tree-GRU encoders are built on top of the bidirectional sequential encoder, the size of the hidden layer (in each direction) is 2048.For the coverage model, we set the size of coverage vectors to 50.Table 2: BLEU scores of different systems."Sequential", "Tree-LSTM", "Tree-GRU" and "Bidirectional" denote the encoder part for the standard sequential encoder, Tree-LSTM encoder, Tree-GRU encoder and the bidirectional tree encoder, respectively."no", "word" and "tree" in column "Coverage" represents the decoder part for using no coverage (standard attention), word coverage (Tu et al., 2016) and our proposed tree-coverage model, respectively.We use Adadelta (Zeiler, 2012) for optimization using a mini-batch size of 32.All other settings are the same as in Bahdanau et al. (2015).

# System
We use case insensitive 4-gram BLEU (Papineni et al., 2002) for evaluation, as calculated by multi-bleu.perl in the Moses toolkit. 7

Tree Encoders
This set of experiments evaluates the effectiveness of our proposed tree encoders.Table 2, row 2 confirms the finding of Eriguchi et al. (2016) that a Tree-LSTM encoder helps, and row 3 shows that our Tree-GRU encoder gets a better result (+0.87 BLEU, v.s.row 2).To verify our assumption that model consistency is important for performance, we also conduct experiments to compare Tree-LSTM and Tree-GRU on top of LSTM-based encoder-decoder settings.Tree-Lstm with LSTM based sequential model can obtain 1.02 BLEU improvement(Table 3 Since the annotation size of our bidirectional tree encoder is twice of the Tree-LSTM encoder, we halved the size of the hidden layers in the sequential encoder to 512 in each direction, to make fair comparison.These results are shown in Table 4. Row 4 shows that, even with the same annotation size, our bidirectional tree encoder works better than the original Tree-LSTM encoder (row 2).In fact, our halved-sized unidirectional Tree-GRU encoder (row 3 ) also works better than the Tree-LSTM encoder (row 2) with half of its annotation size.
We also compared our bidirectional tree encoder with the head-lexicalization based bidirectional tree encoder proposed by Teng and Zhang (2016), which forms the input vector for each nonleaf node by a bottom-up head propagation mechanism (Table 4, row 14 ).Our bidirectional tree encoder gives a better result, suggesting that head word information may not be as helpful for machine translation as it is for syntactic parsing.
When we set the hidden size back to 1024, we found that training the bidirectional tree encoder  4: Experiments with 512 hidden units in each direction of the sequential encoder.The bidirectional tree encoder using head-lexicalization (Bidirectional-head), proposed by (Teng and Zhang, 2016), does not work as well as our simpler bidirectional tree encoder (Bidirectional).
was more difficult.Therefore, we adopted a twophase training strategy: first, we train the parameters of the bottom-up encoder based NMT system; then, with the initialization of bottom-up encoder and random initialization of the top-down part and decoder, we train the bidirectional tree encoder based NMT system.Table 2, row 4 shows the results of this two-phase training: the bidirectional model (row 4) is 0.79 BLEU better than our unidirectional Tree-GRU (row 3).

Tree-Coverage Model
Rows 5-8 in Table 2 show that the word coverage model of Tu et al. (2016) consistently helps when used with our proposed tree encoders, with the bidirectional tree encoder remaining the best.However, the improvements of the tree encoder models are smaller than that of the baseline system.This may be caused by the fact that the word coverage model neglects the relationship among the trees, e.g. the relationship between children and parent nodes.Our tree-coverage model consistently improves performance further (rows 9-11).
Our best model combines our bidirectional tree encoder with our tree-coverage model (row 11), yielding a net improvement of +3.54 BLEU over the standard attentional model (row 1), and +1.90 BLEU over the stronger baseline that implements both the bottom-up tree encoder and coverage model from previous work (row 6).
As noted before, the original coverage model does not take word order into account.For comparison, we also implement an extension of the coverage model that lets each coverage vector also depend on those of its left and right neighbors at the previous time step.This model does not help; in fact, it reduces BLEU by about 0.2.

Analysis By Sentence Length
Following Bahdanau et al. (2015), we bin the development and test sentences by length and show BLEU scores for each bin in Figure 5.The proposed bidirectional tree encoder outperforms the sequential NMT system and the Tree-GRU encoder across all lengths.The improvements become larger for sentences longer than 20 words, and the biggest improvement is for sentences longer than 50 words.This provides some evidence for the importance of syntactic information for long sentences.

Related Work
Recently, many studies have focused on using explicit syntactic tree structure to help learn sentence representations for various sentence classification tasks.For example, Teng and Zhang (2016) and Kokkinos and Potamianos (2017) extend the bottom-up model to a bidirectional model for classification tasks, using Tree-LSTMs with head lexicalization and Tree-GRUs, respectively.We draw on some of these ideas and apply them to machine translation.We use the representation learnt from tree structures to enhance the original sequential model, and make use of these syntactic information during the generation phase.
In NMT systems, the attention model (Bahdanau et al., 2015) becomes a crucial part of the decoder model.Cohn et al. (2016) andFeng et al. (2016) extend the attentional model to include structural biases from word based alignment models.Kim et al. (2017) incorporate richer structural distributions within deep networks to extend the attention model.Our contribution to the decoder model is to directly exploit structural information in the attention model combined with a coverage mechanism.

Conclusion
We have investigated the potential of using explicit source-side syntactic trees in NMT by proposing a novel syntax-aware encoder-decoder model.Our experiments have demonstrated that a top-down encoder is a useful enhancement for the original bottom-up tree encoder (Eriguchi et al., 2016); and incorporating syntactic structure information into the decoder can better control the translation.Our analysis suggests that the benefit of source-side syntax is especially strong for long sentences.
Our current work only uses the structure part of the syntactic tree, without the labels.For future work, it will be interesting to make use of node labels from the tree, or to use syntactic information on the target side, as well.

Figure 3 :
Figure 3: Illustration of the proposed encoder models for the running example.The non-leaf nodes are assigned with index 7-11.The annotations h ↑ i of leaf nodes in (b) are identical to the annotations (dashed rectangles) of leaf nodes in (a).The dotted rectangles in (b) indicate the annotation produced by the bidirectional tree encoder.

Figure 4 :
Figure 4: The attention heapmap plotting the attention weights during different translation steps, for translating the sentence in Figure 1(a).The nodes [7]-[11] correspond to non-leaf nodes indexed in Figure 3. Incorporating Tree-Coverage Model produces more concentrated alignments and alleviates the over-translation problem.

Figure 5 :
Figure 5: Performance of translations with respect to the lengths of the source sentences."+" indicates the improvement over the baseline sequential model.

Table 1 :
Experiment data and statistics.

Table 3 :
BLEU scores of different systems based on LSTM."Seq-LSTM" denotes both the encoder and decoder parts for the sequential model are based on LSTM; "SeqTree-LSTM" means using Tree-LSTM encoder on top of "Seq-LSTM".