A Tree-based Decoder for Neural Machine Translation

Recent advances in Neural Machine Translation (NMT) show that adding syntactic information to NMT systems can improve the quality of their translations. Most existing work utilizes some specific types of linguistically-inspired tree structures, like constituency and dependency parse trees. This is often done via a standard RNN decoder that operates on a linearized target tree structure. However, it is an open question of what specific linguistic formalism, if any, is the best structural representation for NMT. In this paper, we (1) propose an NMT model that can naturally generate the topology of an arbitrary tree structure on the target side, and (2) experiment with various target tree structures. Our experiments show the surprising result that our model delivers the best improvements with balanced binary trees constructed without any linguistic knowledge; this model outperforms standard seq2seq models by up to 2.1 BLEU points, and other methods for incorporating target-side syntax by up to 0.7 BLEU.


Introduction
Most NMT methods use sequence-to-sequence (seq2seq) models, taking in a sequence of source words and generating a sequence of target words (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2015).While seq2seq models can implicitly discover syntactic properties of the source language (Shi et al., 2016), they do not explicitly model and leverage such information.Motivated by the success of adding syntactic information to Statistical Machine Translation (SMT) (Galley et al., 2004;Menezes and Quirk, 2007;Galley et al., 2006), recent works have established that explicitly leveraging syntactic information can improve NMT 1 Our code is available at https://github.com/cindyxinyiwang/TrDec_pytorch.quality, either through syntactic encoders (Li et al., 2017;Eriguchi et al., 2016), multi-task learning objectives (Chen et al., 2017;Eriguchi et al., 2017), or direct addition of syntactic tokens to the target sequence (Nadejde et al., 2017;Aharoni and Goldberg, 2017).However, these syntax-aware models only employ the standard decoding process of seq2seq models, i.e. generating one target word at a time.One exception is Wu et al. (2017), which utilizes two RNNs for generating target dependency trees.Nevertheless, Wu et al. (2017) is specifically designed for dependency tree structures and is not trivially applicable to other varieties of trees such as phrasestructure trees, which have been used more widely in other works on syntax-based machine translation.One potential reason for the dearth of work on syntactic decoders is that such parse tree structures are not friendly to recurrent neural networks (RNNs).
In this paper, we propose TrDec, a method for incorporating tree structures in NMT.TrDec simultaneously generates a target-side tree topology and a translation, using the partially-generated tree to guide the translation process ( § 2).TrDec employs two RNNs: a rule RNN, which tracks the topology of the tree based on rules defined by a Context Free Grammar (CFG), and a word RNN, which tracks words at the leaves of the tree ( § 3).This model is similar to neural models of tree-structured data from syntactic and semantic parsing (Dyer et al., 2016;Alvarez-Melis and Jaakkola, 2017;Yin and Neubig, 2017), but with the addition of the word RNN, which is especially important for MT where fluency of transitions over the words is critical.
TrDec can generate any tree structure that can be represented by a CFG.These structures include linguistically-motivated syntactic tree representa- tions, e.g.constituent parse trees, as well as syntaxfree tree representations, e.g.balanced binary trees ( § 4).This flexibility of TrDec allows us to compare and contrast different structural representations for NMT.
In our experiments ( § 5), we evaluate TrDec using both syntax-driven and syntax-free tree representations.We benchmark TrDec on three tasks: Japanese-English and German-English translation with medium-sized datasets, and Oromo-English translation with an extremely small dataset.Our findings are surprising -TrDec performs well, but it performs the best with balanced binary trees constructed without any linguistic guidance.

Generation Process
TrDec simultaneously generates the target sequence and its corresponding tree structure.We first discuss the high-level generation process using an example, before describing the prediction model ( § 3) and the types of trees used by TrDec ( § 4).
Fig. 1 illustrates the generation process of the sentence "_The _cat _eat s _fi sh _.", where the sentence is split into subword units, delimited by the underscore "_" (Sennrich et al., 2016).The example uses a syntactic parse tree as the intermediate tree representation, but the process of generating with other tree representations, e.g.syntax-free trees, follows the same procedure.
Trees used in TrDec have two types of nodes: terminal nodes, i.e. the leaf nodes that represent subword units; and nonterminal nodes, i.e. the non-leaf nodes that represent a span of subwords.Additionally, we define a preterminal node to be a nonterminal node whose children are all terminal nodes.In Fig. 1 Left, the green squares represent preterminal nodes.
TrDec generates a tree in a top-down, left-toright order.The generation process is guided by a CFG over target trees, which is constructed by taking all production rules extracted from the trees of all sentences in the training corpus.Specifically, a rule RNN first generates the top of the tree structure, and continues until a preterminal is reached.Then, a word RNN fills out the words under the preterminal.The model switches back to the rule RNN after the word RNN finishes.This process is illustrated in Fig. 1 Right.Details are as follows: Step 1.The source sentence is encoded by a sequential RNN encoder, producing the hidden states.
Step 2. The generation starts with a derivation tree with only a Root node.A rule RNN, initialized by the last encoder hidden state computes the probability distribution over all CFG rules whose left hand side (LHS) is Root, and selects a rule to apply to the derivation.In our example, the rule RNN selects ROOT → S.
Step 3. The rule RNN applies production rules to the derivation in a top-down, left-to-right order, expanding the current opening nonterminal using a CFG rule whose LHS is the opening nonterminal.In the next two steps, TrDec applies the rules S → NP VP PUNC and NP → pre to the opening nonterminals S and NP, respectively.Note that after these two steps a preterminal node pre is created.
Step 4a.Upon seeing a preterminal node as the current opening nonterminal, TrDec switches to using a word RNN, initialized by the last state of the encoder, to populate this empty preterminal with phrase tokens, similar to a seq2seq decoder.For example the subword units _The and _cat are generated by the word RNN, ending with a special end-of-phrase token, i.e. eop .
Step 4b.While the word RNN generates sub-word units, the rule RNN also updates its own hidden states, as illustrated by the blue cells in Fig. 1 Right.
Step 5.After the word RNN generates eop , TrDec switches back to the rule RNN to continue generating the derivation from where the tree left off.In our example, this next stage is the opening nonterminal node VP.From here, TrDec chooses the rule VP → pre NP.
TrDec repeats the process above, intermingling the rule RNN and the word RNN as described, and halts when the rule RNN generates the end-ofsentence token eos , completing the derivation.

Model
We now describe the computations during the generation process discussed in § 2. At first, a source sentence x, which is split into subwords, is encoded using a standard bidirectional Long Short-Term Memory (LSTM) network (Hochreiter and Schmidhuber, 1997).This bi-directional LSTM outputs a set of hidden states, which TrDec will reference using an attention function (Bahdanau et al., 2015).
As discussed, TrDec uses two RNNs to generate a target parse tree.In our work, both of these RNNs use LSTMs, but with different parameters.
Rule RNN.At any time step t in the rule RNN, there are two possible actions.If at the previous time step t − 1, TrDec generated a CFG rule, then the state s tree t is computed by: where y CFG t−1 is the embedding of the CFG rule at time step t − 1; c t−1 is the context vector computed by attention at s tree t−1 , i.e. input feeding (Luong et al., 2015); s tree p is the hidden state at the time step that generates the parent of the current node in the partial tree; s word t is the hidden state of the most recent time step before t that generated a subword (note that s word t comes from the word RNN, discussed below); and [•] denotes a concatenation.
Meanwhile, if at the previous time step t − 1, TrDec did not generate a CFG rule, then the update at time step t must come from a subword being generated by the word RNN.In that case, we also update the rule RNN similarly by replacing the embedding of the CFG rule with the embedding of the subword.Word RNN.At any time step t, if the word RNN is invoked, its hidden state s word t is: where s tree p is the hidden state of rule RNN that generated the CFG rule above the current terminal; w t−1 is the embedding of the word generated at time step t − 1; and c t−1 is the attention context computed at the previous word RNN time step t − 1.
Softmax.At any step t, our softmax logits are where W varies depending on whether a rule or a subword unit is needed.

Tree Structures
Unlike prior work on syntactic decoders designed for utilizing a specific type of syntactic information (Wu et al., 2017), TrDec is a flexible NMT model that can utilize any tree structure.Here we consider two categories of tree structures: Syntactic Trees are generated using a thirdparty parser, such as Berkeley parser (Petrov et al., 2006;Petrov and Klein, 2007).Fig. 2 Top Left illustrates an example constituency parse tree.We also consider a variation of standard constituency parse trees where all of their nonterminal tags are replaced by a null tag, which is visualized in Fig. 2 Top Right.In addition to constituency parse trees, TrDec can also utilize dependency parse trees via a simple procedure that converts a dependency tree into a constituency tree.Specifically, this procedure creates a parent node with null tag for each word, and then attaches each word to the parent node of its head word while preserving the word order.An example of this procedure is provided in Fig. 3.
Balanced Binary Trees are syntax-free trees constructed without any linguistic guidance.We use two slightly different versions of binary trees.Version 1 (Fig. 2 Bottom Left) is constructed by recursively splitting the target sentence in half and creating left and right subtrees from the left and right halves of the sentence respectively.Version 2 (Fig. 2 Bottom Right), is constructed by applying Version 1 on a list of nodes where consecutive words are combined together.All tree nodes in both versions have the null tag.We discuss these construction processes in more detail in Appendix A.1.
In the experiments detailed later, we evaluated TrDec with four different settings of tree structures: 1) the fully syntactic constituency parse trees; 2) constituency parse trees with null tags; 3) dependency parse trees; 4) a concatenation of both version 1 and version 2 of the binary trees, (which effectively doubles the amount of the training data and leads to slight increases in accuracy).

Experiments
Datasets.We evaluate TrDec on three datasets: 1) the KFTT (ja-en) dataset (Neubig, 2011), which consists of Japanese-English Wikipedia articles; 2) the IWSLT2016 German-English (de-en) dataset (Cettolo et al., 2016), which consists of TED Talks transcriptions; and 3) the LORELEI Oromo-English (or-en) dataset 2 , which largely consists of texts from the Bible.Details are in Tab. 1. English sentences are parsed using Ckylark (Oda et al., 2015) for the constituency parse trees, and Stanford Parser (de Marneffe et al., 2006;Chen and Manning, 2014) for the dependency parse trees.We use byte-pair encoding (Sennrich et al., 2016) with 8K merge operations on ja-en, 4K merge operations on or-en, and 2 LDC2017E29 24K merge operations on de-en.Baselines.We compare TrDec against three baselines: 1) seq2seq: the standard seq2seq model with attention; 2) CCG: a syntax-aware translation model that interleaves Combinatory Categorial Grammar (CCG) tags with words on the target side of a seq2seq model (Nadejde et al., 2017); 3) CCG-null: the same model with CCG, but all syntactic tags are replaced by a null tag; and 4) LIN: a standard seq2seq model that generates linearized parse trees on the target side (Aharoni and Goldberg, 2017).
Results.Tab. 2 presents the performance of our model and the three baselines.For our model, we report the performance of TrDec-con, TrDeccon-null, TrDec-dep, and TrDec-binary (settings 1,2,3,4 in § 4).On the low-resource or-en dataset, we observe a large variance with different random seeds, so we run each model with 6 different seeds, and report the mean and standard deviation of these runs.TrDec-con-null and TrDeccon achieved comparable results, indicating that the syntactic labels have neither a large positive nor negative impact on TrDec.For ja-en and or-en, syntax-free TrDec outperforms all baselines.On de-en, TrDec loses to CCG-null, but the difference is not statistically significant (p > 0.1).Length Analysis.We performed a variety of analyses to elucidate the differences between the translations of different models, and the most conclusive results were through analysis based on the length of the translations.First, we categorize the ja-en test set into buckets by length of the reference sentences, and compare the models for each length category.Fig. 4 shows the gains in BLEU score over seq2seq for the tree-based models.Since TrDec-con outperforms TrDec-dep for all datasets, we only focus on TrDec-con for analyzing TrDec's performance with syntactic trees.The relative performance of CCG decreases on long sentences.However, TrDec, with both parse trees and syntax-free binary trees, delivers more improvement on longer sentences.This indicates that TrDec is better at capturing long-term dependencies during decoding.Surprisingly, TrDecbinary, which does not utilize any linguistic information, outperforms TrDec-con for all sentence length categories.Second, Fig. 5 shows a histogram of translations by the length difference between the generated output and the reference.This provides an explanation of the difficulty of using parse trees.Ideally, this distribution will be focused around zero, indicating that the MT system is generating translations about the same length as the reference.However, the distribution of TrDec-con is more spread out than TrDec-binary, which indicates that it is more difficult for TrDec-con to generate sentences with appropriate target length.This is probably because constituency parse trees of sentences with similar number of words can have very different depth, and thus larger variance in the number of generation steps, likely making it difficult for the MT model to plan the sentence structure a-prior before actually generating the child sentences.

Conclusion
We propose TrDec, a novel tree-based decoder for NMT, that generates translations along with the target side tree topology.We evaluate TrDec on both linguistically-inspired parse trees and synthetic, syntax-free binary trees.Our model, when used with synthetic balanced binary trees, outperforms CCG, the existing state-of-the-art in incorporating syntax in NMT models.
The interesting result that syntax-free trees outperform their syntax-driven counterparts elicits a natural question for future work: how do we better model syntactic structure in these models?It would also be interesting to study the effect of using source-side syntax together with the targetside syntax supported by TrDec.

Figure 1 :
Figure 1: An example generation process of TrDec.Left: A target parse tree.The green squares represent preterminal nodes.Right: How our RNNs generate the parse tree on the left.The blue cells represent the activities of the rule RNN, while the grey cells represent the activities of the word RNN.eop and eos are the end-of-phrase and end-of-sentence tokens.Best viewed in color.

Figure 2 :
Figure 2: An example of four tree structures (Details of preterminals and subword units omitted for illustration purpose).

Figure 3 :
Figure 3: Conversion of a dependency tree for TrDec.Left: original dependency tree.Right: after conversion.

Table 2 :
BLEU scores of TrDec and other baselines.Statistical significance is indicated with * (p < 0.05) and * * (p < 0.001), compared with the best baseline.