Sequence-to-Dependency Neural Machine Translation

Nowadays a typical Neural Machine Translation (NMT) model generates translations from left to right as a linear sequence, during which latent syntactic structures of the target sentences are not explicitly concerned. Inspired by the success of using syntactic knowledge of target language for improving statistical machine translation, in this paper we propose a novel Sequence-to-Dependency Neural Machine Translation (SD-NMT) method, in which the target word sequence and its corresponding dependency structure are jointly constructed and modeled, and this structure is used as context to facilitate word generations. Experimental results show that the proposed method significantly outperforms state-of-the-art baselines on Chinese-English and Japanese-English translation tasks.


Introduction
Recently, Neural Machine Translation (NMT) with the attention-based encoder-decoder framework (Bahdanau et al., 2015) has achieved significant improvements in translation quality of many language pairs (Bahdanau et al., 2015;Luong et al., 2015a;Tu et al., 2016;. In a conventional NMT model, an encoder reads in source sentences of various lengths, and transforms them into sequences of intermediate hidden vector representations. After weighted by attention operations, combined hidden vectors are used by the decoder to generate translations. In most of cases, both encoder and decoder are implemented as recurrent neural networks (RNNs). * Contribution during internship at Microsoft Research.
Many methods have been proposed to further improve the sequence-to-sequence NMT model since it was first proposed by Sutskever et al. (2014) and Bahdanau et al. (2015). Previous work ranges from addressing the problem of out-ofvocabulary words (Jean et al., 2015), designing attention mechanism (Luong et al., 2015a), to more efficient parameter learning (Shen et al., 2016), using source-side syntactic trees for better encoding (Eriguchi et al., 2016) and so on. All these NMT models employ a sequential recurrent neural network for target generations. Although in theory RNN is able to remember sufficiently long history, we still observe substantial incorrect translations which violate long-distance syntactic constraints. This suggests that it is still very challenging for a linear RNN to learn models that effectively capture many subtle long-range word dependencies. For example, Figure 1 shows an incorrect translation related to the long-distance dependency. The translation fragment in italic is locally fluent around the word is, but from a global view the translation is ungrammatical. Actually, this part of translation should be mostly affected by the distant plural noun foreigners rather than words Venezuelan government nearby.
Fortunately, such long-distance word correspondence can be well addressed and modeled by syntactic dependency trees. In Figure 1, the head word foreigners in the partial dependency tree (top dashed box) can provide correct structural context for the next target word, with this information it is more likely to generate the correct word will rather than is. This structure has been successfully applied to significantly improve the performance of statistical machine translation (Shen et al., 2008). On the NMT side, introducing target syntactic structures could help solve the problem of ungrammatical output because it can bring two advantages over state-of-the-art NMT models: a) syntactic trees can be used to model the grammatical validity of translation candidates; b) partial syntactic structures can be used as additional context to facilitate future target word prediction.  Figure 1: Dependency trees help the prediction of the next target word. "NMT" refers to the translation result from a conventional NMT model, which fails to capture the long distance word relation denoted by the dashed arrow.
However, it is not trivial to build and leverage syntactic structures on the target side in current NMT framework. Several practical challenges arise: (1) How to model syntactic structures such as dependency parse trees with recurrent neural network; (2) How to efficiently perform both target word generation and syntactic structure construction tasks simultaneously in a single neural network; (3) How to effectively leverage target syntactic context to help target word generation.
To address these issues, we propose and empirically evaluate a novel Sequence-to-Dependency Neural Machine Translation (SD-NMT) model in our paper. An SD-NMT model encodes source inputs with bi-directional RNNs and associates them with target word prediction via attention mechanism as in most NMT models, but it comes with a new decoder which is able to jointly generate target translations and construct their syntactic dependency trees. The key difference from conventional NMT decoders is that we use two RNNs, one for translation generation and the other for dependency parse tree construction, in which incremental parsing is performed with the arc-standard shift-reduce algorithm proposed by Nivre (2004).
We will describe in detail how these two RNNs work interactively in Section 3.
We evaluate our method on publicly available data sets with Chinese-English and Japanese-English translation tasks. Experimental results show that our model significantly improves translation accuracy over the conventional NMT and SMT baseline systems.

Neural Machine Translation
As a new paradigm to machine translation, NMT is an end-to-end framework (Sutskever et al., 2014;Bahdanau et al., 2015) which directly models the conditional probability P (Y |X) of target translation Y = y 1 ,y 2 ,...,y n given source sentence X = x 1 ,x 2 ,...,x m . An NMT model consists of two parts: an encoder and a decoder. Both of them utilize recurrent neural networks which can be a Gated Recurrent Unit (GRU) (Cho et al., 2014) or a Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) in practice. The encoder bidirectionally encodes a source sentence into a sequence of hidden vectors H = h 1 ,h 2 ,...,h m with a forward RNN and a backward RNN. Then the decoder predicts target words one by one with probability P (Y |X) = n j=1 P (y j |y <j , H) Typically, for the jth target word, the probability P (y j |y <j , H) is computed as P (y j |y <j , H) = g(s j , y j−1 , c j ) where g is a nonlinear function that outputs the probability of y j , and s j is the RNN hidden state. The context c j is calculated at each timestamp j based on H by the attention network where v a , W a , U a are the weight matrices. The attention mechanism is effective to model the correspondences between source and target.

Dependency Tree Construction
We use a shift-reduce transition-based dependency parser to build the syntactic structure for the target language in our work. Specially, we adopt the arcstandard algorithm (Nivre, 2004) to perform incremental parsing during the translation process. In this algorithm, a stack and a buffer are maintained to store the parsing state over which three kinds of transition actions are applied. Let w 0 and w 1 be two topmost words in the stack, andw be the current new word in a sequence of input, three transition actions are described as below.
• Left-Reduce(LR(d)) : Link w 0 and w 1 with dependency label d as w 0 d − →w 1 , and reduce them to the head w 0 .
• Right-Reduce(RR(d)) : Link w 0 and w 1 with dependency label d as w 0 d ← −w 1 , and reduce them to the head w 1 .
During parsing, an specific structure is used to record the dependency relationship between different words of input sentence. The parsing finishes when the stack is empty and all input words are consumed. As each word must be pushed to the stack once and popped off once, the number of actions needed to parse a sentence is always 2n, where n is the length of the sentence (Nivre, 2004). Because each valid transition action sequence corresponds to a unique dependency tree, a dependency tree can also be equivalently represented by a sequence of transition actions.

Sequence-to-Dependency Neural Machine Translation
An SD-NMT model is an extension to the conventional NMT model augmented with syntactic structural information of target translation. Given a source sentence X = x 1 ,x 2 ,..,x m , its target translation Y = y 1 ,y 2 ,..,y n and Y 's dependency parse tree T , the goal of the extension is to enable us to compute the joint probability P (Y, T |X). As in most structural learning tasks, the full prediction of Y and T is further decomposed into a chain of smaller predictions. For translation Y , it is generated in the left-to-right order as y 1 , y 2 , .., y n following the way in a normal sequence-to-sequence model. For Y 's parse tree T , instead of directly modeling the tree itself, we predict a parsing action sequence A which can map Y to T . Thus at top level our SD-NMT model can be formulated as = P (y 1 y 2 ..y n , a 1 , a 2 ..a l |X) (6) where A = a 1 ,a 2 ,..,a j ,..,a l 1 with length l (l = 2n), a j ∈ {SH, RR(d), LR(d)} 2 . Two recurrent neural networks, Word-RNN and Action-RNN, are used to model generation processes of translation sequence Y and parsing action sequence A respectively. Figure 2 shows an example how translation Y and its parsing actions are predicted step by step.  Figure 2: Decoding example of our SD-NMT model for target sentence "who are you" with transition action sequence "SH SH LR SH RR". The ending symbol EOS is omitted.
Because the lengths of Word-RNN and Action-RNN are different, they are designed to work in a mutually dependent way: a target word is only allowed to be generated when the SH action is predicted in the action sequence. In this way, we can perform incremental dependency parsing for translation Y and at the same time track the partial parsing status through the translation generation process.
For notational clarity, we introduce a virtual translation sequenceŶ =ŷ 1 ,ŷ 2 ,..,ŷ j ,..,ŷ l for Word-RNN which has the same length l with transition action sequence.ŷ j is defined aŝ Apparently the mapping fromŶ to Y is deterministic, and Y can be easily derived givenŶ and A.
With the notation ofŶ , the sequence probability of Y and A can be written as whereŶ <j refers to the subsequencê y 1 ,ŷ 2 , ..,ŷ j−1 , and A ≤j to a 1 , a 2 , .., a j . Based on Equation 7 and 8, the overall joint model can be computed as As we have two RNNs in our model, the termination condition is also different from a conventional NMT model. In decoding, we maintain a stack to track the parsing configuration, and our model terminates once the Word-RNN predicts a special ending symbol EOS and all the words in the stack have been reduced. Figure 3 (a) gives an overview of our SD-NMT model. Due to space limitation, the detailed interconnections between two RNNs are only illustrated at timestamp j. The encoder of our model follows standard bidirectional RNN configuration. At timestamp j during decoding, our model first predicts an action a j by Action-RNN, then Word-RNN checks the condition gate δ according to a j . If a j = SH, the Word-RNN will generate a new state (solid arrow) and predict a new target word y v j , otherwise it just copies previous state (dashed arrow) to the current state. For example, at timestamp 3, a 3 = SH, the state of Word-RNN is copied from its previous one. Meanwhile,ŷ 3 = y 2 is used as the immediate proceeding word in translation history.
When computing attention scores, we extend Equation 5 by replacing the decoder hidden state with the concatenation of Word-RNN hidden state s and Action-RNN hidden state s (gray boxes in Figure 3). The new attention score is then updated as

Syntactic Context for Target Word Prediction
Syntax has been proven useful for sentence generation task (Dyer et al., 2016). We propose to leverage target syntax to help translation generation. In our model, the syntactic context K j at timestamp j is defined as a vector which is computed by a feed-forward network based on current parsing configuration of Action-RNN. Denote that w 0 and w 1 are two topmost words in the stack, w 0l and w 1l are their leftmost modifiers in the partial tree, w 0r and w 1r their rightmost modifiers respectively. We define two unigram features and four bigram features. The unigram features are w 0 and w 1 which are represented by the word embedding vectors. The bigram features are w 0 w 0l , w 0 w 0r , w 1 w 1l and w 1 w 1r . Each of them is computed by These kinds of feature template have beeb proven effective in dependency parsing task (Zhang and Clark, 2008). Based on these features, the syntactic context vector K j is computed as where W k , U k , W b , U b are the weight matrices, E stands for the embedding matrix. Figure 2 (b) gives an overview of the construction of K j . Note that zero vector is used for padding the words which are not available in the partial tree, so that all the K vectors have the same input size in computation.
Adding K j to Equation 2, the probabilities of transition action and word in Equation 7 and 8 are then updated as P (a j |a <j , X,Ŷ <j ) = g(s j , a j−1 , c j , K j ) (12) P (ŷ j |ŷ <j , X, A ≤j ) = g(s j ,ŷ j−1 , c j , K j ) (13) After each prediction step in Word-RNN and Action-RNN, the syntax context vector K will be updated accordingly. Note that K is not used to calculate the recurrent states s in this work.

Model Training and Decoding
For SD-NMT model, we use the sum of loglikelihoods of word sequence and action sequence as objective function for training algorithm, so that the joint probability of target translations and their parsing trees can be maximized: We also use mini-batch for model training. As the target dependency trees are known in the bilingual corpus during training, we pre-compute the partial tree state and syntactic context at each time stamp for each training instance. Thus it is easy for the model to process multiple trees in one batch.
In the decoding process of an SD-NMT model, the score of each search path is the sum of log probabilities of target word sequence and transition action sequence normalized by the sequence length: where n is word sequence length and l is action sequence length.

Experiments
The experiments are conducted on the Chinese-English task as well as the Japanese-English translation tasks where the same data set from WAT 2016 ASPEC corpus  3 is used for a fair comparison with other work. In addition to evaluate translation performance, we also investigate the quality of dependency parsing as a by-product and the effect of parsing quality against translation quality.

Setup
In the Chinese-English task, the bilingual training data consists of a set of LDC datasets, 4 which has around 2M sentence pairs. We use NIST2003 as the development set, and the testsets contain NIST2005, NIST2006, NIST2008 and NIST2012. All English words are lowercased. In the Japanese-English task, we use top 1M sentence pairs from ASPEC Japanese-English corpus. The development data contains 1,790 sentences, and the test data contains 1,812 sentences with single reference per source sentence.
To train SD-NMT model, the target dependency tree references are needed. As there is no golden annotation of parse trees over the target training data, we use pseudo parsing results as the target dependency references, which are got from an in-house developed arc-eager dependency parser based on work in (Zhang and Nivre, 2011 In the neural network training, the vocabulary size is limited to 30K high frequent words for both source and target languages. All low frequent words are normalized into a special token unk and post-processed by following the work in (Luong et al., 2015b). The size of word embedding and transition action embedding is set to 512. The dimensions of the hidden states for all RNNs are set to 1024. All model parameters are initialized randomly with Gaussian distribution (Glorot and Bengio, 2010) and trained on a NVIDIA Tesla K40 GPU. The stochastic gradient descent (SGD) algorithm is used to tune parameters with a learning rate of 1.0. The batch size is set to 96. In the update procedure, Adadelta (Zeiler, 2012) algorithm is used to automatically adapt the learning rate. The beam sizes for both word prediction and transition action prediction are set to 12 in decoding.
The baselines in our experiments are a phrasal system and a neural translation system, denoted by HPSMT and RNNsearch respectively. HPSMT is an in-house implementation of the hierarchical phrase-based model (Chiang, 2005), where a 4gram language model is trained using the modified Kneser-Ney smoothing (Kneser and Ney, 1995) algorism over the English Gigaword corpus (LDC2009T13) plus the target data from the bilingual corpus. RNNsearch is an in-house implementation of the attention-based neural machine translation model (Bahdanau et al., 2015) using the same parameter settings as our SD-NMT model including word embedding size, hidden vector dimension, beam size, as well as the same mechanism for OOV word processing.
The evaluation results are reported with the case-insensitive IBM BLEU-4 (Papineni et al., 2002). A statistical significance test is performed using the bootstrap resampling method proposed by Koehn (2004) with a 95% confidence level. For Japanese-English task, we use the official eval-uation procedure provided by WAT 2016. 5 , where both BLEU and RIBES (Isozaki et al., 2010) are used for evaluation.

Evaluation on Chinese-English Translation
We evaluate our method on the Chinese-English translation task. The evaluation results over all NIST test sets against baselines are listed in Table  1. Generally, RNNsearch outperforms HPSMT by 3.78 BLEU points on average while SD-NMT surpasses RNNsearch 2.03 BLUE point gains on average, which shows that NMT models usually achieve better results than SMT models, and our proposed sequence-to-dependency NMT model performs much better than traditional sequence-tosequence NMT model. We also investigate the effect of syntactic knowledge context by excluding its computation in Equation 12 and 13. The alternative model is denoted by SD-NMT\K. According to Table  1, SD-NMT\K outperforms RNNsearch by 0.54 BLEU points but degrades SD-NMT by 1.49 BLEU points on average, which demonstrates that the long distance dependencies captured by the target syntactic knowledge context, such as leftmost/rightmost children together with their dependency relationships, really bring strong positive effects on the prediction of target words.
In addition to translation quality, we compare the perplexity (PPL) changes on the development set in terms of numbers of training mini-batches for RNNsearch and SD-NMT in Figure 4. We can see that the PPL of SD-NMT is initially higher than that of RNNsearch, but decreased to be lower over time. This is mainly because the quality of parse tree is too poor at the beginning which degrades translation quality and leads to higher PPL. After some training iterations, the SD-NMT   In our experiments, the time cost of SD-NMT is two times of that for RNNsearch due to a more complicated model structure. But we think it is a worthy trade to pursue high quality translations.

Evaluation on Japanese-English Translation
In this section, we report results on the Japanese-English translation task. To ensure fair comparisons, we use the same training data and follow the pre-processing steps recommended in WAT 2016 6 . Table 2 shows the comparison results from 8 systems with the evaluation metrics of BLEU and RIBES. The results in the first 3 rows are produced by SMT systems taken from the official WAT 2016. The remaining results are produced by NMT systems, among which the bottom two row results are taken from our in-house NMT systems and others refer to the work in (Cromieres, 2016; that are the competitive NMT results on WAT 2016. According to Table  2, NMT results still outperform SMT results similar to our Chinese-English evaluation results. The SD-NMT model significantly outperforms most other NMT models, which shows that our proposed approach to modeling target dependency tree benefit NMT systems since our RNNsearch baseline achieves comparable performance with the single layer attention-based NMT system in (Cromieres, 2016). Note that our SD-NMT gets comparable results with the 4 single-layer ensemble model in (Cromieres, 2016;. We believe SD-NMT can get more improvements with an ensemble of multiple models in future experiments.

Effect of the Parsing Accuracy upon Translation Quality
The interaction effect between dependency tree conduction and target word generation is investigated in this section. The experiments are conducted on the Chinese-English task over multiple test sets. We evaluate how the quality of dependency trees affect the performance of translation.
In the decoding phase of SD-NMT, beam search is applied to the generations of both transition and actions as illustrated in Equation 15. Intuitively, the larger the beam size of action prediction is, the better the dependency tree quality is. We fix the beam size for generating target words to 12, and change the beam size for action prediction to see the difference. Figure 5 shows the evaluation results of all test sets. There is a tendency for BLEU scores to increase with the growth of action prediction beam size. The reason is that the translation quality increases as the quality of dependency tree improves, which shows the construction of dependency trees can boost the generation of target  Figure 5: Translation performance against the beam size of action prediction.
words, and vice versa we believe.

Quality Estimation of Dependency Tree Construction
As a by-product, the quality of dependency trees not only affects the performance of target word generation, but also influences the possible downstream processors or tasks such as text analyses. The direct evaluation of tree quality is not feasible due to the unavailable golden references. So we resort to estimating the consistency between the by-products and the parsing results of our standalone dependency parser with state-of-the-art performance. The higher the consistency is, the closer the performance of by-product is to the standalone parser. To reduce the influence of ill-formed data as much as possible, we build the evaluation data set by heuristically selecting 360 SD-NMT translation results together with their dependency trees from NIST test sets where both source-and target-side do not contain unk and have a length of 20-30. We then take the parsing results of the stand-alone parser for these translations as references to indirectly estimate the quality of byproducts. We get a UAS (unlabeled attachment score) of 94.96% and a LAS (labeled attachment score) of 93.92%, which demonstrates that the dependency trees produced by SD-NMT are much similar with the parsing results from the standalone parser.

Translation Example
In this section, we give a case study to explain how our method works. Figure 6 shows a translation example from the NIST testsets. SMT and RNNsearch refer to the translation results from the baselines HPSMT and NMT. For our SD-NMT model, we list both the generated translation and its corresponding dependency tree. We find that the translation of SMT is disfluent and ungrammatical, whereas RNNsearch is better than SMT. Although the translation of RNNsearch is locally fluent around word "have" in the rectangle, both its grammar is incorrect and its meaning is inaccurate from a global view. The word "have" should be in a singular form as its subject is "safety" rather than "workers". For our SD-NMT model, we can see that the translation is much better than baselines and the dependency tree is reasonable. The reason is that after generating the word "workers", the previous subtree in the gray region is transformed to the syntactic context which can guide the generation of the next word as illustrated by the dashed arrow. Thus our model is more likely to generate the correct verb "is" with singular form. In addition, the global structure helps the model correctly identify the inverted sentence pattern of the former translated part and make better choices for the future translation ("only when .. can .." in our translation, "only when .. will .." in the reference), which remains a challenge for conventional NMT model.

Related Work
Incorporating linguistic knowledge into machine translation has been extensively studied in Statistic Machine Translation (SMT) (Galley et al., 2006;Shen et al., 2008;Liu et al., 2006). Liu et al. (2006) proposed a tree-to-string alignment template for SMT to leverage source side syntactic information. Shen et al. (2008) proposed a target dependency language model for SMT to employ target-side structured information. These methods show promising improvement for SMT. Recently, neural machine translation (NMT) has achieved better performance than SMT in many language pairs (Luong et al., 2015a;Zhang et al., 2016;Shen et al., 2016;Neubig, 2016). In a vanilla NMT model, source and target sentences are treated as sequences where the syntactic knowledge of both sides is neglected. Some effort has been done to incorporate source syntax into NMT. Eriguchi et al. (2016) proposed a tree-to-sequence attentional NMT model where source-side parse tree was used and achieved promising improvement. Intuitively, adding source syntactic information to [Source] 只有 施工 人员 的 安全 得到 了 保证 , 才能 继续 施 工 .
[Reference] only when the safety of the workers is guaranteed will they continue with the project . [HPSMT] only safety is assured of construction personnel , to continue construction . [RNNsearch] only when the safety of construction workers have been guaranteed to continue construction .
[SD-NMT] only when the safety of the workers is guaranteed can we continue to work .  Figure 6: Translation examples of SMT, RNNsearch and our SD-NMT on Chinese-English translation task. The italic words on the arrows are dependency labels. The ending symbol EOS is omitted. RNNsearch fails to capture the long dependency which leads to an ungrammatical result. Whereas with the help of the syntactic tree, our SD-NMT can get a much better translation.
NMT is straightforward, because the source sentence is definitive and easy to attach extra information. However, it is non-trivial to add target syntax as target words are uncertain in decoding process. Up to now, there is few work that attempts to build and leverage target syntactic information for NMT.
There has been work that incorporates syntactic information into NLP tasks with neural networks. Dyer et al. (2016) presented a RNN grammar for parsing and language modeling. They replaced SH with a set of generative actions to generate words under a Stack LSTM framework (Dyer et al., 2015), which achieves promising results for language modeling on the Penn Treebank data. In our work, we propose to involve target syntactic trees into NMT model to jointly learn target translation and dependency parsing where target syntactic context over the parse tree is used to improve the translation quality.

Conclusion and Future Work
In this paper, we propose a novel string-todependency translation model over NMT. Our model jointly performs target word generation and arc-standard dependency parsing. Experimental results show that our method can boost the two procedures and achieve significant improvements on the translation quality of NMT systems.
In future work, along this research direction, we will try to integrate other prior knowledge, such as semantic information, into NMT systems. In addition, we will apply our method to other sequenceto-sequence tasks, such as text summarization, to verify the effectiveness.