Neural Machine Translation with Source Dependency Representation

Source dependency information has been successfully introduced into statistical machine translation. However, there are only a few preliminary attempts for Neural Machine Translation (NMT), such as concatenating representations of source word and its dependency label together. In this paper, we propose a novel NMT with source dependency representation to improve translation performance of NMT, especially long sentences. Empirical results on NIST Chinese-to-English translation task show that our method achieves 1.6 BLEU improvements on average over a strong NMT system.

In this paper, we enhance source representations by dependency information, which can capture source long-distance dependency constraints for word prediction. Actually, source dependency information has been shown greatly effective in * Kehai Chen was an internship research fellow at NICT when conducting this work. † Corresponding author.
In this paper, we propose a novel NMT with source dependency representation to improve translation performance. Compared with the simple approach of vector concatenation, we learn the Source Dependency Representation (SDR) to compute dependency context vectors and alignment matrices in a more sophisticated manner, which has the potential to make full use of source dependency information. To this end, we create a dependency unit for each source word to capture long-distance dependency constraints. Then we design an Encoder with convolutional architecture to jointly learn SDRs (Section 3) and source dependency annotations, thus computing dependency context vectors and hidden states by a novel double-context based Decoder for word prediction (Section 4). Empirical results on NIST Chinese-to-English translation task show that the proposed approach achieves significant gains over the method by Sennrich and Haddow (2016), and thus delivers substantial improvements over the standard attentional NMT (Section 5).

Background
An NMT model consists of an Encoder process and a Decoder process, and hence it is often called Encoder-Decoder model (Sutskever et al., 2014;Bahdanau et al., 2014). Typically, each unit of source input x j ∈ (x 1 , . . . , x J ) is firstly embedded as a vector V x j , and then represented as an annotation vector h j by where f enc is a bidirectional Recurrent Neural Network (RNN) (Bahdanau et al., 2014). These annotation vectors H = (h 1 , . . . , h J ) are used to generate the target word in the Decoder. An RNN Decoder is used to compute the target word y i probability by a softmax layer g: whereŷ i−1 is the previously emitted word, and s i is an RNN hidden state for the current time step: and the context vector c i is computed as a weighted sum of these source annotations h j : where the normalized alignment weight α ij is computed by where e ij is an alignment which indicates how well the inputs around position j and the output at the position i match: where f is a feedforward neural network.

Source Dependency Representation
In order to capture source long-distance dependency constraints, we extract a dependency unit U j for each source word x j from dependency tree, inspired by a dependency-based bilingual composition sequence for SMT . The extracted U j is defined as the following: where P A x j , SI x j , CH x j denote the parent, siblings and children words of source word x j in a dependency tree. Take x 2 in Figure 2 as an example, the blue solid box U 2 denotes its dependency unit: We design a simplified neural network following Chen et al. (2017)'s Convolutional Neural Network (CNN) method, to learn the SDR for each source dependency unit U j , as shown in Figure 1. Our neural network consists of an input layer, two convolutional layers, two pooling layers and an output layer: • Input layer: the input layer takes words of a dependency unit U j in the form of embedding vectors n×d, where n is the number of words in a dependency unit and d is vector dimension of each word. In our experiments, we set n to 10, 1 and d is 620. For dependency units shorter than 10, we perform "/" padding at the ending of U j . For example, the padded U 2 is x 3 , x 1 , x 4 , x 7 , ε, /, /, /, /, / .
• Convolutional layer: the first convolution consists of one 3×d convolution kernels (the stride is 1) to output an (n-2)×d matrix; the second convolution consists of one 3×d convolution kernels to output a n−2 2 ×d matrix.
• Max-Pooling layer: the first pooling layer performs row-wise max over the two consecutive rows to output a n−2 4 ×d matrix; the second pooling layer performs row-wise max over the two consecutive rows to output a n−2 8 ×d matrix.
• Output layer: the output layer performs row-wise average based on the output of the second pooling layer to learn a compact d-dimension vector V U j for U j . In our experiment, the output of the output layer is It should be noted that the dependency unit is similar to the source dependency feature of Sennrich and Haddow (2016) and the SDR is the same to the source-side representation of Chen et al. (2017). In comparison with Sennrich and Haddow (2016), who concatenate the source dependency labels and word together to enhance the Encoder of NMT, we adapt a separate attention mechanism together with a CNN dependency Encoder. Compared with Chen et al. (2017), which expands the famous neural network joint model (Devlin et al., 2014) with source dependency information to improve the phrase pair translation probability estimation for SMT, we focus on source dependency information to enhance attention probability estimation and to learn corresponding dependency context and RNN hidden state for improving translation.

NMT with SDR
In this section, we propose two novel NMT models SDRNMT-1 and SDRNMT-2, both of which can make use of source dependency information SDR to enhance Encoder and Decoder of NMT.

SDRNMT-1
Compared with standard attentional NMT, the Encoder of SDRNMT-1 model consists of a convolutional architecture and an bidirectional RNN, as shown in Figure 2. Therefore, the proposed Encoder can not only learn compositional representations for dependency units but also greatly tackle the sparsity issues associated with large dependency units.
Motivated by (Sennrich and Haddow, 2016), we concatenate the V x j and V U j as input of the Encoder, as shown in the black dotted box in Figure 2. Source annotation vectors are learned based on the concatenated representation with dependency information: where ":" denotes the operation of vectors concatenation. Finally, these learned annotation vectors are as the input of the standard NMT Decoder to jointly learn alignment and translation. The only difference between our method and (Sennrich and Haddow, 2016)'s method is that they only use dependency label representation instead of V U j .

SDRNMT-2
In SDRNMT-1, a single annotation, learned over concatenating word representation and SDR, is used to compute the context vector and the RNN hidden state for the current time step. To relieve more translation performance for NMT from the SDR, we propose a double-context mechanism, as shown in Figure 3. First, the Encoder of SDRNMT-2 consists of two independent annotations h j and d j : where H = [h 1 , · · · , h J ] and D = [d 1 , · · · , d J ] encode source sequential and long-distance dependency information, respectively. The Decoder learns the corresponding alignment matrices and context vectors over the H and D, respectively. That is, according to eq.(6), given the previous hidden state s s i−1 and s d i−1 , the current alignments e s i,j and e d i,j are computed over source annotation vectors h j and d j , respectively: According to eq.(5), we further compute the current alignmentα:  where λ is a hyperparameter 2 to control the importance of H and D. Note that compared with the original alignment model only depending on the sequential annotation vectors H, the alignment weightα i,j jointly compute statistic over source sequential annotation vectors H and dependency annotation vectors D.
The current context vector c s i and c d i are compute by eq.(4), respectively: The current hidden state s s i and s d i are computed by eq.(3), respectively: Finally, according to eq.(2), the probabilities for the next target word are computed using two hidden states s s i and s d i , the previously emitted wordŷ i−1 , the current sequential context vector c s i and dependency context vector c d i : 5 Experiment

Setting up
We carry out experiments on Chinese-to-English translation. The training dataset consists of 1.42M sentence pairs extract from LDC corpora. 3 We use the Stanford dependency parser (Chang et al., 2009) to generate the dependency tree for Chinese. We choose the NIST 2002 (MT02) and the NIST 2003-2008 (MT03-08) datasets as the validation set and test sets, respectively. Case-insensitive 4gram NIST BLEU score (Papineni et al., 2002) is used as an evaluation metric, and signtest (Collins et al., 2005) is as statistical significance test. The baseline systems include the standard Phrase-Based Statistical Machine Translation (PBSMT) implemented in Moses (Koehn et al., 2007) and the standard Attentional NMT (AttNMT) (Bahdanau et al., 2014), where only source word representation is utilized. We also compare with a state-of-the-art syntax enhanced NMT method (Sennrich and Haddow, 2016). For a fair comparison, we only utilize dependency information for (Sennrich and Haddow, 2016), called Sennrich-deponly. We try our best to re-implement the baseline methods on Nematus toolkit 4 (Sennrich et al., 2017).
For all NMT systems, we limit the source and target vocabularies to 30K, and the maximum sentence length is 80. The word embedding dimension is 620, 5 and the hidden layer dimension  Table 1: Results on NIST Chinese-to-English Translation Task. "*" indicates statistically significant better than "Sennrich-deponly" at p-value < 0.05 and "**" at p-value < 0.01. AVG = average BLEU scores for test sets.
is 1000, and all the layers use the dropout training technique (Hinton et al., 2012). We shuffle training set before training and the mini-batch size is 80. Training is conducted on a single Tesla P100 GPU. All NMT models train for 15 epochs using ADADELTA (Zeiler, 2012), and the train time is 6 days, which is 25% slower than the standard NMT. Table 1 shows the translation performances on test sets measured in BLEU score. The AttNMT significantly outperforms PBSMT by 2.74 BLEU points on average, indicating that it is a strong baseline NMT system. The baseline Sennrichdeponly improves the performance over the AttNMT by 0.58 BLEU points on average. This indicates that the proposed source dependency constraint is beneficial for improving the performance of NMT. Moreover, SDRNMT-1 gains improvements of 0.92 and 0.34 BLEU points on average than the AttNMT and Sennrich-deponly. These show that the proposed SDR can more effectively capture source dependency information than vector concatenation. Especially, the proposed SDRNMT-2 outperforms the AttNMT and Sennrich-deponly on average by 1.64 and 1.03 BLEU points. These verify that the proposed double-context method is effective for word prediction.

Effect of Translating Long Sentences
We follow (Bahdanau et al., 2014) to group sentences of similar lengths all the test sets (MT03-08), for example, "40" indicates that the length of sentences is between 30 and 40, and compute a BLEU score per group. As demonstrated in Figure 4, the proposed models outperform other baseline systems, especially in translating long sentences. These results show that the proposed models can effective encode longdistance dependencies to improve translation.

Conclusion and Future Work
In this paper, we explored the source dependency information to improve the performance of NMT. We proposed a novel attentional NMT with source dependency representation to capture source longdistance dependencies. In the future, we will try to exploit a general framework for utilizing richer syntax knowledge.