Tree-to-Sequence Attentional Neural Machine Translation

Most of the existing Neural Machine Translation (NMT) models focus on the conversion of sequential data and do not directly use syntactic information. We propose a novel end-to-end syntactic NMT model, extending a sequence-to-sequence model with the source-side phrase structure. Our model has an attention mechanism that enables the decoder to generate a translated word while softly aligning it with phrases as well as words of the source sentence. Experimental results on the WAT'15 English-to-Japanese dataset demonstrate that our proposed model considerably outperforms sequence-to-sequence attentional NMT models and compares favorably with the state-of-the-art tree-to-string SMT system.


Introduction
Machine Translation (MT) has traditionally been one of the most complex language processing problems, but recent advances of Neural Machine Translation (NMT) make it possible to perform translation using a simple end-to-end architecture. In the Encoder-Decoder model (Cho et al., 2014b;Sutskever et al., 2014), a Recurrent Neural Network (RNN) called the encoder reads the whole sequence of source words to produce a fixedlength vector, and then another RNN called the decoder generates the target words from the vector. The Encoder-Decoder model has been extended with an attention mechanism (Bahdanau et al., 2015;Luong et al., 2015a), which allows the model to jointly learn the soft alignment between the source language and the target language. NMT models have achieved state-of-the-art results in English-to-French and English-to-German trans- lation tasks (Luong et al., 2015b;Luong et al., 2015a). However, it is yet to be seen whether NMT is competitive with traditional Statistical Machine Translation (SMT) approaches in translation tasks for structurally distant language pairs such as English-to-Japanese. Figure 1 shows a pair of parallel sentences in English and Japanese. English and Japanese are linguistically distant in many respects; they have different syntactic constructions, and words and phrases are defined in different lexical units. In this example, the Japanese word "緑茶" is aligned with the English words "green" and "tea", and the English word sequence "a cup of" is aligned with a special symbol "null", which is not explicitly translated into any Japanese words. One way to solve this mismatch problem is to consider the phrase structure of the English sentence and align the phrase "a cup of green tea" with "緑茶". In SMT, it is known that incorporating syntactic constituents of the source language into the models improves word alignment (Yamada and Knight, 2001) and translation accuracy (Liu et al., 2006;Neubig and Duh, 2014). However, the existing NMT models do not allow us to perform this kind of alignment.
In this paper, we propose a novel attentional NMT model to take advantage of syntactic infor-mation. Following the phrase structure of a source sentence, we encode the sentence recursively in a bottom-up fashion to produce a vector representation of the sentence and decode it while aligning the input phrases and words with the output. Our experimental results on the WAT'15 English-to-Japanese translation task show that our proposed model achieves state-of-the-art translation accuracy.

Encoder-Decoder Model
NMT is an end-to-end approach to data-driven machine translation (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2015). In other words, the NMT models directly estimate the conditional probability p(y|x) given a large collection of source and target sentence pairs (x, y). An NMT model consists of an encoder process and a decoder process, and hence they are often called Encoder-Decoder models. In the Encoder-Decoder models, a sentence is treated as a sequence of words. In the encoder process, the encoder embeds each of the source words x = (x 1 , x 2 , · · · , x n ) into a d-dimensional vector space. The decoder then outputs a word sequence y = (y 1 , y 2 , · · · , y m ) in the target language given the information on the source sentence provided by the encoder. Here, n and m are the lengths of the source and target sentences, respectively. RNNs allow one to effectively embed sequential data into the vector space.
In the RNN encoder, the i-th hidden unit h i ∈ R d×1 is calculated given the i-th input x i and the previous hidden unit h i−1 ∈ R d×1 , where f enc is a non-linear function, and the initial hidden unit h 0 is usually set to zeros. The encoding function f enc is recursively applied until the nth hidden unit h n is obtained. The RNN Encoder-Decoder models assume that h n represents a vector of the meaning of the input sequence up to the n-th word. After encoding the whole input sentence into the vector space, we decode it in a similar way. The initial decoder unit s 1 is initialized with the input sentence vector (s 1 = h n ). Given the previous target word and the j-th hidden unit of the decoder, the conditional probability that the j-th target word is generated is calculated as follows: where g is a non-linear function. The j-th hidden unit of the decoder is calculated by using another non-linear function f dec as follows: We employ Long Short-Term Memory (LSTM) units (Hochreiter and Schmidhuber, 1997;Gers et al., 2000) in place of vanilla RNN units. The tth LSTM unit consists of several gates and two different types of states: a hidden unit h t ∈ R d×1 and a memory cell c t ∈ R d×1 , where each of i t , f t , o t andc t ∈ R d×1 denotes an input gate, a forget gate, an output gate, and a state for updating the memory cell, respectively. W (·) ∈ R d×d and U (·) ∈ R d×d are weight matrices, b (·) ∈ R d×1 is a bias vector, and x t ∈ R d×1 is the word embedding of the t-th input word. σ(·) is the logistic function, and the operator denotes element-wise multiplication between vectors.

Attentional Encoder-Decoder Model
The NMT models with an attention mechanism (Bahdanau et al., 2015;Luong et al., 2015a) have been proposed to softly align each decoder state with the encoder states. The attention mechanism allows the NMT models to explicitly quantify how much each encoder state contributes to the word prediction at each time step.
In the attentional NMT model in Luong et al. (2015a), at the j-th step of the decoder process, the attention score α j (i) between the i-th source hidden unit h i and the j-th target hidden unit s j is calculated as follows: where h i · s j is the inner product of h i and s j , which is used to directly calculate the similarity score between h i and s j . The j-th context vector To incorporate the attention mechanism into the decoding process, the context vector is used for the the j-th word prediction by putting an additional hidden layers j : where [s j ; d j ] ∈ R 2d×1 is the concatenation of s j and d j , and W d ∈ R d×2d and b d ∈ R d×1 are a weight matrix and a bias vector, respectively. The model predicts the j-th word by using the softmax function: where W s ∈ R |V |×d and b s ∈ R |V |×1 are a weight matrix and a bias vector, respectively. |V | stands for the size of the vocabulary of the target language. Figure 2 shows an example of the NMT model with the attention mechanism.

Objective Function of NMT Models
The objective function to train the NMT models is the sum of the log-likelihoods of the translation pairs in the training data: where D denotes a set of parallel sentence pairs. The model parameters θ are learned through Stochastic Gradient Descent (SGD).
3 Attentional Tree-to-Sequence Model

Tree-based Encoder + Sequential Encoder
The exsiting NMT models treat a sentence as a sequence of words and neglect the structure of a sentence inherent in language. We propose a novel tree-based encoder in order to explicitly take the syntactic structure into consideration in the NMT model. We focus on the phrase structure of a sentence and construct a sentence vector from phrase vectors in a bottom-up fashion. The sentence vector in the tree-based encoder is therefore composed of the structural information rather than the sequential data. Figure 3 shows our proposed model, which we call a tree-to-sequence attentional NMT model.
In Head-driven Phrase Structure Grammar (HPSG) (Sag et al., 2003), a sentence is composed of multiple phrase units and represented as a binary tree as shown in Figure 1. Following the structure of the sentence, we construct a tree-based encoder on top of the standard sequential encoder. The k-th parent hidden unit h (phr) k for the k-th phrase is calculated using the left and right child hidden units h l k and h r k as follows: where f tree is a non-linear function.
We construct a tree-based encoder with LSTM units, where each node in the binary tree is represented with an LSTM unit. When initializing the leaf units of the tree-based encoder, we employ the sequential LSTM units described in Section 2.1. Each non-leaf node is also represented with an LSTM unit, and we employ Tree-LSTM (Tai et al., 2015) to calculate the LSTM unit of the parent node which has two child LSTM units. The hidden unit h (phr) k ∈ R d×1 and the memory cell c (phr) k ∈ R d×1 for the k-th parent node are calcu-lated as follows: where are an input gate, the forget gates for left and right child units, an output gate, and a state for updating the memory cell, respectively. c l k and c r k are the memory cells for the left and right child units, respectively. U (·) ∈ R d×d denotes a weight matrix, and b (·) ∈ R d×1 represents a bias vector.
Our proposed tree-based encoder is a natural extension of the conventional sequential encoder, since Tree-LSTM is a generalization of chainstructured LSTM (Tai et al., 2015). Our encoder differs from the original Tree-LSTM in the calculation of the LSTM units for the leaf nodes. The motivation is to construct the phrase nodes in a context-sensitive way, which, for example, allows the model to compute different representations for multiple occurrences of the same word in a sentence because the sequential LSTMs are calculated in the context of the previous units. This ability contrasts with the original Tree-LSTM, in which the leaves are composed only of the word embeddings without any contextual information.

Initial Decoder Setting
We now have two different sentence vectors: one is from the sequence encoder and the other from the tree-based encoder. As shown in Figure 3, we provide another Tree-LSTM unit which has the final sequential encoder unit (h n ) and the tree-based encoder unit (h (phr) root ) as two child units and set it as the initial decoder s 1 as follows: where g tree is the same function as f tree with another set of Tree-LSTM parameters. This initialization allows the decoder to capture information from both the sequential data and phrase structures. Zoph and Knight (2016) proposed a similar method using a Tree-LSTM for initializing the decoder, with which they translate multiple source languages to one target language. When the syntactic parser fails to output a parse tree for a sentence, we encode the sentence with the sequential encoder by setting h (phr) root = 0. Our proposed treebased encoder therefore works with any sentences.

Attention Mechanism in Our Model
We adopt the attention mechanism into our treeto-sequence model in a novel way. Our model gives attention not only to sequential hidden units but also to phrase hidden units. This attention mechanism tells us which words or phrases in the source sentence are important when the model decodes a target word. The j-th context vector d j is composed of the sequential and phrase vectors weighted by the attention score α j (i): Note that a binary tree has n − 1 phrase nodes if the tree has n leaves. We set a final decoders j in the same way as Equation (7).
In addition, we adopt the input-feeding method (Luong et al., 2015a) in our model, which is a method for feedings j−1 , the previous unit to predict the word y j−1 , into the current target hidden unit s j , where [s j−1 ;s j−1 ] is the concatenation of s j−1 ands j−1 . The input-feeding approach contributes to the enrichment in the calculation of the decoder, becauses j−1 is an informative unit which can be used to predict the output word as well as to be compacted with attentional context vectors. Luong et al. (2015a) showed that the input-feeding approach improves BLEU scores. We also observed the same improvement in our preliminary experiments.

Sampling-Based Approximation to the NMT Models
The biggest computational bottleneck of training the NMT models is in the calculation of the softmax layer described in Equation (8), because its computational cost increases linearly with the size of the vocabulary. The speedup technique with GPUs has proven useful for sequence-based NMT models (Sutskever et al., 2014; 2015a) but it is not easily applicable when dealing with tree-structured data. In order to reduce the training cost of the NMT models at the softmax layer, we employ BlackOut (Ji et al., 2016), a sampling-based approximation method. BlackOut has been shown to be effective in RNN Language Models (RNNLMs) and allows a model to run reasonably fast even with a million word vocabulary with CPUs. At each word prediction step in the training, BlackOut estimates the conditional probability in Equation (2) for the target word and K negative samples using a weighted softmax function. The negative samples are drawn from the unigram distribution raised to the power β ∈ [0, 1] (Mikolov et al., 2013). The unigram distribution is estimated using the training data and β is a hyperparameter. BlackOut is closely related to Noise Contrastive Estimation (NCE) (Gutmann and Hyvärinen, 2012) and achieves better perplexity than the original softmax and NCE in RNNLMs. The advantages of Blackout over the other methods are discussed in Ji et al. (2016). Note that BlackOut can be used as the original softmax once the training is finished.

Training Data
We applied the proposed model to the English-to-Japanese translation dataset of the ASPEC corpus given in WAT'15. 1 Following Zhu (2015), we extracted the first 1.5 million translation pairs from the training data. To obtain the phrase structures of the source sentences, i.e., English, we used the probabilistic HPSG parser Enju (Miyao and Tsujii, 2008). We used Enju only to obtain a binary phrase structure for each sentence and did not use any HPSG specific information. For the target language, i.e., Japanese, we used KyTea (Neubig et al., 2011), a Japanese segmentation tool, and performed the pre-processing steps recommended in WAT'15. 2 We then filtered out the translation pairs whose sentence lengths are longer than 50 and whose source sentences are not parsed successfully.   the effectiveness of our proposed model and on a large training dataset to compare our proposed methods with the other systems. The vocabulary consists of words observed in the training data more than or equal to N times. We set N = 2 for the small training dataset and N = 5 for the large training dataset. The out-ofvocabulary words are mapped to the special token "unk". We added another special symbol "eos" for both languages and inserted it at the end of all the sentences. Table 2 shows the details of each training dataset and its corresponding vocabulary size.

Training Details
The biases, softmax weights, and BlackOut weights are initialized with zeros. The hyperparameter β of BlackOut is set to 0.4 as recommended by Ji et al. (2016). Following Józefowicz et al. (2015), we initialize the forget gate biases of LSTM and Tree-LSTM with 1.0. The remaining model parameters in the NMT models in our experiments are uniformly initialized in [−0.1, 0.1]. The model parameters are optimized by plain SGD with the mini-batch size of 128. The initial learning rate of SGD is 1.0. We halve the learning rate when the development loss becomes worse. Gradient norms are clipped to 3.0 to avoid exploding gradient problems (Pascanu et al., 2012).

Small Training Dataset
We conduct experiments with our proposed model and the sequential attentional NMT model with the input-feeding approach. Each model has 256-dimensional hidden units and word embeddings. The number of negative samples K of BlackOut is set to 500 or 2000.
Our code 3 is implemented in C++ using the Eigen library, 4 a template library for linear algebra, and we run all of the experiments on multicore CPUs. 5 It takes about a week to train a model on the large training dataset with d = 512.

Decoding process
We use beam search to decode a target sentence for an input sentence x and calculate the sum of the log-likelihoods of the target sentence y = (y 1 , · · · , y m ) as the beam score: Decoding in the NMT models is a generative process and depends on the target language model given a source sentence. The score becomes smaller as the target sentence becomes longer, and thus the simple beam search does not work well when decoding a long sentence (Cho et al., 2014a;Pouget-Abadie et al., 2014). In our preliminary experiments, the beam search with the length normalization in Cho et al. (2014a) was not effective in English-to-Japanese translation. The method in Pouget-Abadie et al. (2014) needs to estimate the conditional probability p(x|y) using another NMT model and thus is not suitable for our work.
In this paper, we use statistics on sentence lengths in beam search. Assuming that the length of a target sentence correlates with the length of a source sentence, we redefine the score of each candidate as follows: score(x, y) = L x,y + m j=1 log p(y j |y <j , x),(16) L x,y = log p(len(y)|len(x)), where L x,y is the penalty for the conditional probability of the target sentence length len(y) given the source sentence length len(x). It allows the model to decode a sentence by considering the length of the target sentence. In our experiments, we computed the conditional probability p(len(y)|len(x)) in advance following the statistics collected in the first one million pairs of the training dataset. We allow the decoder to generate up to 100 words.

Evaluation
We evaluated the models by two automatic evaluation metrics, RIBES (Isozaki et al., 2010) and BLEU (Papineni et al., 2002) following WAT'15. We used the KyTea-based evaluation script for the translation results. 6 The RIBES score is a metric based on rank correlation coefficients with word precision, and the BLEU score is based on n-gram word precision and a Brevity Penalty (BP) for outputs shorter than the references. RIBES is known to have stronger correlation with human judgements than BLEU in translation between English and Japanese as discussed in Isozaki et al. (2010). Table 3 shows the perplexity, BLEU, RIBES, and the training time on the development data with the Attentional NMT (ANMT) models trained on the small dataset. We conducted the experiments with our proposed method using BlackOut and softmax. We decoded a translation by our proposed beam search with a beam size of 20. As shown in Table 3, the results of our proposed model with BlackOut improve as the number of negative samples K increases. Although the result of softmax is better than those of BlackOut (K = 500, 2000), the training time of softmax per epoch is about three times longer than that of BlackOut even with the small dataset.

Small Training Dataset
As to the results of the ANMT model, reversing the word order in the input sentence decreases the scores in English-to-Japanese translation, which contrasts with the results of other language pairs reported in previous work (Sutskever et al., 2014;Luong et al., 2015a). By taking syntactic information into consideration, our proposed model improves the scores, compared to the sequential attention-based approach.
We found that better perplexity does not always lead to better translation scores with BlackOut as shown in Table 3. One of the possible reasons is that BlackOut distorts the target word distribution  by the modified unigram-based negative sampling where frequent words can be treated as the negative samples multiple times at each training step. Table 4 shows the results on the development data of proposed method with BlackOut (K = 2000) by the simple beam search and our proposed beam search. The beam size is set to 6 or 20 in the simple beam search, and to 20 in our proposed search.

Effects of the proposed beam search
We can see that our proposed search outperforms the simple beam search in both scores. Unlike RIBES, the BLEU score is sensitive to the beam size and becomes lower as the beam size increases. We found that the BP had a relatively large impact on the BLEU score in the simple beam search as the beam size increased. Our search method works better than the simple beam search by keeping long sentences in the candidates with a large beam size.
Effects of the sequential LSTM units We also investigated the effects of the sequential LSTMs at the leaf nodes in our proposed tree-based encoder. Table 5 shows the result on the development data of our proposed encoder and that of an attentional tree-based encoder without sequential LSTMs with BlackOut (K = 2000). 7 The results show that our proposed encoder considerably out-  performs the encoder without sequential LSTMs, suggesting that the sequential LSTMs at the leaf nodes contribute to the context-aware construction of the phrase representations in the tree. Table 6 shows the experimental results of RIBES and BLEU scores achieved by the trained models on the large dataset. We decoded the target sentences by our proposed beam search with the beam size of 20. 8 The results of the other systems are the ones reported in Nakazawa et al. (2015). All of our proposed models show similar performance regardless of the value of d. Our ensemble model is composed of the three models with d = 512, 768, and 1024, and it shows the best RIBES score among all systems. 9 As for the time required for training, our implementation needs about one day to perform one epoch on the large training dataset with d = 512. It would take about 11 days without using the BlackOut sampling.

Large Training Dataset
Comparison with the NMT models The model of Zhu (2015) is an ANMT model (Bahdanau et al., 2015) with a bi-directional LSTM encoder, and uses 1024-dimensional hidden units and 1000-  (Neubig and Duh, 2014) 79.65 36.58 + ANMT Rerank  81.38 38.17 Comparison with the SMT models PB, HPB and T2S are the baseline SMT systems in WAT'15: a phrase-based model, a hierarchical phrase-based model, and a tree-to-string model, respectively (Nakazawa et al., 2015). The best model in WAT'15 is 's treeto-string SMT model enhanced with reranking by ANMT using a bi-directional LSTM encoder. Our proposed end-to-end NMT model compares favorably with .

Qualitative Analysis
We illustrate the translations of test data by our model with d = 512 and several attentional relations when decoding a sentence. In Figures 4 and  5, an English sentence represented as a binary tree is translated into Japanese, and several attentional relations between English words or phrases and Figure 4: Translation example of a short sentence and the attentional relations by our proposed model.
Japanese word are shown with the highest attention score α. The additional attentional relations are also illustrated for comparison. We can see the target words softly aligned with source words and phrases.
In Figure 4, the Japanese word "液晶" means "liquid crystal", and it has a high attention score (α = 0.41) with the English phrase "liquid crystal for active matrix". This is because the j-th target hidden unit s j has the contextual information about the previous words y <j including "活性 マ トリックス の" ("for active matrix" in English). The Japanese word "セル" is softly aligned with the phrase "the cells" with the highest attention score (α = 0.35). In Japanese, there is no definite article like "the" in English, and it is usually aligned with null described as Section 1.
In Figure 5, in the case of the Japanese word "示" ("showed" in English), the attention score with the English phrase "showed excellent performance" (α = 0.25) is higher than that with the English word "showed" (α = 0.01). The Japanese word "の" ("of" in English) is softly aligned with the phrase "of Si dot MOS capacitor" with the highest attention score (α = 0.30). It is because our attention mechanism takes each previous context of the Japanese phrases "優れ た 性能" ("excellent performance" in English) and "Ｓｉ ドッ ト ＭＯＳ コンデンサ" ("Si dot MOS capacitor" in English) into account and softly aligned the target words with the whole phrase when translating the English verb "showed" and the preposition "of". Our proposed model can thus flexibly learn the attentional relations between English and Japanese.
We observed that our model translated the word "active" into "活性", a synonym of the reference word "アクティブ". We also found similar ex- amples in other sentences, where our model outputs synonyms of the reference words, e.g. "女" and "女 性" ("female" in English) and "NASA" and "航 空 宇 宙 局" ("National Aeronautics and Space Administration" in English). These translations are penalized in terms of BLEU scores, but they do not necessarily mean that the translations were wrong. This point may be supported by the fact that the NMT models were highly evaluated in WAT'15 by crowd sourcing (Nakazawa et al., 2015). Kalchbrenner and Blunsom (2013) were the first to propose an end-to-end NMT model using Convolutional Neural Networks (CNNs) as the source encoder and using RNNs as the target decoder. The Encoder-Decoder model can be seen as an extension of their model, and it replaces the CNNs with RNNs using GRUs (Cho et al., 2014b) or LSTMs (Sutskever et al., 2014). Sutskever et al. (2014) have shown that making the input sequences reversed is effective in a French-to-English translation task, and the technique has also proven effective in translation tasks between other European language pairs (Luong et al., 2015a). All of the NMT models mentioned above are based on sequential encoders. To incorporate structural information into the NMT models, Cho et al. (2014a) proposed to jointly learn structures inherent in source-side languages but did not report improvement of translation performance. These studies motivated us to investigate the role of syntactic structures explicitly given by existing syntactic parsers in the NMT models.

Related Work
The attention mechanism (Bahdanau et al., 2015) has promoted NMT onto the next stage. It enables the NMT models to translate while aligning the target with the source. Luong et al. (2015a) refined the attention model so that it can dynamically focus on local windows rather than the entire sentence. They also proposed a more effective attentional path in the calculation of ANMT models. Subsequently, several ANMT models have been proposed (Cheng et al., 2016;Cohn et al., 2016); however, each model is based on the existing sequential attentional models and does not focus on a syntactic structure of languages.

Conclusion
In this paper, we propose a novel syntactic approach that extends attentional NMT models. We focus on the phrase structure of the input sentence and build a tree-based encoder following the parsed tree. Our proposed tree-based encoder is a natural extension of the sequential encoder model, where the leaf units of the tree-LSTM in the encoder can work together with the original sequential LSTM encoder. Moreover, the attention mechanism allows the tree-based encoder to align not only the input words but also input phrases with the output words. Experimental results on the WAT'15 English-to-Japanese translation dataset demonstrate that our proposed model achieves the best RIBES score and outperforms the sequential attentional NMT model.