Neural Machine Translation with Source-Side Latent Graph Parsing

This paper presents a novel neural machine translation model which jointly learns translation and source-side latent graph representations of sentences. Unlike existing pipelined approaches using syntactic parsers, our end-to-end model learns a latent graph parser as part of the encoder of an attention-based neural machine translation model, and thus the parser is optimized according to the translation objective. In experiments, we first show that our model compares favorably with state-of-the-art sequential and pipelined syntax-based NMT models. We also show that the performance of our model can be further improved by pre-training it with a small amount of treebank annotations. Our final ensemble model significantly outperforms the previous best models on the standard English-to-Japanese translation dataset.


Introduction
Neural Machine Translation (NMT) is an active area of research due to its outstanding empirical results (Bahdanau et al., 2015;Sutskever et al., 2014). Most of the existing NMT models treat each sentence as a sequence of tokens, but recent studies suggest that syntactic information can help improve translation accuracy (Eriguchi et al., 2016b(Eriguchi et al., , 2017Stahlberg et al., 2016). The existing syntax-based NMT models employ a syntactic parser trained by supervised learning in advance, and hence the parser is not adapted to the translation tasks. An alternative approach for leveraging syntactic structure in a language processing task is to jointly learn syntactic trees of the sentences All the calculated electronic band structures are metallic . Edges with a small weight are omitted. along with the target task (Socher et al., 2011;Yogatama et al., 2017).
Motivated by the promising results of recent joint learning approaches, we present a novel NMT model that can learn a task-specific latent graph structure for each source-side sentence. The graph structure is similar to the dependency structure of the sentence, but it can have cycles and is learned specifically for the translation task. Unlike the aforementioned approach of learning single syntactic trees, our latent graphs are composed of "soft" connections, i.e., the edges have realvalued weights (Figure 1). Our model consists of two parts: one is a task-independent parsing component, which we call a latent graph parser, and the other is an attention-based NMT model. The latent parser can be independently pre-trained with human-annotated treebanks and is then adapted to the translation task.
In experiments, we demonstrate that our model can be effectively pre-trained by the treebank annotations, outperforming a state-of-the-art sequential counterpart and a pipelined syntax-based model. Our final ensemble model outperforms the previous best results by a large margin on the WAT English-to-Japanese dataset.

Latent Graph Parser
We model the latent graph parser based on dependency parsing. In dependency parsing, a sentence is represented as a tree structure where each node corresponds to a word in the sentence and a unique root node (ROOT) is added. Given a sentence of length N , the parent node H w i ∈ {w 1 , . . . , w N , ROOT} (H w i = w i ) of each word w i (1 ≤ i ≤ N ) is called its head. The sentence is thus represented as a set of tuples (w i , H w i , w i ), where w i is a dependency label.
In this paper, we remove the constraint of using the tree structure and represent a sentence as a set of tuples (w i , p(H w i |w i ), p( w i |w i )), where p(H w i |w i ) is the probability distribution of w i 's parent nodes, and p( w i |w i ) is the probability distribution of the dependency labels. For example, p(H w i = w j |w i ) is the probability that w j is the parent node of w i . Here, we assume that a special token EOS is appended to the end of the sentence, and we treat the EOS token as ROOT. This approach is similar to that of graph-based dependency parsing (McDonald et al., 2005) in that a sentence is represented with a set of weighted arcs between the words. To obtain the latent graph representation of the sentence, we use a dependency parsing model based on multi-task learning proposed by Hashimoto et al. (2017).

Word Representation
The i-th input word w i is represented with the concatenation of its d 1 -dimensional word embedding v dp (w i ) ∈ R d 1 and its character n-gram embedding c(w i ) ∈ R d 1 : x(w i ) = [v dp (w i ); c(w i )]. c(w i ) is computed as the average of the embeddings of the character n-grams in w i .

POS Tagging Layer
Our latent graph parser builds upon multilayer bi-directional Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units (Graves and Schmidhuber, 2005). In the first layer, POS tagging is handled by computing a hidden state h i+1 , x(w i )) ∈ R d 1 are hidden states of the forward and backward LSTMs, respectively. h (1) i is then fed into a softmax classifier to predict a probability distribution p (1) i ∈ R C (1) for word-level tags, where C (1) is the number of POS classes. The model parameters of this layer can be learned not only by human-annotated data, but also by backpropagation from higher layers, which are described in the next section.

Dependency Parsing Layer
Dependency parsing is performed in the second layer. A hidden state h Then, (soft) edges of our latent graph representation are obtained by computing the probabilities where is a scoring function with a weight matrix W dp ∈ R 2d 1 ×2d 1 . While the models of Hashimoto et al. (2017), , and Dozat and Manning (2017) learn the model parameters of their parsing models only by humanannotated data, we allow the model parameters to be learned by the translation task. Next is fed into a softmax classifier to predict the probability distribution p( w i |w i ), where z(H w i ) ∈ R 2d 1 is the weighted average of the hidden states of the parent nodes: j . This results in the latent graph representation (w i , p(H w i |w i ), p( w i |w i )) of the input sentence.

NMT with Latent Graph Parser
The latent graph representation described in Section 2 can be used for any sentence-level tasks, and here we apply it to an Attention-based NMT (ANMT) model . We modify the encoder and the decoder in the ANMT model to learn the latent graph representation.

Encoder with Dependency Composition
The ANMT model first encodes the information about the input sentence and then generates a sentence in another language. The encoder represents the word w i with a word embedding v enc (w i ) ∈ R d 3 . It should be noted that v enc (w i ) is different from v dp (w i ) because each component is separately modeled. The encoder then takes the word embedding v enc (w i ) and the hidden state h (2) i as the input to a uni-directional LSMT: where h (enc) i ∈ R d 3 is the hidden state corresponding to w i . That is, the encoder of our model is a three-layer LSTM network, where the first two layers are bi-directional.
In the sequential LSTMs, relationships between words in distant positions are not explicitly considered. In our model, we explicitly incorporate such relationships into the encoder by defining a dependency composition function: is the weighted average of the hidden states of the parent nodes.
Note on character n-gram embeddings In NMT models, sub-word units are widely used to address rare or unknown word problems . In our model, the character n-gram embeddings are fed through the latent graph parsing component. To the best of our knowledge, the character n-gram embeddings have never been used in NMT models. Wieting et al. (2016), Bojanowski et al. (2017), and Hashimoto et al. (2017) have reported that the character n-gram embeddings are useful in improving several NLP tasks by better handling unknown words.

Decoder with Attention Mechanism
The decoder of our model is a single-layer LSTM network, and the initial state is set with h (enc) N +1 and its corresponding memory cell. Given the t-th hidden state h (dec) t ∈ R d 3 , the decoder predicts the t-th word in the target language using an attention mechanism. The attention mechanism in  computes the weighted average of the hidden states h (enc) i of the encoder: where s(i, t) is a scoring function which specifies how much each source-side hidden state contributes to the word prediction. In addition, like the attention mechanism over constituency tree nodes (Eriguchi et al., 2016b), our model uses attention to the dependency composition vectors: To predict the target word, a hidden stateh (dec) t ∈ R d 3 is then computed as follows: is also used in the transition of the decoder LSTMs along with a word embedding v dec (w t ) ∈ R d 3 of the target word w t : where the use ofh (dec) t is called input feeding proposed by .
The overall model parameters, including those of the latent graph parser, are jointly learned by minimizing the negative log-likelihood of the prediction probabilities of the target words in the training data. To speed up the training, we use BlackOut sampling (Ji et al., 2016). By this joint learning using Equation (3) and (7), the latent graph representations are automatically learned according to the target task. Implementation Tips Inspired by Zoph et al. (2016), we further speed up BlackOut sampling by sharing noise samples across words in the same sentences. This technique has proven to be effective in RNN language modeling, and we have found that it is also effective in the NMT model. We have also found it effective to share the model parameters of the target word embeddings and the softmax weight matrix for word prediction (Inan et al., 2016;Press and Wolf, 2017). Also, we have found that a parameter averaging technique (Hashimoto et al., 2013) is helpful in improving translation accuracy.
Translation At test time, we use a novel beam search algorithm which combines statistics of sentence lengths (Eriguchi et al., 2016b) and length normalization (Cho et al., 2014). During the beam search step, we use the following scoring function for a generated word sequence y = (y 1 , y 2 , . . . , y Ly ) given a source word sequence where p(L y |L x ) is the probability that sentences of length L y are generated given source-side sentences of length L x . The statistics are taken by using the training data in advance. In our experiments, we have empirically found that this beam search algorithm helps the NMT models to avoid generating translation sentences that are too short.

Data
We used an English-to-Japanese translation task of the Asian Scientific Paper Excerpt Corpus (AS-PEC) (Nakazawa et al., 2016b) used in the Workshop on Asian Translation (WAT), since it has been shown that syntactic information is useful in English-to-Japanese translation (Eriguchi et al., 2016b;Neubig et al., 2015). We followed the data preprocessing instruction for the English-to-Japanese task in Eriguchi et al. (2016b). The English sentences were tokenized by the tokenizer in the Enju parser (Miyao and Tsujii, 2008), and the Japanese sentences were segmented by the KyTea tool 1 . Among the first 1,500,000 translation pairs in the training data, we selected 1,346,946 pairs where the maximum sentence length is 50. In what follows, we call this dataset the large training dataset. We further selected the first 20,000 and 100,000 pairs to construct the small and medium training datasets, respectively. The development data include 1,790 pairs, and the test data 1,812 pairs.
For the small and medium datasets, we built the vocabulary with words whose minimum frequency is two, and for the large dataset, we used words whose minimum frequency is three for English and five for Japanese. As a result, the vocabulary of the target language was 8,593 for the small dataset, 23,532 for the medium dataset, and 65,680 for the large dataset. A special token UNK was used to replace words which were not included in the vocabularies. The character ngrams (n = 2, 3, 4) were also constructed from each training dataset with the same frequency settings.

Parameter Optimization and Translation
We turned hyper-parameters of the model using development data. We set (d 1 , d 2 ) = (100, 50) for the latent graph parser. The word and character n-gram embeddings of the latent graph parser 1 http://www.phontron.com/kytea/.
were initialized with the pre-trained embeddings in Hashimoto et al. (2017). 2 The weight matrices in the latent graph parser were initialized with uni- where row and col are the number of rows and columns of the matrices, respectively. All the bias vectors and the weight matrices in the softmax layers were initialized with zeros, and the bias vectors of the forget gates in the LSTMs were initialized by ones (Jozefowicz et al., 2015).
We set d 3 = 128 for the small training dataset, d 3 = 256 for the medium training dataset, and d 3 = 512 for the large training dataset. The word embeddings and the weight matrices of the NMT model were initialized with uniform random values in [−0.1, +0.1]. The training was performed by mini-batch stochastic gradient descent with momentum. For the BlackOut objective (Ji et al., 2016), the number of the negative samples was set to 2,000 for the small and medium training datasets, and 2,500 for the large training dataset. The mini-batch size was set to 128, and the momentum rate was set to 0.75 for the small and medium training datasets and 0.70 for the large training dataset. A gradient clipping technique was used with a clipping value of 1.0. The initial learning rate was set to 1.0, and the learning rate was halved when translation accuracy decreased. We used the BLEU scores obtained by greedy translation as the translation accuracy and checked it at every half epoch of the model training. We saved the model parameters at every half epoch and used the saved model parameters for the parameter averaging technique. For regularization, we used L2-norm regularization with a coefficient of 10 −6 and applied dropout (Hinton et al., 2012) to Equation (8) with a dropout rate of 0.2.
The beam size for the beam search algorithm was 12 for the small and medium training datasets, and 50 for the large training dataset. We used BLEU (Papineni et al., 2002), RIBES (Isozaki et al., 2010), and perplexity scores as our evaluation metrics. Note that lower perplexity scores indicate better accuracy.

Pre-Training of Latent Graph Parser
The latent graph parser in our model can be optionally pre-trained by using human annotations for dependency parsing. In this paper we used the widely-used Wall Street Journal (WSJ) training data to jointly train the POS tagging and dependency parsing components. We used the standard training split (Section 0-18) for POS tagging. We followed Chen and Manning (2014) to generate the training data (Section 2-21) for dependency parsing. From each training dataset, we selected the first K sentences to pre-train our model. The training dataset for POS tagging includes 38,219 sentences, and that for dependency parsing includes 39,832 sentences.
The parser including the POS tagger was first trained for 10 epochs in advance according to the multi-task learning procedure of Hashimoto et al. (2017), and then the overall NMT model was trained. When pre-training the POS tagging and dependency parsing components, we did not apply dropout to the model and did not fine-tune the word and character n-gram embeddings to avoid strong overfitting.

Model Configurations
LGP-NMT is our proposed model that learns the Latent Graph Parsing for NMT.
LGP-NMT+ is constructed by pre-training the latent parser in LGP-NMT as described in Section 4.3.
SEQ is constructed by removing the dependency composition in Equation (3), forming a sequential NMT model with the multi-layer encoder.
DEP is constructed by using pre-trained dependency relations rather than learning them. That is, p(H w i = w j |w i ) is fixed to 1.0 such that w j is the head of w i . The dependency labels are also given by the parser which was trained by using all the training samples for parsing and tagging.
UNI is constructed by fixing p(H w i = w j |w i ) to 1 N for all the words in the same sentence. That is, the uniform probability distributions are used for equally connecting all the words.

Results on Small and Medium Datasets
We first show our translation results using the small and medium training datasets. We report averaged scores with standard deviations across five different runs of the model training.  and UNI, which shows that the small training dataset is not enough to learn useful latent graph structures from scratch. However, LGP-NMT+ (K = 10,000) outperforms SEQ and UNI, and the standard deviations are the smallest. Therefore, the results suggest that pre-training the parsing and tagging components can improve the translation accuracy of our proposed model. We can also see that DEP performs the worst. This is not surprising because previous studies, e.g., Li et al. (2015), have reported that using syntactic structures do not always outperform competitive sequential models in several NLP tasks. Now that we have observed the effectiveness of pre-training our model, one question arises naturally: how many training samples for parsing and tagging are necessary for improving the translation accuracy? Table 2 shows the results of using different numbers of training samples for parsing and tagging. The results of K= 0 and K= 10,000 correspond to those of LGP-NMT and LGP-NMT+ in Table 1, respectively. We can see that using the small amount of the training samples performs better than using all the training samples. 3 One possible reason is that the domains of the translation dataset and the parsing (tagging) dataset are considerably different. The parsing and tagging datasets come from WSJ, whereas the translation dataset comes from abstract text of scientific papers in a wide range of domains, such as  Table 3: Evaluation on the development data using the medium training dataset (100,000 pairs).

Small Training Dataset
biomedicine and computer science. These results suggest that our model can be improved by a small amount of parsing and tagging datasets in different domains. Considering the recent universal dependency project 4 which covers more than 50 languages, our model has the potential of being applied to a variety of language pairs. Table 3 shows the results of using the medium training dataset. In contrast with using the small training dataset, LGP-NMT is slightly better than SEQ.

Medium Training Dataset
LGP-NMT significantly outperforms UNI, which shows that our adaptive learning is more effective than using the uniform graph weights. By pre-training our model, LGP-NMT+ significantly outperforms SEQ in terms of the BLEU score. Again, DEP performs the worst among all the models. By using our beam search strategy, the Brevity Penalty (BP) values of our translation results are equal to or close to 1.0, which is important when evaluating the translation results using the BLEU scores. A BP value ranges from 0.0 to 1.0, and larger values mean that the translated sentences have relevant lengths compared with the reference translations. As a result, our BLEU evaluation results are affected only by the word n-gram precision scores. BLEU scores are sensitive to the BP values, and thus our beam search strategy leads to more solid evaluation for NMT models. Table 4 shows the BLEU and RIBES scores on the development data achieved with the large training dataset. Here we focus on our models and SEQ because UNI and DEP consistently perform worse than the other models as shown in Table 1 and 3. The averaging technique and attentionbased unknown word replacement (Jean et al., 2015;Hashimoto et al., 2016) Cromieres et al. (2016) 38.20 82.39 Neubig et al. (2015) 38.17 81.38 Eriguchi et al. (2016a) 36.95 82.45 Neubig and Duh (2014) 36.58 79.65 Zhu (2015) 36.21 80.91 Lee et al. (2015) 35.75 81.15 Again, we see that the translation scores of our model can be further improved by pre-training the model. Table 5 shows our results on the test data, and the previous best results summarized in Nakazawa et al. (2016a) and the WAT website 5 are also shown. Our proposed models, LGP-NMT and LGP-NMT+, outperform not only SEQ but also all of the previous best results. Notice also that our implementation of the sequential model (SEQ) provides a very strong baseline, the performance of which is already comparable to the previous state of the art, even without using ensemble techniques. The confidence interval (p ≤ 0.05) of the RIBES score of LGP-NMT+ estimated by bootstrap resampling (Noreen, 1989) is (82.27, 83.37), and thus the RIBES score of LGP-NMT+ is significantly better than that of SEQ, which shows that our latent parser can be effectively pre-trained with the human-annotated treebank.

Results on Large Dataset
The sequential NMT model in Cromieres et al. (2016) and the tree-to-sequence NMT model in Eriguchi et al. (2016b) rely on ensemble techniques while our results mentioned above are obtained using single models. Moreover, our model is more compact 6 than the previous best NMT model in Cromieres et al. (2016). By applying the ensemble technique to LGP-NMT, LGP-NMT+, As a result , it was found that a path which crosses a sphere obliquely existed .

Selectional Preference
In the translation example (1) in Figure 2, we see that the adverb "obliquely" is interpreted differently across the systems. As in the reference translation, "obliquely" is a modifier of the verb "crosses". Our models correctly capture the relationship between the two words, whereas Google Translation and SEQ treat "obliquely" as a modifier of the verb "existed". This error is not a surprise since the verb "existed" is located closer to "obliquely" than the verb "crosses". A possible reason for the correct interpretation by our models is that they can better capture long-distance dependencies and are less susceptible to surface word distances. This is an indication of our models' ability of capturing domain-specific selectional preference that cannot be captured by purely sequential models. It should be noted that simply using standard treebank-based parsers does not necessarily address this error, because our pre-trained dependency parser interprets that "obliquely" is a modifier of the verb "existed".
Adverb or Adjective The translation example (2) in Figure 2 shows another example where the adverb "negatively" is interpreted as an adverb or an adjective. As in the reference translation, "negatively" is a modifier of the verb "controls". Only LGP-NMT+ correctly captures the adverb-verb relationship, whereas "negatively" is interpreted as the adjective "negative" to modify the noun "ImRNA" in the translation results from Google Translation and LGP-NMT. SEQ interprets "negatively" as both an adverb and an adjective, which leads to the repeated translations. This error suggests that the state-of-the-art NMT models are strongly affected by the word order. By contrast, the pre-training strategy effectively embeds the information about the POS tags and the dependency relations into our model.

Analysis on Learned Latent Graphs
Without Pre-Training We inspected the latent graphs learned by LGP-NMT. Figure 1 shows an example of the learned latent graph obtained for a sentence taken from the development data of the translation task. It has long-range dependencies and cycles as well as ordinary left-to-right dependencies. We have observed that the punctuation mark "." is often pointed to by other words with large weights. This is primarily because the hidden state corresponding to the mark in each sentence has rich information about the sentence.
To measure the correlation between the latent graphs and human-defined dependencies, we parsed the sentences on the development data of the WSJ corpus and converted the graphs into dependency trees by Eisner's algorithm (Eisner, 1996). For evaluation, we followed Chen and Manning (2014) and measured Unlabeled Attachment Score (UAS). The UAS is 24.52%, which shows that the implicitly-learned latent graphs are partially consistent with the human-defined syntactic structures. Similar trends have been reported by Yogatama et al. (2017) in the case of binary constituency parsing. We checked the most dominant gold dependency labels which were assigned for the dependencies detected by LGP-NMT. The labels whose ratio is more than 3% are nn, amod, prep, pobj, dobj, nsubj, num, det, advmod, and poss. We see that dependencies between words in distant positions, such as subject-verb-object relations, can be captured.
With Pre-Training We also inspected the pretrained latent graphs. Figure 3-(a) shows the dependency structure output by the pre-trained latent parser for the same sentence in Figure 1. This is an ordinary dependency tree, and the head selection is almost deterministic; that is, for each word, the largest weight of the head selection is close to 1.0. By contrast, the weight values are more evenly distributed in the case of LGP-NMT as shown in Figure 1. After the overall NMT model training, the latent parser is adapted to the translation task, and Figure 3-(b) shows the adapted latent graph. Again, we can see that the adapted weight values are also distributed and different from the original pre-trained weight values, which suggests that human-defined syntax is not always optimal for the target task.
The UAS of the pre-trained dependency trees is 92.52% 9 , and that of the adapted latent graphs is 18.94%. Surprisingly, the resulting UAS (18.94%) is lower than the UAS of our model without pretraining (24.52%). However, in terms of the translation accuracy, our model with pre-training is better than that without pre-training. These results suggest that human-annotated treebanks can provide useful prior knowledge to guide the overall model training by pre-training, but the resulting sentence structures adapted to the target task do not need to highly correlate with the treebanks. 9 The UAS is significantly lower than the reported score in Hashimoto et al. (2017). The reason is described in Section 4.3.

Related Work
While initial studies on NMT treat each sentence as a sequence of words (Bahdanau et al., 2015;Sutskever et al., 2014), researchers have recently started investigating into the use of syntactic structures in NMT models (Bastings et al., 2017;Chen et al., 2017;Eriguchi et al., 2016aEriguchi et al., ,b, 2017Li et al., 2017;Stahlberg et al., 2016;Yang et al., 2017). In particular, Eriguchi et al. (2016b) introduced a tree-to-sequence NMT model by building a tree-structured encoder on top of a standard sequential encoder, which motivated the use of the dependency composition vectors in our proposed model. Prior to the advent of NMT, the syntactic structures had been successfully used in statistical machine translation systems (Neubig and Duh, 2014;Yamada and Knight, 2001). These syntax-based approaches are pipelined; a syntactic parser is first trained by supervised learning using a treebank such as the WSJ dataset, and then the parser is used to automatically extract syntactic information for machine translation. They rely on the output from the parser, and therefore parsing errors are propagated through the whole systems. By contrast, our model allows the parser to be adapted to the translation task, thereby providing a first step towards addressing ambiguous syntactic and semantic problems, such as domain-specific selectional preference and PP attachments, in a task-oriented fashion.
Our model learns latent graph structures in a source-side language. Eriguchi et al. (2017) have proposed a model which learns to parse and translate by using automatically-parsed data. Thus, it is also an interesting direction to learn latent structures in a target-side language.
As for the learning of latent syntactic structure, there are several studies on learning task-oriented syntactic structures. Yogatama et al. (2017) used a reinforcement learning method on shift-reduce action sequences to learn task-oriented binary constituency trees. They have shown that the learned trees do not necessarily highly correlate with the human-annotated treebanks, which is consistent with our experimental results. Socher et al. (2011) used a recursive autoencoder model to greedily construct a binary constituency tree for each sentence. The autoencoder objective works as a regularization term for sentiment classification tasks. Prior to these deep learning approaches, Wu (1997) presented a method for bilingual parsing. One of the characteristics of our model is directly using the soft connections of the graph edges with the real-valued weights, whereas all of the above-mentioned methods use one best structure for each sentence. Our model is based on dependency structures, and it is a promising future direction to jointly learn dependency and constituency structures in a task-oriented fashion.
Finally, more related to our model, Kim et al. (2017) applied their structured attention networks to a Natural Language Inference (NLI) task for learning dependency-like structures. They showed that pre-training their model by a parsing dataset did not improve accuracy on the NLI task. By contrast, our experiments show that such a parsing dataset can be effectively used to improve translation accuracy by varying the size of the dataset and by avoiding strong overfitting. Moreover, our translation examples show the concrete benefit of learning task-oriented latent graph structures.

Conclusion and Future Work
We have presented an end-to-end NMT model by jointly learning translation and source-side latent graph representations. By pre-training our model using treebank annotations, our model significantly outperforms both a pipelined syntax-based model and a state-of-the-art sequential model. On English-to-Japanese translation, our model outperforms the previous best models by a large margin. In future work, we investigate the effectiveness of our approach in different types of target tasks.