Learning to Parse and Translate Improves Neural Machine Translation

There has been relatively little attention to incorporating linguistic prior to neural machine translation. Much of the previous work was further constrained to considering linguistic prior on the source side. In this paper, we propose a hybrid model, called NMT+RNNG, that learns to parse and translate by combining the recurrent neural network grammar into the attention-based neural machine translation. Our approach encourages the neural machine translation model to incorporate linguistic prior during training, and lets it translate on its own afterward. Extensive experiments with four language pairs show the effectiveness of the proposed NMT+RNNG.


Introduction
Neural Machine Translation (NMT) has enjoyed impressive success without relying on much, if any, prior linguistic knowledge. Some of the most recent studies have for instance demonstrated that NMT systems work comparably to other systems even when the source and target sentences are given simply as flat sequences of characters (Lee et al., 2016;Chung et al., 2016) or statistically, not linguistically, motivated subword units Wu et al., 2016). Shi et al. (2016) recently made an observation that the encoder of NMT captures syntactic properties of a source sentence automatically, indirectly suggesting that explicit linguistic prior may not be necessary.
On the other hand, there have only been a couple of recent studies showing the potential benefit of explicitly encoding the linguistic prior into NMT.  for instance proposed to augment each source word with its corresponding part-of-speech tag, lemmatized form and dependency label. Eriguchi et al. (2016) instead replaced the sequential encoder with a tree-based encoder which computes the representation of the source sentence following its parse tree. Stahlberg et al. (2016) let the lattice from a hierarchical phrase-based system guide the decoding process of neural machine translation, which results in two separate models rather than a single end-to-end one. Despite the promising improvements, these explicit approaches are limited in that the trained translation model strictly requires the availability of external tools during inference time.
We instead propose to implicitly incorporate linguistic prior based on the idea of multi-task learning (Caruana, 1998;Collobert et al., 2011). More specifically, we design a hybrid decoder for NMT, called NMT+RG, that combines a usual conditional language model and a recently proposed recurrent neural network grammars (RNNGs, Dyer et al., 2016). This is done by plugging in the conventional language model decoder in the place of the buffer in RNNG, while sharing a subset of parameters, such as word vectors, between the language model and RNNG. We train this hybrid model to maximize both the log-probability of a target sentence and the log-probability of a parse action sequence. We use an external parser (Andor et al., 2016) to generate target parse actions, but unlike the previous explicit approaches, we do not need it during test time.
We evaluate the proposed NMT+RG on four language pairs ({Cs, De, Ru, Jp}-En). We observe significant improvements in terms of BLEU scores on three out of four language pairs and RIBES scores on all the language pairs.

Neural Machine Translation
Neural machine translation is a recently proposed framework for building a machine translation sys-tem based purely on neural networks. It is often built as an attention-based encoder-decoder network (Cho et al., 2015) with two recurrent networks-encoder and decoder-and an attention model. The encoder, which is often implemented as a bidirectional recurrent network with long short-term memory units (LSTM, Hochreiter and Schmidhuber, 1997) or gated recurrent units (GRU, Cho et al., 2014), first reads a source sentence represented as a sequence of words x = (x 1 , x 2 , . . . , x N ). The encoder returns a sequence of hidden states h = (h 1 , h 2 , . . . , h N ). Each hidden state h i is a concatenation of those from the forward and backward recurrent network: The decoder is implemented as a conditional recurrent language model which models the target sentence, or translation, as where y = (y 1 , . . . , y M ). Each of the conditional probabilities in the r.h.s is computed by where f dec is a recurrent activation function, such as LSTM or GRU, and W y is the output word vector of the word y. c j is a time-dependent context vector that is computed by the attention model using the sequence h of hidden states from the encoder. The attention model first compares the current hidden state s j against each of the hidden states and assigns a scalar score: ong et al., 2015). These scores are then normalized across the hidden states to sum to 1, that is The time-dependent context vector is then a weighted-sum of the hidden states with these attention weights: c j = i α i,j h i .

Recurrent Neural Network Grammars
A recurrent neural network grammar (RNNG, Dyer et al., 2016) is a probabilistic syntax-based language model. Unlike a usual recurrent language model (see, e.g., Mikolov et al., 2010), an RNNG simultaneously models both tokens and their tree-based composition. This is done by having a (output) buffer, stack and action history, each of which is implemented as a stack LSTM (sLSTM, Dyer et al., 2015). At each time step, the action sLSTM predicts the next action based on the (current) hidden states of the buffer, stack and action sLSTM. That is, where W a is the vector of the action a. If the selected action is shift, the word at the beginning of the buffer is moved to the stack. When the reduce action is selected, the top-two words in the stack are reduced to build a partial tree. Additionally, the action may be one of many possible non-terminal symbols, in which case the predicted non-terminal symbol is pushed to the stack.
The hidden states of the buffer, stack and action sLSTM are correspondingly updated by where V y and V a are functions returning the target word and action vectors. The input vector r t of the stack sLSTM is computed recursively by where r d and r p are the corresponding vectors of the parent and dependent phrases, respectively (Dyer et al., 2015). This process is iteratively done until the complete parse tree is built. When the complete sentence is provided, the buffer simply summarizes the shifted words. When the RNNG is used as a generator, the buffer further generates the next word when the selected action is shift. The latter can be done by replacing the buffer with a recurrent language model, which is the idea on which our proposal is based.

NMT+RG
Our main proposal in this paper is to hybridize the decoder of the neural machine translation and the RNNG. We continue from the earlier observation that we can replace the buffer of RNNG to a recurrent language model that simultaneously summarizes the shifted words as well as generates future words. We replace the RNNG's buffer with the neural translation model's decoder in two steps.
Construction First, we replace the hidden state of the buffer h buffer (in Eq. (5)) with the hidden state of the decoder of the attention-based neural machine translation from Eq. (3). As is clear from those two equations, both the buffer sLSTM and the translation decoder take as input the previous hidden state (h buffer top and s j−1 , respectively) and the previously decoded word (or the previously shifted word in the case of the RNNG's buffer), and returns its summary state. The only difference is that the translation decoder additionally considers the states j−1 . Second, we let the next word prediction of the translation decoder as a generator of RNNG. In other words, the generator of RNNG will output a word, when asked by the shift action, according to the conditional distribution defined by the translation decoder in Eq. (1). Once the buffer sLSTM is replaced with the neural translation decoder, the action sLSTM naturally takes as input the translation decoder's hidden state when computing the action conditional distribution in Eq. (4). We call this hybrid model NMT+RG.
Learning and Inference After this integration, our hybrid NMT+RG models the conditional distribution over all possible pairs of translation and its parse given a source sentence, i.e., p(y, a|x). Assuming the availability of parse annotation in the target-side of a parallel corpus, we train the whole model jointly to maximize E (x,y,a)∼data [log p(y, a|x)]. In doing so, we notice that there are two separate paths through which the neural translation decoder receives error signal. First, the decoder is updated in order to maximize the conditional probability of the correct next word, which has already existed in the original neural machine translation. Second, the decoder is updated also to maximize the conditional probability of the correct parsing action, which is a novel learning signal introduced by the proposed hybridization. Furthermore, the second learning signal affects the encoder as well, encouraging the whole neural translation model to be aware of the syntactic structure of the target language. Later in the experiments, we show that this additional learning signal is useful for translation,  even though we discard the RNNG (the stack and action sLSTMs) in the inference time.

Knowledge Distillation for Parsing
A major challenge in training the proposed hybrid model is that there is not a parallel corpus augmented with gold-standard target-side parse, and vice versa. In other words, we must either parse the target-side sentences of an existing parallel corpus or translate sentences with existing goldstandard parses. As the target task of the proposed model is translation, we start with a parallel corpus and annotate the target-side sentences. It is however costly to manually annotate any corpus of reasonable size ( Table 6 in Alonso et al., 2016). We instead resort to noisy, but automated annotation using an existing parser. This approach of automated annotation can be considered along the line of recently proposed techniques of knowledge distillation (Hinton et al., 2015) and distant supervision (Mintz et al., 2009). In knowledge distillation, a teacher network is trained purely on a training set with ground-truth annotations, and the annotations predicted by this teacher are used to train a student network, which is similar to our approach where the external parser could be thought of as a teacher and the proposed hybrid network's RNNG as a student. On the other hand, what we propose here is a special case of distant supervision in that the external parser provides noisy annotations to otherwise an unlabeled training set.
Specifically, we use SyntaxNet, released by Andor et al. (2016), on a target sentence. We convert a parse tree into a sequence of one of three transition actions (SHIFT, REDUCE-L, REDUCE-R). We label each REDUCE action with a corresponding dependency label and treat it as a more finegrained action.

Language Paris and Corpora
We compare the proposed NMT+RG against the baseline model on four different language pairs-Jp-En, Cs-En, De-En and Ru-En. The basic statistics of the training data are presented in Table 1.
Ja We use the ASPEC corpus ("train1.txt") from the WAT'16 Jp-En translation task. We tokenize each Japanese sentence with KyTea (Neubig et al., 2011) and preprocess according to the recommendations from WAT'16 (WAT, 2016). We use the first 100K sentence pairs of length shorter than 50 for training. The vocabulary is constructed with all the unique tokens that appear at least twice in the training corpus. We use "dev.txt" and "test.txt" provided by WAT'16 respectively as development and test sets.
Cs, De and Ru We use News Commentary v8 from WMT'16. We use the tokenizer from Moses (Koehn et al., 2007), and build a vocabulary of each language using unique tokens that appear at least 6, 6 and 5 times respectively for Cs, Ru and De. The target-side (English) vocabulary was constructed with all the unique tokens appearing more than three times in each corpus. We only use sentence pairs of length 50 or less for training. We use "newstest2015" and "newstest2016" as development and test sets respectively.

Models, Learning and Inference
In all our experiments, each recurrent network has a single layer of LSTM units of 256 dimensions, and the word vectors and the action vectors are of 256 and 128 dimensions, respectively. To reduce computational overhead, we use Black-Out (Ji et al., 2015) with 2000 negative samples and α = 0.4. For the proposed NMT+RG, we share the target word vectors between the decoder (buffer) and the stack sLSTM. We use stochastic gradient descent with minibatches of 128 examples. The learning rate starts from 1.0, and is halved each time the perplexity on the development set increases. We clip the norm of the gradient (Pascanu et al., 2012) with the threshold set to 3.0 (2.0 for the baseline models on Ru-En and Cs-En to avoid NaN and Inf). We use beam search in the inference time, with the beam width selected based on the development set performance.
The more details are described in the supplementary material.

Results and Analysis
In Table 2, we report the translation qualities of the tested models on all the four language pairs. We report both BLEU (Papineni et al., 2002) and RIBES (Isozaki et al., 2010). Except for De-En, measured in BLEU, we observe the statis-  Table 2: BLEU and RIBES scores by the baseline and proposed models on the test set. We use the bootstrap resampling method from Koehn (2004) to compute the statistical significance. We use † to mark those significant cases with p < 0.005. tically significant improvement by the proposed NMT+RG over the baseline model. It is worthwhile to note that these significant improvements have been achieved without any additional parameters nor computational overhead in the inference time.   Table 3, we see that the best performance could only be achieved when all the three components were present. Removing the stack had the most adverse effect, which was found to be the case for parsing as well by Kuncoro et al. (2017). into this possibility for future research.