Multi-Source Syntactic Neural Machine Translation

We introduce a novel multi-source technique for incorporating source syntax into neural machine translation using linearized parses. This is achieved by employing separate encoders for the sequential and parsed versions of the same source sentence; the resulting representations are then combined using a hierarchical attention mechanism. The proposed model improves over both seq2seq and parsed baselines by over 1 BLEU on the WMT17 English-German task. Further analysis shows that our multi-source syntactic model is able to translate successfully without any parsed input, unlike standard parsed methods. In addition, performance does not deteriorate as much on long sentences as for the baselines.


Introduction
Neural machine translation (NMT) typically makes use of a recurrent neural network (RNN) -based encoder and decoder, along with an attention mechanism (Bahdanau et al., 2015;Cho et al., 2014;Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014). However, it has been shown that RNNs require some supervision to learn syntax (Bentivogli et al., 2016;Linzen et al., 2016;Shi et al., 2016). Therefore, explicitly incorporating syntactic information into NMT has the potential to improve performance. This is particularly true for source syntax, which can improve the model's representation of the source language.
Recently, there have been a number of proposals for using linearized representations of parses within standard NMT (Aharoni and Goldberg, 2017;Li et al., 2017;Nadejde et al., 2017). Linearized parses are advantageous because they can inject syntactic information into the models without significant changes to the architecture. However, using linearized parses in a sequence-to-sequence (seq2seq) framework creates some challenges, particularly when using source parses. First, the parsed sequences are significantly longer than standard sentences, since they contain node labels as well as words. Second, these systems often fail when the source sentence is not parsed. This can be a problem for inference, since the external parser may fail on an input sentence at test time. We propose a method for incorporating linearized source parses into NMT that addresses these challenges by taking both the sequential source sentence and its linearized parse simultaneously as input in a multi-source framework. Thus, the model is able to use the syntactic information encoded in the parse while falling back to the sequential sentence when necessary. Our proposed model improves over both standard and parsed NMT baselines.

Seq2seq Neural Parsing
Using linearized parse trees within sequential frameworks was first done in the context of neural parsing. Vinyals et al. (2015) parsed using an attentional seq2seq model; they used linearized, unlexicalized parse trees on the target side and sentences on the source side. In addition, as in this work, they used an external parser to create synthetic parsed training data, resulting in improved parsing performance. Choe and Charniak (2016) adopted a similar strategy, using linearized parses in an RNN language modeling framework.

NMT with Source Syntax
Among the first proposals for using source syntax in NMT was that of Luong et al. (2016), who introduced a multi-task system in which the source data was parsed and translated using a shared encoder and two decoders. More radical changes to the standard NMT paradigm have also been proposed. Eriguchi et al. (2016) introduced tree-tosequence NMT; this model took parse trees as input using a tree-LSTM (Tai et al., 2015) encoder. Bastings et al. (2017) used a graph convolutional encoder in order to take labeled dependency parses of the source sentences into account. Hashimoto and Tsuruoka (2017) added a latent graph parser to the encoder, allowing it to learn soft dependency parses while simultaneously learning to translate.

Linearized Parse Trees in NMT
The idea of incorporating linearized parses into seq2seq has been adapted to NMT as a means of injecting syntax. Aharoni and Goldberg (2017) first did this by parsing the target side of the training data and training the system to generate parsed translations of the source input; this is the inverse of our parse2seq baseline. Similarly, Nadejde et al. (2017) interleaved CCG supertags with words on the target side, finding that this improved translation despite requiring longer sequences.
Most similar to our multi-source model is the parallel RNN model proposed by Li et al. (2017). Like multi-source, the parallel RNN used two encoders, one for words and the other for syntax. However, they combined these representations at the word level, whereas we combine them on the sentence level. Their mixed RNN model is also similar to our parse2seq baseline, although the mixed RNN decoder attended only to words. As the mixed RNN model outperformed the parallel RNN model, we do not attempt to compare our model to parallel RNN. These models are similar to ours in that they incorporate linearized parses into NMT; here, we utilize a multi-source framework.

Multi-Source NMT
Multi-source methods in neural machine translation were first introduced by Zoph and Knight (2016) for multilingual translation. They used one encoder per source language, and combined the resulting sentence representations before feeding them into the decoder. Firat et al. (2016) expanded on this by creating a multilingual NMT system with multiple encoders and decoders.  applied multi-source NMT to multimodal translation and automatic post-editing and explored different strategies for combining attention over the two sources. In this paper, we apply the multi-source framework to a novel task, syntactic neural machine translation.

NMT with Linearized Source Parses
We propose a multi-source method for incorporating source syntax into NMT. This method makes use of linearized source parses; we describe these parses in section 3.1. Throughout this paper, we refer to standard sentences that do not contain any explicit syntactic information as sequential; see Table 1 for an example.

Linearized Source Parses
We use an off-the-shelf parser, in this case Stanford CoreNLP (Manning et al., 2014), to create binary constituency parses. These parses are linearized as shown in Table 1. We tokenize the opening parentheses with the node label (so each node label begins with a parenthesis) but keep the closing parentheses separate from the words they follow. For our task, the parser failed on one training sentence of 5.9 million, which we discarded, and succeeded on all test sentences. It took roughly 16 hours to parse the 5.9 million training sentences.
Following Sennrich et al. (2016b), our networks operate at the subword level using byte pair encoding (BPE) with a shared vocabulary on the source and target sides. However, the parser operates at the word level. Therefore, we parse then break into subwords, so a leaf may have multiple tokens without internal structure.
The proposed method is tested using both lexicalized and unlexicalized parses. In unlexicalized parses, we remove the words, keeping only the node labels and the parentheses. In lexicalized parses, the words are included. Table 1 shows an example of the three source sentence formats: sequential, lexicalized parse, and unlexicalized parse. Note that the lexicalized parse is significantly longer than the other versions.

Multi-Source
We propose a multi-source framework for injecting linearized source parses into NMT. This model consists of two identical RNN encoders with no shared parameters, as well as a standard RNN decoder. For each target sentence, two versions of the source sentence are used: the sequential (standard) version and the linearized parse (lexicalized or unlexicalized). Each of these is encoded simul-Example Sentence sequential history is a great teacher . lexicalized parse (ROOT (S (NP (NN history ) ) (VP (VBZ is ) (NP (DT a ) (JJ great ) (NN teacher ) ) ) (. . ) ) ) unlexicalized parse (ROOT (S (NP (NN ) ) (VP (VBZ ) (NP (DT ) (JJ ) (NN ) ) ) (. . ) ) ) target sentence die Geschichte ist ein großartiger Lehrmeister . taneously using the encoders; the encodings are then combined and used as input to the decoder. We combine the source encodings using the hierarchical attention combination proposed by . This consists of a separate attention mechanism for each encoder; these are then combined using an additional attention mechanism over the two separate context vectors. This multi-source method is thus able to combine the advantages of both standard RNN-based encodings and syntactic encodings.

Data
We base our experiments on the WMT17 (Bojar et al., 2017) English (EN) → German (DE) news translation task. All 5.9 million parallel training sentences are used, but no monolingual data. Validation is done on newstest2015, while new-stest2016 and newstest2017 are used for testing. We train a shared BPE vocabulary with 60k merge operations on the parallel training data. For the parsed data, we break words into subwords after applying the Stanford parser. We tokenize and truecase the data using the Moses tokenizer and truecaser (Koehn et al., 2007).

Implementation
The models are implemented in Neural Monkey . They are trained using Adam (Kingma and Ba, 2015) and have minibatch size 40, RNN size 512, and dropout probability 0.2 (Gal and Ghahramani, 2016). We train to convergence on the validation set, using BLEU (Papineni et al., 2002) as the metric.
For sequential inputs and outputs, the maximum sentence length is 50 subwords. For parsed inputs, we increase maximum sentence length to 150 subwords to account for the increased length due to the parsing labels; we still use a maximum output length of 50 subwords for these systems.

Seq2seq
The proposed models are compared against two baselines. The first, referred to here as seq2seq, is the standard RNN-based neural machine translation system with attention (Bahdanau et al., 2015). This baseline does not use the parsed data.

Parse2seq
The second baseline we consider is a slight modification of the mixed RNN model proposed by Li et al. (2017). This uses an identical architecture to the seq2seq baseline (except for a longer maximum sentence length in the encoder). Instead of using sequential data on the source side, the linearized parses are used. We allow the system to attend equally to words and node labels on the source side, rather than restricting the attention to words. We refer to this baseline as parse2seq. Table 2 shows the performance on EN→DE translation for each of the proposed systems and the baselines, as approximated by BLEU score. The multi-source systems improve strongly over both baselines, with improvements of up to 1.5 BLEU over the seq2seq baseline and up to 1.1 BLEU over the parse2seq baseline. In addition, the lexicalized multi-source systems yields slightly higher BLEU scores than the unlexicalized multi-source systems; this is surprising because the lexicalized systems have significantly longer sequences than the unlexicalized ones. Finally, it is interesting to compare the seq2seq and parse2seq baselines. Parse2seq outperforms  seq2seq by only a small amount compared to multi-source; thus, while adding syntax to NMT can be helpful, some ways of doing so are more effective than others.

Results
6 Analysis

Inference Without Parsed Sentences
The parse2seq and multi-source systems require parsed source data at inference time. However, the parser may fail on an input sentence. Therefore, we examine how well these systems do when given only unparsed source sentences at test time. Table 3 displays the results of these experiments. For the parse2seq baseline, we use only sequential (seq) data as input. For the lexicalized and unlexicalized multi-source systems, two options are considered: seq + seq uses identical sequential data as input to both encoders, while seq + null uses null input for the parsed encoder, where every source sentence is "( )".
The parse2seq system fails when given only sequential source data. On the other hand, both multi-source systems perform reasonably well without parsed data, although the BLEU scores are worse than multi-source with parsed data.

BLEU by Sentence Length
For models that use source-side linearized parses (multi-source and parse2seq), the source sequences are significantly longer than for the seq2seq baseline. Since NMT already performs relatively poorly on long sentences (Bahdanau et al., 2015), adding linearized source parses may exacerbate this issue. To detect whether this occurs, we calculate BLEU by sentence length.
We bucket the sentences in newstest2017 by source sentence length. We then compute BLEU scores for each bucket for the seq2seq and parse2seq baselines and the lexicalized multisource system. The results are in Figure 1.
In line with previous work on NMT on long sentences (Bahdanau et al., 2015;Li et al., 2017), we see a significant deterioration in BLEU for longer sentences for all systems. In particular, although the parse2seq model outperformed the seq2seq model overall, it does worse than seq2seq for sentences containing more than 30 words. This indicates that parse2seq performance does indeed suffer due to its long sequences. On the other hand, the multi-source system outperforms the seq2seq baseline for all sentence lengths and does particularly well for sentences with over 50 words. This may be because the multi-source system has both sequential and parsed input, so it can rely more on sequential input for very long sentences.

Conclusion
In this paper, we presented a multi-source method for effectively incorporating linearized parses of the source data into neural machine translation. This method, in which the parsed and sequential versions of the sentence were both taken as input during training and inference, resulted in gains of up to 1.5 BLEU on EN→DE translation. In addition, unlike parse2seq, the multi-source model translated reasonably well even when the source sentence was not parsed.
In the future, we will explore adding backtranslated (Sennrich et al., 2016a) or copied (Currey et al., 2017) target data to our multi-source system. The multi-source model does not require all training data to be parsed; thus, monolingual data can be used even if the parser is unreliable for the synthetic or copied source sentences.