Does String-Based Neural MT Learn Source Syntax?

We investigate whether a neural, encoder-decoder translation system learns syntactic information on the source side as a by-product of training. We propose two methods to detect whether the encoder has learned local and global source syntax. A ﬁne-grained analysis of the syntactic structure learned by the encoder reveals which kinds of syntax are learned and which are missing.


Introduction
The sequence to sequence model (seq2seq) has been successfully applied to neural machine translation (NMT) (Sutskever et al., 2014;Cho et al., 2014) and can match or surpass MT state-of-art. Nonneural machine translation systems consist chiefly of phrase-based systems (Koehn et al., 2003) and syntax-based systems (Galley et al., 2004;Galley et al., 2006;DeNeefe et al., 2007;Liu et al., 2011;Cowan et al., 2006), the latter of which adds syntactic information to source side (tree-to-string), target side (string-to-tree) or both sides (tree-to-tree). As the seq2seq model first encodes the source sentence into a high-dimensional vector, then decodes into a target sentence, it is hard to understand and interpret what is going on inside such a procedure. Considering the evolution of non-neural translation systems, it is natural to ask: 1. Does the encoder learn syntactic information about the source sentence? 2. What kind of syntactic information is learned, and how much?
3. Is it useful to augment the encoder with additional syntactic information? In this work, we focus on the first two questions and propose two methods: • We create various syntactic labels of the source sentence and try to predict these syntactic labels with logistic regression, using the learned sentence encoding vectors (for sentence-level labels) or learned word-by-word hidden vectors (for word-level label). We find that the encoder captures both global and local syntactic information of the source sentence, and different information tends to be stored at different layers. • We extract the whole constituency tree of source sentence from the NMT encoding vectors using a retrained linearized-tree decoder. A deep analysis on these parse trees indicates that much syntactic information is learned, while various types of syntactic information are still missing.

Example
As a simple example, we train an English-French NMT system on 110M tokens of bilingual data (English side). We then take 10K separate English sentences and label their voice as active or passive. We use the learned NMT encoder to convert these sentences into 10k corresponding 1000-dimension encoding vectors. We use 9000 sentences to train a logistic regression model to predict voice using the encoding cell states, and test on the other 1000 sentences. We achieve 92.8% accuracy ( fixed-length vector, the NMT system has decided to store the voice of English sentences in an easily accessible way. When we carry out the same experiment on an English-English (auto-encoder) system, we find that English voice information is no longer easily accessed from the encoding vector. We can only predict it with 82.7% accuracy, no better than chance. Thus, in learning to reproduce input English sentences, the seq2seq model decides to use the fixedlength encoding vector for other purposes.

Related work
Interpreting Recurrent Neural Networks. The most popular method to visualize high-dimensional vectors, such as word embeddings, is to project them into two-dimensional space using t-SNE (van der Maaten and Hinton, 2008). Very few works try to interpret recurrent neural networks in NLP. Karpathy et al. (2016) use a character-level LSTM language model as a test-bed and find several activation cells that track long-distance dependencies, such as line lengths and quotes. They also conduct an error analysis of the predictions.  explore the syntactic behavior of an RNN-based sentiment analyzer, including the compositionality of negation, intensification, and concessive clauses, by plotting a 60-dimensional heat map of hidden unit values. They also introduce a first-order derivative based method to measure each unit's contribution to the final decision. Verifying syntactic/semantic properties. Several works try to build a good distributional representation of sentences or paragraph (Socher et al., 2013;Kalchbrenner et al., 2014;Kim, 2014;Zhao et al., 2015;Le and Mikolov, 2014;Kiros et al., 2015). They implicitly verify the claimed syntactic/semantic properties of learned representations by applying them to downstream classification tasks such as sentiment analysis, sentence classification, semantic relatedness, paraphrase detection, imagesentence ranking, question-type classification, etc.
Novel contributions of our work include: • We locate a subset of activation cells that are responsible for certain syntactic labels. We explore the concentration and layer distribution of different syntactic labels. • We extract whole parse trees from NMT encoding vectors in order to analyze syntactic properties directly and thoroughly. • Our methods are suitable for large scale models. The models in this work are 2-layer 1000dimensional LSTM seq2seq models.

Datasets and models
We train two NMT models, English-French (E2F) and English-German (E2G). To answer whether these translation models' encoders to learn store syntactic information, and how much, we employ two benchmark models: • An upper-bound model, in which the encoder learns quite a lot of syntactic information. For the upper bound, we train a neural parser that learns to "translate" an English sentence to its linearized constitutional tree (E2P), following Vinyals et al. (2015). • An lower-bound model, in which the encoder learns much less syntactic information. For the lower bound, we train two sentence autoencoders: one translates an English sentence to itself (E2E), while the other translates a permuted English sentence to itself (PE2PE). We already had an indication above (Section 2) that a copying model does not necessarily need to remember a sentence's syntactic structure. Figure 1 shows sample inputs and outputs of the E2E, PE2PE, E2F, E2G, and E2P models.
We use English-French and English-German data from WMT2014 (Bojar et al., 2014). We take 4M English sentences from the English-German data to train E2E and PE2PE. For the neural parser (E2P), we construct the training corpus following the recipe of Vinyals et al. (2015). We collect 162K training sentences from publicly available treebanks, including Sections 0-22 of the Wall Street Journal Penn Treebank (Marcus et al., 1993) (Papineni et al., 2002). (Pradhan and Xue, 2009) and the English Web Treebank (Petrov and McDonald, 2012). In addition to these gold treebanks, we take 4M English sentences from English-German data and 4M English sentences from English-French data, and we parse these 8M sentences with the Charniak-Johnson parser 1 (Charniak and Johnson, 2005). We call these 8,162K pairs the CJ corpus. We use WSJ Section 22 as our development set and section 23 as the test set, where we obtain an F1-score of 89.6, competitive with the previously-published 90.5 (Table 4).

Model Architecture.
For all experiments 2 , we use a two-layer encoder-decoder with long short-term memory (LSTM) units (Hochreiter and Schmidhuber, 1997). We use a minibatch of 128, a hidden state size of 1000, and a dropout rate of 0.2. 1 The CJ parser is here https://github.com/BLLIP/bllipparser and we used the pretrained model "WSJ+Gigaword-v2". 2 (Vinyals et al., 2015) 90.5 unk For auto-encoders and translation models, we train 8 epochs. The learning rate is initially set as 0.35 and starts to halve after 6 epochs. For E2P model, we train 15 epochs. The learning rate is initialized as 0.35 and starts to decay by 0.7 once the perplexity on a development set starts to increase. All parameters are re-scaled when the global norm is larger than 5. All models are non-attentional, because we want the encoding vector to summarize the whole source sentence. Table 4 shows the settings of each model and reports the BLEU scores.

Experimental Setup
In this section, we test whether different seq2seq systems learn to encode syntactic information about the source (English) sentence.
With 1000 hidden states, it is impractical to investigate each unit one by one or draw a heat map of the whole vector. Instead, we use the hidden states to predict syntactic labels of source sentences via logistic regression. For multi-class prediction, we use a one-vs-rest mechanism. Furthermore, to identify a subset of units responsible for certain syntactic labels, we use the recursive feature elimination (RFE) strategy: the logistic regression is first trained using   all 1000 hidden states, after which we recursively prune those units whose weights' absolute values are smallest. We extract three sentence-level syntactic labels: 1. Voice: active or passive. 2. Tense: past or non-past. 3. TSS: Top level syntactic sequence of the constituent tree. We use the most frequent 19 sequences ("NP-VP", "PP-NP-VP", etc.) and label the remainder as "Other". and two word-level syntactic labels: 1. POS: Part-of-speech tags for each word. 2. SPC: The smallest phrase constituent that above each word. Both voice and tense labels are generated using rule-based systems based on the constituent tree of the sentence. Figure 2 provides examples of our five syntactic labels. When predicting these syntactic labels using corresponding cell states, we split the dataset into training and test sets. Table 4 shows statistics of each labels.
For a source sentence s, s = [w 1 , ..., w i , ..., w n ] the two-layer encoder will generate an array of cell vectors c during encoding, c = [ (c 1,0 , c 1,1 ), ..., (c i,0 , c i,1 ), ..., (c n,0 , c n,1 )] We extract a sentence-level syntactic label L s , and predict it using the encoding cell states that will be fed into the decoder: Similarly, for extracting word-level syntactic labels: .., L wi , ..., L wn ] we predict each label L wi using the cell states immediately after encoding the word w i :

Result Analysis
Test-set prediction accuracy is shown in Figure 3. For voice and tense, the prediction accuracy of two auto-encoders is almost same as the accuracy of majority class, indicating that their encoders do not learn to record this information. By contrast, both the neural parser and the NMT systems achieve approximately 95% accuracy. When predicting the top-level syntactic sequence (TSS) of the whole sentence, the Part-of-Speech tags (POS), and smallest phrase constituent (SPC) for each word, all five models achieve an accuracy higher than that of majority class, but there is still a large gap between the accuracy of NMT systems and auto-encoders. These observations indicate that the NMT encoder learns significant sentence-level syntactic information-it can distinguish voice and tense of the source sentence, and it knows the sentence's structure to some extent. At the word level, the NMT's encoder also tends to cluster together the words that have similar POS and SPC labels.
Different syntactic information tends to be stored at different layers in the NMT models. For wordlevel syntactic labels, POS and SPC, the accuracy of the lower layer's cell states (C0) is higher than that of the upper level (C1  labels, especially tense, the accuracy of C1 is larger than C0. This suggests that the local features are somehow preserved in the lower layer whereas more global, abstract information tends to be stored in the upper layer. For two-classes labels, such as voice and tense, the accuracy gap between all units and top-10 units is small. For other labels, where we use a oneversus-rest strategy, the gap between all units and top-10 units is large. However, when predicting POS, the gap of neural parser (E2P) on the lower layer (C0) is much smaller. This comparison indicates that a small subset of units explicitly takes charge of POS tags in the neural parser, whereas for NMT, the POS info is more distributed and implicit.
There are no large differences between encoders of E2F and E2G regarding syntactic information. We now turn to whether NMT systems capture deeper syntactic structure as a by-product of learning to translate from English to another language. We do this by predicting full parse trees from the information stored in encoding vectors. Since this is a structured prediction problem, we can no longer use logistic regression. Instead, we extract a constituency parse tree from the encoding vector of a model E2X by using a new neural parser E2X2P with the following steps: 1. Take the E2X encoder as the encoder of the new model E2X2P. 2. Initialize the E2X2P decoder parameters with a uniform distribution. 3. Fine-tune the E2X2P decoder (while keeping its encoder parameters fixed), using the CJ corpus, the same corpus used to train E2P . Figure 4 shows how we construct model E2F2P from model E2F. For fine-tuning, we use the same dropout rate and learning rate updating configuration for E2P as described in Section 4.

Evaluation
We train four new neural parsers using the encoders of the two auto-encoders and the two NMT models respectively. We use three tools to evaluate and analyze: 1. The EVALB tool 3 to calculate the labeled bracketing F1-score. 2. The zxx package 4 to calculate Tree edit distance (TED) (Zhang and Shasha, 1989). 3. The Berkeley Parser Analyser 5 (Kummerfeld et al., 2012) to analyze parsing error types. The linearized parse trees generated by these neural parsers are not always well-formed. They can be split into the following categories: • Malformed trees: The linearized sequence can not be converted back into a tree, due to missing or mismatched brackets. • Well-formed trees: The sequence can be converted back into a tree. Tree edit distance can be calculated on this category.
-Wrong length trees: The number of tree leaves does not match the number of source-sentence tokens. -Correct length trees: The number of tree leaves does match the number of sourcesentence tokens. Before we move to results, we emphasize the following points: First, compared to the linear classifier used in Section 5, the retrained decoder for predicting a linearized parse tree is a highly non-linear method. The syntactic prediction/parsing performance will increase due to such non-linearity. Thus, we do not make conclusions based only on absolute performance values, but also on a comparison against the designed baseline models. An improvement over the lower bound models indicates that the encoder learns syntactic information, whereas a decline from the upper bound model shows that the encoder loses certain syntactic information.
Second, the NMT's encoder maps a plain English sentence into a high-dimensional vector, and our goal is to test whether the projected vectors form a more syntactically-related manifold in the highdimensional space. In practice, one could also predict parse structure for the E2E in two steps: (1) use E2E's decoder to recover the original English sentence, and (2) parse that sentence with the CJ parser. But in this way, the manifold structure in the highdimensional space is destroyed during the mapping. Table 5 reports perplexity on training and development sets, the labeled F1-score on WSJ Section 23, and the Tree Edit Distance (TED) of various systems.

Result Analysis
Tree Edit Distance (TED) calculates the minimum-cost sequence of node edit operations (delete, insert, rename) between a gold tree and a test tree. When decoding with beam size 10, the four new neural parsers can generate wellformed trees for almost all the 2416 sentences in the WSJ section 23. This makes TED a robust metric to evaluate the overall performance of each parser.  approximately 17 TED. Among the well-formed trees, around half have a mismatch between number of leaves and number of tokens in the source sentence. The labeled F1score is reported over the rest of the sentences only. Though biased, this still reflects the overall performance: we achieve around 80 F1 with NMT encoding vectors, much higher than with the E2E and PE2PE encoding vectors (below 60).

Fine-grained Analysis
Besides answering whether the NMT encoders learn syntactic information, it is interesting to know what kind of syntactic information is extracted and what is not.
As Table 5 shows, different parsers generate different numbers of trees that are acceptable to Treebank evaluation software ("EVALB-trees"), having the correct number of leaves and so forth. We select the intersection set of different models' EVALBtrees. We get a total of 569 shared EVALB-trees. The average length of the corresponding sentence is 12.54 and the longest sentence has 40 tokens. The average length of all 2416 sentences in WSJ section 23 is 23.46, and the longest is 67. As we do not apply an attention model for these neural parsers, it is difficult to handle longer sentences. While the intersection set may be biased, it allows us to explore how different encoders decide to capture syntax on short sentences. Table 6 shows the labeled F1-scores and Part-of-Speech tagging accuracy on the intersection set. The NMT encoder extraction achieves around 86 percent tagging accuracy, far beyond that of the autoencoder based parser.

Model
Labeled  Besides the tagging accuracy, we also utilize the Berkeley Parser Analyzer (Kummerfeld et al., 2012) to gain a more linguistic understanding of predicted parses. Like TED, the Berkeley Parser Analyzer is based on tree transformation. It repairs the parse tree via a sequence of sub-tree movements, node insertions and deletions. During this process, multiple bracket errors are fixed, and it associates this group of node errors with a linguistically meaningful error type.
The first column of Figure 5 shows the average number of bracket errors per sentence for model E2P on the intersection set. For other models, we report the ratio of each model to model E2P. Kummerfeld et al. (2013) and Kummerfeld et al. (2012) give descriptions of different error types. The NMT-based predicted parses introduce around twice the bracketing errors for the first 10 error types, whereas for "Sense Confusion", they bring more than 16 times bracket errors. "Sense confusion" is the case where the head word of a phrase receives the wrong POS, resulting in an attachment error. Figure 6 shows an example. Even though we can predict 86 percent of partsof-speech correctly from NMT encoding vectors, the other 14 percent introduce quite a few attachment errors. NMT sentence vectors encode a lot of syntax, but they still cannot grasp these subtle details.

Conclusion
We investigate whether NMT systems learn sourcelanguage syntax as a by-product of training on string pairs. We find that both local and global syntactic information about source sentences is captured by the encoder. Different types of syntax is stored in different layers, with different concentration degrees. We also carry out a fine-grained analysis of the constituency trees extracted from the encoder, highlighting what syntactic information is still missing.