Inducing Grammars with and for Neural Machine Translation

Machine translation systems require semantic knowledge and grammatical understanding. Neural machine translation (NMT) systems often assume this information is captured by an attention mechanism and a decoder that ensures fluency. Recent work has shown that incorporating explicit syntax alleviates the burden of modeling both types of knowledge. However, requiring parses is expensive and does not explore the question of what syntax a model needs during translation. To address both of these issues we introduce a model that simultaneously translates while inducing dependency trees. In this way, we leverage the benefits of structure while investigating what syntax NMT must induce to maximize performance. We show that our dependency trees are 1. language pair dependent and 2. improve translation quality.


Motivation
Language has syntactic structure and translation models need to understand grammatical dependencies to resolve the semantics of a sentence and preserve agreement (e.g., number, gender, etc). Many current approaches to MT have been able to avoid explicitly providing structural information by relying on advances in sequence to sequence (seq2seq) models. The most famous advances include attention mechanisms (Bahdanau et al., 2015) and gating in Long Short-Term Memory (LSTM) cells (Hochreiter and Schmidhuber, 1997).
In this work we aim to benefit from syntactic structure, without providing it to the model, and to disentangle the semantic and syntactic components of translation, by introducing a gating mechanism which controls when syntax should be used.
The boy sitting next to the girls ordered a coffee Figure 1: Our model aims to capture both: syntactic (verb ordered → subj/obj boy, coffee) alignment (noun girls → determiner the) attention.
Consider the process of translating the sentence "The boy sitting next to the girls ordered a coffee." (Figure 1) from English to German. In German, translating ordered, requires knowledge of its subject boy to correctly predict the verb's number bestellte instead of bestellten. This is a case where syntactic agreement requires long-distance information. On the other hand, next can be translated in isolation. The model should uncover these relationships and decide when and which aspects of syntax are necessary. While in principle decoders can utilize previously predicted words (e.g., the translation of boy) to reason about subject-verb agreement, in practice LSTMs still struggle with long-distance dependencies. Moreover, Belinkov et al. (2017) showed that using attention reduces the decoder's capacity to learn target side syntax.
In addition to demonstrating improvements in translation quality, we are also interested in analyzing the predicted dependency trees discovered by our models. Recent work has begun analyzing taskspecific latent trees (Williams et al., 2018). We present the first results on learning latent trees with a joint syntactic-semantic objective. We do this in the service of machine translation which inherently requires access to both aspects of a sentence. Further, our results indicate that language pairs with rich morphology require and therefore induce more complex syntactic structure. Our use of a structured self attention encoder ( §4) that predicts a non-projective dependency tree over the source sentence provides a soft structured representation of the source sentence that can then be transferred to the decoder, which alleviates the burden of capturing target syntax on the target side.
We will show that the quality of the induced trees depends on the choice of the target language ( §7). Moreover, the gating mechanism will allow us to examine which contexts require source side syntax.
In summary, in this work: • We propose a new NMT model that discovers latent structures for encoding and when to use them, while achieving significant improvements in BLEU scores over a strong baseline.
• We perform an in-depth analysis of the induced structures and investigate where the target decoder decides syntax is required.

Related Work
Recent work has begun investigating what syntax seq2seq models capture (Linzen et al., 2016), but this is evaluated via downstream tasks designed to test the model's abilities and not its representation. Simultaneously, recent research in neural machine translation (NMT) has shown the benefit of modeling syntax explicitly (Aharoni and Goldberg, 2017;Bastings et al., 2017;Li et al., 2017;Eriguchi et al., 2017) rather than assuming the model will automatically discover and encode it. Bradbury and Socher (2017) presented an encoder-decoder architecture based on RNNG (Dyer et al., 2016). However, their preliminary work was not scaled to a large MT dataset and omits analysis of the induced trees.
Unlike the previous work on source side latent graph parsing (Hashimoto and Tsuruoka, 2017), our structured self attention encoder allows us to extract a dependency tree in a principled manner. Therefore, learning the internal representation of our model is related to work done in unsupervised grammar induction (Klein and Manning, 2004;Spitkovsky et al., 2011) except that by focusing on translation we require both syntactic and semantic knowledge.
In this work, we attempt to contribute to both modeling syntax and investigating a more interpretable interface for testing the syntactic content of a new seq2seq models' internal representation.

Neural Machine Translation
Given a training pair of source and target sentences (x, y) of length n and m respectively, neural machine translation is a conditional probabilistic model p(y | x) implemented using neural networks where θ is the model's parameters. We will omit the parameters θ herein for readability.
The NMT system used in this work is a seq2seq model that consists of a bidirectional LSTM encoder and an LSTM decoder coupled with an attention mechanism (Bahdanau et al., 2015;Luong et al., 2015). Our system is based on a PyTorch implementation 1 of OpenNMT (Klein et al., 2017).
Here we use S = [s 1 ; . . . ; s n ] ∈ R d×n as a concatenation of {s i }. The decoder is composed of stacked LSTMs with input-feeding. Specifically, the inputs of the decoder at time step t are a concatenation of the embedding of the previous generated word y t−1 and a vector u t−1 : where g is a one layer feed-forward network, h t−1 is the output of the LSTM decoder, and c t−1 is a context vector computed by an attention mechanism where W a ∈ R d×d is a trainable parameter. Finally a single layer feed-forward network f takes u t as input and returns a multinomial distribution over all the target words: y t ∼ f (u t )

Syntactic Attention Model
We propose a syntactic attention model 2 ( Figure 2) that differs from standard NMT in two crucial aspects. First, our encoder outputs two sets of annotations: content annotations S and syntactic annotations M (Figure 2a). The content annotations are the outputs of a standard BiLSTM while the syntactic annotations are produced by a head word selection layer ( §4.1). The syntactic annotations M capture syntactic dependencies amongst the source words and enable syntactic transfer from the source to the target. Second, we incorporate the source side syntax into our model by modifying the standard attention (from target to source) in NMT such that it attends to both S and M through a shared attention layer. The shared attention layer biases our model toward capturing source side dependency. It produces a dependency context d (Figure 2c) in addition to the standard context vector c (Figure 2b) at each time step. Motivated by the example in Figure 1 that some words can be translated without resolving their syntactic roles in the source sentence, we include a gating mechanism that allows the decoder to decide the amount of syntax needed when it generates the next word. Next, we describe the head word selection layer and how source side syntax is incorporated into our model.

Head Word Selection
The head word selection layer learns to select a soft head word for each source word. This layer transforms S into a matrix M that encodes implicit dependency structure of x using structured self attention. First we apply three trainable weight matrices W q , W k , W v ∈ R d×d to map S to query, key, and value matrices S q = W q S, S k = W k S, S v = W v S ∈ R d×n respectively. Then we compute the structured self attention probabilities β ∈ R n×n via a function sattn: Here n is the length of the source sentence, so β captures all pairwise word dependencies. Each cell β i,j of the attention matrix β is the posterior probability p(x i = head(x j ) | x). The structured self attention function sattn is inspired by the work of  but differs in two important ways. First we model non-projective dependency trees. Second, we utilize the Kirchhoff's Matrix-Tree Theorem (Tutte, 1984) instead of the sumproduct algorithm presented in  for fast evaluation of the attention probabilities. We note that (Liu and Lapata, 2018) were first to propose using the Matrix-Tree Theorem for evaluating the marginals in end to end training of neural networks. Their work, however, focuses on the task of natural language inference (Bowman et al., 2015) and document classification which arguably require less syntactic knowledge than machine translation.
Additionally, we will evaluate our structured self attention on datasets that are up to 20 times larger than the datasets studied in previous work.
Let z ∈ {0, 1} n×n be an adjacency matrix encoding a source's dependency tree. Let φ = S T q S k/ √ d ∈ R n×n be a scoring matrix such that cell φ i,j scores how likely word x i is to be the head of word x j . The probability of a dependency tree z is therefore given by where Z(φ) is the partition function.
In the head selection model, we are interested in We use the framework presented by Koo et al. (2007) to compute the marginal of non-projective dependency structures. Koo et al. (2007) use the Kirchhoff's Matrix-Tree Theorem (Tutte, 1984) to compute p(z i,j = 1 | x; φ) by first defining the Laplacian matrix L ∈ R n×n as follows: Now we construct a matrixL that accounts for root selectionL The marginals in β are then where δ i,j is the Kronecker delta. For the root node, the marginals are given by The computation of the marginals is fully differentiable, thus we can train the model in an end-toend fashion by maximizing the conditional likelihood of the translation.

Incorporating Syntactic Context
Having set the annotations S and M with the encoder, the LSTM decoder can utilize this information at every generation step by means of attention. At time step t, we first compute standard attention weights α t−1 and context vector c t−1 as in Equations (3) and (4). We then compute a weighted syntactic vector: Note that the syntactic vector d t−1 and the context vector c t−1 share the same attention weights α t−1 . The main idea behind sharing attention weights (Figure 2c) is that if the model attends to a particular source word x i when generating the next target word, we also want the model to attend to the head word of x i . We share the attention weights α t−1 because we expect that, if the model picks a source word x i to translate with the highest probability α t−1 [i], the contribution of x i 's head in the syntactic vector d t−1 should also be highest. Figure 3 The boy sitting next to the girls ordered a coffee shows the latent tree learned by our translation objective. Unlike the gold tree provided in Figure 1, the model decided that "the boy" is the head of "ordered". This is common in our model because the BiLSTM context means that a given word's representation is actually a summary of its local context/constituent.
It is not always useful or necessary to access the syntactic context d t−1 at every time step t. Ideally, we should let the model decide whether it needs to use this information or not. For example, the model might decide to only use syntax when it needs to resolve long distance dependencies on the source side. To control the amount of source side syntactic information, we introduce a gating mechanism: The vector u t−1 from Eq. (2) now becomes Another approach to incorporating syntactic annotations M in the decoder is to use a separate attention layer to compute the syntactic vector d t−1 at time step t: We will provide a comparison to this approach in our results.

Hard Attention over Tree Structures
Finally, to simulate the scenario where the model has access to a dependency tree given by an external parser we report results with hard attention. Forcing the model to make hard decisions during training mirrors the extraction and conditioning on a dependency tree ( §7.1). We expect this technique will improve the performance on grammar induction, despite making translation lossy. A similar observation has been reported in (Hashimoto and Tsuruoka, 2017) which showed that translation performance degraded below their baseline when they provided dependency trees to the encoder.
Recall the marginal β i,j gives us the probability that word x i is the head of word x j . We convert these soft weights to hard onesβ bȳ We train this model using the straight-through estimator (Bengio et al., 2013). In this setup, each word has a parent but there is no guarantee that the structure given by hard attention will result in a tree (i.e., it may contain cycle). A more principled way to enforce a tree structure is to decode the best tree T using the maximum spanning tree algorithm (Chu and Liu, 1965;Edmonds, 1967) and to set β k,j = 1 if the edge (x k → x j ) ∈ T . Maximum spanning tree decoding can be prohibitively slow as the Chu-Liu-Edmonds algorithm is not GPU friendly. We therefore greedily pick a parent word for each word x j in the sentence using Eq. (15). This is actually a principled simplification as greedily assigning a parent for each word is the first step in Chu-Liu-Edmonds algorithm.

Experiments
Next we will discuss our experimental setup and report results for English↔German (En↔De), English↔Russian (En↔Ru), and Russian→Arabic (Ru→Ar) translation models.

Data
We use the WMT17 (Bojar et al., 2017)   guage pairs were chosen to compare results across and between morphologically rich and poor languages. This will prove particularly interesting in our grammar induction results where different pairs must preserve different amounts of syntactic agreement information. We use BPE (Sennrich et al., 2016) with 32,000 merge operations. We run BPE for each language instead of using BPE for the concatenation of both source and target languages.

Baselines
Our baseline is an NMT model with input-feeding ( §3). As we will be making several modifications from the basic architecture in our proposed structured self attention NMT (SA-NMT), we will verify each choice in our architecture design empirically. First we validate the structured self attention module by comparing it to a self-attention module (Lin et al., 2017;Vaswani et al., 2017). Self attention computes attention weights β simply as β = softmax(φ). Since self-attention does not assume any hierarchical structure over the source sentence, we refer it as flat-attention NMT (FA-NMT). Second, we validate the benefit of using two sets of annotations in the encoder. We combine the hidden states of the encoder h with syntactic context d to obtain a single set of annotation using the following equation:s Here we first down-weight the syntactic context d i before adding it to s i . The sigmoid function σ(W g s i ) decides the weight of the head word of x i based on whether translating x i needs additionally dependency information. We refer to this baseline as SA-NMT-1set. Note that in this baseline, there is only one attention layer from the target to the sourceS = {s i } n 1 . In all the models, we share the weights of target word embeddings and the output layer as suggested by Inan et al. (2017) and Press and Wolf (2017).

Hyper-parameters and Training
For all the models, we set the word embedding size to 1024, the number of LSTM layers to 2, and the dropout rate to 0.3. Parameters are initialized uniformly in (−0.04, 0.04). We use the Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 0.001. We evaluate our models on development data every 10,000 updates for De-En and Ru→Ar, and 5,000 updates for Ru-En. If the validation perplexity increases, we decay the learning rate by 0.5. We stop training after decaying the learning rate five times as suggested by Denkowski and Neubig (2017). The mini-batch size is 64 in Ru→Ar experiments and 32 in the rest. Finally, we report BLEU scores computed using the standard multi-bleu.perl script.
In our experiments, the SA-NMT models are twice slower than the baseline NMT measuring by the number of target words generated per second. Table 2 shows the BLEU scores in our experiments. We test statistical significance using bootstrap resampling (Riezler and Maxwell, 2005). Statistical significances are marked as † p < 0.05 and ‡ p < 0.01 when compared against the baselines. Additionally, we also report statistical significances p < 0.05 and p < 0.01 when comparing against the FA-NMT models that have two separate attention layers from the decoder to the encoder. Overall, the SA-NMT (shared) model performs the best gaining more than 0.5 BLEU De→En on wmt16, up to 0.82 BLEU on En→De wmt17 and 0.64 BLEU En→Ru direction over a competitive NMT baseline. The gain of the SA-NMT model on Ru→Ar is small (0.45 BLEU) but significant. The results show that structured self attention is useful when translating from English to languages that have long-distance dependencies and complex morphological agreements. We also see that the gain is marginal compared to self-attention models (FA-NMT-shared) and not significant. Within FA-NMT models, sharing attention is helpful. Our results also confirm the advantage of having two separate sets of annotations in the encoder when modeling syntax. The hard structured self attention model (SA-NMT-hard) performs comparably to the baseline. While this is a somewhat expected result from the hard attention model, we will show in Section 7 that the quality of induced trees from hard attention is often far better than those from soft attention.

Gate Activation Visualization
As mentioned earlier, our models allow us to ask the question: When does the target LSTM need to access source side syntax? We investigate this by analyzing the gate activations of our best model, SA-NMT (shared). At time step t, when the model is about to predict the target word y t , we compute the norm of the gate activations The activation norm z t allows us to see how much syntactic information flows into the decoder. We observe that z t has its highest value when the decoder is about to generate a verb while it has its lowest value when the end of sentence token </s> is predicted. Figure 4 shows some examples of German target sentences. The darker colors represent higher activation norms. It is clear that translating verbs requires structural information. We also see that after verbs, the gate activation norms are highest at nouns Zeit (time), Mut (courage), Dach (roof ) and then tail off as we move to function words which require less context to disambiguate. Below are the frequencies with which the highest activation norm in a sentence is applied to a given part-of-speech tag on newstest2016. We include the top 7 most common activations. We see that while nouns are often the most common tag in a sentence, syntax is disproportionately used for translating verbs.   A quantitative attachment accuracy and 2) A qualitative look at its output.
Our results corroborate and refute previous work (Hashimoto and Tsuruoka, 2017;Williams et al., 2018). We provide stronger evidence that syntactic information can be discovered via latent structured self attention, but we also present preliminary results indicating that conventional definitions of syntax may be at odds with task specific performance.
Unlike in the grammar induction literature our model is not specifically constructed to recover traditional dependency grammars nor have we provided the model with access to part-of-speech tags or universal rules (Naseem et al., 2010;Bisk and Hockenmaier, 2013). The model only uncovers the syntactic information necessary for a given language pair, though future work should investigate if structural linguistic constraints benefit MT.

Extracting a Tree
For extracting non-projective dependency trees, we use Chu-Liu-Edmonds algorithm (Chu and Liu, 1965;Edmonds, 1967). First, we must collapse BPE segments into words. Assume the k-th word corresponds to BPE tokens from index u to v. We obtain a new matrixφ by summing over φ i,j that are the corresponding BPE segments.

Grammatical Analysis
To analyze performance we compute unlabeled directed and undirected attachment accuracies of our predicted trees on gold annotations from the Universal Dependencies (UD version 2) dataset. 3 We chose this representation because of its availability in many languages, though it is atypical for grammar induction. Our five model settings in addition to left and right branching baselines are presented in Table 3. The results indicate that the target language effects the source encoder's induction performance and several settings are competitive with branching baselines for determining headedness.
Recall that syntax is being modeled on the source language so adjacent rows are comparable.
We observe a huge boost in DA/UA scores for EN and RU in FA-NMT and SA-NMT-shared models when the target languages are morphologically I still have surgically induced hair loss I went to this urgent care center and was blown away with their service (a) Gold parses. I still have surgically induced hair loss I went to this urgent care center and was blown away with their service (b) SA-NMT (shared) Figure 6: Samples of induced trees for English by our (En→Ru) model. Notice the red arrows from subject↔verb which are necessary for translating Russian verbs. rich (RU and AR respectively). In comparison to previous work (Belinkov et al., 2017;Shi et al., 2016) on an encoder's ability to capture source side syntax, we show a stronger result that even when the encoders are designed to capture syntax explicitly, the choice of the target language influences the amount of syntax learned by the encoder.
We also see gains from hard attention and several models outperform baselines for undirected dependency metrics (UA). Whether hard attention helps in general is unclear. It appears to help when the target languages are morphologically rich.
Successfully extracting linguistic structure with hard attention indicates that models can capture interesting structures beyond semantic co-occurrence via discrete actions. Our approach also outperforms (Hashimoto and Tsuruoka, 2017) despite lacking access to additional resources like POS tags. 4

Dependency Accuracies & Discrepancies
While the SA-NMT-hard model gives the best directed attachment scores on EN→DE, DE→EN and RU→AR, the BLEU scores of this model are below other SA-NMT models as shown in Table 2. The lack of correlation between syntactic performance and NMT contradicts the intuition of previous work and suggests that useful structures learned in service of a task might not necessarily benefit from or correspond directly to known linguistic formalisms. We want to raise three important differences between these induced structures and UD.
First, we see a blurred boundary between dependency and constituency representations. As noted earlier, the BiLSTM provides a local summary. When the model chooses a head word, it is actually choosing hidden states from a BiLSTM and therefore gaining access to a constituent or region. This means there is likely little difference between attending to the noun vs the determiner in 4 The numbers are not directly comparable since they use WSJ corpus to evaluate the UA score. a phrase (despite being wrong according to UD). Future work might force this distinction by replacing the BiLSTM with a bag-of-words but this will likely lead to substantial losses in MT performance.
Second, because the model appears to use syntax for agreement, often verb dependencies link to subjects directly to capture predicate argument structures like those in CCG or semantic role labeling. UD instead follows the convention of attaching all verbs that share a subject to one another or their conjunctions. We have colored some subject-verb links in Figure 6: e.g., between I, went and was.
Finally, the model's notion of headedness is atypical as it roughly translates to "helpful when translating". The head word gets incorporated into the shared representation which may cause the arrow to flip from traditional formalisms. Additionally, because the model can turn on and off syntax as necessary, it is likely to produce high confidence treelets rather than complete parses. This means arcs produced from words with weak gate activations ( Figure 4) are not actually used during translation and likely not-syntactically meaningful.
We will not speculate if these are desirable properties or issues to address with constraints, but the model's decisions appear well motivated and our formulation allows us to have the discussion.

Conclusion
We have proposed a structured self attention encoder for NMT. Our models show significant gains in performance over a strong baseline on standard WMT benchmarks. The models presented here do not access any external information such as parsetrees or part-of-speech tags yet appear to use and induce structure when given the opportunity. Finally, we see our induction performance is language pair dependent, which invites an interesting research discussion as to the role of syntax in translation and the importance of working with morphologically rich languages. Figure 7 shows a sample visualization of structured attention models trained on En→De data. It is worth noting that the shared SA-NMT model (Figure 7a) and the hard SA-NMT model (Figure 7b) capture similar structures of the source sentence. We hypothesize that when the objective function requires syntax, the induced trees are more consistent unlike those discovered by a semantic objective (Williams et al., 2018). Both models correctly identify that the verb is the head of pronoun (hope→I,