Enhancing Machine Translation with Dependency-Aware Self-Attention

Most neural machine translation models only rely on pairs of parallel sentences, assuming syntactic information is automatically learned by an attention mechanism. In this work, we investigate different approaches to incorporate syntactic knowledge in the Transformer model and also propose a novel, parameter-free, dependency-aware self-attention mechanism that improves its translation quality, especially for long sentences and in low-resource scenarios. We show the efficacy of each approach on WMT English-German and English-Turkish, and WAT English-Japanese translation tasks.


Introduction
Research in neural machine translation (NMT) has mostly exploited corpora consisting of pairs of parallel sentences, with the assumption that a model can automatically learn prior linguistic knowledge via an attention mechanism (Luong et al., 2015). However, Shi et al. (2006) found that these models still fail to capture deep structural details, and several studies (Sennrich and Haddow, 2016;Eriguchi et al., 2017;Chen et al., 2017Chen et al., , 2018 have shown that syntactic information has the potential to improve these models. Nevertheless, the majority of syntax-aware NMT models are based on recurrent neural networks (RNNs; Elman 1990), with only a few recent studies that have investigated methods for the Transformer model (Vaswani et al., 2017). Wu et al. (2018) evaluated an approach to incorporate syntax in NMT with a Transformer model, which not only required three encoders and two decoders, but also target-side dependency relations (precluding its use to low-resource target languages). Zhang et al. (2019) integrate source-side syntax by concatenating the intermediate representations of a dependency parser to word embeddings. * Work done while at Tokyo Institute of Technology.
In contrast to ours, this approach does not allow to learn sub-word units at the source side, requiring a larger vocabulary to minimize out-of-vocabulary words. Saunders et al. (2018) interleave words with syntax representations which results in longer sequences -requiring gradient accumulation for effective training -while only leading to +0.5 BLEU on WAT Ja-En when using ensembles of Transformers. Finally, Currey and Heafield (2019) propose two simple data augmentation techniques to incorporate source-side syntax: one that works well on low-resource data, and one that achieves a high score on a large-scale task. Our approach, on the other hand, performs equally well in both settings.
While these studies improve the translation quality of the Transformer, they do not exploit its properties. In response, we propose to explicitly enhance the its self-attention mechanism (a core component of this architecture) to include syntactic information without compromising its flexibility. Recent studies have, in fact, shown that self-attention networks benefit from modeling local contexts by reducing the dispersion of the attention distribution (Shaw et al., 2018;Yang et al., 2018Yang et al., , 2019, and that they might not capture the inherent syntactic structure of languages as well as recurrent models, especially in low-resource settings (Tran et al., 2018;Tang et al., 2018). Here, we present parentscaled self-attention (PASCAL): a novel, parameterfree local attention mechanism that lets the model focus on the dependency parent of each token when encoding the source sentence. Our method is simple yet effective, improving translation quality with no additional parameter or computational overhead.
Our main contributions are: • introducing PASCAL: an effective parameterfree local self-attention mechanism to incorporate source-side syntax into Transformers; • adapting LISA (Strubell et al., 2018) to subword representations and applying it to NMT; • similar to concurrent work (Pham et al., 2019), we find that modeling linguistic knowledge into the self-attention mechanism leads to better translations than other approaches. Our extensive experiments on standard En↔De, En→Tr and En→Ja translation tasks also show that (a) approaches to embed syntax in RNNs do not always transfer to the Transformer, and (b) PASCAL consistently exhibits significant improvements in translation quality, especially for long sentences.

Model
In order to design a neural network that is efficient to train and that exploits syntactic information while producing high-quality translations, we base our model on the Transformer architecture (Vaswani et al., 2017) and upgrade its encoder with parent-scaled self-attention (PASCAL) heads at layer l s . PASCAL heads enforce contextualization from the syntactic dependencies of each source token, and, in practice, we replace standard selfattention heads with PASCAL ones in the first layer as its inputs are word embeddings that lack any contextual information. Our PASCAL sub-layer has the same number H of attention heads as other layers.
Source syntax Similar to previous work, instead of just providing sequences of tokens, we supply the encoder with dependency relations given by an external parser. Our approach explicitly exploits sub-word units, which enable open-vocabulary translation: after generating sub-word units, we compute the middle position of each word in terms of number of tokens. For instance, if a word in position 4 is split into three tokens, now in positions 6, 7 and 8, its middle position is 7. We then map each sub-word of a given word to the middle position of its parent. For the root word, we define its parent to be itself, resulting in a parse that is a directed graph. The input to our encoder is a sequence of T tokens and the absolute positions of their parents.

Parent-Scaled Self-Attention
Figure 1 shows our parent-scaled self-attention sublayer. Here, for a sequence of length T , the input to each head is a matrix X ∈ R T ×d model of token embeddings and a vector p ∈ R T whose t-th entry p t is the middle position of the t-th token's dependency parent. Following Vaswani et al. (2017), in each attention head h, we compute three vectors (called query, key and value) for each token, resulting in the three matrices K h ∈ R T ×d , Q h ∈ R T ×d , and V h ∈ R T ×d for the whole sequence, where d = d model /H. We then compute dot products between each query and all the keys, giving scores of how much focus to place on other parts of the input when encoding a token at a given position. The scores are divided by √ d to alleviate the vanishing gradient problem arising if dot products are large: Our main contribution is in weighing the scores of the token at position t, s t , by the distance of each token from the position of t's dependency parent: where n h t is the t-th row of the matrix N h ∈ R T ×T representing scores normalized by the proximity to t's parent. d p tj = dist(p t , j) is the (t, j) th entry of the matrix D p ∈ R T ×T containing, for each row d t , the distances of every token j from the middle position of token t's dependency parent p t . In this paper, we compute this distance as the value of the probability density of a normal distribution centered at p t and with variance σ 2 , N p t , σ 2 : Finally, we apply a softmax function to yield a distribution of weights for each token over all the tokens in the sentence, and multiply the resulting matrix with the value matrix V h , obtaining the final representations M h for PASCAL head h.
One of the major strengths of our proposal is being parameter-free: no additional parameter is required to train our PASCAL sub-layer as D p is obtained by computing a distance function that only depends on the vector of tokens' parent positions and can be evaluated using fast matrix operations.
Parent ignoring Due to the lack of parallel corpora with gold-standard parses, we rely on noisy annotations from an external parser. However, the performance of syntactic parsers drops abruptly when evaluated on out-of-domain data (Dredze et al., 2007). To prevent our model from overfitting to noisy dependencies, we introduce a regularization technique for the PASCAL sub-layer: parent ignoring. In a similar vein as dropout (Srivastava et al., 2014), we disregard information during the training phase. Here, we ignore the position of the parent of a given token by randomly setting each row of D p to 1 ∈ R T with some probability q.
Gaussian weighing function The choice of weighing each score by a Gaussian probability density is motivated by two of its properties. First, its bell-shaped curve: It allows us to focus most of the probability density at the mean of the distribution, which we set to the middle position of the sub-word units of the dependency parent of each token. In our experiments, we find that most words in the vocabularies are not split into sub-words, hence allowing PASCAL to mostly focus on the actual parent. In addition, non-negligible weights are placed on the neighbors of the parent token, allowing the attention mechanism to also attend to them. This could be useful, for instance, to learn idiomatic expressions such as prepositional verbs in English. The second property of Gaussian-like distributions that we exploit is their support: While most of the weight is placed in a small window of tokens around the mean of the distribution, all the values in the sequence are actually multiplied by non-zero factors; allowing a token j farther away from the parent of token t, p t , to still play a role in the representation of t if its score s h tj is high.
PASCAL can be seen as an extension of the local attention mechanism of Luong et al. (2015), with the alignment now guided by syntactic information. Yang et al. (2018) proposed a method to learn a Gaussian bias that is added to, instead of multiplied by, the original attention distribution. As we will see next, our model significantly outperforms this.

Experimental Setup
Data We evaluate the efficacy of our approach on standard, large-scale benchmarks and on lowresource scenarios, where the Transformer was shown to induce poorer syntax. Following Bastings et al. (2017), we use News Commentary v11 (NC11) with En-De and De-En tasks to simulate low resources and test multiple source languages. To compare with previous work, we train our models on WMT16 En-De and WAT En-Ja tasks, removing sentences in incorrect languages from WMT16 data sets. For a thorough comparison with concurrent work, we also evaluate on the largescale WMT17 En-De and low-resource WMT18 En-Tr tasks. We rely on Stanford CoreNLP (Manning et al., 2014) to parse source sentences. 1 Training We implement our models in PyTorch on top of the Fairseq toolkit. 2 Hyperparameters, including the number of PASCAL heads, that achieved the highest validation BLEU (Papineni et al., 2002) score were selected via a small grid search.
We report previous results in syntax-aware NMT for completeness, and train a Transformer model as a strong, standard baseline. We also investigate the following syntax-aware Transformer approaches: 1 • +PASCAL: The model presented in §2. The variance of the normal distribution was set to 1 (i.e., an effective window size of 3) as 99.99% of the source words in our training sets are at most split into 7 sub-words units. • +LISA: We adapt LISA (Strubell et al., 2018) to NMT and sub-word units by defining the parent of a given token as its first sub-word (which represents the root of the parent word). • +MULTI-TASK: Our implementation of the multi-task approach by Currey and Heafield (2019) where a standard Transformer learns to both parse and translate source sentences. • +S&H: Following Sennrich and Haddow (2016), we introduce syntactic information in the form of dependency labels in the embedding matrix of the Transformer encoder.

Results
Table 1 presents the main results of our experiments. Clearly, the base Transformer outperforms previous syntax-aware RNN-based approaches, proving it to be a strong baseline in our experiments. The table shows that the simple approach of Sennrich and Haddow (2016) does not lead to notable advantages when applied to the embeddings of the Transformer model. We also see that the multi-task approach benefits from better parameterization, but it only attains comparable performance with the baseline on most tasks. On the other hand, LISA, which embeds syntax in a self-attention head, leads to modest but consistent gains across all tasks, proving that it is also useful for NMT.
Finally, PASCAL outperforms all other methods, with consistent gains over the Transformer baseline independently of the source language and corpus size: It gains up to +0.9 BLEU points on most tasks and a substantial +1.75 in RIBES (Isozaki et al., 2010), a metric with stronger correlation with hu-man judgments than BLEU in En↔Ja translations. On WMT17, our slim model compares favorably to other methods, achieving the highest BLEU score across all source-side syntax-aware approaches. 3 Overall, our model achieves substantial gains given the grammatically rigorous structure of English and German. Not only do we expect performance gains to further increase on less rigorous sources and with better parses (Zhang et al., 2019), but also higher robustness to noisier syntax trees obtained from back-translated with parent ignoring. Figure 2, our model is particularly useful when translating long sentences, obtaining more than +2 BLEU points when translating long sentences in all low-resource experiments, and +3.5 BLEU points on the distant En-Ja pair. However, only a few sentences (1%) in the evaluation datasets are long.

Performance by sentence length As shown in
SRC In a cooling experiment , only a tendency agreed   Table 2 presents examples where our model correctly translated the source sentence while the Transformer baseline made a syntactic error. For instance, in the first example, the Transformer misinterprets the adverb "only" as an adjective of "tendency:" the word "only" is an adverb modifying the verb "agreed." In the second example, "don't" is incorrectly translated to the past tense instead of present.

Qualitative performance
PASCAL layer When we introduced our model, we motivated our design choice of placing PASCAL heads in the first layer in order to enrich the representations of words from their isolated embeddings by introducing contextualization from their parents. We ran an ablation study on the NC11 data in order to verify our hypothesis. As shown in Table 3a, the performance of our model on the validation sets is lower when placing Pascal heads in upper layers; a trend that we also observed with the LISA mechanism. These results corroborate the findings of Raganato and Tiedemann (2018) who noticed that, in the first layer, more attention heads solely focus on the word to be translated itself rather than its context. We can then deduce that enforcing syntactic dependencies in the first layer effectively leads to better word representations, which further enhance the translation accuracy of the Transformer model. Investigating the performance of multiple syntax-aware layers is left as future work.
Gaussian variance Another design choice we made was the variance of the Gaussian weighing function. We set it to 1 in our experiments motivated by the statistics of our datasets, where the vast majority of words is at most split into a few tokens after applying BPE.

Conclusion
This study provides a thorough investigation of approaches to induce syntactic knowledge into self-attention networks. Through extensive evaluations on various translation tasks, we find that approaches effective for RNNs do not necessarily transfer to Transformers (e.g. +S&H). Conversely, dependency-aware self-attention mechanisms (LISA and PASCAL) best embed syntax, for all corpus sizes, with PASCAL consistently outperforming other all approaches. Our results show that exploiting core components of the Transformer to embed linguistic knowledge leads to higher and consistent gains than previous approaches.

A Experiment details
Data preparation We follow the same preprocessing steps as Vaswani et al. (2017). Unless otherwise specified, we first tokenize the data with Moses (Koehn et al., 2007) and remove sentences longer than 80 tokens in either source or target side. Following Bastings et al. (2017), we train on News Commentary v11 (NC11) data set with English→German (En-De) and German→English (De-En) tasks so as to simulate low-resource cases and to evaluate the performance of our models for different source languages. We also train on the full WMT16 data set for En-De, using newstest2015 and newstest2016 as validation and test sets, respectively, in each of these experiments. Moreover, we notice that these data sets contain sentences in different languages and use langdetect 4 to remove sentences in incorrect languages.
We also train our models on WMT18 English→Turkish (En-Tr) as a standard lowresource scenario. Models are evaluated on new-stest2016 and tested on newstest2017.
Previous studies on syntax-aware NMT have commonly been conducted on the WMT16 En-De and WAT English→Japanese (En-Ja) tasks, while concurrent approaches are evaluated on the WMT17 En-De task. In order to provide a generic and comprehensive evaluation of our proposed approach on large-scale data, we also train our models on the latter tasks. We follow the WAT18 preprocessing steps 5 for experiments on En-Ja but use Cabocha 6 to tokenize target sentences. On WMT17, we use newstest2016 and newstest2017 as validation and test sets, respectively.  (2016), we introduce syntactic information in the form of dependency labels in the embedding matrix of the Transformer encoder. More specifically, each token is associated with its dependency label which is first embedded into a vector representation of size 10 and then used to replace the last 10 embedding dimensions of the token embedding, ensuring a final size that matches the original one. • +MULTI-TASK: Our implementation of the multi-task approach by Currey and Heafield (2019) where a standard Transformer learns to both parse and translate source sentences. Each source sentence is first duplicated and associated its linearized parse as target sequence. To distinguish between the two tasks, a special tag indicating the desired task is prepended and appended to each source sentence. Finally, parsing and translation training data is shuffled together. • +LISA: We adapt Linguistically-Informed Self-Attention (LISA; Strubell et al. 2018) to NMT. In one attention head h, Q h and K h are computed through a feed-forward layer and the key-query dot product to obtain attention weights is replaced by a bi-affine operator U. These attention weights are further supervised to attend to each token's parent by interpreting each row t as the distribution over possible parents for token t. Here, we extend the authors' approach to BPE by defining the parent of a given token as its first sub-word unit (which represents the root of the parent word). The model is trained to maximize the joint probability of translations and parent positions.     Table 6 for the training times of each experiment. For each model, we run a small grid search over the hyperparameters and select the ones giving the highest BLEU scores on validation sets (Table 7).
Following Vaswani et al. (2017), we train Transformer-based models for 100K steps on largescale data. On small-scale data, we train for 20K steps and use a dropout probability P drop = 0.3 as they let the Transformer baseline achieve higher performance on this size of data. For instance, in WMT18 En-Tr, our baseline outperforms the one in Currey and Heafield (2019) by +3.5 BLEU.

B Analysis
Multiplication vs. addition In Equation (2), we calculated the weighing scores by multiplying the self-attention scores by the distance to the parent token. Multiplication is, in fact, the standard way to weight values (e.g., the gating mechanism of LSTMs and GRUs). In our case, it introduces sparseness in the attention scores for non-parent tokens. Moreover, it weights gradients in backpropagation: Let x and y be the attention score and dependency weight, respectively. Consider a loss l = f (z) where z = xy and dl/dx = df (z)/dz * y. The attention score receive gradients more on dependent pairs (larger y) than non-dependent ones (smaller y), which is sound for dependency information. In contrast, addition cannot obtain such an effect because it does not affect gradients: dl/dx = df (z)/dz when z = x + y. For completeness, we trained our best NC11 models replacing multiplication by addition. We find that BLEU scores still improve upon the baseline, meaning that our approach is robust, but find them to be slightly lower (−0.2) than with multiplication.
Ablation We introduced different techniques to improve neural machine translation with syntax information. Table 5 lists the contribution of each technique, in an incremental fashion, whenever they were used by the models reported in Table 1.
While removing sentences whose languages do not match the translation task can lead to better performance (NC11), the precision of the detection tool assumes a major role at large scale. In WMT16, langdetect removes more than 200K sentences and leads to performance losses. It would also drop 19K pairs on the clean WAT En-Ja data.
The proposed PASCAL mechanism is the component that most improves the performance of the models, achieving up to +1.0 and +1.3 BLEU on the distant En-Tr and En-Ja pairs, respectively. With the exception of NC11 En-De, we find parent ignoring useful on the noisier WMT18 En-Tr and WMT17 En-De datasets. In the former, lowresource case, the benefits of parent ignoring are minimal, but it proves fundamental on the largescale WMT17 data, where it leads to significant gains when paired with the PASCAL mechanism. 9 Finally, looking at the number of PASCAL heads in Table 7, we notice that most models rely on a large number of syntax-aware heads. Raganato and Tiedemann (2018) found that only a few attention heads per layer encoded a significant amount of syntactic dependencies. Our study shows that the Transformer model can be improved by having more attention heads learn syntactic dependencies.