Syntax-Based Attention Masking for Neural Machine Translation

We present a simple method for extending transformers to source-side trees. We define a number of masks that limit self-attention based on relationships among tree nodes, and we allow each attention head to learn which mask or masks to use. On translation from English to various low-resource languages, and translation in both directions between English and German, our method always improves over simple linearization of the source-side parse tree and almost always improves over a sequence-to-sequence baseline, by up to +2.1 BLEU.


Introduction
The transformer model for machine translation (Vaswani et al., 2017) was originally defined as a mapping from sequences to sequences. More recent work has explored extensions of transformers to other structures: a tree transformer would be able to make use of syntactic information, and a graph transformer would be able to make use of semantic graphs or knowledge graphs.
There have been a number of proposals for transformers on trees, including phrase-structure trees and dependency trees for natural languages, and abstract syntax trees for programming languages. One common strategy is to linearize a tree into a sequence (Ahmad et al., 2020;Currey and Heafield, 2019). Another strategy is to recognize that transformers are fundamentally defined not on sequences but on bags; all information about sequential order is contained in the positional encodings, so all that is needed to construct a tree transformer is to define new positional encodings on trees (Shiv and Quirk, 2019;Omote et al., 2019).
In this paper, we present a third approach, which is to enhance the encoder's self-attention mechanism with attention masks (Shen et al., 2018), which restrict the possible positions an attention head can attend to. We extend this idea in two new ways. First, our attention masks are based on relationships among tree positions (for example, "is an ancestor of" or "is a descendant of") rather than sequence positions ("is left of" or "is right of"). Second, instead of pre-assigning different masks to each attention head, we allow each attention head to learn separately which mask or masks to use.
We experiment on machine translation of several low-resource language pairs (Section 3). Compared to linearization without masks, our method always improves accuracy, by up to +1.7 BLEU (all BLEU reported as percen t). Compared with a sequence-tosequence baseline, our method improves accuracy by up to +2.1 BLEU. On tasks where linearization hurts, our method is usually, but not always, able to turn the loss into a gain.

Methods
Like several previous approaches, we use linearized syntax trees. But whereas the usual linearization traverses a node both before and after its descendants, we use a preorder traversal of the tree. In other words, our linearization does not have closing brackets. Our linearization does not have enough information to reconstruct the original tree; this information is contained in the attention masks, which we describe next. Shen et al. (2018) introduce the idea of using masks in a string transformer to allow attention heads to attend only to the left or only to the right. We apply this idea to tree transformers, with two modifications. First, instead of masking out the left or right context, we use masks based on the structure of the tree. Second, instead of allocating a fixed number of heads to each mask, we let the model learn which mask(s) to use for each attention head.
Given a query Q ∈ R n×d k , key K ∈ R n×d k , and value V ∈ R n×d v (where n is the number of input tokens and d k = d v is d model divided by the number of attention heads), scaled dot-product attention is normally computed as where α ∈ R n×n is the matrix of attention weights, and the softmax is performed per row. We modify the definition of α to where, for each m, the matrix M m ∈ {0, 1} n×n is a fixed mask and s m is its corresponding strength, which is learnable. If [M m ] i j = 1 and s m is large, then the attention at position i is prevented from attending to position j. If [M m ] i j = 0 or s m is very negative, then position i is free to attend to position j. With multiple attention heads, each head has its own strength parameters.
The strength parameters are initialized to zero and learned by backpropagation with the rest of the model. In this way, each attention head can learn separately which mask or masks to use.
It remains to define the masks M m . A mask can be defined for any imaginable string or tree relationship. Because the model can always choose not to use a mask, we can add as many masks as we want. We use the following set:  Although none of the above masks overlap, there would be no problem with defining masks that do. Please see Figure 1 for an example. In (a) is an English tree; (b) shows the same tree after applying byte pair encoding (BPE) subword segmentation (see Section 3 below); and (c) shows the relationships of all the nodes with the second NP (the one dominating my father).

Data
We tested on the following datasets: en-vi English to Vietnamese, from the IWSLT 2015 shared task. 1 To test for dependence of our method on training data size, we also used random subsets of 20k and 50k.
de-en, en-de German↔English, from the WMT 2016 news translation task. 2 For training, we used random subsets of 20k, 50k, and 100k. We used news-test2013 for validation and news-test2014 for testing.
en-tu, en-ha, en-ur English to Turkish, Hausa, and Urdu, from the DARPA LORELEI program.
Some statistics of the datasets are shown in Table 1. This table lists the average number of source words and source interior nodes, from which the average number of tokens in the linearized and mask systems can be derived. We tokenized using the Moses tokenizer, then divided words into subwords using BPE (Sennrich et al., 2016). For en-vi, en-tu, en-ha, and en-ur, we used 8k joint BPE operations, and for en-de and de-en, we used 32k operations.
To parse English or German sentences, we used the Berkeley Neural Parser (Kitaev and Klein, 2018;Kitaev et al., 2019) with the included benepar_en2 model for English and benepar_de for German. The parser reads in untokenized strings and writes out tokenized trees; we used the parser's tokenization, but applied BPE to the leaves, as shown in Figure 1b.

Evaluation
We compare against two baselines: Sequence is a standard sequence-to-sequence model, run on words only. Linearized is a standard sequence-tosequence model, run on linearized trees. A leaf node w is linearized as w. An interior node X is linearized as (X followed by the linearization of its children followed by ) . Against these baselines, we compare our model, Mask, which uses a preorder traversal of the tree together with the masks described above in Section 2.
All systems are implemented on top of Witwicky, 3 an open-source implementation of the transformer. We use all default settings; in particular, layer normalization is performed after residual connections (Nguyen and Salazar, 2019). We score detokenized system outputs using casesensitive BLEU against raw references (except on en-vi, where we use tokenized outputs and references), using bootstrap resampling (Koehn, 2004;Zhang et al., 2004) for significance testing.

Results
The results are shown in Table 2. Relative to the linearized baseline, our method (mask) always improves, by up to +1.7 BLEU for English-Turkish. The difference is statistically significant (p < 0.05) except for English-Urdu.
Relative to the sequence baseline, the story is more complex. Whenever linearized helps over sequence, our method helps more, up to a total of +2.1 BLEU for German↔English (50k). But when linearized hurts, our method sometimes helps overall (all tasks with 20k lines of training) and sometimes doesn't (e.g., English-Urdu, with only 11k lines of training). A simple possible explanation is that additional tokens make training more difficult on the very smallest datasets, and the effect is stronger for linearized, which has twice as many extra tokens.   displays the resulting sum of the masks for each attention head for the parse tree of the sentence, "He is my father." There's a strong left-right asymmetry, with heads 2-4 and 7 attending to the left and heads 1 and 5-6 attending to the right. There's also a strong preference to attend to nodes that are nearby in the tree, with strongest weights on the child, left-sib, and right-sib relations. Figure 4 shows the minimum, maximum, and range of the mask strengths learned for various tasks. Generally, a mask's range correlates with its usefulness to the model. In particular, on Urdu-English, where we saw the syntax-based models perform the worst, we also see the masks being used the least and distinguished the least. English-Vietnamese is clearly an exception to this, however, with the highest maximum and widest range, but a small (insignificant) loss in BLEU.

Conclusion
In this paper, we've shown that syntax can be both helpful and easy to incorporate into low-resource neural machine translation. We introduced learnable attention masks for the transformer that allow each attention head to focus more narrowly on certain node relationships in the syntax tree, improving translation across a variety of low-resource datasets by up to +2.1 BLEU.  Table 2: Experiment results. In each column, the best score and any scores not significantly different from the best (p ≥ 0.05) are printed in boldface.