Long-Short Term Masking Transformer: A Simple but Effective Baseline for Document-level Neural Machine Translation

Many document-level neural machine translation (NMT) systems have explored the utility of context-aware architecture, usually requiring an increasing number of parameters and computational complexity. However, few attention is paid to the baseline model. In this paper, we research extensively the pros and cons of the standard transformer in document-level translation, and find that the auto-regressive property can simultaneously bring both the advantage of the consistency and the disadvantage of error accumulation. Therefore, we propose a surprisingly simple long-short term masking self-attention on top of the standard transformer to both effectively capture the long-range dependence and reduce the propagation of errors. We examine our approach on the two publicly available document-level datasets. We can achieve a strong result in BLEU and capture discourse phenomena.


Introduction
Recent advances in deep learning have led to significant improvement of Neural Machine Translation (NMT) (Sutskever et al., 2014;Bahdanau et al., 2014;Luong et al., 2015;Vaswani et al., 2017). Particularly, the performance on the sentence-level translation of both low-and high-resource language pairs is dramatically improved (Kudugunta et al., 2019;Lample et al., 2018;Lample and Conneau, 2019). However, when translating text with long-range dependencies, such as in conversations or documents, the original mode of translating one sentence at a time ignores the discourse phenomena (Voita et al., 2019a,b), introducing undesirable behaviors such as inconsistent pronouns across different translated sentences.
Document-level NMT, as a more realistic translation task in these scenarios, has been systematically * corresponding author. investigated in the machine translation community. Most literatures focused on looking back a fixed number of previous source or target sentences as the document-level context (Tu et al., 2018;Voita et al., 2018;Zhang et al., 2018;Miculicich et al., 2018;Voita et al., 2019a,b). Some latest works innovatively attempted to either get the most out of the entire document context or dynamically select the suitable context (Maruf and Haffari, 2018;Yang et al., 2019a;Maruf et al., 2019;Jiang et al., 2019). Because of the scarcity of document training data, the benefit gained from such an approach, as reflected in BLEU, is usually limited. We therefore elect to pay attention to the context in the previous n sentences only where n is a small number and usually does not cover the entire document.
Almost all of the latest studies chose the standard transformer model as their baseline which translates each sentence in the document with the model trained on the sentence-level data. The cohesion and consistency are in general poor. A more reasonable baseline is to train the transformer with the context prepended, and this modification could be simply implemented via data preprocessing. Bawden et al. (2018) conducted a detailed analysis of RNN-based NMT models on the topic of whether or not to include the extended context. Consistency and precision is often viewed as a trade-off of each other. We conduct a detailed analysis of the effect of document context on consistency in transformer architecture accepting multi-sentence input.
When it comes to leveraging the contextual information, the common approach is to model the interaction between the sentence and its context with specially designed attention modules (Kim et al., 2019). Such works tend to include more than one encoder or decoder, with a substantial number of parameters and additional computations. In our work, we reduce the contextual and regular attention modules into one single encoder and decoder. Our idea is motivated by the one transformer decoder with the two-stream self-attention (Yang et al., 2019b). In particular, we maintain two different sets of hidden states and employ two different masking matrices to capture the long and short term dependencies.
The contributions of this paper are threefold: i) we extensively research the performance of the standard transformer in the setting of multisentence input and output; ii) we propose a simple but effective modification to adapting the transformer for document NMT with the aim of ameliorating the effect of error accumulation; iii) our experiments demonstrate that even the simple baseline can achieve comparable results.

The Proposed Approach
The standard transformer NMT follows the typical encoder-decoder architecture with using stacked self-attention, pointwise fully connected layers, and the encoder-decoder attention layers. The selfattention in the decoder allows only those positions from the left up to the current one to be attended to, preventing information flow to the right beyond the current target and preserving the auto-regressive property. The illegal connections will be masked out by setting as −∞ before the softmax operation. The attention probability can be succinctly written in a unified formulation.
where the matrices Q, K represent queries and keys in attention module (Vaswani et al., 2017), and M is the masking matrix. For the encoder selfattention and the encoder-decoder attention, M = 0. For the decoder self-attention, M is an upper triangular matrix with zero on the diagonal and non-zero (−∞ ≈ −10 9 ) everywhere else.

Long-Short Term Masking Transformer
The basic setup in this work is multi-sentence input and output, denoted as k-to-k model. In other words, both the encoder and decoder need to consume k sentences during training and inference. Therefore, in our modified transformer, the regular self-attention is substituted by the long-short term masking self-attention (illustrated in Figure 1). While the idea of most context-aware model is to introduce another isolated attention module, we propose to maintain two stream attentions via the  Figure 1: Illustration of the Long-Short Term Masking Self-Attention. Green nodes: global self-attention, which is the same as the standard self-attention. Pink nodes: local self-attention, which does not have access to the information from the document context. The red dash lines is removed in the decoder attention.
local and global representations, but the parameters to calculate queries, keys and values are shared. The global self-attention, simply following the calculation in Eq (1), serves a similar role to the standard hidden states in transformer. The keys and values can broadly look around from the first token to the last one, and the global hidden state of the next layer will summarize the information of both the context and current sentence. The query vector directly comes from the global hidden states of the previous layer via a fully connect layer.
The local self-attention only accesses the information of the current sentence, where the contextual information from the previous sentence(s) is blocked when computing the keys and values. Similar to the masking strategy of the transformer decoder, the implementation of the local self-attention is to mask out the tokens of the context via −∞ inside the scaled dot-product operation. Figure 1 depicts the masking matrices of the local selfattention for the encoder and decoder respectively. They are both diagonal block matrices, where each block represents the local self-attention of current sentence and the blank and maroon dots denote value 0 and −∞. When calculating attention weights, we only need to replace the M in Eq (1) with the block masking matrices.
For the two sets of hidden representations in the final layer, we can either aggregate them with element-wise operation (such as summation or concatenation) or directly use global states to predict the distribution of target language model. In our work, we adopt the concatenation, and subse-Ref "Before I die, I want to plant a tree." "Before I die, I want to live off the grid." "Before I die, I want to hold her one more time." Sys0 "Before death, I want a tree." "Before I die, I want to live lives." "Before death, I want to hug her again." Sys1 I want to be a tree before I die. "Before death, I want to become invisible." "Before death, I want to hug her again."

Sys2
"I want to create a tree before I die." "Before I die, I want to live a hidden life." "Before I die, I want to hug her again." Src 在左边你能看到一个小船。这是一个约15英尺的船。我想让 你们注意冰山的形状它在水面上的变形。

Ref
You can see on the left side a small boat. That's about a 15 foot boat. And I'd like you to pay attention to the shape of the iceberg and where it is at the waterline.

Sys0
On the left you see a small boat. It's about 15 feet. I want you to look at the shape of the iceberg that it deformed on the water.

Sys1
On the left you can see a small boat. This is a ship about 15 feet. I want you to notice the shape of the iceberg which is distorted on the water.

Sys2
On the left you see a small boat. This is a 15 foot boat. I want you to pay attention to the shape of the iceberg that's distorted on the surface of the water. quently transform them via a fully connected layer to reduce dimensionality.

Experiments Experimental Setup
We carry out experiments with the Chinese-English IWSLT TED talks dataset 1 and English-Russian open-subtitle dataset 2 . The widely used Zh-En IWSLT dataset contains around 200K training sentence pairs divided into 1713 documents. As is the convention, dev2010 and tst2010-2013 are used for validation and testing respectively. The En-Ru subtitle dataset contains around 1.5M conversations, where each conversation includes exactly 4 sentences. Two randomly selected subsets of 10,000 instances from movies not included in the training are used for development and test 3 .
The BPE tokenization is separately learnt with 32K operations for each language in the dataset. The resulting source / target vocabulary sizes for En-Zh and En-Ru datasets are 10296 / 16018 and 12273 / 22642, respectively. The token-level batch sizes are 8192 and 16384 for training the Zh-En and En-Ru datasets on two and four P-100 GPUs.
The model hyper-parameters and the optimizer of standard transformer baseline follow the base setting in (Vaswani et al., 2017). We set the layers in encoder and decoder to 6, and the attention heads to 8. The dimensionality of input and output is 512. In addition, we add a feed-forward layer before the decoder output layer, with dimensionality 1024, to combine the local and global stream. We use the Adam optimizer with β1 = 0.9, β2 = 0.98 and = 10 −9 , with 16000 warm-up steps and scale of 4. The batch size for each GPU is 4000.
BLEU score is calculated with the script mteval-v13a.pl in Moses 4 . All reported values are evaluated on the test set with the best checkpoint on the development set.

Evaluation on BLEU
We first conduct a detailed analysis on the k-to-k translation model with respect to the IWSLT Zh-En dataset. In this scenario, the k source and target sentences are concatenated as the input and output to train the transformer. During inference, for every consecutive k source sentences, the model produces k target sentences. To translate a test set in a k − to − k model, we keep a sliding window of size k. Each sentence is translated k times (excpet for the first k − 1 sentences), each time as a j th (j ≤ k) sentence. For example, in a 4-to-4 model, sentence 5 is translated 4 times -the 1 st time as the last sentence in the chunk s 2 , s 3 , s 4 , s 5 , the 2 nd time as the 3 rd sentence in the chunk s 3 , s 4 , s 5 , s 6 , and so on. We thus can assemble different versions of the final translated test set where each sentence is translated as the j th sentence (j ≤ k) in the translation process. Each of these final documents is evaluated separately. The results are illustrated in Figure 2.
We can make two inferences from the results. First, with the Standard transformer, the 1 st sentence BLEU always the highest (Figure 2(a)). This is likely the results of error propagation to subsequent sentences from the auto-regressive property mentioned above. Second, larger k, i.e. more contextual information will not necessarily result in better BLEU score. In this case, k = 2 or 3 is better than k = 4. We hypothesize that training with longer sentences requiring learning longer range dependencies is fundamentally difficult, especially for such a small dataset.
When we compare the results of our model with the standard transformer, we have two other findings. First, the BLEU scores of our k-to-k model outperform those of the standard transformer, and for the j-th sentence BLEU, the score does not decline as much as in the standard transformer.  We believe that our long-short term masking selfattention can, to some extent, relieve the effect of error accumulation. Second, when document information is used (i.e., k > 1), decoding each sentence as the last sentence (ie. using all previous context) achieves higher BLEU scores than decoding each sentence individually in the standard transformer. We pay more attention to the last sentence because presumably it has the richest contextual information; this is also the setting for the results in the next section.
Two qualitative examples are shown in Table 3 (more examples can see in the supplementary materials). In the first case, compared to Sys0 and Sys1, Sys2 is more consistent in the segments "Before I die" and "I want to" of three sentences. In the second case, the translation of "boat" in Sys1 or Sys0 is either omitted or inconsistent in the second sentence, while Sys2 performs better in consistency.

Evaluation on Consistency
The publicly available open-subtitle En-Ru dataset has a special test data to evaluate consistency of document-level translation systems. The details of the data can be found in Voita et al. (2019b). The context of the training and test data contains exactly 3 sentences, so we mainly adopt a 4-to-4 model in our experiments and each sentence is translated as the last sentence in a chunk of 4 sentences. In this section, we follow previous works to focus on the accuracy of Deixis, Lexical cohesion, Verb phrase ellipsis and Ellipsis (inflection) 5 .
In Table 2, we summarize the results of BLEU as well as consistency performance. s-hier-to-2.tied (Bawden et al., 2018) is an RNN-based NMT, so its performance is relatively worse than the other transformer-based models. In contrast, our approach can achieve better performance with respect to both BLEU and consistency, except for lexical cohesion. Especially the accuracy of lexical cohesion of Partial Copy (Jean et al., 2019) exceeds ours by a large margin. Jean et al. (2019) filled the missing context with partial copy strategy, since the repetition can naturally enhance the lexical cohesion. Therefore, when we also apply the partial copy trick to our model, the lexical cohesion can boost by 27% but the BLEU is sacrificed. The Lexical Cohesion of CADec (Voita et al., 2019b) is a bit higher than our approach without partial copy. Considering that CADec is almost doublesized of our standard transformer and complicated architecture with the backbone of the deliberation networks (Xia et al., 2017), the gain over baseline is much higher cost than ours. In summary, our model can achieve a strong result in both BLEU and consistency with few extra model parameters.

Discussions and Conclusions
In this work, we present a simple but effective variation with the long-short term masking strategy, and we performed comparative studies with the k-to-k translation model of the standard transformer. Just as the big, complex neural network architectures with great many parameters has its power, small but efficient modification like ours to the classical transformer has its unique appeals. Other examples of simple but impactful ideas are data augmentation and the round-trip back-translation (Voita et al., 2019a), to name just a few. Big or small, complex or simple, each has its distinct advantages. We're encouraged by our findings that in tandem with the great machinery that could bring powerful results, simplistic approaches could be just as efficacious.

C More Examples
We randomly selected three translation examples and illustrated in Table 3. For Example 1, the proposed system learnt "And" at the beginning of the translation, which is a side effect of document-level training. For Example 2, whether using "love" or "love to" is consistency in the proposed system and 1-to-1 baseline transformer. It seems that 1-to-1 baseline can approximately translate "极" to "radical", which does not even appear in the reference. I personally think "extremely" is a better translation. For Example 3, the reference seems not consistency in "how are we" and "how do we", but our proposed system prefers to keep in consistency using "how do we".

Ref
It's got a feed conversion ratio of 15 to one. That means it takes fifteen pounds of wild fish to get you one pound of farm tuna. Not very sustainable.

Sys0
Feeding tuna is 15 to one. That means that every pound of tunas costs 15 pounds to feed feed on other wild fish. It's not sustainable.

Sys1
It's 15 to 1. What that means is that every pound-pound tuna produces 15 pounds of feed on every other wild fish. It's not sustainable.

Sys2
And the shift rate of breeding tuna is 15 to one. That means, for every one pound of tuna, it takes 15 pounds of feeding on other wild fish. It's not very sustainable.

Ref
We love innovation. We love technology. We love creativity. We love entertainment.

Sys0
We love radical innovation. We love technology. We love creation. We love entertainment.

Sys1
We love to be innovative. We love technology. We love to create. We love entertainment.

Sys2
We love innovation. We love technology. We love creating. We love entertainment.

Ref
Want to feed the world? Let's start by asking: how are we going to feed ourselves? Or better: how can we create conditions that enable every community to feed itself?

Sys0
Do you want to feed the world? So let's start asking: how do we feed ourselves? Or better, how can we build an environment that allows every group to feed themselves?

Sys1
How do we feed the world? So let's start asking: how do we feed ourselves? Or even better, how do we build an environment that will feed itself?

Sys2
Want to feed the world? Let's start asking: how do we feed ourselves? Or better, how do we build an environment that allows every single group to feed itself? Table 3: Examples of translation results. Sys0: 1-to-1 transformer. Sys1: 3-to-3 transformer. Sys2: 3-to-3 long-short term masking transformer.