Context-Aware Cross-Attention for Non-Autoregressive Translation

Non-autoregressive translation (NAT) significantly accelerates the inference process by predicting the entire target sequence. However, due to the lack of target dependency modelling in the decoder, the conditional generation process heavily depends on the cross-attention. In this paper, we reveal a localness perception problem in NAT cross-attention, for which it is difficult to adequately capture source context. To alleviate this problem, we propose to enhance signals of neighbour source tokens into conventional cross-attention. Experimental results on several representative datasets show that our approach can consistently improve translation quality over strong NAT baselines. Extensive analyses demonstrate that the enhanced cross-attention achieves better exploitation of source contexts by leveraging both local and global information.


Introduction
Different from autoregressive translation (Bahdanau et al., 2015;Vaswani et al., 2017, AT) models that generate each target word conditioned on previously generated ones, non-autoregressive translation (Gu et al., 2018, NAT) models break the autoregressive factorization and produce the target words in parallel. Given a source sentence x, the probability of generating its target sentence y with length T is defined by NAT as: p(y|x) = p L (T |x; θ) T t=1 p(y t |x; θ), where p L (·) is a separate conditional distribution to predict the length of target sequence. As NAT models can predict all tokens independently and simultaneously, recent works have fully investigated their superiority on decoding efficiency (Lee et al., 2018;Ghazvininejad et al., 2019;Gu et al., 2019;Kasai et al., 2020;Sun et al., 2019;Shu et al., 2020;Ran et al., 2019). However, there still exists a gap between AT and NAT models in terms of effectiveness.
In encoder-decoder frameworks, the cross-attention module dynamically selects relevant source-side information (key) given a target-side token (query) (Yang et al., 2020;. Through qualitative and quantitative analyses, we found that it is difficult for the NAT decoder to adequately capture the source context due to the lack of autoregressive factorization. As shown in Table 1, when translating the Chinese word "交往", the source context word "女孩" should play a significant role in predicting the candidate word "dating". However, the NAT model inappropriately generates "socializing with", resulting in lexical choice errors. As seen, the AT model gives relatively higher attention weights to local contexts on the source side while the NAT model pays less attention on them (0.15 vs 0.04). We make further statistical analysis in Section 2 to prove the universality of this localness perception problem. Similar to our findings,  showed that distributions of cross-attention in NAT models are more ambiguous than those in AT ones.
To alleviate this localness perception problem in NAT, we propose a context-aware cross-attention to model both local and global contexts simultaneously. For local attention, we limit the scope of cross-attention to adjacent tokens surrounding the source word with the maximum alignment probability. We then combine the local attention weights with the original global ones by a gating mechanism (in Section 3).
Input 弗兰克 找到 一间 公寓 ， 同时 在 跟 一个 女孩 交 交 交往 往 往 。 Reference Frank found an apartment and was dating a girl at the same time. NAT Output Frank found an apartment and was socializing with a girl.
Attention 交往0.68 弗兰克0.18 女孩0.04 AT Output Frank found an apartment and was dating a girl.
Attention 交往0.81 女孩0.15 弗兰克0.03 Ours Output Frank found an apartment and was dating a girl.
Attention 交往0.69 女孩0.11 弗兰克0.09 Table 1: Case study of localness perception problem. "NAT Output" and "AT Output" are generated by NAT and AT models, respectively. "Attention" shows top-3 cross-attention probabilities when generating the target word "dating" or other equivalents.
Experiments are conducted on four commonly-cited datasets on translation task (i.e. WMT16 Romanian⇒English, WAT17 Japanese⇒English, WMT14 English⇒German and WMT17 Chinese⇒English) and show that our approach can consistently improve translation quality by around 0.5 BLEU point over advanced NAT models (in Section 4). Further analyses reveal that our method can enhance abilities of NAT to learn syntactic and semantic information as well as phrase patterns (in Section 5).

Localness Perception Problem
To validate our motivation, we conduct a statistical analysis. Following Tu et al. (2014), we employ the locality entropy to measure how the cross-attention concentrate around a source word that corresponds with y t . As shown in Table 1, when generating the target side word "dating", the concentrated source word is "交往" according to the maximum probability of attention. And AT's attention distribution is obviously concentrated than NAT's, thereby have a lower entropy. In our case, given a sentence pair {f 1 , f 2 , . . . , f n ; e 1 , e 2 , . . . , e m }, for each decoding position pos ∈ [1, m], we can obtain a probability distribution P i pos = {P i (f 1 |pos), . . . , P i (f n |pos)} by calculating cross-attention in the i-th decoding layer. Thus, the locality entropy of one certain sentence is LE = − 1 6m i∈ [1,6] pos∈[1,m] P i pos log 2 P i pos . Finally, we average all sentence-level LE to get the corpus-level one. The lower LE means the more concentrated attention on source-side localness and vice versa.

Models
En  We compare the locality entropy of NAT and AT models on En-De and Zh-En. As shown in Table 2, the locality entropy "LE" of NAT model is higher than that of AT, showing that the localness perception problem in NAT is more severe. With the help of our method (in Section 3), this problem can be alleviated (LE↓), leading to better translation quality (BLEU ↑). This observation confirms the universality and side effect of localness perception problem in NAT, validating our hypothesis in Section 1.

Context-Aware Cross-Attention for NAT
In this section, we introduce the detail of our proposed context-aware cross-attention networks (CCAN), which perceives the original and local cross-attention simultaneously.
Original Cross-Attention For the target-side query Q, source-side key K and value V . The i-th original cross-attention ψ i can be calculated with dot-product: ψ i = Q i K T . The original attention of the i-th element is the weighted sum of values ATT(ψ i , V ) = sof tmax(ψ i )V (in Figure 1(a)).
Our Approach For the i-th position in target side, we propose a locally-sensitive cross-attention component for NAT to capture the neighbor signals. For simplicity, we adopt a straightforward but has been proven effective way (Luong et al., 2015;Xu et al., 2019;You et al., 2020) Figure 1: Illustration of our proposed approach, which combines (a) vanilla cross-attention and (b) localness-aware cross-attention. In (a), the word "交往" is assigned with the maximum attention weight while the adjacent word (local context) "女孩" is assigned with a low weight. In (b), we guide the model to perceive the local context.
scope to a nearby window around the aligned j-th element. In practice, we choose the source element with the highest attention weight as the aligned element, and the local range can be modeled as follows: where ψ i,j denotes the attention correlation between the iand jelements in encoder and decoder parts, respectively. The win is the hard-coded localness modeling window. Furthermore, we design an interpolation gating mechanism to wisely combine the original and local cross-attention: where g = σ(W Q i ) is the interpolation weight conditioned on the decoder side query Q i and σ(·) denotes the sigmoid function. Note that W is the only additional parameter to estimate the importance of original cross-attention operation, and we share it for different cross-attention heads.

Models
We follow Gu et al. (2018) to apply sequence-level knowledge distillation (Kim and Rush, 2016) to simplify the training data. About AT Teachers, we train both BASE and BIG Transformer (Vaswani et al., 2017) models with corresponding training data. In BIG model, we adopt large batch strategy (458K tokens per batch) to optimize the performance. The main results employ Transformer-BIG for all directions except Ro-En, which is distilled by BASE. Our approach can be applied to different NAT architectures. In this paper, we mainly implement it on conditional masked language models (Ghazvininejad et al., 2019, CMLMs) and leave further investigation to future work. The model contains 6-layer encoder and 6-layer decoder, where the decoder trained with conditional mask language model fashion. The model dimension is 512 on 8 heads, with 2048 feed forward dimensions. We follow the common practices (Ghazvininejad et al., 2019;Kasai et al., 2020) to average the top three checkpoints to avoid stochasticity.

Ablation Study
In order to make best use of our proposed component for NAT, we conducted extensive ablation studies. All models are trained and validated on WMT14 En-De training and validation sets.   (Gu et al., 2018) 1 31.4 19.2 n/a n/a 4 Iterative NAT (Lee et al., 2018) 10 30.2 21.6 n/a n/a 5 DisCo (Kasai et al., 2020) 4.8 33.3 26.8 n/a n/a 6 Levenshtein (Gu et al., 2019) 2.5 33.3 27.3 n/a n/a 7 CMLMs (Ghazvininejad et al.,

Effects of Localness Range
We investigate the localness window size within [3,5,7,9,11] and report the translation performance in Table 3 (left). As seen, our context-aware cross-attention with the window size of 9 achieves the best BLEU, which is therefore used as the default setting. Table 3 (right), deploying CCAN on the top-layer slightly outperforms deploying on the bottom-layer ("[6]">"[1]"). In NAT, multiple decoding layers can be cast as the refiner, and the source central word chosen by the bottom-layer cross-attention is not as accurate as of the top-layer one. Our method, highly conditioned on the predicted central words, thus can gain a better effect on the top-layer compared to the bottom layer. In the end, modelling all layers ("[1-6]") achieves the best performance and we thus use this setting in the following experiments. Table 4 lists main results and comparison with previous NAT models on WMT16 Ro-En, WMT14 En-De, WMT17 Zh-En and WAT17 Ja-En datasets. We mainly implemented our approach on top of the advanced CMLMs model. As seen, our approach (Row 9) consistently improves translation performance (BLEU↑) over CMLMs on four language pairs. Note that our approaches only modify the cross-attention module and introduce fewer extra parameters, leading to negligible loss on latency. Encouragingly, our approach even slightly outperforms its AT teachers (Transformer-BASE) on three tasks.

Analysis
In this section, we conduct extensive analyses on WMT14 En-De to better understand how our method contribute to performance gains.

Importance of Localness
The importance of localness should be different over layers. We explore it through gating (in Equation 2) analyzing. Specifically, we cast the weighting scalar of local cross-attention     as its importance degree and calculate the importance of localness for each decoder layer. As shown in Figure 2(a), during information flow evolving from bottom to top layers, the importance of localness continues to decline till the penultimate layer, and then increases. The possible reason for the increase in the last two layers is that the top layer followed by softmax, requiring more source-side context to choose lexicons.
Phrasal Patterns Our approach is expected to pay more attention to the most relevant source token and its neighbours, such that the phrasal translation can be improved. To evaluate the accuracy of phrase translations, we calculate the improvement on n-gram tokens in Figure 2(b), where the golden dashed line indicates that the window size is 9. As seen, CCAN consistently outperforms the baseline (∆Accuracy>0), indicating that our method can enhance the ability of NAT model on capturing the phrasal information, which is similar with Yang et al. (2018)'s findings.
Linguistic Properties Intuitively, our proposed cross-attention component brings context-aware representation, may affecting the linguistic properties learned by the encoder. We quantitatively investigate it from linguistic perspectives with probing tasks (Conneau et al., 2018). These tasks can be categorized into three types: "Surface" focuses on the simple surface properties learned from the sentence embedding; "Syntactic" quantifies the syntactic reservation ability; and "Semantic" assesses the deeper semantic representation ability. To evaluate the representation ability of CCAN equipped NAT model, we compare the pre-trained vanilla NAT and CCAN equipped NAT encoders, followed by a MLP classifier. Specifically, the mean of the top encoding layer, as sentence representation, will be passed to the classifier. We can see from Table 5, the CCAN equipped NAT encoder preserves rich syntactic and semantic information.

Conclusion and Future Work
We reveal a localness perception problem in NAT. To alleviate it, we propose the context-aware approach to make the cross-attention pay more attention to source-side local words, which in turn improves the translation performance over several benchmarks. In future work, we will investigate selectively choosing the context (Geng et al., 2020; rather than the fixed window size. Besides, it is interesting to enhance NAT model with extra signals, such as cross-lingual position embedding (Ding et al., 2020), larger context (Wang et al., 2017) and pre-trained initialization .