Query-Key Normalization for Transformers

Low-resource language translation is a challenging but socially valuable NLP task. Building on recent work adapting the Transformer’s normalization to this setting, we propose QKNorm, a normalization technique that modifies the attention mechanism to make the softmax function less prone to arbitrary saturation without sacrificing expressivity. Specifically, we apply l2-normalization along the head dimension of each query and key matrix prior to multiplying them and then scale up by a learnable parameter instead of dividing by the square root of the embedding dimension. We show improvements averaging 0.928 BLEU over state-of-the-art bilingual benchmarks for 5 low-resource translation pairs from the TED Talks corpus and IWSLT’15.


Introduction
The Transformer (Vaswani et al., 2017) remains the architecture of choice for machine translation. Since its introduction, various architectural and functional modifications have been made to improve its performance on NMT datasets (Ahmed et al., 2017;Zhang et al., 2018;Wang et al., 2019;Dai et al., 2019;Zhao et al., 2019). Translating low-resource languages presents special challenges. Recent strategies for adapting Transformers to this socially valuable task include exploiting transfer learning with many-to-many multilingual models (Aharoni et al., 2019), reducing model depth (van Biljon et al., 2020), and adding a regularization penalty for diverging from the predictions of a monolingual language model pretrained on the target language (Baziotis et al., 2020). This paper builds on recent work on layer normalization for 1 Code to reproduce our experiments is available at https: //github.com/CyndxAI/QKNorm low-resource language pairs, introducing a normalization technique that tries to keep the input to softmax attention within an appropriate range.
Layer normalization. For Transformers and other NLP models, layer normalization (Ba et al., 2016) yields significantly better performance than batch normalization (Ioffe and Szegedy, 2015), in part because NLP models tend to exhibit greater variance in batch statistics during training, for example compared to computer vision (Shen et al., 2020). Layer normalization boosts performance in deeper networks chiefly by controlling their gradients (Xu et al., 2019). It re-scales and re-centers activation distributions (though re-centering may be unnecessary, see Zhang and Sennrich 2019). The type of normalization used and the placement of that normalization within the Transformer are both crucial to Transformer performance (Nguyen and Salazar, 2019).
Softmax attention. Given a matrix X embedding a sequence of tokens, attention transforms each embedding into a mixture of itself and other elements of the sequence according to the importance of their connections for the modeling task at hand. In the case of multihead self-attention, the vectors of X are projected linearly into Query, Key and Value matrices. The operation defines a distribution for each token over all the others in its sequence that sums to 1. Multiplying by V then yields a new matrix where the embedding of each token is a weighted average of the vectors in V . Richter and Wattenhofer (2020) propose replacing the softmax function in attention because it constrains attention's output to the convex hull spanned by the vectors in V , limiting model flexibility. For the softmax over the vocabulary in next word prediction, Demeter et al. (2020) find that the norms of word embeddings drown out their angular displacements, with the consequence that words with smaller norms are systematically less likely to be predicted.
In this work, we replace the dot product inside of softmax attention with cosine similarity scaled up by a learnable parameter. This technique yields improved performance in low-resource bilingual translation, which we conjecture is because it binds QK T to a narrower range in a way that makes it easier to learn more diffuse attention patterns wherever these prove valuable.

Background
Nguyen and Salazar (2019) achieve state-of-the-art bilingual performance on 5 low-resource translation pairs from the TED Talks (Qi et al., 2018) and IWSLT'15 (Cettolo et al., 2015) corpora. This work builds directly on theirs, applying our technique to the same 5 benchmarks. Their model combines three normalization techniques that we describe below: FIXNORM (Nguyen and Chiang, 2018), PRENORM (Klein et al., 2017;Domhan, 2018;Chen et al., 2018), and SCALENORM, which they introduce as a replacement for layer normalization. They report that each technique contributes about 0.3 BLEU for an average improvement of 1.1 BLEU across the test sets for their 5 language pairs. FIXNORM sets word embeddings to unit length, which aids rare word translation (Nguyen and Chiang, 2018). PRENORM simply changes the location of layer normalization within the Transformer architecture, applying it to the input to each sublayer instead of after the residual connection. Moving layer normalization ahead of the residual connection enhances stability because the residual path is allowed to stay an identity map, instead of contributing terms to the gradient that could cause it to explode or vanish (Wang et al., 2019;Nguyen and Salazar, 2019). Interestingly, Nguyen and Salazar (2019) find PRENORM to be superior in low-resource but not high-resource translation settings.
Lastly, SCALENORM replaces layer normalization with 2 normalization along the embedding dimension, multiplied by a learnable scalar parameter initialized with 1 √ d (where d is the embedding dimension; the same term is used in scaled dot product attention (Vaswani et al., 2017)).
In other words, SCALENORM applies 2 normalization along the embedding dimension of Q, K and V , and it does so before the input to multihead attention gets split into heads.
Building on their work, we combine FIXNORM, PRENORM, and vanilla layer normalization (LAYERNORM) with a new technique we call query-key normalization (QKNORM), surpassing their model's performance on each of the same 5 translation pairs by an average of 0.928 test BLEU.
QKNORM applies 2 normalization to Q and K only, and it does so along the head dimension (which is the same dimension as the embedding dimension, but after multihead attention has split its input into separate heads). Q and K thus becomê Q andK, where the ith row vector ofQ (the ith embedding in the sequence) is given by: The effect is to make each element of QK T the cosine similarity of the corresponding pair of contextual token representations instead of their dot product. This is similar to Luo et al. (2018), who propose replacing the dot product in fully-connected networks between layer weights and previous layer outputs with cosine similarity. Like SCALENORM, we also multiply by a learnable parameter that we initialize according to a rule of thumb we describe below. Unlike SCALENORM, QKNORM complements LAYERNORM rather than replacing it. Since the dot product is unbounded, differences between elements that may be insignificantly small on a relative basis can silence all other signals in the attention weights applied to V . We conjecture that this limits the complexity of the patterns that attention heads can learn.

Dot Products and the Softmax Function
The impact is more obvious in less sophisticated Transformer implementations (perhaps in part because subsequent advances have mitigated the same Figure 1: Scaled Dot Product Attention. Self-attention heatmaps for 4 heads from one encoder layer displaying more "concentrated" attention, consistent with the conjecture that unnormalized dot products in QK T saturate the softmax and limit the attention patterns that can be learned. issue in different ways). Figures 1 and 2 show a heatmap comparison of encoder weights trained using the code for The Annotated Transformer 2 , the first with scaled dot product attention and the second with QKNORM.
The models containing these encoders were trained for 10 epochs on IWSLT 2016 de→en (Cettolo et al., 2016) using the Annotated Transformer implementation, with the baseline model scoring 19.4 BLEU and the QKNORM model scoring 24.33 BLEU on the test set, computed with the SacreBLEU Python package (Post, 2018).
Though this heatmap comparison is obviously not systematic, we think the visual at least provides a plausible intuition for the incremental gain this technique achieves, with scaled dot product attention exhibiting the kind of "winner-take-all" behavior we would expect from a softmax near saturation.
In comparison to dot products, cosine similarities are bounded by [−1, 1] which creates the opposite problem as input to softmax -the differences between values are too small for softmax to let the model effectively ignore connections between words it should not attend to. Instead of dividing by √ d as in scaled dot product attention we scale up using a learnable parameter that we initialize with a value that depends on the length of the sequences in the training data (and hence on the number of elements in QK T ): where L is the 97.5th percentile sequence length across all training data sequences for source and target. The attention operation thus changes from whereQ andK are Q and K with 2 -normalization applied along their head dimensions and g is a learnable scalar parameter initialized with g 0 as computed in (3).    (Papineni et al., 2002), scored using the Moses toolkit scripts provided in the repo for Nguyen and Salazar (2019). p < 0.01 using bootstrap resampling (Koehn, 2004). Both architectures use PRENORM and FIXNORM. The Nguyen and Salazar (2019) architecture uses SCALENORM where we instead use vanilla layer normalization (Ba et al., 2016), and scaled dot product attention where we use QKNORM.

Experiments and Results
We follow the implementation in the repository for Nguyen and Salazar (2019), both in replicating their performance and as a starting point for our version (and also for computing BLEU as reported in Table 2). 3 We train on the same 5 low-resource translation pairs as Nguyen and Salazar (2019): 4 from the TED Talks corpus (Qi et al., 2018) 4 -Arabic, Slovak, and Galician translated to English, and English translated to Hebrew -and 1 from the IWSLT'15 corpus (Cettolo et al., 2015), English to Vietnamese. The repository for Nguyen and Salazar (2019) provides the tokenized text they used for English to Vietnamese.
Tokenization and BLEU. Apart from BPE (Sennrich et al., 2016), their repository does not include the code they used for tokenization, so for the other 4 language pairs we used the tokenization script from the repository for Qi et al. (2018). 5 The repository for Nguyen and Salazar (2019) includes two Moses 6 scripts for scoring BLEU, multi-bleu.perl and multi-bleu-detok.perl. We can't use multi-bleu.perl for the 4 TED Talks pairs without being able to replicate their tokenization because scores from that script are not comparable 3 https://github.com/tnq177/ Transformers_without_tears 4 http://phontron.com/data/ted_talks. tar.gz 5 https://github.com/neulab/ word-embeddings-for-nmt/blob/master/ ted_reader.py 6 https://github.com/moses-smt/ mosesdecoder when there are differences in tokenization, unlike multi-bleu-detok.perl (Post, 2018). We use multi-bleu.perl to score en→vi (since we have their preprocessed text for this pair) and multi-bleu-detok.perl to score the 4 TED Talks pairs.
For additional confirmation, we also score all models using SacreBLEU (Post, 2018) after detokenizing with NLTK's TreebankWordDetokenizer (Bird and Loper, 2004). These scores are reported in Table 3. All the detokenized BLEU scores from Table  2 are basically unchanged in Table 3, with the exception of en→vi. The best scores for the baseline model we could get on en→vi were 32.48 for Moses multi-bleu.perl and 32.41 for SacreBLEU, though in Table 2 we report the multi-bleu.perl score from Nguyen and Salazar (2019), 32.79. Our model's score for the same pair comes in 0.06 BLEU lower as well.
Following the Nguyen and Salazar (2019) repository, we perform BPE using fastBPE 7 . We also use the same Moses code for bootstrap resampling (Koehn, 2004).
Model hyperparameters. Although PRENORM has been shown to make warmup less important for Transformers using scaled dot product attention (Nguyen and Salazar, 2019;Xiong et al., 2020), we obtained our best results using 8,000 steps of linear warmup. How much linear warmup matters for QKNORM and why it matters are both subjects for further investigation. We used the same validation-en→vi ar→en en→he gl→en sk→en Nguyen and Salazar (2019) (Papineni et al., 2002), scored using SACREBLEU (Post, 2018).
based decay scheme as Nguyen and Salazar (2019) and allowed models to train until they had reached the minimum learning rate. For all other model hyperparameters and preprocessing settings we followed Nguyen and Salazar (2019)

Conclusion
In this paper, we introduced a normalization technique that modifies the attention mechanism in Transformers and demonstrated its utility for lowresource bilingual translation by building it into an existing Transformer implementation with stateof-the-art performance on 5 low-resource language pairs. QKNORM improves performance for each of the 5 pairs, with an average test BLEU increase of 0.928. We pointed to possible explanations for its effectiveness but identifying exactly where it helps and why requires further research. First, we plan to combine our approach with the fairseq Transformer implementation  and apply it to the FLORES dataset (Guzmán et al., 2019), investigating the effect of QKNORM on the optimal depth, number of attention heads, and warmup schedule for low-resource translation, in combination with recent advances like BPE-dropout (Provilkov et al., 2020). Next, we plan to look at high-resource settings to see whether the benefits of query-key normalization dissipate with access to more training data.

A Varying the Number of Heads
In Table 4, we show the performance of QKNORM on the en→vi test set varying the number of heads. Even when the number of heads is 32 (with head dimension 16), the performance remains stable.

B Equation 3
Intuitively, longer sequences require more scaling to make it at least possible for the maximum values in QK T to softmax to 1. We arrived at Equation 3 empirically by applying softmax to similarity matrices of word vectors scaled up with various heuristics. Like √ d in scaled dot product attention (Vaswani et al., 2017), Equation 3 is a rule of thumb but it initializes a learnable parameter.
We determined the best value of L in Equation 3 by running the en→vi translation task with different percentile values. Table 5 shows the results from those experiments. Table 6 shares test performance on en→vi when we ablate specific components of QKNORM. The biggest performance drop in these experiments comes from omitting g, the learnable scaling factor. This is unsurprising because if we don't scale up  QK T its values are all within [−1, 1] and softmax is a function of the differences between values.