A Mixture of h - 1 Heads is Better than h Heads

Multi-head attentive neural architectures have achieved state-of-the-art results on a variety of natural language processing tasks. Evidence has shown that they are overparameterized; attention heads can be pruned without significant performance loss. In this work, we instead “reallocate” them—the model learns to activate different heads on different inputs. Drawing connections between multi-head attention and mixture of experts, we propose the mixture of attentive experts model (MAE). MAE is trained using a block coordinate descent algorithm that alternates between updating (1) the responsibilities of the experts and (2) their parameters. Experiments on machine translation and language modeling show that MAE outperforms strong baselines on both tasks. Particularly, on the WMT14 English to German translation dataset, MAE improves over “transformer-base” by 0.8 BLEU, with a comparable number of parameters. Our analysis shows that our model learns to specialize different experts to different inputs.


Introduction
The transformer architecture and its variants achieve state-of-the-art performance across a variety of NLP tasks, including machine translation (Vaswani et al., 2017; Ott et al., 2018), language modeling (Radford et al., 2018; Baevski and  Auli, 2019), semantic role labeling (Strubell et al.,  2018), and more (Devlin et al., 2019; Liu et al.,  2019b; Yang et al., 2019b).Under the hood, multihead attention provides the driving force: multiple separately parameterized attention functions act in parallel to contextualize the input representations; their outputs are then gathered by an affine transformation, and fed to onward computation. 1 Our implementation is publicly available at https:// github.com/Noahs-ARK/MAE.

Experts:
Input Attention heads: Figure 1: Illustration of MAE: a mixture of attentive experts.Each H i box is an attention head in a given layer; there are h of them in total.Experts are groups of h − 1 attention heads.MAE learns an input-dependent distribution of the experts (g).At each training step, a single expert is selected and updated (solid line); during the evaluation, experts' outputs are linearly combined with weights produced by g.
Recent efforts by Voita et al. (2019) and Michel  et al. (2019) suggest that typical transformer networks are overparameterized, in the sense that at test time, many of the heads, or even a full layer (Fan et al., 2020), can be removed without significant loss in performance. 2In response to this observation, they propose to prune the unimportant attention heads in the model after it is trained, aiming for faster inference.
In this paper, we ask whether, instead of reducing the model capacity, we can use it more effectively.We propose mixture of attentive experts (MAE).MAE retains all attention heads, and learns to activate different heads on different inputs (see illustration in Figure 1).We start by showing that multi-head attention can be seen as an uniform, input-agnostic mixture of experts (Jacobs et al., 1991), by grouping a subset of atten-tion heads as an expert ( §2.2).We then introduce MAE, which instead of uniformly weighting the experts, complements the experts with a learned, input-dependent function that assigns their responsibilities ( §2.3).To train MAE, we propose a two-step algorithm based on block coordinate descent ( §3), which alternates between updating the experts' responsibilities and their parameters.
We evaluate MAE on machine translation and language modeling ( §4).Our approach outperforms strong baselines on both; on the WMT14 English to German MT dataset, MAE outperforms transformer-base (Vaswani et al., 2017) by 0.8 BLEU with a negligible increase in the number parameters.Our analysis shows that MAE learns to encourage different experts to specialize on different inputs ( §5).

MAE: Mixture of Attentive Experts
This section describes MAE in detail.It is inspired by a mixture-of-experts view of multi-head attention, which we present in §2.2.Specifically, we show that multi-head attention can be viewed as a mixture of uniformly weighted experts, each consisting of a subset of attention heads.Based on this observation, we propose MAE, which learns to weight the experts ( §2.3) depending on the input.We begin by laying out notation and necessary background in §2.1.

Background: Mixture of Experts
Mixture of experts is a well-established technique for ensemble learning (Jacobs et al., 1991).It jointly trains a set of expert models {f i } k i=1 that are intended to specialize across different input cases.The outputs produced by the experts are aggregated by a linear combination, with a "gating function" g = [g 1 , . . ., g k ] determining the importance of each expert in the final decision: (1) The gating function can be parameterized by, e.g., a neural network.We will also refer to g as the responsibilities or weights of the experts.

Multi-Head Attention: a Mixture-of-Experts Perspective
Multi-head attention is the key building block for the state-of-the-art transformer architectures (Vaswani et al., 2017).At its core are mul-tiple separately parameterized attention heads.An attention head takes as input a n-by-d matrix X, with each row being the vector representation of an input element.It contextualizes the input using a dot-product attention mechanism: where Q i , K i , and V i are learned matrices,3 and the softmax normalizes row-wise.The outputs of attention heads are then concatenated and fed through a learned affine transformation: where W is a learned matrix, and h denotes the number of attention heads.
We now present a different computation equivalent to Eq. 3, aiming for a smoother transition into following sections.Let Eq. 4 provides a different view of the output computation of the multi-head attention: each attention head first projects the contextualized representation with a learned matrix (i.e., H i = H i W i ), then their outputs are gathered with a sum (Eq.4).
We now show that this can be seen as a uniformly weighted mixture of experts.
A mixture-of-experts perspective.Let us take a closer look at Eq. 4 and rewrite it: (5) Eq. 5 interprets multi-head attention as a mixture of h h−1 = h experts.It first constructs a set of h experts {f i (•; θ i )}, with θ i denoting f i 's param-eters.f i (•; θ i ) is a parameterized function of the input, which calculates a sum of the outputs by all but the ith attention head.This is achieved by subtracting H i from h j=1 H j , then scaling up the results by h/(h − 1).The experts share part of the parameters: any two share h − 2 attention heads.A uniform responsibility of 1/h is used.
Discussion.Viewing multi-head attention through this MoE lens suggests some interesting consequences.One can replace the input-agnostic responsibility in Eq. 5 with a function over the input.Indeed, we have good reasons for doing so.Voita et al. (2019) and Michel et al. (2019) show that for transformer networks, a handful of important attention heads are sufficient to achieve good test-time performance.They propose to prune the rest using an input-agnostic procedure.Instead of doing so, here we see a potential alternative: keep all the heads, but only activate those that are important to the input.This motivates MAE, which we now introduce.

MAE: Learning to Weight Experts
MAE is inspired by the connections between MoE and multi-head attention we draw in §2.2.On top of multi-head attention, MAE learns an inputdependent parameterized gating function g(•; φ) to complement the experts.More formally, the uniform responsibility 1/h in Eq. 5 is replaced by g(•; φ): given input X, MAE outputs Experts f i are the same as those in Eq. 5. g(•; φ) is parameterized with a multi-layer perceptron (MLP) followed by a softmax.It first averages X along the row (i.e., the sequence direction), and then feeds the results through a twolayer tanh-MLP.g(•; φ) outputs a normalized hdimensional vector using a softmax, indicating the responsibilities of the experts.It can be seen as a learned probability distribution over the experts.
MAE can learn to assign more responsibility to the experts that are more important to the given input, allowing them to contribute more.MAE is applicable wherever multi-head attention is used.For example, in a machine translation experiment ( §4.2), we replace with MAE all the multi-head attention in a transformer network, including the self-attention in all encoder and decoder layers, as well as those attending over the encoded source from the decoder.Each of them is separately treated as a mixture of experts, and has its own gating function.The additional parameter overhead is small: gating functions account for only 3-5% parameters of the full model (Appendix A).

Training MAE with Block Coordinate Descent
It is straightforward to jointly train the experts and the gating functions in an MAE model using backpropagation.However, in line with previous observations (Shen et al., 2019), we empirically observe that this is prone to degenerate solutions where the gating functions tend to learn to similarly weight the experts (see §5.1). 4 As a remedy, we propose a block coordinate descent (BCD) training.At a high level, training is decomposed into two interleaving steps: A G step updates the gating function g(•; φ), fixing the experts; an F step fixes the gating function and updates one randomly selected expert f i (•; θ i ). 5 The computations for G and F steps differ: • In a G step, MAE outputs a linear combination of the experts' outputs, and only updates the gating function's parameters (Algorithm 1).No expert is updated.• An F step computes the experts' responsibilities g(X), according to which an expert i is then sampled (Algorithm 2).MAE computes the output with f i , which is then updated, without updating the gating function or other experts. 6A non-differentiable sampling from g is involved in F steps.It does not create difficulties for the Algorithm 1 A G step update for MAE, with step size η.
Forwardprop with Z and calculate L.
Forwardprop with Z and calculate L.

5:
Calculate ∇ θ i L with backprop. 6: 7: end procedure backpropagation, since an F step never calculates the gradients w.r.t.φ.At test time, the computation is the same as that in a G step, i.e., MAE outputs a linear combination of the experts, weighted by g.

Training time overhead.
A straightforward training procedure is to, for each training instance, first take a G step, and then an F step.This doubles the forward propagation computation overhead.In practice, it is not necessary to take G steps as frequently as F steps, since they only update a small portion of the model.In the experiments, we take G steps one fifth as frequently as F steps: we make G updates every 5 epochs while always take F steps.In preliminary experiments, we find this reduces training time overhead without significant impact on the performance. 7lgorithm 3 summarizes the block coordinate descent training in a given epoch.

Connections to dropout.
In the above block coordinate descent training algorithm, an F step samples an expert to update, and ignores the rest in both forward and backward computation.It is reminiscent of dropout (Srivastava et al., 2014).Specifically, selecting expert f i is equivalent to end for 10: end procedure dropping head i. 9 In other words, the F steps (Algorithm 2) can be seen as a structured dropout applied to the attention heads, but with learned input-dependent drop probabilities.When g is a constant vector with elements 1/h, it recovers the head dropout, which is also explored by concurrent work (Fan et al., 2020).
So far, we view MAE as a mixture of h experts, each consisting of h − 1 attention heads.One can, of course, generalize this to other settings, e.g., mixing h h−2 experts, each containing h−2 heads.From the dropout view, this translates to dropping more attention heads: dropping t heads out of h is equivalent to applying a dropout with drop probability t/h, in the sense that their expected numbers of dropped units are the same.
Despite the similarity between MAE and dropout, a key difference exists between the two: with the latter, the constant dropout probability is set a priori, while MAE uses a gating function g(•; φ) to calculate a learned, input-dependent dropout probability.

MAE is evaluated under two settings:
• MAE-7 mixes 8 experts each with 7 attention heads.
• MAE-6 is similar to MAE-7, but mixes 8 2 = 28 experts each with 6 attention heads. 10e compare MAE to the following baselines.
• BASE is a sequence-to-sequence model based on the transformer architecture.• UNI-MAE-6 mixes 28 6-attention-head experts, and is otherwise the same as UNI-MAE-7.We refer the readers to Appendix A for implementation details.

Machine Translation
Datasets.We experiment with two machine translation datasets: • WMT14 EN-DE (Bojar et al., 2014). 11Following previous practice (Vaswani et al.,  2017) we train on WMT14, and designate newstest2013 and newstest2014 as development and test data respectively.Our preprocessing follows that of Vaswani et al. (2017)  and Ott et al. (2018).A shared source-target vocabulary is used, with 32k byte pair encoding types (BPE; Sennrich et al., 2016).• IWSLT14 DE-EN (Cettolo et al., 2014). 12It is based on TED talks, and is much smaller compared to WMT14.We use the preprocessing from Edunov et al. (2018).Following previous practice, we use separate vocabularies for the source and target, with around 9K and 7K BPE types respectively.Table 1 summarizes some statistics of the datasets. 10Preliminary results show that mixing experts with fewer heads leads to underwhelming performance.We conjecture this is due to too strong a regularization effect ( §3 Evaluation.The models are evaluated using BLEU (Papineni et al., 2002).A beam search with beam size 5 is used.In the WMT14 experiments, we follow Vaswani et al. (2017), and apply a compound split postprocessing. 13sults.Table 2 summarizes WMT14 EN-DE translation test performance.The base and large sized transformer models are due to Vaswani et al.  (2017).To control for compounding factors, we additionally compare to our implementation of the base sized model (BASE).It achieves slightly better performance than Vaswani et al. (2017), with a 0.3 BLEU edge.MAE-7 improves over the base transformer by 0.8 BLEU, obtaining similar performance to the large-size transformer of Vaswani  et al. (2017) using less than a third as many parameters.Since we do not see similar improvement by UNI-MAE-7, we attribute this gain to inputdependent expert weighting.Having a smaller number of heads for each expert, MAE-6 slightly underperforms MAE-7, and so does UNI-MAE-6 in comparison to UNI-MAE-7.Finally, NOBCD gets worse performance than the transformer baseline, demonstrating the importance of the block coordinate decent training.
We observe similar trends on the IWSLT14 DE-EN dataset, summarized in Table 3.The BASE model here is similar to the base-sized transformer in the WMT14 experiment, but with a smaller hidden dimension.MAE-7 outperforms BASE by 0.9 BLEU.Interestingly, UNI-MAE-7 improves over BASE by 0.3 BLEU, possibly because the regularization effect of random expert selection training helps more on this smaller dataset.14
nor replicate their results, under this setting-our GPUs have far less memory, and it is impossible to even load a 3,072-token context chunk. 15Therefore we train and evaluate MAE and UNI-MAE-7 with smaller 512/480 context sizes, also explored by Baevski and Auli (2019), which allows for a head-to-head comparison.
Results.Table 4 shows the perplexity on WikiText-103 test data.When trained under the same setting, MAE outperforms Baevski and Auli  (2019) by more than 0.3 perplexity.Interestingly, despite the much smaller context at both training and test time, MAE matches the best setting by Baevski and Auli (2019).UNI-MAE-7 and NOBCD underperform the baseline (higher perplexity).

Analysis
This section first empirically confirms that MAE learns to activate different experts on different inputs in §5.1.We then run a synthetic experiment to explore MAE's potential in transfer learning ( §5.2). input.Does MAE learn to do so?We empirically study this question, and present evidence indicating that it does, at least in part.We consider the encoders of the UNI-MAE-7, NOBCD, and the MAE-7 models trained on WMT14. 16e first study whether BCD training helps drifting MAE away from uniformly weighting the experts agnostic to the inputs.We treat the gating values as probabilities, and calculate their entropies: H(g) = − h i=1 g i • log g i , which are then averaged across different layers.The average entropy on the development set for MAE-7 is 1.91, lower than the 2.02 by the NOBCD model trained without BCD.In comparison, UNI-MAE-7 uniformly weights the experts and has the entropy of 2.08.This indicates that gating weights of MAE trained with BCD are more "focused" on one or a subset of experts than trained without.

Does MAE Learn to
Second, we study whether MAE learns to specialize different experts for different inputs.To do so we attribute the development instances to the experts that maximize the gating weights.For the first encoder layer of MAE-7, the percentages of instances attributed to each of the 8 experts are relatively balanced: 13%, 14%, 9%, 16%, 10%, 15%, 10%, 12%.17This suggests that all experts are assigned a substantial part of the input, and it is not the case that BCD leads to a "rich get richer" outcome.
We then continue and explore whether MAE performs reasonably well when using only the most "specialized" experts.For each development instance, we select those experts maximizing the gating weights and ignore the rest, instead of linearly combining them as in Eq. 6.We see from Table 5 a 0.3 BLEU decrease under this setting.In comparison, NOBCD has a larger performance decrease of 0.7 BLEU.NOBCD's performance drop is similar to that of UNI-MAE-7, for which we randomly select an expert at each layer and average the performance over 5 runs.These results support the proposition that MAE specializes better when trained with BCD.Finally, we search for the tokens that are more likely to activate each expert.We compute the pointwise mutual information (PMI; Church and  Hanks, 1990) between tokens and experts: PMI(token i , expert j ) = log p(token i , expert j ) p(token i )p(expert j ) .
Table 6 lists the most indicative tokens of each expert, for the first layer.While some of the terms for some experts seem loosely related (e.g., bell, reuters, and computing for expert 2, it is hard to find clear patterns in most of them.

MAE's Potential in Transfer Learning: A Case Study
We now turn to evaluate another property of MAE: its potential for data-efficient transfer learning, by only updating the gating functions, freezing the experts.We consider the pretrain-then-finetune setting.Due to computation limits, we are unable to explore MAE for pre-training contextual representations (Peters et al., 2018; Devlin et al., 2019).Rather, we focus on the following small-scale machine translation experiments.
Setting.We explore finetuning on IWSLT14 much larger WMT14 dataset. 18We compare three finetuning methods: • FTG finetunes the gating functions' parameters (i.e., φ), keeping the rest frozen.• FTG+ updates the parameter matrix W in Eq. 4 in addition to φ.The rest of the model parameters are fixed.
• FTALL updates all parameters.As a baseline, NOFT is the out-of-box pretrained model without any finetuning.SCRATCH trains a MAE model from scratch.
Table 7 summarizes the IWSLT14 EN-DE development set performance.Surprisingly, NOFT already outperforms SCRATCH without any finetuning.We attribute this improvement to the larger pretraining (WMT14) data.Only updating the gating functions, FTG improves over NOFT by 0.8 BLEU.Yet there is still a significant gap of 1.8 BLEU between FTG and FTALL.Interestingly, FTG+ almost matches the performance of FTALL, but only updates 1/9 as many parameters.Both FTG and FTG+ reach the best performance after around 1K gradient updates, i.e., one epoch, significantly less than FTALL or SCRATCH.
We further compare FTG+ and FTALL where less downstream training data is available.

Related Work
Multi-head attention.An increasing amount of effort has been devoted into developing better attention mechanisms (Malaviya et al., 2018; Deng  et al., 2018; Sukhbaatar et al., 2019; Correia et al.,  2019; Maruf et al., 2019, inter alia), and improving transformer architectures (Shaw et al., 2018;  Dehghani et al., 2019; Hao et al., 2019; Correia  et al., 2019; Yang et al., 2019a, inter alia).Closely related, Iida et al. (2019) applies another attention mechanism over the attention heads, allowing a learned reweighting of them.Our work focuses on the connection between multi-head attention and MoE, and the BCD training it suggests and benefits from.Concurrent to our work, (Fan et al.,  2020) study structurally pruning transformer layers for more efficient inference.
Another line of work aims to better understand the working of transformer models (Clark et al.,  2019; Liu et al., 2019a; Tenney et al., 2019, inter  alia).
Mixture of experts.One of the most successful applications of MoE is ensemble learning (Caruana et al., 2004; Liu et al., 2018; Dutt et al., 2017,  inter alia).Recent efforts also explore MoE in sequence learning (Shazeer et al., 2017), and to promote diversity in text generation (He et al., 2018;  Shen et al., 2019; Cho et al., 2019, inter alia).

Conclusion
We presented MAE.It is inspired by a mixture-ofexperts perspective of multi-head attention.With a learned gating function, MAE activates different experts on different inputs.MAE is trained using a block coordinate descent algorithm, which alternates between updating the responsibilities of the experts and their parameters.Our experiments show that MAE outperforms the transformer baselines on machine translation and language modeling benchmarks.The analysis shows that MAE learns to activate different experts.The code is publicly available at https://github.com/Noahs-ARK/MAE.layer.No weight decay is used.φ are updated using SGD with a fixed learning rate 1, separate from the one for the rest part of the models.This aims to avoid using momentum-based optimizing algorithms (e.g., Adam) for the gating functions, which we empirically find helps alleviate the "rich gets richer" degeneracy. 23n the language modeling experiment, most recent 100 input vectors are averaged and then fed into the gating functions; while we average all the input vectors in the machine translation as the inputs to g(•; φ).

B Learning Curve Comparison for MAE and NOBCD
In §3 (footnote 4) we discuss an overfitting issue by jointly updating the experts and the gating function.This section empirically studies it.We compare the learning curves of BASE, NOBCD, and MAE-7 trained on the IWSLT14 dataset, plotted in Figure 3.The models are described in §4.1.We tune dropout and 2 regularization based on development performance.Other hyperparameters are the same for the compared models.
The training loss for NOBCD decreases much faster than BASE; however, on the development set, it never outperforms BASE, and the development loss starts increasing after epoch 40.MAE-7 finds a nice middle ground in terms of training loss.It outperforms both BASE and NOBCD on the validation set.This provides further evidence for the importance of BCD training.
C Addtional Results for §5.1 §5.1 describes a experiment with the MAE-7 model where we attribute the development instances of WMT14 to the experts maximizing the gating weights.Table 8 presents more results.The number of instances each expert receives is relatively balanced, and the trend is consistent across different layers. 23It is not entirely clear to us why using momentum-based optimization algorithms to learn the gating functions leads to degenerate solutions more often.One possible reason is that the accumulated momentum steers the gating functions to keeping selecting the experts they pick at the early stage of training.

Algorithm 3
Block coordinate descent (BCD) training for MAE, at epoch e. D denotes the training data. 81: procedure BCD(D = {X i } i , e) • NOBCD is the same model as MAE, but does not use block coordinate descent training.Instead, it jointly updates all experts and the gating function at training time, as discussed at the start of §3.• UNI-MAE-7 is similar to MAE but does not have parameterized gating functions.It builds on BASE, and mixes 8 experts, each with 7 attention heads.Constant uniform responsibilities are assigned to the experts.At each training step, it updates one uniformly sampled expert; at test time, the outputs of all experts are averaged according to Eq. 5.

Table 1 :
Some statistics for WMT14 and IWSLT14 datasets.We use separate source and target vocabularies in IWSLT14 experiments.

Table 2 :
. It contains articles WMT14 EN-DE translation test performance on newstest2014.† randomly select an expert to update for each training instance, and ‡ learns a gating function to weight the experts.Transformer performance in the first two rows are due to Vaswani et al. (2017).

Table 3 :
IWSLT14 GE-DE test set performance.See Table 2 caption for indications of the superscripts./480 sized ones.See Table 2 caption for the indications of other superscripts.Bold font indicates the best performance using smaller context sizes.The first two rows are due to Table performance(Baevski and Auli, 2019, Table 5).Their best setting uses a 3,072 training context size, and 2,048 at test time (i.e., the model has access 2,048 tokens before predicting any token at test time).However, we are not able to train MAE,

Table 6 :
Indicative tokens for each expert ( §5.1).Tokens attributed to Expert 2 are mostly computer science terminology; trends for other experts are less clear.
Here we reverse the translation direction of IWSLT14: §4.2 experimented with DE-EN, here we use EN-DE. 18

Table 8 :
Figure 3: Learning curves of BASE, NOBCD, and MAE-7 ( §B), trained on the IWSLT14 EN-DE using the same setup.NOBCD quickly fits the training data, but it does not outperform BASE on validation set.Trained with BCD, MAE finds a nice middle ground.For better readability, x-axis starts at epoch 8.The percentage of WMT14 development instances attributed to each of the experts in MAE-7's encoder layers ( §5.1).