Efficiency through Auto-Sizing: Notre Dame NLP's Submission to the WNGT 2019 Efficiency Task

This paper describes the Notre Dame Natural Language Processing Group's (NDNLP) submission to the WNGT 2019 shared task (Hayashi et al., 2019). We investigated the impact of auto-sizing (Murray and Chiang, 2015; Murray et al., 2019) to the Transformer network (Vaswani et al., 2017) with the goal of substantially reducing the number of parameters in the model. Our method was able to eliminate more than 25% of the model's parameters while suffering a decrease of only 1.1 BLEU.


Introduction
The Transformer network (Vaswani et al., 2017) is a neural sequence-to-sequence model that has achieved state-of-the-art results in machine translation. However, Transformer models tend to be very large, typically consisting of hundreds of millions of parameters. As the number of parameters directly corresponds to secondary storage requirements and memory consumption during inference, using Transformer networks may be prohibitively expensive in scenarios with constrained resources. For the 2019 Workshop on Neural Generation of Text (WNGT) Efficiency shared task (Hayashi et al., 2019), the Notre Dame Natural Language Processing (NDNLP) group looked at a method of inducing sparsity in parameters called auto-sizing in order to reduce the number of parameters in the Transformer at the cost of a relatively minimal drop in performance.
Auto-sizing, first introduced by Murray and Chiang (2015), uses group regularizers to encourage parameter sparsity. When applied over neurons, it can delete neurons in a network and shrink the total number of parameters. A nice advantage of auto-sizing is that it is independent of model architecture; although we apply it to the Transformer network in this task, it can easily be applied to any other neural architecture.
NDNLP's submission to the 2019 WNGT Efficiency shared task uses a standard, recommended baseline Transformer network. Following Murray et al. (2019), we investigate the application of auto-sizing to various portions of the network. Differing from their work, the shared task used a significantly larger training dataset from WMT 2014 (Bojar et al., 2014), as well as the goal of reducing model size even if it impacted translation performance. Our best system was able to prune over 25% of the parameters, yet had a BLEU drop of only 1.1 points. This translates to over 25 million parameters pruned and saves almost 100 megabytes of disk space to store the model.

Auto-sizing
Auto-sizing is a method that encourages sparsity through use of a group regularizer. Whereas the most common applications of regularization will act over parameters individually, a group regularizer works over groupings of parameters. For instance, applying a sparsity inducing regularizer to a two-dimensional parameter tensor will encourage individual values to be driven to 0.0. A sparsity-inducing group regularizer will act over defined sub-structures, such as entire rows or columns, driving the entire groups to zero. Depending on model specifications, one row or column of a tensor in a neural network can correspond to one neuron in the model.
Following the discussion of Murray and Chiang (2015) and Murray et al. (2019), auto-sizing works by training a neural network while using a regularizer to prune units from the network, minimizing: W are the parameters of the model and R is a reg- Figure 1: Architecture of the Transformer (Vaswani et al., 2017). We apply the auto-sizing method to the feed-forward (blue rectangles) and multi-head attention (orange rectangles) in all N layers of the encoder and decoder. Note that there are residual connections that can allow information and gradients to bypass any layer we are auto-sizing. Following the robustness recommendations, we instead layer norm before.
ularizer. Here, as with the previous work, we experiment with two regularizers: The optimization is done using proximal gradient descent (Parikh and Boyd, 2014), which alternates between stochastic gradient descent steps and proximal steps: 3 Auto-sizing the Transformer The Transformer network (Vaswani et al., 2017) is a sequence-to-sequence model in which both the encoder and the decoder consist of stacked selfattention layers. The multi-head attention uses two affine transformations, followed by a softmax layer. Each layer has a position-wise feed-forward neural network (FFN) with a hidden layer of rectified linear units. Both the multi-head attention and the feed-forward neural network have residual connections that allow information to bypass those layers. In addition, there are also word and position embeddings. Figure 1, taken from the original paper, shows the architecture. NDNLP's submission focuses on the N stacked encoder and decoder layers. The Transformer has demonstrated remarkable success on a variety of datasets, but it is highly over-parameterized. For example, the baseline Transformer model has more than 98 million parameters, but the English portion of the training data in this shared task has only 116 million tokens and 816 thousand types. Early NMT models such as Sutskever et al. (2014) have most of their parameters in the embedding layers, but the transformer has a larger percentage of the model in the actual encoder and decoder layers. Though the group regularizers of auto-sizing can be applied to any parameter matrix, here we focus on the parameter matrices within the encoder and decoder layers.
We note that there has been some work recently on shrinking networks through pruning. However, these differ from auto-sizing as they frequently require an arbitrary threshold and are not included during the training process. For instance, See et al. (2016) prunes networks based off a variety of thresholds and then retrains a model. Voita et al. (2019) also look at pruning, but of attention heads specifically. They do this through a relaxation of an 0 regularizer in order to make it differentiable. This allows them to not need to use a proximal step. This method too starts with pretrained model and then continues training. Michel et al. (2019) also look at pruning attention heads in the transformer. However, they too use thresholding, but only apply it at test time. Auto-sizing does not require a thresholding value, nor does it require a pre-trained model.
Of particular interest are the large, positionwise feed-forward networks in each encoder and decoder layer: Figure 2: Auto-sizing FFN network. For a row in the parameter matrix W 1 that has been driven completely to 0.0 (shown in white), the corresponding column in W 2 (shown in blue) no longer has any impact on the model. Both the column and the row can be deleted, thereby shrinking the model. W 1 and W 2 are two large affine transformations that take inputs from D dimensions to 4D, then project them back to D again. These layers make use of rectified linear unit activations, which were the focus of auto-sizing in the work of Murray and Chiang (2015). No theory or intuition is given as to why this value of 4D should be used.
Following (Murray et al., 2019), we apply the auto-sizing method to the Transformer network, focusing on the two largest components, the feedforward layers and the multi-head attentions (blue and orange rectangles in Figure 1). Remember that since there are residual connections allowing information to bypass the layers we are autosizing, information can still flow through the network even if the regularizer drives all the neurons in a layer to zero -effectively pruning out an entire layer.

Experiments
All of our models are trained using the fairseq implementation of the Transformer (Gehring et al., 2017). 1 For the regularizers used in auto-sizing, we make use of an open-source, proximal gradient toolkit implemented in PyTorch 2 (Murray et al., 2019). For each mini-batch update, the stochastic gradient descent step is handled with a standard PyTorch forward-backward call. Then the proximal step is applied to parameter matrices.

Settings
We used the originally proposed transformer architecture -with six encoder and six decoder layers. Our model dimension was 512 and we used 8 attention heads. The feed-forward network subcomponents were of size 2048. All of our systems were run using subword units (BPE) with 32,000 merge operations on concatenated source and target training data (Sennrich and Haddow, 2016). We clip norms at 0.1, use label smoothed cross-entropy with value 0.1, and an early stopping criterion when the learning rate is smaller than 10 −5 . We used the Adam optimizer (Kingma and Ba, 2015), a learning rate of 10 −4 , and dropout of 0.1. Following recommendations in the fairseq and tensor2tensor (Vaswani et al., 2018) code bases, we apply layer normalization before a subcomponent as opposed to after. At test time, we decoded using a beam of 5 with length normalization (Boulanger-Lewandowski et al., 2013) and evaluate using case-sensitive, tokenized BLEU (Papineni et al., 2002).
For the auto-sizing experiments, we looked at both 2,1 and ∞,1 regularizers. We experimented over a range of regularizer coefficient strengths, λ, that control how large the proximal gradient step will be. Similar to Murray and Chiang (2015), but differing from Alvarez and Salzmann (2016), we use one value of λ for all parameter matrices in the network. We note that different regularization coefficient values are suited for different types or regularizers. Additionally, all of our experiments use the same batch size, which is also related to λ.

Auto-sizing sub-components
We applied auto-sizing to the sub-components of the encoder and decoder layers, without touching the word or positional embeddings. Recall from  Figure 1, that each layer has multi-head attention and feed-forward network sub-components. In turn, each multi-head attention sub-component is comprised of two parameter matrices. Similarly, each feed-forward network has two parameter matrices, W 1 and W 2 . We looked at three main experimental configurations: • All: Auto-sizing is applied to every multihead attention and feed-forward network subcomponent in every layer of the encoder and decoder.
• Encoder: As with All, auto-sizing is applied to both multi-head attention and feed-forward network sub-components, but only in the encoder layers. The decoder remains the same.
• FFN: Auto-sizing applied only to the feedforward network sub-components W 1 and W 2 , but not to the multi-head portions. This too is applied to both the encoder and decoder.

Results
Our results are presented in Table 1. The baseline system has 98.2 million parameters and a BLEU score of 29.7. It takes up 375 megabytes on disk. Our systems that applied auto-sizing only to the feed-forward network sub-components of the transformer network maintained the best BLEU scores while also pruning out the most parameters of the model. Overall, our best system used 2,1 = 1.0 regularization for auto-sizing and left 73.1 million parameters remaining. On disk, the model takes 279 megabytes to store -roughly 100 megabytes less than the baseline. The performance drop compared to the baseline is 1.1 BLEU points, but the model is over 25% smaller.
Applying auto-sizing to the multi-head attention and feed-forward network sub-components of only the encoder also pruned a substantial amount of parameters. Though this too resulted in a smaller model on disk, the BLEU scores were worse than auto-sizing just the feed-forward subcomponents. Auto-sizing the multi-head attention and feed-forward network sub-components of both the encoder and decoder actually resulted in a larger model than the encoder only, but with a lower BLEU score. Overall, our results suggest that the attention portion of the transformer network is more important for model performance than the feed-forward networks in each layer.

Conclusion
In this paper, we have investigated the impact of using auto-sizing on the transformer network of the 2019 WNGT efficiency task. We were able to delete more than 25% of the parameters in the model while only suffering a modest BLEU drop. In particular, focusing on the parameter matrices of the feed-forward networks in every layer of the encoder and decoder yielded the smallest models that still performed well.
A nice aspect of our proposed method is that the proximal gradient step of auto-sizing can be applied to a wide variety of parameter matrices. Whereas for the transformer, the largest impact was on feed-forward networks within a layer, should a new architecture emerge in the future, auto-sizing can be easily adapted to the trainable parameters.
Overall, NDNLP's submission has shown that auto-sizing is a flexible framework for pruning parameters in a large NMT system. With an aggressive regularization scheme, large portions of the model can be deleted with only a modest impact on BLEU scores. This in turn yields a much smaller model on disk and at run-time.