SqueezeBERT: What can computer vision teach NLP about efficient neural networks?

Humans read and write hundreds of billions of messages every day. Further, due to the availability of large datasets, large computing systems, and better neural network models, natural language processing (NLP) technology has made significant strides in understanding, proofreading, and organizing these messages. Thus, there is a significant opportunity to deploy NLP in myriad applications to help web users, social networks, and businesses. Toward this end, we consider smartphones and other mobile devices as crucial platforms for deploying NLP models at scale. However, today’s highly-accurate NLP neural network models such as BERT and RoBERTa are extremely computationally expensive, with BERT-base taking 1.7 seconds to classify a text snippet on a Pixel 3 smartphone. To begin to address this problem, we draw inspiration from the computer vision community, where work such as MobileNet has demonstrated that grouped convolutions (e.g. depthwise convolutions) can enable speedups without sacrificing accuracy. We demonstrate how to replace several operations in self-attention layers with grouped convolutions, and we use this technique in a novel network architecture called SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive accuracy on the GLUE test set. A PyTorch-based implementation of SqueezeBERT is available as part of the Hugging Face Transformers library: https://huggingface.co/squeezebert


Introduction and Motivation
The human race writes over 300 billion messages per day (Sayce, 2019;Schultz, 2019;Al-Heeti, 2018;Templatify, 2017). Out of these, more than half of the world's emails are read on mobile devices, and nearly half of Facebook users exclusively access Facebook from a mobile device (Lovely Mobile News, 2017;Donnelly, 2018). Natural language processing (NLP) technology has the potential to aid these users and communities in several ways. When a person writes a message, NLP models can help with spelling and grammar checking as well as sentence completion. When content is added to a social network, NLP can facilitate content moderation before it appears in other users' news feeds. When a person consumes messages, NLP models can help classify messages into folders, compose news feeds, prioritize messages, and identify duplicates.
In recent years, the development and adoption of Attention Neural Networks have led to dramatic improvements in almost every area of NLP. In 2017, Vaswani et al. proposed the multi-head self-attention module, which demonstrated superior accuracy to recurrent neural networks on English-German machine language translation (Vaswani et al., 2017). 1 These modules have since been adopted by GPT (Radford et al., 2018) and BERT (Devlin et al., 2019) for sentence classification, and by GPT-2 (Radford et al., 2019) and CTRL (Keskar et al., 2019) for sentence completion and generation. Recent works such as ELEC-TRA (Clark et al., 2020) and RoBERTa (Liu et al., 2019) have shown that larger datasets and more sophisticated training regimes can further improve the accuracy of self-attention networks.
Considering the enormity of the textual data created by humans on mobile devices, a natural approach is to deploy the NLP models directly onto mobile devices, embedding them in the apps used to read, write, and share text. Unfortunately, highlyaccurate NLP models are computationally expensive, making mobile deployment impractical. For example, we observe that running the BERT-base network on a Google Pixel 3 smartphone approximately 1.7 seconds to classify a single text data sample. 2 Much of the research on efficient selfattention networks for NLP has just emerged in the past year. However, starting with SqueezeNet (Iandola et al., 2016b), the mobile computer vision (CV) community has spent the last four years optimizing neural networks for mobile devices. Intuitively, it seems like there must be opportunities to apply the lessons learned from the rich literature of mobile CV research to accelerate mobile NLP. In the following, we review what has already been applied and propose two additional techniques from CV that we will leverage to accelerate NLP models.

What has CV research already taught
NLP research about efficient networks? In recent months, novel self-attention networks have been developed with the goal of achieving faster inference. At present, the MobileBERT network defines the state-of-the-art in low-latency text classification for mobile devices (Sun et al., 2020). MobileBERT takes approximately 0.6 seconds to classify a text sequence on a Google Pixel 3 smartphone while achieving higher accuracy on the GLUE benchmark, which consists of 9 natural language understanding (NLU) datasets (Wang et al., 2018), than other efficient networks such as Distil-BERT (Sanh et al., 2019), PKD , and several others (Lan et al., 2019;Turc et al., 2019;Jiao et al., 2019;Xu et al., 2020). To achieve this, MobileBERT introduced two concepts into their NLP self-attention network that are already in widespread use in CV neural networks: 1. Bottleneck layers. In ResNet (He et al., 2016), the 3x3 convolutions are computationally expensive, so a 1x1 "bottleneck" convolution is employed to reduce the number of channels input to each 3x3 convolution layer. Similarly, MobileBERT adopts bottleneck layers that reduce the number of channels before each self-attention layer, reducing the computational cost of the self-attention layers.
2. High-information flow residual connections. In BERT-base, the residual connections serve as links between the low-channel-count (768 channels) layers. The high-channelcount (3072 channels) layers in BERT-base do not have residual connections. However, the ResNet and Residual-SqueezeNet (Iandola et al., 2016b) CV networks connect the highchannel-count layers with residuals, enabling higher information flow through the network. Similar to these CV networks, MobileBERT adds residual connections between the highchannel-count layers.
1.2 What else can CV research teach NLP research about efficient networks? We are encouraged by the progress that Mobile-BERT has made in leveraging ideas that are popular in the CV literature to accelerate NLP. However, we are aware of two other ideas from CV, which weren't used in MobileBERT which could be applied to accelerate NLP: 1. Convolutions. Since the 1980s, computer vision neural nets have relied heavily on convolutional layers (Fukushima, 1980;LeCun et al., 1989). Convolutions are quite flexible and well-optimized in software, and they can implement things as simple as a 1D fullyconnected layer, or as complex as a 3D dilated layer that performs upsampling or downsampling.
2. Grouped convolutions. A popular technique in modern mobile-optimized neural networks is grouped convolutions (see Section 3). Proposed by Krizhevsky et al. in the 2012 winning submission to the ImageNet image classification challenge (Krizhevsky et al., 2011(Krizhevsky et al., , 2012Russakovsky et al., 2015), grouped convolutions disappeared from the literature from some years, then re-emerged as a key technique circa 2016 (Chollet, 2016;Xie et al., 2017) and today are extensively used in efficient CV networks such as MobileNet (Howard et al., 2017), Shuf-fleNet , and Efficient-Net (Tan and Le, 2019). While common in CV literature, we are not aware of work applying grouped convolutions to NLP.
1.3 SqueezeBERT: Applying lessons learned from CV to NLP In this work, we describe how to apply convolutions and particularly grouped convolutions in the design of a novel self-attention network for NLP, which we call SqueezeBERT. Empirically, we find that SqueezeBERT runs at lower latency on a smartphone than BERT-base, MobileBERT, and several other efficient NLP models, while maintaining competitive accuracy.

Implementing self-attention with convolutions
In this section, first, we review the basic structure of self-attention networks. Next, we identify that their biggest computational bottleneck is in their position-wise fully-connected (PFC) layers. We then show that these PFC layers are equivalent to a 1D convolution with a kernel size of 1.

Self-attention networks
In most BERT-derived networks there are typically 3 stages: the embedding, the encoder, and the classifier (Devlin et al., 2019;Liu et al., 2019;Clark et al., 2020;Sun et al., 2020;Lan et al., 2019). 3 The embedding converts preprocessed words (represented as integer-valued tokens) into learned feature-vectors of floating-point numbers. The encoder is comprised of a series of self-attention and other layers. The classifier produces the network's final output. As we will see later in Table 1, the embedding and the classifier account for less than 1% of the runtime of a self-attention network, so we focus our discussion on the encoder. We now describe the encoder that is used in BERT-base (Devlin et al., 2019). The encoder consists of a stack of blocks. Each block consists of a three position-wise fully-connected (PFC) layers, then a self-attention module, and finally a stack of three position-wise fully-connected layers, known as feed-forward network (FFN) layers. The initial three PFC layers, are used to generate the query (Q), key (K), and value (V ) activation vectors for each position in the feature embedding. Each of these Q, K, and V layers applies the same operation to each position in the feature embedding independently. While neural networks traditionally multiply weights by activations, a distinguishing factor of attention neural networks is that they multiply activations by other activations, enabling dynamic weighting of tensor elements to adjust based on the input data. Further, attention networks allow modeling of arbitrary dependencies regardless of their distance in the input or output (Vaswani et al., 2017). 2.2 Benchmarking BERT for mobile inference To identify the parts of BERT that are timeconsuming to compute, we profile BERT on a smartphone. Specifically, we measure the neural network's latency using PyTorch (Paszke et al., 2019) and TorchScript on a Google Pixel 3 smartphone, with an input sequence length of 128 and a batch size of 1. This is a reasonable sequence length for text messages, instant messages, short emails, and other messages that are commonly written and read by smartphone users. In Table 1, we show the breakdown of FLOPs and latency among the main components of the BERT network, and we observe that the self-attention calculations (i.e. sof tmax( QK T p d k )V ) account for only 11.3% of the total latency. However, PFC layers account for 88.3% of the latency.

Replacing the position-wise fully
connected (PFC) layers with convolutions Given that PFC layers account for the overwhelming majority of the latency, we now focus on reducing the PFC layers' latency. In particular, we intend to replace the PFC layers with grouped convolutions, which have been shown to produce significant speedups in computer vision networks. As a first step in this direction, we now show that the position-wise fully-connected layers used throughout the BERT encoder are a special case of nongrouped 1D convolution.
Let w denote the weights of the position-wise fully-connected layer with dimensions (C, C). Given an input feature vector f of dimensions (P, C) with P positions and C channels to generate an output of (P, C) features, the operation performed by the position-wise fully-connected layer for each output channel c at position p can be defined: Then if we consider the definition of a 1D convolution with kernel size K with the same input and output dimensions. Let q be the weights of the convolution with with dimensions (C, C, K) we observe that the position-wise fully-connected operation is equivalent to a convolution with a ker- Thus, the PFC layers of Vaswani et al. (Vaswani et al., 2017), GPT, BERT, and similar self-attention networks can be implemented using convolutions without changing the networks' numerical properties or behavior.

Incorporating grouped convolutions into self-attention
Now that we have shown how to implement the expensive PFC layers in self-attention networks using convolutions, we can incorporate efficient grouped convolutions into a self-attention network. Grouped convolutions are defined as follows.
Given an input feature vector of dimensions (P, C) with P positions and C channels outputting a vector with dimensions (P, C), a 1d convolution with kernel size K = 1 and } groups and weight vector q of dimensions (C, C g ) can be defined as follows. Let N = C g where N is the number of Figure 1: Traditional vs. grouped convolutions. In panel (a), we illustrate the weight matrix of a traditional 1D convolution with 8 input channels, 8 output channels, and a kernel size of 1. In panel (b), we illustrate a grouped convolution with g = 4. White cells in the grid are empty. Observe that with g = 4, the weight matrix has one-fourth the number of parameters as a traditional convolution. channels in each group.
This is equivalent to splitting the the input vector into g separate vectors of size (P, C in g ) along the C dimension and running g separate convolutions with independent weights each computing vectors of size (P, Cout g ). The grouped convolution, however, requires only 1 g as many floating-point operations (FLOPs) and 1 g as many weights as an ordinary convolution, not counting the small (and unchanged) amount of operations needed for the channel-wise bias term that is often included in convolutional layers. 5 Finally, to complement the mathematical explanation of grouped convolutions, we illustrate the difference between traditional convolutions and grouped convolutions in Figure 1.

SqueezeBERT
Now, we describe our proposed neural architecture called SqueezeBERT, which uses grouped convolutions. SqueezeBERT is much like BERTbase, but with PFC layers implemented as convolutions, and grouped convolutions for many of the layers. Recall from Section 2 that each block in the BERT-base encoder has a self-attention module that ingests the activations from 3 PFC layers, and the block also has 3 more PFC layers called feed-forward network layers (FFN 1 , FFN   has C in = 768 and C out = 3072, and FFN 3 has C in = 3072 and C out = 768. In all PFC layers of the self-attention modules, and in the FFN 2 and FFN 3 layers, we use grouped convolutions with g = 4. To allow for mixing across channels of different groups, we use g = 1 in the less-expensive FFN 1 layers. Note that in BERT-base, FFN 2 and FFN 3 each have 4 times more arithmetic operations than FFN 1 . However, when we use g = 4 in FFN 2 and FFN 3 , now all FFN layers have the same number of arithmetic operations. We illustrate one block of the SqueezeBERT encoder in Figure 2.

Self-Attention
Finally, the embedding size (768), the number of blocks in the encoder (12), the number of heads per self-attention module (12), the Word-Piece tokenizer (Schuster and Nakajima, 2012;Wu et al., 2016), and other aspects of SqueezeBERT are adopted from BERT-base. Aside from the convolution-based implementation and the adoption of grouped convolutions, the SqueezeBERT architecture is identical to BERT-base.
4 Experimental Methodology 4.1 Datasets Pretraining Data. For pretraining, we use a combination of Wikipedia and BooksCorpus (Zhu et al., 2015), setting aside 3% of the combined dataset as a test set. Following the ALBERT paper, we use Masked Language Modeling (MLM) and Sentence Order Prediction (SOP) as pretraining tasks (Lan et al., 2019).
Finetuning Data. We finetune and evaluate SqueezeBERT (and other baselines) on the General Language Understanding Evaluation (GLUE) set of tasks. This benchmark consists of a diverse set of 9 NLU tasks; thanks to the structure and breadth of these tasks (see supplementary material for detailed task-level information), GLUE has become the standard evaluation benchmark for NLP research. A model's performance across the GLUE tasks likely provides a good approximation of that model's generalizability (especially for text classification tasks).

Training Methodology
Many recent papers on efficient NLP networks report results on models trained with bells and whistles such as distillation, adversarial training, and/or transfer learning across GLUE tasks. However, there is no standardization of these training schemes across different papers, making it difficult to distinguish the contribution of the model from the contribution of the training scheme to the final accuracy number. Therefore, we first train Squeeze-BERT using a simple training scheme (described in Section 4.2.1, with results reported in Section 5.1), and then we train SqueezeBERT with distillation and other techniques (described in Section 4.2.2, with results reported in Section 5.2).

Training without bells and whistles
We pretrain SqueezeBERT from scratch (without distillation) using the LAMB optimizer, and we employ the hyperparameters recommended by the LAMB authors: a global batch size of 8192, a learning rate of 2.5e-3, and a warmup proportion of 0.28 (You et al., 2020). Following the LAMB paper's recommendations, we pretrain for 56k steps with a maximum sequence length of 128 and then for 6k steps with a maximum sequence length of 512.
For finetuning, we use the AdamW optimizer with a batch size of 16 without momentum or weight decay with 1 = 0.9 and 2 = 0.999 (Loshchilov and Hutter, 2019). As is common in the literature, during finetuning for each task, we perform hyperparameter tuning on the learning rate and dropout rate. We present more details on this in the supplementary material. In the interest of a fair comparison, we also train BERTbase using the aforementioned pretraining and finetuning protocol.

Training with bells and whistles
We now review recent techniques for improving the training of NLP networks, and we describe the approaches that we will use for the training and evaluation of SqueezeBERT in Section 5.2.
Distillation approaches used in other efficient NLP networks. While the term "knowledge distillation" was coined by Hinton et al. to describe a specific method and equation (Hinton et al., 2015), the term "distillation" is now used in reference to a diverse range of approaches where a "student" network is trained to replicate a "teacher" network. Some researchers distill only the final layer of the network (Sanh et al., 2019), while others also distill the hidden layers (Sun et al., , 2020Xu et al., 2020). When distilling the hidden layers, some apply layer-by-layer distillation warmup, where each module of the student network is distilled independently while downstream modules are frozen (Sun et al., 2020). Some distill during pretraining (Sun et al., 2020;Sanh et al., 2019), some distill during finetuning (Xu et al., 2020), and some do both Jiao et al., 2019).
Bells and whistles used for training Squeeze-BERT (for results in Section 5.2). Distillation is not a central focus of this paper, and there is a large design space of potential approaches to distillation, so we select a relatively simple form of distillation for use in SqueezeBERT training. We apply distillation only to the final layer, and only during finetuning. On the GLUE sentence classification tasks, we use soft cross entropy loss with respect to a weighted sum of the teacher's logits ( t ) and a one-hot encoding of the ground-truth ( g ). The weighting between the teacher logits and the ground-truth is controlled by a hyperparameter ↵. Formally, we write this weighted sum as: Also note that GLUE has one regression task (STS-B text similarity), and for this task we replace the soft cross entropy loss with mean squared error. In addition to distillation, inspired by STILTS (Phang et al., 2018) and ELECTRA (Clark et al., 2020), we apply transfer learning from the MNLI GLUE task to other GLUE tasks as follows. The Squeeze-BERT student model is pretrained using the approach described in Section 4.2.1, and then it is finetuned on the MNLI task. The weights from MNLI training are used as the initial student weights for other GLUE tasks except for CoLA. 6 Similarly, the teacher model is a BERT-base model that is pretrained using the ELECTRA method and then finetuned on MNLI. The teacher model is then finetuned independently on each GLUE task, and these task-specific teacher weights are used for distillation.

Results
We now turn our attention to comparing Squeeze-BERT to other efficient neural networks.

Results without bells and whistles
In the upper portions of Tables 2 and 3, we compare our results to other efficient networks on the dev and test sets of the GLUE benchmark. Note that relatively few of the efficiency-optimized networks report results without bells and whistles, and most such results are reported on the development Table 2: Comparison of neural networks on the development set of the GLUE benchmark. For tasks that have 2 metrics (e.g. MRPC's metrics are Accuracy and F1), we report the average of the 2 metrics. Üdenotes models trained by the authors of the present paper. Bells and whistles are: A = adversarial training; D = distillation of final layer; E = distillation of encoder layers; S = transfer learning across GLUE tasks (a.k.a. STILTs (Phang et al., 2018)); W = per-layer warmup. In GLUE accuracy, a dash means that accuracy for this task is not provided in the literature. GLUE  Results with bells and whistles (not test) set of GLUE. Fortunately, the authors of MobileBERT -a network which we will find in the next section compares favorably to other efficient networks with bells and whistles enabled -do provide development-set results without distillation on 4 of the GLUE tasks. 7 We observe in the upper portion of Table 2 that, when both networks are trained without distillation, SqueezeBERT achieves higher accuracy than MobileBERT on all of these tasks. This provides initial evidence that the techniques from computer vision that we have adopted can be applied to NLP, and reasonable accuracy can be obtained. Further, we observe that SqueezeBERT is 4.3x faster than BERT-base, while MobileBERT is 3.0x faster than  Due to the dearth of efficient neural network results on GLUE without bells and whistles, we also provide a comparison in Table 2 with the ALBERTbase network. ALBERT-base is a version of BERTbase that uses the same weights across multiple attention layers, and it has a smaller encoder than BERT. Due to these design choices, ALBERT-base has 9x fewer parameters than BERT-base. However, ALBERT-base and BERT-base have the same number of FLOPs, and we observe in our measurements in Table 2 that ALBERT-base does not offer a speedup over BERT-base on a smartphone. 9 Further, on the two GLUE tasks where the ALBERT authors reported the accuracy of ALBERT-base, MobileBERT and SqueezeBERT both outperform the accuracy of ALBERT-base.

Results with bells and whistles
Now, we turn our attention to comparing Squeeze-BERT to other models, all trained with bells-andwhistles. Note that the bells-and-whistles come at the cost of extra training time, but the bells-andwhistles do not change the inference time or modelsize. In the lower portion of Table 3, we first ob-7 Note that some papers report results on only the development set or the test set, and some papers only report results on a subset of GLUE tasks. Our aim with this evaluation is to be as inclusive as possible, so we include papers with incomplete GLUE results in our results tables. 8 In our measurements, we find MobileBERT takes 572ms to classify one length-128 sequence on a Pixel 3 phone. This is slightly faster than the 620ms reported by the MobileBERT authors in the same setting (Sun et al., 2019b). We use the faster number in our comparisons. Further, all latencies in our results tables were benchmarked by us. 9 However, reducing the number of parameters while retaining a high number of FLOPs can present other advantages, such as faster distributed training (Lan et al., 2019;Iandola et al., 2016a) and superior energy-efficiency (Iandola and Keutzer, 2017). serve that when trained with bells-and-whistles Mo-bileBERT matches or outperforms the accuracy of the other efficient models (except SqueezeBERT) on 8 of the 9 GLUE tasks. Further, on 4 of the 9 tasks SqueezeBERT outperforms the accuracy of MobileBERT; on 4 of 9 tasks MobileBERT outperforms SqueezeBERT; and on 1 task (WNLI) all models predict the most frequently occurring category. 10 Also, SqueezeBERT achieves an average score across all GLUE tasks that is within 0.4 percentage-points of MobileBERT. Given the speedup of SqueezeBERT over MobileBERT, we think it is reasonable to say that SqueezeBERT and MobileBERT each offer a compelling speedaccuracy tradeoff for NLP inference on mobile devices.

Related Work
Quantization and Pruning. Quantization is a family of techniques which aims to reduce the number of bits required to store each parameter and/or activation in a neural network, while at the same time maintaining the accuracy of that network. This has been successfully applied to NLP in such works as (Shen et al., 2020;Zafrir et al., 2019). Pruning aims to directly eliminate certain parameters from the network while maintaining accuracy, thereby reducing the storage and potentially computational cost of that network; for an application of this to NLP, please see Sanh et al. (2020). These methods could be applied to SqueezeBERT to yield further efficiency improvements, but quantization and pruning are not a focus of this paper.
Addressing long sequence-lengths. In work such as SqueezeBERT and MobileBERT, the inference FLOPs and latency are evaluated using a sequence length of 128. This is a reasonable sequence length for use-cases such as classifying text messages, instant-messages, and short emails. However, if the goal is to classify longer-form texts such as book chapters or even an entire book, then the typical sequence length is much longer. While the positionwise fully-connected (PFC) layers in BERT scale linearly in the sequence length, the self-attention calculations scale quadratically in the sequence length. So, when classifying a long sequence, the self-attention calculations are the dominant factor in the FLOPs and latency of the neural network. Several recent projects have worked to address this problem. For instance, Funnel Transformer downsamples the sequence length in the first few layers of the network, and it upsamples the sequence length in the final few layers of the network (Dai et al., 2020). This approach is similar to computer vision models for semantic segmentation such as U-Net (Ronneberger et al., 2015). In addition, Longformer reduces the number of FLOPs by introducing structured sparsity into the self-attention tensors (Beltagy et al., 2020). Further, Linformer projects long sequences into shorter fixed-length sequences (Wang et al., 2020b). Finally, Tay et al. (2020) provide an extensive survey of approaches for redesigning self-attention networks to efficiently classify long sequences.
Self-attention networks with dynamic computational cost. DeeBERT (Xin et al., 2020), Fast-BERT , and Schwartz et al. (2020) each describe a method to dynamically adjust the amount of computation for different sequences. The intuition is that some sequences are easier to classify than others, and the "easy" sequences can be correctly classified by only computing the first few layers of a BERT-like network.
Convolutions in self-attention networks for language-generation tasks. In this paper, our experiments focus on natural language understanding (NLU) tasks such as sentence classification. However, another widely-studied area is natural language generation (NLG), which includes the tasks of machine-translation (e.g., English-to-German) and language modeling (e.g., automated sentencecompletion). While we are not aware of work that adopts convolutions in self-attention networks for NLU, we are aware of such work in NLG. For instance, the Evolved Transformer and Lite Transformer architectures contain self-attention modules and convolutions in separate portions of the network (So et al., 2019;Wu et al., 2020). Additionally, LightConv shows that well-designed convolutional networks without self-attention produce comparable results to self-attention networks on certain NLG tasks (Wu et al., 2019b). Also, Wang et al. sparsify the self-attention matrix multiplication using a pattern of nonzeros that is inspired by dilated convolutions (Wang et al., 2020a). Finally, while not an attention network, Kim applied convolutional networks to NLU several years before the development of multi-head self-attention (Kim, 2014).

Conclusions & Future Work
In this paper, we have studied how grouped convolutions, a popular technique in the design of efficient computer vision neural networks, can be applied to natural language processing. First, we showed that the position-wise fully-connected layers of self-attention networks can be implemented with mathematically-equivalent 1D convolutions. Further, we proposed SqueezeBERT, an efficient NLP model which implements most of the layers of its self-attention encoder with 1D grouped convolutions. This model yields an appreciable >4x latency decrease over BERT-base when benchmarked on a Pixel 3 phone. We also successfully applied distillation to improve our approach's accuracy to a level that is competitive with a distillation-trained Mo-bileBERT and with the original version of BERTbase.
We now discuss some possibilities for future work in the design of computationally-efficient neural networks for NLP. As we observed in Section 6, in recent months numerous approaches have been proposed for reducing the computational cost of self-attention neural architectures for natural language processing. These approaches include new model structures (e.g. MobileBERT), rethinking the dimensions of attention calculations (e.g. Linformer), grouped convolutions (SqueezeBERT), and much more. Further, once the neural architecture has been selected, approaches such as quantization and pruning can further reduce some of the costs associated with self-attention neural network inference. The combination of all of these potential techniques opens up a broad search-space of neural architecture designs for NLP. This motivates the application of automated neural architecture search (NAS) approaches such as those described in (Shaw et al., 2019;Wu et al., 2019a) to further improve the design of neural networks for NLP.