MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

Natural Language Processing (NLP) has recently achieved great success by using huge pre-trained models with hundreds of millions of parameters. However, these models suffer from heavy model sizes and high latency such that they cannot be deployed to resource-limited mobile devices. In this paper, we propose MobileBERT for compressing and accelerating the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks. To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERT_LARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known benchmarks. On the natural language inference tasks of GLUE, MobileBERT achieves a GLUE score of 77.7 (0.6 lower than BERT_BASE), and 62 ms latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of 90.0/79.2 (1.5/2.1 higher than BERT_BASE).


Introduction
The NLP community has witnessed a revolution of pre-training self-supervised models.These models usually have hundreds of millions of parameters (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2018;Radford et al., 2019;Yang et al., 2019).Among these models, BERT (Devlin et al., 2018) shows substantial accuracy improvements.However, as one of the largest models ever in NLP, BERT suffers from the heavy model size and high latency, making it impractical for resource-limited mobile devices to deploy the power of BERT in mobile-based machine translation, dialogue modeling, and the like.
There have been some efforts that taskspecifically distill BERT into compact models (Turc et al., 2019;Tang et al., 2019;Sun et al., 2019;Tsai et al., 2019).To the best of our knowledge, there is not yet any work for building a taskagnostic lightweight pre-trained model, that is, a model that can be generically fine-tuned on different downstream NLP tasks as the original BERT does.In this paper, we propose MobileBERT to fill this gap.In practice, task-agnostic compression of BERT is desirable.Task-specific compression needs to first fine-tune the original large BERT model into a task-specific teacher and then distill.Such a process is much more complicated (Wu et al., 2019) and costly than directly fine-tuning a task-agnostic compact model.
At first glance, it may seem straightforward to obtain a task-agnostic compact BERT.For example, one may just take a narrower or shallower version of BERT, and train it until convergence by minimizing a convex combination of the prediction loss and distillation loss (Turc et al., 2019;Sun et al., 2019).Unfortunately, empirical results show that such a straightforward approach results in significant accuracy loss (Turc et al., 2019).This may not be that surprising.It is well-known that shallow networks usually do not have enough representation power while narrow and deep networks are difficult to train.
Our MobileBERT is designed to be as deep as BERT LARGE while each layer is made much narrower via adopting bottleneck structures and balancing between self-attentions and feed-forward networks (Figure 1).To train MobileBERT, a deep and thin model, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERT LARGE model (IB-BERT).Then, we conduct knowledge transfer from IB-BERT to MobileBERT.A variety of knowledge transfer strategies are carefully investigated in our empirical studies.Empirical evaluations1 show that MobileBERT is 4.3× smaller and 5.5× faster than BERT BASE , while it can still achieve competitive results on well-known NLP benchmarks.On the natural language inference tasks of GLUE, MobileBERT can achieve a GLUE score of 77.7, which is only 0.6 lower than BERT BASE , with a latency of 62 ms on a Pixel 4 phone.On the SQuAD v1.1/v2.0question answering task, MobileBER obtains a dev F1 score of 90.3/80.2,which is even 1.5/2.1 higher than BERT BASE .

Related Work
Recently, compression of BERT has attracted much attention.Turc et al. (2019) propose to pre-train the smaller BERT models to improve task-specific knowledge distillation.Tang et al. (2019) distill BERT into an extremely small LSTM model.Tsai et al. (2019) distill a multilingual BERT into smaller BERT models on sequence labeling tasks.Clark et al. (2019b) use several single-task BERT models to teach a multi-task BERT.Liu et al. (2019a) distill knowledge from an ensemble of BERT models into a single BERT.
Concurrently to our work, Sun et al. (2019) distill BERT into shallower students through knowledge distillation and an additional knowledge transfer of hidden states on multiple intermediate layers.Jiao et al. (2019) propose TinyBERT, which also uses a layer-wise distillation strategy for BERT but in both pre-training and fine-tuning stages.Sanh et al. (2019) propose DistilBERT, which successfully halves the depth of BERT model by knowledge distillation in the pre-training stage and an optional fine-tuning stage.
In contrast to these existing literature, we only use knowledge transfer in the pre-training stage and do not require a fine-tuned teacher or data augmentation (Wu et al., 2019) in the down-stream tasks.
Another key difference is that these previous work try to compress BERT by reducing its depth, while we focus on compressing BERT by reducing its width, which has been shown to be more effective (Turc et al., 2019).

MobileBERT
In this section, we present the detailed architecture design of MobileBERT and training strategies to efficiently train MobileBERT.The specific model settings are summarized in Table 1.These settings are obtained by extensive architecture search experiments which will be presented in Section 4.1.

Bottleneck and Inverted-Bottleneck
The architecture of MobileBERT is illustrated in Figure 1(c).It is as deep as BERT LARGE , but each building block is made much smaller.As shown in Table 1, the hidden dimension of each building block is only 128.On the other hand, we introduce two linear transformations for each building block to adjust its input and output dimensions to 512.Following the terminology in (He et al., 2016), we refer to such an architecture as bottleneck.
It is challenging to train such a deep and thin network.To overcome the training issue, we first construct a teacher network and train it until convergence, and then conduct knowledge transfer from this teacher network to MobileBERT.We find that this is much better than directly training Mobile-BERT from scratch.Various training strategies will be discussed in a later section.Here, we introduce the architecture design of the teacher network which is illustrated in Figure 1(b).In fact, the teacher network is just BERT LARGE while augmented with inverted-bottleneck structures (Sandler et al., 2018) to adjust its feature map size to 512.In what follows, we refer to the teacher network as IB-BERT LARGE .Note that IB-BERT and MobileBERT have the same feature map size which is 512.Thus, we can directly compare the layerwise output difference between IB-BERT and Mo-bileBERT.Such a direct comparison is needed in our knowledge transfer strategy.
It is worth pointing out that the simultaneously introduced bottleneck and inverted-bottleneck structures result in a fairly flexible architecture design.One may either only use the bottlenecks for MobileBERT (correspondingly the teacher becomes BERT LARGE ) or only the invertedbottlenecks for IB-BERT (then there is no bottleneck in MobileBERT) to align their feature maps.However, when using both of them, we can allow IB-BERT LARGE to preserve the performance of BERT LARGE while having MobileBERT sufficiently compact.

Stacked Feed-Forward Networks
A problem introduced by the bottleneck structure of MobileBERT is that the balance between the Multi-Head Attention (MHA) module and the Feed-Forward Network (FFN) module is broken.MHA and FFN play different roles in the Transformer architecture: The former allows the model to jointly attend to information from different subspaces, while the latter increases the non-linearity of the model.In original BERT, the ratio of the parameter numbers in MHA and FFN is always 1:2.But in the bottleneck structure, the inputs to the MHA are from wider feature maps (of inter-block size), while the inputs to the FFN are from narrower bottlenecks (of intra-block size).This results in that the MHA modules in MobileBERT relatively contain more parameters.
To fix this issue, we propose to use stacked feedforward networks in MobileBERT to re-balance the relative size between MHA and FFN.As illustrated in Figure 1(c), each MobileBERT layer contains one MHA but several stacked FFN.In Mo-bileBERT, we use 4 stacked FFN after each MHA.

Operational Optimizations
By model latency analysis2 , we find that layer normalization (Ba et al., 2016) and gelu activation (Hendrycks and Gimpel, 2016) accounted for a considerable proportion of total latency.Therefore, we propose to replace them with new operations in our MobileBERT.
Remove layer normalization We replace the layer normalization of a n-channel hidden state h with an element-wise linear transformation: where γ, β ∈ R n and • denotes the Hadamard product.Please note that NoNorm has different properties from LayerNorm even in test mode since the original layer normalization is not a linear operation for a batch of vectors.

Use relu activation
We replace the gelu activation with simpler relu activation (Nair and Hinton, 2010).

Embedding Factorization
The embedding table in BERT models accounts for a substantial proportion of model size.To compress the embedding layer, as shown in Table 1, we reduce the embedding dimension to 128 in Mo-bileBERT.Then, we apply a 1D convolution with kernel size 3 on the raw token embedding to produce a 512 dimensional output.

Training Objectives
We propose to use the following two knowledge transfer objectives, i.e., feature map transfer and attention transfer, to train MobileBERT. Figure 1 illustrates the proposed layer-wise knowledge transfer objectives.Our final layer-wise knowledge transfer loss L KT for the th layer is a linear combination of the two objectives stated below: Feature Map Transfer (FMT) Since each layer in BERT merely takes the output of the previous layer as input, the most important thing in layerwise knowledge transfer is that the feature maps of each layer should be as close as possible to those of the teacher.In particular, the mean squared error between the feature maps of the MobileBERT student and the IB-BERT teacher is used as the knowledge transfer objective: where is the index of layers, T is the sequence length, and N is the feature map size.In practice, we find that decomposing this loss term into normalized feature map discrepancy and feature map statistics discrepancy can help stabilize training.
Attention Transfer (AT) The attention mechanism greatly boosts the performance of NLP and becomes a crucial building block in Transformer and BERT (Clark et al., 2019a;Jawahar et al., 2019).This motivates us to use self-attention maps from the well-optimized teacher to help the training of MobileBERT in augmentation to the feature map transfer.In particular, we minimize the KL-divergence between the per-head self-attention distributions of the MobileBERT student and the IB-BERT teacher: where A is the number of attention heads.
Pre-training Distillation (PD) Besides layerwise knowledge transfer, we can also use a knowledge distillation loss when pre-training Mobile-BERT.We use a linear combination of the original masked language modeling (MLM) loss, next sentence prediction (NSP) loss, and the new MLM Knowledge Distillation (KD) loss as our pretraining distillation loss: where α is a hyperparameter in (0, 1).

Training Strategies
Given the objectives defined above, there can be various combination strategies in training.We discuss three strategies in this paper.
Auxiliary Knowledge Transfer In this strategy, we regard intermediate knowledge transfer as an auxiliary task for knowledge distillation.We use a single loss, which is a linear combination of knowledge transfer losses from all layers as well as the pre-training distillation loss.Joint Knowledge Transfer However, the intermediate knowledge of the IB-BERT teacher (i.e.attention maps and feature maps) may not be an optimal solution for the MobileBERT student.Therefore, we propose to separate these two loss terms, where we first train MobileBERT with all layerwise knowledge transfer losses jointly, and then further train it by pre-training distillation.
Progressive Knowledge Transfer One may also concern that if MobileBERT cannot perfectly mimic the IB-BERT teacher, the errors from the lower layers may affect the knowledge transfer in the higher layers.Therefore, we propose to progressively train each layer in the knowledge transfer.
The progressive knowledge transfer is divided into L stages, where L is the number of layers.
Diagram of three strategies Figure 2 illustrates the diagram of the three strategies.For joint knowledge transfer and progressive knowledge transfer, there is no knowledge transfer for the beginning embedding layer and the final classifier in the layerwise knowledge transfer stage.They are copied from the IB-BERT teacher to the MobileBERT student.Moreover, for progressive knowledge transfer, when we train the th layer, we freeze all the trainable parameters in the layers below.In practice, we can soften the training process as follows.
When training a layer, we further tune the lower layers with a small learning rate rather than entirely freezing them.

Experiments
In this section, we first present our architecture search experiments which lead to the model settings in results on benchmarks from MobileBERT and various baselines.

Model Settings
We conduct extensive experiments to search good model settings for the IB-BERT teacher and the MobileBERT student.We start with SQuAD v1.1 dev F1 score as the performance metric in the search of model settings.In this section, we only train each model for 125k steps with 2048 batch size, which halves the training schedule of original BERT (Devlin et al., 2018;You et al., 2019).
Architecture Search for IB-BERT Our design philosophy for the teacher model is to use as small inter-block hidden size (feature map size) as possible, as long as there is no accuracy loss.Under this guideline, we design experiments to manipulate the inter-block size of a BERT LARGE -sized IB-BERT, and the results are shown in of BERT until it is smaller than 512.Hence, we choose IB-BERT LARGE with its inter-block hidden size being 512 as the teacher model.
One may wonder whether we can also shrink the intra-block hidden size of the teacher.We conduct experiments and the results are shown in Table 2 with labels (f)-(i).We can see that when the intra-block hidden size is reduced, the model performance is dramatically worse.This means that the intra-block hidden size, which represents the representation power of non-linear modules, plays a crucial role in BERT.Therefore, unlike the interblock hidden size, we do not shrink the intra-block hidden size of our teacher model.

Architecture Search for MobileBERT
We seek a compression ratio of 4× for BERT BASE , so we design a set of MobileBERT models all with approximately 25M parameters but different ratios of the parameter numbers in MHA and FFN to select a good MobileBERT student model.Table 3 shows our experimental results.They have different balances between MHA and FFN.From the table, we can see that the model performance reaches the peak when the ratio of parameters in MHA and FFN is 0.4 ∼ 0.6.This may justify why the original Transformer chooses the parameter ratio of MHA and FFN to 0.5.
We choose the architecture with 128 intra-block hidden size and 4 stacked FFNs as the MobileBERT student model in consideration of model accuracy and training efficiency.We also accordingly set the number of attention heads in the teacher model to 4 in preparation for the layer-wise knowledge transfer.Table 1 demonstrates the model settings of our IB-BERT LARGE teacher and MobileBERT student.
One may wonder whether reducing the number of heads will harm the performance of the teacher model.By comparing (a) and (f) in Table 2, we can see that reducing the number of heads from 16 to 4 does not affect the performance of IB-BERT LARGE .

Implementation Details
Following BERT (Devlin et al., 2018), we use the BooksCorpus (Zhu et al., 2015) and English Wikipedia as our pre-training data.To make the IB-BERT LARGE teacher reach the same accuracy as original BERT LARGE , we train IB-BERT LARGE on 256 TPU v3 chips for 500k steps with a batch size of 4096 and LAMB optimizer (You et al., 2019).For a fair comparison with the original BERT, we do not use training tricks in other BERT variants (Liu et al., 2019b;Joshi et al., 2019).For Mo-bileBERT, we use the same training schedule in the pre-training distillation stage.Additionally, we use progressive knowledge transfer to train Mo-bileBERT, which takes additional 240k steps over 24 layers.In ablation studies, we halve the pretraining distillation schedule of MobileBERT to accelerate experiments.Moreover, in the ablation study of knowledge transfer strategies, for a fair comparison, joint knowledge transfer and auxiliary knowledge transfer also take additional 240k steps.
For the downstream tasks, all reported results are obtained by simply fine-tuning MobileBERT just like what the original BERT does.To finetune the pre-trained models, we search the optimization hyperparameters in a search space including different batch sizes (16/32/48), learning rates ((1-10) * e-5), and the number of epochs (2-10).The search space is different from the original BERT because we find that MobileBERT usually needs a larger learning rate and more training epochs in fine-tuning.We select the model for testing according to their performance on the development (dev) set.

Results on GLUE
The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018) is a collection of 9 natural language understanding tasks.We compare MobileBERT with BERT BASE and a few state-of-the-art pre-BERT models on the GLUE leaderboard 3 : OpenAI GPT (Radford et al., 2018) and ELMo (Peters et al., 2018).We also compare with three recently proposed compressed BERT models: BERT-PKD (Sun et al., 2019), and Dis-tilBERT (Sanh et al., 2019).To further show the advantage of MobileBERT over recent small BERT models, we also evaluate a smaller variant of our The metrics for these tasks can be found in the GLUE paper (Wang et al., 2018)."OPT" denotes the operational optimizations introduced in Section 3.3.†denotes that the results are taken from (Jiao et al., 2019).*denotes that it can be unfair to directly compare MobileBERT with these models since MobileBERT is task-agnosticly compressed while these models use the teacher model in the fine-tuning stage.†marks our runs with the official code.‡denotes that the results are taken from (Jiao et al., 2019).
model with approximately 15M parameters called MobileBERT TINY4 , which reduces the number of FFNs in each layer and uses a lighter MHA structure.Besides, to verify the performance of Mobile-BERT on real-world mobile devices, we export the models with TensorFlow Lite5 APIs and measure the inference latencies on a 4-thread Pixel 4 phone with a fixed sequence length of 128.The results are listed in Table 4. 6From the table, we can see that MobileBERT is very competitive on the GLUE benchmark.Mo-bileBERT achieves an overall GLUE score of 77.7, which is only 0.6 lower than BERT BASE , while be-

Results on SQuAD
SQuAD is a large-scale reading comprehension datasets.SQuAD1.1 (Rajpurkar et al., 2016) only contains questions that always have an answer in the given context, while SQuAD2.0(Rajpurkar et al., 2018) contains unanswerable questions.We evaluate MobileBERT only on the SQuAD dev datasets, as there is nearly no single model submission on SQuAD test leaderboard.We compare our MobileBERT with BERT BASE , DistilBERT, and a strong baseline DocQA (Clark and Gardner, 2017).As shown in Table 5, MobileBERT outperforms a large margin over all the other models with smaller or similar model sizes.

Quantization
We apply the standard post-training quantization in TensorFlow Lite to MobileBERT.The results are shown in Table 6.We find that while quantization can further compress MobileBERT by 4×, there is nearly no performance degradation from it.This indicates that there is still a big room in the compression of MobileBERT.

Operational Optimizations
We evaluate the effectiveness of the two operational optimizations introduced in Section 3.3, i.e., replacing layer normalization (LayerNorm) with NoNorm and replacing gelu activation with relu activation.We report the inference latencies using the same experimental setting as in Section 4.6.1.From Table 7, we can see that both NoNorm and relu are very effective in reducing the latency of Mobile-BERT, while the two operational optimizations do not reduce FLOPS.This reveals the gap between the real-world inference latency and the theoretical computation overhead (i.e., FLOPS).

Training Strategies
We also study how the choice of training strategy, i.e., auxiliary knowledge transfer, joint knowledge transfer, and progressive knowledge transfer, can affect the performance of MobileBERT.As shown in Table 8, progressive knowledge transfer consistently outperforms the other two strategies.We notice that there is a significant performance gap between auxiliary knowledge transfer and the other two strategies.We think the reason is that the intermediate layer-wise knowledge (i.e., attention maps and feature maps) from the teacher may not be optimal for the student, so the student needs an additional pre-training distillation stage to fine-tune its parameters.

Training Objectives
We finally conduct a set of ablation experiments with regard to Attention Transfer (AT), Feature Map Transfer (FMT) and Pre-training Distillation (PD).The operational OPTimizations (OPT) are removed in these experiments to make a fair comparison between MobileBERT and the original BERT.
The results are listed in Table 9.We can see that the proposed Feature Map Transfer contributes most to the performance improvement of MobileBERT, while Attention Transfer and Pre-training Distillation also play positive roles.We can also find that our IB-BERT LARGE teacher is as powerful as the original IB-BERT LARGE while MobileBERT degrades greatly when compared to its teacher.So we believe that there is still a big room in the improvement of MobileBERT.

Conclusion
We have presented MobileBERT which is a taskagnostic compact variant of BERT.Empirical results on popular NLP benchmarks show that Mo-bileBERT is comparable with BERT BASE while being much smaller and faster.MobileBERT can enable various NLP applications 7 to be easily deployed on mobile devices.
In this paper, we show that 1) it is crucial to keep MobileBERT deep and thin, 2) bottleneck/invertedbottleneck structures enable effective layer-wise knowledge transfer, and 3) progressive knowledge transfer can efficiently train MobileBERT.We believe our findings are generic and can be applied to other model compression problems.

Appendix for "MobileBERT: a Compact
Task-Agnostic BERT for Resource-Limited Devices"

A Extra Related Work on Knowledge Transfer
Exploiting knowledge transfer to compress model size was first proposed by Bucilu et al. (2006).The idea was then adopted in knowledge distillation (Hinton et al., 2015), which requires the smaller student network to mimic the class distribution output of the larger teacher network.Fitnets (Romero et al., 2014) make the student mimic the intermediate hidden layers of the teacher to train narrow and deep networks.Luo et al. (2016) show that the knowledge of the teacher can also be obtained from the neurons in the top hidden layer.Similar to our proposed progressive knowledge transfer scheme, Yeo et al. (2018) proposed a sequential knowledge transfer scheme to distill knowledge from a deep teacher into a shallow student in a sequential way.Zagoruyko and Komodakis (2016) proposed to transfer the attention maps of the teacher on images.Li et al. (2019) proposed to transfer the similarity of hidden states and word alignment from an autoregressive Transformer teacher to a non-autoregressive student.

B Extra Related Work on Compact Architecture Design
While much recent research has focused on improving efficient Convolutional Neural Networks (CNN) for mobile vision applications (Iandola et al., 2016;Howard et al., 2017;Zhang et al., 2017Zhang et al., , 2018;;Sandler et al., 2018;Tan et al., 2019;Howard et al., 2019), they are usually tailored for CNN.
Popular lightweight operations such as depth-wise convolution (Howard et al., 2017) cannot be directly applied to Transformer or BERT.In the NLP literature, the most relevant work can be group LSTMs (Kuchaiev and Ginsburg, 2017;Gao et al., 2018), which employs the idea of group convolution (Zhang et al., 2017(Zhang et al., , 2018) ) into Recurrent Neural Networks (RNN).

C Visualization of Attention Distributions
We visualize the attention distributions of the 1 st and the 12 th layers of a few models in the ablation study for further investigation.They are shown in Figure 3.We find that the proposed attention transfer can help the student mimic the attention distributions of the teacher very well.Surprisingly, we find that the attention distributions in the attention heads of "Mobile-BERT(bare)+PD+FMT" are exactly a re-order of those of "MobileBERT(bare)+PD+FMT+AT" (also the teacher model), even if it has not been trained by the attention transfer objective.This phenomenon indicates that multi-head attention is a crucial and unique part of the non-linearity of BERT.Moreover, it can explain the minor improvements of Attention Transfer in the ablation study table, since the alignment of feature maps lead to the alignment of attention distributions.

D Extra Experimental Settings
For a fair comparison with original BERT, we follow the same pre-processing scheme as BERT, where we mask 15% of all WordPiece (Kudo and Richardson, 2018) tokens in each sequence at random and use next sentence prediction.Please note that MobileBERT can be potentially further improved by several training techniques recently introduced, such as span prediction (Joshi et al., 2019) or removing next sentence prediction objective (Liu et al., 2019b).We leave it for future work.
In pre-training distillation, the hyperparameter α is used to balance the original masked language modeling loss and the distillation loss.Following (Kim and Rush, 2016), we set α to 0.5.

E Architecture of MobileBERT TINY
We use a lighter MHA structure for MobileBERT TINY .
As illustrated in Figure 4, in stead of using hidden states from the inter-block feature maps as inputs to MHA, we use the reduced intra-block feature maps as key, query, and values in MHA for MobileBERT TINY .This can effectively reduce the parameters in MHA modules, but might harm the model capacity.

F GLUE Dataset
In this section, we provide a brief description of the tasks in the GLUE benchmark (Wang et al., 2018).
CoLA The Corpus of Linguistic Acceptability (Warstadt et al., 2018) is a collection of English ac- ceptability judgments drawn from books and journal articles on linguistic theory.The task is to predict whether an example is a grammatical English sentence and is evaluated by Matthews correlation coefficient (Matthews, 1975).

SST-2
The Stanford Sentiment Treebank (Socher et al., 2013) is a collection of sentences from movie reviews and human annotations of their sentiment.The task is to predict the sentiment of a given sentence and is evaluated by accuracy.
MRPC The Microsoft Research Paraphrase Corpus (Dolan and Brockett, 2005) is a collection of sentence pairs automatically extracted from online news sources.They are labeled by human annotations for whether the sentences in the pair are semantically equivalent.The performance is evaluated by both accuracy and F1 score.

STS-B
The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data.Each pair is human-annotated with a similarity score from 1 to 5. The task is to predict these scores and is evaluated by Pearson and Spearman correlation coefficients.
QQP The Quora Question Pairs 8 (Chen et al., 2018) dataset is a collection of question pairs from the community question-answering website Quora.The task is to determine whether a pair questions are semantically equivalent and is evaluated by both accuracy and F1 score.
MNLI The Multi-Genre Natural Language Inference Corpus (Williams et al., 2018) is a collection of sentence pairs with textual entailment annotations.Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis (entailment ), contradicts the hypothesis (contradiction), or neither (neutral) and is evaluated by accuracy on both matched (indomain) and mismatched (cross-domain) sections of the test data.
QNLI The Question-answering NLI dataset is converted from the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016).The task is to determine whether the context sentence contains the answer to the question and is evaluated by the test accuracy.
RTE The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual entailment challenges (Bentivogli et al., 2009).The task is to predict whether sentences in a sentence pair are entailment and is evaluated by accuracy.
WNLI The Winograd Schema Challenge (Levesque et al., 2011) is a reading comprehension task in which a system must read a sentence with a pronoun and select the referent of that pronoun 8 https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs from a list of choices.We follow Devlin et al. (2018) to skip this task in our experiments, because few previous works do better than predicting the majority class for this task.

Figure 2 :
Figure 2: Diagrams of (a) auxiliary knowledge transfer (AKT), (b) joint knowledge transfer (JKT), and (c) progressive knowledge transfer (PKT).Lighter colored blocks represent that they are frozen in that stage.

Figure 3 :
Figure 3: The visualization of the attention distributions in some attention heads of the IB-BERT teacher and different MobileBERT models.

Figure 4 :
Figure 4: Illustration of MobileBERT TINY .red lines denote inter-block flows while blue lines intra-block flows.

Table 1 :
The detailed model settings of a few models.h inter , h FFN , h embedding , #Head and #Params denote the inter-block hidden size (feature map size), FFN intermediate size, embedding table size, the number of heads in multi-head attention, and the number of parameters, respectively.

Table 2 :
Table 1, and then present the empirical Experimental results on SQuAD v1.1 dev F1 score in search of good model settings for the IB-BERT LARGE teacher.The number of layers is set to 24 for all models.

Table 3 :
Experimental results on SQuAD v1.1 dev F1 score in search of good model settings for the Mobile-BERT student.The number of layers is set to 24 and the inter-block hidden size is set to 512 for all models.
Table 2 with labels (a)-(e).We can see that reducing the interblock hidden size doesn't damage the performance

Table 4 :
The test results on the GLUE benchmark (except WNLI).The number below each task denotes the number of training examples.

Table 5 :
The results on the SQuAD dev datasets.

Table 7 :
The effectiveness of operational optimizations on real-world inference latency for MobileBERT.

Table 8 :
Ablation study of MobileBERT on GLUE dev accuracy and SQuAD v1.1 dev F1 score with Auxiliary Knowledge Transfer (AKT), Joint Knowledge Transfer (JKT), and Progressive Knowledge Transfer (PKT).

Table 9 :
Ablation on the dev sets of GLUE benchmark.BERT BASE and the bare MobileBERT (i.e., w/o PD, FMT, AT, FMT & OPT) use the standard BERT pretraining scheme.PD, AT, FMT, and OPT denote Pretraining Distillation, Attention Transfer, Feature Map Transfer, and operational OPTimizations respectively.†marksour runs with the official code.