BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover’s Distance

Pre-trained language models (e.g., BERT) have achieved significant success in various natural language processing (NLP) tasks. However, high storage and computational costs obstruct pre-trained language models to be effectively deployed on resource-constrained devices. In this paper, we propose a novel BERT distillation method based on many-to-many layer mapping, which allows each intermediate student layer to learn from any intermediate teacher layers. In this way, our model can learn from different teacher layers adaptively for various NLP tasks. %motivated by the intuition that different NLP tasks require different levels of linguistic knowledge contained in the intermediate layers of BERT. In addition, we leverage Earth Mover's Distance (EMD) to compute the minimum cumulative cost that must be paid to transform knowledge from teacher network to student network. EMD enables the effective matching for many-to-many layer mapping. %EMD can be applied to network layers with different sizes and effectively measures semantic distance between the teacher network and student network. Furthermore, we propose a cost attention mechanism to learn the layer weights used in EMD automatically, which is supposed to further improve the model's performance and accelerate convergence time. Extensive experiments on GLUE benchmark demonstrate that our model achieves competitive performance compared to strong competitors in terms of both accuracy and model compression.


Introduction
In recent years, pre-trained language models, such as GPT (Radford et al., 2018), BERT (Devlin et al., 2018), XL-Net (Yang et al., 2019), have been proposed and applied to many NLP tasks, yielding state-of-the-art performances. However, the promising results of the pre-trained language models come with the high costs of computation * Equal contribution † Min Yang is corresponding author and memory in inference, which obstruct these pre-trained language models to be deployed on resource-constrained devices and real-time applications. For example, the original BERT-base model, which achieved great success in many NLP tasks, has 12 layers and about 110 millions parameters. It is therefore critical to effectively accelerate inference time and reduce the computational workload while maintaining accuracy. This research issue has attracted increasing attention (Wang et al., 2019;Shen et al., 2019;Tang et al., 2019), of which knowledge distillation (Tang et al., 2019) is considered to be able to provide a practical way. Typically, knowledge distillation techniques train a compact and shallow student network under the guidance of a complicated larger teacher network with a teacher-student strategy (Watanabe et al., 2017). Once trained, this compact student network can be directly deployed in real-life applications.
So far, there have been several studies, such as DistilBERT (Tang et al., 2019), BERT-PKD , TinyBERT (Jiao et al., 2019), which attempt to compress the original BERT into a lightweight student model without performance sacrifice based on knowledge distillation. For example, BERT-PKD  and Tiny-BERT (Jiao et al., 2019) are two representative BERT compression approaches, which encourage the student model to extract knowledge from both the last layer and the intermediate layers of the teacher network.
Despite the effectiveness of previous studies, there are still several challenges for distilling comprehensive knowledge from the teacher model, which are not addressed well in prior works. First, existing compression methods learn one-to-one layer mapping, where each student layer is guided by only one specific teacher layer. For example, BERT-PKD uses the 2, 4, 6, 8, 10 teacher layers to guide the 1 to 5 student layers, respectively. How-ever, these one-to-one layer mapping strategies are assigned based on empirical observations without theoretical guidance. Second, as revealed in (Clark et al., 2019), different BERT layers could learn different levels of linguistic knowledge. The oneto-one layer mapping strategy cannot learn an optimal, unified compressed model for different NLP tasks. In addition, most previous works do not consider the importance of each teacher layer and use the same layer weights among various tasks, which create a substantial barrier for generalizing the compressed model to different NLP tasks. Therefore, an adaptive compression model should be designed to transfer knowledge from all teacher layers dynamically and effectively for different NLP tasks.
To address the aforementioned issues, we propose a novel BERT compression approach based on many-to-many layer mapping and Earth Mover's Distance (EMD) (Rubner et al., 2000), called BERT-EMD. First, we design a many-to-many layer mapping strategy, where each intermediate student layer has the chance to learn from all the intermediate teacher layers. In this way, BERT-EMD can learn from different intermediate teacher layers adaptively for different NLP tasks, motivated by the intuition that different NLP tasks require different levels of linguistic knowledge contained in the intermediate layers of BERT. Second, to learn an optimal many-to-many layer mapping strategy, we leverage EMD to compute the minimum cumulative cost that must be paid to transform knowledge from teacher network to student network. EMD is a well-studied optimization problem and provides a suitable solution to transfer knowledge from the teacher network in a holistic fashion.
We summarize our main contributions as follows. (1) We propose a novel many-to-many layer mapping strategy for compressing the intermediate layers of BERT in an adaptive and holistic fashion.
(2) We leverage EMD to formulate the distance between the teacher and student networks, and learn an optimal many-to-many layer mapping based on a solution to the well-known transportation problem. (3) We propose a cost attention mechanism to learn the layer weights used in EMD automatically, which can further improve the model's performance and accelerate convergence time. (4) Extensive experiments on GLUE tasks show that BERT-EMD achieves better performance than the state-of-the-art BERT distillation methods.

Related Work
Language models pre-trained on large-scale corpora can learn universal language representations, which have proven to be effective in many NLP tasks (Mikolov et al., 2013;Pennington et al., 2014;Joulin et al., 2016). Early efforts mainly focus on learning good word embeddings, such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014). Although these pre-trained embeddings can capture semantic meanings of words, they are context-free and fail to capture higher-level concepts in context, such as syntactic structures and polysemous disambiguation. Subsequently, researchers have shifted attention to contextual word embeddings learning, such as ELMo (Peters et al., 2018), ULMFit (Howard and Ruder, 2018), GPT (Radford et al., 2018), BERT (Devlin et al., 2018), ENRIE (Zhang et al., 2019), XL-Net (Yang et al., 2019), RoBERTa . For example, Devlin et al. (2018) released the BERT-base of 110 million parameters and BERT-large of 330 million parameters, which achieved significantly better results than previous methods on GLUE tasks.
However, along with high-performance, the pretrained language models (e.g., BERT) usually have a large number of parameters, which require a high cost of computation and memory in inference. Recently, many attempts have been made to reduce the computation overhead and model storage of pre-trained language models without performance sacrifice. Existing compression techniques can be divided into three categories: low-rank matrix factorization (Wang et al., 2019), quantization (Shen et al., 2019), and knowledge distillation (Tang et al., 2019). Next, we mainly review the related works that use knowledge distillation to compress the BERT model.
Knowledge distillation using the teacher-student strategy learns a lightweight student network under the guidance of a large and complicated teacher network. Mukherjee and Awadallah (2019) distilled BERT into an LSTM network via both hard and soft distilling methods.  proposed the BERT-PKD model to transfer the knowledge from both the final layer and the intermediate layers of the teacher network. Jiao et al. (2019) proposed the TinyBERT model, which performed the Transformer distillation at both pre-training and fine-tuning processes. Xu et al. (2020) proposed the BERT-of-Theseus model to learn a compact student network by replacing the teacher layers with their substitutes. Sun et al. (2020) introduced the MobileBERT model, which has the same number of layers with the teacher network, but was much narrower via adopting bottleneck structures. Wang et al. (2020) distilled the self-attention module of the last Transformer layer of the teacher network.
However, the aforementioned BERT compression approaches struggle to find an optimal layer mapping between the teacher and student networks. Each student layer merely learns from a single teacher layer, which may lose rich linguistic knowledge contained in the teacher network. Different from previous methods, we propose a many-tomany layer mapping method for BERT distillation, where each intermediate student layer can learn from any intermediate teacher layers adaptively. In addition, an Earth Mover's Disepstance is applied to learn the optimal many-to-many layer mapping solution.

Methodology
In this section, we propose a novel BERT compression method based on many-to-many layer mapping and Earth Mover's Distance (called BERT-EMD). In addition, we also propose a cost attention mechanism to learn the layer weights used in EMD automatically.

Overview of BERT-EMD
The main idea behind BERT-EMD is to transfer knowledge from a large teacher network T (large BERT) to a small student network S (BERT-EMD). Both the student and teacher networks are implemented with an embedding layer, several Transformer layers, and a prediction layer. We assume that the teacher network has M Transformer layers and the student network has N Transformer layers. Each Transformer layer contains an attention layer and a hidden layer.
Similar to TinyBERT (Jiao et al., 2019), our method also includes three primary distillation components: the embedding-layer distillation, the Transformer distillation, and the prediction-layer distillation. Concretely, both the embedding-layer distillation and the prediction-layer distillation employ the one-to-one layer mapping as in TinyBERT and BERT-PKD, where the two student layers are guided by the corresponding teacher layers, respectively. However, different from the previous works, we propose to exploit the many-to-many layer mapping for Transformer (intermediate lay-ers) distillation (attention-based distillation and hidden states based distillation), where each student attention layer (resp. hidden layer) can learn from any teacher attention layers (resp. hidden layers). In this way, BERT-EMD can learn from different intermediate teacher layers adaptively for different NLP tasks, motivated by the intuition that different NLP tasks require different levels of linguistic knowledge contained in the attention and hidden layers of BERT. Next, we will describe the four distillation strategies of BERT-EMD in detail.

Embedding-layer Distillation
Word embeddings are vital in NLP tasks and have been extensively studied in recent years. Better representations of words have come at the cost of huge memory footprints. Compressing embedding matrices without sacrificing model performance is essential for real-world applications. To this end, we minimize the mean squared error (MSE) between the embedding layers of the teacher and student networks: where the matrices E S and E T represent the embeddings of student and teacher networks, which have the same shape. W e is a projection parameter to be learned.

Prediction-layer Distillation
The student network also learns from the probability logits provided by teacher network. We minimize the prediction-layer distillation function as: where z T and z S represent the probability logits predicted by the teacher and student, respectively. t indicates a temperature value.

Transformer Distillation with Earth Mover's Distance
Instead of imposing one-to-one layer mapping as in previous works Jiao et al., 2019), our Transformer distillation approach allows many-to-many layer mapping and is capable of generalizing to various NLP tasks. The Earth Mover's Distance (EMD) is proposed to measure the dissimilarity (distance) between the teacher and student networks as the minimum cumulative cost   of transforming knowledge from the teacher network to student network. The key insight is to view network layers as distributions, and the desired transformation should make the two distributions (teacher and student layers) close.

Attention-based Distillation
We use the attention-based distillation to transform the linguistic knowledge from the teacher network to the student network based on EMD. Formally,

the teacher attention layers and
} be the student attention layers, where M and N represent the numbers of the attention layers in the teacher and student networks, respectively. Each A T i (resp. A S i ) represents the i-th teacher (resp. student) attention layer and w A T i (resp. w A S i ) indicates corresponding layer weight that is initialized as 1 M (resp. 1 N ). We also define a "ground" distance matrix Then, we attempt to find a mapping flow F A = [f A ij ], with f A ij the mapping flow between A T i and A S j , that minimizes the cumulative cost required to transform knowledge from the teacher attention layers A T to the student attention layers A S : subject to the following constraints: where the first constraint forces the mapping flow to be positive. The second constraint limits the amount of attention information that can be sent by A T to their weights. The third constraint limits the attention information that can be received by A S . The fourth constraint limits the amount of total flow. The above optimization is a well-studied transportation problem (Hitchcock, 1941), which can be solved by previously developed methods (Rachev, 1985). Once the optimal mapping flow F A is learned, we can define the Earth Mover's Distance as the work normalized by the total flow: Finally, the objective function for the attentionbased distillation can be defined by the EMD between A T and A S : Hidden States-based Distillation Similar to attention-based distillation, we also learn the hidden layer mapping based on EMD. Formally, let subject to the following constraints: After solving the above optimization problem, we obtain the optimal mapping flow F H . The earth mover's distance can be then defined as the work normalized by the total flow: Finally, the objective function for the hidden states-based distillation can be defined by the earth mover's distance between H T and H S :

Weight Update with Cost Attention
In the EMD defined in Section 3.4, each teacher layer (resp. student layer) is assigned an equal weight w T = 1 M (resp. w S = 1 N ). Since different attention and hidden layers of BERT can learn different levels of linguistic knowledge, these layers should have different weights for various NLP tasks. Therefore, we propose a cost attention mechanism to assign weights for each attention and hidden layers automatically.
The main idea behind the cost attention is to make the teacher and student Transformer networks be as close as possible. That is, we could reduce the overall cost of EMD by increasing the weights of the layers with low flow cost, while the weights of the layers with high flow cost should be decreased adaptively.
We take the weight updating process of the teacher network as an example. The cost attention mechanism can be performed by three steps after learning the optimal solution (flow matrices F A and F H in EMD). First, we learn the transferring cost between each teacher and student layers (unit transferring cost). Formally, letC A T i andC H T i be the unit transferring cost of each attention and hidden layers respectively, which can be computed as: Second, we update the weights (w A T i and w H T i ) of the teacher attention and hidden layers based on the learned unit transferring cost. Specifically, we compute the updated weightsw A T i andw H T i as the inverse ratio of the transferring costs: Finally, we normalize the updated layer weights used in EMD via softmax, and introduce a temperature coefficient τ to smooth the results. In particular, we update weightw T i of the i-th Transformer layer used in EMD by averaging the corresponding weights of attention and hidden layers: It is noteworthy that the learned new weights are leveraged as the constrains to optimize the EMD problem in the next batch. Specifically, we initialize the i-th teacher attention and hidden layer weights (w A T i and w H T i ) in the η-th batch with the updated weightw T i learned in the η − 1-th batch. In this way, we can further improve the performance of BERT-EMD and accelerate convergence time.

Overall Learning Objective
Finally, we combine the embedding-layer distillation, attention-based distillation, hidden statesbased distillation, prediction-layer distillation objectives to form the overall knowledge distillation objective as follows: (24) where β is a factor that controls the weights of the three distillation objectives (L emb , L attn , L hidden ).

Experimental Data
We evaluate our BERT-EMD model on the General Language Understanding Evaluation (GLUE) (Wang et al., 2018) benchmark, which is a collection of nine diverse sentence-level classification tasks. Concretely, GLUE consists of (i) Microsoft Research Paraphrase Matching (MRPC), Quora Question Pairs (QQP) and Semantic Textual Similarity Benchmark (STS-B) for paraphrase similarity matching; (ii) Stanford Sentiment Treebank (SST-2) for sentiment classification; (iii) Multi-Genre Natural Language Inference Matched (MNLI-m), Multi-Genre Natural Language Inference Mismatched (MNLI-mm), Question Natural Language Inference (QNLI) and Recognizing Textual Entailment (RTE) for natural language inference task; and (iv) the Corpus of Linguistic Acceptability (CoLA) for linguistic acceptability.

Evaluation Metrics
Following previous works Jiao et al., 2019), we use classification accuracy as the evaluation metric for SST-2, MNLI-m, MNLI-mm, QNLI, and RTE datasets. For a fair comparison with TinyBERT (Jiao et al., 2019), the F1 metric is adopted for MRPC and QQP datasets, the Spearman correlation is adopted for STS-B, and the Matthew's correlation is adopted for CoLA. The results reported for the test set of GLUE are in the same format as on the official leaderboard.

Implementation Details
Similar to TinyBERT, our BERT-EMD method also contains a general distillation and a task-specific distillation. In particular, we initialize our student model with the general distillation model provided by TinyBERT 1 . The teacher model is implemented as a 12-layer BERT model (BERT BASE 12 ), which is fine-tuned for each task to perform knowledge distillation.
We employ the grid search algorithm on the validation set to tune the hyper-parameters. Since there are many hyper-parameter combinations, we first do the grid search on β and the learning rate. Then, we fix the values of these two hyperparameters and tune the values of the other hyperparameters. Specifically, the batch size is 32, the learning rate is tuned from {5e − 5, 2e − 5, 1e − 5}, the parameter t defined in Eq. (2) is tuned from {1, 3, 7, 10}, the temperature coefficient τ is tuned from {1, 2, 5, 10}, and β is tuned from {0.01, 0.001, 0.005}.

Baseline Methods
In this paper, we compare our BERT-EMD with several state-of-the-art BERT compression approaches, including the original 4/6-layer BERT models (Devlin et al., 2018), DistilBERT (Tang et al., 2019), BERT-PKD , Tiny-BERT (Jiao et al., 2019), BERT-of-Theseus (Xu et al., 2020). However, the original TinyBERT employs a data augmentation strategy in the training process, which is different from the other baseline models. For a fair comparison, we re-implement the TinyBERT model by eliminating the data augmentation strategy.
It is noteworthy that we do not compare BERT-EMD with the recent MobileBERT (Sun et al., 2020) and MiniLM (Wang et al., 2020), since

Main Results
We summarize the experimental results on the GLUE test sets in Table 1. The number below each task denotes the number of training instances. Following previous works , we also report the average values of these nine tasks (the "AVE" column). From the results, we can observe that BERT-EMD substantially outperforms state-of-the-art baseline methods by a noticeable margin on most tasks. Among all the 4-layer BERT approaches, our BERT-EMD 4 method achieves the best results on almost all the tasks except SST-2 and CoLA. First, BERT-EMD 4 achieves significantly better results than BERT SMALL 4 on all the GLUE tasks with a large improvement of 4.48% on average. Second, BERT-EMD 4 also outperforms DistilBERT 4 and BERT-PKD 4 by a substantial margin, even with only 30% parameters and inference time. Furthermore, BERT-EMD 4 exceeds the TinyBERT model (the best competitor) by 2.3% accuracy on RTE, 2.2% F1 on MRPC, and 1.9% Spearman correlation on STS-B. This verifies the effectiveness of our BERT-EMD model in improving the performance of small BERT-based methods on various language understanding tasks.
We can observe similar trends in the 6layer BERT models. Table 1 shows that the proposed BERT-EMD 6 method can effectively compress BERT BASE 12 into a 6-layer BERT model without performance sacrifice. Specifically, BERT-EMD 6 performs better than the 12-layer BERT BERT BASE 12 model on 7 out of 9 tasks, with only about 50% parameters and inference time of the original BERT BASE 12 model. For example, BERT-EMD achieves a noticeable improvement of 5.3% accuracy on RTE and 1% Spearman correlation on STS-B, over the BERT BASE 12 model.

Ablation Study
To verify the effectiveness of EMD and the cost attention mechanism, we perform ablation test of BERT-EMD on two large datasets (MNLI and QQP) and two small datasets (MRPC and RTE) in terms of removing EMD (denoted as w/o EMD) and cost attention (w/o CA), respectively. In particular, for the method of removing EMD, we retain the many-to-many layer mapping by simply replacing the EMD with the mean squared error when measuring the distance between the teacher and student layers.
The ablation test results are summarized in Table 2. Generally, both EMD and cost attention contribute noticeable improvement to our method. The performances decrease sharply, especially on the STS-B task, when removing the EMD module. This is within our expectation since the EMD module formulates the distance between the teacher and student networks as an optimal transport prob-

BERT-EMD4
BERT-EMD6 Figure 2: The visualization of flow matrices (F) and distance matrices (D) in developing BERT-EMD 4 (above) and BERT-EMD 6 (below) for two examples from MNLI and RTE tasks, respectively. The abscissa represents the Transformer layers of BERT BASE12 , and the ordinate represents the Transformer layers of BERT-EMD 4 /BERT-EMD 6 . The color depth represents the values (weights) of the layers. lem, which helps to learn an optimal many-to-many layer mapping. The cost attention also contributes to the effectiveness of BERT-EMD. This verifies that the cost attention can further improve the manyto-many layer mapping by learning the importance of each teacher layer in guiding the student network. It is noteworthy that when removing the EMD module in the many-to-many lay mapping process, our w/o EMD 4 performs slightly worse than TinyBERT 4 on the MNLI and QQP tasks. This is because we cannot automatically control the information flow during the many-to-many layer mapping without using EMD, which further verifies the effectiveness of EMD in the many-to-many layer mapping process.

Visualization of Compression Process
To better understand the many-to-many layer mapping process, we illustrate the flow matrices F and cost (distance) matrices D in developing BERT-EMD 4 (above) and BERT-EMD 6 (below) for two examples from MNLI and RTE tasks, respectively. In Figure 2, we report the averaged values of the flow and cost matrices of the entire epoch that achieves the best performance on the validation set with heat maps.
From the results in Figure 2, we have several key observations. First, different tasks could emphasize different teacher layers in compressing the Transformer. The diagonal positions of the matrices are almost always important for the MNLI task, which exhibits similar trends with TinyBERT with the one-to-one "Skip" layer mapping strategy. However, for the RTE task, each student Transformer layer can learn from any teacher Transformer layers. The previous one-to-one layer mapping methods cannot take full advantage of the teacher network. This argument can be verified by the quantitative results in Table 1, where our BERT-EMD has a much larger improvement on RTE than on MNLI over TinyBERT. Second, comparing BERT-EMD 4 and BERT-EMD 6 , we can observe that BERT-EMD 4 usually needs to learn more comprehensive information from skipped teacher Transformer layers, resulting in more divergent many-to-many layer mappings.

Conclusion
In this paper, we propose a novel BERT compression method based on many-to-many layer mapping by Earth Mover's Distance (EMD). To our knowledge, BERT-EMD is the first work that allows each intermediate student layer to learn from any intermediate teacher layers adaptively. In addition, a cost attention mechanism is designed to further improve the model's performance and accelerate convergence time by learning the layer weights used in EMD automatically. Extensive experiments on GLUE tasks show that BERT-EMD can achieve competitive performances with the large BERT-Base model while significantly reducing the model size and inference time.