TopicBERT for Energy Efficient Document Classification

Prior research notes that BERT’s computational cost grows quadratically with sequence length thus leading to longer training times, higher GPU memory constraints and carbon emissions. While recent work seeks to address these scalability issues at pre-training, these issues are also prominent in fine-tuning especially for long sequence tasks like document classification. Our work thus focuses on optimizing the computational cost of fine-tuning for document classification. We achieve this by complementary learning of both topic and language models in a unified framework, named TopicBERT. This significantly reduces the number of self-attention operations – a main performance bottleneck. Consequently, our model achieves a 1.4x ( 40%) speedup with 40% reduction in CO2 emission while retaining 99.9% performance over 5 datasets.


Introduction
Natural Language Processing (NLP) has recently witnessed a series of breakthroughs by the evolution of large-scale language models (LM) such as ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), XL-Net (Yang et al., 2019) etc. due to improved capabilities for language understanding (Bengio et al., 2003;Mikolov et al., 2013). However this massive increase in model size comes at the expense of very high computational costs: longer training time, high GPU/TPU memory constraints, adversely high carbon footprints, and unaffordable invoices for small-scale enterprises. Figure 1 shows the computational cost (training time: millisecond/batch; CO 2 emission, and GPU memory usage) of BERT all of which grow quadratically with sequence length (N). We note that this *Equal Contribution  is primarily due to self-attention operations. Moreover, as we note in Table 1, the staggering energy cost is not limited to only the pre-training stage but is also encountered in the fine-tuning stage when processing long sequences as is needed in the task of document classification. Note that the computational cost incurred can be quite significant especially because fine-tuning is more frequent than pre-training and BERT is increasingly used for processing long sequences. Therefore, this work focuses on reducing computational cost in the fine-tuning stage of BERT especially for the task of document classification. Recent studies address the excessive computational cost of large language models (LMs) in the pre-training stage using two main compression techniques: (a) Pruning (Michel et al., 2019;Lan et al., 2020) by reducing model complexity, and (b) Knowledge Distillation (Hinton et al., 2015;Tang et al., 2019;Turc et al., 2019;Sanh et al., 2019a) which a student model (compact model) is trained to reproduce a teacher (large model) leveraging the teacher's knowledge. Finally, in order to process long sequences, Xie et al. (2019) and  investigate simple approaches of truncating or partitioning them into smaller sequences, e.g., to fit within 512 token limit of BERT for classification; However, such partitioning leads to a loss of discriminative cross-partition information and is still computationally inefficient. In our work, we address this limitation by learning a complementary representation of text using topic models (TM)) (Blei et al., 2003;Miao et al., 2016;Gupta et al., 2019). Because topic models are bag-of-words based models, they are more computationally efficient than large scale language models that are contextual. Our work thus leverages this computational efficiency of TMs for efficient and scalable fine-tuning for BERT in document classification.
Specifically our contributions(1) Complementary Fine-tuning: We present a novel framework: TopicBERT, i.e., topic-aware BERT that leverages the advantages of both neural network-based TM and Transformer-based BERT to achieve an improved document-level understanding. We report gains in document classification task with full selfattention mechanism and topical information.
(2) Efficient Fine-tuning: TopicBERT offers an efficient fine-tuning of BERT for long sequences by reducing the number of self-attention operations and jointly learning with TM. We achieve a 1.4x ( 40%) speedup while retaining 99.9% of classification performance over 5 datasets. Our approaches are model agnostic, therefore we extend BERT and DistilBERT models. Code in available at https: //github.com/YatinChaudhary/TopicBERT. Carbon footprint (CO 2 ) estimation: We follow Lacoste et al. (2019) and use ML CO 2 Impact calculator 1 to estimate the carbon footprint (CO 2 ) of our experiments using the following equation:  In Figure 1, we run BERT for different sequence lengths (32, 64, 128, 256 and 512) with batch-size=4 to estimate GPU-memory consumed and CO 2 using equation 1. We run each model for 15 epochs and compute run-time (in hours).
For Table 1, we estimate CO 2 for document classification tasks (BERT fine-tuning) considering 512 sequence length. We first estimate the total BERT fine-tuning time in terms of research activities and/or its applications beyond using multiple factors. Then, using equation 1 the final CO 2 is computed. (See supplementary for detailed computation) 2 Methodology: TopicBERT

TopicBERT: Complementary Fine-tuning
Given a document D = rw 1 , ..., w N s of sequence length N , consider V R Z be its BoW representation, v i R Z be the one-hot representation of the word at position i and Z be the vocabulary size.
The Neural Topic Model component (Figure 2, left) is based on Neural Variational Document Model (NVDM) (Miao et al., 2016), seen as a variational autoencoder for document modeling in an unsupervised generative fashion such that: (a) an MLP encoder f M LP and two linear projections l 1 and l 2 compress the input document V into a continuous hidden vector h T M R K : π gpf M LP pVqq and ∼ N p0, Iq µpVq l1pπq and σpVq l2pπq The h T M is sampled from a posterior distribution qph T M |Vq that is parameterized by mean µpVq and variance σpVq, generated by neural network. We call h T M as a document-topic-representation (DTR), summarizing document semantics.
(b) a softmax decoderV, i.e, ppV|h T M q = ± N i1 ppv i |h T M q reconstructs the input document V by generating all words tv i u independently:  Here, b: batch-size, n b : #batches and n l : #layers in BERT. Note, the compute cost of NVDM and selfattention operations as KZ pN 2 H B {pqn l . In TopicBERT: p 1 for complementary learning, and p t2, 4, 8u for complmenrary+efficient learning.
for D and L is the total number of labels. During training, the TopicBERT maximizes the joint ob- DistilBERT (Sanh et al., 2019a) and the variant is named as TopicDistilBERT.

TopicBERT: Efficient Fine-tuning
Since the computation cost of BERT grows quadratically OpN 2 q with sequence length N and is limited to 512 tokens, therefore there is a need to deal with larger sequences. The TopicBERT model offers efficient fine-tuning by reducing the number of self-attention operations in the BERT component.
In     Results: Table 3 illustrates gains in performance and efficiency of TopicBERT, respectively due to complementary and efficient fine-tuning. E.g. in Reuters8, TopicBERT-512 achieves a gain of 1.6% in F 1 over BERT and also outperforms DistilBERT. In the efficient setup, TopicBERT-128 achieves a significant speedup of 1.9¢ (1.9¢ reduction in CO 2 ) in fine-tuning while retaining (Rtn) 99.25% of F1 of BERT. For IMDB and 20NS, TopicBERT-256 reports similar performance to BERT, however with a speedup of 1.2¢ and also outperforms Distil-BERT in F 1 though consuming similar time T epoch . Additionally, TopicBERT-512 exceeds DistilBERT in F 1 for all the datasets. At p = 8, TopicBERT-64 does not achieve expected efficiency perhaps due to saturated GPU-parallelization (a trade-off in decreasing sequence length and increasing #batches).
Overall, TopicBERT-x achieves gains in: (a) performance: 1.604%, 0.850%, 0.537%, 0.260% and 0.319% in F 1 for Reuters8, 20NS, IMDB, Ohsumed and AGnews (in supplementary), respectively, and (b) efficiency: a speedup of 1.4¢ ( 40%) and thus, a reduction of 40% in CO 2 over 5 datasets while retaining 99.9% of F 1 compared to BERT. It suggests that the topical semantics improves document classification in TopicBERT (and TopicDistilBERT: a further 1.55x speedup in Distil-   Analysis (Pareto Frontier): As shown in Table 3, gains in TopicBERT has been analyzed on two different fronts: (a) gain on the basis of performance (F1 score), and (b) gain on the basis of efficiency (Fine-tuning time/CO 2 ). Figure 4 illustrates the following Pareto frontier analysis plots for Reuters8 dataset: (a) F1 score vs Fine-tuning time (left), and (b) F1 score vs CO 2 (right) to find the optimal solution that balances both fronts. Ob-serve that the TopicBERT-512 outperforms all other TopicBERT variants and BERT baseline (B-512) in terms of performance i.e., F1 score. However, Top-icBERT-256 outperforms BERT-512 in terms of both, performance (F1 score) and efficiency (Finetuning time/CO 2 ). Therefore, TopicBERT-256 represents the optimal solution with optimal sequence length of 256 for Reuters8 dataset.

Conclusion
We have presented two novel architectures: Top-icBERT and TopicDistilBERT for an improved and efficient (Fine-tuning time/CO 2 ) document classification, leveraging complementary learning of topic (NVDM) and language (BERT) models.

A Supplementary Material
A.1 CO 2 : Carbon footprint estimation For Table 1, we estimate CO 2 for document classification tasks (BERT fine-tuning) considering 512 sequence length. We first estimate the frequency of BERT fine-tuning in terms of research activities and/or its application beyond. We estimate the following items: 1. Number of scientific papers based on BERT = 5532 (number of BERT citations to date: 01, June 2020) 2. Conference acceptance rate: 25% (i.e., 4 times the original number of submissions or research/application beyond the submissions) 3. Average number of datasets used = 5 4. Average run-time of 15 epochs in fine-tuning BERT over 5000 documents (Reusters8-sized data) of maximum 512 sequence length = 12 hours on the hardware-type used Therefore, using equation 1 in main paper, CO 2 estimate in fine-tuning BERT = 0.07 ¢ (5532 ¢ 4 ¢ 5) ¢ 12 ¢ 0.61 kg eq. = 56,692 ¢ 2,20462 lbs eq = 124,985 lbs eq. Table 4 shows data statistics of 5 datasets used in complementary + finetuning evaluation of our proposed TopicBERT model via Document Classification task. 20Newsgroups (20NS), Reuters8, AGnews are news domain datasets, whereas Imdb and Ohsumed datasets belong to sentiment and medical domains respectively. For NVDM component, we preprocess each dataset and extract vocabulary Z as follows: (a) tokenize documents into words, (b) lowercase all words, (d) remove stop words 2 , and 2 we use NLTK tool to remove stopwords  Table 4: Preprocessed data statistics: #docs Ñ number of documents, k Ñ thousand, Z Ñ vocabulary size of NVDM, L Ñ total number of unique labels, N Ñ sequence length used for BERT fine-tuning, b Ñ batchsize used for BERT fine-tuning, ( : ) Ñ multi-labeled dataset (c) remove words with frequency less than F min .

A.2 Data statistics and preprocessing
Here, F min 100 for large datasets i.e., Imdb, 20NS, AGnews and Ohsumed, whereas F min 10 for Reuters8 which is a small dataset.   Table 5 and 7 shows hyperparameter settings of NVDM and BERT components of our proposed TopicBERT model for document classification task. We initialize BERT component with pretrained BERT-base model released by Devlin et al. (2019). Fine-tuning of TopicBERT is performed as follows:

A.3 Experimental setup
(1) perform pretraining of NVDM component, (2) initialize BERT component with BERT-base model, (3) perform complementary + efficient fine-tuning, for 15 epochs, using joint loss objective: LT opicBERT α log ppy y l |Dq p1 ¡ αqLNV DM where, α t0.1, 0.5, 0.9u. For CNN, we follow the experimental setup of Kim (2014).   mance) compared to BERT, and (b) a significant speedup of 1.3¢ over BERT while retaining (Rtn) 100% of F1 (performance) of BERT at the same time. This gain arises due to the improved document understanding using complementary topical semantics, via NVDM, in TopicBERT and its energy efficient versions.   Additionally, TopicBERT-512 exceeds Distil-BERT in F 1 for the two datasets. At p = 4, TopicDistilBERT-128 does not achieve expected efficiency perhaps due to saturated GPUparallelization (a trade-off in decreasing sequence length and increasing #batches) and therefore, we do not partition further.
It suggests that the topical semantics improves document classification in TopicDistilBERT (and TopicBERT) and its energy-efficient variants. Based on our two extensions: TopicBERT and TopicDistilBERT, we assert that our proposed approaches of complementary learning (fine-tuning) are model agnostic of BERT models.

A.6 Interpretability Analysis in TopicBERT
To analyze the gain in performance (F1 score) of TopicBERT vs BERT, Figure 5 shows document label misclassifications due to BERT model. However, TopicBERT model is able to correctly predict the labels using document topic representation (DTR) which explains the correct predictions by the top key terms of the dominant topic discovered.