To BERT or Not to BERT: Comparing Task-specific and Task-agnostic Semi-Supervised Approaches for Sequence Tagging

Leveraging large amounts of unlabeled data using Transformer-like architectures, like BERT, has gained popularity in recent times owing to their effectiveness in learning general representations that can then be further fine-tuned for downstream tasks to much success. However, training these models can be costly both from an economic and environmental standpoint. In this work, we investigate how to effectively use unlabeled data: by exploring the task-specific semi-supervised approach, Cross-View Training (CVT) and comparing it with task-agnostic BERT in multiple settings that include domain and task relevant English data. CVT uses a much lighter model architecture and we show that it achieves similar performance to BERT on a set of sequence tagging tasks, with lesser financial and environmental impact.


Introduction
Exploiting unlabeled data to improve performance has become the foundation for many natural language processing tasks. The question we investigate in this paper is how to effectively use unlabeled data: in a task-agnostic or a task-specific way. An example of the former is training models on language model (LM) like objectives on a large unlabeled corpus to learn general representations, as in ELMo (Embeddings from Language Models) (Peters et al., 2018) and BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2019). These are then reused in supervised training on a downstream task. These pre-trained models, particularly the ones based on the Transformer architecture (Vaswani et al., 2017) 1 have achieved state-of-the-art results in a variety of NLP tasks, but come at a great cost financially and environmentally (Strubell et al., 2019;Schwartz et al., 2019).
In contrast, Cross-View Training (CVT) ) is a semi-supervised approach that uses unlabeled data in a task-specific manner, rather than trying to learn general representations that can be used for many downstream tasks. Inspired by selflearning (McClosky et al., 2006;Yarowsky, 1995) and multi-view learning (Blum and Mitchell, 1998;Xu et al., 2013), the key idea is that the primary prediction module, which has an unrestricted view of the data, trains on the task using labeled examples, and makes task-specific predictions on unlabeled data. The auxiliary modules, with restricted views of the unlabeled data, attempt to replicate the primary module predictions. This helps to learn better representations for the task.
We present an experimental study that investigates different task-agnostic and task-specific approaches to use unsupervised data and evaluates them in terms of performance as well as financial and environmental impact. On the one hand, we use BERT in three different settings: 1) standard BERT setup in which BERT pretrained on a generic corpus is fine-tuned on a supervised task; 2) pre-training BERT on domain and/or task relevant unlabeled data and fine-tuning on a supervised task (Pretrained BERT); and 3) continued pretraining of BERT on domain and/or task relevant unlabeled data followed by fine-tuning on a supervised task (Adaptively Pretrained BERT) (Gururangan et al., 2020). On the other hand, we use CVT based on a much lighter architecture (CNN-BiLSTM) which uses domain and/or task relevant unlabeled data in a task-specific manner. We experiment on several tasks framed as a sequence labeling problem: opinion target expression detection, named entity recognition and slot-labeling. We find that the CVT-based approach using less unlabeled data achieves similar performance with BERT-based models, while being superior in terms of financial and environmental cost as well.

Background, Tasks and Datasets
Before presenting the models and their training setups, we discuss the relevant literature and introduce the tasks and datasets used for our experiments. We focus on three tasks: opinion target expression (OTE) detection; named entity recognition (NER), and slot-labeling, each of which can be modeled as a sequence tagging problem (Xu et al., 2018;Liu et al., 2019a;Louvan and Magnini, 2018). The IOB sequence tagging scheme (Ramshaw and Marcus, 1999) is used for each of these tasks. Related Work. The usefulness of continued training of large transformer-based models on domain/task-related unlabeled data has been shown recently (Gururangan et al., 2020;Rietzler et al., 2019;Xu et al., 2019), with a varied use of terminology for the process. Xu et al. (2019) and Rietzler et al. (2019) show gains of further tuning BERT using in-domain unlabeled data and refer to this as Post-training, and LM finetuning, respectively. More recently, Gururangan et al. (2020) use the term Domain-Adaptive Pretraining and show benefits over RoBERTa (Liu et al., 2019b). There have also been efforts to reduce model sizes for BERT, such as DistilBERT (Sanh et al., 2019), although these come at significant losses in performance.

Opinion Target Expression (OTE) Detection:
An integral component of fine-grained sentiment analysis is the ability to identify segments of text towards which opinions are expressed. These segments are referred to as Opinion Target Expressions or OTEs. An example of this task is provided in Figure 1. The commonly used labeled datasets for Opinion Target Expression (OTE) detection are those released as part of SemEval Aspect-based Sentiment shared tasks: SemEval-2014 Laptops (Pontiki et al., 2014) (SE14-L) and SemEval-2016 Restaurants (Pontiki et al., 2016) (SE16-R). These consist of reviews from the laptop and restaurant domains, respectively, with OTEs annotated for each sentence of a review. We use the provided train-test splits but further split the training data randomly into 90% training and 10% validation sets. As unlabeled data that is similar to the domain and task, we extract restaurant reviews from the Yelp 2 dataset (Yelp-R) and reviews of electronics products from Amazon Product Reviews dataset 3 (Amazon-E) (see Table 1).
Named Entity Recognition (NER): NER is the task of identifying and categorizing named entities from unstructured text into pre-defined categories such as Person (PER), Location (LOC), Organization (ORG) etc. Figure 1 contains an example of this task. CONLL-2003 (Tjong Kim Sang andDe Meulder, 2003) and CONLL-2012 (OntoNotes v5.0) (Pradhan et al., 2012) are the commonly used labeled datasets to build and evaluate performance for Named Entity Recognition models (Lample et al., 2016;Ma and Hovy, 2016;Akbik et al., 2018, inter-alia). We focus on the English parts of these datasets. CONLL-2003 contains annotations for Reuters news for 4 entity types (Person, Location, Organization, and Miscellaneous). CONLL-2012 dataset contains 18 entity types, consisting of various genres (weblogs, news, talk shows, etc.) with newswire being majority. We use the provided train, validation and test splits for these datasets. As newswire is the predominant genre in these datasets, as we use stories from the CNN and Daily Figure 2: CVT explained using the OTE task (figure adapted from ). In the labeled example, tuna roll is the OTE, hence tuna has B-OTE as the gold label.
Mail datasets 4 (CNN-DM) as an unlabeled dataset from the news genre (see Table 1).
Slot-labeling: Slot-labeling is a key component of Natural Language Understanding (NLU) in dialogue systems, which involves labeling words of an utterance with pre-defined attributes -slots. For this task, we use the widely-used MIT-Movie dataset 5 as labeled data which contains queries related to movie information, with 12 slot labels such as Plot, Actor, Director, etc.. An example from this dataset is demonstrated in Figure 1. We use the default train-test split, and create a validation set by randomly selecting 10% of the training samples. IMDb Movie review dataset (IMDb) is used as in-domain unlabeled data (Maas et al., 2011) (see Table 1).

Models and Experimental Setup
We describe the various models we compare in this work and the experimental setup for each of them. Experiments are geared towards comparing the performance accuracy of the models, while also measuring impact on the environment and the resources required for training these models. Details on model architecture and training are provided in Appendix A.
Cross-View Training (CVT) CVT is a semisupervised approach proposed by  that leverages unlabeled data in a taskspecific manner. The underlying model is a twolayer CNN-BiLSTM sentence encoder followed by 4 https://github.com/abisee/ cnn-dailymail 5 https://groups.csail.mit.edu/sls/ downloads/movie/  a linear layer and a softmax per prediction module. There are two kinds of prediction modulesprimary and auxiliary. CVT alternates between learning from labeled and unlabeled data during training. The key idea is that the primary prediction module, which has an unrestricted view of the data, trains on the task using labeled examples, and makes task-specific predictions on unlabeled data. The auxiliary modules, with different restricted views of the unlabeled data, attempt to mimic the predictions of the primary module. Standard cross-entropy loss is minimized when learning from labeled examples, while for unlabeled examples, KL-Divergence (Kullback and Leibler, 1951) between the predicted primary and auxiliary probability distributions is minimized (see  for more details). We demonstrate the training strategy in Figure 2. Thus, the model is trained to produce consistent results despite seeing partial views of the input -thereby improving underlying representations.
We use Glove 840B.300d embeddings (Pennington et al., 2014) instead of Glove 6B.300d embeddings used by the authors for a larger vocabulary coverage. For each of the labeled datasets (Section 2), we use the corresponding domain/task-relevant unlabeled data to train a sequence tagging model for 400K steps, with early stopping enabled using validation set convergence.
BERT Base BERT (Devlin et al., 2019) has achieved state-of-the-art results on many NLP tasks.
The key innovation lies in the use of bi-directional Transformers as well as the Masked Language Model (MLM) and Next Sentence Prediction (NSP) objectives used during training. Learning happens in two steps: 1) training the model on a very large generic dataset (using the two objectives above); 2) fine-tuning the learned representations on a downstream task in a supervised fashion. For our experiments we use BERT Base , which has 12 layers, 768 hidden dimensions per token and 12 attention heads and is pre-trained on the cased English Wikipedia and Books Corpus data (Wiki+Books). In order to fine-tune on the downstream sequence tagging task, the model we use consists of the BERT Base encoder, followed by a dropout layer and a classification layer that classifies each token into B-label, I-label, O, where label ∈ {label i , label i+1 , ..., label n }. Cross-entropy loss is the loss function used.
Pretrained BERT Base (Pre-BERT Base ) In this setup, we use BERT Base architecture and pre-train it from scratch on the domain/task relevant unlabeled data. Each training step trains on a batch of size 256. A validation set is created from each unlabeled dataset by random sampling (details in Appendix A). The convergence criteria is set to be validation MLM accuracy improvement ≥ 0.05 when evaluated every 30K steps. We then perform the second step of fine-tuning on the downstream task data, as in the regular BERT setup.

Adaptively Pretrained BERT Base (APBERT Base )
Here, we start with BERT Base trained on the generic unlabeled dataset (English Wikipedia and Book Corpus) and continue pretraining on the corresponding domain/task-relevant unlabeled data (Section 2). Inspired by the nomenclature in (Gururangan et al., 2020), we refer to this model as Adaptively Pretrained BERT Base . Further, we perform fine-tuning on the downstream task data, as with the previous BERT models.

Results
We present here metrics-based and resource-based comparison of CVT and BERT models on all tasks. State-of-the-art (SOTA) baselines are included for reference.

Performance Metrics
We report mean F1 (with standard deviation) on the labeled test splits for each task over 5 randomized runs, and compare the models using statistical significance tests over these runs. Further, we report the approximate number of unlabeled sentences seen by each model. Table 2 shows the results for the OTE detection task. Here, out of the 3 BERT-based variations, the best result is achieved by the APBERT Base model across both SemEval datasets. For SemEval2016 Restaurants, we find the mean F1 from the APBERT Base model to be comparable to that of CVT (p-value 0.26). Both models outperform the SOTA baseline. For SemEval2014 Laptops, APBERT Base is found to have a statistically significant (p-value 0.04) higher F1 than CVT, and both models outperform SOTA.
In Tables 3 and 4, we present F1 results on NER and Slot-labeling task, respectively. For all 3 datasets, we find CVT to outperform all BERT models (statistically significant for CONLL-2003 and MIT Movies dataset, at p-values 0.0086 and 0.0085, respectively). For these tasks, BERT base outperforms APBERT Base models. Furthermore, CVT outperforms SOTA for Slot-labeling task.
These results show that the CVT model, using unlabeled data in a task-specific manner, is more robust across different tasks and types of unlabeled data.  Table 4: Model performance for Slot-labeling. The same unlabeled dataset is used for training CVT, Pre-BERT Base and APBERT Base , and Unlabeled Data indicates the approximate number of sentences seen by each model during training, until convergence criteria is met. HSCRF + softdict (Louvan and Magnini, 2018) is the SOTA baseline for this task.
For OTE detection, the unlabeled data is closely related to both domain and task, while for NER and Slot-labeling, the unlabeled data is related to genre (newswire) and domain (movies), but not necessarily to the specific tasks. In line with the findings of Gururangan et al. (2020), Adaptive Pre-training shows best results when using unlabeled data that is domain and task relevant (superior results for the OTE task). It is also worth noting that CVT requires significantly smaller amount of unlabeled data than the BERT models (Tables 2, 3 and 4).  Table 5: Estimated CO 2 emissions and computational cost for CVT and BERT models, using models trained on Yelp Restaurants (Yelp-R) as an example. These computations hold for other tasks and datasets discussed in this work. HW (hardware) refers to #GPUs/#CPUs used. Cost refers to approximate cost in USD. Power stands for total power consumption (in kWh) as combined GPU, CPU and DRAM consumption, multiplied by Power Usage Effectiveness (PUE) coefficient to account for additional energy needed for infrastructure support (Strubell et al., 2019). CO 2 represents CO 2 emissions in pounds. Table 5 shows computational cost and environmental impact by means of estimated CO 2 emissions occurring during training. We use the procedure described by Strubell et al. (2019). Tesla V100 GPUs are used for training. For computational cost, we refer to the average cost per hour for the training instances used. 6 To compute energy consumed, we query the NVIDIA System Management Interface 7 multiple times during training, to note the average GPU power consumption. For CPU and DRAM power usage, we use Linux's turbostat package. 8 The models trained using Yelp Restaurants unlabeled data are used as an example in Table 5, but the same computations hold for other models. Note that we do not perform initial pretraining of BERT base nor pretrain the Glove 840B.300d embeddings used in CVT, but these come at a one-time cost that we consider constant. Worth noting though, that BERT base pretraining is more expensive than Glove pretraining. As is evident, training the CVT model incurs much less financial cost than the corresponding BERT models (∼11x lower than APBERT base ), while also emitting lesser CO 2 emissions (∼18x lower than APBERT base ). 9

Conclusion & Future Work
We compare the task-specific semi-supervised method, CVT, with a task-agnostic semi-supervised approach, BERT (with and without adaptive pretraining), on a variety of problems that can be modeled as sequence tagging tasks. We find that the CVT-based approach is more robust than BERTbased models across tasks and types of unsupervised data available to them. Furthermore, the financial and environmental costs incurred are also significantly lower using CVT as compared to BERT. As part of future work, we will explore CVT on other sequence-labeling tasks such as chunking, elementary discourse unit segmentation and argumentative discourse unit segmentation, thus moving beyond entity-level spans. Moreover, other supervised tasks such as classification could also be studied in this context. Furthermore, we intend to implement CVT as a training strategy over Transformers (BERT) and compare it with Adaptively-Pretrained BERT. David Yarowsky. 1995. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In 33rd annual meeting of the association for computational linguistics.

A.1 Source Code and Data Preprocessing Steps
For CVT, we use the official author-provided codebase. 10 The unlabeled datasets are preprocessed to have one sentence per line using NLTK's sentence tokenizer 11 , as required by the model. For BERT pretraining, we use GluonNLP's opensource code. 12 For each unlabeled dataset, we create a randomly sampled validation set of about 30K samples during these experiments. Unlabeled data is processed to be in the required format. 13

A.2 Model Hyperparameters, Training Details and Validation F1
Here, we enlist the hyperparameters used for each model, and describe the training process. conlleval is used as the evaluation metric for each of the models. 14 CVT: A batch-size of 64 is used for both labeled and unlabeled data. We use character embeddings of size 50, with char CNN filter widths of [2,3,4], and 300 char CNN filters. Encoder LSTMs have sizes 1024 and 512, respectively for the 1st and 2nd layer, with a projection size of 512. Dropout of 0.5 for labeled examples and 0.8 for unlabeled examples is used. Base learning rate of 0.5 is used, with an adaptive learning rate scheme, using SGD with Momentum as the optimizer.
Pretrained BERT Base (Pre-BERT Base ) and Adaptive Pretraining BERT Base (APBERT Base ): Batch-size of 256 is used during training. Number of steps for gradient accumulation is set to 4. BERTAdam is used as optimizer. Base learning rate used of 0.0001 is used, which is adaptive w.r.t. the number of steps. Maximum input sequence length is set to 512.