Go Simple and Pre-Train on Domain-Specific Corpora: On the Role of Training Data for Text Classification

Pre-trained language models provide the foundations for state-of-the-art performance across a wide range of natural language processing tasks, including text classification. However, most classification datasets assume a large amount labeled data, which is commonly not the case in practical settings. In particular, in this paper we compare the performance of a light-weight linear classifier based on word embeddings, i.e., fastText (Joulin et al., 2017), versus a pre-trained language model, i.e., BERT (Devlin et al., 2019), across a wide range of datasets and classification tasks. In general, results show the importance of domain-specific unlabeled data, both in the form of word embeddings or language models. As for the comparison, BERT outperforms all baselines in standard datasets with large training sets. However, in settings with small training datasets a simple method like fastText coupled with domain-specific word embeddings performs equally well or better than BERT, even when pre-trained on domain-specific data.


Introduction
Language models pre-trained on large amounts of text corpora form the foundation of today's NLP (Gururangan et al., 2020;Rogers et al., 2020). They have proved to provide state-of-the-art performance against most standard NLP benchmarks (Wang et al., 2019a;Wang et al., 2019b). However, these models require large computational resources that are not always available and have important environment implications (Strubell et al., 2019). Moreover, there is limited research in the applicability of pre-trained models in classification tasks with small amount of labelled data. Some related studies (Lee et al., 2020;Nguyen et al., 2020;Alsentzer et al., 2019) investigate whether it is helpful to tailor a pre-trained model to the domain while others (Sun et al., 2019;Chronopoulou et al., 2019;Radford et al., 2018) analyse methods for fine-tuning BERT to a given task. However, these studies perform evaluation on a limited range of datasets and classification models and do not consider scenarios with limited amounts of training data.
In particular, this paper aims to estimate the role of labeled and unlabeled data for supervised text classification. Our study is similar to Gururangan et al. (2020) where they investigate whether it is still helpful to tailor a pre-trained model to the domain of a target task. In this paper, however, we focus our evaluation on text classification and compare different types of classifiers on different domains (social media, news and reviews). Unlike other tasks such as natural language inference or question answering that may require a subtle understanding, feature-based linear models are still considered to be competitive in text classification (Kowsari et al., 2019). However, to the best of our knowledge there has not been an extensive comparison between such methods and newer pre-trained language models. To this end, we compare the light-weight linear classification model fastText , coupled with generic and corpus-specific word embeddings, and the pre-trained language model BERT (Devlin et al., 2019), trained on generic data and domain-specific data. Specifically, we analyze the effect of training size over the performance of the classifiers in settings where such training data is limited, both in few-shot scenarios with a balanced set and keeping the original distributions. In both cases, our results show that a large pre-trained language model may not provide significant gains over a linear model that leverage word embeddings, especially when these belong to the given domain.

Supervised Text Classification
Given a sentence or a document, the task of text classification consists of associating it with a label from a pre-defined set. For example, in a simplified sentiment analysis setting the pre-defined labels could be positive, negative and neutral. In the following we describe standard linear methods and explain recent techniques based on neural models that we compare in our quantitative evaluation.

Supervised machine learning models
Linear models. Linear models such as SVMs or logistic regression coupled with frequency-based handcrafted features have been traditionally used for text classification. Despite their simplicity, they are considered a strong baseline for many text classification tasks (Joachims, 1998;McCallum et al., 1998;Fan et al., 2008), even more recently on noisy corpora such as social media text (Çöltekin and Rama, 2018; Mohammad et al., 2018). In general, however, these methods tend to struggle with OOV (Out-Of-Vocabulary) words, fine-grained distinctions and unbalanced datasets. FastText , which is the model evaluated in this paper, partially addresses these issues by integrating a linear model with a rank constraint, allowing sharing parameters among features and classes, and integrates word embeddings that are then averaged into a text representation.
Neural models. Neural models can learn non-linear and complex relationships which makes them a preferable method for many NLP tasks such as sentiment analysis or question answering (Sun et al., 2019). In particular, LSTMs, sometimes in combination with CNNs for text classification (Xiao and Cho, 2016;Pilehvar et al., 2017), enable capturing long-range dependencies in a sequential manner where data is read from only one direction (referred to as the 'unidirectionality constraint'). Recent state-of-the-art language models, such as BERT (Devlin et al., 2019), overcome the unidirectionality constraint by using transformer-based masked language models to learn pre-trained deep bidirectional representations. These pre-trained models leverage generic knowledge on large unlabeled corpora that can then be fine-tuned on the specific task by using the pre-trained parameters. BERT, which is the pretrained language model tested in this paper, has been proved to provide state-of-the-art results in most standard NLP benchmarks (Wang et al., 2019b), including text classification.

Pre-trained word embeddings and language models
Most state-of-the-art NLP models nowadays use unlabeled data in addition to labeled data to improve generalization (Goldberg, 2016). This comes in the form of word embeddings for fastText and a pretrained language model for BERT.
Word embeddings. Word embeddings represent words in a vector space and are generally learned from shallow neural networks trained on text corpora, with Word2Vec (Mikolov et al., 2013) being one of the most popular and efficient approaches. A more recent model based on the Word2Vec architecture is fastText , where words are additionally represented as the sum of character n-gram vectors. This allowed building vectors for rare words, misspelt words or concatenations of words.
Language models. A limitation to the word embedding models described above is that they produce a single vector of a word despite the context in which it appears. In contrast, contextualized embeddings such as ELMo (Peters et al., 2018) or BERT (Devlin et al., 2019) produce word representations that are dynamically informed by the words around them. The main drawback of these models, however, is that they are computationally very demanding, as they are generally based on large transformer-based language models (Strubell et al., 2019).
tion (Barbieri et al., 2018), AG News (Zhang et al., 2015), Newsgroups (Lang, 1995) and IMDB (Maas et al., 2011). The main features and statistics of each dataset are summarized in Comparison models. As mentioned in Section 2, our evaluation is focused on fastText (Joulin et al., 2017, FT) and BERT (Devlin et al., 2019). For completeness we include a simple baseline based on frequency-based features and a suite of classification algorithms available in the Scikit-Learn library (Pedregosa et al., 2011), namely Gaussian Naive Bayes (GNB), Logistic Regression and Support Vector Machines (SVM). Of the three, the best results were achieved using Logistic Regression, which is the model we include in this paper as a baseline for our experiments.
Training. As pre-trained word embeddings we downloaded 300-dimensional fastText embeddings trained on Common Crawl . In order to learn domain-specific word embedding models we used the corresponding training sets for each dataset, except for the Twitter datasets where we leveraged an existing collection of unlabeled tweets from October 2015 to July 2018 to train 300-dimensional fastText embeddings (Camacho Collados et al., 2020). Word embeddings are then fed as input to a fastText classifier where we used default parameters and softmax as the loss function. As for BERT, we fine-tune it for the classification task using a sequence classifier, a learning rate of 2e-5 and 4 epochs. In particular, we made use of BERT's Hugging Face default transformers implementation for classifying sentences (Wolf et al., 2019) and the hierarchical principles described in Pappagari et al. (2019) for pre-processing long texts before feeding them to BERT. We used the generic base uncased pre-trained BERT model and BERT-Twitter 2 , both from Hugging Face (Wolf et al., 2019).
Evaluation metrics. We report results based on standard micro and macro averaged F1 (Yang, 1999). In our setting, since system provide outputs for all instances, micro-averaged F1 is equivalent to accuracy.

Analysis
We perform two main types of analysis. First, we look at the effect of training size over the classifier's performance by randomly sampling different sized subsets from the original labeled datasets (Section 4.1). Then, we perform a few-shot experiment where we compare classifier's performance on different sizes of balanced subsets of the training data (Section 4.2). Table 2 shows the results with different sizes of training data randomly extracted from the training set. Surprisingly, classification models based on corpus-trained embeddings achieve higher performance with less labelled data compared to the classifier based on pre-trained contextualised models. However, for cases with more than 5,000 training samples, the performance of fine-tuned BERT significantly outperforms fastText corpus-based classifier, especially when domain-trained BERT model (i.e., BERT (Twitter)) is used. Further to that, the fine-tuned model performance improves at a higher rate than the classifier based on corpus-trained embeddings for training sets with more than 2,000 instances. For instance, for the SE-18 dataset, fastText with domain embeddings improves 0.112 micro-F1 points when the entire dataset is used with respect to using only 200 instances, while BERT-Twitter provides a 0.360 absolute improvement. In contrast, fastText with pre-trained embeddings performs similarly to the baseline. This shows the advantage for pre-trained models to be fine-tuned to the given domain and task.  Sentences vs. documents. In order to avoid confounds such as the type of input data in each of the experiments, we filter the results by sentences and documents (see Table 1 for the actual split of datasets in each category). Figure 1 shows the results for this experiment. As can be observed, training set size affects similarly for both types of input, with BERT being especially sensitive to the training data size.

Few-shot experiment
A few-shot comparison between the performance of classifiers based on balanced data is shown in Table  3. 3 We balance the dataset for a few shot experiments to ensure the occurrence of instances for all labels within the training set even for datasets with 20 labels when 5-shot and 10-shot experiments are performed. Further, we look at the effect of balanced training data over the classifiers performance. The results show that balancing the dataset lead to improvements in the classification performance with limited training data, especially for BERT. For example, using a subset of 1,000 training instances for 20 Newsgroups corpus, the macro-F1 for random sampled data is 0.42 while the macro-F1 for balanced data (i.e., 50 instances per label) is 0.556.  Similarly to the experiments with randomized data samples, fastText based on corpus-trained embeddings is the best performing classification model for very small amounts of balanced labeled data (see Figure 2). However, as the amount of training data increases, BERT model outperforms fastText on average by 0.0442%. As in the previous experiment, the classification model based on pre-trained embeddings perform poorly compared to the corpus-trained embeddings and models fine-tuned to the task. Further, BERT (Twitter) leads to significant improvements over BERT when only 10 instances per label are used (i.e., for SE-16, BERT (Twitter) has macro-F1 = 0.370, similar to domain-based fastText with macro-F1 = 0.384 versus base BERT with macro-F1 = 0.200).

Conclusion and Future Work
In this paper, we analyzed the role of training and unlabeled domain-specific data in supervised text classification. We compared both linear and neural models based on transformer-based language models.
In settings with small training data, a simple method such as fastText coupled with domain-specific word embeddings appear to be more robust than a more data-consuming model such as BERT, even when BERT is pre-trained on domain-relevant data. However, the same classifier with generic pretrained word embeddings does not perform consistently better than a traditional frequency-based linear baseline. 4 BERT, pre-trained on domain-specific data (i.e., Twitter) leads to improvements over generic BERT, especially for few-shot experiments. For future work it would be interesting to further delve into the role of unlabeled data in text classification, both in terms of word embeddings (e.g., by making use of meta-embeddings (Yin and Schütze, 2016)) and the data used to train language models (Gururangan et al., 2020). Moreover, this quantitative analysis could be extended to more classification tasks and different models, e.g., larger language models such as RoBERTa (Liu et al., 2019) and GPT-3 (Brown et al., 2020), which appear to be more suited to few-shot experiments. However, the generic domain embeddings tend to fail to represent the meaning of more domainspecific words, which may explain their lower performance. This is confirmed by the nearest neighbour analysis (see Table 5) which showed that the generic domain embeddings do not provide accurate representations of more technical words such as 'Windows' and 'Sun'. In the IMDB reviews, words such as 'Toothless', used within a very specific context are also not correctly represented by the generic model. Moreover, tweets are rich with abbreviations which have domain-specific meaning such as 'SF' referring to 'San Francisco'.