DATAMAFIA at WNUT-2020 Task 2: A Study of Pre-trained Language Models along with Regularization Techniques for Downstream Tasks

This document describes the system description developed by team datamafia at WNUT-2020 Task 2: Identification of informative COVID-19 English Tweets. This paper contains a thorough study of pre-trained language models on downstream binary classification task over noisy user generated Twitter data. The solution submitted to final test leaderboard is a fine tuned RoBERTa model which achieves F1 score of 90.8% and 89.4% on the dev and test data respectively. In the later part, we explore several techniques for injecting regularization explicitly into language models to generalize predictions over noisy data. Our experiments show that adding regularizations to RoBERTa pre-trained model can be very robust to data and annotation noises and can improve overall performance by more than 1.2%.


Introduction
The recent outbreak of Coronavirus disease  has turned the world topsy-turvy with more than 25M+ infected people so far and 800K+ deaths across the globe 1 . Government officials, researchers, health workers and fear trapped common people are largely relying on online information to monitor, tackle and overcome the situation. Social media platforms, particularly -Twitter and Facebook, have become an easily accessible source of information related to the current affairs. Very recently, few researchers (Drias and Drias, 2020; Samuel et al., 2020) have conducted large scale analysis on Twitter data in the context of COVID-19. However, as mentioned by Nguyen et al. 2020b, a huge majority of the information shared on Twitter are not informative and can pose an additional burden to those who are relying on social media to monitor the pandemic. For example -a tweet like "Half of Uruguay's COVID-19 cases can be traced 1 https://covid19.who.int/ to a single fashion designer" can be speculative and possibly does not contain any insightful information. On the other hand, a tweet like "Currently 32000+ deaths and their talking spreading it far and wide...BBC News -Coronavirus: Trump unveils plan to reopen states in phases" can be very useful to a larger population.
To overcome this situation, shared task 2 of WNUT 2020 by Nguyen et al. 2020b allows to automatically identify whether a Tweet is informative in the context of COVID-19 or not. The task dataset contains 10K tweets (written mostly in English) and the associated label -INFORMA-TIVE and, UNINFORMATIVE labelled by human annotators.
In this task, we use a fine-tuned pre-trained RoBERTa base model (Liu et al., 2019) to learn the contextual representation of texts. We discover further that the last 4 layers of RoBERTa contain semantically rich hidden representation and are diverse, which, when used together can lead to better performance. In our final submitted model, as described in section 2.1, we use the concatenated hidden states of all the tokens from the last 4 layers of RoBERTa base . Upon further investigation, we realize that the overparameterized large transformer models can be prone to overfitting when fine-tuned on noisy data and ambiguous annotations. In the later part of our study (section 2.3), we explore various different techniques to inject regularization externally to pre-trained language models to improve generalization capabilities. Although, ensembling diverse set of classifiers (Opitz and Maclin, 1999) is known to be an effective technique for improving generalization, in real-life applications, large ensemble systems aren't much effective for drawing inferences on low-resource devices. Further, interpreting model predictions are also difficult for complex ensemble systems. Due to these operational challenges, we refrain ourselves from using ensem-bles and rather focus on single-model systems in our work. We have open-sourced our system and the experiments at Github 2 .

System Description
In this shared task we use the original train, dev and test datasets provided in the challenge 3 . Dev dataset is used for only validation and evaluating our models. We omit the description of the datasets in this paper due to page constraint, and it can be found in the task description by Nguyen et al. 2020b.

System Model
After the introduction of self-attention based transformer architecture (Vaswani et al., 2017), several large auto-regressive and auto-encoder based language models (Devlin et al., 2018;Liu et al., 2019;Yang et al., 2019) and their variants have been developed and have showed great results on various NLP downstream tasks including -text classification, Named entity recognition (NER), Natural Language Inference (NLI) etc. Very recently, Nguyen et al. 2020a has developed pre-trained model particularly for English tweets. We use the RoBERTa base (Liu et al., 2019) as our base language model to learn the hidden representation from the text data. Clark et al. 2019;Kovaleva et al. 2019;Hao et al. 2019 showed that different attention heads from different layers of BERT learn different features from text data. Keeping this in mind, we evaluate all the 12 layers of RoBERTa base and figured out that the last 4 layers learn diverse set of hidden representations and can influence the final output the most. Although, original BERT and RoBERTa uses only [CLS] token embedding for classification task, our experiment shows that using all the token hidden states can lead to better generalization. Architecture of our submitted model (M odel system ) is shown in Figure 1.

Other Baselines
Apart from our original submission, we explore other language models and their variants in great details. Each of these models are used to learn the overall representation of the text which is followed by a logistic dense layer to calculate the probability of a text being INFORMATIVE. In all these models, we use a dropout of 0.2 before applying the final logistic activation.

Techniques for Injecting Regularizations into Language Models
Although being an highly over-parameterized models, BERT and its variants are robust to overfitting while fine-tuning (Hao et al., 2019), empirical results from Lee et al. 2020 show that the instability when it is fine-tuned on small and noisy data. Unlike BookCorpus (Zhu et al., 2015) or English Wikipedia data, as used by most of the language models for pretraining, Twitter data is very noisy, unstructured and lacks many linguistic characteristics. To tackle the noisy nature of the dataset, we explore various strategies for regularizing base language model to make it robust to text noises.
• Transformer Hidden Dropout -Dropout (Srivastava et al., 2014) is an effective technique to reduce overfitting. Original BERT and RoBERTa language models use hidden dropout rate of 0.1 in the FFN layers. We experiment with various dropout rates dropout(p) ∈ [0.0, 0.3].
• Regularization -As explored by Schwarz et al. 2018, Kirkpatrick et al. 2017, we add an additional L2 regularization penalty term to final loss. We use λ as regularization coefficient to control the effect of penalty term on the overall loss.
• Mixout -Mixout is a technique recently proposed by Lee et al. 2020, and shows strong performance improvement when used with BERT on downstream finetuning tasks. We use the parameter mixout(w pre ) to tune the effect of mixout in our model.
• Multi-Sample Dropout -Inoue 2019 proposed multi-sample dropout to accelerate training as well as, better generalization. Multi-Sample dropout uses an average of multiple dropouts for a single sample.
• Text Augmentation -We inject artificial noise to training data by randomly masking a certain % of all tokens and replacing them with contextually similar word predicted by BERT. For text augmentation we use nlpaug package (Ma, 2019a). For augmentation we use parameter aug p ∈ [0.0, 0.3] to denote the proportion of the tokens to be masked for each text.
To our best knowledge, next to the work by Lee et al. 2020, our work is the first large-scale empirical study to show the effectiveness of different regularization techniques on pre-trained language models over noisy text data.

Hyperparamater Settings
In this work, for training, validation and testing, we use the raw data only, without using any further pre-processing. All the pre-trained language models are kept with default configurations. For the base language models, we use Huggingface's transformer library (Wolf et al., 2019). We use the default BytePairEncoding (BPE) for each of the language models to tokenize raw texts with max sequence length of 100.

Results
We evaluate the performances of all the models using F1, Precision and Recall scores. We can also observe that using more than one layer of RoBERTa usually works better than using only the last layer. Table 3 shows the performance of regularization techniques described in section 2.3 on the dev data. We observe that RoBERTa language model with any sort of regularization works better than the one without any regularization added. Figure 2 shows the effect of each regularization method on M odel system model. Among all the methods, multi-sample dropout and using augmented data show most stability w.r.t all the evaluation metrics. Individual dropout layers in M odel multi act differently on each sample and show high variability among each other with, avg. correlation being   On the other hand, by randomly replacing word tokens in the augmented texts, language models learn the overall context better without depending too much on any particular phrase.

Result Analysis
In this section, we inspect the language models and explain their predictive capabilities. In exploratory data analysis (EDA), we plot top words present in the corpus, conditioned on the INFORMATIVE and UNINFORMATIVE classes and figure out that "case", "covid", "death", "virus" etc. remains top words for both the classes of documents. We observe that any language model in just 2-3 epochs of fine-tuning can achieve a F1 score of more than 89%, however, due to the inherent noise of the tweets, around 10% of the examples are ambiguous and difficult to be classified correctly for almost all the standalone models. Figure 3 shows the two different clusters of text representations extracted by system model (embedded onto a lower dimensional space) on the dev dataset, with several misclassified ambiguous examples. From Table 2 (a) (b) Figure 4: Explanations for an INFORMATIVE tweet predicted by M odel system (a) and M odel aug (b). Tokens with high(low) attribute scores are highlighted in green(red). we can understand that all the models have an inductive bias towards the positive class, which lead to relatively poor precision but high recall. There are 89 examples in the validation set which are wrongly classified by M odel system . However, 47 of them are correctly predicted by either of the regularized models (models described in Table 3), and 70% of those examples are originally UNINFOR-MATIVE. We closely inspect the predictions using model interpretation tool Captum (Kokhlikyan et al., 2019), which uses a gradient based attribution method in explaining the predictions. A token with high positive attribution score are assigned more importance by the model and correlates positively with the overall prediction. Similarly, a word with high negative attribution score affects the final outcome adversely. In Figure 4, we explain predictions by M odel system and the M odel aug on an INFORMATIVE tweet, and found that the system model fails to capture the overall semantics correctly, whereas, M odel aug looks at contextually more important words like "declare", "lockdown", "immediately" etc. and predicts the tweet to be INFORMATIVE. Similar observations are found in Figure 5, where, M odel system assigns more importance towards frequently occurring words like "cases" and predicts wrongly. On the contrary, M odel aug understands the subtle sarcastic tone of the tweet by looking at the phrases "interesting", "was in China" and classifies correctly with high confidence.

Conclusion
In this paper, we present a large-scale empirical study of language models with explicit regularizations. We conclude that using hidden states from multiple layers from a language model helps in understanding the context better and further using an additional regularization, we can improve the stability and generalization capabilities of large pre-trained models. In future, we wish to use the insights captured by this work in building a custom and robust language model particularly for noisy user generated texts that are found in social media. Another interesting extension would be to prove the theoretical justifications and calculating the generalization bounds for each of the explored regularization methods. We strongly believe that our study will help the research community in using language models on real-life applications more effectively.