Pre-trained Transformer-based Classification and Span Detection Models for Social Media Health Applications

This paper describes our approach for six classification tasks (Tasks 1a, 3a, 3b, 4 and 5) and one span detection task (Task 1b) from the Social Media Mining for Health (SMM4H) 2021 shared tasks. We developed two separate systems for classification and span detection, both based on pre-trained Transformer-based models. In addition, we applied oversampling and classifier ensembling in the classification tasks. The results of our submissions are over the median scores in all tasks except for Task 1a. Furthermore, our model achieved first place in Task 4 and obtained a 7% higher F1-score than the median in Task 1b.


Introduction
Social media platforms such as Twitter have been widely used to share experiences and health information such as adverse drug effects (ADEs), thus attracting an increasing number of researchers to conduct health-related research using this data. However, because social media data consists of user-generated content that is noisy and written in informal language, health language processing with social media data is still challenging. To promote the use of social media for health information extraction and analysis, the Health Language Processing Lab of the University of Pennsylvania organized Social Media Mining for Health Applications (SMM4H) shared tasks. This year, the SMM4H shared tasks included 8 subtasks (Magge et al., 2021). Our team, the Sarker Lab at Emory University, participated in six classification tasks (i.e., Task 1a, 3a, 3b, 4, and 5) and one span detection task (i.e., Task 1b) of the SMM4H 2021 shared tasks. In recent years, Transformer-based models such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), whose advantage is modeling of long-range context semantics, revolutionised the field of NLP and achieved state-of-the-art results in different NLP tasks. Encouraged by those suc-cesses, we developed separate systems for classification and span detection both based on pre-trained Transformer-based models. We experimented with different Transformer-based model variants, and the model that achieved the best result on the validation set was selected as the final system. In addition, we performed undersampling and oversampling to address the problem of data imbalance and applied an ensemble technique in the classification tasks. The performances of our submissions are above the median F 1 -scores in all tasks except for Task 1a. Furthermore, our model achieved first place in Task 4 and obtained a 7% higher F 1 -score than the median in Task 1b.

Problem Definition and Datasets
We participated in six classification tasks including Task 1a: Classification of adverse effect mentions in English tweets; Task 3a and 3b: Classification of change in medication regimen in tweets and drug reviews from WebMD.com; Task 4: Classification of tweets self-reporting adverse pregnancy outcomes; Task 5: Classification of tweets selfreporting potential cases of COVID-19; and Task 6: Classification of COVID19 tweets containing symptoms. Further details about the data can be found in Magge et al. (2021). Among the six classification tasks, Task 6 was three-way classification and used micro-averaged F 1 -score as the evaluation metric, while the remaining tasks were binary classification and used the F 1 -score for the positive class for evaluation. For all tasks, we split the training data into a training set (TRN) and a validation set (TRN_VAL) with a 90/10 ratio, and evaluated the model on the validation set (VAL) released by the organizers.

Method
We used a uniform framework for all classification tasks, which consists of a Transformer-based en-  coder, a pooling layer, a linear layer, and an output layer with Softmax activation. For each instance, the encoder converts each token into an embedding vector, and the pooling layer generates a document embedding by averaging the token embeddings. The document embedding is then fed into the linear layer and the output layer. The output is a probability vector with values between 0 and 1, which is used to compute a logistic loss during the training phase, and the class with the highest probability is chosen during the inference phase.
Encoder: Encouraged by the success of pretrained Transformer-based language models, we experimented on four Transformer-based models pre-trained on different corpora-BERTweet ( Oversampling: As described in Magge et al. (2021), the class distributions of Task 1a, Task 3a and Task 5 are imbalanced. To address the problem, we oversampled the minority class in the training set by picking samples at random with replacement using a Python toolkit called imbalanced-learn. The script is available on Github. 2 After oversampling, the new training sets included 28,942, 9644 and 9786 instances for Task 1a, Task 3a and Task 5, respectively.
Ensemble Modeling: In an attempt to improve performance over individual classifiers, we applied an ensemble technique to combine the results of different models. We averaged the outputs (i.e., the probability vectors) of each model and selected the class with the highest value as the inference result.

Experiments and Results
We trained each model for 10 epochs, and the checkpoints that achieved the best performances on TRN_VAL were selected for evaluation. We experimented with two learning rates ∈ {2 e−5 , 3 e−5 } and three different random initializations, meaning that there were six checkpoints in total for each model. 3 For each type of model, the median of the six checkpoints was used when we reported the results of individual models (i.e., BT, CL, RBB, and RBL). For each ensemble model, all of the six checkpoints of the same type of model were used. Therefore, an ensemble model that combines k types of models consists of 6 × k checkpoints.   Table 1 shows the results of individual models and ensemble models trained on the oversampled training sets. For each task, we submitted the model that performed best on the validation set, and the results of the test sets are shown in Table  2. The performances of our systems were above the median for each task except for Task 1a, and achieved first place on Task 4. For Task 3a and Task 3b, our system achieved 18% higher F 1 -score on Task 3a and comparable result on Task 3b compared to the baseline model (Weissenbacher et al., 2020). 4

Analysis
In general, for individual models, RoBERTa Base and RoBERTa Large performed better or comparable to BERTweet, and Bio_Clinical BERT underperformed on all tasks compared to the other models, which is consistent with our previous findings (Guo et al., 2020). Ensemble models outperformed individual models on all tasks except for Task 1a. We observed that for Task 1a, all models achieved high F 1 -scores (around 97%) on the TRN_VAL set after training for 1 epoch, but the performance dropped by 25%-35% on the VAL set. Similarly, our F 1 -score on the testing set of Task 1a is 40%, which is lower than that on the VAL set. Since the same trend is not present for other tasks, we hypothesized that the types of ADE in the training set and validation set of Task 1a may have low overlap.
To test our hypothesis, we counted the number of distinct ADE labels and normalized ADE labels using the data of Task 1b and Task 1c, shown in Table 3. Interestingly, the overlap percentage of normalized ADE labels is as high as 85.5%, and that of unnormalized ADE labels is much lower. This suggests that most types of ADE in the validation set are included in the training set but the ADE descriptions can vary widely. This result indicates that the gap between the performance on the training set and validation set may be attributed to the limited generalizability of pre-trained Transformerbased models to capture the semantic similarities between different expressions of the same ADE.  3 Task 1b -ADE Span Detection

Problem Definition and Dataset
Task 1b aims at distinguishing adverse effect mentions from Non-ADE expressions and identifying the text spans of these adverse effect mentions. A tweet can have more than one ADE mention, and an ADE mention can be a sequence of words as well.
The training set consists of 17,385 tweets annotated with 1713 ADE mentions for 1235 tweets, and the validation set consists of 915 tweets annotated with 87 ADE mentions for 65 tweets.

Method
We implemented several Transformer-based models including BioBERT (Lee et al., 2020), SciB-ERT (Beltagy et al., 2019), BERTweet (Nguyen et al., 2020) and two models of BERT (Devlin et al., 2019), and compared their performances. 5 BioBERT is specifically trained for biomedical text and widely used for the biomedical text-mining for NER. SciBERT is trained on more general domain data such as computer science text. BERTweet is a pre-trained language model for English Tweets. In addition, since the dataset is very imbalanced, we also performed undersampling to change the composition of the training set. Specifically, we randomly divided the training data with negative labels into 10 non-overlapping subsets, each of which has a slightly larger size (2000 tweets) compared to the positive data (the same 1235 positive tweets), and then 5 subsets were randomly selected for our experiment.

Experiments and Results
In our experiments, since the tweets are relatively short, we set the max sequence length to 128, batch size to 128 for BERT Large and 256 for other models. The learning rate was set to 5 e−5 , and the epoch was set to 20 for all 5 models. The final submissions were evaluated in terms of precision, recall, and F 1 -score by the official evaluation scripts provided by the organizers, for each ADE extracted where the spans overlap either entirely or partially. However, for the convenience of comparing the performance of the models during the experiments, we used Seqeval, 6 which is a Python framework for sequence labeling evaluation, to compare all methods on the validation set also by precision, recall, and F 1 -score at the token level. Table 4 shows the performances for these 5 models.  From Table 4, it can be observed that BERT Large outperforms all other models with the highest recall and F 1 -score. As a result, we chose BERT Large as the model used in the final submission. Finally, the result we received from the organizers was similar to the performance on the validation set, which is above the median. Although our recall is 17% worse than the median recall, our precision is 68.1 (+19%) and our F 1 -score is 49.0 which is 7% higher than the median F 1 -score.

Comparison Between Models
We conducted the research on the learning efficiency and the performance over 20 epochs of each model, evaluating each time on the validation set. The results of precision, recall, and F 1 -score for each epoch are shown in Figure 1. 6 https://github.com/chakki-works/ seqeval These three plots show that the learning efficiency of BERT Large is very fast. When the epoch is 2, precision, recall and F 1 -score for this model reach about 35%, while the scores of other models are only around 15% at this stage. In addition, as shown in the plots, the performance of BERT Large is consistently better than other models during training, which may benefit from its larger pre-training dataset. However, it is surprising to find that, unlike the curves of BioBERT, SciBERT and BERTweet, the curves of BERT Base model are relatively unstable, with some fluctuations.

Undersampling Experiments
Since BERT Large was the best model in our experiments, we separately finetuned BERT Large for 10 epochs on each of the 5 undersampled datasets, and compared the average scores for these 5 subsets with the performance scores obtained without undersampling. These results were also evaluated on the validation set at the token level. The results for undersampling are shown in Table 5. The averaged F 1 -score for all the undersampled subsets is significantly lower than the best performance. Although we used all the positive data, it is possible that the drastic reduction in the amount of negative data and the total training data has had a very large impact on the results. Furthermore, randomly sampling the negative examples changes the prior distribution of the probability for the classifier. Due to time constraints associated with the shared task deadline, we were unable to try more advanced heuristics to select the negative examples for the undersampling, which is worth further exploring in future work.

Performance Analysis
In order to conduct the research on the results we received from the organizers, we compared the annotated data for validation set provided by the or- ganizers with the results predicted by BERT Large . This analysis revealed two primary causes why our model did not receive higher scores. Firstly, the number of true positives is relatively small. 87 annotations with label "ADE" were given in the validation set, but after the prediction, only 60 ADE mentions in the validation set (including true positive cases and false positive cases) were obtained. In these 60 ADE mentions, 23 cases which we only partially correctly predicted are also included, which means that many true ADEs were not detected (false negatives). Secondly, most of the ADE mentions predicted by our models, which are not annotated with label "ADE" in the validation set, did not appear for no reason, but actually have been annotated with label "ADE" in the training set. For example, "nosleep", which does not seem to have any ambiguity, is marked as ADE in one tweet, but not in another tweet, which might be due to the differences in the contexts in which they are mentioned. For example, in some tweets, "nosleep" appears in the tag "teamnosleep"; although it was predicted as ADE mention after being tokenized, it was not actually labeled as ADE by annotators.

Conclusion
In this work, we developed two systems based on pre-trained Transformer-based models for multiple health-related classification tasks and one span detection task for the SMM4H 2021 shared tasks. We experimented with different Transformer-based model variants as well as sampling strategies and applied an ensemble technique in the classification tasks. The results of our submissions are over the median F 1 -scores in all tasks except for Task 1a. Furthermore, our model achieved first place in Task 4 and obtained a 7% higher F 1 -score than the median in Task 1b. For future work, we will investigate methods to improve the generalizability of pre-trained Transformer-based models to deal with various health-related expressions in social media data.