BERT based Adverse Drug Effect Tweet Classification

This paper describes models developed for the Social Media Mining for Health (SMM4H) 2021 shared tasks. Our team participated in the first subtask that classifies tweets with Adverse Drug Effect (ADE) mentions. Our best performing model utilizes BERTweet followed by a single layer of BiLSTM. The system achieves an F-score of 0.45 on the test set without the use of any auxiliary resources such as Part-of-Speech tags, dependency tags, or knowledge from medical dictionaries.


Introduction
In this effort, we focus on detecting tweets that have ADE mentions as a part of the Social Media Mining for Health (#SMM4H) -2021 shared tasks (Magge et al., 2021). Organizers of SMM4H Task 1 provided datasets of English tweets with binary annotations of 1 and 0 indicating the presence or absence of ADE mentions in the tweet. We develop a robust system against the class imbalance problem in the dataset that classifies tweets containing at least one ADE mention. We also validate the importance of emojis and hashtags in ADE classification empirically.

Dataset
The dataset consists of a training set (18,000 tweets), validation set (953 tweets), and test set (10,000 tweets). The dataset is highly imbalanced, with only 7% of the tweets containing ADE mentions. We tackle this challenge using sampling and per-class penalties in the objective function.

Preprocessing
We performed following preprocessing on the dataset:

Method
We explore three BERT-based models for classification: (i) BERT (Devlin et al., 2019), (ii) RoBERTa (Liu et al., 2019), and (iii) BERTweet (Nguyen et al., 2020). We pass the input through our BERTbased models to get token representations. To compute the sentence representations, we consider two cases -i) [CLS] token (fine-tuning) ii) we pass token representations without [CLS] and [SEP] through a single layer BiLSTM and concatenate the forward and backward context. The sentence representation is passed through a fully connected neural network layer followed by a sigmoid activation to predict probabilities.
To tackle class imbalance, we experiment with oversampling, undersampling, and addition of perclass penalties in the objective function. For oversampling approach, we randomly sampled positive examples with replacement until each class contained 10,000 tweets. For the undersampling approach, we randomly sample negative examples to create a balanced training dataset.

Experiments
For the classification task, each BERT model is trained for 10 epochs with a learning rate of 1 * 10 −5 using Adam optimizer (Kingma and Ba, 2017). We set the batch size to 32 and the maximum sequence length to 128. To tackle class imbalance, we add weights to the standard cross-entropy loss. We set weights as 0.7 and 0.3 for ADE and NoADE classes, respectively. We utilize PyTorch 2 implementation of BERT for training. We train RoBERTa over , RoBERTa under and BERT base -unweighted, using standard unweighted cross-entropy loss. We conduct model selection for every 200 steps against the validation set using the F1-score of the ADE class for comparison.

Discussion
It is evident from Table 1 that BERT base outperforms BERT base -Fine Tune, and validates that the use of BiLSTM layer on top of BERT improves both precision and recall. Table 1 also shows that use of per-class penalties in the objective function (BERT base ) results in better performance as compared to the model with unweighted objective function (BERT base -unweighted). Table 2 shows that retaining emoji and hashtags in tweets help in achieving better performance on BERT base as against excluding those. Table 1 shows that RoBERTa outperformed BERT base in all the evaluation metrics. However, RoBERTa over and RoBERTa under gave results comparable to BERT base . The results show that the ADE class's oversampling and the NoADE class's undersampling did not handle the class imbalance problem well. Hence, we resort to adding class-weights in our objective function.
BERTweet outperforms BERTweet raw , which uses preprocessing techniques described in (Nguyen et al., 2020). Our preprocessing steps are inspired by (Nguyen et al., 2020) with the only difference being that we remove all user mentions and web/URL links from the tweet. We empirically validate our intuition that the user mentions, web links act as noise in the text and do not provide any valuable information needed for the classification task. Table 3 shows the performance of BERTweet on the test set. Our model's performance is relatively poor on the Test set compared to the validation set, which can be attributed to overfitting. This overfitting can be reduced by adding dropout in the model. Table 3 shows the performance of BERTweet on the Test set in the post-evaluation phase after the addition of dropout to the BiLSTM layers.

Conclusion
In this work, we explore an application of BERT to the task of binary classification on English Tweets. We validate that use of per-class penalties in the objective function helped in overcoming the class imbalance problem. We have empirically evaluated differently tuned model versions and preprocessing methods against F1-score for the "ADE" class. Experiments have shown that our model has achieved an F1-score of 0.46, precision of 0.523, and recall of 0.409 on the test set. The future directions would be to evaluate the potential of supplementary resources in our model, such as Part-of-Speech Tags, Dependency Tags, knowledge from medical dictionaries (such as Med-DRA).