Transformer-based Multi-Task Learning for Adverse Effect Mention Analysis in Tweets

This paper presents our contribution to the Social Media Mining for Health Applications Shared Task 2021. We addressed all the three subtasks of Task 1: Subtask A (classification of tweets containing adverse effects), Subtask B (extraction of text spans containing adverse effects) and Subtask C (adverse effects resolution). We explored various pre-trained transformer-based language models and we focused on a multi-task training architecture. For the first subtask, we also applied adversarial augmentation techniques and we formed model ensembles in order to improve the robustness of the prediction. Our system ranked first at Subtask B with 0.51 F1 score, 0.514 precision and 0.514 recall. For Subtask A we obtained 0.44 F1 score, 0.49 precision and 0.39 recall and for Subtask C we obtained 0.16 F1 score with 0.16 precision and 0.17 recall.


Introduction
Information extraction from social media is widely studied nowadays, as platforms like Facebook, Twitter, Instagram, or Reddit become the main place for people to share their opinions and experiences. Concurrently, a wide range of applications with completely different topics arises with the current advances in Natural Language Processing (NLP), as the volume of posted information has become impossible to be manually analysed. For example, the Social Media Mining for Health (SMM4H) Applications Shared Task (Sarker and Gonzalez-Hernandez, 2017) is focused on health applications and introduces a dataset of annotated tweets with the aim to analyse adverse drug ef-fects (ADE) mentioned by users. This year's edition (Magge et al., 2021) proposed eight different tasks, out of which we focused on the first task entitled Classification, Extraction and Normalization of Adverse Effect mentions in English tweets. This task was further divided into three subtasks, as follows. Subtask 1a was a binary classification task of tweets, focused on identifying whether the message contains ADE or not. Subtask 1b was a named entity recognition task on top of subtask 1a, centered on extracting the span of text containing the ADE of the medication. Subtask 1c was a named entity resolution task on top of both previous subtasks, aimed at predicting the normalized concept of the extracted adverse effect from the preferred terms included in the Medical Dictionary for Regulatory Activities (MedDRA) 1 .
All three subtasks are addressed simultaneously using a Multi-Task Learning (MTL) architecture (Caruana, 1997) that leverages acquired knowledge from one subtask to another. Furthermore, we approached the challenge of unbalanced classes in the first subtask by considering class weights and by augmenting the training data set.
The paper is structured as follows. The second section describes previous work that inspired our solution, while the third section presents our employed method. The fourth section presents the results of our work, followed by discussions and the final section that summarizes our findings and presents future research paths.

Health-Related Applications
Given that Task 1 was present in previous editions of the SMM4h shared task (Weissenbacher et al., 2018(Weissenbacher et al., , 2019Klein et al., 2020), several approaches were employed to address its challenges. For example, the winning team from 2019 (Miftahutdinov et al., 2019) used an ensemble of BioBERT-CRF models for the ADE extraction task, while addressing the resolution task as a classification. The system proposed by  ranked first at the end-to-end 2020 competition using the pretrained EnDR-BERT  and the CSIRO Adverse Drug Event Corpus (CADEC) (Karimi et al., 2015) for further training the model. In addition, Dima et al. (2020) showed that bidirectional Transformers trained using class weighting, together with ensembles that combine various configurations, achieve an F1-score of .705 on the dataset made available for that edition of the competition.

MTL-Based Methods
Multi-Task Learning represents a training strategy where a shared model is simultaneously learning multiple tasks. Ruder (2017) analysed the techniques applied in MTL and compared the hard parameter sharing and soft parameter sharing paradigms, concluding that the former is still pervasive in nowadays approaches. MTL proved to fasten the convergence and to improve the model performance in a variety of NLP applications, including named entity recognition (Aguilar et al., 2018), fake news detection (Wu et al., 2019), multilingual offensive language identification (Chen et al., 2020b), sentiment analysis , humor classification , recommender systems (Tang et al., 2020), and even question answering (Kongyoung et al., 2020). MTL also increases performance in conjunction with semi-supervised learning (Liu et al., 2007), curriculum learning (Dong et al., 2017), sequence-tosequence (Zaremoodi and Haffari, 2018), reinforcement learning (Gupta et al., 2020), and adversarial learning (Liu et al., 2017).

Corpus
The SMM4H 2021 Task 1 dataset included 17,385 training samples out of which 1,235 (7.10%) belong to the positive class (i.e., contain ADE), as well as 915 samples in the development set out of which 65 are labeled as positive; hence, a challenge consists of the unbalanced distribution of the two classes.
Subtask 1c required labeling the extracted text span with the corresponding MedDRA term; the number of possible labels exceeds 23,000. Only 476 labels are present in the training set, denoting that most labels are not covered at all. Additionally, the number of appearances of each ID has a longtail distribution (see Figure 1), with some IDs being present in more than 60 examples and most IDs occurring in less than 4 examples.

Multi-Task Learning Neural Architecture
A MTL architecture based on hard parameter sharing (Ruder, 2017) was employed for Task 1 (see Figure 2). Given that all three subtasks are highly related, our assumption was that knowledge acquired while learning one subtask would help in increasing performance on the other two. Three modules are added on top of BioBERT (Lee et al., 2020): (a) the Classifier -a binary classifier for tweet classification in subtask 1a, b) the Extractor -a named entity recognition layer for ADE span extraction in subtask 1b, and c) the Normalizer -a multi-class classifier for span resolution in subtask 1c. All three modules share the same pre-trained BERT encoder; the first 11 layers out of 12 were frozen, whereas the last layer was kept as a shared trainable encoder.
The training dataset was processed in the following manner. The positive tweets from the training set were selected for subtask 1b, and each token was tagged with either "O" (outside adverse effect) or "AE" (adverse effect entity). Two approaches were considered for subtask 1c: (a) create a dataset using the spans labeled with their corresponding PTID, and (b) concatenate the span tokens with the corresponding tweets as: [CLS] <ADE span > [SEP] <entire tweet> [SEP]. The second approach aimed to leverage context information in the Med-DRA ID prediction.
The modules for the three subtasks were trained in parallel. At each training step, a batch from the training datasets was randomly chosen with the following probability: where D i represents the dataset for subtask i, and b i represents a mini-batch from D i .
All three subtasks minimize cross-entropy loss (see Equation 2), whereas only subtask 1a considers values different from one for weights w j : (2) The final loss minimized at each step k of Algorithm 1 is expressed in Equation 3: where t i k = 1 when task i is in training, or t i k = 0 otherwise.
Algorithm 2 describes the processing pipeline which begins by passing the input tweet through the Classifier. If a tweet is labeled as not containing an adverse effect, the label is memorized and the flow stops. Otherwise, the tweet is passed to the Extractor and, afterwards, to the Normalizer. Line 10 highlights that the input tweet can also be used besides the text span that contains the ADE, in order to leverage the context information in predicting the MedDRA ID; this feature is optional and can Language Models: We experimented with BERTbase (Devlin et al., 2019) and with the domainspecific Transformers, namely BioBERT and Bio-ClinicalBERT (Alsentzer et al., 2019). After a preliminary fine-tuning on the subtask 1a, the most promising results were obtained by BioBERT.
Given the limited resources available, we kept it as the default pre-trained solution in all further experiments.
Hyperparameters: All three modules (Classifier, Extractor, and Normalizer) were trained with a learning rate of 5e − 5. Batch sizes of 64 were used for subtask 1a that had most entries, while batch sizes of 16 were considered for subtasks 1b, and 1c in which only positive samples are considered. Training was performed for 30 epochs, computing the performance on the validation set after each epoch and saving the system that performed best.
Class Weights: The class unbalance problem from subtask 1a is addressed using the weighted version of the cross-entropy loss. The weights of the two classes were computed using the balanced heuristic (King and Zeng, 2001) from the scikitlearn library (Pedregosa et al., 2011).
Augmented Training Dataset: Another explored solution for the unbalance in subtask 1a consists in augmenting the poorly represented class (the positive class). We leverage the predefined augmentation approaches integrated into the Tex-tAttack library (Morris et al., 2020). New positive examples are generated by char swapping, by replacing words with synonyms from the Word-Net thesaurus (Miller, 1995), and by using methods from the CheckList testing -i.e., transformations like location replacement or number alteration (Ribeiro et al., 2020). Five positive examples are automatically added for each initial positive sample, thus increasing the proportion of the poorly represented class from 7% to almost 45%.
Class Number Reduction for the Normalizer: We considered subtask 1c a multi-class classification task where the Normalizer module receives as input the text span containing an ADE (i.e., the output of the Extractor module) and classifies the span into one of the classes (i.e., MedDRA PTIDs) present in the training set. The distribution of the 476 MedDRA IDs influenced us to reduce the number of classes. As such, the final classifier considers only the most frequent 108 PTIDs (i.e., IDs that appear more than three times in the training dataset). There were too few examples to properly generalize for all PTIDs; however, the module covers only 69.5% of the training samples.

Results
Four configurations were compared in terms of performance. The first configuration (MTL) is a baseline relying on the previously described MTL architecture. Weighted binary cross-entropy loss and feedback from the Extractor to the Normalizer (Line 12 from Algorithm 2) are enabled, but the Normalizer uses only the ADE span, without the entire tweet (Line 10 from Algorithm 2).
The second configuration (MTL + Boostin-gEnsemble) starts from MTL, but instead of the simple Classifier model, it uses an ensemble of three models trained in a boosting manner. The first classifier (Classifier1) is identical to the classifier from the first configuration. The second classifier (Clas-sifier2) was trained on a modified training set in which the miss-classifies examples from Classifier1 are over-sampled by a factor of three, whereas the correctly classified examples are down-sampled by the same factor. The third classifier, Classifier3, is also trained on a modified training set in which examples with different results from Classifier1 and Classifier2 are over-sampled by a factor of three, while the rest of the examples are down-sampled by the same factor.
The third configuration, denoted MTL + En-hancedEnsemble, further tries to improve the performance of the second configuration by adding two more classifiers to the ensemble, Classifier4 and Classifier5, trained now on the augmented training set while considering equivalent over-and down-sampling approaches.
The fourth configuration, namely MTL + En-hancedNormalizer, is similar to the first configuration, but with the Normalizer is trained on both the ADE span and the entire tweet. Table 1 introduces the comparative results for all configurations. While considering the development dataset, MTL + EnhancedEnsemble obtains the best performance for subtasks 1a and 1b, while MTL + EnhancedNormalizer has the highest F1-score for subtask 1c. In terms of the test dataset, MTL has the highest F1-score for subtask 1a -although the other configurations gain a boost in precision, recall is negatively influenced; MTL + BoostingEnsemble has the best performance on subtask 1b, whereas MTL + EnhancedNormalizer remains the best configuration for subtask 1c. Although MTL + EnhancedEnsemble has better results while integrating the augmented dataset, there are no improvements on the test dataset.

Model
Subtask Development Test P (%) R (%) F 1 (%) P (%) R (%) F 1 (%) MTL 1a 58.  Table 2 introduces classification problems that provide additional insights on how our MTL + Boost-ingEnsemble model works. Overall, it correctly extracts and classifies most text spans containing usual words for adverse effects (e.g., "sick") but, it has occasional difficulties in distinguishing between the desired effect of a medication and its adverse effects. For instance, in the first example from Table 2, our model does not make the association that the described medication is supposed to help the subject sleep, but, in contrast, it assumes sleepiness as an adverse effect.

Discussions
Another limitation of our method is highlighted in the second example. The MedDRA term of Slurred speech is a rather rare label, not even present in the training set. Even though our system correctly extracts the span containing the adverse effect, it is unable to correctly predict the ptid.
The false positive example of "drunk" labeled as Drunk like effect shows that our model finds it hard to discern appearances from facts. A similar bias can be observed in the third example, where the model fails to extract the spans "sleep" and "stomach is a cement mixer" most likely because it learned that interrogations ask about adverse effects rather than offer information about them.
The fourth example denotes subtle errors, like grasping the difference between the MedDRA terms of Sleepiness and Somnolence, which are likely to be mislabeled even by humans.
While considering the differences between development and test set performances, another limi-  tation emerges, namely that our configurations did not generalized as expected on the test set for subtask 1a. This is argued by the reduced development set which contains only 5% of the provided labeled examples, coupled with our training procedure of always saving the model at its best validation score.

Conclusions and Future Work
We introduced a Transformer-based Multi-Task Learning architecture employed for Task 1 from the Social Media Mining for Health Applications Shared Task 2021. Task 1 was concerned with the classification of tweets incorporating adverse effects of medication and, for the positive tweets, with the extraction and normalization of the adverse effects. We started from a pretrained domainspecific BERT language model (i.e., BioBERT) which was further finetuned in a multi-task setting. A hard parameter sharing MTL model was trained on the three subtasks of SMM4H Task 1. Furthermore, class weights and data augmentation were considered to overcome the problem of the unbalanced dataset from subtask 1a.
Our model achieved the highest score for subtask 1b (i.e., adverse effect span detection) with an F 1 -score of 51%, arguing that MTL can enhance adverse effect extraction from social media posts. In terms of future work, adversarial training (Miyato et al., 2018;Chen et al., 2020a) will be considered to improve the robustness of our approach.