A Joint Training Approach to Tweet Classification and Adverse Effect Extraction and Normalization for SMM4H 2021

In this work we describe our submissions to the Social Media Mining for Health (SMM4H) 2021 Shared Task. We investigated the effectiveness of a joint training approach to Task 1, specifically classification, extraction and normalization of Adverse Drug Effect (ADE) mentions in English tweets. Our approach performed well on the normalization task, achieving an above average f1 score of 24%, but less so on classification and extraction, with f1 scores of 22% and 37% respectively. Our experiments also showed that a larger dataset with more negative results led to stronger results than a smaller more balanced dataset, even when both datasets have the same positive examples. Finally we also submitted a tuned BERT model for Task 6: Classification of Covid-19 tweets containing symptoms, which achieved an above average f1 score of 96%.


Introduction
Social media platforms such as Twitter are regarded as potentially valuable tools for monitoring public health, including identifying ADEs to aid pharmacovigilance efforts. They do however pose a challenge due to the relative scarcity of relevant tweets in addition to a more fluid use of language, creating a further challenge of identifying and classifying specific instances of health-related issues.
In this year's task as well as previous SMM4H runs (Klein et al., 2020) a distinction is made between classification, extraction, and normalization. This is atypical of NER systems, and many other NER datasets present their datasets, and are consequently solved in a joint approach. Gattepaille (2020) showed that simply tuning a base BERT (Devlin et al., 2019) model could achieve strong results, even beating ensemble methods that rely on tranformers pretrained on more academic texts such as SciBERT (Beltagy et al., 2019), BioBERT (Lee et al., 2020) or ensembles of them, while approaching the performance of BERT models specifically pretrained on noisy health-related comments .

Pre-processing
Despite the noisy nature of Twitter data, for Task 1 we attempted to keep any pre-processing to a minimum. This was motivated by the presence of spans within usernames and hashtags, in addition to overlapping spans and spans that included preceding or trailing white-spaces. For training and validation data we ignored overlapping and nested spans and chose the longest span as the training/tuning example.
We also compiled a list of characters used in the training data for use in creating character embeddings. This was not limited to alpha-numeric characters, but also included emojis, punctuation, and non-Latin characters. We then removed any character appearing less than 20 times 1 in the training set, and a special UNK character embedding was added. Additionally for the training, validation, and testing data we tokenized the tweets and obtained part-ofspeech tags using the default English model for the Stanza (Qi et al., 2020) pipeline.
Our training set was supplemented with the CSIRO Adverse Drug Event Corpus(CADEC) (Karimi et al., 2015) and was processed in the same manner as above.
For Task 6 no pre-processing was done.

Task 1 Model
Word Representation The BERT vectors produced for each tweet are not necessarily aligned with the tokens produced by the Stanza tokenizer. For this reason we additionally compile a sub-word token map to construct word embeddings from the token embeddings produced by our BERT model (excluding the [CLS] vector). The final word embedding is a summation of the component vectors.

POS Tags & Char-LSTM
We use randomly initialized trainable embeddings for universal POS (UPOS) tags predicted by the Stanza pos-tagger.
For each word we also use a 1-layer LSTM to produce an additional representation. The input to this LSTM would be the embeddings for each character in a word in order of appearance. This is intended to capture both the recurring patterns indicating prefixes/suffixes and to also learn to disregard repeated letters and misspellings so as to overcome the noisiness of the data.
Bi-LSTM Hidden Layer While BERT is itself a bi-directional context-aware representation of a given sentence, we experimented with the addition of a bidirectional Lstm (Bi-LSTM) layer in order to incorporate the additional pos tag and char-LSTM embeddings, and model the interactions between them across the whole context of a tweet.

Task 6 Model
Task 6 proved to be a substantially easier challenge than Subtask 1(a), as can be seen in Subsection 3.2.
Our approach was to simply tune a BERT model, with the [CLS] vector being used as input to a softmax classification layer.

Experiments & Results
We implemented our models using the PyTorch (Paszke et al., 2019) framework, and for the core BERT model we used the pretrained bert-base model from the Huggingface transformers (Wolf et al., 2020) library. For both tasks we optimize parameters using Adam (Kingma and Ba, 2014). We experiment with different learning rates but keep default parameters for β 1 , β 2 , and .

Task 1
One of the largest challenges of Task 1 is the huge imbalance of tweets containing ADEs vs tweets that do not. This is demonstrated in Table 3 where just over 7% of tweets in both training and validation sets contain ADEs. In contrast, the CADEC dataset has ≈ 37% of examples with ADEs. To explore the effect of this distribution we constructed two training sets. The first is a dataset containing all the CADEC data in addition to training data tweets containing ADEs. This results in a dataset with ≈ 46% of examples with ADEs, which we will refer to as the Partial datatset going forward. The second dataset we use for training is all of the task training data and the whole CADEC dataset, which we will be referring to as the Full dataset, with the proportion of ADE examples being ≈ 16%. We train the model jointly over all three subtasks, minimizing over the sum of negative log likelihood losses (L SU M = L DET + L N ER ) for both the classification (L DET = − N i C DET c y ic log(ŷ ic )) and extraction & normalization (L N ER = − N i C N ER c y ic log(ŷ ic )) layers. Where N is the total number of minibatches, C DET and C N ER are the classes for classification and extraction & normalization respectively, and y * andŷ * are the target and predicted classes.   Our experiments on the partial datasets yielded weak results, with only a slight improvement when using a learning rate of 1 × 10 −4 over 2 × 10 −5 . Training on the full dataset with a learning rate of 2 × 10 −5 produced far stronger results, with the f1 score for tweet classification increasing to 70.1% from 14.9% on the validation set, and to 26.9% from 10.5% for span extraction, and finally to 50.4% from 19.1% for span normalization. Training our model with a learning rate of 1 × 10 −4 yielded unusable results and an unstable model, which suggests that this is too high a learning rate for larger datasets. It is interesting to note that while training on the full datatset dramatically improved f1 scores for all three subtasks, there was a general drop in recall and an increase in precision. This suggests that the model trained on the partial dataset was far more likely to produce false positives, and was unable to recognize the absence of ADEs despite negative examples constituting ≈ 53% of examples. The results of our experiments are summarized in Table 2.
Our final submission was trained on the full dataset and showed a similar pattern on the Test set producing better precision, beating the arithmetic mean of all submissions for extraction and normalization, but showed worse recall for all three subtasks. This resulted in the model only achieving an above average f1 score on subtask 1(c).

Task 6
Our approach to Task 6 is essentially the same as that for subtask 1(a), but with a smaller, more balanced dataset. We experiment with two learn-  ing rates, 1 × 10 −5 and 2 × 10 −5 , and minimize over a negative log likelihood loss L = − N i C c y ic log(ŷ ic ). The resulting models produced strong results, as shown in Table 4, with close validation f1 scores (98.6% and 98.3%). We used classifications by both models as our final submission, and both beat the median of all submissions with an f1 score of 94% for both models.

Conclusion
In this work we explored the efficacy of jointly training a BERT model to jointly learn to perform classification, extraction, and normalization of ADE in tweets provided for Task 1 in SMMH 2021 Shared Task. While this approach did not produce classification and extraction above the median submission, it did achieve a normalization score that is. Additionally our experiments show that the seemingly lopsided ratio of tweets with/without ADEs resulted in stronger performance than a more "balanced" dataset. Finally, we showed that tuning a BERT model produces very strong results on Task 6, in classifying tweets related to Covid-19.