MIDAS@SMM4H-2019: Identifying Adverse Drug Reactions and Personal Health Experience Mentions from Twitter

In this paper, we present our approach and the system description for the Social Media Mining for Health Applications (SMM4H) Shared Task 1,2 and 4 (2019). Our main contribution is to show the effectiveness of Transfer Learning approaches like BERT and ULMFiT, and how they generalize for the classification tasks like identification of adverse drug reaction mentions and reporting of personal health problems in tweets. We show the use of stacked embeddings combined with BLSTM+CRF tagger for identifying spans mentioning adverse drug reactions in tweets. We also show that these approaches perform well even with imbalanced dataset in comparison to undersampling and oversampling.


Introduction
Drugs administered for alleviating common sufferings are the fourth biggest cause of death in US, following cancer and heart diseases (Giacomini et al., 2007), making it one of the most important medical problems for the human society. While heart diseases and cancer are commonly reported and studied, adverse reactions to drugs either goes unreported or is confused or lost within other narratives. While it is the onus of the government and the society as a whole to tackle the first task, the second one is an overwhelmingly computational task.
With the advent of universal internet and smartphones, reportrage of incidents is generally increasing, thanks to a host of social media platforms like Twitter, Facebook, Instagram, etc. Hence, this unique situation presents a challenging as well as rewarding opportunity to improve our current computational systems for dealing with the existing incidents more sensibly and increase their reportage with the use of electronic media.
With this motivation, four shared tasks were conducted as part of Social Media Mining for Health Applications (SMM4H) Workshop 2019 (Weissenbacher et al., 2019). Our team particpated in Tasks -1, 2 and 4 of the workshop. The problems for these tasks were: Problem Definition Sub-task 1: Given a labeled dataset D of tweets, the objective of the task is to learn a classification/prediction function that can predict a label l for a given tweet t, where l ∈ {reporting adverse effects of drugs (ADR) -1, no adverse effects of drugs (non-ADR) -0}.
Example of tweets mentioning adverse drug reactions: • I feel siiiiiiiiiiiiiiick. Damn you venlafaxine.
• Who need alcohol when you have gabapentin and tramadol that makes you feel drunk at 12oclock. Problem Definition Sub-task 2: The motive of this sub-task is to first discern ADR tweets from the non-ADR ones and then identify the span of a tweet where an adverse drug effect is reported.
An example of a span from a tweet that represents the mention of adverse drug reactions: • losing it. could not remember the word power strip. wonder which drug is doing this memory lapse thing. my guess the cymbalta. #helps, where not remember is the adverse drug reaction that needs to be identified and extracted from the tweet, which is most likely caused by the intake of the drug named cymbalta. Problem Definition Sub-task 4: Given a labeled dataset D of tweets, the objective of the task is to learn a classification/prediction function that can predict a label l for a given tweet t, where l ∈ {reporting personal health experience -1, no mention of personal health experience -0}.
Example of tweets reporting personal health experience mentions: • This flu shot got my arm killing me.
• man i am so sick i feel terrible i got all the symptoms of the swine flu i am scared. Our Contributions: Towards the objectives of the tasks as described above, we present some of our contributions in this paper: 1. We train ULMFit and BERT models for Tasks 1 and 4, and show that these models are agnostic to the effects of undersampling and oversampling, given a highly imbalanced dataset.
2. We make an initial attempt in studying the effectiveness of transfer learning using ULM-Fit and BERT for the problems in the domain of health care pertaining to the shared tasks.
3. We show the use of stacked embeddings combined with BLSTM+CRF tagger for identifying spans mentioning adverse drug reactions in tweets.
4. We also show the use of combining pretrained BERT embeddings with Glove embeddings fed to a BLSTM text classifier for sub-task-1 and sub-task-4.

Related Work
In general, self reporting of drug effects by patients is a highly noisy source of data. However, even after being noisy, it captures quite a lot of information which might not be available in other cleaner sources of data such as limited clinical trials or a doctor's office (Leaman et al., 2010). Taking cognizance of this, the International Society of Drug Bulletins in 2005 said, "...patient reporting systems should periodically sample the scattered drug experiences patients reported on the internet...". This is an upcoming branch which lies at the intersection of information systems and medicine -pharmacovigilance (Leaman et al., 2010). Detecting and tracking information about certain diseases has been the focus of quite a lot of work (Nakhasi et al., 2012;Paul and Dredze, 2011). For instance, cancer investigation (Ofran et al., 2012), flu (Aramaki et al., 2011;Lamb et al., 2013) and depression (De Choudhury et al., 2013;Yazdavar et al., 2017). There has been some work in the domain of pharmacovigilance (Mahata et al., 2018b,a,c;Mathur et al., 2018;Sarker et al., 2018), recently as well.
The body of works most relevant to ours is the one which uses transfer learning on health domain. Normally, data in health domain is harder to get and process. Thus, many researchers have resorted to using transfer learning in order to deal with the data paucity. The works using transfer learning generally use word embeddings in order to improve the generalization of classification to unseen textual cases. In the context of this work we heavily use ULMFit (Howard and Ruder, 2018) and BERT (Devlin et al., 2018) for our experiments and make an initial attempt on how transfer learning in the domain of health works using them for the different text classification tasks of Social Media Mining for Health Workshop. Next, we give a brief description of the datasets used in this work for the different tasks.

Dataset
The dataset for the shared tasks was collected from the social networking website, Twitter. It consists of mentions of drug effects and other health related issues.
1. For the shared task 1, a total of 25,672 tweets are made available for training, out of which 2,374 contain adverse drug reaction (ADR) mention and the rest (23,298) do not. Only training data was provided by the organizers. For performing our experiments we segmented the provided dataset into train and validation splits. Figure 1 shows the distribution of data in the training and validation splits. The evaluation metric for this task was the F-score for the ADR class. Due to appreciable data bias, for the various experiments for this subtask, we oversample ADR tweets and undersample non-ADR tweets. For oversampling, we just copy the ADR tweets and for undersampling, we randomly select a set of tweets such that the total number of tweets in both the sets becomes equal. For instance "feeling a little dizzy from the quetiapine i just popped!" represents a positive sample from the dataset while "don't say no to pills! latuda won't kill!" is a non-ADR tweet. We also try imbalanced proportions such as from 1:2 to 1:10 as well.
2. For the shared task 2, we got a total of 2,367 tweets out of which 1,212 were positive and 1,155 were negative. In the positive samples, the ADR portion was marked. For instance, the tweet "friends! anybody taken #cipro? (antibiotic) complications?? big side effect is tendon rupture...figured my dr would know better?" is an ADR tweet and the portion "tendon rupture" is where the author of the tweet mentions about ADR.
3. For the shared task 4, we were given a total of 10,876 tweets out of which only 7,388( 67.9%) of the tweets were available on twitter for downloading. A total of 3,598 were positive and the rest were negative in original data. The positive tweets in this case contained a personal mention of ones health (for example, sharing health status or opinion) where as negative samples contained a generic discussion of the health issue, or some unrelated mention of the word. For instance, 9,832 is an example of tweet which contains flu-vaccination context in original data. Similarly, in the tweet 1,046, the author tries to discuss disease context of flu. For the available data we had 2,426 positive combined and 4,962 negative samples where the author is initiating general health discussion as opposed to mentioning any particular context of flu. For performing our experiments we segmented the provided dataset into train and validation splits. Figure 2 shows the distribution of data in the training and validation splits.

Preprocessing
Before feeding the dataset to any machine learning model we took some steps to process the data. We point to those steps in this section. Normalization of tokens were done using some hand-crafted rules mainly for dealing with short forms such as thru (through), abt(about), etc. The '@user' and URL tokens were removed. The hashtags that contained two or more words were segmented into their component words using ekphrasis library 1 . For example #NotFeelingWell was converted to not feeling well.

Training Models
For all the tasks, we mainly concentrated in training recently introduced ULMFit and BERT models that are well known for their transfer learning capabilities and generalizing well for various natural language processing tasks across different domains. We describe our models in this section. We extensively used fast.ai 2 , bert 3 , and flair 4 for training our models related to all the tasks. The different models trained and their corresponding hyperparamaters chosen for the tasks are presented in Table 1. We provide their brief description next.
ULMFiT-We used ULMFit (Howard and Ruder, 2018) for tasks 1 and 4. One of the main advantages of training ULMFiT is that it works very well for a small dataset as provided in the task and also avoids the process of training a classification model from scratch. This avoids overfitting. We have used the base (fast.ai) implementation of this model. The ULMFiT model has mainly two parts, the language model and the classification model.  We observe that fine-tuning the language model on a larger dataset provides a significant improvement in the performace (Tuhin Chakrabarty, 2019). Therefore, we fine-tune the language model over 1,90,823 tweets containing 250-drug related mentions (Sarker and Gonzalez, 2015). Default (fast.ai) parameters were used to train the language models. Finally, we find the best hyperparameters and train the classifier over the original training data. BERT -We use the provided Tensorflow implementation of BERT and fine-tune BERT-baseuncased. We find the best parameters and train the model over original dataset. BLSTM -We train a bidirectional LSTM text classifier and feed different types of pretrained embeddings as presented in the Table 1. It is important to note that due to the long time needed for training the BLSTM models with the embeddings and unavailability of GPUs, we could not finish the training before submitting our results for the test data provided by the organizers. We would like to make our predictions on the final model and keep it as a future work. BLSTM+CRF Tagger -We treated the problem posed in sub-task 2 as a named entity extraction and recognition problem. The text span corre-sponding to an adverse drug reaction mention is treated as an entity, that further needs to be classified into one of the two categories ADR or non-ADR. Following the current state-of-the-art, we trained a BLSTM+CRF tagger implemented in the flair library (referred above). Apart from that, we also used the BLSTM+CRF architecture with two different combinations of stacked embeddings.
Next, we present the results obtained on the test data provided by the organizers for sub-tasks 1, 2 and 4.   Table 2, presents the F1 scores on the test data for sub-task 1. The ULMFit model showed the best performance. As already mentioned, the data provided for sub-task 1 was highly imbal- anced. We performed undersampling with different ratios of the classes (ADR : non-ADR). Figure 3, presents the performance of ULMFit and BERT models on the training data for different undersampling ratios. We also tried oversampling, but didn't observe any improvement in performance. The best performance using both BERT and ULMFit was obtained without using any undersampling or oversampling. Therefore, the model that we used on the test data was trained on the full training dataset maintaining the given ratio of ADR:non-ADR tweets.   Table 3, presents the performance scores for sub-task 2 on the test data. The different metrics as presented in the table were implemented by the organizers and the scores were provided by them.

Task 4: Generalized Identification of Personal Health Experience Mentions
The objective of the task is to classify whether a tweet contains a personal mention of ones health (for example, sharing ones own health status or opinion), as opposed to a more general discussion of the health issue, or an unrelated mention of the word. Each model was finally evaluated using four F1-scores -F1 for the held out influenza  Table 4: Results for Task-4: Generalized identification of personal health experience mentions data, the second and third undisclosed context, and the F1-score overall. The results that our models obtained on the test data is presented in Table  4. As already mentioned that the BLSTM models trained using pretrained embeddings could not be completed. Inspite of the fully trained model, we do see a decent performance using BLSTM along with a combination of pretrained embeddings on the provided dataset.

Future Work and Conclusion
In this work, we presented our initial attempt to use BERT and ULMFit for text classification tasks related to the domain of pharmacovigilance. We obtained decent results for three different tasks organized as a shared task in Social Media Mining for Health Workshop -2019. We noticed that the BERT and ULMFit were agnostic to undersampling and oversampling unlike previously observed performances on traditional text classifiers as reported on a similar task (Sarker et al., 2018), that was a part of the same workshop held in 2017. We consider our reported work in this paper as a preliminary attempt and would like to extend them in the future. As part of our future work we would like to train better models using BERT for all the three sub-tasks that we participated in, and would also like to interpret the predictions of the models. We think domain specific training of different embeddings could help and would like to try them in the future.