Detection of Adverse Drug Reaction Mentions in Tweets Using ELMo

This paper describes the models used by our team in SMM4H 2019 shared task. We submitted results for subtasks 1 and 2. For task 1 which aims to detect tweets with Adverse Drug Reaction (ADR) mentions we used ELMo embeddings which is a deep contextualized word representation able to capture both syntactic and semantic characteristics. For task 2, which focuses on extraction of ADR mentions, first the same architecture as task 1 was used to identify whether or not a tweet contains ADR. Then, for tweets positively classified as mentioning ADR, the relevant text span was identified by similarity matching with 3 different lexicon sets.


Introduction and task description
Twitter is an ever-growing store of daily generated data. Given the huge number of tweets talking about drug-related issues, social media mining is applicable to areas such as pharmacovigilance (Lee et al., 2017;Nikfarjam et al., 2015;Ginn et al., 2014;Freifeld et al., 2014;Bian et al., 2012).
Tasks 1 and 2 focuses on detecting tweets with ADR and identifying location of mentions. We are provided with 25,672 tweets (2,374 positive and 23,298 negative) and approximately 5,000 unlabeled tweets as a validation set. For the second task, a subset of 2,367 tweets from the first task was provided (1,212 positive and 1,155 negative). The evaluation data comprises 1,000 tweets (~500 positive, ~500 negative).

Preprocessing
Stop words and punctuations were removed from tweets and all drug names found in the FDA's Approved Drug Products list 1 were replaced by the word "drug". Word stemming and tokenization were performed using nltk python library.

task 1
For this task, we used 4 deep learning models. The architecture of the first 3 models were relatively similar, differing in the embedding layer.
The first model involves character embedding with dimension equal to the total number of unique characters in training set including emojis. The output of this layer is fed to a series of 6 convolutional neural network layers (CNNs) with ReLU activation. Each CNN used 256 filters, with a filter size of 7 for the first two layers and 3 for the rest. Max pooling with size 3 was used for the first two and last CNNs. The CNNs' output was fed into a bidirectional LSTM (Bi-LSTM) with 2*200 units, whose output was flattened to feed into two dense layers. We used two fully connected layers with 1024 units each, ReLU activation, and dropout of 0.5. Finally, we used a dense layer with size two and softmax activation. We used Adam as the optimizer and binary cross-entropy as the loss function. The model was trained with 10 epochs and batch size of 128.
The second architecture was identical to the first, except the first layer was a word embedding using GloVe 2 pre-trained on Twitter data with embedding dimension of 100.
The third model was a concatenation of word and character embeddings. We combined the Bi-LSTM output of the first and second models and then applied dense layers as before.
After building the above models, we tried to improve the outcomes by adding layers and features. We used a multi-head self-attention with an attention width of 15 and ReLU activation. We also explored the effect of sentiment features. Since the data classes were imbalanced, we tried to make class sizes equal by downsampling and upsampling. In downsampling, samples from the majority class (tweets without ADR mentions) were randomly sampled without replacement. In upsampling we did the opposite, adding samples from the minority class with replacement. None of these strategies substantially altered our baseline results.
In our final model, we used ELMo (Peters et al., 2018) (Embeddings from Language Models) with 1024 dimensions. In contrast to traditional word embeddings such as GloVe and word2vec, ELMo assigns each word to a vector as a function of the entire sentence containing that word. Therefore, the same word can have different embeddings depending on its context. Since ELMo already captures character-level information under the hood, we decided to encircle the complexity inside the embedding layer and used only two additional dense layers with 256 and 2 units, using ReLU and softmax activations, respectively.

Methods for task 2
To identify the text spans of ADR mentions, first the model developed for task 1 was used to determine whether each tweet mentions an ADR. Then the similarity between each tweet and 3 different lexicon sets (Nikfarjam et al. 3 , MedDRA (Medical Dictionary for Regulatory Activities) 4 , and CHV (Consumer Health Vocabulary) 5 ) was measured.
To calculate similarity, each tweet and lexicon was converted to a set of word stems. Since similarity measures such as cosine or Jaccard are highly affected by other non-ADR words, we defined similarity as the percent of word stems of a lexicon that exist in a tweet. For each tweet, only lexicons with a 100% match were kept.

Results, discussion, and next steps
Among all architectures, the best results came from ELMo embedding (F1 = 0.64). Therefore, we only submitted ELMo results with 5, 10, and 15 epochs. The model performed less well for the validation set (F1 = 0.41), below the average F1 score of 0.50 among all teams, which might result from overfitting. Using more sophisticated architecture after the embedding layer might improve performance.
Since task 2's performance depends strongly on task 1, we also scored lower on this task compared to the team average (0.40 vs. 0.54). Since ADR phrases and tweets do not always lexically match, approaches such as named entity recognition (NER) might perform better.
Other approaches to improve performance: Task 1: • Try other embeddings such as BERT

Acknowledgment
I would like to thank Maheedhar Kolla who provided insight and expertise that significantly assisted this work.
I would also like to show my gratitude to Peter Leimbigler for comments that greatly improved the manuscript.
Finally, special thanks go to Alfred Whitehead for supporting me to participate in this challenge.