Classification of Medication-Related Tweets Using Stacked Bidirectional LSTMs with Context-Aware Attention

This paper describes the system that team UChicagoCompLx developed for the 2018 Social Media Mining for Health Applications (SMM4H) Shared Task. We use a variant of the Message-level Sentiment Analysis (MSA) model of (Baziotis et al., 2017), a word-level stacked bidirectional Long Short-Term Memory (LSTM) network equipped with attention, to classify medication-related tweets in the four subtasks of the SMM4H Shared Task. Without any subtask-specific tuning, the model is able to achieve competitive results across all subtasks. We make the datasets, model weights, and code publicly available.


Introduction
The Shared Task of the 2018 Social Media Mining for Health Applications (SMM4H) workshop (Weissenbacher et al., 2018) proposed four subtasks in the domain of social media mining for health monitoring and surveillance. From a Natural Language Processing (NLP) viewpoint, these tasks present a considerable challenge since the nature of social media posts requires dealing with both a significant level of language variation and a widespread presence of noise (spelling mistakes, syntactic errors etc). Any classifier designed for this textual domain should take into account the above intricacies and should, furthermore, be able to deal with with semantic complexities in the various ways people express medication-related concepts and outcomes.
To address these challenges, we use a variant of the Message-level Sentiment Analysis (MSA) model of (Baziotis et al., 2017), originally developed for sentiment analysis of Twitter posts, to classify tweets in all four subtasks. The model is a word-level stacked bidirectional LSTM (BiLSTM) with context-aware attention that uses word-embeddings pretrained by (Baziotis et al., 2017) on a corpus of ≈ 330M tweets. Without additional hyperparameter tuning or subtask-specific modifications, the model outperforms the average of all submitted systems in subtasks 1 and 4 and achieves first place (by a F1-score margin of 0.234 from the next team) in subtask 2. In subtask 3 our model was placed 6th out of 9 systems.
In the following sections, we introduce the datasets, discuss preprocessing steps we took, present the model and its training setup, report results, and conclude with potential avenues for future research.

Datasets
In this section, we describe the datasets of each subtask. Subtasks 1, 3 and 4 are binary classification problems while subtask 2 is a three-class classification problem. The data was manually annotated by the organizers.
Subtask 1 is about the automatic detection of posts mentioning the name of a drug or dietary supplement, as defined by the United States Food and Drug Administration (FDA). A tweet is assigned label 1 if it contains the name of one or more drugs or supplements and 0 otherwise. Subtask 2 poses the challenge of automatic classification of posts describing medication intake. A tweet is assigned label 1 if "the user clearly expresses a personal medication intake/consumption", 2 if the tweet suggests (without certainty) that "the user may have taken the medication", and 3 if the tweet mentions medication names but does not indicate personal intake. Subtask 3 concerns the automatic classification of posts mentioning an adverse drug reaction (ADR). A tweet is assigned label 1 if it men-  tions an ADR and 0 otherwise. Finally, Subtask 4 deals with the automatic detection of posts mentioning vaccination behavior related to influenza vaccines. The annotators were asked the question "Does this message indicate that someone received, or intended to receive, a flu vaccine?" and a tweet was assigned label 1 if the answer was affirmative and 0 otherwise. Subtasks 1, 3 and 4 are evaluated using the F1-score for the positive class while subtask 2 uses the micro-averaged F1-score for classes 1 and 2. Subtask 1 is additionally evaluated on precision and recall for the positive class. Due to Twitter privacy policies, the training set for any subtask did not contain the actual tweet text. To obtain said text, participants were provided with the tweet ID of each dataset example along with a script to use for downloading the text using this ID. The process inevitably resulted in fewer tweets than the number of IDs contained in the original dataset, primarily because a number of tweets had been removed (either by the users themselves, or by Twitter because e.g. the user deleted his account) while others failed to download (due to e.g. lag issues when requesting the HTML of the tweet). To avoid such issues in the evaluation datasets, the organizers decided to provide the tweet text along with the ID. Table [1] provides a short summary of the number of tweets that were available to our team for each subtask.

Pre-processing
We applied identical preprocessing to all datasets. We replaced Twitter specific strings with appropriate tokens (e.g. emojis were replaced by $EMOJI$, numbers were replaced by $NUM-BER$, website urls by $URL$ etc) to reduce the vocabulary size and to ameliorate the noisy nature of the text. All non-alphanumeric characters and all tokens that were too short (fewer than 2 characters) or too long (more than 15 characters) were removed. Finally, all text was converted to lower case and any excess whitespace (i.e. newlines and tabs) was removed.

Model description 4.1 Model architecture
We use a variant of the Message-level Sentiment Analysis (MSA) model of (Baziotis et al., 2017). The model consists of two stacked BiLSTMs with a context-attention mechanismà la (Yang et al., 2016) that identifies the maximally informative words for each label. We describe subsequently the individual network layers.
The input is a tweet, regarded as a sequence of words, which is projected to a vector space of fixed size via the Embedding Layer. The weights of the embedding layer are initialized using pre-trained word embeddings that (Baziotis et al., 2017) trained on a Twitter corpus of approximately ≈ 330M tweets. We opt for these embeddings instead of the standard Word2Vec (Mikolov et al., 2013a,b) ones since they have been trained on a similar textual domain to the tasks at hand.
A LSTM Layer placed on top of the embedding layer takes as input the embedding weights and produces a representation {h i } T i=1 where, h i is the hidden state of the LSTM at time-step i, intuitively corresponding to a summary of all the information of the sentence (viewed as a sequence {w i } T i=1 of words) up to w i . This constitutes a forward LSTM. Since we are using a bidirectional LSTM, we also have an LSTM that scans the sequence of words in the reverse direction. The final representation of a word is produced by concatenating the representations from the forward and backward LSTM: where || denotes the concatenation operator. We opt for a stacked BiLSTM, and consequently we place an additional BiLSTM layer on top of the preceding layer. The motivation for this choice comes from the literature on the interpretation of hidden states of Recurrent Neural Networks (RNNs) (Belinkov et al., 2017;Belinkov, 2018) in which it has been claimed that deeper layers are able to learn more abstract semantic representations of sentences, thus achieving superior performance in downstream tasks.
To account for the fact that not all words contribute equally to the assignment of a label, we place an Attention Layer on top of the BiL-STMs following work like  who successfully used attention mechanisms for sequence-to-sequence neural machine translation. We use context-attention, following (Yang et al., 2016). A context vector u h is initialized and is governed by the following update equations: where W h , b h and u h are learned parameters, h i is the concatenation of the representations of the forward and backward LSTM, introduced in equation (1), and L is the number of cells in one LSTM layer. Finally, we feed the representation r produced by the attention layer to a Dense Layer with sigmoid activation (softmax for subtask 2) and obtain a probability distribution over the classes. If the probability assigned to a tweet is greater than 0.5 we assign label 1, otherwise we assign 0.

Training setup
We train the model to minimize the negative log-likelihood loss using back-propagation with stochastic gradient descent and mini-batch size of 50. We use the Adam optimizer (Kingma and Ba, 2015) with gradient norm clipping (Pascanu et al., 2013) at 1. For subtasks 1, 2 and 3 we use a 90 − 10 train-validation split, while for subtask 4 we use 10−fold stratified cross-validation in consideration of the very small test set. Table [1] summarizes the information on train-validation splits.

Regularization
To make the model more robust to over-fitting, we employ, following (Baziotis et al., 2017), a number of regularization techniques. We add Gaussian noise at the embedding layer and use dropout (Srivastava et al., 2014) to ignore the signal from a set of randomly selected neurons in the network. Dropout is also applied after each LSTM layer as well as to the recurrent connections of the LSTM (Gal and Ghahramani, 2016). L 2 regularization along with class weights are applied to the loss function to prevent overly large weights and to account for class imbalance. Class weights are computed as follows: assuming that #» x is the vector of class counts, the weights are defined as w i = max( #» x )/x i for any class i. Finally, early-stopping (Caruana et al., 2001) is employed to terminate training after the validation loss has stopped decreasing.

Hyperparameter tuning
We use the similar hyperparameters to (Baziotis et al., 2017). In particular, we use 150 as the size of the LSTM hidden states (300 in total since we are using a BiLSTM), the Gaussian noise parameter is set to σ = 0.3, dropout rate on top of the embedding layer is set to 0.3 and dropout rate on top of the LSTM layers is set to 0.5. Dropout at the recurrent connections is also set to 0.3. L 2 regularization at the loss function is set to 0.0001. Finally, we initialize the learning rate at 0.001. Departing from (Baziotis et al., 2017), we use word embeddings of dimension 100. Vocabulary size and maximum sequence length are set to 7000 and 50 respectively for all subtasks and the patience level for early-stopping is set to 0.001 in 5 epochs.

Experimental setup
The model was developed using Keras 2 with the Tensorflow (Abadi et al., 2016) backend. For data preparation and processing we use Scikit-learn (Pedregosa et al., 2011). Given the small size of the datasets, we do not use GPUs for training the model. A standard 8-core CPU is sufficient. Finally, for designing the network architecture, we use part of the code released by (Baziotis et al., 2017) 3 .

Results
For subtasks 1 and 4, the organizers chose to disclose to each team only their respective score along with the average score of all submitted systems. These results are summarized in Table [2]. Our system performed better than the average in both subtasks, considerably so in subtask 1.
For subtasks 2 and 3, the organizers released the complete leaderboards, presented in Tables [3] and [4] respectively. Our system greatly outperformed all other systems by a significant margin in subtask 2. In subtask 3, our system ranked 6th (out of 9 participants), potentially because the other teams developed specialized systems for the particular    first in subtask 3, greatly outperforming all other models in terms of precision, recall and F1-score. The model's performance in this Shared Task is further testament to the ability of attentive RNNs to perform at state-of-the-art level in short text classification where individual word-meaning is essential.
In the future, we aim to investigate whether ensembles of word-and character-level attentive RNNs can perform even better. The benefits of ensembling for text classification can be seen in numerous NLP tasks ranging from Natural Language Inference (Gong et al., 2018, among many others) to product categorization (Skinner, 2018). Wordlevel models perform well in capturing aspects of the semantics (Belinkov et al., 2017) while character-level models succeed in capturing syntactic information about the text. Ensembles of these diverse types of models can potentially lead to improved performance.
A second avenue to pursue would be multi-task learning, an area of active research that has shown promising results in text classification (Liu et al., 2016(Liu et al., , 2017. Given that all subtasks are nearly identical in nature (all but one of them being binary classification problems) and share a highly overlapping lexicon, they provide an excellent ground for testing the merits of multi-task learning.