Identifying Adverse Drug Events Mentions in Tweets Using Attentive, Collocated, and Aggregated Medical Representation

Identifying mentions of medical concepts in social media is challenging because of high variability in free text. In this paper, we propose a novel neural network architecture, the Collocated LSTM with Attentive Pooling and Aggregated representation (CLAPA), that integrates a bidirectional LSTM model with attention and pooling strategy and utilizes the collocation information from training data to improve the representation of medical concepts. The collocation and aggregation layers improve the model performance on the task of identifying mentions of adverse drug events (ADE) in tweets. Using the dataset made available as part of the workshop shared task, we show that careful selection of neighborhood contexts can help uncover useful local information and improve the overall medical concept representation.


Introduction
Multiple studies have analyzed health forums and other social media for drug uses, pharmacovigilance, and effectiveness of medications (Nikfarjam et al., 2015;Daniulaityte et al., 2016). However, research related to drugs and adverse drug effects (ADE) in social media continues to grow rapidly. Automatically detecting ADE mentions in social media posts has been challenging due to the large variability of free text. One of the main challenges in studying natural language processing (NLP) approaches for medical information extraction is the lack of access to health-related information on social media .
Having a robust representation of words is important to train high-performance information extraction approaches. In domain-specific tasks, being able to properly represent domain words or concepts could significantly improve the mod-els. While many studies have undertaken classifications of ADE mentions in posts with various state-of-the-art techniques (Nikfarjam et al., 2015;Weissenbacher et al., 2018), there is still room to improve for the task. For example, in many trained word embedding models (Pennington et al., 2014;Godin et al., 2015;Joulin et al., 2017), the embedding of each word is treated as a vector summarizing multiple semantic meanings for each word as independent dimensions. Indeed, pre-trained embeddings that are trained on a large data corpus usually provide robust representation for common words, compared to traditional feature-based techniques such as bag of words. Yet, for domainspecific tasks, a drawback of pre-trained embeddings is that representations of domain words may not be sufficiently tuned to be able to represent the expected meaning.
Attempts have been made previously to capture the word embedding for medical concepts from a variety of medical data sources (Huang et al., 2016). Similarly, domain-specific knowledge graphs have been shown effective as external resources for feature expansion to represent medical concepts (Choi et al., 2017;Wang et al., 2017). However, even domain-based knowledge graphs sometime contain redundant information stemming from how they are constructed (Yu et al., 2014;Paulheim, 2017;Zaveri et al., 2016). Following prior work by (Turenne, 2003) that show that co-occurring pattern of terms could be beneficial to classification tasks, in this work, we consider an alternate graph-based representation that utilizes local information derived from the training data set. We build a collocation graph -a wordbased graph built from the training data set where nodes correspond to vocabulary words and edges between two nodes indicate the co-occurrence of the corresponding words. We investigate if a model built over the collocation graph could use pre-trained word embeddings and other information to recognize medical concepts from data. We hypothesize that the representation of a medical word can be further enriched by its neighbors in the collocation graph.
In this paper, we propose Collocated LSTM with Attentive Pooling and Aggregated representation (CLAPA), a novel approach that integrates bidirectional LSTM model with attention and pooling strategy and utilizes the collocation information in the training data set to help enhance the pre-trained word embedding of medical concepts. We show that our model leads to a significant improvement on an ADE detection task. To the best of our knowledge, this is the first attempt that utilizes local collocation information to improve the representation of domain concepts in social media.
To summarize, we make the following contributions in this paper: • We propose a novel architecture that encodes locally stored domain information into sentence representation.
• Our work explores the possibility that limited training data could be better exploited by including attentive collocation information.
• We provide implication for other domainrelated works where better representation of domain terms is important, especially when the data set is highly imbalanced.

Related work
Researchers have tackled the problem of identifying posts mentioning ADEs in social media in different ways. Various methods have been used in the 2018 Social Media Mining for Health Applications (SMM4H) shared task, ranging from statistical models such as support vector machines (SVM) to deep neural network models such as convolutional neural network (CNN), long short-term memory (LSTM), and bidirectional LSTM models. Fourteen teams participated in the 2018 SMM4H shared tasks (Weissenbacher et al., 2018), and used deep neural network models and various text processing steps such as correcting misspellings, accounting for class imbalance in data, and incorporating external resources. For the ADE mention classification task, the best system achieved an F1 score of 0.522, while the next best system achieved an F1 score of 0.478. The best system (Wu et al., 2018) was based on a bidirectional LSTM model with hierarchical tweet representation and multi-head self-attention.
In recent years, models such as CNN (Kim, 2014) and bidirectional LSTM (Graves and Schmidhuber, 2005) were used for text classification. In addition, models with attention mechanism, which incorporates information of other input tokens to improve representation of each token, was introduced by (Vaswani et al., 2017). Several max-pooling techniques, which help to detect important ngrams, were explored by (Jacovi et al., 2018) and (Zhou et al., 2016). Such mechanisms and technique have been powerful tools to build better text classification systems. To train distributed representations of words, (Mikolov et al., 2013) introduced Word2Vec in which each word is represented in a low-dimensional vector space. Other popular, pre-trained word embeddings include GloVe (Pennington et al., 2014), Word2vec over Twitter (Godin et al., 2015), and FastText (Joulin et al., 2017). Similarly, graph embedding techniques over large-scale networks were studied by numerous prior works, including LINE (Tang et al., 2015), DeepWalk (Perozzi et al., 2014), and Node2Vec (Grover and Leskovec, 2016). Although graph embedding is similar to word embedding, it is trained on not only nodes adjacent to each node but on the entire local network around the node. So, graph embedding could capture the relations between nodes, and has been used for multi-label classification and community detection (Grover and Leskovec, 2016;Qiu et al., 2018). Since most text-based graphs are typically reducible to a linear chain, and the ADE detection task is a binary classification problem, we focus on only the word embeddingbased approaches in this paper.

Collocation and aggregated representation models
In this section, we describe the architecture of our model in detail. The model contains the following three key components -medical collocation embedding, sentence encoder, and max pooling.
The overall architecture of our model is shown in Figure 1. For each word, the embedding is composed of two parts, namely, a pre-trained word embedding and an attentive neighborhood embedding. Attentive neighborhood embedding is de- Figure 1: Overall architecture of the proposed model for identifying adverse drug events rived from the Concept-Neighbor (C-N ) tensor. In a C-N cube, each N i represents the neighborhood for the i-th concept. Based on an attention vector (MedAttn i ), a concept embedding matrix C is formed in which c i is the embedding for the concept. The collocation embedding for a word w t will be c i if w t is the i-th concept, otherwise, the collocation embedding will be initialized to the zero vector. The concatenated embedding is then fed into an LSTM layer, and multi-head attention and maxpooling are applied to extract informative neurons, which are then concatenated with (1) the final state of the LSTM (sentence encoding) and (2) the sum of the concept embedding matrix. The final output is then computed via a fully connected neural network with a softmax function. Table 1 summarizes the notations used in this paper.

Medical collocation embedding
In order to better utilize the medical information embedded in text, we propose two word embedding methods -a pre-trained word embedding, and a second embedding method that enhances the pre-trained representation of medical terms by extracting information around those terms from the collocation graph.
Our medical collocation embedding can there- R K×d for the i-th concept. C-N tensor neighborhood tensor with the size R |C|×K×d composed by the neighborhood of each concept wt t-th word in a text sequence. si i-th medical concept word in S ci medical collocation embedding of the si n ik word embedding of the k-th neighbor for the i-th concept in the concept set. mt medical collocation embedding for the word wt. |S| total number of concepts T total number of words in a sequence K maximum neighborhood size L total number of attention heads d dimension of word embedding d h dimension of hidden states in LSTM where f (·) represents a linear transformation and the W K×1 1 is a trainable parameter matrix. MedAttn i j calculates the attention that should be paid to the j-th neighbor for the concept s i . Therefore, the embedding c i is represented by the embedding of its neighborhood weighted by attention scores. Lastly, m t represents the medical collocation embedding for the t-th word in text, w t . If the word is matched to the i-th medical concept, then

Aggregated Medical Representation
In addition to the word-based medical concept embedding described in Sec. 3.1, we propose another aggregated medical representation strategy using the collocation information that aggregates the medical concept information in a sentence into a fixed feature space. First, we use an attentive embedding, c i , described in Eq. 1, to construct a medical concept representation using the neighborhood information. Then, the aggregated representation is constructed, as follows: where e(·) is the function that retrieves the original representation of the medical concept word from pre-trained embedding. δ(·) = 1, when the sentence contains the concept word, and 0 otherwise. This aggregated medical representation serves as the residual medical information that is to be added to the output layer.

Sentence encoding
To encode a sentence for the classification task, we used an attention-based LSTM to encode the entire sentence into a fixed vector space. L attention heads are applied to re-represent hidden states. The new hidden states from the l-th attention head can be described as follows (Eq. 3): is a hidden state matrix representing the information status at each time step, and d h is a hidden dimension. e(·) and f (·) are the same as defined in Eq. 1. SentAttn l t is a scalar representing the attention that should be paid to h t . Therefore,ĥ l t is the attentive hidden state scaled by attention values in the l-th attention head.

Max pooling layer
Motivated by previous studies (Jacovi et al., 2018;Zhou et al., 2016), the application of max pooling behavior can highlight the important signals from features and hence improve classification tasks. Following these previous approaches, we apply a max pooling layer to extract important signals from the attentive hidden state in each attention head (Eq. 4).
whereĤ l = [ĥ l 1 , . . . ,ĥ l T ] ∈ R d h×T , and the pooling is applied on the dimension of d h so that signal l ∈ R d h contains important signals from each hidden dimension.

Classification layer
In the final output layer, the classification decision is made on whether or not a sentence contains an ADE mention. A fully connected network module is implemented as: where r is the combination of the final state of LSTM, multiple pooled states using max pooling, and aggregated medical concept representation. Each pooled state vector signal l comes from one attention layer (L attention layers in total) that is applied in sentence encoding (Eq. 3). U 1 , U 2 , b 1 , and b 2 are parameters to be trained. Cross-entropy is used as the loss function for training: 4 Experiments

Data
For our experiments, we used the data set provided as part of Task 1 of the SMM4H 2019 shared tasks (Gonzalez-Hernandez et al., 2019). As summarized in Table 2, the total number of annotated tweets is 25,678. The data set was randomly split into a training set (80%) and a validation set  (20%), while maintaining the target class proportions according to the original distribution. As a result, our training set contains 1,892 tweets that have an ADE mention (positive cases), and 18,650 tweets that do not have any mention of ADEs (negative cases). The validation set contains 485 positive and 4,651 negative tweets. We cleaned the tweets by separating punctuation marks, removing special characters, and replacing mentions, URLs, and number representations with normalized tokens. Finally, we used fastText (Joulin et al., 2017) as the pre-trained word embedding model.

Collocation graph
To build our collocation graph, we treat each unique word in the training set as a node, and add undirected edges from a word to adjacent words in a tweet. The collocation graph consists of 27,440 nodes and 188,329 edges. To reduce the graph size, we removed all words that appeared fewer than three times in the corpus. The resultant graph has 12,438 nodes and 159,759 edges. The mean of degree centrality is 25.39 (sd =114.59). 50% of the nodes have degrees less than 8, and 75% of the nodes have degrees less than 17.
Tysabri Walgreens Figure 2: Examples of a collocation graph: Tysabri is considered as a medical concept while Walgreens is not considered as a medical concept. Figure 2 shows the examples of a collocation graph. The graph has two colors: red and grey. The red nodes are words that are identified as medical concepts while the grey nodes are words that are not identified as medical concepts. The collocation graph on the left is for a medical word, Tysabri. The neighborhood of the word is comprised of both medical and non-medical words.
Tysabri contains other medical words as neighbors such as infusion, treatment, and gilenya. The collocation graph on the right is for a word, walgreens. It contains few medical words such as cipro and miralax.

Medical concepts extraction
MetaMap, a widely used system for identifying medical concepts in the unified language medical system (UMLS), is used to extract potential concepts from our tweet data set (Aronson, 2006). Given a sentence as input, MetaMap identifies phrases that could be medical concepts, and maps concepts to a preferred name using UMLS. However, since MetaMap is designed to parse clinical documents rather than free text on social media, we consider only those marked phrases that are the same as the preferred name as valid medical concepts. After processing, 1, 340 concepts were extracted by MetaMap from ADE tweets and 3, 921 concepts were extracted from non-ADE tweets. Concepts are later split into single words.

Training setup
All hyperparameters are jointly trained with a learning rate of 0.001 for ten epochs. In the experiments, we used FastText pretrained embedding, and the hidden size for LSTM is set to be 300. Number of multi-head attention layer is set to be 3. For each experiment, the score is taken from the average of five runs.

Results
To evaluate our model, we set two baselines: an attention-based LSTM model (Eq. 3), and an attention-based LSTM model with max pooling (Eq. 4). The results are presented in Table 3 as rows (1) and (4)  As presented in Table 3, the model performance is significantly improved with the addition of collocation medical embedding and aggregated embedding, over the attention-based bi-direction LSTM models. Further, adding aggregated medical information helps improve recall, but reduces the model precision and only slightly increases the F1 score, compared to the collocation based model. Hence, while highlighting medical information can reduce false negative decisions, it also causes more instances to be labeled as ADE tweets, thereby increasing a false positive rate as well. The CLAPA model, that integrates both collocation and aggregated representation along with attentive pooling strategy performs the best.
When run against the test set for the shared task, the CLAPA model achieves a F1 score of 0.5676 (see Table 3). As a comparison, the average F1 score of systems participating in this task is 0.5019. This shows our CLAPA model performs significantly better than average on this task.

Model learning stability
To show that our model consistently works better even with smaller training data, we independently and randomly sampled 10%, 30%, 50%, 70%, and 90% data from training set and retrained the models. Figure 3 shows that our model consistently performed well on the validation set, even with reduced training size, compared to the baseline model of bidirectional LSTM model with attentive pooling (the "LAP" model). The results are similar to those on the full validation data set in Table 3, in that even when only a fraction of training data is available, the model achieves higher F1 score because of significantly better recall and at a relatively small reduction in precision.

Effect of concept vocabulary
Next, we analyzed the effect of medical concepts observed in the ADE tweets to understand if there is any difference in terms of the use of medical concepts in ADE tweets vs. non-ADE tweets. We calculated a propensity ratio of each medical term, based on number of times it appears in ADE tweets compared to non-ADE tweets. We found that causing, gain, drowsiness, and sweats are likely to appear in ADE tweets about 15 times more often than in non-ADE tweets. Similarly, crippled is likely to appear in an ADE tweet about 26 times more often than in a non-ADE tweet. Considering the highly skewed appearance ratio Figure 3: Effects of training size on model performance stability for certain concepts, we analyzed the effect on using concepts from the ADE tweets alone. We compared two models -one trained over medical concepts identified from the ADE tweets and another trained over concepts from the entire training set, i.e. both ADE and non-ADE tweets.  As summarized in Table 4, the model trained with concepts from just the ADE tweets achieved a higher F1 score. While the precision is slightly lower, the model trained over concepts from ADE tweets has a significantly higher recall. On further analysis, we find that out of the 1, 183 concept words extracted from the ADE tweets, 866 concepts (73.2%) occurred more frequently in ADE tweets than in non-ADE tweets. However, when using the concepts words extracted from both ADE and non-ADE tweets, the number of concepts are higher (n = 4, 643), but only 1, 094 concepts (23.6%) of those appear more frequently in the ADE tweets. This indicates that propensity ratio could be used for selecting medical concepts used in the ADE tweets as features.

Effects of neighborhood selection
We analyzed two additional questions related to parameter tuning: (1) What method should be used to pick a neighbor? To answer this question, we fixed the neighborhood size as 15 words, and selected one of the following three methods to choose neighbors: (a) Random: Given a node n, we randomly select k of its neighbors n 1 , n 2 , . . . , n k ∈ N , where N is a set of all neighbors for node n.
(b) Popularity: For each medical concept, we first selected a neighbor that has the highest degree. When node n i has more neighbors than node n j , we say that node n i is more popular than node n j . Then, given a node n, we select k popular neighbors n 1 , n 2 , . . . , n k that have the highest degree. In case of ties in popularity, neighbors are selected at random from this set.
(c) Medical neighbor: Given node n, we add k medically-related neighbors.
For all three neighborhood selection methods, if the total number of first-degree neighbors is less than k, then an additional random selection is used among second-degree neighbors to fill the gap. Table 5 shows the results using different selection methods under the two scenarios described in Section 4.7. The left column depicts the model trained on concepts from all tweets, and the right column represents the model trained with concepts from ADE tweets alone.    Table 5 shows that targeting at neighbors using either popularity or medical attributes always leads to better performance regardless of different scenarios. However, when using medical concepts of both ADE and non-ADE tweets, picking a medical neighborhood could be a better choice, whereas popular neighborhood is preferred when concepts are identified from ADE tweets. Medical neighborhood has a higher probability of including informative words related to ADE; and when only ADE tweets are considered, the frequency of co-occurrence of a neighbor and the concepts become more important. This explana-tion also aligns with how language models are usually trained. (2) How should we decide neighborhood size? We experimented with different neighborhood size. As shown in Fig. 4, as the neighborhood size k increases, the performance is not affected much when k is small (from 5 to 20). However, the performance drops significantly when k is larger (k > 20). We explain this by aligning back to our neighborhood selection method where we found that choosing good neighbors (popular or medically related) favors the model. We want to choose informative neighbors instead of all neighbors. Therefore, when k is small, the selected neighbors (high degree) can be easily differentiated from the ones not selected. However, when k is large, the selected neighbors become less informative because many unimportant, noisy, neighbor words (low degree/non-popular) may be included that harm the model.

Limitation and future work
After the above examination of our model, we argue that our model suffers from three main limitations. First, although MetaMap has been found useful at parsing medical notes, due to the different linguistic use on social media, running MetaMap on tweets may not identify relevant concepts. Second, the use of collocation graph and aggregated medical concept representation reduced precision of models, although the overall recall and F1 improved. Additional studies are need to further improve the precision. Third, the collocation graph is built solely on the training data set. This may not favor the model when the data set is not representative enough to provide neighborhood of high quality. To address the first two issues, we believe a pre-trained state-of-the-art medication detection system could be helpful to identify high-quality medical concepts from tweets. For the third issue, we plan to use domain based knowledge base such as UMLS to expand the coverage of the limited data.
We used fastText as the pre-trained word embedding for our model. While fastText is trained on sub-word representations, models trained over medical or larger text corpora might provide additional contextual representation. Additional studies are needed to test our model on different pretrained word embeddings such as Word2vec over Twitter (Godin et al., 2015). We also note that there is a difference in the use of medical related concepts in different classes by testing two scenarios -a model using medical concepts identified from both ADE and non-ADE cases and one using those from the ADE cases. In future, we plan to test this approach by exploring the use of unique nodes in different classes. Meanwhile, the application of our approach on other domain-specific tasks should be verified to examine the generalization of the approach.

Conclusion
In this work, we argue that a collocation graph can be utilized to enrich the representation of a medical concept. We further propose a novel neural network architecture that uses attentive information from a collocation graph to re-embed medical words. Our experiments show that, with a good selection of neighborhood, more useful local information can be accessed, which in turn improves the medical concept representation and the overall model performance in detecting mentions of adverse drug events in tweets.