Treatment Side Effect Prediction from Online User-generated Content

With Health 2.0, patients and caregivers increasingly seek information regarding possible drug side effects during their medical treatments in online health communities. These are helpful platforms for non-professional medical opinions, yet pose risk of being unreliable in quality and insufficient in quantity to cover the wide range of potential drug reactions. Existing approaches which analyze such user-generated content in online forums heavily rely on feature engineering of both documents and users, and often overlook the relationships between posts within a common discussion thread. Inspired by recent advancements, we propose a neural architecture that models the textual content of user-generated documents and user experiences in online communities to predict side effects during treatment. Experimental results show that our proposed architecture outperforms baseline models.


Introduction
Seeking medical opinions from online health communities has become commonplace: 71% of age 18-29 (equivalent to 59% of all U.S. adults) reported consulting online health opinion (Fox and Duggan, 2013). These opinions come from an estimated twenty to one hundred thousand healthrelated websites (Diaz et al., 2002), inclusive of online health communities that network patients with each other to provide information and social support (Johnston et al., 2013). Platforms such as HealthBoards 1 and MedHelp 2 feature users reporting their own health experiences, inclusive of their self-reviewed drugs and medical treatments. Hence, they are valuable sources for researchers (Leyens et al., 2017;Martin-Sanchez and Verspoor, 2014).
Although readers use these platforms to get valuable information about potential drug reactions during treatment, this is not without potentially serious problems. There is lexical variation: 1 https://www.healthboards.com/ 2 https://medhelp.ord/ users do refer side effects differently: "dizziness" can be expressed as "giddiness" or "my head is spinning". More concern is that discussions rarely cover all possible prescribed drugs and their side effects during a treatment, and some topics refer to a condition without mentioning any particular drug. Relying on such information could lead to adverse reactions.
It is important to note that a tool that looks up mentioned drugs' side effects from a static database would not return answers with sufficient coverage. There are also common concerns regarding credibility of user-generated contents - (Impicciatore et al., 1997) have shown that online health information is of variable quality and approached with caution.
Having these caveats in mind though, experienced users can provide valuable expertise. For instance, while reporting expected side effects for a specific treatment, patients with long-term use of certain drugs can be valuable authorities. E.g.: While my experience of 10 years is with Paxil, I expect that Zoloft will be the same. You should definitely feel better within 2 weeks. One way I found to make it easier to sleep was to get lots of exercize. Walk or run or whatever to burn off that anxiety. -User 3690.
This is an answer to a thread asking for expected side effects for depression treatment with Zoloft. User 3690's history of actively discussing about other anti-depressants such as Lexapro and Xanax gives insights in predicting potential drug reactions during the treatment of depression. Table 1 shows that Zoloft (mentioned in the thread) shares many common side effects with the other two anti-depressants: "changed behavior," "dry mouth," and "sleepiness or unusual drowsiness." A method that could differentiate trustworthy user-generated content would be valuable, allowing us to macroscopically harness a large amount of online information that would pave the way to many critical tasks such as digital pharmacovigilance (Salathé, 2016) and disease monitoring (St Louis and Zorlu, 2012). Even on the micro-  scopic level of individual posts, such a tool offers users' suggestions for drug reactions and improves the quality of user-generated content. We address this need in our work. We build a neural architecture that models each post's textual content and its author's experience to predict expected side effects during treatments. Crucially, our supervised neural approach jointly learns posts' content and users' experience level within a thread. A key observation we make is that users can be grouped into clusters that share the same expertise or interest in certain drugs, possibly due to their common treatment or medical history. We leverage this expertise by embedding it into a low dimensional vector learned by the model, and subsequently predict side effects that are unmentioned in the discussion. We believe that our model represents trustworthiness more robustly when compared with representations such as a single weights (Li et al., 2016) and traditional drug side effect extraction (Aramaki et al., 2010). Furthermore, inspired by (Halder et al., 2018), we train a cluster-sensitive attention mechanism that allows our model to emphasize varied parts of the post. We also follow general definition of truth discovery and let the model learn a credibility score that is unique to every user and reflective of her trustworthiness. Our experimental results show that integrating the above components outperforms baseline text classification models.
The contributions of our work are summarized as follows: • We propose a neural network architecture that can capture user expertise, user credibility, individual post's and overall thread's semantic content. • We formulate the task of side effect prediction during treatment as supervised multilabel classification and apply our proposed method to the task of side effect prediction during treatment. • We record and analyze the performance of our proposed model through a set of progressively designed experiments. Additionally, we compare the obtained results with traditional text encoding algorithms.

Related Work
Our approach learns the representation of posts, threads and users, and then integrates them to apply to the task of drug side effect prediction during treatment. We thus review works on the representation of fundamental objects in online communities, and the discovery of drug side effects.

Modeling Objects in Online Communities
Post content modeling. In statement credibility prediction, linguistic features of a post are strong indicators for reliability. Stylistic features -i.e., the number of strong/weak modals, conditionals or negations -and affective features -i.e., words that depict an author's attitude and emotion -are adopted to represent a post's content (Mukherjee et al., 2014). Such feature engineering requires a great amount of correlation analysis when applied to a novel problem or dataset. Linguistic features also often fail to fully capture document content, as most do not account for distinctive words in exchange of scalability. Its counter parts, bag of words and per-vocabulary features loosely capture textual content but disregard semantics and suffer scalability with sparsity issues. To address this, state-of-the-art architectures feature complex modeling to model subtle dependencies and rely on word embeddings to address scalability issues, achieving robust results in text classification (Kim, 2014), neural machine translation (Luong et al., 2016), among others. Inspired by the success of their approaches, we adopt the recurrent neural network architecture (RNN) for post content modeling. Coupled with an attention mechanism, our approach adaptively weights the importance of parts in each post (Luong et al., 2015).
Thread content modeling. Most research working on thread-level modeling usually obtain thread content representation by aggregating each content of its posts (Yang et al., 2014). However, we hypothesize that each post has different contribution to thread content and should be variously weighted to reflect specific factors, such as its author's level of credibility. While my experience of 10 years is with Paxil, I expect that Zoloft will be the same. You should definitely feel better within 2 weeks. One way I found to make it easier to sleep was to get lots of exercize. Walk or run or whatever to burn off that anxiety.
Zoloft changed behavior, decreased sexual desire, diarrhea, dry mouth, heart-26521 I've heard of people going "cold turkey" and having withdrawal at 6 months! Please, get in contact with a doctor ASAP! "common symptoms include dizziness, electric shock-like sensations, sweating, nausea, insomnia, tremor, confusion, nightmares and vertigo" burn, sleepiness or unusual drowsiness,... Table 2: A sample thread, including its list of post-user pairs, mentioned drugs, and side effects.
User modeling. Statement credibility prediction often represents users by a single scalar that indicates their trustworthiness. The intuition is that users who provide trustworthy information frequently will be assigned high reliability scores (Li et al., 2017). Such representation is effective yet insufficient. Recent work have shown that encoding users into high-dimensional embeddings can improve system performance (Yu et al., 2016), which we have adopted in our model.

Side Effect Discovery
Most drug reaction discovery methods focus on extracting mentioned side effects. A common technique is to apply Named Entity Recognition (NER) and Relation Extraction (RE) systems in a supervised manner. (Sampathkumar et al., 2014) demonstrates its effectiveness in detecting drugs and side effects that appear in a target document (in-context), and predicting if they are related.
However, in our side effect prediction during treatment, our model is required to cover potentially encountered reactions, many of which are not explicitly mentioned in the given post (outof-context). Hence, we do not identify our task with traditional task of adverse drug side effect extraction (Leaman et al., 2010). Our approach overcomes the limitations of the existing works by modeling user experience, and credibility during post and thread encoding, then subsequently predicting both in-and out-of-context side effects.

Preliminaries
Basic Terminologies. To ensure a consistent representation, let us first define some terminology: • A drug d has a set of side effects, S d = {s 1 , s 2 , . . . , s k } • A post p is the most basic document, containing a sequence of sentences. It is written by a user u, and belongs to a thread t.
• A user u is a member of an online community. She participates in certain threads, i.e., . . , t l } by writing at least one post in each thread. We use the terms user and author, as well as user experience and user expertise interchangeably. • A thread t (see Table 2) is an ordered collection of post-user pairs, Every thread discusses the treatment of a particular condition and entails a list of Hence, every thread has a list of aggregated side effects defined as S t = S d 1 ∪ S d 2 · · · ∪ S dm , which is also the list of potential side effects experienced during the treatment. Task Definition. Drug side effect prediction during treatment is the task of assigning the most relevant subset of side effects to threads discussing certain treatment, from a large collection of potential side effects. We view the drug side effect prediction problem as a multi-label classification task. In our setting, an instance of item-label is a tuple (x t , y) where x t is the feature vector of thread t derived from its list of post-user pairs Q t and y is the side effect label vector i.e., y ∈ {0, 1} S , where S is the number of possible side effect labels. Given training instances, we train our classifier to predict the list of treatment side effects in unseen threads.
Formal Hypothesis. Given a thread t with Q t , we hypothesize that considering the credibility and experience of user u ∈ (p, u) ∈ Q t improves the quality of feature representation in thread t, resulting in better treatment side effect prediction.
4 Proposed Method Figure 1 shows the detailed network architecture of our model. It has several components which we shall detail sequentially. Ablation of certain components will serve as baseline systems for comparative evaluation later.
User Expertise Representation (UE): We embed each user u ∈ U as a vector v u so that the vector captures user u's experience with certain drugs. As each user u participates in the threads T u , entailing a list of experienced drugs, we derive user drug experience vector v * u ∈ R |D| where D is the set of all possible drugs and v * u i = n u i where user u has mentioned i th drug in n u i threads. We obtain a user drug experience matrix M * ∈ R |U |×|D| where j th row of M * denotes user drug experience vector of j th user u j ∈ U . Since the average number of drugs experienced per user is much fewer than the total number of drugs (see Table 3), M * suffers from data sparsity and limited scalability. Without dimensionality reduction, the model learns at least |D| parameters for every user, amounting to |D| × |U | when aggregated for all users. Data sparsity leads to a large number of insufficiently tuned parameters, which significantly increases training time, storage, and reduces the system's robustness.
We apply Principal Component Analysis (PCA) (Jolliffe, 1986) to M * obtained from training set. Figure 2 shows percentage of variance explained versus number of included principal components (PCs) to determine the number of PCs g. Since our PCA plots do not show added explanation percentage beyond 50 components, we use g = 50 com-ponents, reducing our original M * ∈ R |U |×|D| to user expertise matrix M ∈ R |U |×g .
User Clustering: To model per-user expertise, in a naïve setting, we would train ≈ |U | × g parameters. Given limited data, this is infeasible as it faces sparsity issues. We make a second, key assumption that our set of users U can be grouped into a set of meaning clusters C of size k where k |U |. Users within a cluster would have experience with similar drugs, and hence representable using a single vector, reducing the number of learned parameters to k × g.
We apply K-means clustering algorithm (Mac-Queen, 1967) to cluster the users into k groups.
To determine the number of clusters k, we analyze the total distance to the nearest centroid versus the number of potential clusters in set C -as in Figure  3, where D(C) is defined as follows: where argmaxD(C) is the maximum total distance obtained when |C| = 1.
Since clustering does not gain significant reduction in total distance beyond 100 clusters, we sort For each user, we consider the vector of her assigned cluster's centroid to be her expertise vector.
Post Content Encoding: The network takes the content of a thread t as input, which is a list of post-user pairs Q t . Post p i of pair (p i , u i ) ∈ Q t consists of a sequence of words (w 1 , . . . , w n ). We seek to represent a post p i as vector v p that effectively captures its semantics. We embed each word into a low dimensional vector and transform the post into a sequence of word vectors {v w 1 , v w 2 , . . . , v wn }. Each word vector is initialized using Google's pre-trained word2vec (Mikolov et al., 2013). Additionally, while each out-of-vocabulary word vector is initialized randomly, we keep it tunable during training to capture domain-specific meanings. Such model adaptation is necessary, as the model needs to learn the embeddings for the drug names, most of which are not included in the pre-trained embeddings but are critical to predict the side effects.
We employ Long-Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) to encode the textual content. A bi-directional LSTM encodes the word vector sequence, outputting two sequences of hidden states: a forward sequence, . . , h f n that starts from the beginning of the text; and a backward sequence, . . , h b n that starts from the end of the text. For many sequence encoding tasks, knowing both past (left) and future (right) contexts has proven to be beneficial (Dyer et al., 2015). The states h f i and h b j in the forward and backward sequences are computed as follows: where h f i , h b j ∈ R e , and e are the number of encoder units.   Cluster-sensitive Attention (CA): Inspired by (Halder et al., 2018), we initialize an attention vector, v a i ∈ R e for each cluster c i . Given a forward sequence H f = h f 1 , h f 2 , · · · , h f n and backward sequence H b = h b 1 , h b 2 , · · · , h b n of hidden post states p written by user u belonging to cluster c i , the corresponding w j weights each hidden state h f j and h b j of both sequences based on their similarity with the attention vector are: . ( The intuition behind Equation (2), inspired by (Luong et al., 2015), is that hidden states which are similar to the attention vector v a i should be paid more attention to; hence are weighted higher during document encoding. v a i is adjusted during training to capture hidden states that are significant in forming the final post representation. w a j is then used to compute forward and backward weighted feature vectors: We concatenate the forward and backward vectors to obtain a single vector, following previous bidirectional RNN practice (Ma and Hovy, 2016).
Thread Content Encoding with Credibility Weights (CW): For every post-user pair (p i , u i ) of thread t, we first compute post p i feature vector v p i . It is then concatenated with user u i 's expertise vector v u i to form post-user complex vector v p n i . This user-post complex is weighted by a user credibility score w u i , which is initially randomized and updated while training, to obtain final post-user pair representation v p * n i . This follows the general intuition from the truth discovery literature that users providing high quality answers should assign higher credibility scores, and answers from credible users are more significant. Thus, the thread content representation can be defined as the weighted sum of each post-user complex vector: Multi-label Prediction: We feed the thread content representation v t through a fully connected layer which outputs can be computed as: where W and b are weights and biases of the layer. The output vector s t ∈ R |S| is finally passed through a sigmoid activation function, and trained using cross-entropy loss as defined by L: (y t · log(σ(s t )) + (1 − y t ) · log(1 − σ(s t ))) + λ u v u 2 (6) We adopt regularization that penalizes the training loss with the user experience matrix's L2 norm by a factor of λ = 0.0065, obtained via hypertuning. The loss function is differentiable, thus trainable with Adam (Kingma and Ba, 2015). During our gradient-based learning, user credibility score w u i of user u i can be updated by calculating ∂L ∂wu i by back-propagation:

Experiments
We conduct experiments to validate the effectiveness of our proposed model. In specific, (1) we want to compare our architecture with text encoding baselines, (2) highlight performance improvements incrementally, and (3) evaluate and analyze the obtained results, both at the macroscopic and microscopic levels.

Baselines
As a competitive baseline from prior work, CNN-KIM (Kim, 2014) constructs a document matrix that incorporates word embeddings, then applies a convolution filter to obtain feature maps. These feature maps are passed through a max-pooling filter to construct a document representation. During prediction, the representation is fed through a fully connected layer. We replace the final softmax layer of the author's model with sigmoid to make it work in a multi-label prediction setting.
The following baselines are used to perform an ablation study of our model.
• RNN: We implement a bi-directional LSTM baseline, which is equivalent to our proposed method without CA, UE and CW.
• Weighted Post Encoder (WPE): We construct thread representation by summing each of its post-user complex vector weighted by user credibility. This is equivalent to our proposed methodology without CA and UE.
• Weighted Post Encoder with User Expertise (WPEU): We concatenate user expertise with post vector to create post-user complex vector. This is equivalent to our proposed method without CA.

Dataset
We conduct our experiments on the same dataset as (Mukherjee et al., 2014) including 15,000 users and 2.8 million posts extracted from 620,510 HealthBoards 1 threads. Ground truth possible side effects experienced during treatment are defined as the side effects of drugs mentioned in the discussion. As annotating such amount of posts is expensive, drug side effects are extracted from Mayo Clinic's Drugs and Supplements portal 3 and are used as surrogates for potential reactions of treatments.

Experimental Settings
We applied a standard natural language preprocessing -Snowball stemming (Porter, 1980) and stop-word elimination -before representation modeling. From the original dataset, we only extract threads that are annotated with drugs and their side effects, along with the lists of contained posts and corresponding users. Table 3 shows the dataset statistics. We divide our data into 10  Table 4: Experimental results with both actual (Experiment 1) and Strict (Experiment 2) settings. In the Component columns, "CW", "UE", "CA" denote "Credibility Weights", "User Expertise" and "Cluster Attention module components", respectively. folds to perform cross-validation (8,1,1 folds for training, validation, and testing respectively). We perform PCA and K-means clustering on training set, using scikitlearn's built-in modules (Pedregosa et al., 2011), with g = 50 principal components and k = 100 clusters. For CNN-KIM, we experiment with filters with varying window sizes from 2 to 5, and set the number of feature maps for each filter to 256 and dropout to 0.5. For our proposed model and baseline models using the RNN architecture, when performing post content encoding, we set the number of units in the LSTM cell to 128. Dropout rates of 0.2 and 0.5 are used in our LSTM cells and FC layers, respectively. Cluster attention vectors and user credibility values are initialized with values ranging from -1.0 to 1.0. For each user u, we initialize her expertise vector with the value of v u obtained in Section 4 and allow training to fine-tune. All models are trained using Tensorflow 4 library.
We conducted two separate experiments: • Experiment 1: We keep the text as-is. Any mentioned drugs are retained inside the thread.
• Experiment 2: We remove all mentions of any drug in our drug list. This is a more aggressive experiment which asks the model to predict the treatment's side effects without any mention of the experienced drugs. Table 4 shows the precision, recall, and F 1 obtained by our method and the four baselines. Macroscopic Analysis: Firstly, all of the three models that apply credibility weighting (CW) -WPE, WPEU and our model -outperform both RNN and CNN baselines in both experiments. Specifically, weighting each post by its author credibility improves the performance of naive post encoder by 6.32%, 2.15% and 3.86% on precision, recall and F 1 respectively for Experiment 1. Results for Experiment 2 are similar. This demonstrates the effectiveness of accounting for author credibility when encoding thread content, improving side effect prediction.

Results and Evaluation
Improvements by incorporating user experience (UE) are less pronounced. In Experiment 1, adding UE (WPEU vs. WPE) improves recall by 2.65% and 0.8% in F 1 . Again, the stricter Experiment 2 shows similar performance trends. On a macro scale, these statistics indicate that our model successfully learns to include more side effects in its prediction, where many are relevant to the ground truth. This is consistent with our hypothesis that considering author experience of each post is effective in predicting out-of-context side effects.
Applying cluster-sensitive attention (CA) in combining RNN's hidden states also improves the performance. In Experiment 1, we observe that adding CA (our model vs. WPEU) also improves recall and F 1 , where again, Experiment 2 demonstrates similar but slightly more pronounced performance changes. These indicate that the attention mechanism is more effective when the drugs are present since the drug names in our documents are the phrases that receive greater emphasis.
As settings in Experiment 1 start with more information compared with those in Experiment 2, the task is easier and thus performance is improved (12.7% to 14.15% in F 1 ). The margin for improvement for Experiment 2 is larger, which explains why absolute score improvements are larger in Experiment 2. When measuring relative improvement, the gains are comparable.
Generally, according to the macroscopic analysis of results in Table 4, we conclude that all of the three components in our proposed architecture, namely, CW, UE, and CA have a positive impact on the overall performance of the model. We observe consistent improvements in F 1 after adding each component is consistent with our stated hypotheses, in both experimental settings.   Microscopic Analysis: We also analyze our model performance at per-sample level to check whether they are consistent with the macroscopic results. We aim to confirm three hypotheses: (1) Considering author expertise improves prediction on out-of-context side effects. (2) Considering author credibility improves the extraction of both inand out-of-context side effects from trustworthy users' content. (3) Placing attention on different parts of the document enhances the performance of in-context side effect extraction. Tables 5 and 6 show a sample testing thread, its users' commonly experienced drugs, and its side effects.
We observe that CNN-KIM and the simple, RNN-based post encoding can capture side effects that are mentioned both directly (e.g.,"skin rash") as well as indirectly (e.g., "diarrhea"), but fail to capture the remaining symptoms, many of which are out-of-context.
Considering User 1537's credibility shows performance improvements. In her posts, User 1537 indirectly refers to "headache" by mentioning "bug crawling under my skalp sensations". The calculated higher credibility score weights User 1537 experiences with "sleepiness" higher in the WPEU (CW + UE) baseline prediction, which is correct. These observations are consistent with our hypothesis about user credibility.
User experience is effective in predicting outof-context symptoms. In the illustrated sample training set, all of the four users have experience with similar drugs with common side effects such as "unusual tiredness and weakness", "nausea", and "fever". As "bad breath" is also a shared side effect, it is comprehensible that the model outputs "bad breath". Nonetheless, it is intuitive for the model to pick up such commonness among users and compute relevant results. These observations are consistent with our hypothesis on user experience. Finally, the model with CA can learn different parts of the documents. Especially for User 16248's posts that mentioned digestive problems, hidden states encode phrases such as "increasingly sensitive to more foods", and "damage to your intestines" receive higher attention, resulting in the prediction of "heartburn", "belching", "indigestion", "acid stomach'', and "difficult bowel movement". This functionality is consistent with our original purpose and expectation for adding attention to the post encoder architecture.

Conclusion
We have addressed the importance of user experience and credibility in modeling thread contents of online communities, specifically through the task of drug side effect prediction during treatment. We suggest a subset of side effects relevant to the mentioned treatment in the given discussion, taking into account the each post content and its author expertise in certain treatments. Mainstream models for online communities fail to fully capture post content semantically and user experience with previous drugs.
We model users' expertise by examining their experience with different drugs, then group users with similar experience into clusters that share a common experience vector representation. Experimental results show that our proposed thread content encoder outperforms state-of-the-art document encoders, and that our neural components play a significant role in improving task performance.
We believe that our model is adaptable to other domains. We aim to use it for downstream application in online health community such as credibility analysis and thread recommendation in the future.