Fine-Grained Emotion Detection in Health-Related Online Posts

Detecting fine-grained emotions in online health communities provides insightful information about patients’ emotional states. However, current computational approaches to emotion detection from health-related posts focus only on identifying messages that contain emotions, with no emphasis on the emotion type, using a set of handcrafted features. In this paper, we take a step further and propose to detect fine-grained emotion types from health-related posts and show how high-level and abstract features derived from deep neural networks combined with lexicon-based features can be employed to detect emotions.


Introduction
Emotions are an essential part of our lives and reflect feelings such as joy, sadness, and fear, which affect our overall wellbeing. Emotion detection from text using computational models has been extensively studied from data such as news headlines, social media, blog posts, and song lyrics (Katz et al., 2007;Abdul-Mageed and Ungar, 2017;Mohammad and Turney, 2013;Strapparava and Mihalcea, 2007;. Recently, emotion detection started to emerge in online health communities (OHCs) Biyani et al., 2014;Wang et al., 2012). OHCs provide a user-friendly environment for patients, their families and friends to share thoughts and socialize with each other on topics such as therapeutic processes, side effects, and mental and emotional health. Emotion detection in OHCs is substantially different from the general text due to a health-related vocabulary that people use in OHCs. For example, the phrase "hot flashes" may not have a specific meaning in the general domain, but it bears a very negative feeling in patients' posts. Similarly, the post: "Just received "My doctor's office is very clean, who cares when he has prescribed me a wrong medication for six months!" notice from my doctor, I have a positive Cologuard result" is associated with a very sad emotion in a health domain, i.e., when a test is positive, it means that the disease is present.
Despite of emergence of emotion detection in OHCs, most of the recent works have been devoted to high level emotion analysis, i.e., identifying messages that contain emotions, with no emphasis on the unique challenges associated with fine-grained emotion detection. In order to correctly detect the types of emotions present in messages posted in OHCs, a deep understanding of the text and the writer's intention are required. Table  1 shows an example of a message that contains a sad emotion, which is hidden in text. We ran several sentiment tools on this message, including Stanford CoreNLP (Socher et al., 2013), and interestingly, all showed a positive sentiment, while the emotion expressed is clearly one of sadness (mixed with sarcasm). Thus, even an approach to predict the negative sentiment of the message would not suffice.
In this paper, we analyze messages in OHCs to understand the most prominent emotions in health-related posts and propose a computational model that is able to exploit the semantic information from text and coherently combines high-level (abstract) features with surface and lexicon-based features to automatically detect fine-grained emotions. Our contributions are as follows: 1. We study fine-grained emotions and their distribution in messages posted in OHCs by constructing and analyzing two health-related datasets for fine-grained emotion detection.
Identifying emotions in patients' messages augments the capability of OHCs' moderators, caregivers, and doctors to provide highquality services to OHCs' users and patients. To our knowledge, we are the first to address fine-grained emotion detection in OHCs.
2. We propose a computational model, called ConvLexLSTM, for emotion detection in OHCs. Our model combines the output of a Convolutional Neural Network (CNN) with lexicon-based features, which are all fed into a Long Short-Term Memory (LSTM) network that produces the final output via softmax. We show empirically that ConvLexL-STM significantly outperforms strong baselines and prior works. Moreover, we show that the proposed model continues to perform well even in the absence of lexicon features.
3. Finally, we apply ConvLexLSTM in a large scale experiment to study the correlation between US holidays and users' emotional states, which can help design smarter approaches to improve patients' moods.

Related Work
Emotion detection has been extensively studied in computational linguistics for a long time (Mohammad and Turney, 2013;Strapparava et al., 2004). The Ekman's basic emotion set, which includes six emotions: anger, disgust, fear, happiness, sadness, and surprise, is arguably the best well-known emotion categorization (Ekman, 1992). Strapparava and Mihalcea (2008) proposed knowledgebased and corpus-based methods for classifying emotions based on Ekman's basic emotions. Cooccurrence of general words with emotional words has beed used by Katz et al. (2007) for identifying emotion types latent in news headlines. Keywordbased approaches that are based on finding emotional words in text suffer from the inability to classify text that lacks specific keywords. Therefore, Bao et al. (2009) proposed to use topical relations between words and emotion types for emotion classification in online news. Strapparava et al. (2012) studied emotions from song lyrics. Emotion detection has been studied in social media as well and brings additional challenges due to their informal context in which people do not follow grammatical rules and use many characters that do not occur in formal text (e.g., #, :)). Emotion lexicons derived from social media, e.g., based on emotion word hashtags, have been shown to improve models' performance for emotion detection (Mohammad, 2012;Sykora et al., 2013). Abdul-Mageed and Ungar (2017) used distant supervision to construct a large dataset from the general Twitter for fine-grained emotion detection and explored deep learning models to detect emotions. Liew and Turtle (2016) and  also created a dataset of about 15, 500 tweets labeled with 28 emotion types, using the Amazon Mechanical Turk. Some studies combined knowledge-based and keywordbased approaches with linguistic features and used a machine learning algorithm to reasonably classify sentences with no emotional keywords (Yang et al., 2012;Neviarouskaya et al., 2010).
Emotional support is considered as one of the main advantages of using OHCs that brings better feelings (Khanpour et al., 2017;Zhang et al., 2017a,b;Qiu et al., 2011) and fewer mortality odds to patients (Kroenke et al., 2013). Interestingly, to this date, only very few studies have started to analyze emotions in OHCs using computational models (Wang et al., 2012;Biyani et al., 2014;Wang et al., 2014b,a). For example, Wang et al. (2012) used linear regression to identify emotional support from a cancer forum. Their model predicts to what extent each sentence contains either emotional or non-emotional support. Their features include: LIWC features, POS tags, message length, subjectivity intensity, and LDA topical features. Biyani et al. (2014) identified messages that contain emotions in a breast cancer forum using unigrams, POS tags, structural patterns, and five lexicons that contain strong and weak subjective words, cancer drugs, side-effects, and cancer procedures, and showed that lexicons features have a high impact on the results. Khanpour and Caragea (2018) used deep learning to extract therapeutic processes and side effects from patients' posts. Wang et al. (2014b) classified OHCs' messages based on the intention of the participant when writing messages (e.g., seeking or providing information) and used a combination of features from Wang et al. (2012) coupled with lexicon features used in Biyani et al. (2014).

Data Collection and Annotation
To study the most prominent emotions and their distribution in OHCs and to evaluate our model for fine-grained emotion detection in OHCs, we constructed two benchmark datasets, since, to our knowledge, no labeled dataset is available for this task. The first dataset is created by using data from Biyani et al. (2014), which contains 1066 sentences from the breast cancer discussion board in the Cancer Survivors' Network (CSN) of the American Cancer Society, denoted as B-DS. Note that Biyani et al. (2014) performed sentence level classification since longer messages often comprise different topics. A sentence level analysis can make a better estimation on the purpose of the commentator in writing that sentence, whether or not expressing his or her emotions. For the second dataset, we randomly selected 225 comments from 21 discussion threads in the lung cancer discussion board of CSN. We denote this second dataset as L-DS. Following Biyani et al. (2014), we extracted all sentences out of comments and chose sentences with a length greater than four words. We ended up with 1041 sentences in L-DS. In total, we have 2107 sentences labeled with emotion types.
For our annotation task, we followed the six emotions suggested by Ekman (1992). Our annotators were allowed to attribute one or more emotions to a single sentence, e.g., a sentence could be annotated as bearing sadness or a combination of sadness and fear. The annotation task was conducted iteratively following prior studies and guidelines, using three training rounds (DMello, 2016;Fort et al., 2016;Shanahan et al., 2006). In each round, 300 sentences drawn from both B-DS and L-DS were assigned to the annotators, 1 and we asked them to meet with the researchers in a group to discuss disagreements and document their discussion before the next 300 instances were assigned. Upon passing the training period, annotators were assigned to annotate the remaining sentences from B-DS and L-DS, and they ended up with 83% Kappa inter-annotator agreement. For the remaining 17%, the annotators expressed their views on each case in the presence of the researchers and finally 100% agreement was achieved during 20 days. All these sentences plus the 900 ones used during the three training rounds became part of the final dataset. Table 2 represents the distribution of emotions in 2107 sentences. Note that some sentences do not contain any emotion. As can be seen from the table, both datasets have a similar distribution of emotions, with joy and sadness being the most prominent.
1 Annotators were three graduate students.

Model
Given a sentence of n words, we apply CNN to extract high-level (abstract) features that capture the semantic part of the text (Lai et al., 2015). We combine high-level features with surface-level and lexicon-based features. Our proposed model, Con-vLexLSTM, is shown graphically in Figure 1. As can be seen from the figure, we use a combination of CNN and LSTM models, where the final feature vectors from CNN augmented with lexiconbased features are fed as input to the LSTM network. The architecture of our proposed classification model is close to the models described in Kim et al. (2016) and Xiao and Cho (2016), in which they applied a character-level CNN to create high-level features, whereas our model works at word-level and uses lexicon features. We use the word-level input to benefit from applying embedding vectors, trained on OHCs. We use the character-level model by Kim et al. (2016), denoted by C-ConvLSTM, as one of our baselines. Lexicon-based Features: Lexicon-based approaches for detecting emotions in the text have been the main stream of many models (Strapparava et al., 2004;Strapparava and Mihalcea, 2008;Mohammad, 2012;Liu, 2012). In our work, we used the same lexicons that were provided by Biyani et al. (2014) such as weak subjective words (numWeak), strong subjective words (numStrong) (Stoyanov et al., 2005), cancer drugs (numDrug), side-effects (numSide), and therapeutic procedures (numProc). These lexicons address differentiation between emotional versus non-emotional messages. However, we need more granular information for differentiating between a variety of emotion types. Hence, we also used lexicons introduced by Strapparava and Mihalcea (2007), denoted as EmoLex1, and by Mohammad and Turney (2013), denoted as EmoLex2. We use frequencies of lexicon words to construct the lexicon-based feature vectors. Note that prior work (Biyani et al., 2014) showed that LIWC did not generate high quality features for classification in OHCs, and thus, we did not use it in our study.

Experiments and Results
Next, we describe the evaluation of ConvLexL-STM, using the joy and sadness emotions, which have at least 5% coverage in our labeled data (see Table 2). Also, since binary tasks are considered easier to learn than multi-class tasks (Bishop, 2006), we trained our models in the two-class setting: joy/non-joy (and sad/non-sad), by binarizing the datasets. In all experiments, we used word embeddings as input to the neural networks, which were generated with the W2vector module in Gensim (Řehůřek and Sojka, 2010) on the data (users' comments) from all discussion boards of the Cancer Survivors' Network (CSN) of the American Cancer Society, between June 2000 and June 2012. We also experimented with word embeddings generated using Wikipedia, but these embeddings resulted in lower performance as compared with those generated using CSN data. We estimated hyper-parameters for each deep neural network via a grid search over combinations of important hyper-parameters (e.g., learning rate, decay rate, dropout, number of layers, filter region size, and number of filters). The grid search was done on a development set that consists of removing 20% of instances from the training set in each iteration of 10-fold cross-validation. We report precision, recall and F1-score.

Performance of ConvLexLSTM
First, we evaluate ConvLexLSTM performance in an ablation experiment to determine the role played by each component for emotion detection. Specifically, we compare ConvLexLSTM with ConvLSTM (a model that has the same architecture as ConvLexLSTM, but does not use any external lexicon), CNN, LSTM, and support vec-  tor machine (SVM) with the (concatenated) features from the seven lexicons (described above). Table 3 shows the results of this comparison. As can be seen from the table, ConvLexLSTM achieves the best results consistently throughout all experiments in terms of all compared measures. This ablation experiment confirms our intuition that all components are contributing to the final emotion detection. For example, removing the seven lexicon features from ConvLexLSTM, which yields ConvLSTM, results in a drop in F1score by 5.8% on joy in B-DS, and by 3.9% on sadness in B-DS. Still, ConvLSTM is the second performing model in terms of F1-score. These results show that our model can be successfully applied in a health domain even in the absence of health lexicons, which are often expensive to obtain. Not surprisingly, the SVM with the sevenlexicon based features (denoted as Seven-Lexicon) performs the worst among the compared models, suggesting that capturing the semantic information from text via deep neural networks improves emotion detection.
Second, we compare ConvLexLSTM with three baselines: C-ConvLSTM (i.e., a character-level CNN-LSTM) (Kim et al., 2016), SWAT (Katz et al., 2007) (i.e., an emotion detection model from SemEval-2007), and EmoSVM (i.e., an SVM with a set of handcrafted features: unigrams, bigrams, POS tags, the word-emotions association lexicon by Mohammad (2012), the WordNet-Affect lexicon by Strapparava et al. (2004), and the output of the Stanford sentiment tool by Socher et al. (2013). Table 3 shows the results of this comparison as well. As can be seen, ConvLexLSTM out- performs all three baselines on both datasets, and more importantly, the character-level CNN-LSTM by Kim et al. (2016) (i.e., the C-ConvLSTM model). This result confirms our belief that applying word embedding vectors, which are trained directly on data from OHCs yields improvement in performance over character-level models.
It is worth mentioning that all deep neural networks, ConvLexLSTM, ConvLSTM, CNN, LSTM, and C-ConvLSTM, that capture high-level semantic features perform better than the traditional models on emotion detection. The lexiconbased features act as a complement (for the highlevel semantic features) by looking into exact words in the text to generate appropriate features in ConvLexLSTM for emotion detection. With a paired T-test, the improvements of ConvLexL-STM over the compared models for F1-score are statistically significant for p-values < 0.05.

Impact of holidays on emotional states
We further analyzed the impact of several US holidays, i.e., 4th of July, Labor Day, Thanksgiving Day, Christmas Day, and New Year's Eve, on CSN users' emotional states, joy and sadness. For this experiment, we extracted messages from each event day itself and from two days before and two days after each event (holiday) from the entire CSN data. We also collected messages written on five random days (June 15, October 20, March 9, April 3, and January 29), which are not close to any event, to be used as a baseline for comparing the emotional states of participants in different holidays. Consistent with our labeled datasets, we removed messages with less than four tokens. We used our best performing model ConvLexLSTM to classify joy and sadness on these data. Figure 2 shows the percentage of joy and sad predicted messages for each holiday and for the five random days. As can be seen from the figure, on the random days, the percentage of joy and sad emotions are similar, and so are they for 4th of July and Labor Day, whereas Christmas and Thanksgiving show more joyful spirits, possibly due to family gatherings and other social events around these holidays, in which people feel supported, and hence, feel better. Christmas shows an increase in joy and a slight decrease in sad emotions compared with Thanksgiving. Interestingly, around the New Year's Eve, the percentage of sad messages is almost twice higher compared with the percentage of joy messages, which can be attributed to the end of the holiday season and family gatherings and the beginning of a new challenging year.

Conclusion
In this paper, we addressed the problem of finegrained emotion detection from OHCs messages. To this end, we first annotated a dataset from a cancer forum (i.e., the Cancer Survivors' Network of the American Cancer Society) with the six most common emotions suggested by Ekman (1992) and studied the most prominent emotions and their distribution in OHCs. We found that joy and sadness occur most frequently in the forum, followed by anger and fear. Not surprisingly, disgust and surprise appear the least number of times. We then proposed a computational model that combines the strengths of CNNs, LSTMs and lexiconbased approaches to capture the hidden semantics in OHCs messages and to provide a more insightful understanding of emotional messages by identifying their emotion types.
Our results are promising and show that our proposed model, with or without lexicon-based features, which are often expensive to obtain or maintain in a health domain, provides a better emotion type detection compared with strong baselines and prior works. Given our initial success, in the future, it would be valuable to construct a large health-related dataset to cover other types of emotions, e.g., anger or fear.