Modeling Empathy and Distress in Reaction to News Stories

Computational detection and understanding of empathy is an important factor in advancing human-computer interaction. Yet to date, text-based empathy prediction has the following major limitations: It underestimates the psychological complexity of the phenomenon, adheres to a weak notion of ground truth where empathic states are ascribed by third parties, and lacks a shared corpus. In contrast, this contribution presents the first publicly available gold standard for empathy prediction. It is constructed using a novel annotation methodology which reliably captures empathy assessments by the writer of a statement using multi-item scales. This is also the first computational work distinguishing between multiple forms of empathy, empathic concern, and personal distress, as recognized throughout psychology. Finally, we present experimental results for three different predictive models, of which a CNN performs the best.


Introduction
Over two decades after the seminal work by Picard (1997) the quest of Affective Computing, to ease the interaction with computers by giving them a sense of how emotions shape our perception and behavior, is still far from being fulfilled. Undoubtedly, major progress has been made in NLP, with sentiment analysis being one of the most vivid and productive areas in recent years (Liu, 2015).
However, the vast majority of contributions has focused on polarity prediction, typically only distinguishing between positive and negative feeling * These authors contributed equally to this work. Anneke Buffone designed and supervised the crowdsourcing task and the survey described in Section 2, and provided psychological background knowledge. Sven Buechel was responsible for corpus creation, data analysis, and modeling. The technical set-up of the crowdsourcing task and the survey was done jointly by both first authors.
†Work conducted while being at the University of Pennsylvania. or evaluation, usually in social media postings or product reviews (Rosenthal et al., 2017;Socher et al., 2013). Only very recently, researchers started exploring more sophisticated models of human emotion on a larger scale (Wang et al., 2016;Abdul-Mageed and Ungar, 2017;Mohammad and Bravo-Marquez, 2017a;Buechel andHahn, 2017, 2018a,b). Yet such approaches, often rooted in psychological theory, also turned out to be more challenging in respect to annotation and modeling (Strapparava and Mihalcea, 2007).
Surprisingly, one of the most valuable affective phenomena for improving human-machine interaction has received surprisingly little attention: Empathy. Prior work focused mostly on spoken dialogue, commonly addressing conversational agents, psychological interventions, or call center applications (McQuiggan and Lester, 2007;Fung et al., 2016;Pérez-Rosas et al., 2017;Alam et al., 2017).
In contrast, to the best of our knowledge, only three contributions (Xiao et al., 2012;Gibson et al., 2015;Khanpour et al., 2017) previously addressed text-based empathy prediction 1 (see Section 4 for details). Yet, all of them are limited in three ways: (a) neither of their corpora are available leaving the NLP community without shared data, (b) empathy ratings were provided by others than the one actually experiencing it which qualifies only as a weak form of ground truth, and (c) their notion of empathy is quite basic, falling short of current and past theory.
1 Psychological studies commonly distinguish between state and trait empathy. While the former construct describes the amount of empathy a person experiences as a direct result of encountering a given stimulus, the latter refers to how empathetic one is on average and across situations. This studies exclusively addresses state empathy. For a contribution addressing trait empathy from an NLP perspective, see Abdul-Mageed et al. (2017).
In this contribution we present the first publicly available gold standard for text-based empathy prediction. It is constructed using a novel annotation methodology which reliably captures empathy assessments via multi-item scales. The corpus as well as our work as a whole is also unique in being-to the best of our knowledge-the first computational approach differentiating multiple types of empathy, empathic concern and personal distress, a distinction well recognized throughout psychology and other disciplines. 2

Corpus Design and Methodology
Background.
Most psychological theories of empathic states are focused on reactions to negative rather than positive events. Empathy for positive events remains less well understood and is thought to be regulated differently (Morelli et al., 2015). Thus we focus on empathetic reactions to need or suffering. Despite the fact that everyone has an immediate, implicit understanding of empathy, research has been vastly inconsistent in its definition and operationalization (Cuff et al., 2016). There is agreement, however, that there are multiple forms of empathy (see below). The by far most widely cited state empathy scale is Batson's Empathic Concern -Personal Distress Scale (Batson et al., 1987), henceforth empathy and distress.
Distress is a self-focused, negative affective state that occurs when one feels upset due to witnessing an entity's suffering or need, potentially via "catching" the suffering target's negative emotions. Empathy is a warm, tender, and compassionate feeling for a suffering target. It is other-focused, retains self-other separation, and is marked by relatively more positive affect (Batson and Shaw, 1991;Goetz et al., 2010;Mikulincer and Shaver, 2010;Sober and Wilson, 1997).
Selection of News Stories. Two research interns (psychology undergraduates) collected a total of 418 articles from popular online news platforms, selected to likely evoke empathic reactions, after being briefed on the goal and background of this study. These articles were then used to elicit empathic responses in participants.
Acquiring Text and Ratings. The corpus acquisition was set up as a crowdsourcing task on MTurk.com pointing to a Qualtrics.com questionnaire. The participants completed back-ground measures on demographics and personality, and then proceeded to the main part of the survey where they read a random selection of five of the news articles. After reading each of the articles, participants were asked to rate their level of empathy and distress before describing their thoughts and feelings about it in writing.
In contrast to previous work, this set-up allowed us to acquire empathy scores of the actual writer of a text, instead of having to rely on an external evaluation by third parties (often student assistants with background in computer science). Arguably, our proposed annotation methodology yields more appropriate gold data, yet also leads to more variance in the relationship between linguistic features and empathic state ratings. That is because each rating reflects a single individual's feelings rather than a more stable average assessment by multiple raters. To account for this, we use multi-item scales as is common practice in psychology. I.e., participants give ratings for multiple items measuring the same construct (e.g., empathy) which are then averaged to obtain more reliable results. As far as we know, this is the first time that multiitem scales are used in sentiment analysis. 3 In our case, participants used Batson's Empathic Concern -Personal Distress Scale (see above), i.e, rating 6 items for empathy (e.g., warm, tender, moved) and 8 items for distress (e.g., troubled, disturbed, alarmed) using a 7-point scale for each of those (see Appendix for details). After rating their empathy, participants were asked to share their feelings about the article as they would with a friend in a private message or with a group of friends as a social media post in 300 to 800 characters. Our final gold standard consists of these messages combined with the numeric ratings for empathy and distress.
In sum, 403 participants completed the survey. Median completion time was 32 minutes and each participant received 4 USD as compensation.
Post-Processing. Each message was manually reviewed by the authors. Responses which deviated from the task description (e.g., mere copying from the articles at display) were removed (31 responses, 155 messages), leading to a total 1860 messages in our final corpus. Gold ratings for empathy and distress were derived by averaging the respective items of the two multi-item scales.

E D Message
(1) 4.8 3.1 I'm sorry to hear that about Dakota's parents. Even when you are adult it must be hard to see your parents splitting up. No one wants that to happen and it's unfortunate that her parents couldn't work it out. I hope they are able to still remain civil around the kids and family. Just because it didn't work romantically doesn't mean it won't work at all.
(2) 4.0 5.5 Here's an article about crazed person who murdered two unfortunate women overseas. Life is crazy. I can't imagine what the families are going through. Having to go to or being forced into sex work is bad enough, but for it to end like this is just sad. It feels like there's no place safe in this world to be a woman sometimes.
(3) 1.0 1.3 I just read an article about some chowder-head who used a hammer and a pick ax to destroy Donald Trump's star on the Hollywood walk of fame. Wow, what a great protest. You sure showed him. Good job. Lol, can you believe this garbage? Who has such a hollow and pathetic life that they don't have anything better to do with their time than commit petty vandalism because they dislike some politician? What a dingus.

Corpus Analysis
For a first impression of the language of our new gold standard, we provide illustrative examples in Table 1. The participant in Example (1) displays higher empathy than distress, (2) displays higher distress than empathy, and (3) shows neither empathic state, but employs sarcasm, colloquialisms and social-media-style acronyms to express lack of emotional response to the article. As can be seen, the language of our corpus is diverse and authentic, featuring many phenomena of natural language which render its computational understanding difficult, thus constituting a sound but challenging gold standard for empathy prediction.
Token Counts. We tokenized the 1860 messages using NLTK tools (Bird, 2006). In total, our corpus amounts to 173, 686 tokens. Individual message length varies between 52 and 198 tokens, the median being 84. See Appendix for details.
Rating Distribution. Figure 1 displays the bivariate distribution of empathy and distress rat-ings. As can be seen both target variables have a clear linear dependence, yet show only a moderate Pearson correlation of r=.451, similar to what was found in prior research (Batson et al., 1987(Batson et al., , 1997. This finding supports that the two scales capture distinct affective phenomena and underscores the importance of our decision to describe empathic states in terms of multiple target variables, constituting a clear advancement over previous work. Both kinds of ratings show good coverage over the full range of the scales. Reliability of Ratings. Since each message is annotated by only one rater, its author, typical measures of inter-rater agreement are not applicable. Instead, we compute split-half reliability (SHR), a standard approach in psychology (Cronbach, 1947) which also becomes increasingly popular in sentiment analysis (Mohammad and Bravo-Marquez, 2017a;Buechel and Hahn, 2018a). SHR is computed by splitting the ratings for the individual scale items (e.g., warm, tender, etc. for empathy) of all participants randomly into two groups, averaging the individual item ratings for each group and participant, and then measuring the correlation between both groups. This process is repeated 100 times with random splits, before again averaging the results. Doing so for empathy and distress, we find very high 4 SHR values of r=.875 and .924, respectively.

Modeling Empathy and Distress
In this section, we provide experimental results for modeling empathy and distress ratings based on the participants' messages (see Section 2). We examine three different types of models, varying in design complexity. Distinct models were trained for empathy and distress prediction.
First, ten percent of our newly created gold standard were randomly sampled to be used in development experiments. Then, the main experiment was conducted using 10-fold crossvalidation (CV), providing each model with identical train-test splits to increase reliability. The dev set was excluded for the CV experiment.
Model performance is measured in terms of Pearson correlation r between predicted values and the human gold ratings. Thus, we phrase the prediction of empathy and distress as regression problems.
The input to our models is based on word embeddings, namely the publicly available Fast-Text embeddings which were trained on Common Crawl (≈600B tokens) (Bojanowski et al., 2017;Mikolov et al., 2018).
Ridge. Our first approach is Ridge regression, an 2 -regularized version of linear regression. The centroid of the word embeddings of the words in a message is used as features (embedding centroid). The regularization coefficient α is automatically chosen from {1, .5, .1, ..., .0001} during training.
FFN. Our second approach is a Feed-Forward Net with two hidden layers (256 and 128 units, respectively) with ReLU activation. Again, the embedding centroid is used as features.
CNN. The last approach is a Convolutional Neural Net. 5 We use a single convolutional layer with filter sizes 1 to 3, each with 100 output channels, followed by an average pooling layer and a dense layer of 128 units. ReLUs were used for the convolutional and again for the dense layer.
Both deep learning models were trained using the Adam optimizer (Kingma and Ba, 2015) with a fixed learning rate of 10 −3 and a batch size of 32. We trained for a maximum of 200 epochs yet applied early stopping if the performance on the validation set did not improve for 20 consecutive epochs. We applied dropout with probabilities of .2, .5 and .5 on input, dense and pooling layers, respectively. Moreover 2 regularization of .001 was applied to the weights of conv and dense layers. Word embeddings were not updated.
The results are provided in Table 2. As can be seen, all of our models achieve satisfying performance figures ranging between r=.379 and .444, 5 Recurrent models did not perform well during development due to high sequence length.  given the assumed difficulty of the task (see Section 3). On average over the two target variables, the CNN performs best, followed by Ridge and the FFN. While the CNN significantly outperforms the other models in every case, the differences between Ridge and the FFN are not statistically significant for either empathy or distress. 6 The improvements of the CNN over the other two approaches are much more pronounced for distress than for empathy. Since only the CNN is able to capture semantic effects from composition and word order, our data suggest that these phenomena are more important for predicting distress, whereas lexical features alone already perform quite well for empathy. Discussion. In comparison to closely related tasks such as emotion prediction (Mohammad and Bravo-Marquez, 2017a) our performance figures for empathy and distress prediction are generally lower. However, given the small amount of previous work for the problem at hand, we argue that our results are actually quite strong. This becomes obvious, again, in comparison with emotion analysis where early work achieved correlation values around r=.3 at most (Strapparava and Mihalcea, 2007). Yet state-of-the-art performance literally doubled over the last decade (Beck, 2017), in part due to much larger training sets.
Comparison to the limited body of previous work in text-based empathy prediction is difficult for a number of reasons, e.g., differences in domain, evaluation metric, as well as methodology and linguistic level of annotation. Khanpour et al. (2017) annotate and model empathy in online health communities on the sentence-level, whereas the instances in our corpus are much longer and comprise multiple sentences. In contrast to our work, they treat empathy prediction as a classification problem. Their best performing model, a CNN-LSTM, achieves an F-score of . 78. Gibson et al. (2015) predict therapists' empathy in motivational interviews. Each therapy session transcript received one numeric score. Thus, each prediction is based on much more language data than our individual messages comprise. Their best model achieves a Spearman rank correlation of .61 using n-gram and psycholinguistic features.
Our contribution goes beyond both of these studies by, first, enriching empathy prediction with personal distress and, second, by annotating and modeling the empathic state actually felt by the writer, instead of relying on external assessments.

Conclusion
This contribution was the first to attempt empathy prediction in terms of multiple target variables, empathic concern and personal distress. We proposed a novel annotation methodology capturing empathic states actually felt by the author of a statement, instead of relying on third-party assessments. To ensure high reliability in this singlerating setting, we employ multi-item scales in line with best practices in psychology. Hereby we create the first publicly available gold standard for empathy prediction in written language, our survey being set-up and supervised by an expert psychologist. Our analysis shows that the data set excels with high rating reliability and an authentic and diverse language, rich of challenging phenomena such as sarcasm. We provide experimental results for three different predictive models, our CNN turning out superior.  Before being used in our survey, the selected news articles were categorized by the research interns who gathered them in terms of their intensity of suffering (major or minor), cause of suffering (political, human, nature or other), patient of suffering (humans, animals, environment, or other) and scale of suffering (individual or mass). Research interns also provided a short list of key words for each article. This additional information was gathered to examine the influence of these factors on empathy elicitation and modeling performance in later studies. At the beginning of the survey participants completed background items covering general demographics (including age, gender, and ethnicity), the most commonly used trait empathy scale, the Interpersonal Reactivity Index (Davis, 1980), a brief assessment of the Big 5 personality traits (Gosling et al., 2003), life satisfaction (Diener et al., 1985), as well as a brief measure of generalized trust.
After reading each of the articles, participants rated their level of empathic concern and personal distress using multi-item scales. Figure 2 shows a cropped screenshot of the survey hosted on Qualtrics.com. The first six items (warm, tender, sympathetic, softhearted, moved, and compassionate) refer to empathy. The last eight items (worried, upset, troubled, perturbed, grieved, disturbed, alarmed, and distressed) refer to distress. Figure 2: Multi-item scales for empathic concern and personal distress.
After completing the rating items, participants were instructed to describe their reactions in writing as follows: Now that you have read this article, please write a message to a friend or friends about your feelings and thoughts regarding the article you just read. This could be a private message to a friend or something you would post on social media. Please do not identify your intended friend(s) -just write your thoughts about the article as if you were communicating with them. Please use between 300 and 800 characters.

Further Corpus Analyses
The word clouds in Figure 3 and Figure 4 show 1grams of our corpus which correlate significantly (Benjamini-Hochberg corrected p < .05) with high empathy and high distress ratings, respectively. In the word clouds, larger size indicates higher correlation and the color scale, gray-bluered, indicates word frequency, dark red being most prevalent. The Differential Language Analysis Toolkit (Schwartz et al., 2017) was utilized for this analysis. As can be seen, the word clouds display high face-validity, giving further evidence for the soundness of our acquisition methodology.   Figure 5 displays the distribution of the message length of our corpus in tokens. As can be seen the majority of messages contain between 60 and 100 tokens. Yet outliers go up to almost 200. The introduction of a character cap for the writing task proved successful in comparison to a pilot study where this measure has not been in place. In the latter case, the maximum number of tokens was nearly twice as high due to even stronger outliers.