EmoBank: Studying the Impact of Annotation Perspective and Representation Format on Dimensional Emotion Analysis

We describe EmoBank, a corpus of 10k English sentences balancing multiple genres, which we annotated with dimensional emotion metadata in the Valence-Arousal-Dominance (VAD) representation format. EmoBank excels with a bi-perspectival and bi-representational design. On the one hand, we distinguish between writer’s and reader’s emotions, on the other hand, a subset of the corpus complements dimensional VAD annotations with categorical ones based on Basic Emotions. We find evidence for the supremacy of the reader’s perspective in terms of IAA and rating intensity, and achieve close-to-human performance when mapping between dimensional and categorical formats.


Introduction
In the past years, the analysis of affective language has become one of the most productive and vivid areas in computational linguistics. In the early days, the prediction of the semantic polarity (positiveness or negativeness) was in the center of interest, but in the meantime, research activities shifted towards a more fine-grained modeling of sentiment. This includes the extension from only two to multiple polarity classes or even real-valued scores (Strapparava and Mihalcea, 2007), the aggregation of multiple aspects of an opinion item into a composite opinion statement for the whole item (Schouten and Frasincar, 2016), and sentiment compositionality (Socher et al., 2013).
Yet, two important features of fine-grained modeling still lack appropriate resources, namely shifting towards psychologically more adequate models of emotion (Strapparava, 2016) and distinguishing between writer's vs. reader's perspec-tive on emotion ascription (Calvo and Mac Kim, 2013). We close both gaps with EMOBANK, the first large-scale text corpus which builds on the Valence-Arousal-Dominance model of emotion, an approach that has only recently gained increasing popularity within sentiment analysis. EMOBANK not only excels with a genre-balanced selection of sentences, but is based on a biperspectival annotation strategy (distinguishing the emotions of writers and readers), and includes a bi-representationally annotated subset (which has previously been annotated with Ekman's Basic Emotions) so that mappings between both representation formats can be performed. EMOBANK is freely available for academic purposes. 1 2 Related Work Models of emotion are commonly subdivided into categorical and dimensional ones, both in psychology and natural language processing (NLP). Dimensional models consider affective states to be best described relative to a small number of independent emotional dimensions (often two or three): Valence (corresponding to the concept of polarity), Arousal (degree of calmness or excitement), and Dominance 2 (perceived degree of control over a situation); the VAD model. Formally, the VAD dimensions span a three-dimensional real-valued vector space as illustrated in Figure 1. Alternatively, categorical models, such as the six Basic Emotions by Ekman (1992) or the Wheel of Emotion by Plutchik (1980), conceptualize emotions as discrete states. 3 In contrast to categorical models which were used early on in NLP (Ovesdotter Alm et al., 2005;Strapparava and Mihalcea, 2007), dimensional  Figure 1: The affective space spanned by the three VAD dimensions. As an example, we here include the positions of Ekman's six Basic Emotions as determined by Russell and Mehrabian (1977). models have only recently received increased attention in tasks such as word and document emotion prediction (see, e.g., Yu et al. (2015), Köper and Schulte im Walde (2016), , Buechel and Hahn (2016)).
In spite of this shift in modeling focus, VA(D)annotated corpora are surprisingly rare in number and small in size, and also tend to be restricted in reliability. ANET, for instance, comprises only 120 sentences designed for psychological research (Bradley and Lang, 2007), while Preoţiuc-Pietro et al. (2016) created a corpus of 2,895 English Facebook posts relying on only two annotators.  recently presented a corpus of 2,009 Chinese sentences from various online texts.
As far as categorical models for emotion analysis are concerned, many studies use incompatible subsets of category systems, which limits their comparability (Buechel and Hahn, 2016;Calvo and Mac Kim, 2013). This also reflects the situation in psychology where there is still no consensus on a set of fundamental emotions (Sander and Scherer, 2009). Here, the VAD model has a major advantage: Since the dimensions are designed as being independent, results remain comparable dimension-wise even in the absence of others (e.g., Dominance). Furthermore, dimensional models are the predominant format for lexical affective resources in behavioral psychology as evident from the huge number of datasets available for a wide range of languages (see, e.g., Warriner et al. (2013), Stadthagen-Gonzalez et al. (2016), Moors et al. (2013) and Schmidtke et al. (2014)).
For the acquisition of VAD values from participant's self-perception, the Self-Assessment Manikin (SAM; Lang (1980), Bradley and Lang (1994)) has turned out as the most important and (to our knowledge) only standardized instrument (Sander and Scherer, 2009). SAM iconically displays differences in Valence, Arousal and Dominance by a set of anthropomorphic cartoons on a multi-point scale (see Figure 2).
While it is common for more basic sentiment analysis systems in NLP to map the many different possible interpretations of a sentence's affective meaning into a single assessment ("its sentiment"), there is an increasing interest in a more fine-grained approach where emotion expressed by writers is modeled separately from emotion evoked in readers. An utterance like "Italy defeats France in the World Cup Final" may be completely neutral from the writer's viewpoint (presumably a professional journalist), but is likely to evoke rather adverse emotions in Italian and French readers (Katz et al., 2007).
In this line of work, Tang and Chen (2012) examine the relation between the sentiment of microblog posts and the sentiment of their comments (as a proxy for reader emotion). Liu et al. (2013) model the emotion of a news reader jointly with the emotion of a comment writer using a cotraining approach. This contribution was followed up by Li et al. (2016) who propose a two-view label propagation approach instead. However, to our knowledge, only Mohammad and Turney (2013) investigated the effects of these perspectives on annotation quality, finding differences in interannotator agreement (IAA) relative to the exact phrasing of the annotation task.
In a similar vein to the writer-reader distinction, identifying the holder or source of an opinion or sentiment also aims at describing the affective information entailed in a sentence in more detail (Wiebe et al., 2005;Seki et al., 2009). Thus, opinion statements that can directly be attributed to the writer can be distinguished from references to other's opinions. A related task, the detection of stance, focuses on inferring the writer's (dis)approval towards a given issue from a piece of text (Sobhani et al., 2016).

Corpus Design and Creation
The following criteria guided the data selection process of the EMOBANK corpus: First, complementing existing resources which focus on social media and/or review-style language Quan and Ren, 2009), we decided to address several genres and domains of general English.  Second, we conducted a pilot study on two samples (one consisting of movie reviews, the other pulled from a genre-balanced corpus) to compare the IAA resulting from different annotation perspectives (e.g., the writer's and the reader's perspective) in different domains (see Buechel and Hahn (2017) for details). Since we found differences in IAA but the results remained inconclusive, we decided to annotate the whole corpus biperspectivally, i.e., each sentence was rated according to both the (perceived) writer and reader emotion (henceforth, WRITER and READER).
Third, since many problems of comparing emotion analysis studies result from the diversity of emotion representation schemes (see Section 2), the ability to accurately map between such alternatives would greatly improve comparability across systems and boost the reusability of resources. Therefore, at least parts of our corpus should be annotated bi-representationally as well, complementing dimensional VAD ratings with annotations according to a categorical emotion model.
Following these criteria, we composed our corpus out of several categories of the Manually Annotated Sub-Corpus of the American National Corpus (MASC; Ide et al. (2008), Ide et al. (2010)) and the corpus of SemEval-2007 Task 14 Affective Text (SE07; Strapparava and Mihalcea (2007)). MASC is already annotated on various linguistic levels. Hence, our work will allow for research at the intersection of emotion and other language phenomena. SE07, on the other hand, bears annotations according to Ekman's six Basic Emotion (see Section 2) on a [0, 100] scale, respectively. This collection of raw data comprises 10,548 sentences (see Table 1).
Given this large volume of data, we opted for a crowdsourcing approach to annotation. We chose CROWDFLOWER (CF) over AMAZON ME-CHANICAL TURK (AMT) for its quality control mechanisms and accessibility (customers of AMT, Sven%Büchel%-%JULIE%LAB%(Prof.%Dr.%Udo%Hahn)%-%FSU%Jena%-%November%2,%2016% but not CF, must be US-based). CF's main quality control mechanism rests on gold questions, items for which the acceptable ratings have been previously determined by the customer. These questions are inserted into a task to restrict the workers to those performing trustworthily. We chose these gold items by automatically extracting highly emotional sentences from our raw data according to JEMAS 4 , a lexicon-based tool for VAD prediction (Buechel and Hahn, 2016). The acceptable ratings were determined based on manual annotations by three students trained in linguistics. The process was individually performed for WRITER and READER with different annotators.

Pleasure(
For each of the two perspectives, we launched an independent task on CF. The instructions were based on those by Bradley and Lang (1999) to whom most of the VAD resources developed in psychology refer (see Section 2). We changed the 9-point SAM scales to 5-point scales (see Figure  2) in order to reduce the cognitive load during decision making for crowdworkers. For the writer's perspective, we presented a number of linguistic clues supporting the annotators in their rating decisions, while, for the reader's perspective, we asked what emotion would be evoked in an average reader (rather than asking for the rater's personal feelings). Both adjustments were made to establish more objective criteria for the exclusion of untrustworthy workers. We provide the instructions along with our dataset.
For each sentence, five annotators generated VAD ratings. Thus, a total of 30 ratings were gathered per sentence (five ratings for each of the three VAD dimensions and two annotation perspectives, WRITER and READER). Ten sentences were presented at a time. The task was available for work-ers located in the UK, the US, Ireland, Canada, Australia or New Zealand. The total annotation costs amounted to $1,578.
Upon inspection of the individual judgments, we found that the VAD rating (1, 1, 1) was heavily overrepresented. We interpret this skewed coding distribution as a bias mainly due fraudulent responses since, from a psychological view, this rating is highly improbable (Warriner et al., 2013). Accordingly, we decided to remove all of these ratings (about 10% for each of the tasks; the 'Filtered' condition in Table 1) because these annotations would have inserted a systematic bias into our data which we consider more harmful than erroneously removing a few honest outliers. For each sentence with two or more remaining judgments, its final emotion annotation is determined by averaging these valid ratings leading to a total of 10,062 sentences bearing VAD values for both perspectives (see Table 1).
This makes EMOBANK to the best of our knowledge by far the largest corpus for dimensional emotion models and, with the exception of the dataset by Quan and Ren (2009) (which is problematic in having only one annotator per sentence), the largest gold standard for any emotion format (both dimensional and categorical). Even compared with polarity corpora it is still reasonably large (e.g., similar in size to the Stanford Sentiment Treebank (Socher et al., 2013)).

Analysis and Results
For continuous, real-valued numbers, well-known metrics for IAA, such as Cohen's κ or F-score, are inappropriate as these are designed for nominally scaled variables. Instead, Pearson's correlation coefficient (r) or Mean Absolute Error (MAE) are often applied for this setting (Strapparava and Mihalcea, 2007;. Accordingly, for each annotator, we compute r and MAE between their own and the aggregated EMOBANK annotation and average these values for each VAD dimension. This results in one IAA value per metric (r or MAE), perspective and dimension ( Table 2).
As average over the VAD dimensions, we achieve a satisfying IAA of r > .6 for both perspectives. The READER results in significantly higher correlation, 5 but also higher error than  WRITER (p < .05 for Valence in r and for all dimensions in MAE using a two-tailed t-test).
Prior work found that a large portion of language may actually be neutral in terms of emotion (Ovesdotter Alm et al., 2005). However, a too narrow rating distribution (i.e., most of the ratings being rather neutral relative to the three VAD dimensions) may be a disadvantageous property for training data. Therefore, we regard the emotionality of ratings as another quality criterion for emotion annotation complementary to IAA.
We capture this notion as the absolute difference of a sentence's aggregated rating from the neutral rating (3, in our case), averaged over all VAD dimensions. Comparing the average emotionality of all sentences between WRITER and READER, we find that the latter perspective also excels with significantly higher emotionality than the WRITER (p < .001; two-tailed t-test).
These beneficial characteristics of the READER perspective (better correlation-based IAA and emotionality) contrast with its worse error-based IAA. Thus, we decided to examine the relationship between error and emotionality between the two perspectives more closely: Let V, A, D be three m × n-matrices where m corresponds to the number of sentences and n to the number of annotators so that the three matrices yield all the individual ratings for Valence, Arousal and Dominance, respectively. Then we define the sentence-wise error for sentence i (SWE i ) as where X i := 1 n n j=1 X ij . We compute SWE values for reader and writer perspective individually. We can now examine the dependency between error and emotionality by subtracting, for each sentence, SWE and emotionality for both perspectives from another (resulting in one difference in error and one difference in emotionality value).
Our data reveal a strong correlation (r = .718) between these data series, so that the more the ratings for a sentence differ in emotionality (compar- ing between the perspectives), the more they differ in error as well. Running linear regression on these two data rows, we find that the regression line runs straight through the origin (intercept is not significantly different from 0; p = .992; see Figure 3). This means that without difference in emotionality, WRITER and READER rating for a sentence do, on average, not differ in error. Hence, our data strongly suggest that READER is the superior perspective yielding better inter-annotator correlation and emotionality without overproportionally increasing inter-annotator error.

Mapping between Emotion Formats
Making use of the bi-representational subset of our corpus (SE07), we now examine the feasibility of automatically mapping between dimensional and categorical models. For each Basic Emotion category, we train one k Nearest Neighbor model given all VAD values of either WRITER, READER or both combined as features. Training and hyperparameter selection was performed using 10-fold cross-validation. Comparing the correlation between our models' predictions and the actual annotations (in categorical format) with the IAA as reported by Strapparava and Mihalcea (2007), we find that this approach already comes close to human performance (see Table 3). Once again, READER turns out to be superior in terms of the achieved mapping performance compared to WRITER. However, both perspectives combined yield even better results. In this case, our models' correlation with the actual SE07 rating is as good as or even better than the average human agreement. Note that the SE07 ratings are in turn based on averaged human judgments. Also, the human IAA differs a lot between  Table 3: IAA by Strapparava and Mihalcea (2007) compared to mapping performance of KNN models using writer's, reader's or both's VAD scores as features (W, R and WR, respectively), both in Pearson's r. Bottom section: difference of respective model performance (W, R and WR) and IAA.
the Basic Emotions and is even r < .5 for Disgust and Surprise. For the four categories with a reasonable IAA, Joy, Anger, Sadness and Fear, our best models, on average, actually outperform human agreement. Thus, our data shows that automatically mapping between representation formats is feasible at a performance level on par with or even surpassing human annotation capability. This finding suggests that, for a dataset with highquality annotations for one emotion format, automatic mappings to another format may be just as good as creating these new annotations by manual rating.

Conclusion
We described the creation of EMOBANK, the first large-scale corpus employing the dimensional VAD model of emotion and one of the largest gold standards for any emotion format. This genrebalanced corpus is also unique for having two kinds of double annotations. First, we annotated for both writer and reader emotion; second, for a subset of the EMOBANK, ratings for categorical Basic Emotions as well as VAD dimensions are now available. The statistical analysis of our corpus revealed that the reader perspective yields both better IAA values and more emotional ratings. For the bi-representationally annotated subcorpus, we showed that an automatic mapping between categorical and dimensional formats is feasible with near-human performance using standard machine leraning techniques.