Readers vs. Writers vs. Texts: Coping with Different Perspectives of Text Understanding in Emotion Annotation

We here examine how different perspectives of understanding written discourse, like the reader’s, the writer’s or the text’s point of view, affect the quality of emotion annotations. We conducted a series of annotation experiments on two corpora, a popular movie review corpus and a genre- and domain-balanced corpus of standard English. We found statistical evidence that the writer’s perspective yields superior annotation quality overall. However, the quality one perspective yields compared to the other(s) seems to depend on the domain the utterance originates from. Our data further suggest that the popular movie review data set suffers from an atypical bimodal distribution which may decrease model performance when used as a training resource.


Introduction
In the past years, the analysis of subjective language has become one of the most popular areas in computational linguistics. In the early days, a simple classification according to the semantic polarity (positiveness, negativeness or neutralness) of a document was predominant, whereas in the meantime, research activities have shifted towards a more sophisticated modeling of sentiments. This includes the extension from only few basic to more varied emotional classes sometimes even assigning real-valued scores (Strapparava and Mihalcea, 2007), the aggregation of multiple aspects of an opinion item into a composite opinion statement for the whole item (Schouten and Frasincar, 2016), and sentiment compositionality on sentence level (Socher et al., 2013).
There is also an increasing awareness of different perspectives one may take to interpret written discourse in the process of text comprehension. A typical distinction which mirrors different points of view is the one between the writer and the reader(s) of a document as exemplified by utterance (1) below (taken from Katz et al. (2007)): (1) Italy defeats France in World Cup Final The emotion of the writer, presumably a professional journalist, can be expected to be more or less neutral, but French or Italian readers may show rather strong (and most likely opposing) emotional reactions when reading this news headline. Consequently, such finer-grained emotional distinctions must also be considered when formulating instructions for an annotation task.
NLP researchers are aware of this multiperspectival understanding of emotion as contributions often target either one or the other form of emotion expression or mention it as a subject of future work (Mukherjee and Joshi, 2014;Lin and Chen, 2008;Calvo and Mac Kim, 2013). However, contributions aiming at quantifying the effect of altering perspectives are rare (see Section 2). This is especially true for work examining differences in annotation results relative to these perspectives. Although this is obviously a crucial design decision for gold standards for emotion analytics, we know of only one such contribution (Mohammad and Turney, 2013).
In this paper, we systematically examine differences in the quality of emotion annotation regarding different understanding perspectives. Apart from inter-annotator agreement (IAA), we will also look at other quality criteria such as how well the resulting annotations cover the space of possible ratings and check for the representativeness of the rating distribution. We performed a series of annotation experiments with varying instruc-tions and domains of raw text, making this the first study ever to address the impact of text understanding perspective on sentence-level emotion annotation. The results we achieved directly influenced the design and creation of EMOBANK, a novel large-scale gold standard for emotion analysis employing the VAD model for affect representation (Buechel and Hahn, 2017).

Related Work
Representation Schemes for Emotion. Due to the multi-disciplinary nature of research on emotions, different representation schemes and models have emerged hampering comparison across different approaches (Buechel and Hahn, 2016).
In NLP-oriented sentiment and emotion analysis, the most popular representation scheme is based on semantic polarity, the positiveness or negativeness of a word or a sentence, while slightly more sophisticated schemes include a neutral class or even rely on a multi-point polarity scale (Pang and Lee, 2008).
Despite their popularity, these bi-or tri-polar schemes have only loose connections to emotion models currently prevailing in psychology (Sander and Scherer, 2009). From an NLP point of view, those can be broadly subdivided into categorical and dimensional models (Calvo and Mac Kim, 2013). Categorical models assume a small number of distinct emotional classes (such as Anger, Fear or Joy) that all human beings are supposed to share. In NLP, the most popular of those models are the six Basic Emotions by Ekman (1992) or the 8-category scheme of the Wheel of Emotion by Plutchik (1980). Dimensional models, on the other hand, are centered around the notion of compositionality. They assume that emotional states can be best described as a combination of several fundamental factors, i.e., emotional dimensions. One of the most popular dimensional models is the Valence-Arousal-Dominance (VAD; Bradley and Lang (1994)) model which postulates three orthogonal dimensions, namely Valence (corresponding to the concept of polarity), Arousal (a calm-excited scale) and Dominance (perceived degree of control in a (social) situation); see Figure 1 for an illustration. An even more wide-spread version of this model uses only the Valence and Arousal dimension, the VA model (Russell, 1980). For a long time, categorical models were pre-  Russell and Mehrabian (1977)). dominant in emotion analysis (Ovesdotter Alm et al., 2005;Strapparava and Mihalcea, 2007;Balahur et al., 2012). Only recently, the VA(D) model found increasing recognition (Paltoglou et al., 2013;Yu et al., 2015;Buechel and Hahn, 2016;. When one of these dimensional models is selected, the task of emotion analysis is most often interpreted as a regression problem (predicting real-valued scores for each of the dimension) so that another set of metrics must be taken into account than those typically applied in NLP (see Section 3).
Despite its growing popularity, the first largescale gold standard for dimensional models has only very recently been developed as a followup to this contribution (EMOBANK; Buechel and Hahn (2017)). The results we obtained here were crucial for the design of EMOBANK regarding the choice of annotation perspective and the domain the raw data were taken from. However, our results are not only applicable to VA(D) but also to semantic polarity (as Valence is equivalent to this representation format) and may probably generalize over other models of emotion, as well.
Resources and Annotation Methods. For the VAD model, the Self-Assessment Manikin (SAM; Bradley and Lang (1994)) is the most important and to our knowledge only standardized instrument for acquiring emotion ratings based on human self-perception in behavioral psychology (Sander and Scherer, 2009). SAM iconically displays differences in Valence, Arousal and Dominance by a set of anthropomorphic cartoons on a multi-point scale (see Figure 2). Subjects refer to one of these figures per VAD dimension to rate their feelings as a response to a stimulus.
SAM and derivatives therefrom have been used for annotating a wide range of resources for wordemotion associations in psychology (such as Warriner et al.  2015)), a method only recently introduced into NLP by Kiritchenko and Mohammad (2016). This annotation method exploits the fact that humans are typically more consistent when comparing two items relative to each other with respect to a given scale rather than attributing numerical ratings to the items directly. For example, deciding whether one sentence is more positive than the other is easier than scoring them (say) as 8 and 6 on a 9-point scale.
Although BWS provided promising results for polarity (Kiritchenko and Mohammad, 2016), in this paper, we will use SAM scales. First, with this decision, there are way more studies to compare our results with and, second, the adequacy of BWS for emotional dimensions other than Valence (polarity) remains to be shown.
Perspectival Understanding of Emotions. As stated above, research on the linkage of different annotation perspectives (typically reader vs. writer) is really rare. Tang and Chen (2012) examine the relation between the sentiment of microblog posts and the sentiment of their comments (as a proxy for reader emotion) using a positivenegative scheme. They examine which linguistic features are predictive for certain emotion transitions (combinations of an initial writer and a responsive reader emotion). Liu et al. (2013) model the emotion of a news reader jointly with the emotion of a comment writer using a co-training approach. This contribution was followed up by Li et al. (2016) who criticized that important assumptions underlying co-training, viz. sufficiency and independence of the two views, had actually been violated in that work. Instead, they propose a twoview label propagation approach.
Various (knowledge) representation formalisms have been suggested for inferring sentiment or opinions by either readers, writers or both from a piece of text. Reschke and Anand (2011) propose the concept of predicate-specific evaluativity functors which allow for inferring the writers' evaluation of a proposition based on the evaluation of the arguments of the predicate. Using description logics as modeling language Klenner (2016) advocates the concept of polarity frames to capture polarity constraints verbs impose on their complements as well as polarity implications they project on them. Deng and Wiebe (2015) employ probabilistic soft logic for entity and event-based opinion inference from the viewpoint of the author or intra-textual entities. Rashkin et al. (2016) introduce connotation frames of (verb) predicates as a comprehensive formalism for modeling various evaluative relationships (being positive, negative or neutral) between the arguments of the predicate as well as the reader's and author's view on them. However, up until know, the power of this formalism is still restricted by assuming that author and reader evaluate the arguments in the same way.
In summary, different from our contribution, this line of work tends to focus less on the reader's perspective and also addresses cognitive evaluations (opinions) rather than instantaneous affective reactions. Although these two concepts are closely related, they are yet different and in fact their relationship has been the subject of a long lasting and still unresolved debate in psychology (Davidson et al., 2003) (e.g., are we afraid of something because we evaluate it as dangerous, or do we evaluate something as dangerous because we are afraid?).
To the best of our knowledge, only Mohammad and Turney (2013) investigated the effects of different perspectives on annotation quality. They conducted an experiment on how to formulate the emotion annotation question and found that asking whether a term is associated with an emotion actually resulted in higher IAA than asking whether a term evokes a certain emotion. Arguably, the former phrasing is rather unrelated to either writer or reader emotion, while the latter clearly targets the emotion of the reader. Their work renders evidence for the importance of the perspective of text comprehension for annotation quality. Note that they focused on word emotion rather than sentence emotion.

Methods
Inter-Annotator Agreement. Annotating emotion on numerical scales demands for another statistical tool set than the one that is common in NLP. Well-known metrics such as the κ-coefficient should not be applied for measuring IAA because these are designed for nominal-scaled variables, i.e., ones whose possible values do not have any intrinsic order (such as part-of-speech tags as compared to (say) a multi-point sentiment scale).
In the literature, there is no consensus on what metrics for IAA should be used instead. However, there is a set of repetitively used approaches which are typically only described verbally. In the following, we offer comprehensive formal definitions and a discussion of them.
First, we describe a leave-one-out framework for IAA where the ratings of an individual annotator are compared against the average of the remaining ratings. As one of the first papers, it was used and verbally described by Strapparava and Mihalcea (2007) and was later taken on by  and Preoţiuc-Pietro et al. (2016).
Let X := (x ij ) ∈ R m×n be a matrix where m corresponds to the number of items and n corresponds to the number of annotators. X stores all the individual ratings of the m items (organized in rows) and n annotators (organized in columns) so that x ij represents the rating of the i-th item by the j-th annotator. Since we use the three-dimensional VAD model, in practice, we will have one such matrix for each VAD dimension.
., x mj ), the vector composed out of the j-th column of the matrix and let f : R m × R m → R be an arbitrary metric for comparing two data series, then L1O f (X), the leave-one-out IAA for the rating matrix X relative to the metric f , is defined as where b ∅ j is the average annotation vector of the remaining raters: For our experiments, we will use three different metrics specifying the function f , namely r, MAE and RMSE.
In general, the Pearson correlation coefficient r captures the linear dependence between two data series, x = x 1 , x 2 , ..., x m and y = y 1 , y 2 , ..., y m . In our case x,y correspond to the rating vector of an individual annotator and the aggregated rating vector of the remaining annotators, respectively.
(3) where x, y denote the mean value of x, y, respectively.
When comparing a model's prediction to the actual data, it can be very important not only to take correlation-based metrics like r into account, but also error-based metrics (Buechel and Hahn, 2016). This is so because a model may produce very accurate predictions in terms of correlation, while at the same time it may perform poorly when taking errors into account (for instance, when the predicted values range in a much smaller interval than the actual values).
To be able to compare a system's performance more directly to the human ceiling, we also apply error-based metrics within this leave-one-out framework. The most popular ones for emotion analysis are Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) (Paltoglou et al., 2013;: One of the drawbacks of this framework is that each x ij from matrix X has to be known in order to calculate the IAA. An alternative method was verbally described by Buechel and Hahn (2016) which can be computed out of mean and SD values for each item alone (a format often available from psychological papers). Let X be defined as above and let a i denote the mean value for the i-th item. Then, the Average Annotation Standard Deviation (AASD) is defined as Emotionality. While IAA is indubitably the most important quality criterion for emotion annotation, we argue that there is at least one additional criterion that is not covered by prior research: When using numerical scales (especially ones with a large number of rating points, e.g., the 9-point scales we will use in our experiments) annotations where only neutral ratings are used will be unfavorable for future applications (e.g., training models). Therefore, it is important that the annotations are properly distributed over the full range of the scale. This issue is especially relevant in our setting as different perspectives may very well differ in the extremity of their reactions, as evident from Example (1). We call this desirable property the emotionality (EMO) of the annotations.
For the EMO metric, we first derive aggregated ratings from the individual rating decisions of the annotators, i.e., the ratings that would later form the final ratings of a corpus. For that, we aggregate the rating matrix X from Equation 1 into the vector y consisting of the respective row means y i .
y := (y 1 , ..., y i , ..., y m ) Since we use the VAD model, we will have one such aggregated vector per VAD dimension. We denote them y 1 , y 2 and y 3 . Let the matrix Y = (y j i ) ∈ R m×3 hold the aggregated ratings of item i for dimension j, and let N denote the neutral rating (e.g., 5 on a 9-point scale). Then, Representative Distribution. A closely related quality indicator relates to the representativeness of the resulting rating distribution. For large sets of stimuli (words as well as sentences), numerous studies consistently report that when using SAM-like scales, typically the emotion ratings closely resemble a normal distribution, i.e., the density plot displays a Gaussian, "bell-shaped" curve (see Figure 3b) (Preoţiuc-Pietro et al., 2016;Warriner et al., 2013;Stadthagen-Gonzalez et al., 2016;Montefinese et al., 2014).
Intuitively, it makes sense that most of the sentences under annotation should be rather neutral, while only few of them carry extreme emotions. Therefore, we argue that ideally the resulting aggregated ratings for an emotion annotation task should be normally distributed. Otherwise, it must be seriously called into question in how far the respective data set can be considered representative, possibly reducing the performance of models trained thereon. Consequently, we will also take the density plot of the ratings into account when comparing different set-ups.

Experiments
Perspectives to Distinguish. Considering Example (1) and our literature review from Section 2, it is obvious that at least the perspective of the writer and the reader of an utterance must be distinguished. Accordingly, writer emotion refers to how someone feels while producing an utterance, whereas reader emotion relates to how someone feels right after reading or hearing this utterance.
Also taking into account the finding by Mohammad and Turney (2013) that agreement among annotators is higher when asking whether a word is associated with an emotion rather than asking whether it evokes this emotion, we propose to extend the common writer-reader framework by a third category, the text perspective, where no actual person is specified as perceiving an emotion. Rather, we assume for this perspective that emotion is an intrinsic property of a sentence (or an alternative linguistic unit like a phrase or the entire text). In the following, we will use the terms WRITER, TEXT and READER to concisely refer to the respective perspectives.
Data Sets. We collected two data sets, a movie review data set highly popular in sentiment analysis and a balanced corpus of general English. In this way, we can estimate the annotation quality resulting from different perspectives, also covering interactions regarding different domains.
The first data set builds upon the corpus originally introduced by Pang and Lee (2005). It consists of about 10k snippets from movie reviews by professional critics collected from the website rottentomatoes.com. The data was further enriched by Socher et al. (2013) who annotated individual nodes in the constituency parse trees according to a 5-point polarity scale, forming the Stanford Sentiment Treebank (SST) which contains 11,855 sentences.
Upon closer inspection, we noticed that the SST data have some encoding issues (e.g., Absorbing character study by AndrÃ c Turpin .) that are not present in the original Rotten Tomatoes data set. So we decided to replicate the creation of the SST data from the original snippets. Furthermore, we filtered out fragmentary sentences automatically (e.g., beginning with comma, dashes, lower case, etc.) as well as manually excluded grammatically incomplete and therefore incomprehensible sentences, e.g., "Or a profit" or "Over age 15?". Subsequently, a total of 10,987 sentences could be mapped back to SST IDs forming the basis for our experiments (the SST* collection).
To complement our review language data set, a domain heavily focused on in sentiment analysis (Liu, 2015), for our second data set, we decided to rely on a genre-balanced corpus. We chose the Manually Annotated Sub-Corpus (MASC) of the American National Corpus which is already annotated for various linguistic levels (Ide et al., 2008;Ide et al., 2010). We excluded registers containing spoken, mainly dialogic or non-standard language, e.g., telephone conversations, movie scripts and tweets. To further enrich this collection of raw data for potential emotion analysis applications, we additionally included the corpus of the SEM-EVAL-2007 Task 14 focusing on Affective Text (SE07; Strapparava and Mihalcea (2007)), one of the most important data sets in emotion analysis. This data set already bears annotations according to Ekman's six Basic Emotions (see Section 2) so that the gold standard we ultimately supply already contains a bi-representational part (being annotated according to a dimensional and a categorical model of emotion). Such a double encoding will easily allow for research on automatically mapping between different emotion formats (Buechel and Hahn, 2017).
In order to identify individual sentence in MASC, we relied on the already available annotations. We noticed, however, that a considerable portion of the sentence boundary annotations were duplicates which we consequently removed (about 5% of the preselected data). This left us with a total of 18,290 sentences from MASC and 1,250 headlines from SE07. Together, they form our second data set, MASC*.

Study Design.
We pulled a 40 sentences random sample from MASC* and SST*, respectively. For each of the three perspectives WRITER, READER and TEXT, we prepared a separate set of instructions. Those instructions are identical, except for the exact phrasing of what a participant should annotate: For WRITER, it was consistently asked "what emotion is expressed by the author", while TEXT and READER queried "what emotion is conveyed" by and "how do you [the participant of the survey] feel after reading" an individual sentence, respectively.
After reviewing numerous studies from NLP and psychology that had created emotion annotations (e.g., Katz et al. (2007), Strapparava and Mihalcea (2007), Mohammad and Turney (2013), Pinheiro et al. (2016), Warriner et al. (2013)), we largely relied on the instructions used by Bradley and Lang (1999) as this is one of the first and probably the most influential resource from psychology which also greatly influenced work in NLP Preoţiuc-Pietro et al., 2016).
The instructions were structured as follows. After a general description of the study, the individual scales of SAM were explained to the participants. After that, they performed three trial ratings to familiarize themselves with the usage of the SAM scales before proceeding to judge the actual 40 sentences of interest. The study was implemented as a web survey using Google Forms. 1 The sentences were presented in randomized order, i.e., they were shuffled for each participant individually.
For each of the six resulting surveys (one for each combination of perspective and data set), we recruited 80 participants via the crowdsourcing platform crowdflower.com (CF). The number was chosen so that the differences in IAA may reach statistical significance (according to the leave-one-out evaluation (see Section 3), the number of cases is equal to the number of raters). The surveys went online one after the other, so that as few participants as possible would do more than one of the surveys. The task was available from within the UK, the US, Ireland, Canada, Australia and New Zealand.
We preferred using an external survey over running the task directly via the CF platform because this set-up offers more design options, such as randomization, which is impossible via CF; there, the data is only shuffled once and will then be presented in the same order to each participant. The drawback of this approach is that we cannot rely on CF's quality control mechanisms.
In order to still be able to exclude malicious raters, we introduced an algorithmic filtering process where we summed up the absolute error the participants made on the trial questions-those were asking them to indicate the VAD values for a verbally described emotion so that the correct answers were evident from the instructions. Raters whose absolute error was above a certain threshold were excluded.
We set this parameter to 20 (removing about a third of the responses) because this was approximately the ratio of raters which struck us as unreliable when manually inspecting the data while, at the same time, leaving us with a reasonable 1 https://forms.google.com/  Table 1: IAA values obtained on the SST* and the MASC* data set. r, MAE and RMSE refer to the respective leave-one-out metric (see Section 3).
number of cases to perform statistical analysis.
The results of this analysis is presented in the following section. Our two small sized yet multiperspectival data sets are publicly available for further analysis. 2

Results
In this section, we compare the three annotation perspectives (WRITER, READER and TEXT) on two different data sets (SST* and MASC*; see Section 4), according to three criteria for annotation quality: IAA, emotionality and distribution (see Section 3).
Inter-Annotator Agreement. Since there is no consensus on a fixed set of metrics for numerical emotion values, we compare IAA according to a range of measures. We use r, MAE and RMSE in the leave-one-out framework, as well as AASD (see Section 3). Table 1 displays our results for the SST* and MASC* data set. We calculated IAA individually for Valence, Arousal and Dominance. However, to keep the number of comparisons feasible, we restrict ourselves to presenting the respective mean values (average over VAD), only. The relative ordering between the VAD dimensions is overall consistent with prior work so that Valence shows better IAA than Arousal or Dominance (in line with findings from Warriner et al. (2013) and Schmidtke et al. (2014)).
We find that on the review-style SST* data, WRITER displays the best IAA according to all of the four metrics (p < 0.05 using a two-tailed t-test, respectively). Note that MAE, RMSE and AASD are error-based so that the smaller the value the better the agreement. Concerning the ordering of the remaining perspectives, TEXT is marginally better regarding r, while the results from the three error-based metrics are clearly in favor of READER. Consequently, for IAA on the  SST* data set, WRITER yields the best performance, while the order of the other perspectives is not so clear. Surprisingly, the results look markedly different on the MASC* data. Here, regarding r, WRITER and TEXT are on par with each other. This contrasts with the results from the error-based metrics. There, TEXT shows the best value, while WRITER, in turn, improves upon READER only by a small margin. Most importantly, for neither of the four metrics we obtain statistical significance between the best and the second best perspective (p ≥ 0.05 using a two-tailed t-test, respectively). Thus, concerning IAA on the MASC* sample, the results remain rather opaque.
The fact that, contrary to that, on SST* the results are conclusive and statistically significant, strongly suggests that the resulting annotation quality is not only dependent on the annotation perspective. Instead, there seem to be considerable dependencies and interactions concerning the domain of the raw data, as well.
Interestingly, on both corpora correlation-and error-based sets of metrics behave inconsistently which we interpret as a piece of evidence for using both types of metrics, in parallel (Buechel and Hahn, 2016;. Emotionality. For emotionality, we rely on the EMO metric which we defined in Section 3 (see Table 2 for our results). For both corpora, the ordering of the perspectives according to the EMO score is consistent: WRITER yields the most emotional ratings followed by TEXT and READER. (p < 0.05 for each of the pairs using a two-tailed t-test). These unanimous and statistically significant results further underpin the advantage of the TEXT and especially the WRITER perspective as already suggested by our findings for IAA.
Distribution. We also looked at the distribution of the resulting aggregated annotations relative to the chosen data sets and the three perspectives by examining the respective density plots. In Figure   2 4 3, we give six examples of these plots, displaying the Valence density curve for both corpora, SST* and MASC*, as well as the three perspectives. For Arousal and Dominance, the plots show the same characteristics although slightly less pronounced. The left density plots, for the SST*, display a bimodal distribution (having two local maxima), whereas the MASC* plots are much closer to a normal distribution. This second shape has been consistently reported by many contributions (see Section 3), whereas we know of no other study reporting a bimodal emotion distribution. This highly atypical finding for SST* might be an artifact of the website from which the original movie review snippets were collected-there, movies are classified into either fresh (positive) or rotten (negative). Consequently, this binary classification scheme might have influenced the selection of snippets from full-scale reviews (as performed by the website) so that these snippets are either clearly positive or negative.
Thus, our findings seriously call into question in how far the movie review corpus by Pang and Lee (2005)-one of the most popular data sets in sentiment analysis-can be considered representative for review language or general English. Ultimately, this may result in a reduced performance of models trained on such skewed data.

Discussion
Overall, we interpret our data as suggesting the WRITER perspective to be superior to TEXT and READER: Considering IAA, it is significantly better on one data set (SST*), while it is on par with or only marginally worse than the best perspective on the other data set (MASC*). Regarding emotionality of the aggregated ratings (EMO), the superiority of this perspective is even more obvious.
The relative order of TEXT and WRITER on the other hand, is not so clear. Regarding IAA, TEXT is better on MASC* while for SST* READER seems to be slightly better (almost on par regarding r but markedly better relative to the error measures we propose here). However, regarding the emotionality of the ratings, TEXT clearly surpasses READER.
Our data suggest that the results of Mohammad and Turney (2013) (the only comparable study so far, though considering emotion on the word rather than sentence level) may be also true for sentences in most of the cases. However, our data indicate that the validity of their findings may depend on the domain the raw data originate from. They found that phrasing the emotion annotation task relative to the TEXT perspective yields higher IAA than relating to the READER perspective. However, more importantly, our data complement their results by presenting evidence that WRITER seems to be even better than any of the two perspectives they took into account.

Conclusion
This contribution presented a series of annotation experiments examining which annotation perspective (WRITER, TEXT or READER) yields the best IAA, also taking domain differences into account-the first study of this kind for sentencelevel emotion annotation. We began by reviewing different popular representation schemes for emotion before (formally) defining various metrics for annotation quality-for the VAD scheme we use, this task was so far neglected in the literature.
Our findings strongly suggest that WRITER is overall the superior perspective. However, the exact ordering of the perspectives strongly depends on the domain the data originate from. Our results are thus mainly consistent with, but substantially go beyond, the only comparable study so far (Mohammad and Turney, 2013). Furthermore, our data provide strong evidence that the movie review corpus by Pang and Lee (2005)-one of the most popular ones for sentiment analysis-may not be representative in terms of its rating distribution potentially casting doubt on the quality of models trained on this data.
For the subsequent creation of EMOBANK, a large-scale VAD gold standard, we took the following decisions in the light of these not fully conclusive outcomes. First, we decided to anno-tate a 10k sentences subset of the MASC* corpus considering the atypical rating distribution in the SST* data set. Furthermore, we decided to annotate the whole corpus bi-perspectivally (according to WRITER and READER viewpoint) as we hope that the resulting resource helps clarifying which factors exactly influence emotion annotation quality. This freely available resource is further described in Buechel and Hahn (2017).