Annotation, Modelling and Analysis of Fine-Grained Emotions on a Stance and Sentiment Detection Corpus

There is a rich variety of data sets for sentiment analysis (viz., polarity and subjectivity classification). For the more challenging task of detecting discrete emotions following the definitions of Ekman and Plutchik, however, there are much fewer data sets, and notably no resources for the social media domain. This paper contributes to closing this gap by extending the SemEval 2016 stance and sentiment dataset with emotion annotation. We (a) analyse annotation reliability and annotation merging; (b) investigate the relation between emotion annotation and the other annotation layers (stance, sentiment); (c) report modelling results as a baseline for future work.


Introduction
Emotion recognition is a research area in natural language processing concerned with associating words, phrases or documents with predefined emotions from psychological models. Discrete emotion recognition assigns categorial emotions (Ekman, 1999;Plutchik, 2001), namely Anger, Anticipation, Disgust, Fear, Joy, Sadness, Surprise und Trust. Compared to the very active area of sentiment analysis, whose goal is to recognize the polarity of text (e. g., positive, negative, neutral, mixed), few resources are available for discrete emotion analysis.
Emotion analysis has been applied to several domains, including tales (Alm et al., 2005), blogs (Aman and Szpakowicz, 2007) and microblogs (Dodds et al., 2011). The latter in particular provides a major data source in the form of user messages from platforms such as Twitter (Costa et al., * We thank Marcus Hepting, Chris Krauter, Jonas Vogelsang, Gisela Kollotzek for annotation and discussion. 2014) which contain semi-structured information (hashtags, emoticons, emojis) that can be used as weak supervision for training classifiers (Suttles and Ide, 2013). The classifier then learns the association of all other words in the message with the "self-labeled" emotion (Wang et al., 2012).
While this approach provides a practically feasible approximation of emotions, there is no publicly available, manually vetted data set for Twitter emotions that would support accurate and comparable evaluations. In addition, it has been shown that distant annotation is conceptually different from manual annotation for sentiment and emotion (Purver and Battersby, 2012).
With this paper, we contribute manual emotion annotation for a publicly available Twitter data set. We annotate the SemEval 2016 Stance Data set (Mohammad et al., 2016) which provides sentiment and stance information and is popular in the research community (Augenstein et al., 2016;Wei et al., 2016;Dias and Becker, 2016;Ebrahimi et al., 2016). It therefore enables further research on the relations between sentiment, emotions, and stances. For instance, if the distribution of subclasses of positive or negative emotions is different for against and in-favor, emotion-based features could contribute to stance detection.
An additional feature of our resource is that we do not only provide a "majority annotation" as is usual. We do define a well-performing aggregated annotation, but additionally provide the individual labels of each of our six annotators. This enables further research on differences in the perception of emotions.

Background and Related Work
For a review of the fundaments of emotion and sentiment and the differences between these concepts, we refer the reader to Munezero et al. (2014).  For sentiment analysis, a large number of annotated data sets exists. These include review texts from different domains, for instance from Amazon and other shopping sites (Hu and Liu, 2004;Ding et al., 2008;Toprak et al., 2010;Lakkaraju et al., 2011), restaurants (Ganu et al., 2009), news articles (Wiebe et al., 2005), blogs (Kessler et al., 2010), as well as microposts on Twitter. For the latter, shown in the upper half of Table 1, there are general corpora (Nakov et al., 2013;Spina et al., 2012;Thelwall et al., 2012) as well as ones focused on very specific subdomains, for instance on Obama-McCain Debates (Shamma et al., 2009), Health Care Reforms (Speriosu et al., 2011). A popular example for a manually annotated corpus for sentiment, which includes stance annotation for a set of topics is the SemEval 2016 data set (Mohammad et al., 2016).
For emotion analysis, the set of annotated resources is smaller (compare the lower half of Table 1). A very early resource is the ISEAR data set (Scherer and Wallbott, 1997) which contains descriptions of emotional events. While motivated by psychological research, it was later repurposed for computational research. The first data set developed specifically for computational research was the tales corpus by Alm et al. (2005). Aman and Sz-pakowicz (2007) published a corpus of blog posts. In the context of SemEval, Strapparava and Mihalcea (2007) annotated news headlines.
A notable gap is the unavailability of a publicly available set of microposts (e. g., tweets) with emotion labels. To the best of our knowledge, there are only three previous approaches to labeling tweets with discrete emotion labels. One is the recent data set on for emotion intensity estimation, a shared task aiming at the development of a regression model. The goal is not to predict the emotion class, but a distribution over their intensities, and the set of emotions is limited to fear, sadness, anger, and joy (Mohammad and Bravo-Marquez, 2017).
Most similar to our work is a study by Roberts et al. (2012) which annotated 7,000 tweets manually for 7 emotions (anger, disgust, fear, joy, love, sadness and surprise). They chose 14 topics which they believe should elicit emotional tweets and collect hashtags to help identify tweets that are on these topics. After several iterations, the annotators reached κ = 0.67 inter-annotator agreement on 500 tweets. Unfortunately, the data appear not to be available any more. An additional limitation of that dataset was that 5,000 of the 7,000 tweets were annotated by one annotator only. In contrast, we provide several annotations for each tweet.  Table 2: Corpus Statistics. The threshold t measures that a fraction of more than t annotators labeled the respective emotion (e. g., t=0.0: at least one annotator t=0.99: all annotators). Overall number of tweets: 4,868. Mohammad et al. (2015) annotated electoral tweets for sentiment, intensity, semantic roles, style, purpose and emotions. This is the only available corpus similar to our work we are aware of. However, the focus of this work was not emotion annotation in contrast to ours. In addition, we publish the data of all annotators.

Annotation Procedure
As motivated above, we re-annotate the extended SemEval 2016 Stance Data set (Mohammad et al., 2016) which consists of 4,870 tweets (a subset of which was used in the SemEval competition). For a discussion of the differences of these data sets, we refer to . We omit two tweets with special characters, which leads to an overall set of 4,868 tweets used in our corpus. 1 We frame annotation as a multi-label classification task at the tweet level. The tweets were annotated by a group of six independent annotators, with a minimum number of three annotations for each tweet (696 tweets were labeled by 6 annotators, 703 by 5 annotators, 2,776 by 4 annotators and 693 by 3 annotators). All annotators were undergraduate students of media computer science and between the age of 20 and 30. Only one annotator is female. All students are German native speak-1 Our annotations and original tweets are available at http://www.ims.uni-stuttgart.de/data/ ssec and http://alt.qcri.org/semeval2016/ task6/data/uploads/stancedataset.zip, see also http://alt.qcri.org/semeval2016/task6.  To train the annotators on the task, we performed two training iterations based on 50 randomly selected tweets from the SemEval 2016 Task 4 corpus (Nakov et al., 2016). After each iteration, we discussed annotation differences (informally) in face-to-face meetings.
For the final annotation, tweets were presented to the annotators in a web interface which paired a tweet with a set of binary check boxes, one for each emotion. Taggers could annotate any set of emotions. Each annotator was assigned with 5/7 of the corpus with equally-sized overlap of instances based on an offset shift. Not all annotators finished their task. 2

Emotion Annotation Reliability and Aggregated Annotation
Our annotation represents a middle ground between traditional linguistic "expert" annotation and crowdsourcing: We assume that intuitions about emotions diverge more than for linguistic structures. At the same time, we feel that there is information in the individual annotations beyond the simple "majority vote" computed by most crowdsourcing studies. In this section, we analyse the annotations intrinsically; a modelling-based evaluation follows in Section 5. Our first analysis, shown in Table 2, compares annotation strata with different agreement. For example, the column labeled 0.0 lists the frequencies of emotion labels assigned by at least one annotator, a high recall annotation. In contrast, the column labeled 0.99 lists frequencies for emotion labels that all annotators agreed on. This represents a high  These numbers confirm that emotion labeling is a somewhat subjective task: only a small subset of the emotions labeled by at least one annotator (t=0.0) is labeled by most (t=0.66) or all of them (t=0.99). Interestingly, the exact percentage varies substantially by emotion, between 2 % for sadness and 20 % for anger.
Many of these disagreements stem from tweets that are genuinely difficult to categorize emotionally, like That moment when Canadians realised global warming doesn't equal a tropical vacation for which one annotator chose anger and sadness, while one annotator chose surprise. Arguably, both annotations capture aspects of the meaning. Similarly, the tweet 2 pretty sisters are dancing with cancered kid (a reference to an online video) is marked as fear and sadness by one annotator and with joy and sadness by another. Naturally, not all differences arise from justified annotations. For instance the tweet #BIBLE = Big Irrelevant Book of Lies and Exaggerations has been labeled by two annotators with the emotion trust, presumably because of the word bible. This appears to be a classical oversight error, where the tweet is labeled on the basis of the first spotted keyword, without substantially studying its content.
To quantify these observations, we follow general practice and compute a chance-corrected measure of inter-annotator agreement. Table 3 shows the minimum and maximum Cohen's κ values for pairs of annotators, computed on the intersection of instances annotated by either annotator within each pair. We obtain relatively high κ values of anger, joy, and trust, but lower values for the other emotions.
These small κ values could be interpreted as indicators of problems with reliability. However, κ is notoriously difficult to interpret, and a number of studies have pointed out the influence of marginal frequencies (Cicchetti and Feinstein, 1990): In the presence of skewed marginals (and most of our emotion labels are quite rare, cf. Table 2), the expected agreement (referred to as P (E) in contrast to P (A) for the empirical agreement) is quite high. This makes it hard to obtain high κ values; thus, low κ values do not necessarily indicate unreliable annotation.
To avoid these methodological problems, we assess the usefulness of our annotation extrinsically by comparing the performance of computational models for different values of t. In a nutshell, these experiments will show best results t=0.0, i. e., the  Table 5: Tweet Counts (above diagonal) and odds ratio (below diagonal) for cooccurring annotations for all classes in the corpus (emotions based on majority annotation, t=0.5).
high-recall annotation (see Section 5 for details). We therefore define t=0.0 as our aggregated annotation. For comparison, we also consider t=0.5, which corresponds to the majority annotation as generally adopted in crowdsourcing studies.

Distribution of Emotions
As shown in Table 2, nearly 60 % of the overall tweet set are annotated with anger by at least one annotator. This is the predominant emotion class, followed by anticipation and sadness. This distribution is comparably uncommon and originates from the selection of tweets in SemEval as a stance data set. However, while anger clearly dominates in the aggregated annotation, its predominance weakens for the more precision-oriented data sets. For t=0.99, joy becomes the second most frequent emotion. In uniform samples from Twitter, joy typically dominates the distribution of emotions (Klinger, 2017). It remains a question for future work how to reconciliate these observations. Table 4 shows the number of cooccurring label pairs (above the diagonal) and the odds ratios (below the diagonal) for emotion, stance, and sentiment annotations on the whole corpus for our aggregated annotation (t=0.0). Odds ratio is

Emotion vs. other Annotation Layers
where P (A) is the probability that both labels (at row and column in the table) hold for a tweet and P (B) is the probability that only one holds. A ratio of x means that the joint labeling is x times more likely than the independent labeling. Table 5 shows the same numbers for the majority annotation, t=0.5. We first analyze the relationship between emotions and sentiment polarity in Table 4. For many emotions, the polarity is as expected: Joy and trust occur predominantly with positive sentiment, and anger, disgust, fear and sadness with negative sentiment. The emotions anticipation and surprise are, in comparison, most balanced between polarities, however with a majority for positive sentiment in anticipation and a negative sentiment for surprise. For most emotions there is also a non-negligible number of tweets with the sentiment opposite to a common expectation. For example, anger occurs 28 times with positive sentiment, mainly tweets which call for (positive) change regarding a controversial topic, for instance Lets take back our country! Whos with me? No more Democrats!2016 Why criticise religions? If a path is not your own. Don't be pretentious. And get down from your throne.
Conversely, more than 15 % of the joy tweets carry negative sentiment. These are often cases in which either the emotion annotator or the sentiment annotator assumed some non-literal meaning to be associated with the text (mainly irony), for instance Global Warming! Global Warming! Global Warming! Oh wait, it's summer.
I love the smell of Hillary in the morning. It smells like Republican Victory.
Disgust occurs almost exclusively with negative sentiment.
For the majority annotation (Table 5), the number of annotations is smaller. However, the average size of the odds ratios increase (from 1.96 for t=0.0 to 5.39 for t=0.5).
A drastic example is disgust in combination with negative sentiment, the predominant combination. Disgust is only labeled once with positive sentiment in the t=0.5 annotation: #WeNeedFeminism because #NoMeansNo it doesnt mean yes, it doesnt mean try harder! Similarly, the odds ratio for the combination anger and negative sentiment nearly doubles from 20.3 for t=0.0 to 41.47 for t=0.5. These numbers are an effect of the majority annotation having a higher precision in contrast to more "noisy" aggregation of all annotations (t=0.0).
Regarding the relationship between emotions and stance, most odds ratios are relatively close to 1, indicating the absence of very strong correlations. Nevertheless, the "Against" stance is associated with a number of negative emotions (anger, disgust, sadness, the "In Favor" stance with joy, trust, and anticipation, and "None" with an absence of all emotions except surprise.

Models
We apply six standard models to provide baseline results for our corpus: Maximum Entropy (MAXENT), Support Vector Machines (SVM), a Long-Short Term Memory Network (LSTM), a Bidirectional LSTM (BI-LSTM), and a Convolutional Neural Network (CNN).
MaxEnt and SVM classify each tweet separately based on a bag-of-words. For the first, the linear separator is estimated based on log-likelihood optimization with an L2 prior. For the second, the optimization follows a max-margin strategy.
LSTM (Hochreiter and Schmidhuber, 1997) is a recurrent neural network architecture which includes a memory state capable of learning long distance dependencies. In various forms, they have proven useful for text classification tasks (Tai et al., 2015;Tang et al., 2016). We implement a standard LSTM which has an embedding layer that maps the input (padded when needed) to a 300 dimensional vector. These vectors then pass to a 175 dimensional LSTM layer. We feed the final hidden state to a fully-connected 50-dimensional dense layer and use sigmoid to gate our 8 output neurons. As a regularizer, we use a dropout (Srivastava et al., 2014) of 0.5 before the LSTM layer.
Bi-LSTM has the same architecture as the normal LSTM, but includes an additional layer with a reverse direction. This approach has produced stateof-the-art results for POS-tagging (Plank et al., 2016), dependency parsing (Kiperwasser and Goldberg, 2016) and text classification (Zhou et al., 2016), among others. We use the same parameters as the LSTM, but concatenate the two hidden layers before passing them to the dense layer.
CNN has proven remarkably effective for text classification (Kim, 2014;dos Santos and Gatti, 2014;Flekova and Gurevych, 2016) . We train a simple one-layer CNN with one convolutional layer on top of pre-trained word embeddings, following Kim (2014). The first layer is an embeddings layer that maps the input of length n (padded when needed) to an n x 300 dimensional matrix. The embedding matrix is then convoluted with filter sizes of 2, 3, and 4, followed by a pooling layer of length 2. This is then fed to a fully connected dense layer with ReLu activations and finally to the 8 output neurons, which are gated with the sigmoid function. We again use dropout (0.5), this time before and after the convolutional layers.
For all neural models, we initialize our word representations with the skip-gram algorithm with negative sampling (Mikolov et al., 2013), trained on nearly 8 million tokens taken from tweets collected using various hashtags. We create 300-dimensional vectors with window size 5, 15 negative samples and run 5 iterations. For OOV words, we use a vector initialized randomly between -0.25 and 0.25 to approximate the variance of the pretrained vectors. We train our models using ADAM (Kingma and Ba, 2015) and a minibatch size of 32. We set 10 % of  Table 6: Results of linear and neural models for labels from the aggregated annotation (t=0.0). For the neural models, we report the average of five runs and standard deviation in brackets. Best F 1 for each emotion shown in boldface.
the training data aside to tune the hyperparameters for each model (hidden dimension size, dropout rate, and number of training epochs). Table 6 shows the results for our canonical annotation aggregation with t=0.0 (aggregated annotation) for our models. The two linear classifiers (trained as MAXENT and SVM) show comparable results, with an overall micro-average F 1 of 58 %. All neural network approaches show a higher performance of at least 2 percentage points (3 pp for LSTM, 4 pp for BI-LSTM, 2 pp for CNN). BI-LSTM also obtains the best F-Score for 5 of the 8 emotions (4 out of 8 for LSTM and CNN). We conclude that the BI-LSTM shows the best results of all our models. Our discussion focuses on this model. The performance clearly differs between emotion classes. Recall from Section 3.2 that anger, joy and trust showed much higher agreement numbers than the other annotations. There is however just a mild correlation between reliability and modeling performance. Anger is indeed modelled very well: it shows the best prediction performance with a similar precision and recall on all models. We ascribe this to it being the most frequent emotion class. In contrast, joy and trust show only middling performance, while we see relatively good results for anticipation and sadness even though there was considerable disagreement between annotators. We find the overall worst results for surprise. This is not surprising, surprise being a scarce label with also very low agreement. This might point towards underlying problems in the definition of surprise as an emotion. Some authors have split this class into positive and negative surprise in an attempt to avoid this (Alm et al., 2005).

Results
We finally come to our justification for choosing t=0.0 as our aggregated annotation. Table 7 shows results for the best model (BI-LSTM) on the datasets for different thresholds. We see a clear downward monotone trend: The higher the threshold, the lower the F 1 measures. We obtain the best results, both for individual emotions and at the average level, for t=0.0. This is at least partially counterintuitive -we would have expected a dataset with "more consensual" annotation to yield better models -or at least models with higher precision. This is not the case. Our interpretation is that frequency effects outweigh any other considerations: As Table 2 shows, the amount of labeled data points drops sharply with higher thresholds: even between t=0.0 and t=0.33, on average half of the labels are lost. This interpretation is supported by the behavior of the individual emotions: for emotions where the data sets shrink gradually (anger, joy), performance drops gradually, while it dips sharply for emotions where the data sets shrink fast (disgust, fear). Somewhat surprisingly, therefore, we conclude that t=0.0 appears to be the  In terms of how to deal with diverging annotations, we believe that this result bolsters our general approach to pay attention to individual annotators' labels rather than just majority votes: if the individual labels were predominantly noisy, we would not expect to see relatively high F 1 scores.

Conclusion and Future Work
With this paper, we publish the first manual emotion annotation for a publicly available micropost corpus. The resource we chose to annotate already provides stance and sentiment information. We analyzed the relationships among emotion classes and between emotions and the other annotation layers.
In addition to the data set, we implemented wellknown standard models which are established for sentiment and polarity prediction for emotion classification. The BI-LSTM model outperforms all other approaches by up to 4 points F 1 on average compared to linear classifiers.
Inter-annotator analysis showed a limited agreement between the annotators -the task is, at least to some degree, driven by subjective opinions. We found, however, that this is not necessarily a problem: Our models perform best on a high-recall aggregate annotation which includes all labels assigned by at least one annotator. Thus, we believe that the individual labels have value and are not, like generally assumed in crowdsourcing, noisy inputs suitable only as input for majority voting.
In this vein, we publish all individual annotations. This enables further research on other methods of defining consensus annotations which may be more appropriate for specific downstream tasks. More generally, we will make all annotations, resources and model implementations publicly available.