IEST: WASSA-2018 Implicit Emotions Shared Task

Past shared tasks on emotions use data with both overt expressions of emotions (I am so happy to see you!) as well as subtle expressions where the emotions have to be inferred, for instance from event descriptions. Further, most datasets do not focus on the cause or the stimulus of the emotion. Here, for the first time, we propose a shared task where systems have to predict the emotions in a large automatically labeled dataset of tweets without access to words denoting emotions. Based on this intention, we call this the Implicit Emotion Shared Task (IEST) because the systems have to infer the emotion mostly from the context. Every tweet has an occurrence of an explicit emotion word that is masked. The tweets are collected in a manner such that they are likely to include a description of the cause of the emotion – the stimulus. Altogether, 30 teams submitted results which range from macro F1 scores of 21 % to 71 %. The baseline (Max-Ent bag of words and bigrams) obtains an F1 score of 60 % which was available to the participants during the development phase. A study with human annotators suggests that automatic methods outperform human predictions, possibly by honing into subtle textual clues not used by humans. Corpora, resources, and results are available at the shared task website at http://implicitemotions.wassa2018.com.


Introduction
The definition of emotion has long been debated.The main subjects of discussion are the origin of the emotion (physiological or cognitive), the components it has (cognition, feeling, behaviour) and the manner in which it can be measured (categorically or with continuous dimensions).The Implicit Emotion Shared Task (IEST) is based on Scherer (2005), who considers emotion as "an episode of interrelated, synchronized changes in the states of all or most of the five organismic subsystems (information processing, support, executive, action, monitor) in response to the evaluation of an external or internal stimulus event as relevant to major concerns of the organism".
This definition suggests that emotion is triggered by the interpretation of a stimulus event (i.e., a situation) according to its meaning, the criteria of relevance to the personal goals, needs, values and the capacity to react.As such, while most situations will trigger the same emotional reaction in most people, there are situations that may trigger different affective responses in different people.This is explained more in detail by the psychological theories of emotion known as the "appraisal theories" (Scherer, 2005).
Emotion recognition from text is a research area in natural language processing (NLP) concerned with the classification of words, phrases, or documents into predefined emotion categories or dimensions.Most research focuses on discrete emotion recognition, which assigns categorical emotion labels (Ekman, 1992; Plutchik, 2001), e. g.,  Anger, Anticipation, Disgust, Fear, Joy, Sadness,  Surprise and Trust. 1 Previous research developed statistical, dictionary, and rule-based models for several domains, including fairy tales (Alm et al., 2005), blogs (Aman and Szpakowicz, 2007) and microblogs (Dodds et al., 2011).Presumably, most models built on such datasets rely on emotion words (or their representations) whenever accessible and are therefore not forced to learn associations for more subtle descriptions.Such models might fail to predict the correct emotion when such overt words are not accessible.Consider the instance "when my child was born" from the ISEAR corpus, a resource in which people have been asked to report on events when they felt a specific predefined emotion.This example does not contain any emotion word itself, though one might argue that the words "child" and "born" have a positive prior connotation.Balahur et al. (2012b) showed that the inference of affect from text often results from the interpretation of the situation presented therein.Therefore, specific approaches have to be designed to understand the emotion that is generally triggered by situations.Such approaches require common sense and world knowledge (Liu et al., 2003;Cambria et al., 2009).Gathering world knowledge to support NLP is challenging, although different resources have been built to this aime.g., Cyc2 and ConceptNet (Liu and Singh, 2004).
On a different research branch, the field of distant supervision and weak supervision addresses the challenge that manually annotating data is tedious and expensive.Distant supervision tackles this by making use of structured resources to automatically label data (Mintz et al., 2009;Riedel et al., 2010;Mohammad, 2012).This approach has been adapted in emotion analysis by using information assigned by authors to their own text, with the use of hashtags and emoticons (Wang et al., 2012).
With the Implicit Emotion Shared Task (IEST), we aim at combining these two research branches: On the one hand, we use distant supervision to compile a corpus of substantial size.On the other hand, we limit the corpus to those texts which are likely to contain descriptions of the cause of the emotion -the stimulus.Due to the ease of access and the variability and data richness on Twitter, we opt for compiling a corpus of microposts, from which we sample tweets that contain an emotion word followed by 'that', 'when', or 'because'.We then mask the emotion word and ask systems to predict the emotion category associated with that word. 3 The emotion category can be one of six classes: anger, disgust, fear, joy, sadness, and surprise.Examples from the data are: (1) "It's [#TARGETWORD#] when you feel like you are invisible to others." (2) "My step mom got so [#TARGET-WORD#] when she came home from work and saw that the boys didn't come to Austin with me." In Example 1, the inference is that feeling invisible typically makes us sad.In Example 2, the context is presumably that a person (mom) expected something else than what was expected.This in isolation might cause anger or sadness, however, since "the boys are home" the mother is likely happy.Note that such examples can be used as source of commonsense or world knowledge to detect emotions from contexts where the emotion is not explicitly implied.
The shared task was conducted between 15 March 2018 (publication of train and trial data) and the evaluation phase, which ran from 2 to 9 July.Submissions were managed on CodaLab4 .The best performing systems are all ensembles of deep learning approaches.Several systems make use of external additional resources such as pretrained word vectors, affect lexicons, and language models fine-tuned to the task.
The rest of the paper is organized as follows: we first review related work (Section 2).Section 3 introduces the shared task, the data used, and the setup.The results are presented in Section 4, including the official results and a discussion of different submissions.The automatic system's predictions are then compared to human performance in Section 5, where we report on a crowdsourcing study with the data used for the shared task.We conclude in Section 6.

Related Work
Related work is found in different directions of research on emotion detection in NLP: resource creation and emotion classification, as well as modeling the emotion itself.
Modeling the emotion computationally has been approached from the perspective of humans needs and desires with the goal of simulating human reactions.Dyer (1987) presents three models which take into account characters, arguments, emotion experiencers, and events.These aspects are modeled with first order logic in a procedural manner.Similarly, Subasic and Huettner (2001) use fuzzy logic for such modeling in order to consider gradual differences.A similar approach is followed by the OCC model (Ortony et al., 1990), for which Udochukwu and He (2015) show how to connect it to text in a rule-based manner for implicit emotion detection.Despite of this early work on holistic computational models of emotions, NLP focused mostly on a more coarse-grained level.
One of the first corpora annotated for emotions is that by Alm et al. (2005) who analyze sentences from fairy tales.Strapparava and Mihalcea (2007) annotate news headlines with emotions and valence, Mohammad et al. (2015) annotate tweets on elections, and Schuff et al. (2017) tweets of a stance dataset (Mohammad et al., 2017).The SemEval-2018 Task 1: Affect in Tweets (Mohammad et al., 2018) includes several subtasks on inferring the affectual state of a person from their tweet: emotion intensity regression, emotion intensity ordinal classification, valence (sentiment) regression, valence ordinal classification, and multi-label emotion classification.In all of these prior shared tasks and datasets, no distinction is made between implicit or explicit mentions of the emotions.We refer the reader to Bostan and Klinger (2018) for a more detailed overview of emotion classification datasets.
Few authors specifically analyze which phrase triggers the perception of an emotion.Aman and Szpakowicz (2007) focus on the annotation on document level but also mark emotion indicators.Mohammad et al. (2014) annotate electoral tweets for semantic roles such as emotion and stimulus (from FrameNet).Ghazi et al. (2015) annotate a subset of Aman and Szpakowicz (2007) with causes (inspired by the FrameNet structure).Kim and Klinger (2018) and Neviarouskaya and Aono (2013) similarly annotate emotion holders, targets, and causes as well as the trigger words.
One of the oldest resources nowadays used for emotion recognition is the ISEAR set (Scherer, 1997) which consists of self-reports of emotional events.As the task of participants in a psychological study was not to express an emotion but to report on an event in which they experienced a given emotion, this resource can be considered similar to our goal of focusing on implicit emotion expressions.
With the aim to extend the coverage of ISEAR, Balahur et al. (2011Balahur et al. ( , 2012a) ) build EmotiNet, a knowledge base to store situations and the affective reactions they have the potential to trigger.They show how the knowledge stored can be expanded using lexical and semantic similarity, as well as through the use of Web-extracted knowledge (Balahur et al., 2013).The patterns used to populate the database are of the type "I feel [emotion] when [situation]", which was also a starting point for our task.
Finally, several approaches take into consideration distant supervision (Mohammad and Kiritchenko, 2015;Abdul-Mageed and Ungar, 2017;De Choudhury et al., 2012;Liu et al., 2017, i. a.).This is motivated by the high availability of user-generated text and by the challenge that manual annotation is typically tedious or expensive.This contrasts with the current data demand of machine learning, and especially, deep learning approaches.
With our work in IEST, we combine the goal of the development of models which are able to recognize emotions from implicit descriptions without having access to explicit emotion words, with the paradigm of distant supervision.
3 Shared Task

Data
The aim of the Implicit Emotion Shared Task is to force models to infer emotions from the context of emotion words without having access to them.Specifically, the aim is that models infer the emotion through the causes mentioned in the text.Thus, we build the corpus of Twitter posts by polling the Twitter API5 for the expression 'EMOTION-WORD (that|because|when)', where EMOTION-WORD contains a synonym for one out of six emotions. 6The synonyms are shown in Table 1.The requirement of tweets to have either 'that', 'because', or 'when' immediately after the emotion word means that the tweet likely describes the cause of the emotion.The initially retrieved large dataset has a distribution of 25 % surprise, 23 % sadness, 18 % joy, 16 % fear, 10 % anger, 8 % disgust.We discard tweets with more than one emotion word, as well as exact duplicates, and mask usernames and URLs.From this set, we randomly sample 80 % of the tweets to form the training set (153,600 instances), 5 % as trial set (9,600 instances), and 15 % as test set (28,800 instances).We perform stratified sampling to obtain a balanced dataset.While the shared task took place, two errors in the data preprocessing were discovered by participants (the use of the word unhappy as synonym for sadness, which lead to inconsistent preprocessing in the context of negated expressions, and the occurrence of instances without emotion words).To keep the change of the data at a minimum, the erroneous instances were only removed, which leads to a distribution of the data as shown in Table 2.

Task Setup
The shared task was announced through a dedicated website (http://implicitemotions.wassa2018.com/) and computational-linguistics-specific mailing lists.The organizers published an evaluation script which calculates precision, recall, and F 1 measure for each emotion class as well as micro and macro average.Due to the nearly balanced dataset, the chosen official metric for ranking submitted systems is the macro-F 1 measure.
In addition to the data, the participants were

Baseline
The intention of the baseline implementation was to provide participants with an intuition of the difficulty of the task.It reaches 59.88 % macro F 1 on the test data, which is very similar to the trial data result (60.1 % F 1 ).The confusion matrix for the baseline is presented in Table 3; the confusion matrix for the best submitted system is shown in Table 4.

Submission Results
Table 5 shows the main results of the shared task.We received submissions through CodaLab from thirty participants.Twenty-six teams responded to a post-competition survey providing additional information regarding team members (56 people in total) and the systems that were developed.For the remaining analyses and the ranking, we only report on these twenty-six teams.
The table shows results from 31 systems, including the baseline results which have been made available to participants during the shared task started.From all submissions, 19 submissions scored above the baseline.The best scoring system is from team Amobee, followed by IIDYT and NTUA-SLP.The first two results are not significantly different, as tested with the Wilcoxon (1945) sign test (p < 0.01) and with bootstrap resampling (confidence level 0.99).
Table 10 in the Appendix shows a breakdown of the results by emotion class.Though the data was nearly balanced, joy is mostly predicted with highest performance, followed by fear and disgust.The prediction of surprise and anger shows a lower performance.Note that the macro F 1 evaluation took into account all classes which were either predicted or in the gold data.Two teams submitted results which contain labels not present in the gold data, which reduced the macro-F 1 dramatically.With an evaluation only taking into account 6 labels, id 22 would be on rank 9 and id 28 would be on rank 10.

Review of Methods
Table 6 shows that many participants use high-level libraries like Keras or NLTK.Tensorflow is only of medium popularity and Theano is only used by one participant.Table 7 shows a summary of machine learning methods used by the teams, as reported by themselves.Nearly every team uses embeddings and neural networks; many teams use an ensemble of architectures.Several teams use language models showing a current trend in NLP to fine-tune those to specific tasks (Howard and Ruder, 2018).Presumably, those are specifically helpful in our task due to its word-prediction aspect.
Finally, Table 8 summarizes the different kinds of information sources taken into account by the teams.Several teams use affect lexicons in addition to word information and emoji-specific information.The incorporation of statistical knowledge from unlabeled corpora is also popular.

Top 3 Submissions
In the following, we briefly summarize the approaches used by the top three teams: Amobee, II-IDYT, and NTUA-SLP.For more information on these approaches and those of the other teams, we refer the reader to the individual system description papers.The three best performing systems are all ensemble approaches.However, they make use of different underlying machine learning architectures and rely on different kinds of information.

Amobee
The top-ranking system, Amobee, is an ensemble approach of several models (Rozental et al., 2018).First, the team trains a Twitter-specific language model based on the transformer decoder architecture using 5B tweets as training data.This model is used to find the probabilities of potential missing words, conditional upon the missing word describing one of the six emotions.Next, the team applies transfer learning from the trained models they developed for SemEval 2018 Task 1: Affect in Tweets (Rozental and Fleischer, 2018).Finally, they directly train on the data provided in the shared task while incorporating outputs from Deep- Moji (Felbo et al., 2017) and "Universal Sentence Encoder" (Cer et al., 2018) as features.

IIIDYT
The second-ranking system, IIIDYT (Balazs et al., 2018), preprocesses the dataset by tokenizing the sentences (including emojis), and normalizing the USERNAME, NEWLINE, URL and TRIGGER-WORD indicators.Then, it feeds word-level representations returned by a pretrained ELMo layer into a Bi-LSTM with 1 layer of 2048 hidden units for each direction.The Bi-LSTM output word representations are max-pooled to generate sentencelevel representations, followed by a single hidden layer of 512 units and output size of 6.The team trains six models with different random initializations, obtains the probability distributions for each example, and then averages these to obtain the final label prediction.

NTUA-SLP
The NTUA-SLP system (Chronopoulou et al., 2018) is an ensemble of three different generic models.For the first model, the team pretrains Twitter embeddings with the word2vec skip-gram model using a large Twitter corpus.Then, these pretrained embeddings are fed to a neural classifier with 2 layers, each consisting of 400 bi-LSTM units with attention.For the second model, they use transfer learning of a pretrained classifier on a 3-class sentiment classification task (Semeval17 Task4A) and then apply fine-tuning to the IEST dataset.Finally, for the third model the team uses transfer learning of a pretrained language model, according to Howard and Ruder (2018).They first train 3 language models on 3 different Twitter corpora (2M, 3M, 5M) and then they fine-tune them to the IEST dataset with gradual unfreezing.

Error Analysis
Table 11 in the Appendix shows a subsample of instances which are predicted correctly by all teams (marked as +, including the baseline system and those who did not report on system details) and that were not predicted correctly by any team (marked as −), separated by correct emotion label.
For the positive examples which are correctly predicted by all teams, specific patterns reoccur.For anger, the author of the first example encourages the reader not to be afraid -a prompt which might be less likely for other emotions.For several emotions, single words or phrases are presumably associated with such emotions, e. g., "hungry" with anger, "underwear", "sweat", "ewww" with disgust, "leaving", "depression" for sadness, "why am i not" for surprise.
Several examples which are all correctly predicted by all teams for joy include the syllable "un" preceding the triggerword -a pattern more frequent for this emotion than for others.Another pattern is the phrase "fast and furious" (with furious for anger) which should be considered a mistake in the sampling procedure, as it refers to a movie instead of an emotion expression.
Negative examples appear to be reasonable when the emotion is given but may also be valid with other labels than the gold.For disgust, respective emotion synonyms are often used as a strong expression actually referring to other negative emotions.Especially for sadness, the negative examples include comparably long event descriptions.

Comparison to Human Performance
An interesting research question is how accurately native speakers of a language can predict the emotion class when the emotion word is removed from a tweet.Thus we conducted a crowdsourced study asking humans to perform the same task as proposed for automatic systems in this shared task.
We sampled 900 instances from the IEST data: 50 tweets for each of the six emotions in 18 pair-wise combinations with 'because', 'that', and 'when'.The tweets and annotation questionnaires were uploaded on a crowdsourcing platform, Figure Eight (earlier called CrowdFlower). 9The questionnaire asked for the best guess for the emotion (Q1) as well as any other emotion that they think might apply (Q2).
About 5 % of the tweets were annotated internally beforehand for Q1 (by one of the authors of this paper).These tweets are referred to as gold tweets.The gold tweets were interspersed with other tweets.If a crowd-worker got a gold tweet question wrong, they were immediately notified of the error.If the worker's accuracy on the gold tweet questions fell below 70 %, they were refused further annotation, and all of their annotations were discarded.This served as a mechanism to avoid malicious annotations.
Each tweet is annotated by at least three people.A total of 3,619 human judgments of emotion associated with the trigger word were obtained.Each judgment included the best guess for the emotion (response to Q1) as well as any other emotion that they think might apply (response to Q2).The answer to Q1 corresponds to the shared task setting.However, automatic systems were not given the option of providing additional emotions that might apply (Q2).
The macro F 1 for predicting the emotion is 45 % (Q1, micro F 1 of 0.47).Observe that human performance is lower than what automatic systems reach in the shared task.The correct emotion was present in the top two guessed emotions in 57 % of the cases.Perhaps, the automatic systems are honing in to some subtle systematic regularities in hope that particular emotion words are used (for example, the function words in the immediate neighborhood of the target word).It should also be noted, however, that the data used for human annotations was only a subsample of the IEST data.
An analysis of subsets of Tweets containing the words because, that, and when after the emotion word shows that Tweets with "that" are more difficult (41 % accuracy) than with "when" (49 %) and "because" (51 %).This relationship between performance and query string is not observed in the baseline system -here, accuracy on the test data (on the data used for human evaluation) for the "that" subset is 61 % (60 %), for "when" 62 % (53 %), and for "because" 55 % (50 %) -therefore, the automatic system is most challenged by "because", while humans are more challenged by "that".Please note that this comparison on the test data is somewhat unfair since for the human analysis, the data was sampled in a stratified manner, but not for the automatic prediction.The test data contains 5635 "because" tweets, 13649 with "that" and 9474 with "when".
There are differences in the difficulty of the task for different emotions: The accuracy (F 1 ) by emotion is 57 % (46 %) for anger, 15 % (21 %) for disgust, 42 % (51 %) for fear, 77 % (58 %) for joy, 59 % (52 %) for sadness and 34 % (39 %) for surprise.The confusion matrix is depicted in Table 9. Disgust is often confused with anger, followed by fear being confused with sadness.Surprise is often confused with anger and joy.

Conclusions & Future Work
With this paper and the Implicit Emotion Shared Task, we presented the first dataset and joint effort to focus on causal descriptions to infer emotions that are triggered by specific life situations on a large scale.A substantial number of participating systems presented the current state of the art in text classification in general and transferred it to the task of emotion classification.
Based on the experiences during the organization and preparation of this shared task, we plan the following steps for a potential second iteration.The dataset was now constructed via distant supervision, which might be a cause for inconsistencies in the dataset.We plan to use crowdsourcing as applied for the estimation of human performance to improve preprocessing of the data.In addition, as one participant noted, the emotion words which were used to retrieve the data were removed, but, in a subset of the data, other emotion words were retained.
The next step, which we suggest to the participants and future researchers is introspection of the models -carefully analyse them to prove that the models actually learn to infer emotions from subtle descriptions of situations, instead of purely associating emotion words with emotion labels.Similarly, an open research question is how models developed on the IEST data perform on other data sets.Bostan and Klinger (2018) showed that transferring models from one corpus to another in emotion analysis leads to drops in performance.Therefore, an interesting option is to use transfer learning from established corpora (which do not distinguish explicit and implicit emotion statements) to the IEST data and compare the models to those directly trained on the IEST and vice versa.
Finally, another line of future research is the application of the knowledge inferred to other tasks, such as argument mining and sentiment analysis.

Table 1 :
Emotion synonyms used when polling Twitter.

Table 2 :
Distribution of IEST data.

Table 3 :
Confusion Matrix on Test Data for Baseline.

Table 4 :
Confusion Matrix on Test Data of Best Submitted Systemprovided a list of resources they might want to use 7 (and they were allowed to use any other resources they have access to or create themselves).We also provided access to a baseline system.
8This baseline is a maximum entropy classifier with L2 regularization.Strings which match [#a-zA-Z0-9_=]+|[ˆ] form tokens.As preprocessing, all symbols which are not alphanumeric or contain the # sign are removed.Based on that, unigrams and bigrams form the Boolean features as a set of words for the classifier.

Table 6 :
Overview of tools employed by different teams (sorted by popularity from left to right).

Table 7 :
Overview of methods employed by different teams (sorted by popularity from left to right).

Table 8 :
Overview of information sources employed by different teams (sorted by popularity from left to right).