WASSA-2017 Shared Task on Emotion Intensity

We present the first shared task on detecting the intensity of emotion felt by the speaker of a tweet. We create the first datasets of tweets annotated for anger, fear, joy, and sadness intensities using a technique called best–worst scaling (BWS). We show that the annotations lead to reliable fine-grained intensity scores (rankings of tweets by intensity). The data was partitioned into training, development, and test sets for the competition. Twenty-two teams participated in the shared task, with the best system obtaining a Pearson correlation of 0.747 with the gold intensity scores. We summarize the machine learning setups, resources, and tools used by the participating teams, with a focus on the techniques and resources that are particularly useful for the task. The emotion intensity dataset and the shared task are helping improve our understanding of how we convey more or less intense emotions through language.


Introduction
We use language to communicate not only the emotion we are feeling but also the intensity of the emotion.For example, our utterances can convey that we are very angry, slightly sad, absolutely elated, etc.Here, intensity refers to the degree or amount of an emotion such as anger or sadness. 1utomatically determining the intensity of emotion felt by the speaker has applications in commerce, public health, intelligence gathering, and social welfare.
Twitter has a large and diverse user base which entails rich textual content, including nonstandard language such as emoticons, emojis, creatively spelled words (happee), and hashtagged words (#luvumom).Tweets are often used to convey one's emotion, opinion, and stance (Mohammad et al., 2017).Thus, automatically detecting emotion intensities in tweets is especially beneficial in applications such as tracking brand and product perception, tracking support for issues and policies, tracking public health and well-being, and disaster/crisis management.Here, for the first time, we present a shared task on automatically detecting intensity of emotion felt by the speaker of a tweet: WASSA-2017 Shared Task on Emotion Intensity. 2pecifically, given a tweet and an emotion X, the goal is to determine the intensity or degree of emotion X felt by the speaker-a real-valued score between 0 and 1. 3 A score of 1 means that the speaker feels the highest amount of emotion X.A score of 0 means that the speaker feels the lowest amount of emotion X.We first ask human annotators to infer this intensity of emotion from a tweet.Later, automatic algorithms are tested to determine the extent to which they can replicate human annotations.Note that often a tweet does not explicitly state that the speaker is experiencing a particular emotion, but the intensity of emotion felt by the speaker can be inferred nonetheless.Sometimes a tweet is sarcastic or it conveys the emotions of a different entity, yet the annotators (and automatic algorithms) are to infer, based on the tweet, the extent to which the speaker is likely feeling a particular emotion.
In order to provide labeled training, development, and test sets for this shared task, we needed to annotate instances for degree of affect.This is a substantially more difficult undertaking than annotating only for the broad affect class: respondents are presented with greater cognitive load and it is particularly hard to ensure consistency (both across responses by different annotators and within the responses produced by an individual annotator).Thus, we used a technique called Best-Worst Scaling (BWS), also sometimes referred to as Maximum Difference Scaling (MaxDiff).It is an annotation scheme that addresses the limitations of traditional rating scales (Louviere, 1991;Louviere et al., 2015;Kiritchenko andMohammad, 2016, 2017).We used BWS to create the Tweet Emotion Intensity Dataset, which currently includes four sets of tweets annotated for intensity of anger, fear, joy, and sadness, respectively (Mohammad and Bravo-Marquez, 2017).These are the first datasets of their kind.
The competition is organized on a CodaLab website, where participants can upload their submissions, and the leaderboard reports the results. 4wenty-two teams participated in the 2017 iteration of the competition.The best performing system, Prayas, obtained a Pearson correlation of 0.747 with the gold annotations.Seven teams obtained scores higher than the score obtained by a competitive SVM-based benchmark system (0.66), which we had released at the start of the competition. 5Low-dimensional (dense) distributed representations of words (word embeddings) and sentences (sentence vectors), along with presence of affect-associated words (derived from affect lexicons) were the most commonly used features.Neural network were the most commonly used machine learning architecture.They were used for learning tweet representations as well as for fitting regression functions.Support vector machines (SVMs) were the second most popular regression algorithm.Keras and Tensor-Flow were some of the most widely used libraries.
The top performing systems used ensembles of models trained on dense distributed representations of the tweets as well as features drawn from affect lexicons.They also made use of a substantially larger number of affect lexicons than systems that did not perform as well.
The emotion intensity dataset and the corresponding shared task are helping improve our understanding of how we convey more or less intense emotions through language.The task also adds a dimensional nature to model of basic emotions, which has traditionally been viewed as categorical (joy or no joy, fear or no fear, etc.).On going work with annotations on the same data for valence , arousal, and dominance aims to better understand the relationships between the circumplex model of emotions (Russell, 2003) and the categorical model of emotions (Ekman, 1992;Plutchik, 1980).Even though the 2017 WASSA shared task has concluded, the CodaLab competition website is kept open.Thus new and improved systems can continually be tested.The best results obtained by any system on the 2017 test set can be found on the CodaLab leaderboard.
The rest of the paper is organized as follows.We begin with related work and a brief background on best-worst scaling (Section 2).In Section 3, we describe how we collected and annotated the tweets for emotion intensity.We also present experiments to determine the quality of the annotations.Section 4 presents details of the shared task setup.In Section 5, we present a competitive SVM-based baseline that uses a number of common text classification features.We describe ablation experiments to determine the impact of different feature types on regression performance.In Section 6, we present the results obtained by the participating systems and summarize their machine learning setups.Finally, we present conclusions and future directions.All of the data, annotation questionnaires, evaluation scripts, regression code, and interactive visualizations of the data are made freely available on the shared task website. 2 2 Related Work

Emotion Annotation
Psychologists have argued that some emotions are more basic than others (Ekman, 1992;Plutchik, 1980;Parrot, 2001;Frijda, 1988).However, they disagree on which emotions (and how many) should be classified as basic emotions-some propose 6, some 8, some 20, and so on.Thus, most efforts in automatic emotion detection have focused on a handful of emotions, especially since manually annotating text for a large number of emotions is arduous.Apart from these categorical models of emotions, certain dimensional models of emotion have also been proposed.The most popular among them, Russell's circumplex model, asserts that all emotions are made up of two core dimensions: valence and arousal (Russell, 2003).We created datasets for four emotions that are the most common amongst the many proposals for basic emotions: anger, fear, joy, and sadness.However, we have also begun work on other affect categories, as well as on valence and arousal.
The vast majority of emotion annotation work provides discrete binary labels to the text instances (joy-nojoy, fear-nofear, and so on) (Alm et al., 2005;Aman and Szpakowicz, 2007;Brooks et al., 2013;Neviarouskaya et al., 2009;Bollen et al., 2009).The only annotation effort that provided scores for degree of emotion is by Strapparava and Mihalcea (2007) as part of one of the SemEval-2007 shared task.Annotators were given newspaper headlines and asked to provide scores between 0 and 100 via slide bars in a web interface.It is difficult for humans to provide direct scores at such fine granularity.A common problem is inconsistency in annotations.One annotator might assign a score of 79 to a piece of text, whereas another annotator may assign a score of 62 to the same text.It is also common that the same annotator assigns different scores to the same text instance at different points in time.Further, annotators often have a bias towards different parts of the scale, known as scale region bias.

Best-Worst Scaling
Best-Worst Scaling (BWS) was developed by Louviere (1991), building on some ground-breaking research in the 1960s in mathematical psychology and psychophysics by Anthony A. J. Marley and Duncan Luce.Annotators are given n items (an ntuple, where n > 1 and commonly n = 4).They are asked which item is the best (highest in terms of the property of interest) and which is the worst (lowest in terms of the property of interest).When working on 4-tuples, best-worst annotations are particularly efficient because each best and worst annotation will reveal the order of five of the six item pairs.For example, for a 4-tuple with items A, B, C, and D, if A is the best, and D is the worst, then A > B, A > C, A > D, B > D, and C > D.
BWS annotations for a set of 4-tuples can be easily converted into real-valued scores of association between the items and the property of interest (Orme, 2009;Flynn and Marley, 2014) been empirically shown that annotations for 2N 4-tuples is sufficient for obtaining reliable scores (where N is the number of items) (Louviere, 1991;Kiritchenko and Mohammad, 2016). 6iritchenko and Mohammad (2017) show through empirical experiments that BWS produces more reliable fine-grained scores than scores obtained using rating scales.Within the NLP community, Best-Worst Scaling (BWS) has thus far been used only to annotate words: for example, for creating datasets for relational similarity (Jurgens et al., 2012), word-sense disambiguation (Jurgens, 2013), word-sentiment intensity (Kiritchenko et al., 2014), and phrase sentiment composition (Kiritchenko and Mohammad, 2016).However, we use BWS to annotate whole tweets for intensity of emotion.

Data
Mohammad and Bravo-Marquez (2017) describe how the Tweet Emotion Intensity Dataset was created.We summarize below the approach used and the key properties of the dataset.Not included in this summary are: (a) experiments showing marked similarities between emotion pairs in terms of how they manifest in language, (b) how training data for one emotion can be used to improve prediction performance for a different emotion, and (c) an analysis of the impact of hashtag words on emotion intensities.
For each emotion X, we select 50 to 100 terms that are associated with that emotion at different intensity levels.For example, for the anger dataset, we use the terms: angry, mad, frustrated, annoyed, peeved, irritated, miffed, fury, antagonism, and so on.For the sadness dataset, we use the terms: sad, devastated, sullen, down, crying, dejected, heartbroken, grief, weeping, and so on.We will refer to these terms as the query terms.
We identified the query words for an emotion by first searching the Roget's Thesaurus to find categories that had the focus emotion word (or a close synonym) as the head word. 7The categories chosen for each head word are shown in Table 1.We chose all single-word entries listed within these categories to be the query terms for the corresponding focus emotion. 8Starting November 22, 2016, and continuing for three weeks, we polled the Twitter API for tweets that included the query terms.We discarded retweets (tweets that start with RT) and tweets with urls.We created a subset of the remaining tweets by: • selecting at most 50 tweets per query term.
• selecting at most 1 tweet for every tweeterquery term combination.
Thus, the master set of tweets is not heavily skewed towards some tweeters or query terms.
To study the impact of emotion word hashtags on the intensity of the whole tweet, we identified tweets that had a query term in hashtag form towards the end of the tweet-specifically, within the trailing portion of the tweet made up solely of hashtagged words.We created copies of these tweets and then removed the hashtag query terms from the copies.The updated tweets were then added to the master set.Finally, our master set of 7,097 tweets includes: 1. Hashtag Query Term Tweets (HQT Tweets): 1030 tweets with a query term in the form of a hashtag (#<query term>) in the trailing portion of the tweet; 2. No Query Term Tweets (NQT Tweets): 1030 tweets that are copies of '1', but with the hashtagged query term removed; 3. Query Term Tweets (QT Tweets): 5037 tweets that include: a. tweets that contain a query term in the form of a word (no #<query term>) b. tweets with a query term in hashtag form followed by at least one non-hashtag word.
The master set of tweets was then manually annotated for intensity of emotion.Table 3 shows a breakdown by emotion.
7 The Roget's Thesaurus groups words into about 1000 categories, each containing on average about 100 closely related words.The head word is the word that best represents the meaning of the words within that category. 8The full list of query terms is available on request.

Annotating with Best-Worst Scaling
We followed the procedure described in Kiritchenko and Mohammad (2016) to obtain BWS annotations.For each emotion, the annotators were presented with four tweets at a time (4tuples) and asked to select the speakers of the tweets with the highest and lowest emotion intensity.2 × N (where N is the number of tweets in the emotion set) distinct 4-tuples were randomly generated in such a manner that each item is seen in eight different 4-tuples, and no pair of items occurs in more than one 4-tuple.We refer to this as random maximum-diversity selection (RMDS).RMDS maximizes the number of unique items that each item co-occurs with in the 4-tuples.
After BWS annotations, this in turn leads to direct comparative ranking information for the maximum number of pairs of items.9 It is desirable for an item to occur in sets of 4tuples such that the the maximum intensities in those 4-tuples are spread across the range from low intensity to high intensity, as then the proportion of times an item is chosen as the best is indicative of its intensity score.Similarly, it is desirable for an item to occur in sets of 4-tuples such that the minimum intensities are spread from low to high intensity.However, since the intensities of items are not known before the annotations, RMDS is used.
Every 4-tuple was annotated by three independent annotators. 10The questionnaires used were developed through internal discussions and pilot annotations.(See the Appendix (8.1) for a sample questionnaire.All questionnaires are also available on the task website.) The 4-tuples of tweets were uploaded on the crowdsourcing platform, CrowdFlower.About 5% of the data was annotated internally beforehand (by the authors).These questions are referred to as gold questions.The gold questions are interspersed with other questions.If one gets a gold question wrong, they are immediately notified of it.If one's accuracy on the gold questions falls below 70%, they are refused further annotation, and all of their annotations are discarded.This serves as a mechanism to avoid malicious annotations. 11he BWS responses were translated into scores by a simple calculation (Orme, 2009;Flynn and Marley, 2014): For each item t, the score is the percentage of times the t was chosen as having the most intensity minus the percentage of times t was chosen as having the least intensity.12 Since intensity of emotion is a unipolar scale, we linearly transformed the the −100 to 100 scores to scores in the range 0 to 1.

Reliability of Annotations
A useful measure of quality is reproducibility of the end result-if repeated independent manual annotations from multiple respondents result in similar intensity rankings (and scores), then one can be confident that the scores capture the true emotion intensities.To assess this reproducibility, we calculate average split-half reliability (SHR), a commonly used approach to determine consistency (Kuder and Richardson, 1937;Cronbach, 1946).The intuition behind SHR is as follows.
All annotations for an item (in our case, tuples) are randomly split into two halves.Two sets of scores are produced independently from the two halves.Then the correlation between the two sets of scores is calculated.If the annotations are of good quality, then the correlation between the two halves will be high.
Since each tuple in this dataset was annotated by three annotators (odd number), we calculate SHR by randomly placing one or two annotations per tuple in one bin and the remaining (two or one) annotations for the tuple in another bin.Then two sets of intensity scores (and rankings) are calculated from the annotations in each of the two bins.The process is repeated 100 times and the correlations across the two sets of rankings and intensity scores are averaged.Table 2 shows the split-half reliabilities for the anger, fear, joy, and sadness tweets in the Tweet Emotion Intensity Dataset. 13bserve that for fear, joy, and sadness datasets, both the Pearson correlations and the Spearman rank correlations lie between 0.84 and 0.88, indicating a high degree of reproducibility.However, the correlations are slightly lower for anger indicating that it is relative more difficult to ascertain the degrees of anger of speakers from their tweets.Note that SHR indicates the quality of annotations obtained when using only half the number of annotations.The correlations obtained when repeating the experiment with three annotations for each 4-tuple is expected to be even higher.Thus the numbers shown in Table 2 are a lower bound on the quality of annotations obtained with three annotations per 4-tuple.
4 Task Setup

The Task
Given a tweet and an emotion X, automatic systems have to determine the intensity or degree of emotion X felt by the speaker-a real-valued score between 0 and 1.A score of 1 means that the speaker feels the highest amount of emotion X.A score of 0 means that the speaker feels the lowest amount of emotion X.The competition is organized on a CodaLab website, where participants can upload their submissions, and the leaderboard reports the results.The training and development sets were made available more than two months before the twoweek official evaluation period.Participants were told that the development set could be used to tune ones system and also to test making a submission on CodaLab.Gold intensity scores for the development set were released two weeks before the evaluation period, and participants were free to train their systems on the combined training and development sets, and apply this model to the test set.The test set was released at the start of the evaluation period.

Resources
Participants were free to use lists of manually created and/or automatically generated wordemotion and word-sentiment association lexicons. 15Participants were free to build a system from scratch or use any available software packages and resources, as long as they are not against the spirit of fair competition.In order to assist testing of ideas, we also provided a baseline Weka system for determining emotion intensity, that participants can build on directly or use to determine the usefulness of different features. 16We describe the baseline system in the next section.Each team was allowed to make as many as ten submissions during the evaluation period.However, they were told in advance that only the final submission would be considered as the official submission to the competition.
Once the evaluation period concluded, we released the gold labels and participants were able to determine results on various system variants that they may have developed.We encouraged participants to report results on all of their systems (or system variants) in the system-description paper that they write.However, they were asked to clearly indicate the result of their official submission.
During the evaluation period, the CodaLab leaderboard was hidden from participants-so they were unable see the results of their submissions on the test set until the leaderboard was subsequently made public.Participants were, however, able to immediately see any warnings or errors that their submission may have triggered.

Evaluation
For each emotion, systems were evaluated by calculating the Pearson Correlation Coefficient of the system predictions with the gold ratings.Pearson coefficient, which measures linear correlations between two variables, produces scores from -1 (perfectly inversely correlated) to 1 (perfectly correlated).A score of 0 indicates no correlation.The correlation scores across all four emotions was averaged to determine the bottom-line competition metric by which the submissions were ranked.
In addition to the bottom-line competition metric described above, the following additional metrics were also provided: • Spearman Rank Coefficient of the submission with the gold scores of the test data.Motivation: Spearman Rank Coefficient considers only how similar the two sets of ranking are.The differences in scores between adjacently ranked instance pairs is ignored.On the one hand this has been argued to alleviate some biases in Pearson, but on the other hand it can ignore relevant information.
• Correlation scores (Pearson and Spearman) over a subset of the testset formed by taking instances with gold intensity scores ≥ 0.5.Motivation: In some applications, only those instances that are moderately or strongly emotional are relevant.Here it may be much more important for a system to correctly determine emotion intensities of instances in the higher range of the scale as compared to correctly determine emotion intensities in the lower range of the scale.
Results with Spearman rank coefficient were largely inline with those obtained using Pearson coefficient, and so in the rest of the paper we report only the latter.However, the CodaLab leaderboard and the official results posted on the task website show both metrics.The official evaluation script (which calculates correlations using both metrics and also acts as a format checker) was made available along with the training and development data well in advance.Participants were able to use it to monitor progress of their system by crossvalidation on the training set or testing on the development set.The script was also uploaded on the CodaLab competition website so that the system evaluates submissions automatically and updates the leaderboard.
5 Baseline System for Automatically Determining Tweet Emotion Intensity

System
We implemented a package called Affec-tiveTweets (Mohammad and Bravo-Marquez, 2017) for the Weka machine learning workbench (Hall et al., 2009).It provides a collection of filters for extracting features from tweets for sentiment classification and other related tasks.These include features used in Kiritchenko et al. (2014) and Mohammad et al. (2017). 17We use the AffectiveTweets package for calculating feature vectors from our emotion-intensity-labeled tweets and train Weka regression models on this transformed data.The regression model used is an L 2 -regularized L 2 -loss SVM regression model with the regularization parameter C set to 1, 17 Kiritchenko et al. (2014) describes the NRC-Canada system which ranked first in three sentiment shared tasks: SemEval-2013 Task 2, SemEval-2014 Task 9, and SemEval-2014 Task 4. Mohammad et al. (2017) describes a stancedetection system that outperformed submissions from all 19 teams that participated in SemEval-2016 Task 6.
implemented in LIBLINEAR18 .The system uses the following features:19 a. Word N-grams (WN): presence or absence of word n-grams from n = 1 to n = 4. b.Character N-grams (CN): presence or absence of character n-grams from n = 3 to n = 5. c.Word Embeddings (WE): an average of the word embeddings of all the words in a tweet.We calculate individual word embeddings using the negative sampling skip-gram model implemented in Word2Vec (Mikolov et al., 2013).Word vectors are trained from ten million English tweets taken from the Edinburgh Twitter Corpus (Petrović et al., 2010).We set Word2Vec parameters: window size: 5; number of dimensions: 400.20 d.Affect Lexicons (L): we use the lexicons shown in Table 4 by aggregating the information for all the words in a tweet.If the lexicon provides nominal association labels (e.g, positive, anger, etc.), then the number of words in the tweet matching each class are counted.If the lexicon provides numerical scores, the individual scores for each class are summed.and whether the affective associations provided are nominal or numeric.

Experiments
We developed the baseline system by learning models from each of the Tweet Emotion Intensity Dataset training sets and applying them to the corresponding development sets.Once the system parameters were frozen, the system learned new models from the combined training and development corpora.This model was applied to the test sets.Table 5 shows the results obtained on the test sets using various features, individually and in combination.The last column 'avg.' shows the macro-average of the correlations for all of the emotions.
fect lexicons produces results ranging from avg. r = 0.19 with SentiWordNet to avg.r = 0.53 with NRC-Hash-Emo.Combining all the lexicons leads to statistically significant improvement over individual lexicons (avg.r = 0.63).Combining the different kinds of features leads to even higher scores, with the best overall result obtained using word embedding and lexicon features (avg.r = 0.66). 22The feature space formed by all the lexicons together is the strongest single feature category.The results also show that some features such as character ngrams are redundant in the presence of certain other features.Among the lexicons, NRC-Hash-Emo is the most predictive single lexicon.Lexicons that include Twitter-specific entries, lexicons that include intensity scores, and lexicons that label emotions and not just sentiment, tend to be more predictive on this task-dataset combination.NRC-Aff-Int has real-valued fine-grained wordemotion association scores for all the words in NRC-EmoLex that were marked as being associated with anger, fear, joy, and sadness. 23Improvement in scores obtained using NRC-Aff-Int over the scores obtained using NRC-EmoLex also show that using fine intensity scores of word-emotion association are beneficial for tweet-level emotion intensity detection.The correlations for anger, fear, and joy are similar (around 0.65), but the correlation for sadness is markedly higher (0.71).We can observe from Table 5 that this boost in performance for sadness is to some extent due to word embeddings, but is more so due to lexicon features, especially those from SentiStrength.Sen-tiStrength focuses solely on positive and negative classes, but provides numeric scores for each.
To assess performance in the moderate-to-high range of the intensity scale, we calculated correla- 22 The increase from 0.63 to 0.66 is statistically significant.tion scores over a subset of the test data formed by taking only those instances with gold emotion intensity scores ≥ 0.5.The last row in Table 5 shows the results.We observe that the correlation scores are in general lower here in the 0.5 to 1 range of intensity scores than in the experiments over the full intensity range.This is simply because this is a harder task as now the systems do not benefit by making coarse distinctions over whether a tweet is in the lower range or in the higher range.
Official System Submissions to the Shared Task Twenty-two teams made submissions to the shared task.In the subsections below we present the results and summarize the approaches and resources used by the participating systems.

Results
Table 6 shows the Pearson correlations (r) and ranks (in brackets) obtained by the systems on the full test sets.The bottom-line competition metric, 'r avg.', is the average of Pearson correlations obtained for each of the four emotions.(The task website shows Spearman rank coefficient as well. Those scores are close in value to the Pearson correlations, and most teams rank the same by either metric.)The top ranking system, Prayas, obtained an r avg. of 0.747.It obtains slightly better correlations for joy and anger (around 0.76) than for fear and sadness (around 0.73).IMS, which ranked second overall, obtained slightly higher correlation on anger, but lower scores than Prayas on the other emotions.The top 12 teams all obtain their best correlation on anger as opposed to any of the other three emotions.They obtain lowest correlations on fear and sadness.Seven teams obtained scores higher than that obtained by the publicly available benchmark system (r avg.= 0.66).
Table 7 shows the Pearson correlations (r) and ranks (in brackets) obtained by the systems on those instances in the test set with intensity scores ≥ 0.5.Prayas obtains the best results here too with r avg.= 0.571.SeerNet, which ranked third on the full test set, ranks second on this subset.As found in the baseline results, system results on this subset overall are lower than than on the full test set.Most systems perform best on the joy data and worst on the sadness data.

Machine Learning Setups
Systems followed a supervised learning approach in which tweets were mapped into feature vectors that were then used for training regression models.
Features were drawn both from the training data as well as from external resources such as large tweet corpora and affect lexicons.Table 8 lists the feature types (resources) used by the teams.(To save space, team names are abbreviated to just their rank on the full test set (as shown in Table 6).)Commonly used features included word embeddings and sentence repre-sentations learned using neural networks (sentence embeddings).Some of the word embeddings models used were Glove (SeerNet, UWaterloo, YZU NLP), Word2Vec (SeerNet), and Word Vector Emoji Vectors (SeerNet).The models used for learning sentence embeddings included LSTM (Prayas, IITP), CNN (SGNLP), LSTM-CNN combinations (IMS, YMU-HPCC), bi-directional versions (YZU NLP), and augmented LSTMs models with attention layers (Todai).High-dimensional sparse representations such as word n-grams or character n-grams were rarely used.Affect lexicons were also widely used, especially by the top eight teams.Some teams built their own affect lexicons from additional data (IMS, XRCE).
The regression algorithms applied to the feature vectors included SVM regression or SVR (IITP, Code Wizards, NUIG, H.Niemstov), Neural Networks (Todai, YZU NLP, SGNLP), Random Forest (IMS, SeerNet, XRCE), Gradient Boosting (UWaterLoo, PLN PUCRS), AdaBoost (SeerNet), and Least Square Regression (UWaterloo).Table 9 provides the full list.Some teams followed a popular deep learning trend wherein the feature representation and the prediction model are trained in conjunction.In those systems, the regression algorithm corresponds to the output layer of the neural network (YZU NLP, SGNLP, Todai).
Many libraries and tools were used for implementing the systems.The high-level neural networks API library Keras was the most widely used off-the-shelf package.It is written in Python and runs on top of either TensorFlow or Theano.Ten-sorFlow and Sci-kit learn were also popular (also Python libraries). 24Our AffectiveTweets Weka baseline package was used by five participating teams, including the teams that ranked first, second, and third.The full list of tools and libraries used by the teams is shown in Table 10.
In the subsections below, we briefly summarize the three top-ranking systems.The Appendix (8.3) provides participant-provided summaries about each system.See system description papers for detailed descriptions.9: Regression methods used by the participating systems.Teams are indicated by their rank.Table 10: Tools and libraries used by the participating systems.Teams are indicated by their rank.

Prayas: Rank 1
The best system, Prayas, used an ensemble of three different models: The first is a feed-forward neural network whose input vector is formed by concatenating the average word embedding vector with the lexicon features vector provided by the AffectiveTweets package (Mohammad and Bravo-Marquez, 2017).These embeddings were trained on a collection of 400 million tweets (Godin et al., 2015).The network has four hidden layers and uses rectified linear units as activation functions.Dropout is used a regularization mechanisms and the output layer consists of a sigmoid neuron.The second model treats the problem as a multi-task learning problem with the labeling of the four emotion intensities as the four sub-tasks.Authors use the same neural network architecture as in the first model, but the weights of the first two network layers are shared across the four subtasks.The weights of the last two layers are independently optimized for each subtask.In the third model, the word embeddings of the words in a tweet are concatenated and fed into a deep learning architecture formed by LSTM, CNN, max pooling, fully connected layers.Several architectures based on these layers are explored.The final predictions are made by combining the first two models with three variations of the third model into an ensemble.A weighted average of the individual predictions is calculated using cross-validated performances as the relative weights.Experimental results show that the ensemble improves the performance of each individual model by at least two percentage points.

IMS: Rank 2
IMS applies a random forest regression model to a representation formed by concatenating three vectors: 1. a feature vector drawn from existing affect lexicons, 2. a feature vector drawn from expanded affect lexicons, and 3. the output of a neural network.The first vector is obtained using the lexicons implemented in the AffectiveTweets package.The second is based on an extended lexicons built from feed-forward neural networks trained on word embeddings.The gold training words are taken from existing affective norms and emotion lexicons: NRC Hashtag Emotion Lexicon (Mohammad, 2012b;Mohammad and Kiritchenko, 2015), affective norms from Warriner et al. (2013), Brysbaert et al. (2014), and ratings for happiness from Dodds et al. (2011).The third vector is taken from the output of neural network that combines CNN and LSTM layers.

SeerNet: Rank 3
SeerNet creates an ensemble of various regression algorithms (e.g, SVR, AdaBoost, random forest, gradient boosting).Each regression model is trained on a representation formed by the affect lexicon features (including those provided by AffectiveTweets) and word embeddings.Authors also experiment with different word embeddings models: Glove, Word2Vec, and Emoji embeddings (Eisner et al., 2016).

Conclusions
We conducted the first shared task on detecting the intensity of emotion felt by the speaker of a tweet.We created the emotion intensity dataset using best-worst scaling and crowdsourcing.We created a benchmark regression system and conducted experiments to show that affect lexicons, especially those with fine word-emotion association scores, are useful in determining emotion intensity.
Twenty-two teams participated in the shared task, with the best system obtaining a Pearson correlation of 0.747 with the gold annotations on the test set.As in many other machine learning competitions, the top ranking systems used ensembles of multiple models (Prayas-rank1, SeerNet-rank3).IMS, which ranked second, used random forests, which are ensembles of multiple decision trees.The top eight systems also made use of a substantially larger number of affect lexicons to generate features than systems that did not perform as well.It is interesting to note that despite using deep learning techniques, training data, and large amounts of unlabeled data, the best systems are finding it beneficial to include features drawn from affect lexicons.
We have begun work on creating emotion intensity datasets for other emotion categories beyond anger, fear, sadness, and joy.We are also creating a dataset annotated for valence, arousal, and dominance.These annotations will be done for English, Spanish, and Arabic tweets.The datasets will be used in the upcoming SemEval-2018 Task #1: Affect in Tweets (Mohammad et al., 2018 • Which of the four speakers is likely to be the MOST fearful, and • Which of the four speakers is likely to be the LEAST fearful.

Important Notes
• This task is about fear levels of the speaker (and not about the fear of someone else mentioned or spoken to).
• If the answer could be either one of two or more speakers (i.e., they are likely to be equally fearful), then select any one of them as the answer.
• Most importantly, try not to over-think the answer.Let your instinct guide you.The questionnaires for other emotions are similar in structure.In a post-annotation survey, the respondents gave the task high scores for clarity of instruction (4.2/5) despite noting that the task itself requires some non-trivial amount of thought (3.5 out of 5 on ease of task).

An Interactive Visualization to Explore the Tweet Emotion Intensity Dataset
We created an interactive visualization to allow ease of exploration of the Tweet Emotion Intensity Dataset.This visualization was made public after the the official evaluation period had concludedso participants in the shared task did not have access to it when building their system.It is worth noting that if one intends to evaluate their emotion intensity detection system on the Tweet Emotion Intensity Dataset, then as a matter of commonlyfollowed best practices, they should not use the visualization to explore the test data in the system development phase (until all the system parameters are frozen).
The visualization has three main components: 1. Tables showing the percentage of instances in each of the emotion partitions (train, dev, test).
Hovering over a row shows the corresponding number of instances.Clicking on an emotion filters out data from all other emotions, in all visualization components.Similarly, one can click on just the train, dev, or test partitions to view information just for that data.Clicking again deselects the item.2. A histogram of emotion intensity scores.A slider that one can use to view only those tweets within a certain score range.3. The list of tweets, emotion label, and emotion intensity scores.
Notably, the three components are interconnected, such that clicking on an item in one component will filter information in all other components to show only the relevant details.For example, clicking on 'joy' in 'a' will cause 'b' to show the histogram for only the joy tweets, and 'c' to show only the 'joy' tweets.Similarly one can click on the test/dev/train set, a particular band of emotion intensity scores, or a particular tweet.Clicking again deselects the item.One can use filters in combination.For e.g., clicking on fear, test data, and setting the slider for the 0.5 to 1 range, shows information for only those fear-testdata instances with scores ≥ 0.5.• TweetToSparseFeatureVector filter: calculates the following sparse features: word n-grams (adding a NEG prefix to words occurring in negated contexts), character n-grams (CN), POS tags, and Brown word clusters. 26 • TweetToLexiconFeatureVector filter: calculates features from a fixed list of affective lexicons. 26The scope of negation was determined by a simple heuristic: from the occurrence of a negator word up until a punctuation mark or end of sentence.We used a list of 28 negator words such as no, not, won't and never.
• TweetToInputLexiconFeatureVector: calculates features from any lexicon.The input lexicon can have multiple numeric or nominal wordaffect associations.This filter allows users to experiment with their own lexicons.
• TweetToSentiStrengthFeatureVector filter: calculates positive and negative sentiment intensities for a tweet using the SentiStrength lexiconbased method (Thelwall et al., 2012) • TweetToEmbeddingsFeatureVector filter: calculates a tweet-level feature representation using pre-trained word embeddings supporting the following aggregation schemes: average of word embeddings; addition of word embeddings; and concatenation of the first k word embeddings in the tweet.The package also provides Word2Vec's pre-trained word. 27  Once the feature vectors are created, one can use any of the Weka regression or classification algorithms.Additional filters are under development.

4. 4
Official Submission to the Shared Task System submissions were required to have the same format as used in the training and test sets.Each line in the file should include: id[tab]tweet[tab]emotion[tab]score

Table 1 :
. It has Categories from the Roget's Thesaurus whose words were taken to be the query terms.

Table 2 :
Split-half reliabilities (as measured by Pearson correlation and Spearman rank correlation) for the anger, fear, joy, and sadness tweets in the Tweet Emotion Intensity Dataset. 14

Table 3 :
The number of instances in the Tweet Emotion Intensity dataset.

Table 6 :
Official Competition Metric: Pearson correlations (r) and ranks (in brackets) obtained by the systems on the full test sets.The bottom-line competition metric, 'r avg.', is the average of Pearson correlations obtained for each of the four emotions.

Table 7 :
Pearson correlations (r) and ranks (in brackets) obtained by the systems on a subset of the test set where gold scores ≥ 0.5

Table 8 :
Feature types (resources) used by the participating systems.Teams are indicated by their rank.