SemEval-2015 Task 11: Sentiment Analysis of Figurative Language in Twitter

This report summarizes the objectives and evaluation of the SemEval 2015 task on the sentiment analysis of figurative language on Twitter (Task 11). This is the first sentiment analysis task wholly dedicated to analyzing figurative language on Twitter. Specifically, three broad classes of figurative language are considered: irony , sarcasm and metaphor . Gold standard sets of 8000 training tweets and 4000 test tweets were annotated using workers on the crowdsourcing platform CrowdFlower . Participating systems were required to provide a fine-grained sentiment score on an 11-point scale (-5 to +5, including 0 for neutral intent) for each tweet, and systems were evaluated against the gold standard using both a Cosine-similarity and a Mean-Squared-Error measure.


Introduction
The limitations on text length imposed by microblogging services such as Twitter do nothing to dampen our willingness to use language creatively. Indeed, such limitations further incentivize the use of creative devices such as metaphor and irony, as such devices allow strongly-felt sentiments to be expressed effectively, memorably and concisely. Nonetheless, creative language can pose certain challenges for NLP tools that do not take account of how words can be used playfully and in original ways. In the case of language using figurative devices such as irony, sarcasm or metaphor -when literal meanings are discounted and secondary or extended meanings are intentionally profiled -the affective polarity of the literal meaning may differ significantly from that of the intended figurative meaning. Nowhere is this effect more pronounced than in ironical language, which delights in using affirmative language to convey critical meanings. Metaphor, irony and sarcasm can each sculpt the affect of an utterance in complex ways, and each tests the limits of conventional techniques for the sentiment analysis of supposedly literal texts.
Figurative language thus poses an especially significant challenge to sentiment analysis systems, as standard approaches anchored in the dictionarydefined affect of individual words and phrases are often shown to be inadequate in the face of indirect figurative meanings. It would be convenient if such language were rare and confined to specific genres of text, such as poetry and literature. Yet the reality is that figurative language is pervasive in almost any genre of text, and is especially commonplace on the texts of the Web and on social media platforms such as Twitter. Figurative language often draws attention to itself as a creative artifact, but is just as likely to be viewed as part of the general fabric of human communication. In any case, Web users widely employ figures of speech (both old and new) to project their personality through a text, especially when their texts are limited to the 140 characters of a tweet.
Natural language researchers have attacked the problems associated with figurative interpretations at multiple levels of linguistic representation. Some have focused on the conceptual level, of which the text is a surface instantiation, to identify the schemas and mappings that are implied by a figure of speech (see e.g. Veale and Keane (1992); Barnden (2008); Veale (2012)). These approaches yield a depth of insight but not a robustness of analysis in the face of textual diversity. More robust approaches focus on the surface level of a text, to consider word choice, syntactic order, lexical properties and affective profiles of the elements that make up a text (e.g. Rosso (2012, 2014)). Surface analysis yields a range of discriminatory features that can be efficiently extracted and fed into machine-learning algorithms.
When it comes to analyzing the texts of the Web, the Web can also be used as a convenient source of ancillary knowledge and features. Veale and Hao (2007) describe a means of harvesting a commonsense knowledge-base of stereotypes from the Web, by directly targeting simile constructions of the form "as X as Y" (e.g. "as hot as an oven", "as humid as a jungle", "as big as a mountain", etc.). Though largely successful in their efforts, Veale and Hao were surprised to discover that up to 20% of Web-harvested similes are ironic (examples include "as subtle as a freight train", "as tanned as an Irishman", "as sober as a Kennedy", "as private as a park bench"). Initially filtering ironic similes manually -as irony is the worst kind of noise when acquiring knowledge from the Web - Hao & Veale (2010) report good results for an automatic, Web-based approach to distinguishing ironic from non-ironic similes. Their approach exploits specific properties of similes and is thus not directly transferrable to the detection of irony in general. Reyes, Rosso and Veale (2013) and Reyes, Rosso and Buscaldi (2012) thus employ a more general approach that applies machine learning algorithms to a range of structural and lexical features to learn a robust basis for detecting humor and irony in text.
The current task is one that calls for such a general approach. Note that the goal of Task 11 is not to detect irony, sarcasm or metaphor in a text, but to perform robust sentiment analysis on a finegrained 11-point scale over texts in which these kinds of linguistic usages are pervasive. A system may find detection to be a useful precursor to analysis, or it may not. We present a description of Task 11 in section 2, before presenting our dataset in section 3 and the scoring functions in section 4. Descriptions of each participating system are then presented in section 5, before an overall evaluation in reported in section 6. The report then concludes with some general observations in section 7.

Task Description
The task concerns itself with the classification of overall sentiment in micro-texts drawn from the micro-blogging service Twitter. These texts, called tweets, are chosen so that the set as a whole contains a great deal of irony, sarcasm or metaphor, so no particular tweet is guaranteed to manifest a specific figurative phenomenon. Since irony and sarcasm are typically used to criticize or to mock, and thus skew the perception of sentiment toward the negative, it is not enough for a system to simply determine whether the sentiment of a given tweet is positive or negative. We thus use an 11point scale, ranging from -5 (very negative, for tweets with highly critical meanings) to +5 (very positive, for tweets with flattering or very upbeat meanings). The point 0 on this scale is used for neutral tweets, or those whose positivity and negativity cancel each other out. While the majority of tweets will have sentiments in the negative part of the scale, the challenge for participating systems is to decide just how negative or positive a tweet seems to be.
So, given a set of tweets that are rich in metaphor, sarcasm and irony, the goal is to determine whether a user has expressed a positive, negative or neutral sentiment in each, and the degree to which this sentiment has been communicated.

Dataset Design and Collection
Even humans have difficulty in deciding whether a given text is ironic or metaphorical. Irony can be remarkably subtle, while metaphor takes many forms, ranging the dead to the conventional to the novel. Sarcasm is easier for humans to detect, and is perhaps the least sophisticated form of nonliteral language. We sidestep problems of detection by harvesting tweets from Twitter that are likely to contain figurative language, either because they have been explicitly tagged as such (using e.g. the hashtags #irony, #sarcasm, #not, #yeahright) or because they use words commonly associated with the use of metaphor (ironically, the words "literally" and "virtually" are reliable markers of metaphorical intent, as in "I literally want to die ").
Datasets were collected using the Twitter4j API (http://twitter4j.org/en/index.html), which supports the harvesting of tweets in real-time using search queries. Queries for hashtags such as #sarcasm, #sarcastic and #irony, and for words such as "figuratively", yielded our initial corpora of candidate tweets to annotate. We then developed a Latent Semantic Analysis (LSA) model to extend this seed set of hashtags so as to harvest a wider range of figurative tweets (see Li. et. al., 2014). This tweet dataset was collected over a period of 4 weeks, from June 1 st to June 30 th , 2014. Though URLs have been removed from tweets, all other content, including hashtags -even those used to retrieve each tweet -has been left in place. Tweets must contain at least 30 characters when hashtags are not counted, or 40 characters when hashtags are counted. All others are eliminated as too short.

Dataset Annotation on an 11-point scale
A trial dataset, consisting of 1025 tweets, was first prepared by harvesting tweets from Twitter users that are known for their use of figurative language (e.g. comedians). Each trial tweet was annotated by seven annotators from an internal team, three of whom are native English speakers, the other four of whom are competent non-native speakers. Each annotator was asked to assign a score ranging from -5 (for any tweets conveying disgust or extreme discontent) to +5 (for tweets conveying obvious joy and approval or extreme pleasure), where 0 is reserved for tweets in which positive and negative sentiment is balanced. Annotators were asked to use ±5, ±3 and ±1 as scores for tweets calling for strong, moderate or weak sentiment, and to use ±4 and ±2 for tweets with nuanced sentiments that fall between these gross scores. An overall sentiment score for each tweet was calculated as a weighted average of all 7 annotators, where a double weighting was given to native English speakers.
Sentiment was assigned on the basis of the perceived meaning of each tweet -the meaning an author presumably intends a reader to unpack from the text -and not the superficial language of the tweet. Thus, a sarcastic tweet that expresses a negative message in language that feigns approval or delight should be marked with a negative score (as in "I just love it when my friends throw me under the bus."). Annotators were explicitly asked to consider all of a tweet's content when assigning a score, including any hashtags (such as #sarcasm, #irony, etc.), as participating systems are expected to use all of the tweet's content, including hashtags.
Tweets of the training and test datasetscomprising 8000 and 4000 tweets respectivelywere each annotated on a crowd-sourcing platform, CrowdFlower.com, following the same annotation scheme as for the trial dataset. Some examples of tweets and their ideal scores, given as guidelines to CrowdFlower annotators, are shown in Table 1.

Tweet Content
Score @ThisIsDeep_ you are about as deep as a turd in a toilet bowl. Internet culture is #garbage and you are bladder cancer. -4 A paperless office has about as much chance as a paperless bathroom -3 Today will be about as close as you'll ever get to a "PERFECT 10" in the weather world! Happy Mother's Day! Sunny and pleasant! High 80.
3 I missed voting due to work. But I was behind the Austrian entry all the way, so to speak. I might enter next year. Who knows?
1 Scammers tend to give identical or random scores for all units in a task. To prevent scammers from abusing the task, trial tweets were thus interwoven as test questions for annotators on training and test tweets. Each annotator was expected to provide judgments for test questions that fall within the range of scores given by the original members of the internal team. Annotators are dismissed if their overall accuracy on these questions is below 70%. The standard deviation std u (u i ) of all judgments provided by annotator u i also indicates that u i is likely to be a scammer when std u (u i )=0. Likewise, the standard deviation std t (t j ) of all judgments given for a tweet t j allows us to judge that annotation A i,j as given by u i for t j is an outlier if: If 60% or more of an annotator's judgements are judged to be outliers in this way then the annotator is deemed a scammer and dismissed from the task. Each tweet-set was cleaned of all annotations provided by those deemed to be scammers. After cleaning, each tweet has 5 to 7 annotations. The ratio of in-range judgments on trial tweets, which was used to detect scammers on the annotation of training and test data, can also be used to assign a reliability score to each annotator. The reliability of an annotator u i is given by R(u i )=m i /n i , where n i is the number of judgments contributed by u i on trial tweets, and m i is the number of these judgments that fall within the range of scores provided by the original annotators of the trial data. The final sentiment score for tweet S(t j ) is the weighted average of scores given for it, where the reliability of each annotator is used as a weight.
The weighted sentiment score is a real number in the range [-5 … +5], where the most reliable annotators contribute most to each score. These scores were provided to task participants in two CSV formats: tweet-ids mapped to real number scores, and tweet-ids to rounded integer scores.

Tweet Delivery
The actual text of each tweet was not included in the released datasets due to copyright and privacy concerns that are standard for use of Twitter data. Instead, a script was provided for retrieving the text of each tweet given its released tweet-id.
Tweets are a perishable commodity and may be deleted, archived or otherwise made inaccessible over time by their original creators. To ensure that tweets did not perish in the interval between their first release and final submission, all training and test tweets were re-tweeted via a dedicated account to give them new, non-perishable tweet-ids. The distributed tweet-ids refer to this dedicated account.

Dataset Statistics
The trial dataset contains a mix of figurative tweets chosen manually from Twitter. It consists of 1025 tweets annotated by an internal team of seven members. Table 2 shows the number of tweets in each category. The trial dataset is small enough to allow these category labels to be applied manually. The training and test datasets were annotated by CrowdFlower users from countries where English is spoken as a native language. The 8,000 tweets of the training set were allocated as in Table 3. As the datasets are simply too large for the category labels Sarcasm, Irony and Metaphor to be assigned manually, the labels here refer to our expectations of the kind of tweets in each segment of the dataset, which were each collated using harvesting criteria specific to different kinds of figurative language.  To provide balance, an additional category Other was also added to the Test dataset. Tweets in this category were drawn from general Twitter content, and so were not chosen to capture any specific figurative quality. Rather, the category was added to ensure the ecological validity of the task, as sentiment analysis is never performed on texts that are wholly figurative. The 4000 tweets of the Test set were drawn from four categories as in Figure 4.

Scoring Functions
The Cosine-similarity scoring function represents the gold-standard annotations for the Test dataset as a vector of the corresponding sentiment scores. The scores provided by each participating system are represented in a comparable vector format, so that the cosine of the angle between these vectors captures the overall similarity of both score sets. A score of 1 is achieved only when a system provides all the same scores as the human gold-standard. A script implementing this scoring function was released to all registered participants, who were required in turn to submit the outputs of their systems as a tab-separated file of tweet-ids and integer sentiment scores (as systems may be based either on a regression or a classification model).
A multiplier p cos is applied to all submissions, to penalize any that do not give scores for all tweets.
Thus, cos #submitted-entries #all-entries p = E.g., a cherry-picking system that scores just 75% of the test tweets is hit with a 25% penalty.
Mean-Squared-Error (MSE) offers a standard basis for measuring the performance of predictive systems, and is favored by some developers as a basis for optimization. When calculating MSE, in which lower measures indicate better performance, the penalty-coefficient p MSE is instead given by: #all-entries #submitted-entries MSE p =

Overview of Participating Systems
A total of 15 teams participated in Task 11, submitting results from 29 distinct runs. A clear preference for supervised learning methods can be observed, with two types of approach -SVMs and regression models over carefully engineered features -making up the bulk of approaches. Team UPF used regression with a Random-Sub-Space using M5P as a base algorithm. They exploited additional external resources such as SentiWordnet, Depeche Mood, and the American National Corpus. Team ValenTo used a regression model combined with affective resources such as SenticNet (see Poria et al., 2014) to assign polarity scores. Team Elirf used an SVM-based approach, with features drawn from character N-grams (2 < N < 10) and a bag-of-words model of the tf-idf coefficient of each N-gram feature. Team BUAP also used an SVM approach, taking features from dictionaries, POS tags and character n-grams. Team CLaC used four lexica, one that was automatically generated and three than were manually crafted. Term frequencies, POS tags and emoticons were also used as features. Team LLT_PolyU used a semi-supervised approach with a Decision Tree Regression Learner, using wordlevel sentiment scores and dependency labels as features. Team CPH used ensemble methods and ridge regression (without stopwords), and is notable for its specific avoidance of sentiment lexicons. Team DsUniPi combined POS tags and regular expressions to identify useful syntactic structures, and brought sentiment lexicons and WordNet-based similarity measures to bear on their supervised approach. Team RGU's system learnt a sentiment model from the training data, and used a linear Support Vector Classifier to generate integer sentiment labels. Team ShellFBK also used a supervised approach, extracting grammatical relations for use as features from dependency tree parses.
Team HLT also used an SVM-based approach, using lexical features such as negation, intensifiers and other markers of amusement and irony. Team KElab constructed a supervised model based on term co-occurrence scores and the distribution of emotion-bearing terms in training tweets. Team LT3 employed a combined, semi-supervised SVMand regression-based approach, exploiting a range of lexical features, a terminology extraction system and and both WordNet and DBpedia. Team PRHLT used a deep auto-encoder to extract features, employing both words and character 3grams as tokens for the autoencoder. Their best results were obtained with ensembles of Extremely Random Trees with character n-grams as features.

Results and Discussions
For comparison purposes, we constructed three baseline systems, each implemented as a naïve classifier with shallow bag-of-word features. The results of these baseline systems for both the MSE and Cosine metrics are shown in Table 5.   Table 6 shows the results for each participating system using these metrics. Team CLaC achieves the best overall performance on both, achieving 0.758 on the Cosine metric and 2.117 on the MSE metric. Most of the other systems also show a clear advantage over the baselines reported in Table 5.  Table 6: Overall results, sorted by cosine metric. Scores are for last run submitted for each system.
The best performance on sarcasm and irony tweets was achieved by teams LLT_PolyU and elirf, who ranked 3 rd and 4 th respectively. Team ClaC came first on tweets in the Metaphor category. One run of team CPH excelled on the Other (non-figurative) category, but scored poorly on figurative tweets. Most teams performed well on sarcasm and irony tweets, but the Metaphor and Other categories prove more of a challenge. Table 7 presents the Spearman's rank correlation between the ranking of a system overall, on all tweet categories, and its ranking of different categories of tweets. The right column limits this analysis to the top 10 systems. When we consider all systems, their performance on each category of tweet is strongly correlated to their overall performance. However, looking only at the top 10 performing systems, we see a strikingly strong correlation between performance overall and performance on the category Metaphor. Performance on Metaphor tweets is a bellwether for performance on figurative language overall. Then category Other also plays an important role here. Both the trail data and the training datasets are heavily biased to negative sentiment, given their concentration of ironic and sarcastic tweets. In contrast, the distribution of sentiment scores in the test data is more balanced due to the larger proportion of Metaphor tweets and the addition of non-figurative Other tweets. To excel at this task, systems must not treat all tweets as figurative, but learn to spot the features that cause figurative devices to influence the sentiment of a tweet.

Summary and Conclusions
This paper has described the design and evaluation of Task 11, which concerns the determination of sentiment in tweets which are likely to employ figurative devices such as irony, sarcasm and metaphor. The task was constructed so as to avoid questions of what specific device is used in which tweet: a glance at Twitter, and the use of the #irony hashtag in particular, indicates that there are as many folk theories of irony as there are users of the hashtag #irony. Instead, we have operationalized the task to put it on a sound and more ecologically valid footing. The effect of figurativity in tweets is instead measured via an extrinsic task: measuring the polarity of tweets that use figurative language. The task is noteworthy in its use of an 11-point sentiment scoring scheme, ranging from -5 to +5. The use of 11 fine-grained categories precludes the measurement of inter-annotator agreement as a reliable guide to annotator/annotation quality, but it allows us to measure system performance on a task and a language type in which negativity dominates. We expect the trial, training and test datasets will prove useful to future researchers who wish to explore the complex relation between figurativity and sentiment. To this end, we have taken steps to preserve the tweets used in this task, to ensure that they do not perish through the actions of their original creators. Detailed results of the evaluation of all systems and runs are shown in Tables 9 and 10, or can be found online here: http://alt.qcri.org/semeval2015/task11/