SemEval-2016 Task 4: Sentiment Analysis in Twitter

This paper discusses the fourth year of the ``Sentiment Analysis in Twitter Task''. SemEval-2016 Task 4 comprises five subtasks, three of which represent a significant departure from previous editions. The first two subtasks are reruns from prior years and ask to predict the overall sentiment, and the sentiment towards a topic in a tweet. The three new subtasks focus on two variants of the basic ``sentiment classification in Twitter'' task. The first variant adopts a five-point scale, which confers an ordinal character to the classification task. The second variant focuses on the correct estimation of the prevalence of each class of interest, a task which has been called quantification in the supervised learning literature. The task continues to be very popular, attracting a total of 43 teams.


Introduction
Sentiment classification is the task of detecting whether a textual item (e.g., a product review, a blog post, an editorial, etc.) expresses a POSI-TIVE or a NEGATIVE opinion in general or about a given entity, e.g., a product, a person, a political party, or a policy. Sentiment classification has become a ubiquitous enabling technology in the Twittersphere. Classifying tweets according to sentiment has many applications in political science, social sciences, market research, and many others (Martínez-Cámara et al., 2014;Mejova et al., 2015). * Fabrizio Sebastiani is currently on leave from Consiglio Nazionale delle Ricerche, Italy.
As a testament to the prominence of research on sentiment analysis in Twitter, the tweet sentiment classification (TSC) task has attracted the highest number of participants in the last three SemEval campaigns (Nakov et al., 2013;Rosenthal et al., 2014;Rosenthal et al., 2015;Nakov et al., 2016b).
Previous editions of the SemEval task involved binary (POSITIVE vs. NEGATIVE) or single-label multi-class classification (SLMC) when a NEU-TRAL 1 class is added (POSITIVE vs. NEGATIVE vs. NEUTRAL). SemEval-2016 Task 4 represents a significant departure from these previous editions. Although two of the subtasks (Subtasks A and B) are reincarnations of previous editions (SLMC classification for Subtask A, binary classification for Subtask B), SemEval-2016 Task 4 introduces two completely new problems, taken individually (Subtasks C and D) and in combination (Subtask E):

Ordinal Classification
We replace the two-or three-point scale with a fivepoint scale {HIGHLYPOSITIVE, POSITIVE, NEU-TRAL, NEGATIVE, HIGHLYNEGATIVE}, which is now ubiquitous in the corporate world where human ratings are involved: e.g., Amazon, TripAdvisor, and Yelp, all use a five-point scale for rating sentiment towards products, hotels, and restaurants.
Moving from a categorical two/three-point scale to an ordered five-point scale means, in machine learning terms, moving from binary to ordinal classification (a.k.a. ordinal regression).

Quantification
We replace classification with quantification, i.e., supervised class prevalence estimation. With regard to Twitter, hardly anyone is interested in whether a specific person has a positive or a negative view of the topic. Rather, applications look at estimating the prevalence of positive and negative tweets about a given topic. Most (if not all) tweet sentiment classification studies conducted within political science (Borge-Holthoefer et al., 2015;Kaya et al., 2013;Marchetti-Bowick and Chambers, 2012), economics (Bollen et al., 2011;O'Connor et al., 2010), social science (Dodds et al., 2011), and market research (Burton and Soboleva, 2011;Qureshi et al., 2013), use Twitter with an interest in aggregate data and not in individual classifications.
Estimating prevalences (more generally, estimating the distribution of the classes in a set of unlabelled items) by leveraging training data is called quantification in data mining and related fields. Previous work has argued that quantification is not a mere byproduct of classification, since (a) a good classifier is not necessarily a good quantifier, and vice versa, see, e.g., (Forman, 2008); (b) quantification requires evaluation measures different from classification. Quantification-specific learning approaches have been proposed over the years; Sections 2 and 5 of (Esuli and Sebastiani, 2015) contain several pointers to such literature.
Note that, in Subtasks B to E, tweets come labelled with the topic they are about and participants need not classify whether a tweet is about a given topic. A topic can be anything that people express opinions about; for example, a product (e.g., iPhone6), a political candidate (e.g., Hillary Clinton), a policy (e.g., Obamacare), an event (e.g., the Pope's visit to Palestine), etc.
The rest of the paper is structured as follows. In Section 2, we give a general overview of SemEval-2016 Task 4 and the five subtasks. Section 3 focuses on the datasets, and on the data generation procedure. In Section 4, we describe in detail the evaluation measures for each subtask. Section 5 discusses the results of the evaluation and the techniques and tools that the top-ranked participants used. Section 6 concludes, discussing the lessons learned and some possible ideas for a followup at SemEval-2017.

Task Definition
SemEval-2016 Task 4 consists of five subtasks: 1. Subtask A: Given a tweet, predict whether it is of positive, negative, or neutral sentiment.
2. Subtask B: Given a tweet known to be about a given topic, predict whether it conveys a positive or a negative sentiment towards the topic.
3. Subtask C: Given a tweet known to be about a given topic, estimate the sentiment it conveys towards the topic on a five-point scale ranging from HIGHLYNEGATIVE to HIGHLYPOS- ITIVE. 4. Subtask D: Given a set of tweets known to be about a given topic, estimate the distribution of the tweets in the POSITIVE and NEGATIVE classes.
5. Subtask E: Given a set of tweets known to be about a given topic, estimate the distribution of the tweets across the five classes of a fivepoint scale, ranging from HIGHLYNEGATIVE to HIGHLYPOSITIVE.
Subtask A is a rerun -it was present in all three previous editions of the task. In the 2013-2015 editions, it was known as Subtask B. 2 We ran it again this year because it was the most popular subtask in the three previous task editions. It was the most popular subtask this year as well -see Section 5. Subtask B is a variant of SemEval-2015 Task 10 Subtask C (Rosenthal et al., 2015;Nakov et al., 2016b), with POSITIVE, NEUTRAL, and NEGATIVE as the classification labels.
Subtask E is similar to SemEval-2015 Task 10 Subtask D, which consisted of the following problem: Given a set of messages on a given topic from the same period of time, classify the overall sentiment towards the topic in these messages as strongly positive, weakly positive, neutral, weakly negative, or strongly negative. Note that in SemEval-2015 Task 10 Subtask D, exactly one of the five classes had to be chosen, while in our Subtask E, a distribution across the five classes has to be estimated.

Datasets
In this section, we describe the process of collection and annotation of the training, development and testing tweets for all five subtasks. We dub this dataset the Tweet 2016 dataset in order to distinguish it from datasets generated in previous editions of the task.

Tweet Collection
We provided the datasets from the previous editions 3 (see Table 2) of this task (Nakov et al., 2013;Rosenthal et al., 2014;Rosenthal et al., 2015;Nakov et al., 2016b) for training and development. In addition we created new training and testing datasets.  We employed the following annotation procedure. As in previous years, we first gathered tweets that express sentiment about popular topics. For this purpose, we extracted named entities from millions of tweets, using a Twitter-tuned named entity recognition system (Ritter et al., 2011). The collected tweets were greatly skewed towards the neutral class. In order to reduce the class imbalance, we removed those that contained no sentiment-bearing words. We used SentiWordNet 3.0 (Baccianella et al., 2010) as a repository of sentiment words. Any word listed in SentiWordNet 3.0 with at least one sense having a positive or a negative sentiment score greater than 0.3 was considered sentiment-bearing. 4 The training and development tweets were collected from July to October 2015. The test tweets were collected from October to December 2015. We used the public streaming Twitter API to download the tweets. 5 We then manually filtered the resulting tweets to obtain a set of 200 meaningful topics with at least 100 tweets each (after filtering out near-duplicates). We excluded topics that were incomprehensible, ambiguous (e.g., Barcelona, which is the name both of a city and of a sports team), or too general (e.g., Paris, which is the name of a big city). We then discarded tweets that were just mentioning the topic but were not really about the topic.
Note that the topics in the training and in the test sets do not overlap, i.e., the test set consists of tweets about topics different from the topics the training and development tweets are about.

Annotation
The 2016 data consisted of four parts: TRAIN (for training models), DEV (for tuning models), DEVTEST (for development-time evaluation), and TEST (for the official evaluation). The first three datasets were annotated using Amazon's Mechanical Turk, while the TEST dataset was annotated on CrowdFlower.
Instructions: Given a Twitter message and a topic, identify whether the message is highly positive, positive, neutral, negative, or highly negative (a) in general and (b) with respect to the provided topic. If a tweet is sarcastic, please select the checkbox "The tweet is sarcastic". Please read the examples and the invalid responses before beginning if this is the first time you are working on this HIT. Annotation with Amazon's Mechanical Turk. A Human Intelligence Task (HIT) consisted of providing all required annotations for a given tweet message. In order to qualify to work on our HITs, a Mechanical Turk annotator (a.k.a. "Turker") had to have an approval rate greater than 95% and to have completed at least 50 approved HITs. Each HIT was carried out by five Turkers and consisted of five tweets to be annotated. A Turker had to indicate the overall polarity of the tweet message (on a five-point scale) as well as the overall polarity of the message towards the given target topic (again, on a five-point scale). The annotation instructions along with an example are shown in Figure 1. We made available to the Turkers several additional examples, which are shown in Table 3.
We rejected HITs with the following problems: • one or more responses do not have the overall sentiment marked; • one or more responses do not have the sentiment towards the topic marked; • one or more responses appear to be randomly selected.
Annotation with CrowdFlower. We annotated the TEST data using CrowdFlower, as it allows better quality control of the annotations across a number of dimensions. Most importantly, it allows us to find and exclude unreliable annotators based on hidden tests, which we created starting with the highestconfidence and highest-agreement annotations from Mechanical Turk. We added some more tests manually. Otherwise, we setup the annotation task giving exactly the same instructions and examples as in Mechanical Turk.

Consolidation of annotations.
In previous years, we used majority voting to select the true label (and discarded cases where a majority had not emerged, which amounted to about 50% of the tweets). As this year we have a five-point scale, where the expected agreement is lower, we used a two-step procedure. If three out of the five annotators agreed on a label, we accepted the label. Otherwise, we first mapped the categorical labels to the integer values −2, −1, 0, 1, 2. Then we calculated the average, and finally we mapped that average to the closest integer value. In order to counter-balance the tendency of the average to stay away from −2 and 2, and also to prefer 0, we did not use rounding at ±0.5 and ±1.5, but at ±0.4 and ±1.4 instead.
To give the reader an idea about the degree of agreement, we will look at the TEST dataset as an example. It included 20,632 tweets. For 2,760, all five annotators assigned the same value, and for another 9,944 there was a majority value. For the remaining 7,928 cases, we had to perform averaging as described above.
The consolidated statistics from the five annotators on a three-point scale for Subtask A are shown in Table 4. Note that, for consistency, we annotated the data for Subtask A on a five-point scale, which we then converted to a three-point scale.
The topic annotations on a two-point scale for Subtasks B and D are shown in Table 5, while those on a five-point scale for Subtasks C and E are in Table 6. Note that, as for Subtask A, the two-point scale annotation counts for Subtasks B and D derive from summing the POSITIVEs with the HIGH-LYPOSITIVEs, and the NEGATIVEs with the HIGH-LYNEGATIVEs from Table 6; moreover, this time we also remove the NEUTRALs.     As we use the same test tweets for all subtasks, the submission of results by participating teams was subdivided in two stages: (i) participants had to submit results for Subtasks A, C, E, and (ii) only after the submission deadline for A, C, E had passed, we distributed to participants the unlabelled test data for Subtasks B and D.
Otherwise, since for Subtasks B and D we filter out the NEUTRALs, we would have leaked information about which the NEUTRALs are, and this information could have been used in Subtasks C and E.
Finally, as the same tweets can be selected for different topics, we ended up with some duplicates; arguably, these are true duplicates for Subtask A only, as for the other subtasks the topics still differ. This includes 25 duplicates in TRAIN, 3 in DEV, 2 in DE-VTEST, and 116 in TEST. There is a larger number in TEST, as TEST is about twice as large as TRAIN, DEV, and DEVTEST combined. This is because we wanted a large TEST set with 100 topics and 200 tweets per topic on average for Subtasks C and E.

Evaluation Measures
This section discuss the evaluation measures for the five subtasks of our SemEval-2016 Task 4. A document describing the evaluation measures in detail 6 (Nakov et al., 2016a), and a scoring software implementing all the five "official" measures, were made available to the participants via the task website together with the training data. 7 For Subtasks B to E, the datasets are each subdivided into a number of "topics", and the subtask needs to be carried out independently for each topic. As a result, each of the evaluation measures will be "macroaveraged" across the topics, i.e., we compute the measure individually for each topic, and we then average the results across the topics.

Subtask A: Message polarity classification
Subtask A is a single-label multi-class (SLMC) classification task. Each tweet must be classified as belonging to exactly one of the following three classes C={POSITIVE, NEUTRAL, NEGATIVE}.
We adopt the same evaluation measure as the 2013-2015 editions of this subtask, F P N 1 : F P 1 is the F 1 score for the POSITIVE class: Here, π P and ρ P denote precision and recall for the POSITIVE class, respectively: where P P , U P , N P , P U , P N are the cells of the confusion matrix shown in Table 7.  F N 1 is defined analogously, and the measure we finally adopt is F P N 1 as from Equation 1.

Subtask B: Tweet classification according to a two-point scale
Subtask B is a binary classification task. Each tweet must be classified as either POSITIVE or NEGATIVE. For this subtask we adopt macroaveraged recall: In the above formula, ρ P and ρ N are the positive and the negative class recall, respectively. Note that U terms are entirely missing in Equation 5; this is because we do not have the NEUTRAL class for SemEval-2016 Task 4, subtask A.
ρ P N ranges in [0, 1], where a value of 1 is achieved only by the perfect classifier (i.e., the classifier that correctly classifies all items), a value of 0 is achieved only by the perverse classifier (the classifier that misclassifies all items), while 0.5 is both (i) the value obtained by a trivial classifier (i.e., the classifier that assigns all tweets to the same classbe it POSITIVE or NEGATIVE), and (ii) the expected value of a random classifier. The advantage of ρ P N over "standard" accuracy is that it is more robust to class imbalance. The accuracy of the majority-class classifier is the relative frequency (aka "prevalence") of the majority class, that may be much higher than 0.5 if the test set is imbalanced. Standard F 1 is also sensitive to class imbalance for the same reason. Another advantage of ρ P N over F 1 is that ρ P N is invariant with respect to switching POSITIVE with NEG-ATIVE, while F 1 is not. See (Sebastiani, 2015) for more details on ρ P N .
As we noted before, the training dataset, the development dataset, and the test dataset are each subdivided into a number of topics, and Subtask B needs to be carried out independently for each topic. As a result, the evaluation measures discussed in this section are computed individually for each topic, and the results are then averaged across topics to yield the final score.

Subtask C: Tweet classification according to a five-point scale
Subtask C is an ordinal classification (OCalso known as ordinal regression) task, in which each tweet must be classified into exactly one of the classes in C={HIGHLYPOSITIVE, POS-ITIVE, NEUTRAL, NEGATIVE, HIGHLYNEGA-TIVE}, represented in our dataset by numbers in {+2,+1,0,−1,−2}, with a total order defined on C.
The essential difference between SLMC (see Section 4.1 above) and OC is that not all mistakes weigh equally in the latter. For example, misclassifying a HIGHLYNEGATIVE example as HIGHLYPOSITIVE is a bigger mistake than misclassifying it as NEGA-TIVE or NEUTRAL.
As our evaluation measure, we use macroaveraged mean absolute error (M AE M ): is its predicted label, T e j denotes the set of test documents whose true class is c j , |h(x i ) − y i | denotes the "distance" between classes h(x i ) and y i (e.g., the distance between HIGHLYPOSITIVE and NEGATIVE is 3), and the "M" superscript indicates "macroaveraging". The advantage of M AE M over "standard" mean absolute error, which is defined as: is that it is robust to class imbalance (which is useful, given the imbalanced nature of our dataset). On perfectly balanced datasets M AE M and M AE µ are equivalent.
Unlike the measures discussed in Sections 4.1 and 4.2, M AE M is a measure of error, and not accuracy, and thus lower values are better. See (Baccianella et al., 2009) for more detail on M AE M .
Similarly to Subtask B, Subtask C needs to be carried out independently for each topic. As a result, M AE M is computed individually for each topic, and the results are then averaged across all topics to yield the final score.

Subtask D: Tweet quantification according to a two-point scale
Subtask D also assumes a binary quantification setup, in which each tweet is classified as POSITIVE or NEGATIVE. The task is to compute an estimatê p(c j ) of the relative frequency (in the test set) of each of the classes. The difference between binary classification (as from Section 4.2) and binary quantification is that errors of different polarity (e.g., a false positive and a false negative for the same class) can compensate each other in the latter. Quantification is thus a more lenient task since a perfect classifier is also a perfect quantifier, but a perfect quantifier is not necessarily a perfect classifier. We adopt normalized cross-entropy, better known as Kullback-Leibler Divergence (KLD). KLD was proposed as a quantification measure in (Forman, 2005), and is defined as follows: KLD is a measure of the error made in estimating a true distribution p over a set C of classes by means of a predicted distributionp. Like M AE M in Section 4.3, KLD is a measure of error, which means that lower values are better. KLD ranges between 0 (best) and +∞ (worst).
Note that the upper bound of KLD is not finite since Equation 7 has predicted prevalences, and not true prevalences, at the denominator: that is, by making a predicted prevalencep(c j ) infinitely small we can make KLD infinitely large. To solve this problem, in computing KLD we smooth both p(c j ) andp(c j ) via additive smoothing, i.e., where p s (c j ) denotes the smoothed version of p(c j ) and the denominator is just a normalizer (same for thep s (c j )'s); the quantity = 1 2·|T e| is used as a smoothing factor, where T e denotes the test set.
The smoothed versions of p(c j ) andp(c j ) are used in place of their original versions in Equation 7; as a result, KLD is always defined and still returns a value of 0 when p andp coincide.
KLD is computed individually for each topic, and the results are averaged to yield the final score.

Subtask E: Tweet quantification according to a five-point scale
Subtask E is an ordinal quantification (OQ) task, in which (as in OC) each tweet belongs exactly to one of the classes in C={HIGHLYPOSITIVE, POSI-TIVE, NEUTRAL, NEGATIVE, HIGHLYNEGATIVE}, where there is a total order on C. As in binary quantification, the task is to compute an estimatep(c j ) of the relative frequency p(c j ) in the test tweets of all the classes c j ∈ C.
The measure we adopt for OQ is the Earth Mover's Distance (Rubner et al., 2000) (also known as the Vasersteȋn metric (Rüschendorf, 2001)), a measure well-known in the field of computer vision. EM D is currently the only known measure for ordinal quantification. It is defined for the general case in which a distance d(c , c ) is defined for each c , c ∈ C. When there is a total order on the classes in C and d(c i , c i+1 ) = 1 for all i ∈ {1, ..., (C − 1)} (as in our application), the Earth Mover's Distance is defined as and can be computed in |C| steps from the estimated and true class prevalences.
Like KLD in Section 4.4, EM D is a measure of error, so lower values are better; EM D ranges between 0 (best) and |C| − 1 (worst). See  for more details on EM D.
As before, EM D is computed individually for each topic, and the results are then averaged across all topics to yield the final score.

Participants and Results
A total of 43 teams (see Table 15 at the end of the paper) participated in SemEval-2016 Task 4, representing 25 countries; the country with the highest participation was China (5 teams), followed by Italy, Spain, and USA (4 teams each). The subtask with the highest participation was Subtask A (34 teams), followed by Subtask B (19 teams), Subtask D (14 teams), Subtask C (11 teams), and Subtask E (10 teams).
It was not surprising that Subtask A proved to be the most popular -it was a rerun from previous years; conversely, none among Subtasks B to E had previously been offered in precisely the same form. Quantification-related subtasks (D and E) generated 24 participations altogether, while subtasks with an ordinal nature (C and E) attracted 21 participations. Only three teams participated in all five subtasks; conversely, no less than 23 teams took part in one subtask only (with a few exceptions, Subtask A). Many teams that participated in more than one subtask used essentially the same system for all of them, with little tuning to the specifics of each subtask.
Few trends stand out among the participating systems. In terms of the supervised learning methods used, there is a clear dominance of methods based on deep learning, including convolutional neural networks and recurrent neural networks (and, in particular, long short-term memory networks); the software libraries for deep learning most frequently used by the participants are Theano and Keras. Conversely, kernel machines seem to be less frequently used than in the past, and the use of learning methods other than the ones mentioned above is scarce.
The use of distant supervision is ubiquitous; this is natural, since there is an abundance of freely available tweets labelled according to sentiment (possibly with silver labels only, e.g., emoticons), and it is intuitive that their use as additional training data could be helpful. Another ubiquitous technique is the use of word embeddings, usually generated via either word2vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014); most authors seem to use general-purpose, pre-trained embeddings, while some authors also use customized word embeddings, trained either on the Tweet 2016 dataset or on tweet datasets of some sort.
Nothing radically new seems to have emerged with respect to text preprocessing; as in previous editions of this task, participants use a mix of by now obvious techniques, such as negation scope detection, elongation normalization, detection of amplifiers and diminishers, plus the usual extraction of word n-grams, character n-grams, and POS ngrams. The use of sentiment lexicons (alone or in combination with each other; general-purpose or Twitter-specific) is obviously still frequent.
In the next five subsections, we discuss the results of the participating systems in the five subtasks, focusing on the techniques and tools that the top-ranked participants have used. We also focus on how the participants tailored (if at all) their approach to the specific subtask. When discussing a specific subtask, we will adopt the convention of adding to a team name a subscript which indicates the position in the ranking for that subtask that the team obtained; e.g., when discussing Subtask E, "Finki 2 " indicates team "Finki, which placed 2nd in the ranking for Subtask E". The papers describing the participants' approach are quoted in Table 15.  Table 8 ranks the systems submitted by the 34 teams who participated in Subtask A "Message Polarity Classification" in terms of the official measure F P N 1 . We further show the result for two other measures, ρ P N (the measure that we adopted for Subtask B) and accuracy (Acc = T P +T N T P +T N +F P +F N ). We also report the result for a baseline classifier that assigns to each tweet the POSITIVE class. For Subtask A evaluated using F P N 1 , this is the equivalent of the majority class classifier for (binary or SLMC) classification evaluated via vanilla accuracy, i.e., this is the "smartest" among the trivial policies that attempt to maximize F P N 1 .   score. In each column the rankings according to the corresponding measure are indicated with a subscript. Teams marked as "(*)" are late submitters, i.e., their original submission was deemed irregular by the organizers, and a revised submission was entered after the deadline. All 34 participating systems were able to outperform the baseline on all three measures, with the exception of one system that scored below the baseline on Acc. The top-scoring team (SwissCheese 1 ) used an ensemble of convolutional neural networks, differing in their choice of filter shapes, pooling shapes and usage of hidden layers. Word embeddings generated via word2vec were also used, and the neural networks were trained by using distant supervision. Out of the 10 top-ranked teams, 5 teams (SwissCheese 1 , SENSEI-LIF 2 , UNIMELB 3 , INESC-ID 4 , INSIGHT-1 8 ) used deep NNs of some sort, and 7 teams (SwissCheese 1 , SENSEI-LIF 2 , UNIMELB 3 , INESC-ID 4 , aueb.twitter.sentiment 5 , I2RNTU 7 , INSIGHT-1 8 ) used either generalpurpose or task-specific word embeddings, generated via word2vec or GloVe.
Historical results. We also tested the participating systems on the test sets from the three previous editions of this subtask. Participants were not allowed to use these test sets for training. Results (measured on F P N 1 ) are reported in Table 9. The top-performing systems on Tweet 2016 are also top-ranked on the test datasets from previous years. There is a general pattern: the top-ranked system in year x outperforms the top-ranked system in year (x − 1) on the official dataset of year (x − 1). Topranked systems tend to use approaches that are universally strong, even when tested on out-of-domain test sets such as SMS, LiveJournal, or sarcastic tweets (yet, for sarcastic tweets, there are larger differences in rank compared to systems rankings on Tweet 2016). It is unclear where improvements come from: (a) the additional training data that we made available this year (in addition to Tweettrain-2013Tweettrain- , which was used in 2013Tweettrain- -2015, thus effectively doubling the amount of training data, or (b) because of advancement of learning methods.
We further look at the top scores achieved by any system in the period 2013-2016. The results are shown in Table 10. Interestingly, the results for a test set improve in the second year it is used (i.e., the year after it was used as an official test set) by 1-3 points absolute, but then do not improve further and stay stable, or can even decrease a bit. This might be due to participants optimizing their systems primarily on the test set from the preceding year.  Table 9: Historical results for Subtask A "Message Polarity Classification". The systems are ordered by their score on the Tweet 2016 dataset; the rankings on the individual datasets are indicated with a subscript. The meaning of "(*)" is as in Table 8. (the measure adopted for Subtask A) and accuracy (Acc). We also report the result of a baseline that assigns to each tweet the positive class. This is the "smartest" among the trivial policies that attempt to maximize ρ P N . This baseline always returns ρ P N = 0.500.

Subtask B: Tweet classification according to a two-point scale
Note however that this is also (i) the value returned by the classifier that assigns to each tweet the negative class, and (ii) the expected value returned by the random classifier; for more details see (Sebastiani, 2015, Section 5), where ρ P N is called K.
The top-scoring team (Tweester 1 ) used a combination of convolutional neural networks, topic modeling, and word embeddings generated via word2vec. Similar to Subtask A, the main trend among all participants is the widespread use of deep learning techniques.  Table 11: Results for Subtask B "Tweet classification according to a two-point scale" on the Tweet 2016 dataset. The systems are ordered by their ρ P N score (higher is better). The meaning of "(*)" is as in Table 8.
Conversely, the use of classifiers such as support vector machines, which were dominant until a few years ago, seems to have decreased, with only one team (TwiSE 8 ) in the top 10 using them.

Subtask C: Tweet classification according
to a five-point scale Table 12 ranks the 11 teams who participated in Subtask C "Tweet classification according to a five-point scale" in terms of the official measure M AE M ; we also show M AE µ (see Equation 6). We also report the result of a baseline system that assigns to each tweet the middle class (i.e., NEUTRAL); for ordinal classification evaluated via M AE M , this is the majority-class classifier for (binary or SLMC) classification evaluated via vanilla accuracy, i.e., this is (Baccianella et al., 2009) Table 12: Results for Subtask C "Tweet classification according to a five-point scale" on the Tweet 2016 dataset. The systems are ordered by their M AE M score (lower is better). The meaning of "(*)" is as in Table 8.
The top-scoring team (TwiSE 1 ) used a singlelabel multi-class classifier to classify the tweets according to their overall polarity. In particular, they used logistic regression that minimizes the multinomial loss across the classes, with weights to cope with class imbalance. Note that they ignored the given topics altogether.  Table 13: Results for Subtask D "Tweet quantification according to a two-point scale" on the Tweet 2016 dataset. The systems are ordered by their KLD score (lower is better). The meaning of "(*)" is as in Table 8.
Only 2 of the 11 participating teams tuned their systems to exploit the ordinal (as opposed to binary, or single-label multi-class) nature of this subtask. The two teams who did exploit the ordinal nature of the problem are PUT 3 , which uses an ensemble of ordinal regression approaches, and ISTI-CNR 7 , which uses a tree-based approach to ordinal regression. All other teams used general-purpose approaches for single-label multi-class classification, in many cases relying (as for Subtask B) on convolutional neural networks, recurrent neural networks, and word embeddings.

Subtask D: Tweet quantification according
to a two-point scale and relative absolute error (RAE): where the notation is the same as in Equation 7. We also report the result of a "maximum likelihood" baseline system (dubbed Baseline 1 ). This system assigns to each test topic the distribution of the training tweets (the union of TRAIN, DEV, DE-VTEST) across the classes. This is the "smartest" among the trivial policies that attempt to maximize KLD. We also report the result of a further (less smart) baseline system (dubbed Baseline 2 ), i.e., one that assigns a prevalence of 1 to the majority class (which happens to be the POSITIVE class) and a prevalence of 0 to the other class.
The top-scoring team (Finki 1 ) adopts an approach based on "classify and count", a classificationoriented (instead of quantification-oriented) approach, using recurrent and convolutional neural networks, and GloVe word embeddings.
Indeed, only 5 of the 14 participating teams tuned their systems to the fact that it deals with quantification (as opposed to classification). Among the teams who do rely on quantification-oriented approaches, teams LYS 2 and HSENN 14 used an existing structured prediction method that directly optimizes KLD; teams QCRI 5 and ISTI-CNR 11 use existing probabilistic quantification methods; team NRU-HSE 7 uses an existing iterative quantification method based on cost-sensitive learning. Interestingly, team TwiSE 2 uses a "classify and count" approach after comparing it with a quantificationoriented method (similar to the one used by teams LYS 2 and HSENN 14 ) on the development set, and concluding that the former works better than the latter. All other teams used "classify and count" approaches, mostly based on convolutional neural networks and word embeddings.

Subtask E: Tweet quantification according
to a five-point scale Table 14 lists the results obtained by the 10 participating teams on Subtask E "Tweet quantification according to a five-point scale". We also report the result of a "maximum likelihood" baseline system (dubbed Baseline 1 ), i.e., one that assigns to each test topic the same distribution, namely the distribution of the training tweets (the union of TRAIN, DEV, DEVTEST) across the classes; this is the "smartest" among the trivial policies (i.e., those that do not require any genuine work) that attempt to maximize EM D.
We further report the result of less smart baseline system (dubbed Baseline 2 ) -one that assigns a prevalence of 1 to the majority class (which coincides with the POSITIVE class) and a prevalence of 0 to all other classes.  Table 14: Results for Subtask E "Tweet quantification according to a five-point scale" on the Tweet 2016 dataset. The systems are ordered by their EM D score (lower is better). The meaning of "(*)" is as in Table 8.
Only 3 of the 10 participants tuned their systems to the specific characteristics of this subtask, i.e., to the fact that it deals with quantification (as opposed to classification) and to the fact that it has an ordinal (as opposed to binary) nature.
In particular, the top-scoring team (QCRI 1 ) used a novel algorithm explicitly designed for ordinal quantification, that leverages an ordinal hierarchy of binary probabilistic quantifiers.
Team NRU-HSE 4 uses an existing quantification approach based on cost-sensitive learning, and adapted it to the ordinal case.
Team ISTI-CNR 6 instead used a novel adaptation to quantification of a tree-based approach to ordinal regression.
Teams LYS 7 and HSENN 9 also used an existing quantification approach, but did not exploit the ordinal nature of the problem.
The other teams mostly used approaches based on "classify and count" (see Section 5.4), and viewed the problem as single-label multi-class (instead of ordinal) classification; some of these teams (notably, team Finki 2 ) obtained very good results, which testifies to the quality of the (general-purpose) features and learning algorithm they used.

Conclusion and Future Work
We described SemEval-2016 Task 4 "Sentiment Analysis in Twitter", which included five subtasks including three that represent a significant departure from previous editions. The three new subtasks focused, individually or in combination, on two variants of the basic "sentiment classification in Twitter" task that had not been previously explored within SemEval. The first variant adopts a five-point scale, which confers an ordinal character to the classification task. The second variant focuses on the correct estimation of the prevalence of each class of interest, a task which has been called quantification in the supervised learning literature. In contrast, previous years' subtasks have focused on the correct labeling of individual tweets. As in previous years (2013)(2014)(2015), the 2016 task was very popular and attracted a total of 43 teams.
A general trend that emerges from SemEval-2016 Task 4 is that most teams who were ranked at the top in the various subtasks used deep learning, including convolutional NNs, recurrent NNs, and (generalpurpose or task-specific) word embeddings. In many cases, the use of these techniques allowed the teams using them to obtain good scores even without tuning their system to the specifics of the subtask at hand, e.g., even without exploiting the ordinal nature of the subtask -for Subtasks C and E -or the quantification-related nature of the subtask -for Subtasks D and E. Conversely, several teams that have indeed tuned their system to the specifics of the subtask at hand, but have not used deep learning techniques, have performed less satisfactorily. This is a further confirmation of the power of deep learning techniques for tweet sentiment analysis.
Concerning Subtasks D and E, if quantificationbased subtasks are proposed again, we think it might be a good idea to generate, for each test topic t i , multiple "artificial" test topics t 1 i , t 2 i , ..., where class prevalences are altered with respect to the ones of t i by means of selectively removing from t i tweets belonging to a certain class. In this way, the evaluation can take into consideration (i) class prevalences in the test set and (ii) levels of distribution drift (i.e., of the divergence of the test distribution from the training distribution) that are not present in the "naturally occurring" data.
By varying the amount of removed tweets at will, one may obtain many test topics, thus augmenting the magnitude of the experimentation at will while at the same time keeping constant the amount of manual annotation needed.
In terms of possible follow-ups of this task, it might be interesting to have a subtask whose goal is to distinguish tweets that are NEUTRAL about the topic (i.e., do not express any opinion about the topic) from tweets that express a FAIR opinion (i.e., lukewarm, intermediate between POSITIVE and NEGATIVE) about the topic.
Another possibility is to have a multi-lingual tweet sentiment classification subtask, where training examples are provided for the same topic for two languages (e.g., English and Arabic), and where participants can improve their performance on one language by leveraging the training examples for the other language via transfer learning. Alternatively, it might be interesting to include a cross-lingual tweet sentiment classification subtask, where training examples are provided for a given language (e.g., English) but not for the other (e.g., Arabic); the second language could be also a surprise language, which could be announced at the last moment.  Table 15: Participating teams (Column 2), their affiliation (Column 3) and nationality (Column 4), the subtasks they have participated in (Column 1), and the paper they have contributed (Column 5). Teams whose "Affiliation" column is typeset on more that one row include researchers with different affiliations. Teams marked with a (**) include some of the SemEval 2016 Task 4 organizers. An empty entry for the "Paper" column indicates that the team have not contributed a system description paper.