SemEval-2017 Task 4: Sentiment Analysis in Twitter

This paper describes the fifth year of the Sentiment Analysis in Twitter task. SemEval-2017 Task 4 continues with a rerun of the subtasks of SemEval-2016 Task 4, which include identifying the overall sentiment of the tweet, sentiment towards a topic with classification on a two-point and on a five-point ordinal scale, and quantification of the distribution of sentiment towards a topic across a number of tweets: again on a two-point and on a five-point ordinal scale. Compared to 2016, we made two changes: (i) we introduced a new language, Arabic, for all subtasks, and (ii) we made available information from the profiles of the Twitter users who posted the target tweets. The task continues to be very popular, with a total of 48 teams participating this year.


Introduction
The identification of sentiment in text is an important field of study, with social media platforms such as Twitter garnering the interest of researchers in language processing as well as in political and social sciences.The task usually involves detecting whether a piece of text expresses a POSITIVE, a NEGATIVE, or a NEUTRAL sentiment; the sentiment can be general or about a specific topic, e.g., a person, a product, or an event.
SemEval is the International Workshop on Semantic Evaluation, formerly SensEval.It is an ongoing series of evaluations of computational semantic analysis systems, organized under the umbrella of SIGLEX, the Special Interest Group on the Lexicon of the Association for Computational Linguistics.Other related tasks at SemEval have explored sentiment analysis of product review and their aspects (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016)), sentiment analysis of figurative language on Twitter (Ghosh et al., 2015), implicit event polarity (Russo et al., 2015), detecting stance in tweets (Mohammad et al., 2016a), out-ofcontext sentiment intensity of words and phrases (Kiritchenko et al., 2016), and emotion detection (Strapparava and Mihalcea, 2007).Some of these tasks featured languages other than English, such as Arabic (Pontiki et al., 2016;Mohammad et al., 2016a); however, they did not target tweets, nor did they focus on sentiment towards a topic.
This year, we performed a re-run of the subtasks in SemEval-2016 Task 4, which, in addition to the overall sentiment of a tweet, featured classification, ordinal regression, and quantification with respect to a topic.Furthermore, we introduced a new language, Arabic.Finally, we made available to the participants demographic information about the users who posted the tweets, which we extracted from the respective public profiles.
Ordinal Classification As last year, SemEval-2017 Task 4 includes sentiment analysis on a fivepoint scale {HIGHLYPOSITIVE, POSITIVE, NEU-TRAL, NEGATIVE, HIGHLYNEGATIVE}, which is in line with product ratings occurring in the corporate world, e.g., Amazon, TripAdvisor, and Yelp.In machine learning terms, moving from a categorical two-point scale to an ordered five-point scale means moving from binary to ordinal classification (aka ordinal regression).
Tweet Quantification SemEval-2017 Task 4 includes tweet quantification tasks along with tweet classification tasks, also on 2-point and 5-point scales.While the tweet classification task is concerned with whether a specific tweet expresses a given sentiment towards a topic, the tweet quantification task looks at estimating the distribution of tweets about a given topic across the different sentiment classes.Most (if not all) tweet sentiment classification studies within political science (Borge-Holthoefer et al., 2015;Kaya et al., 2013;Marchetti-Bowick and Chambers, 2012), economics (Bollen et al., 2011;O'Connor et al., 2010), social science (Dodds et al., 2011), and market research (Burton and Soboleva, 2011;Qureshi et al., 2013), study Twitter with an interest in aggregate statistics about sentiment and are not interested in the sentiment expressed in individual tweets.We should also note that quantification is not a mere byproduct of classification, as it can be addressed using different approaches and it also needs different evaluation measures (Forman, 2008;Esuli and Sebastiani, 2015).
Analysis in Arabic This year, we added a new language, Arabic, in order to encourage participants to experiment with multilingual and crosslingual approaches for sentiment analysis.Our objective was to expand the Twitter sentiment analysis resources available to the research community, not only for general multilingual sentiment analysis, but also for multilingual sentiment analysis towards a topic, which is still a largely unexplored research direction for many languages and in particular for morphologically complex languages such as Arabic.
Arabic has become an emergent language for sentiment analysis, especially as more resources and tools for it have recently become available.It is also both interesting and challenging due to its rich morphology and abundance of dialectal use in Twitter.Early Arabic studies focused on sentiment analysis in newswire (Abdul-Mageed and Diab, 2011;Elarnaoty et al., 2012), but recently there has been a lot more work on social media, especially Twitter (Mourad and Darwish, 2013;Abdul-Mageed et al., 2014;Refaee and Rieser, 2014;Salameh et al., 2015), where the challenges of sentiment analysis are compounded by the presence of multiple dialects and orthographical variants, which are frequently used in conjunction with the formal written language.Some work studied the utility of machine translation for sentiment analysis of Arabic texts (Salameh et al., 2015;Mohammad et al., 2016b;Refaee and Rieser, 2015), identification of sentiment holders (Elarnaoty et al., 2012), and sentiment targets (Al-Smadi et al., 2015;Farra et al., 2015;Farra and McKeown, 2017).We believe that the development of a standard Arabic Twitter dataset for sentiment, and particularly with respect to topics, will encourage further research in this regard.
User Information Demographic information in Twitter has been studied and analyzed using network analysis and natural language processing (NLP) techniques (Mislove et al., 2011;Nguyen et al., 2013;Rosenthal and McKeown, 2016).Recent work has shown that user information and information from the network can help sentiment analysis in other corpora (Hovy, 2015) and in Twitter (Volkova et al., 2013;Yang and Eisenstein, 2015).Thus, this year we encouraged participants to use information from the public profiles of Twitter users such as demographics (e.g., age, location) as well as information from the rest of the social network (e.g., sentiment of the tweets of friends), with the goal of analyzing the impact of this information on improving sentiment analysis.
The rest of this paper is organized as follows.Section 2 presents in more detail the five subtasks of SemEval-2017 Task 4. Section 3 describes the English and the Arabic datasets and how we created them.Section 4 introduces and motivates the evaluation measures for each subtask.Section 5 presents the results of the evaluation and discusses the techniques and the tools that the participants used.Finally, Section 6 concludes and points to some possible directions for future work.

Task Definition
SemEval-2017 Task 4 consists of five subtasks, each offered for both Arabic and English: 1. Subtask A: Given a tweet, decide whether it expresses POSITIVE, NEGATIVE or NEU-TRAL sentiment.
2. Subtask B: Given a tweet and a topic, classify the sentiment conveyed towards that topic on a two-point scale: POSITIVE vs. NEGATIVE.Each subtask is run for both English and Arabic.Subtask A has been run in all previous editions of the task and continues to be the most popular one (see section 5.) Subtasks B-E have all been run at SemEval-2016 Task 4 (Nakov et al., 2016a), with variants running in 2015 (Rosenthal et al., 2015).Table 1 shows a summary of the subtasks.

Datasets
Our datasets consist of tweets annotated for sentiment on a 2-point, 3-point, and 5-point scales.We made available to participants all the data from previous years (Nakov et al., 2016a) for the English training sets, and we collected new training data for Arabic, as well as new test sets for both English and Arabic.The annotation scheme remained the same as last year (Nakov et al., 2016a), with the key new contribution being to apply the task and instructions to Arabic as well as providing a script to download basic user information.All annotations were performed on CrowdFlower.Note that we release all our datasets to the research community to be used freely beyond SemEval.

Tweet Collection
We chose English and Arabic topics based on popular current events that were trending on Twitter, both internationally and in specific Arabicspeaking countries, using local and global Twitter trends. 1 The topics included a range of named entities (e.g., Donald Trump, iPhone), geopolitical entities (e.g., Aleppo, Palestine), and other entities (e.g., Syrian refugees, Dakota Access Pipeline, Western media, gun control, and vegetarianism).We then used the Twitter API to download tweets, along with corresponding user information, containing mentions of these topics in the specified language.We intentionally chose to use some overlapping topics between the two languages in order to encourage cross-language approaches.
We automatically filtered the tweets for duplicates and we removed those for which the bag-ofwords cosine similarity exceeded 0.6.We then retained only the topics for which at least 100 tweets remained.The training tweets for Arabic were collected over the period of September-November 2016 and all test tweets were collected over the period of December 2016-January 2017.
For both English and Arabic, the topics for the test dataset were different from those in the training and in the development datasets.

Annotation using CrowdFlower
We used CrowdFlower to annotate the new training and testing tweets.The annotators were asked to indicate the overall polarity of the tweet (on a five-point scale) as well as the polarity of the tweet towards the given target topic (again, on a fivepoint scale), as described in (Nakov et al., 2016a).We also provided additional examples, some of which are shown in Tables 2 and 3.In particular, we stressed that topic-level positive or negative sentiment needed to express an opinion about the topic itself rather than about a positive or a negative event occurring in the context of the topic (see for example, the third row of Table 3).
Each tweet was annotated by at least five people, and we created many hidden tests for quality control, which we used to reject annotations by contributors who missed a large number of the hidden tests.We also created pilot runs, which helped us adjust the annotation instructions until we found, based on manual inspection, the quality of the annotated tweets to be satisfactory.For Arabic, the contributors tended to annotate somewhat conservatively, and thus a very small number of HIGHLYPOSITIVE and HIGHLYNEG-ATIVE annotations were consolidated, despite us having provided examples of such annotations.

Consolidating the Annotations
As the annotations are on a five-point scale, where the expected agreement is lower, we used a twostep procedure.If three out of the five annotators agreed on a label, we accepted the label.Otherwise, we first mapped the categorical labels to the integer values −2, −1, 0, 1, 2. Then we calculated the average, and finally we mapped that average to the closest integer value.In order to counter-balance the tendency of the average to stay away from the extreme values −2 and 2, and also to prefer 0, we did not use rounding at ±0.5 and ±1.5, but at ±0.4 and ±1.4 instead.Finally, note that the values −2, −1, 0, 1, 2 are to be interpreted as STRONGLYNEGATIVE, WEAKLYNEGATIVE, NEUTRAL, WEAKLYPOSI-TIVE, and STRONGLYPOSITIVE, respectively.

Data Statistics
The English training and development data this year consisted of the data from all previous editions of this task (Nakov et al., 2013;Rosenthal et al., 2014Rosenthal et al., , 2015;;Nakov et al., 2016b).Unlike in previous years, we did not set aside data to assess progress compared to prior years.Therefore, we allowed all data to be used for training and development.
For evaluation, we used the newly-created data described in the previous subsection.Tables 4 and 5 show the statistics for the English and Arabic data.
For English, we only show the aggregate statistics for the training data; the breakdown from prior years can be found in (Nakov et al., 2016a).Note that the same tweets were annotated for multiple subtasks, so there is overlap between the tweets across the tasks.Duplicates may have occurred where the same tweet was extracted for multiple topics.
As Arabic is a new language this year, we created for it a default train-development split of the Arabic data for the participants to use if they wished to do so.

Data Distribution
As in previous years, we provided the participants with a script2 to download the training tweets given IDs.In addition, this year we also included in the script the option to download basic user information for the author of each tweet: user id, follower count, status count, description, friend count, location, language, name, and time zone.To ensure a fair evaluation, the test set was provided via download and included the tweets as well as the basic user information provided by the download script.The training and the test data is available for downloaded on our task page.

Evaluation Measures
This section describes the evaluation measures for our five subtasks.Note that for Subtasks B to E, the datasets are each subdivided into a number of topics, and the subtask needs to be carried out independently for each topic.As a result, each of the evaluation measures will be "macroaveraged" across the topics, i.e., we compute the measure individually for each topic, and we then average the results across the topics.

Subtask A: Overall Sentiment of a Tweet
Our primary measure is AvgRec, or average recall, which is recall averaged across the POSITIVE (P), NEGATIVE (N), and NEUTRAL (U) classes.This measure has desirable theoretical properties (Sebastiani, 2015), and is also the one we use as primarily for Subtask B. It is computed as follows: where R P , R N and R U refer to recall with respect to the POSITIVE, the NEGATIVE, and the NEUTRAL class, respectively.See (Nakov et al., 2016a) for more detail.
AvgRec ranges in [0, 1], where a value of 1 is achieved only by the perfect classifier (i.e., the classifier that correctly classifies all items), a value of 0 is achieved only by the perverse classifier (the classifier that misclassifies all items), while 0.3333 is both (i) the value for a trivial classifier (i.e., one that assigns all tweets to the same class -be it POSITIVE, NEGATIVE, or NEUTRAL), and (ii) the expected value of a random classifier.
The advantage of AvgRec over "standard" accuracy is that it is more robust to class imbalance.The accuracy of the majority-class classifier is the relative frequency (aka "prevalence") of the majority class, that may be much higher than 0.5 if the test set is imbalanced.Standard F 1 is also sensitive to class imbalance for the same reason.Another advantage of AvgRec over F 1 is that AvgRec is invariant with respect to switching POSITIVE with NEGATIVE, while F 1 is not.See (Sebastiani, 2015) for more detail on AvgRec.
We further use two secondary measures: accuracy and F P N 1 .The latter was the primary evaluation measure for Subtask A in previous editions of the task.It is macro-average F 1 , calculated over the POSITIVE and the NEGATIVE classes (note the exclusion of NEUTRAL).This year, we demoted F P N 1 to a secondary evaluation measure.It is calculated as follows: where F P 1 and F N 1 refer to F 1 with respect to the POSITIVE and the NEGATIVE class, respectively.

Subtask B: Topic-Based Classification on a 2-point Scale
As in 2016, our primary evaluation measure for subtask B is average recall, or AvgRec (note that there are only two classes for this subtask): We further use accuracy and F 1 as secondary measures for subtask B. Finally, as subtask B is topic-based, we computed each metric individually for each topic, and we then averaged the result across the topics to yield the final score.

Subtask C: Topic-based Classification on a 5-point Scale
Subtask C is an ordinal classification (also known as ordinal regression) task, in which each tweet must be classified into exactly one of the classes in C={HIGHLYPOSITIVE, POSITIVE, NEUTRAL, NEGATIVE, HIGHLYNEGATIVE}, represented in our dataset by numbers in {+2,+1,0,−1,−2}, with a total order defined on C.
We adopt an evaluation measure that takes the order of the five classes into account.For instance, misclassifying a HIGHLYNEGATIVE example as HIGHLYPOSITIVE is a bigger mistake than misclassifying it as NEGATIVE or as NEUTRAL.
As in SemEval-2016 Task 4, we use macroaverage mean absolute error (M AE M ) as the main ordinal classification measure: where y i denotes the true label of item x i , h(x i ) is its predicted label, T e j denotes the set of test documents whose true class is c j , |h(x i ) − y i | denotes the "distance" between classes h(x i ) and y i (e.g., the distance between HIGHLYPOSITIVE and NEGATIVE is 3), and the "M" superscript indicates "macroaveraging".
The advantage of M AE M over "standard" mean absolute error, which is defined as is that it is robust to class imbalance (which is useful, given the imbalanced nature of our dataset).On perfectly balanced datasets M AE M and M AE µ are equivalent.
M AE M is an extension of macro-average recall for ordinal regression; yet, it is a measure of error, and thus lower values are better.We also use M AE µ as a secondary measure, in order to provide better consistency with Subtasks A and B. These measures are computed for each topic, and the results are then averaged across all topics to yield the final score.See (Baccianella et al., 2009) for more detail about M AE M and M AE µ .

Subtask D: Tweet Quantification on a 2-point Scale
Subtask D assumes a binary quantification setup, in which each tweet is classified as POSITIVE or NEGATIVE, and the distribution across classes must be estimated.The difference with binary classification is that errors of different polarity (e.g., a false positive and a false negative for the same class) can compensate for each other in quantification.Quantification is thus a more lenient task than classification, since a perfect classifier is also a perfect quantifier, but a perfect quantifier is not necessarily a perfect classifier.
For evaluating binary quantification, we keep the Kullback-Leibler Divergence (KLD) measure used in 2016 along with additive smoothing (Nakov et al., 2016a;Forman, 2005).KLD was proposed as a quantification measure in (Forman, 2005), and is defined as follows: KLD is a measure of the error made in estimating a true distribution p over a set C of classes by means of a predicted distribution p.Like M AE M , KLD is a measure of error, which means that lower values are better.KLD ranges between 0 (best) and +∞ (worst).
Note that the upper bound of KLD is not finite since Equation 5 has predicted prevalences, and not true prevalences, at the denominator: that is, by making a predicted prevalence p(c j ) infinitely small we can make KLD infinitely large.To solve this problem, in computing KLD we smooth both p(c j ) and p(c j ) via additive smoothing, i.e., where p s (c j ) denotes the smoothed version of p(c j ) and the denominator is just a normalizer (same for the ps (c j )'s); the quantity ǫ = 1 2•|T e| is used as a smoothing factor, where T e denotes the test dataset.
The smoothed versions of p(c j ) and p(c j ) are used in place of their original versions in Equation 5; as a result, KLD is always defined and still returns a value of 0 when p and p coincide.
Like M AE M , KLD is a measure of error, which means that lower values are better.We further use two secondary error-based evaluation measures: absolute error (AE), and relative absolute error (RAE).
Again, the measures are computed individually for each topic, and the results are averaged across the topics to yield the final score.

Subtask E: Tweet Quantification on a 5-point Scale
Subtask E is an ordinal quantification task.As in binary quantification, the goal is to compute the distribution across classes, this time assuming a quantification setup.
Here each tweet belongs exactly to one of the classes in C={HIGHLYPOSITIVE, POS-ITIVE, NEUTRAL, NEGATIVE, HIGHLYNEGA-TIVE}, where there is a total order on C. As in binary quantification, the task is to compute an estimate p(c j ) of the relative frequency p(c j ) in the test tweets of all the classes c j ∈ C.
The measure we adopt for ordinal quantification is the Earth Mover's Distance (Rubner et al., 2000), also known as the Vasersteȋn metric (Rüschendorf, 2001), a measure well-known in the field of computer vision.EM D is currently the only known measure for ordinal quantification.It is defined for the general case in which a distance d(c ′ , c ′′ ) is defined for each c ′ , c ′′ ∈ C. When there is a total order on the classes in C and d(c i , c i+1 ) = 1 for all i ∈ {1, ..., (C − 1)}, the Earth Mover's Distance is defined as and can be computed in |C| steps from the estimated and true class prevalences.
Like KLD, EM D is a measure of error, so lower values are better; EM D ranges between 0 (best) and |C| − 1 (worst).See (Esuli and Sebastiani, 2010) for more detail on EM D.
As before, EM D is computed individually for each topic, and the results are then averaged across all topics to yield the final score.For more detail on EM D, the reader is referred to (Esuli and Sebastiani, 2010) and to last year's task description paper (Nakov et al., 2016a).

Participants and Results
A total of 48 teams participated in SemEval-2017 Task 4 this year.As in previous years, the most popular subtask this year was Subtask A, with 38 teams participating in the English subtask A, and 8 teams participating in the Arabic subtask A. Overall, there were 46 teams who participated in some English subtask and 9 teams that participated in some Arabic subtask.There were 28 teams that participated in a subtask other than subtask A. Moreover, two teams (OMAM and ELiRF-UPV) participated in all English and in all Arabic subtasks.There were 9 teams that participated in the topic versions of the subtasks but not in subtask A, reflecting a growing interest among researchers in developing systems for topic-specific analysis.

Common Resources and Methods
In terms of methods, the use of deep learning stands out in particular, and we also see an increase over the last year.There were at least 20 teams who used deep learning and neural network methods such as CNN and LSTM networks.Supervised SVM and Liblinear were also very popular, with several participants combining SVM with neural network methods or SVM with dense word embedding features.Other teams used classifiers such as Maximum Entropy, Logistic Regression, Random Forest, Naïve Bayes classifier, and Conditional Random Fields.
Common software used included Python (with the sklearn and numpy libraries), Java, Tensorflow, Weka, NLTK, Keras, Theano, and Stanford CoreNLP.The most common external datasets used were sentiment140 as a lexicon, pre-trained word2vec embeddings.Many teams further gathered additional tweets using the Twitter API that were not annotated for sentiment.These were used for distant supervision, lexicon building, and word vector training.
In the following subsections, we present the results and the ranking for each subtask, and we highlight the best-performing systems for each subtask.In each column, the rankings according to the corresponding measure are indicated with a subscript.Bx indicates a baseline.

Results for Subtask A: Overall Sentiment in a Tweet
Tables 6 and 7 show the results for Subtask A in English and Arabic, respectively, where the teams are ranked by macro-average recall.
For English the best ranking teams were BB twtr and DataStories, both achieving a macroaverage recall of 0.681.Both top teams used deep learning; BB twtr used an ensemble of LSTMs Both teams participated in all English subtasks and were also ranked in first (BB twtr) and second (DataStories) place for subtasks B-D; BB twtr was also ranked first for subtask E.
The top 5 teams for English were very closely scored.The following four best-ranked teams all used deep learning or deep learning ensembles.Three of the top-10 scoring teams (INGEOTEC, SiTAKA, and UCSC-NLP) used SVM classifiers instead, with various surface, lexical, semantic, and dense word embedding features.The use of ensembles clearly stood out, with five of the top-10 scoring systems (BB twtr, LIA, NNEMBs, Tweester, and INGEOTEC) using ensembles, hybrid, stacking or some kind of mix of learning methods.All teams beat the baseline on macroaverage recall; however, a few teams did not beat the harsher average F-measure and accuracy baselines.
For Arabic the best-ranked team was NileTMRG, and it achieved a score of 0.583.They used a Naïve Bayes classifier with a combination of lexical and sentiment features; they further augmented the training dataset to about 13K examples using external tweets.The SiTAKA team was ranked second with a score of 0.55.Their system used a feature-rich SVM with lexical features and embedding representations.Except for EliRF-UPV, who used multi-layer neural networks (CRNNs), the remaining teams used SVM and Naïve Bayes classifiers, genetic algorithms, or conditional random fields (CRFs).All teams managed to beat all baselines for all metrics.The difference in the absolute scores for the two languages is probably partially due to the difference in the amount of training data available for Arabic, which was much smaller compared English, even when external datasets were taken into account.The results also reflect the linguistic complexity of Arabic as it is used in social media, which is characterized by the abundant use of dialectal forms and spelling variants.Overall, participants preferred to focus on developing Arabicspecific systems (varying in the extent to which they applied Arabic-specific preprocessing) rather than trying to leverage cross-language models that would enable them to use English data to augment their Arabic models.The systems are ordered by average recall AvgRec (higher is better).Bx indicates a baseline.

Results for Subtasks B and C: Topic-Based Classification
The results of Subtasks B and C are shown in Tables 8-11.We can see that the system scores for subtask B are higher than those for subtask A, with the best team achieving 0.882 accuracy for English (compared to 0.681 for subtask A) and 0.768 for  Table 13: Results for Subtask D "Tweet quantification according to a two-point scale", Arabic.The systems are ordered by their KLD score (lower is better).Bx indicates a baseline.funSentiment, ranked 6th and 9th for subtasks B and C, respectively, modeled the sentiment towards the topic using the left and the right context around a topic mention in the tweet.WarwickDCS, ranked 8th, used simple tweet-level classification, while ignoring the topic.Overall, almost all teams managed to outperform the majority class baseline for subtask B, but only two teams outperformed the NEUTRAL class baseline for subtask C.
For Arabic four teams participated in Subtask B and two teams in Subtask C. NileTMRG was once again ranked first for Subtask B, with a system based on ensembles of topic-specific and topicagnostic models.For subtask C, OMAM also used combinations of such models applied in succession.All teams easily outperformed the baselines for Subtask B, but only the OMAM team managed to do so for Subtask C.

Results for Subtasks D and E: Tweet Quantification
Tables 12-15 show the results for the tweet quantification subtasks.The bottom of the tables report the result of a baseline system, B1, that assigns a prevalence of 1 to the majority class (which is the POSITIVE class for subtask D, and the WEAK-LYPOSITIVE/NEUTRAL class for subtask E, English/Arabic) and 0 to the other class(es).We further show the results for a smarter "maximum likelihood" baseline, which assigns to each test topic the distribution of the training tweets (the union of TRAIN, DEV, DEVTEST) across the classes.This is the "smartest" among the trivial policies that attempt to maximize KLD.For this baseline, for English we use for training either (i) the 2016 data only, or (ii) data from both 2015 and 2016; we also experiment with (i) micro-averaging and (ii) macro-averaging over the topics.It turns out that macro-averaging over 2015+2016 data is the strongest baseline in terms of KLD.For Arabic, we use the train-2017 data, and micro-averaging works better there.There were 15 participating teams competing in Subtask D: 15 for English and 3 for Arabic (these 3 teams all participated in English).As in the other subtasks, BB twtr was ranked first in English.They achieved an improvement of .50 points absolute in KLD over the best baseline, and a .01improvement over the next best team, DataStories.For Arabic, the best team was NileTMRG With improvement of .17over the best baseline and of .08 over the next best team, OMAM.All but the last two teams in English and the last team for Arabic outperformed all baselines.
In Subtask E, there were 12 participating teams, with OMAM and EliRF-UPV competing for both English and Arabic.Once again, BB twtr was the best for English, improving over the best baseline by .31EMD points absolute.Interestingly, this is the first subtask where DataStories was not the second-ranked team.BB twtr outperformed the second-best team, TwiSe, by .02points.For English, all but the last two teams outperformed the baselines.However, for Arabic, none of the two participating teams could do so.

User Information
This year, we encouraged teams to explore using in their models information about the user who wrote the tweet, which can be extracted from the public user profiles of the respective Twitter users.Participants could also try features about following relations and the structure of the social network in general, as well as could make use of other tweets by the target user when analyzing one particular tweet.Four teams tried that: SINAI, ECNU, TakeLab, and OMAM.OMAM and TakeLab did not find any improvements, and ultimately decided not to use any user information.ECNU used profile information such as favorited, favorite count, retweeted, and retweet count.They ended up 15th in Subtask A. SINAI used the last 200 tweets from the person's timeline.They ranked 12th in Subtask B. They generated a user model from the timeline of a given target user.They built a general SVM model on word2vec embeddings.Then, for each user in the test set, they downloaded the last 200 tweets published by the user and classified their sentiment using that SVM classifier.If the classified user tweets achieved an accuracy above a threshold (0.7), the user model was applied on the authored tweets from the test set.If not, the general SVM model was used.It is difficult to judge whether and by how much user information could help the best approaches as they did not try to use such information.However, we believe that building and using a Twitter user profile is a promising research direction, and that participants should learn how to make this work in the future.Thus, we would like to encourage more teams to try to explore using this information.We would also like to provide more user information such as age and gender, which we can predict automatically (Rosenthal and McKeown, 2016), when it is not directly available from the user profile.Another promising direction is to make use of "conversations" in Twitter, i.e., take into account the replies to tweets in Twitter.For example, previous work (Vanzo et al., 2014) has shown that it is beneficial to model the polarity detection problem as a sequential classification task over streams of tweets, where the stream is a "conversation" on Twitter containing tweets, replies to these tweets, replies to these replies, etc.

Conclusion and Future Work
Sentiment Analysis in Twitter continues to be a very popular task, attracting 48 teams this year.The task provides immense value to the sentiment community by providing a large accessible benchmark dataset containing over 70,000 tweets across two languages for researchers to evaluate and compare their method to the state of the art.This year, we introduced a new language for the first time and also encouraged the use of user information.These additions drew new participants and ideas to the task.The Arabic tasks drew nine participants and four teams took advantage of user information.Although a respectable amount of participants for its inaugural year, further exploration into both of these areas would be useful in the future, such as collecting more training data for Arabic and encouraging the use of cross-lingual training data.In the future, we would like to include exploring additional languages, providing further user information, and other related tasks such as irony and emotion detection.Finally, deep learning continues to be popular and employed by the state of the art approaches.We expect this trend to continue in sentiment analysis research, but also look forward to new innovative ideas that are discovered.Table 16: Alphabetical list of the participating teams, their affiliation, country, the subtasks they participated in, and the system description paper that they contributed to SemEval-2017.Teams whose Affiliation column is typeset on more than one row include researchers from different institutions, which have collaborated to build a joint system submission.An N/A entry for the Paper column indicates that the team did not contribute a system description paper.Finally, the last row gives statistics about the total number of system submissions for each subtask.

Table 1 :
Summary of the subtasks.

Table 2 :
Some English example annotations that we provided to the annotators.

Table 3 :
Some Arabic example annotations that we provided to the annotators.

Table 4 :
3Statistics about the English training and testing datasets.The training data is the aggregate of all data from prior years, while the testing data is new.

Table 5 :
Statistics about the newly collected Arabic training and testing datasets.

Table 6 :
Results for Subtask A "Message Polarity Classification", English.The systems are ordered by average recall AvgRec (higher is better).

Table 7 :
Results for Subtask A "Message Polarity Classification", Arabic.The systems are ordered by average recall AvgRec (higher is better).

Table 8 :
Results for Subtask B "Tweet classification according to a two-point scale", English.

Table 9 :
Results for Subtask B "Tweet classification according to a two-point scale", Arabic.The systems are ordered by average recall AvgRec (higher is better).Bx indicates a baseline.

Table 10 :
Results for Subtask C "Tweet classification according to a five-point scale", English.The systems are ordered by their M AE M score (lower is better).Bx indicates a baseline.

Table 11 :
Results for Subtask C "Tweet classification according to a five-point scale", Arabic.The systems are ordered by their M AE M score (lower is better).Bx indicates a baseline.

Table 12 :
Results for Subtask D "Tweet quantification according to a two-point scale", English.The systems are ordered by their KLD score (lower is better).Bx indicates a baseline.

Table 14 :
Results for Subtask E "Tweet quantification according to a five-point scale", English.The systems are ordered by their EM D score (lower is better).Bx indicates a baseline.

Table 15 :
Results for Subtask E "Tweet quantification according to a five-point scale", Arabic.The systems are ordered by their EM D score (lower is better).Bx indicates a baseline.