aueb.twitter.sentiment at SemEval-2016 Task 4: A Weighted Ensemble of SVMs for Twitter Sentiment Analysis

This paper describes the system with which we participated in SemEval-2016 Task 4 (Sentiment Analysis in Twitter) and speciﬁcally the Message Polarity Classiﬁcation subtask. Our system is a weighted ensemble of two systems. The ﬁrst one is based on a previous sentiment analysis system and uses manually crafted features. The second system of our ensemble uses features based on word embed-dings. Our ensemble was ranked 5th among 34 teams. The source code of our system is publicly available.


Introduction
This paper describes the system with which we participated in SemEval-2016 Task 4 (Sentiment Analysis in Twitter) and specifically the Message Polarity Classification subtask (Nakov et al., 2016). In this subtask, each tweet is classified as expressing a positive, negative, or no opinion (neutral). Our system is a weighted ensemble of two systems. The first one is based on a previous sentiment analysis system (Karampatsis et al., 2014) and uses manually crafted features. The second system of our ensemble uses features based on word embeddings (Mikolov et al., 2013;Pennington et al., 2014). Our ensemble was ranked 5th among 34 teams.
Section 2 discusses the datasets we used to train and tune our ensemble. Sections 3 and 4 describe our ensemble and its performance respectively. Finally, Section 5 concludes and discusses future work. The organisers also provided 6,908 tweets from old SemEval data, to allow system evaluation during development. These data could not be used directly for training or tuning and were the following:  Figure 1: Ensemble of two sentiment polarity classifiers, SP1 and SP2, which are influenced by two subjectivity detection classifiers, SD1 and SD2, respectively.

System Overview
The main objective of SemEval-2016 Task 4 is to detect sentiment polarity, i.e., to identify whether a message (tweet) expresses positive, negative or no sentiment at all. We used a weighted ensemble of two sentiment polarity classifiers, namely SP1 and SP2 (Figure 1), each influenced by a subjectivity detection classifier, SD1 and SD2, respectively. A correlation analysis between the confidence scores of SP1 and SP2 (C SP 1 and C SP 2 respectively) revealed that the two systems make different mistakes, which motivated combining them in an ensemble. Given a message and the confidence scores of the two systems (i.e., C SP 1 and C SP 2 ), the ensemble computes a new confidence score for every sentiment label (C pos , C neg and C neu ) as follows: where w pos , w neg , w neu are weights tuned on the development data. The sentiment with the highest confidence score is assigned to each tweet. 1 Below, we describe the two Sentiment Polarity classifiers, along with the two subjectivity detection classifiers that influence them.

SP1 and SD1
First, each message is preprocessed by a a Twitter specific tokeniser and part-of-speech (POS) tagger (Owoputi et al., 2013) to obtain the tokens and 1 Tuning led to wpos = wneg = wneu = 0.66. the corresponding POS tags, which are necessary for some features. 2 Then, we extract features, which can be categorized as follows: 3 • features based on morphology, • POS based features, • sentiment lexicon based features, • negation based features, • features based on clusters of tweets.
We used a linear SVM classifier (Vapnik, 1998;Cristianini and Shawe-Taylor, 2000;Joachims, 2002) trained on three labels, namely, positive, negative and neutral. 4 As already mentioned, SP1 is influenced by a subjectivity detection classifier called SD1. That is, SP1 uses as a feature the confidence score of SD1. SD1 is also a linear SVM classifier, which is trained on data of two labels, neutral and subjective (i.e., positive or negative). 5 The higher the confidence score of SD1 the more likely it is for the message to express sentiment (positive or negative). Apart from the score of SD1 (which was used by SP1), SP1 and SD1 used the same features.

SP2 and SD2
The second system of our ensemble uses word embeddings (Mikolov et al., 2013;Pennington et al., 2014). We use the centroid of the word embeddings of each tweet as the feature vector of the tweet. The centroid of a tweet (message) M is computed as follows: No lemmatization or stemming was used and tokens could be words, emoticons, hashtags, etc.
3 All the features of SP1 are described in detail in a publicly available report, accompanying the source code of the system. The code and the report are available at https://github. com/nlpaueb/aueb.twitter.sentiment. 4 We used the SVM implementation of Scikit Learn (Pedregosa et al., 2011;Fan et al., 2008). The same implementation was used for all our SVM classifiers. The optimal C value was found to be 0.00341, by using 5-fold cross validation on TWtrain16.  where |M | is the number of tokens in M and w i is the embedding of word w i . 6 We used the 200dimensional word vectors for Twitter produced by GloVe (Pennington et al., 2014). 7 As with SP1, SP2 incorporates the confidence score of SD2 as a feature. SD2 is a classifier trained on neutral and subjective data (positive or negative), again with centroid feature vectors. Given a message M , the confidence score of SD2 for M was added as a feature to its centroid and the resulting 201-dimension feature vector was used as input to SP2. 8 SP2 was then trained on the same three classes as SP1 (positive, negative, neutral). 9

Experiments & Discussion
Our system was ranked 5th among 34 teams. 10 All teams were ranked by their score on the Twitter2016 Task 4 test dataset. Table 1 shows our rankings on each dataset. Below we discuss the results of our ensemble and we show how the subjectivity detection classifiers affect our system. 6 We allow multiple word occurrences in a sentence, while we ignore words without embeddings. 7 The word vectors were pre-trained on a 2 billion tweets corpus. See http://nlp.stanford.edu/projects/ glove/.
8 The confidence scores of SD1 and SD2 were exponentially normalized (Bishop, 2006). 9 The optimal C values were found to be 1.40688 for SD2 and 7.39618 for SP2, by using 5-fold cross validation. 10 http://alt.qcri.org/semeval2016/task4/ data/uploads/semeval2016_task4_results.pdf A strict two-stage approach, like the one suggested by Karampatsis et al. (2014), discards messages the sentiment detection (SD) classifier (first stage) decides they do not express sentiment, and classifies the rest as positive or negative. However, errors of the first stage propagate to the second, thus, playing a significant role in overall performance. We extend their approach and attempt to use the results of a subjectivity detection stage in a less rigorous manner; i.e., as a confidence factor along with various other features. Recall that our SD1 is actually the first stage of the system of Karampatsis et al. (2014), and that we use the confidence of SD1 as feature of SP1. Table 2 shows that SP1 (with the confidence of SP1 as a feature) outperforms the strict two-stage approach by 4.57%, yielding an increase in the ranking by 12 positions. Another interesting observation is that SP2 (with the confidence of SD2 as a feature) achieves a score only 1.9% lower than SP1 (with SD1) yielding a ranking around the middle of the list. This is achieved by using only features based on word embeddings along with the confidence of SD2 and no sophisticated feature engineering at all. A final, and also very interesting observation is that when we use an ensemble of SP1 and SP2, the results improve yielding a 5th place in the ranking.

Conclusions and future work
In this paper we presented the system with which we participated in the Message Polarity Classification subtask of SemEval-2016 Task 4. We used a weighted ensemble of two systems each operating in two stages. In a first, subjectivity detection stage, each message is assigned a confidence score representing the probability that the message expresses an opinion. This probability is then used as a feature by a classifier that detects sentiment. We used two different systems, one based on previous work by Karampatsis et al. (2014) (SP1 with the confidences of SD1 as a feature) and a second system that represents the messages by the centroids of their word embeddings (SP2 with the confidence of SD2 as a feature). The two systems are then combined with a weighted linear ensemble scheme in order to get the final sentiment label. Our experiments show that using the confidence of the subjectivity detection stage as a feature instead of using a strict two-stage ap-proach can lead to an improved performance. Also, the ensemble performs better than any of its two systems on their own.
Despite the encouraging results of our approach (5th among 34 participating teams), there is still much room for improvement. A better continuous space vector representation of the messages might improve SD2 and SP2. Much research has been conducted recently on obtaining better continuous space vector representations of sentences (Le and Mikolov, 2014;Kiros et al., 2015;Hill et al., 2016) instead of centroid vectors. Another direction for future work would be to investigate replacing the SVM classifiers by multilayer perceptrons, possibly on top of recurrent neural nets that would compute vector representations of sentences.