Tweester at SemEval-2016 Task 4: Sentiment Analysis in Twitter Using Semantic-Affective Model Adaptation

We describe our submission to SemEval2016 Task 4: Sentiment Analysis in Twitter. The proposed system ranked ﬁrst for the sub-task B. Our system comprises of multiple independent models such as neural networks, semantic-affective models and topic modeling that are combined in a probabilistic way. The novelty of the system is the employment of a topic modeling approach in order to adapt the semantic-affective space for each tweet. In addition, signiﬁcant enhancements were made in the main system dealing with the data pre-processing and feature extraction including the employment of word embeddings. Each model is used to predict a tweet’s sentiment (positive, negative or neutral) and a late fusion scheme is adopted for the ﬁnal decision.


Introduction
Nowadays, the usage of social networks such as Twitter dominates the daily communication of hundreds of millions of people around the world. People often share opinions and express their feelings about various topics through social networks. Tasks such as opinion mining and sentiment analysis (Pang and Lee, 2008) have become very popular, since they can capture a large portion of the public opinion.
The sentiment analysis of tweets was applied in various domains, such as commerce (Jansen et al., 2009), disaster management (Verma et al., 2011) or health (Chew and Eysenbach, 2010). This task is especially challenging due to the terse and informal writing style, the semantic diversity of content, as well as the often "unconventional" grammar and orthography. Many computational systems like those submitted to SemEval 2015 Task 10 (Rosenthal et al., 2015) incorporate bag-of-words models with Twitter-specific features like hashtags and emoticons (Davidov et al., 2010;Büchner and Stein, 2015). Word embeddings obtained from large amounts of tweets are used under the scope of an unsupervised approach for sentiment analysis (Astudillo et al., 2015). Additionally, deep learning models have recently become very popular for Twitter sentiment analysis (Severyn and Moschitti, 2015). Topic modeling approaches for sentiment analysis can also be found in literature, e.g., (Mei et al., 2007;Lin and He, 2009;Lu et al., 2011;Alam et al., 2016;Rao, 2016) In this paper, we present systems submitted to the SemEval 2016 Task 4 (Nakov et al., 2016) that deal with the sentiment analysis of tweets on the sentence level. The submitted systems are based on the fusion of the different classifiers. Specifically, 1) we enhanced the system submitted by Malandrakis et al. (2014) to the SemEval 2014 Task 9 (Rosenthal et al., 2014), 2) we used the open-source system submitted by Büchner and Stein (2015) to the SemEval 2015 Task 10 (Rosenthal et al., 2015), 3) we trained a convolutional neural network on a large amount of unlabelled Twitter data, 4) we developed a system based on topic modeling and 5) we trained a classifier using word embeddings as features. Our system was submitted on two subtasks, namely subtask A (message polarity classification) and subtask B (tweet classification according to a two-point scale) and ranked in the fifth 1 and the first place, respectively.
2 System Description 2.1 Baseline The baseline system is an enhanced version of the system submitted by Malandrakis et al. (2014) to the SemEval 2014 Task 9 (Rosenthal et al., 2014). The major changes include the different manipulation of hasgatags, muliword expressions and the affective features as well as the incorporation of new features. A plethora of features is extracted, the majority of which are affective. The feature extraction is performed at the tweet, suffix and prefix level. Assume the following tweet: "Lol Red Sox just slid through 3rd base #out" (tweet level). A window applied at the beginning estimates the prefix e.g., "Lol Red Sox" (prefix level) and a window at the end of the tweet estimates the suffix "3rd base #out" (suffix level).

Affective features
The baseline is based on affective features derived from the semantic-affective model proposed by Malandrakis et al. (2013). The semantic-affective model relies on the assumption that semantic similarity implies affective similarity (Malandrakis et al., 2013). First, a semantic model is built and then affective ratings are estimated for unknown tokens exploiting the affective ratings of semantically similar words. It is applicable both to single words or short multiword expression as shown below: where t j is the unknown lexical token, w 1..N are the seed words, υ(w i ), α i are the affective rating and the weight corresponding to the word w i and S(·) is the semantic similarity metric between two tokens. The semantic model was built as shown in (Palogiannidi et al., 2015) using word-level contextual feature vectors and adopting a scheme based on mutual information for feature weighting. The dimensionality of the affective features was reduced by retaining only the polarity features (instead of using additional affective dimensions, like arousal and dominance). Affective lexica were created using a generic corpus (116M sentences) (Iosif et al., 2016) and a Twitter corpus (115M tweets) created for the purpose of our submission. In the former case task-independent affective ratings are estimated. Task-dependent affective ratings can be estimated by keeping domainspecific sentences in the generic corpus. A language model was built using domain relevant sentences and then the top 30% of the most relevant entries of the generic corpus were selected 2 . Affective features were also derived through semantic models that were built from corpora that were collected based on the topic modeling approach described in Section 2.2. Third party affective lexica were also used. Affin (Nielsen, 2011) contains discrete polarity ratings in the range [−5, 5], nrc, nrctag (Mohammad et al., 2013) contain continuous polarity ratings for tokens, generated from a collection of tweets that include a positive or a negative word hashtag.

Tokenization
Based on the assumption that hashtags in different positions in a tweet may have different semantic interpretation, the tweets are transformed as follows: if a hashtag occurs at the end of the tweet it is assumed to convey semantic information. Otherwise, the hashtag is treated as a word or possibly a union of words. In the latter case, only the corresponding word is kept (e.g., "#moon is so beautiful tonight" → "moon is so beautiful tonight", but "What a beautiful night under the moonlight #romantic" is preserved as is). A hashtag that contains a union of words is expanded, e.g., #Hockeyisback → Hockey is back. Hashtags were expanded using the Viterbi algorithm (Forney and David, 1973) exploiting n-gram datasets. The n-gram dataset we used is a concatenation of the Google n-gram corpus that contains 1 trillion tokens from publicly accessible web pages (Brants and Franz, 2006) and an n-gram corpus based on 75 million English tweets (Herdagdelen, 2013). The absolute and relative frequencies of hashtags to be expanded were also used as features, as well as the indicators (binary features) that a tweet contains hashtags that require expansion. POS-tagging / Tokenization was then performed using the ARK NLP tweeter tagger (Owoputi et al., 2013), a Twitter-specific tagger. A tweet contains single words or emoticons and punctuations or multiword expressions. Multiword expressions (MWE) are non-compositional expressions that are processed as a single token. They were detected using the Gensim library (Řehůřek and Sojka, 2010) and they were added to the affective lexica.
Some parts of the tweets may be crucial for the correct understanding of its affective meaning. We assume that such parts, are the prefix, the suffix and the negated parts. Negations were detected using the list proposed by Potts (2011). When a negation token is detected, the tokens that follow are marked as negated until a punctuation mark is reached. Then, feature extraction is applied on the negated part of the tweet. Windows are used for splitting a tweet into prefix and suffix attempting to estimate the cognitive dissonance phenomenon that is associated with sarcasm, irony and humour (Reyes et al., 2012). The suffix is extracted by applying windows that keep the 20%, 50% and 70% of tokens that occur at the end of the tweet. Feature extraction is performed for both suffixes and prefixes.

Word2vec features
In addition to the use of semantic similarity features as in (Malandrakis et al., 2014), we use semantic representations, i.e., word embeddings that are utilized for the semantic similarity estimation, i.e., the S(·) in (1). Word embedding features were derived using word2vec (Mikolov et al., 2013), representing each word as a 300-D vector. Since tweetlevel features are required, for each tweet a 300-D vector is generated by averaging the corresponding vectors of the constituent words.

Additional features
Additional features based on characters and subjectivity lexica were used. Character features include the absolute and relative frequencies of selected characters. The selected characters are the capitalized letters, the punctuation marks, the emoticons as well as characters repetitions , i.e., at least three same successive characters in a word. Subjectivity features were also extracted based on the subjectivity lexicon of (Wilson et al., 2005). Specifically, the absolute and the relative frequencies of the strong positive/negative and weak positive/negative words were used as features.

Statistics extraction
The statistics of the token-level polarity features were estimated in order to extract tweet-level features. The following statistics were computed: length (cardinality), min, max, max amplitude, sum, average, range (max minus min), standard deviation and variance. Normalized versions of the same statistics were also calculated.

Topic Modeling
Topic modeling is a method for discovering "topics" that occur in collections of documents. Typically, multiple topics are present in a document. The most widely used topic modeling approach is the Latent Dirichlet Allocation (LDA) (Blei et al., 2003) which is based on Latent Semantic Analysis (LSA) (Deerwester, 1988) and probabilistic Latent Semantic Analysis (pLSA) (Hofmann, 2000). Here, we used topic modeling in order to adapt the semantic space on each tweet. In essence, the corpus was split in a probabilistic way based on the detected topics and then a semantic model was built for each subcorpus. Since this technique is probabilistic, each corpus sentence can belong to each revealed topic with a given probability. Then, clusters of sentences per topic were created by classifying each sentence to the most probable topic. A semantic model was built for each cluster, using word2vec (Mikolov et al., 2013). A mixture of the semantic models, weighted by the topic posteriors is used for the estimation of tweet's semantic similarities, e.g., S(·) in (1).

Convolutional Neural Network
A deep convolutional neural network (CNN) was also developed. The neural network's architecture is inspired by sentence classification tasks (Collobert et al., 2011;Kalchbrenner et al., 2014;Kim, 2014). Each tweet is represented by a "sentence" matrix S that is created as follows. First, each word is represented as a 300-D vector using word2vec, and then, the word vectors are concatenated as follows: S = W 1 ⊕W 2 ⊕W 3 ⊕···⊕W n ., S ∈ IR d×n , where ⊕ indicates the concatenation operation. Each column i of S is a vector W ∈ IR d that corresponds to the i th word of the tweet. This way the sequence of the words in the tweet is kept. In order to preserve the same length for all tweets, zero padding was applied concatenating zero word vectors until the length n of the longest tweet is reached. The size of S is d × n, where d is the dimension of the word embedding and n is the length of the longest tweet. The matrix S is the input to the network, where a convolution operation is applied between S and a kernel F ∈ IR d×m . The width m was set to 5 and the parameters of the model, i.e., the values of the kernel, the size of the sliding window h are learned. The result of the convolutional layer is a vector c ∈ IR n−m+1 (Kim, 2014). In fact, the convolution network uses multiple kernels with varying sliding windows and generates multiple features. These features are the inputs to the next layer which selects the maximum value of each feature by applying a max-over-time pooling operation (max-pooling layer) (Collobert et al., 2011). Next, the output of the max-pooling layer is passed to a dropout layer (Srivastava et al., 2014). A softmax layer that classifies each test instance to one of the possible classes is the final step.

Word2vec System
Based on the assumption that similarity of meaning implies affective similarity (Malandrakis et al., 2013) we build a system that relies exclusively on tweets' semantic representation. Specifically, word2vec was applied over large text corpora in order to compute the semantic representations of words (formulated as vectors). Then, the vectors of each tweet's constituent words were averaged in order to create a single vector. These vectors were used for training a random forest classifier.

Webis
The Webis open-source system (Büchner and Stein, 2015) that was submitted on (Rosenthal et al., 2015) is the ensemble of different subsystems that ranked at the top of Semeval 2013, Semeval 2014 Sentiment Analysis tasks (Nakov et al., 2013;Rosenthal et al., 2014). The following systems were combined in a late fusion scheme, based on the mean posterior probabilities: 1) NRC-Canada (Mohammad et al., 2013) is based on morphological, linguistic and polarity features, 2) GU-MLT-LT (Günther and Furrer, 2013) trains a linear classifier by stochastic gradient descent which uses social media specific text preprocessing and linguistic features, 3) KLUE (Proisl et al., 2013) employs a Maximum Entropy classifier with bag-of-words models, sentiment, emoticons and internet slang abbreviations features, 4) TeamX (Miura et al., 2014) is similar to NRC-Canada but uses more lexicon-based features and handles the unbalance distribution of tweets' sentiment by adopting a weighting scheme to bias the output.

Fusion of the systems
The motivation behind the development of various systems for sentiment classification is that different systems may capture different aspects of the sentiment, and by combining them we can predict more accurately the sentiment of tweets. The system architecture including the fusion scheme is summarized in the figure below. As shown in Figure 1, the various classifiers were trained with features extracted from Twitter datasets and their posterior probabilities were combined in a late fusion scheme. Various techniques can be used for the late fusion of classifiers, e.g., voting-based (Tulyakov et al., 2008) methods, multi-classifiers fusion that use posteriors as features for training a new classifier (Kittler et al., 1998;Gutierrez et al., 2016)   or algebraic combinations (Kittler et al., 1998). Algebraic combinations are based on operations such as mean, median, product, max or min.

Data
We trained our systems using both general purpose and Twitter data. We used a large generic corpus that contains 116M sentences (G-116M). The dataset was created by posing queries on a web search engine and aggregating the resulting snippets. A Twitter specific dataset was also created collecting 115M tweets (T-115M). We also used the training data provided by SemEval 2016 for subtasks A and B (Sem/A-2016, Sem/B-2016), as well as training data from the corresponding task of SemEval 2013 (Sem-2013). We also used ANEW (Bradley and Lang, 1999) for bootstrapping the affective lexicon expansion process.

Systems
The following subsystems were combined for the subtask A: Baseline (B), CNN, Word2vec system (W2V) and Webis (W). Baseline and Webis were trained on the concatenation of Sem/A-2016 and Sem-2013. A two-stage feature selection was applied, the first one took place on each feature set separately and the second to the combined feature set of the first stage. Finally, a Naive Bayes tree classifier was trained. Affective features derived by the topic modeling approach were also incorporated. The word vectors that are required for the CNN are derived from different corpora, i.e., the combination of a Google News dataset and the T-115M corpus. Specifically, for each word in the tweet, the word vector extracted from the latter corpus is used only if the word doesn't exist in the former corpus. The vectors of OOV words are initialized randomly from a uniform distribution. For the word2vec-based system we trained a random forest classifier with 100 trees using 300-D feature vectors on the concatenation of Sem/A-2016 and Sem-2013. We experimented with various fusion schemes and system combinations. The mean algebraic fusion scheme was selected for the reported results. The submission for the subtask B includes the following systems: Baseline (B ), Topic Model-ing (TM) and three Webis subsystems, i.e., NRC-Canada, GU-MLT-LT and Team X. The Baseline (B ) and the selected Webis systems were trained on the Sem/B-2016 dataset. Both feature selection and the classification of B are similar to the Baseline that was used in subtask A. The difference between B and B is that the former, in contrast to the second, includes affective features extracted through the approach based on topic modeling. The topic modeling system applies the LDA algorithm on the G-116M, using 16 topics 3 . Then, affective ratings are estimated and they are used in order to train a Naive Bayes tree classifier. Similarly to subtask A, a mean algebraic fusion scheme was selected for combining the systems.

Results
The developed systems for subtask A classify each tweet in one out of three sentiment classes (positive vs. negative vs. neutral), however the performance of the model is measured taking into consideration only the performance on the two polarity classes, i.e., positive and negative. The macroaveraged F-score of the positive and negative classes Table 1 for various datasets. In the first column the integrated systems are presented, while the submitted system is highlighted with grey color. Our system ranked fifth, while experimenting with different system combinations we can climb up to the third position.
Subtask B is a binary classification task (positive vs. negative). Macro-averaged recall p P N = p P os +p N eg 2 , p P N ∈ [0, 1] and macro-averaged Fscore are reported in Table 2 for various system combinations. The systems and their macro-averaged recall are listed in the first and second row of the table, respectively. The reported scores were derived by the test tweets of 2016 that belong to specific topics. Each row that follows indicates a unique combination and the submitted system is highlighted with grey color. In the case that all the available systems are combined, the marco-averaged recall (AvgR) is 0.803 and macro-averaged F-score (AvgF1) is 0.808. The combinations that follow contain a subset of the systems (the selected systems are denoted with √ and the omitted systems with ×).
The baseline was proved to be the most robust system achieving the highest performance among the others and higher performance than the submitted system (0.797). When using all the subsystems but one, performance decreases except in the case that the Webis system is omitted, then AvgR increased to 0.818. The highest performance drop is achieved when the baseline system is omitted. Investigating more combination schemes AvgR rises to 0.827. The combination of CNN, TM and B with two of the best Webis systems yields robust performance with AvgR and AvgF1 0.824 and 0.805, respectively.

Conclusions
We presented a system for the sentiment classification of tweets for the SemEval 2016 Task 4: Sentiment analysis in Twitter. We participated in subtasks A and B and we won subtask B. We developed various systems including a CNN and a topic modeling approach for the adaptation of the semantic space to each tweet. Regarding task A, the submitted system was ranked between the fifth and third position. The performance for subtask B can be higher (up to +3% compared to the submitted system) for various system combinations. Future work will focus on the fusion of the systems as well as their enhancement in order to achieve higher performance to the threepoint scale sentiment classification.