UNIBA: Sentiment Analysis of English Tweets Combining Micro-blogging, Lexicon and Semantic Features

This paper describes the UNIBA team participation in the Sentiment Analysis in Twitter task (Task 10) at SemEval-2015. We propose a supervised approach relying on keyword, lexicon and micro-blogging features as well as representation of tweets in a word space.


Introduction
Sentiment analysis is the study of the subjectivity and polarity (positive vs. negative) of a text (Pang and Lee, 2008). With the worldwide diffusion of social media, a huge amount of textual data has been made available, thus attracting the interest of researchers in this domain (Rosenthal et al., 2014). Sentiment analysis on such informal texts poses new challenges due to the presence of slang, misspelled words and micro-blogging features such as hashtags or links and traditional approaches may not be successfully exploited in this domain. Previous research has successfully exploited approaches based on lexical and micro-blogging features (Mohammad et al., 2013). In this study, we investigate a supervised approach including three kinds of features based on keywords and micro-blogging properties of tweets, sentiment lexicons and semantics. Rather than using word-sense disambiguation (Miura et al., 2014), we represent tweets in a distributional semantic model (DSM) (Vanzo et al., 2014), which is able to learn the context of usage of words analysing cooccurrences in large corpora.
This paper describes our participation at the Se-mEval 2015 Sentiment Analysis in Twitter task (Rosenthal et al., 2015). We discuss methods and results of our experimental study for the overall polarity classification of tweets (message level subtask B). The Sentiment Analysis task focuses on English tweets. Data provided for training are annotated according to the overall polarity of each tweet (i.e., 'negative', 'positive' or 'neutral'). The system evaluation is performed on different test sets. In particular, the rank of the systems is calculated on the offical Twitter 2015 test set. Further evaluation is performed on a progress set including test instances from the previous edition of the task, to allow comparision with previous studies (Rosenthal et al., 2014). We build a supervised system based on our sentiment classifier for Italian tweets, which ranked 1st in both the polarity and subjectivity tasks at Evalita 2014 (Basile and Novielli, 2014).
The paper is structured as follows: we introduce our system and report the details about features in Section 2. We describe the evaluation and the system setup in Section 3. We conclude by reporting results and discussion in Section 4.

System Description
Our system is built upon our classifier for sentiment analysis of Italian tweets (Basile and Novielli, 2014). We adopt a supervised approach using Support Vector Machine as a classification algorithm. We investigate three groups of features based on: (i) keyword and micro-blogging characteristics, (ii) sentiment lexicons, and (iii) a Distributional Semantic Model (DSM).
Keywords and micro-blogging features. Keyword-based features exploit tokens occurring in the tweets (Table 1). During the tokenization we replace the user mentions, URLs and hashtags with three metatokens, " USER ", " URL " and " TAG ", for which we also count the total occurrences. As for keywords, we consider unigrams and bigrams. To deal with negations, all the n-grams occurring in a negated context receive the neg suffix. A negated context is a tweet fragment starting with a negation word 1 and ending with a punctuation mark (Pang et al., 2002). Moreover, we create features capturing typical aspects of micro-blogging, such as the use of upper case ratio and character repetitions 2 , positive and negative emoticons, informal expressions of laughters 3 , as well as the presence of exclamation and interrogative marks, negations, intensifiers 4 . Finally we include features based on word count for 1000 large-scale word clusters built on English tweets 5 .
Lexicon-based Features. The second group contains features calculated for each of the eight lexicons we consider in this study. These lexicons can be differentiated based on how they represent the information about prior polarity of words.
The NRC Emotion Lexicon (Mohammad and Turney, 2010), the MPQA Lexicon (Wilson et al., 2005) and the Bing Liu Lexicon (Hu and Liu, 2004) provide lists of positive and negative words. We assign a positive score equal to 1 to the positive sentiment terms, and a negative score equal to 1 to the negative ones. Similarly, the NRC Hashtag Sentiment Lexicon and the Sentiment140 Lexicon provide a list of words with their sentiment association score, calculated as pointwise mutual information with respect to collections of positive and negative tweets (Mohammad et al., 2013). Positive and negative scores are associated, respectively, to positive and negative sentiment, while the magnitude indicates the degree of association. We consider also the lexicon used by SentiStrength 6 , a state-of-the-art tool for extracting sentiment strength from informal English text on social media (Thelwall et al., 2010). The SentiStrength lexicon is structured as a list of words with scores ranging in [−5, +5]. A set of booster words is also provided, to increase or decrease the strength of the prior polarity of terms. Finally, we use a list of emoticons as taken from Wikipedia 7 : we assign +1 and -1 as a score for positive and negative emoticons, respectively. In all the lexicons mentioned so far either a positive or negative score is associated to each term. Using these lexicons, we extract a set of features based on prior polarity of words occuring in the tweets, as reported in Table 2. The features are computed separately for terms in affirmative contexts and terms in negated contexts.
In addition, we use SentiWordNet 3.0 (Esuli and Sebastiani, 2006). SentiWordNet extends Word-Net by associating positive, negative and objective scores to each synset, where the three scores sum up to 1. A lemma can receive multiple polarity scores if it occurs in more than one synset. In such cases, we select the most frequent sense for the lemma, with respect to its part-of-speech. Thanks to the availability of the objective scores, additional features can be computed to model the presence of neutral terms, as reported in (Basile and Novielli, 2014). Also the features based on SentiWordNet are calculated separately for affirmative and negated contexts.
Finally, we consider the word classes defined in the Linguistic Inquiry and Word Count (LIWC) taxonomy, developed in the scope of psycholinguistic research (Pennebaker and Francis, 2001). LIWC organizes words into psychologically meaningful categories based on the assumption that words and language reflect most part of cognitive and emotional phenomena involved in communication. Previous research has shown how the language use varies with respect to the communicative intention, thus making possible to distinguish between objective and subjective statements as well as between agreement and disagreement expressions (Novielli and Strapparava, 2013). Therefore, we include word count features for each word class in LIWC. Similarly, we include word count features for the emotion word classes in the NRC Emotion Lexicon.
Semantic Features. Finally, we calculate features based on the Distributional Semantic Model (DSM). Given a set of 15M unlabelled downloaded tweets, we build a geometric space in which each word is represented as a mathematical point (Sahlgren, 2006). The similarity between words is computed as their closeness in the space. To represent a tweet in the geometric space, we adopt the superposition operator (Smolensky, 1990), that is the vector sum of all the vectors of words occurring in the tweet. We use the tweet vector − → t as a semantic feature in training our classifier.
In the same fashion, we build prototype vectors for each class based on the sentiment lexicons that provide prior polarity scores for words (i.e. Sen-tiWordNet, SentiStrength, and the merge of NRC Hashtag and the Sentiment140). For example, the prototype vector for the positive class − − → p pos based on SentiStrength is obtained by summing up all the vectors of words with positive prior polarity in the Sen-tiStrength lexicon. We use three prototype vectors to represent, for each lexicon, the positive − − → p pos , negative − − → p neg , and subjective − → p s class (defined by considering both positive and negative words). In the case of SentiWordNet, objectivity scores are also available and allow us to build a prototype for objectivity − → p o . To capture the subjectivity and the polarity of a tweet − → t , we compute the cosine similarity between − → t and each prototype vector.

Evaluation
The message level subtask (subtask B) is designed for evaluating systems on their ability to predict the overall polarity of a given tweet, with respect to three classes: positive, negative, and neutral.
Organizers provided 8,006 manually annotated tweets as training data. We use the training set 8 to extract the features described in Section 2. Details on our system setup are reported in Section 3.1. As test set, organizers provided a collection of 2,390 manually annotated tweets (Official 2015 test set). Further data from different sources (8,987 8 Further development data provided by the organizers are not used for training tweets overall) are included in the progress test set and are provided to allow comparison with systems participating in previous editions. Systems are compared against the gold standard of the official test set in terms of macro average F measure calculated over the positive and negative classes. For the sake of completeness, we report also weighted F measure considering all the three categories in the classification task (see Section 4).

System Setup
The system is completely developed in JAVA. We used the Liblinear 9 implementation of L 2 -loss support vector classifier. Tweets are tokenized using the Twitter NLP and Part-of-Speech Tagging API 10 . We use both the tokenizer and the part-of-speech tagger to preprocess the data.
Regarding the DSM, we download 15 million tweets using the Twitter Streaming API. Tweets are downloaded by querying the API using three lexicons extracted from the training data for each class, based on Kullback-Leibler divergence (KLD) as described in (Basile and Novielli, 2014).
We download the same number of tweets for each lexicon. We exploit these unlabeled tweets to build a DSM, using the "word2vec" 11 tool based on a revised implementation of the Recurrent Neural Net Language Model (Mikolov et al., 2013) using a loglinear approach. We use the skipgram model, which is more accurate in presence of infrequent words, with 300 vector dimensions and remove the terms with less than ten occurrences, obtaining 308,493 terms overall.
In training our classifier, we set the C parameter to 0.01. We select this value after a 10-fold validation on training data to select the best combination. The total number of features exploited is 145,967.

Results and Discussion
The final ranking issued by the organizers considers the system performance in terms of average between F measures for the positive and negative classes only. Table 3

reports the system performance and
Keyword and micro-blogging features n − grams uni-and bi-grams are considered. User mentions, URLs and hashtag are replaced with metatokens count U SER total occurrences of user mentions count U RL total occurrences of URLs count T AG total occurrences of hashtags uppercase ratio the ratio between the number of upper case characters and the total number of characters emo pos the number of positive emoticons emo neg the number of negative emoticons count Laugh the count of sequences of 'ah' as slang expression of laughters count Intensif the ratio between the number of tokens with repeated characters and the total number of tokens count QM ark the total occurrences of question marks count ExM ark the total occurrences of exclamation marks count N egation the total occurrences of negation words count cluster i the total occurrences of words belonging to the i-th cluster Table 1: Description of keyword and micro-blogging features.

Sentiment lexicon based features o pos
the number of tokens with positive score o neg the number of tokens with negative score o subj the number of tokens with either positive or negative score last pos the score of the last positive token in the tweet last neg the score of the last negative token in the tweet last emo the score of the last emoticon in the tweet sum pos the sum of positive scores for the tokens in the tweet sum neg the sum of negative scores for the tokens in the tweet sum subj the subjectivity polarity, it is the sum of the positive and negative scores sum M axpos the maximum positive score observed for tokens in the tweet sum M axneg the maximum negative score observed for tokens in the tweet count C i the total occurrences of words belonging to the i-th word class C i , where word classes are defined by the LIWC and NRC Emotion Lexicon taxonomies its rank. The system rank on the progress set is calculated on the performance on the Twitter 2014 subset. For completeness, we report also the F measure calculated considering all the three classes in our model, including the neutral category 4.
The results are very encouraging: even if far from optimum, the system differs for only 3.29 points from the first ranked one (F=64.84). Furthermore, we observe that even if our system is trained only on tweets it is able to generalize on datasets from other domains, such as SMS and other microblogging services (i.e., LiveJournal). Conversely, the system performance drops on the Twitter 2014 Sarcasm set. This is consistent with results observed in our previous study (Basile and Novielli, 2014) on Italian tweets , where the 43% of misclassified negative cases were mostly ironic and would require common sense reasoning to detect the negative opinion expressed. Moreover a drop in perfomance on the sarcasm test set had been already   observed for systems participating in the previous edition of the task (Rosenthal et al., 2014) and can be observed for all systems in the current edition. However, our system had a greater than average performance drop and we are currently studying this issue.
Observing the detailed scores for each class (first row of Table 4) we discover that the system performs better in the recognition of positive and neutral cases, in contrast with previous evidence from the experiment on the Italian corpus.
To further investigate the predictive power of the features in our model, we perform an ablation test on the Twitter 2015 test set, for which organizers provided the gold standard. We remove each group of features to assess the decrease of F measure on test data with respect to the setting including all features. Results are reported in Table 4 and demonstrate the importance of all feature groups.
Removing the sentiment lexicon group of features causes the highest decrease in performance. This is in contrast with previous evidence of our experiment on the Italian dataset of tweets, where a drop of performance of only 1% was observed. We provide a possible explanation to this by observing that only one sentiment lexicon was adopted in the study on the Italian dataset. On the contrary, in the current ex-periment on English tweets we can rely on a richer set of features due to the avaliablity of numerous lexicons, as explained in Section 2. Moreover, the Sen-timent140 Lexicon and the Hashtag Sentiment Lexicon are both developed specifically to address sentiment analysis of tweets, thus providing higher coverage of lexical cues that are typical of microblogging.
Keyword and microblogging features are the second most useful group. This is consistent with evidence from the Italian experiment, for which we observe a comparable drop in performance on the polarity detection task. However, in the current experiment we also consider n-grams, which are not included in the feature set of the system for Italian. This consideration suggest that n-grams might contribute differently to the performance of sentiment classifiers depending on the language being used, thus suggesting directions for further investigation.
Finally, semantic features lead to the smaller drop in F measure when removed (-0.69%). This is in contrast with our previous findings in the Italian setting, where the semantic features plays a key role. This might be due to the prevalence of political topics in the Italian dataset, possibly causing a bias in our classifier due to the domain-specific lexicon about politics. This discrepancy indicates further directions for future investigation on the ability of semantic features in disambiguating polarity in microblogging, with respect to the topic being discussed and the language being used.
Future replications of this study will involve further data to validate and generalize our findings.