ELiRF: A SVM Approach for SA tasks in Twitter at SemEval-2015

This paper describes our participation at tasks 10 (sub-task B, Message Polarity Classiﬁca-tion) and 11 task (Sentiment Analysis of Fig-urative Language in Twitter) of Semeval2015. We describe the Support Vector Machine sys-tem we used in this competition. We also present the relevant feature set that we take into account in our models. Finally, we show the results we obtained in this competition and some conclusions.


Introduction
Nowadays social media, such as Twitter, produce a vast amount of information that lead us to new challenges in Machine Learning (ML) and in Natural Language Processing (NLP) fields. Twitter 1 is a micro-blogging service, which according to latest statistics, has 284 million active users, 77 % outside the US that generate 500 million tweets a day in 35 different languages. That means 5,700 tweets per second and they had peaks of activity of 43,000 per second. This numbers justify the great interest in the automatic processing of this information. The study (Analytics, 2009) estimates that 50.9% of tweets have some useful information that are capable of mobilize opinions in Internet and also in the real world. Therefore, social media users opinions have great strategic value for different organizations.
Our work is focused on automatically identify the prevailing sentiment in a tweet using ML and NLP techniques. We developed a system for determining the tweets polarity for 10B and 11 tasks at the SemEval-2015 competition. The aim of task 10 (subtask B) (Rosenthal et al., 2015) is to classify tweets among positive, negative, and neutral polarity. In task 11 (Ghosh et al., 2015) we had to deal with figurative language, and we should assign a polarity to each tweet with a score that vary in the range [-5..5], this score represents the degree of the sentiment. Due to this last requirement, we formalized this task as a regression problem. Our approach shared some points for solving both tasks. Preprocessing and feature extraction processes from the corpora were similar. We considered some common problems when we are dealing with text from social media and in particular from Twitter: short texts, slang, peculiarities of the language (hashtags, retweets, user mentions, etc.). We represented features extracted using a bag of n-grams. We used Support Vector Machine (SVM) formalism due to the fact to its ability to handle large feature space and to determine the relevant features. Task 10B has been considered as a classification problem and it has been modeled by means of SVM classifiers. For Task 11 we used regression SVM, due to the granularity of the scores. Both tasks were solved using a supervised technique. Our systems learned from the training set supplied by the Semeval organization. We also used external resources such as polarity dictionaries. The rest of this paper is organized as follows. In section 2, we briefly present some relevant works related to these tasks. In section 3, we describe the main features of the used corpora. In section 4, we present the system we developed to solve these tasks. Section 5 is dedicated to show the results of our experimental work and the results we obtained for the SemEval tasks. Finally, in section 6, we will share some conclusions from our work and possible future directions.

Related Work
Sentiment Analysis has been widely studied in the last decade in multiple domains. Most work focuses on classifying the polarity of the texts as positive, negative, mixed, or neutral. The pioneering works in this field used supervised (Pang et al., 2002) or unsupervised (knowledge-based) (Turney, 2002) approaches. In (Pang et al., 2002), the performance of different classifiers on movie reviews was evaluated. In (Turney, 2002), some patterns containing POS information were used to identify subjective sentences in reviews to then estimate their semantic orientation.
In (Pang and Lee, 2008) we can find a comprehensive study of the different techniques used to identify the polarity of a text. Many efforts have been made to transfer this knowledge to language extracted from social media. In the literature we can find recent attempts to solve this problem using different machine learning approaches such as, SVM, Maximum Entropy, Naive Bayes, etc, (Barbosa and Feng, 2010;O'Connor et al., 2010a;Zhu et al., 2014). At best, these works achieve F1-score close to 70%, therefore we still could improve these proposed systems. The construction of polarity lexicons is another widely explored field of research. Opinion lexicons have been obtained for English (Liu et al., 2005;Wilson et al., 2005) and also for Spanish (Perez-Rosas et al., 2012). A good presentation of the SA problem and a description of the state-of-the-art of the more relevant approaches to SA can be found in (Liu, 2012).
Research works about SA on Twitter are much more recent. Twitter appeared in the year 2006 and the early works in this field are from 2009 when Twitter started to achieve popularity. Some of the most significant works are (Barbosa and Feng, 2010), (Jansen et al., 2009), and(O'Connor et al., 2010b). A survey of the most relevant approaches to SA on Twitter can be see in (Vinodhini and Chandrasekaran, 2012). The SemEval competition has also dedicated specific tasks for SA on Twitter Rosenthal et al., 2014a,b) which shows the great interest of the scientific community in this field. TASS workshop has proposed different tasks for SA focused on the Spanish language (Villena-Román and García-Morera, 2013) and (Villena-Román et al., 2014). In this paper, we have included some ideas that we have used in previous works in the context of some SA tasks at TASS competition for Spanish Hurtado, 2013, 2014b,a)

Corpus Description
In the following section, we describe the main features of SemEval2015 corpora used in 10B and 11 tasks, respectively.

Task 10 B
The corpora supplied by the Semeval2015 organization is composed by 7,236 tweets for training, 1,242 tweets for tuning (development set) and 2,880 tweets for test-time development composed by part of the Semeval2013 corpora used in that edition . The test corpora has an official test with 2,390 tweets and a progress test with 8,987 tweets. Figure 1 plots the polarity distribution over these train, tuning and test-time development corpora. On average, 16.53% of the tweets are negatives, 45.75% are neutrals and 37.72% are positives. Vocabulary from training corpus has 25,973 words, development corpus has 6,700 words and test-time development corpus has 13,672 words after we deleted the stop-words. We found that 57.57% of the words from test-time development were never seen in training.
We studied the Zipf's distribution of the words from train, tune and test-time development corpora and we find out that words with less number of synsets, less ambiguity, appear with more frequency. We used this information in the normalization of the SentiWordNet Lexicon. Since we used lexicons as a features for training our systems, it is important to know the percentage of words from corpus which appear in these lexicons.  It is noteworthy that only 19.98% of training tweets and 20.31% of tweets from the test-time development set have hashtags. Users tag the content of their tweets with hashtags, consequently its meaning may be relevant when we try to classify a tweet. However hashtags often have multiple words together and segmentation of these words it is a problem in itself.

Task 11
The Task 11 corpus is similar to previous one, but its main feature is that it contains figurative language such as irony and affective metaphor. This kind of language will increase the complexity of the task. Also this task requires a much more fine grained polarity identification. Two corpora were provided to address this task.
Trial and a train corpus share some tweets. We had 7,135 unique tweets to train and tune our systems. The corpus has 22,227 words without stop words.  Table 2 shows the percentage of words from task 11 corpus we could find in the lexicons. Just like vocabulary from task 10, a small percentage of the vocabulary will have a polarity score.  As expected, the corpora for this task has a lot of figurative language. If we assume that Twitter's users tag semantically its tweets using hashtags and tags as #irony or #sarcasm indicates the presence of figurative text then at least 46.22 % of the corpus has figurative text. This was the only knowledge we add to deal with task 11 differently from the knowledge used in task 10. Finally, a remarkable 85.58% of tweets have at least one hashtag. Therefore these features will be relevant in our classification system.

Our System
In this section we describe the main features of the system developed for SemEval tasks We determined the baseline for both tasks by selecting the most probable class in the training set. In task 10B we got a 26.49% of F1-score, a 43.61% of precision, and a 43.61 % of recall. In task 11 we got a 19.53% of F1-score, a 36.51 % of precision and a 36.51% of recall. After studying the corpus, we train and tune different classifiers using features extracted from the text and from the lexicons. We did a 10-cross validation to tune the SVM models.

Feature Extraction
We selected the best set of features in order to solve each task. The best features considered were: N-grams We used a bag-of-words approach to represent each tweet as a feature vector that contains the tf-idf factors of the selected features of the training set. After tokenizing the tweet and deleting its stop words we extract n-grams of characters. We have two approaches: we got all n-grams joining words or just n-grams within words. In task 10 we used 1-grams to 6-grams and we vectorized them using tf-idf coefficients. In task 11 we used the same approach but we used 3-grams to 9-grams.
Negation We need to deal with negation to predict polarity correctly. Thus, we label every word in a negation context. We assume that a negation context begins with a negation word as: "never", "no", "nothing", "none", . . . , and ends with a punctuation mark, following the approach of (Pang et al., 2002). We used this strategy only in task 10. After labeling negation context, our system extracted the n-grams from labeled tweets.
Lexicons In order to use lexicons, tweets are tokenized, cleaned the stop words and all the tokens are converted to lowercase. We applied five lexicons.
1. Pattern (De Smedt and Daelemans, 2012): Given a tweet this lexicon will return a score with the polarity and another one with the objectivity.
2. Afinn-111 (Hansen et al., 2011): This lexicon has a set of words tagged with a score. We sum the polarity of every word in a tweet to get a score for the whole tweet. w∈W Af inn(w) 3. Jeffrey (Hu and Liu, 2004): This lexicon has two sets of words: a positive and a negative word set. We got two scores from this lexicon. First score is the count of positive words and the second one is the count of negative words. 5. SentiWordNet (Baccianella et al., 2010): In this lexicon each word could belong to multiple sets of meaning (Synsets S), therefore we normalize the score of a word by its number of meanings. This lexicon provides three scores for: positive, negative and objective words, and we used these three scores. Features from Twitter: We count the number of hashtags, retweets, mentions and URLs for each tweet. Some hashtags like: #irony, #sarcasm o #not,. . . are useful in order to identify the presence of figurative text in a tweet. We count the number of these hashtags as a feature.
Encoding We consider number of capitalized words and the number of words with elongated characters.
Obviously we tried different set of features like: POS tags, word n-grams, binary bag of words, . . . also we tried different combinations of features in order to optimize the system.

Clasification
We classified tweets using a SVM approach. In task 10B we used a linear kernel for classification and in task 11 we also used a linear kernel for regression. Feature selection process was performed in task 10 using the development corpus and in task 11 using a cross-validation technique (10-fold cross validation) on training set. We selected the set of features that optimized the accuracy of the system on the development set. We used scikit-lean toolkit (Pedregosa et al., 2011), and we developed a framework to define functional classification models. These models included: preprocess, mining, vectorization features, and classification functions. This framework receive 1 to N models. A tweet is classified using the most voted category or using the mean of predictions if we are doing regression.

Experiments
We tested a set of configurations in order to obtain a competitive classifier. In this section, we present only the systems which achieved best performance in development time. We submitted only the best system to the SemEval 2015 competition.
Each one of them were trained with this set of features: • 1-gram to 6-grams of characters from tweet. • A lexicon. Each SVM has its own lexicon.
Then we used a majority voting system to combine these classifiers.  For the competition we submitted the model 2 which achieved the best performance in the development phase. Table 4 shows evaluation performance. Forty teams participated in this task. In the official rank our system achieved the 24th position and the 35th position in the progress test.

Task 11
Our best model for this task was trained using these features: • 3-grams to 9-grams of characters from tweet.
• Features extracted from Twitter including the number of figurative hashtags.
We selected this set of features by cross validation. We tuned our system using the official measure, the cosine distance.      Table 5 shows the official results of our system in task 11. We achieved the 5th position in the rank. Our system obtained the first position in detecting sarcasm. We achieved a 0.918 of cosine similarity measure. For non figurative language, our system performed worse, obtaining the 8th position in the rank. We think this is due to the fact that training corpus lacks of non-figurative tweets, therefore our system was not able to learn this class properly.
Mean square error metric (MSE) was also considered by Task 11 organizers. Table 6 shows the results achieved using this metric. We obtained worse results because we didn't tune the system for this metric.

Conclusions
We have presented a system for 10B and 11 tasks at SemEval 2015. We used a machine learning approach based on SVM formalism for both tasks. We handled both tasks uniformly with regard to the preprocesing, feature extraction and feature representation. We have not included any knowledge about the tasks, except from resources used, that is, corpora and dictionaries. In this respect, our system will be easy to adapt to other SA tasks and other languages with this kinds of resources.
Even we did not include any external knowledge we plan to study the impact of including external resources to improve our system. Moreover, we also find interesting to extend existing corpora based on Twitter in order to increase the accuracy of the machine learning system.