IOA: Improving SVM Based Sentiment Classification Through Post Processing

This paper describes our systems for expression-level and message-level sentiment analysis – two subtasks of SemEval-2015 Task 10 on sentiment analysis in Twitter . First we built two baseline systems for the two sub-tasks using SVM with a variety of features. Then we improved the systems through model iteration and probability-output weighting respectively. Our submissions are ranked the 3rd and 2nd among eleven teams on the 2015 test set and progress test set in subtask A and the 7th and 4th among 40 teams on the two test sets respectively in subtask B.


Introduction
Recently sentiment analysis has become one of the most popular research topics in the natural language processing community, mainly due to the exponential growth of social media data replete with subjective information. The once neglected topic has spurred immense interests from both academia and industry. Many approaches have been proposed for sentiment analysis in customer reviews, blogs and microblogs (for good reviews, see (Pang and Lee, 2008;Liu, 2012;Kiritchenko et al., 2014)). These approaches can be roughly divided into two categories. One is knowledge intensive or rule-based approaches, e.g., (Taboada et al., 2011;Reckman et al., 2013). Such approaches can achieve reasonably good results when tailored for a specific domain but their maintainability and cross domain portability is usually weak. The other is data intensive or machine learning-based, which learns to analyse sentiment from data. It is currently the most predominant approach, including supervised learning, deep learning etc. Sentiment analysis is often taken as a classification task. Widely used classifiers include Support Vector Machines (SVM), Maximum Entropy Models (MaxEnt), and naive Bayes classifiers. Common features include word/character n-grams and sentiment lexicons, among others. Key research issues for learning approaches include feature engineering, model selection, ensemble learning, etc.
SemEval 2015 task10 (Rosenthal et al., 2015) is a sequel to the two tasks on sentiment analysis in Twitter in the past two years (Nakov et al., 2013;Rosenthal et al., 2014). They have provided freely available, annotated corpus as a common testbed and significantly promoted sentiment analysis in tweetlike short and informal texts. The same metric, i.e., the average F 1 score of positive and negative classes, is used for measuring performances. But this year there are some changes. Besides the classical expression-level (A) and message-level (B) subtasks, another three subtasks are added, i.e., subtask C -topic-based message polarity classification, subtask D -detecting trends towards a topic, and subtask E -determining strength of association of twitter terms with positive sentiment. The organisers make no distinction between constrained and unconstrained systems, which means participants could utilise any other data. But it has to be described in the submission form.
We submitted systems only for the expressionlevel and message-level subtasks. In this paper, we provide some details behind the systems.

Our System
Our systems are built with an SVM classifier using various features and resources, including sentiment lexicons and word vectors. To further improve the performance, we use model iteration and probability-output weighting.

Resources
The resources used in our system are as follows: Labeled training and test data: Although the organisers make no difference between constrained and unconstrained systems, it is not easy to make additional data effective (Rosenthal et al., 2014). So we just use the provided labeled data. However, since we did not participate in the past two evaluations, we are unable to get the full labeled data because some tweets are unavailable. But we crawled as much data as possible using the provided script. Table 1 shows the size of the labeled data and test data we get. The 2015 test data is released directly and the results are required to be submitted in one week. We take the training data and development data as our training data. The test data from the previous years can be used for tuning parameters (but NOT for training).
Sentiment Lexicons and Word Embedding: As many researchers have showed, e.g., (Mohammad et al., 2013), sentiment lexicons play an important role in sentiment analysis. In our system, seven sentiment lexicons are used: the Hashtag Sentiment lexicon, the Sentiment140 lexicon (Mohammad and Turney, 2010)  2005), the Bing Liu lexicon (Hu and Liu, 2004), the AFINN-111 (Nielsen, 2011), the SentiWordNet (Baccianella et al., 2010) and the Hedonometer lexicon 1 . In addition, as word embeddings have been utilised to produce promising results in various NLP applications, we use sentiment-specific word embedding (Tang et al., 2014) in our system. LibSVM: We used the package LibSVM (Chang and Lin, 2011) to construct the classification model for both subtasks.
CMU Tweet NLP: It is an open resource (Owoputi et al., 2013) for analysing tweets and was used to extract features for tokenising, POS tagging and clustering.

Preprocessing
The main preprocessing steps are the following: • All upper case letters are converted to lower case ones • URLs and user names are replaced with strings 'http://someurl' and '@someuser' respectively • Tokenise and label the tweets with partof-speech using Carnegie Mellon University (CMU) tool (Owoputi et al., 2013)

Features
After preprocessing, each tweet is represented as a feature vector made up of part of the following features, the features used in each subtask are shown in Table 2.
• Word N-grams: A binary value of contiguous n-grams of 1, 2, 3, and 4 tokens and noncontiguous n-grams (n=3, 4). Non-contiguous n-grams are those intermediate grams that are replaced with a special symbol like '*'. For example, a 4-gram "I * * guys" is the corresponding non-contiguous gram of contiguous gram "I love you guys".
• Character N-grams: Although character ngrams have been used in sentiment analysis by many researchers, we find that the features are not effective for subtask B, so they are only used for subtask A. This feature is the binary value of the two and three prefix and suffix letters.
• POS: Ten features are added by pos tagging. They are respectively the count of interjection, adverb, preposition, article, verb, punctuation, noun, pronoun, adjective and hashtag in a tweet.
• Clusters: Every token in a tweet is mapped to one of Twitter Word Clusters by CMU tool (Owoputi et al., 2013). The features extracted are a boolean vector showing the presence or absence of the tweet in the 1000 clusters which are generated from about 56 million tweets.
• Word Vector: Words are represented as a vector of 50 dimensions. Then we use min, average and max functions to convert the embeddings into fixed-length features, in a way similar to the pooling technique used in CNN to get a tweet vector representation. So another three features are added.
• Negation: A binary value indicating the negated contexts. The "_NEG" suffix is appended to grams if they are in a negation scope which starts with a negation word and ends with certain punctuation marks 2 .
• Lexicons: For each token in one tweet, if it appears in sentiment lexicons in section 2.1, it is mapped to the corresponding score. In the lexicons which have no sentiment score we set the positive +1 and the negative -1. Other tokens are set to zero. Then a tweet would be represented with its total score, maximal score, 2 http://sentiment.christopherpotts.net/lingstruc.html#negation minimal score, negative score, last word score which does not equal zero, and the count of tokens with non-negative score.

Training
SVM is used as the classifier in our systems with the features described in section 2.3. We trained SVM on the labeled tweets with the RBF kernel and tuned the parameters on the dev dataset. For both subtasks, we tuned the parameters for Twit-ter2015 test data using the Twitter2013, Twitter2014 test data as dev dataset and tuned the parameters for the progress2015 test data using all the previous test data as dev dataset. The parameters were tuned to maximise the average F 1 score of positive and negative classes using brute-force grid search.

Post-processing
We tried different strategies for the different subtasks. For subtask A, we adopted a model iteration approach described in Algorithm 1. For subtask B, we used probability-output weighting to adapt SVM model with RBF kernel to the data set, similar to (Miura et al., 2014).

Model iteration for expression-level subtask
It was found that utilising more external data did not improve the performance as expected because of the different data resource and annotation method (Rosenthal et al., 2014). So we tried a model iteration approach. 3 We added the test data labeled with high confidence into the training data and then retrained a new model. The algorithm for subtask A is given in Algorithm 1 and the experiment results are given in section 3.1.  The maximum number of iteration I; Result: The probability-output p(c|x) for each instance x ∈ T ; The label l (x) for each instance x ∈ T , l (x) ∈ C 1 begin 2 i := 0;

Probability output weighting for message-level subtask
We applied probability-output weighting (Miura et al., 2014) into SVM and adapted it to subtask B. For a tweet x, the base model output probability p(c|x) for each polarity c (c ∈ {pos, neg, neu}). A weighting factor w c that adjusted the probabilityoutput p(c|x) was introduced. The system labeled the tweet with polarity c which maximises the prod-  uct of w c and p(c|x), namely arg max c w c × p(c|x). The weighting parameters w c for each polarity was tuned by maximising the accuracy using grid-search in the corresponding dev data. The results can be seen in section 3.2.

Experiments and Results
The official evaluation metric of the task is the average F 1 score of the positive and the negative classes. After the base training (Section 2.4), we got the base results in Table 4, "baseline" columns. Then we focused on improving systems for both subtasks. And the improved (or not) results are shown in the "submitted" columns.

Subtask A: expression-level sentiment analysis
We built the system using 8,568 tweets, including 7,639 training tweets and 929 development tweets described in section 2.1 using the features in section 2.3. After the release of the labeled test data, we compared the performance using the same model to rerun the test data. We set different threshold parameters p referred in section 2.5 to compare the results. The experiment results are given in Table 5.

Subtask B: message-level sentiment analysis
We adapted the probability-output weighting approach to subtask B. The experiment result shows that weighting is effective for this subtask. The improvement using the parameters in Table 3 can be seen from Table 4. The approach is effective for improving the twitter F 1 score but degrades the performance on the Sarcasm data, maybe because it depends too much on the data.

Experiment analysis
For subtask A, we made iteration stop at i = 2. The reason why there is little improvement is: (1) After each iteration, the number of new data added to the training data for retraining a new model is rather small. (2) Once the classifier puts a high confidence on a label, this instance is very likely to be similar to existing instances, which means the added instances would not contribute very much to classification.
In the experiments after submission, we tried to interchange the improvement method between the subtasks, but they showed a little decrease on both subtasks. When the model iteration approach was used in subtask B, we did not receive expected improvement. This may be because that the performance for subtask B is lower than that for subtask A, which may result in the wrong samples added into the training data. When the probability-output weighting approach was used on subtask A, we only got limited improvement in the F 1 score.

Conclusion
We described our system for two subtasks of Se-mEval 2015 task 10 -Sentiment Analysis in Twitter. Our systems are built by integrating a variety of features into SVM as baselines and then improved by model iteration and probability-output weighting for expression-level and message-level subtasks respectively. We compared the results and analyse the reason of the improvement. Our submissions are ranked the 3rd and 2nd among eleven teams on the 2015 test set and progress test set in subtask A and the 7th and 4th among 40 teams on the two test sets respectively in subtask B.