UCSC-NLP at SemEval-2017 Task 4: Sense n-grams for Sentiment Analysis in Twitter

This paper describes the system submitted to SemEval-2017 Task 4-A Sentiment Analysis in Twitter developed by the UCSC-NLP team. We studied how relationships between sense n-grams and sentiment polarities can contribute to this task, i.e. co-occurrences of WordNet senses in the tweet, and the polarity. Furthermore, we evaluated the effect of discarding a large set of features based on char-grams reported in preceding works. Based on these elements, we developed a SVM system, which exploring SentiWordNet as a polarity lexicon. It achieves an F_1=0.624 of average. Among 39 submissions to this task, we ranked 10th.


Introduction
To determine whether a text expresses a POSI-TIVE, NEGATIVE or NEUTRAL opinion has attracted an increasingly attention. In particular, sentiment classification of tweets has immediate applications in areas such as marketing, political, and social analysis (Nakov et al., 2016) Different approaches have shown to be very promising for polarity classification of tweets such as Convolutional Neural Networks trained with large amounts of data (Deriu et al., 2016).
Several authors have studied Machine Learning approaches based on lexicon, surface and semantic features. The proposal of Mohammad et al. (2013) as well as an improved version of Zhu et al. (2014) show very competitive scores.
The latter approach was re-implemented by Hagen et al. (2015) as a part of an ensemble of twitter polarity classifier which is top-ranked in the Se-mEval 2015 Task 9: Sentiment Analysis in Twitter. Our system proposes to enrich the set of fea-tures used by Mohammad et al. (2013). We describe here only the features more relevant for our experiments, further details in all features could be found in Mohammad et al. (2013); Hagen et al. (2015).
• N-gram Based Features (WG and CG) Each 1 to 4-word n-gram present in the training corpus is associated with a feature which indicates if the tweet includes or not the n-gram. For characters, all different occurrences of 3 to 5 grams are considered.
Given its definition, the number of generated ngram based features is variable and related with the training corpus. In experiments with SemEval 2017 training data, we got near three million of features of this type that is much largest than the number of tweets.

• Cluster Based Features (CB)
For each one of the 1000 clusters identified by Owoputi et al. (2013) using Brown algorithm (Brown et al., 1992 a feature indicates whether the terms of the tweet belong to them. Mohammad et al. (2013) studied the effect of removing individual set of features as well a whole group of them. Empirical results suggest that lexicon and n-gram based features are the most important since removing them causes the greatest drop on the classifier efficacy measured as the macroaverage F-score in the test set.
In this work, we studied how to reduce the number of generated features by removing some of the n-gram based. Next sections describe further details of our approach.

System Description
We trained a Support Vector Machine (SVM) as in (Mohammad et al., 2013;Zhu et al., 2014;Hagen et al., 2015). SVM algorithm has proved to be very effective in the Sentiment Analysis task. Moreover, to better assess the effect of the removal or inclusion of new features we decided to use the same classifier as the aforementioned authors.
In the first stage of our system, the tweets were preprocessed like Hagen et al. (2015). To avoid missing some emoticon symbols we ensure UTF-8 encoding in all stages. In addition, instead of detect emoticons using a regular expression 1 we use the tag provided by the CMU pos-tagging tool. In our case, negation was not considered to generate the word n-gram features.

New Predictor Features
We aim to explore the relation between the polarity and the presence or not of certain sense combinations in the text. Due to synonymy, two semantically equivalent tweets could lead to very different word n-grams while the sense n-grams could be the same in both tweets.
After a word sense disambiguation (WSD) stage, we generated a new version of the tweet where each word is replaced by its sense. A set of new n-grams features are computed using the new text. This approach allows one sense n-gram 1 http://sentiment.christopherpotts.net/tokenizing.html to represent two or more different word n-grams if the words have the same sense.
To enrich our model respect to those in (Mohammad et al., 2013;Hagen et al., 2015) we have considered SentiWordNet (Baccianella et al., 2010) as a polarity dictionary, idea explored in (Günther and Furrer, 2013). In this case, after WSD, we can use SentiWordNet to compute positive or negative scores for a given word generating features as with the other lexicons.
Considering that elongated (e.g. greaaaat) words could emphasize the sentiment expressed, similar features were computed but only allowing for the lengthened words in the tweet. In this case, we not considered bi-grams lexicons and normalized the elongated words before query the lexicons.
Finally, we studied the following set of new features.
• Additional Features -Sense n-grams (SG): one feature for each sense n-gram in the training corpus.
-SentiWordNet polarity scores (SW): eight features similar to those defined to other lexicons in section 1.
-Polarity scores of elongated words (EW): eight features similar to those defined to other lexicons in section 1 but only considering lengthened words if any. All lexicons but NRC-Sentiment140 and NRC-Hashtag for bi-grams were used.

Model Ensemble
With the available training data, we trained several models using different combinations of feature types. Our final submission was an ensemble of the top 10 models trained. Classifiers was combined by weighted voting as explained by Kuncheva (2004). To classify a tweet, we query a model that output a single label and a weight for that label, proportional to the accuracy of the classifier for that class in previous tests. Querying the 10 models, the final classification of the tweet is the most voted class. Given A C ij the accuracy of the model i over the class C in test data j the weight of that category for C is computed as w where j = 1 refers to SemEval 2013 test data and so on to S = 4 and M = 10 is the number of models in the ensemble.
The next section describes the experiments we carried out to assess different feature sets, how weights were computed as well the results.

Experiments
Our predictor is based in an ensemble of Support Vector Machines with linear kernel, and C = 0.005 trained with all the features proposed by As Mohammad et al. (2013), we want to evaluate how removing n-gram and cluster based features affect the results of our models. Table 1 show eight base models resulting of removing combinations of features of the types WG, CG and CB; with X indicating the characteristic set included in the model.  Table 2 show different arrangements of the new features which were combined with the based models for a total of 96 experiments. Exp 1 2 3 4 5 6 7 8 9 10 11 12 SG --- We replicated twice the experiments that included SG, one time disambiguating with Lesk (Lesk, 1986) algorithm and other considering the most frequent sense for a word. In all experiments, we used implementations from the NLTK (Bird et al., 2009) to disambiguate. In total, 160 different models were evaluated. Note that some of these models just augmented the features in (Mohammad et al., 2013) with some of the new ones.
With the training data of previous SemEval, 2013 to 2016, we mock our participation in these competitions. We trained SVMs for each model and evaluated it with the corresponding test data using the F 1 score for the POSITIVE class. Table  3 show the best (B) and the worst (W) results for each test dataset.
These results allowed us to rank the models. A final ranking was computed averaging the different positions across different test data of the same model. However, a drawback of this approach is that, besides one model could be ranked better than other, the result difference between them could be very small. The 10 top ranked models are the result of the based model 3 (character n-grams discarded) combined with new features [4, 2, 5, 9, 4 * , 12, 9 * , 8, 10, 1] where * indicates that the WSD was using Lesk algorithm.
Given the results in all previous SemEval test data, the accuracy over each category was obtained for each model as well the weights for the top 10. Finally, the system submitted was built as follow. We train versions of each of the top 10 models using the SemEval-2017 training data. After removing duplicates, we get 52, 780 tweets. The 10 trained classifiers were combined by weighted voting with weights computed as explained before. Table 4 show results for each category over the 12, 284 test tweets. As regard of the measures used to evaluate systems, our proposal gets an average recall of ρ = 0.642, F P N 1 = 0.624 and accuracy Acc = 0.565. The submitted system stood 10th among participants. Further details about the train and test datasets and results of other participants can be found in (Rosenthal et al., 2017)

Conclusions and Future Works
Our proposal is based in (Mohammad et al., 2013). We assessed a new set of features as well analyzed the effect of removing some of the features used in this system. Data in Table 3 as well the top 10 model trained show that the inclusion of the new features cold improve results.
Experiments in (Mohammad et al., 2013) suggest that removing character n-grams attributes degrades the classifier outcome. We also got these results, but when the feature set is extended with the new ones, character n-grams exclusion seems to be convenient. A look of model results and rankings, show that all models in the top 10, furthermore, in the top 30 are models where character  n-grams were excluded but some of the new ones considered.
Another interesting fact is that systems seem to be more sensitive to word n-grams and cluster based attributes. The best ranked model without n-grams, stood 23 in our ranking. Character n-grams were also omitted in this model, which was extended with SG, SW and LE features. After the release of the gold labels, we evaluated the predictions of other models not submitted but also trained with the SemEval 2017 training data. The aforementioned model shows a F P 1 N = 0.652, better than the model we submitted. It is important to say that this model used only 822, 650 features, substantially less than the 2, 993, 189 used by the best of our single models over test data which only discards character n-grams plus SG, EW and LE features and achieves a F P 1 N = 0.654 These results open an interesting direction of future work, further study how to minimize the set of features used without a noticeable degradation of prediction results. Ideally, identifying a set of features of size independent of the corpus as the lexicon based ones.