ECNU: Multi-level Sentiment Analysis on Twitter Using Traditional Linguistic Features and Word Embedding Features

This paper reports our submission to task 10 (Sentiment Analysis on Tweet, SAT) (Rosen-thal et al., 2015) in SemEval 2015 , which contains ﬁve subtasks, i.e., contextual polarity disambiguation (subtask A: expression-level), message polarity classiﬁcation (subtask B: message-level), topic-based message polarity classiﬁcation and detecting trends towards a topic (subtask C and D: topic-level), and determining sentiment strength of twitter terms (subtask E: term-level). For the ﬁrst four sub-tasks, we built supervised models using traditional features and word embedding features to perform sentiment polarity classiﬁcation. For subtask E, we ﬁrst expanded the training data with the aid of external sentiment lexi-cons and then built a regression model to estimate the sentiment strength. Despite the simplicity of features, our systems rank above the average.


Introduction
In the past few years, hundreds of millions of people shared and expressed their opinions through microblogging websites, such as Twitter. The study on this platform is increasingly drawing attention of many researchers and organizations. Given the character limitations on tweets, the sentiment orientation classification on tweets is usually analogous to the sentence-level sentiment analysis (Kouloumpis et al., 2011;Kim and Hovy, 2004;Yu and Hatzivassiloglou, 2003). However, considering opinions adhering on different topics and expressed by various expression words in tweets, (Wang et al., 2011;Jiang et al., 2011;Chen et al., 2012) have investigated various ways to settle these target dependent issues. Recently, inspired by (Mikolov et al., 2013a) using neural network to construct distributed word representation (word embedding), several researchers employed neural network to perform sentiment analysis. For example, (Kim, 2014;dos Santos and Gatti, 2014) adopted convolutional neural networks to learn sentiment-bearing sentence vectors, and (Mikolov et al., 2013b) proposed Paragraph vector which outperformed bag-of-words model for sentiment analysis.
The task of Sentiment Analysis in Twitter (SAT) in SemEval 2015 consists of five subtasks. The first three subtasks focus on determining the polarity of the given tweet, phrase or topic (i.e., subtask A aims at classifying the sentiment of a marked instance in a given message, subtask B is to determine the polarity of the whole message and subtask C focuses on identifying the sentiment of the message towards the given topic). The fourth subtask D is to detect the sentiment trends of a given set of messages towards a topic from the same period of time. The last subtask E is to predict a score between 0 and 1, which is indicative of the strength of association of twitter terms with positive sentiment.
The remainder of this paper is organized as follows. Section 2 reports our systems including preprocessing, feature engineering, evaluation metrics, etc. The data sets and experiments descriptions are shown in Section 3. Finally, we conclude this paper in Section 4.

System Description
For subtask A and B, we compared two classifiers built on traditional NLP features (linguistic and Sentiment Lexicon) and word embedding features respectively. We also combined the results of the above two classifiers by summing up the predicted probability score. Due to time limitation, for subtask C and D, we only used the traditional feature sets to build a classifier. Unlike the above four subtasks, for subtask E we built a regression model to calculate a sentimental strength score for each target term with the aid of sentiment lexicon score features and word embedding features.

Data Preprocessing
Firstly, we collected about 5, 000 slangs or abbreviations from Internet to convert the irregular writing to formal forms. By doing this, we also recovered the elongated words to its initial forms, e.g., "goooooood" to "good", "asap" to "as soon as possible", "3q" to "thank you", etc. Then the processed data was performed for tokenization, POS tagging and parsing by using CMU Parsing tools (Owoputi et al., 2013).

Feature Engineering
Although the first four subtasks all focus on sentiment polarity classification, they have very different definitions. For example, since subtask B focuses on sentiment classification on whole tweet, we extract features from all words in tweet. However, the other three subtasks, i.e, A, C, and D, perform sentiment polarity classification only on a certain piece of tweet, i.e., expression words or topic in tweet. Since organizers have provided the annotated target words (for A) and topics (for C and D) for each tweet, we only chose related words rather than all words in whole tweet as pending words for consequential feature extraction. To pick out related words from whole tweet, following (Kiritchenko et al., 2014), for each annotated target word we only treated the surrounding words from parse tree with distance d ≤ 2 as its relevant words.
In this task, we used four types of features: sentiment lexicon features (the score calculated from seven sentiment lexicons), linguistic features (n-grams, POS tagger, negations, etc), tweet-specific features (emoticons, all-caps, hashtag, etc) and word embedding features.
Sentiment Lexicon Features (SentiLexi): We employed the following seven sentiment lexicons to extract sentimental lexicon features: Bing Liu lexicon 1 , General Inquirer lexicon 2 , IMDB 3 , M-PQA 4 , SentiWordNet 5 , NRC Hashtag Sentiment Lexicon 6 , and NRC Sentiment140 Lexicon 7 . Generally, we transformed the scores of all words in all sentiment lexicons to the range of −1 to 1, where the minus sign denotes negative sentiment and the positive number indicates positive sentiment.
Given extracted pending words, we first converted them to lowercase. Then for each sentiment lexicon, we calculated the following five sentimental scores on the processed pending words: (1) the ratio of positive words to pending words, (2) the ratio of negative words to pending words, (3) the maximum sentiment score, (4) the minimum sentiment score, (5) the sum of sentiment scores. If the pending word does not exist in one sentiment lexicon, its corresponding score is set to zero. Specifically, before locating the corresponding term in SentiWord-Net lexicon, we conducted lemmatization for words and selected its first item in searched results according to its POS tag.
Linguistic Features: -Word n-grams: We first converted all pending words to lowercase and removed URLs, mentions, hashtags, and low frequency (threshold value is 10) words. Then we extracted unigram and bigram features. Besides, inspired by (Kiritchenko et al., 2014), the words connected on parse tree are extracted as pairgram.
-Negation Features: Usually, the sentiment orientation of a message or phrase can be reversed by a modified negation. Thus, we collected 29 negations 8 from Internet and this binary feature is set as 1 or 0 if corresponding negation is present or absent in pending words.

Tweet Specific Features (PAHE):
-Emoticon: We gathered 69 emoticons from Internet and this binary feature records whether the corresponding emoticon is present or absent in pending words.
-All-caps: It indicates the number of words with uppercase letters.
-Hashtag: It is the number of hashtags in the sentence or phrase.
-Elongated: It represents the number of words with one character repeated more than two times, e.g., "gooooood".
Word Embedding Features: Word embedding is a continuous-valued representation of the word which usually carries syntactic and semantic information (Zeng et al., 2014). Since a phrase or sentence contains more than one word, usually there are two strategies to convert the words vectors into a sentence vector: (1) summing up all words vectors; (2) rolling up the sequential words to obtain a vector that contains context information (i.e., convolution neutral network). The convolution neural network (CNN) is usually employed in image recognition, while many researchers have adopted it in Natural Language Processing (Kim, 2014;dos Santos and Gatti, 2014) and achieved good performance. For subtask A and B, we adopted the CNN tools in (Kim, 2014) and extracted the penultimate hidden layer content as the sentence word embedding features to perform classification. For subtask E, we simply adopted the first strategy to sum up the word vectors in the given phrase.
Specifically, in this work we used the publicly available word2vec vectors to get the word embedding with dimensionality of 300, which is trained on 100 billion words from Google News (Mikolov et al., 2013b). If a word is not in word2vec list, we initialize its vector values to random values.

Evaluation Metrics
For subtask A, B and C, we used the macroaveraged F score of positive and negative classes (i.e., F macro = Fpos+Fneg 2 ) to evaluate the performance, which considers a sense of effectiveness on small classes. For subtask D, the averaged absolute difference (i.e., avgAbsDif f = 1 is employed, which is a common measure of how much a set of observations differ from the average. Since the subtask E aims at predicting the sentiment score for target term, in order to make the comparison of predicted strength of different terms reasonable, the Kendall rank correlation coefficient (usually measures the association between two measured quantities) and Spearman rank correlation (a nonparametric measure of statistical dependence between two variables) are adopted in this subtask, where the Kendall rank correlation coefficient is the official evaluation criteria.

Datasets
The organizers provided tweet ids and a script for all participants to collect data. Table 1 shows the statistics of the data sets we used in our experiments.
For subtask A and B, the training data set is composed of SemEval 2013 Task 2 training and development data (Nakov et al., 2013) and the development data set is made up of the test sets from the same tasks in previous two years. For subtask C and D, this data is divided into many topic sets.
With regard to subtask E, the organizers provided 200 terms labeled with a decimal in the range of 0 to 1. We observed that among these 200 given terms, 22% are hashtags and 15% contain negator. In consideration of the lack of training data, we expanded it with 1, 346 terms collected from following sources: 916 terms which are present in all above mentioned 7 sentiment lexicons, 230 terms with hashtag and 200 terms with negator extracted from NRC Hashtag sentiment lexicon randomly. The provided 200 terms were used as development data. To predict the strength values of the extended data, we used the M-PQA sentiment lexicon label as reference. There are 6 polarity types in MPQA, i.e., strong positive, weak positive, both strong, both weak, weak negative and strong negative. We converted them to numeric score as 1, 0.75, 0.5, 0.5, 0.25, 0 respectively. By doing so, if a target term is present in this expanded lexicon, the output is its corresponding score. Otherwise we split the term to several words and calculated their averaged sentiment score as output.

Subtask A and B
To address subtask A and B, we conducted a series of experiments to examine the effects of different traditional features. Table 2 describes the experiments of various traditional features on subtask A and B. From Table 2, it is interesting to find that: (1) SentiLexi and unigram are the most effective feature types to detect the polarities; (2) POS feature makes contribution to improve the performance for subtask B but no improvement for A. It may be because the neutral instances in subtask B (i.e., 45.58%) are much more than that in subtask A (i.e., 5.01%); (3) The emoticons features are not as effective as expected since most emoticons are already present in unigram.
Besides, following (Kim, 2014) we adopted sentence modeling and extracted the penultimate hidden layer content as novel word embedding feature to build another classifier. Furthermore, we combined the intermediate results (i.e., the distances between point to multiple hyperplanes returned from SVM) of two classifiers. The experimental results of using word embedding features in isolation and in combination are shown in Table 3. From Table 3, we find that the word embedding alone performs a bit worse than the traditional features. This may be because the traditional features are dozens of times more than word embedding features and as a result the effectiveness of word embeddings is impaired. However, when we combined the two experimental results, we find that the combination result of two classifiers achieves the best performances in both subtasks. This indicates that although the size of word embeddings is small, it still makes contribution to performance improvement.   mented in scikit-learn tools (Pedregosa et al., 2011) (e.g., SVM with kernel={linear, rbf}, c=0.1 ,1 ,10, SGD with loss={hinge, log}, RandomForestClassifier with n={10, 50, 100}, etc). Table 4 shows the configuration of classifiers with best performance. Thus, in subsequential experiments, we adopted the configurations listed in Table 4.   Table 2 lists the experimental results using several traditional features on subtask C. Since the sentiment trend of given topic in subtask D is calculated from the results of subtask C (i.e., sentiment trend = positive/(positive + negative)), we have not conducted additional experiments for subtask D.

Subtask C and D
Similar with the first two subtasks, we adopted the SVM classification algorithm with kernel=linear, c=0.1 as system configurations for follow-up experiments.

Subtask E
We transformed the informal terms to their normal forms and used the sentiment lexicons mentioned in Section 2.2 except MPQA to extract sentiment lexicon feature. If the target term contained more than one word, we averaged their scores as its final sentiment lexicon feature. Besides, the word embedding features were also adopted in this subtask.
To explore the effectiveness of different feature types, we conducted several feature combination experiments shown in Table 5.  From Table 5, we find that: (1) The combination of SentiLexi and word embedding is the most effective feature type for sentiment score prediction; (2) The word embedding features achieved better result than SentiLexi features about 4.7% improvement in terms of Kendall measure, which indicates that word embedding feature preserves the sentiment information and semantic relationship between words.
We also performed a series of experiments to optimize the parameters of SVM classifiers. Similarly, we found that SVM classifier with kernel=linear and c=1 obtained the best performance. Thus, in following experiments on test data, we adopted this configuration with SentiLexi and word embedding features together.

Results on Test Data
Using the optimum feature set and configurations described in Section 3.2, we trained separate models for each subtasks and evaluated them against the SemEval-2015 Task 10 test set. Table 6 shows the results of our systems and the top-ranked systems on subtask A, B, C and D. From  Table 6: Performances of our systems and top-ranked systems for subtask A, B, C (F macro (%)) and D (avgAbsDiff ) on test data. The numbers in the brackets are the rankings on corresponding data set.
the Table 6, we observe the following findings. Firstly, in accordance with previous work (Rosenthal et al., 2014), the results of subtask B is much worse than those of subtask A. On one hand, the text in message-level task is long and contains multiple/mixed sentiments with different strength and the text in expression-level usually contain a single sentiment orientation. On the other hand, the polarity distributions of subtask A and B are significantly different (i.e., about 6.14% instances in expressionlevel are neutral while 41.30% in message-level).
Secondly, the performances on LiveJournal and SMS are comparable to the results on Twitter2013 and Twitter2014 in both subtasks, which means the Twitter, SMS and LiveJournal have similar characteristics and then we may consider to use SMS as training data when the available tweet data is insufficient.
Thirdly, the submissions of subtask C and D only adopted traditional linguistic features rather than the combination of word embeddings, which may result in the poor performance in subtask C and D.
Our systems ranked 7th out of 11 submissions for subtask A, 19th out of 40 submissions for subtask B and performed well on LiveJournal and SMS2013 data sets. For subtask C and D, our systems ranked 5th out of 7 submissions and 5th out of 6 submissions respectively.   Table 7 shows the results of our system and the top ranked system provided by organizer for subtask E. Our system ranked 3rd out of 10 submissions. Although the word embedding features obtained from large amount of contexts are believed to contain semantic information, they contain sentiment information more or less induced from context. As a consequence, with the aid of sentiment lexicon and word embedding, our system is promising.

Conclusion
In this paper, we combined the results of two classifiers (adopting traditional features and word embedding features respectively) to detect the sentiment polarity towards expression-level and message-level (i.e., subtask A, B), adopted several basic feature types to settle topic-level task (i.e., subtask C, D) and built regression model with the aid of sentiment lexicon features and word embedding features to predict degree of polarity strength on term-level (i.e., subtask E). Using word embedding features alone may not perform good results, but it makes contribution to performance improvement in combination with traditional linguistic features. In future work, we consider to construct the word representations bearing sentiment information to address sentiment analysis.