ECNU at SemEval-2016 Task 4: An Empirical Investigation of Traditional NLP Features and Word Embedding Features for Sentence-level and Topic-level Sentiment Analysis in Twitter

This paper reports our submissions to Task 4, i.e., Sentiment Analysis in Twitter (SAT), in SemEval 2016, which consists of ﬁve subtasks grouped into two levels: (1) sentence level, i.e., message polarity classiﬁcation (subtask A), and (2) topic level, i.e., tweet classiﬁcation and quantiﬁcation according to two-point scale (subtask B and D) or ﬁve-point scale (sub-task C and E). We participated in all these ﬁve subtasks. To address these subtasks, we investigated several traditional Natural Language Processing (NLP) features including sentiment lexicon, linguistic and domain speciﬁc features, and word embedding features together with supervised machine learning methods. Ofﬁcially released results showed that our systems rank above average.


Introduction
In recent years, with the emergence of social media, more and more users have shared and obtained information through microblogging websites, such as Twitter. As a result, a huge amount of available data attracts a lot of researchers. SemEval 2016 provides such a universal platform for researchers to explore in the task of Sentiment Analysis in Twitter (Nakov et al., 2016) (Task 4), which includes five subtasks grouped into two levels, i.e., sentence level and topic level. Subtask A is a sentence level task aiming at sentiment polarity classification of the whole tweet. The other four subtasks are at topic level, i.e., given one topic, the sentiment polarity of tweets are classified or assigned by a two-point scale (i.e., subtask B and D) and by a five-point scale (i.e., subtask C and E). Specifically, subtask B is to identify the sentiment polarity label (i.e, Positive and Negative) of tweets with respect to the given topic while subtask D aims at estimating the sentiment distribution of tweets with respect to the given topic. Both subtask B and D are on a two-point scale. The purposes of subtask C and E are similar with that of subtask B and D, except for using a five-point scale, that is, the class labels are of 5 values, i.e, 2, 1, 0, -1 and -2 representing Very Positive, Positive, Neutral, Negative and Very Negative.
Given the character limitations on tweets, sentiment orientation classification on tweets can be regarded as a sentence-level sentiment analysis. Many researchers focus on feature engineering to improve the performance of SAT. For example, (Turian et al., 2010;Liu, 2012;Zhang et al., 2006) showed that one-hot representation on n-gram features is a relatively strong baseline. Furthermore, (Mohammad et al., 2013) proposed a state-of-the-art model which implemented several sentiment lexicons and a variety of manual features. Apart from the traditional methods, more and more researchers have paid their attention to use deep learning methods. Word embedding is one of such methods, where each word is represented as a continuous, low-dimension vector and has been applied into NLP tasks as a critical and fundamental step. Commonly, there are several types of word embedding models, e.g., Bengio proposed a Neural Probabilistic Language Model (NNLM) in (Bengio et al., 2003) to learn distributed representation for each word and Mikolov simplified the structure of NNLM and presented t-wo efficient log-linear models in (Mikolov et al., 2013). Moreover, Tang et al., 2014) further proposed learning sentiment-based word embeddings to settle SAT. Meanwhile, topicbased opinion always adheres on certain words or phrases rather than whole tweet. To address topicbased SAT, (Wang et al., 2011) used the hashtag information, (Lin and He, 2009) utilized the topic model to extract topic information from tweets and  picked out related words rather than all words in whole tweet as pending words for consequential feature extraction.
Previous work showed that feature engineering has a significant impact on this task. Thus, in this work, we presented multiple types of traditional NLP features to perform SAT, e.g., sentiment lexicon features (e.g., MPQA, IMDB, Bing Liu opinion lexicon, etc), linguistic features (e.g., negations, n-gram at the word level and character level, etc) and tweet specific features (e.g., emoticons, capital words, elongated words, hashtags, etc,). Besides, the word embedding features were adopted. We also performed a series of experiments to select effective feature subsets and supervised machine learning algorithms with optimal parameters. The rest of this paper is organized as follows. Section 2 describes our system framework including preprocessing, feature engineering, evaluation metrics, etc. The experiments are reported in Section 3. Finally, this work is concluded in Section 4.

Data Preprocessing
With the aid of approximate 5, 000 abbreviations and slangs collected from Internet, we converted the informal writing into regular forms, e.g., "asap" replaced by "as soon as possible", "3q" replaced by "thank you", etc. And we recovered the elongated words to their original forms, e.g., "soooooo" to "so". Finally, the processed data was performed for tokenization, POS tagging and parsing by using C-MU Parsing tools (Owoputi et al., 2013).
• POS: The absolute frequency of each part-ofspeech tag is recorded.
• Negation: Negation in a message always reverses its sentiment orientation. We collected 29 negations from Internet and recorded the frequency of negations in the whole tweet.
• Cluster: The CMU TweetParser tool provides 1, 000 token clusters produced with the Brown clustering algorithm on 56 million English language tweets. We recorded the existence of tokens in tweets with respect to these 1, 000 clusters.
• Dependency triple: The dependency tree is generated by Stanford Parser tool and each tweet contains several dependency triples (e.g., relation(government, dependent)). We used a binary feature to record if a dependency triple is present or absent in a tweet.

Tweet Specific Features:
• Punctuation: Punctuation marks (e.g, exclamation mark (!) and question mark (?) ) usually indicate the strength of sentiment. Therefore, we recorded the numbers of these marks in isolation and in combination. Besides, the position of punctuation in tweet is also an important clue for sentiment, thus we used a binary feature to indicate whether it is the last token of tweet.
• All-caps: The number of words in uppercase is recorded.
• Hashtag: We recorded the number of hashtags in the tweet.
• Emoticon: We collected 67 emoticons from Internet and this feature type records the number of positive and negative emoticons respectively. Moreover, two binary values are to record whether the last token is a positive or negative emoticon respectively.
• Elongated: It indicates the number of elongated words in the raw text of tweet.

Sentiment Lexicon Features (SentiLexi):
We employed the following eight sentiment lexicons to extract sentiment lexicon features: Bing Liu lexicon 1 , General Inquirer lexicon 2 , AFINN 3 , IMD-B 4 , MPQA 5 , NRC Emotion Sentiment Lexicon 6 , NR-C Hashtag Sentiment Lexicon 7 , and NRC Sentimen-t140 Lexicon 8 . Generally, we transformed the scores of all words in all sentiment lexicons to the range of -1 to 1, where the positive number indicates positive sentiment and the minus sign denotes negative sentiment.
The following six scores are calculated on the whole data for each sentiment lexicon: (1) the ratio of positive words to all words, (2) the ratio of negative words to all words, (3) the maximum sentiment score, (4) the minimum sentiment score, (5) the sum of sentiment scores, and (6) the sentiment score of the last word in tweet. If a word does not exist in one sentiment lexicon, its corresponding score is set to 0.

Word Embedding Features:
In this work, we employed three different types of word vectors. The general word vectors are trained by Google on huge amount of News, which is a different domain from Twitter. The other two sentiment word vectors are both trained on tweets but using different methods. The purpose of this feature type is to examine the effects of word embedding and sentiment word embedding on performance.
• General Word Vector (GeneralW2V): We adopted the word2vec tool 9 to obtain word vectors with the dimensionality of 300 (i.e., Gen-eralW2V), trained on 100 billion words from Google News.
• Sentiment Word Vector (SWV):  proposed a Combined-Sentiment Word Embedding Model to learn sentiment word vectors (SWV) for sentiment analysis task. In this work, we learn SWV on NRC140 tweet corpus (Go et al., 2009), where the corpus is made up of 1.6 million tweets (0.8 million positive and 0.8 million negative). The vector dimension is set as 100.
• Sentiment-specific Word Embedding (SSWE): Similar with SWV, the sentiment-specific word embedding model proposed by (Tang et al., 2014) used a multi-hidden-layers neural network to train SSWE with dimensionality of 50.
To convert the above word vectors into a sentence vector, we simply adopted the min, max and average operations. Obviously, this combination strategy neglects the word sequence in tweet but it is simple and straightforward. As a result, the final sentence vector V (s) was concatenated by V min (s), V max (s) and V average (s).

Evaluation Metrics
For subtask A, we used the macro-averaged F score of positive and negative classes (i.e., F macro = Fpos+Fneg 2 ) to evaluate the performance. Subtask B and D just contain positive and negative labels. The official metric for subtask B is macro-averaged recall among positive and negative (i.e., R macro = R P os +R N eg 2 ). As for subtask D, it is Kullback-Leibler Divergence (KLD) among distributions of two classes (i.e., KLD(pos, neg) = c j ∈pos,neg P (c j ) · log P (c j ) P (c j ) , where P denotes the probability of predicted label andP is the probability of gold label). There are 5 classes existing in subtask C and E, and the organizers adopted Macroaveraged Mean Absolute Error (i.e., M AE M ) and Earth Mover's Distance (EM D) among 5 predefined classes for two subtasks respectively, where the detail information of two metrics for evaluation is described in the official document available on the website 10 .

Datasets
Since only tweet IDs are provided by organizers, different participants may collect different numbers of tweets due to missing tweets or system errors. Subtask B and D are of the same data set. And subtask C and E share one common data set. The statistics of all datasets for these subtasks are shown in Tables 1, 2, and 3, respectively.
For subtask A, the training data set consists of four parts which are shown in Table 1 , i.e., 2013train, 2013dev, 2016train and 2016dev. The data set 2013train means SemEval-2013 Task 2 training data set (Nakov et al., 2013), and the following data sets are named in the same way. Actually, in consideration of the difference of polarity distribution between data set 2016devtest and 2013&2014test, we just adopted 2016devtest as development data. For subtask B, C, D and E, the data is divided into many topic sets.

Experiments on Training Data
In order to improve the performance of sentiment analysis, we performed feature selection experi-10 http://alt.qcri.org/semeval2016/task4/data/uploads/eval.pdf  ments on all subtasks and the optimum feature sets are shown in Table 4. From Table 4, it is interesting to find that: (1) Negation features and tweet specific features such as emoticon and all-caps make contributions to almost all subtasks.
(2) The feature set with the best performance of subtask B is not quite beneficial for subtask D even though they have the same data set, perhaps because of the essential difference between binary classification and binary quantification: in the latter, errors of different polarity compensate each other. The similar observation is found in subtask C and E. (3) The sentiment lexicon features make contributions to performance improvement of subtask A, B and C, but are not quite useful for subtask D and E. A possible reason is that the latter two subtasks focus on quantification analysis while the sentiment lexicon only contains sentiment orientation rather than sentiment strength. (4) The word embedding features are not as effective as expected. It maybe because we obtained sentence vectors by the simplest combination method described above, which does not take into account contextual information and semantic relations among words. Besides, since subtask B, C, D and E focus on topic-level sentiment analysis, we tried to extract features from related words rather than whole tweet. But the preliminary experimental results showed that extracting features from related words underperformed the latter strategy for extracting features. The possible reason is that in many cases a tweet only has one single sentiment polarity. Thus the sentiment polarity of sentence can always represent that of topic and extracting features from the related words may drop important information.

Results on Test Data
Based on the optimum feature sets shown in Table 4 and configuration of classifiers described above, we trained separate models for each subtask.   Table 5 shows the results of our systems and the top-ranked systems on all five subtasks. Our systems ranked 10th out of 34 submissions for sub-11 https://www.csie.ntu.edu.tw/ cjlin/liblinear/ task A, 4th/19 for subtask B, 2nd/11 for subtask C, 10th/14 for subtask D and 5th/10 for subtask E. Compared with the top ranked systems, there is much room for improvement in our work. Although word embedding features were adopted in this work, we used the simplest combination method to convert word vectors to sentence vectors. The effective convolution method is expected to be able to improve the performance of sentiment analysis.

Conclusion
In this paper, we extracted several traditional NLP features(e.g., linguistic features, tweet specific features, etc) and word embedding features from whole tweet and constructed classifiers using supervised machine learning algorithms to accomplish sentiment analysis towards sentence level(i.e., subtask A) and topic level(i.e., subtask B, C, D and E). Word embedding features are not as effective as expected since the way of using these features are quite simple and naive, thus it is too hasty to make a conclusion that the word embedding features make marginal contribution. In future work, we consider to focus on developing advanced convolution neural network to model sentence with the aid of sentiment word vector.