ECNU at SemEval-2017 Task 4: Evaluating Effective Features on Machine Learning Methods for Twitter Message Polarity Classification

This paper reports our submission to subtask A of task 4 (Sentiment Analysis in Twitter, SAT) in SemEval 2017, i.e., Message Polarity Classification. We investigated several traditional Natural Language Processing (NLP) features, domain specific features and word embedding features together with supervised machine learning methods to address this task. Officially released results showed that our system ranked above average.


Introduction
In recent years, with the emergence of social media, more and more users have shared and obtained information through microblogging websites, such as Twitter. The study on this platform is increasingly drawing attention of many researchers and organizations. Se-mEval 2017 provides a universal platform for researchers to explore sentiment analysis in Twitter (Rosenthal et al., 2017) (Task 4, Sentiment Analysis in Twitter, SAT) which includes five subtasks, and we participated in subtask A: Message Polarity Classification. It aims at sentiment polarity classification of the whole tweet on a three-point scale(i.e., Positive, Negative and Neutral).
Given the character limitations on tweets, the sentiment orientation classification on tweets can be regarded as a sentence-level sentiment analysis task. Following previous work (Mohammad et al., 2013;Zhang et al., 2015;Wasi et al., 2014), we adopted a rich set of traditional NLP features, i.e., linguistic features (e.g., word n-gram, partof-speech (POS) tags, etc), sentiment lexicon features (i.e., the scores calculated from eight sentiment lexicons), and domain content features (e.g., emoticons, capital words, elongated words, etc).
In consideration of rich information in the metadata of tweets, we also extracted metadata features from tweets. Moreover, several word embeddings (including general word embeddings and sentiment word vectors) were adopted. We performed a series of experiments to explore the effectiveness of each type of features and supervised machine learning algorithms.

System Description
We first performed data preprocessing, then extracted several types of features from tweets and metadata for sentiment analysis and constructed supervised classification models for this task.

Data Preprocessing
Firstly, we used about 5, 000 abbreviations and slangs 1 to convert the informal writing into regular forms, e.g., "3q" replaced by "thank you", "asap" replaced by "as soon as possible", etc. And we recovered the elongated words to their original forms, e.g., "soooooo" to "so". Then the processed data was performed for tokenization, POS tagging, parsing, stemming and lemmatization using Stanford CoreNLP (Manning et al., 2014).

Feature Engineering
In this task, we evaluated four types of features, i.e, linguistic features, sentiment lexicon features, domain-specific features and word embedding features.

Linguistic Features
• Word RF n-grams: We extracted unigrams, bigrams and trigrams features at two different levels, i.e., the original word level and the word stem level. Considering that different words make different contribution to sentimental expression, for each n-gram feature, we calculated rf (relevance frequency) value (Lan et al., 2009) to weight its importance.
• POS: Generally, the sentences carrying subjective emotions (i.e., positive and negative sentiment) are inclined to contain more adjectives and adverbs while the sentences without sentiment orientation (i.e., neutral) would contain more nouns. Therefore, we recorded the number of each POS tag in one sentence.
• Negation: Negation in a message always reverses its sentiment orientation. We manually collected 29 negations 2 from previous work in (Zhang et al., 2015) and designed two binary features. One is to indicate whether there is any negation in the tweet and the other is to record whether this tweet contains more than one negation.

Sentiment Lexicon Features (SentiLexi)
We employed the following eight sentiment lexicons to extract sentiment lexicon features: Bing Liu lexicon 3 , General Inquirer lexicon 4 , IMD-B 5 , MPQA 6 , NRC Emotion Sentiment Lexicon 7 , AFINN 8 , NRC Hashtag Sentiment Lexicon 9 , and NRC Sentiment140 Lexicon 10 . Since certain words may consist of mixed sentiments based on different contexts, it is not appropriate to assign only one sentiment score for this type of word. Therefore, the first five lexicons use two values for each word to represent its sentiment scores, i.e., one for positive sentiment and the other for negative sentiment. In order to unify the formats, we transformed the two scores into a one-dimensional value by subtracting negative emotion scores from positive emotion scores. Then in all sentiment lexicons, for each word the positive number indicates a positive emotion and the minus sign represents a negative emotion. Given a tweet, we first converted all words into lowercase. Then on each sentiment lexicon, we calculated the following six scores for one message: (1) the ratio of positive words to all words, (2) the ratio of negative words to all words, (3) the maximum sentiment score, (4) the minimum sentiment score, (5) the sum of sentiment scores, (6) the sentiment score of the last word in tweet. If the word does not exist in one sentiment lexicon, its corresponding score is set to 0.

Domain-Specific Features
Domain-specific features are extracted from two sources. One is from the content of tweets and the other is from tweet metadata information.
Firstly, the domain specific features extracted from tweet content are shown as follows: • All-caps: One binary feature is to check whether this tweet has words in uppercase.
• Bag-of-Hashtags: We constructed a vocabulary of hashtags appearing in the training data and then adopted the bag-of-hashtags method for each tweet.
• Elongated: It indicates whether the raw text of tweet contains words with one continuous character repeated more than two times, e.g., "gooooood".
• Emoticon: We manually collected 67 emoticons from Internet 11 and designed the following 4 binary features: -to record the presence or absence of positive and negative emoticons respectively in the tweet; -to record whether the last token is a positive or a negative emoticon.
• Punctuation: Punctuation marks (e.g, exclamation mark (!) and question mark (?)) usually indicate the expression of sentiment. Therefore, we designed the following 6 binary features to record: -whether the tweet contains an exclamation mark; -whether the tweet contains more than one exclamation mark; -whether the tweet has a question mark; -whether the tweet contains more than one question mark; -whether the tweet contains both exclamation marks and question marks; -whether the last token of this tweet is an exclamation or question mark.
Recently, several studies using tweet metadata are reported to have good performance on sentiment classification (Tang et al., 2015;Chen et al., 2016). Inspired by them, the second tweet domain-specific features we used are extracted from tweet metadata information. We first used Twitter API 12 to collect tweet metadata and then designed the following two types of features.
• Tweet metadata: Two binary features are to check whether this tweet has been retweeted and whether it has been liked by authenticating users. Furthermore, given one tweet, two numeric features are to record the count of retweeted and the count of liked. These two numeric features were standardized using [0-1] normalization.
• User metadata: In addition to the metadata of tweets, users who write tweets may also contain useful information. Thus the following 5 user metadata features are collected: friends count, followers count, statuses count, verified and default profile image. The first three numeric items are standardized using [0-1] normalization and the rest are binary values.
In total, we collected 9 metatdata features.

Word Embedding Features
Word embedding is a continuous-valued vector representation for each word, which usually carries syntactic and semantic information. In this work, we employed five different types of word embeddings. The GoogleW2V and GloVe are two pre-trained word vectors downloaded from Internet. The former is pre-trained on News domain and the latter is pre-trained on tweets. We also trained the TweetW2V on tweet domain using Google word2vec tool. Besides, taking into consideration the sentiment information of each word, previous work in (Tang et al., 2014) and (Lan et al., 2016) presented methods to learn sentiment word vectors rather than general word vectors. The last two word vectors i.e., SWV and SS-WE, are expected to endow word embeddings with sentiment information and semantic information.
• GloVe: The 100-dimensional word vectors are pre-trained on Twitter using GloVe, available in GloVe 14 .
• SWV: Our previous work in (Lan et al., 2016) proposed a combined model to learn sentiment word vector (SWV) for sentiment analysis task. In this work, we learned the SWV on NRC140 tweet corpus and the dimension is set as 200.
• SSWE: The sentiment-specific word embedding (SSWE) model has been proposed by (Tang et al., 2014) used a multi-hiddenlayers neural network to train SSWE on 10 million tweets with dimensionality of 50.
In order to obtain a sentence vector, we simply adopted the min, max and mean pooling operations on all words in a tweet message. Obviously, this combination strategy neglects the word sequence in tweet but it is simple and straightforward. As a result, the final sentence vector V (s) was concatenated as [V min (s) ⊕ V max (s) ⊕ V mean (s)].

Learning Algorithms
We granted this task as a three-way classification task and explored four supervised machine learning algorithms: Logistic Regression (LR) implemented in Liblinear 16 , Support Vector Machine (SVM), Stochastic Gradient Descent (SGD) and AdaBoost all implemented in scikit-learn tools 17 .

Evaluation Metric
To evaluate the system performance, the official evaluation criterion is macro-averaged recall, which is calculated among three classes (i.e., positive, negative and neutral) as follows: 3 Experiments

Datasets
For training set, the organizers provided only the list of tweet ID and a script for all participants to collect tweets and their corresponding metadata. However, since not all tweets and their metadata are available when downloading, participants may collect slightly different numbers of tweets for training data.

Experiments on Training Data
Firstly, in order to explore the effectiveness of each feature type, we performed a series of experiments. Table 2 lists the comparison of different contributions made by different features on development set with Logistic Regression algorithm. We observe the following findings.
(1) All feature types make contributions to sentiment polarity classification. Their combination achieves the best performance (i.e., 63.14%).
(2) Linguistic features act as baseline and have shown their effectiveness for sentiment polarity prediction. Besides, SentiLexi makes more contributes than other domain-specific and word embeddings features. Since sentiment lexicons are constructed by expert knowledge, it is beneficial for tweet sentiment polarity prediction.
(3) The domain-specific metadata is not as effective as expected. One possible reason results from the missing metadata downloaded by Twitter API.  Secondly, we also explored the performance of different learning algorithms. Table 3 lists the comparison of different supervised learning algorithms with all above features. Clearly, Logistic Regression algorithm outperformed other algorithms. Therefore, the system configuration for submission is all features and LR algorithm. Table 4 shows the results of our system and the top-ranked systems provided by organizers for this sentiment classification task. Compared with the top ranked systems, there is much room for improvement in our work. There are several possible reasons for this performance lag. First, although the linguistic features are effective, the dimensionality of word RF n-gram features is quite huge (approximately 79K n-grams), which dominates the performance of classification rather than other low dimension features. Second, the usage of word embeddings is simple and straightforward, which neglects the word sequence and sentence structure. Third, the effects of metadata may be reduced due to lots of missing metadata.

Conclusion
In this paper, we extracted several traditional NLP features, domain specific features and word embedding features from tweets and their metadata and adopted supervised machine learning algorithms to perform sentiment polarity classification.
The system performance ranks above average. In future work, we consider to focus on developing neural networks method to model sentence with the aid of sentiment word vectors.