MDSENT at SemEval-2016 Task 4: A Supervised System for Message Polarity Classification

This paper describes our system submitted for the Sentiment Analysis in Twitter task of SemEval-2016, and speciﬁcally for the Message Polarity Classiﬁcation subtask. We used a system that combines Convolutional Neural Networks and Logistic Regression for sentiment prediction, where the former makes use of embedding features while the later utilizes various features like lexicons and dictionaries.


Introduction
Recently, rapid growth of the amount of usergenerated content on the web prompts increasing interest in research on sentiment analysis and opinion mining. A typical example is Twitter, where lots of users express feelings and opinions about various subjects. However, unlike traditional media, language used in social network services like Twitter is often informal, leading to new challenges to corresponding text analysis.
The SemEval-2016 Sentiment Analysis in Twitter task (SESA-16) is a task that focuses on the sentiment analysis of tweets. As a continuation of SemEval-2015 Task 10, SESA-16 introduces several new challenges, including the replacement of classification with quantification, movement from two/three-point scale to five-point scale, etc.
We participated in Subtask A of SESA-16, namely message polarity classification, a task that seeks to predict a sentiment label for some given text. We model the problem as a multi-class classification problem that combines the predictions given by two different classifiers: one is a Convolutional Neural Network (CNN) and the other is Logistic Regression (LR). The former takes embedding-based features while the latter utilizes various features such as lexicons, dictionaries, etc.
The remainder of this paper is structured as follows. In Section 2, we describe our system in detail, including feature description and approaches. In Section 3, we list the details of datasets for the experiments, along with hyperparameter settings and training techniques. In Section 4, we report the experiment results and present the corresponding discussion.

System Description
Our system aims at predicting the sentiment of a given message, i.e., whether the message expresses positive, negative or neutral emotion. To achieve that, we adopt two separate classifiers, CNN and LR, designed to utilize different types of features. The final prediction for sentiment is a combination of predictions given by both classifiers.

Data Preprocessing
Tweets often include informal text, making it essential to preprocess tweets before they are fed to the system. However, we keep the preprocessing to a minimum by only removing URLs and @User tags. We then further tokenize and tag tweets with arktweetnlp (Gimpel et al., 2011). In addition, all tweets are lower-cased.

Logistic Regression
We use the LR classifier for features from sentiment lexicons and token clusters. We have used the fol- lowing: • clusters: 1000 token clusters provided by the CMU tweet NLP tool. These clusters are produced with the Brown clustering algorithm on 56 million English-language tweets.
• manually-constructed sentiment lexicons: NRC Emotion Lexicon (Mohammad and Turney, 2010), MPQA (Wilson et al., 2005), Bing Liu Lexicon (Hu and Liu, 2004) and AFINN-111 Lexicon (Nielsen, 2011). For the Sentiment140 Lexicon and Hashtag Sentiment Lexicon, we compute separate lexicon features for uni-grams and bi-grams, while for other Lexicons, only uni-gram lexicon features are produced. For each lexicon, let t be the token(uni-gram or bigram), p be the polarity and s be the score provided by the lexicon. We use the same features that are also adopted by the NRC-Canada system (Mohammad et al., 2013): • the total count of tokens in a tweet with s(t, p) > 0.
• the total score of tokens in a tweet w s(t, p).
• the maximum score of tokens in a tweet max w s(t, p).
• the score of the last token in the tweet with s(t, p) > 0.
For each token, we also use features to describe whether it is present or absent in each of the 1000 token clusters. There are in total 1051 features for a tweet.

Convolutional Neural Network
Deep learning models have achieved remarkable results for various NLP tasks, with most of them based on embeddings that represent words, characters, etc. with vectors of real values. Some work on embeddings suggests that word vectors generated by some embedding algorithms preserve many linguistic regularities (Mikolov et al., 2013a).
Among the various deep learning models, we use Convolutional Neural Networks, which have already been used for sentiment classification with promising results (Kim, 2014). We show the network architecture in Figure 1.
In general, the architecture contains two separate CNNs: one is for word-based input maps while the other is for character-based input maps. In our system, an input map for a tweet is a stack of the embeddings of its words/characters w.r.t. their order in the tweet. We initialize word embeddings with the publicly available 300 dimension Google News embeddings trained with Word2Vec, but randomly initialize character embeddings with the same dimension. We fine tune both kinds of embeddings during the training procedure.
Each of the two separate CNNs has its own set of convolutional filters. We fix the width of all filters to be the same as the corresponding embedding dimension, but set their height according to predefined types of n-grams. For example, a filter for bi-grams on an input map constructed with 300 dimensional word embeddings will have shape (2, 300), where 2 is the height and 300 is the width. In other words, we use each filter to capture and extract features w.r.t. a specific type of n-gram from an input map.
The feature maps generated by a particular filter may have different shapes for different input maps, due to variable tweet lengths. Thus we adopt a pooling scheme called max-over-time pooling (Collobert et al., 2011), which captures the most important feature, i.e., the one with highest value, for each feature map. This pooling scheme naturally deals with the variable tweet length problem.
After pooling, we first generate a representation for each CNN by concatenating its own pooled features, and then form a final representation by concatenating the two separate representations. The final representation is then fed into a multi-layer perceptron (MLP) classifier for predictions.

Regularization
For regularization we employ dropout with a constraint on l 2 -norms of the weight vectors (Hinton et al., 2012). The key idea of dropout is to prevent coadaptation of feature detectors (hidden units) by ran-domly dropping out a portion of hidden units in the training procedure. At test time, the learned weight vectors are scaled according to the portion while no dropout is needed.
In addition to dropout, we constrain weight vectors by introducing an upper limit on their l 2 -norms. That is, for a weight vector w, we rescale it to have ||w|| 2 = l, whenever it has ||w|| 2 > l, after gradient descent step.

Combination
We combine the predictions of the two classifiers in the form of a weighted summation. Given the prediction P LR by Logistic Regression and the prediction P CN N by the CNN, we introduce a scalar w, such that the final prediction is given as, In other words, let x be the input instance, We do not simply feed the features of LR along with the features generated by the CNN into a single classifier because they are naturally different. The features from LR are highly relevant with manually-created or automatically-generated dictionaries, scores, clusters, etc. They are a mixture of binary and real-value features with high variance. While for the CNN, the features are generated by convolutional kernels on distributed representations (embeddings), leading to strong correlation and relatively smaller variance. Our preliminary experiments show that by simply adding LR features to CNN features, the performance of our system does not increase, but drops.

Datasets
We test our model on the SemEval-2016 benchmark dataset with two different settings. Setting 1 uses only the 2016 datasets while Setting 2 uses a combination of 2016 and 2013 datasets. We list the details of the two settings in Table 1.
For setting 2, the merge of two datasets is conducted w.r.t. the train/dev splits. Although we did Settings Train Dev Test Setting 1 5975 1997 32009 Setting 2 12964 3100 32009 Setting 2: a dataset that is a combination of the SemEval-2016 and SemEval-2013 datasets. In Setting 2, the merge is conducted w.r.t. train/dev splits, with "Not Available" tweets removed. not remove any "Not Available" tweets for setting 1, we found a relatively high amount of such tweets in the combined dataset, which may significantly influence the system performance, thus we removed all the "Not Available" tweets for setting 2.

CNN
For both settings, we use rectified linear units. For the word-based CNN, we use filters of height 1,2,3,4, while for the character-based CNN, we use filters of height 3,4,5. And 100 feature maps are used for each filter. We also use a dropout rate of 0.5, l 2 -norm constraint of 3, and mini-batch size of 50. These values were picked on the Dev dataset of Setting 1.
We perform early stop on dev datasets during training. We use Adadelta as the optimization algorithm (Zeiler, 2012).

LR
We use the publicly available tool LibLinear for LR training. The cost is set to be 0.5 with all other parameters assigned with default settings. The cost is chosen based on the Dev dataset of Setting 1.

Combination
The scalar w is picked via grid search on the Dev dataset for both settings. Because of the random initialization of weights and random shuffling of batches for the CNN during the training procedure, w is different for different runs. Thus we consider it as a weight to be trained with other weights.

Embeddings
It is popular to initialize word vectors with pretrained embeddings obtained by some unsupervised algorithms trained over a large corpus to improve  for "the number of tweets that were labeled as X and should have been labeled as Y", where P U N stand for Positive Neutral Negative, respectively. system performance (Kim, 2014) (Socher et al., 2011). We use the publicly available Word2Vec vectors trained on 100 billion words from Google News using the continuous bag-of-words architecture (Mikolov et al., 2013b) to initialize word embeddings, but randomly initialize character embeddings. All embeddings have dimensionality of 300. We also randomly initialize word embeddings that are not present in the vocabulary of those pre-trained word vectors.

Results and Discussion
The same evaluation measure as the one used in previous years is adopted, i.e., where F P os 1 is defined as, F P os 1 = 2π P os ρ P os π P os + ρ P os (4) with ρ P os defined as the precision of predicted positive tweets, i.e., the fraction of tweets predicted to be positive that are indeed positive, ρ P os = P P P P + P U + P N and π P os defined as the recall of predicted positive tweets, i.e., the fraction of positive tweets that are predicted to be such, π P os = P P P P + U P + N P where PP, PU, PN, UP, NP are defined in Table  2, a confusion matrix for Subtask A provided by (Nakov et al., )   We show the evaluation results of our system in Table 3, along with the top 15 systems reported. Originally we tested the system with only setting 1 and it ranks 12th among 34 systems. However, we find the system with setting 1 perform poorly on older datasets, which may due to the lack of training data. Thus we then test our model with setting 2 and report ranks generated from the same list of evaluation results reported by the 34 systems. It is apparent that our system can benefit from more training data and shows significant performance improvement (rank 6th).
Another interesting observation is that when provided with large amounts of training data, the CNN itself can perform very well, with LR assigned a very small weight during the combination proce-dure. We further test this finding by making 5 individual runs for both settings and checking the combination scalar weight w and final evaluation score F P N 1 . We list corresponding results in Table 4. With more training data, w increased from an average of 0.654 to an average of 0.98, which is very close to 1, while the performance improved from an average of 0.587 to an average of 0.604. This suggests the possibility to use only deep learning techniques along with embeddings to achieve similar or even better performance than traditional systems that require a lot of human engineered features and knowledge bases.
Our future work includes finer-design of the CNN, e.g., performing two stages of classification: first for subjectivity detection and then for polarity classification. We will also seek the possibility of conducting unsupervised learning with the CNN, which allows us to make use of the large amount of tweets on the Internet. With such increased amount of training data, our system may further improve its performance.