RoseMerry: A Baseline Message-level Sentiment Classification System

In this paper, we propose a baseline message-level sentiment classiﬁcation method, as developed for SemEval-2015 Task 10, Subtask B. This system leverages both hand-crafted features and message-level embedding features, and uses an SVM classiﬁer for message-level sentiment classiﬁcation. In pre-training the embedding features, we use one million randomly-selected tweets. We present re-sults over SemEval-2015 Task 10, Subtask B, as well as the Stanford Sentiment Treebank. Our experiments show the effectiveness of our method over both datasets.


Introduction
The rise of social media such as blogs and microblogs (e.g., Twitter) has fueled interest in sentiment analysis (Liu, 2012;Pang and Lee, 2008). One of the most popular settings for carrying out sentiment analysis is at the sentence level or over individual micro-blog posts, using the simple threelabel class set of POSITIVE, NEGATIVE and NEU-TRAL (Liu, 2012;Pang and Lee, 2008;Rosenthal et al., 2014). Sentiment classification has been shown to have utility in various business intelligence applications, including product marketing, identifying new business opportunities, and managing a company's reputation (Liu, 2012;Pang and Lee, 2008).
Learning effective features plays an important role in building sentiment classification systems (Liu, 2012;Pang and Lee, 2008). For example, the winning system in the SemEval-2013 message polarity classification task (Nakov et al., 2013) was based on a rich set of hand-tuned features such as word-sentiment association lexicon features, word n-grams, punctuation, and emoticons, which were combined using a simple SVM-based classifier (Mohammad et al., 2013). Recently, there has been a surge of interest in representation learning -automatically learning word and document representations, often in the form of continuous-valued vectors or "embeddings" -using auto-encoders or neural network language models (Mikolov et al., 2013;Le and Mikolov, 2014). Of particular relevance to message-level sentiment analysis, Tang et al. (2014) proposed a deep learning approach to learn sentiment-specific word representation features, and Le and Mikolov (2014) proposed a neural network auto-encoder to learn message-level vectors.
In this paper, we detail RoseMerry, a (strong) baseline sentiment analysis method that combines hand-crafted features with message-level 1 embeddings generated by doc2vec (Le and Mikolov, 2014), using a linear-kernel SVM.  Our interest in sentiment analysis stems from a desire to use it as part of a commercial text analytics system. As such, there is an overarching constraint associated with the system and all third-party components must be licensed in a manner which is compatible with commercial use. In our description below, we point out places where we were unable to use notable resources because of this constraint.
The message-level embeddings are pre-trained using doc2vec over the combination of the training data and a random sample of 1M English tweets, as detailed in Section 2.1. The hand-crafted features are based heavily off the work of Mohammad et al. (2013), and are detailed in Section 2.2. Finally, the d-dimensional message-level embedding is concatenated with the N -dimensional hand-crafted features to form a d + N -dimensional combined feature vector. We experiment with each of the two feature subsets, in addition to the combined feature set. One significant divergence from Mohammad et al. (2013) is that we do not use many of the sentiment lexicons, due to non-commercial licensing. Given that one of the key findings in that work was that lexicons are one of the most reliable features, we expect that this will have a large impact on our results.

Message-level embeddings
The message-level embeddings are generated using doc2vec (Le and Mikolov, 2014). In this framework, words and documents are represented in a common d-dimensional space, using real-valued vectors. The embeddings are learned by prediction of each word in a given document based on the document embedding and word embeddings of its surrounding context. The document vector acts as another word which captures the larger context of a word that is missing from its immediate word context.
The word and document vectors are trained using stochastic gradient descent, based on back propagation.
After pre-training, the document vector of each training document is used as its representation, and test documents are fed through the pre-trained autoencoder to generate a message-level embedding.

Hand-crafted features
The hand-crafted features are largely lexical: • word n-grams: binary features capturing the presence or absence of word n-grams observed in the training data, i.e. contiguous sequences of n words (n ∈ {1, 2, 3, 4}); we also included binary features for non-contiguous 3-and 4grams included in the training data (n-grams with one non-final word removed) • character n-grams: continuous features capturing the proportion of contiguous character ngrams (n ∈ {3, 4, 5}) of each type observed in the training data, which make up a given message • proportion of words in all caps: the proportion of words which are in all caps (e.g. YAY) • punctuation features: the proportion of tokens which are made up of multiple exclamation marks, question marks, or a combination of the two (e.g. ??!) • elongated words: the proportion of words which have "elongated" vowels, i.e. a given vowel repeated more than twice (e.g. coool) • proportion of emoticons: the proportion of tokens which are (a) positive-and (b) negativepolarity emoticons, as identified by Chris Potts' scripts 2 • polarity of message-final emoticon: if the last token is a polarised emoticon, its polarity (NEGATIVE, POSITIVE or None) • negated words: the presence or absence of words in "negated contexts", where a negated context is defined as span from a negation word 3 to a punctuation mark (matching the regular expression [,.:;!?])

Experiments
In this section, we will detail the experimental setup and the results of our experiments.

Datasets
We evaluate our method over two labelled datasets, and also two unlabelled datasets to pre-train doc2vec, as detailed below.   Stanford Sentiment Treebank Dataset: a collection of movie review documents from www. rottentomatoes.com, which have been sentence tokenised and annotated for sentiment at the sentence level (Maas et al., 2011) and prepartitioned into training and test data, as detailed in Table 2. Socher et al. (2013) additionally annotated the data at the phrase and lexical levels, but we use only the sentence-level annotations in this paper.

Unlabelled Datasets
Twitter Dataset: a random sample of 10M English tweets from a 5.3TB Twitter dataset crawled from 18 June to 4 Dec, 2014 using the Twitter Trending API. This is used as additional data to pre-train the message-level embeddings for the SemEval-2015 Dataset.
IMDB Dataset: a 100K sentence movie review dataset from www.imdb.com, collected by Maas et al. (2011). This is used as additional data to pretrain the message-level embeddings for the Stanford Sentiment Treebank dataset.

Experimental setup
To evaluate the effectiveness of the different feature sets, we report on results as follows: • RM-manual: only hand-crafted features • RM-doc2vec: only message-level embeddings • RM-all: both hand-crafted features and message-level embeddings As our primary evaluation metric, we use F1 PN , which is the average F1 PN for the POSITIVE (i.e., F1 pos ) and NEGATIVE classes (i.e., F1 neg ): We also report the overall classification accuracy (Acc) across the three classes, and the F1 PN score of each class (i.e., F1 pos , F1 neg and F1 neu ).
For the message-level embeddings, we used d = 100 and a context window size of 10. We used LibSVM with a linear-kernel and default parameter settings.

Experimental results
In this section, we present the results first over the SemEval-2015 datasets, and then over the Stanford Sentiment Treebank.

Results for SemEval-2015
The results for the SemEval-2015 test set and progress test set are shown in Table 3. Figure 2a is a learning curve of RM-doc2vec, pre-trained over varying numbers of documents. We can see that the results plateau at 1M tweets; this is the document collection size we used for pre-training RM-doc2vec and RM-all in our official runs. The overall Acc and F1 of each class for the three feature sets are shown in Figure 2b. RM-doc2vec is marginally better than RM-manual overall, and for the NEGATIVE class in particular. When combined, RM-all outperforms the two component feature sets across all classes, pointing to (weak) complementarity between the two feature sets. RM-doc2vec performed best when pretrained over 50K documents (plus the Stanford Sentiment Treebank data), and this is the model we include in the remainder of our results over this dataset. Figure 3b shows the Acc, in addition to the per-class F1 over the Stanford Sentiment Treebank for the three feature sets. The overall trend is strikingly similar to that for SemEval-2015, with the combined feature set performing marginally better than the two component feature sets in all cases.

Conclusion
In this paper, we described the method used in our official submission to the SemEval-2015 message polarity classification task, which combines message-level embeddings with hand-crafted features using a simple linear-kernel SVM. We pre-  sented results over the SemEval-2015 dataset and Stanford Sentiment Treebank, and showed that the combined feature achieved the best results. The difference between the combined feature set and the two component feature sets is not statistically significant (based on randomised estimation, p > 0.05).
While we were not able to achieve state-of-the-art results, we commend the proposed approach as a strong baseline method.