Don’t Count, Predict! An Automatic Approach to Learning Sentiment Lexicons for Short Text

We describe an efﬁcient neural network method to automatically learn sentiment lexicons without relying on any manual resources. The method takes inspiration from the NRC method, which gives the best results in SemEval13 by leveraging emoticons in large tweets, using the PMI between words and tweet sentiments to de-ﬁne the sentiment attributes of words. We show that better lexicons can be learned by using them to predict the tweet sentiment labels. By using a very simple neural network, our method is fast and can take advantage of the same data volume as the NRC method. Experiments show that our lexicons give signiﬁcantly better accuracies on multiple languages compared to the current best methods.


Introduction
Sentiment lexicons contain the sentiment polarity and/or the strength of words or phrases (Baccianella et al., 2010;Taboada et al., 2011;Tang et al., 2014a;Ren et al., 2016a). They have been used for both rule-based (Taboada et al., 2011) and unsupervised (Turney, 2002;Hu and Liu, 2004; or supervised (Mohammad et al., 2013;Tang et al., 2014b;Vo and Zhang, 2015) machine-learning-based sentiment analysis. As a result, constructing sentiment lexicons is one important research task in sentiment analysis.
Many approaches have been proposed to construct sentiment lexicons. Traditional methods manually label the sentiment attributes of words (Hu and Liu, 2004;Wilson et al., 2005;Taboada et al., 2011). One benefit of such lexicons is high quality. On the other hand, the methods are timeconsuming, requiring language and domain exper-tise. Recently, statistical methods have been exploited to learn sentiment lexicons automatically (Esuli and Sebastiani, 2006;Baccianella et al., 2010;Mohammad et al., 2013). Such methods leverage knowledge resources (Bravo-Marquez et al., 2015) or labeled sentiment data (Tang et al., 2014a), giving significantly better coverage compared to manual lexicons.
Among the automatic methods, Mohammad et al. (2013) proposed to use tweets with emoticons or hashtags as training data. The main advantage is that such training data are abundant, and manual annotation can be avoided. Despite that emoticons or hashtags can be noisy in indicating the sentiment of a tweet, existing research (Go et al., 2009;Pak and Paroubek, 2010;Agarwal et al., 2011;Kalchbrenner et al., 2014;Ren et al., 2016b) has shown that effectiveness of such data when used to supervise sentiment classifiers. Mohammad et al. (2013) collect sentiment lexicons by calculating pointwise mutual information (PMI) between words and emoticons. The resulting lexicons give the best results in a SemEval13 benchmark (Nakov et al., 2013). In this paper, we show that a better lexicon can be learned by directly optimizing the prediction accuracy, taking the lexicon as input and emoticon as the output. The correlation between our method and the method of Mohammad et al. (2013) is analogous to the "predicting" vs "counting" correlation between distributional and distributed word representations (Baroni et al., 2014).
We follow Esuli and Sebastiani (2006) in using two simple attributes to represent each sentiment word, and take inspiration from Mikolov et al. (2013) in using a very simple neural network for sentiment prediction. The method can leverage the same data as Mohammad et al. (2013) and therefore benefits from both scale and annotation independence. Experiments show that the neural model gives the best results on standard benchmarks across multiple languages. Our code and lexicons are publicly available at https://github.com/duytinvo/acl2016.

Related work
Existing methods for automatically learning sentiment lexicons can be classified into three main categories. The first category augments existing lexicons with sentiment information. For example, Esuli and Sebastiani (2006) and Baccianella et al. (2010) use a tuple (pos, neg, neu) to represent each word, where pos, neg and neu stand for possibility, negativity and neutrality, respectively, training these attributes by extracting features from WordNet. These methods rely on the taxonomic structure of existing lexicons, which are limited to specific languages.
The second approach expands existing lexicons, which are typically manually labeled. For example, Tang et al. (2014a) apply a neural network to learn sentiment-oriented embeddings from a small amount of annotated tweets, and then expand a set of seed sentiment words by measuring vector space distances between words. Bravo-Marquez et al. (2015) extend an existing lexicon by classifying words using manual features. These methods are also limited to domains and languages with manual resources.
The third line of methods constructs lexicons from scratch by accumulating statistical information over large data. Turney (2002) proposes to estimate the sentiment polarity of words by calculating PMI between seed words and search hits. Mohammad et al. (2013) improve the method by computing sentiment scores using distance-supervised data from emoticon-baring tweets instead of seed words. This approach can be used to automatically extract multilingual sentiment lexicons Mohammad et al., 2015) without using manual resources, which makes it more flexible compared to the first two methods. We consider it as our baseline.
We use the same data source as Mohammad et al. (2013) to train lexicons. However, rather than relying on PMI, we take a machine-learning method in optimizing the prediction accuracy of emoticons using the lexicons. To leverage large data, we use a very simple neural network to train the lexicons.  (1) where pos represents the positive label and neg represents the negative label. PMI stands for pointwise mutual information, which is Here freq(w , pos) is the number of times the term w occurs in positive tweets, freq(w ) is the total frequency of term w in the corpus, freq(pos) is the total number of tokens in positive tweets, and N is the total number of tokens in the corpus. PMI (w , neg) is calculated in a similar way. Thus, Equation 1 is equal to:

Model
We follow Esuli and Sebastiani (2006), using positivity and negativity attributes to define lexicons. In particular, each word takes the form w = (n, p), where n denotes negativity and p denotes positivity (n, p ∈ R). As shown in Figure 1, given a tweet tw = w 1 , w 2 , ..., w n , a simple neural network is used to predict its two-dimensional sentiment label y, where [1,0] for negative and [0,1] for positive tweets. The predicted sentiment probability y of a tweet is computed as: where W is fixed to the diagonal matrix (W ∈ R 2x2 ). We follow Go et al. (2009) in defining the sentiment labels of tweets via emoticons. Each token is first initialized by random negative and positive attribute scores in [-0.25,0.25], and then trained by supervised learning. The cross-entropy error is employed as the objective function: Backpropagation is applied to learn (n, p) for each token. Optimization is done using stochastic gradient descent over shuffled mini-batches, with the AdaDelta update rule (Zeiler, 2012). All models are trained over 5 epochs with a batch size of 50. Due to its simplicity, the method is very fast, training a sentiment lexicon over 9 million tweets within 35 minutes per epoch on an Intel core™ i7-3770 CPU @ 3.40 GHz.

Sentiment Classification
The resulting lexicon can be used in both unsupervised and supervised sentiment classifiers. The former is implemented by summing the sentiment scores of all tokens contained in a given document (Taboada et al., 2011;. If the total sentiment score is larger than 0, the document is classified as positive. Here only one positivity attribute is required to represent a lexicon, and we use the contrast between the positivity and negativity attributes (p − n) as the score. The supervised method makes use of sentiment lexicons as features for machine learning classification. Given a document D, we follow  and extract the following features: • The number of sentiment tokens in D, where sentiment tokens are word tokens whose sentiment scores are not zero in a lexicon; • The total sentiment score of a document: • The maximal score: max w i ∈D SS(w i );  • The total scores of positive and negative words in D; • The sentiment score of the last token in D.
Again we use SS(w i ) = p w i − n w i as the sentiment score of each word w i , because the methods are based on a single sentiment score value for each word.

Experimental Settings
Training data: To automatically obtain training data, we use the Twitter Developers API 1 to crawl emoticon tweets 2 of English and Arabic from February 2014 to September 2014. We follow Go et al. (2009), removing all emoticons used to collect training data from the tweets, and Tang et al. (2014b), ignoring tweets which are less than 7 tokens. A Twitter tokenizer (Gimpel et al., 2011) is applied to preprocess all tweets. Rare words that occur less than 5 times in the vocabulary are removed. HTTP links and username are replaced by http and user , respectively. The statistics of training data is shown in Table 1. Sentiment classifier: We use LibLinear 3 (Fan et al., 2008) as the supervised classifier on benchmark datasets. The parameter c is tuned by making a grid search (Hsu et al., 2003) on the accuracy of development set on the English dataset and fivefold cross validation on the Arabic dataset.
Evaluation: We follow  in employing precision (P), recall (R) and F1 score (F) to evaluate unsupervised classification. We follow Hsu et al. (2003) and use accuracy (acc), the tuning criterion, to evaluate supervised classification.
Code and lexicons: We make the Python implementation of our models and the resulting sentiment lexicons available at https://github.com/duytinvo/acl2016   Table 4: Standard splits of ASTD.

English Lexicons
The Twitter benchmark of SemEval13 (Nakov et al., 2013) is used as the English test set. In order to evaluate both unsupervised and supervised methods, we follow Tang et al. (2014b) and , removing neutral tweets. The statistics is shown in Table 2. We compare our lexicon with the lexicons of NRC 4 (Mohammad et al., 2013), HIT 5 (Tang et al., 2014a) and WEKA 6 (Bravo-Marquez et al., 2015). As shown in Table  3, using the unsupervised sentiment classification method (unsup) in Section 5, our lexicon gives significantly better result in comparison with countbased lexicons of NRC. Under both settings, our lexicon yields the best results compared to other methods.

Arabic Lexicons
We employ the standard Arabic Twitter dataset ASTD (Nabil et al., 2015), which consists of about 10,000 tweets with 4 labels: objective (obj), negative (neg), positive (pos) and mixed subjective (mix). The standard splits of ASTD are shown in Table 4. We follow Nabil et al. (2015) by merging training and validating data for learning model. We compare our lexicon with only the lexicons of NRC 7 , because the methods of Tang et al. (2014a) Table 6: Example sentiment scores, where * denotes incorrect polarity.
not available. As shown in Table 5, our lexicon consistently gives the best performance on both the balanced and unbalanced datasets, showing the advantage of "predicting" over "counting". Table 6 shows examples of our predicting-based lexicon and the counting-based lexicon of Mohammad et al. (2013). First, both lexicons can correctly reflect the strength of emotional words (e.g. bad, worse, worst), which demonstrates that our method can learn statistical relevance as effectively as PMI. Second, we find many cases where our lexicon gives the correct polarity (e.g. suitable, lazy) but the lexicon of Mohammad et al.

Analysis
(2013) does not. To quantitatively compare the lexicons, we calculated the accuracies of their polarities (i.e. sign) by using the manually-annotated lexicon of Hu and Liu (2004) as the gold standard. We take the intersection between the automatic lexicons and the lexicon of Hu and Liu (2004) as the test set, which contains 3270 words. The polarity accuracy of our lexicon is 78.2%, in contrast to 76.9% by the lexicon of Mohammad et al. (2013), demonstrating the relative strength of our method. Third, by having two attributes (n, p) instead of one, our lexicon is better in compositionality (e.g. SS(strong memory) > 0, SS(strong snowstorm) < 0).

Conclusion
We constructed a sentiment lexicon for short text automatically using an efficient neural network, showing that prediction-based training is better than counting-based training for learning from large tweets with emoticons. In standard evaluations, the method gave better accuracies across multiple languages compared to the state-of-theart counting-based method.