INESC-ID: A Regression Model for Large Scale Twitter Sentiment Lexicon Induction

We present the approach followed by INESC-ID in the SemEval 2015 Twitter Sentiment Analysis challenge, subtask E. The goal was to determine the strength of the association of Twitter terms with positive sentiment. Using two labeled lexicons, we trained a regression model to predict the sentiment polarity and intensity of words and phrases. Terms were represented as word embeddings induced in an unsupervised fashion from a corpus of tweets. Our system attained the top ranking submission, attesting the general adequacy of the proposed approach.


Introduction
Sentiment lexicons are one of the key resources for the automatic analysis of opinions, emotive and subjective text (Liu, 2012). They compile words annotated with their prior polarity of sentiment, regardless of the context. For instance, words such as beautiful or amazing tend to express a positive sentiment, whereas words like boring or ugly are considered negative. Most sentiment analysis systems use either word count methods, based on sentiment lexicons, or rely on text classifiers. In the former, the polarity of a message is estimated by computing the ratio of (positive and negative) sentiment bearing words. Despite its simplicity, this method has been widely used (O'Connor et al., 2010;Bollen and Mao, 2011;Mitchell et al., 2013). Even more sophisticated systems, based on supervised classification, can be greatly improved with features derived from lexicons (Kiritchenko et al., 2014). However, manually created sentiment lexicons consist of few carefully selected words. Consequently, they fail to capture the use of non-conventional word spelling and slang, commonly found in social media.
This problem motivated the creation of a task in the SemEval 2015 Twitter Sentiment Analysis challenge. This task (subtask E), intended to evaluate automatic methods of generating Twitter specific sentiment lexicons. Given a set of words or phrases, the goal was to assign a score between 0 and 1, reflecting the intensity and polarity of sentiment these terms express. Participants were asked to submit a list, with the candidate terms ranked according to sentiment score. This list was then compared to a ranked list obtained from human annotations and the submissions were evaluated using the Kendall (1938) Tau rank correlation metric.
In this paper, we describe a system developed for this challenge, based on a novel method to create large scale, domain-specific sentiment lexicons. The task is addressed as a regression problem, in which terms are represented as word embeddings, induced from a corpus of 52 million tweets. Then, using manually annotated lexicons, a regression model was trained to predict the polarity and intensity of sentiment of any word or phrase from that corpus. We found this approach to be effective for the proposed problem.
The rest of the paper proceeds as follows: we review the work related to lexicon expansion in Section 2 and describe the methods used to derive word embeddings in Section 3. Our approach and the experimental results are presented in Sections 5 and 6, respectively. We conclude in Section 7.
Most of the literature on automatic lexicon expansion consists of dictionary-based or corpora-based approaches. In the former, the main idea is to use a dictionary, such as WordNet, to extract semantic relations between words. Kim and Hovy (2006) simply assign the same polarity to synonyms and the opposite polarity to antonyms, of known words. Others, create a graph from the semantic relationships, to find new sentiment words and their polarity. Using the seed words, new terms are classified using a distance measure (Kamps et al., 2004), or propagating the labels along the edges of the graph (Rao and Ravichandran, 2009). However, given that dictionaries mostly describe conventional language, these methods are unsuited for social media.
Corpora based approaches follow the assumption that the polarity of new words can be inferred from co-occurrence patterns with known words. Hatzivassiloglou and McKeown (1997) discovered new polar adjectives by looking at conjunctions found in a corpus. The adjectives connected with and got the same polarity, whereas adjectives connected with but were assigned opposing polarities. Turney and Littman (2003) created two small sets of prototypical polar words, one containing positive and another containing negative examples. The polarity of a new term was computed using the point-wise mutual information between that word and each of the prototypical sets (Lin, 1998). The same method was used by Kiritchenko et al. (2014), to create large scale sentiment lexicons for Twitter.
A recently proposed alternative is to learn word embeddings specific for Twitter sentiment analysis, using distant supervision (Tang et al., 2014). The resulting features are then used in a supervised classifier to predict the polarity of phrases. This work is the most related to our approach, but it differs in the sense that we use general word embeddings, learned from unlabeled data, and predict both polarity and intensity of sentiment.

Unsupervised Word Embeddings
In recent years, several models have been proposed, to derive word embeddings from large corpora. These are essentially, dense vector representations that implicitly capture syntactic and se-mantic properties of words (Collobert et al., 2011;Mikolov et al., 2013a;Pennington et al., 2014). Moreover, a notion of semantic similarity, as well as other linguistic regularities seem to be encoded in the embedding space (Mikolov et al., 2013b). In word2vec, Mikolov et al. (2013a) induce word vectors with two simple neural network architectures, CBOW and skip-gram. These models estimate the optimal word embeddings by maximizing the probability that, words within a given window size are predicted correctly.

Skip-gram and Structured Skip-gram
Central to the skip-gram is a log-linear model of word prediction. Given the i-th word from a sentence w i , the skip-gram estimates the probability of each word at a distance p from w i as: Here, w i ∈ {1, 0} v×1 is a one-hot representation of the word, i.e., a sparse column vector of the size of the vocabulary v with a 1 on the position corresponding to that word. The model is parametrized by two matrices: E ∈ R e×v is the embedding matrix, transforming the one-hot sparse representation into a compact real valued space of size e; C p ∈ R v×e is a matrix mapping the realvalued representation to a vector with the size of the vocabulary v. A distribution over all possible words is then attained by exponentiating and normalizing over the v possible options. In practice, due to the large value of v, various techniques are used to avoid having to normalize over the whole vocabulary (Mikolov et al., 2013a). In the particular case of the structured skip-gram model, the matrix C p depends only of the relative position between words p (Ling et al., 2015).
After training, the low dimensional embedding E· w i ∈ R e×1 encapsulates the information about each word and its surrounding contexts. defined as: where S i+d i−d is the point wise sum of the embeddings of all context words starting at E · w i−d to E · w i+d , excluding the index w i , and once again C ∈ R e×v is a matrix mapping the embedding space into the output vocabulary space v.

GloVe
The models discussed above rely on different assumptions about the relations between words within a context window. The Global Vector model, referred as GloVe (Pennington et al., 2014), combines this approach with ideas drawn from matrix factorization methods, such as LSA (Deerwester et al., 1990). The embeddings are derived with an objective function that combines context window information, with corpus statistics computed efficiently from a global term co-occurrence matrix.

Labeled Data
The evaluation of the shared task was performed on a labeled test set, consisting of 1315 words and phrases. To support the development of the systems, the organizers released a trial set with 200 examples. The terms are representative of the informal style of Twitter text, containing hashtags, slang, abbreviations and misspelled words. Negated expressions were also included. We show a sample of the words and phrases in Table 1. For more details on these datasets, see (Kiritchenko et al., 2014).
Given the small size of the trial set, we used an additional labeled lexicon: the Language Assessment by Mechanical Turk (LabMT) lexicon (Dodds et al., 2011). It consists of 10,000 words collected from different sources. Words were rated on a scale of 1 (sad) to 9 (happy), by users of Amazon's Mechanical Turk service, resulting in a measure of average happiness for each given word. Note that LabMT contains annotations for happiness but our goal is to label words in terms of sentiment polarity. We rely on the fact that some emotions are correlated with sentiment, namely, joy/happiness are associated with positivity, while sadness/disgust relate to negativity (Liu, 2012).
This complementary dataset was used for two purposes: first, as the development set to evaluate and tune our system, and second, as additional training data for the candidate submission.

Proposed Approach
We addressed the task of inducing large scale sentiment lexicons for Twitter as a regression problem. Each term w i was represented with an embedding E · w i ∈ R e×1 , with e ∈ {50, 200, 400, 600, 1250 1 } as discussed in Section 3. Then, the manually annotated lexicons were used to train a model that, given a new term w j , predicts a score y ∈ [0, 1] reflecting the polarity and intensity of sentiment it conveys. Note that the embeddings represent words, so to deal with phrases we leveraged on the compositional properties of word vectors (Mikolov et al., 2013b). Given that algebraic operations in the embedding space preserve meaning, we represented phrases as the sum or mean of individual word vectors.

Learning the Word Embeddings
The first step of our approach, requires a corpus of tweets to support the unsupervised learning of the embedding matrix E. We resorted to the corpus of 52 million tweets used by Owoputi et al. (2013) and the tokenizer described in the same work.
The CBOW and skip-gram embeddings were induced using the word2vec 2 tool, while we used our own implementation of the structured skipgram. The default values in word2vec were employed for most of the parameters, but we set the negative sampling rate to 25 words (Goldberg and Levy, 2014). For the GloVe model, we used the available implementation 3 with the default parameters. In all the models, words occurring less than 100 times in the corpus were discarded, resulting in a vocabulary of around 210,000 tokens.

Hyperparameter Optimization and Model Selection
Regarding the choice of learning algorithm, several linear regression models were considered: least squares and regularized variants, namely, the lasso, ridge and elastic net regressors. We also experimented with Support Vectors Regression (SVR) using non-linear kernels, namely, polynomial, sigmoid 1 corresponds to the concatenation of all the embeddings 2 https://code.google.com/p/word2vec/ 3 http://nlp.stanford.edu/projects/GloVe/ and Radial Basis Function (RBF). Most of these models have hyperparameters, thus the combination of possible algorithms and parameters represents a huge configuration space. A brute force approach to find the optimal model would be cumbersome and time consuming. Instead, for each parameter, we defined meaningful distributions and ranges of values. Then, a hyperparameter optimization algorithm was used to find the best combination of model and parameters, by sampling from the specified configuration pool. The Tree of Parzen Estimators algorithm, as implemented in HyperOpt 4 , was used (Bergstra et al., 2013).

Experiments
Learning word embeddings from large corpora allowed us to derive representations for a considerable number of words. Thus, we were able to find embeddings for 94% of the candidate terms. Using simple normalization steps, we could find embeddings for the remaining terms. However, we found that this improvement in recall had almost no impact in the performance of the system.
After mapping terms to their respective embeddings, we performed experiments to find the best regression model and respective hyperparameters. For this purpose, the LabMT lexicon was employed as the development set and the trial data as a validation set, against which different configurations were evaluated. After 1000 trials, the SVR model with RBF kernel was selected. Finally, we performed detailed experiments to compare word embedding models and vectors of different dimensions.

Submitted System
The evaluation on the trial data indicated that several configurations of embedding model and size could achieve the optimal results. Therefore, our candidate system was based on structured skip-gram embeddings with 600 dimensions, and SVR with RBF kernel. The hyperparameters were set to C = 50, = 0.05 and γ = 0.01 and the system was trained using the trial data and the LabMT lexicon.

Results
The experiments showed that all the word embeddings have comparable capabilities. In Figure 1, we compare the results of different embeddings with the same regression model. Regarding the representation of phrases, the skip-gram and structured skipgram embeddings performed better when averaged. However, the GloVe and CBOW seemed to be more effective when summing the individual word vectors. These results were consistent across all the experiments. In terms of embedding size, we observed that smaller vectors tend to perform worse and, in general, concatenating vectors of different dimensionality improved performance. The CBOW representations were the only exception. This suggests that embeddings of different size capture different aspects of words.
Our final method, attained the highest ranking result of the competition, with 0.63 rank correlation. Figure 2a shows the results of the top 4 submissions to SemEval. Further experiments were conducted after the release of the test set labels. We found that the concatenation of GloVe embeddings outperforms our previous choice of features on the test set. Surprisingly, these embeddings obtained the worst results on the trial data, but are much better than the others in the test set, achieving a rank correlation of 0.67. At this point, it is still not clear why this is the case. Figure 2b shows the performance of each embed-ding model, under different combinations of training and test data. We can see that the proposed approach is effective, and our models outperform the other systems with as few as 200 training examples.

Conclusions
We described the approach followed by INESC-ID for subtask E of SemEval 2015 Twitter Sentiment Analysis challenge. This work presents the first steps towards a general method to extract large-scale lexicons with fine-grained annotations from Twitter data. Although the results are encouraging, further investigation is required to shed light on some unexpected outcomes (e.g., the inconsistent behavior of the GloVe features on the trial and test sets). It should nonetheless be noted that, given the small size of the labeled sets, it is hard to draw definitive conclusions about the soundness of any method. Furthermore, the merit of a sentiment lexicon should be assessed in terms of its impact on the performance of concrete sentiment analysis applications.