Context-Sensitive Lexicon Features for Neural Sentiment Analysis

Sentiment lexicons have been leveraged as a useful source of features for sentiment analysis models, leading to the state-of-the-art accuracies. On the other hand, most existing methods use sentiment lexicons without considering context, typically taking the count, sum of strength, or maximum sentiment scores over the whole input. We pro-pose a context-sensitive lexicon-based method based on a simple weighted-sum model, using a recurrent neural network to learn the sen-timents strength, intensiﬁcation and negation of lexicon sentiments in composing the sentiment value of sentences. Results show that our model can not only learn such operation details, but also give signiﬁcant improvements over state-of-the-art recurrent neural network baselines without lexical features, achieving the best results on a Twitter benchmark.


Introduction
Sentiment lexicons (Hu and Liu, 2004;Wilson et al., 2005;Esuli and Sebastiani, 2006) have been a useful resource for opinion mining (Kim and Hovy, 2004;Agarwal et al., 2011;Moilanen and Pulman, 2007;Choi and Cardie, 2008;Mohammad et al., 2013;Guerini et al., 2013;Vo and Zhang, 2015). Containing sentiment attributes of words such as polarities and strengths, they can serve to provide a word-level foundation for analyzing the sentiment of sentences and documents. We investigate an effective way to use sentiment lexicon features.
A traditional way of deciding the sentiment of a document is to use the sum of sentiment values of  all words in the document that exist in a sentiment lexicon (Turney, 2002;Hu and Liu, 2004). This simple method has been shown to give surprisingly competitive accuracies in several sentiment analysis benchmarks , and is still the standard practice for specific research communities with mature domain-specific lexicons, such as finance (Kearney and Liu, 2014) and product reviews (Ding et al., 2008).
More sophisticated sentence-level features such as the counts of positive and negative words, their total strength, and the maximum strength, etc, have also been exploited (Kim and Hovy, 2004;Wilson et al., 2005;Agarwal et al., 2011). Such lexicon features have been shown highly effective, leading to the best accuracies in the SemEval shared task (Mohammad et al., 2013). On the other hand, they are typically based on bag-of-word models, hence suffering two limitations. First, they do not explicitly handle semantic compositionality (Polanyi and Zaenen, 2006;Moilanen and Pulman, 2007;Taboada et al., 2011), some examples of which are shown in Figure 1. The composition effects can exhibit intricacies such as negation over intensification (e.g. not very good), shifting (e.g. not terrific) vs flip-ping negation (e.g. not acceptable), content word negation (e.g. removes my doubts) and unbounded dependencies (e.g. No body gives a good performance).
Second, they cannot effectively deal with word sense variations (Devitt and Ahmad, 2007;Denecke, 2009). Guerini et al. (2013) show challenges in modeling the correlation between contextdependent posterior word sentiments and their context independent priors. For example, the sentiment value of "cold" varies between "cold beer", "cold pizza" and "cold person" due to sense and context differences. Such variations raise difficulties for a sentiment classifier with bag-of-word nature, since they can depend on semantic information over long phrases or the full sentence.
We investigate a method that can potentially address the above issues, by using a recurrent neural network to capture context-dependent semantic composition effects over sentences. Shown in Figure 2, the model is conceptually simple, using a weighted sum of lexicon sentiments and a sentence-level bias to estimate the sentiment value of a sentence. The key idea is to use a bi-directional long-short-term-memory (LSTM) (Hochreiter and Schmidhuber, 1997;Graves et al., 2013) model to capture global syntactic dependencies and semantic information, based on which the weight of each sentiment word together with a sentence-level sentiment bias score are predicted. Such weights are context-sensitive, and can express flipping negation by having negative values.
The advantages of the recurrent network model over existing semantic-composition-aware discrete models such as (Choi and Cardie, 2008) include its capability of representing non-local and subtle semantic features without suffering from the challenge of designing sparse manual features. On the other hand, compared with neural network models, which recently give the state-of-the-art accuracies (Li et al., 2015;Tai et al., 2015), our model has the advantage of leveraging sentiment lexicons as a useful resource. To our knowledge, we are the first to integrate the operation into sentiment lexicons and a deep neural model for sentiment analysis.
The conceptually simple model gives strong empirical performances. Results on standard sentiment benchmarks show that our method gives competitive Figure 2: Overall model structure. The sentiment score of the sentence "not a bad movie at all" is a weighted sum of the scores of sentiment words "not", "bad" and a sentence-level bias score b. score(not) and score(bad) are prior scores obtained from sentiment lexicons. γ1 and γ3 are context-sensitive weights for sentiment words "not" and "bad", respectively.
accuracies to the state-of-the-art models in the literature. As a by-product, the model can also correctly identify the compositional changes on the sentiment values of each word given a sentential context. Our code is released at https://github.com/zeeeyang/lexicon rnn.

Related Work
There exist many statistical methods that exploit sentiment lexicons (Kim and Hovy, 2004;Agarwal et al., 2011;Mohammad et al., 2013;Guerini et al., 2013;Tang et al., 2014b;Vo and Zhang, 2015;Cambria, 2016). Mohammad et al. (2013) leverage a large sentiment lexicon in a SVM model, achieving the best results in the SemEval 2013 benchmark on sentence-level sentiment analysis (Nakov et al., 2013). Compared to these methods, our model has two main advantages. First, we use a recurrent neural network to model context, thereby exploiting non-local semantic information. Second, our model offers context-sensitive operational details on each word.
Several previous methods move beyond bag-ofword models in leveraging lexicons. Most notably, Moilanen and Pulman (2007) introduce the ideas from compositional semantics (Montague, 1974) into sentiment operations, developing a set of composition rules for handling negations. Along the line, Taboada et al. (2011) developed a lexicon and a collection of sophisticated rules for addressing intensification, negation and other phenomena. Differ-ent from these rule-based methods, Choi and Cardie (2008) use a structured linear model to learn semantic compositionality relying on a set of manual features. In contrast, we leverage a recurrent neural model for inducing semantic composition features automatically. Our weighted-sum representation of semantic compositionality is formally simpler compared with fine-grained rules such as (Taboada et al., 2011). However, it is sufficient for describing the resulting effect of complex and context-dependent operations, with the semantic composition process being modeled by LSTM. Our sentiment analyzer also enjoys a more competitive LSTM baseline compared to a traditional discrete models.
Our work is also related to recent work on using deep neural networks for sentence-level sentiment analysis, which exploits convolutional (Kalchbrenner et al., 2014;Kim, 2014;Ren et al., 2016), recursive (Socher et al., 2013;Dong et al., 2014;Nguyen and Shirai, 2015) and recurrent neural networks (Liu et al., 2015;, giving highly competitive accuracies. As our baseline, LSTM (Tai et al., 2015;Li et al., 2015) stands among the best neural methods. Our model is different from these prior methods in mainly two aspects. First, we introduce sentiment lexicon features, which effectively improve classification accuracies. Second, we learn extra operation details, namely the weights on each word, automatically as hidden variables. While the baseline uses LSTM features to perform end-to-end mapping between sentences and sentiments, our model uses them to induce the lexicon weights, via which word level sentiment are composed to derive sentence level sentiment.

Model
Formally, given a sentence s = w 1 w 2 ...w n and a sentiment lexicon D, denote the subjective words in s as w D j 1 w D j 2 ...w D jm . Our model calculates the sentiment score of s according to D in the form of where Score(w D jt ) is the sentiment value of w jt , γ jt are sentiment weights and b is a sentence-level bias.
The sentiment values of words and sentences are real numbers, with the sign indicating the polarity and the absolute value indicating the strength.
As shown in Figure 2, our neural model consists of three main layers, namely the input layer, the feature layer and the output layer. The input layer maps each word in the input sentence into a dense real-value vector. The feature layer exploits a bidirectional LSTM (Graves and Schmidhuber, 2005;Graves et al., 2013) to extract non-local semantic information over the sequence. The output layer calculates a weight score for each sentiment word, as well as an overall sentiment bias of the sentence.
In this figure, the score of the sentence "not a bad movie at all" is decided by a weighted sum of the sentiments of "bad" and "not" 1 , and a sentiment shift bias based on the sentence structure. Ideally, the weight on "not" should be a small negative value, which results in a slightly positive sentiment shift. The weight on "bad" should be negative, which represents a flip in the polarity. These weights jointly model a negation effect that involves both shifting and flipping.

Bidirectional LSTM
We use LSTM (Hochreiter and Schmidhuber, 1997) for feature extraction, which recurrently processes sentence s token by token. For each word w t , the model calculate a hidden state vector h t . A LSTM cell block makes use of an input gate i t , a memory cell c t , a forget gate f t and an output gate o t to control information flow from the history x 1 ...x t and h 1 ...h t−1 to the current state h t . Formally, h t is computed as follows: Here x t is the word embedding of word w t , σ denotes the sigmoid function, is element-wise mul- We apply a bidirectional extension of LSTM (BiLSTM) (Graves and Schmidhuber, 2005;Graves et al., 2013), shown in Figure 2, to encode the input sentence s both left-to-right and right-to-left. The BiLSTM model maps each word w t to a pair of hidden vectors h L t and h R t , which denote the hidden vector of the left-to-right LSTM and right-toleft LSTM, respectively. We use different parameters for the left-to-right LSTM and the right-to-left LSTM. These state vectors are used as features for calculating the sentiment weights γ.
In addition, we append a sentence end marker w <e> to the left-to-right LSTM and a sentence start marker w <s> to the right-to-left LSTM. The hidden state vector of w <s> and w <e> are denoted as h R s and h L e , respectively.

Output Layer
The base score. Given a lexicon word w jt in the sentence s (w jt ∈ D), we use the hidden state vectors h L j t and h R j t in the feature layer to calculate a weight value τ jt . As shown in Figure 3, a two-layer neural network is used to induce τ jt . In particular, a hidden layer combines h L t and h R t using a nonlinear tanh activation The resulting hidden vector p s j t is then mapped into τ jt using another tanh layer.
We choose the 2tanh function to make the learned weights conceptually useful. The factor 2 is introduced for modelling the effect of intensification. Since the range of tanh function is [−1, 1], the range of 2tanh is [−2, 2]. Intuitively, a weight value of 1 maps the word sentiment directly to the sentence sentiment, such as the weight for "good" in "This is good". A weight value in (1, 2] represents intensification, such as the weight for "bad" in "very bad". Similarly, a weight value in (0, 1) represents weakening, and a weight in (−2, 0) represents various scales of negations. Given all lexicon words w D jt in the sentence, we calculate a base score for the sentence By averaging the score of each word, the resulting S base is confined to [−2α, 2α], where α is the maximum absolute value of word sentiment. In the above equations, W L ps , W R ps , b ps , W pw and b pw are model parameters.
The bias score. We use the same neural network structure in Figure 3 to calculate the overall bias of the input sentence. The input to the neural network includes h R s and h L e , and the output is a bias score S bias . Intuitively, the calculation of S bias relies on information of the full sentence. h R s and h L e are chosen because they have commonly been used in the research literature to represent overall sentential information (Graves et al., 2013;.
We use a dedicated set of parameters for calculating the bias, where and W L pb , W R pb , b pb , W b and b L p are parameters.

Final Score Calculation
The base S base and bias S bias are linearly interpolated to derive the final sentiment value for the sentence s.
λ ∈ [0, 1] reflects the relative importance of the base score in the sentence. It offers a new degree of model flexibility, and should be calculated for each sentence specifically. We use the attention model  to this end. In particular, the base score features h L t /h R t and the bias score features h L e /h R s are combined in the calculation and Here σ denotes the sigmoid activation function and ⊕ denotes vector concatenation. W sλ , W bλ and b λ are model parameters.
The final score of the sentence is This corresponds to the original Equation 1 by γ jt = λ m τ jt and b = (1 − λ)S bias .

Training and Testing
Our training data contains two different settings. The first is binary sentiment classification. In this task, every sentence s i is annotated with a sentiment label l i , where l i = 0 and l i = 1 to indicate negative and positive sentiment, respectively. We apply logistic regression on the output layer. Denote the probability of a sentence s i being positive and negative as p 1 s i and p 0 s i respectively. p 0 s i and p 1 s i are estimated as Suppose that there are N training sentences, the loss function over the training set is defined as where Θ is the set of model parameters. λ r is a parameter for L2 regularization. The second setting is multi-class classification. In this task, every sentence s i is assigned a sentiment label l i from 0 to 4, which represent very negative, negative, neutral, positive and very positive, respectively. We apply least square regression on the output layer. Since the output range of 2tanh is [-2, 2], the value of the base score and the bias score both belongs to [-2, 2]. The final score is a weighted sum of the base score and the bias score, also belonging to [-2, 2]. However, the gold sentiment label ranges   from 0 to 4. We add an offset -2 to every gold sentiment label to both adapt our model to the training data and to increase the interpretability of the learned weights. The loss function for this problem is then defined as During testing, we predict the sentiment label l * i of a sentence s i by 4 Experiments

Experimental Settings
Data. We test our model on three datasets, including a dataset on Twitter sentiment classification, a dataset on movie review and a dataset with mixed domains. The Twitter dataset is taken from Se-mEval 2013 (Nakov et al., 2013). We downloaded the dataset according to the released ids. The statistics of the dataset are shown in Table 1. The movie review dataset is Stanford Sentiment Treebank 2 (SST) (Socher et al., 2013). For each sentence in this treebank, a corresponding constituent  tree is given. Each internal constituent node is annotated with a sentiment label ranging from 0 to 4. We follow Socher et al. (2011) and Li et al. (2015) to perform five-class and binary classification, with the data statistics being shown in Table 2.
In order to examine cross-domain robustness, we apply our model on a product review corpus (Täckström and McDonald, 2011), which contains 196 documents covering 5 domains: books, dvds, electronics, music and videogames. The document distribution is listed in Table 3.
Lexicons. We use four sentiment lexicons, namely TS-Lex, S140-Lex, SD-Lex and SWN-Lex. TS-Lex 3 is a large-scale sentiment lexicon built from Twitter by Tang et al. (2014a) for learning sentiment-specific phrase embeddings. S140-Lex 4 is the Sentiment140 lexicon, which is built from point-wise mutual information using distant supervision (Go et al., 2009;Mohammad et al., 2013).
SD-Lex is built from SST. We construct a sentiment lexicon from the training set by excluding all neutral words and adding the aforementioned offset -2 to each entry. SWN-Lex is a sentiment lexicon extracted from SentimentWordNet3.0 (Baccianella et al., 2010). For words with different partof-speech tags, we keep the minimum negative score or the maximum positive score. The original score in the SentimentWordNet3.0 is a probability value between 0 and 1, and we scale it to [-2, 2] 5 .
When building these lexicons, we only use the sentiment scores for unigrams. Ambiguous words are discarded. Both TS-Lex and S140-Lex are Twitter-specific sentiment lexicons. They are used in the Twitter sentiment classification task. SD-Lex and SWN-Lex are exploited for the Stanford dataset. The statistics of lexicons are listed in Table 4.

Implementation Details
We implement our model based on the CNN toolkit. 6 Parameters are optimized using stochastic gradient descent with momentum (Sutskever et al., 2013). The decay rate is 0.1. For initial learning rate, L2 and other hyper-parameters, we adopt the default values provided by the CNN toolkits. We select the best model parameter according to the classification accuracy on the development set.
For the Twitter data, we use the glove.twitter.27B 7 as pretrained word embeddings. For the Stanford dataset, following Li et al. (2015), we use glove.840B.300d 8 as pretrained word embeddings. Words that do not exist in both the training set and the pretrained lookup table are treated as outof-vocabulary (OOV) words. Following Dyer et al. (2015), singletons in the training data are randomly mapped to UNK with a probability p unk during training. We set p unk = 0.1. All word embeddings are fine-tuned. We use dropout (Srivastava et al., 2014) in the input layer to prevent overfitting during training.
One-layered BiLSTM is used for all tasks. The dimension of the hidden vector in LSTM is 150. The size of the second layer in Figure 3 is 64. Table 5 shows results on the Twitter development set. Bi-LSTM is our model using the bias score S bias only, which is equivalent to bidirectional LSTM model of Li et al. (2015) and Tai et al. (2015), since they use same features and only differ in the output layer. Bi-LSTM+avg.lexicon is a baseline model integrating the average sentiment scores of lexicon words as a feature, and Bi-LSTM+flex.lexicon is our final model, which considers both the Bi-LSTM score (S bias ) and the context-sensitive score (S base ).

Method
Test(%) SVM6  78.5 Tang et al. (2014a) 82.4 Bi-LSTM 86.7 Bi-LSTM + TS-Lex 87.6 Bi-LSTM + S140-Lex 88.0 Bi-LSTM+avg.lexicon improves the classification accuracy over Bi-LSTM by 0.7 point, which shows the usefulness of sentiment lexicons to recurrent neural models using a vanilla method. It is consistent with previous research on discrete models. By considering context-sensitive weighting for sentiment words Bi-LSTM+flex.lexicon further outperforms Bi-LSTM+avg.lexicon, improving the accuracy by 1.5 points (84.9 → 86.4), which demonstrates the strength of context-sensitive scoring. Base on the development results, we use Bi-LSTM+flex.lexicon for the remaining experiments.

Main Results
Twitter. Table 6 shows results on the Twitter test set. SVM6 is our implementation of , which extracts six types of manual features from TS-Lex for SVM classification. The features include: (1) the number of sentiment words in the sentence; (2) the total sentiment scores of the sentence; (3) the maximum sentiment score; (4) the total positive and negative sentiment scores; (5) the sentiment score of the last word in the sentence. The system of Tang et al. (2014a) is a state-of-the-art system that extracts various manually designed features from TS-Lex, such as bag-of-words, term frequency, parts-ofspeech, the sum of sentiment scores of all words in a tweet, etc, for SVM. The Bi-LSTM rows are our final models with different lexicons.
Cross-domain Results. Lexicon-based methods can be robust for cross-domain sentiment analysis (Taboada et al., 2011). We test the robustness of our model in the mixed domain dataset of product reviews (Täckström and McDonald, 2011). This dataset contains document level sentiments. We take the majority voting strategy to transform sentiment    Table 8 shows the results. Introducing the sentiment lexicons SD-Lex and SWN-Lex consistently improves the classification accuracy across five domains compared with the baseline Bi-LSTM model. When trained and tested using the same lexicon, SWN-Lex gives better performances on three out of five domains. SD-Lex gives better results only on Electronics. This shows that the results are sensitive to the domain of the sentiment lexicon, which is intuitive.
We also investigate a model trained using SD-Lex but tested by replacing SD-Lex with SWN-Lex. This is to examine the generalizability of a sourcedomain model on different target domains by plugging in relevant domain-specific lexicons, without being retrained. Results show that the mode still outperforms the SD-Lex lexicon on two out of five domains, but is less accurate than full retraining using SWN-Lex. Figure 4 shows the details of sentiment composition for two sentences in the SST, learned automatically by our model. For the first sentence, the three subjective words in the lexicon "pure", "excitement"

ID Sentence
Bi-LSTM SWN-Lex  and "not" receives weights of 1.6, 1.9 and −0.6, respectively, and the overall bias of the sentence is positive. A λ value (0.58) that slightly biases towards the base score leads to a final sentiment score is 1.8, which is close to the gold label 2.
In the second example, both negation words received positive weight values, and the bias over the sentence is negative. A λ (0.3) value that biases towards the bias score results in a final score of −1.2, which is close to the gold label −1. These results demonstrate the capacity of the model to decide how word-level sentiments composite according to sentence-level context. Table 9 shows three sentences in the Stanford test set which are incorrectly classified by Bi-LSTM model, but correctly labeled by our Bi-LSTM+SWN-Lex model. These examples show that our model is more sensitive to contextdependent sentiment changes, thanks to the use of lexicons as a basis.

Conclusion
We proposed a conceptually-simple, yet empirically effective method of introducing sentiment lexicon features to state-of-the-art LSTM models for sentiment analysis. Compared to the simple averag-ing method in traditional bag-of-word models, our system leverages the strength of semantic feature learning by LSTM models to calculate a contextdependent weight for each word given an input sentence. The method gives competitive results on various sentiment analysis benchmarks. In addition, thanks to the use of lexicons, our model can improve the cross-domain robustness of recurrent neural models for sentiment analysis.