Attentive Gated Lexicon Reader with Contrastive Contextual Co-Attention for Sentiment Classification

This paper proposes a new neural architecture that exploits readily available sentiment lexicon resources. The key idea is that that incorporating a word-level prior can aid in the representation learning process, eventually improving model performance. To this end, our model employs two distinctly unique components, i.e., (1) we introduce a lexicon-driven contextual attention mechanism to imbue lexicon words with long-range contextual information and (2), we introduce a contrastive co-attention mechanism that models contrasting polarities between all positive and negative words in a sentence. Via extensive experiments, we show that our approach outperforms many other neural baselines on sentiment classification tasks on multiple benchmark datasets.


Introduction
Across the rich history of sentiment analysis research (Kim and Hovy, 2004;Liu, 2012;Pang et al., 2008), sentiment lexicons have been extensively used as features for sentiment classification tasks. Lexicons, either handcrafted or algorithmically generated, consist of words and their associated polarity score. For instance, lexicons assign a high positive score for the word 'excellent' but a negative score for the word 'terrible'. Traditionally, the summation of lexicon scores has been treated as a reasonable heuristic estimate (or feature) that is capable of supporting opinion mining applications. Throughout the years, plenty of lexicon lists have been built for various specific domains or general purposes (Hu and Liu, 2004;Mohammad et al., 2013;Wilson et al., 2005). They are indeed valuable resources that should be exploited. * Denotes equal contribution.
However, sentiment lexicons are in reality hardly useful without context. After all, the complexity and ambiguity of natural language pose great challenges for the crude bag-of-words generalization of lexicons. Firstly, the concept of semantic compositionality is non-existent in simple lexicon approaches which raises problems when handling flipping negation (not happy), content word negation (ameliorates pain) or unbounded dependencies (no body passed the exam). Secondly, lexicons also do not handle word sense, e.g., not being able to differentiate the meaning of hot in the phrases 'a hot, attractive person' and a 'a scorching hot day'. Thirdly, simple summation over lexicon scores cannot deal with sentences with double contrasting polarities, e.g., the lexicon polarity score of 'Thanks for making this uncomfortable situation more comfortable' becomes negative because uncomfortable has a higher negative lexicon score over the positive score of the word comfortable. Lastly, strongly positive or negative words may occur in neutral context which forces an inclination of predictions towards a nonneutral polarity. As such, the exploitation of readily available lexicon lists is an inherently challenging task.
Deep learning has demonstrated incredibly competitive performance in many NLP tasks (Liu et al., 2015;Bradbury et al., 2016;Tai et al., 2015). With no exception, the task of sentiment analysis is recently also dominated by neural architectures. It has been proven from the fact that the top systems from SemEval Sentiment analysis challenges (e.g., notably 2016 and 2017) have mainly leveraged the effectiveness of deep learning models. The main advantage of deep learning approach is that it is effective in exploring both linguistic and semantic relations between words, thus can overcomes the problems of lexicon-based approach. However, current deep learning approach for sentiment analysis usually faces with the major shortcoming, i.e., being limited by the quantity of high quality labeled data. Manual labeling of data, however, is costly and require domain expert knowledge which is not always available in practice.
Given the pros and cons of previous two previous approaches, we aim to combine the best of both worlds -the traditional sentiment lexicon and modern deep learning architectures. To the best of our knowledge, the only work that combines the two paradigms within end-to-end neural networks is the Lexicon RNN model . In their approach, sentiment lexicons are extracted from the hidden states of a recurrent neural network and passed through a simple feedforward neural network to produce a new polarity weight. This approach, however, has some limitations which will be illustrated using the following example: "Thanks for making this horrible situation at work more bearable." Firstly, the Lexicon RNN does not consider the interactions between positive or negative lexicon words, which makes it susceptible to misleading strong lexicon priors. In the above example, the word 'horrible' is a strongly negative word in most lexicons. As a result, the Lexicon RNN (and many other lexicon based approaches in general) will assign a negative polarity to the sentence. Clearly, modeling similarity between two contrasting polarity words ('horrible' and 'bearable') can help the model resolve this confusion. Secondly, the RNN encoder in the Lexicon RNN is restricted by the sequential nature of the recurrent model, resulting in a limited global view of the entire sentence. For example, the word pairs ('horrible', 'bearable') and ('thanks', 'bearable') are useful for detecting the polarity of the sentence but do not have any explicit interaction even with a sequential RNN encoder. Moreover, the word pair ('thanks', 'bearable') is very far apart in the above example sentence, making it challenging for RNN encoders to capture interactions between them. Finally, the Lexicon RNN faces difficulty dealing with more than two classes due to its design, i.e., linear combination of two scalar scores. In order to cope with this weakness, the authors define hardcoded dataset specific thresholds for 5-way classification. Adapting this to 3-way (positive, negative and neutral) is cumbersome as thresholds have to be found by either maximizing over the development set or defined heuristically.
In this paper, we introduce a new end to end paradigm that integrates lexicon information into neural network for the task of sentiment analysis. More specifically, instead of learning a lexicon-based score, we propose to learn an auxiliary embedding by exploiting lexicon information. The key motivation behind the auxiliary representation is that compositional learning with prior/global knowledge of positive and negative inclined words can lead to improved representations. Next, a gating mechanism controls the additive blend between this lexicon-based representation and a standard attention-based recurrent model. In essence, this supporting network aims to learn a 'lexicon-based' view of the sentence and can be interpreted as 'learning to compose' by exploiting lexicon information. Finally, instead of the combination of two scalar values (the base lexicon score and sentence bias score) as in the Lexicon RNN model, we propose to use the k-class softmax function at the final layer. Intuitively, it is a more natural solution for fine-grained sentiment classification over the cumbersome tuning of ad-hoc threshold values. Our contributions can be summarized as follows: • We propose to learn an auxiliary embedding by exploiting lexicon information rather than learning a lexicon-based score. Its design is a more natural and flexible solution for k-class sentiment classification.
• We propose a contextual attention (CA) mechanism that learns to attend to lexicon words based on the context. Unlike Lexicon RNN which extracts the hidden representations from the recurrent model, contextual attention allows a wider, global and more complete view of the context (sentence) by matching against every single word in the sentence. In addition to semantic compositionality, our model also benefits from semantic similarity.
• We propose to model the interaction between the positive and negative lexicon words inside the neural network. Positive and negative lexicon words are modeled seperately and subsequently compared using contrasive co-attention (CC) which learns the relative importance of positive lexicons with respect to negative lexicons (and vice versa). Modeling such intricacies between positive and negative words allows our model to deal with scenarios such as contrasting polarities, neutrality and also sarcasm. We also discover that our CC mechanism produces a neutralizing effect which negates misleading attention on words with intense polarity even though the context is neutral.
Overall, we propose AGLR (Attentive Gated Lexicon Reader), a new attention-based neural architecture that exploits sentiment lexicons for learning to compose an auxiliary sentence embedding. Our model achieves state-of-the-art performance on several benchmark datasets. Finally, our AGLR, a single neural model, also achieves competitive performance with respect to top teams in SemEval runs which are mostly comprised of extensively engineered ensembles.

Related Work
Sentiment lexicons have a rich traditional in sentiment analysis research and have been exploited in many statistical methods across the years (Hu and Liu, 2004;Kim and Hovy, 2004;Agarwal et al., 2011;Mohammad et al., 2013;Tang et al., 2014b,a;. It is easy to see how sentiment lexicons are able to benefit opinion mining applications. More specifically, sentiment lexicons form an integral role in the winning solutions of SemEval 2013 (Mohammad et al., 2013) and 2014 (Miura et al., 2014). In many of these these approaches, standard machine learning classifiers (such as Support Vector Machines) are trained on discrete features partly derived from resources such as sentiment lexicon.
In recent years, we see a shift of the state-of-theart from discrete models to neural models (Socher et al., 2013;Kim, 2014;Dong et al., 2014;Tang et al., 2016;Tai et al., 2015;Ren et al., 2016;. This ranges from learning sentiment-specific word embeddings (Tang et al., 2014b;Faruqui et al., 2015) to end-to-end neural architectures Angelidis and Lapata, 2017). The winning solution of SemEval 2016 (Deriu et al., 2016) utilized ensembles of convolutional neural networks (CNN). Recurrent-based models such as the bidirectional long short-term memory (BiLSTM) (Hochreiter and Schmidhuber, 1997;Graves et al., 2013) are popular and standard strong baselines for many opinion mining tasks including sentiment analysis (Tay et al., 2017) and sarcasm detection (Tay et al., 2018c). These neural models such as the BiLSTM are capable of modeling semantic compositionality and produce a feature vector which can be used for classification.
To integrate the information of lexicon inside Lexicon RNN model,  proposed to use the hidden representations from a BiLSTM to influence the lexicon score, i.e., learning context-sensitive lexicon features. However, our method can be considered as a vastly different paradigm and instead learns a d-dimensional embedding using neural attention (Bahdanau et al., 2014;Luong et al., 2015) instead of a lexicon score. The key idea of neural attention is that it allows neural networks to look (or attend) to certain words in a sequence. This concept has indeed profoundly impacted the fields of NLP, giving rise to many variant architectures including end-to-end memory networks (Sukhbaatar et al., 2015;Li et al., 2017).
Our approach draws inspiration from memory networks and co-attentive models for machine comprehension Seo et al., 2016). In fact, the auxiliary network can be interpreted as a form of multi-layered attention which draws connection to vanilla memory networks. Attending over two sequences (or bidirectional attention) are intuitive approaches for NLP tasks such as information retrieval (Tay et al., 2018b) and generic text matching (Tay et al., 2018a). In our work, we adapt this to model the similarities between (1) lexicon-context and (2) contrasting polarities which borrows inspiration from (Riloff et al., 2013). Since our matching problem is derived from the same sequence (identified by a lexicon prior), this work can be interpreted as a form of self-attention (Vaswani et al., 2017) which draws relations to the intra-attentive model for sarcasm detection (Tay et al., 2018c).

Attentive Gated Lexicon Reader
In this section, we describe our proposed deep learning model for sentiment classification. The key idea of our model is to generate two representations, i.e., a lexicon-based auxiliary embedding of the sentence and also a generic compositional representation of the sentence. The former is generated via a supporting network that consists of contextual attention and contrastive co-attention layers. The latter is generated by a vanilla attention-based BiLSTM model. A gating mechanism then combines them for prediction.

Embedding Layer
Firstly, we extract all lexicon words from the input sequence and then separately 1 denote them as positive or negative words. Overall, our model accepts three sequences as an input. (1) the original sentence, (2) a list of positive lexicon words found in the sentence and (3) a list of negative lexicon words found in the sentence. The three sequences are indexed into a word embedding layer W ∈ R |V |×d which outputs three matrices S ∈ R d×Ls (sentence embeddings), P ∈ R d×Gp (positive lexicon embeddings) and N ∈ R d×Gn (negative lexicon embeddings). d is the dimensionality of the word embeddings and L s , G p and G n are the maximum sequence lengths of sentence, positive lexicon and negative lexicon respectively.

Learning Sentence Representation
To learn sentence representations of the input sequence, we pass S = (w 1 , w 2 · · · w Ls ) into a Bidirectional Long Short-Term Memory (LSTM) layer. As such, the output of the BiLSTM is described as follows: where h t is the hidden representation at step t. Given a sequence of inputs w 1 , w 2 · · · w L , the output of the BiLSTM layer is a sequence of hidden states h 1 , h 2 · · · h L . Note that since we use a bidirectional LSTM, then h t ∈ R 2r where r is the dimensionality of the BiLSTM layer. In our case r is set to d 2 such that the output vector has dimensionality d.

Sentence Attention
To learn a final sentence representation of the sentence, we adopt an attention mechanism. The attention mechanism is defined by the following equations: where s ∈ R d is the output sentence representation, W y ∈ R d×d and w y ∈ R d are parameters of the attention layer. Intuitively, the attention layer learns to pay attention to important segments of the sentence, producing a weighted representation of the hidden states of the BiLSTM layer.

Learning Auxiliary Lexicon Embedding
This layer aims to learn a single d-dimensional lexicon-based representation of the sentence. In order to learn the lexicon embedding, our model adopts a two layer attention mechanism, namely the contextual attention (CA) and contrastive coattention (CC).
Contextual Attention (CA) We utilize an attention mechanism to learn the relative importance of each lexicon word based on the sentence representation. This layer is applied to and is functionally identical for both P and N . As such, for notational convenience, we use Q to represent either positive (P ) or negative (N ), and G to represent the maximum number of lexicon words. Let Q ∈ R G×d be a sequence of lexicon words and H ∈ R Ls×d be the intermediate hidden representations obtained from the contextual BiLSTM layer: where U ∈ R d×d are the parameters of this layer. Next, we apply a column-wise max pooling of M .
The key idea is to generate an attention vector: where a ∈ R G . The softmax function normalizes the values of the vector max col (M ) into a probability distribution. To learn the contextsensitive weight importance of each lexicon word, we then apply the attention vector on Q. C = {c 1 , c 2 · · · c G } is the context-sensitive lexicon representation of Q. Intuitively, the CA mechanism attends to each lexicon word based on its maximum influence on each word of the main sentence. There are several advantages to our context attention mechanism. Unlike Lexicon RNN which simply extracts the hidden representation (generated from BiLSTM) of the lexicon word, our approach has a global view of the entire sentence which allows each lexicon word to benefit from wider contextual knowledge as opposed to being limited to the temporal compositionality provided by the BiLSTM layer. Overall, the outputs of this layer are two matrices (positive and negative lexicon embeddings) which are context-sensitive. Note that these lexicon embeddings retain their dimensionality passing through this layer.

Contrastive Co-Attention (CC)
This layer aims to model the contrast between polarities. Intuitively, this layer helps to model sentences with double or conflicting polarities. It also aims to negate strongly positive or negative words in the case of a neutral context. In order to do so, we employ a contrastive co-attention model that learns to weight the relative importance of each positive lexicon word based on the negative lexicon (and vice versa). We accept the contextualized positive and negative lexicon embeddings from the previous layer as an input. LetP ∈ R G×d be the contextualized positive lexicons andN ∈ R G×d be the contextualized negative lexicons, our coattention layer learns a soft attention alignment between positive and negative lexicon embeddings. Similar to our contextual attention layer, we first learn an affinity matrix Z that models the relationship between positive and negative lexicon embeddings: Next, we apply both column-wise and row-wise max-pooling on the affinity matrix Z to obtain two attention vectors. The two attention vectors are then normalized with the softmax function (de-noted as sm).
a p = sm(max col (Z)) ; a n = sm(max row (Z)) (7) a p is the attention vector for the positive lexicon embeddings and a n is the attention vector for the negative lexicon embeddings. The final vector representations are therefore: where p f ∈ R d and n f ∈ R d are the vector representations for positive lexicon and negative lexicon respectively. Note that this layer, unlike the contextual attention layer, is named 'co-attention' because both P and N are both 'attended over' concurrently. It is also good to note that attentions are applied over the original embeddings P, N and not the contextualized embeddingsP ,N .
Fully-Connected Layer Next, we pass the concatenation of p and n through a fullyconnected layer to learn the final representation for the auxiliary lexicon embedding, i.e., r = tanh(W h ([p; n]) + b h ) where W h ∈ R 2d×d are the parameters of the hidden layer and b h is the bias value. The output r ∈ R d is the final auxiliary lexicon-based embedding.

Learning Final Representations
To combine the lexicon-based representation with the sentence representation, we adopt a gating mechanism. s = σ(w g r) r + (1 − σ(w g r)) s (9) where w g ∈ R d are the parameters of this layer, σ is the sigmoid function.ŝ is the overall final representation.

Final Layer and Optimization
Finally, we passŝ the overall final representation into a softmax layer.
where y ∈ R k , where k is the number of classes (2 for positive and negative and 3 including neutral). W f and b f are standard parameters of a linear regression layer. For optimization, we adopt the standard cross entropy loss function with L2 regularization.
where o is the output of the softmax layer and R = λ ψ 2 2 is the L2 regularization.

Empirical Evaluation
This section describes our empirical experiments.

Evaluation Procedure
In this section, we describe the datasets used, evaluation metric and implementation details.
Datasets We conduct our experiments 2 on subsets of sentiment analysis benchmarks from Se-mEval 2013 (Nakov et al., 2013), SemEval 2014 (Rosenthal et al., 2014) and SemEval 2016 (Nakov et al., 2016). More specifically, we focus on the sentence level of sentiment analysis and evaluate on the datasets of SemEval 2013 task 2, SemEval 2014 task 9 and SemEval 2016 task 4, which we will name as SemEval13, SemEval14 and Se-mEval16 respectively in this section. For fair comparison, we use the same setting of training, development and testing as in SemEval competitions.
To further evaluate the performance of methods when data is limited, for SemEval16, we experiment on two different training settings. The first, TRAIN, uses only the 2016 training set while the other, TRAIN-ALL, appends the 2013 training set to the 2016 training set, following the official setting of SemEval 2016 while TRAIN explores the setting where training data is limited.

Evaluation Metrics
We evaluate on two settings, i.e., 3-way (positive, negative and neutral) and also binary (positive and negative) classification. We report the accuracy and macro-averaged F1 score for all settings.
Compared Baselines In this section, we list the neural baselines we use for comparisons.
• NBOW-MLP (Neural Bag-of-Words + Multi-layered Perceptron) is a simple sum of all word embeddings which is connected to a 2-layer MLP of 100 dimensions.
• CNN (Convolutional Neural Network) is another popular neural encoder for learning sentence representations. We use a filter size of 3 and 150 filters.
• BiLSTM (Bidirectional Long Short-Term Memory) is a standard strong neural baseline for many NLP tasks. The size of the LSTM is set to 150.
• AT-BiLSTM (Attention-based BiLSTM) is an extension of the BiLSTM model with neural attention.
• Lexicon RNN (Lexicon Recurrent Neural Network) is the model of . The first neural model that incorporates sentiment lexicon. The size of the BiLSTM in this model is also set to 150.
All models except Lexicon RNN optimize the softmax cross entropy loss. The authors use Lexicon RNN for binary and 5-way classification. In order to adapt Lexicon RNN to 3-way classification (positive, negative, neutral), we adapt the 5way formulation that minimizes the MSE (mean square error) loss to 3-way. The output is scaled 3 to s ∈ [−1, 1] where s > 0.25 is treated as positive, s < −0.25 is treated as negative and everything in between is neutral.   Table 1 and Table 2 report the results of our experiments. The results on TRAIN-ALL are higher than TRAIN for SemEval16 in lieu of the larger dataset. Firstly, we observe that our proposed AGLR outperforms all neural baselines on 3-way classification. The overall performance of AGLR achieves state-of-the-art performance. On average, AGLR outperforms Lexicon RNN and AT-BiLSTM by 1% − 3% in terms of F1 score. We also observe that AGLR always improves AT-BiLSTM which ascertains the effectiveness of learning auxiliary lexicon embeddings. The key idea here is that the auxiliary lexicon embeddings provide a different view of the sentence which supports the network in making predictions.

Experimental Results
We also observe that Lexicon RNN does not handle 3-way classification well. Even though it has achieved good performance on binary classification, the performance on 3-way classification is lackluster (the performance of AGLR outperforms Lexicon RNN by up to 8% on SemEval16 TRAIN). This could also be attributed to the MSE based loss function. Conversely, by learning an auxiliary embedding (instead of a scalar score), our model becomes more flexible at the final layer and can be adapted to using a k softmax function. Finally, we observe that BiLSTM and AT-BiLSTM outperform Lexicon RNN on average with Lexicon RNN being slightly better on binary classification.  We observe that AGLR achieves competitive performance relative to the top runs in SemEval 2013, 2014 and 2016. It is good to note that Se-mEval approaches are often heavily engineered containing ensembles and many handcrafted features which include extensive use of sentiment lexicons, POS tags and negation detectors. Recent SemEval runs gravitate towards neural ensembles. For instance, the winning approach for SwissCheese (SemEval 2016) uses an ensemble of 6 CNN models along with a meta-classifier (random forest classifier). On the other hand, our proposed model is a single neural model. In addition, SwissCheese also uses emoticon-based distant supervision which exploits a huge corpus of sentences (millions) for training. Conversely, our approach only uses the 2013 and 2016 training sets which are significantly smaller. Given these conditions, we find it remarkable that our single model is able to achieve competitive performance relative to the extensively engineered approach of Swiss-Cheese. Moreover, we actually outperform significantly in terms of pure accuracy. AGLR performs competitively on SemEval 2013 and 2014 as well. The good performance on the sarcasm dataset could be attributed to our contrastive attention mechanism.

Comparisons against Top SemEval Systems
Ablation Study In this section, we study the impacts and contribution of the different components of our model. Specifically, we tested 3 settings. The first, we removed CC only. In this case, positive and negative lexicons are summed instead of a weighted summed using attention. In the next setting, we removed CA only. Similarly, embeddings are summed instead of attentively summed. Finally, we removed both CA and CC. In this case, all lexicons are considered neural bag-of-words (NBOW) and passed to the MLP layer. Table  4 shows the results of this ablation study on Se-mEval16 using the TRAIN-ALL setting.

Model
Acc  It is clear that both CC and CA are critical to the performance of AGLR. Removing either or both can cause performance to degrade. In particular, we also observe that CA seems to be less important than CC, i.e., performance drops more as compared to removing CA. We also note that removing both and a simple NBOW for lexicons can degrade performance since the base AT-BiLSTM is better than using NBOW lexicons as an auxiliary support. As such, the design of the auxiliary embeddings must be treated with care.
Qualitative Analysis In order to study what are the specific roles of the contextual and contrastive attention mechanism, we inspect the attention maps over the positive and negative lexicons. We use the following example in which the ground truth label is positive: "Very excited about Tuesday night @user free iced coffee and smoothies courtesy of Dunkin Donuts will be set up.". Figure 2a shows the attention maps for contextual attention. We observe that contextual attention focuses more on the context, i.e., focusing on words such as 'night', 'iced coffee' and 'smoothies'. On the other hand, Figure 2b shows the attention maps after contrastive attention. We observe that contrastive attention learns more polarity specific attentions, i.e., shifting some focus to 'very excited'. We also observe that the contrastive attention tends to shift its attention weights to less meaningful words for the negative lexicon if the ground truth label is positive (and vice versa). We believe that this indicates that there is an absence of negative sentiment.

Conclusion
We proposed a novel method of incorporating lexicons into neural models for the task of sentiment analysis. More specifically, we learn an auxiliary lexicon embedding using neural attention. Our proposed model AGLR achieves an overall state-of-the-art performance on multiple benchmark datasets outperforming strong neural baselines such as AT-BiLSTM and Lexicon RNN. The performance of AGLR is also competitive relative to top SemEval systems which utilized neural ensembles or very extensive feature engineering.