Attention Modeling for Targeted Sentiment

Neural network models have been used for target-dependent sentiment analysis. Previous work focus on learning a target specific representation for a given input sentence which is used for classification. However, they do not explicitly model the contribution of each word in a sentence with respect to targeted sentiment polarities. We investigate an attention model to this end. In particular, a vanilla LSTM model is used to induce an attention value of the whole sentence. The model is further extended to differentiate left and right contexts given a certain target following previous work. Results show that by using attention to model the contribution of each word with respect to the target, our model gives significantly improved results over two standard benchmarks. We report the best accuracy for this task.


Introduction
Targeted sentiment analysis investigates the classification of opinions polarities towards specific target entity mentions in given sentences (Jiang et al., 2011;Dong et al., 2014;Tang et al., 2016;. The input is a sentence with given target entity mentions, and the output consists of two-way or three-way sentimental classes on each target mention. For example, the sentence "She began to love miley ray cyrus since 2013 :)" is marked with a positive sentiment label on the target "miley ray cyrus".
One important problem of targeted sentiment classification is how to model the relation between targets and their context. Earlier methods defined rich features by exploiting POS tags and syntactic structures (Jiang et al., 2011;Dong et al., 2014). Compared with discrete manual features, embedding features are less sparse, and can be learnt from large raw texts, capturing distributional syntactic and semantic information. Dong et al. (2014) use a target-specific recurrent neural network to represent a sentence.  use the rich pooling functions to extract the feature vector for a given target.
One important contribution of  is that they split a sentence into three sections including the target, its left contexts and its right contexts, as shown in Figure 1.  represent words in the input using a bidirectional gated recurrent neural network, and then use three-way gated neural network structure to model the interaction between the target and its left and right contexts. Tang et al. (2016) learn target-specific sentence representation by combining word embeddings with the corresponding targeted embeddings, and then using two recurrent neural networks to encode the left context and the right context, respectively.
The above methods use the different neural network structures to model the relation between contexts and targets, but they did not explicitly model the importance of each word in contributing to the sentiment polarity of the target. For example, the sentence "#nowplaying [lady gaga] 0 -let love down" is neural for the target "lady gaga", where the contribution of "love" is little, despite that the word "love" is a positive word.
To address this, we utilize the attention mechanism to calculate the contribution of each word towards targeted sentiment classes, as shown in Figure 1, where the gray level in the spectrum means the contribution of words. In particular, we build a vanilla model using a bidirectional LSTM to extract word embeddings over the sentence and then apply attention over the hidden nodes to estimate the importance of each word. Furthermore, following , Tang et al. (2016) and , we differentiate the left and right contexts given a target. Our final models give significantly improved results on two standard benchmarks compared to previous methods, resulting in best reported accuracy so far. Our source code is released at https://github.com/LeonCrashCode/ AttentionTargetSentiment.

Related Work
Traditional sentiment classification methods rely on manual discrete features (Pang et al., 2002;Go et al., 2009;Mohammad et al., 2013). Recently, distributed word representation (Socher et al., 2013; and neural network methods (Irsoy and Cardie, 2013;dos Santos and Gatti, 2014;Dong et al., 2014;Teng et al., 2016;Ren et al., 2016) have shown promising results on this task. The success of such work suggests that using word embeddings and deep neural network structures can automatically exploit the syntactic and semantic structures. Our work is in line with these methods.
The seminal work using the attention mechanism is neural machine translation (Bahdanau et al., 2015), where different weights are assigned to source words to implicitly learn alignments for translation. Subsequently, the attention mechanism has been applied into various other natural language processing tasks including parsing (Vinyals et al., 2015;Kuncoro et al., 2016;Liu and Zhang, 2017), document classification (Yang et al., 2016), question answering (He and Golub, 2016) and text understanding (Kadlec et al., 2016).
For sentiment analysis, the attention mechanism has been applied to cross-lingual sentiment (Zhou et al., 2016), aspect-level sentiment (Wang et al., 2016) and user-oriented sentiment (Chen et al., 2016). To our knowledge, we are the first to use the attention mechanism to model sentences with respect to targeted sentiments.

Models
We use a bidirectional LSTM to represent the input word sequence w 0 , w 1 , ..., w n as hidden nodes h 0 , h 1 , ..., h n : where the target is denoted as h t , which is the average of word embeddings in the target phrase [h t 0 ; ...; h tm ]. We propose three variants of attention to model the relation between context words and targets.

Vanilla Model
We build a vanilla attention model by calculating a weighted value α over each word in sentences. The final representation of the sentence s is then given by 1 : n j exp(β j ) and the weight scores β are calculated by using the target representation and the context word representation, The sentence representation s is then used to predict the probability distribution p of sentiment labels on the target by: We refer to this vanilla model as BILSTM-ATT.

Contextualized Attention
We make two extensions to the vanilla attention method. The first is a contextualized attention model (BILSTM-ATT-C), where the sentence is divided into two segments with respect to the target, namely left context and right context Tang et al., 2016;. Attention is applied on left and right contexts, respectively. In particular, the representation of the left context is: and the representation of the right context is: Together with the vanilla representation s, the distribution of sentiment labels is predicted by:

Contextualized Attention with Gates
A second extension is to add gates to control the flow of context information (BILSTM-ATT-G). This is motivated by the fact that sentiment signals can be dominated by the left context, the right context or the entire sentence . The three gates, z, z l and z r , controlled by the target and the corresponding context, are used.
where z + z l + z r = 1. The linear interpolation among s, s l and s r is formulated as s = z s + z l s l + z r s r .
Then the probability distribution of sentiment labels is predicted by: Training our models are trained to minimize a cross-entropy loss object with a l 2 regularization term, defined by where θ is the set of parameters, p t is the probability of the ith training example given by the model and λ is a regularization hyper-parameter, λ = 10 −6 . We use momentum stochastic gradient descent (Sutskever et al., 2013) with a learning rate of η = 0.01 for optimization.    Table 1 shows the corpus statistics. Both dataset are three-way classification data.

Parameters & Metrics
The hyper-parameters are given in Table 2 4 . We use GloVe vectors (Pennington et al., 2014) with 200 dimensions as pre-trained word embeddings, which are tuned during training. Two metrics are used to evaluate model performance: the classification accuracy and macro F1-measure over the three sentiment classes.

Development Experiments
We run three variants of targeted sentiment classification models on the development section of Z-Dataset to investigate the effectiveness of attention mechanism. A simple BILSTM without attention is deployed as our baseline. Table 3 shows the development results. We find that BILSTM-C gives a 0.6% accuracy improvement by differentiating the left and right contexts. However, surprisingly, BILSTM-G does not give much improvement despite using gates to control the contexts.   (2015)   This is different from the observation of , who find that gate mechanism improves accuracy without using attention. Finally, compared to baseline models without attention, our models give an average 1.2% accuracy improvement and a 1.8% macro F1 improvement. Our final model (BILSTM-ATT-G) gives a 2.3% accuracy significant improvement (p < 0.01 using ttest) and a 3.0% macro F1 improvement over the strongest baseline.

Final Results
We compare our models with previous work. The final results are shown in Table 4. Our final models outperform both  and Tang et al. (2016) by achieving 73.55% accuracy and 72.07% macro F1 on T-Dataset, and 75.04% accuracy and 72.29% macro F1 on Z-Dataset, respectively. Compared with , our final models have significant improvements (p < 0.05) on the Z-Dataset.

Analysis
We compare the performances of various models against OOV rates. In particular, we split the test sentences into two sets, where one contains sentences that have no OOV and the other consist of sentences which have at least one OOV. The results are shown in Figure 2. The BILSTM-ATT-G performs the best, especially on OOV sentences, which shows the robustness of the BILSTM-ATT-

G.
We compare the performances of various models on each distinct polarity. The results are shown in Figure 5. Interestingly, compared to BILSTM-ATT without contextualized attention, BILSTM-ATT-C loses accuracies on positive (-1.1%). However, BILSTM-ATT-G gives large improvements on positive (+4.2%) and neutral (+1.2%) targets but loses accuracy on negative (-2.4%). Overall, both BILSTM-ATT-C and BILSTM-ATT-G outperform BILSTM-ATT on neural cases, which account for 50% of all targets.  Figure 3(d) are consistent with the institution. The words "most", "famous", "history", "XD" lead to a positive label, while the word "damn" leads to a negative label. In Figure 3(c), although "haha" could be a positive word, here the sentimental class of the target is neutral. This can be explained by the fact that the word "haha" shows the happiness of the speaker instead of the target "Nicolas Cage". Figure 3(d) shows one example long sentence, where the left context dominates the sentiment. Applying attention mechanism into left and right context of the target is meaningful and beneficial.

Conclusion
Prior work on targeted sentiment analysis investigates sentence representation that are targetspecific but do not explicitly model the contribution of each word towards targeted sentiment. We investigated various attentional neural networks for targeted sentiment classification. Experiments demonstrated that attention over words is highly useful for targeted sentiment analysis. Our model gives the best reported results on two different benchmarks.