Attention-based LSTM for Aspect-level Sentiment Classification

Aspect-level sentiment classiﬁcation is a ﬁne-grained task in sentiment analysis. Since it provides more complete and in-depth results, aspect-level sentiment analysis has received much attention these years. In this paper, we reveal that the sentiment polarity of a sentence is not only determined by the content but is also highly related to the concerned aspect. For instance, “ The appetizers are ok, but the service is slow. ”, for aspect taste , the polarity is positive while for service , the polarity is negative. Therefore, it is worthwhile to explore the connection between an aspect and the content of a sentence. To this end, we propose an Attention-based Long Short-Term Memory Network for aspect-level sentiment classiﬁcation. The attention mechanism can concentrate on different parts of a sentence when different aspects are taken as input. We experiment on the SemEval 2014 dataset and results show that our model achieves state-of-the-art performance on aspect-level sentiment classiﬁcation.


Introduction
Sentiment analysis (Nasukawa and Yi, 2003), also known as opinion mining (Liu, 2012), is a key NLP task that receives much attention these years. Aspect-level sentiment analysis is a fine-grained task that can provide complete and in-depth results. In this paper, we deal with aspect-level sentiment classification and we find that the sentiment polarity of a sentence is highly dependent on both content and aspect. For example, the sentiment polarity of "Staffs are not that friendly, but the taste covers all." will be positive if the aspect is food but negative when considering the aspect service. Polarity could be opposite when different aspects are considered.
Neural networks have achieved state-of-the-art performance in a variety of NLP tasks such as machine translation (Lample et al., 2016), paraphrase identification (Yin et al., 2015), question answering (Golub and He, 2016) and text summarization (Rush et al., 2015). However, neural network models are still in infancy to deal with aspectlevel sentiment classification. In some works, target dependent sentiment classification can be benefited from taking into account target information, such as in Target-Dependent LSTM (TD-LSTM) and Target-Connection LSTM (TC-LSTM) (Tang et al., 2015a). However, those models can only take into consideration the target but not aspect information which is proved to be crucial for aspect-level classification.
Attention has become an effective mechanism to obtain superior results, as demonstrated in image recognition (Mnih et al., 2014), machine translation (Bahdanau et al., 2014), reasoning about entailment (Rocktäschel et al., 2015) and sentence summarization (Rush et al., 2015). Even more, neural attention can improve the ability to read comprehension (Hermann et al., 2015). In this paper, we propose an attention mechanism to enforce the model to attend to the important part of a sentence, in response to a specific aspect. We design an aspect-tosentence attention mechanism that can concentrate on the key part of a sentence given the aspect.
We explore the potential correlation of aspect and sentiment polarity in aspect-level sentiment classification. In order to capture important information in response to a given aspect, we design an attentionbased LSTM. We evaluate our approach on a benchmark dataset (Pontiki et al., 2014), which contains restaurants and laptops data.
The main contributions of our work can be summarized as follows: • We propose attention-based Long Short-Term memory for aspect-level sentiment classification. The models are able to attend different parts of a sentence when different aspects are concerned. Results show that the attention mechanism is effective.
• Since aspect plays a key role in this task, we propose two ways to take into account aspect information during attention: one way is to concatenate the aspect vector into the sentence hidden representations for computing attention weights, and another way is to additionally append the aspect vector into the input word vectors.
• Experimental results indicate that our approach can improve the performance compared with several baselines, and further examples demonstrate the attention mechanism works well for aspect-level sentiment classification.
The rest of our paper is structured as follows: Section 2 discusses related works, Section 3 gives a detailed description of our attention-based proposals, Section 4 presents extensive experiments to justify the effectiveness of our proposals, and Section 5 summarizes this work and the future direction.

Related Work
In this section, we will review related works on aspect-level sentiment classification and neural networks for sentiment classification briefly.

Sentiment Classification at Aspect-level
Aspect-level sentiment classification is typically considered as a classification problem in the liter-ature. As we mentioned before, aspect-level sentiment classification is a fine-grained classification task. The majority of current approaches attempt to detecting the polarity of the entire sentence, regardless of the entities mentioned or aspects. Traditional approaches to solve those problems are to manually design a set of features. With the abundance of sentiment lexicons (Rao and Ravichandran, 2009;Perez-Rosas et al., 2012;Kaji and Kitsuregawa, 2007), the lexicon-based features were built for sentiment analysis (Mohammad et al., 2013). Most of these studies focus on building sentiment classifiers with features, which include bag-of-words and sentiment lexicons, using SVM (Mullen and Collier, 2004). However, the results highly depend on the quality of features. In addition, feature engineering is labor intensive.

Sentiment Classification with Neural Networks
Since a simple and effective approach to learn distributed representations was proposed (Mikolov et al., 2013), neural networks advance sentiment analysis substantially. Classical models including Recursive Neural Network (Socher et al., 2011;Dong et al., 2014;Qian et al., 2015), Recursive Neural Tensor Network (Socher et al., 2013), Recurrent Neural Network (Mikolov et al., 2010;Tang et al., 2015b), LSTM (Hochreiter and Schmidhuber, 1997) and Tree-LSTMs (Tai et al., 2015) were applied into sentiment analysis currently. By utilizing syntax structures of sentences, tree-based LSTMs have been proved to be quite effective for many NLP tasks. However, such methods may suffer from syntax parsing errors which are common in resourcelacking languages. LSTM has achieved a great success in various NLP tasks. TD-LSTM and TC-LSTM (Tang et al., 2015a), which took target information into consideration, achieved state-of-the-art performance in target-dependent sentiment classification. TC-LSTM obtained a target vector by averaging the vectors of words that the target phrase contains. However, simply averaging the word embeddings of a target phrase is not sufficient to represent the semantics of the target phrase, resulting a suboptimal performance.
Despite the effectiveness of those methods, it is still challenging to discriminate different sentiment polarities at a fine-grained aspect level. Therefore, we are motivated to design a powerful neural network which can fully employ aspect information for sentiment classification.
3 Attention-based LSTM with Aspect Embedding 3.1 Long Short-term Memory (LSTM) Recurrent Neural Network(RNN) is an extension of conventional feed-forward neural network. However, standard RNN has the gradient vanishing or exploding problems. In order to overcome the issues, Long Short-term Memory network (LSTM) was developed and achieved superior performance (Hochreiter and Schmidhuber, 1997). In the LSTM architecture, there are three gates and a cell memory state. Figure 1 illustrates the architecture of a standard LSTM. {w1, w2, . . . , wN } represent the word vector in a sentence whose length is N . {h1, h2, . . . , hN } is the hidden vector.
More formally, each cell in LSTM can be computed as follows: (1) where W i , W f , W o ∈ R d×2d are the weighted matrices and b i , b f , b o ∈ R d are biases of LSTM to be learned during training, parameterizing the transformations of the input, forget and output gates respectively. σ is the sigmoid function and ⊙ stands for element-wise multiplication. x t includes the inputs of LSTM cell unit, representing the word embedding vectors w t in Figure 1. The vector of hidden layer is h t .
We regard the last hidden vector h N as the representation of sentence and put h N into a sof tmax layer after linearizing it into a vector whose length is equal to the number of class labels. In our work, the set of class labels is {positive, negative, neutral}.

LSTM with Aspect Embedding (AE-LSTM)
Aspect information is vital when classifying the polarity of one sentence given aspect. We may get opposite polarities if different aspects are considered.
To make the best use of aspect information, we propose to learn an embedding vector for each aspect.
Vector v a i ∈ R da is represented for the embedding of aspect i, where d a is the dimension of aspect embedding. A ∈ R da×|A| is made up of all aspect embeddings. To the best of our knowledge, it is the first time to propose aspect embedding.

Attention-based LSTM (AT-LSTM)
The standard LSTM cannot detect which is the important part for aspect-level sentiment classification. In order to address this issue, we propose to design an attention mechanism that can capture the key part of sentence in response to a given aspect. Figure 2 represents the architecture of an Attentionbased LSTM (AT-LSTM).
Let H ∈ R d×N be a matrix consisting of hidden vectors [h 1 , . . . , h N ] that the LSTM produced, where d is the size of hidden layers and N is the length of the given sentence. Furthermore, v a represents the embedding of aspect and e N ∈ R N is a vector of 1s. The attention mechanism will produce an attention weight vector α and a weighted hidden representation r.
α is a vector consisting of attention weights and r is a weighted representation of sentence with given aspect. The operator in 7 (a circle with a multiplication sign inside, OP for short here) means: v a ⊗e N = [v; v; . . .
; v], that is, the operator repeatedly concatenates v for N times, where e N is a column vector with N 1s. W v v a ⊗ e N is repeating the linearly transformed v a as many times as there are words in sentence. The final sentence representation is given by: where, h * ∈ R d , W p and W x are projection parameters to be learned during training. We find that this works practically better if we add W x h N into the final representation of the sentence, which is inspired by (Rocktäschel et al., 2015).
The attention mechanism allows the model to capture the most important part of a sentence when different aspects are considered.
h * is considered as the feature representation of a sentence given an input aspect. We add a linear layer to convert sentence vector to e, which is a realvalued vector with the length equal to class number |C|. Then, a sof tmax layer is followed to transform e to conditional probability distribution.
where W s and b s are the parameters for sof tmax layer.

Attention-based LSTM with Aspect Embedding (ATAE-LSTM)
The way of using aspect information in AE-LSTM is letting aspect embedding play a role in computing the attention weight. In order to better take advantage of aspect information, we append the input aspect embedding into each word input vector. The structure of this model is illustrated in 3. In this way, the output hidden representations (h 1 , h 2 , ..., h N ) can have the information from the input aspect (v a ). Therefore, in the following step that compute the attention weights, the inter- dependence between words and the input aspect can be modeled.

Model Training
The model can be trained in an end-to-end way by backpropagation, where the objective function (loss function) is the cross-entropy loss. Let y be the target distribution for sentence,ŷ be the predicted sentiment distribution. The goal of training is to minimize the cross-entropy error between y andŷ for all sentences.
where i is the index of sentence, j is the index of class. Our classification is three way. λ is the L 2regularization term. θ is the parameter set. Similar to standard LSTM, the parameter set Furthermore, word embeddings are the parameters too. Note that the dimension of W i , W f , W o , W c changes along with different models. If the aspect embeddings are added into the input of the LSTM cell unit, the dimension of W i , W f , W o , W c will be enlarged correspondingly. Additional parameters are listed as follows: AT-LSTM: The aspect embedding A is added into the set of parameters naturally. In addition, W h , W v , W p , W x , w are the parameters of attention. Therefore, the additional parameter set of AT-LSTM is {A, W h , W v , W p , W x , w}.
AE-LSTM: The parameters include the aspect embedding A. Besides, the dimension of W i , W f , W o , W c will be expanded since the aspect vector is concatenated. Therefore, the additional parameter set consists of {A}.
ATAE-LSTM: The parameter set consists of {A, W h , W v , W p , W x , w}. Additionally, the dimension of W i , W f , W o , W c will be expanded with the concatenation of aspect embedding.
The word embedding and aspect embedding are optimized during training. The percentage of outof-vocabulary words is about 5%, and they are randomly initialized from U (−ϵ, ϵ), where ϵ = 0.01.
In our experiments, we use AdaGrad (Duchi et al., 2011) as our optimization method, which has improved the robustness of SGD on large scale learning task remarkably in a distributed environment (Dean et al., 2012). AdaGrad adapts the learning rate to the parameters, performing larger updates for infrequent parameters and smaller updates for frequent parameters.

Experiment
We apply the proposed model to aspect-level sentiment classification. In our experiments, all word vectors are initialized by Glove 1 (Pennington et al., 2014). The word embedding vectors are pre-trained on an unlabeled corpus whose size is about 840 billion. The other parameters are initialized by sampling from a uniform distribution U (−ϵ, ϵ). The dimension of word vectors, aspect embeddings and the size of hidden layer are 300. The length of attention weights is the same as the length of sentence. Theano (Bastien et al., 2012) is used for implementing our neural network models. We trained all models with a batch size of 25 examples, and a momentum of 0.9, L 2 -regularization weight of 0.001 and initial learning rate of 0.01 for AdaGrad.

Dataset
We experiment on the dataset of SemEval 2014 Task 4 2 (Pontiki et al., 2014). The dataset consists of customers reviews. Each review contains a list of aspects and corresponding polarities. Our aim is to identify the aspect polarity of a sentence with the corresponding aspect. The statistics is presented in Table 1.

Task Definition
Aspect-level Classification Given a set of preidentified aspects, this task is to determine the polarity of each aspect. For example, given a sentence, "The restaurant was too expensive.", there is an aspect price whose polarity is negative. The set of aspects is {food, price, service, ambience, anecdotes/miscellaneous}. In the dataset of SemEval 2014 Task 4, there is only restaurants data that has aspect-specific polarities. Table 2 Asp  illustrates the comparative results.
Aspect-Term-level Classification For a given set of aspects term within a sentence, this task is to determine whether the polarity of each aspect term is positive, negative or neutral. We conduct experiments on the dataset of SemEval 2014 Task 4. In the sentences of both restaurant and laptop datasets, there are the location and sentiment polarity for each occurrence of an aspect term. For example, there is an aspect term fajitas whose polarity is negative in sentence "I loved their fajitas.". Experiments results are shown in Table 3 and Table 4. Similar to the experiment on aspect-level classification, our models achieve state-of-the-art performance.

Comparison with baseline methods
We compare our model with several baselines, including LSTM, TD-LSTM, and TC-LSTM.
LSTM: Standard LSTM cannot capture any aspect information in sentence, so it must get the same (a) the aspect of this sentence: service (b) the aspect of this sentence: food Figure 4: Attention Visualizations. The aspects of (a) and (b) are service and food respectively. The color depth expresses the importance degree of the weight in attention vector α. From (a), attention can detect the important words from the whole sentence dynamically even though multi-semantic phrase such as "fastest delivery times" which can be used in other areas. From (b), attention can know multi-keypoints if more than one keypoint existing.

Models
Three   sentiment polarity although given different aspects.
Since it cannot take advantage of the aspect information, not surprisingly the model has worst performance. TD-LSTM: TD-LSTM can improve the performance of sentiment classifier by treating an aspect as a target. Since there is no attention mechanism in TD-LSTM, it cannot "know" which words are important for a given aspect.
TC-LSTM: TC-LSTM extended TD-LSTM by incorporating a target into the representation of a sentence. It is worth noting that TC-LSTM performs worse than LSTM and TD-LSTM in Table 2. TC-LSTM added target representations, which was obtained from word vectors, into the input of the LSTM cell unit.
In our models, we embed aspects into another vector space. The embedding vector of aspects can be learned well in the process of training. ATAE-LSTM not only addresses the shortcoming of the unconformity between word vectors and aspect embeddings, but also can capture the most important information in response to a given aspect. In addition, ATAE-LSTM can capture the important and different parts of a sentence when given different aspects.

Qualitative Analysis
It is enlightening to analyze which words decide the sentiment polarity of the sentence given an aspect. We can obtain the attention weight α in Equation 8 and visualize the attention weights accordingly. Figure 4 shows the representation of how attention focuses on words with the influence of a given aspect. We use a visualization tool Heml (Deng The appetizers are ok, but the service is slow. I highly recommend it for not just its superb cuisine, but also for its friendly owners and staff. The service, however, is a peg or two below the quality of food (horrible bartenders), and the clientele, for the most part, are rowdy, loud-mouthed commuters (this could explain the bad attitudes from the staff) getting loaded for an AC/DC concert or a Knicks game. aspect: service; polarity: negative aspect: food; polarity: neutral The sentences in Figure 4 are "I have to say they have one of the fastest delivery times in the city ." and "The fajita we tried was tasteless and burned and the mole sauce was way too sweet.". The corresponding aspects are service and food respectively. Obviously attention can get the important parts from the whole sentence dynamically. In Figure 4 (a), "fastest delivery times" is a multi-word phrase, but our attention-based model can detect such phrases if service can is the input aspect. Besides, the attention can detect multiple keywords if more than one keyword is existing. In Figure 4 (b), tastless and too sweet are both detected.

Case Study
As we demonstrated, our models obtain the state-ofthe-art performance. In this section, we will further show the advantages of our proposals through some typical examples.
In Figure 5, we list some examples from the test set which have typical characteristics and cannot be inferred by LSTM. In sentence (a), "The appetizers are ok, but the service is slow.", there are two aspects food and service. Our model can discriminate different sentiment polarities with different aspects. In sentence (b), "I highly recommend it for not just its superb cuisine, but also for its friendly owners and staff.", there is a negation word not. Our model can obtain correct polarity, not affected by the negation word who doesn't represent negation here. In the last instance (c), "The service, however, is a peg or two below the quality of food (horrible bartenders), and the clientele, for the most part, are rowdy, loud-mouthed commuters (this could explain the bad attitudes from the staff) getting loaded for an AC/DC concert or a Knicks game.", the sentence has a long and complicated structure so that existing parser may hardly obtain correct parsing trees. Hence, tree-based neural network models are difficult to predict polarity correctly. While our attention-based LSTM can work well in those sentences with the help of attention mechanism and aspect embedding.

Conclusion and Future Work
In this paper, we have proposed attention-based LSTMs for aspect-level sentiment classification.
The key idea of these proposals are to learn aspect embeddings and let aspects participate in computing attention weights. Our proposed models can concentrate on different parts of a sentence when different aspects are given so that they are more competitive for aspect-level classification. Experiments show that our proposed models, AE-LSTM and ATAE-LSTM, obtain superior performance over the baseline models.
Though the proposals have shown potentials for aspect-level sentiment analysis, different aspects are input separately. As future work, an interesting and possible direction would be to model more than one aspect simultaneously with the attention mechanism.