Multi-grained Attention Network for Aspect-Level Sentiment Classification

We propose a novel multi-grained attention network (MGAN) model for aspect level sentiment classification. Existing approaches mostly adopt coarse-grained attention mechanism, which may bring information loss if the aspect has multiple words or larger context. We propose a fine-grained attention mechanism, which can capture the word-level interaction between aspect and context. And then we leverage the fine-grained and coarse-grained attention mechanisms to compose the MGAN framework. Moreover, unlike previous works which train each aspect with its context separately, we design an aspect alignment loss to depict the aspect-level interactions among the aspects that have the same context. We evaluate the proposed approach on three datasets: laptop and restaurant are from SemEval 2014, and the last one is a twitter dataset. Experimental results show that the multi-grained attention network consistently outperforms the state-of-the-art methods on all three datasets. We also conduct experiments to evaluate the effectiveness of aspect alignment loss, which indicates the aspect-level interactions can bring extra useful information and further improve the performance.


Introduction
Aspect level sentiment classification is a fundamental task in sentiment analysis (Pang et al., 2008;Liu, 2012), which aims to infer the sentiment polarity (e.g. positive, neutral, negative) of sentence with respect to the aspects. For example, in sentence "I like coming back to Mac OS but this laptop is lacking in speaker quality compared to my $400 old HP laptop", the polarity of the sentence towards the aspect "Mac OS" is positive while the polarity is negative in terms of aspect "speaker quality". * corresponding author.
Many statistical methods, such as support vector machine (SVM) (Wagner et al., 2014;Kiritchenko et al., 2014), are employed with welldesigned handcrafted features. In recent years, neural network models (Socher et al., 2011;Dong et al., 2014;Nguyen and Shirai, 2015) are studied to automatically learn low-dimensional representations for aspects and their context. Attention mechanism (Wang et al., 2016;Ma et al., 2017) is also be studied to characterize the effect of aspect on enforcing the model to pay more attention on the important words of the context. Previous works (Tang et al., 2016b;Chen et al., 2017) mainly employed the simple averaged aspect vector to learn the attention weights on the context words. Ma et al. [2017] further proposed the bidirectional attention mechanism, which interactively learns the attention weights on context/aspect words, with respect to the averaged vector of aspect/context, respectively.
These above attention methods are all at the coarse-grained level, which simply averages the aspect/context vector to guide learning the attention weights on the context/aspect words. The simple average pooling mechanism might cause information loss, especially for the aspect with multiple words or larger context. For example, in sentence "I like coming back to Mac OS but this laptop is lacking in speaker quality compared to my $400 old HP laptop", the simple averaged vector of long context might lose information when steering the attention weights on aspect words. Similarly, the simple averaged vector of aspect (i.e. "speaker quality") may deviate from the intuitive core meaning (i.e. "quality") when enforcing the model to pay varying attentions on the context words. From another perspective, previous works all regard the aspect and its context words as one instance, and train each instance separately. However, they do not consider the relationship among the instances that have the same context words. The aspect-level interactions among the instances with same context might bring extra useful information. Considering the above example, intuitively, the aspect "speaker quality" should pay more attention on "lacking" and less attention on "like", compared with aspect "Mac OS", since they have different sentiment polarities.
In this paper, we propose a multi-grained attention network to address the above two issues in aspect level sentiment classification. Specifically, we propose a fine-grained attention mechanism (i.e. F-Aspect2Context and F-Context2Aspect), which is employed to characterize the word-level interactions between aspect and context words, and relieve the information loss occurred in coarse-grained attention mechanism. In addition, we utilize the bidirectional coarsegrained attention (i.e. C-Aspect2Context and C-Context2Aspect) and combine them with finegrained attention vectors to compose the multigrained attention network for the final sentiment polarity prediction, which can leverage the advantages of them. More importantly, in order to make use of the valuable aspect-level interaction information, we design an aspect alignment loss in the objective function to enhance the difference of the attention weights towards the aspects which have the same context and different sentiment polarities. As far as we know, we are the first to explore the interactions among the aspects with the same context.
To evaluate the proposed approach, we conduct experiments on three datasets: laptop and restaurant are from the SemEval 2014 Task 4 and the third one is a tweet collection. Experimental results show that our method achieves the best performance on all three datasets.

Related Work
Aspect-level sentiment analysis is a branch of sentiment classification, which requires considering both the sentence and aspect information.
Traditional approaches (Jiang et al., 2011;Kiritchenko et al., 2014;Vo and Zhang, 2015) regard this task as the text classification problem and design effective features, which are utilized in statistical learning algorithms for training a classifier. Kiritchenko et al. [2014] proposed to use SVM based on n-gram features, parse features and lexicon features, which achieved the best perfor-mance in SemEval 2014. Vo and Zhang [2015] designed sentiment-specific word embedding and sentiment lexicons as rich features for prediction. These methods highly depend on the effectiveness of the laborious feature engineering work and easily reach the performance bottleneck.
In recent works, there are growing studies on neural network based methods due to their capability of encoding original features as continuous and low-dimensional vectors without feature engineering. Recursive Neural Network (Socher et al., 2011;Dong et al., 2014;Nguyen and Shirai, 2015) are studied to conduct semantic compositions on tree structures, and generate representations for prediction. Methods on LSTM (Hochreiter and Schmidhuber, 1997) were proposed to model the context information and use an aggregated vector for prediction. TD-LSTM (Tang et al., 2016a) adopted LSTM to model the left context and right context of the aspect, and concatenate them as the representation for prediction. However, these works only focused on modeling the contexts without considering the aspects, which performed an important role in estimate the sentiment polarity.
Attention mechanisms (Wang et al., 2016;Lei et al., 2016; are studied to enhance the influence of aspects on the final representation for prediction. Many approaches (Tang et al., 2016b;Chen et al., 2017) adopted the averaged aspect vector to learn the attention weights on the hidden vectors of context words. Ma et al. [2017] further proposed bidirectional attention mechanism, which also learns the attention weights on aspect words towards the averaged vector of context words. These attention methods only consider the coarse-grained level attention, through using the simple averaged aspect/context vector to steer the attention weights learning on the context/aspect words, which might cause some information loss on the long aspect or context case.
In contrast, motivated by the bidirectional attention flow approaches (Seo et al., 2017;Pan et al., 2017) in machine comprehension, we propose a fine-grained attention mechanism which is responsible for linking and fusing information from the aspect and the context words. Furthermore, we leverage both the coarse-grained and finegrained attentions to compose the multi-grained attention network (MGAN). In addition, existing works train each instance separately. However, we observe that the interactions among the aspects, which have the same context words, could bring extra useful information. Thus we design the aspect alignment loss in the objective function to depict such kind of relationship, which is the first work to explore the aspect-level interactions.

Task Definition
Given a sentence s = {w 1 , w 2 , · · · , w N } consisting of N words, and an aspect list A = {a 1 , · · · , a k }, where the aspect list size is k and each aspect a i = {w i 1 , · · · , w i M } is a subsequence of sentence s, which contains M ∈ [1, N ) words. Aspect-level sentiment classification evaluates sentiment polarity of the sentence s with respect to each aspect a i .
We present the overall architecture of the proposed Multi-grained Attention Network (MGAN) model in Figure 1. It consists of the Input Embedding layer, the Contextual Layer, the Multigrained Attention Layer and the Output Layer.

Input Embedding Layer
Input Embedding Layer maps each word to a high dimensional vector space. We employ the pretrained word vector, GloVe (Pennington et al., 2014), to obtain the fixed word embedding of each word. Specifically, we denote the embedding lookup matrix as L ∈ R dv×|V | , where d v is the word vector dimension and |V | is the vocabulary size.

Contextual Layer
We employ a bidirectional Long Short-Term Memory Network (BiLSTM) on top of the embedding layer to capture the temporal interactions among words. Specifically, at time step t, given the input word embedding x, the update process of forward LSTM network can be formalized as follows: Where σ is the sigmoid activation function, i t , f t , o t are the input gate, forget gate and output gate, , and d is the hidden dimension size. The backward LSTM does the similar process and we can get the concatenated output Given the word embeddings of a context sentence s and a corresponding aspect a j , we will employ the BiLSTM separately and get the sentence contextual output H ∈ R 2d * N and aspect contextual output Q ∈ R 2d * M .
In addition, considering that the context words with closer distance to an aspect may have higher influence to the aspect, we utilize the position encoding mechanism to simulate the observation. Formally, the weight for a context word w j , which has l word-level distance from the aspect (here we treat the aspect phrase as a single unit), is defined as follows: Specifically, we treat the weights of words within the aspect as 0 in order to focus on the context words in the sentence. Then we can obtain the final contextual outputs of context words

Multi-grained Attention Layer
Attention mechanism is a common way to capture the interactions between the aspect and context words. Previous methods (Tang et al., 2016b;Ma et al., 2017;Chen et al., 2017) only adopt coarsegrained attentions, which simply use the averaged aspect/context vector as the guide to learn the attention weights on context/aspect. However, the simple average pooling in generating the guide vector might bring some information loss, especially for the aspect with multiple words or larger context. We propose the fine-grained attention mechanism, which is responsible for linking and fusing information from the aspect and context words. This mechanism is designed to capture the word-level interactions which estimate how each aspect/context word affect each context/aspect word. In addition, we concatenate both the fine-grained and coarse-grained attention vectors to obtain the final representation. From other perspective, we observe the relationship among aspects can introduce extra valuable information. Hence, we propose an aspect alignment loss, which is designed to strengthen the at- Aspect Alignment Loss tention difference among aspects with same context and different sentiment polarities.

Coarse-grained Attention
Coarse-grained attention is a widely used mechanism to capture the interactions between aspect and context, which utilizes an averaged aspect vector to steer the attention weights on the context words. Follow the work in (Ma et al., 2017), we employ the bidirectional attention mechanism, namely C-Aspect2Context and C-Context2Aspect.
(1) C-Aspect2Context learns to assign attention scores to the context words with respect to the averaged aspect vector. Here we employ an average pooling layer above aspect contextual output Q to generate the averaged aspect vector Q avg ∈ R 2d . For each word vector H i in context, we can compute the attention score a ca i as follows: Where the score function s ca computes the weight which indicates the importance of a context word towards aspect sentiment. W ca ∈ R 2d * 2d is the attention weight matrix. Then the weighted combination of the context output m ca ∈ R 2d is calcu-lated as follows: (2) C-Context2Aspect learns to assign attention weights on aspects words, which follows the similar learning process with C-Aspect2Context. We utilize the average pooling mechanism to obtain the averaged context vector H avg , and compute the weights for each word w i in the aspect phrase. We compute the final weighted combination of aspect vector m cc ∈ R 2d as follows: where W cc ∈ R 2d * 2d is the attention weight matrix.

Fine-grained Attention
As introduced above, we propose a fine-grained attention mechanism to characterize the word-level interactions and evaluate how each aspect/context word affect each context/aspect word. Considering the previous example "I like coming back to Mac OS but this laptop is lacking in speaker quality compared to my $400 old HP laptop", the word "quality" in aspect "speaker quality" should have more effect on the context words compared with word "speaker". Accordingly, the context words should pay more attention on "quality" instead of "speaker". Formally, we define an alignment matrix U ∈ R N * M , between the contextual output of and the context H and the aspect Q, where U ij indicates the similarity between i-th context word and j-th aspect word. The similarity matrix U is computed by Where W u ∈ R 1 * 6d is the weight matrix, [; ] is the vector concatenation across row, * is the elementwise multiplication. Then we use U to calculate the attention vectors in both directions.
(1) F-Aspect2Context estimates which context word has the closest similarity to one of the aspect word and are hence critical for determining the sentiment. We can compute the attention weights a f a on context words by where s f a i obtains the maximum similarity across column. And then we can get the attended vector m f a ∈ R 2d as follows: (2) F-Context2Aspect measures which aspect words are most relevant to each context word. Let a f c i ∈ R M be the attention weights on aspect contextual output Q with respect to the i-th context word vector H i . The attended aspect vector q f c ∈ R 2d * N is defined as follows: Then we use an average pooling layer on q f c to get the attended vector m f c ∈ R 2d :

Output Layer
At last, we concatenate both the coarse-grained and fine-grained attention vectors as the final representation m ∈ R 8d , which will be fed to a softmax layer for determining the aspect sentiment polarity.
where p ∈ R C is the probability distribution for the polarity of aspect sentiment, W p ∈ R C * 8d and b p ∈ R C are the weight matrix and bias, respectively. Here we set C = 3, which is the number of aspect sentiment classes.

Aspect Alignment Loss
Existing approaches train each aspect with its context separately, without considering the relationship among the aspects. However, we observe the aspect-level interactions can bring extra valuable information. In order to enhance the attention differences of aspects, which have the same context and different sentiment polarities, we design the aspect alignment loss on the C-Aspect2Context attention weights. C-Aspect2Context is employed to find the important context words in terms of a specific aspect. With the constraint of aspect alignment loss, each aspect will pay more attention on the important words through the comparisons with other related aspects. In terms of the previous example, the aspect "speaker quality" should pay more attention on "lacking" and less attention on "like", compared with aspect "Mac OS" due to their different sentiment polarities. Specifically, for each aspect pair a i and a j in aspect list A, we compute the square loss on the coarse-grained attention vector a ca i and a ca j , and also estimate the distance d ij ∈ [0, 1] between a i and a j as the loss weight.
(24) Where σ is the sigmoid function, W d ∈ R 1 * 6d is weight matrix for computing the distance, y i and y j are the true labels of the aspect a i and a j , a ca ik and a ca jk are the attention weights on k-th context word towards aspect a i and a j , respectively.
For training the multi-grained attention network (MGAN), we should optimize all the parameters Θ from the LSTM networks: , the attention and alignment loss parameters: [W ca , W cc , W u , W d ] and softmax parameters: [W p , b p ]. The final loss function is consisting of the cross-entropy loss, aspect alignment loss and regularization item as follows: Where β ≥ 0 and λ ≥ 0 controls the influence of the aspect alignment loss and the L 2 regularization item, respectively. We employ the stochastic gradient descent (SGD) optimizer to compute and update the training parameters. In addition, we utilize dropout strategy to avoid overfitting.

Experiments
In this section, we conduct experiments to evaluate our two hypotheses: (1) whether the wordlevel interaction between aspect and context can help relieve the information loss and improve the performance.
(2) whether the relationship among the aspects, which have the same context and different sentiment polarities, can bring extra useful information.

Experiment Setting
We conduct experiments on three datasets, as shown in Table 1. The first two are from the Se-mEval 2014 Task 4 1 (Pontiki et al., 2014), which contains the reviews in laptop and restaurants, respectively. The third one is a tweet collection, which are gathered by (Dong et al., 2014). Each aspect with the context is labeled by three sentiment polarities, namely Positive, Neutral and Negative. In addition, we adopt Accuracy and Macro-F1 as the metrics to evaluate the performance of aspect-level sentiment classification, which is widely used in previous works (Tang et al., 2016b;Ma et al., 2017;Chen et al., 2017;Wang et al., 2016).
In our experiments, word embeddings for both context and aspect words are initialized by Glove (Pennington et al., 2014). The dimension of word embedding d v and hidden state d are 1 The detailed task introduction can be found in http://alt.qcri.org/semeval2014/task4/.  set to 300. The weight matrix and bias are initialized by sampling from a uniform distribution U (0.01, 0.01). The coefficient λ of L 2 regularization item is 10 −5 , the parameter β of aspect alignment loss and drop out rate are set to 0.5.

Compared Methods
To evaluate the performance of proposed approach, we compared with the following methods: Majority is the basic baseline, which chooses the largest sentiment polarity in the training set to each instance in the test set. We also list the variants of MGAN model, which are used to analyze the effects of coarsegrained attention, fine-grained attention and aspect alignment loss, respectively. MGAN-C only employs the coarse-grained attentions for prediction, which is similar with IAN. MGAN-F only utilizes the proposed fine-grained attentions for prediction. MGAN-CF adopts both the coarse-grained and fine-grained attentions, while without applying the aspect alignment loss. MGAN is the complete multi-grained attention network model. Table 2 shows the performance comparison results of MGAN with other baseline methods. We can have the following observations.

Overall Performance Comparison
(1) Majority performs worst since it only utilizes the data distribution information. Fea-ture+SVM can achieve much better performance on all the datasets, with the well-designed feature engineering. Our method MGAN outperforms Majority and Feature+SVM since MGAN could learn the high quality representation for prediction.
(2) ATAE-LSTM is better than LSTM since it employs attention mechanism on the hidden states and combines with attention embedding to generate the final representation. TD-LSTM performs slightly better than ATAE-LSTM, and it employs two LSTM networks to capture the left and right context of the aspect. TD-LSTM performs worse than our method MGAN since it could not properly pay more attentions on the important parts of the context.
(3) IAN achieves slightly better results with the previous LSTM-based methods, which interactively learns the attended aspect and context vector as final representation. Our method consistently performs better than IAN since we utilize the finegrained attention vectors to relieve the information loss in IAN. MemNet continuously learns the attended vector on the context word embedding memory, and updates the query vector at each hop. BILSTM-ATT-G models left context and right context using attention-based LSTMs, which achieves better performance than MemNet. RAM performs better than other baselines. It employs bidirectional LSTM network to generate contextual memory, and learns the multiple attended vector on the memory. Similar with MemNet, it utilizes the averaged aspect vector to learn the attention weights on context words.
Our proposed MGAN consistently performs better than MemNet, BILSTM-ATT-G and RAM on all three datasets. On one hand, they only consider to learn the attention weights on context towards the aspect, and do not consider to learn the weights on aspect words towards the context. On the other hand, they just use the averaged aspect vector to guide the attention, which will lose some information, especially on the aspects with multiple words. From another perspective, our method employs the aspect alignment loss, which can bring extra useful information from the aspectlevel interactions. Table 3 shows the performance comparison among the variants of MGAN model. We can have the following observations.

Analysis of MGAN model
(1) the proposed fine-grained attention mechanism MGAN-F, which is responsible for linking and fusing the information between the context and aspect word, achieves competitive performance compared with MGAN-C, especially on laptop dataset. To investigate this case, we collect the percentage of aspects with different word lengths in Table 4. We can find that laptop dataset has the highest percentage on the aspects with more than two words, and the second-highest percentage on two words. It demonstrates MGAN-F has better performance on aspects with more words, and make use of the word-level interactions to relieve the information loss occurred in coarsegrained attention mechanism.
(2) MGAN-CF is better than both MGAN-C and MGAN-F, which demonstrates the coarsegrained attentions and fine-grained attentions could improve the performance from different perspectives. Compared with MGAN-CF, the complete MGAN model gains further improvement by bringing the aspect alignment loss, which is designed to capture the aspect level interactions. Specifically, we collect the statistics of sentencelevel with different aspect amounts, which is shown in Table 5. We can observe that both laptop and restaurant datasets have relatively high percentage on the sentences with multiple aspects.

Case Study
In order to demonstrate the effect of aspect alignment loss, we visualize the attention weights of the C-Aspect2Context mechanism. Figure 2 shows the attention weights of two aspects "resolution" and "fonts", whose sentiment polarities are positive and negative, respectively. From the above two bars, we can observe that the C-Aspect2Context can enforce the model to pay more attentions on the important words with respect to the aspect. For example, in terms of the aspect"resolution", the words "has", "higher" and "but" have higher attention weights compared with other words. In contrast, aspect "fonts" pays more attentions on words "but", "fonts" and "small". In addition, we evaluate the effect of aspect alignment loss, which enhances the attention difference between the aspect "resolution" and "fonts". For the two bars at bottom, we can find that aspect "fonts" has more attention on "small" and less attention on "higher", compared with the aspect "resolution". This phenomenon shows that with the constraint of aspect alignment loss, C-Aspect2Context can not only learn the important context words for each aspect, but also can make the attention gaps on the important words be as large as possible for aspects with different polarities.

Conclusion
In this paper, we propose a multi-grained attention network (MGAN) for aspect-level sentiment classification. Specifically, we propose a finegrained attention mechanism, which is responsible for linking and fusing the words from the aspect and context, to capture the word-level interaction. And we combine it with the coarse-grained atten-  Figure 2: The attention visualizations on aspect "resolution" and "fonts". The above two bars are from the C-Aspect2Context attention mechanism, and the two bars at bottom are from the C-Aspect2Context attention mechanism with the constraint of aspect alignment loss.
tion mechanism to compose the MGAN model. In addition, we design an aspect alignment loss to characterize the aspect-level interactions among aspects, which have the same context and different sentiment polarities, to explore extra valuable information. Experimental results demonstrate the effectiveness of our approach on three public datasets.