CAN: Constrained Attention Networks for Multi-Aspect Sentiment Analysis

Aspect level sentiment classification is a fine-grained sentiment analysis task. To detect the sentiment towards a particular aspect in a sentence, previous studies have developed various attention-based methods for generating aspect-specific sentence representations. However, the attention may inherently introduce noise and downgrade the performance. In this paper, we propose constrained attention networks (CAN), a simple yet effective solution, to regularize the attention for multi-aspect sentiment analysis, which alleviates the drawback of the attention mechanism. Specifically, we introduce orthogonal regularization on multiple aspects and sparse regularization on each single aspect. Experimental results on two public datasets demonstrate the effectiveness of our approach. We further extend our approach to multi-task settings and outperform the state-of-the-art methods.


Introduction
Sentiment analysis (Nasukawa and Yi, 2003;Liu, 2012), an important task in natural language understanding, receives much attention in recent years. Aspect level sentiment classification (ALSC) is a fine-grained sentiment analysis task, which aims at detecting the sentiment towards a particular aspect in a sentence. ALSC is especially critical for multi-aspect sentences which contain multiple aspects. A multi-aspect sentence can be categorized as overlapping or nonoverlapping. A sentence is annotated as nonoverlapping if any two of its aspects have no overlap. Our study found that around 85% of the multiaspect sentences are non-overlapping in the two public datasets. Figure 1 shows a simple example. The non-overlapping sentence contains two * This work was done when Mengting Hu was a research intern at IBM Research -China. † Corresponding author. sometimes i get good food and ok service . aspects. The aspect food is on the left side of the aspect service. Their distributions on words are orthogonal to each other. Another observation is that only a few words relate to the opinion expression in each aspect. As shown in Figure 1, only the word "good" is relevant to the aspect food and "ok" to service. The distribution of the opinion expression of each aspect is sparse.
To detect the sentiment towards a particular aspect, previous studies (Wang et al., 2016;Ma et al., 2017;Cheng et al., 2017;Ma et al., 2018; have developed various attentionbased methods for generating aspect-specific sentence representations. In these works, the attention may inherently introduce noise and downgrade the performance  since the attention scatters across the whole sentence and is prone to attend on noisy words, or the opinion words from other aspects. Take Figure 1 as an example, for the aspect food, we visualize the attention weights from the model (Wang et al., 2016). Much of the attention focuses on the noisy word "sometimes", and the opinion word "ok" which is relevant to the aspect service rather than food.
To alleviate the above issue, we propose a model for multi-aspect sentiment analysis, which regularizes the attention by handling multiple aspects of a sentence simultaneously. Specifically, we introduce orthogonal regularization for attention weights among multiple non-overlapping as-pects. The orthogonal regularization tends to make the attention weights of multiple aspects concentrate on different parts of the sentence with less overlap. We also introduce the sparse regularization, which tends to make the attention weights of each aspect concentrate only on a few words. We call our networks with such regularizations constrained attention networks (CAN). There have been some works on introducing sparsity in attention weights in machine translation (Malaviya et al., 2018) and orthogonal constraints in domain adaptation (Bousmalis et al., 2016). In this paper, we add both sparse and orthogonal regularizations in a unified form inspired by the work (Lin et al., 2017). The details will be introduced in Section 3.
In addition to aspect level sentiment classification (ALSC), aspect category detection (ACD) is another task of aspect based sentiment analysis. ACD (Zhou et al., 2015;Schouten et al., 2018) aims to identify the aspect categories discussed in a given sentence from a predefined set of aspect categories (e.g., price, food, service). Take Figure 1 as an example, aspect categories food and service are mentioned. We introduce ACD as an auxiliary task to assist the ALSC task, benefiting from the shared context of the two tasks. We also apply our attention constraints to the ACD task. By applying attention weight constraints to both ALSC and ACD tasks in an end-to-end network, we can further evaluate the effectiveness of CAN in multi-task settings.
In summary, the main contributions of our work are as follows: • We propose CAN for multi-aspect sentiment analysis. Specifically, we introduce orthogonal and sparse regularizations to constrain the attention weight allocation, helping learn better aspect-specific sentence representations.
• We extend CAN to multi-task settings by introducing ACD as an auxiliary task, and applying CAN on both ALSC and ACD tasks.
• Extensive experiments are conducted on public datasets. Results demonstrate the effectiveness of our approach for aspect level sentiment classification.

Related Work
Aspect level sentiment analysis is a fine-grained sentiment analysis task. Earlier methods are usu-ally based on explicit features (Liu et al., 2010;Vo and Zhang, 2015). With the development of deep learning technologies, various neural attention mechanisms have been proposed to solve this fine-grained task (Wang et al., 2016;Ruder et al., 2016;Ma et al., 2017;Tay et al., 2017;Cheng et al., 2017;Chen et al., 2017;Tay et al., 2018;Ma et al., 2018;Wang and Lu, 2018;. To name a few, Wang et al. (2016) propose an attention-based LSTM network for aspect level sentiment classification. Ma et al. (2017) use the interactive attention networks to generate the representations for targets and contexts separately. Cheng et al. (2017); Ruder et al. (2016) both propose hierarchical neural network models for aspect level sentiment classification. Wang and Lu (2018) employ a segmentation attention based LSTM model for aspect level sentiment classification. All these works can be categorized as singleaspect sentiment analysis, which deals with aspects in a sentence separately, without considering the relationship between aspects.
More recently, a few works have been proposed to take the relationship among multiple aspects into consideration. Hazarika et al. (2018) make simultaneous classification of all aspects in a sentence using recurrent networks. Majumder et al. (2018) employ memory network to model the dependency of the target aspect with the other aspects in the sentence. Fan et al. (2018) design an aspect alignment loss to enhance the difference of the attention weights towards the aspects which have the same context and different sentiment polarities. In this paper, we introduce orthogonal regularization to constrain the attention weights of multiple non-overlapping aspects, as well as sparse regularization on each single aspect.
Multi-task learning Caruana (1997) solves multiple learning tasks at the same time, achieving improved performance by exploiting commonalities and differences across tasks. Multi-task learning has been used successfully in many machine learning applications. Huang and Zhong (2018) learn both main task and auxiliary task jointly with shared representations, achieving improved performance in question answering. Toshniwal et al. (2017) use low-level auxiliary tasks for encoderdecoder based speech recognition, which suggests that the addition of auxiliary tasks can help in either optimization or generalization. Yu and Jiang (2016)   sentence embedding that works well across domains for sentiment classification. In this paper, we adopt the multi-task learning approach by using ACD as the auxiliary task to help the ALSC task.

Model
We first formulate the problem. There are totally N predefined aspect categories in the dataset, A = {A 1 , ..., A N }. Given a sentence S = {w 1 , w 2 , ..., w L }, which contains K aspects A s = {A s 1 , ..., A s K }, A s ⊆ A, the multi-task learning is to simultaneously solve the ALSC and ACD tasks, namely, the ALSC task predicts the sentiment polarity of each aspect A s k ∈ A s , and the auxiliary ACD task checks each aspect A n ∈ A to see whether the sentence S mentions it. Note that we only focus on aspect-category instead of aspectterm (Xue and Li, 2018) in this paper.
We propose CAN for multi-aspect sentiment analysis, supporting both ALSC and ACD tasks by a multi-task learning framework. The network architecture is shown in Figure 2. We will introduce all components sequentially from left to right.

Embedding and LSTM Layers
Traditional single-aspect sentiment analysis handles each aspect separately, one at a time. In such settings, a sentence S with K aspects will be copied to form K instances. For example, a sentence S contains two aspects: A s 1 with polarity p 1 and A s 2 with polarity p 2 . Two instances, S, A s 1 , p 1 and S, A s 2 , p 2 , will be constructed.
Our multi-aspect sentiment analysis method handles multiple aspects together and takes the single instance S, [A s 1 , A s 2 ], [p 1 , p 2 ] as input. The input sentence {w 1 , w 2 , ..., w L } is first converted to a sequence of vectors {v 1 , v 2 , ..., v L }, and the K aspects of the sentence are transformed to vectors {u s 1 , ..., u s K }, which is a subset of {u 1 , ..., u N }, the vectors of all aspect categories. The word embeddings of the sentence are then fed into an LSTM network (Hochreiter and Schmidhuber, 1997), which outputs hidden states H = {h 1 , h 2 , ..., h L }. The sizes of the embedding and the hidden state are both set to be d.

Task-Specific Attention Layer
The ALSC and ACD tasks share the hidden states from the LSTM layer, while compute their own attention weights separately. The attention weights are then used to compute aspect-specific sentence representations.
ALSC Attention Layer The key idea of aspect level sentiment classification is to learn different attention weights for different aspects, so that different aspects can concentrate on different parts of the sentence. We follow the approach in the work (Bahdanau et al., 2015) to compute the attention. Particularly, given the sentence S with K aspects, A s = {A s 1 , ..., A s K }, for each aspect A s k , its attention weights are calculated by: (1) where u s k is the embedding of the aspect A s k , e L ∈ R L is a vector of 1s, u s k ⊗ e L is the op-eration repeatedly concatenating u s k for L times. W a 1 ∈ R d×d , W a 2 ∈ R d×d and z a ∈ R d are the weight matrices.
ACD Attention Layer We treat the ACD task as multi-label classification problem for the set of N aspect categories. For each aspect A n ∈ A, its attention weights are calculated by: The ALSC and ACD tasks use the same attention mechanism, but they do not share parameters. The reason to use separated parameters is that, for the same aspect, the attention of ALSC concentrates more on opinion words, while ACD focuses more on aspect target terms (see the attention visualizations in Section 4.6).

Regularization Layer
We simultaneously handles multiple aspects by adding constraints to their attention weights. Note that this layer is only available in the training stage, in which the ground-truth aspects are known for calculating the regularization loss, and then influence parameter updating in back propagation. While in the testing/inference stage, the true aspects are unknown and the regularization loss is not calculated so that this layer is omitted from the architecture.
In this paper, we introduce two types of regularizations: the sparse regularization on each single aspect; the orthogonal regularization on multiple non-overlapping aspects.
Sparse Regularization For each aspect, the sparse regularization constrains the distribution of the attention weights (α k or β n ) to concentrate on less words. For simplicity, we use α k as an example, α k = {α k1 , α k2 , ..., α kL }. To make α k sparse, the sparse regularization term is defined as: where L l=1 α kl = 1 and α kl > 0. Since α k is normalized as a probability distribution, L 1 norm is always equal to 1 (the sum of the probabilities) and does not work as sparse regularization as usual.
Minimizing Equation 3 will force the sparsity of α k . It has the similar effect as minimizing the entropy of α k , which leads to placing more probabilities on less words.
Orthogonal Regularization This regularization term forces orthogonality among attention weight vectors of multiple aspects, so that different aspects attend on different parts of the sentence with less overlap. Note that we only apply this regularization to non-overlapping multi-aspect sentences. Assume that the sentence S contains K non-overlapping aspects {A s 1 , ..., A s K } and their attention weight vectors are {α 1 , ..., α K }. We pack them together as a two-dimensional attention matrix M ∈ R K×L to calculate the orthogonal regularization term.
where I is an identity matrix. In the resulted matrix of M T M , each non-diagonal element is the dot product between two attention weight vectors, minimizing the non-diagonal elements will force orthogonality between corresponding attention weight vectors. The diagonal elements of M T M are subtracted by 1, which are the same as R s defined in Equation 3. As a whole, R o includes both sparse and orthogonal regularization terms. Note that in the ACD task, we do not pack all the N attention vectors {β 1 , ..., β N } as a matrix. The sentence S contains K aspects. For simplicity, let {β 1 , ..., β K } be the attention vectors of the K aspects mentioned, while {β K+1 , ..., β N } be the attention vectors of the N − K aspects not mentioned. We compute the average of the N − K attention vectors, denoted by β avg . We then construct the attention matrix G = {β 1 , ..., β K , β avg }, G ∈ R (K+1)×L . The reason why we calculate β avg is that if an aspect is not mentioned in the sentence, its attention weights often attend to meaningless stop words, such as "to", "the", "was", etc. We do not need to distinguish among the N − K aspects not mentioned, therefore they can share stop words in the sentence by being averaged as a whole, which keeps the K aspects mentioned away from such stop words.

Task-Specific Prediction Layer
Given the attention weights of each aspect, we can generate aspect-specific sentence representation, and then make prediction for the ALSC and ACD tasks respectively.
ALSC Prediction The weighted hidden state is combined with the last hidden state to generate the final aspect-specific sentence representation.
where W r 1 ∈ R d×d and W r 2 ∈ R d×d .h k = L l=1 α kl h l is the weighted hidden state for aspect k. r s k is then used to make sentiment polarity prediction.ŷ where W a p ∈ R d×c and b a p ∈ R c are the parameters of the projection layer, and c is the number of classes.
For the sentence S with K aspects mentioned, we make K predictions simultaneously. That is why we call our approach multi-aspect sentiment analysis.
ACD Prediction We directly use the weighted hidden state as the sentence representation for ACD prediction.
We do not combine with the last hidden state h L since the aspect may not be mentioned by the sentence. We make N predictions for all predefined aspect categories.
where W b p ∈ R d×1 and b b p is a scalar.

Loss
For the task ALSC, the loss function for the K aspects of the sentence S is defined by: where c is the number of classes. For the task ACD, as each prediction is binary classification problem, the loss function for the N aspects of the sentence S is defined by: We jointly train our model for the two tasks. The parameters in our model are then trained by minimizing the combined loss function: where R is the regularization term mentioned previously, which can be R s or R o . λ is the hyperparameter used for tuning the impact from regularization loss to the overall loss. To avoid L b overwhelming the overall loss, we divide it by the number of aspect categories.

Datasets
We conduct experiments on two public datasets from SemEval 2014 task 4 (Pontiki et al., 2014) and SemEval 2015 task 12 (Pontiki et al., 2015) (denoted by Rest14 and Rest15 respectively). These two datasets consist of restaurant customer reviews with annotations identifying the mentioned aspects and the sentiment polarity of each aspect. To apply orthogonal regularization, we manually annotate the multi-aspect sentences with overlapping or non-overlapping 1 . We randomly split the original training set into training, validation sets in the ratio 5:1, where the validation set is used to select the best model. We count the sentences of single-aspect and multi-aspect separately. Detailed statistics are summarized in Table  1. Particularly, 85.23% and 83.73% of the multiaspect sentences are non-overlapping in Rest14 and Rest15, respectively.

Comparison Methods
Since we focus on aspect-category sentiment analysis, many works (Ma et al., 2017;Fan et al., 2018) which focus on aspect-term sentiment analysis are excluded.
• LSTM: We implement the vanilla LSTM to model the sentence and use the average of all  Table 2: Results of the ALSC task in single-task settings in terms of accuracy (%) and Macro-F1 (%).
hidden states as the sentence representation. In this model, aspect information is not used.
• AT-LSTM (Wang et al., 2016): It adopts the attention mechanism in LSTM to generate a weighted representation of a sentence. The aspect embedding is used to compute the attention weights as in Equation 1. We do not concatenate the aspect embedding to the hidden state as in the work (Wang et al., 2016) and gain small performance improvement. We use this modified version in all experiments.
• ATAE-LSTM (Wang et al., 2016): This method is an extension of AT-LSTM. In this model, the aspect embedding is concatenated to each word embedding of the sentence as the input to the LSTM layer.
• GCAE (Xue and Li, 2018): This state-of-theart method is based on the convolutional neural network with gating mechanisms, which is for both aspect-category and aspect-term sentiment analysis. We compare with its aspect-category sentiment analysis task.

Our Methods
To verify the performance gain of introducing constraints on attention weights, we first create several variants of our model for single-task settings.
• AT-CAN-R s : Add sparse regularization R s to AT-LSTM to constrain the attention weights of each single aspect.
• AT-CAN-R o : Add orthogonal regularization R o to AT-CAN-R s to constrain the attention weights of multiple non-overlapping aspects.
We then extend attention constraints to multitask settings, creating variants by different options: 1) no constraints, 2) adding regularizations only to the ALSC task, 3) adding regularizations to both tasks.
• M-AT-LSTM: This is the basic multi-task model without regularizations.
• M-CAN-R s : Add R s to the ALSC task in M-AT-LSTM.

Implementation Details
We set λ = 0.1 with the help of the validation set. All models are optimized by the Adagrad optimizer (Duchi et al., 2011) with learning rate 0.01. Batch size is 25. We apply a dropout of p = 0.7 after the embedding and LSTM layers. All words in the sentences are initialized with 300 dimension Glove Embeddings (Pennington et al., 2014). The aspect embedding matrix and parameters are initialized by sampling from a uniform distribution U (−ε, ε), ε = 0.01. d is set as 300. The models are trained for 100 epochs, during which the model with the best performance on the validation set is saved. We also apply early stopping in training, which means that the training will stop if the performance on validation set does not improve in 10 epochs.  Table 3: Results of the ALSC task in multi-task settings in terms of accuracy (%) and Macro-F1 (%).

Model
Rest14 Rest15 Table 4: Results of the ACD task. Rest14 has 5 aspect categories while Rest15 has 13 ones.

Results
Table 2 and 3 show our experimental results on the two public datasets for single-task and multitask settings respectively. In both tables, "3-way" stands for 3-class classification (positive, neutral, and negative), and "Binary" for binary classification (positive and negative). The best scores are marked in bold. Table 2 shows our experimental results of ALSC in single-task settings. Firstly, we observe that by introducing attention regularizations (either R s or R o ), most of our proposed methods outperform their counterparts. Particularly, AT-CAN-R s and AT-CAN-R o outperform AT-LSTM in all results; ATAE-CAN-R s and ATAE-CAN-R o also outperform ATAE-LSTM in 15 of 16 results. For example, in the Rest15 dataset, ATAE-CAN-R o outperforms ATAE-LSTM by up to 5.39% of accuracy and 6.46% of the F1 score in the 3-way classification. Secondly, regularization R o achieves better performance improvement than R s in all results. This is because R o includes both orthogonal and sparse regularizations for non-overlapping multiaspect sentences. Thirdly, our approaches, especially ATAE-CAN-R o , outperform the state-ofthe-art baseline model GCAE. Finally, the LSTM method outputs the worst results in all cases, because it can not distinguish different aspects.

Single-task Settings
Multi-task Settings Table 3 shows experimental results of ALSC in multi-task settings. We first observe that the overall results in multi-task set-tings outperform the ones in single-task settings, which demonstrates the effectiveness of multi-task learning by introducing the auxiliary ACD task to help the ALSC task. Second, in almost all cases, applying attention regularizations to both tasks gains more performance improvement than only to the ALSC task, which shows that our attention regularization approach can be extended to different tasks which involving aspect level attention weights, and works well in multi-task settings. For example, for the Binary classification in the Rest15 dataset, M-AT-LASTM outperforms AT-LSTM by 3.57% of accuracy and 4.96% of the F1 score, and M-CAN-2R o further outperforms M-AT-LSTM by 3.28% of accuracy and 4.0% of the F1 score. Table 4 shows the results of the ACD task in multi-task settings. Our proposed regularization terms can also improve the performance of ACD. Regularization R o achieves the best performance in almost all metrics. figure (a), the attention weights of aspect food and service are both high in words "outstanding", "and", and "the", but actually, the word "outstanding" is used to describe the aspect food rather than service. The same situation occurs with the word "tops", which should associate with service rather than food. The attention mechanism alone is not good enough to locate aspect-specific opinion words and generate aspect-specific sentence representations in the ALSC task. As shown in subfigure (b), the issue is mitigated in M-AT-LSTM. Multi-task learning can learn better hidden states of the sentence, and better aspect embeddings. However, it is still not good enough. For instance, the attention weights of the word  "tops" are both high for the two aspects, and the weights are overlapped in the middle part of the sentence.

Ro Loss
As shown in subfigure (c), M-CAN+2R o generates the best attention weights. The attention weights of the aspect food are almost orthogonal to the weights of service. The aspect food concentrates on the first part of the sentence while service on the second part. Meanwhile, the key opinion words "outstanding" and "tops" get highest attention weights in the corresponding aspects.
We also visualize the attention for the auxiliary task ACD. Figure 4 depicts the attention weights from the method M-CAN-2R o . There are five predefined aspect categories (food, ambience, price, anecdotes/miscellaneous, service) in the dataset, two of which are mentioned in the sentence. In the ACD task, we need to calculate the attention weights for all the five aspect categories, and then generate aspect-specific sentence representations to determine whether the sentence contains each aspect. As shown in Figure 4, attention weights for aspects food and service are pretty good. The aspect food concentrates on words "food" and "outstanding", and the aspect service focuses on the word "service". It is interesting that for aspects which are not mentioned in the sentence, their attention weights often attend to meaningless stop words, such as "the", "was", etc. We do not distinguish these aspects and just treat them as a whole.
We plot the regularization loss curves in Figure  5, which shows that both R s and R o decrease during the training of AT-CAN-R o .

Case Studies
Overlapping Case We only add sparse regularization to overlapping sentences in which multiple aspects share the same opinion snippet. As shown in Figure 6, the sentence contains two aspects food

Overlapping Case
Error Case a/m 0.11 0.1 0.12 0.11 0.10 0.11 I was highly disappointed by their service and food.
I was highly disappointed by their service and food.
But dinner here is never disappointing, even if the prices are a bit over the top.
A thai restaurant out of rice during dinner ? Figure 6: Examples of overlapping case and error case. The a/m is short for anecdotes/miscellaneous. and service, both described by the opinion snippet "highly disappointed". Our method can locate the aspect terms and shared opinion words for both aspects, and then classify the sentiment correctly.
Error Case With the help of attention visualization, we can conduct error analysis of our model conveniently. As shown in Figure 6, for the first sentence in error case, the aspect food attends on the right word "disappointing", but fails to include the negation word "never". This may be caused by the inaccurate sentence representation or aspect embedding. We can not rebuild the connection between the aspect and the word by our regularizations. The second sentence is negative but classified as neutral. The attention weights distribute evenly since the sentence does not contain any explicit opinion words. Since there is no tendency which words to concentrate, our model can not adjust the attention weights and help on such cases.

Conclusions
We propose constrained attention networks for multi-aspect sentiment analysis, which handles multiple aspects of a sentence simultaneously. Specifically, we introduce orthogonal and sparse regularizations on attention weights. Furthermore, we introduce an auxiliary task ACD for promoting the ALSC task, and apply CAN on both tasks. Experimental results demonstrate that our approach outperforms the state-of-the-art methods.