Capsule Network with Interactive Attention for Aspect-Level Sentiment Classification

Aspect-level sentiment classification is a crucial task for sentiment analysis, which aims to identify the sentiment polarities of specific targets in their context. The main challenge comes from multi-aspect sentences, which express multiple sentiment polarities towards different targets, resulting in overlapped feature representation. However, most existing neural models tend to utilize static pooling operation or attention mechanism to identify sentimental words, which therefore insufficient for dealing with overlapped features. To solve this problem, we propose to utilize capsule network to construct vector-based feature representation and cluster features by an EM routing algorithm. Furthermore, interactive attention mechanism is introduced in the capsule routing procedure to model the semantic relationship between aspect terms and context. The iterative routing also enables encoding sentence from a global perspective. Experimental results on three datasets show that our proposed model achieves state-of-the-art performance.


Introduction
Aspect-level sentiment classification is a finegrained task in the field of sentiment analysis (Pang and Lee, 2008;Liu, 2012), which aims to infer the sentiment polarity (e.g., positive, neutral, negative) of a sentence with respect to the aspect. It demands to differentiate sentiments towards different targets when there are multiple targets in one sentence. For example, given the mentioned aspect term {food, price, drinks}, and the sentence is "The food was definitely good, but when all was said and done, I just could not justify it for the price including 2 drinks, $100/person." For aspect term food, the sentimental polarity is posi- * Corresponding author.
tive, but for aspect term price, the polarity is negative while for aspect term drinks, the polarity is neutral. Recently, with the development of deep learning techniques, various neural networks are designed for this task and obtain promising results (Wang et al., 2016a;Ma et al., 2017;Fan et al., 2018).
The main challenge in aspect-level sentiment classification is that one sentence expresses multiple sentiment polarities, resulting in overlapped feature representation. Take the same example above, the sentence simultaneously reviews on 'food', 'price', and 'drinks', and expresses three different sentiment polarities. The highly overlapped features will confuse the classifier seriously. However, most existing methods only keep the most active feature by max-pooling operation or utilize attention mechanism to find the sentimental words, which fails to distinguish the overlapped features.
Therefore, we propose a novel capsule network and iterative EM routing method with interactive attention (IACapsNet) to solve this problem. Capsule network (Hinton et al., 2011;Sabour et al., 2017;Hinton et al., 2018) constructs vectorbased feature representation. Capsules in adjacent layers are connected by dynamic routing, which shows strengths in distinguishing overlapped features by feature clustering (Sabour et al., 2017;Zhang et al., 2019). In the aspect-level sentimental classification task, the vector-based overlapped sentimental features towards different aspect terms will be clustered by an Expectation-Maximization (EM) routing algorithm, which makes the subsequent classification more clear. Furthermore, we further devise an interactive attention-based routing mechanism in order to highlight the word-level difference and model the semantic relationship between aspect terms and context. Moreover, our iterative routing mechanism can be viewed as a top-down attention mechanism, which is more efficient because of the global perspective compared to the standard attention mechanism. Standard attention mechanism in this task only considers a part of the context information in a sentence without considering the overall meaning conveyed by the sentence, which may introduce noise and downgrade the prediction accuracy, especially for complex sentences. For example, in an ironic statement for the aspect term "mac os": "Maybe the mac os improvement were not the product they want to offer.", the standard attention mechanism will highlight the sentimental word 'improvement' and confuse the classifier to make the wrong prediction to be positive. Our routing mechanism can tackle this by adjusting the contribution of each low-level capsule based on the high-level capsules (overall representation), and the iterative update makes the overall representation more accurate compared to other similar top-down attention (Liu et al., 2018;). Our proposed model (IACapsNet) is evaluated on three datasets: laptop, restaurant datasets from the SemEval 2014 Task 4 and Twitter collection. The experimental results show that our model outperforms other baseline methods and achieves state-of-the-art performance. Our contributions are summarized as follows: • We apply capsule network to aspect-level sentiment classification to tackle the overlapped features by feature clustering. To the best of our knowledge, there is no work that investigates the performance of capsule network in this task.
• An interactive attention mechanism is introduced in the capsule routing to help model the semantic relationship between aspect term and context.

Related work 2.1 Aspect Level Sentiment Classification
Traditional approaches have designed rich features about content and syntactic structures to capture the sentiment polarity (Jiang et al., 2011;Pérez-Rosas et al., 2012). However, These featurebased methods are labor-intensive and the performance highly depends on the quality of the features. Recently, deep learning methods are becoming popular for aspect-level sentiment classi-fication. Recurrent Neural Networks (RNNs) are the most commonly used technique for this task (Tang et al., 2016a). The attention mechanism is further introduced to model the target-context association (Wang et al., 2016b;Ma et al., 2017). Furthermore, Fan et al. (2018) proposed MGAN to integrate fine-grained attention mechanisms, which is employed to characterize the word-level interactions between aspect and context words. Very recently, CNN-based models have shown the strengths in efficiency to tackle the aspect-level sentiment classification (Xue and Li, 2018;Huang and Carley, 2018;. However, all the previous methods utilize static pooling operation or attention mechanism to locate the sentimental words, which fails to handle the overlapped features. We introduces vectorbased feature representation and feature clustering to address this.

Capsule Network
Capsule network was proposed to improve the representational limitations of CNN and RNN by extracting features in the form of vectors. The technique was firstly proposed in (Hinton et al., 2011) and improved in (Sabour et al., 2017;Hinton et al., 2018), which is mainly devised for image processing domain. Introducing capsules allows us to utilize a routing mechanism instead of pooling operation to generate high-level features which is a more efficient way for features encoding. Routingby-agreement is able to cluster features in an iterative way, which achieved impressive performance recognizing highly overlapped digits. Several types of capsule networks have been proposed for natural language processing. Yang et al. (2018) investigated capsule networks for text classification. They also found that capsule networks exhibit significant improvement when transferring single-label to multi-label text classification. Similar property has also been observed in the task of relation extraction (Zhang et al., 2019. However, interactive word-level attention is not considered in these typical capsule routing methods. sentence over a specific aspect term, where y ∈ {positive, negative, neutral}. The overall architecture is shown in Figure 1. It consists of the input embedding layer, bidirectional RNN layer, primary capsule layer, and output layer.

Input Embedding Layer
The context's input representations of IACapsNet include word embeddings w n and position embeddings p n . The aspect term's input representation only consists of word embedding w a n . Word embedding is a distributed representation of a word, where words from the vocabulary are mapped to vectors. Initializing words vectors via pre-trained word vectors can improve the performance due to their ability to capture syntactic and semantic information of words from large scale unlabeled text. In our model, we employ the pre-trained word vector GloVe (Pennington et al., 2014) to obtain the fixed word embedding w n , w a n ∈ R dw , where d w is the word vector dimension.
Considering that the context words with closer distance to an aspect may have higher influence on the sentiment analysis, we introduce position embedding to encode the relative distance r n from word w n to the aspect term. We define the position embedding matrix P ∈ R dp×N , which is randomly initialized and updated during the training process. Here, d p is the position embedding dimension and N denotes the length of the sentence. The corresponding word's position embeddings p n can be obtained by looking up the position embedding matrix P using r n .
The input representation for each context word is the concatenation of word embeddings and position embeddings: x n = [w n ; p n ] ∈ R dw+dp .

Bidirectional Recurrent Networks Layer
The recurrent neural networks can capture longdistance dependencies within a sentence. A bidirectional recurrent neural network is the first layer of IACapsNet. The forward direction captures the left context h l for a word and the backward direction captures the right context h r . We concatenate the left context and the right context as the contextualized word representation h c n , h a n ∈ R 2×d l , where d l is the dimension of hidden state, h a and h c are the word representations for aspect term and context, respectively.

Primary Capsule Layer
The primary capsule is a group of neurons obtained from the output of the convolutional operation performed on h a n and h c n . So, the output of capsule is a vector representing different properties of the same objective. In aspect-level sentiment classification task, the properties may contain the sentiment and aspect term features.
EM-based routing method (Hinton et al., 2018) is implemented in our model, and except for the high-dimensional output M , there is one more activation probability a in our capsule, which is like the activity in a standard neural net (shown in Figure 1).

Interactive Attention EM Routing
We have already decided on the outputs of all the capsules Ω L in primary capsule layer and we now want to decide which capsules Ω L+1 to active in the layer above and how to assign each active lowlevel capsule to one active higher-level capsule.
The vector-based features get clustered in the high-level capsules by an EM based algorithm where the outputs of high-level capsules play the role of Gaussians and the output vectors of lowlevel capsules play the role of the datapoints. The means, variances, and activation probabilities of the output capsules, as well as the assignment probabilities R of the input capsules are iteratively updated by alternating between an E-step and an M-step. It can also be viewed as a parallel atten-tion mechanism in the opposite direction, which can adjust the low-level word's contribution based on the sentence global representation. Moreover, in order to model the semantic relationship between aspect term and context, we further devise an interactive attention-based routing mechanism.

M-step
The M-step holds the assignment probabilities R constant and adjusts each Gaussian (i.e. high-level capsules) to maximize the sum of the weighted log probabilities that the Gaussian would generate the datapoints (i.e. low-level capsules) assigned to it. This procedure aims to obtain the overall representation among the low-level capsules for a given iteration.
Firstly, every primary capsule i is transformed by W ij to cast a vote V ij = M i W ij for the output of high-level capsule j. And we can get the mean µ j of the votes from the input capsules and the variance σ j about that mean for each dimension h: where µ h j is the h th component of the capsule j's vectorized output M j .
The activation probability of capsule j is calculated by where β u and β α are trainable parameters denoting fixed cost per input capsule when not activating it and fixed cost for coding the mean and variance of capsule j when activating it. The variance σ reflects the degree of agreement. An intuitive understanding of this activation probability is that if the votes from low-level capsules are not agreed on one high-level capsule, the activation of the highlevel capsule should be low. λ is an inverse temperature parameter set 1e−3 with a fixed schedule.

E-step
The E-step adjusts the assignment probabilities R for each datapoint (i.e. low-level capsule) to the Gaussian (i.e. high-level capsules). This procedure aims to adjust the contribution of each capsule based on the high-level capsule (i.e. overall representation) for a given iteration. We firstly compute the negative log probability density of the vectorized vote under the j's Gaussian distribution: For capsule i in primary capsule layer, the assignment probability is adjusted by: Alternating E-step and M-step will route the output of capsule to a capsule in the layer above that receives a cluster of high-dimensional features.
Algorithm 1 Interactive attention EM routing. Capsule i and j denote a low-level and high-level capsule. Ω L and Ω L+1 denote the low-level and high-level capsules set, respectively. ∀ capsule j: compute µ j and σ j by Eq. 1 and 2 15: ∀ capsule j: compute a j by Eq. 4 16: end procedure 17: procedure E-STEP(µ j ,σ j ,a j ,V ij ) 18: ∀ capsule j: compute p j and update R ij by Eq. 5 and 6 19: end procedure

Interactive Attention
The overlapped features in primary capsule layer can be routed and clustered to high-level capsules in an iterative way. However, purely alternating the E-step and M-step will ignore the relationship between the context and aspect term. It has been demonstrated that target and context can determine the representation of each other. And the coordination of targets and their contexts can remarkably enhance the performance of sentiment classification (Ma et al., 2017;Fan et al., 2018). Word-level attention, which has already been demonstrated to be essential is also ignored. So, apart from the assignment probabilities, we further introduce an interactive attention weight α which is learned interactively between the context and the aspect term.
Specifically, we implement scaled dot-product attention which can be described as mapping a query and a set of key-value pairs to a weight on the word-level token. The queries are the averaged representation of the context h c and the aspect term h a which are transformed to dimension d k by trainable parameters: The keys are the corresponding words' representation from aspect term and context which are also transformed to dimension d k : k a n = h a n * W k a ∈ R d k , where W q c , W q a , W k c , W k a ∈ R 2d l * d k The attention weights can be computed as follows: s c→a n (q c , k a n ) = k a n * q T , α c→a n = exp(s c→a n (q c , k a n )) N n=1 exp(s c→a n (q c , k a n )) .
In order to ensure a proper magnitude of s n avoiding pushing the softmax function into regions where it has extremely small gradients, we introduce the scaling factor 1 √ d k following (Vaswani et al., 2017). We can thus get the wordlevel significance for each token in the context and aspect term in an interactive way, which will be adopted on each primary capsule's activation probability a. We detail the whole routing algorithm in Algorithm 1.

Training Objective
In IACapsNet, each top-level capsule corresponds to a sentiment category. The activation probability a of each top-level capsule represents the probability that the input sentence belongs to the corresponding category. We use a spread margin loss, L k for each top-level capsule k to directly maximize the gap between the activation of the target class (a t ) and the activation of the other classes. The total loss L is simply the sum of the losses of all top-level capsules: where m is the margin which we set 0.9 with a fixed schedule.

Experiments
In this section, we conduct extensive experiments on three datasets to evaluate the proposed IACap-sNet.

Experimental Setup
The experiments are implemented on three datasets. The first two datasets are from the Se-mEval 2014 Task 4 (Pontiki et al., 2014), which contains reviews about laptops and restaurants, respectively. The third one is a Twitter dataset collected by (Dong et al., 2014). The statistics of these datasets are listed in Table 2. Following (Tang et al., 2016c), conflict category is removed from the SemEval 2014 datasets to avoid datasets getting unbalanced. Sentences are zero-padded to the length of the longest sentence in respective dataset. Results are measured by accuracy and Macro-Averaged F1 score.
In our experiments, the pre-trained GloVe (Pennington et al., 2014) is used to initialize the word embeddings from context and aspect term. The

Model Comparisons
IACapsNet is compared with the following methods: ATAE-LSTM (Wang et al., 2016a): An LSTMbased model which learns attention embeddings and combine them with the LSTM hidden states to predict the polarity.
TD-LSTM (Tang et al., 2016b): It employs two LSTMs to estimate the left context and the right context, respectively. The concatenated context representations perform the predictions.
IAN (Ma et al., 2017): An interactive attention is implemented on the representation of context and aspect learned by two LSTMs.
MemNet (Tang et al., 2016c): It applies attention mechanism over the word embeddings multi-ple times and predicts sentiment based on the topmost sentence representation.
RAM (Chen et al., 2017): Similar to Mem-Net, RAM is a multi-layer architecture where each layer consists of attention-based aggregation of word features and a GRU cell to learn the sentence representation.
BILSTM-ATT-G: (Zhang and Liu, 2017): It models left and right contexts using two attentionbased LSTMs and introduces gates to measure the importance of left context, right context, and the entire sentence for the prediction.
MGAN (Fan et al., 2018): MGAN leverages the fine-grained and coarse-grained attention, which is further employed to characterize the word-level interactions between aspect and context words.
PBAN (Gu et al., 2018): PBAN concentrates on the position information of aspect terms and mutually models the relation between aspect term and sentence by employing bidirectional attention.
TNet : It employs a CNN layer to extract salient features from the transformed word representations originated from a bidirectional RNN layer.
Cabasc (Liu et al., 2018): Cabasc employs sentence-level content attention mechanism to capture the important information about given aspects from a global perspective.

Main Results
As shown in Table 1, IACapsNet achieves the best performance on all the datasets. From Table 1, we can have the following observations. ATAE-LSTM performs better than TD-LSTM.
One main reason may be the attention mechanism in TD-LSTM that enables to notice the important parts based on the aspect term. BILSTM-ATT-G adopts a similar architecture with TD-LSTM by modeling left context and right context using attention-based LSTM, which achieves better results than ATAE-LSTM. IAN and MGAN introduce the interactive attention in coarse-grained and multi-grained ways respectively and bring remarkable improvements. PBAN similarly utilize a fine-grained bidirectional attention and performs comparably with MGAN. MemNet utilizes a more complex structure that contains nine computational layers, which updates the query vector at each hop. RAM also learns multiple attended vectors on the memory, which achieves superior results among the baseline models, especially on Laptop dataset.
Our proposed IACapsNet consistently performs best on all the three datasets. The improvement is mainly attributed to the feature clustering ability to tackle the overlapped features and the iteratively updating on coupling coefficients, which considers the overall meaning of the contexts. Moreover, compared with Cabasc, which also incorporates the overall representation to typical attention mechanism in a static way, our iterative method shows remarkable strengths.

Ablation Study
To analyze the effect of different components including the routing mechanism and the introduced interactive attention, we report the results of variants of IACapsNet. The results in Table 3 indicate: (1) EM routing based IACapsNet outperforms IACapsNet-Cosine, which routes capsules by cosine similarity (Sabour et al., 2017). One main reason maybe cosine saturates at 1, which is insensitive to the difference between a quite good agreement and a very good agreement. (2) Integrating interactive attention in routing mechanism brings a remarkable improvement on both routing mechanisms, which demonstrates the necessity to consider the relationship between the aspect and contexts during the routing procedure.
Moreover, EM routing also brings a boost on efficiency with fewer trainable parameters and faster speed which is intuitively shown in Table 4 (IAN is listed as baseline). From the table, it is easy to conclude that IACapsNet is much more efficient than IACapsNet-Cosine with about 10% and 36% decrease in the number of trainable parameters and running speed, respectively. Moreover, compared to IAN, IACapsNet achieves a much better accuracy with fewer trainable parameters, meaning that capsule network is more efficient in feature encoding with fewer parameters. However, IACapsNet costs more time compared to IAN, which is the implicit deficiency of capsule network because of the iterative calculation during routing.

Effects of Routing Iteration Number
As our proposed IACapsNet involves iterative procedure during routing. In this section, we investigate the effects of different routing iteration numbers. Specifically, we conduct experiments on all the three datasets and vary routing iteration numbers r from 1 to 4. The results are illustrated in Figure 3.
The results show that IACapsNet achieves the best performance at routing iteration number 2, 3 and 3 on the dataset RESTAURANT, LAPTOP, and Twitter, respectively. When the number of iteration is 1, our capsule network degrades to a standard network, which obtains comparable results with IAN. While increasing r to 4, the performance gets worse dramatically. Moreover, as the number of iteration increases to 4, it brings many difficulties to train IACapsNet. The model becomes more sensitive, which fluctuates greatly in loss and accuracy during training. Therefore, it is appropriate to limit the routing iteration number r and set it to be 2 or 3 depending on the performance.

Case Study
In order to assess the effect of our EM routing with interactive attention mechanism, we visualize the coupling coefficients. Our model is able to adjust the contribution of each part based on the global meaning of a sentence and shows superiority in modeling complicated sentence. In this section, we pick an example from RESTAURANT dataset
consisting of a long and complicated sentence and 3 aspect terms with 3 different sentimental polarities. Figure 2 shows this example and the visualization results of word-level coupling coefficients which are the sum of one word's coupling coefficients to the category. The attention weights from IAN are also shown as a baseline, which is normalized to the same scale with routing coupling coefficients. The line in the chart reflects the difference ∆. 'F' in Figure 2 means false sentiment classification, and 'T' means correct classification.
From the figure, we can observe that our routing methods can adjust the attended words according to different aspect terms, which helps make all the predictions correctly. Moreover, our routing method can locate on the more important words more efficiently. For example, in terms of the aspect 'price', the words "n't" and "justify" is attended, which are the corresponding essential sen- Figure 3: Effects of Routing Iteration Number timental words. However, they are ignored by the ordinary attention mechanism, which leads a wrong prediction. This shows that our routing mechanism can capture important parts of a sentence more accurately.

Conclusion and Future Work
We re-examine the deficiencies of existing models with attention mechanism for aspect-level sentiment classification. And we propose to utilize capsule network to handle the overlapped sentiment features by features clustering, and iteratively adjust the attention weights from a global perspective. To the best of our knowledge, capsule network is firstly applied in this task. Moreover, interactive attention is introduced to the dynamic rout-ing to model the semantic relationship between aspect term and sentence. The experimental results verify that IACapsNet outperforms baseline models. The ablation and case studies show the efficacy of different proposed modules.
In the future, our theory can be generalized to other tasks that highly depends on the attention mechanism. For example, reading comprehension and machine translating.