Transfer Capsule Network for Aspect Level Sentiment Classification

Aspect-level sentiment classification aims to determine the sentiment polarity of a sentence towards an aspect. Due to the high cost in annotation, the lack of aspect-level labeled data becomes a major obstacle in this area. On the other hand, document-level labeled data like reviews are easily accessible from online websites. These reviews encode sentiment knowledge in abundant contexts. In this paper, we propose a Transfer Capsule Network (TransCap) model for transferring document-level knowledge to aspect-level sentiment classification. To this end, we first develop an aspect routing approach to encapsulate the sentence-level semantic representations into semantic capsules from both the aspect-level and document-level data. We then extend the dynamic routing approach to adaptively couple the semantic capsules with the class capsules under the transfer learning framework. Experiments on SemEval datasets demonstrate the effectiveness of TransCap.


Introduction
Aspect-level sentiment classification (ASC) is a fine-grained subtask in sentiment analysis. Given a sentence and an aspect occurring in the sentence, ASC aims to determine the sentiment polarity of the aspect. Traditional methods mostly use machine learning models with handcrafted features to build sentiment classifiers for ASC tasks (Jiang et al., 2011;Mohammad et al., 2013). Such methods need either laborious feature engineering or massive linguistic resources. With the development of deep learning technique, a number of neural models have been proposed (Wang et al., 2016b;Tang et al., 2016;Chen et al., 2017) for ASC tasks. All these models train classifiers in a supervised manner and require sufficient num-*Corresponding author. ber of labeled data to get promising results. However, the annotation of opinion targets in ASC is extremely expensive.
The lack of labeled data is a major obstacle in this field. Publicly available datasets for ASC often contain limited number of training samples. On the other hand, document-level labeled data like reviews are easily accessible from online websites such as Yelp and Amazon. Since each review has an accompanying rating score indicating user's overall satisfaction towards an item, such a score can naturally serve as the label of sentiment polarity of the review document. Intuitively, the document-level data contain useful sentiment knowledge for analysis on aspectlevel data since they may share many linguistic and semantic patterns. Unfortunately, for ASC tasks, only one study (He et al., 2018) has taken the utilization of document-level data into account. The PRET+MULT framework proposed in (He et al., 2018) is a successful attempt by adopting pre-training and multi-task learning approaches. However, their model only shares shallow embedding and LSTM layers between ASC and DSC (document-level sentiment classification) tasks. In other words, the document-level knowledge is merely used for improving the word representations in ASC. Consequently, it is unable for PRET+MULT to handle complicated patterns like euphemism and irony which require highlevel semantic knowledge from the entire sentence. For example, given a sentence "The staff should be a bit more friendly", PRET+MULT will make a wrong prediction (the detail will be given in the analysis part).
In this paper, we propose a novel Transfer Capsule Network (TransCap) model to transfer sentence-level semantic knowledge from DSC to ASC. Our work is inspired by the capsule network (Hinton et al., 2011;Sabour et al., 2017) which uses capsule vectors and the dynamic routing approach to store and cluster features, but we move one step further in that we develop an aspect routing approach which can generate sentence-level semantic features shared by ASC and DSC. Moreover, we extend the dynamic routing approach by adapting it to the transfer learning framework. We conduct extensive experiments on two SemEval datasets. Results demonstrate that our TransCap model consistently outperforms the state-of-the-art methods.

Related Work
Aspect-level Sentiment Classification Traditional methods for sentiment classification (Nakagawa et al., 2010;Jiang et al., 2011;Taboada et al., 2011;Mohammad et al., 2013) mostly use machine learning algorithms to build sentiment classifiers with carefully extracted features, which take massive time and resources to collect. Early studies focus on document-level sentiment classification (DSC) tasks. In recent years, a number of deep learning methods have been proposed for aspect-level sentiment classification (ASC) tasks (Dong et al., 2014;Vo and Zhang, 2015;Tang et al., 2016;Wang et al., 2016a;Ma et al., 2017;Ma et al., 2018;. In general, there are three types of neural networks for ASC tasks: LSTM based (Wang et al., 2016b;Ma et al., 2017;Tay et al., 2018), memory based (Tang et al., 2016;Chen et al., 2017;Zhu and Qian, 2018), and hybrid methods (Xue and Li, 2018). For example, Wang et al. (2016b) use attention mechanism to model the inter-dependence between LSTM hidden units and aspects. Tang et al. (2016) utilize memory network to store context words and conduct multi-hop attention to get the sentiment representation towards aspects. Chen et al. (2017) apply recurrent attention to multi-layer memory. Xue and Li (2018) employ CNN and gating mechanism to extract aspectspecific information from contexts.
Although various types of approaches have been proposed, the inherent obstacle, i.e., the lack of labeled data, is still a big challenge for all ASC tasks. Without sufficient labeled data, training procedures in these approaches are likely to converge in a sub-optimal state. We differentiate our work from aforementioned models in that we aim to utilize the abundant labeled DSC data to alleviate the scarcity of labeled data in ASC tasks.
Transfer Learning Transfer learning aims to extract knowledge from one or more source tasks and then apply the knowledge to a target task. It can be categorized into three types based on different situations in the source and target domains/tasks (Pan and Yang, 2010). Our work belongs to "inductive transfer learning (ITL)" type since ASC (target) and DSC (source) in our framework are different but related tasks. In this case, ITL is similar to multi-task learning (MTL) with a slight difference: ITL only aims at achieving high performance in the target task while MTL tries to improve both simultaneously.
Several recent attempts have taken ITL or MTL methods for sentiment classification tasks. Dong and de Melo (2018) present a transfer learning framework by utilizing trained models. Xiao et al. (2018) employ capsule network for multitask learning. Both these methods are designed for document-level text/sentiment classification tasks, and are inappropriate for the fine-grained ASC task in this work. He et al. (2018) propose a multitask framework to combine ASC with DSC tasks together. This is the closest work to ours. However, the method in (He et al., 2018) is based on an existing AT-LSTM model (Wang et al., 2016b), whereas our framework is a totally new one which employs capsule network with carefully designed strategies for ASC tasks.

Our Proposed TransCap Model
In this section, we introduce our Transfer Capsule Network (TransCap) model. TransCap is proposed to conduct aspect-level sentiment classification with the auxiliary knowledge transferred from document-level data. We first present the problem definitions and preliminary. We then illustrate the architecture of TransCap in detail.

Definitions and Preliminary
Definition 1 (T ransCap) Given a source document-level corpus C D and the learning task T D , a target aspect-level corpus C A and the learning task T A , TransCap aims to help improve the learning of the target predictive function f A (·) in T A using the knowledge transferred from T D .
Definition 2 (T A and T D ) Given a sentence S = {w 1 , ..., w a , ..., w L } ∈ C A and an aspect w a occurring in S, an aspect-level sentiment classification task T A aims to determine the sentiment polarity of S towards w a . Note there might be multiple aspects in one sentence. Given an opinion sentence (or document) D ∈ C D , a document-level sentiment classification task T D aims at assigning an overall sentiment polarity for D. Note that T A is the main task and T D is only for providing auxiliary knowledge in our TransCap model.
Preliminary (CapsN et) Capsule network is first proposed for image classification in computer vision (Hinton et al., 2011;Sabour et al., 2017). Compared with CNN, it replaces the scalar-output feature detectors with vector-output capsules and has the ability to preserve additional information such as position and thickness. The vanilla Cap-sNet consists of two capsule layers. The primary layer stores low-level image feature maps and the class layer generates the classification probability with each capsule corresponding to one class.
Recently, CapsNet has been applied to several NLP tasks like text classification and relation extraction (Yang et al., 2018b;Gong et al., 2018;Xiao et al., 2018;Wang et al., 2018b). CapsNet is able to adaptively decide the information transferred between layers by using dynamic routing. Furthermore, each class in Cap-sNet has distinctive parameters to aggregate features and an independent probability to be existed. Therefore, CapsNet meets our needs in the transfer learning scenario which includes multiple polarities and tasks. Our TransCap model is the first attempt to exploit the power of CapsNet under the transfer learning framework for ASC tasks.

An Overview of Architecture
The architecture of TransCap is shown in Figure 1. It consists of four layers: 1) Input layer converts words in a sentence into low-dimensional realvalued vectors, 2) FeatCap layer extracts N-gram features from word vectors and transforms them into feature capsules, 3) SemanCap layer aggregates feature capsules into a set of aspect-related sentence-level semantic capsules, and 4) ClassCap layer generates class capsules which correspond to sentiment polarities in T A and T D , respectively.
Note that T A and T D tasks share the first three layers, and they separate only in the last ClassCap layer. Since T A and T D are related tasks both aiming to identify the sentiment polarity, features useful for one task might be useful for the other. We expect the features produced by the shared layers can be improved in a mutual way.

Input Layer
The input layer consists of two lookup layers. Let E w ∈ R dw×|V | be the pre-trained word embedding lookup table, where d w is the dimension of word vectors and |V | is the vocabulary size. The word lookup layer maps the word sequence in S(D) to a list of word vectors {e 1 , ..., e a , ...,e L } ∈ R dw×L .
Following (Gu et al., 2018), we also use another position lookup layer. For T A , by calculating the absolute distance from every context word w i to aspect word w a , we can get an additional position sequence for S. For T D , the position sequence is a zero sequence since there is no aspect information. Let E l ∈ R d l ×|L| be the position embedding lookup table with random initialization, the position lookup layer maps the position sequence to a list of position vectors {l 1 , ..., l a , ...,l L } ∈ R d l ×L .
The final representation of each word w i is calculated as

Feature Capsule Layer
This layer is used to extract n-gram features from sentence embedding X. N-gram features contain raw and local semantic meaning in a fixed window. We apply multiple convolution operations to the ith n-gram in X and get its feature vector r i : where is the size of one convolutional kernel, K is the n-gram size and d p is the dimension of one feature capsule. After sliding F in X, we get a set of feature capsules r ∈ R dp×(L−K+1) encapsulating n-gram features extracted from the whole sentence S(D).
Since one kernel group F corresponds to one category of semantic meaning, we repeat the above procedure C times with different kernel groups, and get multiple channels of feature capsules representing C categories of semantic meaning. The final output of feature capsule layer is arranged as R ∈ R C×dp×(L−K+1) :

Semantic Capsule Layer
Aspect Routing Approach The sentence or document in two corpora C A and C D differs in whether an aspect term occurs in the sentence/document. The T D task does not contain aspects. Meanwhile, it is crucial for the T A task to determine the relation between contexts and aspects. Especially when a sentence contains two opposite sentiment polarities, different contexts must be separated for different aspects. For example, given a sentence "Great food but the service is dreadful !", the context word "dreadful" should be strengthened for the aspect "service" and be weakened for the aspect "food".
To this end, we propose a novel aspect routing approach to compute the aspect weight for the context words of K-size window in T A . Formally, we apply a fusing convolution operation to the sentence embedding X with a kernel F a ∈ R d h ×K , and we get the aspect routing weight a i : where e a is the aspect embedding (or average embedding in the case of multi-word aspect), T a ∈ R 1×dw is a transfer matrix to map e a to a scalar value, and b a is bias. The generated routing weight a i ∈ [0, 1] fuses aspect information with respect to its context. It controls how much information in the current context can be transmitted to the next layer. If a i is zero, the feature capsule would be totally blocked.
A minor challenge is that, for a T D task, there is actually no aspect in the document and we need to distinguish two types of sources from C A and C D . Hence we design a piecewise function g i for calculating the aspect routing weight g i for an arbitrary feature vector r i from X as: After sliding in X, we can get g ∈ R 1×(L−K+1) for the whole sentence S(D). Since we have C channels of feature capsules, we repeat the above procedure C times to get the entire aspect routing weights G ∈ R C×1×(L−K+1) as: Finally, the feature capsules are routed using these weights: where P ∈ R C×dp×(L−K+1) are the aspectcustomized feature capsules, and denotes element-wise multiplication (with broadcasting).

Semantic Capsule Generation
The above generated P are transformed from the n-gram feature capsules. Though encoding aspect-related information, P are still local features without a sentence-level view. Moreover, the large number of capsules in P may prevent the next layer from learning robust representations. Hence we adopt the element-wise maximum function (Lai et al., 2015) in P to aggregate all feature capsules in same channel horizontally.
where U ∈ R C×dp are the generated semantic capsules. Eq. 7 condenses all local features in each channel and thus we can obtain more precise and global semantic representations from subtle expressions, e.g., an euphemistic sentence. Finally, we want the length of each semantic capsule u i to represent the probability that u i 's semantic meaning is present in the current input, so we use a nonlinear "squash" function (Sabour et al., 2017) to limit its length in [0,1] as

Class Capsule Layer
In the original capsule network, there is only one classification task and it uses class capsules to denote classes and their lengths as classification probabilities. However, there are two different tasks in our problem, and it is necessary to discern sentiment polarities (classes) in these tasks. To achieve this, we introduce two types of class capsules into TransCap, with six capsules in total. Such a structure makes it possible for our model to train T A and T D in a unified framework. Given input data from two tasks in turn, the first three layers share most parameters (except those in Eq. 3) to jointly train T D and T A , so that knowledge from document-level data can be successfully transferred into aspect-level task. In the last layer, each class capsule is used for calculating the classification probability of each class in T D and T A separately. Hence each class capsule should have its own routing weights to adaptively aggregate semantic capsules from the previous layer. Below we give the detail.
A semantic capsule i generates a "prediction vector"û j|i towards a class capsule j as: where W ij ∈ R dc×dp is a weight matrix, d p and d c are the dimensions of semantic capsule i and class capsule j, u i is the vector representation of semantic capsule i. All "prediction vectors" generated by semantic capsules are summed up with weights c ij to obtain the vector representation s j of class capsule j: where c ij is a coupling coefficient defined by a "routing softmax": where each b ij is the log prior probability that a semantic capsule i should pass to a class capsule j.
It is computed using a dynamic routing approach which will be presented later.
After that, we again apply the non-linear "squash" function (Sabour et al., 2017) to s j in Eq. 10 to get a final representation v j for class capsule j.
where the length of v j is limited in [0,1] to represent the active probability of class capsule j.

Dynamic Routing Approach
The logit b ij in Eq. 11 determines the intensity of the connection between the semantic capsule i and the class capsule j. It is initialized with 0 and is updated with an agreement coefficient a ij .
This agreement coefficient is added to the initial logit b ij before computing the new values for all coupling coefficients c ij linking semantic capsules to class capsules.

Margin Loss
The length of a class capsule is used to represent the probability of the sentiment polarity. The capsule length of the active class should be larger than others. Hence we adopt a separate margin loss L j for each class capsule j in each task: where Y j =1 if the sentiment polarity is present in class capsule j, and we simply set m + =0.9, m − =0.1, λ=0.5 following those in (Sabour et al., 2017). The loss for a single task is where T is either A or D, denoting the loss L A and L D for task T A and T D , respectively. The final loss L for our TransCap model is the linear combination of two losses on single tasks.
where γ ∈[0,1] is a hyper-parameter controlling the weight of T D . When training converges, the class capsule with the largest active probability in a task is chosen as the prediction of sentiment polarities.

Datasets and Settings
Datasets for T A We evaluate TransCap on two aspect-level datasets from SemEval2014 Task 4 (Pontiki et al., 2014). The datasets contain reviews from Restaurant and Laptop domains respectively with 3-way sentiment polarity labels: positive, neutral and negative 1 . Both datasets have a fixed training/test split. We further randomly sample 20% training data as the development set, and use the remaining 80% for training.
Datasets for T D We use three document-level datasets to transfer knowledge: Yelp, Amazon and Twitter. All the documents (reviews) in Yelp Review  and Amazon Electronics (McAuley et al., 2015) datasets have accompanying five-star ratings (1..5). We consider reviews with a score <3 as negative, =3 as neutral and >3 as positive. The Twitter dataset is collected from SemEval 2013 to 2017, where the original tweets are already labeled with 3-way polarities. Each dataset for T D contains 30,000 samples with balanced class labels. All samples in these datasets are used for auxiliary training. We do not report performance for the T D task since it is not our focus. Also note that the first two datasets in T D are of the same topics as those in T A , while the topics in Twitter are more general and less relevant to our main task T A .  Compared Methods To demonstrate the superiority of our TransCap for ASC tasks, we compare it with followings baselines: ATAE-LSTM (Wang et al., 2016b), IAN (Ma et al., 2017), AF-LSTM(CONV) (Tay et al., 2018), AF-LSTM(CORR) (Tay et al., 2018), PBAN (Gu et al., 2018), MemNN (Tang et al., 2016), RAM (Chen et al., 2017), CEA (Yang et al., 2018a), DAuM (Zhu and Qian, 2018), IARM (Majumder et al., 2018), PRET+MULT (He et al., 2018) and GCAE (Xue and Li, 2018). Most of them are the latest methods published in 2018. The rest are frequently-used classical models.

Main Results
The comparison results for all models are shown in Table 2. For clarity, we classify the models into four categories: the first is the LSTM-based methods (from M1 to M5), the second is the memorybased ones (from M6 to M10), the third is the hybrid ones (M11 and M12), and the last three lines (M13 to M15) are the variants of our model, where TransCap{S} denotes the one with T A task only, TransCap{Y,A} and TransCap{T,T} utilize the knowledge from different sources in T D .  It is clear that our TransCap model consistently outperforms all baselines on both datasets. The hybrid model PRET+MULT, which is a multitask learning based model, also has the overall better performance than other baselines. Both these demonstrate that the aspect-level sentiment classification task T A can benefit a lot by transferring knowledge from the auxiliary task T D . PRET+MULT is inferior to our model. The reason is that it only shares low-level features and transfers limited knowledge between tasks.
We also find that two multi-task variants of our model, TransCap{Y,A} and TransCap{T,T}, achieve similar performance. {Y,A} provides knowledge from relevant domains, but their labels are not very accurate since they may contain a lot of noises. Though the knowledge in {T,T} are from tweets of mixed and less relevant topics, their labels are manually-annotated and thus are quite reliable. Overall, given the sufficient number of training samples in the auxiliary task T D , the performance of T A tasks can be significantly enhanced over its single task counterpart TransCap{S}.
Among LSTM-based models, PBAN and IAN achieve higher performance than others since they use the bi-directional attention mechanism. RAM is better than other memory-based models because it utilizes a non-linear combination for attention results in different hops. GCAE performs the worst among all baselines, as its simple CNNbased model can not capture the long-term dependencies between context words.

Ablation Study
To investigate the effects of different components in our model, we conduct the following ablation study on TransCap. (i)"-A": We remove the aspect routing approach, and set same weights 1.0 for all feature capsules. (ii)"-S": We remove semantic capsules, and pass weighted feature capsules directly to class capsules. (iii)"-D": We remove the dynamic routing approach, i.e., a semantic capsule would be coupled to all class capsules with equal probabilities.
Results for the ablation study are shown in Table 3, where "Ori." denotes results for the original TransCap model, and "-*" for those removing the corresponding components.  As expected, results for the simplified models all drop a lot. This clearly demonstrates the effectiveness of these components. Specifically, TransCap-A performs the worst, since it cannot generate aspect-related feature capsules after removing aspect routing from TransCap. Dynamic routing is critical as it helps TransCap to reduce the interference between T A and T D . The drop of performance of TransCap-S also shows that semantic capsules are important for building robust and precise connections between features and polarities.

Parameter Analysis
Influence of Auxiliary Corpus Size To show the influence of DSC task on our major ASC task, we vary the size of auxiliary document-level corpus C D and observe the performance changes in T A . We use a percentage ∈ [0, 1] to control the ratio of C D and present results in Figure 2.
As can be seen, all curves in Figure 2 tend to rise with the increasing amount of document-level knowledge. This shows the effectiveness of our model by transferring knowledge from documentlevel data. At the initial stages where only 20% or 40% of C D are introduced, we find small decreases of performance. The reason may be that when the auxiliary document-level corpus C D is small, the model in T D has not been well trained. Hence it provides limited transferable knowledge to train the shared input, feature capsule and semantic capsule layers. Consequently, ASC task T A gets misleading information from these layers and then performs worse. After getting sufficient document-level data, T D becomes robust and stable, and T A also improves its performance.
Effects of Balance Factor γ The balance factor γ determines how important the DSC task T D is in the model. To evaluate its effects, we vary γ in range [0,1] and present results in Figure 3. The key observation from Figure 3 is that there are Turning Points (denoted as TP) for both two datasets: TP≈0.7 for Restaurant and TP≈0.3 for Laptop. The curves have an overall upward trend when γ < TP, but become flat or downward once γ > TP. This phenomenon can be explained with multi-task learning mechanism. In upward part, lots of useful sentiment knowledge is transferred from document-level data to aspect-level data, thus the performance of T A gets improved. Once the weight for T D exceeds TP, T D begins to dominate the whole TransCap model while T A gradually loses the mastership and performs worse.

Case Study
To have a close look, we further select three samples from different datasets for a case study.

Part 1
We first illustrate what kind of knowledge TransCap will transfer. Below is an example from Laptop where the target is enclosed in [] with a subscript denoting its true polarity: 1."It has so much more speed and the [screen] pos is very sharp." Humans can easily identify the positive polarity towards aspect [screen]. However, the single-task variant TransCap{S} and most baselines give a false negative prediction. This is because "sharp" is a multi-polarity word in the training set as the following two examples show: 2."Once open, the [leading edge] neg is razor sharp." 3."[Graphics] pos are clean and sharp, internet interfaces are seamless." The training set in Laptop contains only 8 samples including "sharp" with 5 of them are labeled as negative. It is hard for single-task models to learn a correct meaning for "sharp" with several contradictory samples. Hence they simply consider it as a negative token due to the superiority of this polarity and make false predictions. However, for TransCap{Y,A}, the auxiliary Amazon dataset contains 294 samples where "sharp" cooccurs with lots of different contexts. With the help of sufficient training samples, three shared layers have learned to recognize the true polarity of "sharp" with respect to its contexts, thus the class capsule layer in TransCap{Y,A} finally makes a correct prediction.
Part 2 This part aims to visualize the decisionmaking process of TransCap with an example from Restaurant dataset: 4."Great [food] pos but the [service] neg is dreadful !".
The coupling coefficients c ij ∈[0,1] for this example are visualized in Figure 4, which presents the c ij between each pair of (semantic capsule, class capsule) after dynamic routing with respect to different aspects. Note that the sum of c ij in every column (not row as that in the attention mechanism) is 1.0.
When the input aspect is [service] (the upper part in Figure 4), the detailed decision-making process is as follow. Firstly, several semantic capsules such as 4 and 8 have already captured corresponding sentence-level semantic meaning from the review's content. Secondly, by calculating the coupling coefficient c ij after dynamic routing, these semantic capsules are highly coupled with the negative class capsule, and thus this negative capsule gets a higher active probability than other class capsules. As a result, TransCap makes the negative prediction for the aspect [service]. Similarly, when the input aspect is [food] (the lower part in Figure 4), the positive class capsule gets a high active probability and TransCap then makes a correct prediction for this aspect. Part 3 In last part, we present an example from Restaurant to show the advantage of TransCap over PRET+MULT (He et al., 2018): 5."The [staff] neg should be a bit more friendly." This is an euphemistic negative review towards the aspect [staff] though each word in the sentence itself does not convey a negative sentiment. PRET+MULT generates features and transfers knowledge only at the word level. Although embedding for each word is enhanced by the auxiliary document-level data, PRET+MULT can not recognize the overall negative sentiment behind each word and makes a false positive prediction due to the word "friendly". In contrast, TransCap generates sentence-level semantic capsules containing overall semantic meanings of the sentence, and shares these sentence-level features between ASC and DSC tasks. Both these help TransCap make a correct decision.

Conclusion
In this paper, we present a novel transfer capsule network (TransCap) model for aspect-level sentiment classification. In order to solve the problem of lacking aspect-level labeled data, we wish to utilize the abundant document-level labeled data. We develop a transfer learning framework to transfer knowledge from the document-level task to the aspect-level task. We implement it with a carefully designed capsule network, which mainly consists of the aspect routing and dynamic routing approaches. Experiments on two SemEval datasets demonstrate that TransCap outperforms the stateof-the-art baselines by a large margin.