Attention Transfer Network for Aspect-level Sentiment Classification

Aspect-level sentiment classification (ASC) aims to detect the sentiment polarity of a given opinion target in a sentence. In neural network-based methods for ASC, most works employ the attention mechanism to capture the corresponding sentiment words of the opinion target, then aggregate them as evidence to infer the sentiment of the target. However, aspect-level datasets are all relatively small-scale due to the complexity of annotation. Data scarcity causes the attention mechanism sometimes to fail to focus on the corresponding sentiment words of the target, which finally weakens the performance of neural models. To address the issue, we propose a novel Attention Transfer Network (ATN) in this paper, which can successfully exploit attention knowledge from resource-rich document-level sentiment classification datasets to improve the attention capability of the aspect-level sentiment classification task. In the ATN model, we design two different methods to transfer attention knowledge and conduct experiments on two ASC benchmark datasets. Extensive experimental results show that our methods consistently outperform state-of-the-art works. Further analysis also validates the effectiveness of ATN.


Introduction
Aspect-level sentiment classification (ASC) is a fundamental task in sentiment analysis (Pang et al., 2008;Liu, 2012;Pontiki et al., 2014), which aims to infer the sentiment polarity (e.g. positive, neutral, negative) of a given opinion target in a review sentence. An opinion target, also known as aspect term, refers to a word or a phrase in review describing an aspect of an entity. For example, the sentence "The tastes are great, but the service is dreadful" consists of two opinion targets, namely "tastes" and "service". User's sentiment towards the opinion target "tastes" is positive while negative in terms of target "service". Traditional methods usually focus on designing a set of features such as bag-of-words or sentiment lexicon to train a classifier (e.g., SVM) for ASC (Jiang et al., 2011;Kiritchenko et al., 2014). Motivated by the great success of deep learning in computer vision , speech recognition (Dahl et al., 2012) and natural language processing (Bengio et al., 2003), recent works use neural networks to learn low-dimensional and continuous text representations without any feature engineering, and achieve competitive results on the ASC task (Tang et al., 2016a).
From the above example, we can see that a sentence sometimes refers to several opinion targets and they may express different sentiment polarities, thus one main challenge of ASC is to separate different opinion contexts for different targets. To this end, abundant state-of-the-art works employ attention mechanism (Bahdanau et al., 2014) to capture sentiment words related to the given target, and then aggregate them to make sentiment prediction (Wang et al., 2016;Tang et al., 2016b;Ma et al., 2017;Chen et al., 2017;Majumder et al., 2018;Fan et al., 2018). Despite the effectiveness of attention mechanism, we argue that it fails to reach the full potential due to the limited ASC labeled data. It is well-known that the promising results of deep learning heavily rely on sufficient training data. However, the annotation of ASC data is very labour-intensive and expensive in real-world scenarios, because annotators need to not only identify all opinion targets in a sentence but also determine their corresponding sentiment polarity. The difficulty of annotation leads to that existing public aspect-level datasets are all relatively small-scale, which finally limits the potential of attention mechanism.
Despite the lack of ASC data, enormous labeled data of document-level sentiment classification (DSC) are available at online review sites such as Amazon and Yelp. These reviews contain substantial sentiment knowledge and semantic patterns. Therefore, one meaningful but challenging research question is how to leverage resource-rich DSC data to improve the low-resource task ASC. For this purpose, He et al. (2018) design the PRET+MULT framework to transfer sentiment knowledge from DSC data to ASC task through sharing shallow embedding and LSTM layer. Inspired by the capsule network (Sabour et al., 2017), Chen and Qian (2019) propose TransCap to share bottom three capsule layers, then separate two tasks only in the last ClassCap layer. Fundamentally, PRET+MULT and Transcap improve ASC by sharing parameters and multi-task learning, but they cannot accurately control and interpret what knowledge to be transferred. In this work, we directly focus on the aforementioned attention issue in the ASC task and propose a novel framework, Attention Transfer Network (ATN), to explicitly transfer attention knowledge from the DSC task for improving the attention capability of the ASC task. Compared with PRET+MULT and Transcap, our model achieves better results and retains good interpretability.
In the ATN framework, we adopt two attention-based BiLSTM networks, respectively, as the DSC module and base ASC module, and propose two different methods to transfer attention from DSC to ASC. The first transfer approach is called Attention Guidance. Specifically, we first pre-train an attentionbased BiLSTM on large-scale DSC data, then exploit the attention weights from the DSC module as a learning signal to guide the ASC module to capture sentiment clues more accurately, thereby acheiving improvements. The second approach adopts the way of Attention Fusion, and directly incorporates the attention weights of the DSC module into the ASC module. The two approaches work in different ways and have their different advantages. Attention Guidance aims to learn the attention ability of the DSC module and has faster inference speed, since it does not use external attention from DSC during the testing stage. In contrast, Attention Fusion can leverage the attention knowledge of the DSC module during the testing stage and make more comprehensive predictions.
We conduct experiments on two benchmark datasets to evaluate different methods. The results indicate that the ATN model can be substantially improved by incorporating the two attention transfer approaches, and outperforms all compared methods on the ASC task. Figure 1 shows the overall architecture of the Attention Transfer Network (ATN). It mainly consists of four parts: the pre-trained DSC module, the base ASC module, and two attention transfer approaches. In this section, we will first give the task formalization of ASC and DSC, then introduce the attention-based pre-trained DSC module and base ASC module. Finally, we present the details of our proposed two attention transfer approaches, namely Attention Guidance and Attention Fusion.

Task Formalization
ASC Formalization Formally, given a sample < s, t > from the ASC dataset A, s = {w 1 , w 2 , ..., w n } is a review sentence consisting of n words and t = {w l , w l+1 , ..., w r } is a given opinion target containing |r − l| words. The opinion target t is a continuous subsequence of s. The goal of ASC is to predict the sentiment polarity (i.e., positive, neutral and negative) of the opinion target t in the sentence s. DSC Formalization For a review document d from the DSC dataset D, we regard it as a special long sentence {w d 1 , w d 2 , ..., w d n } consisting of n words. DSC aims to determine the overall sentiment polarity of the review document d.

Pre-trainig DSC Module
Before transferring attention knowledge, we first pre-train a DSC module on the large-scale DSC dataset D. In this work, we employ a conventional attention-based BiLSTM as our DSC module.   To obtain the document representation r d , we employ the attention mechanism to aggregate the sentiment words that are significant for sentiment classification as follows: where α i is the attention weight of h d i and defined as: , (2) where h d avg is the average of all the hidden states, i.e., h d avg = n i=1 h d i /n, W d and b d are respectively the weight matrix and bias.
Finally, the representation r d is fed to a linear layer and a softmax layer to predict the sentiment label of the review document d. We pre-train the DSC module by minimizing the cross-entropy loss between the predicted sentiment distribution and the ground truth. After pre-training is finished, all parameters in the DSC module are fixed.

Base ASC Module
As shown in the left part of Figure 1, the base ASC module has a similar architecture to the DSC module. The difference is that the ASC task needs to model opinion target information. To obtain target-aware context representations, we additionally employ position embedding besides word embedding, which is an effective method of modeling position information (Lin et al., 2016;Gehring et al., 2016). Therefore, the base ASC module is an attention-based BiLSTM network enhanced with position embedding.
Specifically, given a sentence s = {w 1 , w 2 , ..., w n } and an opinion target t = {w l , w l+1 , ..., w r } in s, we first map each word w i into its word embedding representation w i by using the word embedding table.
To incorporate opinion target information with position embedding, we calculate the relative distance l i of each word w i to the opinion target t: The distance index l i is mapped into the positional representation p i by looking up a position embedding table E pos ∈ R L×dp , where L denotes the maximal position index and d p is the embedding dimension. Then we concatenate the word embedding representation w i and position embedding representation p i as the repsentation e i of the word w i , i.e., e i = [w i ; p i ], where [·; ·] denotes the vector concatenation operation. Similarly, we employ a BiLSTM to receive the word represenations {e 1 , e 2 , · · · , e n } as input and generate target-aware context representations {h 1 , h 2 , · · · , h n }. Different from the attention part of the DSC module, we use the opinion target represenation t = r i=l h i /(r − l) as query in the ASC task to extract target-dependent sentiment clues: where W s and b s are respectively the weight matrix and bias. Finally, the target-dependent sentence representation r s is used for detecting the sentiment polarity of the target t, and the base ASC module can optimized by minimizing the following cross-entropy loss: whereŷ i and y i respectively are the predictive class distribution and golden class distribution.

Attention Guidance
To leverage the attention knowledge of the DSC module, we simultaneously input the sentence s into the base ASC module and the pre-trained DSC module when performing the ASC task, generating the attention weights β i in Equation 6 and α i in Equation 2. As mentioned before, the attention mechanism of the ASC module cannot reach full potential due to limited training data, which means that the attention weights β i may fail to capture target-relevant sentiment words. In contrast, sufficient DSC data enables the DSC module to extract sentiment words more accurately. Thus we propose the Attention Guidance approach to guide the learning of the attention weights β i with the help of α i . Nevertheless, there is a tiny gap between the attention weights α i and β i . Since the DSC task only detects the overall sentiment of a review, the sentiment words captured by α i are global and target-irrelevant. To make up the gap, we use a heuristic method to transform target-irrelevant attention weight α i into target-relevant weight δ i : where l i denotes the relative distance between the word and the target as in Equation 4. We can see that a word nearer to the target receives a higher attention weight according to δ i , because the closer word has a bigger probability of modifier relation to the target.
Finally, we apply KL (Kullback-Leibler divergence) to describe the differences between attention distributions β and δ: In the pre-trained DSC module, the above term n i=1 δ i logδ i in Equation 13 is invariant for the given sentence s and the opinion target t. Therefore, we can minimize the loss L a = n i=1 −δ i logβ i to guide the ASC module to focus on target-relevant sentiment words. In the Attention Guidance approach, the final loss is defined as follows: where λ is the hyperparameter that controls the importance of L a .

Attention Fusion
Attention Guidance learns the attention ability of the DSC module through an auxiliary supervision signal. However, it cannot leverage the attention weights from the DSC module during the testing stage and wastes the pre-trained knowledge. To make full use of the additional attention capacity, we further propose the Attention Fusion approach to incorporate them directly. Specifically, we design a fusion gate g to integrate the global attention weight α i from the DSC module and the target-dependent attention weight β i from the ASC module, thereby generating more comprehensive and accurate attention weight γ i : where σ denotes sigmoid function and W g is the weight matrix. Finally, we replace β i in Equation 7 with the new attention weight γ i to obtain the target-dependent sentence representation r s for sentiment prediction.

Datasets and Metrics
We evaluate our model on two ASC benchmark datasets from SemEval 2014 Task 4 (Pontiki et al., 2014). They respectively contain reviews from Restaurant and Laptop domains. Following previous studies (Tang et al., 2016b;Chen et al., 2017;He et al., 2018), we remove samples with conflicting polarities in all datasets. The statistics of the ASC datasets are shown in Table 1.
To pre-train the DSC module, we employ two larget-scale DSC datasets, respectively Yelp Review and Amazon Review (Li et al., 2018a). The DSC dataset Yelp Review is applied to transfer attention knowledge for the ASC dataset Restaurant. The Amazon Review is used for the dataset Laptop. Table 2 shows their statistics. In this work, we adopt Accuracy and Macro-F1 score as the metrics to evaluate the performance of different methods on the ASC task.

Experimental Settings
In our experiments, word embeddings are initialized by 300-dimension GloVe (Pennington et al., 2014). After initialization, the word vectors are fixed and not fine-tuned during the training stage. All the weight matrices and biases are given the initial value by sampling from the uniform distribution U (−0.1, 0.1). The dimension of LSTM cell hidden states is set to 300. We employ stochastic gradient descent (SGD)   with momentum (Qian, 1999) to train models. The initial learning rate and momentum parameter are respectively set to 0.1 and 0.9. In addition, we apply dropout  with probability 0.5 on embedding layer as a regularizer. The parameter λ in Attention Guidance approach is set to 0.4. All hyper-parameters were tuned on 20% randomly held-out training data. Finally, we run each model five times and report the average result of them.

Compared Methods
We divide compared methods into two groups according to whether using transferred knowledge.
(I). The first group contains some classic methods for the ASC task: Majority assigns each instance in the test set with the most frequent sentiment label in the training set.
Feature-based SVM (Kiritchenko et al., 2014) is the top system of SemEval 2014 Task 4. It uses n-gram features, parse features and lexicon features to train an SVM classifier.
TD-LSTM (Tang et al., 2016a) applies two LSTM networks to model the left context and right context of opinion target respectively, then concatenates their last hidden states for sentiment prediction.
ATAE-LSTM (Wang et al., 2016) concatenates the word embedding and target embedding as the input of LSTM, then employs the attention mechanism to capture target-dependent sentiment information.
IAN (Ma et al., 2017) proposes the interactive attention to interactively learn representations of the context and target. The two representations are then concatenated for prediction.
MemNet (Tang et al., 2016b) uses multi-hops attention on the word embeddings to generate the targetdependent sentence representation.
RAM (Chen et al., 2017) works similar to the method MemNet. It employs BiLSTM to build memory and applies GRU-based multi-hops attention.
IARM (Majumder et al., 2018) incoporates the neighboring targets-related information for ASC by using memory networks.
MGAN (Fan et al., 2018) proposes a fine-grained attention mechanism to capture the word-level interaction between target and context, then combines it with coarse-grained attention for ASC.
GCAE (Xue and Li, 2018) uses a convolutional neural network (CNN) with gating mechanisms to perform the ASC task.
TNet (Li et al., 2018b) proposes target specific transformation component to integrate target information into the word representation. (II). Besides, we also compare two existing methods using transferred knowledge from large-scale DSC data to facilitate the ASC task: PRET+MULT (He et al., 2018) shares shadow embedding and LSTM layers between the ASC model and the DSC model through multi-task learning.
TransCap (Chen and Qian, 2019) employs capsule network to share the bottom features between the ASC task and the DSC task.

Main Results and Analysis
The main results are shown in Table 3. We classify the results into three groups: the first lists the classic methods for the ASC task, the second presents two existing transfer-based methods, and the last is our base ASC model and enhanced versions with transferring attention knowledge. We use ATN-AG and ATN-AF respectively to represent ATN using Attention Guidance and Attention Fusion.  (Kiritchenko et al., 2014) 80.16 N/A 70.49 N/A ATAE-LSTM (Wang et al., 2016) 77.20 N/A 68.70 N/A TD-LSTM (Tang et al., 2016a) 78.00 66.73 71.83 68.43 IAN (Ma et al., 2017) 78.60 N/A 72.10 N/A MemNet (Tang et al., 2016b) 80.32 N/A 72.37 N/A RAM (Chen et al., 2017) 80.23 70.80 74.49 71.35 IARM (Majumder et al., 2018) 80.00 N/A 73.80 N/A MGAN (Fan et al., 2018) 81  The method Feature-SVM obtains competitive results on the restaurant dataset but performs poorly on the laptop dataset. This may be attributed to that the performance of simple feature-based methods heavily relies on the quality of hand-crafted features. IAN achieves better performance than TD-LSTM and ATAE-LSTM by using the interactive attention mechanism to learn the representations of context and opinion target. With combining of fine-grained and coarse-grained attention mechanisms, MGAN achieves the best performance among all pure attention-based models. Among the memory-based methods, it can be observed that RAM outperforms MemNet and IARM on the laptop dataset, which validates the effectiveness of multi-hops attention based on recurrent network. GCAE performs poorly compared with other neural methods, as CNN is not good at capturing the long-term dependencies between context words. TNet achieves state-of-the-art performance by designing target-specific transformation mechanism between LSTM and CNN.
PRET+MULT and Transcap transfer knowledge implicitly from large-scale DSC data to the ASC task through sharing parameters and multi-task learning. They show superiority compared to some methods without transferring knowledge. For example, the base model of PRET+MULT is an attention-based LSTM similar to ATAE-LSTM. We can observe that PRET+MULT outperforms ATAE-LSTM significantly, and achieves 2.78% and 5.44% accuracy improvements respectively on the restaurant and laptop datasets. Transcap obtains better results compared to PRET+MULT, which verifies the effectiveness of capsule network for capturing shared features. Our base ASC model attention-based BiLSTM enhanced with position embedding performs better than some attention-based models, such as ATAE-LSTM and IAN. This result indicates that position embedding is beneficial for modeling target information in the ASC task. On this basis, our attention transfer models ATN-AG and ATN-AF respectively achieve about 1% and 2% improvements in accuracy on the restaurant dataset, and over 2.8% improvements on the laptop dataset. In addition, they surpass two existing methods that use transferred knowledge obviously, i.e., PRET+MULT and Transcap. These comparisons demonstrate the effectiveness of our proposal of explicitly transferring attention knowledge from resource-rich DSC data to the ASC task. Compared with ATN-AG, ATN-AF achieves better performance on the restaurant dataset. It is reasonable because ATN-AG cannot leverage the attention weights of the DSC module during the testing stage. Nevertheless, ATN-AG still obtains comparable results on the laptop dataset and has a faster inference speed than ATN-AF.

Effect of DSC Data Size
To investigate the effect of DSC data size on our approaches, we vary the percentage of DSC data from 0% to 100% to report the results of ATN-AG and ATN-AF. The critical values 0% and 100% respectively mean no DSC data and using the complete DSC dataset. The results are shown in Figure 2. We can observe that our approaches ATN-AG and ATN-AF both achieve very stable improvements on the two datasets with the increase of DSC data size. This indicates that the ASC task indeed benefits from the transferred attention knowledge from the pre-trained DSC module. Consistent and stable improvements show the robustness of our approaches.

Effect of Hyper-parameter λ
To analyze the effect of hyper-parameter λ in Equation 14 on ATN-AG, we adjust it in [0, 1] to conduct experiments and the step is 0.1. Figure 3 shows the performance of ATN-AG with different λ on the restaurant and laptop datasets.
We can see that the curves on two datasets have an overall upward trend when λ < 0.4, but become flat or downward once λ > 0.4. In the upward part, the attention knowledge from the DSC module is a useful guidance signal to help the ASC module to focus on sentiment words more accurately, thus improve the performance of ASC. Once the weight λ exceeds 0.4, the transferred attention knowledge begins to dominate the attention process while the ASC module loses the mastership and perform worse. Therefore, we finally set λ to be 0.4 on two datasets.

Case Study
In the ATN model, we propose the approaches Attention Guidance and Attention Fusion to help the ASC module to capture sentiment clues more accurately. To verify this, we analyze some dozens of instances from the test set. Compared with the base ASC model, we find that our attention transfer methods can deal with low-frequency sentiment words and complex sentiment patterns such as negation. Table 4 shows the attention visualizations of two examples and the corresponding sentiment predictions under the base model, ATN-AG and ATN-AF. Note that the darker color means higher attention weight.
In the first example, the base ASC model mainly focuses on the adverb "mostly", while fails to capture the critical sentiment clue "reliable". According to the statistics, the word "reliable" only appears five times in the training set. This indicates that the base model is not good at catching low-frequency sentiment words, thus makes wrong sentiment predictions. In contrast, the enhanced models ATN-AG and ATN-AF with transferred attention knowledge both successfully capture the informative word "reliable", and give the right predictions.
From the second example, we can see that the base ASC model mainly focuses on the word "enjoy" rather than the sentiment negator "not". It is hard for the base model to learn the negation with the insufficient labeled dataset. With the help of the external attention knowledge, our approaches ATN-AG and ATN-AF pay more attention to the negator "not", and make correct sentiment predictions.
The above observations show that our approaches indeed improve the low-resource task ASC with the transferred attention knowledge and retain good interpretability.

Aspect-level Sentiment Classification
Early works adopt supervised learning and devote to designing effective features for the ASC task, such as n-gram features (Kiritchenko et al., 2014) and sentiment lexicons (Vo and Zhang, 2015). The performance of these methods heavily depends on labor-intensive feature engineering. With the development of deep learning, Tang et al. (2016a) use two Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) networks to respectively model the left context and right context of the given opinion target. However, it cannot capture the association between the context and opinion target. To address the issue, recent works employ the attention mechanism to catch target-dependent sentiment context and achieve very promising resutls (Wang et al., 2016;Ma et al., 2017;Fan et al., 2018). Instead of single attention, some works propose multi-hops attention based on memory networks (Sukhbaatar et al., 2015) to detect more powerful sentiment clues (Tang et al., 2016b;Chen et al., 2017;Majumder et al., 2018).
Despite attention-based models showing the potential for ASC, they highly rely on data-driven attention mechanism. Unfortunately, public ASC datasets are all small-scale because of the complexity of annotation. Insufficient labeled data finally limits the effectiveness of attention mechanism for the ASC task. Different from the above methods, we improve the attention capacity of the ASC model in this work, by transferring substantial attention knowledge from the DSC model pre-trained with resourcerich document-level sentiment classification data.

Transfer Learning
Transfer learning aims to extract knowledge from one or more source tasks and then apply them to a target task. Neural transfer learning has proven effective for image recognition (Donahue et al., 2014) and natural language processing tasks (Mou et al., 2016;Dong and De Melo, 2018;Wu et al., 2020). He et al. (2018) are the first to transfer knowledge from document-level review data to improve the ASC task through sharing embedding and LSTM layers. Chen and Qian (2019) employ capsule network to share bottom features between the ASC task and DSC task. In this work, we aim to transfer attention knowledge from the DSC model explicitly to improve the effectiveness of attention mechanism for the ASC task. In contrast to the two existing works, our proposed approaches show better performance and good interpretability.

Conclusion
Insufficient labeled data limits the effectiveness of attention-based models for the ASC task. In this paper, we propose a novel attention transfer framework, in which two different attention transfer methods are designed to exploit attention knowledge from resource-rich document-level sentiment classification corpus to enhance the attention process of resource-poor aspect-level sentiment classification, finally achieving the goal of improving the performance of ASC. Experimental results indicate that our approaches outperform the state-of-the-art works. Further analysis validates the effectiveness and benefits of transferring the attention knowledge from DSC data for the ASC task.