A Novel Aspect-Guided Deep Transition Model for Aspect Based Sentiment Analysis

Aspect based sentiment analysis (ABSA) aims to identify the sentiment polarity towards the given aspect in a sentence, while previous models typically exploit an aspect-independent (weakly associative) encoder for sentence representation generation. In this paper, we propose a novel Aspect-Guided Deep Transition model, named AGDT, which utilizes the given aspect to guide the sentence encoding from scratch with the specially-designed deep transition architecture. Furthermore, an aspect-oriented objective is designed to enforce AGDT to reconstruct the given aspect with the generated sentence representation. In doing so, our AGDT can accurately generate aspect-specific sentence representation, and thus conduct more accurate sentiment predictions. Experimental results on multiple SemEval datasets demonstrate the effectiveness of our proposed approach, which significantly outperforms the best reported results with the same setting.


Introduction
Aspect based sentiment analysis (ABSA) is a finegrained task in sentiment analysis, which can provide important sentiment information for other natural language processing (NLP) tasks. There are two different subtasks in ABSA, namely, aspect-category sentiment analysis and aspectterm sentiment analysis (Pontiki et al., 2014;Xue and Li, 2018). Aspect-category sentiment analysis aims at predicting the sentiment polarity towards the given aspect, which is in predefined several categories and it may not appear in the sentence. For instance, in Table 1, the aspect-category sentiment analysis is going to predict the sentiment polarity towards the aspect "food", which

Sentence
The appetizers are ok, but the service is slow. Aspect-Category food service Aspect-Term The appetizers service Sentiment Polarity Neutral Negative is not appeared in the sentence. By contrast, the goal of aspect-term sentiment analysis is to predict the sentiment polarity over the aspect term which is a subsequence of the sentence. For instance, the aspect-term sentiment analysis will predict the sentiment polarity towards the aspect term "The appetizers", which is a subsequence of the sentence. Additionally, the number of categories of the aspect term is more than one thousand in the training corpus.
As shown in Table 1, sentiment polarity may be different when different aspects are considered. Thus, the given aspect (term) is crucial to ABSA tasks (Jiang et al., 2011;Ma et al., 2017;Xing et al., 2019;Liang et al., 2019). Besides, Li et al. (2018a) show that not all words of a sentence are useful for the sentiment prediction towards a given aspect (term). For instance, when the given aspect is the "service", the words "appetizers" and "ok" are irrelevant for the sentiment prediction. Therefore, an aspectindependent (weakly associative) encoder may encode such background words (e.g., "appetizers" and "ok") into the final representation, which may lead to an incorrect prediction.
Numerous existing models (Tang et al., 2016b;Tay et al., 2017;Fan et al., 2018;Xue and Li, 2018) typically utilize an aspect-independent encoder to generate the sentence representation, and then apply the attention mechanism (Luong et al., 2015) or gating mechanism to conduct feature selection and extraction, while feature selection and extraction may base on noised representations. In addition, some models (Tang et al., 2016a;Wang et al., 2016;Majumder et al., 2018) simply concatenate the aspect embedding with each word embedding of the sentence, and then leverage conventional Long Short-Term Memories (LSTMs) (Hochreiter and Schmidhuber, 1997) to generate the sentence representation. However, it is insufficient to exploit the given aspect and conduct potentially complex feature selection and extraction.
To address this issue, we investigate a novel architecture to enhance the capability of feature selection and extraction with the guidance of the given aspect from scratch. Based on the deep transition Gated Recurrent Unit (GRU) Pascanu et al., 2014;Miceli Barone et al., 2017;Meng and Zhang, 2019), an aspect-guided GRU encoder is thus proposed, which utilizes the given aspect to guide the sentence encoding procedure at the very beginning stage. In particular, we specially design an aspect-gate for the deep transition GRU to control the information flow of each token input, with the aim of guiding feature selection and extraction from scratch, i.e. sentence representation generation. Furthermore, we design an aspect-oriented objective to enforce our model to reconstruct the given aspect, with the sentence representation generated by the aspect-guided encoder. We name this Aspect-Guided Deep Transition model as AGDT. With all the above contributions, our AGDT can accurately generate an aspect-specific representation for a sentence, and thus conduct more accurate sentiment predictions towards the given aspect.
We evaluate the AGDT on multiple datasets of two subtasks in ABSA. Experimental results demonstrate the effectiveness of our proposed approach. And the AGDT significantly surpasses existing models with the same setting and achieves state-of-the-art performance among the models without using additional features (e.g., BERT (Devlin et al., 2018)). Moreover, we also provide empirical and visualization analysis to reveal the advantages of our model. Our contributions can be summarized as follows: • We propose an aspect-guided encoder, which utilizes the given aspect to guide the encoding of a sentence from scratch, in order to conduct the aspect-specific feature selection and extraction at the very beginning stage.
• We propose an aspect-reconstruction approach to further guarantee that the aspectspecific information has been fully embedded into the sentence representation.
• Our AGDT substantially outperforms previous systems with the same setting, and achieves state-of-the-art results on benchmark datasets compared to those models without leveraging additional features (e.g., BERT).

Model Description
As shown in Figure 1, the AGDT model mainly consists of three parts: aspect-guided encoder, aspect-reconstruction and aspect concatenated embedding. The aspect-guided encoder is specially designed to guide the encoding of a sentence from scratch for conducting the aspect-specific feature selection and extraction at the very beginning stage. The aspect-reconstruction aims to guarantee that the aspect-specific information has been fully embedded in the sentence representation for more accurate predictions. The aspect concatenated embedding part is used to concatenate the aspect embedding and the generated sentence representation so as to make the final prediction.

Aspect-Guided Encoder
The aspect-guided encoder is the core module of AGDT, which consists of two key components: Aspect-guided GRU and Transition GRU . A-GRU: Aspect-guided GRU (A-GRU) is a specially-designed unit for the ABSA tasks, which is an extension of the L-GRU proposed by Meng and Zhang (2019). In particular, we design an aspect-gate to select aspect-specific representations through controlling the transformation scale of token embeddings at each time step.
At time step t, the hidden state h t is computed as follows: where represents element-wise product; z t is the update gate ; and h t is the candidate activation, which is computed as: x n x n x 2 x 2 x 1 x 1 x n x n x n-1 x n-1 x 1 x 1 ... ... Figure 1: The overview of AGDT. The bottom right dark node (above the aspect embedding) is the aspect gate and other dark nodes (⊗) means element-wise multiply for the input token and the aspect gate. The aspect-guided encoder consists of a L-GRU (the circle frames fused with a small circle on above) at the bottom followed by several T-GRUs (the circle frames) from bottom to up.
where g t denotes the aspect-gate; x t represents the input word embedding at time step t; r t is the reset gate ; H 1 (x t ) and H 2 (x t ) are the linear transformation of the input x t , and l t is the linear transformation gate for x t (Meng and Zhang, 2019). r t , z t , l t , g t , H 1 (x t ) and H 2 (x t ) are computed as: where "a" denotes the embedding of the given aspect, which is the same at each time step. The update gate z t and reset gate r t are the same as them in the conventional GRU. In Eq. (2) ∼ (8), the aspect-gate g t controls both nonlinear and linear transformations of the input x t under the guidance of the given aspect at each time step. Besides, we also exploit a linear transformation gate l t to control the linear transformation of the input, according to the current input x t and previous hidden state h t−1 , which has been proved powerful in the deep transition architecture (Meng and Zhang, 2019).
As a consequence, A-GRU can control both non-linear transformation and linear transformation for input x t at each time step, with the guidance of the given aspect, i.e., A-GRU can guide the encoding of aspect-specific features and block the aspect-irrelevant information at the very beginning stage. T-GRU: Transition GRU (T-GRU) (Pascanu et al., 2014) is a crucial component of deep transition block, which is a special case of GRU with only "state" as an input, namely its input embedding is zero embedding. As in Figure 1, a deep transition block consists of an A-GRU followed by several T-GRUs at each time step. For the current time step t, the output of one A-GRU/T-GRU is fed into the next T-GRU as the input. The output of the last T-GRU at time step t is fed into A-GRU at the time step t + 1. For a T-GRU, each hidden state at both time step t and transition depth i is computed as: The AGDT encoder is based on deep transition cells, where each cell is composed of one A-GRU at the bottom, followed by several T-GRUs. Such AGDT model can encode the sentence representation with the guidance of aspect information by utilizing the specially designed architecture.

Aspect-Reconstruction
We propose an aspect-reconstruction approach to guarantee the aspect-specific information has been fully embedded in the sentence representation. Particularly, we devise two objectives for two subtasks in ABSA respectively. In terms of aspectcategory sentiment analysis datasets, there are only several predefined aspect categories. While in aspect-term sentiment analysis datasets, the number of categories of term is more than one thousand. In a real-life scenario, the number of term is infinite, while the words that make up terms are limited. Thus we design different lossfunctions for these two scenarios.
For the aspect-category sentiment analysis task, we aim to reconstruct the aspect according to the aspect-specific representation. It is a multi-class problem. We take the softmax cross-entropy as the loss function: where C1 is the number of predefined aspects in the training example; y c i is the ground-truth and p c i is the estimated probability of a aspect. For the aspect-term sentiment analysis task, we intend to reconstruct the aspect term (may consist of multiple words) according to the aspect-specific representation. It is a multi-label problem and thus the sigmoid cross-entropy is applied: where C2 denotes the number of words that constitute all terms in the training example, y t i is the ground-truth and p t i represents the predicted value of a word.
Our aspect-oriented objective consists of L c and L t , which guarantee that the aspect-specific information has been fully embedded into the sentence representation.

Training Objective
The final loss function is as follows: where the underlined part denotes the conventional loss function; C is the number of sentiment labels; y i is the ground-truth and p i represents the estimated probability of the sentiment label; L is the aspect-oriented objective, where Eq. 13 is for the aspect-category sentiment analysis task and Eq. 14 is for the aspect-term sentiment analysis task. And λ is the weight of L. As shown in Figure 1, we employ the aspect reconstruction approach to reconstruct the aspect (term), where "softmax" is for the aspect-category sentiment analysis task and "sigmoid" is for the aspect-term sentiment analysis task. Additionally, we concatenate the aspect embedding on the aspect-guided sentence representation to predict the sentiment polarity. Under that loss function (Eq. 15), the AGDT can produce aspect-specific sentence representations.

Datasets and Metrics
Data Preparation. We conduct experiments on two datasets of the aspect-category based task and two datasets of the aspect-term based task. For these four datasets, we name the full dataset as "DS". In each "DS", there are some sentences like the example in Table 1, containing different sentiment labels, each of which associates with an aspect (term). For instance, Table 1 shows the customer's different attitude towards two aspects: "food" ("The appetizers") and "service". In order to measure whether a model can detect different sentiment polarities in one sentence towards different aspects, we extract a hard dataset from each "DS", named "HDS", in which each sentence only has different sentiment labels associated with different aspects. When processing the original sentence s that has multiple aspects a 1 , a 2 , ..., a n and corresponding sentiment labels l 1 , l 2 , ..., l n (n is the number of aspects or terms in a sentence), the sentence will be expanded into (s, a 1 , l 1 ), (s, a 2 , l 2 ), ..., (s, a n , l n ) in each dataset (Ruder et al., 2016b,a;Xue and Li, 2018), i.e, there will be n duplicated sentences associated with different aspects and labels.   Table 3: Statistics of datasets for the aspect-term sentiment analysis task. The 'NC' indicates No "Conflict" label, which is just removed the "conflict" label and is prepared for the three-class experiment.
Aspect-Category Sentiment Analysis. For comparison, we follow Xue and Li (2018) and use the restaurant reviews dataset of SemEval 2014 ("restaurant-14") Task 4 (Pontiki et al., 2014) to evaluate our AGDT model. The dataset contains five predefined aspects and four sentiment labels. A large dataset ("restaurant-large") involves restaurant reviews of three years, i.e., 2014 ∼ 2016 (Pontiki et al., 2014). There are eight predefined aspects and three labels in that dataset. When creating the "restaurant-large" dataset, we follow the same procedure as in Xue and Li (2018). Statistics of datasets are shown in Table 2.
Aspect-Term Sentiment Analysis. We use the restaurant and laptop review datasets of SemEval 2014 Task 4 (Pontiki et al., 2014) to evaluate our model. Both datasets contain four sentiment labels. Meanwhile, we also conduct a threeclass experiment, in order to compare with some work (Wang et al., 2016;Ma et al., 2017;Li et al., 2018a) which removed "conflict" labels. Statistics of both datasets are shown in Table 3.
Metrics. The evaluation metrics are accuracy. All instances are shown in Table 2 and Table 3. Each experiment is repeated five times. The mean and the standard deviation are reported.

Implementation Details
We use the pre-trained 300d Glove 2 embeddings (Pennington et al., 2014) to initialize word em-beddings, which is fixed in all models.  (2017), we take the averaged word embedding as the aspect representation for multi-word aspect terms. The transition depth of deep transition model is 4 (see Section 3.4). The hidden size is set to 300. We set the dropout rate (Srivastava et al., 2014) to 0.5 for input token embeddings and 0.3 for hidden states. All models are optimized using Adam optimizer (Kingma and Ba, 2014) with gradient clipping equals to 5 (Pascanu et al., 2012). The initial learning rate is set to 0.01 and the batch size is set to 4096 at the token level. The weight of the reconstruction loss λ in Eq. 15 is fine-tuned (see Section 3.4) and respectively set to 0.4, 0.4, 0.2 and 0.5 for four datasets. The neural model is implemented in Tensorflow (Abadi et al., 2016) and all computations are done on a NVIDIA Tesla M40 GPU.

Baselines
To comprehensively evaluate our AGDT, we compare the AGDT with several competitive models. ATAE-LSTM. It is an attention-based LSTM model. It appends the given aspect embedding with each word embedding, and then the concatenated embedding is taken as the input of LSTM. The output of LSTM is appended aspect embedding again. Furthermore, attention is applied to extract features for final predictions.   CNN. This model focuses on extracting n-gram features to generate sentence representation for the sentiment classification.
TD-LSTM. This model uses two LSTMs to capture the left and right context of the term to generate target-dependent representations for the sentiment prediction.
IAN. This model employs two LSTMs and interactive attention mechanism to learn representations of the sentence and the aspect, and concatenates them for the sentiment prediction.
RAM. This model applies multiple attentions and memory networks to produce the sentence representation.
GCAE. It uses CNNs to extract features and then employs two Gated Tanh-Relu units to selectively output the sentiment information flow towards the aspect for predicting sentiment labels.

Aspect-Category Sentiment Analysis Task
We present the overall performance of our model and baseline models in Table 4. Results show that our AGDT outperforms all baseline models on both "restaurant-14" and "restaurant-large" datasets. ATAE-LSTM employs an aspect-weakly associative encoder to generate the aspect-specific sentence representation by simply concatenating the aspect, which is insufficient to exploit the given aspect. Although GCAE incorporates the gating mechanism to control the sentiment information flow according to the given aspect, the information flow is generated by an aspectindependent encoder. Compared with GCAE, our AGDT improves the performance by 2.4% and 1.6% in the "DS" part of the two dataset, respectively. These results demonstrate that our AGDT can sufficiently exploit the given aspect to generate the aspect-guided sentence representation, and thus conduct accurate sentiment prediction. Our model benefits from the following aspects. First, our AGDT utilizes an aspect-guided encoder, which leverages the given aspect to guide the sentence encoding from scratch and generates the aspect-guided representation. Second, the AGDT guarantees that the aspect-specific information has been fully embedded in the sentence representation via reconstructing the given aspect. Third, the given aspect embedding is concatenated on the aspect-guided sentence representation for final predictions.
The "HDS", which is designed to measure whether a model can detect different sentiment polarities in a sentence, consists of replicated sentences with different sentiments towards multiple aspects. Our AGDT surpasses GCAE by a very large margin (+11.4% and +4.9% respectively) on both datasets. This indicates that the given aspect information is very pivotal to the accurate sentiment prediction, especially when the sentence has different sentiment labels, which is consistent with existing work (Jiang et al., 2011;Ma et al., 2017;. Those results demonstrate the effectiveness of our model and suggest that our AGDT has better ability to distinguish the different sentiments of multiple aspects compared to GCAE.

Aspect-Term Sentiment Analysis Task
As shown in Table 5, our AGDT consistently outperforms all compared methods on both domains. In this task, TD-LSTM and ATAE-LSTM use a aspect-weakly associative encoder. IAN, RAM and GCAE employ an aspect-independent encoder. In the "DS" part, our AGDT model surpasses all baseline models, which shows that the inclusion of A-GRU (aspect-guided encoder), aspect-reconstruction and aspect concatenated embedding has an overall positive impact on the classification process.
In the "HDS" part, the AGDT model obtains +3.6% higher accuracy than GCAE on the restaurant domain and +4.2% higher accuracy on the laptop domain, which shows that our AGDT has stronger ability for the multi-sentiment problem against GCAE. These results further demonstrate that our model works well across tasks and datasets.
From Table 6 and Table 7, we can conclude:   Table 6: Ablation study of the AGDT on the aspectcategory sentiment analysis task. Here "AC", "AG" and "AR" represent aspect concatenated embedding, A-GRU and aspect-reconstruction, respectively, ' √ ' and '×' denotes whether to apply the operation. 'Rest-  Table 7: Ablation study of the AGDT on the aspectterm sentiment analysis task.
previous work (Miceli Barone et al., 2017;Meng and Zhang, 2019) ( 2 vs. 1 ). 2). Utilizing "AG" to guide encoding aspectrelated features from scratch has a significant impact for highly competitive results and particularly in the "HDS" part, which demonstrates that it has the stronger ability to identify different sentiment polarities towards different aspects. ( 3 vs. 2 ). 3). Aspect concatenated embedding can promote the accuracy to a degree ( 4 vs. 3 ). 4). The aspect-reconstruction approach ("AR") substantially improves the performance, especially in the "HDS" part ( 5 vs. 4 ). 5). the results in 6 show that all modules have an overall positive impact on the sentiment classification.

Impact of Model Depth
We have demonstrated the effectiveness of the AGDT. Here, we investigate the impact of model depth of AGDT, varying the depth from 1 to 6.   sets as depth increases. We find that the best results can be obtained when the depth is equal to 4 at most case, and further depth do not provide considerable performance improvement.

Effectiveness of Aspect-reconstruction Approach
Here, we investigate how well the AGDT can reconstruct the aspect information. For the aspectterm reconstruction, we count the construction is correct when all words of the term are reconstructed. Table 9 shows all results on four test datasets, which shows the effectiveness of aspectreconstruction approach again.

Impact of Loss Weight λ
We randomly sample a temporary development set from the "HDS" part of the training set to choose the lambda for each dataset. And we investigate the impact of λ for aspect-oriented objectives. Specifically, λ is increased from 0.1 to 1.0. Figure 2 illustrates all results on four "HDS" datasets, which show that reconstructing the given aspect can enhance aspect-specific sentiment features and thus obtain better performances.

Comparison on Three-Class for the Aspect-Term Sentiment Analysis Task
We also conduct a three-class experiment to compare our AGDT with previous models, i.e., IARM, TNet, VAE, PBAN, AOA and MGAN, in Table 10.

Models
Rest. Laptop IARM (Majumder et al., 2018) Figure 4: The above is the output of A-GRU. The bottom is the output after reconstructing the given aspect.
These previous models are based on an aspectindependent (weakly associative) encoder to generate sentence representations. Results on all domains suggest that our AGDT substantially outperforms most competitive models, except for the TNet on the laptop dataset. The reason may be TNet incorporates additional features (e.g., position features, local ngrams and word-level features) compared to ours (only word-level features).

Analysis and Discussion
Case Study and Visualization. To give an intuitive understanding of how the proposed A-GRU works from scratch with different aspects, we take a review sentence as an example. As the example "the appetizers are ok, but the service is slow." shown in Table 1, it has different sentiment labels towards different aspects. The color depth denotes the semantic relatedness level between the given aspect and each word. More depth means stronger relation to the given aspect. Figure 3 shows that the A-GRU can effectively guide encoding the aspect-related features with the given aspect and identify corresponding sentiment. In another case, "overpriced Japanese food with mediocre service.", there are two extremely strong sentiment words. As the above of Figure 4 shows, our A-GRU generates almost the same weight to the word "overpriced" and "mediocre". The bottom of Figure 4 shows that reconstructing the given aspect can effectively enhance aspect-specific sentiment features and produce correct sentiment predictions.
Error Analysis. We further investigate the errors from AGDT, which can be roughly divided into 3 types. 1) The decision boundary among the sentiment polarity is unclear, even the annotators can not sure what sentiment orientation over the given aspect in the sentence. 2) The "conflict/neutral" instances are extremely easily misclassified as "positive" or "negative", due to the imbalanced label distribution in training corpus 3 .
3) The polarity of complex instances is hard to predict, such as the sentence that express subtle emotions, which are hardly effectively captured, or containing negation words (e.g., never, less and not), which easily affect the sentiment polarity.

Related Work
Sentiment Analysis. There are kinds of sentiment analysis tasks, such as documentlevel (Thongtan and Phienthrakul, 2019), sentence-level 4 , aspect-level (Pontiki et al., 2014;Wang et al., 2019a) and multimodal (Chen et al., 2018;Akhtar et al., 2019) sentiment analysis. For the aspect-level sentiment analysis, previous work typically apply attention mechanism (Luong et al., 2015) combining with memory network (Weston et al., 2014) or gating units to solve this task (Tang et al., 2016b;He et al., 2018a;Xue and Li, 2018;Duan et al., 2018;Tang et al., 2019;Yang et al., 2019;Bao et al., 2019), where an aspect-independent encoder is used to generate the sentence representation. In addition, some work leverage the aspect-weakly associative encoder to generate aspect-specific sentence representation (Tang et al., 2016a;Wang et al., 2016;Majumder et al., 2018). All of these methods make insufficient use of the given aspect information. There are also some work which jointly extract the aspect term (and opinion term) and predict its sentiment polarity (Schmitt et al., 2018;Li et al., 2018b;Ma et al., 2018;Angelidis and Lapata, 2018;He et al., 2019;Hu et al., 2019;Dai and Song, 2019;Wang et al., 2019b). In this paper, we focus on the latter problem and leave aspect extraction (Shu et al., 2017) to future work. And some work He et al., 2018b;Xu and Tan, 2018;Chen and Qian, 2019;He et al., 2019) employ the well-known BERT (Devlin et al., 2018) or document-level corpora to enhance ABSA tasks, which will be considered in our future work to further improve the performance.
Deep Transition. Deep transition has been proved its superiority in language modeling (Pascanu et al., 2014) and machine translation (Miceli Barone et al., 2017;Meng and Zhang, 2019). We follow the deep transition architecture in Meng and Zhang (2019) and extend it by incorporating a novel A-GRU for ABSA tasks.

Conclusions
In this paper, we propose a novel aspect-guided encoder (AGDT) for ABSA tasks, based on a deep transition architecture. Our AGDT can guide the sentence encoding from scratch for the aspectspecific feature selection and extraction. Furthermore, we design an aspect-reconstruction approach to enforce AGDT to reconstruct the given aspect with the generated sentence representation. Empirical studies on four datasets suggest that the AGDT outperforms existing state-of-the-art models substantially on both aspect-category sentiment analysis task and aspect-term sentiment analysis task of ABSA without additional features.