Improving Aspect-based Sentiment Analysis with Gated Graph Convolutional Networks and Syntax-based Regulation

Aspect-based Sentiment Analysis (ABSA) seeks to predict the sentiment polarity of a sentence toward a specific aspect. Recently, it has been shown that dependency trees can be integrated into deep learning models to produce the state-of-the-art performance for ABSA. However, these models tend to compute the hidden/representation vectors without considering the aspect terms and fail to benefit from the overall contextual importance scores of the words that can be obtained from the dependency tree for ABSA. In this work, we propose a novel graph-based deep learning model to overcome these two issues of the prior work on ABSA. In our model, gate vectors are generated from the representation vectors of the aspect terms to customize the hidden vectors of the graph-based models toward the aspect terms. In addition, we propose a mechanism to obtain the importance scores for each word in the sentences based on the dependency trees that are then injected into the model to improve the representation vectors for ABSA. The proposed model achieves the state-of-the-art performance on three benchmark datasets.


Introduction
Aspect-based Sentiment Analysis (ABSA) is a finegrained version of sentiment analysis (SA) that aims to find the sentiment polarity of the input sentences toward a given aspect. We focus on the term-based aspects for ABSA where the aspects correspond to some terms (i.e., sequences of words) in the input sentence. For instance, an ABSA system should be able to return the negative sentiment for input sentence "The staff were very polite, but the quality of the food was terrible." assuming "food" as the aspect term.
Due to its important applications (e.g., for opinion mining), ABSA has been studied extensively * Equal contribution. in the literature. In these studies, deep learning has been employed to produce the state-of-the-art performance for this problem (Wagner et al., 2016;Dehong et al., 2017). Recently, in order to further improve the performance, the syntactic dependency trees have been integrated into the deep learning models (Huang and Carley, 2019; for ABSA (called the graph-based deep learning models). Among others, dependency trees help to directly link the aspect term to the syntactically related words in the sentence, thus facilitating the graph convolutional neural networks (GCN) (Kipf and Welling, 2017) to enrich the representation vectors for the aspect terms.
However, there are at least two major issues in these graph-based models that should be addressed to boost the performance. First, the representation vectors for the words in different layers of the current graph-based models for ABSA are not customized for the aspect terms. This might lead to suboptimal representation vectors where the irrelevant information for ABSA might be retained and affect the model's performance. Ideally, we expect that the representation vectors in the deep learning models for ABSA should mainly involve the related information for the aspect terms, the most important words in the sentences. Consequently, in this work, we propose to regulate the hidden vectors of the graph-based models for ABSA using the information from the aspect terms, thereby filtering the irrelevant information for the terms and customizing the representation vectors for ABSA. In particular, we compute a gate vector for each layer of the graph-based model for ABSA leveraging the representation vectors of the aspect terms. This layer-wise gate vector would be then applied over the hidden vectors of the current layer to produce customized hidden vectors for ABSA. In addition, we propose a novel mechanism to explicitly increase the contextual distinction among the gates to further improve the representation vectors.
The second limitation of the current graph-based deep learning models is the failure to explicitly exploit the overall importance of the words in the sentences that can be estimated from the dependency trees for the ABSA problem. In particular, a motivation of the graph-based models for ABSA is that the neighbor words of the aspect terms in the dependency trees would be more important for the sentiment of the terms than the other words in the sentence. The current graph-based models would then just focus on those syntactic neighbor words to induce the representations for the aspect terms. However, based on this idea of important words, we can also assign a score for each word in the sentences that explicitly quantify its importance/contribution for the sentiment prediction of the aspect terms. In this work, we hypothesize that these overall importance scores from the dependency trees might also provide useful knowledge to improve the representation vectors of the graph-based models for ABSA. Consequently, we propose to inject the knowledge from these syntaxbased importance scores into the graph-based models for ABSA via the consistency with the modelbased importance scores. In particular, using the representation vectors from the graph-based models, we compute a second score for each word in the sentences to reflect the model's perspective on the importance of the word for the sentiment of the aspect terms. The syntax-based importance scores are then employed to supervise the model-based importance scores, serving as a method to introduce the syntactic information into the model. In order to compute the model-based importance scores, we exploit the intuition that a word would be more important for ABSA if it is more similar the overall representation vector to predict the sentiment for the sentence in the final step of the model. In the experiments, we demonstrate the effectiveness of the proposed model with the state-of-the-art performance on three benchmark datasets for ABSA. In summary, our contributions include: • A novel method to regulate the GCN-based representation vectors of the words using the given aspect term for ABSA.
• A novel method to encourage the consistency between the syntax-based and model-based importance scores of the words based on the given aspect term.
• Extensive experiments on three benchmark datasets for ABSA, resulting in new state-of-theart performance for all the datasets.

Related Work
Sentiment analysis has been studied under different settings in the literature (e.g., sentence-level, aspect-level, cross-domain) Chauhan et al., 2019;Hu et al., 2019). For ABSA, the early works have performed feature engineering to produce useful features for the statistical classification models (e.g., SVM) (Wagner et al., 2014). Recently, deep learning models have superseded the feature based models due to their ability to automatically learn effective features from data (Wagner et al., 2016;Johnson and Zhang, 2015;Tang et al., 2016). The typical network architectures for ABSA in the literature involve convolutional neural networks (CNN) (Johnson and Zhang, 2015), recurrent neural networks (RNN) (Wagner et al., 2016), memory networks (Tang et al., 2016), attention (Luong et al., 2015) and gating mechanisms (He et al., 2018). The current state-of-the-art deep learning models for ABSA feature the graph-based models where the dependency trees are leveraged to improve the performance. (Huang and Carley, 2019;Hou et al., 2019). However, to the best of our knowledge, none of these works has used the information from the aspect term to filter the graph-based hidden vectors and exploited importance scores for words from dependency trees as we do in this work.

Model
The task of ABSA can be formalized as follows: Given a sentence X = [x 1 , x 2 , . . . , x n ] of n words/tokens and the index t (1 ≤ t ≤ n) for the aspect term x t , the goal is to predict the sentiment polarity y * toward the aspect term x t for X. Our model for ABSA in this work consists of three major components: (i) Representation Learning, (ii) Graph Convolution and Regulation, and (iii) Syntax and Model Consistency.
(i) Representation Learning: Following the recent work in ABSA (Huang and Carley, 2019;, we first utilize the contextualized word embeddings BERT (Devlin et al., 2019) to obtain the representation vectors for the words in X. In particular, we first generate a sequence of words of the formX where [CLS] and [SEP ] are the special tokens in BERT. This word sequence is then fed into the pre-trained BERT model to obtain the hidden vectors in the last layer. Afterwards, we obtain the embedding vector e i for each word x i ∈ X by averaging the hidden vectors of x i 's sub-word units (i.e., wordpiece). As the result, the input sentence X will be represented by the vector sequence E = e 1 , e 2 , . . . , e n in our model. Finally, we also employ the hidden vector s for the special token [CLS] inX from BERT to encode the overall input sentence X and its aspect term x t .
(ii) Graph Convolution and Regulation: In order to employ the dependency trees for ABSA, we apply the GCN model (Nguyen and Grishman, 2018;Veyseh et al., 2019) to perform L abstraction layers over the word representation vector sequence E. A hidden vector for a word x i in the current layer of GCN is obtained by aggregating the hidden vectors of the dependency-based neighbor words of x i in the previous layer. Formally, let h l i (0 ≤ l ≤ L, 1 ≤ i ≤ n) be the hidden vector of the word x i at the l-th layer of GCN. At the beginning, the GCN hidden vector h 0 i at the zero layer will be set to the word representation vector e i . Afterwards, h l i (l > 0) will be computed by: is the set of the neighbor words of x i in the dependency tree. We omit the biases in the equations for simplicity.
One problem with the GCN hidden vectors h l i GCN is that they are computed without being aware of the aspect term x t . This might retain irrelevant or confusing information in the representation vectors (e.g., a sentence might have two aspect terms with different sentiment polarity). In order to explicitly regulate the hidden vectors in GCN to focus on the provided aspect term x i , our proposal is to compute a gate g l for each layer l of GCN using the representation vector e t of the aspect term: g l = σ(W g l e t ). This gate is then applied over the hidden vectors h l i of the l-th layer via the element-wise multiplication •, generating the regulated hidden vectorh l i for h l i : h l i = g l • h l i . Ideally, we expect that the hidden vectors of GCN at different layers would capture different levels of contextual information in the sentence. The gate vectors g t for these layer should thus also exhibit some difference level for contextual information to match those in the GCN hidden vectors. In order to explicitly enforce the gate diversity in the model, our intuition is to ensure that the regu-lated GCN hidden vectors, once obtained by applying different gates to the same GCN hidden vectors, should be distinctive. This allows us to exploit the contextual information in the hidden vectors of GCN to ground the information in the gate vectors for the explicit gate diversity promotion.
In particular, given the l-th layer of GCN, we first obtain an overall representation vectorh l for the regulated hidden vectors at the l-th layer using the max-pooled vector:h l = max pool(h l 1 , . . . ,h l n ). Afterwards, we apply the gate vectors g l from the other layers (l = l) to the GCN hidden vectors h l i at the l-th layer, resulting in the regulated hidden vectorsh l,l i = g l • h l i . For each of these other layers l , we also compute an overall representation vectorh l,l with max-pooling: h l,l = max pool(h l,l 1 , . . . ,h l,l n ). Finally, we promote the diversity between the gate vectors g l by enforcing the distinction betweenh l andh l,l for l = l. This can be done by minimizing the cosine similarity between these vectors, leading to the following regularization term L div to be added to the loss function of the model: (iii) Syntax and Model Consistency: As presented in the introduction, we would like to obtain the importance scores of the words based on the dependency tree of X, and then inject these syntaxbased scores into the graph-based deep learning model for ABSA to improve the quality of the representation vectors. Motivated by the contextual importance of the neighbor words of the aspect terms for ABSA, we use the negative of the length of the path from x i to x t in the dependency tree to represent the syntax-based importance score syn i for x i ∈ X. For convenience, we also normalize the scores syn i with the softmax function.
In order to incorporate syntax-based scores syn i into the model, we first leverage the hidden vectors in GCN to compute a model-based importance score mod i for each word x i ∈ X (also normalized with softmax). Afterwards, we seek to minimize the KL divergence between the syntaxbased scores syn 1 , . . . , syn n and the model-based scores mod 1 , . . . , mod n by introducing the following term L const into the overall loss function: L const = −syn i log syn i mod i . The rationale is to promote the consistency between the syntax-based and model-based importance scores to facilitate the in-jection of the knowledge in the syntax-based scores into the representation vectors of the model.
For the model-based importance scores, we first obtain an overall representation vector V for the input sentence X to predict the sentiment for x t . In this work, we compute V using the sentence representation vector s from BERT and the regulated hidden vectors in the last layer of GCN: V = [s, max pool(ĥ L 1 , . . . ,ĥ L n )]. Based on this overall representation vector V , we consider a word x i to be more contextually important for ABSA than the others if its regulated GCN hidden vectorĥ L i in the last GCN layer is more similar to V than those for the other words. The intuition is the GCN hidden vector of a contextually important word for ABSA should be able capture the necessary information to predict the sentiment for x t , thereby being similar to V that is supposed to encode the overall relevant context information of X to perform sentiment classification. In order to implement this idea, we use the dot product of the transformed vectors for V andĥ L i to determine the model-based importance score for x i in the model: Finally, we feed V into a feed-forward neural network with softmax in the end to estimate the probability distribution P (.|X, x t ) over the sentiments for X and x t . The negative log-likelihood L pred = − log P (y * |X, x t ) is then used as the prediction loss in this work. The overall loss to train the proposed model is then: L = L div +αL const +βL pred where α and β are trade-off parameters.

Experiments
Datasets and Parameters: We employ three datasets to evaluate the models in this work. Two datasets, Restaurant and Laptop, are adopted from the SemEval 2014 Task 4 (Pontiki et al., 2014) while the third dataset, MAMS, is introduced in . All the three datasets involve three sentiment categories, i.e., positive, neural, and negative. The numbers of examples for different portions of the three datasets are shown in Table 1.
As only the MAMS dataset provides the development data, we fine-tune the model's hyperparameters on the development data of MAMS and use the same hyper-parameters for the other datasets. The following hyper-parameters are suggested for the proposed model by the fine-tuning process: 200 dimensions for the hidden vectors of Dataset Pos. Neu. Neg. Train 2164 637  807  Restaurant-Test  728  196  196  Laptop-Train  994  464  870  Laptop-Test  341  169  128  MAMS-Train  3380 5042 2764  MAMS-Dev  403  604  325  MAMS-Test  400  607  329   Table 1: Statistics of the datasets the feed forward networks and GCN layers, 2 hidden layers in GCN, the size 32 for the mini-batches, the learning rate of 0.001 for the Adam optimizer, and 1.0 for the trade-off parameters α and β. Finally, we use the cased BERT base model with 768 hidden dimensions in this work.

Restaurant-
Results: To demonstrate the effectiveness of the proposed method, we compare it with the following baselines: (1) the feature-based model that applies feature engineering and the SVM model (Wagner et al., 2014), (2) the deep learning models based on the sequential order of the words in the sentences, including CNN, LSTM, attention and the gating mechanism (Wagner et al., 2016;Wang et al., 2016;Tang et al., 2016;Huang et al., 2018;, and (3) the graph-based models that exploit dependency trees to improve the deep learning models for ABSA (Huang and Carley, 2019;Hou et al., 2019;Wang et al., 2020). Table 2 presents the performance of the models on the test sets of the three benchmark datasets. This table shows that the proposed model outperforms all the baselines over different benchmark datasets. The performance gaps are significant with p < 0.01, thereby demonstrating the effectiveness of the proposed model for ABSA.
Ablation Study: There are three major components in the proposed model: (1) the gate vectors g l to regulate the hidden vectors of GCN (called Gate), (2) the gate diversity component L div to promote the distinction between the gates (called Div.), and (3) the syntax and model consistency component L const to introduce the knowledge from the syntax-based importance scores (called Con.). Table 3 reports the performance on the MAMS development set for the ablation study when the components mentioned in each row are removed from the proposed model. Note that the exclusion of Gate would also remove Div. due to their dependency. It is clear from the table that all three components are necessary for the proposed model   Gate Diversity Analysis: In order to enforce the diversity of the gate vectors g t for different layers of GCN, the proposed model indirectly minimizes the cosine similarities between the regulated hidden vectors of GCN at different layers (i.e., in L div ). The regulated hidden vectors are obtained by applying the gate vectors to the hidden vectors of GCN, serving as a method to ground the information in the gates with the contextual information in the input sentences (i.e., via the hidden vectors of GCN) for diversity promotion. In order to demonstrate the effectiveness of such gatecontext grounding mechanism for the diversity of the gates, we evaluate a more straightforward baseline where the gate diversity is achieved by directly minimizing the cosine similarities between the gate vectors g t for different GCN layers. In particular, the diversity loss term L div in this baseline would be: L div = 1 L(L−1) Σ L l=1 Σ L l =1,l =l g l · g l . We call this baseline GateDiv. for convenience. Table 4 report the performance of GateDiv. and the proposed model on the development dataset of MAMS. As can be seen, the proposed model is significantly better than GateDiv., thereby testifying to the effectiveness of the proposed gate diversity component with information-grounding in this work. We attribute this superiority to the fact that the regulated hidden vectors of GCN provide richer contextual information for the diversity term L div than those with the gate vectors. This offers better grounds to support the gate similarity comparison in L div , leading to the improved performance for the proposed model.

Model
Acc. The proposed model 87.98 GateDiv.
86.13 Table 4: Model performance on the MAMS development set when the diversity term L div is directly computed from the gate vectors.

Conclusion
We introduce a new model for ABSA that addresses two limitations of the prior work. It employs the given aspect terms to customize the hidden vectors. It also benefits from the overall dependency-based importance scores of the words. Our extensive experiments on three benchmark datasets empirically demonstrate the effectiveness of the proposed approach, leading to state-of-the-art results on these datasets. The future work involves applying the proposed model to the related tasks for ABSA, e.g., event detection (Nguyen and Grishman, 2015).