Aspect-based Sentiment Analysis with Type-aware Graph Convolutional Networks and Layer Ensemble

It is popular that neural graph-based models are applied in existing aspect-based sentiment analysis (ABSA) studies for utilizing word relations through dependency parses to facilitate the task with better semantic guidance for analyzing context and aspect words. However, most of these studies only leverage dependency relations without considering their dependency types, and are limited in lacking efficient mechanisms to distinguish the important relations as well as learn from different layers of graph based models. To address such limitations, in this paper, we propose an approach to explicitly utilize dependency types for ABSA with type-aware graph convolutional networks (T-GCN), where attention is used in T-GCN to distinguish different edges (relations) in the graph and attentive layer ensemble is proposed to comprehensively learn from different layers of T-GCN. The validity and effectiveness of our approach are demonstrated in the experimental results, where state-of-the-art performance is achieved on six English benchmark datasets. Further experiments are conducted to analyze the contributions of each component in our approach and illustrate how different layers in T-GCN help ABSA with quantitative and qualitative analysis.


Introduction
Aspect-based sentiment analysis (ABSA) processes fine-grained sentiment polarities towards specific aspects, where in many cases, it is required to identify different sentiments for multiple aspects in the same context. For example, in the sentence "The drink menu is limited but the wines are excellent.", the sentiment polarity towards "drink menu" is negative while that towards "wines" is positive; an * Equal contribution. † Corresponding author. 1 The code and models involved in this paper are released at https://github.com/cuhksz-nlp/ASA-TGCN. ABSA system may predict wrong if it fails to capture the important contextual information for each aspects. Therefore, to model such contextual information, neural models (e.g., Bi-LSTM and Transformer (Vaswani et al., 2017)) have been widely used for ABSA and demonstrated to be useful for this task (Wang et al., 2016;Tang et al., 2016a;Chen et al., 2017;Ma et al., 2017;Fan et al., 2018).
As a further enhancement of encoding contextual information for ABSA, there are studies Huang and Carley, 2019;Zhang et al., 2019a) using graph convolutional networks (GCN) to learn from a graph that is often built over the dependency parsing results of the input texts. As a result, the GCN models are able to learn from distant wordword relations that are more helpful to ABSA. However, GCN models used in these studies are limited by omitting the information carried in dependency types and treating all word-word relations in the graph equally, therefore unimportant relations may not be distinguished and mislead ABSA accordingly. For example, Figure 1 illustrates an example sentence with an aspect highlighted in red, where the aspect word "menu" is connected with three others words, i.e., "the", "drink", and "limited". The connection between "menu" and "limited" could be the most important one since its dependency type, i.e., "nsubj", suggests that "menu" is the nominal subject of "limited", which strongly guides sentiment analysis towards "menu". In this case, if the dependency type is not modeled, one may not be able to leverage such beneficial information. In addition, although previous GCN models learn such word-word relations by multiple GCN layers, they only use the output from the last layer for ABSA, where the encodings from intermediate layers are omitted and some essential information may be lost because different context information are modeled across layers. Thus an appropriate approach is required to enhance current GCN models for ABSA.
In this paper, we propose a type-aware graph convolutional networks (T-GCN) with multiple layers to enhance ABSA by incorporating both word relations and their dependency types to comprehensively learn from dependency parsing results. Specifically, we firstly obtain the dependency parsing results of the input texts through off-the-shelf toolkits, then build the graph over the dependency tree with each edge labeled by the corresponding dependency type between the two connected words, later apply an attention mechanism to the graph to weight all edges according to their contributions to the task, and finally use attentive layer ensemble to weight and combine the contextual information learned from different GCN layers. In doing so, our proposed T-GCN model can not only model wordword relations and their dependency types, but also distinguish the important contextual information from such relations to enhance ABSA. Experiments on six English benchmark datasets are conducted to evaluate the proposed model, where the results illustrate its effectiveness and state-of-the-art performance is observed over previous studies on all datasets. We also perform further analysis to in-vestigate the contribution of each component (i.e., type-aware graph, attention for edges, and attentive layer ensemble) in our approach, and illustrate how different layers in T-GCN helps ABSA with quantitative and qualitative studies.

The Approach
Given an input sentence X = x 1 , x 2 , · · · , x n and the aspect terms A ⇢ X (A is usually a sub-string of X ), the conventional ABSA approaches often take the sentence-aspect pair as the input and predicts A's sentiment polarity b y (Tang et al., 2016b;Ma et al., 2017;Xue and Li, 2018;Hazarika et al., 2018;Fan et al., 2018;Huang and Carley, 2018;Tang et al., 2019;Chen and Qian, 2019;Tan et al., 2019;Tang et al., 2020). We follow this paradigm and the overview of our approach is illustrated in Figure 2, with a contextual encoder (i.e., BERT), the proposed T-GCN and the attentive layer ensemble (ALE). The overall conceptual formalism of our approach can be written as where T denotes the set of all sentiment labels for y (i.e., positive, neutral, and negative) and p computes the probability of predicting y 2 T given X and A through T-GCN and ALE. In the following texts, we firstly describe the construction of the graph with dependency types, then elaborate the details of our T-GCN model, and the ALE to incorporate contextual information from different T-GCN layers, and finally illustrate incorporating T-GCN to ABSA.

Type-aware Graph Construction
Contextual features such as n-grams and syntactic information have been demonstrated to be useful to enhance text representation and thus improve model performance for many NLP tasks (Sun and Xu, 2011;Gong et al., 2012;Xu et al., 2015;Chen et al., 2017;Zhang et al., 2019b;Tang et al., 2020). In addition, it is demonstrated by many recent studies that GCN models are effective in capturing contextual features that are represented in graph-like signals, i.e., dependencies among words, of an input sentence Huang and Carley, 2019;Zhang et al., 2019a;Tian et al., 2020c;. In the graph for conventional GCN models, each edge between any two words x i and x j in the input sentence is added to the graph if there is a Figure 3: An illustration of how we build the typeaware graph from dependency parsing results and the detail of a T-GCN layer that consumes the graph. Edges and their dependency types are illustrated in the adjacency matrix and the relation matrix, respectively. dependency relation on them. Therefore, they fail to comprehensively use the dependency parsing results because dependency types are always omitted in the graph. To leverage the such type information, we propose the type-aware graph for feeding our T-GCN via the following steps.
First, we use off-the-shelf toolkits to obtain the dependency results, which can be represented by a list of dependency tuples (x i , x j , r i,j ) with r i,j denoting the dependency type between x i and x j . Second, we use an adjacency matrix A = {a i,j } n⇥n to present the graph by recording word relations in all tuples and a relation type matrix R = {r i,j } n⇥n to represent the edges with their dependency types. Therefore, A is a 0-1 matrix where a i,j = 1 if there is an edge between x i and x j , and a i,j = 0 otherwise. For R, each element r i,j in it uses a mark to denote the dependency type between x i and x j . Figure 3 illustrates the dependency parsing results of an example sentence as well as its type-aware graph represented by A and R, with the marks for r i,j listed in the "Type Reference". Finally, to leverage the relation types, i for x 3 ="menu" through a T-GCN layer. All words x j connected to "menu" with their dependency types (in embeddings e r i,j ) are shown at the bottom part.
we use a transition matrix to map all r i,j to their embeddings e r i,j .

T-GCN
With the type-aware graph, we propose an L-layer T-GCN and for each layer we apply attention to the edges in the graph to weight them by their contributions to the ABSA task. Figure 4 illustrates the processes of doing so for the aspect word "menu" in the sentence "The drink menu is limited but all the wines are excellent.". In detail, for a each edge between x i and x j , the l-th GCN layer takes the hidden vectors h and s (l) Then, we compute the weight p (l) i,j for this edge by and align the dimension of e r i,j to h (l 1) j by a train- Finally, we apply p (l) i,j to this edge and compute the output for x i at l-th layer following a similar process in the conventional GCN by where W (l) and b (l) denote trainable parameters in the l-th GCN layer and refers to the ReLU activation function. The above process is conducted for every x i and throughout all GCN layers, thus the information of dependency types are incorporated into the GCN to enhance ABSA accordingly.

Attentive Layer Ensemble
For each word x i , since every T-GCN layer incorporates information from the words that directly connect to it, so that multiple T-GCN layers could learn indirect word relations from long distance. Thus it is assumed that different layers have their unique capabilities to encode contextual information. To utilize such capabilities, we propose to comprehensively learn from all T-GCN layers with attentive layer ensemble.
In doing so, we firstly obtain the output o (l) from each T-GCN layer by averaging the output hidden vectors of all aspect terms x k 2 A: where |A| is the number of words in the aspect terms A. Then we attentively ensemble the output of all T-GCN layers through a weighted average: where o is the final vector output for ABSA and (l) is a trainable weight assigned to o (l) to balance its contribution and satisfying P L l=1 (l) = 1.

Encoding and Decoding with T-GCN
To support applying T-GCN for ABSA, there are necessary encoding and decoding processes. For encoding, there are two ways in doing so. The first is to take the sentence X as the input and obtain the hidden vectors h where H X is the hidden vectors of all words in X , and we use BERT as the encoder (same below).
The second is to take the sentence-aspect pair as the input, which can be formalized by where H A is the hidden vectors of all aspect words. Then, the hidden vectors from H X or H A are feed into the T-GCN model as that described in §2.2. For decoding, after we obtain o from ALE, we firstly map o to the label space by a fully connected layer, where W and b are the trainable matrix and the bias, respectively, and each dimension of u corresponds to a sentiment type. Thus, we employ a softmax function to u and predict the output sentimentŷ for the aspect A in X by: where u t is the value at dimension t in u.  Table 2: Experimental results (accuracy and F1 scores) of using two encoders i.e., BERT-base and BERT-large, with different configurations on six benchmark datasets. "GCN" refers to the normal GCN model without using type-aware graph, attention mechanism as well as ALE. "S" and "P" refer to the settings that the input is a single sentence and a sentence-aspect pair, respectively.
contain another conflict label, which identifies the aspects that have conflict sentiment polarities. For example, the aspect "sushi" is assigned by a conflict label in "Certainly not the best sushi in New York, however, it is always fresh." from REST14. Therefore, we follow Tang et al. (2016b) to clean the datasets by removing all aspects with the aforementioned conflict label, as well as sentences without an aspect. The statistics (number of aspects with positive, neutral, and negative labels) of the processed six datasets are reported in Table 1.

Implementation Details
To build the graph for T-GCN, we firstly use the current best performing constituency parser, i.e., SAPar 3 (Tian et al., 2020d), to parse all input text into constituency trees, then convert the trees into dependency trees by Stanford Converter 4 , and finally build the graph over the dependency relations and types from the trees. 5 Since high quality text representations can improve the performance of NLP models (Mikolov et al., 2013;Song et al., 2017;Bojanowski et al., 2017;, we employ BERT (Devlin et al., 2019) as the context encoder, which and whose variants (Diao et al., 2020;Dai et al., 2019;Joshi et al., 2020) have demonstrated their effectiveness to encode context information and achieved stateof-the-art performance in many NLP tasks (Huang and Carley, 2019;Tian et al., 2020a,b;Tang et al., 2020;Nie et al., 2020;Wang et al., 2020). Specifically, we use the uncased BERT-base and BERTlarge 6 with their default settings, i.e., 12 layers of self-attention with 768 dimensional hidden vectors for BERT-base and 24 layers of self-attention with 1024 dimensional hidden vectors for BERTlarge, and use three T-GCN layers. We try two ways to encode the input, where the first encodes the single sentence and the second encodes the sentence-aspect pair. For all models, we use the pre-trained parameters of BERT and initialize all other trainable parameters by Xavier (Glorot and Bengio, 2010). Moreover, we use the cross-entropy loss function for our models and follow previous studies (Tang et al., 2016a;Chen et al., 2017;He et al., 2018a;Zhang et al., 2019a) to evaluate them via accuracy and macro-averaged F1 scores over all sentiment polarities. For datasets without the official development set, we randomly sample 10% instances from the training set and regard them as the development set to find the best hyper-parameter setting which is then used to train different models on the entire training set. 7  Table 3: Performance (accuracy and F1 scores) comparison of our best model (i.e., T-GCN and ALE on large BERT with sentence-aspect pair input) with previous studies on all six benchmark datasets. Models using BERTlarge and dependency information are marked by "*" and " †", respectively.

Effect of T-GCN
In the main experiments, for each encoder (i.e., BERT base and large), we run two baselines: 1, only using BERT and 2, BERT with normal GCN where all edges are equally treated and the ABSA result is predicted based on the output of the last GCN layer. Table 2 reports the experimental results from all baselines and our models. 8 There are several observations. First, for both BERT-base and BERT-large encoders, although the models with normal GCN are able to enhance the BERT baselines, our models can further improve the performance in both accuracy and F1 socres on all datasets. This observation clearly illustrate the effectiveness of incorporating dependency type information into GCN and thus improves ABSA accordingly. Second, in most cases, our models that encode the sentence-aspect pair achieve higher results than the ones encoding the single sentence, which is not surprising because the aspect is therefore emphasized in the input and provide more contextual information to be modeled for ABSA.

Comparison with Previous Studies
To further demonstrate the effective of our approach, we compare the performance of our best 8 We report the mean and the standard deviation of the results of the same group of models in Appendix C. model (i.e., T-GCN using BERT-large encoder with sentence-aspect pair input), with previous studies on all datasets. The results are reported in Table  3, where our model outperforms previous studies, including the ones (Huang and Carley, 2019;Wang et al., 2020;Tang et al., 2020) using BERTlarge (marked by "*") and dependency information (marked by " †"), on all datasets in terms of both accuracy and F1 scores. In particular, compared with our approach, Huang and Carley (2019) use a variant of graph attention networks (GAT), while they do not use dependency types; Wang et al. (2020) also use a variant of GAT and they use the relation type as well, but they do not assign different weight to separate word-word relations; Tang et al. (2020) use a variant of GCN but they do not use the dependency type information. Our model shows its superiority to the aforementioned studies since we not only assign different weights to dependencies, but also comprehensively leverage the dependency parsing results with both word relations and their dependency type information, as well as fined-grained encoding results from multiple T-GCN layers.

Ablation Study
To explore the effectiveness of different components in our model, i.e., type-aware graph (TG),  Table 4: Experimental results of ablation study on the six datasets, with different configurations applied to our best model. 'TG" refers to the type-aware graph; "ATT" denotes the attention mechanism in T-GCN; "ALE" stands for the attentive layer ensemble. " p " and "⇥" represent if a corresponding component is used or not.
attention (Att), and ALE, we conduct an ablation study based on our best model (i.e., T-GCN on BERT-large encoder with sentence-aspect pair input). The experimental results on all datasets with respect to using different combinations of such components are reported in Table 4, with the results of the full model and the baseline with normal GCN illustrated on the first (ID: 1) and last row (ID: 8), respectively. Herein, models without ALE (ID: 4-6) use the output of the last T-GCN layer (i.e., the third layer) to predict the sentiment polarity. 9 Here are some observations. First, it is clearly indicated in results that, the model performance drops on all datasets if any component is excluded from the full model. This observation indicates that all three components play important roles in our approach to enhance ABSA; each one has its unique contribution to the full model. Second, for each single components, compared with the results from GCN baseline (ID: 8), the results from models with a particular module (ID: 5-7) demonstrate that the attention mechanism is the most important one to improve model performance, where on all datasets, the model (ID: 6) with attention outperforms the others. This observation complies with our intuition because the attention directly guides the model to distinguish the contextual information to the aspect words, so that informative words are highlighted so as to improve ABSA accordingly.

Impact of Different T-GCN Layers
Besides those components, we also investigate the effect of each layer when our model is trained on different datasets. In doing so, we perform experiments on all datasets using our best performing model and use the weight ( (l) in Eq. (8)) assigned 9 We obtain similar results when using the output of intermediate layers. The details are reported in Appendix D. Figure 5: The histograms of weights assigned to different T-GCN layers (blue, green, and orange bars refer to the weights for the 1st, 2nd, and 3rd layer, respectively) in ALE with respect to each dataset.
to each T-GCN layer to identify the contribution of them. The results are illustrated in Figure 5, with the weights for the 1st, 2nd, and 3rd T-GCN layers drawn in blue, green, and orange bars, respectively.
We have following observations. First, all layers contribute to the final prediction for ABSA, which complies with our expectation and confirms the validity of leveraging the information from all layers of GCN. Therefore, the model is able to provide comprehensive contextual information comparing to that only uses the output from the last layer. Second, interestingly, as shown in the histograms, for most datasets (i.e. LAP14, REST14, REST15, REST16, and MAMS), the second layer of T-GCN contributes the most among all three layers. A possible reason behind is that (1) the second layer is able to encode contextual information from a larger range (because the edges in the first layer only cover words with direct relations, while the second and third layer provide indirect relations, i.e., second and third order dependencies in practice); (2) comparing to the third layer, the second layer may introduce less irrelevant information from multi-word relations. Third, we also notice that for TWITTER, the weight distribution among three layers is rather different from the other  Figure 6: Visualization of the weights assigned to different edges and dependency types in each T-GCN layer for an example sentence with two aspects (in red) in conflict sentiment polarities. The edge and type weights (in blue) for "OK" in the first and second layer are illustrated on the left, while such weights (in green) for "food" and ALE weights (in yellow) for each layer are illustrated on the right. Deeper color refers to the higher weight.
datasets, where the first and last layer contributes more to ABSA. This observation can be explained by that, TWITTER is social medial data, where, in general, sentences in such data are short and less organized, so that our model may require the information from either local context or the entire sentence for ABSA.

Case Study
To further illustrate the effectiveness of T-GCN on leveraging the information of dependency types and weighting salient word relations for improving ABSA, we conduct a case study on using our model to process the sentence "The food was OK but the service was so poor that the food was cold by the time everyone in my party was served" from REST16. In this sentence, there are two aspects with contrast sentiment polarities, i.e., "food" and "service" have positive and negative sentiment suggested by "OK" and "limited", respectively.
To demonstrate the effectiveness of our model to process such sentence with conflict sentiments, on the right part of Figure 6, we visualize weights (in green) assigned to the edges connected to 'food" from the attention in all T-GCN layers, and the ALE weights (in yellow) for each layer, where deeper color refers to higher weight. For those edges, except for its self-connection, the edge between "food" and "OK" receives the highest weight in every layer, and the second layer receives the highest weight in ALE. Note that in this case, the reason why T-GCN works can be explained by that, when there are more than two layers are used in a GCN model, the edges connecting to "OK" also influence the ABSA results because indirect relations are introduced across layers. As a result, the noisy connection between "OK" and "poor" may contribute to the prediction and the normal GCN could possibly fail on this case because of lacking a mechanism to distinguish it from other edges. Therefore, as shown in the left part of Figure 6, we also visualize the weights for edges connecting to "OK" from the first and second T-GCN layers, 10 where the informative word relations and their dependency types receive much heavier weights than that for noisy ones. Moreover, it is noticed that the dependency type for the edge between "OK" and "poor" is "conj" (conjunction), which suggests that "poor" is syntactically parallel with "OK" and is thus less likely to provide essential sentiment guidance for "OK". Overall, this case study illustrates that our model successfully identifies that "OK" is the most important contextual information to determine the sentiment for "food", with the help of dependency type and attention used in T-GCN, and also shows that the final prediction relies on the contributions from different T-GCN layers.

Related Work
ABSA is in the line of research on sentiment analysis in a fine-grained level focusing on categoriz-ing sentiment polarities for a specific aspect (e.g., "chicken") or category (e.g., "food") in a sentence. Conventionally, this task is formulated as to classify a sentence-aspect pair and most of studies try to explore the contextual information between aspect and the entire sentence to facilitate the analysis of sentiment (Dong et al., 2014;Wang et al., 2016;Tang et al., 2016a;Ma et al., 2017;Chen et al., 2017;Xue and Li, 2018;Wang et al., 2020;Tang et al., 2020). To further enhancing the modeling of contextual information, dependency parses were leveraged by many studies, where adaptive recursive neural networks (Dong et al., 2014), attention mechanism (He et al., 2018a), and key-value memory networks (Tian et al., 2021) 2020) leveraged graph neural models (e.g., GCN) for ABSA with their graph built upon the dependency tree obtained from off-the-self dependency parsers, and demonstrated promising results. The models in their studies normally focus on building the graph with the dependency structure without considering dependency types, meanwhile treating the edges in the graph equally. In addition, they usually use the output of the last layer to predict sentiment labels although their models consist multiple layers. Thus, our approach differs from previous graph-based ones on several aspects, including the integration of depdendency type information, applying attention to edges, and ensemble of multiple layers to comprehensively learn from the graph model.

Conclusion
In this paper, we propose a neural approach for ABSA with T-GCN, where the input graph is built on the dependency tree of the input sentence. Specifically, the edges in the graph are constructed on top of both dependency relations and types for the input sentence; for each word, we use attention to weight all such type-aware edges associated to it in the T-GCN; we also apply attentive layer ensemble to comprehensively learn contextual information from different T-GCN layers. Experimental results on six widely used English benchmark datasets demonstrate the effectiveness of our approach, where state-of-the-art performance are achieved on all datasets. Further analyses illustrate the validity of incorporating type information into our model as well as applying attentive ensemble to learning from its multiple layers.   Table 6 reports the number of trainable parameters and the inference speed (sentences per second) of the baseline models (BERT) and our best performing models (i.e., T-GCN and ALE using BERTlarge encoder with sentence-aspect pair input) on all datasets. All models are performed on an Nvidia Quadro RTX 6000 GPU.

C. Mean and Deviation of the Results
In our experiments, we run models using BERTbase and BERT-large encoders with different configurations, where models using single sentence input (S) or sentence-aspect pair input (P) as well as models using normal GCN (+ GCN) or T-GCN (+ T-GCN) are tested. For each model, we train it with the best hyper-parameter setting using five different random seeds. We report the mean (µ) and standard deviation ( ) of the experimental results (accuracy and F1 scores) on all datasets in Table 7.

D. Effect of T-GCN layer
In our ablation study, we run models with different configurations of type-aware graph (TG), attention (Att), and ALE, where three T-GCN layers are used. For settings without ALE (ID: 4-6 and 8), we also try different number of layers and report the results in Table 8, where similar trend is observed.  Table 6: Numbers of trainable parameters (Para.) in different models and the inference speed (sentences per second) of these models on the test sets of all datasets.