Aspect-Category based Sentiment Analysis with Hierarchical Graph Convolutional Network

Most of the aspect based sentiment analysis research aims at identifying the sentiment polarities toward some explicit aspect terms while ignores implicit aspects in text. To capture both explicit and implicit aspects, we focus on aspect-category based sentiment analysis, which involves joint aspect category detection and category-oriented sentiment classification. However, currently only a few simple studies have focused on this problem. The shortcomings in the way they defined the task make their approaches difficult to effectively learn the inner-relations between categories and the inter-relations between categories and sentiments. In this work, we re-formalize the task as a category-sentiment hierarchy prediction problem, which contains a hierarchy output structure to first identify multiple aspect categories in a piece of text, and then predict the sentiment for each of the identified categories. Specifically, we propose a Hierarchical Graph Convolutional Network (Hier-GCN), where a lower-level GCN is to model the inner-relations among multiple categories, and the higher-level GCN is to capture the inter-relations between aspect categories and sentiments. Extensive evaluations demonstrate that our hierarchy output structure is superior over existing ones, and the Hier-GCN model can consistently achieve the best results on four benchmarks.


Introduction
As an important fine-grained subtask in the field of sentiment analysis, Aspect-Based Sentiment Classification (ABSC) aims to detect the sentiment polarities of aspect terms mentioned in a review sentence. For example, in Table 1, given the aspect term food, it is expected to identify its corresponding sentiment polarity as positive. The main limitation of ABSC lies in that aspect terms need to be annotated before aspect sentiment classification, which is not applicable to real applications. To address this problem, many studies have been proposed to explore Aspect Term-based Sentiment Analysis (ATSA), which performs aspect term extraction and aspect sentiment classification jointly (Mitchell et al., 2013;Zhang et al., 2015;Luo et al., 2019;Hu et al., 2019).
However, ATSA still suffers from a major obstacle that it only considers explicit aspects but completely ignore implicit aspects in text. Take the review in Table 1 as an example. Although the second clause does not mention any aspect term, it clearly expresses user's negative sentiment towards the service. More importantly, we observe that existing benchmark datasets contain a large amount of such reviews. For instance, Table 2 shows the proportion of reviews with implicit aspects in the Restaurant dataset from SemEval 2015 and 2016, and it is clear that nearly 25% of the examples contain implicit aspects.
Since these examples also convey valuable information, we should no longer ignore them, as the previous ATSA methods do.
Motivated by this, we focus on Aspect-Category based Sentiment Analysis (ACSA) in this paper, aiming to perform joint aspect category detection and category-oriented sentiment classification. Compared with ATSA, ACSA has the following two advantages: On the one hand, for each aspect mentioned in a review: The food here is rather good, but only if you like to wait for it.   review sentence, even if it does not have the corresponding aspect term, there must be a corresponding aspect category so that we can identify user's sentiment over it. On the other hand, from the perspective of application in real scenarios, although ACSA does not extract the aspect terms explicitly, it already meets the demand of opinion summary from multiple aspect level granularity. However, research in this area is relatively rare, and only a few preliminary studies have been carried out. Schmitt et al. (2018) proposed a joint model by extending sentiment labels with one more dimension to indicate the occurrence of each aspect category, which is shown to outperform traditional pipeline methods. Another feasible solution is to perform Cartesian product for aspect categories and sentiment labels, which essentially performs multi-label sentiment classification for each aspect category. Nevertheless, most of these methods fail to explicitly model the hierarchical relationship between aspect category detection and category-oriented sentiment classification. In particular, when there are many aspect categories, it is difficult for these methods to learn the inner-relations among multiple categories and the inter-relations between categories and sentiments.
In this paper, we re-formalize the task as a category-sentiment hierarchy prediction problem, which contains a two-layer hierarchy output structure. The lower layer is to detect aspect categories, which can be modelled as a multi-label classification problem (i.e., one review may contain more than one category). The higher layer is to perform category-oriented sentiment classification, which can be modelled as a multi-class classification problem for each detected category.
Under the hierarchy output structure, our model contains three modules: the bottom module leverages BERT to obtain hidden representations of the two sub-tasks respectively. In the middle module, we propose a Hierarchical Graph Convolutional Network (Hier-GCN), where the lower-level GCN is to model the inner-relations among multiple categories, and the higher-level GCN is to capture the interrelations between categories and category-oriented sentiments. Based on the interactive representations generated from Hier-GCN, the top module performs category-sentiment hierarchy prediction to generate the final output.
We conduct experiments on four benchmark product review datasets from SemEval 2015 and 2016. The results prove that the hierarchy output structure achieves better performance than other existing structures. On this basis, the proposed Hier-GCN architecture can bring additional performance gains, and consistently achieves the best results across the four datasets. Further analysis also proves the effectiveness of Hier-GCN in cases of both explicit aspect and implicit aspect.

Problem Formalization
Given a review sentence with n words r=[w 1 , . . . , w n ], Aspect Category-based Sentiment Analysis (ACSA) aims to detect all the mentioned aspect categories, and identify the sentiment for each detected category. Formally, let C = {c 1 , . . . , c m } be a set of m predefined aspect categories, and s = {positive, negative, neutral} be the label set of sentiment polarities. Accordingly, for each input r, the goal of ACSA is to generate a set of category-sentiment pairs, denoted as {. . . , (ŷ c i ,ŷ s i ), . . .}, whereŷ c i is the i-th aspect category mentioned in r,ŷ s i is its corresponding sentiment. As shown in Fig. 1(a), one possible solution to this task is to consider all the combinations of categorysentiment pairs, denoted by Cartesian Product. Specifically, for each aspect category c i , we can model its sentiment prediction as a multi-label classification task to generate the outputŷ s i ∈ {0, 1, 2}, which indicates negative, neutral, and positive sentiment polarity, respectively. Note that ifŷ c i = 0, it indicates the absence of the category c i and its corresponding sentiment polarity. Since Cartesian Product suffers from the risk of generating multiple sentiments for each category, an alternative solution is to add one  Figure 1: Comparison between hierarchical prediction structure and other available structures. (a) refers to "Cartesian product", (b) refers to "Add one dimension", and (c) refers to "Hierarchy" methods, respectively.
label to the sentiment label space to predict the presence or absence of each category (Schmitt et al., 2018). In this case, the occurrence and the sentiment prediction of each category is unified as a simple multi-class classification problem.
However, both approaches simply unify aspect category detection and category-oriented sentiment classification as one task, while ignoring the internal relationship between the two sub-tasks. Intuitively, if an aspect category is not detected, it is unnecessary to perform sentiment classification for it, and if multiple categories are detected, the relations among their sentiment outputs should be fully utilized. Motivated by this, we re-formalize the ACSA task as a category-sentiment hierarchy prediction problem: p(y c , y s |r) = p(y c |r)p(y s |y c , r), where the first component p(y c |r) is modeled as a multi-label classification problem for aspect category detection, and the second component p(y s |y c , r) is modeled as a multi-class classification problem to perform sentiment classification for each detected category.

The Proposed Approach
The overall architecture of the proposed framework is illustrated in Fig. 2. We use BERT as the basic encoder of our model to encode the contextual information of a sentence and generate context-aware category representations and sentiment representations, respectively (Section 3.1); Next, we use two hierarchical graph convolution networks to capture the inner-relation between multiple categories and the inter-relation between categories and sentiment polarities, respectively (Section 3.2); Finally, a hierarchical output and integration module is designed to construct the final prediction from category prediction and sentiment prediction (Section 3.3).

Feature Extraction with BERT
We adopt Bidirectional Encoder Representations from Transformers (BERT) as our sentence encoder, which is a pre-trained on a huge amount of text with masked language model and has been shown to achieve state-of-the-art results on a broad set of NLP tasks. Let H ∈ R d×(n+2) denote the final hidden states generated from BERT, where we insert two special tokens (i.e., [CLS] and [SEP]) at the beginning and the end of each input r. For space limitation, we omit a detailed description of BERT and refer readers to (Devlin et al., 2018). For category representations, we further use m separate self-attention sub-layers on top of H to get the representations of m categories, denoted by C ∈ R d×m . Besides, following the practice in (Devlin et al., 2018), we employ the hidden state of the first token [CLS] as the shared sentiment representation for each category, denoted by S ∈ R d .

Hierarchical Graph Convolutional Network (Hier-GCN)
Motivation. As introduced before, we model the task of ACSA as a hierarchy prediction problem with two sub-tasks: Aspect Category Detection and Category-oriented Sentiment Classification. For Aspect Category Detection, since some aspect categories tend to frequently co-occur with each other, it is necessary to model the inner-relations between categories (e.g., in Fig. 1, if food is mentioned in a user

Max Pooling
Category Representation

Multi-label Category Prediction
Global Sentiment

Category GCN
Category-Sentiment Prediction SA BERT CLS

Feature Extractor
The food here is rather good, but only if you like to wait for it.  review, the service of the restaurant is likely to be mentioned as well). Similarly, for Category-oriented Sentiment Classification, since the sentiment prediction is highly dependent on the prediction of aspect category, it is crucial to model the inter-relations between categories and sentiment (e.g., in Fig. 1, we only need to predict sentiment for Food and Service categories but not the others). As the graph convolutional network (GCN) can better capture the two kinds of relations by incorporating category-category and category-sentiment co-occurrence as prior knowledge, we employ it to model the inner-realtions and the inter-relations simultaneously. Specifically, we propose a hierarchical GCN model with two sub-layers, where the lower GCN sub-layer is to learn the inner interactions between categories, and the higher GCN sub-layer is to learn the inter interactions between categories and sentiments. The two sub-layers are coupled together as a Hier-GCN structure.

Category GCN Sub-Layer
To adapt GCN to model the inner-relations between multiple categories, we construct a directed graph by treating each category as a node, and obtain the adjacent matrix M c ∈ R m×m by calculating the co-occurrence between every category pair over the entire training corpus. Formally, we use M c i,j to denote the transition probability of having the j-th category given the i-th category, and M c ∈ R m×m can be computed as follows: where count(c i ) refers to the number of samples with the i-th category, and count(c i , c j ) denotes the number of samples with both the i-th and the j-th category. Note that in most cases, M c i,j is not equal to M c j,i , which indicates that for a co-occurred category pair (c i , c j ), the category with high frequency will have a high effect on the category with low frequency but not vice versa. This implies that modeling the inner-relations between categories may help detect more categories with low frequency, which is more significant for large-scale multi-label classification with many low frequency categories.
With the adjacent matrix M c , we perform the standard graph convolution on the node representations as follows: where C is the initial state of X, i.e., X 0 , l refers to the l-th category GCN sub-layer, W l ∈ R d×d and b l ∈ R d are the linear transformation weight and bias, and f (·) is a nonlinear activation function, i.e., GELU.
After obtaining the output of the l + 1-th sub-layer X l+1 , we not only feed it to the next category GCN sub-layer as the input category representation, but also combine it with sentiment representations as the initial state of the l + 1-th category-sentiment GCN sub-layer. Note that in the final Hier-GCN layer, it is used to perform multi-label classification for aspect category detection.

Category-Sentiment GCN Sub-Layer
Similarly, to employ GCN to model the inter-relations between m categories and m * 3 category-oriented sentiments, we construct another directed graph by treating the m categories and m * 3 sentiments as graph nodes. Since our final goal is to predict sentiment for each detected category, we propose to obtain an adjacent matrix for the positive, negative, and neutral sentiment, respectively. We use M c-s i,j to denote the transition probability that the sentiment polarity of the j-th category given the i-th category: where (s|c j ) denotes the sentiment polarity towards the j-th category, and s ∈ {positive, negative, neutral}.
Next, we first obtain the fused sentiment-sensitive category representation for each input node to derive F l ∈ R m×d as follows: where ⊕ denotes the concatenation operation, and W c,s l ∈ R d×2d and b c,s l ∈ R d refer to the weight and bias.
With the node representation and the three adjacent matrices, the standard graph convolution is then performed on the input nodes to obtain F l ∈ R m×d : where W s l ∈ R d×d and bias are learnable parameters. Finally, we can obtain the category-oriented sentiment representation S l+1 as follows: where dense(·) is a parameter sharing feed-forward network with tanh as the activation function.

Hierarchical Prediction Integration
Based on the final category representation X i and sentiment representation S i generated from the L-th layer Hier-GCN, we can obtain the probability of having the i-th category p c i ∈ R 1 and its corresponding sentiment probability distribution p s i ∈ R 3 as follows: where the parameters W s and b s are used for sentiment classification, which is shared for all categories. Then, Hierarchical Prediction Module is used to get the final prediction of the i-th category-sentiment where I(·) is an indicator function. The loss of ACSA has two parts, in correspondence with Equation (1). The first component (category detection) is modeled as a multi-label classification problem, and the cross-entropy loss, i.e., negative log-likelihood of p(y c |r), is defined as:  where y c i ∈ {0, 1} is the ground truth of the i-th category. The second component (category-oriented sentiment classification) is modeled as a multi-class classification problem for each category, and the cross-entropy loss, i.e., negative log-likelihood of p(y s |y c , r), is defined as: where y s i,j is the ground truth of j-th sentiment for the i-th category. The cross-entropy loss is equivalent to negative log-likelihood. Therefore, the final loss function of ACSA, i.e., negative log-likelihood of p(y c , y s |r), equals the sum of two component cross-entropy losses: 4 Experiments

Datasets and Experimental Settings
We evaluate our model on the benchmark SemEval 2015 and 2016 datasets (Pontiki et al., 2015;Pontiki et al., 2016), which contain customer reviews from laptop and restaurant domains.To ensure that aspect category can accurately describe an opinion expression object, a category is predefined as specific entity type E and its attribute label A in SemEval 2015 and 2016 datasets. The E#A pair defines an aspect category. The details of the data sets are summarized in Table 3. We compute the Precision, Recall of the predicted (category, sentiment) pairs respectively, and use Micro-F1 score as the final evaluation metric. We randomly split the training set into the ratio of 9:1 as training, validation set. The reported results are averaged scores of 10 runs to obtain statistically stable results. We use BERT-base as basic encoder, and refer readers to (Devlin et al., 2018) for the detailed BERT-base model setting. We adopt the same AdamW optimizer in BERT. The learning rate and batch size are set to 5e-5 and 8, respectively. We set the maximum sentence length to 128 for SemEval 2015 and 100 for SemEval 2016, and run 20 epochs for every fold, a dropout rate of 0.1 is used, and the hidden size of Transformer is 768.

Compared Systems
In Section 2, we have summarized several problem formalization methods including Cartesian, Add one dimension. We use them as baseline systems for comparison. We furthermore implement a pipeline method of category detection and sentiment classification. All compared systems are as follows: • Cartesian-BERT: The Cartesian method with BERT as the sentence encoder.
• Pipeline-BERT: Pipeline of individual aspect category detection and category sentiment classification, both with BERT as the review encoder. This method can better model each sub-task, but ignores the associations between two sub-tasks. • AddOneDim-LSTM: The Add one dimension method with LSTM as the sentence encoder. This represents the state-of-the-art work by (Schmitt et al., 2018).  Table 4: Main results of the ACSA task on four benchmark datasets.
• AddOneDim-BERT: The Add one dimension method with BERT as the sentence encoder.
• Hier-BERT: The hierarchy method with BERT as the sentence encoder.
• Hier-Transformer-BERT: As a strong baseline, we use Transformer to model the inner-relations between category and inter-relations between category and sentiment based on Hier-BERT. It should be noted that except AddOneDim-LSTM, all the other systems are proposed in this work. Table 4 shows the results of different systems on two datasets. It can be observed that AddOneDim-LSTM seems difficult to achieve good performance when the number of categories is large. For example, in the Laptop dataset, the training algorithm fails to converge. In comparison, AddOneDim-BERT can perform much better. Cartesian-BERT is less effective probably due to the class-parse problem especially when the number of categories is large (e.g., Recall on Laptop is very low). The precision of the pipeline method is low although its recall is high, because it ignores the relations among sub-tasks and leads to too many sentiment predictions without taking category-sentiment restriction into consideration. This also indicates that our proposed category-sentiment hierarchy structure is more suitable for joint aspect category detection and category-oriented sentiment classification.

Main Results
We can see that among all systems, our hierarchical methods (Hier, Hier-Transfomer-BERT, Hier-GCN-BERT) achieves significantly better F1 performance than the other baseline systems, among which, Hier-GCN-BERT gets the best performance on three datasets. This indicates that the modeling of hierarchical relationship can better describe the original problem on the basis of hierarchical label prediction.

GCN vs Transformer
In Table 4, we further present the result of Hier-Transformer-BERT where we replace GCN with Transformer to learn the relations among categories and sentiments. The layers of Transformer is kept the same as that in Hier-GCN-BERT. The first Transformer layer is used to model the inner-relation between categories, and the second layer is used to capture the inter-relation between category and sentiment. It can be found that Hier-Transformer-BERT is quite efficient but still lower than Hier-GCN-BERT on three datasets, only slightly better on Restaurant 2015. The improvement of GCN upon Transformer is more substantial when the number of categories is larger (e.g., the Laptop dataset), due to the GCN's ability in learning the inner-relations between categories and inter-relations between category and sentiment.

Ablation Study
In Table 5, we conduct experiments to test the effectiveness of each component in Hier-GCN-BERT.
It can be seen that both C-GCN-BERT and CS-GCN-BERT have incremental effects compared with the basic hierarchy prediction model Hier-BERT. Among two sub-GCNs, category GCN plays a more important role than category-sentiment GCN, because it captures the inner-relations among multiple categories and lays the basis of the following category-sentiment inter-relation learning.   Table 6: Impact of the number of layers in Hier-GCN-BERT.

Impact of the Number of Hier-GCN Layers
We further investigate the impact of the number of Hier-GCN layers L. Considering the over-smoothing problem caused by high co-occurrence nodes, we vary L from 1 to 3, and reports the results in Table 6. The best performance is obtained when L = 2. Layers more than 3 in GCN may incorporate too much node co-occurrence information and result in non-discriminative representations.

Results on Implicit Aspects
As we have mentioned in Table 2, there are nearly 25% of aspect terms are not annotated in the Restaurant 2015 dataset. We refer to them as implicit aspects. To test our approach's ability to capture the implicit aspects, we split the Restaurant 2015 test set into two parts: Explicit Aspects (containing 212 samples) and Implicit Aspects (containing 360 samples), and report the performance of category-sentiment hierarchy prediction in Table 7. We find that although the performance of our approaches on the Implicit Aspects part is lower than the Explicit Aspects part, it can still accurately identify a considerable amount of category and sentiment pairs, and outperform all the compared systems in implicit aspects.
Case study: In order to verify the advantages of our model in detecting categories and predicting category sentiment polarity intuitively, we further study cases of approaches AddOneDim-BERT, Hier-BERT and Hier-GCN-BERT on implicit aspects of Restaurant 2015. As shown in Table 8, review Once you try it for a special occasion beware.. you can't stop! implies positive sentiment towards category miscellaneous of restaurant, only AddOneDim-BERT cannot detect the corresponding category. Review I have never ever had such an unpleasant experience. implies general negative sentiment towards category restaurant, AddOneDim-BERT cannot detect the mentioned category, while Hier-BERT detects the mentioned category correctly, but predicts the sentiment polarity as positive. Only Hier-GCN-BERT predicts the category-sentiment pairs correctly. Besides, in a complex context with both implicit and explicit aspects, Hier-GCN-BERT can still perform better. Review Highly impressed from the decor to the food to the hospitality to the great night I had! has two explicit aspects and two implicit aspects, all approaches can predict the explicit aspects, while only Hier-GCN-BERT finds one more category-sentiment pair.

Related Work
Aspect-Based Sentiment Analysis: Aspect-Based Sentiment Analysis (Hu and Liu, 2004;Ding et al., 2008) is an important fine-grained subtask in sentiment analysis (also called opinion mining). Most recent approaches can be categorized into two branches, namely Aspect Term-based Sentiment Analysis (ATSA) and Aspect Category-based Sentiment Analysis (ACSA).
ATSA aims to detect the aspect term and identify their corresponding sentiment polarities given a piece of text. Mitchell et al. (2013) first explored the end-to-end task by applying traditional approaches with shallow features to extract aspect terms and their sentiments jointly. Later, Zhang et al. (2015)     Different from ATSA, ACSA targets at identifying the aspect category as well as its corresponding sentiment. Most existing studies on ACSA focused on its sub-tasks, namely, the aspect category detection (ACD) and category-oriented sentiment classification (CSC). For the task of ACD, (Zhou et al., 2015;Schouten et al., 2017) proposed algorithms to predict categories. More recently, Movahedi et al. (2019) proposed an attention-based neural network method to identify different aspect categories based on different topics. For the task of CSC, (Ruder et al., 2016;Wang et al., 2016;Xue and Li, 2018;Tay et al., 2018;Sun et al., 2019) proposed neural networks to make the most of context and aspect category information. (Ma et al., 2018;Hu et al., 2018) proposed attention-based networks to perform category and sentiment co-classification. Apart from the researches on the two subtasks, there are a few studies focusing on the end-to-end ACSA task. Schmitt et al. (2018) proposed a joint model and transform the task into a multi-class classification problem. However, most of these studies fail to explore the relations between ACD and CSC. Therefore, we propose a hierarchical graph network to achieve this goal.
Hierarchical Classification & GCN: The hierarchical classification problem has been studied in some previous work. Silla and Freitas (2011) defined the task of hierarchical classification and presented a new perspective about hierarchical classification approaches, Aly et al. (2019) investigated simple shallow capsule networks for hierarchical multi-label text classification. GCN can be used for non-Euclidean Structure data, Bruna et al. (2013) proposed spectral graph convolutional neural networks, and Kipf and Welling (2016) designed a convolutional architecture via a localized first-order approximation of spectral graph convolutions. Inspired by this work, we propose a BERT-based Hier-GCN model, which decomposes the inner-relations between categories and the inter-relations between sentiment polarities into two hierarchically related sub-graphs and get the final predictions in hierarchy manner.

Conclusion
In this paper, we first examined the limitations of existing approaches for the task of Aspect Categorybased Sentiment Analysis, and proposed a hierarchy output structure to first detect the aspect categories, and then perform sentiment classification for each detected category. Given the output structure, we further proposed a Hierarchical Graph Convolutional Network to capture the inner-relations between multiple categories as well as the inter-relations between categories and sentiments. Experimental results on four benchmark datasets show that our hierarchy output structure performs significantly better than any existing output structure, and Hier-GCN can outperform a number of highly competive approaches including a strong hierarchical Transformer model proposed by us.