Joint Aspect Extraction and Sentiment Analysis with Directional Graph Convolutional Networks

End-to-end aspect-based sentiment analysis (EASA) consists of two sub-tasks: the first extracts the aspect terms in a sentence and the second predicts the sentiment polarities for such terms. For EASA, compared to pipeline and multi-task approaches, joint aspect extraction and sentiment analysis provides a one-step solution to predict both aspect terms and their sentiment polarities through a single decoding process, which avoid the mismatches in between the results of aspect terms and sentiment polarities, as well as error propagation. Previous studies, especially recent ones, for this task focus on using powerful encoders (e.g., Bi-LSTM and BERT) to model contextual information from the input, with limited efforts paid to using advanced neural architectures (such as attentions and graph convolutional networks) or leveraging extra knowledge (such as syntactic information). To extend such efforts, in this paper, we propose directional graph convolutional networks (D-GCN) to jointly perform aspect extraction and sentiment analysis with encoding syntactic information, where dependency among words are integrated in our model to enhance its ability of representing input sentences and help EASA accordingly. Experimental results on three benchmark datasets demonstrate the effectiveness of our approach, where D-GCN achieves state-of-the-art performance on all datasets.


Introduction
End-to-end aspect-based sentiment analysis (EASA) aims to extract aspect terms in the text and predict their sentiment polarities so as to understand targeted sentiment towards particular objects. For Example, in the sentence "The ambiance is minimal but food is not phenomenal", the aspect terms are "ambiance" and "food" and the sentiment polarities towards them are positive and negative, respectively. In general, there are mainly three types of approaches for this task, i.e., pipeline, multi-task, and joint-label approaches. Pipeline approaches (Mitchell et al., 2013;Zhang et al., 2015;Hu et al., 2019) perform aspect extraction and sentiment analysis in a sequence, which is not straightforward and suffers from error propagation among different steps; multi-task approaches (Mitchell et al., 2013;Zhang et al., 2015;Ma et al., 2018;Luo et al., 2019;He et al., 2019;Hu et al., 2019) apply an encoder to the input and use separate decoding process to extract aspects and predict their sentiments, where there could be mismatches between the two decoding results. As a comparison, joint-label approaches (Mitchell et al., 2013;Zhang et al., 2015;Li and Lu, 2017;Li et al., 2019a;Hu et al., 2019) extract aspect terms and predict their sentiments simultaneously through a unified labeling scheme, which not only provides an one-step solution to EASA, but also avoids the aforementioned problems in other two approaches.
In most cases, previous studies demonstrate that a good modeling of contextual information is effective in improving EASA performance. However, these studies mainly rely on powerful encoders (e.g., Bi-LSTM, CNN, BERT) (Zhang et al., 2015;Ma et al., 2018;Schmitt et al., 2018;Li et al., 2019a;Li et al., 2019b;Luo et al., 2019;He et al., 2019;Hu et al., 2019) and pre-trained embedings (e.g., GloVe, word2vec, FastText) (Schmitt et al., 2018;Li et al., 2019a) to learn contextual information, with limited effort paid to leveraging advanced architectures and extra knowledge for this task. To extend such effort, graph convolutional networks (GCN) was proposed and shows its effectiveness in conventional sentiment analysis , as well as other tasks, e.g., text classification (Kipf and Welling, 2016), neural machine translation (Bastings et al., 2017), semantic role labeling , etc. Moreover, consider that discriminatively modeling the contextual features of a given word according to their positional relations to the word is helpful in text representation learning Figure 1: The overall architecture of our approach, where the graph is built upon the dependency tree of an input sentence, with all edges in the graph illustrated in the adjacency matrix. The red, blue, and orange colors illustrate our modeling for contextual features in left, right, and self positional relations with a specific word, respectively. (Zhang et al., 2017;Song et al., 2018;Shaw et al., 2018;Tian et al., 2020b), any encoder for EASA could also be beneficial from adding such treatment to model the input text. Therefore, it is expected to enhance conventional GCN with directional information for different parts of input, so that one can distinguish them and appropriately model the contextual information for EASA.
In this paper, we propose directional graph convolutional networks (D-GCN) for EASA, which performs the task following the sequence labeling paradigm and models dependency relations among words in the input with an appropriate architecture. Specifically, for an input sentence, we firstly build the word relation graph upon its auto-processed dependency trees; then, we apply a direction mechanism in GCN, where for each word, we separately encode its associated contextual features (which are suggested by the graph) with respect to different positional relations (i.e., on the left, right or self). To further distinguish the importance of different contextual features, we also propose an attention mechanism, in which we assign different weights to such features that are computed according to the comparisons among them, so as to emphasize important ones for EASA. To illustrate the effectiveness of our approach, experiments are performed on three bechmark datasets, where the results confirm that D-GCN is an appropriate model in leveraging dependency based word relations for EASA, with state-of-the-art performance observed on all datasets.

The Approach
The overall architecture of our approach is illustrated in Figure 1, which follows the sequence labeling paradigm for EASA, where an input sentence X = x 1 · · · x i · · · x n is tagged by a corresponding joint label 2 sequence Y = y 1 · · · y i · · · y n . For the D-GCN, there are L layers placed in between the context encoder (i.e., BERT) and the output layer, where to feed them, an adjacency matrix (shown at the lower right side of Figure 1) representing the graph is built on the dependency tree of the input sentence and an attention matrix (shown at the upper right side of Figure 1) is applied to the edges in the graph to weight the contextual features associated with a specific word, i.e., "soup". In the following text, we firstly introduce normal GCN, then elaborate our proposed D-GCN, and finally illustrate EASA labeling with D-GCN.

Graph Convolutional Networks
The representation of an input sentence always plays an important role in achieving good model performance when it is fed to different natural language processing (NLP) tasks (Song et al., 2017;Babanejad et al., 2020). Contextual features, such as n-grams and syntactic information, have been demonstrated to be highly useful to enhance text representation and thus improve NLP model performance (Huang et al., 2007;Jiang et al., 2009;Wang et al., 2011;Song and Xia, 2013;Dong et al., 2014;Miller et al., 2016;Bastings et al., 2017;Yoon et al., 2018;Seyler et al., 2018;Kumar et al., 2018;Diao et al., 2019;Huang and Carley, 2019;Margatina et al., 2019;De Cao et al., 2019;Tian et al., 2020a;Tian et al., 2020c;Tian et al., 2020d). In addition, it is also proved that GCN could be a powerful model to capture context features suggested by the graph-alike signals, e.g., dependency tree, of an input sentence.
Normal GCN models usually have L layers, and its input graph can be built upon the dependency tree of the input sentence, where an edge is added to every two words, i.e., x i , and x j , if there exists a dependency relation between them. In general, a 0-1 adjacency matrix A = {a i,j } n×n is used to represent the graph where a i,j = 1 if there is an edge between x i and x j and a i,j = 0 otherwise. Based on A, for any x i in X , the l-th GCN layer takes the output h (l−1) i from the (l − 1)-th GCN layer 3 and computes its output by ( 1) where W (l) and b (l) are trainable matrix and bias for the l-th layer. Therefore, all contextual features associated with x i (i.e., all x j satisfying a i,j = 1) are treated equally in normal GCN models.

Directional Graph Convolutional Networks
The motivation of D-GCN is to separately model contextual features that have different positional relationships with their associated word, and further weight such features according to the comparison among them. Therefore, following the same notations in normal GCN, in the l-th D-GCN layer, our approach to compute the output h where W (l) dir and p i,j (which correspond to W (l) and a i,j in Eq.
(1), respecitively) show our improvement to normal GCN through the direction modeling and attention mechanism. For the direction information, W (1), attention (through p i,j ) is applied to the edge between x i and x j to weight different contextual features. Specifically, p i,j is computed via computes the interaction between x i and x j through inner product. Note that we also apply a i,j from A to computing p i,j so that the attention for any two words can be easily ignored if there is not an edge between them (a i,j = 0).

Tagging with Directional Graph Convolutional Networks
In our approach, we use BERT (Devlin et al., 2019) to encode the input X and obtain the hidden vector h i . Finally, we apply a softmax decoder to o i to predict the joint labelŷ i for aspect extraction and sentiment analysis viâ where T denotes the label set and o t i refers to the value at dimension t in o i .

Settings
In our experiments, we use three benchmark datasets, including restaurant (REST) dataset from SemEval ABSA challenges (Pontiki et al., 2014;Pontiki et al., 2015;Pontiki et al., 2016), laptop (LPTP) dataset from Pontiki et al. (2014), and Twitter (TWTR) dataset from Mitchell et al. (2013). All these datasets contain the ground truth labels of target aspect and their sentiment polarities. Following (Li et al., 2019a;Li et al., 2019b;He et al., 2019;Hu et al., 2019), we only consider three sentiment polarities, i.e., positive, negative, and neutral, where all cases with conflict label in REST and LPTP dataset are ignored. We report the statistics (the number of sentences, aspects with respect to positive, neutral, and negative sentiment polarities) of the three datasets in Table 1. For TWTR dataset, since there is no standard train-test split, we only report its total statistics and follow (Mitchell et al., 2013;3 The input to the first GCN layer is the hidden vector h     Zhang et al., 2015;Li et al., 2019a;Luo et al., 2019;Hu et al., 2019) to use ten-fold cross validation on it in our experiments. We use an off-the-shelf system, i.e, Standard CoreNLP Toolkits (SCT) 4 to obtain the dependency tree for each sentence to construct its D-GCN graph, since SCT is a well-known NLP toolkit that has been used in many previous studies (Huang and Carley, 2019;Tian et al., 2020a). We use uncased version of BERT-Base and BERT-Large 5 (Devlin et al., 2019) under their default settings. All trainable parameters in our D-GCN model are randomly initialized. Following previous studies (Li et al., 2019a;Li et al., 2019b;Luo et al., 2019;He et al., 2019;Hu et al., 2019), we evaluate all models by F1 score.

Results
In the main experiments, we run our models and baselines with and without D-GCN, and try different numbers (i.e., from 1 to 4) of D-GCN layers. We also run a baseline that uses graph attention networks (GAT) (Veličković et al., 2017) for references. Table 2 shows the results (F1 scores) on all datasets. There are several observations. First, D-GCN works well with both base and large BERT, where consistent improvement is observed over the baselines (including the GAT baseline) across datasets. Second, for models using BERT-Base, three layers of D-GCN achieve the best result, where more or less layers cause inferior performance. One possible explanation could be that although we only model the contextual features directly linked to a specific word in each D-GCN layer, contextual information in larger range can be leveraged indirectly across layers when the number of D-GCN layers increases, so that EASA performance is improved accordingly. However, further adding layer could lead to over-fitting and introduce more noises and thus harm the EASA results. Different from BERT-Base, models using BERT-Large require less D-GCN layers to achieve best performance because BERT-Large is more powerful in encoding contextual information so that they rely less on the long range contextual information encoded by higher layers of D-GCN. Moreover, we also compare our best model using BERT-base and BERT-large with previous studies, where the results (F1 scores) are presented in Table 3. It is found that our models (especially with BERT-Large) outperform all previous EASA studies. Particularly, although the pipeline approach shows a surprising good performance over other previous studies, we prove that an appropriate model design could effectively take full advantage of the joint approach.

Ablation Study
To explore the effect of the direction feature (DIR) and the attention mechanism (ATT) applied in our D-GCN, we conduct an ablation study on our best model (BERT-Large) with 1 layer D-GCN, where either DIR or ATT is ablated. The results (F1 scores) on different datasets are reported in Table 4, where the scores from baseline with (ID: 4) and without (ID: 5) normal GCN are also presented. It is clearly indicated that the ablation of either DIR and ATT (ID: 2, 3) hurts model performance, which suggests both parts contribute to improving the EASA task. In   Table 4: Ablation results from our best model (i.e., BERT-Large encoder with 1 D-GCN layer). "DIR" and "ATT" denote direction modeling and attention mechanism, respectively. √ and × refers to whether a component is used. Figure 2: An example sentence with its sentiment outputs from our D-GCN model (with BERT-Large encoder) (a) and a reference model without the direction modeling (i.e., ID 3 in Table 4) (b). The predicted aspect term "Safari browser" is highlighted in yellow. The correct and incorrect predicted sentiment polarities are presented in green and red color. We visualize the weights assigned to the contextual features associated with "browser" on the arcs (including the arc linking "browser" itself) between them, where thicker arc refers to higher weights.
addition, directly using normal GCN (ID: 4) leads to further inferior results compared to baseline (ID: 5) without using it, which emphasizes the necessity of our design to weight different contextual features through D-GCN.

Case Study
To explore how the D-GCN model captures position information to improve model performance, we explore the effect of direction modeling by comparing the output of our D-GCN models with BERT-Large encoder and a reference model without the direction modeling (i.e., ID 3 in Table 4). In Figure 2, we show an example sentence with the outputs from two models, where both models correctly recognize the aspect term "Safari browser" (highlighted in yellow color). In addition, our D-GCN model also correctly predict the sentiment polarity "positive" (in green), while the reference model fails to do so (its output "neural" is highlighted in red). In the figure, for "browser", we visualize the weight assigned to each of its associated word (i.e., the word that connected to "browser" by an dependency arc or "browser" itself) on the arc between them, where thicker arcs refer to higher weights. From Figure 2, it is found that the reference model (i.e., GCN + Att.) assigns the highest weight to "browser" itself, which makes its associated contextual features fail to contribute to the process of predicting the joint label for "browser". On the contrary, our D-GCN approach that considers the directional information allows the attention mechanism to assign higher weights to its contextual features, especially the context word "quick" that may provide useful cues to predict a "positive" sentiment polarity compared to the reference model. To summarize, this example shows a typical case that, by allowing the attention mechanism to assign appropriate weights to the contextual features, our D-GCN model can leverage the positional relationship between a word and its contextual features to improve the EASA task.

Conclusion
In this paper, we propose a joint approach for EASA with D-GCN, whose graph is built upon the dependency tree of the input sentence obtained from off-the-shelf toolkits. The novelty of this work lies in the direction modeling and attention applied in GCN, where in each D-GCN layer, for each word, we separately model its different contextual features with considering their direction to the word, and weight these features according to the comparisons among them. Experimental results on three widely used benchmark datasets illustrate the effectiveness of our approach, with state-of-the-art performance achieved on all datasets. Further analysis confirms that both direction modeling and attention mechanism are helpful for the task.