Syntax-Aware Graph Attention Network for Aspect-Level Sentiment Classification

Aspect-level sentiment classification aims to distinguish the sentiment polarities over aspect terms in a sentence. Existing approaches mostly focus on modeling the relationship between the given aspect words and their contexts with attention, and ignore the use of more elaborate knowledge implicit in the context. In this paper, we exploit syntactic awareness to the model by the graph attention network on the dependency tree structure and external pre-training knowledge by BERT language model, which helps to model the interaction between the context and aspect words better. And the subwords of BERT are integrated into the dependency tree graphs, which can obtain more accurate representations of words by graph attention. Experiments demonstrate the effectiveness of our model.

they the desk ##s in their dorm ##itor ##ies like Figure 1: An example of the graph in our model. The complete sentence is They like the desks in their dormitories. The aspect word is desk (colored red), and the keyword which gives out sentiment information is like (colored yellow). The distance between the two words is 1 in the dependency tree shorter than 2 in sequence. And the meaning of desk is more important than they for the word like (colored yellow).
shown in Figure 1, they and desk are both child nodes of like and should have different positions in the ABSA task.
Through the survey above, we hope to take a further step in solving the long-distance dependence problem and introducing syntactic information for the ABSA task. In this paper, we propose a novel model named Syntax-Aware Graph ATtention Network (SAGAT) which fully exploits the potentials of BERT and carefully models the syntactic relations between different words. Specifically, SAGAT first obtain the representation of each subword in the sentence from the BERT. Unlike other BERT-based models (Zhao et al., 2019;Rietzler et al., 2019; which merge the representation of subwords mostly by mean pooling to form word-level embeddings, we retain all the representations of subwords and use them in the subsequent dependency tree graph to obtain more accurate representations of words by graph attention performed later. Following (Sun et al., 2019), we also build a graph for each sentence according to its dependency tree. In this kind of graph, we state above that a common parent word may treat its different child words differently as in Figure 1. To tackle this difference, we design the Graph Attention Network to pay different attentions to different children. Finally, we use the fusion and alignment layers to interact context and aspects, respectively. The fusion layer performs max-pooling on context and aspects to get the most significant features first, then the hidden states of each token in both context and aspects are updated by those captured features. The alignment layer utilizes attention mechanism to capture the most relevant features, and these features are also used for updating the hidden states of tokens. After all operations above, the representation of context and aspects concentrated to pass through a linear layer to get the final output. Experiments demonstrate the effectiveness of our model. The main contributions of this paper are presented as follows: • To the best of our knowledge, SAGAT is the first model to combine both syntactic information and external pre-training knowledge for the sentiment classification task.
• We use syntax-guide graph attention to reduce the distance between keywords to ease RNN's long distance dependency, and tell the differences between various child nodes.
• We preserve subwords of BERT in the graph to fully release the power of BERT.
• We evaluate our method on the SemEval 2014 datasets and experiments show that SAGAT achieves outstanding performance.

Related Work
In this section, we will briefly review works on aspect-level sentiment analysis, graph neural network, and BERT language model.

Aspect-Level Sentiment Analysis
Aspect-level sentiment is widely used in scenarios like e-commerce and social network (Zhang, 2008). Researchers usually focus on the fusion of context and aspects to obtain corresponding results. Traditional methods utilize handcrafted features like sentiment lexical features and bag-of-words features to train sentiment classifiers (Rao and Ravichandran, 2009). With the development of Deep Neural Network, more DNN-based models have been proposed. There are mainly two categories models including semantic-based and syntactic-based methods. Semantic-based models (Wang et al., 2016;Ma et al., 2017;He et al., 2018; usually use the attention mechanism to capture and amplify the key semantic information of sentences and aspects. However, most of the previous works neglect the power of syntax information. Syntactic-based methods (Sun et al., 2019;Zhao et al., 2019;Huang and Carley, 2019;Lin et al., 2019) introduce the results of dependency parsing into the DNN models to shorten the distance between the aspect and the keyword and introduce syntactic information. DNN-based models including semantic-based and syntactic-based methods can generate dense vectors of sentences without handcrafted features.

Graph Neural Network
Graph neural networks have recently become very popular in NLP research. GNN was first proposed in (Scarselli et al., 2009) and has been used in many tasks in NLP including text classification (Defferrard et al., 2016), sequence labeling (Zhang et al., 2018), neural machine translation (Bastings et al., 2017), and relational reasoning (Battaglia et al., 2016). Tai et al. (2015) first attempts to use GNN in the sentiment classification task. Recently, Zhao et al. (2019) proposed to model sentiment dependency within one sentence by building graphs between them, and Sun et al. (2019) introduced dependency parsing and implemented simple graph convolution to propagate their graph. They all achieve outstanding performance by applying GNN in their model. While Zhao et al. (2019) must parse out all the aspects first in the sentence, and no syntax information is introduced. Sun et al. (2019) makes use of syntax information, but they only implement simple graph convolution on their syntax graph, which can not tell the differences between nodes, and they simply using mean pooling to get final results after GCN with deeper integration between aspect and context.

BERT
BERT is one of the key innovations in the recent progress of contextualized representation learning (Peters et al., 2018;Devlin et al., 2018). Traditional word embeddings like Glove (Mikolov et al., 2013;Pennington et al., 2014) are trained among large scale of corpora, while all these methods finally get the superposition of different meanings of words. BERT adopts a fine-tuning mechanism that almost needs no specific requirement architecture for each end task. Recently, some BERT-based models (Zhao et al., 2019; have been adopted in sentiment analysis task. But almost all of these models utilize BERT as an embedding layer, which is better than Glove.

Methodology
In this section, we will introduce the details of all the layers in our model separately. Figure 2 shows the overall architecture of the proposed Syntax-Aware Graph ATtention Network (SAGAT), which consists of an encoding layer, graph attention layer, fusion layer, alignment layer, and output layer.

Encoding
The raw text sequences of contexts and aspects are first represented as embedding vectors to feed a pretrained BERT. The input context sequence S c = {w 1 c , w 2 c , ..., w m c } with length m and aspect sequence S a = {w 1 a , w 2 a , ..., w n a } with length n are first cut into subword tokens and concatenated, which become: where w n#k represents the k-th subword of w n , [CLS] and [SEP] are special token of BERT.
Then the transformer encoder captures the semantic information of each subword through selfattention, and the encoder produces an embedded sequence with semantic information.
We notice that almost all the previously proposed BERT-based models merge subwords by mean pooling after encoding because their subsequent layers cannot deal with the structure of subwords. It's indisputable that the importance of different subwords is not consistent, so merge them here by mean pooling is not appropriate. It is worth mentioning that we do not merge the subwords after BERT encoding layer, and they will be processed more effectively later by the graph attention layer that we introduce later. So the final output of encoding layer is: where x ∈ R d , d is the dimension of encoding vectors.

Graph Attention
To get the syntax information of the text, we first perform dependency parsing on the input sequence S. The result of dependency parsing can be expressed as a list of tuples, and each tuple represents a pair of parent-child relationships on the parse tree. The parsing list can be notated as P = [(w s1 , w e1 ), (w s2 , w e2 ), ..., (w sp , w ep )], where w sp ,w ep means the p-th start and the end word respectively, p is the length of parsing results. After obtaining the parsing results, we start to build graphs for each input sequence. Each token (subword) has a corresponding node in the graph, as we have mentioned in Section 3.1. The hidden states of nodes are filled with their encoding outputs. All the words are connected according to the dependency parsing results P , and the subwords w #k n are connected to its original word w n . Concretely, the graph of input sequence S defined as: where P is the parsing results mentioned before. Then we perform graph attention on the graph we've built before. Since the children of a node can be its subwords or children in the dependency tree with different meanings to the parent node, attention can capture those differences much better than convolution. The definition of graph attention is described as follows: where ; represents concatenation operation, W ∈ R d×d , A ∈ R d×2d are learnable parameters of the model, N i means all the neighbors of node i and σ is the activation function. After passing through the graph attention layer, the nodes containing aspect words within the context are duplicated into a set N a , and the nodes of context make up another set N c .

Fusion
It's essential to obtain the interaction information between aspect and context in aspect level sentiment analysis. Note that we attach the aspect to the end of the context when we put the sequence into the BERT model as Eq.1, there would be two aspect words in one sequence. The aspect word in the context contains more contextual information than the one at the end. Therefore, we would fetch the aspect representation in the context to fuse with the context in this layer.
We fetch the aspect representation from context, and then we perform max-pooling on them to get the most meaningful information. After passing through a linear layer, the fusion vector of aspect F a is obtained. Finally, we directly add the fusion vector to each vector in N c to get the fused representation of context N c . The way to get the fused representation of aspect N a is a mirror operation of obtaining N c . The definition of the fusion layer is shown as follows: where W a , W c ∈ R d×d , b a , b c ∈ R d are learnable parameters of the model, the shape of N is (B, S, D) where B is batch size, S is the sequence length and D is the embedding dimension, the MaxPooling is performed on the second dimension S.

Alignment
Aspect words and context sequences can bring the most meaningful features to each other through the fusion layer we've discussed before. With the fusion layer only, we can not get the most accurate aspect and context representation because there is no interaction between them. Therefore we introduce the alignment layer in our model. In this layer, we perform self-attention on context to enhance contextual information. Then aspect words and context are interacted with each other by attention. After all these operations, context and aspect are fully integrated and ready for output. The definition of the alignment layer is shown as follows: in which the Attention function is defined as follows and the score of Attention is calculated by dot product:

Output
We get the final representations of the previous outputs by max pooling, concatenate them, and use a fully connected layer to project the concatenated vector into the space of the targeted C classes.
where ; represents concatenation operation,

Experiments
In this section, we describe our experimental setup and report our experimental results.

Experimental Setup
For experiments, we utilize three datasets, including SemEval 2014 Task 4 dataset composed of Restaurant reviews and Laptop reviews (Manandhar, 2014) and ACL 14 Twitter gathered by Dong et al. (2014). All the cases in these datasets are labeled with three sentiment polarities positive, negative and neutral. The details of these datasets are shown in Table 1. We fine-tune pre-trained BERT 1 in our model. Embedding dim d is set to 768. We utilize Adam (Kingma and Ba, 2014) as optimizer with initial learning rate 2 × 10 −5 . Dropout with a keep probability of 0.9 is applied after the dense layer. The batch size of our model is 32, and the max sequence length is set to 128. The number of training epochs is 100. We use Spacy (Honnibal and Montani, 2017) as the dependency parsing tool.
The loss function L to be optimized in our model is the cross-entropy loss, which can be defined as:

Baselines
We compare our model with the following baseline models: • TD-LSTM: Tang Table 2: Main results. The results of baseline models are from published papers. "-" means not reported.
• IAN: an RNN-based approach proposed by (Ma et al., 2017), which encodes contexts and aspects words by LSTM and interacts them with attention to generate the representations for aspects and contexts concerning each other.
• RAM: Chen et al. (2017) proposed a method based on Memory Network and represents memory with LSTM. Then a gated recurrent unit network is applied to concatenate all the outputs for sentence representation from attention.
• TNet: Li et al. (2018) uses BiLSTM embeddings as target-specific embeddings, and utilizes a CNN model to extract final embeddings.
• HSCN:  selects target words and extracts target-specific contextual representation to measure the deviation between target-specific contextual representation and target representations by capturing interactions between the context and target.
• CDT: a method based on graph neural network proposed by (Sun et al., 2019). CDT builds graph by dependency parsing, which is similar but not the same to us and propagates graphs by graph convolution network.
• SDGCN: also a GNN-based method by (Zhao et al., 2019), which first considers the sentiment dependencies between aspects by employing GCN to effectively capture the sentiment dependencies between different aspects in one sentence.
• AEN-BERT:  designs an attentional encoder network to draw the hidden states and semantic interactions between target and context words. And they apply pre-trained BERT to this task, which enhances the performance of the basic BERT model and obtains better results.
It is noted that the results of the baseline models are directly retrieved from published papers. Table 2 reports the main results of our model against other baseline models. We can see that our model generally achieve state-of-the-art performance in Restaurant and Twitter datasets. It is worth mentioning that although our model performs slightly worse than SDGCN-BERT on the Laptop dataset, which will discuss later, we still achieve the best average performance of Restaurant and Laptop, which is exactly the coverage of the SemEval 2014. Generally speaking, graph-based models would introduce some additional information compared to traditional methods. SDGCN (Zhao et al., 2019) connects different aspects in one sentence to build two kinds of graphs, nodes in one graph are connected in the order of their appearance in the sentence, and the other are all connected in pairs, which does not utilize the dependency information between aspects. CDT (Sun et al., 2019) builds graphs by dependency parsing, which is similar to us. The advantage of this method is that it can shorten the distance from keywords to aspect words as shown in Figure  1. However, graphs built by SDGCN still comes from sequence information; no syntax information is introduced. And CDT utilizes native GCN rather than attention mechanism when propagates the graph, which would cause child nodes of the same parent node in the dependency tree to be precisely in the same position when updating the graph state, which is incorrect in actual language scenarios. As shown in Figure 1, for the word like, the meaning of desk (underlined) should be more influential than they. In our model, syntax information is introduced by the graph based on dependency parsing. And differences between child nodes are also well distinguished by graph attention network.

Main Results
We can also find that BERT-based models perform better overall due to the strong ability of BERT. However, the previous BERT-based models like SDGCN (Zhao et al., 2019) and AEN  did not take advantage of "subwords" into their models, and just merged the subwords. That means they used BERT as an embedding layer only, which obviously could not play all of BERT's strength. In our model, the subwords can be used as the child nodes of its parent word and participate in the calculation of the rest of our models. Besides, with the help of GAT, subwords can also affect the meaning of its parent word with different weights corresponding to their importance.
We noticed that our model has a little performance degradation on the laptop dataset. We compared the laptop dataset with the other two datasets. From the perspective of data distribution differences, the reason for this phenomenon may be too many numbers and professional terms in the cases of the laptop dataset. The parser is more likely to obtain parsing errors when facing these professional terms. We take a case in laptop datasets as an example: bluetooth (2.1) , fingerprint reader , full 1920x1080 screen -integrated mic/webcam * -dual touchpad mode is interesting , and easy to use -5 usb ports -runs about 38-41c on idle , up to 65 (for me) on load -very quiet -i could go on and on .
The example above contains a lot of numbers and technical terms, and the parser will usually get confusing results when parsing such sentences.

Ablation Study
To investigate the effects of different components in our model, we conduct the following ablation study on our model. The results of the ablation study are shown in Table 3.
(1) w.o. GAT: we remove the graph attention network in our model; no syntax information is introduced into the model.
(2) w.o. Fusion: we remove the fusion layer and remain other layers in our model the same as the original one.
(3) w.o. Alignment: the same settings as (2) but removed alignment layer (4) with GCN: we use graph convolution with mean-pooling to propagate the graph instead of graph attention.
It is worth mentioning that we did not treat BERT as a component for ablation study since we introduced the subword of BERT into our model instead of simply using BERT as the embedding layer like other model of this task .
As expected, results for the simplified models all drop a lot, which demonstrates the effectiveness of these components.
When graph attention is removed in Experiments (1), vectors encoded by BERT are send to the subsequent layers directly.Thus our model would not be able to obtain the syntax information from the dependency tree. The model naturally degenerates to a similar result to other models using BERT as the embedding layer.    (2) show that the fusion layer, which fuses the most prominent features between context and aspect by max-pooling, is simple but effective, as we have mentioned in Section 3.3.
In Experiment (3), when the alignment is removed but the fusion is retained, performance degradation still occurs, which indicates that the alignment layer and the fusion layer play different roles in our model. Since the operations of these two layers on the context and aspect are mirrored, let's take the context as an example, the fusion layer would send the most significant features of the aspect to the context. In contrast, the alignment layer mixes the most relevant features of the aspect with the current context.
To verify the actual performance of GAT, we designed Experiment (4). In Experiment (4), we use GCN instead of GAT for graph propagation, which means that nodes no longer use attention to assign neighbors' weights before updating their hidden states, but only average their neighbors' hidden states as the new one. The nodes would not be able to tell the different importance of their neighbor nodes. The experimental results drop with our expectations. But overall, it is still better than Experiment (1), where GAT is moved completely.

Case Study
To better explore the role of the graph attention layer, we conducted case study between SAGAT without graph attention layer and itself. We selected some cases from the classification results of these two models for discussion. They are shown in Table 4.
In example (i) to (iii), the original model classifies correctly but gets wrong when removing the GAT layer. We found that these cases have something in common, that is the aspect word and keywords are far away in sequence, but much closer on the dependency tree. We can see that the distance in the dependency tree is significantly reduced, no matter in short sentences like example (i) or long sentences like (ii) and (iii). From this, we can observe that there are two benefits of introducing graph based on the dependency tree. One is to obtain the syntax information that may be missing in the previous encoding procedure, and the other is to make the distance between keywords closer, thus deepen their mutual influence.
Besides, there are some cases with less formal expressions in the dataset, like example (iv). Due to the incorrect usage of punctuation, the sentence is split into multiple dependency trees, and keywords become inaccessible on the dependency tree. Thus the distance between keywords cannot be reduced. For these representations, they get some extra syntax knowledge without more benefits. Therefore, the classification result is wrong.

Conclusion
In this paper, we proposed a model named SAGAT for aspect-level sentiment classification. To fully obtain both syntax and semantic information in the sequence, we utilize graph attention network and BERT in our model. Experiments on three datasets demonstrate the effectiveness of our model. We also perform some additional experiments to show the power of components in SAGAT.