Dependency Graph Enhanced Dual-transformer Structure for Aspect-based Sentiment Classification

Aspect-based sentiment classification is a popular task aimed at identifying the corresponding emotion of a specific aspect. One sentence may contain various sentiments for different aspects. Many sophisticated methods such as attention mechanism and Convolutional Neural Networks (CNN) have been widely employed for handling this challenge. Recently, semantic dependency tree implemented by Graph Convolutional Networks (GCN) is introduced to describe the inner connection between aspects and the associated emotion words. But the improvement is limited due to the noise and instability of dependency trees. To this end, we propose a dependency graph enhanced dual-transformer network (named DGEDT) by jointly considering the flat representations learnt from Transformer and graph-based representations learnt from the corresponding dependency graph in an iterative interaction manner. Specifically, a dual-transformer structure is devised in DGEDT to support mutual reinforcement between the flat representation learning and graph-based representation learning. The idea is to allow the dependency graph to guide the representation learning of the transformer encoder and vice versa. The results on five datasets demonstrate that the proposed DGEDT outperforms all state-of-the-art alternatives with a large margin.


Introduction
Aspect-based or aspect-level sentiment classification is a popular task with the purpose of identifying the sentiment polarity of the given aspect (Yang et al., 2017;Zhang and Liu, 2017;Zeng et al., 2019). The goal is to predict the sentiment polarity of a given pair (sentence, aspect). Aspects in our study are mostly noun phrases appearing in the * Corresponding author. input sentence. As shown in Figure 1, where the comment is about the laptop review, the sentiment polarities of two aspects battery life and memory are positive and negative, respectively. Giving a specific aspect is crucial for sentiment classification owing to the situation that one sentence sometimes contains several aspects, and these aspects may have different sentiment polarities.
Modern neural methods such as Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN) (Dong et al., 2014;Vo and Zhang, 2015) have already been widely applied to aspectbased sentiment classification. Inspired by the work (Tang et al., 2016a) which demonstrates the importance of modeling the semantic connection between contextual words and aspects, RNN augmented by attention mechanism (Bahdanau et al., 2015;Luong et al., 2015; is widely utilized in recent methods for exploring the potentially relevant words with respect to the given aspect (Yang et al., 2017;Zhang and Liu, 2017;Zeng et al., 2019;Wang et al., 2016). CNN based attention methods (Xue and Li, 2018; are also proposed to enhance the phrase-level representation and achieved encouraging results.
Although attention-based models have achieved promising performance on several tasks, the limitation is still obvious because attention module may highlight the irrelevant words owing to the syntactical absence. For example, given the sentence "it has a bad memory but a great battery life." and aspect "battery life", attention module may still assign a large weight to word "bad" rather than "great", which adversely leads to a wrong sentiment polarity prediction.
To take advantages of syntactical information among aspects and contextual words, Zhang et al. (2019) proposed a novel aspect-based GCN method which incorporates dependency tree into the attention models. Actually, using GCN (Kipf and Aspect: memory Sentiment: Negative Aspect: battery life Sentiment: Positive Figure 1: A typical utterance sample of aspect-based sentiment classification task with a proper dependency tree, notice that different aspects may have different sentiment polarities. Welling, 2017) to encode the information conveyed by a dependency tree has already been investigated in several fields, e.g., modeling document-word relationships (Yao et al., 2019) and tree structures (Marcheggiani and Titov, 2017;Zhang et al., 2018). As shown in Figure 1, an annotated dependency tree of original sentence is provided, and we can observe that word-aspect pairs (bad, memory) and (great, battery life) are well established.
Direct application of dependency tree has two obvious shortcomings: (1) the noisy information is inevitably introduced through the dependency tree, due to imperfect parsing performance and the casualness of input sentence; (2) GCN would be inherently inferior in modeling long-distance or disconnected words in the dependency tree. It is reported that lower performance is achieved even with the golden dependency tree, by comparing against using only the flat structure (Zhang et al., 2019).
To address these two challenges, we propose a dependency graph enhanced dual-transformer network (named DGEDT) for aspect-based sentiment classification. DGEDT consists of a traditional transformer (Vaswani et al., 2017) and a transformer-like structure implemented via a dependency graph based bidirectional GCN (BiGCN). Specifically, a dual-transformer structure is introduced in DGEDT to fuse the flat representations learnt by the transformer and the graph-based representations learnt based on the dependency graph. These two kinds of representations are jointly refined through a mutual BiAffine transformation process, where the dependency graph can guide and promote the flat representation learning. The final flat representations derived by the transformer is then used with an aspect-based attention for sentiment classification. We have conducted extensive experiments over five benchmark datasets. The experimental results demonstrate that the proposed DGEDT achieves a large performance gain over the existing state-of-the-art alternatives.
To the best of our knowledge, the proposed DGEDT is the first work that jointly considers the flat textual knowledge and dependency graph empowered knowledge in a unified framework. Furthermore, unlike other aspect-based GCN models, we aggregate the aspect embeddings from multiple aspect spans which share the same mentioned aspect before feeding these embeddings into submodules. We also introduce an aspect-modified dependency graph in DGEDT.

Related Work
Employing modern neural networks for aspectbased sequence-level sentiment classification task, such as CNNs (Kim, 2014;Johnson and Zhang, 2015), RNNs (Castellucci et al., 2014;Tang et al., 2016a), Recurrent Convolutional Neural Networks (RCNNs) (Lai et al., 2015), have already achieved excellent performance in several sentiment analysis tasks. Many attention-based RNN or CNN methods (Yang et al., 2017;Zhang and Liu, 2017;Zeng et al., 2019) are also proposed to handle sequence classification tasks. Tai et al. (2015) proposed a tree-LSTM structure which is enhanced with dependency trees or constituency trees, which outperforms traditional LSTM. Dong et al. (2014) proposed an adaptive recursive neural network using dependency trees. Since being firstly introduced in (Kipf and Welling, 2017), GCN has recently shown a great ability on addressing the graph structure representation in Natural Language Processing (NLP) field. Marcheggiani and Titov (2017) proposed a GCN-based model for semantic role labeling. Vashishth et al. (2018) and Zhang et al. (2018) used GCN over dependency trees in document dating and relation classification, respectively. Yao et al. (2019) introduced GCN to text classification task with the guidance of document-word and word-word relations. Furthermore, Zhang et al. (2019) introduced aspect-based GCN to cope with aspect-level sentiment classification task using dependency graphs. On the other hand, Chen and Qian (2019) introduced and adapted Capsule Networks along with transfer learning to improve the performance of aspect-level sentiment classification. Gao et al. (2019) introduced BERT into a target-based method, and Sun et al. (2019) constructed BERT-based auxiliary sentences to further improve the performance.

Preliminaries
Since Transformer (Vaswani et al., 2017) and GCN are two crucial sub-modules in DGEDT, here we briefly introduce these two networks and illustrate the fact that GCN can be considered as a specialized Transformer.
Assume that there are three input matrices Q ∈ R n×d k , K ∈ R m×d k , V ∈ R m×dv , which represent the queries, keys and values respectively. n and m are the length of two inputs.
where Q ∈ R n×dv , d k and d v are the dimension size of keys and values, respectively. Actually, Transformer adopts multi-head attention mechanism to further enhance the representative ability as follows: and h i is the i-th head embedding. Then, two normalization layers are employed to extract higher-level features as follows: where F F N (x) = Relu(xW 1 + b 1 )W 2 + b 2 is a two-layer multi-layer perceptron (MLP) with the activation function Relu, N orm is a normalization  Figure 2: An overall demonstration of our proposed DGEDT. Aspect representation is accumulated from the embeddings in its aspect span, thus the attention module is also aspect-sensitive. layer, Q 2 is the output vector of this transformer layer. Equations (1)-(5) can be repeated for T times. Note that if Q = K = V , this operation can be considered as self alignment.
As for GCN, the computation can be conducted as follows when the adjacent matrix of each word in the input is explicitly provided.
where A adj ∈ R n×n is the adjacent matrix formed from the dependency graph, n is the number of words, which is denoted as a generated alignment matrix, except for the main difference that A adj is fixed and discrete. It is obvious that Equation (6) can be decomposed into Equations (1)-(4), and it can be also repeated for T times. In our perspective, GCN is a specialized Transformer with the head size set to one and the generated alignment matrix replaced by a fixed adjacent matrix.

DGEDT
The network architecture of our proposed DGEDT is shown in Figure 2. For a given input text, we first utilize BiLSTM or Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019) as the aspect-based encoder to extract hidden contextual representations. Then these hidden representations are fed into our proposed dual-transformer structure, with the guidance of aspect-modified dependency graph. At last, we aggregate all the aspect representations via maxpooling and apply an attention module to align contextual words and the target aspect. In this way, the model can automatically select relevant aspectsensitive contextual words with the dependency information for sentiment classification.

Aspect-based Encoder
We use w k to represent the k-th word embedding. Bidirectional LSTMs (Schuster and Paliwal, 1997;Hochreiter and Schmidhuber, 1997) (BiLSTM) are applied for the encoder if we do not use BERT.
where h k ∈ R h is the k-th output of Encoder (BERT or BiLSTM), k ∈ [1, N ] and h is the hidden size, and N is the text length. Note that for a given aspect, there may exist M aspect mentions referring to the same aspect in the text. Also, each aspect mention could contain more than one word. To ease aspect-level representation in the later stage, we choose to collapse each aspect mention as a single word. The summation of the representations of each constituent word within the mention works as its hidden representation. We also develop a span set span with the size N s . Each span records the start and end position of the given aspect. span j denotes the j-th aspect span in original text. Note that for non-aspect words, spans involved in the computation are their original positions with the length as one.
where j ∈ [1, N s ], N s <= N denotes the number of words after aspect-based sum operation. s j is the j-th output of the aspect-based encoder layer. This process can be illuminated by an example transforming 'It has a bad memory but a great battery life' to 'It has a bad memory but a great [battery life]'. N is ten and N s is nine in this case.

Dual-transformer Structure
After obtaining the contextual hidden representations from the aspect-based encoder, we develop a dual-transformer structure to fuse the flat textual knowledge and dependency knowledge in a mutual reinforcement manner. Specifically, as demonstrated in Figure 3, dual-transformer structure consists of a multi-layer Transformer and a multi-layer BiGCN.
Bidirectional GCN: We design a BiGCN by considering the direction of each edge in the dependency graph. Note that dependency graph is constructed on the word-level. Hence, similar to aspectlevel representation performed in Section 4.1, we merge the edges corresponding to the constituent word of the given aspect in the adjacent matrix, resulting in an aspect-level adjacent matrix. Then, we derive the graph-based representations for the input text as follows: where A out adj and A in adj are outgoing and incoming aspect-level adjacent matrices gathered from the dependency graph respectively. Here, we concatenate the representations of two directions to produce the final output in each iteration, while other similar methods conduct the merging only in the last iteration. BiGCN represents Equations (9)-(11). We use a simple method to merge the adjacent matrix of the words in the same aspect span as follows: where A adj can be replaced by A out adj and A in adj , and we can thus get A out adj and A in adj . Each span records the start and end position of the given aspect. span i denotes the i-th span in original text.
BiAffine Module: Assume that there are two inputs S 1 ∈ R n×h and S 2 ∈ R n ×h , we introduce a mutual BiAffine transformation process to interchange their relevant features as follows: where W 1 , W 2 ∈ R h×h . Here, S 1 can be considered as a projection from S 2 to S 1 , and S 2 follows the same principle. Biaf f ine represents Equations (14)-(17). A 1 and A 2 are temporary alignment matrices projecting from S 2 to S 1 and S 1 to S 2 , respectively. The Whole Procedure: We can then assemble all the sub-modules mentioned above to construct our proposed dual-transformer structure, and the detailed procedures are listed below: where S T r 0 = S G 0 = H, and H ∈ R Ns×h denotes the contextual hidden representations {s 1 , ...} from the aspect-based encoder. T ransf omer represents the process denoted by Equations (1)-(5). Equations (19)-(23) can be repeatedly calculated for T times and t ∈ [0, T ]. We choose S T r T (flat (with graph) in Figure 3) as the last representation, because S G T (graph (with flat) in Figure 3) heavily depends on the dependency graph.

Aspect-based Attention Module
Given M aspect representations can be obtained through the above mentioned procedure, we can derive the final aspect representation by Max-Pooling operation. Here, we utilize an attention mechanism to identify relevant words with respect to the aspect. However, these would be M aspect representations which are all highly relevant to the aggregated aspect representation. To avoid that these aspect mentions from being assigned with too high weight, we utilize a mask mechanism to explicitly set the attention values of aspect mentions to zeros. Let I be the index set of these M aspect mentions, we form M ask vector as follows: We then calculate the probability distribution p of the sentiment polarity as follows: where W 3 , W , W p and b , b p are learnable weights and biases, respectively.

Loss Function
The proposed DGEDT is optimized by the standard gradient descent algorithm with the crossentropy loss and L2-regularization: where D denotes the training dataset, y p is the ground-truth label and p yp means the y p -th element of p. θ represents all trainable parameters, and λ is the coefficient of the regularization term.
of data from two categories: laptop and restaurant.
The statistics of datasets are demonstrated in Table 1.

Experiment Setup
We compare the proposed DGEDT * with a line of baselines and state-of-the-art alternatives, including LSTM, MemNet (Tang et al., 2016b), AOA (Huang et al., 2018), IAN (Ma et al., 2017), TNet-LF , CAPSNet (Chen and Qian, 2019), Transfer-CAPS (Chen and Qian, 2019), TG-BERT (Gao et al., 2019), AS-CNN (Zhang et al., 2019) and AS-GCN (Zhang et al., 2019). We conduct the experiments with our proposed DGEDT with BiLSTM as the aspect-based encoder, and DGEDT +BERT with BERT as the aspect-based encoder. Several simplified variants of DGEDT are also investigated: DGEDT(Transformer) denotes that we keep standard Transformer and remove the BiGCN part, DGEDT(BiGCN) denotes that we keep BiGCN and remove the Transformer part. The layer number or iteration number (i.e., T ) of all available models is set to three for both Transformer and GCN. We use Spacy toolkit † to generate dependency trees.

Parameter Settings
We use BERT-base English version (Devlin et al., 2019), which contains 12 hidden layers and 768 hidden units for each layer. We use Adam (Kingma and Ba, 2014) as the optimizer for BERT and our model with the learning rate initialized by 0.00001 and 0.001 respectively, and decay rate of learning is set as 0.98. Except for the influence of decay rate, the learning rate decreases dynamically according to the current step number. Batch shuffling * available at https://github.com/tomsonsgs/DGEDT-sentimaster. † available at https://spacy.io/ is applied to the training set. The hidden size of our basic BiLSTM is 256 and the size of all embeddings is set as 100. The vocab size of BERT is 30,522. The batch size of all model is set as 32. As for regularization, dropout function is applied to word embeddings and the dropout rate is set as 0.3. Besides, the coefficient λ for the L2norm regularization is set as 0.0001. We train our model up to 50 epochs and conduct the same experiment for 10 times with random initialization. Accuracy and Macro-Averaged F1 are adopted as the evaluation metrics. We follow the experimental setup in (Zhang et al., 2019;Chen and Qian, 2019) and report the average maximum value for all metrics on testing set. If the model is not equipped with BERT, then we use word vectors that were pre-trained from Glove (Pennington et al., 2014).

Overall Results
As shown in Table 2, our model DGEDT outperforms all other alternatives on all five dataset. BERT makes further improvement on the performance especially in Twitter, Rest14 and Rest 15. We can conclude that traditional Transformer DGEDT(Transformer) obtains better performance than DGEDT(BiGCN) in the most datasets. DGEDT employs and combines two sub-modules (traditional Transformer and dependency graph enhanced GCN) and outperforms any single submodule. Using dependency tree indeed contributes to the performance when acting as a supplement rather than a single decisive module.

Ablation Study
Note that the performance of individual modules is already reported in Table 2. As shown in Table 3, we investigate and report four typical ablation conditions. '-Mask' denotes that we remove the aspect-based attention mask mechanism, and '-MultiAspect' denotes that we only use the aspect representation of the first aspect mention instead of MaxPooling them. We can see that these two procedures provide slight improvement. '-BiGCN(+GCN)' means that we remove the bidirectional connection and only use original GCN, the results show that bidirectional GCN outperforms original GCN owing to the adequate connection information. '-BiAffine' indicates that we remove the BiAffine process and use all the outputs of dual-transformer structure, we can thus conclude that BiAffine process is critical for our model, and utilizing simple concatenation of the    outputs of Transformer and BiGCN is worse than DGEDT(Transformer).

Impact of Iteration Number
As shown in Figure 4, we find that three is the best iteration number for Lap14 and Rest14. Dependency information will not be fully broadcasted when the iteration number is too small. The model will suffer from over-fitting and redundant information passing, which results in the performance drop when iteration number is too large. So, numerous experiments need to be conducted to figure out a proper iteration number.

Case Study and Attention Distribution Exploration
As shown in Figure 5, DGEDT and DGEDT(BiGCN) output correct prediction Negative while DGEDT(Transformer) fails for the sentence The management was less than accommodating. To figure out the essential cause, we demonstrate the attention of self alignment in Figure 5. We can see that for the aspect management, DGEDT(Transformer) mainly focuses on accommodating, which is a positive word at document level. Thus, DGEDT(Transformer) obtains an incorrect prediction Positive. In the dependency tree, less which is often regarded as a negative word has a more related connection with aspect management, so DGEDT(BiGCN) outputs right sentiment Negative. With the assistance of supplementary dependency graph, DGEDT also obtains right prediction Negative owing to the high attention value between management and less. As shown in Figure 6, DGEDT and DGEDT(Transformer) output correct prediction Positive while DGEDT(BiGCN) fails for the sentence This little place is wonderfully warm welcoming. To figure out the essential cause, we demonstrate the attention of self alignment and dependency tree in Figure 6. We can see that for the aspect place, DGEDT(Transformer) mainly focuses on wonderfully, which is a positive word at document level. Thus, DGEDT(Transformer) obtains a correct prediction Positive. In the dependency tree, little which is often regarded as a negative word has a more related connection with aspect place, so DGEDT(BiGCN) outputs incorrect sentiment Negative. With the disturbance of inappropriate dependency tree, DGEDT still   Figure 6: Case Study 2: A testing example demonstrates that the information of dependency tree may be harmful for the classification performance, and our dual-transformer model still obtains a proper attention distribution. Darker cell color indicates higher attention value, the aspect is place and golden sentiment is Positive.
obtains right prediction Positive owing to the high attention value between place and wonderfully.
We can see from two examples above that DGEDT is capable of achieving the proper balance between dependency graph enhanced BiGCN and traditional Transformer according to different situations.

Conclusion
Recently neural structures with syntactical information such as semantic dependency tree and constituent tree are widely employed to enhance the word-level representation of traditional neural networks. These structures are often modeled and described by TreeLSTMs or GCNs. To introduce Transformer into our task and diminish the error induced by incorrect dependency trees, we propose a dual-transformer structure which considers the connections in dependency tree as a supplementary GCN module and a Transformer-like structure for self alignment in traditional Transformer. The results on five datasets demonstrate that dependency tree indeed promotes the final performance when utilized as a sub-module for dual-transformer structure.
In future work, we can further improve our method in the following aspects. First, the edge information of the dependency trees needs to be exploited in later work. We plan to employ an edgeaware graph neural network considering the edge labels. Second and last, domain-specific knowledge can be incorporated into our method as an external learning source.