Inducing Target-specific Latent Structures for Aspect Sentiment Classification

Aspect-level sentiment analysis aims to recognize the sentiment polarity of an aspect or a target in a comment. Recently, graph convolutional networks based on linguistic dependency trees have been studied for this task. However, the dependency parsing accuracy of commercial product comments or tweets might be unsatisfactory. To tackle this problem, we associate linguistic dependency trees with automatically induced aspect-speciﬁc graphs. We propose gating mechanisms to dynamically combine information from word dependency graphs and latent graphs which are learned by self-attention networks. Our model can complement supervised syntactic features with latent semantic dependencies. Experimental results on ﬁve benchmarks show the effectiveness of our proposed latent models, giving signiﬁcantly better re-sults than models without using latent graphs.


Introduction
Aspect-level sentiment analysis aims to classify the sentiment polarities towards specific aspect terms in a given sentence (Jiang et al., 2011;Dong et al., 2014;Vo and Zhang, 2015). Aspects are also called opinion targets, which can typically be product or service features in customer reviews. For example, in the user comment "The environment is romantic, but the food is horrible", the sentiments of the two aspects "environment" and "food" are positive and negative, respectively. The main challenge of aspect-level sentiment analysis is to effectively model the interaction between the aspect and its surrounding contexts. For example, identifying that "romantic" instead of "horrible" as the opinion word is the key to correctly classifying the sentiment of "environment".
Recently, graph convolutional networks (GCNs; Kipf and Welling (2017)) over dependency i complained to the manager , but he was not even apologetic (a) An example dependency tree from Stanford CoreNLP parser 2 .
the portions are small but being that the food was so good makes up for that .
(b) A latent graph for the aspect term "portion".
the portions are small but being that the f ood was so good makes up for that .
(c) A latent graph for the aspect term "food".
trees (Marcheggiani and Titov, 2017;Sun et al., 2019b;Wang et al., 2020) have received much research attention. It has been shown to be more effective for learning aspect-specific representations than traditional sentence encoders without considering graph structures (Tang et al., 2016a,b;Liu and Zhang, 2017;Li et al., 2018a). Intuitively, dependency trees allow a model to better represent the correlation between aspect terms and their relevant opinion words. However, the existing methods suffer from two potential limitations. First, dependency parsing accuracies can be relatively low on noisy texts such as tweets, blogs and review comments, which are the main sources of aspect-level sentiment data. Second, dependency syntax according to a treebank may not be the most effective structure for capturing interaction between aspect terms and opinion words. Take Figure 1(a) for example. The aspect term "manager" is syntactically related to "not apologetic" through complained → manager and complained → not apologetic, though semantically they are directly related.
One intuitive solution to the aforementioned problems is to automatically induce semantic structures during the optimization process for sentiment classification. To this end, existing work has investigated latent structures sentence-level sentiment classification (Yogatama et al., 2016;Kim et al., 2017;Choi et al., 2017;Zhang et al., 2018;Corro and Titov, 2019), but no existing work has considered aspect-level sentiment classification. For the aspect-level task, a different structure should be ideally learned for each aspect. As shown in Figure 1(b) and 1(c), when given the sentence "the portions are small but being that the food was so good makes up for that.", ideal structures for the aspects "portions" and "food" can consist of links relevant to the terms and their opinion words only, without introducing additional information.
We empirically investigate three different methods for inducing semantic dependencies, including attention (Vaswani et al., 2017), sparse attention (Correia et al., 2019) and hard Kuma discrete structures (Bastings et al., 2019). In particular, attention has been used as a soft alignment structure for tasks such as machine translation (Bahdanau et al., 2014), and sparse attention has been used for text generation (Martins et al., 2020). The Hard Kumaraswamy distribution has been used to induce discrete structures with full differentiability (Bastings et al., 2019). We build a unified self-attentivenetwork (SAN) framework (Vaswani et al., 2017) for investigating the three structure induction methods, using a graph convolutional network on top of the induced aspect-specific structure for aspect level sentiment classification. In addition, to exploit mutual benefit with dependency syntax, we further consider a novel gate mechanism for merging multiple tree structures during GCN encoding.
Experiments on five benchmarks including Twitter, laptop and restaurant comments show the effectiveness of our proposed latent variable models. Our final methods give the state-of-the-art results in the literature, achieving significantly better accuracies than models without using latent graphs. To our knowledge, we are the first to investigate automatically inducing tree structures for targeted sentiment classification. We release our code at https://github.com/CCSoleil/latent graph atsc.

Related Work
Aspect-level sentiment analysis Aspect-level sentiment analysis includes three main subtasks, namely aspect term sentiment classification (ATSC) (Jiang et al., 2011;Dong et al., 2014), aspect category sentiment classification (ACSC) (Jo and Oh, 2011;Pontiki et al., 2015Pontiki et al., , 2016) and aspect-term or opinion word extractions (Li et al., 2018b;Fan et al., 2019;Wan et al., 2020). In this paper, we focus on ATSC. To model relationships between the aspect terms and the context words, Vo and Zhang (2015) designed target-aware pooling functions to extract discriminative contexts. Tang et al. (2016a) modeled the interaction of targets and context words by using target-dependent LSTMs. Tang et al. (2016b) used multi-hop attention and memory networks to correlate an aspect with its opinion words. Zhang et al. (2016) design gating mechanisms to select useful contextual information for each target. Attention networks are further explored by sequent work (Ma et al., 2017;Liu and Zhang, 2017). Li et al. (2018a) used targetspecific transformation networks to learn targetspecific word representations. Liang et al. (2019) used aspect-guided recurrent transition networks to generate aspect-specific sentence representations. Sun et al. (2019a) constructed aspect related auxiliary sentences as inputs to BERT (Devlin et al., 2019) for strong contextual encoders.  proposed BERT-based post training for enhancing domain-specific contextual representations for aspect sentiment analysis.
Recently, there is a line of work considering dependency tree information for ATSC. Lin et al. (2019) proposed deep mask memory network based on dependency trees.  and Sun et al. (2019b) encoded dependency tree using GCNs for aspect-level sentiment analysis. Zhao et al. (2019) used GCNs to model fully connected graphs between aspect terms, so that all targets can be classified using a shared representation. Huang and Carley (2019) proposed graph attention networks based on dependency trees for modeling structural relations. Wang et al. (2020) used relational graph attention networks to incorporate dependency edge type information, and construct aspect-specific graph structures by heuristically reshaping dependency trees.
Latent graph induction Latent graphs can be induced to learn task-specific structures by end-toend models jointly with downstream tasks. Kim et al. (2017) proposed structural attention networks to introduce latent dependency graphs as intermediate layers for neural encoders. Niculae et al. (2018) used SparseMAP to obtain a sparse distribution over latent dependency trees. Peng et al. (2018) implemented a differentiable proxy to the argmax operator over latent dependency trees, which can be regarded as a special case of introducing sparsity constraints into the softmax function (Niculae et al., 2018;Peters et al., 2019;Correia et al., 2019). Bastings et al. (2019) used HardKuma to sample stochastic interpretable discrete graphs for interpreting the classification results. Corro and Titov (2018) induced dependency structure for unsupervised parsing with a differentiable perturband-parsing method. While previous work obtains different structures using different methods, we investigate multiple methods for ATSC.
More in line with our work, Yogatama et al. (2016) and Zhang et al. (2018) considered reinforcement learning for inducing latent structures for text classification. Our work is in line but differs in two main aspects. First, we consider aspectbased sentiment, learning a different structure for each aspect term in the same sentence. Second, we empirically compare different methods for latent graph induction, and investigate complementary effects with dependency trees. To our knowledge, we are the first to consider inducing structures automatically for aspect-based sentiment classification.

Model
The overall model structure is shown in Figure 1. The model consists of four main components, including a sequence encoder layer for the input sentence, a structural representation layer that learns a latent induced structure A, a GCN network that represents the latent structure and an aspect oriented classification layer. Below we discuss each component in detail in the bottom-up direction.

Sentence Encoder
We separately explore two sentence encoders, including a bidirectional long short-term memory networks (BiLSTM) encoder and a BERT encoder. Given an input sentence s = w 1 w 2 . . . w n , we first obtain the embedding vector x i of each w i using a lookup table E ∈ R |V |×dw (where |V | is the vocabulary size and d w is the dimension of word vectors) and then use a standard BiLSTM encoder to obtain the contextual vectors of the input sentence. For the BERT encoder, we follow the standard practice by feeding the input " . w e is the corresponding aspect sequence in s. Since BERT uses a subword encoding mechanism (Sennrich et al., 2015), we apply average pooling over the subword-level representations to obtain the corresponding word-level representations. The output vectors from the sentence encoder are denoted as ce 0 i for each w i . Aspect mask In order to make the encoder learn aspect-specific representations, we use distancebased masks on the word representation ce 0 i . Formally, given an aspect (1) In this way, the more similar the context words are to the aspect, the higher their weights are. We denote the sentence representation as H = [h 0 i , h 0 1 , . . . , h 0 n ], which is used for inducing latent graphs later.

Dependency Tree Representation
Given a sentence s = w 1 w 2 . . . w n and the corresponding dependency tree t over s (obtained using parser), an undirected graph G is built by taking each word as a node and representing headdependent relations in t as edges. Each headdependent arc is converted into two undirected edges. In addition, self loops are included for each word. Formally, the adjacent matrix A dep is given by A dep represents the syntactic dependencies between word pairs.

Latent Graph
We propose to learn latent graphs A lat for each aspect, investigating three methods, namely selfattention, sparse self-attention and hard kuma.
Self-attention-based latent graph Selfattention networks (SANs) compute similarity scores between two arbitrary nodes in a graph. Formally, given a sentence representation H, the similarity score α ij can be regarded as the interaction strength between node i and node j. A lat is given by where Q and K are two copies of H, representing the query and key vectors, respectively. W q ∈ R d×d and W k ∈ R d×d are model parameters. The denominator √ d is a scale constant for controlling the magnitude of the dot-product operation. The softmax function normalizes the similarity scores by the column so that the sum of each row in A lat equals to 1.
Multi-head SANs partition the graph representation H into multiple non-overlapping heads For the i-th head, Eq 3 is independently applied to generate A i head . The final latent graph averages the latent graphs of all heads, Sparse-self-attention-based latent graph SANs learn a fully connected latent graph, where dense attention weights can bring noise from irrelevant context. To address this issue, sparse SANs potentially enables each node to attend to highly relevant contextual nodes. To achieve this goal, we replace the softmax operation in Eq 3 with the 1.5-entmax function (Niculae et al., 2018;Peters et al., 2019;Correia et al., 2019), which can project a real-valued vector into a sparse probability simplex. Formally, where 1.5-entmax 3 is applied to each row of the resulted matrix, with 1.5-entmax(x) = arg max p∈ d p T x + H T 1.5 (p).
Here H T 1.5 (p) is an entropy function and . For more details, readers can refer to Peters et al. (2019).
Similar to Eq 4, multi-head SANs are used for sparse latent graph learning.
HardKuma-based latent graph Hard-Kuma (Bastings et al., 2019) is a method which can produce stochastic graphs by sampling. Suppose that each edge α ij ∈ [0, 1] between nodes i and j is a stochastic random variable and α ij ∼ HardKuma(a, b, l, r), where Hard-Kuma is a rectified Kumaraswamy distribution which includes both 0 and 1 in the support of Kumaraswamy distribution 4 , a > 0 and b > 0 are parameters to control the shape of the Hard Kumaraswamy probability distribution, l < 0 and r > 1 define the supporting open interval (l, r). A sample of α ij can be obtained by gradient reparameterization tricks (Kingma and Welling, 2013;Jang et al., 2016), where u is a uniform random variable and u ∼ U(0, 1) which replaces the HardKuma sample, s 1 is a sample of the Kumaraswamy distribution, s 2 is a stretched sample for the supporting interval after shift and scale operations. s 2 is converted to z by a hard-sigmoid function, which can ensure that the value of z falls into [0, 1]. z is differentiable with respect to the shape parameters a and b.
Denote the shape parameters for all edges as a and b. With reparameterization, the sampling is independent of the model and the main goal is to represent a and b using neural networks. Specifically, a and b can be calculated by SANs. Formally, a ∈ R n×n for the whole graph is given by H a = MHSAN(H, H, H), where MHSAN, LN and FFN denote multi-head self attention networks, layer normalization and position-wise feed-forward networks, respectively. In our model, the particular networks of MHSAN, LN and FFN are taken from Transformer (Vaswani et al., 2017). b is defined in a similar way, but with different parameters. Here H a is the initial result of MHSAN, C a considers residual connections and feature transformations, s a is the initial similarity score calculated by self attention, n a denotes the normalized similarity scores and a is ensured to be non-negative by applying the softplus activation function over n a .

Graph Convolutional Networks
Graph convolutional networks (GCNs) (Kipf and Welling, 2017) encode graph-structured data with convolution operations. The representation of each node v in a graph G is aggregated from its neighbors. Suppose that the node set is V = {v i } n i=1 , where n is the number of nodes and the graph is G = {V, A}. A ∈ R n×n is the adjacent matrix between nodes. Let the representation vector of v i at the l-th layer be h l i and h l i ∈ R d , where d is the node vector dimension. The whole graph representation of the l-th layer H l is the concatenation of all the node vectors of this layer, namely H l = [h l 1 , h l 2 , ..., h l n ] and H l ∈ R n×d . The graph convolution for H l is given by: where W l ∈ R d×d and b l ∈ R d are model parameters for the l-th layer. ρ is an activation function over the input graph representation H l−1 and typically set to be the ReLU function. The initial input H 0 is the sentence representation H.

Gated Combination
Given two graphs A dep and A lat , we design gating mechanisms to combine the strengths of both. Formally, suppose that the input graph representation is H in , the graph convolution weight matrix and bias are W and b respectively, we propose a gated GCN to output H out by considering both A dep and A lat , where g is a gating function learned automatically from data and 0 ≤ λ ≤ 1 is a hyper-parameter for prior knowledge. The graph convolutional matrix W is the same for A dep and A lat , which suggests that our model does not introduce any additional parameters. AHW in Eq 8 can be replaced with I com , which is a gated combination of A dep HW and A lat HW. This combination equals that we first merge A dep and A lat into a single graph A com using dynamic gating mechanisms and then directly use A com as A in Eq 7 to obtain the graph representations. Gated GCN blocks In practice, we stack N GCN layers. For different layers, the convolution parameters are different. A highway network is used to combine the feature representations in adjacent layers. Formally, given the input representation of the l-th block H l−1 , the input to the first block is the aspect-aware sentence representation H 0 , the output of the l-th block H l is given by, where gatedcombine is the GCN function defined in Eq 8. We apply the highway gate to all the GCN blocks except for the last one.

Sentiment Classifier
Aspect-specific attention Based on the output of the last GCN block H N , we obtain the representations for the aspect w f w f +1 . . . w e using H N f , H N f +1 . . . , H N e . The final aspect-specific feature representation z is given by an attention network over the sentence representation vectors ce 0 where C = [ce 0 1 , ce 0 2 , . . . , ce 0 n ] is the contextualized representations produced by the sentence coder, γ t is the attention scores of the t-th context word with respect to the aspect term, α denotes the normalized attention scores and z is the final aspect-specific representation.
Softmax classifier The aspect-specific representation vector z is then used to calculate the sentiment score by a linear transformation. A softmax classifier is used to predict the probability of the sentimental class according to the learned sentiment scores. Formally, where W o and b o are model parameters and p is the predicted sentiment probability distribution.

Training
The classifier is trained by maximizing the negative log-likelihood of a set of training samples D where each x i contains a set of aspects c i,j . Formally, the loss function is given by where N is the number of training instances, θ is the set of model parameters, λ is a regularization hyperparameter, y i,j is the training label of the j-th aspect c i,j in x i and p y i,j is the aspect classification probability for c i,j , which is given by Eq 10.

Experiments
We conduct experiments on five benchmark datasets for aspect-level sentiment analysis, including twitter posts (TWITTER) from Dong et al.    Stanza . No dependency labels are used. For the other settings, we follow . Following previous conventions, we repeat each model three times and average the results, reporting accuracy (Acc.) and macro-f1 (F1).

Development Results
Effect of latent graphs Table 2 shows the performances on REST16. We enhance dependency tree based graphs with self-attention based latent graph models (sanGCN), sparse self-attention based latent graph models (sparseGCN) and hard kuma based latent graph models (kumaGCN). sparseGCN significantly outperforms depGCN. sanGCN is also better than depGCN in terms of F1 scores. kumaGCN performs the best, achieving 89.39 accuracy scores and 73.19 F1 scores, which empirically shows the importance of introducing stochastic semantic directly related connections between aspect word and the context words.
We additionally test two model variants, −latent and −dep, which denote kumaGCN models without using latent graphs or dependency trees, respectively. Both underperform the full model, which demonstrates the strength of combining the two graphs for learning better aspect-specific graph representations. Additionally, −latent is worse than −dep especially in terms of F1, which shows that the automatically induced latent graph can be better than the dependency graph. As a result, we use kumaGCN as our final model.
Effect of λ To investigate how the trade-off between using automatically latent graphs and dependency tree may affect the ATSC performance, we vary λ in Eq 8 from 0 to 1 using a step size 0.1. Figure 2 shows the F1 scores achieved by ku-maGCN on REST16 and REST15 with different λ. When λ = 0, the model degrades to depGCN; when λ = 1 the model relies solely on automatically learned latent graphs. λ = 0.2 gives the best results, which shows that the structures are complementary to each other. We thus set λ =0.2.

Main Results
We compare our models with: • LSTM. Tang  Without BERT, using Glove embeddings Table 3 shows the results. KumaGCN outperforms all the baselines in terms of both averaged accuracy scores and averaged F1 scores. In particular, it improves the performance by 2.77 F1 points compared with the depGCN method. The performance gain compared to depGCN can empirically demonstrate the effectiveness of introducing latent graphs for aspect sentiment analysis tasks. Considering the running time, the self-attention module and the gated combination module can make our model slower compared to depGCN. In practice, we compare our model with depGCN on the Rest16 test dataset (616 examples). The inference time costs are 0.32s and 0.48s for depGCN and our model respectively, which shows that our model does not add too much computation overhead. Our model also significantly outperforms the state-of-the-art non-depGCN model TNet-LF 7 on all the datasets except for Twitter. On Twitter, sparseGCN gives 72.64 accuracy, which is comparable to the performances of TNet-LF (72.98), which applies a topmost convolution 2D layer over a BiLSTM encoder to capture local n-grams and is thus less sensitive to informal texts without strong sequential patterns. We believe that the slight performance deficiency compared to TNet-LF is because of specific network settings. In particular, TNet-LF applies an attention-based contextpreserving transformation to enhance the contextual representations produced by the BiLSTM encoder. For fair comparison with baselines, we do not use such modules 8 . To our knowledge, our model gives the best results without using BERT. Sun et al. (2019b) also proposed a GCN model based on dependency trees for aspect sentiment analysis similar to depGCN of . Sun et al. (2019b) use aspect-specific pooling over the dependency tree nodes to obtain the final representation vector, instead of using aspect mask and aspect-specific attention of . The data settings of Sun et al. (2019b)    We also add a head-to-head comparison with Sun et al. (2019b) as shown in Table 5 using their data settings. It can be seen that our model can still achieve better F1 scores on all the datasets.

Comparison with Sun et al. (2019b)'s model
With BERT We compare our kumaGCN + BERT models with the state-of-the-art BERT-based models, and also implement the depGCN+BERT model as baseline. Table 4 shows the results. depGCN+BERT generally performs better than BERT-SPC. Our model outperforms both depGCN+BERT and BERT-SPC on all the datasets. Compared to the current state-of-the-art dependency tree based models RGAT+BERT, our model is better on TWITTER and LAP14. On REST14, the accuracy score of our model is comparable to RGAT+BERT without using dependency label information. In addition, our model gives 86.35/70.76 (REST15) and 92.53/79.24 (REST16) Acc./F1 scores, which are the best results on the datasets to our knowledge.

Parameter-based Transfer Learning
We further perform experiments using parameterbased transfer learning by training one source model on the Twitter dataset, and testing the trained model on the restaurant datasets. Table 6 shows the results. Our model outperforms BERT-SPC on all the target domains, which empirically demonstrates the strong aspect-specific semantic representation abilities of our proposed model. Compared to depGCN+BERT, our model gives improved results on the three datasets by about 10.0 accuracy points, which suggests that the induced latent structures have strong robustness for capturing aspectopinion interactions. Figure 3 shows the distribution of the averaged attention weights of the context words according to the aspect terms on the test sentences of REST16.

Attention Distance
In both cases, the attention scores defined in Eq 9 are shown. kumaGCN makes the distribution sharper than depGCN, focusing more on the context within 1 or 2 words. This observation also confirms data bias in the training set, where many opinion words are close to the aspect term (Tang et al., 2016b). Though depGCN can assign high weights to words far away from the target by using syntactic path dependencies, it may also bring in more noise. kumaGCN potentially circumvents this problem.

Case Study
To gain more insights into our model's behavior, we show one case study in Figure 4 using the example "when i got there i sat up stairs where the atmosphere was cozy & the service was horrible !". This example contains two aspects "atmosphere" and "service". Both depGCN and kumaGCN can correctly classify the sentiment of "service" as negative. However, depGCN cannot recognize the positive sentiment of "atmosphere" while kumaGCN can. Figure 4(a) compares the attention weights α defined in Eq 9 of each context word with respect to "atmosphere" between depGCN and kumaGCN. For the target "atmosphere", depGCN assigns the highest weight to the word "terrible", which is an irrelevant sentiment word to this target, leading to an incorrect prediction. In contrast, our model assigns the largest weight to the key sentiment word "cozy", classifying it correctly. Figure 4(b) shows pruned dependency trees by only keeping dependency edges related to these two aspects. We observe that the current parse contains an edge between "cozy" and "horrible", which might mislead depGCN to produce inappropriate representations. For further comparisons, we also extract the links of each target word i from the latent graph A lat (Eq 6) by only keeping edges between i and j if and only if j = arg max j A i,j . If j is not unique, we return all the indices which correspond to the same highest value. Figure 4(c) and Figure 4(d) show the two latent graphs for "atmosphere" and "service", respectively. First, we observe that the two latent graphs are significantly different. Second, each of them contains only a (depGCN) when 0.03 i 0.02 got 0.02 there 0.01 i 0.03 sat 0.05 up 0.03 stairs 0.04 where 0.09 the 0.07 atmosphere 0.01 was 0.02 cozy 0.03 & 0.07 the 0.07 service 0.02 was 0.11 horrible 0.25 ! 0.03 (kumaGCN) when 0.00 i 0.00 got 0.00 there 0.00 i 0.00 sat 0.00 up 0.00 stairs 0.00 where 0.09 the 0.10 atmosphere 0.14 was 0.31 cozy 0.35 & 0.00 the 0.00 service 0.00 was 0.00 horrible 0.00 ! 0.00 (a) Attention comparsions between depGCN and kumaGCN. Subscript numbers indicate the attention weights with respect to the underlined target words.
when I got there i sat up stairs where the atomoshpere was cozy the service was horrible ! & (b) Pruned dependency graph for "atmosphere" and "service".  few edges related to the semantic contexts for sentiment classification. We also verify that there is no edge between "cozy" and "horriable" when inducing the latent graph of "atmosphere". This can be an example to show that our model can learn aspect-specific latent graphs. With these automatically induced graphs, our model can learn better aspect-aware representations, providing better attention weights than depGCN.

Conclusion
We considered latent graph structures for aspect sentiment classification by investigating a variety of neural networks for structure induction, and novel gated mechanisms to dynamically combine different structures. Compared with dependency tree GCN baselines, the model does not introduce additional model parameters, yet significantly enhances the representation power. Experiments on five benchmarks show effectiveness of our model. To our knowledge, we are the first to investigate latent structures for aspect level sentiment classification, achieving the best-reported accuracy on five benchmark datasets.