Aspect-Level Sentiment Analysis Via Convolution over Dependency Tree

We propose a method based on neural networks to identify the sentiment polarity of opinion words expressed on a specific aspect of a sentence. Although a large majority of works typically focus on leveraging the expressive power of neural networks in handling this task, we explore the possibility of integrating dependency trees with neural networks for representation learning. To this end, we present a convolution over a dependency tree (CDT) model which exploits a Bi-directional Long Short Term Memory (Bi-LSTM) to learn representations for features of a sentence, and further enhance the embeddings with a graph convolutional network (GCN) which operates directly on the dependency tree of the sentence. Our approach propagates both contextual and dependency information from opinion words to aspect words, offering discriminative properties for supervision. Experimental results ranks our approach as the new state-of-the-art in aspect-based sentiment classification.


Introduction
The current explosion in digital technology in recent years has led to a vast amount of opinionated materials on the internet. In particular, individuals have expressed opinions on several aspects of products, services, blogs, and comments which are deemed to be influential, especially when making purchase decisions based on product reviews (Schouten and Frasincar, 2015). However, due to the voluminous amount of content online, sifting through reviews to learn knowledge of opinions expressed on specific aspects of a review is cumbersome. This fact has led to an increase in research in aspect-based sentiment analysis (ABSA), which aims to find scalable solutions to address the problem automatically. More * Corresponding author specifically, ABSA involves two tasks: (1) to identify aspects of a sentence, and (2) to determine the sentiment polarity (e.g. positive, negative, neutral) expressed on a specific aspect. In this paper, we focus on the second task: aspect-based sentiment classification.
We ordered the special grilled branzino , that was so infused with bone, it was difficult to eat. We ordered the special grilled branzino, that was so infused with bone, it was difficult to eat Figure 1: An example of a dependency tree where an opinion word (blue) and the specific aspect expression (red) are connected with other word tokens based on their syntactic dependencies.
With the aim to address the classification task, several methods have been developed. Majority of recent works such as (Dong et al., 2014;Tang et al., 2015;Wang et al., 2016;Chen et al., 2017;Cheng et al., 2017) have exploited neural networks due to its ability to model representations for sentences automatically. Even so, some recent methods have integrated both lexical resources with neural networks to achieve state-of-the-art performance in ABSA Ouyang and Su, 2018).
Generally, we find that a dependency tree shortens the distance between the aspects and opinion words of a sentence, captures the syntactic relations between words, and offers discriminative syntactic paths on arbitrary sentences for information propagation across the tree. For instance, consider the dependency tree as depicted in Figure 1, the distance between the aspect expression 'grilled branzino' and the opinion word 'difficult' is shortened by a single path based on their syntactic dependencies. These properties allow neural network models to capture long-term syntactic dependencies effortlessly. Besides, dependency trees have graph-like structures bringing to play the recent class of neural networks, namely, graph convolutional networks (GCN) (Kipf and Welling, 2016). The GCN has been successful in learning representations for nodes, capturing the local position of nodes in the graph. Several AI applications such as link prediction (Schlichtkrull et al., 2018;Zitnik et al., 2018;Kong et al., 2019), semantic role labeling (Marcheggiani and Titov, 2017), and relation extraction  have successfully exploited GCNs to improve representation learning.
These observations motivate us to develop a neural model which can operate on the dependency tree of a sentence, with the aim to make accurate sentiment predictions with respect to specific aspects. Specifically, we propose a convolution over a dependency tree (CDT) model which exploits a GCN to model the structure of a sentence through its dependency tree, where node (word) embeddings of the tree are initialized by means of a Bi-directional Long Short Term Memory (Bi-LSTM) network. Motivated by the recent work of ) in a relation extraction task, we find that the architecture of CDT allows the Bi-LSTM account for contextual information between successive words, while the GCN enhances the embeddings by modeling the dependencies along the syntactic paths of the dependency tree. Such operations allow information to be transferred from opinion words to aspect words, implying that the encoding for aspect words is sufficient for supervision in the classification task. Experimental results, including visualizations show the effectiveness of our proposed model.

Related Work
The performance bottleneck in the classification task of ABSA comes from modeling representations which efficiently encode the relationship between a specific aspect and the opinion words of a sentence. Most recent methods have focused on leveraging neural networks (Chen et al., 2017;Gu et al., 2018;Majumder et al., 2018;Fan et al., 2018a;Xue and Li, 2018;Huang and Carley, 2018;Zheng and Xia, 2018) which model representations automatically. Besides, in contrast to rule-based methods (Hu and Liu, 2004;Popescu and Etzioni, 2005;Ding et al., 2008;Popescu and Etzioni, 2005), neural networks are more capable of dealing with situations where opinion words are found in more complicated contexts.
Among neural network methods, some model the sentence representation using RNN variants such as LSTM and gated recurrent units (GRU). (Chen et al., 2017) handles the encoding of reviews using BiLSTM and attention networks. (Gu et al., 2018) improves the performance by considering the position of the aspect words. Similarly, (Zheng and Xia, 2018) use LSTMs to learn embeddings for the left context, right context and target phrase of sentences while considering the interactions between targets and contexts. (Majumder et al., 2018) on the other hand models the sentence representations using GRU and attention mechanisms. However convenient, these neural network based methods neglect informative resources such as dependency trees which is capable of shortening the distance between aspect and opinion words, enabling dependency information to be preserved effectively in lengthy sentences.
The state-of-the-art methods for representation learning have integrated dependency trees with neural networks. (Tai et al., 2015) proposed a treestructured LSTM: a generalized class of LSTMs which enables the learning of dependency information between words and phrases. (Mou et al., 2015) exploit the short paths of dependency trees to learn representations of sentences using convolutional neural networks, while preserving dependency information. Motivated by such works, (Gu et al., 2018) proposed a position encoding convolutional neural network which takes into account the relative position of words and entities of a dependency tree for relation classification. Given that dependency trees can be considered as a graph, (Marcheggiani and Titov, 2017) introduced a variant of a GCN to model representations for dependency graphs in semantic role labeling tasks.
In a recent relation extraction task,  extract entity-based representations via a GCN which operates on a dependecy tree.  observed that stacking a GCN layer over an LSTM improves performance immensely. We follow a similar approach and propose CDT: a method which performs convolutions over a dependency tree to extract rich representations for aspect-based sentiment classification. CDT extracts a final representation for the ABSA classification task by aggregating only the aspect vec-tors. We believe this is sufficient because the GCN componet can be interpreted as a messaging passing network which propagates information along edges. Thus, successive GCN operations allow information to be propagated across the network, and hence aspect vectors are encoded with information from opinion words which should be sufficient for supervision.

Convolution over Dependency Tree Model
In this section, we describe the CDT model which takes as input a dependency tree of a sentence. Node embeddings of the dependency tree are initially modeled by means of a BiLSTM, and the embeddings are further enhanced via a GCN. Finally, an aggregator is applied over the enhanced aspect embeddings to distill a dense vector embedding for the classification task. In particular, we aim to extract embeddings which encode both contextual and dependency information between a specific aspect expression and opinion words, providing supervisory signals for the aspect-based classification task. We briefly describe the BiLSTM model, which takes as input the sentence s with n ordered word embeddings. The BiLSTM integrates context information in the word embeddings by keeping track of dependencies along the chain of words. Given an aspect-sentence pair (a, s), where a = {a 1 , a 2 , . . . , a l } is a sub-sequence of the sentence s = {w 1 , w 2 , . . . , w n }. The sentence s has corresponding word embeddings x = {x 1 , x 2 , . . . x n }. The LSTM learns hidden state representations − → h 0 n } in the forward direction on the word embeddings in x. This allows contextual information to be captured in a forward direction. In a similar fashion, a backward LSTM will learn Finally, we can concatenate the corresponding parallel representations modeled by both forward and backward LSTMs into higher dimensional representations {h 0 1 , h 0 2 , . . . , h 0 n }, which contains the subsequence {h 0 a 1 , h 0 a 2 , . . . , h 0 a l } corresponding to the aspect expression a. In doing so, we capture contextual information between opinion words and aspects. Besides, we integrate dependency information in the contextualized embeddings using a GCN which operates directly on the dependency tree of the sentence.

Graph Convolutional Network
The dependency tree can be interpreted as a graph G with n nodes, where nodes represent words in the sentence and edges represent syntactic dependency paths between words in the graph. The nodes of the dependency tree are given by realvalued vectors modeled by BiLSTM as described above. This structure allows a GCN to operate directly on the graph to model dependencies that exist between words. To allow the GCN to model node embeddings efficiently, we allow G to have self-loops. The GCN approach ensures that the sentence structure represented by the dependency tree is encoded efficiently, whereby the representations for nodes encode the local position of opinion words and the target words in the dependency tree.
The dependency tree G for any arbitrary sentence can be represented as an n × n adjacency matrix A, with entries A ij signaling if node i is connected to node j by a single dependency path in G. Specifically, A ij = 1 if node i is connected to node j, and A ij = 0 otherwise. Together with node embeddings modeled by BiLSTMs, we can exploit a GCN capable of operating directly on graphs. The GCN makes efficient use of dependency paths to transform and propagate information across the paths, and update node embeddings by aggregating the propagated information. In such an operation, the GCN only considers the first-order neighborhood of a node when modeling its embeddings. However, k successive GCN operations result in the propagation of information across the k-th order neighborhood. A single node embedding update takes the form j is the hidden state representation for node j at the k th layer of the GCN, b (k) is a bias term, W (k) is a parameter matrix, c i is a normalization constant, which we choose as c i = 1/d i . d i denotes the degree of node i in the graph calculated as d i = n j=1 A ij . φ(·) is a relu elementwise non-linear activation function. Note that h 0 i represent the initial embeddings modeled by a BiL-STM, and h (k+1) i is the final output for node i at layer k.
In extracting a final embedding for the classifi- , where [w a 1 , w a 2 ] is the specific aspect expression in s, and k is the number of GCN layers.
cation task, we exploit a simple aggregator. For our framework, we choose an average pool which aggregates information over the aspect vectors. We choose to aggregate only the aspect vectors because we believe that these vectors encode contextual and dependency information owing to the BiLSTM and the GCN respectively. The BiLSTM and the GCN can be interpreted as message passing networks. Specifically, the BiLSTM allow aspect words of an arbitrary sentence to be contextualize, while the GCN finds the local position of aspect words in the syntactic dependency tree. The local position within the dependency tree encodes dependency information of a word with respect to its neighbors. As a result, the BiLSTM and the GCN allow embeddings for aspect words to have discriminative features, providing supervisory signals for the classification task. Moreover, we perform an average pool to retain most of the information in the aspect vectors. The pool operation over the aspect vectors takes the form of where f (·) is an average pool function applied over the enhanced aspect vectors. We present an overview of the model architecture in Figure 2 based on an example sentence input.

Model Training
The aspect-based representation h (k+1) a is passed to a fully connected softmax layer σ whose output is a probability distribution over the different sentiment polarities. The model is trained end-toend through a backpropagation, where the objective function to be minimized is the cross entropy error defined as where D is a collection of aspect-sentence pairs, C is the collection of distinct sentiment classes, y c ((a, s)) is the ground truth for (a, s) which takes the value of either 1 or 0. Besides, (a, s) can belong to only one sentiment class. Hence y c ((a, s)) = 1 indicates that the ground truth sentiment class for (a, s) is c.ŷ c ((a, s)) is the model's prediction for (a, s). θ 1 , θ 2 are trainable parameters for the BiLSTM and GCN respectively.

Experiment
In this section, we conduct experiments to validate our model which we denote as CDT on benchmark datasets. We also present restricted versions of our model denoted as ASP-BiLSTM and ASP-GCN. Unlike our main model, ASP-BiLSTM only exploits BiLSTM to model contextual information with respect to a specific aspect expression, while ASP-GCN exploits a GCN to model dependencies between words. Both models extract a final embedding on the aspect vectors. We propose these two models to observe the performance of GCN and BiLSTM, as well as the performance when we stack a GCN on a BiLSTM which forms the CDT model. To distinguish CDT as the new state-ofthe-art in aspect-based sentiment classification, we compare CDT with several well established models, showing that CDT outperforms the very recent models in the classification task. In particular, we perform case studies with visualizations to verify our approach of aggregating only aspect vectors for the final embedding. We further present visualizations on case examples showing how GCN improves on a simple BiLSTM model.

Datasets
We evaluate the performance of our model on Se-mEval 2014 (Pontiki et al., 2014), which consists of restaurant reviews (Rest14) and laptop reviews (Laptop14). We also evaluate our model on SemEval 2016 1 containing restaurant reviews (Rest16). Experiments are also performed on a collection of tweets from Twitter provided in the works of (Dong et al., 2014). We summarize the statistics of the datasets in Table 1

Implementation and parameter settings
For fairness in model comparation, we use similar parameters in compared models. Specifically, we exploit 300-dimensional Glove vectors (Pennington et al., 2014) for the word embeddings, as well as a 30-dimensional part-of-speech (POS) embeddings, 30-dimensional position embeddings, which is used to identify the relative position of each word with respect to the aspect in the sentence. We concatenate both word, POS and position embeddings, and learn a 50-dimensional BiL-STM embeddings for each word. The GCN operates on the dependency tree of the sentence to enhance the BiLSTM embeddings. All sentences are parsed by the Stanford parser. 2 To encourage the GCN to model dependencies between words, we randomly dropout 10% of neurons per layer, and about 0.7 at the input layer. The GCN model is trained for 100 epochs with batch size 32. We use the adam optimizer with learning rate 0.01 for all datasets. The code for our model is found on the Github page 3 .

Compared Prior Art
As a baseline, we include CNN and LSTM models, which learn representations from both word embeddings and position embeddings. We denote these models as CNN+Position and LSTM+Position. We also include a CNN baseline method which exploits an attention mechanism to model the relation between aspect words and context words. We denote this model as CNN+ATT. These models extract a final embedding by aggregating all learned embeddings using an average pool. In particular, we compare our proposed model with very recent models on the benchmark datasets. The models we consider include, • TNet (Li et al., 2018a): In this work, BiLSTM embeddings are transformed into target-specific embeddings, and a CNN model is used to extract a final embedding.
• PRET+MULT (He et al., 2018b): A multitask framework based on LSTMs is proposed to transfer knowledge from a document-level model task to an aspect-level model task.
• SA-LSTM-P (Wang and Lu, 2018): This work first learn embeddings using BiLSTM and model structural dependencies between words by means of a segmentation attention mechanism.
• LSTM+SynATT+TarRep (He et al., 2018a): This method models target representation as a weighted sum of aspect embeddings, and models the syntactic structure of the sentence using an attention mechanism.
• MGAN (Fan et al., 2018b): A BiLSTM is exploited to capture contextual information in the sentence, while a multi-grained attention mechanism is proposed to extract an embedding which effectively captures the interaction between the aspect and the context.
• MGAN (Li et al., 2018b): This work integrates an alignment mechanism in a multitask model comprising of an aspect-term task and an aspect-category task to effectively extract aspect-specific representations.
• HSCN (Lei et al., 2019): A model is proposed to capture interactions between the context and target, select target words and extract target-specific contextual representation, while measuring the deviation between target-specific contextual representation and target representations.

Performance Comparison
In this section, we compare model performance of recent methods with CDT, ASP-BiLSTM and ASP-GCN. We implement and report results for the baseline methods CNN+Position, LSTM+Position and CNN+ATT, and report the results in the original paper for the recent models under comparison. The classification results are shown in Table 2.
From the table, we find that CDT generally ouperforms all models for the different datasets, while having a slight accuracy performance degradation of 0.31 on the twitter dataset for the TNet model. The difference between TNet and CDT is not really significant. Hence it is fair to conclude that both models are competitive on the Twitter dataset. Even with simple architectures, we find that ASP-BiLSTM and ASP-GCN have competitive performance with the recent models on benchmark datasets. Particularly, ASP-GCN outperforms the models on the Rest14 dataset.
ASP-BiLSTM, ASP-GCN and CDT extract final representations from only the aspect vectors. Based on the performance, it seems as a sufficient technique for the classification task. We believe that the aspect vector is encoded with context and dependency information from the context and structure of the sentence by means of the BiL-STM and the GCN. The BiLSTM and GCN can be regarded as message passing networks, propagating information along a chain of sequence of words(BiLSTM) or along syntactic dependency path(GCN). Due to the fact that relevant information is passed to the aspect words, a simple average pool is all we need to retain information relevant to the classification task. Note that the information propagated in the network is learned therefore only weighed information is encoded within the aspect words.

GCN Performance
We conduct an experiment to demonstrate that the performance of our proposed models, namely CDT and ASP-GCN, depend on the number of layers of the GCN. We perform this experiment on the Rest14 dataset and present the result in Figure 4.
In our experimentation, we find that as we increase the number of layers the accuracy performance increase to an extent. In particular, ASP-GCN increase in model performance over 6 layers of the GCN. The performance becomes unstable after the 6-th layer. Since GCN passes information in the local neighborhood of any node, successive operations on the dependency tree allows ASP-GCN to pass information to the furthest node. The problem of overfitting takes effect when the layers rises beyond a threshold, explaining the accuracy curve after the 6-th layer in the figure. Another important observation is the convergence of accuracy performance of the ASP-BiLSTM and ASP-GCN at the 6-th layer. Note that ASP-BiLSTM only captures contextual information while ASP-GCN captures dependency information. However, both models converge in performance at the 6-th layer. Taking advantage of the GCN and the BiLSTM we expect to improve performance, capturing both context and dependencies with respect to the aspect expression. As seen in the accuracy curve of CDT, the GCN integrates dependency informa-

Mask Experiment
Our primary assumption was that the aspect embeddings learned by our GCN model contains sufficient information necessary for the classification task. Based on this assumption we aggregate only the aspect embeddings using a max pool with the aim to retain most of the information. To verify this assumption, we trace from the input embeddings to the final embedding. We propose a mask method designed to estimate the relevance of a word with respect to the final embedding, and perform mask experiments using ASP-GCN.
The mask method works as follows. First, we follow through the conventional procedure to extract a final embedding h s for a given sentence s using ASP-GCN. We perform a subsequent run of ASP-GCN on the same sentence s to extract a final embedding, but in this instance we conceal a specific input word w. We conceal w by mapping it to the zero vector before applying ASP-GCN on s. As a result a final embedding h (s\w) is generated for s. If h s = h (s\w) , the word w has no impact on the representation h s . In other words, h s does not capture w or no information flows from w to h s . To this end, we can define a score function to estimate the relevance of w on h s . We define the score function for w as where d is the dimension of the final embedding distilled by ASP-GCN, m is a normalization constant which we choose as m = max w∈s γ(w, s).
Generally, the final embedding should capture information on opinion words with respect to the target aspect. Hence, we expect to score high values for opinion words. Consider the scores for words shown in Figure 3, we find that γ scores high values for opinion words as we increase the number of layers of ASP-GCN, while reducing scores on irrelevant words. Implying that the final embedding captures information from opinions. The results as seen in these case examples convinces us that the final embedding distilled by our model captures relevant information necessary for the classification task.

Case Study
In this section we study the behaviour of ASP-BiLSTM, TNet and CDT on case examples. To this end we present visualizations showing the attention these models place on words. For a good model, we expect the model to attend to words which influence the sentiment inferred on a specific aspect.
From Table 2 and Figure 4, it is clear that GCN complements the BiLSTM to improve model performance. This means that the BiLSTM can identify opinion words within the context with respect to a specific aspect. However, in some complicated contexts, it might perform poorly. But the GCN can build upon BiLSTM to attend to the correct opinion words by leveraging the dependencies among words. Consider the case example shown in Figure 5, ASP-BiLSTM was clever to know that the word 'good' is an opinion word with respect to the aspect 'Sangria'. But ASP-BiLSTM failed to identify whether 'good' on the far left is associated to the 'Sangria' or 'good' on the far right is associated to 'Sangria'. Interestingly, we find that the GCN could analyze this further through the dependencies between words to identify that it is the 'good' on the far right. TNet on the other hand measures the association between 'Sangria' and 'good' in both directions to identify the correct 'good'.
In Figure 6, even though the BiLSTM is able to identify the opinion word 'GREAT' which expresses an opinion on the aspect 'parathra', CDT is able to capture the opinion word 'FRESH' which directly expresses the sentiment towards the aspect. However, from the visualization is easily observed that CDT still attends to 'GREAT'. This suggest that the GCN is able to model the importance of the words with respect to the aspect, placing larger weights to words directly expressing the opinion on the aspect. At the same time, TNet misses the opinion word 'FRESH' and places attention to the word 'GREAT' just like ASP-BiLSTM.
In the case example shown in Figure 7, we find that ASP-BiLSTM places small attention on the opinion word 'BEST' which expresses the sentiment on the aspect word 'LASAGNA', while focusing its attention on 'WAS PROBABLY' which is not meaningful alone. Interestingly, CDT builds upon this little information and rely on the dependencies between the words through the de- pendency tree to learn that 'BEST' is the correct word to attend to. Similar to ASP-BiLSTM, TNet misses the important word 'BEST' and places attention to 'WAS PROBABLY'. This result suggest that TNet heavily depends on the representations modeled by its BiLSTM layer, while CDT considers other information such as the dependencies among words to accurately identify words which expresses opinions on specific aspects.

Conclusion
Modeling representations for aspect-based sentiment classification generally require capturing informative words which express the sentiment inferred on the target aspect. Leveraging neural networks are highly desirable for representation learning. BiLSTM-based models have been successful to capture contextual information in prior works.
In this paper, we integrate a GCN with a simple BiLSTM model, with the aim to capture structural and contextual information of sentences. We have shown that the GCN successfully performs convolutions on the dependency tree to refine BiLSTM embeddings. Experimental results with visualizations support our argument on the extraction of a final embedding based on only the aspect vectors. In fact, the model we propose is simple and outperforms more complex and recent models tackling the same problem.