Graph Ensemble Learning over Multiple Dependency Trees for Aspect-level Sentiment Classification

Recent work on aspect-level sentiment classification has demonstrated the efficacy of incorporating syntactic structures such as dependency trees with graph neural networks (GNN), but these approaches are usually vulnerable to parsing errors. To better leverage syntactic information in the face of unavoidable errors, we propose a simple yet effective graph ensemble technique, GraphMerge, to make use of the predictions from different parsers. Instead of assigning one set of model parameters to each dependency tree, we first combine the dependency relations from different parses before applying GNNs over the resulting graph. This allows GNN models to be robust to parse errors at no additional computational cost, and helps avoid overparameterization and overfitting from GNN layer stacking by introducing more connectivity into the ensemble graph. Our experiments on the SemEval 2014 Task 4 and ACL 14 Twitter datasets show that our GraphMerge model not only outperforms models with single dependency tree, but also beats other ensemble models without adding model parameters.


Introduction
Aspect-level sentiment classification is a finegrained sentiment analysis task, which aims to identify the sentiment polarity (e.g., positive, negative or neutral) of a specific aspect term in a sentence. For example, in "The exterior, unlike the food, is unwelcoming.", the polarities of aspect terms "exterior" and "food" are negative and positive, respectively. This task has many applications, such as assisting customers to filter online reviews or make purchase decisions on e-commerce websites.
Recent studies have shown that syntactic information such as dependency trees is very effective in capturing long-range syntactic relations that are obscure from the surface form (Zhang et al., 2018). Several successful approaches employed food The exterior unlike the , is unwelcoming , . Figure 1: An example where an incorrect parse (above the sentence) can mislead aspect-level sentiment classification for the term "food" by connecting it to the negative sentiment word "unwelcoming" by mistake. Although having its own issues, the parse below correctly captures the main syntactic structure between the aspect terms "exterior", "food" and the sentiment word, and is more likely to lead to a correct prediction.
graph neural network (GNN) (Kipf and Welling, 2016) model over dependency trees to aspect-level sentiment classification (Huang and Carley, 2019;Sun et al., 2019;Wang et al., 2020b), which demonstrate that syntactic information is helpful for associating the aspect term with relevant opinion words more directly for increased robustness in sentiment classification. However, existing approaches are vulnerable to parsing errors (Wang et al., 2020b). For example, in Figure 1, the blue parse above the sentence can mislead models to predict negative sentiment for the aspect term "food" with its direct association to "unwelcoming". Despite their high edge-wise parsing performance on standard benchmarks, stateof-the-art dependency parsers usually struggle to predict flawless parse trees especially in out-ofdomain settings. This poses great challenge to dependency-based methods that rely on these parse trees-the added benefit from syntactic structure does not always prevail the noise introduced by model-predicted parses (He et al., 2017;Sachan et al., 2021).
In this paper, we propose GraphMerge, a graph ensemble technique to help dependency-based models mitigate the effect of parsing errors. Our technique is based on the observation that different parsers, especially ones with different inductive biases, often err in different ways. For instance, in Figure 1, the green parse under the sentence is incorrect around "unlike the food", but it nevertheless correctly associates "unwelcoming" with the other aspect term "exterior", and therefore is less likely to mislead model predictions. Given dependency trees from multiple parses, instead of assigning each dependency tree a separate set of model parameters and ensembling model predictions or dependency-based representations of the same input, we propose to combine the different dependency trees before applying representation learners such as GNNs.
Specifically, we take the union of the edges in all dependency trees from different parsers to construct an ensemble graph, before applying GNNs over it. This exposes the GNN model to various graph hypotheses at once, and allows the model to learn to favor edges that contribute more to the task. To retain the syntactic dependency information between words in the original dependency trees, we also define two different edge typesparent-to-children and children-to-parent-which are encoded by applying relational graph attention networks (RGAT) (Busbridge et al., 2019) on the ensemble graph.
Our approach has several advantages. Firstly, since GraphMerge combines dependency trees from different parsers, the GNN models can be exposed to multiple parsing hypotheses and learn to choose edges that are more suitable for the task from data. As a result, the model is less reliant on any specific parser and more robust to parsing errors. Secondly, this improved robustness to parsing errors does not require any additional computational cost, since we are still applying GNNs to a single graph with the same number of nodes. Last but not least, GraphMerge helps prevent GNNs from overfitting by limiting over-parameterization. Aside from keeping the GNN computation over a single graph to avoid separate parameterization for each parse tree, GraphMerge also introduces more edges in the graph when parses differ, which reduces the diameter of graphs. As a result, fewer layers of GNNs are needed to learn good representations from the graph, alleviating the oversmoothing problem (Li et al., 2018b).
To summarize, the main contribution of our work are the following: • We propose a GraphMerge technique to combine dependency parsing trees from different parsers to improve model robustness to parsing errors. The ensemble graph enables the model to learn from noisy graph and select correct edges among nodes at no additional computational cost. • We retain the syntactic dependency information in the original trees by parameterizing parent-tochildren and children-to-parent edges separately, which improves the performance of the RGAT model on the ensemble graph. • Our GraphMerge RGAT model outperforms recent state-of-the-art work on three benchmark datasets (Laptop and Restaurant reviews from Se-mEval 2014 and the ACL 14 Twitter dataset). It also outperforms its single-parse counterparts as well as other ensemble techniques.

Related Work
Much recent work on aspect-level sentiment classification has focused on applying attention mechanisms (e.g., co-attention, self attention, and hierarchical attention) to sequence models such recurrent neural networks (RNNs) (Tang et al., 2015(Tang et al., , 2016Liu and Zhang, 2017;Fan et al., 2018;Chen et al., 2017;Zheng and Xia, 2018;Wang and Lu, 2018;Li et al., 2018a,c). In a similar vein, pretrained transformer language models such as BERT (Devlin et al., 2018) have also been applied to this task, which operates directly on word sequences Xu et al., 2019;Rietzler et al., 2019). In parallel, researchers have also found syntactic information to be helpful for this task, and incorporated it into aspect-level sentiment classification models in the form of dependency trees (Dong et al., 2014;He et al., 2018) as well as constituency trees (Nguyen and Shirai, 2015). More recently, researchers have developed robust dependency-based models with the help of GNNs that operate either directly on dependency trees (Huang and Carley, 2019;Sun et al., 2019), as well as reshaped dependency trees that center around aspect terms (Wang et al., 2020b). While most recent work stack GNNs on top of BERT models, Tang et al. (2020) have also reported gains by jointly learning the two with a mutual biaffine attention mechanism.
Despite the success of these dependency-based models, they are usually vulnerable to parse errors since they rely on a single parser. Tu et al. (2012) used a dependency forest to combine multiple dependency trees, however they tackled the sentence-level sentiment analysis task instead, and their proposed ensemble technique is also significantly different from ours. Furthermore, most prior work that leverage GNNs to encode dependency information treats the dependency tree as an undirected graph, therefore ignores the syntactic relation between words in the sentence.

Proposed Model
We are interested in the problem of predicting the sentiment polarity of an aspect term in a given sentence. Specifically, given a sentence of n words {w 1 , w 2 , . . . , w τ , . . . , w τ +t , . . . , w n } where {w τ , w τ +1 , . . . , w τ +t−1 } is the aspect term, the goal is to classify the sentiment polarity toward the term as positive, negative, or neutral. Applying GNNs over dependency trees is shown effective to solve this problem, however it is vulnerable to parsing errors. Therefore, we propose a Graph-Merge technique to utilize multiple dependency trees to improve robustness to parsing errors. In this section, we will first introduce GraphMerge, our proposed graph ensemble technique, then introduce the GNN model over GraphMerge graph for aspect-level sentiment analysis.

GraphMerge over Multiple Dependency Trees
To allow graph neural networks to learn dependency-based representations of words while being robust to parse errors that might occur, we introduce GraphMerge, which combines different parses into a single ensemble graph. Specifically, given a sentence {w 1 , w 2 , . . . , w n } and M different dependency parses G 1 , . . . , G M , GraphMerge takes the union of the edges from all parses, and constructs a single graph G as follows where V is the shared set of nodes among all graphs 1 and E m (1 ≤ m ≤ M ) is the set of edges in G m (see the right side of Figure 2 for an example). As a result, G contains all of the (directed) edges from all dependency trees, on top of which we can apply the same GNN models when a single dependency tree is used. Therefore, GraphMerge introduces virtually no computational overhead to existing GNN approaches, compared to traditional ensemble approaches where computational time and/or parameter count scale linearly in M . Note that the parsing time is not accounted for computational cost, because the dependency tree from three parsers could be obtained in parallel thus the running time is the same as the single parser.
What is more, the resulting graph G likely contains more edges from the gold parse which correctly captures the syntactic relation between words in the sentence, allowing the GNN to be robust to parse errors from any specific parser. Finally, since G contains more edges between words when parses differ than any single parse and reduces the diameter of the graph, it is also more likely that a shallower GNN model is enough to learn good representations, therefore avoiding over-parameterization and thus overfitting from stacking more GNN layers.

RGAT over Ensemble Graph
To learn node representations from ensemble graphs, we apply graph attention networks (GAT; Veličković et al., 2017). In one layer of GAT, the hidden representation of each node in the graph is computed by attending over its neighbors, with a multi-head self-attention mechanism. The representation for word i at the l-th layer of GAT can be obtained as follows Where K is the number of attention heads, N i is the neighborhood of node i in the graph, and the concatenation operation. W k ∈ R d B ×d h represents the learnable weights in GAT and σ denotes ReLU activation function. α k ij is the attention score between node i and node j with head k.
Edge Types. To apply GAT to ensemble graphs, we first add reciprocal edges for each edge in the dependency tree, and label them with parent-tochildren and children-to-parent types, respectively. This allows our model to retain the original syntactic relation between words in the sentence. We also follow previous work to add self loop to each node in the graph, which we differentiate from dependency edges by introducing a third edge type.
We adapt Relational GAT (RGAT) to capture this edge type information. Specifically, we encode the edge type information when computing the attention score between two nodes. We assign each edge type an embedding e ∈ R d h , incorporate it into attention score computation as follows where e ij is the representation of the type of the edge connecting nodes i and j. a ∈ R d h , W ∈ R d h ×2d h and a e ∈ R d h are learnable matrices.

Sentiment Classification
We extract hidden representations from nodes that correspond to aspect terms in the last RGAT layer,  and conduct average pooling to obtain h t ∈ R d h .
Then we feed it into a two-layer MLP to calculate the final classification scoresŷ s : where W 2 ∈ R C×dout and W 1 ∈ R dout×d h denote learnable weight matrices, and C is the number of sentiment classes. We optimize the model to minimize the standard cross entropy loss function, and apply weight decay to model parameters.

RGAT Input
The initial word node features for RGAT are obtained from a BERT encoder, with positional information from positional embeddings.
BERT Encoder. We use the pre-trained BERT base model as the encoder to obtain word representations. Specifically, we construct the input as "[CLS] + sentence + [SEP] + term + [SEP]" and feed it into BERT. This allows BERT to learn termcentric representations from the sentence during fine-tuning. To feed the resulting wordpiece-based representations into the word-based RGAT model, we average pool representations of subwords for each word to obtain X, the raw input to RGAT.
Positional Encoding. Position information is beneficial for this task, especially when there are multiple aspect terms in one sentence, where it helps to locate opinion words relevant to an aspect term. Although the BERT encoder already takes the word position into consideration, it is dampened after layers of Transformers. Therefore, we explicitly encode the absolute position for each word and add it to the BERT output. Specifically, we add a trainable position embedding matrix to X before feeding the resulting representation into RGAT.

Setup
Data & Processing. We evaluate our model on three datasets: Restaurant and Laptop reviews from SemEval 2014 Task 4 (14Rest and 14Lap) 2 and ACL 14 Twitter dataset (Twitter) (Dong et al., 2014). We remove several examples with "conflict" sentiment polarity labels in the reviews. The statistics of these datasets are listed in Table 1. Following previous work, we report the accuracy and macro F1 scores for sentiment classification. For dependency-based approaches, we tokenize sentences with Stanford CoreNLP (Manning et al., 2014), and then parse them with CoreNLP, Stanza , and the Berkeley neural parser (Kitaev and Klein, 2018). Since the Berkeley parser returns constituency parses, we further convert it into dependency parses using CoreNLP.
Baselines. We compare our GraphMerge model against published work on these benchmarks, including: BERT-SPC  feeds the sentence and term pair into the BERT model and uses the BERT outputs for predictions; AEN-BERT  uses BERT as the encoder and employs several attention layers. BERT + Dependency tree based models: DGEDT-BERT (Tang et al., 2020) proposes a mutual biaffine module to jointly consider the representations learnt from Transformer and the GNN model over the dependency tree; R-GAT+BERT (Wang et al., 2020b) reshapes and prunes the dependency tree to an aspect-oriented tree rooted at the aspect term, and then employs RGAT to encode the new tree for predictions. For fair comparison, we report the results of our GraphMerge model using the same data split (without a development set).
To understand the behavior of different models, we also implement several baseline models. In our experiments, we randomly sample 5% training data as held-out development set for hyper-parameter tuning, use the remaining 95% for training and present results of the average and standard deviation numbers from five runs of random initialization on the test set. We consider these baselines: 1. BERT-baseline which feeds the sentence-term pair into the BERT-base encoder and then applies a classifier with the representation of the aspect term token.
2. GAT-baseline with Stanza which employs a vanilla GAT model over single dependency tree obtained from Stanza without differentiating edge types. And the initial node features are the raw output of the BERT encoder.
3. RGAT over single dependency trees, where we apply RGAT models with parent-to-children and child-to-parent edge types over different dependency trees from the CoreNLP, Stanza, and Berkeley parsers. For a fair comparison to our GraphMerge model, the RGAT input comes from BERT encoder plus position embeddings.
4. Two ensemble models to take advantage of multiple dependency trees, including a Label-Ensemble model which takes the majority vote from three models each trained on one kind of parses, and a Feature-Ensemble model which applies three sets of RGAT parameters, one for each parse, on top of the BERT encoder with their output features concatenated. These models have more parameters and are more computationally expensive compared to the GraphMerge model when operating on the same parses.
Parameter Setting. We use Pytorch (Paszke et al., 2019) to implement our models. The GAT implementation is based on Deep Graph Library . During training, we set the learning rate = 10 −5 , batch size = 4. We use dev data to select the hidden dimension d h for GAT/RGAT from {64, 128, 256}, the head number in the multi-head self-attention from {4, 8}, and GAT/RGAT layer from {2, 3, 4}. The 2-layer GAT/RGAT models turn out to be the best based on the dev set. We apply dropout (Srivastava et al., 2014) and select the best setting from the dropout rate range = [0.1, 0.3]. We set the weight of L2 regularization as 10 −6 . We train the model up to 5 epochs. 3

Results
We first compare our model to previous work following the evaluation protocol in previous work, and report results in Table 2. As we can see, the GraphMerge model achieves best performances on all three datasets. On the Laptop dataset, the GraphMerge model further outperforms baselines by at least 1.42 accuracy and 2.34 Macro-F1 respectively. Table 3 shows performance comparisons of the GraphMerge model with other baselines in terms of accuracy and Macro-F1. We observe that: Syntax information benefits aspect-level sentiment classification. All GAT and RGAT models based on dependency trees outperform BERT-   baseline on all three datasets. This demonstrates that leveraging syntax structure information is beneficial to this task.
Ensemble models benefit from multiple parses.
The Label-Ensemble, Feature-Ensemble, and GraphMerge models achieve better performance compared to their single dependency tree counterparts. This shows that ensemble models benefit from the presence of different parses and thus less sensitive to parse errors from any single parser.
GraphMerge achieves the best performance overall. Our proposed GraphMerge model not only shows consistent improvements over all single dependency tree models, but also surpasses the other two ensemble models without additional parameters or computational overhead, when compared to the single-tree models. Note that although in this specific task, the best results are achieved using three trees in GraphMerge. The number of trees for ensemble depends on different tasks and datasets.

Model Analysis
We analyze the proposed GraphMerge model from two perspectives: an ablative analysis of model components and an analysis of the change in the dependency graphs after GraphMerge is applied.

Ablation Study
Model components. We conduct ablation studies of our modeling for edge type and position information in Table 4. We observe that: (1) On three datasets, ablating the edge type degrades the performances. It indicates that the syntactic dependency information in original dependency trees is important. Differentiating edges in the ensemble graph provides more guidance to the model about selecting useful connections among nodes.
(2) Removing the position embeddings hurts the performances as well. Although the BERT encoder already incorporates position information at its input, this information is dampened over the layers of Transformers. Emphasizing sequence order again before applying RGAT benefits the task.
Edge Union vs. Edge Intersection. While GraphMerge keeps all edges from different dependency parsing trees for the RGAT model to learn to use, this could also result in too much structural noise and adversely impact performance. We therefore compare GraphMerge to edge intersection, which only retains edges that shared by all individual trees when constructing the ensemble graph, which can be thought of distilling syntactic   information that an ensemble parser is confident about. We observe from the last row in Table 4 that edge intersection strategy underperforms Graph-Merge on average accuracy and Marco-F1. We postulate that this is because edge intersection overprunes edges in the ensemble graph and might introduce more disjoint connected components where parsers disagree, which the RGAT model cannot easily recover from.

Graph Structure Analysis
Effect of GraphMerge on Graph Structure.
To better understand the effect of GraphMerge on dependency graphs, we conduct statistical analysis on the test set of 14Lap and 14Rest. Specifically, we are interested in the change in the shortest distance between the aspect term and its opinion words on the dependency graphs. For this analysis, we use the test sets with opinion words labeled by Fan et al. (2019) (see Table 5 for dataset statistics). We summarize analysis results in Figure 3. We observe that: (1) Compared with single dependency tree, the ensemble graph effectively increases the number of one-hop and two-hops cases, meaning the overall distance between the term and opinion words is shortened on both datasets. (2) Shorter distance between the term and opinion words correlates with better performance. With the ensemble graph, the accuracy of one-hop and two-hops cases beats all single dependency tree models. These observations suggest that the ensemble graph from GraphMerge introduces important connectivity to help alleviate overparameterization from stacking RGAT layers, and that the RGAT model is able to make use of the diversity of edges in the resulting graph to improve classification performance.
Note that although shortening distance correlates with improved results, it does not mean that the closer distance is sufficient for better performance. This is because although the BERT model can be seen as a GAT over a fully-connected graph where It's been a couple weeks since the purchase and I'm struggle with finding the correct keys (but that was expected).
Stick to the items the place does best, brisket, ribs, wings, cajun shrimp is good, not great.

Aspect term CoreNLP
Berkeley Shared by all parsers Opinions Stanza Figure 4: Examples of partial dependency trees on which the single dependency tree models make wrong prediction, but the GraphMerge model makes correct prediction. a word is reachable for all other context words within one hop (Wang et al., 2020a), the BERTbaseline model performs worse than dependencybased models. Therefore, encoding the syntactic structure information in dependency trees is crucial for this task. Our GraphMerge model achieves the best results by shortening the graph distance between the aspect term and opinion words with syntactic information.
Case Study. To gain more insight into the Graph-Merge model's behaviour, we find several examples and visualize their dependency trees from three parsers (Figure 4). Due to the space limit, we only show partial dependency trees that contain essential aspect terms and opinion words. These examples are selected from cases that all single dependency tree RGAT models predict incorrectly, but the GraphMerge model predicts correctly.
We observe that in general, the three parsers do not agree in the neighborhood around the aspect term and opinion words in these sentences. As a result, GraphMerge tends to shorten the distance between the aspect term and the opinion words on the resulting graph. For instance, for all examples in Figure 4, the shortest distances between the aspect term and the opinion words are no more than two in the ensemble graphs, while they vary from 2 to 4 in the original parse trees. This could allow the RGAT model to capture the relation between the words without an exessive amount of layers, thus avoiding overfitting.
On the other hand, we observe that the resulting ensemble graph from GraphMerge is more likely to contain the gold parse for the words in question. For instance, in the first two examples, the gold parse for the words visualized in the figure can be  found in the ensemble graph (despite no individual parser predicting it in the first example); the third example also has a higher recall of gold parse edges than each parser despite being difficult to parse. This offers the RGAT model with the correct semantic relationship between these words in more examples during training and evaluation, which is often not accessible with those single parse trees.
Aspect Robustness. To study the aspect robustness of the GraphMerge model, we test our model on the Aspect Robustness Test Set (ARTS) datasets proposed by Xing et al. (2020) (see Table 6 for statistics). The datasets enrich the original 14Lap and 14Rest datasets following three strategies: reverse the sentiment of the aspect term; reverse the sentiment of the non-target terms with originally the same sentiment as target term; generate more non-target aspect terms that have opposite sentiment polarities to the target one. They propose a novel metric, Aspect Robustness Score (ARS), that counts the correct classification of the source example and all its variations generated by the above three strategies as one unit of correctness. We compare three single dependency tree models with the GraphMerge model in Table 7. We directly evaluate the models trained on the original SemEval datasets on ARTS without further tuning. The results indicate that the GraphMerge model shows better aspect robustness than single dependency tree and BERT models.  Table 7: Comparison of GraphMerge model to the single dependency tree based models and BERT model in terms of Aspect Robustness Score (ARS), on the ARTS dataset (Xing et al., 2020).

Conclusion
We propose a simple yet effective graph-ensemble technique, GraphMerge, to combine multiple dependency trees for aspect-level sentiment analysis. By taking the union of edges from different parsers, GraphMerge allows graph neural model to be robust to parse errors without additional parameters or computational cost. With different edge types to capture the original syntactic dependency in parse trees, our model outperforms previous state-of-theart models, single-parse models, as well as traditional ensemble models on three aspect-level sentiment classification benchmark datasets.