Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa

Aspect-based Sentiment Analysis (ABSA), aiming at predicting the polarities for aspects, is a fine-grained task in the field of sentiment analysis. Previous work showed syntactic information, e.g. dependency trees, can effectively improve the ABSA performance. Recently, pre-trained models (PTMs) also have shown their effectiveness on ABSA. Therefore, the question naturally arises whether PTMs contain sufficient syntactic information for ABSA so that we can obtain a good ABSA model only based on PTMs. In this paper, we firstly compare the induced trees from PTMs and the dependency parsing trees on several popular models for the ABSA task, showing that the induced tree from fine-tuned RoBERTa (FT-RoBERTa) outperforms the parser-provided tree. The further analysis experiments reveal that the FT-RoBERTa Induced Tree is more sentiment-word-oriented and could benefit the ABSA task. The experiments also show that the pure RoBERTa-based model can outperform or approximate to the previous SOTA performances on six datasets across four languages since it implicitly incorporates the task-oriented syntactic information.


Introduction
Aspect-based sentiment analysis (ABSA) aims to do the fine-grained sentiment analysis towards aspects (Pontiki et al., 2014(Pontiki et al., , 2016. Specifically, for one or more aspects in a sentence, the task calls for detecting the sentiment polarities for all aspects. Take the sentence "great food but the service was dreadful" for example, the task is to predict the sentiments towards the underlined aspects, which expects to get polarity positive for aspect food and polarity negative for aspect service. Generally, ABSA contains aspect extraction (AE) and aspect-level sentiment classification (ALSC). We only focus on the ALSC task.
Early works of ALSC mainly rely on manually designed syntactic features, which is laborintensive yet insufficient. In order to avoid designing hand-crafted features (Jiang et al., 2011;Kiritchenko et al., 2014), various neural network models have been proposed in ALSC (Dong et al., 2014;Vo and Zhang, 2015;Wang et al., 2016;Chen et al., 2017;He et al., 2018;Zhang et al., 2019b;. Since the dependency tree can help the aspects find their contextual words, most of the recently proposed State-of-the-art (SOTA) ALSC models utilize the dependency tree to assist in modeling connections between aspects and their opinion words Sun et al., 2019b;Zhang et al., 2019b). Generally, these dependency tree based ALSC models are implemented in three methods. The first one is to use the topological structure of the dependency tree (Dong et al., 2014;Zhang et al., 2019a;Huang and Carley, 2019;Sun et al., 2019b;Zheng et al., 2020;Tang et al., 2020); The second one is to use the treebased distance, which counts the number of edges in a shortest path between two tokens in the dependency tree (He et al., 2018;Zhang et al., 2019b;Phan and Ogunbona, 2020); The third one is to simultaneously use both the topological structure and the tree-based distance.
Except for the dependency tree, pre-trained models (PTMs) (Qiu et al., 2020), such as BERT (Devlin et al., 2019), have also been used to enhance the performance of the ALSC task (Sun et al., 2019a;Tang et al., 2020;Phan and Ogunbona, 2020;. From the view of interpretability of PTMs, ; Hewitt and Manning (2019);  try to use probing methods to detect syntactic information in PTMs. Empirical results reveal that PTMs capture some kind of dependency tree structures implicitly. Therefore, two following questions arise naturally.
Q1: Will the tree induced from PTMs achieve better performance than the tree given by a dependency parser when combined with different tree-based ALSC models? To answer this question, we choose one model from each of the three typical dependency tree based methods in ALSC, and compare their performance when combined with the parser-provided dependency tree and the off-the-shelf PTMs induced trees.
Q2: Will PTMs adapt the implicitly entailed tree structure to the ALSC task during the finetuning? Therefore, in this paper, we not only use the trees induced from the off-the-shelf PTMs to enhance ALSC models, but also use the trees induced from the fine-tuned PTMs (In short FT-PTMs) which are fine-tuned on the ALSC datasets. Experiments show that trees induced from FT-PTMs can help tree-based ALSC models achieve better performance than their counterparts before finetuning. Besides, models with trees induced from the ALSC fine-tuned RoBERTa can even outperform trees from the dependency parser.
Last but not least, we find that the base RoBERTa with an MLP layer is enough to achieve State-ofthe-art (SOTA) or near SOTA performance on all six ALSC datasets across four languages, while incorporating tree structures into RoBERTa-based ALSC models does not achieve concrete improvement.
Therefore, our contributions can be summarized as: (1) We extensively study the induced trees from PTMs and FT-PTMs. Experiments show that models using induced trees from FT-PTMs achieve better performance. Moreover, models using induced trees from fine-tuned RoBERTa outperform other trees.
(2) The analysis of the induced tree from FT-PTMs shows that it tends to be more sentimentword-oriented, making the aspect term directly connect to its sentiment adjectives.
(3) We achieve SOTA or near SOTA performances on six ALSC datasets across four languages based on RoBERTa. We find that the RoBERTa could better adapt to ALSC and help the aspects to find the sentiment words.
2 Related Work ALSC without Dependencies Vo and Zhang (2015) propose the early neural network model which does not rely on the dependency tree. Along this line, diverse neural network models have been proposed. Tang et al. (2016a) use the long short term memory (LSTM) network to enhance the interactions between aspects and context words. In order to model relations of aspects and their contextual words, Wang et al. (2016); Liu and Zhang (2017); Ma et al. (2017); Tay et al. (2018) incorporate the attention mechanism into the LSTM-based neural network models. Other model structures such as convolutional neural network (CNN) Xue and Li, 2018), gated neural network (Zhang et al., 2016;Xue and Li, 2018), memory neural network (Tang et al., 2016b;Chen et al., 2017;Wang et al., 2018), attention neural network (Tang et al., 2019) have also been applied in ALSC. ALSC with Dependencies Early works of ALSC mainly employ traditional text classification methods focusing on machine learning algorithms and manually designed features, which took syntactic structures into consideration from the very beginning. Kiritchenko et al. (2014) combine a set of features including sentiment lexicons and parsing dependencies, from which experiments show the effectiveness of context parsing features.
A myriad of works attempt to fuse dependency tree into neural network models in ALSC. Dong et al. (2014) propose to convert the dependency tree into a binary tree first, then apply the adaptive recursive neural network to propagate information from the context words to aspects. Despite the improvement of aspect-oriented feature modeling, converting the dependency tree into a binary tree might cause syntax related words separated away from each other. In general, owing to the syntax parsing errors, early dependency tree based ALSC models do not show clear preponderance over models without the dependency tree.
However, the introduction of the neural network into the dependency parsing task enhances the parsing quality substantially (Chen and Manning, 2014;Dozat and Manning, 2017). Recent advances, leveraging graph neural network (GNN) to model the dependency tree (Zhang et al., 2019a;Huang and Carley, 2019;Sun et al., 2019b;Tang et al., 2020;, have achieved significant performance. Among them, Zheng et al. (2020);  attempt to convert the dependency tree into the aspect-oriented dependency tree. Instead of using the topological structure of dependency tree, He et al. (2018); Zhang et al. (2019b); Phan and Ogunbona (2020) exploit the tree-based distance between two tokens in the dependency tree. PTMs-based Dependency Probing Over the past few years, the pre-trained models (PTMs) have dominated across various NLP tasks. Therefore, many researchers are attracted to investigate what linguistic knowledge has been captured by PTMs (Clark et al., 2019;Hewitt and Liang, 2019;Hewitt and Manning, 2019;. Clark et al. (2019) try to use a single or a combination of head attention maps of BERT to infer the dependencies. Since BERT has many attention heads, this method can hardly fully reveal the dependency between two tokens. Hewitt and Manning (2019) propose a small learnable probing model to probe the syntax dependencies encoded in BERT. Despite very few parameters been added, it may still be very hard to tell if the syntactic information is encoded by BERT itself or by the additional parameters from the probing model. Therefore, the parameterfree dependency probing method proposed in  might be more preferred.

Method
In this section, we first introduce how to induce trees from PTMs, then we describe three tree-based ALSC models, which are selected from three representative methods of incorporating the dependency tree in ALSC task.

Inducing Tree Structure from PTMs
Perturbed Masking  can induce trees from the pre-trained models without additional parameters. Generally, a broad range of PTMs can be applied in the Perturbed Masking method. For the sake of being representative and practical, we select BERT and RoBERTa as our base models.
In this subsection, we first briefly introduce the model structure of BERT and RoBERTa, then present the basic idea of the Perturbed Masking method. More details about them can be found in their respective reference papers.

BERT and RoBERTa
BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) both take Transformers (Vaswani et al., 2017) as backbone architecture. Generally, they can be formulated as the following equationŝ where h 0 is the BERT/RoBERTa input representation, formed by the sum of token embeddings, position embeddings, and segment embeddings; LN is the layer normalization layer; MHAtt is the multi-head self-attention; FFN contains three layers, the first one is a linear projection layer, then an activation layer, then another linear projection layer; l is the depth of Transformer layers. The base and large version of BERT and RoBERTa have 12, 24 Transformer layers, respectively. BERT is pre-trained on Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks. In the MLM task, 15% of the tokens in a sentence are manipulated in three ways. Specifically, 10%, 10%, 80% of them are replaced by a random token, itself, or a "[MASK]" token, respectively. In the NSP task, two sentences A and B are concatenated before sending to BERT. Given 50% of the time when B is the next utterance of A, BERT needs to utilize the vector representation of "[CLS]" to figure out whether the input is continuous or not. RoBERTa is only pre-trained on the MLM task.

Perturbed Masking
Perturbed Masking aims to detect syntactic information from pre-trained models. For a sentence x = [x 1 , . . . , x T ], BERT and RoBERTa will map each x i into a contextualized representation H θ (x) i . Perturbed Masking is trying to derive the value f (x i , x j ) that denotes the impact a token x j has on another token x i . To derive this value, it first uses the "[MASK]" (or "<mask>" in RoBERTa) to replace the token x i , which returns a representation H θ (x\{x i }) i for the masked x i ; secondly, it further masks the token x j , which returns a repre- By repeating this process between every two tokens in the sentence, we can get an impact matrix M ∈ R T ×T and M i,j = f (x i , x j ). The tree decoding algorithm, such as Eisner (Eisner, 1996) and Chu-Liu/Edmonds' algorithm (Chu and Liu, 1965;Edmonds, 1967), is then used to extract the dependency tree from the matrix M. The Perturbed Masking can exert on any layer of BERT or RoBERTa.

ALSC Models Based on Trees
In this subsection, we introduce three representative tree-based ALSC models. Each of the model is from the methods mentioned in the Introduction part (Section 1). For a fair comparison, all the selected models are of the most recently advanced tree-based ALSC models. We briefly introduce these three models as follows.

Aspect-specific Graph Convolutional Networks (ASGCN)
The Aspect-specific Graph Convolutional Networks (ASGCN) is proposed by Sun et al. (2019b). They utilize the dependency tree as a graph, where each word is viewed as a node and the dependencies between words are deemed as edges. After converting the dependency tree into the graph, ASGCN uses the Graph Convolutional Network (GCN) to operate on this graph to model dependencies between each word.

Proximity-Weighted Convolution Network (PWCN)
The Proximity-Weighted Convolution Network (PWCN) model is proposed by Zhang et al. (2019b). They try to help the aspect to find their contextual words. For an input sentence, the PWCN first gets its dependency tree, and based on this tree it would assign a proximity value to each word in the sentence. The proximity value for each word is calculated by the shortest path in the dependency tree between this word and the aspects.

Relational Graph Attention Network (RGAT)
The Relational Graph Attention Network (RGAT) is proposed by . In the RGAT model, they transform the dependency tree into an aspect-oriented dependency tree. The aspectoriented dependency tree uses the aspect as the root node, and all other words depend on the aspect directly. The relation between the aspect and other words is either based on the syntactic tag or the treebased distance in the dependency tree. Specifically, the RGAT reserves syntactic tags for words with 1 tree-based distance to aspect, and assigns virtual tags to longer distance words, such as "2:con" for "A 2 tree-based distance connection". Therefore, Train  2164  807  637  Test  728  196  196   Laptop14  Train  994  870  464  Test  341  128  169   Twitter  Train  1561  1560  3127  Test  173  173  346   Table 1: Data statistics.
the RGAT model not only exploits the topological structure of the dependency tree but also the treebased distance between two words.

Experimental Setup
In this section, we present details about the datasets, the tree structures used in experiments, as well as the experiments implementations. We conduct experiments on all six datasets across four languages. But due to the limited space, we present our experiments on the non-English datasets in the Appendix.

Datasets
We run experiments on six benchmark datasets. Three of them, namely, Rest14, Laptop14, and Twitter, are English datasets. Rest14 and Laptop14 are from SemEval 2014 task 4 (Pontiki et al., 2014), containing sentiment reviews from restaurant and laptop domains. Twitter is from Dong et al. (2014), which is processed from tweets. The statistics of these datasets are presented in Table 6. Details of the other three non-English datasets can be found in the Appendix. Following previous works, we remove samples with conflicting polarities or with "NULL" aspects in all datasets.

Tree Structures
For each dataset, we obtain five kinds of trees from three sources.
(1) The first one is derived from the off-the-shelf dependency tree parser, such as spaCy 2 and allenNLP 3 , written as "Dep.". For the three English datasets, we use the biaffine parser from the allenNLP package to get the dependency tree, which is reported in  that the biaffine parser could achieve better performance.
(2) We induce trees from the pre-trained BERT and RoBERTa by the Perturbed Masking method , written them as "BERT Induced Tree" and "RoBERTa Induced Tree", respectively.
(3) We use the Perturbed Masking method to induce trees from the fine-tuned BERT and RoBERTa after finetuning in the corresponding datasets. These two are written as "FT-BERT Induced Tree" and "FT-RoBERTa Induced Tree".
Besides, we add "Left-chain" and "Right-chain" in our experiments. "Left-chain", "Right-chain" mean that every word deems its previous or next word as the dependent child word.

Implementation Details
In order to derive the FT-PTMs Induced Tree, we fine-tune BERT and RoBERTa on the ALSC datasets. To introduce as few parameters as possible, a rather simple MLP is used and the overall structure of our fine-tuning model is presented in Figure 1. The fine-tuning experiments are with the batch size b = 32, dropout rate d = 0.1, learning rate µ = 2e-4 using the AdamW optimizer with the default settings.
As for the Perturbed Masking method, we apply Chu-Liu/Edmonds' algorithm for the tree decoding. For the induced trees, we first induce trees from each layer of the PTMs, then test them by the model in Figure 1 on dev set which is composed by 20% of training set. Experiments show that the trees induced from the 11th layer of the PTMs could achieve the best performance among all layers, which is applied for all our experiments.
We conduct multiple experiments incorporating different trees (Section 4.2) into the aforementioned tree-based models (Section 3.2). Specifically, we use the 300-dimension Glove (Pennington et al., 2014) embeddings for English datasets. We keep the word embeddings fixed to avoid overfitting. It is worth noting that in experiments with the RGAT model, since the induced tree does not provide syntactic tags, we assign virtual tags for every dependency in a uniform way, which slightly damage the performance of model.

ALSC Performance with Different Trees
The comparison between models with different trees is presented in Table 2, which comprises experiments results of English datasets. The results of non-English datasets can be found in the Appendix.
We observe that among all the trees, incorporating FT-RoBERTa Induced Tree leads to the best results on all datasets. On average, models based on the FT-RoBERTa Induced Tree outperform "Dep." by about 1.1% in accuracy. This proves the effectiveness and advantage of FT-RoBERTa Induced Tree in this competitive comparison.
Models using BERT Induced Tree and RoBERTa Induced Tree from Table 2 show small performance difference in all but one dataset, and both are close to the "Left-chain" and "Right-chain" baselines. To have a better sense, we visualize trees induced from RoBERTa in Figure 2b. It shows that RoBERTa Induced Tree has strong neighboring connection dependency pattern. This behavior is expected since the masked language modeling pre-training task will make words favor depending more on its neighboring words. This tendency may be the reason why PTMs induced trees perform similarly to the "Left-chain" and "Right-chain" baselines.
To answer the question Q1 in the Introduction part (Section 1), we need to compare the "Dep.", BERT Induced Tree, and RoBERTa Induced Tree results. The results show that models with dependency trees usually achieve better performance than PTMs induced trees. This is predictable since the word in PTMs induced trees tends to depend on words in their either left or right side as shown in Figure 2. It is worth noting that this observation does not align with the observation in . The experiments based on PWCN in  show that BERT Induced Tree achieves comparable results with the "Dep.", which is consistent with our PWCN results. However, this observation does not hold when the induced trees are used in a broader range of tree-based ALSC models, especially for the RGAT model in the bottom of Table 2. More detailed analysis will be provided in the next section.
Although models with the PTMs induced trees usually perform worse than those with the dependency parsing trees, models with trees induced from ALSC fine-tuned RoBERTa can surpass both of them. Take RoBERTa Induced Tree and FT-RoBERTa Induced Tree in Table 2 Table 3: Proportion of neighboring connections of different trees in all datasets. We use the short name of induced trees here as well as Table 4 and Table 5.

Analysis
To further investigate the reasons for the difference between trees, we propose a set of quantitative metrics, presented in Table 3 and Table 4. The Proportion of Neighboring Connections is to calculate the proportion of neighboring connections in the sentence, shown in Table 3. A neighboring connection links the word to its left/right neighbor word. From Table 3, we observe that on average over 70% relations in BERT/RoBERTa Induced Tree are neighboring connections. This will damage the performance of models using topological structures of trees. Thus, PTMs induced trees usually perform worse than "Dep.", with a slight (c) The FT-RoBERTa Induced Tree Figure 2: Visualization of different trees. The colored box refers to the aspect terms. Since ROOT has no directional relation arcs, we omit the ROOT notation here. For the same two sentences, trees from dependency parser, RoBERTa and fine-tuned RoBERTa are displayed. As Figure 2b shows, trees induced from RoBERTa tend to have more neighboring connections. As the bottom two figures show, trees induced from fine-tuned RoBERTa tend to have connections between sentiment words and others words.
improvement over left/right-chains.
In comparison with RoBERTa Induced Tree, a significant decline of the proportion is shown in FT-RoBERTa Induced Tree in Table 3. We see the same tendency in BERT Induced Tree and FT-BERT Induced Tree. This marks the consistent structure change in the fine-tuning process, indicating the transition to a more diverse structure. As shown in Figure 2b, RoBERTa Induced Tree has a clear pattern to depend on words in their neighbor side. Yet FT-RoBERTa Induced Tree in Figure 2c shows a more diverse dependency pattern.
Aspects-sentiment Distance is the average distance between aspect and sentiment words. We pre-define a sentiment words set C. For a sentence S i in datasets S, the set of aspects words in S i is termed as w. S i ∩ C is the set of sentiment words appearing both in the sentence S i and the sentiment words set C. The Aspects-sentiment Distance(AsD) is calculated as follows: where | · | is the number of elements in the set and dist(x i , x j ) represents the relative distance be-tween x i and x j in the tree. Specifically, C contains sentiment words counted on Amazon-2 from Tian et al. (2020), which can be found in the Appendix. As for the Rest14 and Laptop14,  provides the paired sentiment words with its corresponding aspect. We also calculate the paired Aspects-sentiment Distance(pAsD) on these two datasets, which only counts the distance between aspect and its corresponding sentiment words.  We present the Aspects-sentiment Distance (AsD) of different trees in English datasets in Table 4. Results show that FT-RoBERTa has the least AsD value, indicating the shortest aspects-sentiment distance. Compared to PTMs induced trees, the trees from FT-PTMs have less AsD, indicating shortened aspects-sentiment distance. This shows that the FT-PTMs induced  Table 5: The results(%) of SOTA ALSC models on English datasets. The results with " †" are retrieved from Sun et al. (2019b), and those with " " are retrieved from the original papers. Those without additional symbols are on our own. We highlight the best results on bold.
trees are more sentiment-word-oriented, which partially reveals that the fine-tuning in ALSC encourages the aspects to find sentiment words. However, for the "Dep.", we notice that some Twitter results in Table 2 can not be fully explained by these two proposed metrics. We conjecture that the grammar casualness features the Twitter corpus, which makes the parser hard to provide an accurate dependency parsing tree. Still, these two metrics can be suitable for the induced trees. Taken together, as the conclusion to Q2, these analyses demonstrate that the fine-tuning on ALSC could adapt the induced tree implicitly. On the one hand, less proportion of neighboring connections after fine-tuning indicates the increase of long range connections. On the other hand, less Aspectssentiment Distance after fine-tuning illustrates the shorter distance between aspects and sentiment words, which helps to model connections between aspects and sentiment words. Thus, as shown in Section 5.1, fine-tuning RoBERTa in ALSC not only makes induced tree better suit the ALSC task but also outperform the dependency tree when combined with different tree-based ALSC models.

Comparison between ALSC models
Additional, we explore how well the fine-tuned RoBERTa model could achieve in the ALSC task. We select a set of top high-performing models of ALSC as state-of-the-art alternatives. The compari-son results are shown in Table 5.
Comparing with all these SOTA alternatives, surprisingly, the RoBERTa with an MLP layer achieve SOTA or near SOTA performance. Especially, compared to other datasets, we notice that significant improvement is obtained on the Lap-top14 dataset. We assume that the pre-training corpus of RoBERTa may be more friendly to the laptop domain since the RoBERTa-MLP already obtains much better results than the BERT-MLP on Laptop14. For these BERT-based models in the second row of Table 5, similar experiments using RoBERTa are conducted. However, limited improvements have been made over the RoBERTa-MLP. We expect that induced trees from models specifically pre-trained for ALSC (Tian et al., 2020) may provide more information, which is left for the future works.
The FT-RoBERTa Induced Tree could be beneficial to Glove based ALSC models. However, incorporating trees over the RoBERTa brings no significant improvement, even the decline can be seen in some cases. This may be caused by failure to reconcile the implicitly entailed tree with external tree. We argue that incorporating trees over the RoBERTa in currently widely-used tree methods may be the loss outweighs the gain. Additionally, in the review of previous ALSC works, we notice that very few works employ the RoBERTa as the base model. We would attribute this to the difficulty of optimizing the RoBERTa-based ALSC models. As the higher architecture, which is usually randomly initialized, needs a bigger learning rate compared to the RoBERTa. The inappropriate hyperparameters may be the cause reason for the lagging performance of previous RoBERTa-based ALSC works (Phan and Ogunbona, 2020).

Conclusion
In this paper, we analyze several tree structures for the ALSC task including parser-provided dependency tree and PTMs-induced tree. Specifically, we induce trees using the Perturbed Masking method from the original PTMs and ALSC finetuned PTMs respectively, and then compare the different tree structures on three typical tree-based ALSC models on six datasets across four languages. Experiments reveal that fine-tuning on ALSC task forces PTMs to implicitly learn more sentimentword-oriented trees, which can bring benefits to Glove based ALSC models. Benefited from its better implicit syntactic information, the fine-tuned RoBERTa with an MLP is enough to obtain SOTA or near SOTA results for ALSC task. Our work can lead to several promising directions, such as PTMssuitable tree-based models and better tree-inducing methods from PTMs.

A Experiments on non-English Datasets
In this section, we provide details about our experiments on non-English datasets.

A.1 Datasets
We conduct experiments on three non-English datasets, which are named Dutch, French, and Spanish, respectively. All of them are restaurant review datasets from SemEval-2016 task 5 (Pontiki et al., 2016), whose languages are the same as dataset names. Detailed data statistics can be found in Table 6. Following previous works, we remove samples with conflicting polarities or with "NULL" aspect terms in all datasets.

A.2 Tree Structures
We obtain five kinds of trees for every dataset. The first one is to use the off-the-shelf dependency tree parser to get parser-provided dependency trees, written as "Dep.". Specifically, we utilize the spaCy parser for the non-English datasets. The second method is to induce the trees from the pre-trained mBERT and XLM-R (Conneau et al., 2020) base models by the Perturbed Masking method , written them as "BERT Induced Tree" and "RoBERTa Induced Tree", respectively. The third method is to use the same method as above to induce trees from the mBERT and XLM-R after fine-tuning in the corresponding datasets with the same model structure as English datasets. These two are written as "FT-BERT Induced Tree" and "FT-RoBERTa Induced Tree" to have a uniform form as the English datasets. Similarly, we add "Left-chain" and "Right-chain" as baselines. "Leftchain", "Right-chain" mean that every word deems its previous or next word as the dependent child word.

A.3 Implementation Details
Similar to the English datasets, Experiments incorporating tree-based ALSC models with different trees are conducted on non-English datasets, as well as the fine-tuning of PLMs. All experiments are conducted on the NVIDIA GTX1080Ti. For experiments with tree-based models, we use the 300-dimension pre-trained embeddings (Ruder et al., 2016) for non-English datasets. We keep the word embeddings fixed to avoid overfitting. Other parameters are initialized with original models. It is worth noting that in RGAT Model reproduction, since the induced tree does not provide relation labels, we assign virtual relations for every dependency in a uniform way.
We retain the fine-tuning experiments with batch size b = 32, dropout rate d = 0.1, learning rate µ = 2e-4 using the AdamW optimizer with the default settings.
As for the induced trees, We choose the trees induced from the 11th layer in all of our experiments.

A.4.1 ALSC Performance with Different Trees
The comparison between models with different trees is presented in Table 7, which comprises experiments results of non-English datasets. Experimental results shows that: (1) Incorporating FT-RoBERTa Induced Tree leads to the best results on all datasets, which proves the effectiveness and advantage of FT-RoBERTa Induced Tree in non-   English datasets. Moreover, we find that the results of the FT-RoBERTa Induced Tree usually have more stable F 1 scores.
(2) Subjected to the quality of the parser of non-English languages, models using the PLMs induced trees achieve slightly better performance compared to "Dep.". This illustrates that the dependency tree could be very sensitive to parser and quality of corpus.
(3) Similarly, from "RoBERTa Induced Tree" and "FT-RoBERTa Induced Tree", we conclude that fine-tuning can substantially enhance the ALSC performance through trees induced from PLMs.

A.4.2 Comparison between ALSC models
Similarly, we compare the performance between the fine-tuned XLM-R and a set of top highperforming models. The results are presented in Table 8. We could see that XLM-R with an MLP is enough to achieve SOTA or near SOTA results in non-English datasets.

B Sentiment words set
positive sentiment words great, good, like, just, will, well, even, love, best, better, back, want, recommend, worth, easy, sound, right, excellent, nice, real, fun, sure, pretty, interesting, stars negative sentiment word too, little, bad, game, down, long, hard, waste, disappointed, problem, try, poor, less, boring, worst, trying, wrong, least, although, problems, cheap To calculate the Aspects-sentiment Distance of different tree structures on English datasets, we predefine a set of sentiment words, shown in Table 9. Specifically, we use the sentiment words described in Tian et al. (2020), which are the selected 50 most frequent sentiment words counted on Amazon-2.