Relation-Aware Collaborative Learning for Unified Aspect-Based Sentiment Analysis

Aspect-based sentiment analysis (ABSA) involves three subtasks, i.e., aspect term extraction, opinion term extraction, and aspect-level sentiment classification. Most existing studies focused on one of these subtasks only. Several recent researches made successful attempts to solve the complete ABSA problem with a unified framework. However, the interactive relations among three subtasks are still under-exploited. We argue that such relations encode collaborative signals between different subtasks. For example, when the opinion term is “delicious”, the aspect term must be “food” rather than “place”. In order to fully exploit these relations, we propose a Relation-Aware Collaborative Learning (RACL) framework which allows the subtasks to work coordinately via the multi-task learning and relation propagation mechanisms in a stacked multi-layer network. Extensive experiments on three real-world datasets demonstrate that RACL significantly outperforms the state-of-the-art methods for the complete ABSA task.


Introduction
Aspect-based sentiment analysis (ABSA) is a finegrained task which aims to summarize the opinions of users towards specific aspects in a sentence. ABSA normally involves three subtasks, namely aspect term extraction (AE), opinion term extraction (OE), and aspect-level sentiment classification (SC). For example, given a review "The place is small and cramped but the food is delicious.", AE aims to extract a set of aspect terms {"place", "food"}. OE aims to extract a set of opinion terms {"small", "cramped", "delicious"}. Meanwhile, it is expected for SC to assign a sentiment polarity "negative" and "positive" to the aspect "place" and "food", respectively. *Corresponding author. Most existing works treat ABSA as a two-step task containing AE and SC. They develop one separate method for each subtask (Tang et al., 2016;Xu et al., 2018;Li et al., 2018a;Hu et al., 2019), or take OE as an auxiliary task of AE Li et al., 2018b). In order to perform ABSA for practical use, the separate methods need to be pipelined together. Recently, several studies attempt to solve ABSA in a unified framework (Wang et al., 2018a;He et al., 2019;Luo et al., 2019).
Despite their effectiveness, we argue that these methods are not sufficient to yield satisfactory performance for the complete ABSA task. The key reason is that the interactive relations among different subtasks have been largely neglected in existing studies. These relations convey collaborative signals which can enhance the subtasks in a mutual way. For example, the opinion term "delicious" can serve as the evidence of the aspect term "food", and vice versa. In the following, we first analyze the interactive relations among different subtasks, and then present our RACL framework which is developed to exploit these relations. The detailed relations are summarized in Figure 1 (left), where each arrow ⇔ denotes one specific relation R i .
• R 1 indicates the dyadic relation between AE and OE. In practice, the aspect terms must be the targets of opinion, indicating that most aspect terms like "place" can only be modified by corresponding opinion terms like "small" and "cramped" rather than a term like "delicious". Hence AE and OE might hold informative clues to each other. • R 2 indicates the triadic relation between SC and R 1 . One critical problem in SC is to determine the dependency between the aspect and its context. For example, the context "small and cramped" plays an important role in predicting the polarity of "place". Such a dependency is highly in accordance with R 1 which emphasizes the interaction between the aspect and opinion terms. Hence SC and R 1 can help refine the selection process for each other. • R 3 indicates the dyadic relation between SC and OE. The specific opinion terms generally convey specific polarities. For example, "fantastic" is often positive. The opinion terms extracted in OE should be paid more attention when predicting the sentiment polarity in SC. • R 4 indicates the dyadic relation between SC and AE. In the complete ABSA task, the aspect terms are unknown and SC will assign a polarity to every word. The aspect terms, e.g., "place" and "food", will have their corresponding polarities, while other words are considered as the background ones without sentiment. That is to say, the results from AE should be helpful in supervising the training of SC. When reviewing the literature on the ABSA task, we find that existing separate methods either do not utilize any relations, or only utilize R 1 by treating OE as an auxiliary task of AE. Meanwhile, the unified methods at most explicitly utilize R 3 and R 4 . In view of this, we propose a novel Relation-Aware Collaborative Learning (RACL) framework to fully exploit the interactive relations in the complete ABSA task. We compare our model with existing methods by their capability in utilizing interactive relations in Table 1. RACL is a multi-layer multi-task learning framework with a relation propagation mechanism to mutually enhance the performance of subtasks. For multi-task learning, RACL adopts the sharedprivate scheme (Collobert and Weston, 2008;Liu et al., 2017). Subtasks AE, OE, and SC first jointly train the low-level shared features, and then they train their high-level private features independently. In this way, the shared and private features can embed the task-invariant and task-oriented knowledge respectively. For relation propagation, RACL improves the model capacity by exchanging informative clues among three subtasks. Moreover, RACL can be stacked to multiple layers to perform collaborative learning at different semantic levels. We conduct extensive experiments on three datasets. Results demonstrate that RACL significantly outperforms the state-of-the-art methods for both the single subtasks and the complete ABSA task.

Related Work
Aspect-based sentiment analysis (ABSA) is first proposed by Hu and Liu (2004) and has been widely studied in recent years (Zhang et al., 2018). We organize existing studies by how the subtasks are performed and combined to perform ABSA. Separate Methods Most existing studies treat ABSA as a two-step task containing aspect term extraction (AE) and aspect-based sentiment classification (SC), and develop separate methods for AE (Popescu and Etzioni, 2005;Wu et al., 2009;Li et al., 2010;Qiu et al., 2011;Liu et al., 2012;Chen et al., 2014;Chernyshevich, 2014;Toh and Wang, 2014;Vicente et al., 2015;Liu et al., 2015Yin et al., 2016;Wang et al., 2016;Li and Lam, 2017;Clercq et al., 2017;He et al., 2017;Xu et al., 2018;Yu et al., 2019), and SC (Jiang et al., 2011;Mohammad et al., 2013;Kiritchenko et al., 2014;Dong et al., 2014;Ma et al., 2017;Wang et al., 2018b;Zhu and Qian, 2018;Zhu et al., 2019), respectively. Some of them resort to the auxiliary task opinion term extraction (OE) and exploit their relation for boosting the performance of AE. For the complete ABSA task, results from two steps must be merged together in a pipeline manner. In this way, the relation between AE/OE and SC is totally neglected, and the errors from the upstream AE/OE will be propagated to the downstream SC. The overall performance of ABSA task is not promising for pipeline methods. Unified Methods Recently, several studies attempt to solve ABSA task in a unified framework. The unified methods fall into two types: collapsed tagging (Mitchell et al., 2013;Wang et al., 2018a; and joint training (He et al., 2019;Luo et al., 2019). The former combines the labels of AE and SC to construct collapsed labels like {B-senti, I-senti, O}. The subtasks need to share all trainable features without distinction, which is likely to confuse the learning process. Moreover, the relations among subtasks cannot be explicitly modeled for this type of methods. Meanwhile, the latter constructs a multi-task learning framework where each subtask has independent labels and can have shared and private features. This allows the interactive relations among different subtasks to be modeled explicitly for the joint training methods. However, none of existing studies along this line has fully exploited the power of such relations.
We differentiate our work from aforementioned methods in that we propose a unified framework which exploits all dyadic and triadic relations among subtasks to enhance the learning capability. • SC aims to predict a tag sequence Y S = {y S 1 , ..., y S i , ..., y S n } for sentiment classification, where y S i ∈ {pos, neu, neg} denotes the positive, neutral, and negative sentiment polarities towards each word.

Model Architecture
Our proposed RACL is a unified multi-task learning framework which enables propagating the interactive relations (denoted as the same R 1 ..R 4 as those in Figure 1) for improving the ABSA performance, and it can be stacked to multiple layers to interact subtasks at different semantic levels. We present the overall architecture of RACL in Figure 2(a) and details of a single layer in Figure 2 In particular, a single RACL layer contains three modules: AE, OE, and SC, where each module is designed for the corresponding subtask. These modules receive a shared representation of the input sentence, then encode their task-oriented features. After that, they propagate relations R 1 ..R 4 for collaborative learning by exchanging informative clues to further enhance the task-oriented features. Finally, three modules will make predictions for the corresponding tag sequences Y A , Y O , and Y S based on the enhanced features.
In the following, we first illustrate the relationaware collaborative learning in one layer, then show the stacking and the training of the entire RACL.

Relation-Aware Collaborative Learning
Input Word Vectors Given a sentence S e , we can map the word sequence in S e with either pretrained word embeddings (e.g., GloVe) or pretrained language encoders (e.g., BERT) to generate a sequence of word vectors E={e 1 , ..., e i , ...,e n } ∈ R dw×n , where d w is the dimension of word vectors. We will examine the effects of these two types of word vectors in the experiments. Multi-task Learning with Shared-Private Scheme To perform multi-task learning, different subtasks should focus on the different characteristics of a shared training sample. Inspired by the shared-private scheme (Collobert and Weston, 2008;Liu et al., 2017), we extract both the shared and private features to embed task-invariant and task-oriented knowledge for the AE, OE, and SC modules.
To encode the shared task-invariant features, we simply feed each e i in E into a fully-connected layer and generate a transformed vector h i ∈ R d h . We then obtain a sequence of shared vectors H={h 1 , ..., h i , ...,h n } ∈ R d h ×n for each sentence which will be jointly trained by all subtasks.
Upon the shared task-invariant features H, the AE, OE, and SC modules will encode the taskoriented private features for the corresponding subtasks. We choose a simple CNN as the encoder function F due to its high computation efficiency.
For subtasks AE and OE, the key features for determining the existence of aspect and opinion terms are the representations of the original and adjacent words. Therefore, we construct two encoders to extract local AE-oriented features X A and OE-oriented features X O : For subtask SC, the process of feature generation is different from that in AE/OE. In order to determine the sentiment polarity towards an aspect term, we need to extract related semantic information from its context. The critical problem in SC is to determine the dependency between an aspect average pooling layer (2) layer (1) layer (L) ...
Details of a single RACL layer. term and its context. Moreover, in the complete ABSA task, the aspect terms are unknown in SC and it needs to assign a polarity to every word in S e . Based on these observations, we first encode the contextual features X ctx from H: Then we treat the shared vector h i as the query aspect and compute the semantic relation between the query and contextual features using the attention mechanism: where ds denotes the dependency strength between the i-th query word and the j-th context word, and M ctx i,j is the normalized attention weight of ds (i =j) i,j . We add a coefficient [log 2 (2+|i−j|)] −1 based on the absolute distance between two words. The rationale is that the adjacent context words should contribute more to the sentiment polarity. Finally, for the aspect query w i , we can obtain the global SC-oriented features X S i by a weighted sum of all contextual features (except the one for w i ): Propagating Relations for Collaborative Learning After encoding task-oriented features, we propagate the interactive relations (R 1 ..R 4 ) among subtasks to mutually enhance the AE, OE, and SC modules.
(1) R 1 is the dyadic relation between AE and OE, which indicates that AE and OE might hold informative clues to each other. In order to model R 1 , we want the AE-oriented features X A and the OE-oriented features X O to exchange useful information based on their semantic relations. Take the subtask AE as an example, the semantic relation between the word in AE and that in OE is defined as follows: For the word w i in AE, we can obtain the useful clues X O2A i from OE by applying a weighted sum of semantic relations to all words in OE (except the word w i itself), i.e., We then concatenate the original AE-oriented features X A and the useful clues X O2A from OE as the final features for AE, and feed them into a fullyconnected layer to predict the tags of aspect terms: where W A ∈ R 3×2dc is a transformation matrix, Y A ∈ R 3×n is the predicted tag sequence of AE.
For subtask OE, we use the transposed matrix of sr (i =j) i,j in Eq. 5 to compute the corresponding M A2O . In this way, the semantic relation between AE and OE will be consistent without regard to the direction. Then we can obtain the useful clues X A2O from AE and generate the predicted tag sequence Y O ∈ R 3×n in a similar way, i.e., Additionally, each w i cannot be an aspect term and an opinion term at the same time, so we add a regularization hinge loss to constrain Y A and Y O : where P denotes the probability under the given conditions.
(2) R 2 is the triadic relation between SC and R 1 . Remember that the dependency between the aspect term and its context is critical for subtask SC, and we have already calculated this dependency using the normalized attention weight M ctx . Hence we can model R 2 by propagating R 1 to M ctx . We use M O2A as the representative of R 1 , and add it on M ctx to denote the influence from R 1 to SC. More formally, we define R 2 as the following operation: Actually, M O2A characterizes the dependency between aspect terms and contexts in the view of term extraction while M ctx characterizes it in the view of sentiment classification. The dual-view relation R 2 can help refine the selection processes for both extraction and classification subtasks.
(3) R 3 is the dyadic relation between SC and OE, which indicates that the extracted opinion terms should be paid more attention when predicting the sentiment polarity. In order to model R 3 , similarly to the method for R 2 , we update M ctx in SC using the generated tag sequence Y O from OE: By doing this, the opinion terms can get larger weights in the attention mechanism. Consequently, they will contribute more to the prediction of the sentiment polarity. After getting the interacted values for M ctx , we can recompute the SC-oriented features X S in Eq.4 accordingly. Then we concatenate H and X S as the final features for SC and feed them into a fullyconnected layer to predict sentiment polarities for the candidate aspect terms: where W S ∈ R 3×2d h is a transformation matrix, Y S ∈ R 3×n is the predicted tag sequence of SC.
(4) R 4 is the dyadic relation between SC and AE, which indicates that the results from AE are helpful in supervising the training of SC. Clearly, only aspect terms have sentiment polarities. Although SC needs to assign a polarity to every word, we know the ground truth aspect terms in AE during the training process. Therefore, we directly use the ground truth tag sequenceŶ A of AE to refine the labeling process in SC. Specifically, only the predicted tags towards true aspect terms would be counted in the training procedure: where I(ŷ A i ) equals to 1 if w i is an aspect term and to 0 if not. Notice that this approach is only used in the training procedure.

Stacking RACL to Multiple Layers
When using one single RACL layer, AE, OE, and SC modules only extract corresponding features in a relatively low linguistic level, which may be insufficient to serve as the evidence to label each word. Hence we stack RACL to multiple layers to obtain high-level semantic features for subtasks, which helps to conduct deep collaborative learning.
Specifically, we first encode features X ctx(1) , X A(1) ⊕X O2A(1) , and X O(1) ⊕X A2O(1) in layer (1) . Then in layer (2) , we input these features for SC, AE, and OE to generate X ctx(2) , X A(2) , and X O(2) . In this way, we can stack RACL to L layers. We then conduct average pooling on results from all layers to obtain the final prediction: where T ∈ {A, O, S} denotes the specific subtask, and L is the number of layers. This shortcut-like architecture can enforce the features in the low layers to be meaningful and informative, which in turn helps the high layers to make better predictions.

Training Procedure
After generating the tag sequences Y A , Y O , and Y S for the sentence S e , we compute the crossentropy loss of each subtask: where T ∈ {A, O, S} denotes the subtask, n is the length of S e , J is the category of labels, y T i andŷ T i are the predicted tags and ground truth labels. The final loss L of RACL is the combination of the loss for subtasks and the loss for regularization, i.e., L = L T + λ · L R , where λ is a coefficient. We then train all parameters with back propagation.

Datasets and Settings
Datasets We evaluate RACL on three real-world ABSA datasets from SemEval 2014 (Pontiki et al., 2014) and 2015 (Pontiki et al., 2015), which include reviews from two domains: restaurant and laptop. Original datasets only have ground truth labels for aspect terms and corresponding sentiment polarities, while labels for opinion terms are annotated by two previous works (Wang et al., 2016. All datasets have a fixed training/test split. We further randomly sample 20% training data as the development set to tune hyper-parameters, and only use the remaining 80% for training. The statistics for datasets are summarized in Table 2. Settings We examine RACL with two types of word vectors: the pre-trained word embedding and pre-trained language encoder. In the word embedding implementation, we follow the previous studies (Xu et al., 2018;He et al., 2019;Luo et al., 2019) and use two types of embeddings, i.e., general-purpose and domain-specific embeddings. The former is from GloVe vectors with 840B tokens (Pennington et al., 2014), and the latter is trained on a large domain-specific corpus using fastText and published by Xu et al. (2018). Two types of embeddings are concatenated as the word vectors. In the language encoder implementation, we follow Hu et al. (2019) by using the BERT Large (Devlin et al., 2019) as the backbone and fine-tuning it during the training process.
We denote these two implementations as RACL-GloVe and RACL-BERT 1 . For RACL-GloVe, we set the dimension d w =400, d h =400, d c =256 and the coefficient λ=1e-5. Other hyper-parameters are tuned on the development set. The kernel size K of CNN and the layer number L is set to {3,3,5} and {4,3,4} for three datasets, respectively. We train the model for fixed epochs using Adam optimizer (Kingma and Ba, 2015) with learning rate 1e-4 and batch size 8. For RACL-BERT, we set d w to 1024 and learning rate to 1e-5 for fine-tuning BERT, and other hyper-parameters are directly inherited from RACL-GloVe.
We use four metrics for evaluation, i.e., AE-F 1 , OE-F 1 , SC-F 1 , and ABSA-F 1 . The first three denote the F 1 -score of each subtask, while the last one measures the overall performance for complete ABSA 2 . To compute ABSA-F 1 , the result for an aspect term would be considered as correct only when both AE and SC results are correct. The model achieving the minimum loss on the development set is used for evaluation on the test set.
Baselines To demonstrate the effectiveness of RACL for the complete ABSA task, we compare it with the following pipeline and unified baselines. The hyper-parameters for baselines are set to the optimal values as reported in their papers. • {CMLA, DECNN}+ {TNet, TCap}: CMLA  and DECNN (Xu et al., 2018) are the state-of-the-art methods for AE, while TNet (Li et al., 2018a) and T(rans)Cap  are the top-performing methods for SC. We then construct four pipeline baselines through combination. • MNN (Wang et al., 2018a): is a unified method utilizing the collapsed tagging scheme for AE and SC. • E2E-ABSA : is a unified method using the collapsed tagging scheme for AE and SC, and it introduces the auxiliary OE task without explicit interaction. • DOER (Luo et al., 2019): is a multi-task unified method which jointly trains AE and SC, and it explicitly models the relation R 4 . • IMN-D (He et al., 2019): is a unified method involving joint training for AE and SC with separate labels. The OE task is fused into AE to construct five-class labels. It explicitly models relations R 3 and R 4 3 .
• SPAN-BERT (Hu et al., 2019): is a pipeline method using BERT Large as the backbone. A multi-target extractor is used for AE, then a polarity classifier is used for SC. • IMN-BERT: is an extension of the best unified baseline IMN-D with BERT Large . By doing this, we wish to conduct convincing comparisons for the BERT-style methods. The input dimension and learning rate of IMN-BERT are the same as our RACL-BERT, and other hyper-parameters are inherited from IMN-D .

Comparison Results
The comparison results for all methods are shown in Table 3. The methods are divided into three groups: M1∼M4 are GloVe-based pipeline methods, M5∼M9 are GloVe-based unified methods, and M10∼M12 are BERT-based methods. Firstly, among all GloVe-based methods (M1∼M9), we can observe that RACL-GloVe consistently outperforms all baselines in terms of Table 3: Comparison of different methods. We separate the GloVe-based (M1∼M9) and BERT-based (M10∼M12) methods for a fair comparison. The best scores are in bold, and second best ones are underlined. Results of M5, M6 and M8 are taken from He et al. (2019), while other results are the average scores of 5 runs with random initialization. "-" denotes that the method does not contain the subtask OE.

Model
Restaurant14 (Res14) Laptop14 (Lap14) Restaurant15 (Res15)  AE-F1 OE-F1 SC-F1 ABSA-F1 AE-F1 OE-F1 SC-F1 ABSA-F1 AE-F1 OE-F1 SC-F1 ABSA- the overall metric ABSA-F 1 , and achieves 2.12%, 2.92%, and 2.40% absolute gains over the strongest baselines on three datasets. The results prove that jointly training all subtasks and comprehensively modeling the interactive relations are critical for improving the performance of the complete ABSA task. Moreover, RACL-GloVe also achieves the best or second best results on all subtasks. This further demonstrates that the learning process of each subtask can be enhanced by the collaborative learning. Another observation from M1∼M9 is that the unified methods (M5∼M9) perform better than the pipeline ones (M1∼M4).
Secondly, among the GloVe-based unified methods, RACL-GloVe, IMN-D, and DOER perform better than MNN and E2E-TBSA in general. This can be due to the fact that the former three methods explicitly model interactive relations among subtasks while the latter two do not. We notice that DOER gets a poor SC-F 1 score. The reason might be that it utilizes an auxiliary sentiment lexicon to enhance the words with "positive" and "negative" sentiment. It is hard for DOER to handle words with "neutral" sentiment and this results in a low SC-F 1 score.
Thirdly, the BERT-based methods (M10∼M12) achieve a better performance than GloVe-based methods by utilizing the large-scale external knowledge encoded in the pre-trained BERT Large backbone. Specifically, SPAN-BERT is a strong baseline in subtask AE by reducing the search space with a multi-target extractor. However, its performance on SC drops a lot because it cannot capture the dependency between the extracted aspect terms in AE and the opinion terms in SC without interactions among subtasks. IMN-BERT achieves relatively high scores on OE and SC, but its performance on AE is the worst among three without the guidance from the relations R 1 and R 2 . In contrast, RACL-BERT gets significantly better overall scores than SPAN-BERT and IMN-BERT on all three datasets. This again shows the superiority of our RACL framework for the complete ABSA task by using all interactive relations.

Ablation Study
To investigate the effects of different relations on RACL -GloVe/-BERT, we conduct the following ablation study. We sequentially remove each interactive relation and obtain four simplified variants.
As expected, all simplified variants in Table 4 have a performance decrease of ABSA-F 1 . The results clearly demonstrate the effectiveness of the proposed relations. Moreover, we find that the relations play more important roles on small datasets than on large ones. The reason might be that it is hard to train a complicated model on small datasets, and the relations can absorb external knowledge from other subtasks.

Effects of Hyper-Parameters
There are two key hyper-parameters in our model: the kernel size K of the CNN encoder and the layer number L. To investigate their impacts, we first vary K in the range of [1,9] stepped by 2 while fixing L to the values in section 4.1, and then vary L in the range of [1, 7] stepped by 1 while fixing K. Table 5: Case study. The columns "AE+SC" and "OE" denote the results generated by corresponding subtasks, where "None" denotes that no aspect/opinion terms are extracted. Words in blue and italic are annotated opinion terms, and those in red are annotated aspect terms with the subscripts denoting their sentiment polarities. We only present the ABSA-F 1 results for RACL-GloVe in Figure 3 since the hyper-parameters of RACL-BERT are inherited from RACL-GloVe.  In Figure 3(a), K=1 yields extremely poor performance because the raw features are generated only by the current word. Increasing K to 3 or 5 can widen the receptive field and remarkably boosts the performance. However, when further increasing K to 7 or 9, many irrelevant words are added as noises and thus deteriorate the performance. In Figure 3(b), increasing L can, to some extent, expand the learning capability and achieve high performance. However, too many layers introduce excessive parameters and make the learning process over complicated.

Case Study
This section details the analysis on results of several examples by different methods for a case study. We choose CMLA+TCap (denoted as PIPELINE), IMN-D, and RACL-GloVe as three competitors. We do not include the BERT-based methods as we wish to investigate the power of the models without the external resources. S1 and S2 verify the effectiveness of relation R 1 . In S1, due to the existence of the conjunction "and", two baselines incorrectly extract "offers" as an opinion term as "easy". In contrast, RACL-GloVe can successfully filter out "offers" in OE by using R 1 . The reason is that "offers" has never co-occured as an opinion term with the aspect term "OS" in the training set, and R 1 which connects the AE and OE subtasks will treat them as irrelevant terms. This information will be passed to OE subtask during the testing phase. Similarly, in S2, both baselines fail to recognize "looking" as an aspect term, because it might be the present participle of "look" without opinion information. Instead, RACL-GloVe correctly labels it as R 1 provides useful clues from opinion terms "faster" and "sleeker".
S3 shows the superiority of relation R 2 which is critical to connect the three subtasks but has never been employed in previous studies. Both baselines successfully extract "Dessert" and "die for" for AE and OE, but assign the incorrect "neutral" sentiment polarity even if IMN-D has emphasized the opinion terms. The reason is that these two terms have not co-occurred in the training samples, and it is hard for SC to recognize their dependency. In contrast, since "Dessert" and "die for" are typical words in AE and OE, RACL-GloVe is able to encode their dependency in R 1 . By propagating R 1 to SC using R 2 , RACL-GloVe can assign a correct polarity for "Dessert". To take a close look, we visualize the averaged predicted results (left) and the attention weights (right) of all layers in Figure 4. Clearly, the original attention M ctx−bef ore of "Dessert" does not concentrate on "die for". After getting enhanced by M O2A and OE, M ctx−af ter successfully highlights the opinion words and SC makes a correct prediction. S4 shows the benefits from relation R 3 . IMN-D and RACL-GloVe assign a correct polarity towards "Sushi" in SC since they both get the guidance from "fresh" in OE, while PIPELINE gets lost in contexts and makes a false prediction without the help of the opinion term. Notice that S1∼S4 simultaneously demonstrate the necessity for R 4 , since RACL-GloVe is not biased by background words and can make correct sentiment predictions in all examples.

Analysis on Computational Cost
To demonstrate that our RACL model does not incur the high computational cost, we compare it with two strong baselines DOER and IMN-D in terms of the parameter number and running time. We run three models on the Restaurant 2014 dataset with the same batch size 8 in a single 1080Ti GPU, and present the results in Table 6. Obviously, our proposed RACL has similar computational complexity with IMN-D, and they are both much simpler than DOER.

Conclusion
In this paper, we highlight the importance of interactive relations in the complete ABSA task. In order to exploit these relations, we propose a Relation-Aware Collaborative Learning (RACL) framework with multi-task learning and relation propagation techniques. Experiments on three realworld datasets demonstrate that our RACL framework with its two implementations outperforms the state-of-the-art pipeline and unified baselines for the complete ABSA task.