An Interactive Multi-Task Learning Network for End-to-End Aspect-Based Sentiment Analysis

Aspect-based sentiment analysis produces a list of aspect terms and their corresponding sentiments for a natural language sentence. This task is usually done in a pipeline manner, with aspect term extraction performed first, followed by sentiment predictions toward the extracted aspect terms. While easier to develop, such an approach does not fully exploit joint information from the two subtasks and does not use all available sources of training information that might be helpful, such as document-level labeled sentiment corpus. In this paper, we propose an interactive multi-task learning network (IMN) which is able to jointly learn multiple related tasks simultaneously at both the token level as well as the document level. Unlike conventional multi-task learning methods that rely on learning common features for the different tasks, IMN introduces a message passing architecture where information is iteratively passed to different tasks through a shared set of latent variables. Experimental results demonstrate superior performance of the proposed method against multiple baselines on three benchmark datasets.


Introduction
Aspect-based sentiment analysis (ABSA) aims to determine people's attitude towards specific aspects in a review.This is done by extracting explicit aspect mentions, referred to as aspect term extraction (AE), and detecting the sentiment orientation towards each extracted aspect term, referred to as aspect-level sentiment classification (AS).For example, in the sentence "Great food but the service is dreadful", the aspect terms are "food" and "service", and the sentiment orientations towards them are positive and negative respectively.
In previous works, AE and AS are typically treated separately and the overall task is performed in a pipeline manner, which may not fully exploit the joint information between the two tasks.
Recently, two studies (Wang et al., 2018;Li et al., 2019) have shown that integrated models can achieve comparable results to pipeline methods.Both works formulate the problem as a single sequence labeling task with a unified tagging scheme1 .However, in their methods, the two tasks are only linked through unified tags, while the correlation between them is not explicitly modeled.Furthermore, the methods only learn from aspect-level instances, the size of which is usually small, and do not exploit available information from other sources such as related documentlevel labeled sentiment corpora, which contain useful sentiment-related linguistic knowledge and are much easier to obtain in practice.
In this work, we propose an interactive multitask learning network (IMN), which solves both tasks simultaneously, enabling the interactions between both tasks to be better exploited.Furthermore, IMN allows AE and AS to be trained together with related document-level tasks, exploiting the knowledge from larger document-level corpora.IMN introduces a novel message passing mechanism that allows informative interactions between tasks.Specifically, it sends useful information from different tasks back to a shared latent representation.The information is then combined with the shared latent representation and made available to all tasks for further processing.This operation is performed iteratively, allowing the information to be modified and propagated across multiple links as the number of iterations increases.In contrast to most multi-task learning schemes which share information through learning a common feature representation, IMN not only allows shared features, but also explicitly models the interactions between tasks through the message passing mechanism, allowing different tasks to better influence each other.
In addition, IMN allows fined-grained tokenlevel classification tasks to be trained together with document-level classification tasks.We incorporated two document-level classification tasks -sentiment classification (DS) and domain classification (DD) -to be jointly trained with AE and AS, allowing the aspect-level tasks to benefit from document-level information.In our experiments, we show that the proposed method is able to outperform multiple pipeline and integrated baselines on three benchmark datasets 2 .

Related Work
Aspect-Based Sentiment Analysis.Existing approaches typically decompose ABSA into two subtasks, and solve them in a pipeline setting.Both AE (Qiu et al., 2011;Yin et al., 2016;Wang et al., 2016aWang et al., , 2017;;Li and Lam, 2017;He et al., 2017;Li et al., 2018b;Angelidis and Lapata, 2018) and AS (Dong et al., 2014;Nguyen and Shirai, 2015;Vo and Zhang, 2015;Tang et al., 2016;Wang et al., 2016b;Zhang et al., 2016;Liu and Zhang, 2017;Chen et al., 2017;Cheng et al., 2017;Tay et al., 2018;Ma et al., 2018;He et al., 2018a,b;Li et al., 2018a) have been extensively studied in the literature.However, treating each task independently has several disadvantages.In a pipeline setting, errors from the first step tend to be propagated to the second step, leading to a poorer overall performance.In addition, this approach is unable to exploit the commonalities and associations between tasks, which may help reduce the amount of training data required to train both tasks.
Some previous works have attempted to develop integrated solutions.Zhang et al. (2015) proposed to model the problem as a sequence labeling task with a unified tagging scheme.However, their results were discouraging.Recently, two works (Wang et al., 2018;Li et al., 2019) have shown some promising results in this direction with more sophisticated network structures.However, in their models, the two subtasks are still only linked through a unified tagging scheme, while the interactions between them are not explicitly mod-2 Our source code can be obtained from https:// github.com/ruidan/IMN-E2E-ABSAeled.To address this issue, a better network structure allowing further task interactions is needed.
Multi-Task Learning.One straightforward approach to perform AE and AS simultaneously is multi-task learning, where one conventional framework is to employ a shared network and two task-specific network to derive a shared feature space and two task-specific feature spaces.Multitask learning frameworks have been employed successfully in various natural language processing (NLP) tasks (Collobert and Weston, 2008;Luong et al., 2015;Liu et al., 2016).By learning semantically related tasks in parallel using a shared representation, multi-task learning could capture the correlations between tasks and improve the model generalization ability in certain cases.For ABSA, He et al. (2018b) have shown that aspectlevel sentiment classification can be significantly improved through joint training with documentlevel sentiment classification.However, conventional multi-task learning still does not explicitly model the interactions between tasks -the two tasks only interact with each other through error back-propoagation to contribute to the learned features and such implicit interactions are not controllable.Unlike existing methods, our proposed IMN not only allows the representations to be shared, but also explicitly models the interactions between tasks, by using an iterative message passing scheme.The propagated information contributes to both learning and inference to boost the overall performance of ABSA.
Message Passing Architectures.Networked representations for message passing graphical model inference algorithms have been studied in computer vision (Arnab et al., 2018) and NLP (Gormley et al., 2015).Modeling the execution of these message passing algorithms as a network results in recurrent neural network architectures.We similarly propagate information in a network and learn the update operators, but the architecture is designed for solving multi-task learning problems.Our algorithm can similarly be viewed as a recurrent neural network since each iteration uses the same network to update the shared latent variables.

Proposed Method
The IMN architecture is shown in Figure 1 as the value of the shared latent vector corresponding to x i after t rounds of message passing, with h s(0) i denoting the value after initialization.
The sequence of shared latent vectors3 {h s 1 , h s 2 , ..., h s n } is used as input to the different task-specific components.
Each task-specific component has its own sets of latent and output variables.The output variables correspond to a label sequence in a sequence tagging task; in AE, we assign to each token a label indicating whether it belongs to any aspect or opinion4 term, while in AS, we label each word with its sentiment.In a classification task, the output corresponds to the label of the input instance: the sentiment of the document for the sentiment classification task (DS), and the domain of the document for the domain classification task (DD).At each iteration, appropriate information is passed back to the shared latent vectors to be combined; this could be the values of the output variables or the latent variables, depending on the task.In addition, we also allow messages to be passed between the components in each iteration.Specifically for this problem, we send information from the AE task to the AS task as shown in Figure 1.After T iterations of message passing, which allows information to be propagated through multiple hops, we use the values of the output variables as predictions.For this problem, we only use the outputs for AE and AS during inference as these are the end-tasks, while the other tasks are only used for training.We now describe each component and how it is used in learning and inference.

Aspect-Level Tasks
AE aims to extract all the aspect and opinion terms5 appearing in a sentence, which is formulated as a sequence tagging problem with the BIO tagging scheme.Specifically, we use five class labels: Y ae = {BA, IA, BP, IP, O}, indicating the beginning of and inside of an aspect term, the beginning of and inside of an opinion term, and other words, respectively.We also formulate AS as a sequence tagging problem with labels Y as = {pos, neg, neu}, indicating the tokenlevel positive, negative, and neutral sentiment orientations.Table 1 shows an example of aspectlevel training instance with gold AE and AS labels.In aspect-level datasets, only aspect terms get sentiment annotated.Thus, when modeling AS as a sequence tagging problem, we label each token that is part of an aspect term with the sentiment label of the corresponding aspect term.For exam-Input The fish is fresh but the variety of fish is nothing out of ordinary .
Table 1: An aspect-level training instance with gold AE and AS labels.
ple, as shown in Table 1, we label "fish" as pos, and label "variety", "of ", "fish" as neg, based on the gold sentiment labels of the two aspect terms "fish" and "varity of fish" respectively.Since other tokens do not have AS gold labels, we ignore the predictions on them when computing the training loss for AS.
The AE component f θae is parameterized by θ ae and outputs {ŷ ae 1 , ..., ŷae n }.The AS component f θas is parameterized by θ as and outputs {ŷ as 1 , ..., ŷas n }.The AE and AS encoders consist of m ae and m as layers of CNNs respectively, and they map the shared representations to {h ae 1 , h ae 2 , ..., h ae n } and {h as 1 , h as 2 , ..., h as n } respectively.For the AS encoder, we employ an additional self-attention layer on top of the stacked CNNs.As shown in Figure 1, we make ŷae i , the outputs from AE available to AS in the selfattention layer, as the sentiment task could benefit from knowing the predictions of opinion terms.Specifically, the self-attention matrix A ∈ R n×n is computed as follows: score where the first term in Eq.( 1) indicates the semantic relevance between h as i and h as j with parameter matrix W as , the second term is a distancerelevant factor, which decreases with increasing distance between the ith token and the jth token, and the third term P op j denotes the predicted probability that the jth token is part of any opinion term.The probability P op j can be computed by summing the predicted probabilities on opinionrelated labels BP and IP in ŷae j .In this way, AS is directly influenced by the predictions of AE.We set the diagonal elements in A to zeros, as we only consider context words for inferring the sentiment of the target token.The self-attention layer outputs In AE, we concatenate the word embedding, the initial shared representation h s(0) i , and the task-specific representation h ae i as the final representation of the ith token.In AS, we concatenate h s(0) i and h as i as the final representation.For each task, we employ a fullyconnected layer with softmax activation as the decoder, which maps the final token representation to probability distribution ŷae i (ŷ as i ).

Document-Level Tasks
To address the issue of insufficient aspect-level training data, IMN is able to exploit knowledge from document-level labeled sentiment corpora, which are more readily available.We introduce two document-level classification tasks to be jointly trained with AE and AS.One is documentlevel sentiment classification (DS), which predicts the sentiment towards an input document.
The other is document-level domain classification (DD), which predicts the domain label of an input document.
As shown in Figure 1, the task-specific operation f θo consists of m o layers of CNNs that map the shared representations {h s 1 , ..., h s n } to {h o 1 , ..., h o n }, an attention layer att o , and a decoding layer dec o , where o ∈ {ds, dd} is the task symbol.The attention weight is computed as: where W o is a parameter vector.The final document representation is computed as We employ a fully-connected layer with softmax activation as the decoding layer, which maps h o to ŷo .

Message Passing Mechanism
To exploit interactions between different tasks, the message passing mechanism aggregates predictions of different tasks from the previous iteration, and uses this knowledge to update the shared latent vectors {h s 1 , ..., h s n } at the current iteration.Specifically, the message passing mechanism integrates knowledge from ŷae i , ŷas i , ŷds , a ds i , and a dd i computed on an input {x 1 , ..., x n }, and the shared hidden vector h s i is updated as follows: , the outputs of AE and AS from the previous iteration, such that these information are available for both tasks in current round of computation.We also incorporate information from DS and DD.ŷds indicates the overall sentiment of the input sequence, which could be helpful for AS.The attention weights a ds i and a dd i generated by DS and DD respectively reflect how sentiment-relevant and domain-relevant the ith token is.A token that is more sentiment-relevant or domain-relevant is more likely to be an opinion word or aspect word.This information is useful for the aspect-level tasks.

Learning
Instances for aspect-level problems only have aspect-level labels while instances for documentlevel problems only have document labels.IMN is trained on aspect-level and document-level instances alternately.
When trained on aspect-level instances, the loss function is as follows: ) + l(y as i,j , ŷas(T ) i,j )) (5) where T denotes the maximum number of iterations in the message passing mechanism, N a denotes the total number of aspect-level training instances, n i denotes the number of tokens contained in the ith training instance, and y ae i,j (y as i,j ) denotes the one-hot encoding of the gold label for AE (AS).l is the cross-entropy loss applied to each token.In aspect-level datasets, only aspect terms have sentiment annotations.We label each token that is part of any aspect term with the sentiment of the corresponding aspect term.During model training, we only consider AS predictions on these aspect term-related tokens for computing the AS loss and ignore the sentiments predicted on other tokens 6 .
When trained on document-level instances, we 6 Let l(y as i,j , ŷas(T ) i,j ) = 0 in Eq.( 5) if y ae i,j is not BA or IA where N ds and N dd denote the number of training instances for DS and DD respectively, and y ds i and y dd i denote the one-hot encoding of the gold label.Message passing iterations are not used when training document-level instances.
For learning, we first pretrain the network on the document-level instances (minimize L d ) for a few epochs, such that DS and DD can make reasonable predictions.Then the network is trained on aspectlevel instances and document-level instances alternately with ratio r, to minimize L a and L d .The overall training process is given in Algorithm 1. D a , D ds , and D dd denote the aspect-level training set and the training sets for DS, DD respectively.D ds and D a are from similar domains.D dd contains review documents from at least two domains with y ds i denoting the domain label, where one of the domains is similar to the domains of D a and D ds .In this way, linguistic knowledge can be transferred from DS and DD to AE and AS, as Table 2 shows the statistics of the aspect-level datasets.We run experiments on three benchmark datasets, taken from Se-mEval2014 (Pontiki et al., 2014) and SemEval 2015 (Pontiki et al., 2015).The opinion terms are annotated by Wang et al. (2016a).We use two document-level datasets from (He et al., 2018b).One is from the Yelp restaurant domain, and the other is from the Amazon electronics domain.Each contains 30k instances with exactly balanced class labels of pos, neg, and neu.We use the concatenation of the two datasets with domain labels as D dd .We use the Yelp dataset as D ds when D a is either D1 or D3, and use the electronics dataset as D ds when D a is D2.
Network details.We adopt the multi-layer-CNN structure from (Xu et al., 2018) as the CNN-based encoders in our proposed network.See Appendix A for implementation details.For word embedding initialization, we concatenate a general-purpose embedding matrix and a domain-specific embedding matrix7 following (Xu et al., 2018).We adopt their released domainspecific embeddings for restaurant and laptop domains with 100 dimensions, which are trained on a large domain-specific corpus using fastText.The general-purpose embeddings are pre-trained Glove vectors (Pennington et al., 2014) with 300 dimensions.
One set of important hyper-parameters are the number of CNN layers in the shared encoder and the task-specific encoders.To decide the values of m s , m ae , m as , m ds , m dd , we first investigate how many layers of CNNs would work well for each of the task when training it alone.We denote c o as the optimal number of CNN layers in this case, where o ∈ {ae, as, ds, dd} is the task indicator.We perform AE, AS separately on the training set of D1, and perform DS, DD separately on the document-level restaurant corpus.Crossvalidation is used for selecting c o , which yields 4, 2, 2, 2 for c ae , c as , c ds , c dd .Based on this observation, we made m s , m ae , m as , m ds , m dd equals to 2, 2, 0, 0, 0 respectively, such that m s + m o = c o .Note that there are other configurations satisfying the requirement, for example, m s , m ae , m as , m ds , m dd equals to 1, 3, 1, 1, 1. we select our setting as it involves the smallest set of parameters.
We tune the maximum number of iterations T in the message passing mechanism by training IMN −d via cross validation on D1.It is set to 2. With T fixed as 2, we then tune r by training IMN via cross validation on D1 and the relevant document-level datasets.It is set to 2 as well.
We use Adam optimizer with learning rate set to 10 −4 , and we set batch size to 32.Learning rate and batch size are set to conventional values without specific tuning for our task.
At training phase, we randomly sample 20% of the training data from the aspect-level dataset as the development set and only use the remaining 80% for training.We train the model for a fix number of epoches, and save the model at the epoch with the best F1-I score on the development set for evaluation.
Evaluation metrics.During testing, we extract aspect (opinion) terms, and predict the sentiment for each extracted aspect term based on ŷae(T ) and ŷas(T ) .Since the extracted aspect term may consist of multiple tokens and the sentiment predictions on them could be inconsistent in AS, we only output the sentiment label of the first token as the predicted sentiment for any extracted aspect term.
We employ five metrics for evaluation, where two measure the AE performance, two measure the AS performance, and one measures the overall performance.Following existing works for AE (Wang et al., 2017;Xu et al., 2018), we use F1 to measure the performance of aspect term extraction and opinion term extraction, which are denoted as F1-a and F1-o respectively.Following existing works for AS (Chen et al., 2017;He et al., 2018b), we adopt accuracy and macro-F1 to measure the performance of AS.We denote them as acc-s and F1-s.Since we are solving the integrated task without assuming that gold aspect terms are given, the two metrics are computed based on the correctly extracted aspect terms from AE.We compute the F1 score of the integrated task denoted as F1-I for measuring the overall performance.To compute F1-I, an extracted aspect term is taken as correct only when both the span and the sentiment are correctly identified.When computing F1-a, we consider all aspect terms, while when computing acc-s, F1-s, and F1-I, we ignore aspect terms with conflict sentiment labels.

Models under Comparison
Pipeline approach.
We select two topperforming models from prior works for each of AE and AS, to construct 2 × 2 pipeline baselines.For AE, we use CMLA (Wang et al., 2017) and DECNN (Xu et al., 2018).CMLA was proposed to perform co-extraction of aspect and opinion terms by modeling their interdependencies.DECNN is the state-of-the-art model for AE.It utilizes a multi-layer CNN structure with both general-purpose and domainspecific embeddings.We use the same structure as encoders in IMN.For AS, we use ATAE-LSTM (denoted as ALSTM for short) (Wang et al., 2016b) and the model from (He et al., 2018b) which we denote as dTrans.ALSTM is a representative work with an attention-based LSTM structure.We compare with dTrans as it also utilizes knowledge from document corpora for improving AS performance, which achieves state-of-the-art results.
Thus, we compare with the following pipeline methods: CMLA-ALSTM, CMLA-dTrans, DECNN-ALSTM, and DECNN-dTrans.We also compare with the pipeline setting of IMN, which trains AE and AS independently (i.e., without parameter sharing, information passing, and document-level corpora).We denote it as PIPELINE.The network structure for AE in PIPELINE is the same as DECNN.
During testing of all methods, we perform AE in the first step, and then generate AS predictions on the correctly extracted aspect terms.
Integrated Approach.We compare with two recently proposed methods that have achieved stateof-the-art results among integrated approaches: MNN (Wang et al., 2018) and the model from (Li et al., 2019) which we denote as INABSA (integrated network for ABSA).Both methods model the overall task as a sequence tagging problem with a unified tagging scheme.Since during testing, IMN only outputs the sentiment on the first token of an extracted aspect term to avoid sentiment inconsistency, to enable fair comparison, we also perform this operation on MNN and IN-ABSA.We also show results for a version of IMN that does not use document-level corpora, denoted as IMN −d .The structure of IMN −d is shown as the solid lines in Figure 1.It omits the information ŷds , a ds i , and a dd i propagated from the documentlevel tasks in Eq.( 4).

Results and Analysis
Main results.Table 3 shows the comparison results.Note that IMN performs co-extraction of aspect and opinion terms in AE, which utilizes additional opinion term labels during training, while the baseline methods except CMLA do not consider this information in their original models.To enable fair comparison, we slightly modify those baselines to perform co-extraction as well, with opinion term labels provided.Further details on model comparison are provided in Appendix B.
From Table 3, we observe that IMN −d is able to significantly outperform other baselines on F1-I.IMN further boosts the performance and outperforms the best F1-I results from the baselines by 2.29%, 1.77%, and 2.61% on D1, D2, and D3.Specifically, for AE (F1-a and F1-o), IMN −d performs the best in most cases.For AS (acc-s and F1-s), IMN outperforms other methods by large margins.PIPELINE, IMN −d , and the pipeline methods with dTrans also perform reasonably well on this task, outperforming other baselines by moderate margins.All these models utilize knowledge from larger corpora by either joint training of document-level tasks or using domain-specific embeddings.This suggests that domain-specific knowledge is very helpful, and both joint training and domain-specific embeddings are effective ways to transfer such knowledge.
We also show the results of IMN −d and IMN when only the general-purpose embeddings (without domain-specific embeddings) are used for initialization.They are denoted as IMN without domain-specific embeddings, while it still outperforms all other baselines except DECNN-dTrans.DECNN-dTrans is a very strong baseline as it exploits additional knowledge from larger corpora for both tasks.IMN −d wo DE is competitive with DECNN-dTrans even without utilizing additional knowledge, which suggests the effectiveness of the proposed network structure.
Ablation study.To investigate the impact of different components, we start with a vanilla model which consists of f θs , f θae , and f θas only without any informative message passing, and add other components one at a time.Table 4 shows the results of different model variants.+Opinion transmission denotes the operation of providing additional information P op j to the self-attention layer as shown in Eq.(1).+Message passing-a denotes propagating the outputs from aspect-level tasks only at each message passing iteration.+DS and +DD denote adding DS and DD with parameter sharing only.+Message passing-d denotes involving the document-level information for message passing.We observe that +Message passing-a and +Message passing-d contribute to the performance gains the most, which demonstrates the effectiveness of the proposed message passing mechanism.We also observe that simply adding documentlevel tasks (+DS/DD) with parameter sharing only marginally improves the performance of IMN −d .This again indicates that domain-specific knowledge has already been captured by domain embeddings, while knowledge obtained from DD and DS via parameter sharing could be redundant in this case.However, +Message passing-d is still helpful with considerable performance gains, showing that aspect-level tasks can benefit from knowing predictions of the relevant document-level tasks.Impact of T .We have demonstrated the effectiveness of the message passing mechanism.Here, we investigate the impact of the maximum number of iterations T .Table 6 shows the change of F1-I on the test sets as T increases.We find that convergence is quickly achieved within two or three iterations, and further iterations do not provide considerable performance improvement.
Case study.To better understand in which conditions the proposed method helps, we examine the instances that are misclassified by PIPELINE and INABSA, but correctly classified by IMN.
For aspect extraction, we find the message passing mechanism is particularly helpful in two scenarios.First, it helps to better recognize uncommon aspect terms by utilizing information from the opinion contexts.Table 5, PIPELINE and INABSA fail to recognize "build" as it is an uncommon aspect term in the training set while IMN is able to correctly recognize it.We find that when no message passing iteration is performed, IMN also fails to recognize "build".However, when we analyze the predicted sentiment distribution on each token in the sentence, we find that except "durability", only "build" has a strong positive sentiment, while the sentiment distributions on the other tokens are more uniform.This is an indicator that "build" is also an aspect term.IMN is able to aggregate such knowledge with the message passing mechanism, such that it is able to correctly recognize "build" in later iterations.Due to the same reason, the message passing mechanism also helps to avoid extracting terms on which no opinion is expressed.As observed in example 2, both PIPELINE and INABSA extract "Pizza".However, since no opinion is expressed in the given sentence, "Pizza" should not be considered as an aspect term.IMN avoids extracting this kind of terms by aggregating knowledge from opinion prediction and sentiment prediction.
For aspect-level sentiment, since IMN is trained on larger document-level labeled corpora with balanced sentiment classes, in general it better captures the meaning of domain-specific opinion words (example 3), better captures sentiments of complex expressions such as negation (example 4), and better recognizes minor sentiment classes in the aspect-level datasets (negative and neutral in our cases).In addition, we find that knowledge propagated by the document-level tasks through message passing is helpful.For example, the sentiment-relevant attention weights are helpful for recognizing uncommon opinion words, and which further help on correctly predicting the sentiments of the aspect terms.As observed in example 5, PIPELINE and INABSA are unable to recognize "scratches easily" as the opinion term, and they also make wrong sentiment prediction on the aspect term "aluminum".IMN learns that "scratches" is sentiment-relevant through knowledge from the sentiment-relevant attention weights aggregated via previous iterations of message passing, and is thus able to extract "scratches easily".Since the opinion predictions from AE are sent to the self-attention layer in the AS component, correct opinion predictions further help to infer the correct sentiment towards "aluminum".

Conclusion
We propose an interactive multi-task learning network IMN for jointly learning aspect and opinion term co-extraction, and aspect-level sentiment classification.The proposed IMN introduces a novel message passing mechanism that allows informative interactions between tasks, enabling the correlation to be better exploited.In addition, IMN is able to learn from multiple training data sources, allowing fine-grained token-level tasks to benefit from document-level labeled corpora.The proposed architecture can potentially be applied to similar tasks such as relation extraction, semantic role labeling, etc.

CNN-based Encoder
We adopt the multi-layer-CNN structure from (Xu et al., 2018) as the CNN-based encoders for both the shared CNNs and the task-specific ones in the proposed network.Each CNN layer has many 1Dconvolution filters, and each filter has a fixed kernel size k = 2c + 1, such that each filter performs convolution operation on a window of k word representations, and compute the representation for the ith word along with 2c nearby words in its context.
Following the settings in the original paper, the first CNN layer in the shared encoder has 128 filters with kernel sizes k = 3 and 128 filters with kernel sizes k = 5.The other CNN layers in the shared encoder and the CNN layers in each task-specific encoder have 256 filters with kernel sizes k = 5 per layer.ReLu is used as the activation function for each CNN layer.Dropout with p = 0.5 is employed after the embedding layer and each CNN layer.

Opinion Transmission
To alleviate the problem of unreliable predictions of opinion labels in the early stage of training, we adopt scheduled sampling for opinion transmission at training phase.We send gold opinion labels rather than the predicted ones generated by AE to AS in the probability of i .The probability i depends on the number of epochs i during training, for which we employ an inverse sigmoid decay i = 5/(5 + exp(i/5)).

B Model Comparison Details
For CMLA8 , ALSTM9 , dTrans10 , and INABSA11 , we use the officially released source codes for experiments.For MNN, we re-implement the model following the descriptions in the paper as the source code is not available.We run each baseline multiple times with random initializations and save their predicted results.We use an unified evaluation script for measuring the outputs from different baselines as well as the proposed method.
The proposed IMN performs co-extraction of aspect terms and opinion terms in AE, which utilizes additional opinion term labels during model training.In the baselines, the two integrated methods MNN and INABSA, and the pipeline methods with DECNN as the AE component do not Table 7: Model comparison in a setting without opinion term labels.Average results over 5 runs with random initialization are reported.* indicates the proposed method is significantly better than the other baselines (p < 0.05) based on one-tailed unpaired t-test.take take opinion information during training.To make fair comparison, we add labels {BP, IP} to the original label sets of MNN, INABSA, and DECNN, indicating the beginning of and inside of an opinion term.We train those models on training sets with both aspect and opinion term labels to perform co-extraction as well.In addition, for pipeline methods, we also make the gold opinion terms available to the AS models (ALSTM and dTrans) during training.To make ALSTM and dTrans utilize the opinion label information, we modify their attention layer to assign higher weights to tokens that are more likely to be part of an opinion term.This is reasonable since the objective of the attention mechanism in an AS model is to find the relevant opinion context.The attention weight of the ith token before applying softmax normalization in an input sentence is modified as: where a i denotes the attention weight computed by the original attention layer, p op i denotes the probability that the ith token belongs to any opinion term.a i denotes the modified attention weights.At the training phase, since the gold opinion terms are provided, p op i = 1 for the tokens that are part of the gold opinion terms, while p op i = 0 for the other tokens.At the testing phase, p op i is computed based on the predictions from the AE model in the pipeline method.It is computed by summing up the predicted probabilities on the opinion-related labels BP and IP for the ith token.We also present the comparison results in a setting without using opinion term labels in Table 7 12 .In this setting, we modify the proposed IMN and IMN −d to recognize aspect terms only 12 We exclude the results of the pipeline methods with CMLA, as CMLA relies on opinion term labels during training.It is difficult to modify it.in AE.The opinion transmission operation, which sends the opinion term predictions from AE to AS, is omitted as well.
Both IMN −d and IMN still significantly outperform other baselines in most cases under this setting.In addition, when compare the results in Table 7 and Table 3, we observe that IMN −d and IMN consistently yield better F1-I scores on all datasets in Table 3, when opinion term extraction is also considered.Consistent improvements are not observed in other baseline methods when trained with opinion term labels.These findings suggest that knowledge obtained from learning opinion term extraction is indeed beneficial, however, a carefully-designed network structure is needed to utilize such information.IMN is designed to exploit task correlations by explicitly modeling interactions between tasks, and thus it better integrates knowledge obtained from training different tasks.
. It accepts a sequence of tokens {x 1 , . . ., x n } as input into a feature extraction component f θs that is Integer r > 0 for e ∈ [1, max-pretrain-epochs] do for minibatch B ds , B dd in D ds , D dd do compute L d based on B ds and B dd update θ s , θ ds , θ dd end for end for for e ∈ [1, max-epochs] do for b ∈ [1, batches-per-epoch] do sample B a from D a compute L a based on B a update θ s , θ ae , θ as , θ re if b is divisible by r then sample B ds , B dd from D ds , D dd compute L d based on B ds and B dd update θ s , θ ds , θ dd

Table 2 :
Dataset statistics with numbers of aspect terms and opinion terms they are semantically relevant.We fix θ ds and θ dd when updating parameters for L a , since we do not want them to be affected by the small number of aspect-level training instances.

Table 3 :
−d /IMN wo DE.IMN wo DE performs only marginally below IMN.This indicates that the knowledge captured by domain-specific embeddings could be similar to that captured by joint training of the document-level tasks.IMN −d is more affected Model comparison.Average results over 5 runs with random initialization are reported.* indicates the proposed method is significantly better than the other baselines (p < 0.05) based on one-tailed unpaired t-test.

Table 4 :
F1-I scores of different model variants.Average results over 5 runs are reported.
As shown in example 1 in

Table 5 :
Case analysis.The "Examples" column contains instances with gold labels.'The "opinion" and "aspect" columns present the opinion terms and aspect terms with sentiments, generated by the corresponding model.

Table 6 :
F1 scores with different T values using IMN −d .Average results over 5 runs are reported.