Transferable End-to-End Aspect-based Sentiment Analysis with Selective Adversarial Learning

Joint extraction of aspects and sentiments can be effectively formulated as a sequence labeling problem. However, such formulation hinders the effectiveness of supervised methods due to the lack of annotated sequence data in many domains. To address this issue, we firstly explore an unsupervised domain adaptation setting for this task. Prior work can only use common syntactic relations between aspect and opinion words to bridge the domain gaps, which highly relies on external linguistic resources. To resolve it, we propose a novel Selective Adversarial Learning (SAL) method to align the inferred correlation vectors that automatically capture their latent relations. The SAL method can dynamically learn an alignment weight for each word such that more important words can possess higher alignment weights to achieve fine-grained (word-level) adaptation. Empirically, extensive experiments demonstrate the effectiveness of the proposed SAL method.


Introduction
End-to-End Aspect-Based Sentiment Analysis (E2E-ABSA) aims to jointly detect the aspect terms explicitly mentioned in sentences and predict the sentiment polarities over them (Liu, 2012;Pontiki et al., 2014). For example, in the sentence "The AMD Turin Processor seems to always perform much better than Intel", the user mentions two aspect terms, i.e., "AMD Turin Processor" and "Intel", and expresses positive and negative sentiments over them, respectively.
Typically, prior work formulates E2E-ABSA as a sequence labeling problem over a unified tagging scheme (Mitchell et al., 2013;Zhang et al., 2015;Li et al., 2019a). The unified tagging scheme connects a set of aspect boundary tags (e.g., {B, I, E, S, O} denotes the beginning of, inside of, end of, single-word, and no aspect term), and sentiment tags (e.g. {POS, NEG, NEU} denotes positive, negative or neutral sentiment) together to constitute a joint label space for each word. As such, "AMD Turin Processor" and "Intel" should be tagged with {B-POS, I-POS, E-POS} and {S-NEG}, respectively, while the remaining words are tagged with O. This formulation makes two sub-tasks joint modeling easier, and meanwhile, tend to be low-resource. There usually exist few annotated data for each new domain, where labeling each word with a unified tag could be more time-consuming and expensive.
To alleviate the dependence on domain supervisions, we explore an unsupervised domain adaptation setting for E2E-ABSA, which aims to leverage knowledge from a labeled source domain to improve the sequence learning in an unlabeled target domain. The challenges in fulfillment of this setting are two-fold: (1) there exists a large feature distribution shift between domains since aspect terms in different domains are usually disjoint. For example, users usually mention "pizza" in the Restaurant domain while "camera" is often discussed in the Laptop domain; (2) Unlike domain adaptation in traditional sentiment classification (Blitzer et al., 2007) that learns shared sentence or document representations, we need to learn fine-grained (word-level) representations to be domain-invariant for sequence prediction.
Consider the first problem, i.e., what to transfer? Even though aspect terms from different domains behave distinctly, some association patterns between aspect and opinion words are common across domains; e.g., "The pizza is great." from the Restaurant domain and "The camera is excellent." from the Laptop domain. Both of them share the same syntactic pattern (as-pect words→nsubj→opinion words). Inspired by this, existing studies use general syntactic relations as the pivot to bridge the domain gaps for cross-domain aspect extraction Ding et al., 2017), or aspect and opinion co-extraction (Li et al., 2012;Wang and Pan, 2018). Unfortunately, these methods highly rely on prior knowledge (e.g., manually-designed rules) or external linguistic resources (e.g., dependency parsers), which are inflexible and prone to bringing in knowledge errors. Instead, we introduce a multi-hop Dual Memory Interaction (DMI) mechanism to automatically capture the latent relations among aspect and opinion words. The DMI iteratively infers the correlation vectors of each word by interacting its local memory (LSTM hidden state) with both the global aspect and opinion memories, such that the inter-correlations between aspects and opinions, and the intra-correlations in aspects or opinions can be derived.
Second, how to transfer for this sequence prediction task? One straightforward way is to apply domain adaption methods to align all words within the sentence, however, it is observed that it will not yield significant improvements. Actually, not all the words contribute equally to the domaininvariant feature space though fine-grained adaptation is required. Thus, we propose a novel Selective Adversarial Learning (SAL) method to dynamically learn an alignment weight for each word, where more important words can possess higher alignment weights to achieve a local semantic alignment based on adversarial training. Empirically, the proposed model outperforms the state-of-the-art fine-grained adaptation methods by a large margin on four benchmark datasets. We also conduct extensive ablation studies to quantitatively and qualitatively demonstrate the effectiveness of the selectivity of adversarial learning.
Overall, our main contributions are summarized as: (1) to the best of our knowledge, an unsupervised domain adaptation setting is firstly explored for E2E-ABSA; (2) an effective SAL method is proposed to conduct a local semantic alignment for fine-grained domain adaptation; (3) extensive experiments verify the effectiveness of the proposed SAL method.

Task Definition
Single-domain: E2E-ABSA involves both aspect detection (AD) and aspect sentiment (AS) classification tasks, which are formulated as a unified sequence labeling problem. Given an input sequence of words x = {w 1 , w 2 , ..., w T } and its word embeddings e = {e 1 , e 2 , ..., e T }, the goal is to predict a tag sequence y = {y 1 , y 2 , ..., y T } over the unified tags, with y i ∈ Y U = {B-POS, I-POS, E-POS, S-POS, B-NEG, I-NEG, E-NEG, S-NEG, B-NEU, I-NEU, E-NEU, S-NEU, O}. Cross-domain: Here we are performing in a more challenging unsupervised domain adaptation setting. Given a set of labeled data Nt j=1 from a target domain, we aim to transfer the knowledge of D s to improve the sequence learning in D t .

Model Description
Overview: As shown in Figure 1, we adopt two stacked bi-directional LSTMs as the base model (Li et al., 2019a) for E2E-ABSA. The upper LSTM U is for the high-level ADS (AD+AS) task and it predicts the unified tags as output, while the lower LSTM B is for the low-level AD task and predicts the aspect boundary tags as the guidance.
To adapt the base model, we design different components in terms of the two problems, i.e., what to transfer and how to transfer, respectively.
(1) To automatically capture the latent relations between aspect and opinion words as transferable knowledge across domains, we introduce a multihop Dual Memory Interaction (DMI) mechanism between the two LSTMs. At each hop, e.g., the 1st hop, each local memory h B i will interact with both the global aspect and opinion memories, i.e., m 1 a and m 1 o based on the DMI, to produce two correlation vectors for aspect and opinion words co-detection, where the opinion detection is used as an auxiliary task for the AD task. The "local" memory denotes the hidden representation (LSTM B hidden state) of each word within the sentence. Whereas the two "global" memories are globally shared by all input sentences, which are commonly used in memory networks (Sukhbaatar et al., 2015;Kumar et al., 2016) and can be seen as high-level representations for aspect and opinion words, respectively. The A-attention and Oattention then aggregate most relevant aspect or opinion words information to refine the two global memories for the next hop.
(2) To adapt these relations across domains, we propose a Selective Adversarial Learning (SAL) method to dynamically focus on aligning the aspect words between domains. This is because the informative aspect words contribute more to the shared feature space than the unmeaning words tagged with O in the sentence (Zhou et al., 2019b). As such, an aspect tagger trained on a source domain can work well when applied to a target domain. Specifically, at the final hop, we adopt a domain discriminator for each word with a gradient reversal layer (Ganin et al., 2016) to perform domain adversarial learning over its correlation vector (alignment). While the A-attention module provides an aspect attention distribution as a selector to control a learnable alignment weight for each word (selectivity). Finally, each aligned correlation vector will be used to predict aspect boundary tags (AD task) and fed to the LSTM U for the unified tags prediction (ADS task). In the following sections, we detail each component.

Base Model
We adopt two stacked bi-directional LSTMs as the base model. We link these two LSTM layers so that the hidden representations generated by the LSTM B can be fed to LSTM U as the guidance information. Specifically, their hidden representa- are calculated as follows: The probability scores z B i ∈ R |Y B | over the aspect boundary tags Y B ={B, I, E, S, O} are calculated by a fully-connected softmax layer: Similarly, the scores z U i ∈ R |Y U | over the unified tags Y U defined in Section 2 are obtained as:

Global-Local Memory Interaction
Before detailing the DMI module, we firstly introduce Global-Local Memory Interaction (GLMI) that describes the interaction between a local memory h i ∈ R dim h and a global memory m ∈ R dim h . Formally, we parameterize the GLMI f (h i , m; Θ, G), with Θ = {W, b} and G, which consists of a residual transformation and a tensor product operation. Specifically, we firstly incorporate the global memory information m into each local position with a residual transformation denotes the vector concatenation. As such, the global memory can distill more correlated local information and they are mapped into the same space. Then we compute a correlation vector r i ∈ R K that encodes the strength of correlations between the global memory m and the transformed local memoryh i through a tensor product operation as: where the tensor G ∈ R dim h ×dim h ×K can be seen as multiple bilinear matrice that model K kinds of latent relations between two objects. The k-th slice of the G, i.e., G k ∈ R dim h ×dim h denotes one type of latent relation that interacts with 2 vectors to constitute one type of composition.

Dual Memory Interaction
Following the notations in Section 3.2, we further define a global aspect memory m a ∈ R dim B h , a as the local memories. The global aspect and opinion memories are able to capture highly correlated aspect or opinion words from the local memories, respectively. Based on the observation that aspect words are often collocated with opinion words across domains, thus their associations can act as the pivot information to bridge the domain gaps. To automatically capture their latent relations within the sentences, at the l-th hop, each local memory h B i will interact with the global memories m l a and m l o by the Dual Memory Interaction (DMI) shown in Figure 2, to produce two correlation vectors for aspect and opinion co-detection: where G a , G o and G ao denote the composition tensors of modeling the latent relations between aspect and aspect, opinion and opinion, and aspect and opinion, respectively. The correlation vector measures the association strength between local and global memories; e.g., If h B i for the word w i is both highly intracorrelated with the aspect memory m a and intercorrelated with the opinion memory m o , w i is more likely to be an aspect term. Then the two correlation vectors can be transformed to a scalar aspect attention (A-attention) and opinion attention (O-attention) weight α l p,i , respectively, with p ∈ {a, o} denoting the aspect or opinion, which indicates the possibility of each word in the sentence being an aspect word or an opinion word as: where W p is the weight of the attention module. The aspect or opinion attention weight α l p,i will summarize the local memories to update the global aspect and opinion memories, respectively, for the next hop, i.e., m l+1 The updates gradually refine the global memories to incorporate more relevant candidates based on the attention mechanism. In the DMI, all parameters are shared in different hops and domains.
At the final L-th hop, we use r L a,i for the AD task and feed it to the LSTM U for the ADS task. For the auxiliary opinion detection task, we feed r L o,i into a softmax layer for predicting the probability scores z O i ∈ R |Y O | over the opinion labels 2 Y O , i.e., a word is an opinion word or not, as:

Selective Adversarial Learning
To adapt the captured relations to be domaininvariant, we propose a Selective Adversarial Learning (SAL) method to dynamically align the words with high probability to fall into the aspect boundaries, i.e., being an aspect word. Specifically, we introduce a domain discriminator for each word, which aims to identify the domain label y D i ∈ R |Y D | of the input word, i.e., the word in the sentence is from the source or the target domain. While the feature extractor is to produce the domain-invariant correlation vector r L a,i that cannot be distinguished by the domain discriminator via a Gradient Reversal Layer (GRL) (Ganin et al., 2016). Mathematically, we formulate the GRL as a 'pseudo-function' R λ (x) = x with a reversal gradient ∂R λ (x) ∂x = −λI, where λ is the adaptation rate. The correlation vector r L a,i will be fed to the GRL before the domain discriminator, which is used to predict the probability scores z D i ∈ R |Y D | over the domain labels Y D as: And meanwhile, the aspect attention weight α L a,i at the final hop serves as a selector to be a learnable alignment weight for each word. Thus, the selective domain adversarial loss is a weighted crossentropy loss for all the words from the labeled source data D s and unlabeled target data D t : Existing studies (Yosinski et al., 2014;Mou et al., 2016) have already shown some evidence that lowlevel neural layer features (i.e., low-level task) are more easily transferred to different tasks or domains. Thus, we choose the r L a,i from the lowlevel AD task to be aligned instead of the feature h U i from the high-level ADS task to transfer. Our ablation studies also confirm this assumption.

Alternating Training
The primary task loss consists of the cross-entropy losses for both the guided AD and main ADS tasks for the labeled source data D s : The auxiliary opinion detection loss is the crossentropy loss for the labeled source data D s and unlabeled target data D t as follows: Traditionally, we can directly optimize the joint loss of Eqs.
(1)-(3), i.e., E = L M +ρL O +γL D to obtain both discriminative and domain-invariant word representations, where ρ and γ are the tradeoff factors. However, we found the optimization process tends to be unstable since it may be hard to jointly optimize many objectives. Thus, we propose an empirically alternating strategy to train the L M +ρL O and L D iteratively, which separates the whole word representation learning into a discriminative stage and a domain-invariant stage. Let θ f , θ w , θ d denote the parameters for feature learning of each word, word predictors for AD, ADS and opinion detection tasks, and domain discriminators, respectively. Based on our strategy, we are seeking the parameters (θ f ,θ w ,θ d ) that deliver a saddle point of E among two stages: At the saddle point, the feature learning parameters θ f minimize the word prediction losses (i.e., the features are discriminative) for the first stage. For the second stage, the domain classification loss is minimized by the domain discriminator parameters θ d while maximized by the feature learning parameters θ f via GRL (i.e., the features are   (Pontiki et al., 2014). Following the setup in (Li et al., 2019a), R is the union set of the restaurant datasets from SemEval ABSA challenge 2014, 2015, and 2016 (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016. D is a combination of device reviews from 5 different digital products provided by (Hu and Liu, 2004). S is introduced by (Toprak et al., 2010) and contains reviews from web services. Detailed statistics are shown in Table 1. Settings: We construct 10 transfer pairs like D s →D t with the four domains mentioned above, and we do not use the pairs L→D and D→L as these two domains are very similar. Note that for the unsupervised domain adaptation setting, no labels are available for the target domain. Therefore, for each transfer pair, its training dataset is the combination of the labeled training data of the source domain and the unlabeled training data of the target domain. Meanwhile, it employs the testing data of the source domain with labels as the validation set and the testing data of the target domain as the evaluation set. We report the results for both AD and ADS tasks. The evaluation metric is the Micro-F1 score under the exact match, which means that an output segment is considered to be correct only if it exactly matches with the gold standard span of the aspect term for the AD task or the aspect term and its sentiment for the ADS task. All experiments are repeated 5 times and we report the average results over 5 runs.

Implementation details
The word embeddings are 100-dimensional word2vec (Mikolov et al., 2013) vectors pretrained on the combination of the Yelp Challenge  dataset 3 and the electronics dataset from Amazon reviews 4 . For out-of-vocabulary words, we randomly initialized them with a uniform distribution U (−0.25, 0.25). The dimensions of two LSTM layers dim B h and dim U h are all set to 100. The number of hops L is set to 2. The number of bilinear interactions K is set to 50. The weight matrices are initialized with a uniform distribution U (−0.2, 0.2). The adaptation rate λ is 0.1 and the trade-off factor ρ is 1.0. For training, the model is optimized by the Adam algorithm (Kingma and Ba, 2014) with the initial learning rate 0.001. The batch size is 64, with a half coming from the source and target domains, respectively. Gradients with the 2 norm larger than 40 are normalized to be 40. To alleviate the overfitting, we apply the dropout on the word embeddings e i and the learned word representations r l a,i , r l o,i , and h U i with dropout rate 0.5. We use the same hyper-parameters, which are tuned on 10% randomly held-out training data of the source domain in R→L, for all transfer pairs.

Baselines
We compare with several state-of-the-art finegrained adaptation methods.
• TCRF : Transferable CRF that uses a linear-chain CRF for sequence prediction based on shared nonlexical features across domains, e.g., POS tags and dependency relations.
• RAP (Li et al., 2012): A cross-domain Relational Adaptive Bootstrapping method that iteratively expands target aspect and opinion lexicons according to common opinion words and syntactic relations.
• Hier-Joint (Ding et al., 2017): A recurrent neural network (RNN) with manually designed rule-based auxiliary tasks based on common syntactic relations among aspect and opinion words.
• RNSCN (Wang and Pan, 2018): a recursive neural structural correspondence network that incorporates syntactic structures and exploits an auto-encoder to denoise relation labels generated from the parser.
As the first to address cross-domain E2E-ABSA, we have to adapt all the baselines which are originally proposed for cross-domain aspect detection, or aspect and opinion co-detection to return the ADS results by replacing their aspect boundary tags with the unified tags. Absent of the proposed stacking architecture, all the baselines fail to accomplish the auxiliary AD task meantime. Thus, we only report their ADS results. Besides, E2E-ABSA aims to simultaneously learn aspect terms along with their sentiments. Thus, the ADS is exactly our main task while the AD is only an auxiliary task used for the guidance.
To be more convincing, we extend neural models (i.e., Hier-Joint and RNSCN) to more powerful baselines named Hier-Joint + and RNSCN + with the proposed stacking architecture, respectively. Both of them stack an additional RNN layer on top of the original framework to produce the unified tags while the lower layer is to predict the aspect boundary tags as the guidance. The validity of such extensions is guaranteed by the fact that the extended versions achieve even better AD performances than the original versions. We use the source code of the baselines for experiments. For  Table 3: Ablation results (%). ∆ refers to the improvements of the full model over ablation methods. The marker † means that the full model significantly outperforms the best ablation model ADS-SAL with p-value < 0.01.
fair comparison, all baselines use the same pretrained word embeddings and the baselines that require opinion labels use the same opinion lexicon.

Main Results
Based on the results in Table 2, we have the following observations: • Our model consistently and significantly achieves the best results on almost all transfer pairs, outperforming the strongest baseline RNSCN + by 3.07% and 7.62% Micro-F1 on average for the AD and ADS tasks, respectively. Our model can automatically model complicated relations among aspect and opinion words via the DMI as transferable knowledge. Besides, the proposed SAL method can dynamically learn an alignment weight for each word to achieve a local semantic alignment, which distills a better shared feature space and further improves the performances.
• Traditional non-neural methods like TCRF and RAP perform very poorly due to the reliance on hand-crafted features. Our model outperforms Hier-Joint and RNSCN, which are neural models, by 10.92% and 8.72% Micro-F1 on average for the ADS task, respectively. Both of them rely on the dependency parser to exploit syntactic relations, which are inflexible due to the no end-to-end manner and may bring in external errors.
• The extended version Hier-Joint + and RNSCN + can further improve the performances, which shows the benefits of the guidance from the low-level AD task. However, our model can still outperform them by a large margin, which demonstrates the effectiveness of the proposed methods.

Ablation Study
To investigate the effectiveness of each component, we conduct the ablation study to compare our full model with different ablation variants: • Base Model (SO / TO): it uses two stacked Bi-LSTMs as the Base Model. SO (Source Only) and TO (Target Only) denote that the base model is only trained on the labeled data from the source and target domain, respectively. We usually refer to them as a lower bound and a upper bound, respectively.
• Base Model+DMI: it uses two stacked Bi-LSTMs with a multi-hop dual memory interaction (DMI) between them.
• AD-AL: it performs pure adversarial learning (removing the selective weight α L a,i from the Eq. (1)) on each correlation vector r L a,i for the low-level AD task.
• AD-SAL: it advances the AD-AL by conducting selective adversarial learning.
• ADS-SAL: it conducts selective adversarial learning on each word representation h U i for the high-level ADS task.
Note that, the AD-AL, AD-SAL (Full model) and ADS-SAL all use the same architecture as the Base Model+DMI. Based on the  Table 4: Case analysis for the R→L pair. Note that we only show the sentiment part of the unified labels (i.e., POS, NEG, and NEU) and use brackets to indicate the boundary. The marker denotes an incorrect prediction. that the original word hidden representations (LSTM B hidden states) are not suitable for transfer. Thus, we need to resort to the correlation vectors inferred by the DMI that models the transferable latent relations between aspect and opinion words.
• No SAL v.s. SAL: AD-SAL significantly and consistently exceeds Base Model+DMI by 6.46% and 4.92% Micro-F1 on average for the AD and ADS tasks, respectively. Without any adaptation, the captured relations by the DMI may not work well across domains, while the proposed SAL method can effectively align these latent relations to be domain-invariant.
• No Selectivity v.s. Selectivity: AD-SAL outperforms AD-AL by 5.18% and 4.11% Micro-F1 on average for the AD and ADS tasks, respectively. This proves the necessity to conduct the selective alignment. The SAL method can dynamically learn to control an alignment weight for each word to achieve a local semantic alignment, which captures a better domain-invariant feature space than pure adversarial learning that treats all words equally for the fine-grained adaptation.
• Low-level v.s. High-level: AD-SAL exceeds ADS-SAL by 2.31% and 1.94% Micro-F1 on average for the AD and ADS tasks, respectively. The label space of the unified tags for the high-level ADS task is more complicated than that of the aspect boundary tag for the low-level AD task. This gives the evidence that low-level neural features are more easily to transfer than high-level features. It is very easy to integrate bluetooth device and usb device are recognized almost instantly.

Case Analysis
As illustrated in Table 4, we perform case analysis of the R→L pair for the Base model+DMI, AD-AL, and the full model AD-SAL to demonstrate the necessity to conduct selective alignment for the fine-grained adaptation. The Base model+DMI can identify some domain-specific aspect words (e.g., battery, ports) without any supervision from the target domain. This is because the DMI can infer relational representations that capture some common latent relations between aspect and opinion words. However, it still cannot completely capture the multi-word aspect terms (e.g. bluetooth device, battery life) and sometimes it totally ignores them (e.g. window 7). The AD-AL performs pure adversarial learning for aligning all words in a sentence. Even though the domain adaptation method is adopted, it does not yield significant improvements and sometimes it becomes worse. For example, the AD-AL cannot even identify aspect words that can be captured by the Base Model+DMI (e.g., ports, devices), or wrongly identifies some non-aspect words (e.g., pc). The reason is that pure adversarial learning treats all the words equally for the alignment, which may bring in noises of uninformative words into the shared feature space. To solve that, the full model AD-SAL performs a local semantic alignment to dynamically focus on aligning aspect words that contribute more to the domain-invariant feature space. Hence, AD-SAL model can precisely and completely identity all the aspect terms and make correct unified tag predictions.
Moreover, in Figure 3, we visualize the attentions from these models, where deeper colors denote larger weights. Compared with the Base model+DMI, the full model AD-SAL can precisely attend the complete aspect words from the target domain (A-attention), i.e., bluetooth device and usb device, and make correct predictions, while the AD-AL cannot achieve that. The AD-AL can only align all the words equally, which hinders the model to attend the aspect words, while the A-attention of the AD-SAL model can be used for both discriminative word tags predictions and acting as a learnable alignment weight for each word. This shows that the proposed SAL method can learn to align important aspect words to improve the transferability of the model for the fine-grained adaptation.

Related Works
E2E-ABSA can be broken into two sub-tasks, namely, aspect detection and aspect sentiment classification. The aspect detection aims to extract the aspect terms mentioned in the text and it has been actively studied (Qiu et al., 2011;Liu et al., 2015;Poria et al., 2016;Wang et al., 2016a;He et al., 2017;Majumder et al., 2018;Li et al., 2018b;Xu et al., 2018). The aspect sentiment classification is to predict the sentiment polarities of the given aspect terms and has also received a lot of attention recently (Dong et al., 2014;Tang et al., 2016;Wang et al., 2016b;Ma et al., 2017;Chen et al., 2017;Ma et al., 2018;He et al., 2018b;Li et al., 2018aLi et al., , 2019b. For practical applications, a typical way is to pipeline these two sub-tasks together, which becomes ineffective due to the accumulated errors across tasks. These two sub-tasks have strong couplings and thus a unified formulation (Mitchell et al., 2013;Zhang et al., 2015;Li et al., 2019a) to handle them to-gether in an end-to-end manner becomes a more promising direction. Despite its importance, existing studies are only exploring the performance in a single domain, while ignoring the transferability across domains. To address this problem, unsupervised domain adaptation methods can be applied. While existing methods focus on traditional cross-domain sentiment classification to learn shared representations for sentences or documents, including pivot-based methods (Blitzer et al., 2007;Pan et al., 2010;Bollegala et al., 2013;Yu and Jiang, 2016), auto-encoders (Glorot et al., 2011;Chen et al., 2012;Zhou et al., 2016), domain adversarial networks (Ganin et al., 2016;Li et al., , 2018c, or semi-supervised methods (He et al., 2018a). Due to the difficulties in fine-grained adaptation, there exist very few methods for cross-domain aspect extraction Ding et al., 2017), which acts as a sub-task of E2E-ABSA, or aspect and opinion co-extraction (Li et al., 2012;Wang and Pan, 2018) that focuses on detecting aspect and opinion words, while E2E-ABSA needs to analyze more complicated correspondences between them. Besides, those methods can only rely on general syntactic relations between aspect and opinion words to bridge the domains. Different from them, our method leverages the attention mechanism (Bahdanau et al., 2014;Sukhbaatar et al., 2015;Shen et al., 2018Shen et al., , 2019 as a dynamic selector to automatically achieve the selective alignment.

Conclusion
The effectiveness of supervised methods for E2E-ABSA is limited due to the data scarcity. Our wok is the first attempt to resolve cross-domain E2E-ABSA by leveraging knowledge from other related domains to enhance the sequence learning of the target domain. Extensive experiments show the effectiveness of the proposed SAL method. Ablation studies also prove the necessity to perform selective alignment. In the future, the proposed SAL method can be potentially extended to other domain adaptation methods and applied to more general sequence labeling tasks including named entity recognition (Zhou et al., 2019c), part-of-speech tagging (Zhou et al., 2019a), etc.