Syntactically Aware Cross-Domain Aspect and Opinion Terms Extraction

A fundamental task of fine-grained sentiment analysis is aspect and opinion terms extraction. Supervised-learning approaches have shown good results for this task; however, they fail to scale across domains where labeled data is lacking. Non pre-trained unsupervised domain adaptation methods that incorporate external linguistic knowledge have proven effective in transferring aspect and opinion knowledge from a labeled source domain to un-labeled target domains; however, pre-trained transformer-based models like BERT and RoBERTa already exhibit substantial syntactic knowledge. In this paper, we propose a method for incorporating external linguistic information into a self-attention mechanism coupled with the BERT model. This enables leveraging the intrinsic knowledge existing within BERT together with externally introduced syntactic information, to bridge the gap across domains. We successfully demonstrate enhanced results on three benchmark datasets.


Introduction
A fundamental task of fine-grained sentiment analysis is aspect and opinion terms extraction. For example, in the sentence "The chocolate cake was incredible", the aspect term is chocolate cake and the opinion term is incredible. Most of the work related to aspect and opinion term extraction is formulated as a supervised sequence-tagging task. RNN-based models (Liu et al., 2015) and Transformer-based models showed promising results in single-domain setups where the training and the testing data are from the same domain. However, these approaches typically do not scale across different domains, where only unlabeled data is available for the target domain, since aspect terms from two different domains are usually semantically different hence separated in the embedding space. For example, frequent aspect terms in the restaurant domain, like salad and dessert, have little or no semantic relatedness to frequent aspect terms in the laptop domain, like screen size and battery life. To date, only a handful of approaches for unsupervised domain adaptation of aspect and opinion term extraction have been proposed.
It has been shown that syntactic information is important for identifying aspect and opinion terms (Hu and Liu, 2004b;Qiu et al., 2011). A recent line of work, based on non pre-trained models, encodes dependency-based aspect extraction rules (Ding et al., 2017) or automatically-generated dependency relations (Wang and Jialin Pan, 2018;Wang and Pan, 2020), as auxiliary supervision for non pre-trained models. This recent line of work demonstrates effective domain adaptation by incorporating syntactic knowledge into non pre-trained models during their training step. Subsequently, recent studies (Clark et al., 2019;Htut et al., 2019) show that pre-trained transformer-based models such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) already exhibit substantial linguistic knowledge. In this paper we examine whether the incorporation of external syntactic knowledge into pre-trained models, contributes to bridging the gap across domains. For this purpose, we propose an approach for unsupervised domainadaptation of aspect and opinion terms extraction based on incorporating linguistic knowledge into a pretrained BERT model. Specifically, inspired by Strubell et al. (2018), we incorporate externally-generated dependency relations into a self-attention mechanism that is coupled with the pre-trained BERT model Figure 1: An example of opinionated sentences from two different domains with similar syntactic patterns. Opinion terms are colored green and aspect terms are colored blue. (Stickland and Murray, 2019), where the external information is introduced during the fine-tuning and testing stages of the model.

Motivation and Background
Formally, the task of aspect and opinion terms extraction can be formulated as a sequence tagging task. The input is a sequence of tokens X = {x 1 , x 2 , .., x n } where the objective is to predict a corresponding sequence of labels Y = {y 1 , y 2 , ..., y n } with y i ∈ {BA, IA, BO, IO, N }, where BA, BO, IA and IO represent a beginning of aspect/opinion and inside of aspect/opinion, respectively, and N represents all other tokens. The goal of unsupervised domain adaptation is to predict the token-level labels y T i of unlabeled target domain sentences D T = {(x T i )}, given a set of labeled sentences from a source domain It was observed that aspect and opinion terms maintain often-occurring syntactic patterns (Hu and Liu, 2004b;Qiu et al., 2011). Consider for example, a sentence from the laptop domain "The display is absolutely wonderful" and a sentence from the restaurant domain "The cheesecake was simply wonderful". In the first sentence, an NSUBJ dependency relation exists between the opinion term ('wonderful') and the aspect term ('display'). Assuming that the pattern aspect-NSUBJ-opinion is frequently observed in the laptop domain, then the term cheesecake can be extracted as an aspect term in the restaurant domain ( Figure 1). This domain-independent trait of the syntactic structure can be leveraged for transferring knowledge from a labeled source domain to an unlabeled target domain. Based on this notion, Ding et al. (2017) proposed using dependency-based aspect extraction rules as auxiliary supervision for an RNN network. However, this method depends on the quality of manually-crafted rules. Wang and Jialin Pan (2018) addressed this issue by automatically encoding dependency relations into the hidden representations of words, thus shifting the representations of different aspect terms having similar dependency relations, close to each other. Wang and Pan (2020) have further enhanced this model by integrating a conditional domain-adversarial network that encodes both word features and syntactic parent relation types.
Analyses of pre-trained transformer-based models like BERT reveal substantial syntactic information captured within their attention mechanisms; however, those analyses also show that for many syntactic relations BERT only slightly improve over a simple baseline (Clark et al., 2019;Htut et al., 2019). Our goal is to design a neural network model that leverages both the information captured in the pre-trained model, and externally introduced syntactic information, to bridge the gap between the source and target domains.

The Proposed Model
The basis for our model is a pre-trained BERT-base model (Devlin et al., 2019) with a fully connected sequence tagging classifier on top. Inspired by the work of Strubell et al. (2018), we incorporate dependency relations into a self-attention mechanism denoting a syntactically-aware attention head. Our approach differs from previous approaches which modify an existing self-attention head within a transformer-based model and train it from scratch. Our method modifies the BERT function by adding syntactically-aware self-attention heads in parallel to the BERT model (Stickland and Murray, 2019), and introduce the syntactic knowledge during the fine-tuning and testing stages. This leaves the original Figure 2: Coupling a syntactically-aware self-attention with a multi-head self-attention layer in a BERT model. pre-trained model intact, enabling the model to utilize both the external linguistic information that is incorporated into the model and the intrinsic knowledge gained during the pre-training stage of the model. We refer to this model as syntactically-aware extended attention layer (SA-EXAL).
Multi-Head Self-Attention. The basis of our implementation is BERT's multi-head self-attention mechanism (Vaswani et al., 2017), which consists of I scaled dot-product attention heads. For each attention head i, the hidden token representations h l ∈ R d×T , at the input of layer l, are projected to key, query and value representations K i , Q i and V i of dimensions T × d k , where T is the number of tokens in the input sequence and d k = d/I. Attention head i denotes attention weights that are a distinct distribution of every input token over all other tokens in the sequence: The output of attention head i is denoted by Syntactically-Aware Self-Attention. Inspired by the work of Strubell et al. (2018), we incorporate syntactic information into the self-attention head, forming a syntactically-aware self-attention, by encouraging it to attend to specific tokens corresponding to the syntactic structure of the sentence. As in the original attention-heads, we project h l denoting K parse , Q parse and V parse matrix representations of dimensions T × d k , but unlike the original heads, we also use an external syntactic parser (Dozat and Manning, 2017) to generate P parse , a T × T matrix in which each row t represents the probability of each token in the sentence to be the syntactic head of token t. We encourage this self-attention head to attend to the syntactic head of each token by performing an element-wise multiplication between P parse and the dot product between the key and query matrices: As in the original heads, The output of the syntactically-aware self-attention head is denoted by: Adding Syntactically-Aware Self-Attention to BERT. Inspired by the work of Stickland and Murray (2019) we modify the BERT(·) function by adding a syntactically-aware self-attention head in parallel to each self-attention layer of the BERT model (see Figure 2) as follows: where LN is BERT's layer normalization function and h l ∈ R d×T are the T hidden token representations at the input of layer l. Note that the contribution of the SA parse (h l ) component to the representation of each token t in layer l + 1, is mostly the representation of the syntactic head of token t. This shifts the representations of aspect terms from distinct domains, that syntactically relate to the same opinion term, closer to each other, thus contributing to bridging the gap between the domains. Data & Experimental Setup. Our experimental setup follows that of Wang and Pan (2020). We conduct experiments on benchmark datasets with customer reviews from three different domains: restaurant, laptop and digital devices. The restaurant domain combines reviews from SemEval 2014 (Pontiki et al., 2014) and SemEval 2015 (Pontiki et al., 2015). The laptop domain contains laptop reviews from SemEval 2014. Opinion term labels for these domains are obtained from Wang et al. (2016). For the device domain, we use reviews from Hu and Liu (2004a) pertaining to five different digital products. Each token in each sentence is labeled as described in section 2. In order to make robust comparisons and to be comparable with previous work, for each domain we create three random splits of the data with a train/development/test ratio of 3:1:1 (see Table 1). Since results may vary across random seeds (Dodge et al., 2020), we repeat each experiment using three different seeds and the final result is reported as the mean F1 score (and standard deviation) calculated over the three splits and the three seeds. We adopt the HuggingFace (Wolf et al., 2019) implementation of BERT-base (uncased) 1 model as the basis for all experiments, and open-source our code. 2 We fine-tune the model with a learning rate of 5e −5 , a batch size of 16 and a maximum sequence length of 64 tokens, for 10 epochs with an early stopping mechanism according to the development set. The dependency relations obtained by the Biaffine parser (Dozat and Manning, 2017) are generated in advance and are introduced to the model during the finetuning as well as during the development/test stages. Following prior work, only exact matches between the predicted aspect and opinion terms and the gold labels are counted as correct.
Results. Table 2 shows a comparison of our proposed model (SA-EXAL) with notable baseline models, across different domain transfers. The baselines include: • CrossCRF (Jakob and Gurevych, 2010): A linear-chain CRF with hand-engineered features (e.g. POS tags and dependencies). • Hier-Joint (Ding et al., 2017): An RNN with auxiliary labels derived from manually designed rules that are based on frequently observed syntactic relations between aspect and opinion terms. • RNCRF (Wang et al., 2016): A joint recursive neural network and CRF for in-domain aspect and opinion terms extraction. • ARNN-GRU (Wang and Pan, 2020): A dependency-tree-based recursive neural network with GRU which uses an auto-encoder in the auxiliary task to reduce label noise. • TRNN-GRU (Wang and Pan, 2020): An extension of ARNN-GRU which integrates a conditional domain-adversarial network that takes both word features and syntactic head relations as input. • EXAL: A baseline model that shares the same size and structure as the proposed model SA-EXAL (Section 3) but does not incorporate syntactic information. Our proposed model (SA-EXAL) shows an advantage over EXAL which demonstrate that although it was shown that the pre-trained BERT model captures significant linguistic knowledge, informing it with explicit external dependency relations is effective for transferring knowledge across domains. Specifically, SA-EXAL outperforms EXAL in 10 out of 12 cases (underlined in the table), including 6.44%, 3.56% and 2.33% improvements for L → R (AS), R → L (AS) and R → D (AS), respectively. We also note that SA-EXAL outperforms the non pre-trained model baselines in 8 out of 12 cases.   (Wang and Pan, 2020). The best result for each dataset is highlighted in bold and the best result between EXAL and SA-EXAL is underlined.

Conclusion
We propose a method for incorporating external linguistic information into a self-attention mechanism coupled with the BERT model. We demonstrate that this model leverages both the intrinsic knowledge existing within the pre-trained model and the externally introduced syntactic information, to bridge the gap across domains.