DomBERT: Domain-oriented Language Model for Aspect-based Sentiment Analysis

This paper focuses on learning domain-oriented language models driven by end tasks, which aims to combine the worlds of both general-purpose language models (such as ELMo and BERT) and domain-specific language understanding. We propose DomBERT, an extension of BERT to learn from both in-domain corpus and relevant domain corpora. This helps in learning domain language models with low-resources. Experiments are conducted on an assortment of tasks in aspect-based sentiment analysis (ABSA), demonstrating promising results.


Introduction
Pre-trained language models (LMs) (Peters et al., 2018;Radford et al., 2018Radford et al., , 2019Devlin et al., 2019) aim to learn general (or mixed-domain) knowledge for end tasks. Recent studies (Xu et al., 2019;Gururangan et al., 2020) show that learning domain-specific LMs are equally important. This is because the training corpus of general LMs is out-of-domain for end tasks in a particular domain and, more importantly, because general LMs may not capture the long-tailed and underrepresented domain details ). An intuitive example related to corpus of aspect-based sentiment analysis (ABSA) can be found in Table 1, where all masked words sky, water, idea, screen and picture can appear in a mixed-domain corpus. A generalpurpose LM may favor frequent examples and ignore long-tailed choices in certain domains.
In contrast, although domain-specific LMs can capture fine-grained domain details, they may suffer from insufficient training corpus (Gururangan et al., 2020) to strengthen general knowledge within a domain. To this end, we propose a domain- 1 The code will be released on https://github.com/ howardhsu/BERT-for-RRC-ABSA.  oriented learning task that aims to combine the benefits of both general and domain-specific worlds: Domain-oriented Learning: Given a target domain t and a set of diverse source domains S = {s 1 , s 2 , . . . }, perform (language model) learning that focusing on t and all its relevant domains in S. This learning task resolves the issues in both general and domain-specific worlds. On one hand, the training of LM does not need to focus on unrelated domains anymore (e.g., Books is one big domain but not very related to laptop); on the other hand, although an in-domain corpus may be limited, other relevant domains can share a great amount of knowledge (e.g., Desktop in Table 1) to make in-domain corpus more diverse and general. This paper proposes an extremely simple extension of BERT (Devlin et al., 2019) called DomBERT to learn domain-oriented language models. DomBERT simultaneously learns masked language modeling and discovers relevant domains (with a built-in retrieval model (Lewis et al., 2020)) to draw training examples, where the later are computed from domain embeddings learned from an auxiliary task of domain classification. We apply DomBERT to end tasks in aspect-based sentiment analysis (ABSA) in low-resource settings, demonstrating promising results. Related Work Pre-trained language models gain significant improvements over a wide spectrum of NLP tasks (Minaee et al., 2020), including ELMo(Peters et al., 2018, GPT/GPT2 (Radford et al., 2018(Radford et al., , 2019, BERT (Devlin et al., 2019), XL-Net (Yang et al., 2019), RoBERTa , ALBERT (Lan et al., 2019), ELECTRA (Clark et al., 2019). This paper extends BERT's masked language model (MLM) with domain knowledge learning. Following RoBERTa, the proposed DomBERT leverages dynamic masking, removes the next sentence prediction (NSP) task (which is proved to have negative effects on pre-trained parameters), and allows for max-length MLM to fully utilize the computational power. This paper also borrows ALBERT's removal of dropout since pre-trained LM, in general, is an underfitting task that requires more parameters instead of avoiding overfitting.
The proposed domain-oriented learning task can be viewed as one type of transfer learning(Pan and Yang, 2009), which learns a transfer strategy implicitly that transfer training examples from relevant (source) domains to the target domain. This transfer process is conducted throughout the training process of DomBERT.
The experiment of this paper focuses on aspectbased sentiment analysis (ABSA), which typically requires a lot of domain-specific knowledge. Reviews serve as a rich resource for sentiment analysis (Pang et al., 2002;Hu and Liu, 2004;Liu, 2012Liu, , 2015. ABSA aims to turn unstructured reviews into structured fine-grained aspects (such as the "battery" or aspect category of a laptop) and their associated opinions (e.g., "good battery" is positive about the aspect battery). This paper focuses on three (3) popular tasks in ABSA: aspect extraction (AE) (Hu and Liu, 2004;Li and Lam, 2017), aspect sentiment classification (ASC) (Hu and Liu, 2004;Dong et al., 2014;Nguyen and Shirai, 2015;Li et al., 2018;Tang et al., 2016;Wang et al., 2016a,b;Ma et al., 2017;Chen et al., 2017;Ma et al., 2017;Tay et al., 2018;He et al., 2018; and end-to-end ABSA (E2E-ABSA) (Li et al., 2019a,b). AE aims to extract aspects (e.g., "battery"), ASC identifies the polarity for a given aspect (e.g., positive for battery) and E2E-ABSA is a combination of AE and ASC that detects the aspects and their associated polarities simultaneously. This paper focuses on self-supervised methods 2 to improve ABSA.

DomBERT
This section presents DomBERT, which is an extension of BERT for domain knowledge learning. The goal of DomBERT is to discover relevant domains from the pool of source domains and uses the training examples from relevant source domains for masked language model learning. As a result, DomBERT has a sampling process over a categorical distribution on all domains (including the target domain) to retrieve relevant domains' examples. Learning such a distribution needs to detect the domain similarities between all source domains and the target domain. DomBERT learns an embedding for each domain and computes such similarities. The domain embeddings are learned from an auxiliary task called domain classification.

Domain Classification
Given a pool of source and target domains, one can easily form a classification task on domain tags. As such, each text document has its domain label l. Following RoBERTa 's maxlength training examples, we pack different texts from the same domain up to the maximum length into a single training example.
Let the number of source domains be |S| = n. Then the number of domains (including the target domain) is n + 1. Let h [CLS] denote the hidden state of the [CLS] token of BERT, which indicates the document-level representations of one example. We first pass this hidden states into a dense layer to reduce the size of hidden states. Then we pass this reduced hidden states to a dense layer D ∈ R (n+1) * m to compute the logits over all domainsl: where m is the size of the dense layer, D, W and b are trainable weights. Besides a dense layer, D is essentially a concatenation of domain embeddings: Then we apply crossentropy loss to the logits and label to obtain the loss of domain classification: To encourage the diversity of domain embeddings, we further compute a regularizer among domain embeddings as following: Minimizing this regularizer encourages the learned embeddings to be more orthogonal (thus diverse) to each other. Finally, we add the loss of domain classification, BERT's masked language model and regularizer together: where λ controls the ratio of losses between masked language model and domain classification.

Domain Sampler
As a side product of domain classification, DomBERT has a built-in data sampling process to draw examples from both the target domain and relevant domains for future learning. This process follows a unified categorical distribution over all domains, which ensures a good amount of examples from both the target domains and relevant domains are sampled. As such, it is important to always have the target domain t with the highest probability for sampling.
To this end, we use cosine similarity as the similarity function, which has the property to always let cos(d t , d t ) = 1. For an arbitrary domain i, the probability P i of domain i being sampled is computed from a softmax function over domain similarities as following: where τ is the temperature (Hinton et al., 2015) to control the importance of highly-ranked domains vs long-tailed domains.
To form a mini-batch for the next training step, we sample domains following the categorical distribution of s ∼ P and retrieve the next available example from each sampled domain. As such, we maintain a randomly shuffled queue of examples for each domain. When the examples of one domain are exhausted, a new randomly shuffled queue will be generated for that domain.

Datasets
We apply DomBERT to end tasks in aspectbased sentiment analysis from the SemEval dataset, which focusing on Laptop, Restaurant. We choose 3 end tasks: aspect extraction (AE), aspect sentiment classification (ASC), and end2end ABSA.
For AE, we choose SemEval 2014 Task 4 for laptop and SemEval-2016 Task 5 for restaurant to be consistent with  and other previous works. For ASC, we use SemEval 2014 Task 4 for both laptop and restaurant as existing research frequently uses this version. We use 150 examples from the training set of all these datasets for validation. For E2E-ABSA, we adopt the formulation of (Li et al., 2019a) where the laptop data is from SemEval-2014 task 4 and the restaurant domain is a combination of SemEval 2014-2016.
Based on the domains of end tasks from Se-mEval dataset, we explore the capability of the large-scale unlabeled corpus from Amazon review datasets (He and McAuley, 2016) and Yelp dataset 3 . Following (Xu et al., 2019), we select all laptop reviews from the electronics department. This ends with about 100 MB corpus. Similarly, we simulate a low-resource setting for restaurants and randomly select about 100 MB reviews tagged as Restaurants as their first category from Yelp reviews. For source domains S, we choose all reviews from the 5-core version of Amazon review datasets and all Yelp reviews excluding Laptop and Restaurants. Note that Yelp is not solely about restaurants but has other location-based domains such as car service, bank, theatre, etc. This ends with a total of |D| = 4680 domains, and n = 4679 are source domains. The total size of the corpus is about 20 GB.
The number of examples for each domain is plotted in Figure 1, where the distribution of domains is heavily long-tailed.

Hyper-parameters
We adopt BERT BASE (uncased) as the basis for all experiments due to the limits of computational power in our academic setting. We choose the hidden size of domain embeddings m = 64 to ensure the regularizer term in the loss doesn't consume too much GPU memory. We choose τ = 0.1 for 3 https://www.yelp.com/dataset/ challenge, 2019 version.    The maximum length of DomBERT is consistent with BERT as 512. We use Adamax(Kingma and Ba, 2014) as the optimizer. Lastly, the learning rate is to be 5e-5.

Compared Methods
BERT this is the vanilla BERT BASE pre-trained model from (Devlin et al., 2019), which is used to show the performance of BERT without any domain adaption. BERT-Review post-train BERT on all (mixeddomain) Amazon review datasets and Yelp datasets in a similar way of training BERT. Following , we train the whole corpus for 4 epochs, which took about 10 days of training (much longer than DomBERT). BERT-DK is a baseline borrowed from (Xu et al., 2019)    SemEval, which is not a low-resource case. We use this baseline to show that DomBERT can reach competitive performance. DomBERT is the model proposed in this paper 4 .

Evaluation Metrics
For AE, we use F1 score. For ASC, we compute both accuracy and Macro-F1 over 3 classes of polarities, where Macro-F1 is the major metric as the imbalanced classes introduce biases on accuracy. Examples belonging to the conflict polarity are dropped as in (Tang et al., 2016). For E2E-ABSA, we adopt the evaluation script from (Li et al., 2019a), which reports precision, recall, and F1 score. Results are as averages of 10 runs.

Result Analysis and Discussion
AE: In Table 5 ASC: ASC is a more domain agnostic task because most of the sentiment words are sharable across all domains (e.g., "good" and "bad"). As such, in Table 6, we notice ASC for restaurant is more domain-specific than laptop. DomBERT is worse than BERT-Review in laptop because a 20+ G can 4 We do not compare DomBERT with LMs that require extra (directly or indirectly) annotated data.   learn general-purpose sentiment better. BERT-DK is better than DomBERT because a much larger indomain corpus is more important for performance. E2E ABSA: By combining AE and ASC together, E2E ABSA exhibit more domain-specificity, as shown in Table 7. In this case, we can see the full performance of DomBERT because it can learn both general and domain-specific knowledge well. BERT-Review is poor probably because it focuses on irrelevant domains such as Books.
We further investigate relevant domains discovered by DomBERT in Table 8. The results are closer to our intuition because most domains are very close to laptop and restaurant, respectively.

Conclusions
We propose DomBERT, which automatically exploits the power of training corpus from relevant domains for a target domain. Experiments demonstrate that DomBERT is promising for ABSA.