Learning Domain Representation for Multi-Domain Sentiment Classification

Training data for sentiment analysis are abundant in multiple domains, yet scarce for other domains. It is useful to leveraging data available for all existing domains to enhance performance on different domains. We investigate this problem by learning domain-specific representations of input sentences using neural network. In particular, a descriptor vector is learned for representing each domain, which is used to map adversarially trained domain-general Bi-LSTM input representations into domain-specific representations. Based on this model, we further expand the input representation with exemplary domain knowledge, collected by attending over a memory network of domain training data. Results show that our model outperforms existing methods on multi-domain sentiment analysis significantly, giving the best accuracies on two different benchmarks.


Introduction
Sentiment analysis has received constant research attention due to its importance to business (Pang et al., 2002;Hu and Liu, 2004;Choi and Cardie, 2008;Socher et al., 2012;Vo and Zhang, 2015;Tang et al., 2014). For multiple domains, such as movies, restaurants and digital products, manually annotated datasets have been made available. A useful research question is how to leverage resources available across all domains to improve sentiment classification on a certain domain.
One naive domain-agnostic baseline is to combine all training data, ignoring domain differences. However, domain knowledge is one valuable source of information available. To utilize this, there has been recent work on domain-aware models via multi-task learning (Liu et al., 2016;Nam and Han, 2016), building an output layer for each domain while sharing a representation network. Given an input sentence and a specific test domain, the output layer of the test domain is chosen for calculating the output.
These methods have been shown to improve over the naive domain-agnostic baseline. However, a limitation is that outputs for different domains are constructed using the same domainagnostic input representation, which leads to weak utilization of domain knowledge. For different domains, sentiment words can differ. For example, the word "beast" can be a positive indicator of camera quality, but irrelevant to restaurants or movies. Also, "easy" is frequently used in the electronics domain to express positive sentiment (e.g. the camera is easy to use), while expressing negative sentiment in the movie domain (e.g. the ending of this movie is easy to guess).
We address this issue by investigating a model that learns domain-specific input representations for multi-domain sentiment analysis. In particular, given an input sentence, our model first uses a bidirectional LSTM to learn a general sentence-level representation. For better utilizing data from all domains, we use adversarial training (Ganin and Lempitsky, 2015;Goodfellow et al., 2014) on the Bi-LSTM representation.
The general sentence representation is then mapped into a domain-specific representation by attention over the input sentence using explicitly learned domain descriptors, so that the most salient parts of the input are selected for the specific domain for sentiment classification. Some examples are shown in Figure 2, where our model pays attention to word "engaging" for movie reviews, but not for laptops, restaurants or cameras. Similarly, the word "beast" receives attention for laptops and cameras, but not for restaurants or movies.
In addition to the domain descriptors, we further introduce a memory network for explicitly representing domain knowledge. Here domain knowl-  edge refers to example training data in a specific domain, which can offer useful background context. For example, given a sentence 'Keep cool if you think it's a wonderful life will be a heartwarming tale about life like finding nemo', algorithms can mistakenly classify it as positive based on 'wonderful' and 'heartwarming', ignoring the fact that 'it's a wonderful life' is a movie. In this case, necessary domain knowledge revealed in other sentences, such as 'The last few minutes of the movie: it's a wonderful life don't cancel out all the misery the movie contained' is helpful. Given a domain-specific input representation, we make attention over the domain knowledge memory network to obtain a background context vector, which is used in conjunction with the input representa-tion for sentiment classification. Results on two real-world datasets show that our model outperforms the aforementioned multi-task learning methods for domain-aware training, and also generalizes to unseen domains. Our code is released 1 .

Problem Definition
Formally, we assume the existence of m sentiment j is a sequence of words w 1 , w 2 ...w |s i j | , each being drawn from a vocabulary V , y i j indicates the sentiment label (e.g. y i j ∈ {−1, +1} for binary sentiment classification) and d i is a domain indicator (since we use 1 to m to number each domain, d i = i). The task is to learn a function f which maps each input (s i j , d i ) to its corresponding sentiment label y i j . The challenge of the task lies in how to improve the generalization performance of mapping function f both indomain and cross-domain by exploring the correlations between different domains.

Domain-Agnostic Model
One naive baseline solution ignores the domain characteristics when learning f . It simply combines the datasets {D i } m i=1 into one and learns a single mapping function f . We refer to this baseline as Mix, which is depicted in Figure 1 (a).
Given an input s i j , its word sequence w 1 , w 2 ...w |s i j | is fed into a word embedding layer to obtain embedding vectors x 1 , x 2 ...x |s i j | . The word embedding layer is parameterized by an embedding matrix E w ∈ R K×|V | , where K is the embedding dimension.
Bidirectional LSTM: To acquire a semantic representation of input s i j , a bidirectional extension (Graves and Schmidhuber, 2005) of Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) is applied to capture sentence-level semantics both left-to-right and right-to-left. As a result, two sequences of hidden states are obtained, denoted as h 1 , h 2 ... h |s i j | and h 1 , h 2 ... h |s i j | , respectively. We concatenate h t and h t at each time step to obtain the hidden states h 1 , h 2 ...h |s i j | , which are of sizes 2K. Output Layer: Average pooling (Boureau et al., 2010) is applied on the hidden states h 1 , h 2 ...h |s i j | to obtain an input representation I i j for s i j , Finally, softmax is applied over I i j to obtain a probability distribution of all sentiment labels. During training, cross entropy is used as loss function, denoted as L(f (s i j ), y i j ) for data points (s i j , d i , y i j ), and AdaGrad (Duchi et al., 2011) is applied to update parameters.

Multi-Domain Training
We build a second baseline for domain-aware sentiment analysis. A state-of-the-art architecture (Liu et al., 2016;Nam and Han, 2016) is used as depicted in Figure 1 (b), where m mapping functions f i are learned for each domain. Given the input representation I i j obtained in Equation 1, multi-task learning is conducted, where each domain has a domain-specific set of parameters for softmax to predict sentiment labels with shared input representation layers. The input domain indicator d i instructs which set of softmax parameters to use here and each domain has its own cross entropy loss L i (f i (s i j , d i ), y i j ) for data points (s i j , d i , y i j ). We denote this baseline as Multi.

Domain-Aware Input Representation
The above baseline Multi achieves state-of-theart performance for multi-domain sentiment analysis (Liu et al., 2016), yet the domain indicator d i is used solely to select softmax parameters. As a result, domain knowledge is hidden and under-utilized. Similar to Mix and Multi, we use a Bi-LSTM to learn representations shared across domains. However, we introduce domain-specific layers to better capture domain characteristics as shown in Figure 1 (c). Different domains have their own sentiment lexicons and domain differences largely lie in which words are relatively more important for deciding the sentiment signals. We use the neural attention mechanism (Bahdanau et al., 2014) to select words, obtaining domain-specific input representations.
In our model, domain descriptors are introduced to explicitly capture domain characteristics, which are parametrized by a matrix N ∈ R 2K×m . Each domain descriptor corresponds to one column of N and has a length of 2K, the same as the bidirectional LSTM hidden states h t . This matrix is automatically learned during training.
Given an input (s i j , d i ), we apply an embedding layer and Bi-LSTM to generate its domaingeneral representation h 1 , h 2 , ..., h |s i j | and use the corresponding domain descriptor N i to weigh h 1 , h 2 , ..., h |s i j | for obtaining a domain-specific representation. To this end, there are two most commonly used attention mechanisms: additive attention (Bahdanau et al., 2014) and dot product attention (Ashish Vaswani, 2017). We choose additive attention here, which utilizes a feedforward network with a single hidden layer, since it achieves better accuracies in our development. The input representation I i j becomes a weighted sum of hidden states: The weight a i jt reflects the similarity between the domain i's descriptor N i and the hidden state h t . a i jt is evaluated as: Here P ∈ R 4K×2K , Q ∈ R 4K×2K and v ∈ R 4K are parameters of additive attention. P and Q linearly project N i and h t to a hidden layer, respectively. The projected space is set as 4K empirically, since we find it beneficial to project the vectors into a larger layer. v serves as the output layer. Softmax is applied to normalize l i jt . We name this method DSR for learning domain-specific representations.

Self-Attention over Domain Descriptors
DSR uses a single domain descriptor to attend over input words. However, relations between domains are not considered (e.g. sentiment lexicons for domain 'camera' are more similar to the lexicons of domain 'laptop' than those of domain 'restaurant'). To model the interaction between domains, a self-attention layer is applied using dot product attention empirically, as shown in Figure  1 (c): We compute dot products between N i and every domain descriptors. The dot products are normalized using the softmax function, and N new i is a weighted sum of all domain descriptors. N new i is used to attend over hidden states, employing Equation 2 and 3. During back propagation training, domain descriptors of similar domains could be updated simultaneously. We name this method DSRsa, which denotes domain-specific representation with self-attention.

Explicit Domain Knowledge
To further capture domain characteristics, we devise a memory network (Weston et al., 2014;Sukhbaatar et al., 2015;Kumar et al., 2016) framework to explicitly represent domain knowledge. Our memory networks hold example training data of a specific domain for retrieving context data during predictions.
Formally, we use a memory M i ∈ R 2K×|D i | (|D i | is the total number of training instances of domain i) to hold domain-specific representations I i j of training instances for the domain i. Memory Network: We directly set I i j as the jth column of the memory M i . Formally, Obtaining A Context Vector Using Background Knowledge: Given an input I i j , we generate a context vector C i j to support predictions by memory reading: Dot product attention is applied here, which is faster and more space-efficient than additive attention, since it can be implemented using highly optimized matrix multiplication. Dot products are performed between I i j and each column of M i and the scores are normalized using the softmax function. The final context vector is a weighted sum of M i 's columns.
Output: We concatenate the context vector and the domain-specific input representation, feeding the result to softmax layers. Similar to the baseline Multi, each domain has its own loss . We name this method as DSRctx for context vector enhancements.
Reducing Memory Size: In the naive implementation, the memory size |M i | is equal to the total number of saved sequences, which can be very large in practice. We explore two ways to reduce memory size.
(1) Organizing memory by the vocabulary. We set |M i | = |V |, where each memory column of M i corresponds to a word in the vocabulary. During memory writing, I i j updates all the columns that correspond to the words w in its input sequence s i j by exponential moving average: In this way, two input representations update the same column of the memory network if and only if they share at least one common word.
(2) Fixing the memory size by clustering. |M i | is set to a fixed size and I i j only updates the memory column that is most similar to I i j , i.e. I i j only update the column arg max (M i ) T I i j . In this way, semantically similar inputs are clustered and update the same column.

Adversarial Training
We use embeddings and Bi-LSTM, parametrized by θ dg , to generate domain-general representations. However, the distributions of domaingeneral representations for all domains can be different (Goodfellow et al., 2014), which contaminates the representations (Liu et al., 2017) and imposes negative effects for in-domain predictions. For cross-domain testing, the discrepancies cause domain shift, which harms prediction accuracies on target domains (Ganin and Lempitsky, 2015). Thus, models that can generate domain-invariant representations for all domains are favorable for utilizing multi-domain datasets.
We incorporate adversarial training to enhance the domain-general representations. As shown in Figure 1 (c), domain classifier layers are introduced, parametrized by θ dc , which predicts how likely the input sequence s i j comes from each domain i. We denote its cross entropy loss as L at (f at (s i j ), d i ) for data points (s i j , d i , y i j ) from domain i (note that we use d i as its label instead of input here). Now consisting of domain-general layers, domain-specific layers and domain classifier lay-ers, the model is trained by a minimax game. For dataset D i drawn from domain i, we minimize its loss L i (f i (s i j , d i ), y i j ) for sentiment predictions, while maximizing the domain classifier loss L at (f at (s i j ), d i ), controlled by λ: where θ ds is the set of domain-specific parameters including domain descriptors, attention weights and softmax parameters. We fix θ dc and update θ dg and θ ds here. Its adversarial part maximizes the loss by updating θ dc , while fixing θ dg and θ ds .

Experiments
We evaluate the effectiveness of the model both indomain and cross-domain. The former refers to the setting where the domain of the test data falls into one of the m training data domains, and the latter refers to the setting where the test data comes from one unknown domain.

Experimental Settings
We conduct experiments on two benchmark datasets. The datasets are balanced, so we use accuracy as the evaluation metric in the experiments.
The dataset 1 contains four domains. The statistics are shown in Table 1 , which also shows the accuracies using baseline method Mix trained and tested on each domain. Camera 2 consists of reviews with respect to digital products such as cameras and MP3 players (Hu and Liu, 2004). Laptop and Restaurant are laptop and restaurant reviews, respectively, obtained from SemEval 2015 Task 12 3 . Movie 4 are movie reviews provided by Pang and Lee (2004).
The dataset 2 is Blitzer's multi-domain sentiment dataset (Blitzer et al., 2007), which contains product reviews taken from Amazon.com, including 25 product types (domains) such as books, beauty and music. More statistics can be found at its official website 5 . Given each dataset, we randomly select 80%, 10% and 10% of the instances as training, development and testing sets, respectively.

Baselines and Hyperparameters
In addition to the Mix baseline, the Multi baseline (Liu et al., 2016) and our domain-aware models, DSR, DSR-sa, DSR-ctx, DSR-at, we also experiment with the following baselines: MTRL (Zhang and Yeung, 2012) is a state-ofthe-art multi-task learning method with discrete features. The method models covariances between task classifiers, and in turn the covariances regularize task-specific parameters. The feature extraction for MTRL follows (Blitzer et al., 2007). We use this baseline to demonstrate the effectiveness of dense features generated by neural models.
MDA (Chen et al., 2012) is a cross-domain baseline, which utilizes marginalized de-noising auto-encoders to learn a shared hidden representation by reconstructing pivot features from corrupted inputs.
FEMA (Yang and Eisenstein, 2015) is a crossdomain baseline, which utilizes techniques from neural language models to directly learn feature embeddings and is more robust to domain shift.
NDA (Kim et al., 2016) is a cross-domain baseline, which uses m + 1 LSTMs, where one LSTM captures global information across all m domains and the remaining m LSTM capture domainspecific information.
We set the size of word embeddings K to 300, which are initialized using the word2vec model 6 on news. To obtain the best performance, the parameters are set using grid search based on development results. The dropout ratio is chosen from [0.3, , 1]. Learning rate is chosen from   (Pascanu et al., 2013) is adopted to prevent gradient exploding and vanishing during training process. Since all datasets only have thousands of instances, we set memory network sizes as training instance sizes in the experiments.

Working with Known Domains
In this section, we perform in-domain validations. We first combine two datasets for training and test on each domain's hold-out testing dataset. The results on dataset 1 are shown in Table 2 (the results on Blitzer's dataset exhibit similar results and are omitted due to space constraints). The accuracies of MTRL are significantly lower than the neural models, which demonstrates the effectiveness of dense features over discrete features. The baseline Mix improves the average accuracy from 0.778 to 0.818, and most multidomain training accuracies are better compared to single-domain training in Table 1. Mix simply combines the two datasets for trainings and ignores domain characteristics, yet improves over single dataset training. This demonstrates that more data reduces over-fitting and leads to better generalization capabilities. Multi further improves the average accuracy by 1.4%, which confirms the effectiveness of utilizing domain information.
Among our models, DSR further improves the accuracy over Multi by 1%, which confirms the effectiveness of domain-specific input representations in multi-domain sentiment analysis. DSR-sa slightly outperforms DSR by 0.03%. Adopting an additional self-attention layer, DSR-sa trains similar domain descriptors together, thus better modeling domain relations, which will be further studied in Section 5.5.2. DSR-ctx outperforms DSR-sa by 1.2%, which demonstrates the effectiveness of memory networks in utilizing domain-specific example knowledge. DSR-at gives significantly the best results, confirming that domain-invariant representations achieved by adversarial training indeed benefit in-domain training. The results are significant using McNeymar's test.
The results combining all the 4 domains and the 25 domains of the two datasets are shown in the 'In domain' sections of Table 3 and Table 4, respectively. Here the models are trained using all domains' training data, and tested on each domain's hold-out test data. Similar patterns are observed as in Table 2 and DSR-at achieves significantly the best accuracies (0.867 and 0.907 for the two datasets, respectively).

Working with Unknown New Domains
We validate the algorithms cross-domain. For dataset 1, models are trained on three domains, yet validated and tested on the other domain. For dataset 2, models are trained on 24 domains, yet validated and tested on the 25th.
Since DSR-at has m outputs (one for each training domain), we adopt an ensemble approach to obtain a single output for unknown test domains. In particular, since the domain classifier outputs probabilities on how likely the test data come from each training domain, we use these probabilities as weights to average the m outputs.
For NDA, Multi, DSR and DSR-sa and DSRctx, we use average pooling to combine the m outputs. Since MDA and FEMA are devised to train on a single source domain, we combine the training data of m domains for training.
The results are shown in the 'Cross domain' section of Table 3 and Table 4, respectively. One observation is that cross-domain accuracies are worse than in-domain accuracies, showing challenges in unknown-domain testing.
Contrast between our models and FEMA/NDA shows the advantage of leveraging resources from all domains, versus a single source domain for cross-domain modelling. Among the baselines,    NDA also considered domain-specific representations. On the other hand, it duplicates the full set of model parameters for each domain, yet underperforms DSR and DSR-sa, which records only one domain descriptor vector for each domain. The contrast shows the advantages of learning domain descriptors explicitly in terms of both efficiency and accuracy. Similar to the known domain results, DSTsa and DSR-ctx further improve upon DSR and DSR-sa, showing the effectiveness of domain memory and adversarial learning. On both datasets, DSR-at achieves significantly the best performances, which shows the advantages of domain-invariant representations for unknowndomain testing.

Input Attention
To obtain a better understanding of input attention with domain descriptors, we examine the attention weights of inputs and three examples are displayed in Figure 2, where the x axis denotes the four domains from the first dataset and the y axis shows the words.
In Figure 2 (a), the domain-specific word 'ease' is only selected for the domains LT and CR, while the domain-independent word 'great' is salient in all domains. Similarly, in Figure 2 (b), 'meaty' and 'engaging' are only salient in RT and M, respectively. In Figure 2 (c), the domain-specific word 'beast' is chosen in LT and CR.
These confirm the effectiveness of input attention and DSR-ctx has the capability to pick out sentiment lexicons in conformity with domain characteristics.

Domain Descriptors
With the self-attention layer, one interesting question is whether learned domain descriptors can reflect domain similarities/dissimilarities.
We take out the twenty-five domain descriptors for Blitzer's dataset and calculate the cosine similarities between each pair. Also, we calculate the cosine similarities of twenty-five domains based on unigram and bigram representations for ground truth. Pearson correlation coefficient is used to measure the correlations between two sets of cosine values. The final score is 0.796, which shows that domain descriptor similarities can serve as indicators for domain similarities.

Memory Network Attention
We further study the attention of memory networks by randomly picking instances in the test sets and listing the context instances with the greatest attention weights obtained from Equation 6. The results of three test instances and their context instances are shown in Table . One observation is that semantically similar instances are selected to provide extra knowledge for predictions (e.g. a1, a2, b3, c1, c2, c3). Another observation is that the sentiment polarities between test instances and selected context instances are usually the same. We conclude that the memory networks are capable of selecting instructive instances for facilitating predictions.

Related Work
Domain Adaptation (Blitzer et al., 2007;Titov, 2011;Yu and Jiang, 2015) adapts classifiers trained on a source domain to an unseen target domain. One stream of work focuses on learning a general representation for different domains based on the co-occurrences of domain-specific and domain-independent features (Blitzer et al., 2007;Pan et al., 2011;Yu and Jiang, 2015;Yang et al., 2017). Another stream of work tries to identify domain-specific words to improve crossdomain classification (Bollegala et al., 2011;Li et al., 2012;Zhang et al., 2014;Qiu and Zhang, 2015). Different from previous work, we utilize multiple source domains for cross-domain validation, which makes our method more general and domain-aware.
Multi-domain Learning jointly learn multiple domains to improve generalization. One strand of work (Dredze and Crammer, 2008;Saha et al., 2011;Zhang and Yeung, 2012) uses covari-ments compared with strong multi-task learning baselines.