Learning Domain-Sensitive and Sentiment-Aware Word Embeddings

Word embeddings have been widely used in sentiment classification because of their efficacy for semantic representations of words. Given reviews from different domains, some existing methods for word embeddings exploit sentiment information, but they cannot produce domain-sensitive embeddings. On the other hand, some other existing methods can generate domain-sensitive word embeddings, but they cannot distinguish words with similar contexts but opposite sentiment polarity. We propose a new method for learning domain-sensitive and sentiment-aware embeddings that simultaneously capture the information of sentiment semantics and domain sensitivity of individual words. Our method can automatically determine and produce domain-common embeddings and domain-specific embeddings. The differentiation of domain-common and domain-specific words enables the advantage of data augmentation of common semantics from multiple domains and capture the varied semantics of specific words from different domains at the same time. Experimental results show that our model provides an effective way to learn domain-sensitive and sentiment-aware word embeddings which benefit sentiment classification at both sentence level and lexicon term level.


Introduction
Sentiment classification aims to predict the sentiment polarity, such as "positive" or "negative", over a piece of review. It has been a longstanding research topic because of its importance for many applications such as social media analysis, e-commerce, and marketing (Liu, 2012;Pang et al., 2008). Deep learning has brought in progress in various NLP tasks, including sentiment classification. Some researchers focus on designing RNN or CNN based models for predicting sentence level (Kim, 2014) or aspect level sentiment (Li et al., 2018;Chen et al., 2017;Wang et al., 2016). These works directly take the word embeddings pre-trained for general purpose as initial word representations and may conduct fine tuning in the training process. Some other researchers look into the problem of learning taskspecific word embeddings for sentiment classification aiming at solving some limitations of applying general pre-trained word embeddings. For example, Tang et al. (2014b) develop a neural network model to convey sentiment information in the word embeddings. As a result, the learned embeddings are sentiment-aware and able to distinguish words with similar syntactic context but opposite sentiment polarity, such as the words "good" and "bad". In fact, sentiment information can be easily obtained or derived in large scale from some data sources (e.g., the ratings provided by users), which allows reliable learning of such sentiment-aware embeddings.
Apart from these words (e.g. "good" and "bad") with consistent sentiment polarity in different contexts, the polarity of some sentiment words is domain-sensitive. For example, the word "lightweight" usually connotes a positive sentiment in the electronics domain since a lightweight device is easier to carry. In contrast, in the movie domain, the word "lightweight" usually connotes a negative opinion describing movies that do not invoke deep thoughts among the audience. This observation motivates the study of learning domainsensitive word representations Bollegala et al., 2015Bollegala et al., , 2014. They basically learn separate embeddings of the same word for different domains. To bridge the semantics of individual embedding spaces, they select a subset of words that are likely to be domain-insensitive and align the dimensions of their embeddings. However, the sentiment information is not exploited in these methods although they intend to tackle the task of sentiment classification.
In this paper, we aim at learning word embeddings that are both domain-sensitive and sentiment-aware.
Our proposed method can jointly model the sentiment semantics and domain specificity of words, expecting the learned embeddings to achieve superior performance for the task of sentiment classification. Specifically, our method can automatically determine and produce domain-common embeddings and domainspecific embeddings. Domain-common embeddings represent the fact that the semantics of a word including its sentiment and meaning in different domains are very similar. For example, the words "good" and "interesting" are usually domain-common and convey consistent semantic meanings and positive sentiments in different domains. Thus, they should have similar embeddings across domains. On the other hand, domain-specific word embeddings represent the fact that the sentiments or meanings across domains are different. For example, the word "lightweight" represents different sentiment polarities in the electronics domain and the movie domain. Moreover, some polysemous words have different meanings in different domains. For example, the term "apple" refers to the famous technology company in the electronics domain or a kind of fruit in the food domain.
Our model exploits the information of sentiment labels and context words to distinguish domain-common and domain-specific words. If a word has similar sentiments and contexts across domains, it indicates that the word has common semantics in these domains, and thus it is treated as domain-common. Otherwise, the word is considered as domain-specific. The learning of domain-common embeddings can allow the ad-vantage of data augmentation of common semantics of multiple domains, and meanwhile, domainspecific embeddings allow us to capture the varied semantics of specific words in different domains. Specifically, for each word in the vocabulary, we design a distribution to depict the probability of the word being domain-common. The inference of the probability distribution is conducted based on the observed sentiments and contexts. As mentioned above, we also exploit the information of sentiment labels for the learning of word embeddings that can distinguish words with similar syntactic context but opposite sentiment polarity.
To demonstrate the advantages of our domainsensitive and sentiment-aware word embeddings, we conduct experiments on four domains, including books, DVSs, electronics, and kitchen appliances. The experimental results show that our model can outperform the state-of-the-art models on the task of sentence level sentiment classification. Moreover, we conduct lexicon term sentiment classification in two common sentiment lexicon sets to evaluate the effectiveness of our sentiment-aware embeddings learned from multiple domains, and it shows that our model outperforms the state-of-the-art models on most domains.

Related Works
Traditional vector space models encode individual words using the one-hot representation, namely, a high-dimensional vector with all zeroes except in one component corresponding to that word (Baeza-Yates et al., 1999). Such representations suffer from the curse of dimensionality, as there are many components in these vectors due to the vocabulary size. Another drawback is that semantic relatedness of words cannot be modeled using such representations. To address these shortcomings, Rumelhart et al. (1988) propose to use distributed word representation instead, called word embeddings. Several techniques for generating such representations have been investigated. For example, Bengio et al. propose a neural network architecture for this purpose (Bengio et al., 2003;Bengio, 2009). Later, Mikolov et al. (2013) propose two methods that are considerably more efficient, namely skip-gram and CBOW. This work has made it possible to learn word embeddings from large data sets, which has led to the current popularity of word embed-dings. Word embedding models have been applied to many tasks, such as named entity recognition (Turian et al., 2010), word sense disambiguation (Collobert et al., 2011;Iacobacci et al., 2016;Zhang and Hasan, 2017;Dave et al., 2018), parsing (Roth and Lapata, 2016), and document classification (Tang et al., 2014a,b;Shi et al., 2017). Sentiment classification has been a longstanding research topic (Liu, 2012;Pang et al., 2008;Chen et al., 2017;Moraes et al., 2013). Given a review, the task aims at predicting the sentiment polarity on the sentence level (Kim, 2014) or the aspect level (Li et al., 2018;Chen et al., 2017). Supervised learning algorithms have been widely used in sentiment classification (Pang et al., 2002). People usually use different expressions of sentiment semantics in different domains. Due to the mismatch between domain-specific words, a sentiment classifier trained in one domain may not work well when it is directly applied to other domains. Thus cross-domain sentiment classification algorithms have been explored (Pan et al., 2010;Li et al., 2009;Glorot et al., 2011). These works usually find common feature spaces across domains and then share learned parameters from the source domain to the target domain. For example, Pan et al. (2010) propose a spectral feature alignment algorithm to align words from different domains into unified clusters. Then the clusters can be used to reduce the gap between words of the two domains, which can be used to train sentiment classifiers in the target domain. Compared with the above works, our model focuses on learning both domain-common and domain-specific embeddings given reviews from all the domains instead of only transferring the common semantics from the source domain to the target domain.
Some researchers have proposed some methods to learn task-specific word embeddings for sentiment classification (Tang et al., 2014a,b). Tang et al. (2014b) propose a model named SSWE to learn sentiment-aware embedding via incorporating sentiment polarity of texts in the loss functions of neural networks. Without the consideration of varied semantics of domain-specific words in different domains, their model cannot learn sentiment-aware embeddings across multiple domains. Some works have been proposed to learn word representations considering multiple domains Bach et al., 2016;Bollegala et al., 2015). Most of them learn separate embeddings of the same word for different domains. Then they choose pivot words according to frequency-based statistical measures to bridge the semantics of individual embedding spaces. A regularization formulation enforcing that word representations of pivot words should be similar in different domains is added into the original word embedding framework. For example,  use Sørensen-Dice coefficient (Sørensen, 1948) for detecting pivot words and learn word representations across domains. Even though they evaluate the model via the task of sentiment classification, sentiment information associated with the reviews are not considered in the learned embeddings. Moreover, the selection of pivot words is according to frequency-based statistical measures in the above works. In our model, the domain-common words are jointly determined by sentiment information and context words.

Model Description
We propose a new model, named DSE, for learning Domain-sensitive and Sentiment-aware word Embeddings. For presentation clarity, we describe DSE based on two domains. Note that it can be easily extended for more than two domains, and we remark on how to extend near the end of this section.

Design of Embeddings
We assume that the input consists of text reviews of two domains, namely D p and D q . Each review r in D p and D q is associated with a sentiment label y which can take on the value of 1 and 0 denoting that the sentiment of the review is positive and negative respectively. In our DSE model, each word w in the whole vocabulary Λ is associated with a domaincommon vector U c w and two domain-specific vectors, namely U p w specific to the domain p and U q w specific to the domain q. The dimension of these vectors is d. The design of U c w , U p w and U q w reflects one characteristic of our model: allowing a word to have different semantics across different domains. The semantic of each word includes not only the semantic meaning but also the sentiment orientation of the word. If the semantic of w is consistent in the domains p and q, we use the vector U c w for both domains. Otherwise, w is repre-sented by U p w and U q w for p and q respectively. In traditional cross-domain word embedding methods Bollegala et al., 2015Bollegala et al., , 2016, each word is represented by different vectors in different domains without differentiation of domain-common and domain-specific words. In contrast to these methods, for each word w, we use a latent variable z w to depict its domain commonality. When z w = 1, it means that w is common in both domains. Otherwise, w is specific to the domain p or the domain q.
In the standard skip-gram model (Mikolov et al., 2013), the probability of predicting the context words is only affected by the relatedness with the target words. In our DSE model, predicting the context words also depends on the domain-commonality of the target word, i.e z w . For example, assume that there are two domains, e.g. the electronics domain and the movie domain. If z w = 1, it indicates a high probability of generating some domain-common words such as "good", "bad" or "satisfied". Otherwise, the domain-specific words are more likely to be generated such as "reliable", "cheap" or "compacts" for the electronics domain. For a word w, we assume that the probability of predicting the context word w t is formulated as follows: (1) If w is a domain-common word without differentiating p and q, the probability of predicting w t can be defined as: where Λ is the whole vocabulary and V w ′ is the output vector of the word w ′ . If w is a domain-specific word, the probability of p(w t |w, z w = 0) is specific to the occurrence of w in D p or D q . For individual training instances, the occurrences of w in D p or D q have been established. Then the probability of p(w t |w, z w = 0) can be defined as follows:

Exploiting Sentiment Information
In our DSE model, the prediction of review sentiment depends on not only the text information but also the domain-commonality. For example, the domain-common word "good" has high probability to be positive in different reviews across multiple domains. However, for the word "lightweight", it would be positive in the electronics domain, but negative in the movie domain. We define the polarity y w of each word w to be consistent with the sentiment label of the review: if we observe that a review is associated with a positive label, the words in the review are associated with a positive label too. Then, the probability of predicting the sentiment for the word w can be defined as: (4) If z w = 1, the word w is a domain-common word. The probability p(y w = 1|w, z w = 1) can be defined as: where σ(·) is the sigmoid function and the vector s with dimension d represents the boundary of the sentiment. Moreover, we have: p(y w = 0|w, z w = 1) = 1−p(y w = 1|w, z w = 1) (6) If w is a domain-specific word, similarly, the probability p(y w = 1|w, z w = 0) is defined as:

Inference Algorithm
We need an inference method that can learn, given D p and D q , the values of the model parameters, namely, the domain-common embedding U c w , and the domain-specific embeddings U p w and U q w , as well as the domain-commonality distribution p(z w ) for each word w. Our inference method combines the Expectation-Maximization (EM) method with a negative sampling scheme. It is summarized in Algorithm 1. In the E-step, we use the Bayes rule to evaluate the posterior distribution of z w for each word and derive the objective function. In the M-step, we maximize the objective function with the gradient descent method and Algorithm 1 EM negative sampling for DSE 1: Initialize U c w , U p w , U q w , V , s, p(z w ) 2: for iter = 1 to Max iter do 3: for each review r in D p and D q do 4: for each word w in r do

5:
Sample negative instances from the distribution P.

6:
Update p(z w |w, c w , y w ) by Eq. 11 and Eq. 15 respectively. Update p(z w ) using Eq. 13 10: Update U c w , U p w , U q w , V , s via Maximizing Eq. 14 11: end for update the corresponding embeddings U c w , U p w and U q w .
With the input of D p and D q , the likelihood function of the whole training set is: where L p and L q are the likelihood of D p and D q respectively. For each review r from D p , to learn domainspecific and sentiment-aware embeddings, we wish to predict the sentiment label and context words together. Therefore, the likelihood function is defined as follows: where y w is the sentiment label and c w is the set of context words of w. For the simplification of the model, we assume that the sentiment label y w and the context words c w of the word w are conditionally dependent. Then the likelihood L p can be rewritten as: where p(w t |w) and p(y w |w) are defined in Eq. 1 and Eq. 4 respectively. The likelihood of the reviews from D q , i.e L q , is defined similarly. For each word w in the review r, in the E-step, the posterior probability of z w given c w and y w is: In the M-step, given the posterior distribution of z w in Eq. 11, the goal is to maxmize the following Q function: Using the Lagrange multiplier, we can obtain the update rule of p(z w ), satisfying the normalization constraints that zw∈0,1 p(z w ) = 1 for each word w: where n(w, r) is the number of occurrence of the word w in the review r.
To obtain U c w , U p w and U q w , we collect the related items in Eq. 12 as follows: Note that computing the value p(w t |w, z w ) based on Eq. 2 and Eq. 3 is not feasible in practice, given that the computation cost is proportional to the size of Λ. However, similar to the skip-gram model, we can rely on negative sampling to address this issue. Therefore we estimate the probability of predicting the context word p(w t |w, z w = 1) as follows: where w i is a negative instance which is sampled from the word distribution P (.). Mikolov et al. (2013) have investigated many choices for P (w) and found that the best P (w) is equal to the unigram distribution Unigram(w) raised to the 3/4rd power. We adopt the same setting. The probability p(w t |w, z w = 0) in Eq. 3 can be approximated in a similar manner. After the substitution of p(w t |w, z w ), we use the Stochastic Gradient Descent method to maximize Eq. 14, and obtain the update of U c w , U p w and U q w .

More Discussions
In our model, for simplifying the inference algorithm and saving the computational cost, we assume that the target word w t in the context and the sentiment label y w of the word w are conditionally independent. Such technique has also been used in other popular models such as the bi-gram language model. Otherwise, we need to consider the term p(w t |w, y w ), which complicates the inference algorithm. We define the formulation of the term p(w t |w, z) to be similar to the original skipgram model instead of the CBOW model. The CBOW model averages the context words to predict the target word. The skip-gram model uses pairwise training examples which are much easier to integrate with sentiment information. Note that our model can be easily extended to more than two domains. Similarly, we use a domain-specific vector for each word in each domain and each word is also associated with a domain-common vector. We just need to extend the probability distribution of z w from Bernoulli distribution to Multinomial distribution according to the number of domains.

Experimental Setup
We conducted experiments on the Amazon product reviews collected by Blitzer et al. (2007). We use four product categories: books (B), DVDs (D), electronic items (E), and kitchen appliances (K). A category corresponds to a domain. For each domain, there are 17,457 unlabeled reviews on average associated with rating scores from 1.0 to 5.0 for each domain. We use unlabeled reviews with rating score higher than 3.0 as positive reviews and unlabeled reviews with rating score lower than 3.0 as negative reviews for embedding learning. We first remove reviews whose length is less than 5 words. We also remove punctuations and the stop words. We also stem each word to its root form using Porter Stemmer (Porter, 1980). Note that this review data is used for embedding learning, and the learned embeddings are used as feature vectors of words to conduct the experiments in the later two subsections.
Given the reviews from two domains, namely, D p and D q , we compare our results with the following baselines and state-of-the-art methods: SSWE The SSWE model 1 proposed by Tang et al. (2014b) can learn sentiment-aware word embeddings from tweets. We employ this model on the combined reviews from D p and D q and then obtain the embeddings.
Yang's Work  have proposed a method 2 to learn domain-sensitive word embeddings. They choose pivot words and add a regularization item into the original skipgram objective function enforcing that word representations of pivot words for the source and target domains should be similar. The method trains the embeddings of the source domain first and then fixes the learned embedding to train the embedding of the target domain. Therefore, the learned embedding of the target domain benefits from the source domain. We denote the method as Yang in short.
EmbeddingAll We learn word embeddings from the combined unlabeled review data of D p and D q using the skip-gram method (Mikolov et al., 2013).
EmbeddingCat We learn word embeddings from the unlabeled reviews of D p and D q respectively. To represent a word for review sentiment classification, we concatenate its learned word embeddings from the two domains.

EmbeddingP and EmbeddingQ
In Embed-dingP, we use the original skip-gram method (Mikolov et al., 2013) to learn word embeddings only from the unlabeled reviews 1 We use the implementation from https://github.com/attardi/deepnl/wiki/Sentiment-Speci 2 We use the implementation from http://statnlp.org/research/lr/. of D p . Similarly, we only adopt the unlabeled reviews from D q to learn embeddings in EmbeddingQ.
BOW We use the traditional bag of words model to represent each review in the training data.
For our DSE model, we have two variants to represent each word. The first variant DSE c represents each word via concatenating the domaincommon vector and the domain-specific vector. The second variant DSE w concatenates domaincommon word embeddings and domain-specific word embeddings by considering the domaincommonality distribution p(z w ). For individual review instances, the occurrences of w in D p or D q have been established. The representation of w is specific to the occurrence of w in D p or D q . Specifically, each word w can be represented as follows: where ⊕ denotes the concatenation operator. For all word embedding methods, we set the dimension to 200. For the skip-gram based methods, we sample 5 negative instances and the size of the windows for each target word is 3. For our DSE model, the number of iterations for the whole reviews is 100 and the learning rate is set to 1.0.

Review Sentiment Classification
For the task of review sentiment classification, we use 1000 positive and 1000 negative sentiment reviews labeled by Blitzer et al. (2007) for each do-main to conduct experiments. We randomly select 800 positive and 800 negative labeled reviews from each domain as training data, and the remaining 200 positive and 200 negative labeled reviews as testing data. We use the SVM classifier (Fan et al., 2008) with linear kernel to train on the training reviews for each domain, with each review represented as the average vector of its word embeddings.
We use two metrics to evaluate the performance of sentiment classification. One is the standard accuracy metric. The other one is Macro-F1, which is the average of F1 scores for both positive and negative reviews.
We conduct multiple trials by selecting every possible two domains from books (B), DVDs (D), electronic items (E) and kitchen appliances (K). We use the average of the results of each two domains. The experimental results are shown in Table 1.
From Table 1, we can see that compared with other baseline methods, our DSE w model can achieve the best performance of sentiment classification across most combinations of the four domains. Our statistical t-tests for most of the combinations of domains show that the improvement of our DSE w model over Yang and SSWE is statistically significant respectively (p-value < 0.05) at 95% confidence level. It shows that our method can capture the domain-commonality and sentiment information at the same time.
Even though both of the SSWE model and our DSE model can learn sentiment-aware word embeddings, our DSE w model can outperform SSWE. It demonstrates that compared with general sentiment-aware embeddings, our learned domain-common and domain-specific word embeddings can capture semantic variations of words  Compared with the method of Yang which learns cross-domain embeddings, our DSE w model can achieve better performance. It is because we exploit sentiment information to distinguish domain-common and domain-specific words during the embedding learning process. The sentiment information can also help the model distinguish the words which have similar contexts but different sentiments.
Compared with EmbeddingP and EmbeddingQ, the methods of EmbeddingAll and Embedding-Cat can achieve better performance. The reason is that the data augmentation from other domains helps sentiment classification in the original domain. Our DSE model also benefits from such kind of data augmentation with the use of reviews from D p and D q .
We observe that our DSE w variant performs better than the variant of DSE c . Compared with DSE c , our DSE w variant adds the item of p(z w ) as the weight to combine domain-common embeddings and domain-specific embeddings. It shows that the domain-commonality distribution in our DSE model, i.e p(w z ), can effectively model the domain-sensitive information of each word and help review sentiment classification.

Lexicon Term Sentiment Classification
To further evaluate the quality of the sentiment semantics of the learned word embeddings, we also conduct lexicon term sentiment classification on two popular sentiment lexicons, namely HL (Hu and Liu, 2004) and MPQA (Wilson et al., 2005). The words with neutral sentiment and phrases are removed. The statistics of HL and MPQA are shown in Table 3.
We conduct multiple trials by selecting every possible two domains from books (B), DVDs (D), electronics items (E) and kitchen appliances (K  our DSE model, we only use the domain-common part to represent each word because the lexicons are usually not associated with a particular domain. For each lexicon, we select 80% to train the SVM classifier with linear kernel and the remaining 20% to test the performance. The learned embedding is treated as the feature vector for the lexicon term. We conduct 5-fold cross validation on all the lexicons. The evaluation metric is Macro-F1 of positive and negative lexicons. Table 2 shows the experimental results of lexicon term sentiment classification. Our DSE method can achieve competitive performance among all the methods. Compared with SSWE, our DSE is still competitive because both of them consider the sentiment information in the embeddings. Our DSE model outperforms other methods which do not consider sentiments such as Yang, EmbeddingCat and EmbeddingAll. Note that the advantage of domain-sensitive embeddings would be insufficient for this task because the sentiment lexicons are not domain-specific. Table 4 shows the probabilities of "lightweight", "die", "mysterious", and "great" to be domaincommon for different domain combinations. For "lightweight", its domain-common probability for the books domain and the DVDs domain ("B & D") is quite high, i.e. p(z = 1) = 0.999, and the review examples in the last column show that the word "lightweight" expresses the meaning of lacking depth of content in books or movies. Note that most reviews of DVDs are about movies. In the electronics domain and the kitchen appli-  Table 4: Learned domain-commonality for some words. p(z = 1) denotes the probability that the word is domain-common. The letter in parentheses indicates the domain of the review.

Case Study
ances domain ("E & K"), "lightweight" means light material or weighing less than average, thus the domain-common probability for these two domains is also high, 0.696. In contrast, for the other combinations, the probability of "lightweight" to be domain-common is much smaller, which indicates that the meaning of "lightweight" varies. Similarly, "die" in the domains of electronics and kitchen appliances ("E & K") means that something does not work any more, thus, we have p(z = 1) = 0.712. While for the books domain, it conveys meaning that somebody passed away in some stories. The word "mysterious" conveys a positive sentiment in the books domain, indicating how wonderful a story is, but it conveys a negative sentiment in the electronics domain typically describing that a product breaks down unpredictably. Thus, its domain-common probability is small. The last example is the word "great", and it usually has positive sentiment in all domains, thus has large values of p(z = 1) for all domain combinations.

Conclusions
We propose a new method of learning domainsensitive and sentiment-aware word embeddings. Compared with existing sentiment-aware embeddings, our model can distinguish domain-common and domain-specific words with the consideration of varied semantics across multiple domains. Compared with existing domain-sensitive methods, our model detects domain-common words according to not only similar context words but also sentiment information. Moreover, our learned embeddings considering sentiment information can distinguish words with similar syntactic context but opposite sentiment polarity. We have conducted experiments on two downstream sentiment classification tasks, namely review sentiment classification and lexicon term sentiment classification. The experimental results demonstrate the advantages of our approach.