JEAM: A Novel Model for Cross-Domain Sentiment Classification Based on Emotion Analysis

Cross-domain sentiment classification (CSC) aims at learning a sentiment classifier for unlabeled data in the target domain based on the labeled data from a different source domain. Due to the differences of data distribution of two domains in terms of the raw features, the CSC problem is difficult and challenging. Previous researches mainly focused on concepts mining by clustering words across data domains, which ignored the importance of authors’ emotion contained in data, or the different representations of the emotion between domains. In this paper, we propose a novel framework to solve the CSC problem, by modelling the emotion across domains. We first develop a probabilistic model named JEAM to model author’s emotion state when writing. Then, an EM algorithm is introduced to solve the likelihood maximum problem and to obtain the latent emotion distribution of the author. Finally, a supervised learning method is utilized to assign the sentiment polarity to a given online review. Experiments show that our approach is effective and outperforms state-of-the-art approaches.


Introduction
Cross-domain sentiment classification (CSC) is the task that learns a sentiment classifier for unlabeled data in the target domain based on the labeled data from the source domain. With the increasing amount of opinion information  Corresponding author available on the Internet, CSC has become a hot spot in recent years. Traditional machine learning algorithms often train a classifier utilizing the labeled data for CSC. However, in some practical cases, we may have many labeled data for some domains (source domains) but very few or no labeled data for other domains (target domains). Due to the differences of the distribution of two domains in terms of raw features, e.g. raw term frequency, the classifier trained from the source domain often performs badly on the target domain. To overcome this issue, several feature-based studies have been proposed to improve the sentiment classification domain adaptation [Zhuang et al., 2013;He et al., 2011;Gao and Li, 2011;Li et al., 2012;Dai et al., 2007;Zhuang et al., 2010;Pan et al., 2010;Wang et al., 2011;Long et al., 2012;Lin and He, 2009].
Existing studies build various generative models to solve the domain adaptation problems for CSC. In most cases, the models are trained by using the whole corpora without specifying on the sentiment of the texts. For example, [Zhuang et al., 2013] propose a general framework HIDC to mine high-level concepts (e.g. word clusters) across various domains. However, their learned concepts contain many topics not restricted to the sentiment. On the other hand, some researchers focus on the usage of the sentiment in CSC study [Mitra et al., 2013a;Mitra et al., 2013b;He et al., 2011]. [He et al., 2011] modify JST model [Lin and He, 2009] by incorporating word polarity priors through adjusting the topic-word Dirichlet priors. However, they fail to consider the expression differences among various domains.
To overcome the above issues, we employ "emotion", for its ubiquity among domains. The sentiment words in different domains might vary significantly, but the emotion can be effectively transferred. For example, when expressing the emotion "happiness", one uses "bravo" in the domain of sport, while "yummy" in the domain of food. Therefore, we propose an EA framework to model the latent emotions which are commonly contained in subjective articles and expressed by "emotional words". We infer the sentiment polarity of a document based on the emotion state. The hierarchy of EA is composed by four layers: (1) Sentiment Layer Normally, the sentiment of a document is the general opinion towards a certain event or object. For example, a movie review in IMDB might voice the feeling about the movie by a reviewer [Yu et al., 2013].
(2) Emotion Layer Based on the emotion classification theories in psychology [Plutchik, 2002], the emotion can be classified into the basic ones influenced by the physiological factors, e.g. happiness, sadness, anger, etc., and dozens of complicated ones formed under some specific social conditions, e.g. shame, guilt, abashment, etc. Additionally, the emotion can be classified as positive and negative (similar to the sentiment classification) based on dimensional models of emotion [Schlosberg, 1954;Plutchik, 2002;Rubin and Talerico, 2009]. Intuitively, we assume that a document tends to contain the emotions of similar polarity.
(3) Lexicon Layer To build the connection between words and the emotion, we introduce emotional words instead of raw word features into our model. By utilizing the emotional lexicon MPQA [Wiebe et al., 2005], we select groups of strong polar words, which get high scores in the emotional lexicon. These words are considered highly correlated to the certain emotion of the same polarity. And these strong polar words have invariant polarity across domains. Therefore, the emotion can be substantialized by a series of emotional words drawn from corresponding probability distribution.
(4) Expression Layer In many practical cases, data come from different domains. We suppose that the correlation between emotion state and sentiment orientation is stable over domains, but one emotion may have different expressions when domain varies. E.g., "satisfaction" may be expressed as "interesting" or "attractive" for a book; meanwhile, it may be expressed "efficient" for an electronics device. Formally, we have where denotes the emotion, y denotes the author's sentiment orientation, 1 and 2 denotes two different domains, and denotes the emotional words.
Along this line, we propose the Joint Emotion Analysis Model (named JEAM for abbreviation) utilizing the probabilistic methods. See details in the next section.

Problem Formulation
The CSC problem can be formulated as follows: Suppose we have two sets of data, denoted as and , which represent the source domain data and the target domain data respectively. In the CSC problem, the source domain data consist of labeled instances, denoted by

The JEAM Model
To model the author's emotion state contained in the document, we propose the JEAM model based on the probabilistic graphical principle. Note that all the factors and edges in JEAM are derived from the specific concepts and relations in EA, e.g., Eq(1) and Eq(2). We draw the graphical representation of JEAM in Figure 1, and show the notations of this paper in Table 1.
In Figure 1, y denotes the sentiment orientation of the author, which is a latent variable in this model. denotes any emotion (topic) generated by y from a conditional probability ( | ).
is also a latent variable in this model. denotes any data domain, e.g., books, dvd, kitchen, and electronics etc. denotes any document chosen from domain r with label y. For documents from the source domain, the conditional probability ( | , ) is known, which can be used to supervise the modeling process. denotes the prior sentiment polarity of the corresponding emotional word. In practice, can be obtained from the emotional lexicon,  denotes any emotional word with polarity u, which is chosen over words conditioned on emotion e and domain r from conditional probability ( | , , ) . In this paper, we only select emotional words with strong sentiment polarities to represent the vector of the document. Therefore, we rebuild the data with the help of emotional lexicon cutting out the nonemotional words. As a result, any word chosen from the rebuilt data will be an emotional word, which is supposed not to change its polarity in different domains. Additionally, the joint probability over all the observed variables can be defined as follows based on the hidden variables: ( , , , ) = ∑ ( , , , , Based on the graphical model, we have: ( , , , , , ) = ( | , , ) ( | , ) ( | ) ( ) ( ) ( ) (4) We need to learn the unobservable probabilities (e.g., ( | , , ), ( | , ), ( | ), ( ) ) to infer the hidden emotion distribution. Therefore, we develop an Expectation-Maximization (EM) algorithm to maximize the log likelihood of generating the whole dataset and obtain the iterative formula in E-step as follows: where all the factors are calculated in M-step similar to PLSA and HIDC (Hoffman, 1999;Zhuang et al., 2013).

CSC via JEAM
To use JEAM to solve CSC problems, we adopt two optimizations: First, we supervise the EM optimization with the polarity information of emotional words and instances respectively in the source domain. On the one hand, we estimate ( , | , , , ) utilizing the polarity label of the emotional words. Let the emotion set be divided into positive set and negative set . We set ( , | , , , ) = 0 during the whole EM process when the polarities of the emotion and current emotional word are different. On the other hand, we estimate the probability ( | , ) with the label information of instances in the source domain. When the document is from the source domain, we set ( | , ) = 0 if is different with the ground truth.
Second, we reconstruct the document as follows, where [ 1 , 2 , … ] is the distribution over emotions, , and ( | , ) can be computed based on ( | , , ) , ( | ), ( ) and ( ) obtained after EM algorithm. The main function of this step is to process a new given document faster, avoiding training JEAM again with the new input. Finally, a machine learning method Support Vector Machine (SVM) is introduced to train a classifier with the labeled data from the source domain and assign polarities to documents from the target domain based on our reconstructed data.

Experimental Setup
We demonstrate the effectiveness of JEAM on the Multi-Domain sentiment data set [Blitzer et al., 2007] which contains four types (domains) of real-world product documents taken from Amazon.com, which are books, dvd, electronics and kitchen. We randomly select 1800 documents from the one domain (source domain) and 200 documents from another domain (target domain). Then, we train a sentiment classifier using documents selected from the source domain and Emotion Emotional word Domain Document Prior sentiment polarity of the emotional word Sentiment polarity of the document All the observed variables All the model parameters assign labels to documents selected from the target domain, which generates 12 classification tasks. We preform 10 random selections and report the average results over 10 different runs. We use MPQA subjective lexicon 1 as the emotional lexicon. In our experiments, only strongly subjective clues are considered as emotional words, consisting of 1717 positive and 3621 negative words. We rebuild the dataset by cutting out the non-emotional words. For experiment parameters, we set = 25, = 25, and = 100 after plenty of experiments. Considering the data in practice, the sentiment orientation y has only two forms, positive or negative. Note that we do neither instance selection nor complicated feature selection (only filter the low-frequency words) to our proposed method and other methods in comparison.

Performance of Emotional Words
We show the effectiveness of introducing emotional words to solve the CSC problem. In JEAM, we reconstruct the documents by cutting out the non-emotional words. To compare the classification accuracy on the original documents and the reconstructed (emotional) documents, we choose two common classification algorithms, linear SVM and PLSA (topic size=10) for experiment respectively. The experiment results shows that both SVM and PLSA perform better on the emotional documents (60.43% and 60.48%) than on the original documents (57.73% and 56.69%) for the average accuracy over 12 classification tasks.

Effectiveness of using domain information and word polarity
We show the effectiveness of using domain information and word polarity, which are employed in our approach. For this purpose, we repeat the experiment without introducing domain and word polarity (node u and node r) into the model. Figure 2 shows the results. As it is clear, the highest performance can be achieved when domain information and word polarity are both used, while the lowest performance is obtained when neither of them is used.

Comparison with the Baselines
We compare our proposed approach with PLSA, SVM, SFA [Pan et al., 2010], JST [He et al., 2011] and HIDC [Zhuang et al., 2013]. The experimental results of the 12 classification tasks are shown in Figure 3. It can be observed that our 1 http://www.cs.pitt.edu/mpqa proposed approach outperforms all the other approaches in general. Note that in order to obtain a more precise comparison of the algorithms, we do neither the instance selection nor the complicated feature selection. The result of our proposed approach can possibly be improved with the help of these selection strategies.

Conclusion
In this paper, we propose a novel framework to solve the CSC problem, by modelling emotions across domains. We deeply analyze the relation between the author and the document based on the emotion theories in the field of psychology. Along this line, we propose a framework named EA, which takes the emotions and domains into account. Based on EA, we propose a novel model named JEAM to model the author's emotion state for Cross-domain sentiment classification. We conduct extensive experiments on real datasets to evaluate JEAM. The experiment results show that emotion plays an important role in CSC and JEAM outperforms existing state-of-the-art methods on the task of CSC.