Potential and Limitations of Cross-Domain Sentiment Classification

In this paper we investigate the cross-domain performance of a current state-of-the-art sentiment analysis systems. For this purpose we train a convolutional neural network (CNN) on data from different domains and evaluate its performance on other domains. Furthermore, we evaluate the usefulness of combining a large amount of different smaller annotated corpora to a large corpus. Our results show that more sophisticated approaches are required to train a system that works equally well on various domains.


Introduction
Most work regarding sentiment analysis focuses on training and testing a sentiment classifier on data of the same domain. For example a new classifier is trained on tweets and tested on tweets. However, in real-world scenarios the data might originate from different sources and domains. Often it is the case that sentiment analysis is performed on a domain for which there is no training data available. Instead of investing large amounts of money to create such a corpus it would make more sense to use an existing classifier. However, it is not always clear how well the existing classifier generalizes on the target domain. Although, it is obvious that the performance will be affected negatively, the magnitude is not known. This missing information is often useful for assessing the need of generating a new classifier for a given domain which is very costly.
Thus, our work is driven by the question of how useful sentiment classifiers are if we evaluate them with datasets from unseen domains, and if a combination of data from different domains might help to overcome the recurring problem of having too little data. Furthermore, we assess the usefulness of large weakly supervised corpora where the labels are inferred from properties of the text, e.g. the smileys in the text or the rating of a review. We answer the question of how much gain one can expect from leveraging such corpora.
Usually, cross-domain sentiment analysis has a low performance due to the vocabulary mismatch (Pan et al., 2010). Thus, we asses the impact of word embeddings trained on large amounts of data, thus guaranteeing a large coverage of the vocabulary. We then asses how word embeddings trained on different types of data (e.g. News, Twitter) impact the performance of the system. For this, we train a convolutional neural network (CNN) based on (Deriu et al., 2016) on data from different combinations of domains and evaluate its performance on foreign domains.
Related Work Some research has been done already in the field of cross-domain sentiment classification. Most of the work in this area focuses on the mismatch in the vocabularies of the different domains. (Pan et al., 2010) overcome the challenge of vocabulary-mismatch by employing a spectral feature alignment algorithm to map domainspecific words to a unified representation which can then be used in conjunction with the domainindependent words to lower the mismatch between the domains. (Blitzer et al., 2007) use structural correspondence learning to adapt the vocabulary of the various domains. (Li et al., 2008) experiment with ensembles of classifiers where each classifier was trained on a specific domain and then used in combination to boost the crossdomain performance. (Bollegala et al., 2011) use a semi-supervised algorithm, which leverages supervised and unsupervised data, to create a sentiment-sensitive thesaurus which is used to compute the relatedness of words from different domains. (Bollegala et al., 2016) uses the aforementioned sentiment-sensitive thesaurus to generate sentiment-sensitive word embeddings. (Glorot et al., 2011) apply unsupervised cross-domain sentiment classification, where they use spectral embeddings to project words and documents into a low dimensional embedding space. (Yu et al., 2016) borrow ideas from SCL and combine it with auxiliary binary predicition tasks to learn dense sentence embeddings which incorporate sentiment and can be used in a cross-domain context.

Contribution
Our work presents an in-depth analysis on the generalization power of the current state-of-the-art in a cross-domain setting. This work can be used to estimate and predict the expected drop in performance for a given sentiment classifier.

Training
Model We use a state-of-the-art model based on the CNN used by (Deriu et al., 2016). The architecture is composed by two consecutive convolutional-and pooling-layers followed by a fully-connected and a softmax layer. Table 1 gives an overview on the hyper-parameters used for the CNN.

Hyper-Parameter
Value Number of convolutional Filters 200 Filter width (both layers) 6 Pooling Length (first layer) 4 Pooling Stride (first layer) 2 Activation relu Table 1: Overview of the hyper-parameters chosen for the CNN. Note that we define a layer as one convolutional layer followed by one pooling-layer. For the second pooling layer the length is chosen over the whole feature.

3-Phase Learning
We apply the 3-Phase learning procedure (see Figure 1) proposed by (Severyn et al., 2015) where we first create word embeddings based on the skip-gram model (Mikolov et al., 2013). For our purposes we create embeddings with 52 dimensions as in (Deriu et al., 2016) . In a second step we apply a distant-phase where we pre-train the CNN on a large corpus of weakly su- pervised data, where the sentiment labels are inferred by properties of the texts. In this phase the word embeddings are updated to incorporate sentiment-specific information. The third and final phase is the supervised phase, where we train the CNN on a corpus of manually annotated texts.
Training For the distant-supervised and the supervised phase we employ the AdaDelta optimizer to train the CNN. The hyper-parameters are set to the default values of = 1e −6 , ρ = 0.95, and the learning rate is set to lr = 1.0. Many of the datasets are unbalanced (see Table 2) and, to mitigate this problem, we use class-weights during the learning procedure. The following formula was used to compute the class-weights for each dataset D and each class i ∈ S: where d i denotes the number of elements in D that belong to class i. Thus, over-represented classes will get a lower weight than under-represented classes. The loss function is scaled with the classweight for the respective class when training the model.

Data
For each of the aforementioned phases we experiment with different corpora. We use 3 different corpora for word embeddings, 2 corpora for the distant-supervised phase where the sentiment is inferred by the smiley in case of the tweets and the user ratings in case of the product reviews, and 8 corpora for the supervised phase. A detailed overview of the data is provided in Table 2.  Evaluation For the evaluation we use the macro-averaged F1-score of positive and negative classes F1 = (F1 pos + F1 neg ) / 2, since it is also used in SemEval (Nakov et al., 2016) as standard measure of quality.

Experiments & Results
In the following we refer to the system trained on a single target domain (TD) data as specialized TD system, a system trained on one foreign domain (FD) dataset and evaluated on the TD test set is called a specialized FD system, a system trained on a combinations of FD corpora is called a generalized FD system, and a system trained on all data is called a generalized system.

Word Embeddings and Distant-Phase
We train the CNN with all possible combinations of word-embeddings and distant-phases to assess which combination works best for each domain. Additionally we include experiments where we use randomly initialized word embeddings denoted as Random, as well as experiments where the distant-phase is omitted, denoted as None. Tables 4, 5, and 6 give an overview of the results. In the following we present the main findings.
The complexity among the domains varies.
The differences of the averaged scores over each domain are very high. The average score of the DAI-tweets is 66 points in F1 score, whereas the average score of the JCR-quotations is only at 39.3 points. These differences could be caused by the different sized of the corpora, variations in the quality of the annotations or by the difficulty of the domains itself.
Random word embeddings are not necessarily bad. Generally it is assumed that using pretrained word embeddings would increase the performance compared to using randomly initialized values. Indeed, the average performance of the random word embeddings (see Table 5.B) lies 3 point below the averages achieved by the Newsembeddings. Random word embeddings yield the best score only for one domain out of eight. However a closer look at the averaged scores over the combinations of word embeddings and distantphases (see Table 6) reveals that the combination of random word embeddings with a distantphase on reviews achieves an average score of 59.4, which is the second-highest average score. Thus, a distant-phase can compensate the lack of pre-trained word embeddings.
Pretrained word embeddings are not necessarily good. The same analysis as above reveals a similar picture for the Wikipedia-embeddings. The average score achieved using the Wikipediaembeddings lies 2 points below the average score achieved by the News-embeddings. The average scores achieved by using the Wikipediaembeddings on each domain (see Table 5.B) is up to 6 points worse than the best score for the particular domain. Thus, pre-trained word embeddings do not imply an increase in score.
Vocabulary coverage is important.  Distant-Phase as score-booster. Performing a distant-phase yields the best scores for eight out of nine domains, the exception being the MPQreviews. The average scores achieved performing a distant-phase show the same picture (see Table 5.C Avg.-column), where using the Review-corpus performs 7 points above omitting the distant-phase. Using tweets for the distant-phase improves the score by 4 points on average. Thus, a distant-phase boosts the performance of the system. This is consistent with the results shown in (Deriu et al., 2016). However we cannot give any recommendation as to which corpus to use, even if using reviews mostly performed better in our case.  Table 6: Shows the average F1 score for each combination of word embeddings, distant-phase corpus.

Cross-Domain Experiments
We train the system on the data of one domain called target domain (TD) and test it on the TD as well as the foreign domains (FD). The system is optimized for the TD by using the test set of the TD to perform early-stopping. Furthermore we trained the system on the union of all domains and tested it on all the domains separately. For optimization we used the TD test set for earlystopping. For each domain we use the best combination of word embeddings and distant-phase from Section 3.1 as base model (see Table 7). In Table 8   The generalization power of a specialized systems is poor. As expected the best score is achieved by training and testing on the same domain. However there is a large deterioration in score when the system is tested on another domain than it is trained on. The average score achieved by a specialized FD system on the TD is far below the scores achieved for a specialized TD system. The differences range from 15 (JCR) up to 30 (DAI and DIL) points in F1 score.  The last row shows the average score achieved on a particular dataset. The scores in bold denote the best score achieved on the dataset. For each domain we denote the text-type as follows: T: Tweets, N: News, R: Reviews, H: Headlines and Q: Quotations. Alongside with the text-type we also note the size of the corpus.    shows the difference between the best score of TD and FD Avg.
A general system does not increase the systems prediction power. The results achieved by training on the union of all data and optimizing for a specific TD shows no increase in score on the TD. Only on the JCR-quotations the score increased, on the twitter datasets (DAI and SEval) the score is similar to the score of the target specific system. In all the other cases the systems trained on the union of all data perform worse. In the case of the HUL-reviews the drop is even by 14 points.

Ablation Experiments
To further assess the generalization performance we ran ablation experiments as follows: We combine all the training sets except for the target domain set, train the system on this combination of data, and then evaluate the system on the target domain.
The generalized FD system performs better than a specialized FD system. Table 9 shows the performance of the system trained on the combination of FD data excluding the TD. The results show that in most cases training on a mixture of FD data achieves better scores on the TD data than training using a single FD for training (see Table  8). As expected the general FD system is usually not able to achieve the score on the TD data achieved by the specialized TD system. Table 9 shows the difference between the specialized TD system and the generalized FD system. The differences range from 3 points in the case the DILreviews up to 17 points for the MPQ-news. Only for the JCR-quotations the generalized FD system performs better. Thus, it is best to have TD data, although in some cases an acceptable score might be achieved using a generalized FD system.  Table 9: Results of the ablation experiments. The last column shows the difference between the specific TD system and the Ablation System trained on a mix of FD data excluding data from the TD.

Augmentation Experiments
To further investigate the difference between a specialized system and a general system we performed experiments where we start with a specialized TD, specialized FD, or a general FD system (referred to as base system) and gradually transform it to a generalized system by adding data. Let n be the number of texts used to train the base system. Then we augment the training set by adding n/2, n and 2n datapoints. The evaluation is always performed on the TD.
Adding FD to a specialized TD system decreases the performance on the target domain.
For each of the 8 TDs we start with a specialized TD system and gradually add a combination of FD data (mixed FD augmentation) or data from a single FD (single FD augmentation) and evaluate the performance on TD. Figure 2 shows the scores averaged over all experiments for each TD. The trend shows that adding more data from one or more FDs for training decreases the performance of the system. Adding TD to a FD system increases the score.
For each TD we start with a specialized FD system (single FD base) or a generalized FD system (mix FD base) and gradually add more data from the TD. In both cases adding more data from the TD increases the performance of the system when it is evaluated on the TD (see Figure 3).

Conclusion
In this work we gave an overview of the deterioration of the quality when using a sentiment classifier on a domain it was not trained on. Our in-depth analysis showed that having a large corpus of weakly labelled data boosts the score by 7 points on average. We also showed that using pre-trained word embeddings helps to increase the score by 3-4 points on average. This work can be used as a basis when evaluating sentiment classifiers that were trained on a domain different from the target domain. Future work in this area would include more indepth analysis of the inter-  play among different domains: for instance our results show that a system trained on tweets performs better on reviews than a system trained on news. Here, a better understanding of these mechanisms is necessary to better assess the potential of cross domain classification. Furthermore, one can analyse the effect of the distant-phases and word embeddings in the cross-domain setting. How does the usage of different types of word embeddings and weakly labelled data impact the performance in a cross-domain setting? Does the usage of weakly-labelled data increase the performance of a sentiment classifier on a foreign domain? We are convinced that answering these questions will help to develop sentiment analysis systems that perform better on new, unknown domains.