Domain Adapted Word Embeddings for Improved Sentiment Classification

Generic word embeddings are trained on large-scale generic corpora; Domain Specific (DS) word embeddings are trained only on data from a domain of interest. This paper proposes a method to combine the breadth of generic embeddings with the specificity of domain specific embeddings. The resulting embeddings, called Domain Adapted (DA) word embeddings, are formed by aligning corresponding word vectors using Canonical Correlation Analysis (CCA) or the related nonlinear Kernel CCA. Evaluation results on sentiment classification tasks show that the DA embeddings substantially outperform both generic and DS embeddings when used as input features to standard or state-of-the-art sentence encoding algorithms for classification.

1 Dimensions of CCA and KCCA projections.
Using both KCCA and CCA, generic embeddings and DS embeddings are projected onto their d largest correlated dimensions. By construction, d ≤ min (d 1 , d 2 ). The best d for each data set is obtained via 10 fold cross validation on the sentiment classification task. Table 2 provides dimensions of all word embeddings considered. Note that for LSA and DA, average word embedding dimension across all four data sets are reported. Generic word embeddings such as GloVe and word2vec are of fixed dimensions across all four data sets.
Parameter σ of the Gaussian kernel used in KCCA is obtained empirically from the data. The median (µ) of pairwise distances between data points mapped by the kernel function is used to determine σ. Typically σ = µ or σ = 2µ. In this section both values are considered for σ and results with the best performing σ are reported.

Introduction
Generic word embeddings such as Glove and word2vec (Pennington et al., 2014;Mikolov et al., 2013) which are pre-trained on large sets of raw text, have demonstrated remarkable success when used as features to a supervised learner in various applications such as the sentiment classification of text documents. There are, however, many applications with domain specific vocabularies and relatively small amounts of data. The performance of generic word embedding in such applications is limited, since word embeddings pre-trained on generic corpora do not capture domain specific semantics/knowledge, while embeddings learned on small data sets are of low quality.
A concrete example of a small-sized domain specific corpus is the Substances User Disorders (SUDs) data set (Quanbeck et al., 2014;Litvin et al., 2013), which contains messages on discussion forums for people with substance addictions. These forums are part of a mobile health intervention treatment that encourages participants to engage in sobriety-related discussions. The goal of such treatments is to analyze content of participant's digital media content and provide human intervention via machine learning algorithms. This data is both domain specific and limited in size. Other examples include customer support tickets reporting issues with taxi-cab services, product reviews, reviews of restaurants and movies, discussions by special interest groups and political surveys. In general they are common in domains where words have different sentiment from what they would have elsewhere.
Such data sets present significant challenges for word embedding learning algorithms. First, words in data on specific topics have a different distribution than words from generic corpora. Hence using generic word embeddings obtained from algorithms trained on a corpus such as Wikipedia, may introduce considerable errors in performance metrics on specific downstream tasks such as sentiment classification. For example, in SUDs, discussions are focused on topics related to recovery and addiction; the sentiment behind the word 'party' may be very different in a dating context than in a substance abuse context. Thus domain specific vocabularies and word semantics may be a problem for pre-trained sentiment classification models (Blitzer et al., 2007). Second, there is insufficient data to completely retrain a new set of word embeddings. The SUD data set consists of a few hundred people and only a fraction of these are active (Firth et al., 2017), (Naslund et al., 2015). This results in a small data set of text messages available for analysis. Furthermore, content is generated spontaneously on a day to day basis, and language use is informal and unstructured. Fine-tuning the generic word embedding also leads to noisy outputs due to the highly non-convex training objective and the small amount of data. Since such data sets are common, a simple and effective method to adapt word embedding approaches is highly valuable. While existing work (Yin and Schütze, 2016), (Luo et al., 2014), (Mehrkanoon and Suykens, 2017), (Anoop et al., 2015), (Blitzer et al., 2011) combines word embeddings from different algorithms to improve upon intrinsic tasks such as similarities, analogies etc, there does not exist a concrete method to combine multiple embeddings to perform domain adaptation or improve on extrinsic tasks. This paper proposes a method for obtaining high quality word embeddings that capture domain specific semantics and are suitable for tasks on the specific domain. The new Domain Adapted (DA) embeddings are obtained by combining generic embeddings and Domain Specific (DS) embeddings via CCA/KCCA. Generic embeddings are trained on large corpora and do not capture domain specific semantics, while DS embeddings are obtained from the domain specific data set via algorithms such as Latent Semantic Analysis (LSA) or other embedding methods. The two sets of embeddings are combined using a linear CCA (Hotelling, 1936) or a nonlinear kernel CCA (KCCA) (Hardoon et al., 2004). They are projected along the directions of maximum correlation, and a new (DA) embedding is formed by averaging the projections of the generic embeddings and DS embeddings. The DA embeddings are then evaluated in a sentiment classification setting. Empirically, it is shown that the CCA/KCCA combined DA embeddings improve substantially over the generic embeddings, DS embeddings and a concatenation-SVD (concSVD) based baseline.
The remainder of this paper is organized as follows. Section 2 briefly introduces the CCA/KCCA and details the procedure used to obtain the DA embeddings. Section 3 describes the experimental set up. Section 4 discusses the results from sentiment classification tasks on benchmark data sets using standard classification as well as using a sophisticated neural network based sentence encoding algorithm. Section 5 concludes this work.

Domain Adapted Word Embeddings
Training word embeddings directly on small data sets leads to noisy outputs while embeddings from generic corpora fail to capture specific local mean-ings within the domain. Here we combine DS and generic embeddings using CCA KCCA, which projects corresponding word vectors along the directions of maximum correlation.
Let W DS ∈ R |V DS |×d 1 be the matrix whose columns are the domain specific word embeddings (obtained by, e.g., running the LSA algorithm on the domain specific data set), where V DS is its vocabulary and d 1 is the dimension of the embeddings. Similarly, let W G ∈ R |V G |×d 2 be the matrix of generic word embeddings (obtained by, e.g., running the GloVe algorithm on the Common Crawl data), where V G is the vocabulary and d 2 is the dimension of the embeddings. Let V ∩ = V DS ∩V G . Let w i,DS be the domain specific embedding of the word i ∈ V ∩ , and w i,G be its generic embedding. For one dimensional CCA, let φ DS and φ G be the projection directions of w i,DS and w i,G respectively. Then the projected values are,w (1) CCA maximizes the correlation betweenw i,DS andw i,G to obtain φ DS and φ G such that where ρ is the correlation between the projected word embeddings and E is the expectation over all words i ∈ V ∩ . The d-dimensional CCA with d > 1 can be defined recursively. Suppose the first d − 1 pairs of canonical variables are defined. Then the d th pair is defined by seeking vectors maximizing the same correlation function subject to the constraint that they be uncorrelated with the first d − 1 pairs. Equivalently, matrices of projection vec- The final domain adapted embedding for word i is given byŵ i,DA = αw i,DS + βw i,G , where the parameters α and β can be obtained by solving the following optimization, Solving (3) gives a weighted combination with α = β = 1 2 , i.e., the new vector is equal to the average of the two projections: Because of its linear structure, the CCA in (2) may not always capture the best relationships between the two matrices. To account for nonlinearities, a kernel function, which implicitly maps the data into a high dimensional feature space, can be applied. For example, given a vector w ∈ R d , a kernel function K is written in the form of a feature map ϕ defined by ϕ : In kernel CCA, data is first projected onto a high dimensional feature space before performing CCA. In this work the kernel function used is a Gaussian kernel, i.e., The implementation of kernel CCA follows the standard algorithm described in several texts such as (Hardoon et al., 2004); see reference for details.

Experimental Evaluation
This section evaluates DA embeddings in binary sentiment classification tasks on four standard data sets. Document embeddings are obtained via (i) a standard framework, i.e document embeddings are a weighted combination of their constituent word embeddings and (ii) by initializing a state of the art sentence encoding algorithm In-ferSent (Conneau et al., 2017) with word embeddings to obtain sentence embeddings. Encoded sentences are then classified using a Logistic Regressor.

Datasets
The following balanced and imbalanced data sets are used for experimentation, • Yelp: This is a balanced data set consisting of 1000 restaurant reviews obtained from Yelp. Each review is labeled as either 'Positive' or 'Negative'. There are a total of 2049 distinct word tokens in this data set.  Table 1: This table shows results from the classification task using sentence embeddings obtained from weighted averaging of word embeddings. Metrics reported are average Precision, F-score and AUC and the corresponding standard deviations (STD). Best results are attained by KCCA (GlvCC, LSA) and are highlighted in boldface.
• Amazon: In this balanced data set there are 1000 product reviews obtained from Amazon. Each product review is labeled either 'Positive' or 'Negative'. There are a total of 1865 distinct word tokens in this data set.
• IMDB: This is a balanced data set consisting of 1000 reviews for movies on IMDB. Each movie review is labeled either 'Positive' or 'Negative'. There are a total of 3075 distinct  word tokens in this data set.
• A-CHESS: This is a proprietary data set 1 obtained from a study involving users with alcohol addiction. Text data is obtained from a discussion forum in the A-CHESS mobile app (Quanbeck et al., 2014). There are a total of 2500 text messages, with 8% of the messages indicative of relapse risk. Since this data set is part of a clinical trial, an exact text message cannot be provided as an example. However, the following messages illustrate typical messages in this data set, "I've been clean for about 7 months but even now I still feel like maybe I won't make it." Such a message is marked as 'threat' by a human moderator. On the other hand there are other benign messages that are marked 'not threat' such as "30 days sober and counting, I feel like I am getting my life back." The aim is to eventually automate this process since human moderation involves considerable effort and time. This is an unbalanced data set ( 8% of the messages are marked 'threat') with a total of 3400 distinct work tokens.
The first three data sets are obtained from (Kotzias et al., 2015).
1 Center for Health Enhancement System Services at UW-Madison

Word embeddings and baselines:
This section briefly describes the various generic and DS embeddings used. We also compare against a basic DA embedding baseline in both the standard framework and while initializing the neural network baseline.
• Generic word embeddings: Generic word embeddings used are GloVe 2 from both Wikipedia and common crawl and the word2vec (Skip-gram) embeddings 3 . These generic embeddings will be denoted as Glv, GlvCC and w2v.
• DS word embeddings: DS embeddings are obtained via Latent Semantic Analysis (LSA) and via retraining word2vec on the test data sets using the implementation in gensim 4 . DS embeddings via LSA are denoted by LSA and DS embeddings via word2vec are denoted by DSw2v.
• concatenation-SVD baseline: Generic and DS embeddings are concatenated to form a single embeddings matrix. SVD is performed on this matrix and the resulting singular vectors are projected onto the d largest singular values to form resultant word embeddings. These meta-embeddings proposed by (Yin and Schütze, 2016) have demonstrated considerable success in intrinsic tasks such as similarities, analogies etc.
Details about dimensions of the word embeddings and kernel hyperparameter tuning are found in the supplemental material.
The following neural network baselines are used in this work, • InferSent:This is a bidrectional LSTM based sentence encoder (Conneau et al., 2017) that learns sentence encodings in a supervised fashion on a natural language inference (NLI) data set. The aim is to use the sentence encoder trained on the NLI data set to learn generic sentence encodings for use in transfer learning applications.
• RNTN: The Recursive Neural Tensor Network (Socher et al., 2013) baseline is a neural network based dependency parser that performs sentiment analysis. Since the data sets considered in our experiments have binary sentiments we compare against this baseline as well.
Note that InferSent is fine-tuned with a combination of GloVe common crawl embeddings and DA embeddings, and concSVD. The choice of GloVe common crawl embeddings is in keeping with the experimental conditions of the authors of InferSent. Since the data sets at hand do not contain all the tokens required to retrain InferSent, we replace word tokens that are common across our test data sets and InferSent training data with the DA embeddings and concSVD.
Since we have a combination of balanced and unbalanced test data sets, test metrics reported are Precision, F-score and AUC. We perform 10-fold cross validation to determine hyperparameters and so we report averages of the performance metrics along with the standard deviation.

Results and Discussion
From Tables 1 and 2 we see that DA embeddings perform better than concSVD as well as the generic and DS word embeddings, when used in a standard classification task as well as when used to initialize a sentence encoding algorithm. As expected, LSA DS embeddings provide better results than word2vec DS embeddings. Note that on the imbalanced A-CHESS data set, on the standard classification task, KCCA embeddings perform better than the other baselines across all three performance metrics. However from Table 2, GlvCC embeddings achieve a higher average Fscore and AUC over KCCA embeddings that obtain the highest precision.
While one can argue that when evaluating a classifier, the F-score and AUC are better indicators of performance, it is to be noted that A-CHESS is highly imbalanced and precision is calculated on the minor (positive) class that is of most interest. Also note that, InferSent is retrained on the balanced NLI data set that is much larger in size than the A-CHESS test set. Certainly such a training set has more instances of positive samples. Thus when using generic word embeddings to initialize the sentence encoder, which uses the outputs in the classification task, the overall Fscore and AUC are better. From our hypothesis, KCCA embeddings are expected to perform better than the others because CCA/KCCA provides an intuitively better technique to preserve information from both the generic and DS embeddings. On the other hand the concSVD based embeddings do not exploit information in both the generic and DS embeddings. Furthermore, in their work (Yin and Schütze, 2016) propose to learn an 'ensemble' of metaembeddings by learning weights to combine different generic word embeddings via a simple neural network. We determine the proper weight for combination of DS and generic embeddings in the CCA/KCCA space using the simple optimization problem given in Equation (3).
Thus, task specific DA embeddings formed by a proper weighted combination of DS and generic word embeddings are expected to do better than the concSVD embeddings and individual generic and/or DS embeddings and this is verified empirically. Also note that the LSA DS embeddings do better than the word2vec DS embeddings. This is expected due to the size of the test sets and the nature of the word2vec algorithm. We expect similar observations when using GloVe DS embeddings owing to the similarities between word2vec and GloVe.

Conclusion
This paper presents a simple yet effective method to learn Domain Adapted word embeddings that generally outperform generic and Domain Specific word embeddings in sentiment classification experiments on a variety of standard data sets. CCA/KCCA based DA embeddings generally outperform even a concatenation based methods.