Learning Word Embeddings for Data Sparse and Sentiment Rich Data Sets

This research proposal describes two algorithms that are aimed at learning word embeddings for data sparse and sentiment rich data sets. The goal is to use word embeddings adapted for domain specific data sets in downstream applications such as sentiment classification. The first approach learns word embeddings in a supervised fashion via SWESA (Supervised Word Embeddings for Sentiment Analysis), an algorithm for sentiment analysis on data sets that are of modest size. SWESA leverages document labels to jointly learn polarity-aware word embeddings and a classifier to classify unseen documents. In the second approach domain adapted (DA) word embeddings are learned by exploiting the specificity of domain specific data sets and the breadth of generic word embeddings. The new embeddings are formed by aligning corresponding word vectors using Canonical Correlation Analysis (CCA) or the related nonlinear Kernel CCA. Experimental results on binary sentiment classification tasks using both approaches for standard data sets are presented.


Introduction
Generic word embeddings such as Glove and word2vec (Pennington et al., 2014;Mikolov et al., 2013) which are pre-trained on large sets of raw text, in addition to having desirable structural properties have demonstrated remarkable success when used as features to a supervised learner in various applications such as the sentiment classification of text documents. There are, however, many applications with domain specific vocabu-laries and relatively small amounts of data. The performance of word embedding approaches in such applications is limited, since word embeddings pre-trained on generic corpora do not capture domain specific semantics/knowledge, while embeddings trained on small data sets are of low quality. Since word embeddings are used to initialize most algorithms for sentiment analysis etc, generic word embeddings further make for poor initialization of algorithms for tasks on domain specific data sets.
A concrete example of a small-sized domain specific corpus is the Substances User Disorders (SUDs) data set (Quanbeck et al., 2014;Litvin et al., 2013), which contains messages from discussion forums for people with substance addictions. These forums are part of mobile health intervention treatments that encourages participants to engage in sobriety-related discussions. The aim with digital intervention treatments is to analyze the daily content of participants' messages and predicit relapse risk. This data is both domain specific and limited in size. Other examples include customer support tickets reporting issues with taxi-cab services, reviews of restaurants and movies, discussions by special interest groups, and political surveys. In general they are common in fields where words have different sentiments from what they would have elsewhere.
Such data sets present significant challenges for algorithms based on word embeddings. First, the data is on specific topics and has a very different distribution from generic corpora, so pre-trained generic word embeddings such as those trained on Common Crawl or Wikipedia are unlikely to yield accurate results in downstream tasks. When performing sentiment classification using pre-trained word embeddings, differences in domains of training and test data sets limit the applicability of the embedding algorithm. For example, in SUDs, dis-cussions are focused on topics related to recovery and addiction; the sentiment behind the word 'party' may be very different in a dating context than in a substance abuse context. Similarly seemingly neutral words such as 'holidays', 'alcohol' etc are indicative of stronger negative sentiment in these domains, while words like 'clean' are indicative of stronger positive sentiment. Thus domain specific vocabularies and word semantics may be a problem for pre-trained sentiment classification models (Blitzer et al., 2007).
Second, there is insufficient data to completely train a word embedding. The SUD data set consists of a few hundred people and only a fraction of these are active (Firth et al., 2017) and (Naslund et al., 2015). This results in a small data set of text messages available for analysis. Furthermore, the content is generated spontaneously on a day to day basis, and language use is informal and unstructured. Running the generic word embedding constructions algorithms on such a data set leads to very noisy outputs that are not suitable as input for downstream applications like sentiment classification. Fine-tuning the generic word embedding also leads to noisy outputs due to the highly nonconvex training objective and the small amount of the data. This proposal briefly describes two possible solutions to address this problem. Section 3 describes a Canonical Correlation Analysis (CCA) based approach to obtain domain adapted word embeddings. Section 2 describes an biconvex optimization algorithm that jointly learns polarity aware word embeddings and a classifier. Section 4 discusses results from both approaches and outlines potential future work.

Supervised Word Embeddings for Sentiment Analysis on Small Sized Data Sets
Supervised Word Embedding for Sentiment Analysis (SWESA) algorithm is an iterative algorithm that minimizes a cost function for both a classifier and word embeddings under unit norm constraint on the word vectors. SWESA incorporates document label information while learning word embeddings from small sized data sets.

Mathematical model and optimization
Text documents d i in this framework are represented as a weighted linear combination of words in a given vocabulary. Weights φ i used are term frequencies. SWESA aims to find vector representations for words, and by extension of text documents such that applying a nonlinear transformation f to the product (θ W φ) results in a binary label y indicating the polarity of the document. Mathematically we assume that, for some function f The optimization problem in (1) can be solved as the following minimization problem, +λ θ || θ || 2 2 . This optimization problem can now be written as Class imbalance is accounted for by using misclassification costs C − , C + as in (Lin et al., 2002). The unit norm constraint in the optimization problem shown in (2) is enforced on word embeddings to discourage degenerate solutions of w j . This optimization problem is bi-convex. Algorithm 1 shows the algorithm that we use to solve the optimization problem in (2). This algorithm is an alternating minimization procedure that initializes the word embedding matrix W with W 0 and then alternates between minimizing the objective function w.r.t. the weight vector θ and the word embeddings W.
The probability model used in this work is logistic regression. Under this assumption the minimization problem in Step 3 of Algorithm 1 is a standard logistic regression problem. In order to solve the optimization problem in line 4 of Algorithm 1 a projected stochastic gradient descent (SGD) with suffix averaging (Rakhlin et al., 2011). Algorithm 2 implements the SGD algorithm (with stochastic gradients instead of full gradients) for solving the optimization problem in step 4 of Algorithm 1. W 0 is initialized via pretrained word2vec embeddings and Latent Semantic Analysis (LSA) (Dumais, 2004) based word Algorithm 1 Supervised Word Embeddings for Sentiment Analysis (SWESA) Require: W 0 , Φ, C + , C − , λ θ , 0 < k < V , Labels: y = [y 1 , . . . , y N ], Iterations: T > 0, Solve θ t ← arg min θ J(θ, W t−1 ).

4:
Solve W t ← arg min W J(θ t , W). 5: end for 6: Return θ T , W T Algorithm 2 Stochastic Gradient Descent for W Require: θ, γ, W 0 , Labels: y = [y 1 , . . . , y N ], Iterations: N, step size: η > 0, and suffix parameter: 0 < τ ≤ N . 1: Randomly shuffle the dataset. 2: for t = 1, . . . , N do 3: embeddings obtained form a matrix of term frequencies from the given data. Dimension k of word vectors is determined empirically by selecting the dimension that provides the best performance across all pairs of training and test data sets.

Experiment evaluation and results
SWESA is evaluated against the following baselines and data sets, Datasets: 3 balanced data sets (Kotzias et al., 2015) of 1000 reviews from Amazon, IMDB and Yelp with binary 'positive' and 'negative' sentiment labels are considered. One imbalanced data set with 2500 text messages obtained from a study involving subjects with alcohol addiction is considered. Only 8% of the messages are indicative of 'relapse risk' while the rest are 'benign'. Note that this imbalance influences the performance metrics and can be seen by comparing against the scores achieved by the balanced data sets. Additional information such as number of word tokens etc can  be found in the supplemental section.
• Naive Bayes: This is a standard baseline that is best suited for classification in small sized data sets.
• Recursive Neural Tensor Network: RNTN is a dependency parser based sentiment analysis algorithm. Both pre-trained RNTN and the RNTN algorithm retrained on the data sets considered here are used to obtain classification accuracy. Note that with the RNTN we do not get probabilities for classes hence we do not compute AUC.
• Two-Step (TS): In this set up, embeddings obtained via word2vec on the test data sets and LSA are used to obtain document representation via weighted averaging. Documents are then classified using a Logistic Regressor.
Hyperparameters: Parameters such as dimension of word embeddings, regularization on the logistic regressor etc are determined via 10-fold cross validation. Results: Average Precision and AUC are reported in table 2. Note that, the word2vec embeddings used in TS are obtained by retraining the word2vec algorithm on the test data sets. To reinforce the point that retraining neural network based algorithms on sparse data sets depreciates their performance, results from pre-trained and retrained RNTN are presented to further support this fact. Since SWESA makes use of document labels when learning word embeddings, resulting word embeddings are polarity aware. Using cosine similarity, word antonym pairs are observed. Given words 'Good,''fair' and 'Awful,' the antonym pair 'Good/Awful' is determined via cosine similarity between w Good and w Awf ul . Figure 1 shows a small sample of word embeddings learned on the Amazon data set by SWESA and word2vec. The cosine similarity (angle) between the most dissimilar words is calculated and words are depicted as points on the unit circle. These examples illustrate that SWESA captures sentiment polarity at word embedding level despite limited data.

Domain Adapted Word Embeddings for Improved Sentiment Classification
While SWESA learns embeddings from domain specific data alone, this approach proposes a method for obtaining high quality Domain Adapted (DA) embeddings by combining generic embeddings and Domain Specific (DS) embeddings via CCA/KCCA. Generic embeddings are trained on large corpora and do not capture domain specific semantics, while DS embeddings are obtained from the domain specific data set via algorithms such as Latent Semantic Analysis (LSA) or other embedding methods. Thus DA embed-dings exploit the breath of generic embeddings and the specificity of DS embeddings. The two sets of embeddings are combined using a linear CCA (Hotelling, 1936) or a nonlinear kernel CCA (KCCA) (Hardoon et al., 2004). They are projected along the directions of maximum correlation, and a new (DA) embedding is formed by averaging the projections of the generic embeddings and DS embeddings. The DA embeddings are then evaluated in a sentiment classification setting. Empirically, it is shown that the combined DA embeddings improve substantially over the generic embeddings, DS embeddings and a concatenation-SVD (conc-SVD) based baseline.

Brief description of CCA/KCCA
Let W DS ∈ R |V DS |×d 1 be the matrix whose columns are the domain specific word embeddings (obtained by, e.g., running the LSA algorithm on the domain specific data set), where V DS is its vocabulary and d 1 is the dimension of the embeddings. Similarly, let W G ∈ R |V G |×d 2 be the matrix of generic word embeddings (obtained by, e.g., running the GloVe algorithm on the Common Crawl data), where V G is the vocabulary and d 2 is the dimension of the embeddings. Let V ∩ = V DS ∩V G . Let w i,DS be the domain specific embedding of the word i ∈ V ∩ , and w i,G be its generic embedding. For one dimensional CCA, let φ DS and φ G be the projection directions of w i,DS and w i,G respectively. Then the projected values are,w CCA maximizes the correlation ρ betweenw i,DS andw i,G to obtain φ DS and φ G such that where the expectation is over all words i ∈ V ∩ . The d-dimensional CCA with d > 1 can be defined recursively. Suppose the first d − 1 pairs of canonical variables are defined. Then the d th pair is defined by seeking vectors maximizing the same correlation function subject to the constraint that they be uncorrelated with the first d − 1 pairs. Equivalently, matrices of projection vectors Φ DS ∈ R d 1 ×d and Φ G ∈ R d 2 ×d are obtained for all vectors in W DS and W G where d ≤  Table 2: This table shows results from the classification task using sentence embeddings obtained from weighted averaging of word embeddings. Metrics reported are average Precision, F-score and AUC and the corresponding standard deviations (STD). Best results are attained by KCCA (GlvCC, LSA) and are highlighted in boldface.    (5) gives a weighted combination with α = β = 1 2 , i.e., the new vector is equal to the average of the two projections: Because of its linear structure, the CCA in (4) may not always capture the best relationships between the two matrices. To account for nonlinearities, a kernel function, which implicitly maps the data into a high dimensional feature space, can be applied. For example, given a vector w ∈ R d , a kernel function K is written in the form of a feature map ϕ defined by ϕ : w = (w 1 , . . . , w d ) → ϕ(w) = (ϕ 1 (w), . . . , ϕ m (w))(d < m) such that given w a and w b In kernel CCA, data is first projected onto a high dimensional feature space before performing CCA. In this work the kernel function used is a Gaussian kernel, i.e., The implementation of kernel CCA follows the standard algorithm described in several texts such as (Hardoon et al., 2004); see reference for details.

Experimental evaluation and results
DA embeddings are evaluated in binary sentiment classification tasks on four data sets described in Section 2.2. Document embeddings are obtained via i)a standard framework that expresses documents as weighted combination of their constituent word embeddings and ii) by initializing a state of the art sentence encoding algorithm In-ferSent (Conneau et al., 2017) with word embeddings to obtain sentence embeddings. Encoded sentences are then classified using a Logistic Regressor.
Word embeddings and baselines: • Generic word embeddings: Generic word embeddings used are GloVe 1 from both Wikipedia and common crawl and the word2vec (Skip-gram) embeddings 2 . These generic embeddings will be denoted as Glv, GlvCC and w2v.
• DS word embeddings: DS embeddings are obtained via Latent Semantic Analysis (LSA) and via retraining word2vec on the test data sets using the implementation in gensim 3 . DS embeddings via LSA are denoted by LSA and DS embeddings via word2vec are denoted by DSw2v.
• concatenation-SVD baseline: Generic and DS embeddings are concatenated to form a single embeddings matrix. SVD is performed on this matrix and the resulting singular vectors are projected onto the d largest singular values to form resultant word embeddings. These meta-embeddings proposed by (Yin and Schütze, 2016) have demonstrated considerable success in intrinsic tasks such as similarities, analogies etc.
Details about dimensions of the word embeddings and kernel hyperparameter tuning are found in the supplemental material. Note that InferSent is fine-tuned with a combination of GloVe common crawl embeddings and DA embeddings, and concSVD. Since the data sets at hand do not contain all the tokens required to retrain InferSent, we replace word tokens that are common across our test data sets and In-ferSent training data with the DA embeddings and concSVD.

Discussion of results
From tables 2 and 3 we see that DA embeddings perform better than concSVD as well as the generic and DS word embeddings, when used in a standard classification task as well as when used to initialize a sentence encoding algorithm. As expected LSA DS embeddings provide better results than word2vec DS embeddings. Also since the A-CHESS dataset is imbalanced, we look at precision closely over the other metric since the positive class is in minority. These results are because i) CCA/KCCA provides an intuitively better technique to preserve information from both the generic and DS embeddings. On the other hand the concSVD based embeddings do not exploit information in both the generic and DS embeddings. ii) Furthermore, in their work (Yin and Schütze, 2016) propose to learn an 'ensemble' of meta-embeddings by learning weights to combine different generic word embeddings via a simple neural network. Via the simple optimization problem we propose in equation (5), we determine the proper weight for combination of DS and generic embeddings in the CCA/KCCA space. Thus, task specific DA embeddings formed by a proper weighted combination of DS and generic word embeddings are expected to do better than the concSVD and other embeddings and this is verified empirically. Also note that the LSA DS embeddings do better than the word2vec DS embeddings. This is expected due to the size of the test sets and the nature of the word2vec algorithm. We expect similar observations when using GloVe DS embeddings owing to the similarities between word2vec and GloVe.

Future work and Conclusions
From these initial preliminary results we can see that while SWESA learns embeddings from the domain specific data sets along, DA embeddings combine both generic and domain specific embeddings thereby achieving better performance metrics than SWESA or DS embeddings alone. However, SWESA imparts potentially desirable structural properties to its word embeddings. As a next step we would like to infer from both these approaches to learn better polarized and domain adapted word embeddings.