Refining Word Embeddings for Sentiment Analysis

Word embeddings that can capture semantic and syntactic information from contexts have been extensively used for various natural language processing tasks. However, existing methods for learning context-based word embeddings typically fail to capture sufficient sentiment information. This may result in words with similar vector representations having an opposite sentiment polarity (e.g., good and bad), thus degrading sentiment analysis performance. Therefore, this study proposes a word vector refinement model that can be applied to any pre-trained word vectors (e.g., Word2vec and GloVe). The refinement model is based on adjusting the vector representations of words such that they can be closer to both semantically and sentimentally similar words and further away from sentimentally dissimilar words. Experimental results show that the proposed method can improve conventional word embeddings and outperform previously proposed sentiment embeddings for both binary and fine-grained classification on Stanford Sentiment Treebank (SST).


Introduction
Word embeddings are a technique to learn continuous low-dimensional vector space representations of words by leveraging the contextual information from large corpora. Examples include C&W (Collobert and Weston, 2008;Collobert et al., 2011), Word2vec (Mikolov et al., 2013a2013b) and GloVe (Pennington et al., 2014). In addition to the contextual information, characterlevel subwords (Bojanowski et al., 2016) and semantic knowledge resources (Faruqui et al., 2015;Kiela et al., 2015) such as WordNet (Miller, 1995) are also useful information for learning word embeddings. These embeddings have been successfully used for various natural language processing tasks.
In general, existing word embeddings are semantically oriented. They can capture semantic and syntactic information from unlabeled data in an unsupervised manner but fail to capture sufficient sentiment information. This makes it difficult to directly apply existing word embeddings to sentiment analysis. Prior studies have reported that words with similar vector representations (similar contexts) may have opposite sentiment polarities, as in the example of happy-sad mentioned in (Mohammad et al., 2013) and good-bad in (Tang et al., 2016). Composing these word vectors may produce sentence vectors with similar vector representations but opposite sentiment polarities (e.g., a sentence containing happy and a sentence containing sad may have similar vector representations). Building on such ambiguous vectors will affect sentiment classification performance.
To enhance the performance of distinguishing words with similar vector representations but opposite sentiment polarities, recent studies have suggested learning sentiment embeddings from labeled data in a supervised manner (Maas et al., 2011;Labutov and Lipson, 2013;Lan et al., 2016;Ren et al., 2016;Tang et al., 2016). The common goal of these methods is to capture both semantic/syntactic and sentiment information such that sentimentally similar words have similar vector representations. They typically apply an objective function to optimize word vectors based on the sentiment polarity labels (e.g., positive and negative) given by the training instances. The use of such sentiment embeddings has improved the performance of binary sentiment classification. Ranked by cosine similarity Figure 1: Example of nearest neighbor ranking.
This study adopts another strategy to obtain both semantic and sentiment word vectors. Instead of building a new word embedding model from labeled data, we propose a word vector refinement model to refine existing semantically oriented word vectors using sentiment lexicons. That is, the proposed model can be applied to the pre-trained vectors obtained by any word representation learning models (e.g., Word2vec and GloVe) as a post-processing step to adapt the pre-trained vectors to sentiment applications. The refinement model is based on adjusting the pre-trained vector of each affective word in a given sentiment lexicon such that it can be closer to a set of both semantically and sentimentally similar nearest neighbors (i.e., those with the same polarity) and further away from sentimentally dissimilar neighbors (i.e., those with an opposite polarity).
The proposed refinement model is evaluated by examining whether our refined embeddings can improve conventional word embeddings and outperform previously proposed sentiment embeddings. To this end, several deep neural network classifiers that performed well on the Stanford Sentiment Treebank (SST) (Socher et al., 2013) are selected, including convolutional neural networks (CNN) (Kim, 2014), deep averaging network (DAN) (Iyyer et al., 2015) and long-short term memory (LSTM) (Tai et al., 2015;Looks et al., 2017). The conventional word embeddings used in these classifiers are then replaced by our refined versions and previously proposed sentiment embeddings to re-run the classification for performance comparison. The SST is chosen because it can show the effect of using different word embeddings on fine-grained sentiment classification, whereas prior studies only reported binary classification results.
The rest of this paper is organized as follows. Section 2 describes the proposed word vector refinement model. Section 3 presents the evaluation results. Conclusions are drawn in Section 4.

Word Vector Refinement
The refinement procedure begins by giving a set of pre-trained word vectors and a sentiment lexicon annotated with real-valued sentiment scores. Our goal is to refine the pre-trained vectors of the affective words in the lexicon such that they can capture both semantic and sentiment information. To accomplish this goal, we first calculate the semantic similarity between each affective word (target word) and the other words in the lexicon based on the cosine distance of their pre-trained vectors, and then select top-k most similar words as the nearest neighbors. These semantically similar nearest neighbors are then re-ranked according to their sentiment scores provided by the lexicon such that the sentimentally similar neighbors can be ranked higher and dissimilar neighbors lower. Finally, the pre-trained vector of the target word is refined to be closer to its semantically and sentimentally similar nearest neighbors and further away from sentimentally dissimilar neighbors. The following subsections provide a detailed description of the nearest neighbor ranking and refinement model.

Nearest Neighbor Ranking
The sentiment lexicon used in this study is the extended version of Affective Norms of English Words (E-ANEW) (Warriner et al., 2013). It contains 13,915 words and each word is associated with a real-valued score in [1,9] for the dimensions of valence, arousal and dominance. The valence represents the degree of positive and negative sentiment, where values of 1, 5 and 9 respectively denote most negative, neutral and most positive sentiment. In Fig. 1, good has a valence score of 7.89, which is greater than 5, and thus can be considered positive. Conversely, bad has a valence score of 3.24 and is thus negative. In addition to the E-ANEW, other lexicons such as Sen-tiWordNet (Esuli and Fabrizio, 2006) For each target word to be refined, the top-k semantically similar nearest neighbors are first selected and ranked in descending order of their cosine similarities. In Fig. 1, the left ranked list shows the top 10 nearest neighbors for the target word good. The semantically ranked list is then sentimentally re-ranked based on the absolute difference of the valence scores between the target word and the words in the list. A smaller difference indicates that the word is more sentimentally similar to the target word, and thus will be ranked higher. As shown in the right ranked list in Fig. 1, the re-ranking step can rank the sentimentally similar neighbors higher and the dissimilar neighbors lower. In the refinement model, the higher ranked sentimentally similar neighbors will receive a higher weight to refine the pre-trained vector of the target word.

Refinement Model
Once the word list ranked by both cosine similarity and valence scores for each target word is obtained, its pre-trained vector will be refined to be (1) closer to its sentimentally similar neighbors, (2) further away from its dissimilar neighbors, and (3) not too far away from the original vector.
Let V = {v1, v2, …, vn} be a set of the pretrained vectors corresponding to the affective words in the sentiment lexicon. For each target to be refined, the refinement model iteratively minimizes the distance between the target word and its top-k nearest neighbors. The objective function Φ(V) can thus be defined as where n denotes the total number of vectors in V to be refined, vi denotes the vector of a target word, vj denotes the vector of one of its nearest neighbors in the ranked list, dist(vi, vj) denotes the distance between vi and vj, and wij denotes the weight of the target word's nearest neighbor, defined as the reciprocal rank of a ranked list. For example, excellent in Fig. 1 will receive a weight of 1, great will receive a weight of 1/2, and so on. A word ranked higher will receive a higher weight. This weight is used to control the movement direction of the target word towards to its nearest neighbors. That is, the target word will be moved closer to the higher-ranked sentimentally similar neighbors and further away from lowerranked dissimilar neighbors, as shown in Fig. 2.
To prevent too many words being moved to the same location and thereby producing too many similar vectors, we add a constraint to keep each pre-trained vector within a certain range from its original vector. The objective function is thus divided as two parts: denotes the distance between the vector of the target word in step t and t+1, i.e., the distance between the refined vector and its original vector. The later one represents the distance between the vector of the target word and that of its neighbors (similar to Eq. (1)). The parameters α and β together are used as a ratio to control how far the refined vector can be moved away from its original vector and toward its nearest neighbors. A greater ratio indicates a stronger constraint on keeping the refined vector closer to its original vector. For the extreme case of α=1 and β=0, the target word will not be moved (refined). As the ratio decreases, the constraint decreases accordingly and the refined vector can be moved closer to its nearest neighbors. The setting of α=0 and β=1 means that the constraint is disabled.
To facilitate the calculation of the partial derivative of Φ(V), dist (vi, vj) in the above equations is measured by the squared Euclidean distance, defined as where D is the dimensionality of the word vectors.
The global optimal solution of Φ(V) can be found by using an iterative update method. To do so, we solve the partial derivation of Eq.
(2) in step t with respect to word vector t i v , and by setting The iterative update procedure is defined as Through the iterative procedure, the vector representation of each target word will be iteratively updated until the change of the location of the target word's vector is converged. The refinement process will be terminated when all target words are refined.

Experimental Results
This section evaluates the proposed refinement model, conventional word embeddings and previously proposed sentiment embeddings using several deep neural network models for binary and fine-grained sentiment classification.
Dataset. SST was adopted as the evaluation corpus (Socher et al., 2013). The binary classification subtask (positive and negative) contains 6920/872/1821 samples for the train/dev/test sets, while the fine-grained ordinal classification subtask (very negative, negative, neutral, positive, and very positive) contains 8544/1101/2210 samples of the train/dev/test sets.
Word Embeddings. The word embeddings used for comparison included two conventional word embeddings (GloVe and Word2vec), our refined versions (Re(GloVe) and Re(Word2vec)), and previously proposed sentiment embeddings (Hy-Rank) (Tang et al., 2016). We used the same dimensionality of 300 for all word embeddings.
 HyRank: It was trained using SST, NRC Sentiment140 and IMDB datasets. We compared this method because its code is publicly accessible 3 .  After the proposed refinement model was applied, both the pre-trained Word2vec and GloVe were improved. The Re(Word2vec) and Re(GloVe) respectively improved Word2vec and GloVe by 1.7% and 1.5% averaged over all classifiers for binary classification, and both 1.6% for finegrained classification. In addition, both Re(GloVe) and Re(Word2vec) outperformed the sentiment embeddings HyRank for all classifiers on both binary and fine-grained classification, indicating that the real-valued intensity scores used by the proposed refinement model are more effective than the binary polarity labels used by the previously proposed sentiment embedings. The proposed method yielded better performance because it can remove semantically similar but sentimentally dissimilar nearest neighbors for the target words by refining their vector representations. To demonstrate the effect, we define a measure noise@k to calculate the percentage of top k nearest neighbors with an opposite polarity (i.e., noise) to each word in E-ANEW. For instance, in Fig. 1, the noise@10 for good is 20% because there are two words with an opposite polarity to good among its top 10 nearest neighbors. Table 2 shows the average noise@10 for different word embeddings. For the two semantic-oriented word vectors, GloVe and Word2vec, on average around 24% of the top 10 nearest neighbors for each word are noisy words. After refinement, both Re(GloVe) and Re(Word2vec) can reduce noise@10 to around 14%. The HyRank also yielded better performance than both GloVe and Word2vec.

Conclusion
This study presents a word vector refinement model that requires no labeled corpus and can be applied to any pre-trained word vectors. The proposed method selects a set of semantically similar nearest neighbors and then ranks the sentimentally similar neighbors higher and dissimilar neighbors lower based on a sentiment lexicon. This ranked list can guide the refinement procedure to iteratively improve the word vector representations.
Experiments on SST show that the proposed method yielded better performance than both conventional word embeddings and sentiment embeddings for both binary and fine-grained sentiment classification. In addition, the performances of various deep neural network models have also been improved. Future work will evaluate the proposed method on another datasets. More experiments will also be conducted to provide more in-depth analysis.  Table 2: Average percentages of noisy words in the top 10 nearest neighbors for different word embeddings.