Predicting Emotional Word Ratings using Distributional Representations and Signed Clustering

Inferring the emotional content of words is important for text-based sentiment analysis, dialogue systems and psycholinguistics, but word ratings are expensive to collect at scale and across languages or domains. We develop a method that automatically extends word-level ratings to unrated words using signed clustering of vector space word representations along with affect ratings. We use our method to determine a word’s valence and arousal, which determine its position on the circumplex model of affect, the most popular dimensional model of emotion. Our method achieves superior out-of-sample word rating prediction on both affective dimensions across three different languages when compared to state-of-the-art word similarity based methods. Our method can assist building word ratings for new languages and improve downstream tasks such as sentiment analysis and emotion detection.


Introduction
Word-level ratings play an important role in computational linguistics and psychology research. Many studies have focused on collecting ratings related to the properties of words, such as frequency, complexity, concreteness, imagery, age of acquisition, familiarity and affective states (Kuperman et al., 2012;Schock et al., 2012;Juhasz and Yap, 2013;Brysbaert et al., 2014). Applications span from memory experiments to developing reading tests and analyzing texts from nonnative speakers (Mohammad and Turney, 2013). In NLP, these ratings can be used to quantify different properties in large scale naturally occurring text, for example when analysing lexical choice between demographic groups  or music lyrics (Maulidyani and Manurung, 2015).
Of particular importance to NLP research are ratings of affect, which can be used for sentiment analysis and emotion detection (Pang and Lee, 2008;. The main dimensional model of affect is the circumplex model of Russell (1980), which posits that all affective states are represented as a linear combination of two independent systems: valence (or sentiment) and arousal (Posner et al., 2005). For example, the word 'fear' is rated by humans as low in valence (2.93/9) but relatively high in arousal (6.41/9), while the word 'sad' is low in both valence (2.1/9) and arousal (3.49/9).
However, collecting word ratings is very time consuming and expensive for new languages, domains or properties, which hinders their applicability and reliability. In addition, although word ratings are performed using anchoring to control for differences between raters, implicit biases may exist when rating. This can be caused by certain demographic biases or halo effects e.g., a high valence word is more likely to be rated higher in arousal. An independent way of measuring words could also help refine existing ratings, rather than only extending them to unrated words.
Automatically expanding affective word ratings has been studied based on the intuition that words similar in a reduced semantic space will have similar ratings (Recchia and Louwerse, 2015;Palogiannidi et al., 2015;Vankrunkelsven et al., 2015;Köper and Im Walde, 2016). For example, Bestgen and Vincze (2012) compute the rating of an unknown word as the average of its k-nearest neighbors from the low-dimensional semantic space. However, the downside is that antonyms are also semantically similar, which is expected to reduce the accuracy of these methods. Orthographic similarity has shown to slightly improve results (Recchia and Louwerse, 2015). A different approach to rating prediction is based on graph methods inspired by label propagation (Wang et al., 2016). In a related task of adjective intensity prediction, Sharma et al. (2015) also use distributional methods, but their work is restricted to discrete categories and relative ranking within each semantic property. Another related task to affective norm prediction is building sentiment and polarity lexicons (Turney, 2002;Turney and Littman, 2003;Velikovich et al., 2010;Yih et al., 2012;Tang et al., 2014;Hamilton et al., 2016). However, polarity is assigned to words in order to determine if a text is subjective and its sentiment, which is slightly different to word-level affective norms e.g., 'sunshine' is an objective word (neural polarity), but has a positive affective rating.
Our approach builds upon recent work in learning word representations and enriches these by integrating a set of existing ratings. Including this information allows our method to differentiate between words that are semantically similar, but on opposite sides of the rating scale. Results show that our automatic word prediction approach obtains better results than competitive methods and demonstrates the benefits of introducing existing ratings on top of the underlying word representations. The superiority of our approach holds for both valence and arousal word ratings across three languages.

Data
Our gold standard data is represented by affective norms of words. The ratings are obtained by asking human coders to indicate the emotional reaction evoked by specific words on 9-point scales: valence (1-negative to 9-positive) and arousal (from 1-calm to 9-excited).
Originally, word ratings were computed using trained raters in a laboratory setup. The Affective Norms for English Words (Bradley and Lang, 1999) -ANEW -contained ratings for valence and arousal, as well as dominance for only 1034 English words. Similar norms were obtained for Spanish (Redondo et al., 2007). Recently, crowdsourcing was used to derive ratings for larger sets of words using the ANEW ratings for anchoring and validation. Warriner et al. (2013) computed valence, arousal, and dominance scores for 13,915 English lemmas. A similar methodology was used to obtain affective norms for Dutch -4,300 words (Moors et al., 2013) -and Spanish -14,031 words (Stadthagen-Gonzalez et al., 2016. In our experiments, we use valence and arousal ratings for these three languages. Although some affective norms contain a third dimension of dominance (from feeling dominated to feeling dominant), we choose not to include this as it was not present in all data sets.

Method
Our method consists of two separate steps. First, we leverage large corpora of naturally occurring text and the distributional hypothesis in order to represent words in a semantic space with reduced dimensionality. Words that are similar in this space will appear in similar contexts, hence are expected to have similar scores. However, words of opposite polarity have similar distributional properties and will also be very similar in this space (Landauer, 2002). Hence, we perform an additional second step which distorts the word representations, here implemented using signed spectral clustering.

Distributional Word Representations
Distributional word representations or word embeddings make use of the distributional hypothesis -a word is characterised by the company it keeps -to represent words as low dimensional numeric vectors using large text corpora (Harris, 1954;Firth, 1957).
We use the word2vec algorithm (Mikolov et al., 2013), without loss of generality, to generate word vectors as it is arguably the most popular model out of the variety of existing word representations. The word2vec embeddings for English and Spanish have 300 dimensions and are trained on the Gigaword corpora (Parker et al., 2011;Mendonca et al., 2011). For Dutch, we use the word2vec embeddings with 320 dimensions from Tulkens et al. (2016). All words in the embeddings have minimal tokenization, with no additional stemming or lowercasing. Our vocabulary consists of the words that have ratings on either scale.

Signed Spectral Clustering
To infer the score of an unrated word we use a clustering approach -rather than nearest neighbors -to automatically uncover the number of related words based on which the rating is com- Figure 1: A continuous two-dimensional representation of a cluster (using K-means) of English words and their normalized valence ratings. After incorporating valence ratings using the signed clustering algorithm, "disappointed" is removed from the main cluster. The colors represent the resulting cluster memberships.
puted. Distributional word representations capture semantic word similarity. However, a common pitfall is that words with different properties can be used in similar contexts e.g., 'happy' and 'sad' are antonyms but are used similarly. Signed spectral clustering (SSC) -described in Sedoc et al. (2016) -is extremely well suited for this type of problem.
SSC is a multiclass optimization method which builds upon existing theory in spectral clustering (Shi and Malik, 2000;Yu and Shi, 2003;von Luxburg, 2007) and incorporates side information about word ratings in the form of negative edges which repel words with opposing scores from belonging to the same clusters. It minimizes the cumulative edge weights cut within clusters versus between clusters, while simultaneously minimizing the negative edge weights within the clusters.
More formally, given a partition of nodes of a graph into k clusters, (A 1 , . . . , A k ), signed spectral clustering using normalized cuts minimizes For any subset A of the set of nodes, V , of the graph, let where w ij is the similarity or dissimilarity of words i and j. For any two subsets A and its com- |wij|.
Note, that the main innovation of signed spectral clustering is minimizing the number of negative edges within the cluster, links − (A j , A j ). Without the addition of negative weights, signed spectral clustering is simply spectral clustering i.e., normalized cuts (Yu and Shi, 2003). For this application, rather than incorporating a thesaurus knowledge base (a.k.a., side information) as in Sedoc et al. (2016), we used the continuous lexical scores from our arousal and valence ratings. To obtain signed information, we zerocentered the word ratings which are originally between 1 and 9. We create a similarity matrix where the weight between words i and j incorporate both the signed information and the word similarities computed using the cosine similarity of the distributional word representations. The similarity matrix W (a.k.a., weight matrix) is used to create word clusters which capture both the distributional features as well as the lexical features. We perform a separate clustering for each valence and arousal and each separate language. More formally, the similarity matrix where W emb is the matrix of cosine similarities between vector embeddings of words, is element-wise multiplication. The matrix T = T + + T − is the outer product of the normalized lexical ratings, where the matrices T + , T − contain the outer product of the normalized lexical ratings split into positive and negative entries, respectively, in matrix block form, The values β + and β − are found using grid search on the training data. Figure 1 shows the intuition behind signed clustering by presenting an example cluster obtained using K-means clustering on the reduced semantic space (here showing the first two principal components). This includes the word 'disappointed' together with with words like 'happy', 'excited' and 'elated'. While this is relatively appropriate for arousal, it is not the case for valence as they represent opposite ends of the rating spectrum. By incorporating valence information, 'disappointed' is taken apart from the cluster of words with positive valence and thus its negative valence rating will not be considered when predicting the rating of a word belonging to this cluster.
Note that we used signed spectral clustering (SSC) for our problem since, unlike when antonym pairs are used as side information, we need to incorporate continuous information. Other methods for adding antonym or arbitrary relationships on distributional word representations, are unable to extrapolate these to unseen words or handle unpaired side information (Yih et al., 2012;Chang et al., 2013;Faruqui et al., 2015;Mrkšić et al., 2016). Furthermore, our information comes in lists rather than sets, contexts, or patterns, which presents a problem for other existing methods (Tang et al., 2014;Pham et al., 2015;Schwartz et al., 2015). An alternative to SSC -must-link / cannot-link clustering (Rangapuram and Hein, 2012) -has the downside of requiring a choice of threshold for defining the must-link and cannot-link underlying graph edges. An extended comparison of SSC to related methods is presented in (Sedoc et al., 2016).

Results
We compare the proposed method with other baselines and approaches which assign to the unrated word: 1. the mean of the available ratings (Mean); 2. the average of its k nearest rated neighbors in the semantic space -the method introduced in (Bestgen and Vincze, 2012) (K-NN); 3. the mean rating of words in its cluster using standard k-means clustering in the reduced semantic space (K-Means); 4. linear regression value with the word embedding dimensions as features (Regression); 5. the mean rating of words in its cluster using vanilla spectral clustering (i.e., W = W emb ) which uses normalized cuts (NCut), in order to measure the utility and impact of the signed spectral clustering. We perform the experiment in a 10-fold crossvalidation setup, where 90% of the ratings are known and used in training. Results are evaluated in both Root Mean Squared Error (RMSE) between the human and automatic rating and the Pearson Correlation Coefficient (ρ) between the list of human and automatic ratings. We used k = 10 nearest neighbors for K-NN, which generally outperforms k = {1, 5, 20} over valence and arousal in all three test languages. This is consistent with the original results of Bestgen and Vincze (2012), although Recchia and Louwerse (2015) found that k = 40 was optimal for predicting arousal ratings. For all other clustering methods we used k ∼ 10% of the total ratings (k = 1000 for English and Spanish, k = 400 for Dutch). In English valence experiments, the K-means cluster sizes have a median of 13 with σ = 16.4, for NCut the median is 6 with σ = 62.5 and for SNCut the median is 5 with σ = 78.1. In SNCut, smaller cluster sizes are associated with more extreme ratings.
The results are presented in Table 1 and show that our method (SNCut) consistently performs best across both ratings -valence and arousaland across all three languages. For English and Spanish, the larger margins of improvement over the mean baseline and K-NN are obtained on valence. This is particularly intuitive, as opposite valence words are usually antonyms and are more useful to split apart compared to low/high arousal words, which might also not be as distributionally similar to each other. In all cases, the signed clustering step improves rating prediction significantly over vanilla spectral clustering (NCut), highlighting the utility of signed clustering. Out of the baseline methods, none consistently outperforms the others. In addition, we also used English 300 dimensional GloVe word embeddings (Pennington et al., 2014) instead of word2vec, which led to similar results using SNCut where for valence RMSE= 0.82, ρ = 0.76 and arousal RMSE= 0.73 and ρ = 0.56. As an upper bound comparison, Warriner et al. (2013) reported that the human inter-annotator agreements are 0.85 to 0.97, and 0.56 to 0.76 for valence and arousal respectively across various languages.
We also directly compare with results from previous work by matching the training and testing data sets where enough information was provided. When using only English ANEW words for outof-sample analysis as in Recchia and Louwerse (2015), our results are slightly higher (ρ=.804 cf. ρ=.8 for valence, ρ=.632 cf ρ=.62 for arousal). We did not have enough information to reproduce  Figure 2 presents the rating prediction error of our method when varying the number of ratings used as seeds in signed clustering. As expected, the error of our predictions decreases with the amount of ratings available with signs of reaching a plateau towards the end.

Conclusion
This study looked at the feasibility of automatically predicting word-level ratings -here valence and arousal -by combining distributional approaches with signed spectral clustering. Our ex-periments on word ratings of valence and arousal across three different languages showed that in an out-of-sample word rating prediction task, our proposed method consistently achieves the best prediction results when compared to a number of competitive methods and existing baselines.
Future work will include experiments on other word-level ratings, such as age-of-acquisition, dominance, imageability or abstractness, on other languages and using other word embeddings. Possible applications of our work include choosing the words to rate in an active learning setup on annotating new languages, automatically cleaning and checking word ratings and applying automatically derived scores to improve downstream tasks such as sentiment analysis or emotion detection.