Rotated Word Vector Representations and their Interpretability

Vector representation of words improves performance in various NLP tasks, but the high dimensional word vectors are very difficult to interpret. We apply several rotation algorithms to the vector representation of words to improve the interpretability. Unlike previous approaches that induce sparsity, the rotated vectors are interpretable while preserving the expressive performance of the original vectors. Furthermore, any prebuilt word vector representation can be rotated for improved interpretability. We apply rotation to skipgrams and glove and compare the expressive power and interpretability with the original vectors and the sparse overcomplete vectors. The results show that the rotated vectors outperform the original and the sparse overcomplete vectors for interpretability and expressiveness tasks.


Introduction
Vector representations of words contain rich semantic and syntactic information and thus improve the performance of numerous natural language processing tasks. The vectors also play a basic role as an embedding layer in deep learning models for NLP, affecting the expressive performance of the model (Iyyer et al., 2014;Tai et al., 2015;Yang et al., 2016). However, the many dimensions comprising the vector representation are not amenable to interpretation.
Previous research on vector representation of words has proposed improving interpretability while keeping the expressive performance by inducing sparsity in word vector dimensions (Murphy et al., 2012;Fyshe et al., 2014). Recent research has proposed to build sparse vector repre-sentations from a large corpus and added the nonnegativity constraint using improved projected gradient (Luo et al., 2015), while (Sun et al., 2016) learns l1-regularised vectors. But, these models cannot be learned over pre-trained word vectors based on skip-gram (Mikolov et al., 2013) or glove (Pennington et al., 2014) which are widely used. Faruqui et al. proposes an alternative approach to stand-alone models by forming sparse representations based on the pre-trained models. To do this, they use overcomplete vectors, which are much higher in dimensionality than the original vectors.
Unlike these sparsity-inducing approaches, we construct an interpretable word vector representation by using the pre-trained word vectors as input and using a basis rotation algorithm from the Exploratory Factor Analysis (EFA) literature used in developing psychological scales (Osborne and Costello, 2009). Like the word vector representation, every single item in the scale is represented as a numeric vector in the latent factor space. The set of item vectors are represented in a factor loading matrix, and the matrix is rotated such that the factors (i.e., dimensions) become interpretable. The rotation achieves a Simple Structure (Thurstone, 1947) through minimizing the row and the column complexity of the matrix (Crawford and Ferguson, 1970). We elaborate on this process in the next section. As in EFA, we rotate the word vector representation matrix to obtain dimension-wise interpretability while retaining the number of dimensions the same. For example, Figure 1 shows the rotated skip-gram vectors for two groups of words. These words are top five words of two dimensions from rotated Word2Vec.
Our main contribution is applying the matrix rotation algorithm from psychometric analysis to word vector representation models to improve the interpretability of the vector. This approach gives an answer to the question why and how word vec- tor representations work well by revealing a hidden structure of the original word vectors. That is, it is meaningful to transform the hard-to-interpret dimensions of the pre-built word vectors, which are widely used, to more interpretable vectors. We also show that the rotated vectors retain their effectiveness with respect to downstream tasks without re-building the vector representations. Our method can be applied to any type of word vectors as a post-processing method such that it does not require a large corpus to be trained. In addition, it does not require additional number of dimensions so it does not increase the complexity of the model. Furthermore, we explore the characteristics of the rotated word vectors.

Factor Rotation
We take the rotation algorithm from the exploratory factor analysis (EFA) conducted to verify the construct validity of the psychological scale in development. For example, when validating a scale measuring respondents' latent factors, such as "Engineering problem solving" and "Interest in engineering", items should be similar within a factor, and distinguished between factors. As shown in Table 1, EFA projects every item into the latent factor space as an unrotated factor loading matrix. However, since it is unclear what the factor means, factor rotation is applied to the matrix that produces the rotated factor loading matrix which enhances the interpretability of the dimensions (Osborne, 2015).

Rotating Factors
The rotation algorithm transforms factor loading matrix to the simple structure which is much easier to interpret (Thurstone, 1947). It involves postmultiplication of a p × m input matrix A by an m × m square matrix T , to compute the rotated matrix Λ,  Table 1: An example of the factor rotation process to verify the construct validity of the psychological scale and its intended latent factor (left) in development. Items and loadings are from (Osborne, 2015).
which minimizes the cost function f (Λ), also known as the rotation criterion. The function minimizes the complexity of the matrix, to make the rotated matrix have a few large values in a row or a column. Minimizing the complexity allows non-binary values in the vector, and thus a more complex solution that the perfect simple structure. This is a more realistic solution since a solution with binary vectors may be misleading in representing the factor of interest (Yates, 1988;Browne, 2001). More details are described in the next subsection.
The intuition behind this approach is that inducing interpretability by factor rotation reforms the word embedding matrix to have a simple structure by linear transformation. It encourages each word vector (row) and dimension (column) to have a few large values, leading to more interpretable dimensions as shown in Fig 1.

Crawford-Ferguson Rotation Family
The rotation criterion introduced in Crawford and Ferguson is a family of complexity functions as follows: where λ ij is an element of Λ. The first term represents the row (item) complexity, and the second term represents the column (factor) complexity. The ratio between the two is adjusted by the parameter κ (0 ≤ k ≤ 1). The criterion is a generalized version of the widely used criteria, the orthomax family (Harman, 1960) which includes quartimax (Carroll, 1953;Ferguson, 1954;Neuhaus and Wrigley, 1954), varimax (Kaiser, 1958), and direct quartimin (Carroll, 1960). It effectively reflects the simple structure as well (Browne, 2001). In this work, we apply the fol- lowing representative κ values in Table 2 (Sass and Schmitt, 2010). In addition, the constraints for the rotation matrix T can be applied in general. We can categorize the rotation as orthogonal and oblique based on the constraint. Orthogonal rotation assumes the correlation between the rotated dimensions is zero. Hence, the matrix should be an orthogonal matrix that with m(m − 1)/2 constraints, satisfies: Oblique rotations allow the correlation between dimension to be non-zero, resulting in m constraints satisfying: The solution for the input matrix is computed by using the gradient projection algorithm (Jennrich, 2001(Jennrich, , 2002. The algorithm minimizes equation 2 while satisfying the constraints of the rotation matrix.

Experimental Settings
We choose the Wikipedia English articles 1 to train the word vector models. The corpus contains 5.3M articles, 83M sentences and 1,676M tokens. For preprocessing, we leave only the alphanumeric tokens and apply lowercase to all words. Then we remove the words with frequency less than 50, and the size of the remaining vocabulary is 306,491.
We train skip-gram 2 (Mikolov et al., 2013) and glove 3 (Pennington et al., 2014) based on the corpus by using existing implementations. We set the window size to 5 for both skip-gram and glove. We set the number of negative samples to 5 and the number of dimensions to 300. We use the default values for the other hyperparameters. The size of the resulting word vector matrix is (306,491,300). We compare our model with two baseline models: sparse overcomplete vector representations (SOV) and the non-negative version of the SOV. We set the hyperparameters of these models as λ = .5, τ = 10 −5 , K = 3000 for SG, and λ = 1.0, τ = 10 −5 , K = 3000 for Glove (Faruqui et al., 2015). We excluded methods as baselines that construct interpretable word vectors using huge training corpora because our method works with pre-trained vectors.
We apply four rotation algorithms for each orthogonal and oblique rotation, listed in Table 2. Since we have two original word vector representations, we have 16 (4 x 2 x 2) rotated vectors in total. We implement the algorithm through Tensor-Flow (Abadi et al., 2016), and it is publicly available on GitHub 4 .

Interpretability
In this section, we show how the rotation of word vectors results in improved dimension-wise interpretability using the word intrusion task. (Murphy et al., 2012;Faruqui et al., 2015;Sun et al., 2016).

Word Intrusion
Word intrusion task seeks to measure the semantic coherence of a set of words. For example, consider a set of words consists of ('daughter', 'wife', 'sister', 'mother', 'son') and add an 'intruder' word ('bigram') to the set. Since the words except intruder has similar meanings to each other, we can easily pick out the intruder to conclude that the five words are sharing coherent meanings.
We apply this task to measure interpretability of every word vector dimensions. If we choose the words with the highest embedding values for each of the dimensions (top words for that dimension) and add an random (intruder) word and see whether the intruder can be easily identified, then we can conclude the dimension is semantically coherent. In this way, we can measure the extent of interpretability of a dimension in vector representations by this task. Note that we pick top words for a dimension by looking only for the value of that dimension, ignoring values in the other dimensions.
Specifically, we first choose the top five words in each dimension, and then we choose an intruder word based on two criteria: 1) it is in the lower half of that dimension, and 2) it is in the top 10% in some other dimension. Also, we follow the settings of the measure (k = 5, top 10%) from previous works. We see similar results when we run experiments with larger k. (Murphy et al., 2012;Sun et al., 2016) In the standard word intrusion task, human evaluators pick out the intruder words, and the results report the accuracy of the evaluators (Chang et al., 2009). But this approach would be impractical to use for all experimental conditions with 300 dimensions and the baselines, so we use the following distance ratio (DR) metric as an alternative approach in (Sun et al., 2016) with slight modifications. Another advantage of our metric is that it can be used to quantify the distance between the intruder and the non-intruder words. We define the overall metric as the average of the ratio between D a inter and D a intra over d dimensions as where D a intra is the average distance of every pair among the top k words in dimension a and D a inter is the average distance between the intruder word and each of the top k words in dimension a We define dist(w j , w k ) as the cosine distance between w j and w k . We set k = 5 and repeat this three times for each dimension a and use the average to compute DR overall .   Table 3 shows the results of word intrusion in terms of the distance ratio metric. Overall, the results of the rotated vector representations show improvements over SOV and the original word vector representations. For skip-grams, orthogonal parsimax shows the best result while for Glove, orthogonal varimax outperforms the others. Among oblique rotation, varimax and quartimax show better performance than factor parsimony. In general, interpretability varies with different values of κ. It increases when κ is close to zero and decreases when κ is close to one, putting more weight on the column complexity. Also, orthogonal rotation shows better performance than oblique rotation when κ is controlled.

Qualitative Examples
We present the top words of five dimensions for skip-gram and rotated skip-gram (parsimaxorthogonal) in Table 4. The dimensions shown are randomly selected for both conditions.
Overall, the top words in each dimension of skip-gram do not clearly show a common topic among them. Only a few dimensions out of 300 are interpretable, such as the second row in the table which is related to numbers. The overall distance ratio of the original vectors is slightly higher than one.
For the rotated word vectors, the top words show clear semantic coherence. The first row shows words about social network services, the  second row is about biology, the third row is about geographical locations in the US, and the fourth is about paintings. As the last row shows, some of these dimensions represent syntactic features.

Expressive Performance
We evaluate the expressive power of word vector representations on the following tasks and report Spearman's correlation coefficient for the first task, and accuracy for the other tasks. Table 5 shows the results.

Evaluation
We briefly describe the seven benchmark tasks: word similarity and semantic/syntactic analogy, and four classification tasks. For the classification tasks, we average the word vectors in each training sentence or phrase to use them as features. SVM and random forest classifier are trained to predict the target values, and hyperparameters are tuned on the validation set. Word Similarity (Simil.) SimLex-999 (Hill et al., 2016) presented to evaluate the similarity of word pairs, rather than relatedness. We compute the cosine similarity between the given word pairs, and report the Spearman's correlation coefficient as a measure of consistency between the similarity and human ratings.
Semantic and Syntactic Analogies (Analg. sem, syn). The second and third tasks are word analogy tasks proposed by (Mikolov et al., 2013). The semantic task includes 8,869 questions (sem) and the syntactic task includes 10,675 questions (syn).
Sentiment Analysis (Sent.) The first classification task is sentiment classification on the movie reviews (Socher et al., 2013 Table 5: Evaluation results of the original skip-gram, sparse overcomplete vectors (SOV), and the rotated (orthogonal and oblique) word vectors on various tasks. The left three columns show tasks based on cosine similarity, and the right four columns show classification tasks using average word vectors as features. Overall, the rotated word vectors show higher or comparable performance to that of the SOV and the original. We observe a similar pattern in Glove as well.
6,920, 872, 1,821 sentences for training, development, and test, respectively. The goal of this task is to predict positive or negative sentiment of the reviews. Question Classification (Ques.) Next, we use TREC dataset to classify categories of the questions (Faruqui et al., 2015). We divide the dataset into 4,952, 500, 500 for training, development, and test. The dataset has six types of questions including about person, location, etc.
NP bracketing (NP brckt.) The final task is classifying noun phrases in terms of bracketing (Lazaridou et al., 2013;Faruqui et al., 2015). Each phrase consists of three words, and the task is to predict the correct bracketing to match the similar words. We compute the average of NPs and perform ten-fold cross-validation over 2,227 phrases. The classifiers are trained and the hyperparameters are tuned for every fold.

Results
Word Similarity and Analogies We observe improved performance of oblique rotation of word vectors compared to the original and the SOV in word similarity and semantic analogy tasks. In the syntactic analogy, orthogonal rotation shows the same performance as the original. Note that the orthogonal rotations preserve the cosine-based expressive performances because the cosine similarity between any two vectors does not change after the orthogonal rotation.
Classification Tasks The SOV models show slightly higher performance except the question classification task. However, we can observe the rotated word vectors have improved performance over the original vectors. We observe a similar pattern in Glove as well. In conclusion, the rotated representations preserve the expressive power of the original word vectors, and it is quite close to that of the sparse representation with 10 times larger dimensionality.

Understanding Rotated Word Vectors
In this section, we perform several experiments to understand the characteristics of the rotated word vector representations.

Directionality
One conventional approach to make the word vectors to be more interpretable is by forcing the representation to have non-negative values (Faruqui et al., 2015;Luo et al., 2015). However, the dimensions in the rotated vectors are not non-negative, spread in both directions. Hence, we investigate the relationship between the directionality (positive / negative) and interpretability.  Table 6: Overall distance ratio based on the top words extracted from the values in word vectors sorted by descending order (Hi) and ascending order (Lo). Cor(Hi, Lo) is correlation between two distance ratios based on both directions. Next three columns present correlation between the absolute word vector values of the top words and distance ratios. The last columns shows selective distance ratio measure. The results implies generally both direction is interpretable, one direction is more interpretable than the other within a dimension, and larger absolute value in a dimension means higher interpretability. (* p < .05, ** p < .01, *** p < .001) Overall Interpretability of both directions The first two columns (A) and (B) in table 6 show the overall distance ratio computed over the top words extracted by descending order and ascending order, respectively. In other words, (A) refers to the top words having the highest positive values in each dimension, while (B) uses the lowest negative values. Note that we used descending order in word intrusion task in the previous section.
Interestingly, the overall distance ratios in both directions are comparable to each other. On average, both sides of a dimension are more interpretable than the unrotated vector representations except the oblique factor parsimony rotation.
Interpretability of both directions within a dimension Next, we compare the interpretability of both directions within a dimension. We first define the distance ratio of an individual dimension a as follows: We compute the ratio by using top words extracted from positive and negative directions for every dimension, and compute Spearman's correlation of the distance ratio pairs. Table 6 column (C) shows the results. All of the rotation conditions except the oblique factor parsimony shows significant (p < .05) negative correlation, meaning that both directions are hard to be highly interpretable within a dimension simultaneously.

Dir.
Topwords + depends, depend, rely, focused, focuses on, upon, onto, again, until + years, month, weeks, days, decades many, several, ago, numerous, various + that, which, when, where, but consists, includes, provides, contains, serves + criticizes, excelled, tended, much, criticized october, july, april, september, june + were, hoc, recently, their, had largest, oldest, longest, biggest, tallest Table 7: Examples of top words in both directions. The words are extracted from a part of the orthogonal parsimax rotated skip-gram word vectors.

Case Study
We present the top words in both directions for some dimensions of orthogonal parsimax rotated word vectors. As shown in table 7, some dimensions show a relationship between the opposite directions that they consist of consecutively used words, such as "rely on", "depends upon", "which includes", "that contains", "many years", "weeks ago". However, other dimensions show that one direction is relatively more interpretable than the other direction.

Selecting the Direction
Next, it is natural to question whether the larger absolute value in word vectors means higher interpretability, regardless of its directionality. We verify the relation between them by investigating the size of the absolute value in a dimension and the individual distance ratios.
Relation to distance ratio Table 6 column (D) presents Spearman's correlation between individual distance ratio and the mean absolute vector value of top words for that dimension. The fifth column (E) also shows the correlation between the intra-distance among the top words and the mean absolute value, and the sixth column (F) is the relationship of the inter-distance among the top words and the intruder and the mean absolute value.
Correlation coefficients show that the larger mean absolute value means higher interpretability for that dimension. In detail, there exists tendencies that larger mean absolute value of dimension reduces the intra-distances among the top words while increasing the inter-distances among the top words and the intruder.
Overall, we summarize our findings as follows: 1) generally both directions are somewhat interpretable, 2) one direction is usually more interpretable than the other within a dimension, and 3) a larger absolute value in a dimension means higher interpretability of the dimension.
Selective Distance Ratio We can select a more interpretable direction for each dimension through inspecting the mean absolute value of the top words in both directions. If we choose a direction that has a larger mean absolute value among the top words, each dimension should be easier to interpret. Table 6 column (G) presents this distance ratio computed on the rotated vectors, resulting in increased distance ratio values. We name this ratio as the overall selective distance ratio. This measure could be effectively used when vector representation is interpretable in both directions.

Effect of κ
We explore the effect on performance of the ratio between the row and the column complexity of the rotation criteria. As shown in section 4, choosing an appropriate κ is important for interpretability.
We set the κ value from zero to one and the numbers divided on a log scale. We run the word similarity task and the word intrusion to evaluate the performance. We present Spearman's correlation and the selective overall distance ratio. Figure 2 shows that the performance of the similarity task tends not to change regardless of κ, however, the selective distance ratio starts to decrease when κ > .01. Considering the ratio between the number of rows and columns of the  word vector matrix, giving too much weight to the column complexity results in degraded interpretability.
In our experiments, κ values of the quartimax, varimax, and parsimax rotation are computed as 0, 3e-06, 1e-04 respectively. Based on the results, our selection of kappas have shown interpretability improvement effectively, compared to factor parsimony (κ = 1). We observe these tendencies in orthogonal rotations as well.

Effect of the Number of Dimensions
To investigate the effect of the number of dimensions to interpretability of dimensions, we also measure the overall distance ratio (DR overall ) on 50, 100 and 200 dimensions of unrotated skip-gram and parsimax (orthogonal) and varimax (oblique) rotated word vectors. Figure 3 shows the results. For all settings, the rotated vectors orthogonal (parsimax) and oblique (varimax) show higher DR overall score than the original skip-gram vectors.

Related Work
Since distributed representations play an important role in various NLP tasks, they are applied to semantics (Herbelot and Vecchi, 2015;Qiu et al., 2015;Woodsend and Lapata, 2015), with incorporating external information to them (Tian et al., 2016;Nguyen et al., 2016). In addition, finding interpretable regularities from the representations is often conducted through non-negative and sparse coding (Murphy et al., 2012;Faruqui et al., 2015;Luo et al., 2015;Kober et al., 2016), and regularization (Sun et al., 2016). Instead, our approach is using rotation, showing better results in terms of interpretability. Meanwhile, various rotation methods are proposed such as CF-family (Crawford and Ferguson, 1970), Infomax (McKeon, 1968), Minimum Entropy (Jennrich, 2006, Geomin (Yates, 1988), procrustues (Hurley andCattell, 1962), and promax rotation criteria. (Hendrickson and White, 1964). Incorporating prior knowledge about rotated matrix is possible through target rotations (Harman, 1960;Browne, 1972a,b) are proposed as well. There are various ways to rotated dimensions, we select a CF-family that covers frequently used rotation methods in practice.

Conclusion and Discussions
In this paper, we applied the rotation algorithm to improve interpretability of distributed representation of words. We applied quartimax, varimax, parsimax and factor parsimony rotation by using the Crawford-Ferguson rotation criteria, then we constructed the rotated word vector representa-tions. We evaluated the expressive performance and interpretability for the rotated word vectors by word similarity, analogy, classification, and word intrusion task. The results show that the rotated word vector representations are highly interpretable with preserving expressive performance.
In addition, we explored the characteristics of the rotated word vectors: we observed 1) increased interpretability in both directions and 2) the positive relation between absolute value of the dimension and interpretability. Based on these observations, we proposed the selective distance ratio to measure and maximize the interpretability when the vector representation has interpretable meaning in both directions. We expect that the rotation algorithm can be easily applied to other word vector representations.
Our results imply that a rotated word vector can be used to understand what the word vectors are comprised of. Since a lexicon can be decomposed into morphemes, a word can have multiple meaning as a polysemy, contain information of syntactic structure in its meaning (Carpenter et al., 1995;MacDonald et al., 1994;Trueswell et al., 1994), or it can be divided into a variety of sub-components. Hence, we can investigate the lexical semantics of words by exploring the dimensions for which a word has higher values.
In addition, there are practical implications of interpreting the dimensions as well. Based on the meanings, we can remove irrelevant dimensions for a specific task of interest, in order to secure more efficient storage of the vectors and decrease the complexity of downstream NLP models. We will examine the issues in future work.
We plan to explore following issues. First, we apply target rotation (Harman, 1960;Browne, 1972a,b) to incorporate prior knowledge when constructing the rotated word vector representations. Second, we will investigate the interpretability of hidden structures of neural networks for NLP tasks such as (Yang et al., 2016;Li et al., 2016), when the rotated word vectors are used as an embedding layer.