Word Embedding Calculus in Meaningful Ultradense Subspaces

We decompose a standard embedding space into interpretable orthogonal sub-spaces and a “remainder” subspace. We consider four interpretable subspaces in this paper: polarity, concreteness, frequency and part-of-speech (POS) sub-spaces. We introduce a new calculus for subspaces that supports operations like “ − 1 × hate = love ” and “give me a neutral word for greasy ” (i.e., oleaginous ). This calculus extends analogy computations like “ king − man + woman = queen ”. For the tasks of Antonym Classiﬁcation and POS Tagging our method outperforms the state of the art. We create test sets for Morphological Analogies and for the new task of Polarity Spectrum Creation.


Introduction
Word embeddings are usually trained on an objective that ensures that words occurring in similar contexts have similar embeddings. This makes them useful for many tasks, but has drawbacks for others; e.g., antonyms are often interchangeable in context and thus have similar word embeddings even though they denote opposites. If we think of word embeddings as members of a (commutative or Abelian) group, then antonyms should be inverses of (as opposed to similar to) each other. In this paper, we use DENSIFIER (Rothe et al., 2016) to decompose a standard embedding space into interpretable orthogonal subspaces, including a one-dimensional polarity subspace as well as concreteness, frequency and POS subspaces. We introduce a new calculus for subspaces in which antonyms are inverses, e.g., "−1 × hate = love".
The formula shows what happens in the polarity subspace; the orthogonal complement (all the re-maining subspaces) is kept fixed. We show below that we can predict an entire polarity spectrum based on the subspace, e.g., the four-word spectrum hate, dislike, like, love. Similar to polarity, we explore other interpretable subspaces and do operations such as: given a concrete word like friend find the abstract word friendship (concreteness); given the frequent word friend find a less frequent synonym like comrade (frequency); and given the noun friend find the verb befriend (POS).

Word Embedding Transformation
We now give an overview of DENSIFIER; see Rothe et al. (2016) for details. Let Q ∈ R d×d be an orthogonal matrix that transforms the original word embedding space into a space in which certain types of information are represented by a small number of dimensions. The orthogonality can be seen as a hard regularization of the transformation. We choose this because we do not want to add or remove any information from the original embeddings space. This ensures that the transformed word embeddings behave differently only when looking at subspaces, but behave identically when looking at the entire space. By choosing an orthogonal and thus linear transformation we also assume that the information is already encoded linearly in the original word embedding. This is a frequent assumption, as we generally use the vector addition for word embeddings.
Concretely, we learn Q such that the dimensions D p ⊂ {1, . . . , d} of the resulting space correspond to a word's polarity information and the {1, . . . , d} \ D p remaining dimensions correspond to non-polarity information. Analogously, the sets of dimensions D c , D f and D m correspond to a word's concreteness, frequency and POS (or morphological) information, respectively. In this paper, we assume that these properties do not corre- Figure 1: Illustration of the transformed embeddings. The horizontal axis is the polarity subspace. All non-polarity information, including concreteness, frequency and POS, is projected into a two dimensional subspace for visualization (gray plane). A query word (bold) specifies a line parallel to the horizontal axis. We then construct a cylinder around this line. Words in this cylinder are considered to be part of the word spectrum.
late and therefore the ultradense subspaces do not overlap. E.g., D p ∩D c = ∅. This might not be true for other settings, e.g., sentiment and semantic information. As we are using only four properties there is also a subspace which is in the orthogonal complement of all trained subspaces. This subspace includes the not classified information, e.g., genre information in our case (e.g., "clunker" is a colloquial word for "automobile").
If e v ∈ R d is the original embedding of word v, the transformed representation is u v = Qe v . We use * as a placeholder for polarity (p), concreteness (c), frequency (f ) and POS/morphology (m) and call d * = |D * | the dimensionality of the ultradense subspace of property * . For each ultradense subspace, we create P * ∈ R d * ×d , an identity matrix for the dimensions in D * . Thus, the ultradense (UD) representation u * v ∈ R d * of word v is defined as: For notational simplicity, u * v will either refer to a vector in R d * or to a vector in R d where all dimensions / ∈ D * are set to zero. For training, the orthogonal transformation Q we assume we have a lexicon resource. Let L * ∼ be a set of word index pairs (v, w) with different labels, e.g., positive/negative, concrete/abstract or noun/verb. We want to maximize the distance for pairs in this group. Thus, our objective is: subject to Q being an orthogonal matrix. Another goal is to minimize the distance of two words with identical labels. Let L * ∼ be a set of word index pairs (v, w) with identical labels. In contrast to Eq. 2, we now want to minimize each distance. Thus, the objective is given by: subject to Q being an orthogonal matrix. For training Eq. 2 is weighted with α * and Eq. 3 with 1 − α * . We do a batch gradient descent where each batch contains the same number of positive and negative examples. This means the number of examples in the lexica -which give rise to more negative than positive examples -does not influence the training.

Setup and Method
Eqs. 2/3 can be combined to train an orthogonal transformation matrix. We use pretrained 300dimensional English word embeddings (Mikolov et al., 2013) (W2V). To train the transformation matrix, we use a combination of MPQA (Wilson et al., 2005), Opinion Lexicon (Hu and Liu, 2004) and NRC Emotion lexicons (Mohammad and Turney, 2013) for polarity; BWK, a lexicon of 40,000 English words (Brysbaert et al., 2014), for concreteness; the order in the word embedding file for frequency; and the training set of the FLORS tagger (Schnabel and Schütze, 2014) for POS. The application of the transformation ma-trix to the word embeddings gives us four subspaces for polarity, concreteness, frequency and POS. These subspaces and their orthogonal complements are the basis for an embedding calculus that supports certain operations. Here, we investigate four such operations. The first operation computes the antonym of word v: where nn : R d → V returns the word whose embedding is the nearest neighbor to the input. Thus, our hypothesis is that antonyms are usually very similar in semantics except that they differ on a single "semantic axis," the polarity axis. 1 The second operation is "neutral version of word v": Thus, our hypothesis is that neutral words are words with a value close to zero in the polarity subspace. The third operation produces the polarity spectrum of v: (6) This means that we keep the semantics of the query word fixed, while walking along the polarity axis, thus retrieving different shades of polarity. Figure 1 shows two example spectra. The fourth operation is "word v with POS of word w": This is similar to analogies like king − man + woman, except that the analogy is inferred by the subspace relevant for the analogy. We create word spectra for some manually chosen words using the Google News corpus (W2V) and a Twitter corpus. As the transformation was orthogonal and therefore did not change the length of a dimension, we multiply the polarity dimension with 30 to give it a high weight, i.e., paying more attention to it. We then use Eq. 6 with a sufficiently small step size for x, i.e., further reducing the step size does not increase the spectrum. We also discard words that have a cosine distance of more than .6 in the non-polarity space. Table 1 shows examples. The results are highly domain dependent, with Twitter's spectrum indicating more negative views of politicians than news. While fall has negative associations, autumn's are positive -probably because autumn is of a higher register in American English.  We evaluate on Adel and Schütze (2014)'s data; the task is to decide for a pair of words whether they are antonyms or synonyms. The set has 2,337 positive and negative pairs each and is split into 80% training, 10% dev and 10% test. Adel and Schütze (2014) collected positive/negative examples from the nearest neighbors of the word embeddings to make it hard to solve the task using word embeddings. We train an SVM (RBF kernel) on three features that are based on the intuition depicted in Figure 1: the three cosine distances in: the polarity subspace; the orthogonal complement; and the entire space.  Table 3: Results for POS tagging. LSJU = Stanford. SVM = SVMTool. F=FLORS. We show three stateof-the-art taggers (lines 1-3), FLORS extended with 300-dimensional embeddings (4) and extended with UD embeddings (5). †: significantly better than the best result in the same column (α = .05, one-tailed Z-test).
this dictionary returns a list of up to 80 words of shades of meaning between two polar opposites. We look for words that are also present in Adel and Schütze (2014)'s Antonym Classification data and retrieve 35 spectra. Each word in a spectrum can be used as a query word; after intersecting the spectra with our vocabulary, we end up with 1301 test cases.
To evaluate PSC-SET, we calculate the 10 nearest neighbors of the m words in the spectrum and rank the 10m neighbors by the distance to our spectrum, i.e., the cosine distance in the orthogonal complement of the polarity subspace. We report mean average precision (MAP) and weighted MAP where each MAP is weighted by the number of words in the spectrum. As shown in Table 4 there is no big difference between both numbers, meaning that our algorithm does not work better or worse on smaller or larger spectra.
To evaluate PSC-ORD, we calculate Spearman's ρ of the ranks in OAWT and the values on the polarity dimension. Again, there is no significant difference between average and weighted average of ρ. Table 4 also shows that the variance of the measures is low for PSC-SET and high for PSC-ORD; thus, we do well on certain spectra and worse on others. The best one, beautiful ↔ ugly, is given as an example. The worst performing spectrum is fat ↔ skinny (ρ = .13) -presumably because both extremes are negative, contradicting our modeling assumption that spectra go from positive to negative. We test this hypothesis by separating the spectrum into two subspectra. We then report the weighted average correlation of the optimal separation. For fat ↔ skinny, this improves ρ to .67.

Morphological Analogy.
The previous two subspaces were onedimensional. Now we consider a POS subspace, because POS is not one-dimensional and cannot be modeled as a single scalar quantity. We create a word analogy benchmark by extracting derivational forms from WordNet (Fellbaum, 1998). We discard words with ≥2 derivational forms of the same POS and words not in the most frequent 30,000. We then randomly select 26 pairs for every POS combination for the dev set and 26 pairs for the test set. 2 An example of the type of equation we solve here is prediction − predict + symbolize = symbol (from the dev set). W2V embeddings are our baseline. We can also rewrite the left side of the equation as POS(prediction) + Semantics(symbolize); note that this cannot be done using standard word embeddings. In contrast, our method can use meaningful UD embeddings and Eq. 7 with POS(v) being u m v and Semantics(v) being u v − u m v . The dev set indicates that a 8-dimensional POS subspace is optimal and Table 5 shows that this method out-  Table 5: Accuracy @1 on test for Morphological Analogy. †: significantly better than the corresponding result in the same row (α = .05, onetailed Z-test).
performs the baseline.

POS Tagging
Our final evaluation is extrinsic. We use FLORS (Schnabel and Schütze, 2014), a state-of-the-art POS tagger which was extended by Yin et al. (2015) with word embeddings as additional features. W2V gives us a consistent improvement on OOVs (Table 3, line 4). However, training this model requires about 500GB of RAM. When we use the 8-dimensional UD embeddings (the same as for Morphological Analogy), we outperform W2V except for a virtual tie on news (Table 3, line 5). So we perform better even though we only use 8 of 300 dimensions! However, the greatest advantage of UD is that we only need 100GB of RAM, 80% less than W2V.

Related Work
Yih et al. (2012) also tackled the problem of antonyms having similar embeddings. In their model, the antonym is the inverse of the entire vector whereas in our work the antonym is only the inverse in an ultradense subspace. Our model is more intuitive since antonyms invert only part of the meaning, not the entire meaning. Schwartz et al. (2015) present a method that switches an antonym parameter on or off (depending on whether a high antonym-synonym similarity is useful for an application) and learn multiple embedding spaces. We only need a single space, but consider different subspaces of this space. An unsupervised approach using linguistic patterns that ranks adjectives according to their intensity was presented by de Melo and Bansal (2013). Sharma et al. (2015) present a corpus-independent approach for the same problem. Our results (Table 1) suggest that polarity should not be consid-ered to be corpus-independent.
There is also much work on incorporating the additional information into the original word embedding training. Examples include (Botha and Blunsom, 2014) and (Cotterell and Schütze, 2015). However, postprocessing has several advantages. DENSIFIER can be trained on a normal work station without access to the original training corpus. This makes the method more flexible, e.g., when new training data or desired properties are available.
On a general level, our method bears some resemblance with (Weinberger and Saul, 2009) in that we perform supervised learning on a set of desired (dis)similarities and that we can think of our method as learning specialized metrics for particular subtypes of linguistic information or particular tasks. Using the method of Weinberger and Saul (2009), one could learn k metrics for k subtypes of information and then simply represent a word w as the concatenation of (i) the original embedding and (ii) k representations corresponding to the k metrics. 3 In case of a simple one-dimensional type of information, the corresponding representation could simply be a scalar. We would expect this approach to have similar advantages for practical applications, but we view our orthogonal transformation of the original space as more elegant and it gives rise to a more compact representation.

Conclusion
We presented a new word embedding calculus based on meaningful ultradense subspaces. We applied the operations of the calculus to Antonym Classification, Polarity Spectrum Creation, Morphological Analogy and POS Tagging. Our evaluation shows that our method outperforms previous work and is applicable to different types of information. We have published test sets and word embeddings at http://www.cis.lmu. de/˜sascha/Ultradense/.