Model-based Word Embeddings from Decompositions of Count Matrices

This work develops a new statistical understanding of word embeddings induced from transformed count data. Using the class of hidden Markov models (HMMs) underlying Brown clustering as a generative model, we demonstrate how canonical correlation analysis (CCA) and certain count transformations permit efﬁcient and effective recovery of model parameters with lexical semantics. We further show in experiments that these techniques empirically outperform existing spectral meth-ods on word similarity and analogy tasks, and are also competitive with other popular methods such as WORD2VEC and GLOVE.


Introduction
The recent spike of interest in dense, lowdimensional lexical representations-i.e., word embeddings-is largely due to their ability to capture subtle syntactic and semantic patterns that are useful in a variety of natural language tasks. A successful method for deriving such embeddings is the negative sampling training of the skip-gram model suggested by Mikolov et al. (2013b) and implemented in the popular software WORD2VEC. The form of its training objective was motivated by efficiency considerations, but has subsequently been interpreted by Levy and Goldberg (2014b) as seeking a low-rank factorization of a matrix whose entries are word-context co-occurrence counts, scaled and transformed in a certain way. This observation sheds new light on WORD2VEC, yet also raises several new questions about word embeddings based on decomposing count data. What is the right matrix to decompose? Are there rigorous justifications for the choice of matrix and count transformations?
In this paper, we answer some of these questions by investigating the decomposition specified by CCA (Hotelling, 1936), a powerful technique for inducing generic representations whose computation is efficiently and exactly reduced to that of a matrix singular value decomposition (SVD). We build on and strengthen the work of Stratos et al. (2014) which uses CCA for learning the class of HMMs underlying Brown clustering. We show that certain count transformations enhance the accuracy of the estimation method and significantly improve the empirical performance of word representations derived from these model parameters (Table 1).
In addition to providing a rigorous justification for CCA-based word embeddings, we also supply a general template that encompasses a range of spectral methods (algorithms employing SVD) for inducing word embeddings in the literature, including the method of Levy and Goldberg (2014b). In experiments, we demonstrate that CCA combined with the square-root transformation achieves the best result among spectral methods and performs competitively with other popular methods such as WORD2VEC and GLOVE on word similarity and analogy tasks. We additionally demonstrate that CCA embeddings provide the most competitive improvement when used as features in named-entity recognition (NER).

Notation
We use [n] to denote the set of integers {1, . . . , n}. We denote the m × m diagonal matrix with values v 1 . . . v m along the diagonal by diag(v 1 . . . v m ). We write [a 1 . . . a m ] to denote a matrix whose ith column is a i . The expected value of a random variable X is denoted by E[X]. Given a matrix Ω and an exponent a, we distinguish the entrywise power operation Ω a (i.e., Ω a i,j = (Ω i,j ) a ) from the matrix power operation Ω a (defined only for square Ω).

Background in CCA
In this section, we review the variational characterization of CCA. This provides a flexible framework for a wide variety of tasks. CCA seeks to maximize a statistical quantity known as the Pearson correlation coefficient between random variables L, R ∈ R: This is a value in [−1, 1] indicating the degree of linear dependence between L and R.

CCA objective
Let X ∈ R n and Y ∈ R n be two random vectors. Without loss of generality, we will assume that X and Y have zero mean. 1 Let m ≤ min(n, n ). CCA can be cast as finding a set of projection vectors (called canonical directions) a 1 . . . a m ∈ R n and b 1 . . . b m ∈ R n such that for i = 1 . . . m: That is, at each i we simultaneously optimize vectors a, b so that the projected random variables a X, b Y ∈ R are maximally correlated, subject to the constraint that the projections are uncorrelated to all previous projections.
as new m-dimensional representations of the original variables that are transformed to be as correlated as possible with each other. Furthermore, often m min(n, n ), leading to a dramatic reduction in dimensionality.

Exact solution via SVD
Eq. (1) is non-convex due to the terms a and b that interact with each other, so it cannot be solved exactly using a standard optimization technique. However, a method based on SVD provides an efficient and exact solution. See Hardoon et al. (2004) for a detailed discussion.
Lemma 3.1 (Hotelling (1936)). Assume X and Y have zero mean. The solution (A, B) to (1) is given by

Using CCA for word representations
As presented in Section 3.1, CCA is a general framework that operates on a pair of random variables. Adapting CCA specifically to inducing word representations results in a simple recipe for calculating (3).
A natural approach is to set X to represent a word and Y to represent the relevant "context" information about a word. We can use CCA to project X and Y to a low-dimensional space in which they are maximally correlated: see Eq.
(2). The projected X can be considered as a new word representation.
Denote the set of distinct word types by [n]. We set X, Y ∈ R n to be one-hot encodings of words and their associated context words. We define a context word to be a word occurring within ρ positions to the left and right (excluding the current word). For example, with ρ = 1, the following snippet of text where the current word is "souls": Whatever our souls are made of will generate two samples of X × Y : a pair of indicator vectors for "souls" and "our", and a pair of indicator vectors for "souls" and "are".
CCA requires performing SVD on the following matrix Ω ∈ R n×n : At a quick glance, this expression looks daunting: we need to perform matrix inversion and multiplication on potentially large dense matrices. However, Ω is easily computable with the following observations: Observation 1. We can ignore the centering operation when the sample size is large (Dhillon et al., 2011). To see why, let {(x (i) , y (i) )} N i=1 be N samples of X and Y . Consider the sample estimate of the term E The first term dominates the expression when N is large. This is indeed the setting in this task where the number of samples (word-context pairs in a corpus) easily tends to billions.
This follows from our definition of the word and context variables as one-hot encodings since With these observations and the binary definition of (X, Y ), each entry in Ω now has a simple closed-form solution: which can be readily estimated from a corpus.

Using CCA for parameter estimation
In a less well-known interpretation of Eq. (4), CCA is seen as a parameter estimation algorithm for a language model (Stratos et al., 2014). This model is a restricted class of HMMs introduced by Brown et al. (1992), henceforth called the Brown model. In this section, we extend the result of Stratos et al. (2014) and show that its correctness is preserved under certain element-wise data transformations.

Clustering under a Brown model
A Brown model is a 5-tuple (n, m, π, t, o) for n, m ∈ N and functions π, t, o where is a set of word types.
• [m] is a set of hidden states.
• π(h) is the probability of generating h ∈ [m] in the first position of a sequence.
Importantly, the model makes the following additional assumption: Assumption 4.1 (Brown assumption). For each word type w ∈ [n], there is a unique hidden state In other words, this model is an HMM in which observation states are partitioned by hidden states. Thus a sequence of N words An equivalent definition of a Brown model is given by organizing the parameters in matrix form. Under this definition, a Brown model has parameters (π, T, O) where π ∈ R m is a vector and T ∈ R m×m , O ∈ R n×m are matrices whose entries are set to: Our main interest is in obtaining some representations of word types that allow us to identify their associated hidden states under the model. For this purpose, representing a word by the corresponding row of O is sufficient. To see this, note that each row of O must have a single nonzero entry by Assumption 4.1. Let v(w) ∈ R m be the wth row of O normalized to have unit 2-norm: A crucial aspect of this representational scheme is that its correctness is invariant to scaling and rotation. In particular, clustering the normalized rows of diag(s)O a diag(s 2 )Q where O a is any element-wise power of O with any a = 0, Q ∈ R m×m is any orthogonal transformation, and s 1 ∈ R n and s 2 ∈ R m are any positive vectors yields the correct clusters under the model. See Figure 1(b) for illustration.

Spectral estimation
Thus we would like to estimate O and use its rows for representing word types. But the likelihood function under the Brown model is non-convex, making an MLE estimation of the model parameters difficult. However, the hard-clustering assumption (Assumption 4.1) allows for a simple Assume that a Brown model (π, T, O) generates a sequence of words. Let X, Y ∈ R n be one-hot encodings of words and their associated context words. Let U ∈ R n×m be the matrix of m left singular vectors of Ω a ∈ R n×n corresponding to nonzero singular values where Ω is defined in Eq. (4) and a = 0: Then there exists an orthogonal matrix Q ∈ R m×m and a positive s ∈ R m such that U = O a/2 diag(s)Q .
This theorem states that the CCA projection of words in Section 3.3 is the rows of O up to scaling and rotation even if we raise each element of Ω in Eq. (4) to an arbitrary (nonzero) power. The proof is a variant of the proof in Stratos et al. (2014) and is given in Appendix A.

Choice of data transformation
Given a corpus, the sample estimate of Ω a is given by: where #(w, c) denotes the co-occurrence count of word w and context c in the corpus, #(w) := c #(w, c), and #(c) := w #(w, c). What choice of a is beneficial and why? We use a = 1/2 for the following reason: it stabilizes the variance of the term and thereby gives a more statistically stable solution.

Variance stabilization for word counts
The square-root transformation is a variancestabilizing transformation for Poisson random variables (Bartlett, 1936;Anscombe, 1948). In particular, the square-root of a Poisson variable has variance close to 1/4, independent of its mean.
Lemma 4.1 (Bartlett (1936)). Let X be a random variable with distribution Poisson(n × p) for any p ∈ (0, 1) and positive integer n. Define Y := √ X. Then the variance of Y approaches 1/4 as n → ∞.
This transformation is relevant for word counts because they can be naturally modeled as Poisson variables. Indeed, if word counts in a corpus of length N are drawn from a multinomial distribution over [n] with N observations, then these counts have the same distribution as n independent Poisson variables (whose rate parameters are related to the multinomial probabilities), conditioned on their sum equaling N (Steel, 1953). Empirically, the peaky concentration of a Poisson distribution is well-suited for modeling word occurrences.

Variance-weighted squared-error minimization
At the heart of CCA is computing the SVD of the Ω a matrix: this can be interpreted as solving the following (non-convex) squared-error minimization problem: But we note that minimizing unweighted squarederror objectives is generally suboptimal when the target values are heteroscedastic. For instance, in linear regression, it is well-known that a weighted least squares estimator dominates ordinary least squares in terms of statistical efficiency (Aitken, 1936;Lehmann and Casella, 1998). For our setting, the analogous weighted least squares optimization is: where  Figure 2 gives a generic template that encompasses a range of spectral methods for deriving word embeddings. All of them operate on cooccurrence counts #(w, c) and share the low-rank SVD step, but they can differ in the data transformation method (t) and the definition of the matrix of scaled counts for SVD (s). We introduce two additional parameters α, β ≤ 1 to account for the following details. Mikolov et al. (2013b) proposed smoothing the empirical context distribution asp α (c) := #(c) α / c #(c) α and found α = 0.75 to work well in practice. We also found that setting α = 0.75 gave a small but consistent improvement over setting α = 1. Note that the choice of α only affects methods that make use of the context distribution (s ∈ {ppmi, cca}).

A template for spectral methods
The parameter β controls the role of singular values in word embeddings. This is always 0 for CCA as it does not require singular values. But for other methods, one can consider setting β > 0 since the best-fit subspace for the rows of Ω is given by U Σ. For example, Deerwester et al. (1990) use β = 1 and Levy and Goldberg (2014b) use β = 0.5. However, it has been found by many (including ourselves) that setting β = 1 yields substantially worse representations than setting β ∈ {0, 0.5} (Levy et al., 2015).

4.
Define v(w) ∈ R m to be the w-th row of U Σ β normalized to have unit 2-norm. No scaling t ∈ {-, log, sqrt}, s = -. This is a commonly considered setting (e.g., in Pennington et al. (2014)) where no scaling is applied to the co-occurrence counts. It is however typically accompanied with some kind of data transformation.
Positive point-wise mutual information (PPMI) t = -, s = ppmi . Mutual information is a popular metric in many natural language tasks (Brown et al., 1992;Pantel and Lin, 2002). In this setting, each term in the matrix for SVD is set as the pointwise mutual information between word w and context c: Typically negative values are thresholded to 0 to keep Ω sparse. Levy and Goldberg (2014b)  Regression t ∈ {-, sqrt}, s = reg . Another novelty of our work is considering a lowrank approximation of a linear regressor that predicts the context from words. Denoting the word sample matrix by X ∈ R N ×n and the context sample matrix by Y ∈ R N ×n , we seek U * = arg min U ∈R n×n ||Y − X U || 2 whose closed-from solution is given by: Thus we aim to compute a low-rank approximation of U * with SVD. This is inspired by other predictive models in the representation learning literature (Ando and Zhang, 2005;Mikolov et al., 2013a). We consider applying the square-root transformation for the same variance stabilizing effect discussed in Section 4.3.
CCA t ∈ {-, two-thirds, sqrt}, s = cca . This is the focus of our work. As shown in Theorem 4.1, we can take the element-wise power transformation on counts (such as the power of 1, 2/3, 1/2 in this template) while preserving the representational meaning of word embeddings under the Brown model interpretation. If there is no data transformation (t = -), then we recover the original spectral algorithm of Stratos et al. (2014).

Related work
We make a few remarks on related works not already discussed earlier. Dhillon et al. (2011) and(2012) propose novel modifications of CCA (LR-MVL and two-step CCA) to derive word embeddings, but do not establish any explicit connection to learning HMM parameters or justify the squareroot transformation. Pennington et al. (2014) propose a weighted factorization of log-transformed co-occurrence counts, which is generally an intractable problem (Srebro et al., 2003). In contrast, our method requires only efficiently computable matrix decompositions. Finally, word embeddings have also been used as features to improve performance in a variety of supervised tasks such as sequence labeling (Dhillon et al., 2011;Collobert et al., 2011) and dependency parsing (Lei et al., 2014;Chen and Manning, 2014). Here, we focus on understanding word embeddings in the context of a generative word class model, as well as in empirical tasks that directly evaluate the word embeddings themselves.

Word similarity and analogy
We first consider word similarity and analogy tasks for evaluating the quality of word embeddings. Word similarity measures the Spearman's correlation coefficient between the human scores and the embeddings' cosine similarities for word pairs. Word analogy measures the accuracy on syntactic and semantic analogy questions. We refer to Levy and Goldberg (2014a) for a detailed description of these tasks. We use the multiplicative technique of Levy and Goldberg (2014a) for answering analogy questions. For the choice of corpus, we use a preprocessed English Wikipedia dump (http:// dumps.wikimedia.org/). The corpus contains around 1.4 billion words. We only preserve word types that appear more than 100 times and replace all others with a special symbol, resulting in a vocabulary of size around 188k. We define context words to be 5 words to the left/right for all considered methods.
We use three word similarity datasets each containing 353, 3000, and 2034 word pairs. 3 We report the average similarity score across these datasets under the label AVG-SIM. We use two word analogy datasets that we call SYN (8000 syntactic analogy questions) and MIXED (19544 syntactic and semantic analogy questions). 4 We implemented the template in Figure 2 in C++. 5 We compared against the public implementation of WORD2VEC by Mikolov et al. (2013b) and GLOVE by Pennington et al. (2014). These external implementations have numerous hyperparameters that are not part of the core algorithm, such as random subsampling in WORD2VEC and the word-context averaging in GLOVE. We refer to Levy et al. (2015) for a discussion of the effect of these features. In our experiments, we enable all these features with the recommended default settings.
We reserve a half of each dataset (by category)  as a held-out portion for development and use the other half for final evaluation.

Effect of data transformation for CCA
We first look at the effect of different data transformations on the performance of CCA. Table 1 shows the result on the development portion with 1000-dimensional embeddings. We see that without any transformation, the performance can be quite bad-especially in word analogy. But there is a marked improvement upon transforming the data. Moreover, the square-root transformation gives the best result, improving the accuracy on the two analogy datasets by 25.46% and 20.06% in absolute magnitude. This aligns with the discussion in Section 4.3.

Comparison among different spectral embeddings
Next, we look at the performance of various combinations in the template in Figure 2. We smooth the context distribution with α = 0.75 for PPMI and CCA. We use β = 0.5 for PPMI (which has a minor improvement over β = 0) and β = 0 for all other methods. We generally find that using β = 0 is critical to obtaining good performance for s ∈ {-, reg}. Table 2 shows the result on the development portion for both 500 and 1000 dimensions. Even without any scaling, SVD performs reasonably well with the square-root and log transformations. The regression scaling performs very poorly without data transformation, but once the square-root transformation is applied it performs quite well (especially in analogy questions). The PPMI scaling achieves good performance in word similarity but not in word analogy. The CCA scaling, combined with the square-root transformation, gives the best overall performance. In particular, it performs better than all other methods in mixed analogy questions by a significant margin.

Comparison with other embedding methods
We compare spectral embedding methods against WORD2VEC and GLOVE on the test portion. We use the following combinations based on their performance on the development portion: • LOG: log transform, -scaling  Table 3 shows the result for both 500 and 1000 dimensions. In word similarity, spectral methods generally excel, with CCA consistently performing the best. SKIP is the only external package that performs comparably, with GLOVE and CBOW falling behind. In word analogy, REG and CCA are significantly better than other spectral methods. They are also competitive to GLOVE and CBOW, but SKIP does perform the best among all compared methods on (especially syntactic) analogy questions.  Table 3: Performance of different word embedding methods on the test portion of data. See the main text for the configuration details of spectral methods.

As features in a supervised task
Finally, we use word embeddings as features in NER and compare the subsequent improvements between various embedding methods. The experimental setting is identical to that of Stratos et al. (2014). We use the Reuters RCV1 corpus which contains 205 million words. With frequency thresholding, we end up with a vocabulary of size around 301k. We derive LOG, REG, PPMI, and CCA embeddings as described in Section 7.1.3, and GLOVE, CBOW, and SKIP embeddings again with the recommended default settings. The number of left/right contexts is 2 for all methods. For comparison, we also derived 1000 Brown clusters (BROWN) on the same vocabulary and used the resulting bit strings as features (Brown et al., 1992). Table 4 shows the result for both 30 and 50 dimensions. In general, using any of these lexical features provides substantial improvements over the baseline. 6 In particular, the 30-dimensional CCA embeddings improve the F1 score by 2.84 on the development portion and by 4.88 on the test portion. All spectral methods perform competitively with external packages, with CCA and SKIP consistently delivering the biggest improvements on the development portion.

Conclusion
In this work, we revisited SVD-based methods for inducing word embeddings. We examined a framework provided by CCA and showed that the resulting word embeddings can be viewed as cluster-revealing parameters of a certain model and that this result is robust to data transformation. 6 We mention that the well-known dev/test discrepancy in the CoNLL 2003 dataset makes the results on the test portion less reliable.  Our proposed method gives the best result among spectral methods and is competitive to other popular word embedding techniques. This work suggests many directions for future work. Past spectral methods that involved CCA without data transformation (e.g., Cohen et al. (2013)) may be revisited with the square-root transformation. Using CCA to induce representations other than word embeddings is another important future work. It would also be interesting to formally investigate the theoretical merits and algorithmic possibility of solving the varianceweighted objective in Eq. (6). Even though the objective is hard to optimize in the worst case, it may be tractable under natural conditions. A Proof of Theorem 4.1 We first define some random variables. Let ρ be the number of left/right context words to consider in CCA. Let (W 1 , . . . , W N ) ∈ [n] N be a random sequence of words drawn from the Brown model where N ≥ 2ρ + 1, along with the corresponding sequence of hidden states (H 1 , . . . , H N ) ∈ [m] N . Independently, pick a position I ∈ [ρ + 1, N − ρ] uniformly at random; pick an integer J ∈ [−ρ, ρ]\{0} uniformly at random. Define B ∈ R n×n , u, v ∈ R n ,π ∈ R m , andT ∈ R m×m as follows: Lemma A.1. Ω a = AΘ where Θ ∈ R n×m has rank m and A ∈ R n×m is defined as: A := diag(Oπ) −a/2 O a diag(π) a/2 diag(s) Proof. LetÕ := OT . It can be algebraically verified that B = Odiag(π)Õ , u = Oπ, and v =Õπ. Thus B a = O a diag(π) a (Õ a ) . Therefore, Ω a = diag(u) −1/2 Bdiag(v) −1/2 a = diag(u) −a/2 B a diag(v) −a/2 = diag(Oπ) −a/2 O a diag(π) a/2 diag(s) diag(s) −1 diag(π) a/2 (Õ a ) diag(Õπ) −a/2 This gives the desired result.
Next, we show that the left component of Ω a is in fact the emission matrix O up to (nonzero) scaling and is furthermore orthonormal.
Lemma A.2. The matrix A in Lemma A.1 has the expression A = O a/2 diag(s) and has orthonormal columns.
Proof. By Assumption 4.1, each entry of A is simplified as follows: h × s h o(w|H(w)) a/2 ×π a/2 H(w) = o(w|h) a/2 × s h This proves the first part of the lemma. Note that: Thus our choice of s gives A A = I m×m .
Proof of Theorem 4.1. With Lemma A.1 and A.2, the proof is similar to the proof of Theorem 5.1 in Stratos et al. (2014).