Multiview LSA: Representation Learning via Generalized CCA

Multiview LSA (MVLSA) is a generalization of Latent Semantic Analysis (LSA) that supports the fusion of arbitrary views of data and relies on Generalized Canonical Correlation Analysis (GCCA). We present an algorithm for fast approximate computation of GCCA, which when coupled with methods for handling missing values, is general enough to approximate some recent algorithms for inducing vector representations of words. Experiments across a comprehensive collection of test-sets show our approach to be competitive with the state of the art.

1 Introduction Winograd (1972) wrote that: "Two sentences are paraphrases if they produce the same representation in the internal formalism for meaning". This intuition is made soft in vector-space models (Turney and Pantel, 2010), where we say that expressions in language are paraphrases if their representations are close under some distance measure.
One of the earliest linguistic vector space models was Latent Semantic Analysis (LSA). LSA has been successfully used for Information Retrieval but it is limited in its reliance on a single matrix, or view, of term co-occurrences. Here we address the single-view limitation of LSA by demonstrating that the framework of Generalized Canonical Correlation Analysis (GCCA) can be used to perform Multiview LSA (MVLSA). This approach allows for the use of an arbitrary number of views in the induction process, including embeddings induced using other algorithms. We also present a fast approximate method for performing GCCA and approxi-mately recover the objective of (Pennington et al., 2014) while accounting for missing values.
Our experiments show that MVLSA is competitive with state of the art approached for inducing vector representations of words and phrases. As a methodological aside, we discuss the (in-)significance of conclusions being drawn from comparisons done on small sized datasets.

Motivation
LSA is an application of Principal Component Analysis (PCA) to a term-document cooccurrence matrix. The principal directions found by PCA form the basis of the vector-space in which to represent the input terms (Landauer and Dumais, 1997). A drawback of PCA is that it can leverage only a single source of data and it is sensitive to scaling.
An arguably better approach to representation learning is Canonical Correlation Analysis (CCA) that induces representations that are maximally correlated across two views, allowing the utilization of two distinct sources of data. While an improvement over PCA, being limited to only two views is unfortunate in light of the fact that many sources of data (perspectives) are frequently available in practice. In such cases it is natural to extend CCA's original objective of maximizing correlation between two views by maximizing some measure of the matrix Φ that contains all the pairwise correlations between linear projections of the covariates. This is how Generalized Canonical Correlation Analysis (GCCA) was first derived by Horst (1961). Recently these intuitive ideas about benefits of leveraging multiple sources of data have received strong theoretical backing due to the work by Sridharan and Kakade (2008) who showed that learning with multiple views is beneficial since it reduces the complexity of the learning problem by restricting the search space. Recent work by Anandkumar et al. (2014) showed that at least three views are necessary for recovering hidden variable models.
Note that there exist different variants of GCCA depending on the measure of Φ that we choose to maximize. Kettenring (1971) enumerated a variety of possible measures, such as the spectral-norm of Φ. Kettenring noted that maximizing this spectralnorm is equivalent to finding linear projections of the covariates that are most amenable to rank-one PCA, or that can be best explained by a single term factor model. This variant was named MAX-VAR GCCA and was shown to be equivalent to a proposal by Carroll (1968), which searched for an auxiliary orthogonal representation G that was maximally correlated to the linear projections of the covariates. Carroll's objective targets the intuition that representations leveraging multiple views should correlate with all provided views as much as possible.

Proposed Method: MVLSA
Let X j ∈ R N ×d j ∀j ∈ [1, . . . , J] be the mean centered matrix containing data from view j such that row i of X j contains the information for word w i . Let the number of words in the vocabulary be N and number of contexts (columns in X j ) be d j . Following standard notation (Hastie et al., 2009) we call X j X j the scatter matrix and X j (X j X j ) −1 X j the projection matrix.
The objective of MAX-VAR GCCA can be written as the following optimization problem: Find G ∈ R N ×r and U j ∈ R d j ×r that solve: The matrix G that solves problem (1) is our vector representation of the vocabulary. Finding G reduces to spectral decomposition of sum of projection ma-trices of different views: Define Then, for some positive diagonal matrix Λ, G and U j satisfy: Computationally storing P j ∈ R N ×N is problematic owing to memory constraints. Further, the scatter matrices may be non-singular leading to an ill-posed procedure. We now describe a novel scalable GCCA with 2 -regularization to address these issues. Approximate Regularized GCCA: GCCA can be regularized by adding r j I to scatter matrix X j X j before doing the inversion where r j is a small constant e.g. 10 −8 . Projection matrices in (2) and (3) can then be written as Next, to scale up GCCA to large datasets, we first form a rank-m approximation of projection matrices (Arora and Livescu, 2012) and then extend it to an eigendecomposition for M following ideas by Savostyanov (2014). Consider the rank-m SVD of X j : where S j ∈ R m×m is the diagonal matrix with mlargest singular values of X j and A j ∈ R N ×m and B j ∈ R m×d j are the corresponding left and right singular vectors. Given this SVD, write the j th projection matrix as where T j ∈ R m×m is a diagonal matrix such that T j T j = S j (r j I + S j S J ) −1 S j . Finally, we note that the sum of projection matrices can be expressed as M =MM wherẽ Therefore, eigenvectors of matrix M , i.e. the matrix G that we are interested in finding, are the left singular vectors ofM , i.e.M = GSV . These left singular vectors can be computed by using Incremental PCA (Brand, 2002) sinceM may be too large to fit in memory.

Computing SVD of mean centered X j
Recall that we assumed X j to be mean centered matrices. Let Z j ∈ R N ×d j be sparse matrices containing mean-uncentered cooccurrence counts. Let f j = n j • t j be the preprocessing function that we apply to Z j : In order to compute the SVD of mean centered matrices X j we first compute the partial SVD of uncentered matrix Y j and then update it (Brand (2006) provides details). We experimented with representations created from the uncentered matrices Y j and found that they performed as well as the mean centered versions but we would not mention them further since it is computationally efficient to follow the principled approach. We note, however, that even the method of mean-centering the SVD produces an approximation.

Handling missing rows across views
With real data it may happen that a term was not observed in a view at all. A large number of missing rows can corrupt the learnt representations since the rows in the left singular matrix become zero. To counter this problem we adopt a variant of the "missing-data passive" algorithm from Van De Velden and Bijmolt (2006) who modified the GCCA objective to counter the problem of missing rows. 1 The objective now becomes: where [K j ] ii = 1 if row i of view j is observed and zero otherwise. Essentially K j is a diagonal rowselection matrix which ensures that we optimize our representations only on the observed rows. Note that X j = K j X j since the rows that K j removed were already zero. Let, K = j K j then the optima of the objective can be computed by modifying equation (7) as: Again, if we regularize and approximate the GCCA solution we get G as the left singular vectors of K − 1 2M . We mean center the matrices using only the observed rows.
Also note that other heuristic weighting schemes could be used here. For example if we modify our objective as follows then we would approximately recover the objective of Pennington et al. (2014):

Data
Training Data We used the English portion of the Polyglot Wikipedia dataset released by Al-Rfou et al. (2013) to create 15 irredundant views of cooccurrence statistics where element [z] ij of view Z k represents that number of times word w j occurred k words behind w i . We selected the top 500K words by occurrence to create our vocabulary for the rest of the paper. We extracted cooccurrence statistics from a large bitext corpus that was made by combining a number of parallel bilingual corpora as part of the Para-Phrase DataBase (PPDB) project: Table 1 gives a summary, Ganitkevitch et al. (2013) provides further details. Element [z] ij of the bitext matrix represents the number of times English word w i was automatically aligned to the foreign word w j .
We also used the dependency relations in the Annotated Gigaword Corpus (Napoles et al., 2012) to create 21 views 2 where element [z] ij of view Z d represents the number of times word w j occurred as the governor of word w i under dependency relation d.
We combined the knowledge of paraphrases present in FrameNet and PPDB by using the dataset created by Rastogi and Van Durme (2014) to construct a FrameNet view. Element [z] ij of the FrameNet view represents whether word w i was present in frame f j . Similarly we combined the knowledge of morphology present in the CatVar database released by Habash and Dorr (2003) and morpha released by Minnen et al. (2001) along with morphy that is a part of WordNet. The morphological views and the frame semantic views were especially sparse with densities of 0.0003% and 0.03%. While the approach allows for an arbitrary number of distinct sources of semantic information, such as going further to include cooccurrence in WordNet synsets, we considered the described views to be representative, with further improvements possible as future work. Test Data We evaluated the representations on the word similarity datasets listed in Table 2. The first 10 datasets in Table 2 were annotated with different rubrics and rated on different scales. But broadly they all contain human judgements about how similar two words are. The "AN-SYN" and "AN-SEM" datasets contain 4-tuples of analogous words and the 2 Dependency relations employed : nsubj, amod, advmod, rcmod, dobj, prep of, prep in, prep to, prep on, prep for, prep with, prep from, prep at, prep by, prep as, prep  task is to predict the missing word given the first three. Both of these are open vocabulary tasks while TOEFL is a closed vocabulary task.

Significance of comparison
While surveying the literature we found that performance on word similarity datasets is typically reported in terms of the Spearman correlation between the gold ratings and the cosine distance between normalized embeddings. However researchers do not report measures of significance of the difference between the  items, produced respectively by algorithms A and B, and then a list of gold ratings T . Let r AT , r BT and r AB denote the Spearman correlations between A : T , B : T and A : B respectively. Let r AT ,r BT ,r AB be their empirical estimates and assume thatr BT >r AT without loss of generality.
For word similarity datasets we define σ r p 0 as the MRDS, such that it satisfies the following proposition: Here pval is the probability of the test statistic under the null hypothesis that r AT = r BT found using the Steiger's test (Steiger, 1980). The above constraint ensures that as long as the correlation between the competing methods is less than r and the difference between the correlations of the scores of the competing methods to the gold ratings is less than σ r p 0 , then the pvalue of the null hypothesis will be greater than p 0 . We can then ask what we consider a reasonable upper bound on the agreement of ratings produced by competing algorithms: for instance two algorithms correlating above 0.9 might not be considered meaningfully different. That leaves us with the second part of the predicate which ensures that as long as the difference between the correlations of the competing algorithms to the gold scores is less than σ r p 0 then the null hypothesis is more likely than p 0 .
We can find σ r p 0 as follows: Let stest denote Steiger's test predicate which satisfies the following: stest-p(r AT ,r BT , r AB , p 0 , n) =⇒ pval < p 0 Once we define this predicate then we can use it to set up an optimistic problem where our aim is to find σ r p 0 by solving the following: σ r p0 = min{σ|∀ 0<r <1 stest-p(r , min(r +σ, 1), r, p 0 , n)} Note that MRDS is a liberal threshold and it only guarantees that differences in correlations below that threshold can never be statistically significant (under the given parameter settings). MRDS might optimistically consider some differences as significant when they are not, but it is at least useful in reducing some of the noise in the evaluations. The values of σ r p 0 are shown in Table 2.
Unfortunately there are no widely reported traintest splits of the above datasets, leading to potential concerns of soft supervision (hyper-parameter tuning) on these evaluations, both in our own work and throughout the existing literature. We report on the resulting impact of various parameterizations, and our final results are based on a single set of parameters used across all evaluation sets.

Experiments and Results
We wanted to answer the following questions through our experiments: (1) How do hyperparameters affect performance? (2) What is the contribution of the multiple sources of data to performance? (3) How does the performance of MVLSA compare with other methods? For brevity we show tuning runs only on the larger datasets. We also highlight the top performing configurations in bold using the small threshold values in column σ 0.09 0.05 of Table 2. Effect of Hyper-parameters f j : We modeled the preprocessing function f j as the composition of two functions, f j = n j • t j . n j represents nonlinear preprocessing that is usually employed with LSA. We experimented by setting n j to be: identity; logarithm of count plus one; and the fourth root of the count.
t j represents the truncation of columns and can be interpreted as a type of regularization of the raw counts themselves through which we prune away the noisy contexts. Decrease in t j also reduces the influence of views that have a large number of context columns and emphasizes the sparser views. Table 3 and Table 4 show the results.

Test Set
Log Count Count    Table 5: Performance versus m, the number of left singular vectors extracted from raw cooccurrence counts. We set n j = Count 1 4 , t = 100K, v = 25, k = 300. k: Table 6 demonstrates the variation in performance versus the dimensionality of the learnt vector representations of the words. Since the dimensions of the MVLSA representations are orthogonal to each other therefore creating lower dimensional representations is a trivial matrix slicing operation and does not require retraining.  Table 6: Performance versus k, the final dimensionality of the embeddings. We set m = 300 and other settings were same as Table 5.
v: Expression 12 describes a method to set W j . We experimented with a different, more global, heuristic to set [W j ] ii = (K ww ≥ v), essentially removing all words that did not appear in v views before doing GCCA. Table 7 shows that changes in v are largely inconsequential for performance.  r j : The regularization parameter ensures that all the inverses exist at all points in our method. We found that the performance of our procedure was invariant to r over a large range from 1 to 1e-10. This was because even the 1000th singular value of our data was much higher than 1. Table 8 shows an ablative analysis of performance where we remove individual views or some combination of them and measure the performance. It is clear by comparing the last column to the second column that adding in more views improves performance. Also we can see that the Dependency based views and the Bitext based views give a larger boost than the morphology and FrameNet based views, probably because the latter are so sparse. Comparison to other word representation creation methods There are a large number of methods of creating representations both multilingual and monolingual. There are many new methods such as by Yu and Dredze (2014), , Hill and Korhonen (2014), and Weston et al. (2014) that are performing multiview learning and could be considered here as baselines: however it is not straightforward to use those systems to handle the variety of data that we are using. Therefore, we directly compare our method to the Glove and the SkipGram model of Word2Vec as the performance of those systems is considered state of the art. We trained these two systems on the English portion of the Polyglot Wikipedia dataset. 5 We also combined their outputs using MVLSA to create MV-G-WSG) embeddings.

Contribution of different sources of data
We trained our best MVLSA system with data from all views and by using the individual best settings of the hyper-parameters. Specifically the configuration we used was as follows: n j = Count 1 4 , t = 12.5K, m = 500, k = 300, v = 16.
To make a fair comparison we also provide results where we used only the views derived from the Polyglot Wikipedia corpus. See column MVLSA (All Views) and MVLSA (Wiki) respectively. It is clearly visible that MVLSA on the monolingual data itself is competitive with Glove but worse than Word2Vec on the word similarity datasets and it is substantially worse than both the systems on the AN-SYN and AN-SEM datasets. However with the addition of multiple views MVLSA makes substantial gains, shown in column MV Gain, and after consuming the Glove and WSG embeddings it again improves performance by some margins, as shown in column G-WSG Gain, and outperforms the original systems. Using GCCA itself for system combination provides closure for the MVLSA algorithm since multiple distinct approaches can now be simply fused using this method. Finally we contrast the Spearman correlations r s with Glove and Word2Vec before and after including them in the GCCA procedure. The values demonstrate that including Glove and WSG during GCCA actually increased the correlation between them and the learnt embeddings, which supports our motivation for performing GCCA in the first place.

Previous Work
Vector space representations of words have been created using diverse frameworks including Spectral methods (Dhillon et al., 2011;Dhillon et al., 2012), 6 Neural Networks (Mikolov et al., 2013b;Collobert and Lebret, 2013), and Random Projections (Ravichandran et al., 2005;Bhagat and Ravichan-5 We explicitly provided the vocabulary file to Glove and Word2Vec and set the truncation threshold for Word2Vec to 10. Glove was trained for 25 iterations. Glove was provided a window of 15 previous words and Word2Vec used a symmetric window of 10 words. 6 Table 9: Comparison of Multiview LSA against Glove and WSG(Word2Vec Skip Gram). Using σ 0.9 0.05 as the threshold we highlighted the top performing systems in bold font. † marks significant increments in performance due to use of multiple views in the Gain columns. The r s columns demonstrate that GCCA increased pearson correlation. dran, 2008;Chan et al., 2011). 7 They have been trained using either one (Pennington et al., 2014) 8 or two sources of cooccurrence statistics (Zou et al., 2013;Bansal et al., 2014;Levy and Goldberg, 2014) 9 or using multi-modal data (Hill and Korhonen, 2014;Bruni et al., 2012). Dhillon et al. (2011) and Dhillon et al. (2012) were the first to use CCA as the primary method to learn vector representations and  further demonstrated that incorporat-7 code.google.com/p/ word2vec,metaoptimize.com/projects/ wordreprs 8 nlp.stanford.edu/projects/glove 9 ttic.uchicago.edu/˜mbansal/data/ syntacticEmbeddings.zip,cs.cmu.edu/ mfaruqui/soft.html ing bilingual data through CCA improved performance. More recently this same phenomenon was reported by Hill et al. (2014a) through their experiments over neural representations learnt from MT systems. Various other researchers have tried to improve the performance of their paraphrase systems or vector space models by using diverse sources of information such as bilingual corpora (Bannard and Callison-Burch, 2005;Huang et al., 2012;Zou et al., 2013), 10 structured datasets (Yu and Dredze, 2014; or even tagged images (Bruni et al., 2012). However, most previous work 11 did not adopt the general, simplifying view that all of these sources of data are just cooccurrence statistics coming from different sources with underlying latent factors. 12 Bach and Jordan (2005) presented a probabilistic interpretation for CCA. Though they did not generalize it to include GCCA we believe that one could give a probabilistic interpretation of MAX-VAR GCCA. Such a probabilistic interpretation would allow for an online-generative model of lexical representations, which unlike methods like Glove or LSA would allows us to naturally perplexity or generate sequences. We also note that Vía et al. (2007) presented a neural network model of GCCA and adaptive/incremental GCCA. To the best of our knowledge both of these approaches have not been used for word representation learning.
CCA is also an algorithm for multi-view learning (Kakade and Foster, 2007;Ganchev et al., 2008) and when we view our work as an application of multiview learning to NLP, this follows a long chain of effort started by Yarowsky (1995) and continued with Co-Training (Blum and Mitchell, 1998), CoBoosting (Collins and Singer, 1999) and 2 view perceptrons (Brefeld et al., 2006).

Conclusion and Future Work
While previous efforts demonstrated that incorporating two views is beneficial in word-representation learning, we extended that thread of work to a logical extreme and created MVLSA to learn distributed representations using data from 46 views! 13 Through evaluation of our induced representations, shown in Table 9, we demonstrated that the MVLSA algorithm is able to leverage the information present in multiple data sources to improve performance on a battery of tests against state of the art baselines. In order to perform MVLSA on large vocabularies 11 Ganitkevitch et al. (2013) did employ a rich set of diverse cooccurrence statistics in constructing the initial PPDB, but without a notion of "training" a joint representation beyond random projection to a binary vector subspace (bit-signatures).
12 Note that while  performed belief propagation over a graph representation of their data, such an undirected weighted graph can be viewed as an adjacency matrix, which is then also a cooccurrence matrix. 13 Code and data available at www.cs.jhu.edu/ prastog3/mvlsa with up to 500K words we presented a fast scalable algorithm. We also showed that a close variant of the Glove objective proposed by Pennington et al. (2014) could be derived as a heuristic for handling missing data under the MVLSA framework. In order to better understand the benefit of using multiple sources of data we performed MVLSA using views derived only from the monolingual Wikipedia dataset thereby providing a more principled alternative of LSA that removes the need for heuristically combining word-word cooccurrence matrices into a single matrix. Finally, while surveying the literature we noticed that not enough emphasis was being given towards establishing the significance of comparative results and proposed a method, (MRDS), to filter out insignificant comparative gains between competing algorithms. Future Work Column MVLSA Wiki of Table 9 shows us that MVLSA applied to monolingual data has mediocre performance compared to the baselines of Glove and Word2Vec on word similarity tasks and performs surprisingly worse on the AN-SEM dataset. We believe that the results could be improved by (1) either using recent methods for handling missing values mentioned in footnote 1 or by using the heuristic count dependent non-linear weighting mentioned by Pennington et al. (2014) and that sits well within our framework as exemplified in Expression 12 (2) by using even more views, which look at the future words as well as views that contain PMI values. Finally, we note that Table 8 shows that certain datasets can actually degrade performance over certain metrics. Therefore we are exploring methods for performing discriminative optimization of weights assigned to views, for purposes of task-based customization of learned representations.