Low-Rank Tensors for Verbs in Compositional Distributional Semantics

Several compositional distributional semantic methods use tensors to model multi-way interactions between vectors. Unfortunately, the size of the tensors can make their use impractical in large-scale implementations. In this paper, we investigate whether we can match the performance of full tensors with low-rank approximations that use a fraction of the original number of parameters. We investigate the effect of low-rank tensors on the transitive verb construction where the verb is a third-order tensor. The results show that, while the low-rank tensors require about two orders of magnitude fewer parameters per verb, they achieve performance comparable to, and occasionally surpassing, the unconstrained-rank tensors on sentence similarity and verb disam-biguation tasks.


Introduction
Distributional semantic methods represent word meanings by their contextual distributions, for example by computing word-context co-ocurrence statistics (Schütze, 1998;Turney and Pantel, 2010) or by learning vector representations for words as part of a context prediction model (Bengio et al., 2003;Collobert et al., 2011;Mikolov et al., 2013). Recent research has also focused on compositional distributional semantics (CDS): combining the distributional representations for words, often in a syntax-driven fashion, to produce distributional representations of phrases and sentences (Mitchell and Lapata, 2008;Baroni and Zamparelli, 2010;Socher et al., 2012;Zanzotto and Dell'Arciprete, 2012).
One method for CDS is the Categorial framework (Coecke et al., 2011;, where each word is represented by a tensor whose order is determined by the Categorial Grammar type of the word. For example, nouns are an atomic type represented by a vector, and adjectives are matrices that act as functions transforming a noun vector into another noun vector (Baroni and Zamparelli, 2010). A transitive verb is a thirdorder tensor that takes the noun vectors representing the subject and object and returns a vector in the sentence space .
However, a concrete implementation of the Categorial framework requires setting and storing the values, or parameters, defining these matrices and tensors. These parameters can be quite numerous for even low-dimensional sentence spaces. For example, a third-order tensor for a given transitive verb, mapping two 100-dimensional noun spaces to a 100-dimensional sentence space, would have 100 3 parameters in its full form. All of the more complex types have corresponding tensors of higher order, and therefore a barrier to the practical implementation of this framework is the large number of parameters required to represent an extended vocabulary and a variety of grammatical constructions.
We aim to reduce the size of the models by demonstrating that reduced-rank tensors, which can be represented in a form requiring fewer parameters, can capture the semantics of complex types as well as the full-rank tensors do. We base our experiments on the transitive verb construction for which there are established tasks and datasets (Grefenstette and Sadrzadeh, 2011;. Previous work on the transitive verb construction within the Categorial framework includes a two-step linear-regression method for the construction of the full verb tensors (Grefenstette et al., 2013) and a multi-linear regression method combined with a two-dimensional plausibility space .  also introduce several alternative ways of reducing the number of tensor parameters by using matrices. The best performing method uses two matrices, one representing the subject-verb interactions and the other the verb-object interactions. Some interaction between the subject and the object is re-introduced through a softmax layer. A similar method is presented in Paperno et al. (2014). Milajevs et al. (2014) use vectors generated by a neural language model to construct verb matrices and several different composition operators to generate the composed subject-verb-object sentence representation.
In this paper, we use tensor rank decomposition (Kolda and Bader, 2009) to represent each verb's tensor as a sum of tensor products of vectors. We learn the component vectors and apply the composition without ever constructing the full tensors and thus we are able to improve on both memory usage and efficiency. This approach follows recent work on using low-rank tensors to parameterize models for dependency parsing (Lei et al., 2014) and semantic role labelling (Lei et al., 2015). Our work applies the same tensor rank decompositions, and similar optimization algorithms, to the task of constructing a syntax-driven model for CDS. Although we focus on the Categorial framework, the low-rank decomposition methods are also applicable to other tensor-based semantic models including Van de Cruys (2010), Smolensky and Legendre (2006), and Blacoe et al. (2013).

Model
Tensor Models for Verbs We model each transitive verb as a bilinear function mapping subject and object noun vectors, each of dimensionality N , to a single sentence vector of dimensionality S (Coecke et al., 2011;Maillard et al., 2014) representing the composed subject-verb-object (SVO) triple. Each transitive verb has its own thirdorder tensor, which defines this bilinear function. Consider a verb V with associated tensor V ∈ R S×N ×N , and vectors s ∈ R N , o ∈ R N for subject and object nouns, respectively. Then the compositional representation for the subject, verb, and object is a vector V (s, o) ∈ R S , produced by applying tensor contraction (the higher-order analogue of matrix multiplication) to the verb tensor and two noun vectors. The l th component of the vector for the SVO triple is given by We aim to learn distributional vectors s and o for subjects and objects, and tensors V for verbs, such that the output vectors V (s, o) are distributional representations of the entire SVO triple. While there are several possible definitions of the sentence space (Clark, 2013;, we follow previous work (Grefenstette et al., 2013) by using a contextual sentence space consisting of content words that occur within the same sentences as the SVO triple.
Low-Rank Tensor Representations Following Lei et al. (2014), we represent each verb's tensor using a low-rank canonical polyadic (CP) decomposition to reduce the numbers of parameters that must be learned during training. As a higher-order analogue to singular value decomposition for matrices, CP decomposition factors a tensor into a sum of R tensor products of vectors. 1 Given a third-order tensor V ∈ R S×N ×N , the CP decomposition of V is: where P ∈ R R×S , Q ∈ R R×N , R ∈ R R×N are parameter matrices, P r gives the rth row of matrix P, and ⊗ is the tensor product. The smallest R that allows the tensor to be expressed as this sum of outer products is the rank of the tensor (Kolda and Bader, 2009). By fixing a value for R that is sufficiently small compared to S and N (forcing the verb tensor to have rank of at most R), and directly learning the parameters of the low-rank approximation using gradient-based optimization, we learn a low-rank tensor requiring fewer parameters without ever having to store the full tensor.
In addition to reducing the number of parameters, representing tensors in this form allows us to formulate the verb tensor's action on noun vectors as matrix multiplication. For a tensor in the form of Eq. (2), the output SVO vector is given by where is the elementwise vector product.

Training
We train the compositional model for verbs in three steps: extracting transitive verbs and their subject and object nouns from corpus data, producing distributional vectors for the nouns and the SVO triples, and then learning parameters of the verb functions, which map the nouns to the SVO triple vectors.
Corpus Data We extract SVO triples from an October 2013 download of Wikipedia, tokenized using Stanford CoreNLP (Manning et al., 2014), lemmatized with the Morpha lemmatizer (Minnen et al., 2001), and parsed using the C&C parser (Curran et al., 2007). We filter the SVO triples to a set containing 345 distinct verbs: the verbs from our test datasets, along with some additional high-frequency verbs included to produce more representative sentence spaces. For each verb, we selected up to 600 triples which occurred more than once and contained subject and object nouns that occurred at least 100 times (to allow sufficient context to produce a distributional representation for the triple). This resulted in approximately 150,000 SVO triples overall.

Distributional Vectors
We produce two types of distributional vectors for nouns and SVO triples using the Wikipedia corpus. Since these methods for producing distributional vectors for the SVO triples require that the triples occur in a corpus of text, the methods are not a replacement for a compositional framework that can produce representations for previously unseen expressions. However, they can be used to generate data to train such a model, as we will describe. 1) Count vectors (SVD): we count the number of times each noun or SVO triple co-occurs with each of the 10,000 most frequent words (excluding stopwords) in the Wikipedia corpus, using sentences as context boundaries. If the verb in the SVO triple is itself a content word, we do not include it as context for the triple. This produces one set of context vectors for nouns and another for SVO triples. We weight entries in these vectors using the t-test weighting scheme (Curran, 2004;, and then reduce the vectors to 100 dimensions via singular value decomposition (SVD), decomposing the noun vectors and SVO vectors separately.
2) Prediction vectors (PV): we train vector embeddings for nouns and SVO triples by adapt-ing the Paragraph Vector distributed bag of words method of Le and Mikolov (2014), an extension of the skip-gram model of Mikolov et al. (2013). In our experiments, given an SVO triple, the model must predict contextual words sampled from all sentences containing that triple. In the process, the model learns vector embeddings for both the SVO triples and for the words in the sentences such that SVO vectors have a high dot product with their contextual word vectors. While previous work (Milajevs et al., 2014) has used prediction-based vectors for words in a tensor-based CDS model, ours uses prediction-based vectors for both words and phrases to train a tensor regression model.
We learn 100-dimensional vectors for nouns and SVO triples with a modified version of word2vec, 2 using the hierarchical sampling method with the default hyperparameters and 20 iterations through the training data.
Training Methods We learn the tensor V of parameters for a given verb V using multi-linear regression, treating the noun vectors s and o as input and the composed SVO triple vector V (s, o) as the regression output. Let M V be the number of training instances for V , where the i th instance is a triple of vectors s (i) , o (i) , t (i) , which are the distributional vectors for the subject noun, object noun, and the SVO triple, respectively. We aim to learn a verb tensor V (either in full or in decomposed, low-rank form) that minimizes the mean of the squared residuals between the predicted SVO vectors V (s (i) , o (i) ) and those vectors obtained distributionally from the corpus, t (i) . Specifically, we attempt to minimize the following loss function: (1) for full tensors, and by Eq.
(3) for tensors represented in low-rank form. In both the low-rank and full-rank tensor learning, we use mini-batch ADADELTA optimization (Zeiler, 2012) up to a maximum of 500 iterations through the training data, which we found to be sufficient for convergence for every verb. Rather than placing a regularization penalty on the tensor parameters, we use early stopping if the loss increases on a validation set consisting of 10% of the available SVO triples for each verb.
For low-rank tensors, we compare seven different maximal ranks: R=1, 5, 10, 20, 30, 40 and 50. To learn the parameters of the low-rank tensors, we use an alternating optimization method (Kolda and Bader, 2009;Lei et al., 2014): performing gradient descent on one of the parameter matrices (for example P) to minimize the loss function while holding the other two fixed (Q and R), then repeating for the other parameter matrices in turn. The parameter matrices are randomly initialized. 3

Evaluation
We compare the performance of the low-rank tensors against full tensors on two tasks. Both tasks require the model to rank pairs of sentences each consisting of a subject, transitive verb, and object by the semantic similarity of the sentences in the pair. The gold standard ranking is given by similarity scores provided by human evaluators and the scores are not averaged among the annotators. The model ranking is evaluated against the ranking from the gold standard similarity judgements using Spearman's ρ.
The verb disambiguation task (GS11) (Grefenstette and Sadrzadeh, 2011) involves distinguishing between senses of an ambiguous verb, given subject and object nouns as context. The dataset consists of 200 sentence pairs, where the two sentences in each pair have the same subject and object but differ in the verb. Each of these pairs was ranked by human evaluators on a 1-7 similarity scale so that properly disambiguated pairs (e.g. author write book -author publish book) have higher similarity scores than improperly disambiguated pairs (e.g. author write book -author spell book).
The transitive sentence similarity dataset (Kartsaklis and Sadrzadeh, 2014) consists of 72 subjectverb-object sentences arranged into 108 sentence pairs. As in GS11, each pair has a gold standard semantic similarity score on a 1-7 scale. For example, the pair medication achieve result -drug produce effect has a high similarity rating, while author write book -delegate buy land has a low rating. In this dataset, however, the two sentences in each pair have no lexical overlap: neither subjects, objects, nor verbs are shared. We show the highest tensor result for each task and vector set in bold (and also bold the baseline when it outperforms the tensor method).

Results
Table 1 displays correlations between the systems' scores and human SVO similarity judgements on the verb disambiguation (GS11) and sentence similarity (KS14) tasks, for both the count (SVD) and prediction vectors (PV). We also give results for simple composition of word vectors using elementwise addition and multiplication (Mitchell and Lapata, 2008) (using verb vectors produced in the same manner as for nouns). As is consistent with prior work, the tensor-based models are surpassed by vector addition on the KS14 dataset (Milajevs et al., 2014), but perform better than both addition and multiplication on the GS11 dataset. 4 Unsurprisingly, the rank-1 tensor has lowest performance for both tasks and vector sets, and performance generally increases as we increase the maximal rank R. The full tensor achieves the best, or tied for the best, performance on both tasks when using the PV vectors. However, for the SVD vectors, low-rank tensors surpass the performance of the full-rank tensor for R=40 and R=50 on GS11, and R=50 on KS14.
On GS11, the SVD and PV vectors have varying but mostly comparable performance, with PV having higher performance on 5 out of 8 models. However, on KS14, the PV vectors have better performance than the SVD vectors for every model by at least 0.05 points, which is consistent with prior work comparing count and predict vectors on these datasets (Milajevs et al., 2014).
The low-rank tensor models are also at least twice as fast to train as the full tensors: on a single core, training a rank-1 tensor takes about 5 seconds for each verb on average, ranks 5-50 each take between 1 and 2 minutes, and the full tensors each take about 4 minutes. Since a separate tensor is trained for each verb, this allows a substantial amount of time to be saved even when using the constrained vocabulary of 345 verbs.

Conclusion
We find that low-rank tensors for verbs achieve comparable or better performance than full-rank tensors on both verb disambiguation and sentence similarity tasks, while reducing the number of parameters that must be learned and stored for each verb by at least two orders of magnitude, and cutting training time in half.
While in our experiments the prediction-based vectors outperform the count-based vectors on both tasks for most models, Levy et al. (2015) indicate that tuning hyperparameters of the countbased vectors may be able to produce comparable performance. Regardless, we show that the low-rank tensors are able to achieve performance comparable to the full rank for both types of vectors. This is important for extending the model to many more grammatical types (including those with corresponding tensors of higher order than investigated here) to build a wide-coverage tensorbased semantic system using, for example, the CCG parser of Curran et al. (2007).