Embedding Lexical Features via Low-Rank Tensors

Modern NLP models rely heavily on engineered features, which often combine word and contextual information into complex lexical features. Such combination results in large numbers of features, which can lead to over-fitting. We present a new model that represents complex lexical features---comprised of parts for words, contextual information and labels---in a tensor that captures conjunction information among these parts. We apply low-rank tensor approximations to the corresponding parameter tensors to reduce the parameter space and improve prediction speed. Furthermore, we investigate two methods for handling features that include $n$-grams of mixed lengths. Our model achieves state-of-the-art results on tasks in relation extraction, PP-attachment, and preposition disambiguation.


Introduction
Statistical NLP models usually rely on handdesigned features, customized for each task.These features typically combine lexical and contextual information with the label to be scored.In relation extraction, for example, there is a parameter for the presence of a specific relation occurring with a feature conjoining a word type (lexical) with dependency path information (contextual).In measuring phrase semantic similarity, a word type is conjoined with its position in the phrase to signal its role.Figure 1b shows an example in dependency parsing, where multiple types (words) are conjoined with POS tags or distance information.
To avoid model over-fitting that often results from features with lexical components, several smoothed lexical representations have been proposed and shown to improve performance on various NLP tasks; for instance, word embeddings (Bengio et al., 2006) help improve NER, dependency parsing and semantic role labeling (Miller et al., 2004;Koo et al., 2008;Turian et al., 2010;Sun et al., 2011;Roth and Woodsend, 2014;Hermann et al., 2014).
However, using only word embeddings is not sufficient to represent complex lexical features (e.g.φ in Figure 1c).In these features, the same word embedding conjoined with different non-lexical properties may result in features indicating different labels; the corresponding lexical feature representations should take the above interactions into consideration.Such important interactions also increase the risk of over-fitting as feature space grows exponentially, yet how to capture these interactions in representation learning remains an open question.
To address the above problems,1 we propose a general and unified approach to reduce the feature space by constructing low-dimensional feature representations, which provides a new way of combining word embeddings, traditional non-lexical properties, and label information.Our model exploits the inner structure of features by breaking the feature into multiple parts: lexical, non-lexical and (optional) label.We demonstrate that the full feature is an outer product among these parts.Thus, a parameter tensor scores each feature to produce a prediction.Our model then reduces the number of param- between "see" and "with" in (a), we may rely on lexical features in (b).Here p, c, g are indices of the word "with", its child ("telescope") and a candidate head.eters by approximating the parameter tensor with a low-rank tensor: the Tucker approximation of Yu et al. (2015) but applied to each embedding type (view), or the Canonical/Parallel-Factors Decomposition (CP).Our models use fewer parameters than previous work that learns a separate representation for each feature (Ando and Zhang, 2005;Yang and Eisenstein, 2015).CP approximation also allows for much faster prediction, going from a method that is cubic in rank and exponential in the number of lexical parts, to a method linear in both.Furthermore, we consider two methods for handling features that rely on n-grams of mixed lengths.
Our model makes the following contributions when contrasted with prior work: Lei et al. (2014) applied CP to combine different views of features.Compared to their work, our usage of CP-decomposition is different in the application to feature learning: (1) We focus on dimensionality reduction of existing, well-verified features, while Lei et al. (2014) generates new features (usually different from ours) by combining some "atom" features.Thus their work may ignore some useful features; it relies on binary features as supplementary but our model needs not.(2) Lei et al. (2014)'s factorization relies on views with explicit meanings, e.g.head/modifier/arc in dependency parsing, making it less general.Therefore its applications to tasks like relation extraction are less obvious.
Compared to our previous work (Gormley et al., 2015;Yu et al., 2015), this work allows for higherorder interactions, mixed-length n-gram features, lower-rank representations.We also demonstrate the strength of our new model via applications to new tasks.
The resulting method learns smoothed feature representations combining lexical, non-lexical and label information, achieving state-of-the-art performance on several tasks: relation extraction, preposition semantics and PP-attachment.

Notation and Definitions
We begin with some background on notation and definitions.Let T ∈ R d 1 ×•••×d K be a K-way tensor (i.e., a tensor with K views).In this paper, we consider the tensor k-mode product, i.e. multiplying a tensor The product is denoted by T × k x and is of size vector obtained by fixing all but the kth index.The mode-k unfolding T (k) of T is the d k × i =k d i matrix obtained by concatenating all the i =k d i mode-k fibers along columns.
Given two matrices W 1 ∈ R d 1 ×r 1 , W 2 ∈ R d 2 ×r 2 , we write W 1 ⊗ W 2 to denote the Kronecker product between W 1 and W 2 (outer product for vectors).We define the Frobenius product (matrix dot product) A B = i,j A ij B ij between two matrices with the same sizes; and define element-wise (Hadamard) multiplication a • b between vectors with the same sizes.
Tucker Decomposition: Tucker Decomposition represents a d 1 × d 2 × . . .× d K tensor T as: where each × i is the tensor i-mode product and each W i is a r i × d i matrix.Tensor g with size r 1 × r 2 × . . .× r K is called the core tensor.We say that T has a Tucker rank (r (1) , r (2) , . . ., r (K) ), where r (i) = rank(T (i) ) is the rank of mode-i unfolding.To simplify learning, we define the Tucker rank as r (i) =rank(g (i) ), which can be bounded simply by the dimensions of g, i.e. r (i) ≤ r i ; this allows us to enforce a rank constraint on T simply by restricting the dimensions r i of g, as described in §6.
CP Decomposition: CP decomposition represents a d 1 ×d 2 ×. ..×dK tensor T as a sum of rank-one tensors (i.e. a sum of outer products of K vectors): where each W i is an r × d i matrix and W i [j, :] is the vector of its j-th row.For CP decomposition, the rank r of a tensor T is defined to be the number of rank-one tensors in the decomposition.CP decomposition can be viewed as a special case of Tucker decomposition in which r 1 = r 2 = . . .= r K = r and g is a superdiagonal tensor.

Factorization of Lexical Features
Suppose we have feature φ that includes information from a label y, multiple lexical items w 1 , . . ., w n and non-lexical property u.This feature can be factorized as a conjunction of each part: φ = y ∧ u ∧ w 1 ∧. ..∧w n .The feature fires when all (n+2) parts fire in the instance (reflected by the ∧ symbol in φ).
The one-hot representation of φ can then be viewed as a tensor e φ = y ⊗ u ⊗ w 1 ⊗ • • • ⊗ w n , where each feature part is also represented as a one-hot vector. 2Figure 1d illustrates this case with two lexical parts.Given an input instance x and its associated label y, we can extract a set of features S(x, y).In 2 u, y, wi denote one-hot vectors instead of symbols.a traditional log-linear model, we view the instance x as a bag-of-features, i.e. a feature vector F (x, y).Each dimension corresponds to a feature φ, and has value 1 if φ ∈ S(x, y).Then the log-linear model scores the instance as s(x, y; w) = w T F (x, y) = φ∈S(x,y) s(φ; w), where w is the parameter vector.We can re-write s(x, y; w) based on the factorization of the features using tensor multiplication; in which w becomes a parameter tensor T : Here each φ has the form (y, u, w 1 , . . ., w n ), and Note that one-hot vectors w i of words themselves are large (|w i | > 500k), thus the above formulation with parameter tensor T can be very large, making parameter estimation difficult.Instead of estimating only the values of the dimensions which appear in training data as in traditional methods, we will reduce the size of tensor T via a low-rank approximation.With different approximation methods, (4) will have different equivalent forms, e.g. ( 6), (7) in §4.1.

Optimization objective:
The loss function for training the log-linear model uses (3) for scores, e.g., the log-loss (x, y; T ) = − log exp{s(x,y;T )} y ∈L exp{s(x,y ;T )} .Learning can be formulated as the following optimization problem: minimize: where the constraints on rank(T ) depend on the chosen tensor approximation method ( §2).
The above framework has some advantages: First, as discussed in §1 and here, we hope the representations capture rich interactions between different parts of the lexical features; the low-rank tensor approximation methods keep the most important interaction information of the original tensor, while significantly reducing its size.Second, the low-rank structure will encourage weight-sharing among lexical features with similar decomposed parts, leading to better model generalization.Note that there are examples where features have different numbers of multiple lexical parts, such as both unigram and bigram features in PP-attachment.We will use two different methods to handle these features ( §5).

Remarks (advantages of our factorization)
Compared to prior work, e.g.(Lei et al., 2014;Lei et al., 2015), the proposed factorization has the following advantages: 1. Parameter explosion when mapping a view with lexical properties to its representation vector (as will be discussed in 4.3): Our factorization allows the model to treat word embeddings as inputs to the views of lexical parts, dramatically reducing the parameters.Prior work cannot do this since its views are mixtures of lexical and non-lexical properties.Note that Lei et al. (2014) uses embeddings by concatenating them to specific views, which increases dimensionality, but the improvement is limited.

2.
No weight-sharing among conjunctions with same lexical property, like the child-word "word(c)" and its conjunction with head-postag "word(c) ∧ word(g)" in Figure 1(b).The factorization in prior work treats them as independent features, greatly increasing the dimensionality.Our factorization builds representations of both features based on the embedding of "word(c)", thus utilizing their connections and reducing the dimensionality.
The above advantages are also key to overcome the problems of prior work mentioned at the end of §1.

Feature Representations via Low-rank Tensor Approximations
Using one-hot encodings for each of the parts of feature φ results in a very large tensor.This section shows how to compute the score in (4) without constructing the full feature tensor using two tensor approximation methods ( §4.1 and §4.2).
We begin with some intuition.To score the original (full rank) tensor representation of φ, we need a parameter tensor T of size is the vocabulary size, n is the number of lexical parts in the feature and d 1 = |L| and d 2 = |F | are the number of different labels and non-lexical properties, respectively.( §5 will handle n varying across features.)Our methods reduce the tensor size by embedding each part of φ into a lower dimensional space, where we represent each label, non-lexical property and words with an r 1 , r 2 , r 3 , . . ., r n+2 dimensional vector respectively (r i d i , ∀i).These embedded features can then be scored by much smaller tensors.We denote the above transformations as matrices for i = 1, . . ., n, and write corresponding lowdimensional hidden representations as h In our methods, the above transformations of embeddings are parts of low-rank tensors as in ( 5), so the embeddings of non-lexical properties and labels can be trained simultaneously with the low-rank tensors.Note that for one-hot input encodings the transformation matrices are essentially lookup tables, making the computation of these transformations sufficiently fast.

Tucker Form
For our first approximation, we assume that tensor T has a low-rank Tucker decomposition: We can then express the scoring function (4) for a feature φ = (y, u, w 1 , . . .w n ) with n-lexical parts, as: which amounts to first projecting u, y, and w i (for all i) to lower dimensional vectors h w i , and then weighting these hidden representations using the flattened core tensor g.The low-dimensional representations and the corresponding weights are learned jointly using a discriminative (supervised) criterion.We call the model based on this representation the Low-Rank Feature Representation with Tucker form, or LRFR n -TUCKER.

CP Form
For the Tucker approximation the number of parameters in (6) scale exponentially with the number of lexical parts.For instance, suppose each h (i) w i has di-mensionality r, then |g| ∝ r n .To address scalability and further control the complexity of our tensor based model, we approximate the parameter tensor using CP decomposition as in (2), resulting in the following scoring function: We call this model Low-Rank Feature Representation with CP form (LRFR n -CP).

Pre-trained Word Embeddings
One of the computational and statistical bottlenecks in learning these LRFR n models is the vocabulary size; the number of parameters to learn in each matrix W i scales linearly with |V | and would require very large sets of labeled training data.To alleviate this problem, we use pre-trained continuous word embeddings (Mikolov et al., 2013) as input embeddings rather than the one-hot word encodings.
We denote the m-dimensional word embeddings by e w ; so the transformation matrices W i for the lexical parts are of size r i × m where m |V |.We note that when sufficiently large labeled data is available, our model allows for fine-tuning the pre-trained word embeddings to improve the expressive strength of the model, as is common with deep network models.
Remarks Our LRFRs introduce embeddings for non-lexical properties and labels, making them better suit the common setting in NLP: rich linguistic properties; and large label sets such as open-domain tasks (Hoffmann et al., 2010).The LRFR-CP better suits n-gram features, since when n increases 1, the only new parameters are the corresponding W i .It is also very efficient during prediction (O(nr)), since the cost of transformations can be ignored with the help of look-up tables and pre-computing.

Learning Representations for n-gram Lexical Features of Mixed Lengths
For features with n lexical parts, we can train an LRFR n model to obtain their representations.However, we often have features of varying n (e.g. both unigrams (n=1) and bigrams (n=2) as in Figure 1).
We require representations for features with arbitrary different n simultaneously.
We propose two solutions.The first is a straightforward solution based on our framework, which handles each n with a (n+2)-way tensor.This strategy is commonly used in NLP, e.g.Taub-Tabib et al. (2015) have different kernel functions for different order of dependency features.The second is an approximation method which aims to use a single tensor to handle all ns.
Multiple Low-Rank Tensors Suppose that we can divide the feature set S(x, y) into subsets S 1 (x, y), S 2 (x, y), . . ., S n (x, y) which correspond to features with one lexical part (unigram features), two lexical parts (bigram features), . . .and n lexical parts (n-gram features), respectively.To handle these types of features, we modify the training objective as follows: minimize where the score of a training instance (x, y) is defined as s(x, y; T ) = n i=1 φ∈S i (x,y) s(φ; T i ).We use the Tucker form low-rank tensor for T 1 , and the CP form for T i (∀i > 1).We refer to this method as LRFR 1 -TUCKER & LRFR 2 -CP.
Word Clusters Alternatively, to handle different numbers of lexical parts, we replace some lexical parts with discrete word clusters.Let c(w) denote the word cluster (e.g. from Brown clustering) for word w.For bigram features we have: s(y, u, w 1 , w 2 ; T ) = s(y, u∧c(w 1 ), w 2 ; T ) + s(y, u∧c(w 2 ), where for each word we have introduced an additional set of non-lexical properties that are conjunctions of word clusters and the original non-lexical properties.This allows us to reduce an n-gram feature representation to a unigram representation.The advantage of this method is that it uses a single low-rank tensor to score features with different numbers of lexical parts.This is particularly helpful when we have very limited labeled data.We denote this method as LRFR 1 -BROWN, since we use Brown clusters in practice.In the experiments we use the Tucker form for LRFR 1 -BROWN.

Parameter Estimation
The goal of learning is to find a tensor T that solves problem (5).Note that this is a non-convex objective, so compared to the convex objective in a traditional log-linear model, we are trading better feature representations with the cost of a harder optimization problem.While stochastic gradient descent (SGD) is a natural choice for learning representations in large data settings, problem (5) involves rank constraints, which require an expensive proximal operation to enforce the constraints at each iteration of SGD.We seek a more efficient learning algorithm.Note that we fixed the size of each transformation matrix W i ∈ R r i ×d i so that the smaller dimension (r i < d i ) matches the upper bound on the rank.Therefore, the rank constants are always satisfied through a run of SGD and we in essence have an unconstrained optimization problem.Note that in this way we do not guarantee orthogonality and fullrank of the learned transformation matrices.These properties are assumed in general, but are not necessary according to (Kolda and Bader, 2009).The gradients are computed via the chain-rule.We use AdaGrad (Duchi et al., 2011) and apply L2 regularization on all W i s and g, except for the case of r i =d i , where we will start with W i = I and regularize with W i -I 2 .We use early-stopping on a development set.

Experimental Settings
We evaluate LRFR on three tasks: relation extraction, PP attachment and preposition disambiguation (see Table 1 for a task summary).We include detailed feature templates in Table 2.
PP-attachment and relation extraction are two fundamental NLP tasks, and we test our models on the largest English data sets.The preposition disambiguation task was designed for compositional semantics, which is an important application of deep learning and distributed representations.On all these tasks, we compare to the state-of-the-art.
We use the same word embeddings in Belinkov et al. (2014) on PP-attachment for a fair comparison.For the other experiments, we use the same 200-d word embeddings in Yu et al. (2015).
Relation Extraction We use the English portion of the ACE 2005 relation extraction dataset (Walker et al., 2006).Following Yu et al. (2015), we use both gold entity spans and types, train the model on the news domain and test on the broadcast conversation domain.To highlight the impact of training data size we evaluate with all 43,518 relations (entity mention pairs) and a reduced training set of the first 10,000 relations.We report precision, recall, and F1.
We compare to two baseline methods: 1) a loglinear model with a rich binary feature set from Sun et al. (2011) and Zhou et al. (2005) as described in Yu et al. (2015) (BASELINE); 2) the embedding model (FCM) of Gormley et al. (2015), which uses rich linguistic features for relation extraction.We use the same feature templates and evaluate on finegrained relations (sub-types, 32 labels) (Yu et al., 2015).This will evaluate how LRFR can utilize nonlexical linguistic features.

PP-attachment
We consider the prepositional phrase (PP) attachment task of Belinkov et al.  (2014), 3 where for each PP the correct head (verbs or nouns) must be selected from content words before the PP (within a 10-word window).We formulate the task as a ranking problem, where we optimize the score of the correct head from a list of candidates with varying sizes.
PP-attachment suffers from data sparsity because of bi-lexical features, which we will model with methods in §5.Belikov et al. show that rich features -POS, WordNet and VerbNet -help this task.The combination of these features give a large number of non-lexical properties, for which embeddings of non-lexical properties in LRFR should be useful.
We extract a dev set from section 22 of the PTB following the description in Belinkov et al. (2014).
Preposition Disambiguation We consider the preposition disambiguation task proposed by Ritter et al. (2014).The task is to determine the spatial relationship a preposition indicates based on the two objects connected by the preposition.For example, "the apple on the refrigerator" indicates the "support by Horizontal Surface" relation, while "the apple on the branch" indicates the "Support from Above" relation.Since the meaning of a preposition depends   2014)).We denote the two target entities as M 1 , M 2 (with head indices h 1 , h 2 , NE types t h1 , t h2 ), and their dependency path as P .Right: Uni/bi-gram feature for PP-attachment: Each feature is defined on tuple (w m , w p , w h ), where w p is the preposition word, w m is the child of the preposition, and w h is a candidate head of w p .t(w): POS tag of word w; p(w): a preposition collocation of verb w from VerbNet; r(w): the root hypernym of word w in WordNet.Dis(•, •): the number of candidate heads between two words.Down-left: Uni/bi-gram feature for preposition disambiguation (for each preposition word p, its modifier noun w m and head noun w h ).Since the sentences are different from each other on only p, w m and w h , we ignore the words on the other positions.
on the combination of both its head and child word, we expect conjunctions between these word embeddings to help, i.e. features with two lexical parts.We include three baselines: point-wise addition (SUM) (Mitchell and Lapata, 2010), concatenation (Ritter et al., 2014), and an SVM based on handcrafted features in Table 2. Ritter et al. show that the first two methods beat other compositional models.
Hyperparameters are all tuned on the dev set.The chosen values are learning rate η = 0.05 and the weight of L2 regularizer λ = 0.005 for LRFR, except for the third LRFR in Table 3 which has λ = 0.05.We select the rank of LRFR-TUCKER with a grid search from the following values: r 1 = {10, 20, d 1 }, r 2 = {20, 50, d 2 } and r 3 = {50, 100, 200}.For LRFR-CP, we select r = {50, 100, 200}.For the PP-attachement task there is no r 1 since it uses a ranking model.For the Preposition Disambiguation we do not choose r 1 since the number of labels is small.

Results
Relation Extraction All LRFR-TUCKER models improve over BASELINE and FCM (Table 3), making these the best reported numbers for this task.However, LRFR-CP does not work as well on the features with only one lexical part.The Tucker-form does a better job of capturing interactions between different views.In the limited training setting, we find that LRFR-CP does best.
Additionally, the primary advantage of the CP approximation is its reduction in the number of model parameters and running time.We report each model's running time for a single pass on the development set.The LRFR-CP is by far the fastest.The first three LRFR-TUCKER models are slightly slower than FCM, because they work on dense nonlexical property embeddings while FCM benefits from sparse vectors.4 shows that LRFR (89.6 and 90.3) improves over the previous best standalone system HPCD (88.7) by a large margin, with exactly the same resources.Belinkov et al. (2014) also reported results of parsers and parser re-rankers, which can access to additional resources (complete parses for training and complete sentences as inputs) so it is unfair to compare them with the standalone systems like HPCD and our LRFR.Nonethe-

System
Resources Used Acc SVM (Belinkov et al., 2014) distance, word, embedding, clusters, POS, WordNet, VerbNet 86.0 HPCD (Belinkov et al., 2014) distance ), and with grand-head-modifier conjunctions removed (89.3) .Note that compared to LRFR, RBG benefits from binary features, which also exploit grand-head-modifier structures.Yet the above reduced models still work better than RBG (88.4) without using additional resources. 4Moreover, the results of LRFR can still be potentially improved by combining with binary features.The above results show the advantage of our factorization method, which allows for utilizing pre-trained word embeddings, and thus can benefit from semi-supervised learning.
Preposition Disambiguation LRFR improves (Table 5) over the best methods (SUM and Concatenation) in Ritter et al. (2014)   We also include a control setting (LRFR 1 -BROWN -Control), which has a full rank parameter tensor with the same inputs on each view as LRFR 1 -BROWN, but represented as one hot vectors without transforming to the hidden representations hs.This is equivalent to an SVM with the compound cluster features as in Koo et al. (2008) Zhang, 2005), denoising autoencoders (Vincent et al., 2008), and feature embeddings (Yang and Eisenstein, 2015).These methods treat features as atomic elements and ignore the inner structure of features, so they learn separate embedding for each feature without shared parameters.As a result, they still suffer from large parameter spaces when the feature space is very huge. 5nother line of research studies the inner structures of lexical features: e.g.Koo et al. (2008), Turian et al. (2010), Sun et al. (2011), Nguyen and Grishman (2014), Roth and Woodsend (2014), and Hermann et al. (2014) used pre-trained word embeddings to replace the lexical parts of features ; Srikumar and Manning (2014), Gormley et al. (2015) and Yu et al. (2015) propose splitting lexical features into different parts and employing tensors to perform classification.The above can therefore be seen as special cases of our model that only embed a certain part (view) of the complex features.This restriction also makes their model parameters form a full rank tensor, resulting in data sparsity and high computational costs when the tensors are large.
Composition Models (Deep Learning) build representations for structures based on their component word embeddings (Collobert et al., 2011;Bordes et al., 2012;Socher et al., 2012;Socher et al., 2013b).When using only word embeddings, these models achieved successes on several NLP tasks, but sometimes fail to learn useful syntactic or semantic patterns beyond the strength of combinations of word embeddings, such as the dependency relation in Figure 1(a).To tackle this problem, some work designed their model structures according to a specific kind of linguistic patterns, e.g.dependency paths (Ma et al., 2015;Liu et al., 2015), while a recent trend enhances compositional models with linguistic features.For example, Belinkov et al. (2014) concatenate embeddings with linguistic features before feeding them to a neural network; Socher et al. (2013a) and Hermann and Blunsom (2013) enhanced Recursive Neural Networks by refining the transformation matrices with linguistic features (e.g.phrase types).These models are similar to ours in the sense of learning representations based on linguistic features and embeddings.
Low-rank Tensor Models for NLP aim to handle the conjunction among different views of features (Cao and Khudanpur, 2014;Lei et al., 2014;Chen and Manning, 2014).Yu and Dredze (2015) proposed a model to compose phrase embeddings from words, which has an equivalent form of our CPbased method under certain restrictions.Our work applies a similar idea to exploiting the inner structure of complex features, and can handle n-gram features with different ns.Our factorization ( §3) is general and easy to adapt to new tasks.More importantly, it makes the model benefit from pre-trained word embeddings as shown by the PP-attachment results.

Conclusion
We have presented LRFR, a feature representation model that exploits the inner structure of complex lexical features and applies a low-rank tensor to efficiently score features with this representation.LRFR attains the state-of-the-art on several tasks, including relation extraction, PP-attachment, and preposition disambiguation.We make our implementation available for general use. 6

Figure 1 :
Figure 1: An example of lexical features used in dependency parsing.To predict the "PMOD" arc (the dashed one) Figure (c) shows what the fifth feature (φ) is like, when the candidate is "see".As is common in multi-class classification tasks, each template generates a different feature for each label y.Thus a feature φ = w g ∧ w c ∧ u ∧ y is the conjunction of the four parts.Figure (d) is the one-hot representation of φ, which is equivalent to the outer product (i.e. a 4-way tensor) among the four one-hot vectors.v(x) = 1 means the vector v has a single non-zero element in the x position.

Table 1 :
Statistics of each task.PP-attachment and preposition disambiguation have both unigram and bigram features.Therefore we list the numbers of non-lexical properties for both types.

Table 2 :
Up-left: Unigram lexical features (only showing non-lexical parts) for relation extraction (from Yu et al. (

Table 3 :
Results on test for relation extraction.Y(es)/N(o) indicates whether embeddings are updated during training.
as well as the SVM4Still this is not a fair comparison since we have different training objectives.Using RBG's factorization and training with our objective will give a fair comparison and we leave it to future work.

Table 5 :
Accuracy for spatial classification of PPs.based on the original lexical features (85.1).In this task LRFR 1 -BROWN better represents the unigram and bigram lexical features, compared to the usage of two low-rank tensors (LRFR 1 -TUCKER & LRFR 2 -CP).This may be because LRFR 1 -BROWN has fewer parameters, which is better for smaller training sets.
. It performs much worse than LRFR 1 -BROWN, showing the advantage of using word embeddings and low-rank tensors.Summary For unigram lexical features, LRFR n -TUCKER achieves better results than LRFR n -CP.However, in settings with fewer training examples, features with more lexical parts (n-grams), or when faster predictions are advantageous, LRFR n -CP does best as it has fewer parameters to estimate.For ngrams of variable length, LRFR 1 -TUCKER & LRFR 2 -CP does best.In settings with fewer training examples, LRFR 1 -BROWN does best as it has only one parameter tensor to estimate.