Low-Rank Regularization for Sparse Conjunctive Feature Spaces: An Application to Named Entity Classification

Entity classification, like many other important problems in NLP, involves learning classifiers over sparse highdimensional feature spaces that result from the conjunction of elementary features of the entity mention and its context. In this paper we develop a low-rank regularization framework for training maxentropy models in such sparse conjunctive feature spaces. Our approach handles conjunctive feature spaces using matrices and induces an implicit low-dimensional representation via low-rank constraints. We show that when learning entity classifiers under minimal supervision, using a seed set, our approach is more effective in controlling model capacity than standard techniques for linear classifiers.


Introduction
Many important problems in NLP involve learning classifiers over sparse high-dimensional feature spaces that result from the conjunction of elementary features. For example, to classify an entity in a document, it is standard to exploit features of the left and right context in which the entity occurs as well as spelling features of the entity mention itself. These sets of features can be grouped into vectors which we call elementary feature vectors. In our example, there will be one elementary feature vector for the left context, one for the right context and one for the features of the mention. Observe that, when the elementary vectors consist of binary indicator features, the outer product of any pair of vectors represents all conjunctions of the corresponding elementary features.
Ideally, we would like to train a classifier that can leverage all conjunctions of elementary features, since among them there might be some that are discriminative for the classification task at hand. However, allowing for such expressive high dimensional feature space comes at a cost: data sparsity becomes a key challenge and controlling the capacity of the model is crucial to avoid overfitting the training data.
The problem of data sparsity is even more severe when the goal is to train classifiers with minimal supervision, i.e. small training sets. For example, in the entity classification setting we might be interested in training a classifier using only a small set of examples of each entity class. This is a typical scenario in an industrial setting, where developers are interested in classifying entities according to their own classification schema and can only provide a handful of examples of each class.
A standard approach to control the capacity of a linear classifier is to use 1 or 2 regularization on the parameter vector. However, this type of regularization does not seem to be effective when dealing with sparse conjunctive feature spaces. The main limitation is that 1 and 2 regularization can not let the model give weight to conjunctions that have not been observed at training. Without such ability it is unlikely that the model will generalize to novel examples, where most of the conjunctions will be unseen in the training set.
Of course, one could impose a strong prior on the weight vector so that it assigns weight to unseen conjunctions, but how can we build such a prior? What kind of reasonable constraints can we put on unseen conjunctions?
Another common approach to handle high dimensional conjunctive feature spaces is to manually design the feature function so that it includes only a subset of "relevant" conjunctions. But designing such a feature function can be time consuming and one might need to design a new feature function for each classification task. Ideally, we would have a learning algorithm that does not require such feature engineering and that it can automatically leverage rich conjunctive feature spaces.
In this paper we present a solution to this problem by developing a regularization framework specifically designed for sparse conjunctive feature spaces. Our approach results in a more effective way of controlling model capacity and it does not require feature engineering.
Our strategy is based on: • Employing tensors to define the scoring function of a max-entropy model as a multilinear form that computes weighted inner products between elementary vectors.
• Forcing the model to induce low-dimensional embeddings of elementary vectors via lowrank regularization on the tensor parameters.
The proposed regularization framework is based on a simple conceptual trick. The standard approach to handle conjunctive feature spaces in NLP is to regard the parameters of the linear model as long vectors computing an inner product with a high dimensional feature representation that lists explicitly all possible conjunctions. Instead, the parameters of our the model will be tensors and the compatibility score between an input pattern and a class will be defined as the sum of multilinear functions over elementary vectors.
We then show that the rank 1 of the tensor has a very natural interpretation. It can be seen as the intrinsic dimensionality of a latent embedding of the elementary feature vectors. Thus by imposing a low-rank penalty on the tensor parameters we are encouraging the model to induce a lowdimensional projection of the elementary feature vectors . Using the rank itself as a regularization constraint in the learning algorithm would result in a non-convex optimization. Instead, we follow a standard approach which is to use the nuclear norm as a convex relaxation of the rank.
In summary the main contributions of this paper are: • We develop a new regularization framework for training max-entropy models in high-dimensional sparse conjunctive feature spaces. Since the proposed regularization implicitly induces a low dimensional embedding of feature vectors, our algorithm can also be seen as a way of implicitly learning a latent variable model.
• We present a simple convex learning algorithm for training the parameters of the model.
• We conduct experiments on learning entity classifiers with minimal supervision. Our results show that the proposed regularization framework is better for sparse conjunctive feature spaces than standard 2 and 1 regularization. These results make us conclude that encouraging the max-entropy model to operate on a low-dimensional space is an effective way of controlling the capacity of the model an ensure good generalization.

Entity Classification with Log-linear Models
The formulation we develop in this paper applies to any prediction task whose inputs are some form of tuple. We focus on classification of entity mentions, or entities in the context of a sentence. Formally, our input objects are tuples x = l, e, r consisting of an entity e, a left context l and a right context r. The goal is to classify x into one entity class in the set Y. We will use log-linear models of the form: where s θ : X × Y → R is a scoring function of entity tuples with a candidate class, and θ are the parameters of this function, to be specified below.
In the literature it is common to employ a feature-based linear model. That is, one defines a feature function φ : X → {0, 1} n that represents entity tuples in an n-dimensional binary feature space 2 , and the model has a weight vector for each class, θ = {w y } y∈Y . Then s θ (x, y) = φ(x) · w y .

Low-rank Entity Classification Models
In this section we propose a specific family of models for classifying entity tuples.

A Low-rank Model of Left-Right Contexts
We start from the observation that when representing tuple objects such as x = l, e, r with features, we often depart from a feature representation of each element of the tuple. Hence, let φ l and φ r be two feature functions representing left and right contexts, with binary dimensions d 1 and d 2 respectively. For now, we will define a model that ignores the entity mention e and makes predictions using context features. It is natural to define conjunctions of left and right features. Hence, in its most general form, one can define a matrix W y ∈ R d 1 ×d 2 for each class, such that θ = {W y } y∈Y and the score is: Note that this corresponds to a feature-based linear model operating in the product space of φ l and φ r , that is, the score has one term for each pair of features: Note also that it is trivial to include elementary features of φ l and φ r , in addition to conjunctions, by having a constant dimension in each of the two representations set to 1.
In all, the model in Eq.
(2) is very expressive, with the caveat that it can easily overfit the data, specially when we work only with a handful of labeled examples. The standard way to control the capacity of a linear model is via 1 or 2 regularization.
Regarding our parameters as matrices allows us to control the capacity of the model via regularizers that favor parameter matrices with low rank. To see the effect of these regularizers, consider that W y has rank k, and let W y = U y Σ y V y be the singular value decomposition, where U y ∈ R d 1 ×k and V y ∈ R d 2 ×k are orthonormal projections and Σ y ∈ R k×k is a diagonal matrix of singular values. We can rewrite the score function as s θ ( l, e, r , y) = (φ l (l) U y ) Σ y (V y φ r (r)) .
(3) In words, the rank k is the intrinsic dimensionality of the inner product behind the score function. A low-rank regularizer will favor parameter matrices that have low intrinsic dimensionality. Below we describe a convex optimization for low-rank models using nuclear norm regularization.

Adding Entity Features
The model above classifies entities based only on the context. Here we propose an extension to make use of features of the entity. Let T be a set of possible entity feature tags, i.e. tags that describe an entity, such as ISCAPITALIZED, CONTAINSDIG-ITS, SINGLETOKEN, . . . Let φ e be a feature function representing entities. For this case, to simplify our expression, we will use a set notation and denote by φ e (e) ⊆ T the set of feature tags that describe e. Our model will be defined with one parameter matrix per feature tag and class label, i.e. θ = {W t,y } t∈T ,y∈Y . The model form is: (4)

Learning with Low-rank Constraints
In this section we describe a convex procedure to learn models of the above form that have low rank. We will define an objective that combines a loss and a regularization term.
Our first observation is that our parameters are a tensor with up to four axes, namely left and right context representations, entity features, and entity classes. While a matrix has a clear definition of rank, it is not the case for general tensors, and there exist various definitions in the literature. The technique that we use is based on matricization of the tensor, that is, turning the tensor into a matrix that has the same parameters as the tensor but organized in two axes. This is done by partitioning the tensor axes into two sets, one for matrix rows and another for columns. Once the tensor has been turned into a matrix, we can use the standard definition of matrix rank. A main advantage of this approach is that we can make use of standard routines like singular value decomposition (SVD) to decompose the matricized tensor. This is the main reason behind our choice.
In general, different ways of partitioning the tensor axes will lead to different notions of intrinsic dimensions. In our case we choose the left context axes as the row dimension, and the rest of axes as the column dimension. 3 In this section, we will denote as W the matricized version of the parameters θ of our models.
The second observation is that minimizing the rank of a matrix is a non-convex problem. We make use of a convex relaxation based on the nuclear norm (Srebro and Shraibman, 2005). The nuclear norm 4 of a matrix W, denoted W , is the sum of its singular values: W = i Σ i,i where W = UΣV is the singular value decomposition of W. This norm has been used in several applications in machine learning as a convex surrogate for imposing low rank, e.g. (Srebro et al., 2004).
Thus, the nuclear norm is used as a regularizer. With this, we define our objective as follows: where L(W) is a convex loss function, R(W) is a regularizer, and τ is a constant that trades off error and capacity. In experiments we will compare nuclear norm regularization with 1 and 2 regularizers. In all cases we use the negative log-likelihood as loss function, denoting the training data as D: (6) To solve the objective in Eq. (5) we use a simple optimization scheme known as forward-backward splitting (FOBOS) (Duchi and Singer, 2009). In a series of iterations, this algorithm performs a gradient update followed by a proximal projection of the parameters. Such projection depends on the regularizer used: for 1 it thresholds the parameters; for 2 it scales them; and for nuclearnorm regularization it thresholds the singular values. This means that, for nuclear norm regularization, each iteration requires to decompose W using SVD. See (Madhyastha et al., 2014) for details about this optimization for a related application.

Related Work
The main aspect of our approach is the use of a spectral penalty (i.e., the rank) to control the capacity of multilinear functions parameterized by matrices or tensors.  used nuclear-norm regularization to learn latentvariable max-margin sequence taggers. Madhyastha et al. (2014) defined bilexical distribu-tions parameterized by matrices which result lexical embeddings tailored for a particular linguistic relation. Like in our case, the low-dimensional latent projections in these papers are learned implicitly by imposing low-rank constraints on the predictions of the model. Lei et al. (2014) also use low-rank tensor learning in the context of dependency parsing, where like in our case dependencies are represented by conjunctive feature spaces. While the motivation is similar, their technical solution is different. We use the technique of matricization of a tensor combined with a nuclear-norm relaxation to obtain a convex learning procedure. In their case they explicitly look for a low-dimensional factorization of the tensor using a greedy alternating optimization.
Also recently,  have framed entity classification as a low-rank matrix completion problem. The idea is based on the fact that if two entities (in rows) have similar descriptions (in columns) they should have similar classes. The low-rank structure of the matrix defines intrinsic representations of entities and feature descriptions. The same idea was applied to relation extraction , using a matrix of entity pairs times descriptions that corresponds to a matricization of an entity-entity-description tensor. Very recently Singh et al. (2015) explored alternative ways of applying low-rank constraints to tensor-based relation extraction.
Another aspect of this paper is training entity classification models using minimal supervision, which has been addressed by multiple works in the literature. A classical successful approach for this problem is to use co-training (Blum and Mitchell, 1998): learn two classifiers that use different views of the data by using each other's predictions. In the same line, Collins and Singer (1999) trained entity classifiers by bootstraping from an initial set of seeds, using a boosting version of co-training. Seed sets have also been exploited by graphical model approaches. Haghighi and Klein (2006) define a graphical model that is soft-constrained such that the prediction for an unlabeled example agrees with the labels of seeds that are distributionally similar. Li et al. (2010) present a Bayesian approach to expand an initial seed set, with the goal of creating a gazetteer.
Another approach to entity recognition that, like in our case, learns projections of contextual features is the method by Ando and Zhang (2005).  Table 1: For each entity class, the seed of entities for the 10-30 set, together with the number of mentions in the training data that involve entities in the seed for various sizes of the seeds.
They define a set of auxiliary tasks, which can be supervised using unlabeled data, and find a projection of the data that works well as input representation for the auxiliary tasks. This representation is then used for the target task. More recently Neelakantan and Collins (2014) presented another approach to gazetteer expansion using an initial seed. A novel aspect is the use of Canonical Correlation Analysis (CCA) to compute embeddings of entity contexts, that are used by the named entity classifier. Like in our case, their method learns a compressed representation of contexts that helps prediction.

Experiments
In this section we evaluate our regularization framework for training models in highdimensional sparse conjunctive feature spaces. We run experiments on learning entity classifiers with minimal supervision. We focus on classification of unseen entities to highlight the ability of the regularizer to generalize over conjunctions that are not observed at training. We simulate minimal supervision using the CoNLL-2003 Shared Task data (Tjong Kim Sang and De Meulder, 2003), and compare the performance to 1 and 2 regularizers.

Minimal Supervision Task
We use a minimal supervision setting where we provide the algorithm a seed of entities for each class, that is, a list of entities that is representative for that class. The assumption is that any mention of an entity in the seed is a positive example for the corresponding class. Given unlabeled data and a seed of entities for each class, the goal is to learn a model that correctly classifies mentions of entities that are not in the seed. In addition to standard entity classes, we also consider a special non-entity class, which is part of the classification but is excluded from evaluation.
Note that named entity classification for unseen entities is a challenging problem. Even in the standard fully-supervised scenario, when we measure the performance of state-of-the-art methods on unseen entities, the F1 values are in the range of 60%. This represents a significant drop with respect to the standard metrics for named entity recognition, which consider all entity mentions of the test set irrespective of whether they appear in the training data or not, and where F1 values at 90% levels are obtained (e.g. (Ratinov and Roth, 2009)). This suggests that part of the success of state-of-the-art models is in storing known entities together with their type (in the form of gazetteers or directly in lexicalized parameters of the model).

Setting
We use the CoNLL-2003 English data, which is annotated with four types: person (PER), location (LOC), organization (ORG), and miscellaneous (MISC). In addition, the data is tagged with partsof-speech (PoS), and we compute word clusters running the Brown clustering algorithm (Brown et al., 1992) on the words in the training set.
We consider annotated entity phrases as candidate entities, and all single nouns that are not part of an entity as candidate non-entities (O). Both candidate entities and non-entities will be referred to as candidates in the remaining of this section. We lowercase all candidates and remove the am-  biguous ones (i.e., those with more than one label in different mentions). 5 To simulate a minimal supervision, we create supervision seeds by picking the n most frequent training candidates for entity types, and the m most frequent candidate non-entities. We create seeds of various sizes n-m, namely 10-30, 40-120, 640-1920, as well as all of the candidates. For each seed, the training set consists of all training mentions that involve entities in the seed. Table 1 shows the smaller seed, as well as the number of mentions for each seed size.
For evaluation we use the development and test sections of the data, but we remove the instances of candidates in the training data (i.e., that are in the all seed). We do not remove instances that are ambiguous in the tests. 6 As evaluation metric we use the average F1 score computed over all entity types, excluding the non-entity type.

Context Representations
We refer to context as the sequence of tokens before (left context) and after (right context) a candidate mention in a sentence. Different classifiers can be built using different representations of the contexts. For example we can change the window size of the context sequence (i.e., for a window size of 1 we only use the last token before the mention and the first token after the mention). We can treat the left and right contexts independently of each other, we can treat them as a unique combination, or we can use both. We can also choose to use the word form of a token, its PoS tag, a word cluster, or a combination of these.   we will use the elementary features that are more predictive and compact: clusters and PoS tags in windows of size at most 2.

Comparing Regularizers
We compare the performance of models trained using the nuclear norm regularizer with models trained using 1 and 2 regularizers. To train each model, we validate the regularization parameter and the number of iterations on development data, trying a wide range of values. The best performing configuration is then used for the comparison. Figure 1 shows results on the development set for different feature sets. We started representing context using cluster labels, as it is the most compact representation obtaining good results in preliminary experiments. We tried several conjunctions: a conjunction of the left and right context, as well as conjunctions of left and right contexts and features of the candidate entity. We also tried all different conjunction combinations of the contexts and the candidate entity features, as well as adding PoS tags to represent contexts. To represent an entity candidate we use standard traits of the spelling of the mention, such as capitalization, ation. Using our richest feature set, the model obtains 76.76 of accuracy in the development, for the task of classifing entities with correct boundaries. If we add features capturing the full entity and its tokens, then the accuracy is 87.63, which is similar to state-of-the-art performance (the best results in literature typically exploit additional gazetteers). Since our evaluation focuses on unknown entities, our features do not include information about the word tokens of entites. the existence of symbols, as well as the number of tokens in the candidate. See Table 3 for the definition of the features describing entity candidates. We observe that for most conjunction settings our regularizer performs better than the 1 and 2 regularizers. Using the best model from each regularizer, we evaluated on the test set. Table  4 shows the test results. For all seed sets, the nuclear norm regularizer obtains the best average F1 performance. This shows that encouraging the max-entropy model to operate on a lowdimensional space is effective. Moreover, Figure  2 shows model performance as a function of the number of dimensions of the intrinsic projection. The model obtains a good performance even if only a few intrinsic dimensions are used. Figure 3 shows the parameter matrix of the low-   Figure 1f trained with the 10-30 seed, with respect to observations of the associated features in training and development. Non-white conjunctions correspond to non-zero weights: black is for conjunctions seen in both the training and development sets; blue is for those seen in training but not in the development; red indicates that the conjunctions were observed only in the development; yellow is for those not observed in training nor development.
rank model in Figure 1f trained with the 10-30 seed, with respect to observed features in training and development data. Many of the conjunctions of the development set were never observed in the training set. Our regularizer framework is able to propagate weights from the conjunctive features seen in training to unseen conjunctive features that are close to each other in the projected space (these are the yellow and red cells in the matrix). In contrast, 1 and 2 regularization techniques can not put weight on unseen conjunctions.

Conclusion
We have developed a low-rank regularization framework for training max-entropy models in sparse conjunctive feature spaces. Our formulation is based on using tensors to parameterize classifiers. We control the capacity of the model using the nuclear-norm of a matricization of the tensor. Overall, our formulation results in a convex procedure for training model parameters.
We have experimented with these techniques in the context of learning entity classifiers. Compared to 1 and 2 penalties, the low-rank model obtains better performance, without the need to manually specify feature conjunctions. In our analysis, we have illustrated how the low-rank approach can assign non-zero weights to conjunctions that were unobserved at training, but are similar to observed conjunctions with respect to the low-dimensional projection of their elements. We have used matricization of a tensor to define its rank, using a fixed transformation of the tensor into a matrix. Future work should explore how to combine efficiently different transformations.