Regularizing Text Categorization with Clusters of Words

STATISTICAL REGULARIZERS Network of features Ωnet(θ) = λnet ∑ θ> k Mθk , where M = α(I − P)>(I − P) + βI. Sentence Regularizer Ωsen(θ) = ∑D d=1 ∑Sd s=1 λd ,s‖θd ,s‖2 SEMANTIC REGULARIZERS: LDA regularizer LSI regularizer ΩLDA,LSI(θ) = ∑K k=1 λ‖θk‖2 GRAPHICAL REGULARIZERS Graph-of-words regularizer Community detection on document collection graph Ωgow(θ) = ∑C c=1 λ‖θc‖2 c ranges over the C communities. Word2vec regularizer Kmeans clustering on word2vec Ωword2vec(θ) = ∑K k=1 λ‖θk‖2 K is the number of clusters


Introduction
Harnessing the full potential in text data has always been a key task for the NLP and ML communities. The properties hidden under the inherent high dimensionality of text are of major importance in tasks such as text categorization and opinion mining.
Although simple models like bag-of-words manage to perform well, the problem of overfitting still remains. Regularization as proven in Chen and Rosenfeld (2000) is of paramount importance in Natural Language Processing and more specifically language modeling, structured prediction, and classification. In this paper we build upon the work of Yogatama and Smith (2014b) who introduce prior knowledge of data as a regularization term. One of the most popular structured regularizers, the group lasso (Yuan and Lin, 2006), was proposed to avoid large L2 norms for groups of weights. 1 https://goo.gl/mKqvro In this paper, we propose novel linguistic structured regularizers that capitalize on the clusters learned from texts using the word2vec and graph-ofwords document representation, which can be seen as group lasso variants. The extensive experiments we conducted demonstrate these regularizers can boost standard bag-of-words models on most cases tested in the task of text categorization, by imposing additional unused information as bias.

Background & Notation
We place ourselves in the scenario where we consider a prediction problem, in our case text categorization, as a loss minimization problem, i. e. we define a loss function L(x, θ, y) that quantifies the loss between the prediction h θ,b (x) of a classifier parametrized by a vector of feature weights θ and a bias b, and the true class label y ∈ Y associated with the example x ∈ X . Given a training set of N data points {(x i , y i )} i=1...N , we want to find the optimal set of feature weights θ * such that: In the case of logistic regression with binary predictions (Y = {−1, +1}), h θ,b (x) = θ x + b and L(x, θ, y) = e −yh θ,b (x) (log loss).

Regularization
Only minimizing the empirical risk can lead to overfitting, that is, the model no longer learns the underlying pattern we are trying to capture but fits the noise contained in the training data and thus results in poorer generalization (e. g., lower performances on the test set). For instance, along with some feature space transformations to obtain non-linear decision boundaries in the original feature space, one could imagine a decision boundary that follows every quirk of the training data. Additionally, if two hypothesis lead to similar low empirical risks, one should select the "simpler" model for better generalization power, simplicity assessed using some measure of model complexity.
Loss+Penalty Regularization takes the form of additional constraints to the minimization problem, i. e. a budget on the feature weights, which are often relaxed into a penalty term Ω(θ) controlled via a Lagrange multiplier λ. We refer to the book of Boyd and Vandenberghe (2004) for the theory behind convex optimization. Therefore, the overall expected risk (Vapnik, 1991) is the weighted sum of two components: the empirical risk and a regularization penalty term, expression referred to as "Loss+Penalty" by Hastie et al. (2009). Given a training set of N data points {(x i , y i )} i=1...N , we now want to find the optimal set of feature weights θ * such that: (2) L 1 and L2 regularization The two most used penalty terms are known as L 1 regularization, a. k. a. lasso (Tibshirani, 1996), and L2 regularization, a. k. a. ridge (Hoerl and Kennard, 1970) as they correspond to penalizing the model with respectively the L 1 and L2 norm of the feature weight vector θ: Prior on the feature weights L 1 (resp. L2) regularization can be interpreted as adding a Laplacian (resp. Gaussian) prior on the feature weight vector. Indeed, given the training set, we want to find the most likely hypothesis h * ∈ H, i. e. the one with maximum a posteriori probability: For the derivation, we assumed that the hypothesis h does not depend on the examples alone (Eq. 5) and that the N training labeled examples are drawn from an i.i.d. sample (Eq. 6). In that last form, we see that the loss function can be interpreted as a negative log-likelihood and the regularization penalty term as a negative log-prior over the hypothesis. Therefore, if we assume a multivariate Gaussian prior on the feature weight vector of mean vector 0 and covariance matrix Σ = σ 2 I (i. e. independent features of same prior standard deviation σ), we do obtain the L2 regularization: And similarly, if we assume a multivariate Laplacian prior on the feature weight vector (i. e. θ i ∼ Laplace(0, 1 λ )), we obtain L 1 -regularization. In practice, in both cases, the priors basically mean that we expect weights around 0 on average. The main difference between L 1 and L2 regularization is that the Laplacian prior will result in explicitly setting some feature weights to 0 (feature sparsity) while the Gaussian prior will only result in reducing their values (shrinkage).

Structured regularization
In L 1 and L2 regularizations, features are considered as independent, which makes sense without any additional prior knowledge. However, similar features have similar weights in the case of linear classifiers -equal weights for redundant features in the extreme case -and therefore, if we have some prior knowledge on the relationships between features, we should include that information for better generalization, i. e. include it in the regularization penalty term. Depending on how the similarity between features is encoded, e. g., through sets, trees (Kim and Xing, 2010;Liu and Ye, 2010; or graphs , the penalization term varies but in any case, we take into account the structure between features, hence the "structured regularization" terminology. It should not be confused with "structured prediction" where this time the outcome is a structured object as opposed to a scalar (e. g., a class label) classically.
Group lasso Bakin (1999) and later Yuan and Lin (2006) proposed an extension of L 1 regularization to encourage groups of features to either go to zero (as a group) or not (as a group), introducing group sparsity in the model. To do so, they proposed to regularize with the L 1,2 norm of the feature weight vector: where θ g is the subset of feature weights restricted to group g. Note that the groups can be overlapping (Jacob et al., 2009;Schmidt and Murphy, 2010;Jenatton et al., 2011;Yuan et al., 2011) even though it makes the optimization harder.

Learning
In our case we use a logistic regression loss function in order to integrate our regularization terms easily.
It is obvious that the framework can be extended to other loss functions (e. g., hinge loss). For the case of structured regularizers, there exist a plethora of optimization methods such group lasso. Since our tasks involves overlapping groups, we select the method of Yogatama and Smith (2014b).

Algorithm 1 ADMM for overlapping group-lasso
Require: augmented Lagrangian variable ρ, regularization strengths λ glas and λ las 1: while update in weights not small do 2: Their method uses the alternating directions method of multipliers (Hestenes, 1969;Powell, 1969). Now given the lasso penalty for each feature and the group lasso regularizer, the problem becomes: where v is a copy-vector of θ. The copy-vector v is needed because the group-lasso regularizer contains overlaps between the used groups. M is an indicator matrix of size L × V , where L is the sum of the total sizes of all groups, and its ones show the link between the actual weights θ and their copies v. Following Yogatama and Smith (2014b) a constrained optimization problem is formed, that can be transformed to an augmented Lagrangian problem: Essentially, the problem becomes the iterative update of θ, v and u: Convergence Yogatama and Smith (2014b) proved that ADMM for sparse overlapping group lasso converges. It is also shown that a good approximate solution is reached in a few tens of iterations. Our experiments confirm this as well.

Structured Regularization in NLP
In recent efforts there are results to identify useful structures in text that can be used to enhance the effectiveness of the text categorization in a NLP context. Since the main regularization approach we are going to use are variants of the group lasso, we are interested on prior knowledge in terms of groups/clusters that can be found in the training text data. These groups could capture either semantic, or syntactic structures that affiliate words to communities. In our work, we study both semantic and syntactic properties of text data, and incorporate them in structured regularizer. The grouping of terms is produced by either LSI or clustering in the word2vec or graph-of-words space.

Statistical regularizers
In this section, we present statistical regularizers, i. e. with groups of words based on co-occurrences, as opposed to syntactic ones (Mitra et al., 1997). Sandler et al. (2009) introduced regularized learning with networks of features. They define a graph G whose edges are nonnegative with larger weights indicating greater similarity. Conversely, a weight of zero means that two features are not believed a priori to be similar. Previous work (Ando and Zhang, 2005;Raina et al., 2006;Krupka and Tishby, 2007) shows such similarities can be inferred from prior domain knowledge and statistics computed on unlabeled data. The weights of G are mapped in a matrix P , where P ij ≥ 0 gives the weight of the directed edge from vertex i to vertex j. The out-degree of each vertex is constrained to sum to one, j P ij = 1, so that no feature "dominates" the graph.

Network of features
where M = α(I − P ) (I − P ) + βI. The matrix M is symmetric positive definite, and therefore it possesses a Bayesian interpretation in which the weight vector θ, is a priori normally distributed with mean zero and covariance matrix 2M −1 . However, preliminary results show poorer performance compared to structured regularizers in larger datasets.
Sentence regularizer Yogatama and Smith (2014b) proposed to define groups as the sentences in the training dataset. The main idea is to define a group d d,s for every sentence s in every training document d so that each group holds weights for occurring words in its sentence. Thus a word can be a member of one group for every distinct (training) sentence it occurs in. The regularizer is: where S d is the number of sentences in document d.
Since modern text datasets typically contain thousands of sentences and many words appear in more than one sentence, the sentence regularizer could potentially lead to thousands heavily overlapping groups. As stated in the work of Yogatama and Smith (2014b), a rather important fact is that the regularizer will force all the weights of a sentence, if it is recognized as irrelevant. Respectively, it will keep all the weights of a relevant sentence, even though the group contains unimportant words. Fortunately, the problem can be resolved by adding a lasso regularization (Friedman et al., 2010).

Semantic regularizers
In this section, we present semantic regularizers that define groups based on how semantically close words are.
LDA regularizer Yogatama and Smith (2014a) considered topics as another type of structure. It is obvious that textual data can contain a huge number of topics and especially topics that overlap each other. Again the main idea is to penalize weights for words that co-occur in the same topic, instead of treating the weight of each word separately.
Having a training corpus, topics can be easily extracted with the help of the latent Dirichlet allocation (LDA) model (Blei et al., 2003). In our experiments, we form a group by extracting the n most probable words in a topic. We note that the extracted topics can vary depending the text preprocessing methods we apply on the data.
LSI regularizer Latent Semantic Indexing (LSI) can also be used in order to identify topics or groups and thus discover correlation between terms (Deerwester et al., 1990). LSI uses singular value decomposition (SVD) on the document-term matrix to  identify latent variables that link co-occurring terms with documents. The main basis behind LSI is that words being used in the same contexts (i. e. the documents) tend to have similar meanings. We used LSI as a baseline and compare it with other standard baselines as well as other proposed structured regularizers. In our work we keep the top 10 words which contribute the most in a topic.
The regularizer for both LDA and LSI is: where K is the number of topics.

Graphical regularizers
In this section we present our proposed regularizers based on graph-of-words and word2vec. Essentially the word2vec space can be seen as a large graph where nodes represent terms and edges similarities between them.
Graph-of-words regularizer Following the idea of the network of features, we introduce a simpler and faster technique to identify relationships between features. We create a big collection graph from the training documents, where the nodes correspond to terms and edges correspond to cooccurrence of terms in a sliding window. We present a toy example of a graph-of-words in Figure 1. A critical advantage of graph-of-words is that it easily encodes term dependency and term order (via edge direction). The strength of the dependence between two words can also be captured by assigning a weight to the edge that links them.
Graph-of-words was originally an idea of Mihalcea and Tarau (2004) and Erkan and Radev (2004) who applied it to the tasks of unsupervised keyword extraction and extractive single document summarization. Rousseau and Vazirgiannis (2013) and Malliaros and Skianis (2015) showed it performs well in the tasks of information retrieval and text categorization. Notably, the former effort ranked nodes based on a modified version of the PageRank algorithm.
Community detection on graph-of-words Our goal is to identify groups or communities of words. Having constructed the collection-level graph-ofwords, we can now apply community detection algorithms (Fortunato, 2010).
In our case we use the Louvain method, a community detection algorithm for non-overlapping groups described in the work of Blondel et al. (2008). Essentially it is a fast modularity maximization approach, which iteratively optimizes local communities until we reach optimal global modularity given some perturbations to the current community state. The regularizer becomes: where c ranges over the C communities. Thus θ c corresponds to the sub-vector of θ such that the corresponding features are present in the community c.
Note that in this case we do not have overlapping groups, since we use a non-overlapping version of the algorithm. As we observe that the collection-level graph-ofwords does not create well separated communities of terms, overlapping community detection algorithms, like the work of Xie et al. (2013) fail to identify "good" groups and do not offer better results. Mikolov et al. (2013) proposed the word2vec method for learning continuous vector representations of words from large text datasets. Word2vec manages to capture the actual meaning of words and map them to a multidimensional vector space, giving the possibility of applying vector operations on them. We introduce another novel regularizer method, by applying unsupervised clustering algorithms on the word2vec space.

Word2vec regularizer
Clustering on word2vec Word2vec contains millions of words represented as vectors.
Since word2vec succeeds in capturing semantic similarity between words, semantically related words tend to group together and create large clusters that can be interpreted as "topics".
In order to extract these groups, we use a fast clustering algorithm such as K-Means (Macqueen, 1967) and especially Minibatch K-means. The regularizer is: where K is the number of clusters we extracted from the word2vec space. Clustering these semantic vectors is a very interesting area to study and could be a research topic by itself. The actual clustering output could vary as we change the number of clusters we are trying to identify. In this paper we do not focus on optimizing the clustering process.

Experiments
We evaluated our structured regularizers on several well-known datasets for the text categorization task.   (Pang and Lee, 2004;Zaidan and Eisner, 2008) 3 , floor speeches by U.S. Congressmen deciding "yea"/"nay" votes on the bill under discussion (Thomas et al., 2006) 3 and product reviews from Amazon (Blitzer et al., 2007) 4 .

Experimental setup
As features we use unigram frequency concatenated with an additional unregularized bias term. We reproduce standard regularizers like lasso, ridge, elastic and state-of-the-art structured regularizers like sentence, LDA as baselines and compare them with our proposed methods. For LSI, LDA and word2vec we use the gensim package (Řehůřek and Sojka, 2010) in Python. For the learning part we used Matlab and specifically code by Schmidt et al. (2007).
We split the training set in a stratified manner to retain the percentage of classes. We use 80% of the data for training and 20% for validation.
LDA we set the number of topics to 1000 and we keep the 10 most probable words of each topic as a group. For LSI we keep 1000 latent dimensions and we select the 10 most significant words per topic. For the clustering process on word2vec we ran Minibatch-Kmeans for max 2000 clusters. For each word belonging to a cluster, we also keep the top 5 or 10 nearest words so that we introduce overlapping groups. The intuition behind this is that words can be part of multiple "concepts" or topics, thus they can belong to many clusters.

Results
In Table 2 we report the results of our experiments on the aforementioned datasets, and we distinguish our proposed regularizers LSI, GoW, word2vec with underlining. Our results are inline and confirm that of (Yogatama and Smith, 2014a) showing the advantages of using structured regularizers in the text categorization task. The group based regularizers perform systematically better than the baseline ones. We observe that the word2vec clustering based regularizers performs very well -achieving best performance for three out of the ten data sets while it is quite fast with regards to execution time as it appears in Table 3 (i. e. it is four to ten times faster than the sentence based one).
The LSI based regularization, proposed for the first time in this paper, performs surprisingly well as it achieves the best performance for three of the ten datasets. This is somehow interpreted by the fact that this method extracts the inherent dimensions that best represent the different semantics of the documents -as we see as well in the anecdotal   and space the launch health for use that medical you space cancer and nasa hiv health shuttle for tobacco that cancer that research center space hiv aids are use theory keyboard data telescope available are from system information space ftp  examples in Table 6, 7, 8. This method proves as well very fast as it appears in Table 5 (i.e. it is three to sixty times faster than the sentence based one).
The GoW based regularization although very fast, did not outperform the other methods (while it has a very good performance in general). It remains to be seen whether a more thorough parameter tuning and community detection algorithm selection would improve further the accuracy of the method.
In Table 3 we present the feature space sizes retained by each of the regularizers for each dataset. As expected the lasso regularizer sets the vast majority of the features' weights to zero, and thus a very sparse feature space is generated. This fact has as a consequence the significant decrease in accuracy performance. Our proposed structured regularizers = 0 islands inta spain galapagos canary originated anodise advertises jewelry mercedes benzes diamond trendy octave chanute lillienthal = 0 vibrational broiled relieving succumb spacewalks dna nf-psychiatry itself commented usenet golded insects alternate self-consistent retrospect managed to perform better in most of the cases, introducing more sparse models compared to the stateof-the-art regularizers.

Time complexity
Although certain types of structured regularizers improve significantly the accuracy and address the problem of overfitting, they require a notable amount of time in the learning process. As seen in Yogatama and Smith (2014b), a considerable disadvantage is the need of search for the optimal hyperparameters: λ glas , λ lasso , and ρ, whereas standard baselines like lasso and ridge only have one hyperparameter and elastic net has two.
Parallel grid search can be critical for finding the optimal set of hyperparameters, since there is no dependency on each other, but again the process can be very expensive. Especially for the case of the sentence regularizer, the process can be extremely slow due to two factors. First, the high number of sentences in text data. Second, sentences consist of heavily overlapping groups, that include words reappearing in one or more sentences. On the contrary, as it appears on Table 4, the number of clusters in the clustering based regularizers is significantly smaller than that of the sentences -and definitely controlled by the designer -thus resulting in much faster computation. The update of v still remains time consuming for small datasets, even with parallelization.
Our proposed structured regularizers are considerably faster in reaching convergence, since they of-fer a smaller number of groups with less overlapping between words. For example, on the computer subset of the 20NG dataset, learning models with the best hyperparameter value(s) for lasso, ridge, and elastic net took 7, 1.4, and 0.8 seconds, respectively, on an Intel Xeon CPU E5-1607 3.00 GHz machine with 4 cores and 128GB RAM. Given the best hyperparameter values the LSI regularizer takes 6 seconds to converge, the word2vec regularizer takes 10 seconds to reach convergence, the graph-of-words takes 4 seconds while the sentence regularizer requires 43 seconds. Table 5 summarizes required learning time on 20NG datasets.
We also need to consider the time needed to extract the groups. For word2vec, Minibatch K-means requires 15 minutes to cluster the pre-trained vectors by Google. The clustering is executed only once. Getting the clusters of words that belong to the vocabulary of each dataset requires 20 minutes, but can be further optimized. Finding also the communities in the graph-of-words approach with the Louvain algorithm, is very fast and requires a few minutes depending on the size and structure of the graph.
In Tables 6, 7, 8 we show examples of our proposed regularizers-removed and -selected groups (in v) in the science subset of the 20NG dataset. Words with weights (in w) of magnitude greater than 10 −3 are highlighted in red (sci.med) and blue (sci.space).

Conclusion & Future Work
This paper proposes new types of structured regularizers to improve not only the accuracy but also the efficiency of the text categorization task. We mainly focused on how to find and extract semantic and syntactic structures that lead to sparser feature spaces and therefore to faster learning times. Overall, our results demonstrate that linguistic prior knowledge in the data can be used to improve categorization performance for baseline bag-of-words models, by mining inherent structures. We only considered logistic regression because of its interpretation for L2 regularizers as Gaussian prior on the feature weights and following Sandler et al. (2009), we considered a non-diagonal covariance matrix for L2 based on word similarity before moving to group lasso as presented in the paper. We are not expecting a significant change in results with different loss functions as the proposed regularizers are not log loss specific.
Future work could involve a more thorough investigation on how to create and cluster graphs, i. e. covering weighted and/or signed cases. Finding better clusters in the word2vec space is also a critical part. This is not only restricted in finding the best number of clusters but what type of clusters we are trying to extract. Gaussian Mixture Models (McLachlan and Basford, 1988) could be applied in order to capture overlapping groups at the cost of high complexity. Furthermore, topical word embeddings (Liu et al., 2015) can be considered for regularization. This approach could enhance the regularization on topic specific datasets. Additionally, we plan on exploring alternative regularization algorithms diverging from the group-lasso method.