Orthogonal Matching Pursuit for Text Classification

In text classification, the problem of overfitting arises due to the high dimensionality, making regularization essential. Although classic regularizers provide sparsity, they fail to return highly accurate models. On the contrary, state-of-the-art group-lasso regularizers provide better results at the expense of low sparsity. In this paper, we apply a greedy variable selection algorithm, called Orthogonal Matching Pursuit, for the text classification task. We also extend standard group OMP by introducing overlapping Group OMP to handle overlapping groups of features. Empirical analysis verifies that both OMP and overlapping GOMP constitute powerful regularizers, able to produce effective and very sparse models. Code and data are available online.


Introduction
The overall high dimensionality of textual data is of major importance in text classification (also known as text categorization), opinion mining, noisy text normalization and other NLP tasks. Since in most cases a high number of words occurs, one can easily fall in the case of overfitting. Regularization remains a key element for addressing overfitting in tasks like text classification, domain adaptation and neural machine translation (Chen and Rosenfeld, 2000;Lu et al., 2016;Barone et al., 2017). Along with better generalization capabilities, a proper scheme of regularization can also introduce sparsity. Recently, a number of text regularization techniques have been proposed in the context of deep learning (Qian et al., 2016;Ma et al., 2017;Zhang et al., 2017).
Apart from 1 , 2 and elastic net, a very popular method for regularizing text classification is group lasso. Yogatama and Smith (2014b) introduced a group lasso variant to utilize groups of 1 github.com/y3nk0/OMP-for-Text-Classification Best feature (word): j (k) = arg max j / ∈I X j r (k−1) Update active set:  Figure 1: OMP pipeline where X ∈ R N ×d is the design matrix, y ∈ R N is the response vector, K is our budget and I the set of active features.
words for logistic models. Occasionally though, these groupings are either not available or hard to be extracted. Moreover, no ground truth groups of words exist to validate their quality. Furthermore, group lasso can also fail to create sparse models. Lastly, there has been little work in overlapping group regularization for text, since words can appear in different groups, following the intuition that they can share multiple contexts or topics.
In this work, we apply two greedy variable selection techniques to the text classification task, Orthogonal Matching Pursuit (OMP) and overlapping group Orthogonal Matching Pursuit (GOMP). In the case of GOMP, we build upon work of Lozano et al. (2011), where the authors propose the GOMP algorithm for Logistic Regression for selecting relevant groups of features. More specifically, standard GOMP is based on the assumption that a number of disjoint groups of features are available. Nevertheless, in most cases, these groups are not disjoint. To overcome this problem we extend GOMP to handle overlapping groups of features. We empirically show that both OMP and overlapping GOMP provide highly accurate models, while producing very sparse mod-els compared to group lasso variants. Figure 1 illustrates schematically the pipeline of OMP.
Our contribution can be summarized in the following novel aspects: (1) apply OMP to text classification; (2) introduce overlapping GOMP, moving from disjoint to overlapping groups; (3) analyze their efficiency in accuracy and sparsity, compared to group lasso variants and state-of-the-art deep learning models.
The rest of the paper is organized as follows. Section 2 presents the background about the classification task, and Section 3 gives an overview of the related work. Section 4 formally introduces the proposed OMP and overlapping GOMP algorithms for the text classification problem. Experimental results are presented in Section 5. We conclude the paper in Section 6 by discussing possible future directions.

Background & Notation
In this section, we set the theoretical and practical background, needed to tackle text classification.

Loss minimization
In the binary classification problem, the objective is to assign an instance vector x ∈ R d , which represents a document in our setting, to a binary response y ∈ {−1, 1}. In text classification, d represents the size of our dictionary concatenated with an additional bias term.
We begin by transforming the classification problem into a loss minization problem. For that purpose, a loss function should be defined that quantifies the loss between the prediction of a classifier and the true class label, y i , associated with a specific document (instance), x i .
Logistic regression models the class conditional probability, as: where vector θ contains the unknown model's parameters, and the hyperplane θ x = 0 is the decision boundary of the classifier that separates the two classes. Given a training set of i.i.d. data point , we find the optimal model's parameters, θ * , by minimizing the negative log likelihood: where L(x i , θ, y i ) = log[1 + exp{−y i (θ x i )}] is the loss function of our model. It should be also noticed that other loss functions can be used such as hinge loss, square loss, etc.. For linear classifiers such as Linear Least Square Fit, Logistic Regression and linear Support Vector Machines, in the case of binary predictions the L(x, θ, y) is respectively [1 − y(θ x)] 2 (squared loss), log[1 + exp{−y(θ x)} (log loss) and [1 − y(θ x)] + (hinge loss).

Regularization
By only minimizing the empirical risk, a model can be led to severe overfitting in the case where the number of the features (dictionary) is much higher than the number of the instances (documents) in training set. In practice, it yields models with poor generalization capabilities (i.e., lower performances on the test set) that fit the noise contained in the training dataset instead of learning the underlying pattern we are trying to capture. Additionally, if two hypothesis lead to similar low empirical risks, one should select the "simpler" model for better generalization capabilities.
Interpretation The concept of regularization encompasses all these ideas. It can be interpreted as taking into account the model complexity by discouraging feature weights from reaching large values; incorporating prior knowledge to help the learning by making prior assumptions on the feature weights and their distribution; and helping compensate ill-posed conditions.
Expected risk Regularization takes the form of additional constraints to the minimization problem, i. e., a budget on the feature weights, which are often relaxed into a penalty term Ω(θ) controlled via a Lagrange multiplier λ (see Boyd and Vandenberghe (2004) for more details about the theory behind convex optimization). Therefore, the overall expected risk (Vapnik, 1991) can be expressed as the weighted sum of two components: the empirical risk and a penalty term, called "Loss+Penalty" (Hastie et al., 2009). In this way, the optimal set of feature weights θ * is found as: where the free parameter λ ≥ 0 governs the importance of the penalty term compared with the loss term.

Related Work
In this section, we review particularly relevant prior work on regularization for text classification and more specifically methods based on group lasso. In many applications of statistics and machine learning, the number of exploratory variables may be very large, while only a small subset may truly be relevant in explaining the response to be modelled. In certain cases, the dimensionality of the predictor space may also exceed the number of examples. Then the only way to avoid overfitting is via some form of capacity control over the family of dependencies being explored. Estimation of sparse models that are supported on a small set of input variables is thus highly desirable, with the additional benefit of leading to parsimonious models, which can be used not only for predictive purposes but also to understand the effects (or lack thereof) of the candidate predictors on the response.
More specifically, regularization in text scenarios is essential as it can lead to removing unnecessary words along with their weights. For example, in text classification, we may only care for a small subset of the vocabulary that is important during the learning process, by penalizing independently or in grouped way noisy and irrelevant words.
With noiseness we refer to user-generated words that may increase the dimensionality and complexity of a problem, while having a clear decreasing effect in performance.
Another example task is text normalization, where we want to transform lexical variants of words to their canonical forms. Text normalization can be seen as a machine learning problem (Ikeda et al., 2016) and thus regularization techniques can be applied.
Next we present standard regularization methods, which prove to be effective for classification tasks. We also use them later as baselines for our experiments.
1 , 2 regularization Two of the most used regularization schemes are 1 -regularization, called Lasso (Tibshirani, 1996) or basis pursuit in signal processing (Chen et al., 2001), and 2regularization, called ridge (Hoerl and Kennard, 1970) or Tikhonov (Tikhonov and Arsenin, 1977), which involve adding a penalty term ( 1 and 2 norms of the parameter vector, respectively) to the error function: Elastic net A linear combination of the 1 and 2 penalties has been also introduced by Zou and Hastie (2005), called elastic net. Although 1 and elastic net can be very effective in terms of sparsity, the accuracy achieved by these regularizers can be low. On the contrary, 2 can deliver sufficient accuracy at the cost of zero sparsity. The need for new methods that outperform the aforementioned approaches in both accuracy and sparsity is evident.
Group structured regularization In many problems a predefined grouping structure exists within the explanatory variables, and it is natural to incorporate the prior knowledge so that the support of the model should be a union over some subset of these variable groups. Group structured regularization has been proposed to address the problem of overfitting, given we are provided with groups of features. Group lasso is a special case of group regularization proposed by Yuan and Lin (2006), to avoid large 2 norms for groups of weights, given we are provided with groups of features. The main idea is to penalize together features that may share some properties. Group structured regularization or variable group selection problem is a well-studied problem, based on minimizing a loss function penalized by a regularization term designed to encourage sparsity at the variable group level. Specifically, a number of variants of the 1 -regularized lasso algorithm (Tibshirani, 1996) have been proposed for the variable group selection problem, and their properties have been extensively studied recently. First, for linear regression, Yuan and Lin (2006) proposed the group lasso algorithm as an extension of lasso, which minimizes the squared error penalized by the sum of 2 -norms of the group variable coefficients across groups. Here the use of 2 -norm within the groups and 1 -norm across the groups encourages sparsity at the group level.
In addition, group lasso has been extended to logistic regression for binary classification, by replacing the squared error with the logistic error (Kim et al., 2006;Meier et al., 2008), and several extensions thereof have been proposed (Roth and Fischer, 2008). Later, sparse group lasso and overlapping group lasso were introduced (Obozinski et al., 2011) to additionally penalize features inside the groups, while the latter can be used when groups include features that can be shared between them.
In Figure 2, we illustrate the selection of features by the most used group lasso regularizers. In group lasso, a group of features is selected and all its features are used. Next, in the sparse group lasso case, groups of features are selected again but not all the features belonging to them are used. In the overlapping group lasso, groups can share features between them. Finally, we may have sparse group lasso with overlaps.
Linguistic structured regularizers As mentioned previously, words that appear together in the same context, share topics or even have a similar meaning, may form groups that capture semantic or syntactic prior information. Hence we can feed these groups to group lasso. Yogatama and Smith (2014a) used the Alternating Direction Method of Multipliers algorithm (ADMM) (Boyd et al., 2011) for group lasso, an algorithm that solves convex optimization problems by breaking them into smaller pieces. In this work, groups extracted by Latent Dirichlet Allocation (LDA) and sentences were used for structured regularization. Next, Skianis et al. (2016) extended their work by utilizing topics extracted by Latent Semantic Indexing (LSI) (Deerwester et al., 1990), communities in Graph-of-Word (GoW) (Rousseau and Vazirgiannis, 2013) and clusters in the word2vec (w2v) (Mikolov et al., 2013) space. They also per-formed a computational analysis in terms of the number and size of groups and how it can affect learning times.
While current state-of-the-art methods either focus on finding the most meaningful groups of features or how to further "optimize" the group lasso approach, the attempts carry as well the disadvantages of group lasso architectures. In some cases, we may not be able to extract "good" groups of words. As presented in the next section, we want to explore new ways of regularization on groups, diverging from group lasso, that can give high accuracy with high sparsity.

OMP for Text Classification
The vanilla Matching Pursuit (MP) algorithm (Mallat and Zhang, 1993) has its origin in signal processing where it is mainly used in the compressed sensing task. Actually, it approximates the original "signal" iteratively improving the current solution by minimizing the norm of the residual (approximation error). It can also be considered as a forward greedy algorithm for feature selection (dictionary learning problem), that at each iteration uses the correlation between the residual and the candidate features to (greedily) decide which feature to add next. The correlation between the residual and the candidate features is considered to be the length of the orthogonal projection. Then, it subtracts off the correlated part from the residual and performs the same procedure on the updated residual. The algorithm terminates when the residual is lower than a predefined threshold. The final solution is obtained by combining the selected features weighted by their respective correlation values, which are calculated at each iteration.
Orthogonal Matching Pursuit (Pati et al., 1993) is one of the most famous extensions of the matching pursuit algorithm. Similar to MP, OMP can be used for the dictionary learning task where it constitutes a competitive alternative to lasso algorithm. The way it differs from the standard MP is that at every step, all the coefficients extracted so far are updated, by computing the orthogonal projection of the data onto the set of features selected so far. In this way, the newly derived residual is orthogonal to not only the immediately selected feature at the current iteration, but also to all the features that have already been selected. Therefore, OMP never selects the same feature twice. Tropp (2004) provided a theoretical analysis of OMP, (budget), (precision), λ (regularization factor). Initialize: I = ∅, r (0) = y, k = 1 1: while |I| ≤ K do 2: j (k) = arg max j / ∈I X j r (k−1) 3: if |X j (k) r (k−1) | ≤ then 4: break 5: end if 6: k += 1 10: end while 11: return θ (k) , I which has been generalized by Zhang (2009) on the stochastic noise case.
In the following part, we explain the main steps of the logistic OMP algorithm in detail. Given a training set, we define X = [x 1 , ..., x N ] ∈ R N ×d to be the (dictionary) matrix of features (or variables) vectors, with each column X j to represent a feature, f j ∈ R N . Let also y = [y 1 , . . . , y N ] denote the response vector. For any set of indices I, let X I denote a subset of features from X, such that feature f j is included in X I if j ∈ I. Thus, X I = {f j , j ∈ I}, with the columns f j to be arranged in ascending order.
OMP starts by setting the residual equal to the response vector, r (0) = y, assuming that the set of indices I (contains the indices of the active features) is initially empty. At each iteration k, OMP activates the feature that has the maximum correlation with the residual r (k−1) (calculated in the previous step): Then, we incorporate the index j (k) to the set I, i. e., I = I ∪ {j (k) }. Afterwards, we apply the ordinary logistic regression by considering only the active features. More specifically, we get the optimal coefficients by minimizing the negative log likelihood along with an 2 penalty term: s.t. supp(θ) ⊆ I (7) where supp(θ) = {j : θ j = 0}. Roughly speaking, the values of the coefficients correspond to inactive features (indices) forced to be equal to zero.

12:
k += 1 13: end while 14: return θ (k) , I Finally, we calculate the updated residual: where 1 {y} 1{y i ∈ {1}, ∀i ∈ {1, . . . , n}} indicates if instance x i belongs to class 1 or not. We repeat the process until the residual becomes smaller than a predefined threshold, ≥ 0, or a desired number of active features, K (budget), has been selected. Through our empirical analysis we set = 0, examining only the number of active features. An overview of logistic-OMP is given in Alg. 1. A detailed analysis of the algorithm's complexity is provided by Tropp and Gilbert (2007).

Overlapping Group OMP
The Group OMP (GOMP) algorithm was originally introduced by Swirszcz et al. (2009) for linear regression models, and extended by Lozano et al. (2011) in order to select groups of variables in logistic regression models. Following the notion of group lasso, GOMP utilizes prior knowledge about groups of features in order to penalize large weights in a collective way. Given that we have words sharing some properties, we can leverage these grouping for regularization purposes.
Similar to Lozano et al. (2011), let us assume that a natural grouping structure exists within the variables consisting of J groups X G 1 , . . . , X G J , where G i ⊂ {1, . . . , d}, and X G i ∈ R N ×|G i | . The standard GOMP algorithm also assumes that the groups are disjoint, G i ∩ G j = ∅ for i = j. We will remove this assumption later on, by proposing the overlapping GOMP algorithm that is able to handle overlapping groups of features. GOMP operates in the same way with OMP but instead of selecting a single feature, it selects a group of features with the maximum correlation between them and the residual: In the case where the groups are not orthonormalized (i. e., X G j X G j = I G j , where I G j is the identity matrix of size R |G j |×|G j | ), we select the best group based on the next criterion: During our empirical analysis, we have noticed that the aforementioned criteria benefit large groups. This becomes apparent especially in the case where the size of the groups is not balanced. In this way, groups with a large number of "irrelevant" features are highly probable to be added. For instance, it is more probable to add a group that consists of 2 good features and 100 bad features, instead of a group that contains only 2 good features. To deal with situations like this one, we consider the average correlation between the group's features and the residual: Overlapping GOMP extends the standard GOMP in the case where the groups of indices are overlapping, i. e., G i ∩ G j = ∅ for i = j. The main difference with GOMP is that each time a group becomes active, we remove its indices from each inactive group: G i = G i \ G j (k) , ∀i ∈ {1, . . . , J}. In this way, the theoretical properties of GOMP hold also in the case of the overlapping GOMP algorithm. A sketch of the overlapping GOMP is shown in Alg. 2.

Experiments
Next, we present the data, setup and results of our empirical analysis on the text classification task.

Datasets
Topic categorization. From the 20 Newsgroups 2 dataset, we examine four classification 2 qwone.com/∼jason/20Newsgroups/  Sentiment analysis. The sentiment analysis datasets we examined include movie reviews (Pang and Lee, 2004;Zaidan and Eisner, 2008) 3 , floor speeches by U.S. Congressmen deciding "yea"/"nay" votes on the bill under discussion (Thomas et al., 2006) 3 and product reviews from Amazon (Blitzer et al., 2007) 4 . Table 1 summarizes statistics about the aforementioned datasets used in our experiments. We choose small datasets intentionally, like Yogatama and Smith (2014b), so that we can observe the regularization effect clearly.

Experimental setup
In our setup, as features we use unigram frequency concatenated with an additional bias term. We reproduce standard regularizers like lasso, ridge, elastic and state-of-the-art structured regularizers like sentence, LDA, GoW and w2v groups (Skianis et al., 2016) as baselines and compare them with the proposed OMP and GOMP. We used pre-trained Google vectors introduced by Mikolov et al. (2013) and apply k-means clustering (Lloyd, 1982) algorithm with maximum 2000 clusters. For each word belonging to a cluster, we also keep the top 5 nearest words so that we introduce overlapping groups.
For the learning part we used Matlab and specifically code provided by Schmidt et al. (2007). If no pre-defined split exists, we separate the training    set in a stratified manner by 80% for training and 20% for validation.
All the hyperparameters are tuned on the development dataset, using accuracy for evaluation. For lasso and ridge regularization, we choose λ from {10 −2 , 10 −1 , 1, 10, 10 2 }. For elastic net, we perform grid search on the same set of values as ridge and lasso experiments for λ rid and λ las . For group lasso, OMP and GOMP regularizers, we perform grid search on the same set of parameters as ridge and lasso experiments. In the case we get the same accuracy on the development data, the model with the highest sparsity is selected. In GOMP we considered all individual features as separate groups of size one, along with the w2v groups. Last but not least, in both OMP and GOMP the maximum number of features, K(budget), is set to 2000. Table 2 reports the results of our experiments on the aforementioned datasets. The empirical results reveal the advantages of using OMP or GOMP for regularization in the text categorization task. The OMP regularizer performs systematically better than the baseline ones. More specifically, OMP outperforms the lasso, ridge and elastic net regularizers in all datasets, as regards to the accuracy. At the same time, the performance of OMP is quite close or even better to that of structured regularizers. Actually, in the case of electronics data, the model produced by OMP is the one with the highest accuracy. On the other hand, the proposed overlapping GOMP regularizer outperforms all the other regularizers in 3 out of 10 datasets.

Results
Another important observation is how GOMP performs with different types of groups. GOMP only requires some "good" groups along with single features in order to achieve good accuracy. Smaller groups provided by LDA, LSI and w2v clusters provide a good solution and also fast computation, while others (GoW communities) can produce similar results with slower learning times. This phenomenon can be attributed to the different  Table 5: Comparison in test accuracy with state-ofthe-art classifiers: CNN (Kim, 2014), FastText (Joulin et al., 2017) with no pre-trained vectors. The proposed OMP and GOMP algorithms produce the highest accurate model in 4 out of 10 datasets. structure of groups. While LDA and LSI have a large number of groups with small number of features in them (1000 groups, 10 words per group), w2v clusters and GoW communities consist of smaller number of groups with larger number of words belonging to each group. Nevertheless, we have reached to the conclusion that the selection of groups is not crucial for the general performance of the proposed GOMP algorithm. Table 3 shows the sparsity sizes of all the regularizers we tested. As it becomes apparent, both OMP and GOMP yield super-sparse models, with good generalization capabilities. More specifically, OMP produces sparse spaces similar to lasso, while GOMP keeps a significantly lower number of features compared to the other structured regularizers. In group regularization, GOMP achieves both best accuracy and sparsity in two datasets (vote & books), while group lasso only in one (sports).
In Table 4 we demonstrate the ability of OMP to produce more discriminative features compared to lasso by showing the largest weights and their respective term.
Finally, in Table 5 we compare state-of-theart group lasso classifiers with deep learning architectures (Kim, 2014) with Dropout (Srivastava et al., 2014) for regularization and FastText (Joulin et al., 2017). We show that group lasso regularizers with simple logistic models remain very effective. Nevertheless, adding pre-trained vectors in the deep learning techniques and performing parameter tuning would definitely increase their performance against our models, but with a significant cost in time complexity. Figure 3 visualizes the accuracy vs. sparsity for all datasets and all classifiers. We do that in order to identify the best models, by both metrics. The desirable is for classifiers to belong in the top right corner, offering high accuracy and high sparsity at the same time. We observe that OMP and GOMP tend to belong in the right parts of the plot, having very high sparsity, often comparable to the aggressive lasso, even when they do not achieve the best accuracies.

Number of active features (atoms)
In both OMP and GOMP algorithms, the maximum desired number of active features (K, budget) was used as stopping criterion. For instance, by setting K = 1000, the proposed methods return the learned values that correspond to the first {100, 200, . . . , 1000} features, respectively. Thus, we exploit the feedforward feature selection structures of OMP and GOMP. Figure 4 presents the number of active features versus accuracy in the development subsets of the 20NG dataset. It can be easily observed that after selecting 1000 active atoms, the accuracy stabilizes or even drops (overfitting problem). For instance, the best number of active features are: i) science: 700, ii) sports: 1100, iii) religion: 400 and iv) computer: 1500. The reason for selecting K = 2000 as the number of features to examine was to provide a sufficient number for OMP to reach a good accuracy while providing a supersparse solution comparable to lasso.

Time complexity
Although certain types of group lasso regularizers perform well, they require a notable amount of time in the learning process.
OMP offers fast learning time, given the hyperparameter values and the number of atoms. For example, on the computer subset of the 20NG dataset, learning models with the best hyperparameter value(s) for lasso, ridge, and elastic net took 7, 1.4, and 0.8 seconds, respectively, on a 4core 3.00GHz CPU. On the other hand, OMP requires only 4 seconds for training, making it even faster than lasso, while providing a sparser model.
GOMP can have very slow learning time when adding the features as groups individually. This is due to the large number of groups that GOMP needs to explore in order to extract the most "con-   tributing" ones. If we consider GOMP without the individual features as groups, then the learning process becomes faster, with a clear decreasing effect on accuracy. In general, groups need to be well structured for GOMP to manage to surpass OMP and other state-of-the-art group lasso regularizers.
The advantages of the proposed methods are: (1) OMP requires no prior structural knowledge, (2) producing more discriminative features and (3) fast with relatively small number of dimensions.
Moreover, our implementation compared to the one of Lozano et al. (2011), provides the advantage of storing the weights and not having to recompute the whole matrices from scratch.
In the drawbacks of the methods: (1) OMP and GOMP are greedy algorithms, thus GOMP gets slow when we add the features as individual groups and (2) groups need to be "good".

Conclusion & Future Work
In this work, we introduced OMP and GOMP algorithms on the text classification task. An extension of the standard GOMP algorithm was also proposed, which is able to handle overlapping groups. The main advantages of both OMP and GOMP compared to other regularization schemes are their simplicity (greedy feedforward feature selection) and ability to produce accurate models with good generalization capabilities. We have shown that the proposed classifiers outperform standard baselines, as well as state-of-art structured regularizers in various datasets. Similar to Mosci et al. (2010); Yen et al. (2017); Xie et al. (2017), our empirical analysis validates that regularization remains a highly important topic, especially for deep learning models (Roth et al., 2017).
As mentioned previously, groups are not always specified in advance or hard to extract. Especially in environments involving text. To address this problem, we plan to extend our work by learning automatically the groups with Simultaneous Orthogonal Matching Pursuit (Szlam et al., 2012). Another interesting future direction would be to additionally penalize features inside the groups, similarly to sparse group lasso. Moreover, it would be highly interesting to examine the theoretical properties of overlapping GOMP. Finally, as shown in recent work by Roth et al. (2017), regularization remains an open topic for deep learning models.