Feature Selection for Short Text Classification using Wavelet Packet Transform

Text classiﬁcation tasks suffer from curse of dimensionality due to large feature space. Short text data further exacerbates the problem due to their sparse and noisy nature. Feature selection thus becomes an important step in improving the classiﬁcation performance. In this paper, we propose a novel feature selection method using Wavelet Packet Trans-form. Wavelet Packet Transform (WPT) has been used widely in various ﬁelds due to its efﬁciency in encoding transient signals. We demonstrate how short text classiﬁcation task can be beneﬁted by feature selection using WPT due to their sparse nature. Our technique chooses the most discriminating features by computing inter-class distances in the transformed space. We experimented extensively with several short text datasets. Compared to well known techniques our approach reduces the feature space size and improves the overall classiﬁcation performance signiﬁcantly in all the datasets.


Introduction
Text classification task consists of assigning a document to one or more classes. This can be done using machine learning techniques by training a model with labelled documents. Documents are usually represented as vectors with a variety of techniques like bagof-words(unigram, bigram), TFIDF representation, etc.
Typically, text corpora have very high dimensional document representation equal to the size of vocabulary. This leads to curse of dimensionality 1 in machine learning models, thereby degrading the performance.
Short text corpora, like SMS, tweets, etc., in particular suffer from sparse high dimensional feature space, due to large vocabulary and short document length. To give an idea as to how these factors affect the size of the feature space we compare Reuters with Twitter data corpus. In Reuters-21578 corpus there are approximately 2.5 Million words in total and 14506 unique vocabulary entries after standard preprocessing steps (which is the dimensionality of the feature space). However, Twitter 1 corpus, we used for experiments has approximately 15,000 words in total and feature space size of 7423 words. Additionally, the average length of an English tweet is 10-12 words whereas the average length of a document in Reuters-21578 news classification corpus is 200 words. Therefore, the dimensionality is extremely high even for small corpora with short texts. In addition, the average number of words in a document is significantly less in short text data leading to higher sparsity of feature space representation of documents.
Owing to this high dimensionality problem, one of the important steps in text classification workflows is feature selection. Feature selection techniques for traditional documents have been aplenty and a few seminal survey articles have been written on this topic (Blitzer, 2008). In contrast, for short text there is much less work on statistical feature selection but more focus has gone to feature engineering towards word normalization, canonicalization etc. (Han and Baldwin, 2011).
In this paper, we propose a dimensionality reduction technique for short text using Wavelet packet transform called Improvised Adaptive Discriminant Wavelet Packet Transform (IADWPT). IAWDPT does dimensionality reduction by selecting discriminative features (wavelet coefficients) from the Wavelet Packet Transform (WPT) representation. Short text data resembles transient signals in vector representation and WPT encodes transient signals (signals lasting for very short duration) well (Learned and Willsky, 1995), using very few coefficients. This leads to considerable decrease in the dimensionality of the feature space along with increase in classification accuracy. Additionally, we optimise the procedure to select the most discriminative features from WPT representation. To the best of our knowledge this is the first attempt to apply an algorithm based on wavelet packet transform to the feature selection in short text classification.

Related Work
Feature selection has been widely adopted for dimensionality reduction of text datasets in the past. Yiming Yang et al. (Yang and Pedersen, 1997) performed a comparative study of some of these methods including, document frequency, information gain(IG), mutual information(MI), χ 2 -test(CHI) and term strength(TS). They concluded that IG and CHI are the most effective in aggressive dimensionality reduction. The mRMR technique proposed by (Peng et al., 2005) selects the best feature subset by increasing relevancy of feature with target class and reducing redundancy between chosen features.
Wavelet transform provides time-frequency representation of a given signal. The time-frequency representation is useful for describing signals with time varying frequency content. Detailed explanation of wavelet transform theory is beyond the scope of this paper. For detailed theory, refer to Daubechies (Daubechies, 2006;Daubechies, 1992;Coifman and Wickerhauser, 2006) and Robi Polikar (Polikar, ). First use of wavelet transform for compression was proposed by Ronald R Coifman et al. (Coifman et al., 1994). Hammad Qureshi et al. (Qureshi et al., 2008) proposed an adaptive discriminant wavelet packet transform(ADWPT) for feature reduction.
In past wavelet transform has been applied to natural language processing tasks. A survey on wavelet applications in data mining (Li et al., 2002), discusses the basics and properties of wavelets which make it a very effective technique in Data Mining. CC Aggarwal (Aggarwal, 2002) uses wavelets for strings classification. He notes that wavelet technique creates a hierarchical decomposition of the data which can capture trends at varying levels of granularity and thus helps classification task with the new representation. Geraldo Xexeo et al. (Xexeo et al., 2008) used wavelet transform to represent documents for classification.

Wavelet Packet Transform for
Short-text Dimensionality Reduction Feature selection performs compression of feature space to preserve maximum discriminative power of features for classification. We use this analogy to do compression of document feature space using Wavelet Packet Transform. Vector format(e.g. dictionary encoded vector) representation of a document is equivalent to a digital representation. This vector format can then be processed using wavelet transform to get a compressed representation of the document in terms of wavelet coefficients. Document features are transformed into wavelet coefficients. Wavelet coefficients are ranked and selected based on their discrimination power between classes. Classification model is trained on these highly informative coefficients. Results show a considerable improvement in model accuracy using our dimensionality reduction technique. Typically, vector representation of short text will have very few non-zero entries due to short length of the documents. If we plot count of each word in the dictionary on y-axis v/s distinct words on x-axis. Just like transient signals, the resulting graph will have very few spikes. Transient signals last for a very little time in the whole duration of the observation. (Learned and Will-sky, 1995) show the efficacy of wavelet packet transform in representing transient signal. This motivates our use of Wavelet Packet Transform to encode short text.
Wavelet transform is a popular choice for feature representation in image processing. Our approach is inspired by a related work by (Qureshi et al., 2008). They propose Adaptive Discriminant Wavelet Packet Transform (ADWPT) based representation for meningioma subtype classfication. ADWPT obtains a wavelet based representation by optimising the discrimination power of the various features. Proposed technique IADWPT differs from ADWPT in the way discriminative feature are selected. Next section provides details about the proposed approach IADWPT.

IADWPT -Improvised Adaptive Discriminant Wavelet Packet Transform
This section presents the proposed short text feature selection technique IADWPT. IADWPT uses wavelet packet transform of the data to extract useful discriminative features from the sub-bands at various depths. Natural language processing tasks usually represent their documents in dictionary encoded bag-of-words representation. This numerical vector representation of a document is equivalent to signal representation. In order to get IADWPT representation of the document following steps should be computed: 1) Compute full wavelet packet transform of the document vector representation.
2) Compute the discrimination power of each coefficient in wavelet packet transform representation.
3) Select the most discriminative coefficients to represent all the documents in the corpus.
Once the 1-D wavelet transform is computed at a desired level l, wavelet packet transform (WPT) produces 2 l different sets of coefficients (nodes in WPT tree). These coefficients represent the magnitude of various frequencies present in the signal at a given time. We select the most discriminative coefficients to represent all the documents in the corpus by calculating the discriminative power of each coefficient.
The classification task consists of c classes with d documents. 1-D Wavelet Packet Transform of the d th k document yields l levels with f sub bands consisting of m coefficients in each sub band. x m,f,l represent the coefficients of Wavelet Packet Transform. Following terms are defined for Algorithm 1.
• probability density estimates (S m,f,l ) of a particular sub-band in a level l a training sample document d i k of a given Class c i is given by: Here, x m,f,l is the m th coefficient in f th subband of l th level of document d k . Where, j varies Algorithm 1 IADWPT Algorithm for best discriminative feature selection 1: for all classes C do • Discriminative power (D a,b m,f,l ) of each coefficient in l th level's f th sub band's m th coefficient, between classes a and b is defined as follows: m,f,l | Discriminative power is the hellinger distance between the average probability density estimates of a coefficient for the two classes. It quantifies the difference in the average value of a coefficient between a pair of classes. More the difference, better the discriminative power of the coefficient. Thus discriminative features tend to have a higher average probability density in one of the classes whereas redundant features cancel out in taking the difference in computing the distance. (Rajpoot, 2003) have shown efficacy of Hellinger distance applied to ADWPT.
Selecting coefficients with greater discriminative power helps the classifier perform well. Full algorithm is mentioned in algorithm 1.
Multi class classification can then be handled in this framework using one-vs-one classification. We select the top m features from the wavelet representation for representing the data in the classification task. Time complexity of the algorithm is polynomial. The method is based on adaptive discriminant wavelet packet transform (ADWPT) (Qureshi et al., 2008). Therefore, we name it as improvised adaptive discriminant wavelet packet transform (IADWPT). ADWPT uses best basis for classification which is a union of the various subbands selected that can span the transformed space, so noise is still retained in the signal whereas IADWPT selects coefficients from the sub-band having maximal discriminative power thus improving the classification results. As opposed to ADWPT, IADWPT is a one way transform, original signal cannot be recovered from the transform domain. Experimental results confirm that IADWPT performs better than ADWPT in short text datasets.

Experiments and Results
We used multiple short text datasets to prove efficacy of proposed algorithm against state of the art algorithms for feature selection.
1) Twitter 1: This dataset is a part of the SemEval 2013 task B dataset (Nakov et al., ) for two class sentiment classification. We gathered 624 examples in positive and negative class each for our experiments.   2) SMS Spam 1: UCI spam dataset (Almeida, ) consists of 5,574 instances of SMS classified into SPAM and HAM classes. SPAM class is defined as messages which are not useful and HAM is the class of useful messages. We compare our results with the results they published in their paper (Almeida et al., 2013). Therefore, we followed the same experiment procedure as cited in the paper. First 30% samples were used in train and the rest in test set as reported in the paper.
3) SMS Spam 2: The dataset was published by Yadav et al. (Yadav et al., 2011). It consists of 2000 SMS, 1000 SPAM and 1000 HAM messages. Experiment settings are same as that of dataset SMS Spam 1.
The goal of our experiments is to examine the effectiveness of the proposed algorithm in feature selection for short text datasets. We measure the effectiveness of the feature selection technique with respect to the increase in accuracy in the final machine learning task. Our method does not depend on a specific classifier used in the final classification. Therefore, we used  (Cortes and Vapnik, 1995) and Logistic Regression with and without dimensionality reduction for unigram representation to benchmark the performance. All the experiments were done with 10 fold cross validation and grid search on C parameter. We report results with respect to classification accuracy which is measured as #correctly classif ied datapoints #total datapoints . We conducted detailed experiments comparing IAD-WPT using Coiflets of order 2 with other feature selection techniques such as PCA, Mutual Information, χ 2 , mRMR (Peng et al., 2005) and ADWPT (Qureshi et al., 2008). Results are reported in Table 1. The table  reports best accuracy values and respective feature set size selected by the technique. It can be observed that IADWPT gives best accuracy in most of the cases with very few features.
We compared performance of our algorithm with mRMR. Results for SMS Spam 2 dataset are shown in Figure 2 and Figure 3. The plots prove efficacy of our algorithm versus state of the art mRMR algorithm. mRMR technique could not finish execution for the rest of the datasets. It can also be observed from results in Table 1 and Figure 1,2 that performance of feature selection algorithms follow consistent pattern in short text. Following is observed order of performance of algorithms in decreasing order, IADWPT, mRMR, PCA, Chi Square, MI. Further, it is observed that IADWPT performs well at feature selection without losing discriminative information, even when the dimensionality of feature space is reduced to as far as 1/40th of original feature space and steadily maintains the accuracy as dimensionality is reduced, which makes it a suitable technique for aggressive dimensionality reduction. This also helps in learning ML (machine learning) models faster due to reduced dimensionality. We plotted the discrimination power of coefficients in each dataset. Plot suggested that very few coefficients contained most of the discriminative power. And, therefore just working with these coefficients can help in getting good accuracies resulting in aggressive dimensionality reduction. Results establish the effectiveness of IAD-WPT for applicability in compressing short text feature representation and reducing noise to improve classifi-

IADWPT Effectiveness
Short text data is noisy and consists of many features which are irrelevant to the task of classification of data. IADWPT effectively gets rid of the noise in approximations(as Signal strength is greater than the noise), the feature selection step at the sub-band level as described in Algorithm 1, it enforces selection of good discriminative features and thus improves classifier accuracy, reducing feature space dimensionality at the same time.
Features from sub-bands of the signal are chosen based on their discriminative power, therefore, the original signal information is lost and the transform is not reversible.
IADWPT gives good compression of data and without losing discriminative information, even when the dimensionality of space is reduced to as far as 1/40th of original feature space and is thus steadily maintaining the accuracy as dimensionality is reduced, which makes it a suitable technique for dimensionality reduction. This also helps learning machine learning models faster due to reduced dimensionality. Figure  4 shows the plot of Discriminative Power D a,b m,f,l values for coefficients arranged in descending order for SMS Spam 2 dataset. Other datasets displayed similar graph for Discriminative Power. From the figure it can be observed that few coefficients hold most of the discriminative power, and thus aggressive dimensionality reduction is possible with IADWPT algorithm. Results establish the effectiveness of IADWPT for applicability in compressing short text feature representation and reducing noise.

Conclusion and Future Work
In this paper, we have proposed IADWPT algorithm for effective dimensionality reduction for short text corpus. The algorithm can be used in a number of scenarios where high dimensionality and sparsity pose challenge. Experiments prove efficacy of IADWPT based dimensionality reduction for short text data. This technique can prove useful to a number of social media data analysis applications. In future, we would like to explore theoretical bounds on best number of dimensions to choose from wavelet representation.