Intra-Topic Variability Normalization based on Linear Projection for Topic Classification

This paper proposes a variability normalization algorithm to reduce the variability be-tween intra-topic documents for topic classi-ﬁcation. Firstly, an optimization problem is constructed based on linear variability removable assumption. Secondly, a new feature space for document representation is found by solving the optimization problem with kernel principle component analysis (KPCA). Finally, effective feature transformation is taken through linear projection. As for experiments, state-of-the-art SVM and KNN algorithm are adopted for topic classiﬁcation respectively. Experimental results on a free-style conversational corpus show that the proposed variability normalization algorithm for topic classiﬁcation achieves 3.8% absolute improvement for micro-F 1 measure.


Introduction
Topic classification is now faced with the problem of enormous variability between documents due to the exponential growth of free-style unstructured texts in recent years. This paper treats variability as differences between text documents and aims at reducing the intra-topic document variability for better topic classification. There are various factors to cause the intra-topic variability problem, such as the different language usages of different persons (Chambers, 1995;Fillmore et al., 2014). In freestyle conversations experimented in this paper, different people would use very different words to express their opinions. Therefore, documents in a same topic could be quite different because of the intra-topic variability problem.
In this work, we are interested in finding a robust document representation strategy to address the intra-topic variability problem. Traditional method represents document by a high-dimensional TF-IDF vector based on the bag-of-word approach (Salton and McGill, 1986;Salton and Buckley, 1988). However, the TF-IDF feature reveals little semantic similarity information between terms, which would increase the differences between intra-topic documents when different words are used. Beyond the TF-IDF strategy, there are two class of techniques, i.e., unsupervised technique and supervised technique for document representations. The unsupervised technique includes some latent semantic analysis methods. The typical method is Latent Semantic Indexing (LSI) while the features estimated by LSI are linear combinations of the original features (Deerwester et al., 1990;Wang et al., 2013). Meanwhile, the popular Latent Dirichlet Allocation (Blei et al., 2003;Morchid et al., 2014) algorithm was proposed to represent document by a generative probabilistic model (Blei et al., 2003). Moreover, in recent years, many neural network based methods have been investigated for document representations (Hinton and Salakhutdinov, 2006;Srivastava et al., 2013;Le and Mikolov, 2014). For example, in (Le and Mikolov, 2014), a model called paragraph vector was designed to represent each document by a dense vector while the vector is trained by predicting all words in the corresponding document. On the other hand, supervised technique for document representation includes some discrim-inative approaches, e.g., Linear Discriminant Analysis (Berry et al., 1995;Chakrabarti et al., 2003;Torkkola, 2004) and supervised latent semantic indexing (Sun et al., 2004;Chakraborti et al., 2007;Bai et al., 2009). Meanwhile, some improved linear analysis methods were proposed for encoding documents with a reliable similarity information (Yih et al., 2011;Chang et al., 2013). However, all those works for document representation paid little attention to the variability of intra-topic documents. Therefore, they could hardly solve the intra-topic variability problem in a direct way.
This paper makes a preliminary investigation to deal with the intra-topic variability problem. The main purpose of this work is to find a new feature space with minimized intra-topic variability. An objective criterion is constructed for optimization. Mathematically, we make use of the topic label information of the training set to create a weighting matrix, and then sum over all the differences of intra-topic documents. Then a robust feature space with minimized intra-topic variability is generated by solving the optimization problem with effective KPCA based algorithm. Finally, we accomplish the variability normalization operations for the baseline features. We also employ the linear discriminant analysis as a supplementary algorithm. As for experiments, state-of-the-art SVM and KNN algorithms are employed for topic classification. System performances are evaluated on a challenging free-style conversational database.
The rest of this paper is organized as follows. In section 2, we introduce the proposed variability normalization algorithm for topic classification in detail. After it, section 3 presents experimental setup and results. Finally, conclusions and future work would be given in section 4.

Motivation for variability normalization
This work aims to find a robust document representation strategy for topic classification. The proposed algorithm is motivated by the Nuisance Attribute Projection (NAP) algorithm in speaker verification field (Solomonoff et al., 2005;Solomonoff et al., 2007). We firstly make a linear variability removable assumption for document representation.
Mathematically, given a document, it could be denoted by a column vector x with dimensionality of d as follows where x t denotes the useful signal information in current document, x v stands for the remaining noise. It is very difficult to model the noise signal in a document since it could come from various sources. Therefore, in this paper, we focus on the noise created by the variability among intra-topic documents. Our goal is to find a new document representation through linear projection: where P is the projection matrix. Since the goal of this paper is not dimensionality reduction, the dimensionality of the new document representation is the same as the source document representation. Therefore, the size of P is d × d. This paper proposes to learn P by minimizing the following intra-topic variability where w ij is the i-th row and j-th column element of a weighting matrix W created in this work. The matrix is determined by the topic label information of training set as follows w ij = 1 if x i and x j belong to a same topic 0 othervise (4)

Variability normalization algorithm
For deriving the variability normalization algorithm, we follow the work of (Solomonoff et al., 2007) and re-write the projection matrix P by the variability space (denoted as a unit vector v here) as follows Combining (3) and (5), we get Since the first part of Q in (6) is independent on v, we discard it and create the final criterion Unfolding (7) by linear operation, we get where X denotes the training set matrix, each row of X represents one document vector, 1 is a vector with all elements equal to 1. Minimizing (8) is equivalent to solving the flowing eigenvalue decomposition problem Here we apply the idea of KPCA (Solomonoff et al., 2007;Schölkopf et al., 1997) to solve (9). Denoting v by a new vector Xu, finding u turns to solving a generalized eigenvalue problem in kernel space as The variability space is then constructed by selecting a set of eigenvectors corresponding to the d 1 largest eigenvalues.
Finally, a (d × d) projection matrix is obtained by combining (5), (11) and v = Xu. Based on this variability normalization algorithm, the baseline document vectors could be transformed to a new feature space with minimized intra-topic variability. The main procedure to implement intratopic variation normalization could be divided into the following steps: • Generate sample matrix X using the whole n documents of training set.
• Construct weighting matrix W according to (4) with the use of topic label information.
• Estimate a projection matrix P by solving the aforementioned eigenvalue problem.
• Transform all documents to new feature space through linear projection according to (2).
It should be noticed that after making feature transformation by the proposed variability normalization algorithm, the dimensionality of document representation has not been changed. This is different with all the existing dimensionality reduction methods since our goal is to re-define the feature representation space for topic document representation. To prove the effectiveness of the proposed algorithm, this paper presents experimental results on a challenging conversational dataset.

Experiments
In this section, we evaluate the proposed variability normalization method in a typical topic classification problem. We will firstly introduce the experimental setup, including dataset, evaluation criteria and system description. After it, all the experimental results would be reported in detail.

Dataset
The data set used in this paper is the text transcripts of free-style conversational speech database, Fisher English corpus released by LDC, which contains 11699 recorded conversations (Cieri et al., 2004). This corpus is collected from 40 different topics, and each document includes relatively a distinct topic (e.g. "Comedy", "Smoking", "Terrorism", etc.) as well as topics covering similar subject areas (e.g. "Airport Security", "Bioterrorism", "Issues in the Middle East"). This paper randomly chooses 60 documents and 50 documents per topic for the training set and testing set respectively. Another 50 documents for each topic are randomly selected to for the development set.

Evaluation criteria
We use two types of criteria to make a comprehensive evaluations for this work. The first evaluation creterion is F 1 measure corresponding to the recall and precision rates for a typical classification system. In detail, we would report micro-average F 1 and macro-average F 1 results. In consideration of topic classification is similar to topic verification, we choose equal error rate (EER) to be the second criterion, which is the equal value of miss probability and false probability.

Module
Methods Text processing stop-word removal, stemming Representation TF-IDF feature Classification KNN, SVM algorithm This paper constructs several systems for comparison. The configurations of our baseline system are shown in Table 1. Porter algorithm (Porter, 1980) is adopted for word stemming after stop-words removal. Then a vocabulary with 19534 unique words is determined according to the occurrence frequency information of training set. Documents in the baseline system are represented by using the popular TF-IDF term weighting strategy (Salton and Buckley, 1988). Two popular algorithms SVM and KNN are used for classification separately. The SVM classification is implemented using the LIBSVM toolkit (Chang and Lin, 2011).
Based on the baseline system, descriptions of other systems are given as below.
(1) LSI: documents are represented in latent semantic space estimated by the LSI algorithm (Deerwester et al., 1990) based on the baseline features.
(2) LDA: document features are transformed by linear discriminant analysis. We select 50 eigenvectors for the low dimensional feature space.
(3) VarNorm: document features are transformed from the baseline TF-IDF vectors by the approach proposed in this paper. We select 60 eigenvectors for generating the project matrix.
(4) VarNorm-LDA: system combined VarNorm with LDA, which employs feature transformation operations twice on the original TF-IDF document features. The number of eigenvectors for VarNorm and LDA are set to 60 and 50 respectively.
All the parameters suggested in this paper are tuned on the development set. However, the eigenvector number is not restricted to 50 or 60. It is recommended to set the eigenvector num from 45 to 75 since we have 40 topics for experiments.

Variability normalization performance
According to (3), we compare the intra-topic variability for the baseline and the VarNorm system. The difference for variability calculation is whether to use the projection matrix P or not. Figure 1 shows the intra-topic variability on 40 topics of the training set. The vertical axis represents the variability for each topic, while the horizontal axis stands for 40 topics in the conversation corpus. As we can see clearly, the variability of baseline system is high. After conducting variability normalization, it could be reduced effectively.  After making detailed analysis, we find for the topic ENG06, the theme is "Hypothetical Situations: Perjury -Do either of you think that you would commit perjury for a close friend or family member?", the variability among documents from this topic is largest in the whole corpus. However, for the topic ENG13, "Movies: Do each of you enjoy going to the movies in a theater, or would you rather rent a movie and stay home? What was the last movie that you saw? Was it good or bad and why?", the variability is the lowest. This is the difference between common topics and infrequent topics. Since people would use various words to express their ideas, it is reasonable to find the variability problem is more serious for infrequent topics than common topics.

Classification Results using KNN
Experimental results using KNN classification algorithm are given in Table 2. The results show that, compared to the baseline system, the variability normalization system VarNorm achieves 2% absolute F 1 improvement, and 29% relative improvement for EER. When taking the variability removing as a preliminary process, and employing LDA as the secondary transformation, the system VarNorm-LDA achieves the best performance. The EER is im-  proved by 65% relatively, and the micro-F 1 measure is improved by 6.85% absolutely. The reason for this performance is straightforward. Since the proposed algorithm effectively reduce the differences among intra-topic documents, the LDA algorithm would be more easier and effective to maximize the ratio of between-class-variance to within-class-variance.

Classification Results using SVM
Similarly, the experimental results using SVM classification algorithm are shown in Table 3. The baseline performance is better than system using KNN algorithm. The improvements achieved by LSI in KNN sytem almost vanish here, while the VarNorm system keeps its improvement. The VarNorm system even works better than the LDA system, with nearly 15% relative improvement on EER, and 3.4% absolute improvement on micro-F 1 measure. The best results are obtained by the VarNorm-LDA system. There are 36% relative improvement for EER, and 3.75% absolute improvement for micro-F 1 measure.

Conclusions and Future Work
In this paper, we investigated the intra-topic variability problem for topic classification. The major contribution of this work is that we proposed a effective variability normalization approach for robust document representation. An optimization problem was constructed after making a linear variability removable assumption. In order to take a deep insight into the performance of the proposed variability normalization algorithm, we conducted experiments on a challenge free-style conversation corpus. Experimental results based on the SVM and KNN classification algorithm all confirmed the robustness of the proposed approach. As a conclusion, the variability normalization algorithm could be used as a front-end feature transformation strategy, and we also suggest to combine it with linear discriminant analysis algorithm or some other algorithms to further improve system performances.
Further study will investigate the adaptive methods for constructing robust feature spaces. We would also combine this work with more document representations methods as well. Moreover, it would be very interesting to extend and combine our work to some novel unsupervised machine learning techniques, like the work of (Zhang and Jiang, 2015) while they proposed a model for high-dimensional data by combineing a linear orthogonal projection and a finite mixture model under a unified generative modeling framework.