Social Media Text Classification under Negative Covariate Shift

In a typical social media content analysis task, the user is interested in analyzing posts of a particular topic. Identifying such posts is often formulated as a classification problem. However, this problem is challenging. One key issue is covariate shift . That is, the training data is not fully representative of the test data. We observed that the covariate shift mainly occurs in the negative data because topics discussed in social media are highly diverse and numerous, but the user-labeled negative training data may cover only a small number of topics. This paper proposes a novel technique to solve the problem. The key novelty of the technique is the transformation of document representation from the traditional n-gram feature space to a center-based similarity (CBS) space. In the CBS space, the covariate shift problem is significantly mitigated, which enables us to build much better classifiers. Experiment results show that the proposed approach markedly improves classification.


Introduction
Applications using social media data, such as reviews, discussion posts, and (micro) blogs are becoming increasingly popular. We observed from our collaborations with social science and health science researchers that in a typical application, the researcher first need to obtain a set of posts of a particular topic that he/she wants to study, e.g., a political issue. Keyword search is often used as the first step. However, that is not sufficient due to low precision and low recall. A post containing the keyword "politics" may not be a political post while a post that does not contain the keyword may be a political post. Thus, text classification is needed to make more sophisticated decisions to improve accuracy.
For classification, the user first manually labels a set of relevant posts (positive data) about the political issue and irrelevant posts (negative data) not about the political issue and then builds a classifier by running a learning algorithm, e.g. SVM or naïve Bayes. However, the resulting classifier may not be satisfactory. There may be many reasons. One key reason we observed is that the labeled negative training data is not fully representative of the negative test data.
Let the user-interested topic be P (positive), and the set of all other irrelevant topics discussed in a social media source be T = {T 1 , T 2 , …, T n }, which forms the negative data. n is usually large. However, due to the labor-intensive effort of manual labeling, the user can label only a certain number of training posts. Then the labeled negative training posts may cover only a small number of irrelevant topics S of T (S ⊆ T) as negative. Further, due to the highly dynamic nature of social media, it is probably impossible to label all possible negative topics. In testing, when posts of other negative topics in T−S show up, their classification can be unpredictable. For example, in an application, the training data has no negative examples about sports. However, in testing, some sports posts show up. These unexpected sports posts may be classified arbitrarily, which results in low classification accuracy. In this paper, we aim to solve this problem.
In machine learning, this problem is called covariate shift, a type of sample selection bias. In classic machine learning, it is assumed that the training and testing data are drawn from the same distribution. However, this assumption may not hold in practice such as in our case above, i.e., the training and the test distributions are different (Heckman 1979;Shimodaira 2000;Zadrozny 2004;Huang et al. 2007;Sugiyama et al. 2008;Bickel et al. 2009). In general, the sample selection bias problem is not solvable because the two distributions can be arbitrarily far apart from each other. Various assumptions were made to solve special cases of the problem. One main assumption was that the conditional distribution of the class given a data instance is the same in the training and test data sets (Shimodaira 2000;Huang et al. 2007;Bickel et al. 2009). This gives the covariate shift problem.
In this paper, we focus on a special case of the covariate shift problem. We assume that the covariate shift problem occurs mainly in the negative training and test data, and no or minimum covariate shift exists in the positive training and test data. This assumption is reasonable because the user knows the type of posts/documents that s/he is looking for and can label many of them.
Following the notations in (Bickel et al. 2009), our special case of the covariate shift problem can be stated formally as follows: let the set of training examples be {(x 1 , y 1 ), (x 2 , y 2 ), …, (x k , y k )}, where x i is the data/feature vector and y i is the class label of x i . Let the set of test cases be {x k+1 , x k+2 , …, x n }, which have no class labels. Since we are interested in binary classification, y i is either 1 (positive class) or -1 (negative class). The labeled training data and the unseen test data have the same target conditional distribution p(y|x) and the marginal distributions of the positive data in both the training and testing are also the same. But the marginal distributions of the negative data in the training and testing are different, i.e., ! ( ! ) ≠ ! ( ! ), where L, T, andrepresent the labeled training data, test data, and the negative class respectively.
Existing methods for addressing the covariate shift problem basically work as follows (see the Related Work section). First, they estimate the bias of the training data based on the given test data using some statistical techniques. Then, a classifier is trained on a weighted version of the original training set based on the estimated bias. Requiring the test data to be available in training is, however, a major weakness. In the social media post classification setting, the system needs to constantly classify the incoming data. It is infeasible to perform training constantly.
In this paper, we propose a novel learning technique that does not need the test data to be available during training due to the specific nature of our problem, i.e., the positive training data does not have the covariate shift issue.
One obvious solution to this problem is oneclass classification (Schölkopf et al. 1999;Tax and Duin, 1999a), i.e., one-class SVM. We simply discard the negative training posts/documents completely because they have the covariate shift problem. Although this is a valid solution, as we will see in the evaluation section, the models built based on one-class SVM perform poorly. Although it is conceivable to use an unsupervised method such clustering, SVD (Alter et al., 2000) or LDA (Blei et al., 2003), supervised learning usually give much higher accuracy.
In our proposed method, instead of performing supervised learning in the original document space based on n-grams, we perform learning in a similarity space. Thus, the key novelty of the method is the transformation from the original document space (DS) to a center-based similarity space (CBS). In the new space, the covariate shift problem is significantly mitigated, which enables us to build more accurate classifiers. The reason for this is that in CBS based learning the vectors in the similarity space enable SVM (which is the learning algorithm that we use) to find a good boundary of the positive class data based on similarity and to separate it from all possible negative class data, including those negative data that is not represented in training. We will explain this in greater detail in Section 3.5 after we present the proposed algorithm, which we call CBS-L (for CBS Learning).
This paper makes three contributions: First, it formulates a special case of the covariate shift problem. This case occurs frequently in social media data classification as we discussed above. Second, it proposes a novel CBS space based learning method, CBS-L, which avoids the covariate shift problem to a large extent because it is able to find a good similarity boundary of the positive data. Third, it experimentally demonstrates the effectiveness of the proposed method.

Related Work
Traditional supervised learning assumes that the training and test examples are drawn from the same distribution. However, this assumption can be violated in many applications. This is especially the case for social media data because of the high topic diversity and constant changes of topics. This problem is known as covariate shift, which is a form of sample selection bias.
Sample selection bias was first introduced in econometrics by Heckman (1979). It came into the field of machine learning through the work of Zadrozny (2004). The main approach in machine learning is to first estimate the distribution bias of the training data based on the test data, and then learn using weighted training examples to compensate for the bias (Bickel et al. 2009).
For example, Shimodaira (2000) and Sugiyama and Muller (2005) proposed to estimate the training and test data distributions using kernel density estimation. The estimated density ratio is then used to generate weighted training examples. Dudik et al. (2005) and Bickel and Scheffer (2007) used maximum entropy density estimation, while Huang et al. (2007) proposed kernel mean matching. Sugiyama et al. (2008) and Tsuboi et al. (2008) estimated the weights for the training instances by minimizing the Kullback-Leibler divergence between the test and the weighted training distributions. Bickel et al. (2009) proposed an integrated model. As we discussed in the introduction, the need for the test data at the training time is a major weakness for social media data classification. The proposed technique CBS-L doesn't have this restriction.
As mentioned in the introduction, one-class classification is a suitable approach to solve the problem. Duin (1999a and1999b) proposed a model for one-class classification called Support Vector Data Description (SVDD) to seek a hyper-sphere around the positive data that encompasses points in the data with the minimum radius. In order to balance between model over-fitting and under-fitting, Tax and Duin (2001) proposed a method that tries to use artificially generated outliers to optimize the model parameters. However, their experiments suggest that the procedure to generate artificial outliers in a hyper-sphere is only feasible for up to 30 dimensions. Also, as pointed out by (Khan and Madden, 2010;2014), one drawback of their methods is that they often require a large dataset and the methods become very inefficient in high dimensional feature spaces. Since text documents are usually represented in a much higher dimensional space, these methods are less suitable for text applications. Manevitz and Yousef (2001) performed one-class text classification using one-class SVM as proposed by Schölkopf et al. (1999). The method is based on identifying outlier data that are representative of the second class. Instead of assuming the origin is the only member of the outlier class, it assumes those data points with few non-zero entries are also outliers. However, as reported in the paper, their methods produce quite weak results (Schölkopf et al., 1999;2000).  presented an improved version of one-class SVM for detecting anomalies. Their idea is to consider all data points that are close to the origin as outliers. Both (Yang and Madden, 2007) and (Tian and Gu, 2010) tried to refine Schölkopf's models by searching optimal parameters. Luo et al., (2007) proposed a cost-sensitive one-class SVM algorithm for intrusion detection. We will see in the experiment section that one-class classification is far inferior to our proposed CBS-L method.
In this work, we propose to represent documents in the similarity space and thus it is related to works on document representation. Alternative document representations have been proposed in the past and have been shown to perform well in many applications (Radev et al., 2000;He et al., 2004;Lebanon 2006;Ranzato andSzummer, 2008, Wang andDomeniconi, 2008). In (Radev et al., 2000), although the centroid sentence/document vector was computed, it was not transformed to a similarity space vector representation. Wang and Domeniconi (2008) proposed to use external knowledge to build semantic kernels for documents in order to improve text classification. In our problem, the main difficulty is that testing negative documents cannot be well covered in training. It is not clear how the enriched document representations could help solve our problem.
Our work is also related to learning from positive and unlabeled examples, also known as PU learning (Denis, 1998;Yu et al. 2002;Liu et al. 2003;Lee and Liu, 2003;Elkan and Noto, 2008;Li et al. 2010). In this learning model, there is a set of labeled positive training data and a set of unlabeled data, but there is no labeled negative training data. Clearly, their setting is different from ours too. There is also no guarantee that the unlabeled data has the same distribution as the future test data.
Our problem is also very different from domain adaption as we work in the same domain. Due to the use of document similarity, our method has some resemblance to learning to rank (Li, 2011;Liu, 2011). However, CBS-L is very different because we perform supervised classification. Our similarity is also center-based rather than pair-wise document similarity, which is also used in (Qian and Liu 2013) for spam detection.

The Proposed CBS Learning
We now formulate the proposed supervised learning in the CBS space, called CSB-L. The key difference between CBS learning and the classic document space (DS) learning is in the document representation, which applies to both training and testing documents or posts. In the next subsection, we first give the intuitive idea and a simple example. The detailed algorithm follows. In Section 3.5, we explain why CBS-L is better than DS-based learning when unexpected negative data appear in the test set.

Basic Idea
In the proposed CBS-L formulation, each document d is still represented as a feature vector, but the vector no longer represents the document d itself based on n-grams. Instead, it represents a set of similarity values between document d and the center of the positive documents. Specifically, the learning consists of the following steps: where Sim is a similarity function consisting of a set of similarity measures. Each feature in s d is called an cbs-feature. s d still has the same original class label as d. Let us see an actual example. We assume that our single center vector for the positive class has been computed (see Section 3.2) based on the unigram representation of documents: c: 1:1 2:1 6:2 where y:z represents a ds-feature y (e.g., a word) and its feature value (e.g., term frequency, tf). We want to transform the follow-ing positive document d 1 and negative document d 2 (ds-vectors) to their cbs-vectors (the first number is the class): d 1 : 1 1:2 2:1 3:1 d 2 : -1 2:2 3:1 5:2 If we use cosine as the first similarity measure in Sim, we can generate a cbs-feature 1:0.50 for d 1 (as cosine(c, d 1 ) = 0.50) and a cbsfeature 1:0.27 for d 2 (as cosine(c, d 2 ) = 0.27).
If we have more similarity measures, more cbs-features will be produced. The resulting cbs-vectors for d 1 and d 2 with their class labels, 1 and -1, are: We now have a binary classification problem in the CBS space. This step simply runs a classification algorithm, e.g., SVM, to build a classifier. We use SVM in our work.

CBS Based Learning
We are given a binary text classification problem. Let D = {(d 1 , y 1 ), (d 2 , y 2 ), …, (d n , y n )} be the set of training examples, where d i is a document and y i ∈ {1, -1} is its class label. Traditional classification directly uses D to build a binary classifier. However, in the CBS space, we learn a classifier that returns 1 for documents that are "close enough" to the center of the training positive documents and -1 for documents elsewhere. We now detail the proposed technique. As we mentioned above, instead of using one single dsvector to represent a document d i ∈D, we use a set R d of p ds-vectors Each vector ! ! denotes one document space representation of the document, e.g., unigram representation. We then compute the center of positive training documents, which is represented as a set of centroids C = {c 1 , c 2 , …, c p }, each of which corresponds to one document space representation in R d . The way to compute each center c i is similar to that in the Rocchio relevance feedback method in information retrieval (Rocchio, 1971;Manning et al. 2008), which uses the corresponding ds-vectors of all training positive and negative documents. The detail will be given below. Based on R d for document d and the center C, we can transform a document d from its document space representations R d to one center-based similarity vector cbs-v by applying a similarity function on each element ! ! of R d and its corresponding center c i . We now detail document transformation.

Training document transformation:
The train-ing data transformation from ds-vectors to cbsvectors performs the following two steps: Step 1: Compute the set C of centroids for the positive class. Each centroid vector c i ∈C is for one document representation ! ! . And it is computed by applying the Rocchio method to the corresponding ds-vectors of all documents in both positive and negative training data.
where ! is the set of documents in the positive class and |.| is the size function. and are parameters, which are usually set empirically. It is reported that using tf-idf representation, = 16 and = 4 usually work quite well (Buckley et al. 1994). The subtraction is used to reduce the influence of those terms that are not discriminative (i.e., terms appearing in both positive and negative documents).
Step 2: Compute the similarity vector cbs-v d (center-based similarity space vector) for each document d ∈D based on its set of document space vectors R d and the corresponding centroids C of the positive documents.

cbs-v d = Sim(R d , C)
Sim has a set of similarity measures, and each measure m j is applied to p document representations ! ! in R d and their corresponding centers ! in C to generate p similarity features (cbs-features) in cbs-v d . We discuss the dsfeatures and similarity measures for computing cbs-features in the next two subsections.

Complexity:
The data transformation step is clearly linear in the number of examples, i.e., n.

Test document transformation:
For each test document d, we can use step 2 above to produce a cbs-vector for d.

DS-Features
In order to compute cbs-features (center-based similarity space features) for each document, we need to have the ds-features of a document and the center of the positive class. We discuss dsfeatures first, which are extracted from each document itself.
Since our task is document classification, we use the popular unigram, bigram and trigram with tf-idf weighting as the ds-features for a document. These three types of ds-features also give us three different document representations.

CBS-Features
Ds-vectors are transformed into cbs-vectors by applying a set of similarity measures on each document space vector and the corresponding center vector. In this work, we employed five similarity measures from (Cha, 2007) to gauge the similarity of two vectors. Based on these measures, we produce 15 CBS features using the unigram, bigram, and trigrams representations of each document. The similarity measures we used are listed in Table 1, where P and Q are two vectors and d represents the dimension of P and Q.

Why Does CBS Space Learning Work?
We now try to explain why CBS learning (CBS-L) can deal with the covariate shift problem, and thus can perform better than document space learning. The reason is that due to the use of similarity features, CBS-L is essentially trying to generate a boundary for the positive training data because similarity is not directional and thus covers all directions in a spherical shape in the space. In classification, the negative data from anywhere or direction outside the spherical shape can be detected. The covariate shift problem will not affect the classification much. Many types of documents that are not represented in the negative training data will still be detected due to their low similarity. For example, in Figure 1, we want to build a SVM classifier to separate positive data represented as black squares and negative data represented as empty circles. The constructed CBS-L classifier would look like a circle (in dashed line) in the original document space covering the positive data. The size of this (boundary) circle depends on the separation margin between the two classes. Although data points represented by empty triangles are not represented in the negative training data (which has only empty circles) in building the classifier, our classifier is able to identify them as not positive at the test time because they are outside the boundary circle. If we had used the document space (DS) features to build a SVM classifier, the classifier would be a line (see Figure 1) between the positive data (black squares) and the negative data (empty circles). This line unfortunately will not be able to identify data points represented as empty triangles as not positive because the triangles actually lie on the positive side and would be classified as positive, which is clearly wrong.

Experiments
In this section, we evaluate the proposed learning in the center-based similarity space (CBS-L) and compare it with baselines.

Experimental Dataset
As stated at the beginning of the paper, this work was motivated by the real-life problem of identifying the right social media posts or documents for specific applications. For an effective evaluation, we need a large number of classes in the data to reflect the topic richness and diversity of the social media. The whole data also has to be labeled for evaluation. Using online reviews of a large number of products is a natural choice because there are many types of products and services and there is no need to do manual labeling, which is very labor intensive, time consuming, and error prone. We obtained the Amazon review database from the authors of (Jindal and Liu 2008), and constructed a dataset with reviews of 50 types of products, which we also call 50 topics. Each topic (a type of products) have 1000 reviews. For each topic, we randomly sampled 700 reviews/documents for training and the remaining 300 reviews for testing. Note that although we use this product review collection, we do not perform sentiment classification. Instead, we still perform the traditional topic based classification. That is, given a review, the system decides what type of product the review is about. In our experiments, we use every topic as the positive class. This gives us 50 classification results.

Baselines
We use three baselines in our evaluation. Document space one-class SVM (ds-osvm): As we discussed earlier, due to the covariate shift problem in the negative training data, one solution is to drop the negative training data completely to build a one-class classifier. One-class SVM is the state-of-the-art one-class classification algorithm. We apply one-class SVM to the documents in the document space as one of the baselines. One-class SVM was first introduced by Schölkopf et al. (1999;2000), which is based on the assumption that the origin is the only member of the second class. The data is first mapped into a transformed feature space via a kernel and then standard two-class SVM is employed to construct a hyper-plane that separates the data and the original with maximum margin. As mentioned earlier, there is also the support vector data description (SVDD) formulation for one-class classification proposed by Tax and Duin (1999a;1999b). SVDD seeks to distinguish the positive class from all other possible data in space. It basically finds a hyper-sphere around the positive class data that contains almost all points in the data set with the minimum radius. It has been shown that the use of Gaussian kernel makes SVDD and One-class SVM equivalent, and the results reported in (Khan and Madden, 2014) demonstrate that SVDD and One-class SVM are comparable when the Gaussian kernel is applied. Thus in this paper, we just use oneclass SVM, which is one of the SVM-based classification tools in the LIBSVM 1 library (version 3.20) (Chang and Lin, 2011).

Center-based similarity space one-class SVM (cbs-osvm):
Instead of applying one-class SVM to documents in the original document space, this baseline applies it to the CBS space after the documents are transformed to CBS vectors.

SVM:
This baseline is the SVM applied in the original document space. Although in this case, there is covariate shift problem, we want to see how serious the problem might be, and how the proposed CBS-L technique can deal with the problem. We use the SVM tool in LIBSVM.

Kernels and Parameters
As Khan and Madden (2014) pointed out that one-class SVM performs the best when Gaussian kernel is used, we use Gaussian kernel as well. Manevitz and Yousef (2001) applied one-class SVM to text classification, and the authors reported that one-class SVM works the best with binary feature weighting scheme compared to tf or tf-idf weighting schemes. Also, they reported that a small number of features (10) with highest document frequency performed the best with Gaussian kernel. We also use binary representation, but found that 10 features are already too many in our case. In fact, 5 features give the best results. Using a small number of features is intuitive because to find the boundary of a very high dimensional space is very difficult. We also tried more features but they were poorer.
For SVM classification in the document space, we use the linear kernel as it has been shown by many researchers that the linear kernel performs the best (e.g., Joachims, 1998;Colas and Brazdil, 2006). We experimented with RBF kernels extensively, but they did not perform well with the traditional document representation. The term weighting scheme is tf-idf (Colas and Brazdil, 2006) with no feature selection.
For our proposed method CBS-L, we use tf-idf values of unigram, bigram and trigram to represent a document in three ways in the document space. As mentioned earlier, five document similarity functions are used to transform document space vectors to CBS space vectors. And in order to filter out less useful features for the center vector of the positive class, we performed feature selection in the document space using the classic information gain method (Yang and Pedersen, 1997) to empirically choose the most effective 100 features for the positive class. For all the kernels, we use the default parameter settings in the LIBSVM systems. We tried to tune the parameters, but did not get better results.

Results
We now present the experiment results. As mentioned above, we treat each topic as the positive class. This gives 50 tests. To test the effect of covariate shift, we also vary the number of topics in the negative class. We used 10, 20, 30, and 40 topics in the training negative class. The test set always has 49 topics of negative data.
For each setting, we give three sets of results for the positive class, which is the target topic data that we are interested in obtaining through classification. Each set of results includes the standard measures of precision, recall, and F1score for the positive class. The three sets are: 1. In-training: In this case, the test negative data   Table 2 summarizes the results. Notice that for ds-osvm, it does not make sense to have in-training and not-intraining results because it does not use any training negative data. Thus, there is only one set of results for "Combined," which is duplicated in the table for easy comparison. However, note that cbs-osvm uses negative data for training in order to compute the center for the positive class.
From the table, we can make the following observations (since there are many numbers, we only focus on F1-scores). 1. The proposed CBS-L method performs markedly better than all baselines. For the results of in-training, not-in-training, and combined, CBS-L is consistently better in all cases than all baselines. Even for in-training, CBS-L perform better than SVM. This clearly shows the superiority of the proposed CBS-L method. 2. ds-osvm performs poorly. cbs-osvm is much better because it uses the negative data in feature selection and center computation. 3. SVM in the document space performed poorly (Combined) when only a small number of negative topics are used in training. It gets better than both one-class SVM baselines when more negative topics are used in training (see the reason in the next point). 4. Finally, we can also see that with the number of training negative topics increases, the results of the combined case of both SVM and CBS-L improve. This is expected because with the increased number of negative topics for training, the number of not-in-training negative topics for testing decreases and the covariate shift problem gets smaller. We can also see that cbs-osvm, SVM and CBS-L's F1-scores for not-in-training improve with the increased training negative topics due to the same reason. However, their F1-scores drop for in-training because with more negative topic ds-osvm cbs-osvm SVM CBS-L  Table 3: F1-score for each positive topic or class in the combined case topics, the data becomes more skewed, which hurts in-training classification.
To give a flavor of the detailed results for each topic (product), we give the full results for one setting with 30 randomly selected topics as the training negative data ( Table 3). The results in the table are F1-scores of the combined case.

Conclusion
The ability to get relevant posts accurately about a topic from social media is a challenging problem. This paper attempted to solve this problem by identifying and dealing with the technical issue of covariate shift. The key idea of our technique is to transform document representation from the traditional n-gram feature space to a similarity based space. Our experimental results show that the proposed method CBS-L outperformed strong baselines by large margins.