Semi-Stacking for Semi-supervised Sentiment Classification

In this paper, we address semi-supervised sentiment learning via semi-stacking, which integrates two or more semi-supervised learning algorithms from an ensemble learning perspective. Specifically, we apply meta-learning to predict the unlabeled data given the outputs from the member algorithms and propose N -fold cross validation to guarantee a suitable size of the data for training the meta-classifier. Evaluation on four domains shows that such a semi-stacking strategy performs consistently better than its member algorithms.


Introduction
The past decade has witnessed a huge exploding interest in sentiment analysis from the natural language processing and data mining communities due to its inherent challenges and wide applications (Pang et al., 2008;Liu, 2012). One fundamental task in sentiment analysis is sentiment classification, which aims to determine the sentimental orientation a piece of text expresses (Pang et al., 2002). For instance, the sentence "I absolutely love this product." is supposed to be determined as a positive expression in sentimental orientation.  While early studies focus on supervised learning, where only labeled data are required to train the classification model (Pang et al., 2002), recent studies devote more and more to reduce the heavy dependence on the large amount of labeled data by exploiting semi-supervised learning approaches, such as co-training (Wan, 2009;Li et al., 2011), label propagation (Sindhwani and Melville, 2008), and deep learning (Zhou et al., 2013), to sentiment classification. Empirical evaluation on various domains demonstrates the effectiveness of the unlabeled data in enhancing the performance  * Corresponding author of sentiment classification. However, semi-supervised sentiment classification remains challenging due to the following reason.
Although various semi-supervised learning algorithms are now available and have been shown to be successful in exploiting unlabeled data to improve the performance in sentiment classification, each algorithm has its own characteristic with different pros and cons. It is rather difficult to tell which performs best in general. Therefore, it remains difficult to pick a suitable algorithm for a specific domain. For example, as shown in Li et al. (2013), the co-training algorithm with personal and impersonal views yields better performances in two product domains: Book and Kitchen, while the label propagation algorithm yields better performances in other two product domains: DVD and Electronic.
In this paper, we overcome the above challenge above by combining two or more algorithms instead of picking one of them to perform semi-supervised learning. The basic idea of our algorithm ensemble approach is to apply meta-learning to re-predict the labels of the unlabeled data after obtaining their results from the member algorithms. First, a small portion of labeled samples in the initial labeled data, namely meta-samples, are picked as unlabeled samples and added into the initial unlabeled data to form a new unlabeled data. Second, we use the remaining labeled data as the new labeled data to perform semi-supervised learning with each member algorithm. Third, we collect the meta-samples' probability results from all member algorithms to train a meta-learning classifier (called meta-classifier). Forth and finally, we utilize the meta-classifier to re-predict the unlabeled samples as new automatically-labeled samples. Due to the limited number of labeled data in semi-supervised learning, we use Nfold cross validation to obtain more meta-samples for better learning the meta-classifier. In principle, the above ensemble learning approach could be seen as an extension of the famous stacking approach (Džeroski and Ženko, 2004) to semi-supervised learning. For convenience, we call it semi-stacking.
The remainder of this paper is organized as follows. Section 2 overviews the related work on semi-supervised sentiment classification. Section 3 proposes our semi-stacking strategy to semi-supervised sentiment classification. Section 4 proposes the data filtering approach to filter low-confident unlabeled samples. Section 5 evaluates our approach with a benchmark dataset. Finally, Section 6 gives the conclusion and future work.

Related Work
Early studies on sentiment classification mainly focus on supervised learning methods with algorithm designing and feature engineering (Pang et al., 2002;Cui et al., 2006;Riloff et al., 2006;Li et al., 2009). Recently, most studies on sentiment classification aim to improve the performance by exploiting unlabeled data in two main aspects: semi-supervised learning (Dasgupta and Ng, 2009;Wan, 2009;Li et al., 2010) and cross-domain learning (Blitzer et al. 2007;He et al. 2011;Li et al., 2013). Specifically, existing approaches to semi-supervised sentiment classification could be categorized into two main groups: bootstrappingstyle and graph-based.
As for bootstrapping-style approaches, Wan (2009) considers two different languages as two views and applies co-training to conduct semi-supervised sentiment classification. Similarly, Li et al. (2010) propose two views, named personal and impersonal views, and apply co-training to use unlabeled data in a monolingual corpus. More recently, Gao et al. (2014) propose a feature subspace-based self-training to semi-supervised sentiment classification. Empirical evaluation demonstrates that subspace-based self-training outperforms co-training with personal and impersonal views.
As for graph-based approaches, Sindhwani and Melville (2008) first construct a document-word bipartite graph to describe the relationship among the labeled and unlabeled samples and then apply label propagation to get the labels of the unlabeled samples.
Unlike above studies, our research on semi-supervised sentiment classification does not merely focus on one single semi-supervised learning algorithm but on two or more semi-supervised learning algorithms with ensemble learning. To the best of our knowledge, this is the first attempt to combine two or more semi-supervised learning algorithms in semi-supervised sentiment classification.

Semi-Stacking for Semi-supervised Sentiment Classification
In semi-supervised sentiment classification, the learning algorithm aims to learn a classifier from a small scale of labeled samples, named initial labeled data, with a large number of unlabeled samples. In the sequel, we refer the labeled data as  Table 1.

Framework Overview
In our approach, two member semi-supervised learning algorithm are involved, namely, 1 semi l and 2 semi l respectively, and the objective is to leverage both of them to get a better-performed semi-supervised learning algorithm. Our basic idea is to apply meta-learning to re-predict the labels of the unlabeled data given the outputs from the member algorithms. Figure 1 shows the framework of our implementation of the basic idea. The core component in semi-stacking is the meta-classifier learned from the meta-learning process, i.e., meta c . This classifier aims to make a better prediction on the unlabeled samples by combining two different probability results from the two member algorithms.

Meta-learning
As shown above, meta-classifier is the core component in semi-stacking, trained through the metalearning process. Here, metameans the learning samples are not represented by traditional descriptive features, e.g., bag-of-words features, but by the result features generated from member algorithms. In our approach, the learning samples in meta-learning are represented by the posterior probabilities of the unlabeled samples belonging to the positive and negative categories from member algorithms, i.e., The framework of the meta-learning process is shown in Figure 2. In detail, we first split the initial labeled data into two partitions, new L and un L where new L is used as the new initial labeled data while un L is merged into the unlabeled data U to form a new set of unlabeled data un LU  . Then, two semi-supervised algorithms are performed with the labeled data new L and the unlabeled data un LU  . Third and finally, the probability results of un L , together with their real labels are used as meta-learning samples to train the meta-classifier. The feature representation of each meta-sample is defined in Formula (1). One problem of meta-learning is that the data size of un L might be too small to learn a good metaclassifier. To better use the labeled samples in the initial labeled data, we employ N-fold cross validation to generate more meta-samples. Specifically, we first split L into N folds. Then, we select one of them as un L and consider the others as new L and generate the meta-learning samples as described in Section 3.2; Third and finally, we repeat the above step 1 N  times by selecting a different fold as un L in each time. In this way, we can obtain the meta-learning samples with the same size as the initial labeled data. Figure 3 presents the algorithm description of meta-learning with N-fold cross validation. In our implementation, we set N to be 10. The dataset contains product reviews from four different domains: Book, DVD, Electronics and Kitchen appliances (Blitzer et al., 2007), each of which contains 1000 positive and 1000 negative labeled reviews. We randomly select 100 instances as labeled data, 400 instances are used as test data and remaining 1500 instances as unlabeled data.

Features:
Each review text is treated as a bag-ofwords and transformed into binary vectors encoding the presence or absence of word unigrams and bigrams. Supervised learning algorithm: The maximum entropy (ME) classifier implemented with the public tool, Mallet Toolkits (http://mallet.cs.umass.edu/), where probability outputs are provided. Semi-supervised learning algorithms: (1) The first member algorithm is called self-trainingFS, proposed by Gao et al. (2014). This approach can be seen as a special case of self-training. Different from the traditional self-training, self-trainingFS use the feature-subspace classifier to make the prediction on the unlabeled samples instead of using the whole-space classifier. In our implementation, we use four random feature subspaces. (2) The second member algorithm is called label propagation, a graph-based semi-supervised learning approach, proposed by Zhu and Ghahramani (2002). In our implementation, the document-word bipartite graph is adopted to build the document-document graph (Sindhwani and Melville, 2008). Significance testing: We perform t-test to evaluate the significance of the performance difference between two systems with different approaches (Yang and Liu, 1999) Figure 4 compares the performances of the baseline approach and three semi-supervised learning approaches. Here, the baseline approach is the supervised learning approach by using only the initial labeled data (i.e. no unlabeled data is used). From the figure, we can see that both Self-trainingFS and label propagation are successful in exploiting unlabeled data to improve the performances. Self-trainingFS outperforms label propagation in three domains including Book, DVD, and Kitchen but it performs worse in Electronic. Our approach (semi-stacking) performs much better than baseline with an impressive improvement of 4.95% on average. Compared to the two member algorithms, semi-stacking always yield a better performance, although the improvement over the better-performed member algorithm is slight, only around 1%-2%. Significance test shows that our approach performs significantly better than worse-performed member algorithm (p-value<0.01) in all domains and it also performs significantly better than better-performed member algorithm (p-value<0.05) in three domains, i.e., Book, DVD, and Kitchen.

Conclusion
In this paper, we present a novel ensemble learning approach named semi-stacking to semi-supervised sentiment classification. Semi-stacking is implemented by re-predicting the labels of the unlabeled samples with meta-learning after two or more member semi-supervised learning approaches have been performed. Experimental evaluation in four domains demonstrates that semi-stacking outperforms both member algorithms.