Co-training for Semi-supervised Sentiment Classification Based on Dual-view Bags-of-words Representation

A review text is normally represented as a bag-of-words (BOW) in sentiment clas-siﬁcation. Such a simpliﬁed BOW model has fundamental deﬁciencies in modeling some complex linguistic phenomena such as negation. In this work, we propose a dual-view co-training algorithm based on dual-view BOW representation for semi-supervised sentiment classiﬁcation. In dual-view BOW, we automatically construct antonymous reviews and model a review text by a pair of bags-of-words with opposite views. We make use of the original and antonymous views in pairs, in the training, bootstrapping and testing process, all based on a joint observation of two views. The experimental re-sults demonstrate the advantages of our approach, in meeting the two co-training requirements, addressing the negation problem, and enhancing the semi-supervised sentiment classiﬁcation efﬁciency.


Introduction
In the past decade, there has been an explosion of user-generated subjective texts on the Internet in forms of online reviews, blogs and microblogs. With the need of automatically identifying sentiments and opinions from those online texts, sentiment classification has attracted much attention in the field of natural language processing.
Lots of previous research focused on the task of supervised sentiment classification. However, in some domains, it is hard to obtain a sufficient amount of labeled training data. Manual annotation is also very expensive and time-consuming. To address this problem, semi-supervised learning approaches were employed in sentiment classification, to reduce the need for labeled reviews by taking advantage of unlabeled reviews.
The dominating text representation method in both supervised and semi-supervised sentiment classification is known as the bag-of-words (BOW) model, which is difficult to meet the requirements for understanding the review text and dealing with complex linguistic structures such as negation. For example, the BOW representations of two opposite reviews "It works well" and "It doesn't work well" are considered to be very similar by most statistical learning algorithms.
In supervised sentiment classification, many approaches have been proposed in addressing the negation problem (Pang et al., 2002;Na et al., 2004;Polanyi and Zaenen , 2004;Kennedy and Inkpen, 2006;Ikeda et al., 2008;Li et al., 2010b;Orimaye et al., 2012;Xia et al., 2013). Nevertheless, in semi-supervised sentiment classification, most of the current approaches directly apply standard semi-supervised learning algorithms, without paying attention to appropriate representation for review texts. For example, Aue and Gamon (2005) applied the naïve Bayes EM algorithm (Nigam et al., 2000). Goldberg and Zhu (2006) applied a graph-based semi-supervised learning algorithm by (Zhu et al., 2003). Wan (2009) employed a co-training approach for cross-language sentiment classification. Li et al. (2010a) employed cotraining with personal and impersonal views. Ren et al. (2011) explored the use of label propagation (Zhu and Ghahramani, 2002).
As pointed by (Goldberg and Zhu, 2006): it is necessary to investigate better review text representations and similarity measures based on linguistic knowledge, as well as reviews' sentiment patterns. However, to the best knowledge, such investigations are very scarce in the research of semi-supervised sentiment classification.
In (Xia et al., 2013), we have developed a dual sentiment analysis approach, which creates antonymous reviews and makes use of original and antonymous reviews together for supervised sentiment classification. In this work, we propose a dual-view co-training approach based on dualview BOW representation for semi-supervised sentiment classification. Specifically, we model both the original and antonymous reviews by a pair of bags-of-words with opposite views. Based on such a dual-view representation, we design a dual-view co-training approach. The training, bootstrapping and testing processes are all performed by observing two opposite sides of one review. That is, we consider not only how positive/negative the original review is, but also how negative/positive the antonymous review is.
In comparison with traditional methods, our dual-view co-training approach has the following advantages: • Effectively address the negation problem; • Automatically learn the associations among antonyms; • Better meet the two co-training requirements in (Blum and Mitchell, 1998).

Related Work
The mainstream of the research in sentiment classification focused on supervised and unsupervised learning tasks. In comparison, semi-supervised sentiment classification has much less related studies. In this section, we focus on reviewing the work of semi-supervised sentiment classification. Aue and Gamon (2005) combined a small amount of labeled data with a large amount of unlabeled data in target domain for cross-domain sentiment classification based on the EM algorithm. Goldberg and Zhu (2006) presented a graphbased semi-supervised learning algorithm (Zhu et al., 2003) for the sentiment analysis task of rating inference. Dasgupta and Ng (2009) proposed a semi-supervised approach to mine the unambiguous reviews at first and then exploiting them to classify the ambiguous reviews, via a combination of active learning, transductive learning and ensemble learning. Ren et al. (2011) explored the use of label propagation (LP) (Zhu and Ghahramani, 2002) in building a semi-supervised sentiment classifier, and compared their results with Transductive SVMs(T-SVM). LP and T-SVM are transductive learning methods where the test data should participate in the training process. Zhou et al. (2010) proposed a deep learning approach called active deep networks to address semi-supervised sentiment classification with active learning. Socher et al. (2012) introduced a deep learning framework called semi-supervised recursive autoencoders for predicting sentencelevel sentiment distributions. The limitation of deep learning approaches might be their dependence on a considerable amount of unlabeled data to learn the representations and the inability to explicitly model the negation problem.
One line of semi-supervised learning research is to bootstrap class labels using techniques like self-training, co-training and their variations. Wan (2009) proposed a co-training approach to address the cross-lingual sentiment classification problem. They made use of the machine translation service to produce two views (a English view and a Chinese view) for co-training a Chinese review sentiment classifier, based on English corpus and unlabeled Chinese corpus. Li et al. (2010a) proposed an unsupervised method at first to automatically separate the review text into a personal view and an impersonal view, based on which the standard cotraining algorithm is then applied to build a semisupervised sentiment classifier. Li et al. (2011) further studied semi-supervised learning for imbalanced sentiment classification by using a dynamic co-training approach. Su et al. (2012) proposed a multi-view learning approach to semi-supervised sentiment classification with both feature partition and language translation strategies (Wan , 2009). Following (Li et al., 2010a), Li (2013) proposed a co-training approach which exploits subjective and objective views for semi-supervised sentiment classification. Our approach can also be viewed as a variation of co-training. The innovation of our approach is the dual-view construction technique by incorporating antonymous reviews and the bootstrapping mechanism by observing two opposite sides of one review. Given an original review, its antonymous review is automatically created as follows 1 : 1) We first detect the negations in each subsentence of the review text; 2) If there is a negation, we remove negators in that subsentence; 3) Otherwise, we reverse all the sentiment words in the subsentence into their antonyms, according to a pre-defined antonym dictionary 2 .
We subsequently use a dual-view BOW model to represent such a pair of reviews, as shown in Figure 1. The original and antonymous reviews will be used in pairs in our dual-view semi-supervised learning approach. As we determine the sentiment of one review, we could observe not only the original view, but also the antonymous view.
1 It is worth noting that our emphasis here is not to generate natural-language-like review texts. Since either the original or the created antonymous review will be represented as a vector of independent words in the BOW model, the grammatical requirement is not as strict as that in human languages. 2 In our experiments, we extract the antonym dictionary from the WordNet lexicon http://wordnet. princeton.edu/.  It is important to notice that the antonymous view removes all negations and incorporates antonymous features. On this basis, we design a dual-view co-training approach. We will introduce our approach in detail in Section 3.2, and analyze its potential advantages in Section 3.3.

The Dual-view Co-training Approach
Since the original and antonymous views form two different views of one review text, it is natural to employ the co-training algorithm, which requires two views for semi-supervised classification.
Co-training is a typical bootstrapping algorithm that first learns a separate classifier for each view using the labeled data. The most confident predictions of each classifier on the unlabeled data are then used to construct additional labeled training data iteratively. Co-training has been extensively used in NLP, including statistical parsing (Sarkar , 2001), reference resolution (Ng and Cardie, 2003), part-of-speech tagging (Clark et al., 2003), word sense disambiguation (Mihalcea, 2004), and sentiment classification (Wan , 2009;Li et al., 2010a).
But it should be noted that the dual views in our approach are different from traditional views. One important property of our approach is that two views are opposite and therefore associated with opposite class labels. Figure 2 illustrates the process of dual-view co-training.
(1) Dual-view training For each instance in the initial labeled set, we construct the dual-view representations. Let x l o and x l a denote the bags of words in the original view and the antonymous view, respectively. Note that the class labels in two views are kept opposite: . That is, we reverse the class label in the original view (i.e., positive to negative, or vice versa), as the class label of the created antonymous view.
Suppose L is the labeled set, with L o and L a denoting the original-view and antonymous-view labeled sets, respectively. We train two distinct classifiers: the original-view classifier h o and the antonymous-view classifier h a , based on L o and L a , respectively. We further train a joint classifier by using L o and L a together as the training data, and refer to it as h d .
(2) Dual-view bootstrapping In standard co-training, we allow each classifier to examine the unlabeled set U and select the most confidently predicted examples in each category.
The selected examples are then added into L , along with the predicted class labels.
In this work, we design a dual-view co-training algorithm to bootstrap the class labels by a joint observation of two sides of one review. Specifically, we propose a new bootstrapping mechanism, based on a principle called dual-view sentiment consensus. Given an unlabeled instance {x u o , x u a }, dual view sentiment consensus requires that, the original prediction y u o and the antonymous prediction should be opposite: y u a = 1 − y u o . In other words, we only select the instances of which the original prediction is positive/negative, and the same time the antonymous prediction is negative/positive. To increase the degree of sentiment consensus, we further require that the predition y u d of h d should be the same as y u o . We sort all unlabeled instances according to the dual-view predictions in each class, filter the list according to the dual-view sentiment consensus principle, and add the top-ranked s instances in each class to the labeled set. For each selected unlabeled instance, its original view x u o is added into L o with class label y u o ; and the antonymous view x u a is added into L a , with an opposite class label y u a = 1 − y u o . When L o and L a receive the supplemental labeled instances, we update h o and h a .
Our bootstrapping mechanism differs from the traditional methods in two major aspects: First, in traditional co-training, given the same instance, the class labels in two views are the same. But in our approach, the class labels in two views need to be opposite. Second, in traditional co-training, the most confidently predicted examples in each view are selected to extend the amount of labeled data. It is dangerous to believe the confident but incorrect predictions. While in our approach, the candidates are further filtered by the principle of dual-view sentiment consensus. In this way, the labeling accuracy and learning efficiency can be improved.
(3) Dual-view testing Finally, in the testing stage, standard co-training uses a joint set of features in two views to train the classifier. In dual-view testing, we use h o and h a to predict the test example in two views, and make the final prediction by considering both sizes of the review.
Given a test example x te with its original view denoted by x te o and antonymous view denoted by x te a , let p o (·|x te o ) be the posterior probability predicted by the original-view classifier h o , and p a (·|x te a ) be the posterior probability predicted by h a . The dual-view testing process can be formulated as follows: That is, the final positive score is assigned by measuring not only how positive the original review is, but also how negative the antonymous one is; the negative score is assigned by measuring not only how positive the original review is, but also how negative the antonymous one is.

Advantages of Dual-view Co-training
Our proposed dual-view co-training approach has the following three advantages.
(1) Effectively address the negation issue We use the antonymous review as a view to effectively address the negation issue. Let us revisit the example in Section 3.1 and assume that the original review (i.e., "The app doesn't work well on my phone. Disappointing. Do not recommend it.") is an unlabeled sample. Because the traditional BOW model cannot well represent negative structures, the review is likely to be incorrectly labeled as positive and then added into the labeled set.
In our proposed approach, the antonymous review (i.e., "The app works well on my phone. Satisfactory. Recommend it.") removed all the negative structures, and is thus more suited for the BOW representation. In this example, the antonymous review is also likely to be marked as positive. Hence, in this case, both the original review and its antonymous review will be labeled as positive, which violates the principle of dual-view sentiment consensus as mentioned in Section 3.2. As a result, the unlabeled instance will not be added into the labeled set.
Therefore, our approach can overcome the limitations of the conventional methods in addressing the negation issue and reduce the labeling error rate (caused by the negative structures) during the bootstrapping process.
(2) Automatically learn the associations among antonyms In semi-supervised sentiment classification, only limited association information between the words and categories can be obtained from a small number of initial labeled data.
For instance, in the above example "disappointing" and "satisfactory" are a pair of antonyms. From the initial labeled data, we may only learn that "disappointing" is derogatory, but we cannot infer that "satisfactory" is commendatory.
During the bootstrapping process in our approach, when constructing the dual view representation, the original view and its antonymous view are required to have opposite class labels. Hence we can automatically infer the relationship between "satisfactory" and "disappointing" (e.g., one is positive and one is negative), thereby improving the learning efficiency of the system.

(3) Better meet two co-training requirements
Compared with traditional methods, our dual-view co-training can better meet the two co-training requirements: 1) sufficient condition (i.e., each view is sufficient for classification); 2) complementary condition (i.e., the two views are conditionally independent).
First, for the sufficient condition, we use a different view construction method. Most traditional methods construct the two views by feature partitioning (i.e., dividing the original feature set into two subsets), while we use data expansion by generating antonymous reviews. We will demonstrate in the experimental section (Section 4.6), that our data expansion method can construct better views than the feature partition method in terms of predicting the class labels from individual views.
Second, as we know, every coin has two sides and the two sides are often complementary. In our proposed approach, the original review and its antonymous review (i.e., two sides of one review) are used as two views for co-training and they can better meet the complementary condition. We will illustrate this point in Section 4.6 by calculating the KL divergence between the two views.

Datasets and Experimental Settings
We conduct the experiments on the multi-domain sentiment datasets, which were introduced in (Blitzer et al., 2007) and have been widely used in sentiment classification. It consists of four domains (Book, DVD, Electronics, and Kitchen) of reviews extracted from Amazon.com. Each of the four datasets contains 1,000 positive and 1,000 negative reviews. Following the experimental settings used in (Li et al., 2010a), we randomly separate all the reviews in each class into a labeled data set, a unlabeled data set, and a test set, with a proportion of 10%, 70% and 20%, respectively. We report the averaged results of 10-fold cross-validation in terms of classification accuracy.
Note that our approach is a general framework that allows different classification algorithms. Due to the space limitation, we only report the results by using logistic regression 3 . Note the similar conclusions can be obtained by using the other algorithms such as SVMs and naïve Bayes. The LibLinear toolkit 4 is utilized, with a dual L2-regularized factor, and a default tradeoff parameter c. Similar to (Wan , 2009;Li et al., 2010a), we carry out the experiments with the unigram features without feature selection. Presence is used as the term weighting scheme as it was reported in (Pang et al., 2002) that it performed better than TF and TF-IDF. Finally, the paired t-test (Yang and Liu , 1999)

Compared Systems
We implement the following nine systems and compare them with our approach: • Baseline, the supervised baseline trained with the initial labeled data only; • Expectation Maximization (EM), with the naïve Bayes model proposed by Nigam et al. (2000); • Label Propagation (LP), a graph-based semi-supervised learning method proposed by Zhu and Ghahramani (2002); • Transductive SVM (T-SVM), an extension of SVM so that it can exploit unlabeled data in semi-supervised learning ( Joachims, 1999); • Self-Training, a bootstrapping model that first trains a classifier, uses it to classify the unlabeled data, and adds the most confident data to the labeled set; • Self-Reserved, a variation of self-training proposed in (Liu et al., 2013),with a reserved procedure to incorporate some less confident examples; • Co-Static, the co-training algorithm by using two static partitions of feature set as two views (Blum and Mitchell, 1998); • Co-Dynamic, a variation of co-training that uses dynamic feature space in each loop. It was reported in (Li et al., 2011) that the Co-Dynamic significantly outperforms Co-Static significantly; • Co-PI, another variation of co-training proposed by (Li et al., 2010a), by using personal and impersonal views for co-training.

Performance Comparison
In table 1, we report the semi-supervised classification accuracy of ten evaluated systems. We report the results with 200 labeled, 1400 unlabeled and 400 test reviews. Note that the similar conclusions can be obtained when the size of the initial labeled data changes. We will discuss its influence later. As can be seen, trained with only 200 labeled data, the supervised baseline yields an average accuracy of 0.709. Self-training gains an improvement of 1.1%. Self-reserved does not show significant priority against Self-training. Three cotraining systems (Co-static, Co-dynamic and Co-PI) get significant improvements. They increase the supervised baseline by 2.0%, 2.8% and 2.4%, respectively.
It is somehow surprising that T-SVM and LP do not outperform the supervised baseline, probably because the supervised baseline is obtained by logistic regression, which was reported to be more effective than SVMs in sentiment classification (the supervised result of SVMs is 0.695).
Our proposed approach significantly outperforms all the other methods. It gains the improvement over the supervised baseline, Self-training, Co-static, Co-dynamic and Co-PI by 4.3%, 3.2%, 2.3%, 1.5% and 1.9%, respectively. All of the improvements are significant according to the paired t-test.

Comparison of Bootstrapping Methods
In Figure 3, we further compare five bootstrapping methods by drawing the accuracy curve dur- ing the bootstrapping process. The x-axis denotes the number of new labeled data bootstrapped from the unlabeled data. We can roughly rank five bootstrapping methods as follows: Our approach Co-dynamic > Co-PI > Co-static Self-training. Self-training gives the worst performance. Co-static works better but the effect is limited. Co-PI and Co-dynamic are significantly better. Our proposed approach outperforms the other systems robustly, along with the increased number of the new labeled data. It suggests that our approach is very efficient in bootstrapping the class labels from the unlabeled data.

Influence of the Size of the Initial Labeled Set
The above results are obtained with 200 labeled, 1400 unlabeled and 400 test reviews. We now tune the size of the initial labeled set (from 20 to 400), and report its influence in Figure 4. For all the settings, we fix the size of test set as 400. The x-axis denotes the number of initial labeled set. For example, "20" denotes the setting of 20 labeled and 1580 unlabeled data. We can observe that our all methods improve as the initial size increases. But the improvements become limited when the size becomes larger. When the initial size is 400, the semi-supervised performance is close to the golden result obtained by the supervised classifier trained with all 1600 labeled data.
Our approach performs consistently the best across different sizes of the initial sizes. The smaller the initial size is, the more improvements our approach can gain, in comparison with the other methods. This confirms our analysis in Section 3.3 that the technique of dual-view construction is very effective to boost the semi-supervised  classification performance, especially when the size of the initial labeled set is small.

Discussion on the Two Co-training Requirements
Ideally, co-training requires that each view is sufficient for classification (sufficient condition) and two views provide complementary information of the instance,(complementary condition).In this section, we answer the following question empirically: whether our approach could meet the two requirements?
(1) Sufficient condition In Figure 5, we report the classification performance obtained by the classifiers trained with distinct views and compared them with the two views in Co-PI, on the DVD and Electronics datasets. The observation in Book is similar to that in Electronics; the observation in DVD is similar to that in Kitchen. Seen from Figure 5, the classification performance of both the original-view and antonymousview classifiers are satisfactory. It shows that in our approach, each individual view is sufficient to predict the sentiment. In comparison with the two views in Co-PI (i.e., the personal and impersonal views), two views in our approach perform significantly better.
As has been mentioned in Section 3.3, in traditional methods, such as Co-PI and Co-dynamic, two views are created by data partition (or feature partition). In comparison, the two views in our approach are constructed in a manner of data expansion. By creating a new antonymous view, our approach can provide more sufficient information of the reviews than traditional methods.
(2) Complementary condition Since we have not found a direct measure of the complementarity of two views, we instead calculate the Kullback-Leibler (KL) divergence between them, based on an assumption that two views with higher KL divergence can provide more complementary information of the instance.
KL divergence is a widely used metric of statistical distance. We assume that distribution of the review text is multinomial, and calculate the K-L divergence between two views as follows: where p i and q i are the probabilities of word appearing in two views, respectively. In our experiments, we use information gain (IG) to select a set of discriminative words with the dimension V = 2000.
In Table 2, we report the results of three different methods: 1) dataset random partition; 2) personal and impersonal views in Co-PI; 3) original and antonymous views in our approach. We can observe from Table 2 that, random partition has the lowest KL divergence. It shows that the distributional distance between two randomly partitioned views is very small. Co-PI is a higher value, but it still does not have significant difference in two views. By contrast, the KL divergence between the original view and the antonymous view is much higher than both random partition and Co-PI. It demonstrates that the distributions of two views in our approach are significantly different. We thereby infer that the two views constructed in our approach can provide more complementary information than traditional methods. It is reasonable since the antonymous view incorporates the KL divergence Random Partition 2.43 Co-PI 4.59 Our approach 12.33 antonyms that might have not appeared in the original view (e.g., "satisfactory" in the example in Section 3.2). These features might provide new information about the instance.

The Effect of Dual-view Testing
In Figure 5, we can further observe the effect of dual-view testing. On the Electronics dataset, the antonymous view performs better than the original view. This suggests the advantage of the antonymous view, as it removes the negations and thus is more suitable for the BOW representation. On the DVD dataset, the original view is slightly better. This is also reasonablel, because the antonymous review is automatically created and its quality might be limited in some cases. By taking two opposite views into a joint consideration, our dual-view testing technique guarantees a satisfactory classification performance across different datasets.
Note that in the current version, the originalview and antonymous-view classifiers have the same predicting weight. We believe that by learning the tradeoff between two views in different settings may further improve our approach's performance. For example, if the original view on the Electronics dataset gets a relatively larger weight, dual-view testing might gain more improvements.

Conclusions
In this work, a review text is represented by a pair of bags-of-words with opposite views (i.e., the original and antonymous views). By making use of two views in pairs, a dual-view co-training algorithm is proposed for semi-supervised sentiment classification. The dual-view representation is in a good accordance with the two co-training requirements (i.e., sufficient condition and complementary condition). The experimental results demonstrate the effect of our approach, in addressing the negation problem and enhancing the bootstrapping efficiency for semi-supervised sentiment classification.