Semi-Supervised Learning with Auxiliary Evaluation Component for Large Scale e-Commerce Text Classification

The lack of high-quality labeled training data has been one of the critical challenges facing many industrial machine learning tasks. To tackle this challenge, in this paper, we propose a semi-supervised learning method to utilize unlabeled data and user feedback signals to improve the performance of ML models. The method employs a primary model Main and an auxiliary evaluation model Eval, where Main and Eval models are trained iteratively by automatically generating labeled data from unlabeled data and/or users’ feedback signals. The proposed approach is applied to different text classification tasks. We report results on both the publicly available Yahoo! Answers dataset and our e-commerce product classification dataset. The experimental results show that the proposed method reduces the classification error rate by 4% and up to 15% across various experimental setups and datasets. A detailed comparison with other semi-supervised learning approaches is also presented later in the paper. The results from various text classification tasks demonstrate that our method outperforms those developed in previous related studies.


Introduction
There are many ways to improve the performance of a machine learning model. Improving the training data is one such method. Obtaining highquality training data, such as human labeled data, is usually expensive and time-consuming. Many machine learning systems use unlabeled data or a mixture of labeled and unlabeled data for train-ing because it is cheaper and easier to collect enormous amounts of unlabeled data. Industrydeployed machine learning systems that serve millions of users generate vast amounts of unlabeled data and noisy user feedback signals every day. Those data and signals are very important and can be utilized in the training of real-world machine learning systems.
In this paper, we propose a new semi-supervised learning method with a feedback loop to leverage vast amounts of unlabeled data and feedback signals. In particular, we train two machine learning models iteratively. The main model, which is represented as M ain, performs the main task at runtime. The auxiliary model, which is represented as Eval, works offline and it estimates the correctness of the M ain models output. The information available to the auxiliary model Eval is much richer than the run-time model M ain. Extra data, such as user feed-back data and session context information, can be used when training the auxiliary model. The idea is to control the false positive rate of Eval to produce high-quality, automatically labeled data from unlabeled data. The entire process runs iteratively and the performance of both models is improved in an iterative manner. The assumption of run-time M ain model has much fewer available information is due to the business logic flow and/or UX design constraints which limit the run-time M ain model to collect richer features in some industry setups.
In this paper, we use text classification experiments to illustrate the proposed approach. However, this semi-supervised learning approach can also be applied to other machine learning tasks, such as machine translation and search relevance.
Different semi-supervised learning approaches have been previously proposed to leverage unlabeled data, including Blum and Mitchell (1998), Weiss et al. (2016), Goodfellow et al. (2014) and Cohn et al. (1996). Experimental results on the public Yahoo! Answers dataset and on a new public e-commerce dataset for product classification demonstrate the advantages and potential of the proposed framework compared with previous work.
The contributions of this work are as follows: • A new semi-supervised learning method with the introducing of an auxiliary evaluation component, and • A scalable, cost-effective and efficient way to convert vast amounts of unlabeled data into high quality labeled data for supervised training purposes.
In section 2, we present the details of the proposed semi-supervised learning approach in the context of text classification tasks. In section 3, the theoretical analysis of the proposed method is provided. Next, we give an overview of related works and highlight their differences compared to our approach. Section 5 defines various experimental setups and presents the results for two different datasets. Finally, we present the papers conclusions in section 6.

Modeling Approach
The suggested method is based on a few observations from real-world machine learning systems.
• The main model M ain that serves online users may have limited information available at prediction time due to the business logic flow or UX design constraints in industry setups. Therefore, the predictions from the M ain component may contain errors and cannot be used directly as labeled data for model training purposes.
• The evaluation model Eval runs offline. Hence, it can access much richer information than the main task component without those constraints. User feedback data, user behavioral data, prior and post task session history data, and the knowledge base about the users main task constitute richer offline information. All those data helps the Eval model to reliably estimate whether Mains output is acceptable or not.
• Large-scale supervised machine learning methods typically need a larger amount of labeled training data for a good performance. However, in reality, it is very expensive to manually label millions of training data. On the other hand, large scale machine learning systems serving millions of users are generating millions of unlabeled data and user feedback signals every day that is not effectively utilized.
The main idea of the proposed approach is to train and deploy two parallel machine learning models. The first model M ain is used to serve live user requests for the main task. The second machine learning model Eval is used as an offline model to estimate the accuracy of M ains prediction. The Eval model utilizes additional signals, such as user feedback signals, system logs that are related to user sessions, the output confidence scores from Main, etc. An EM-style iterative process is applied to train M ain and Eval in a repeated manner. High-quality, automatically labeled data are extracted from unlabeled data by controlling the false positive rate and the false negative rate in Eval. Mathematical analysis (section 3) and experiments (section 5) on different datasets show that after multiple iterations, the accuracy of Main can be improved substantially.
Algorithm 1 shows the details of the proposed semi-supervised learning algorithm. The input to the algorithm is an initial small set of labeled data L, a large set of unlabeled data U and an optional set of user feedback data F . The algorithm leverages the auxiliary evaluation model Eval and the optional user feedback data F to produce high-quality, automatically labeled data AL from a large amount of unlabeled data U . Once the labeled data AL is extracted and added to the training corpus C, we use a shallow neural network 1 to train model M ain for the text classification tasks.

Auxiliary Evaluation Model
The auxiliary evaluation model Eval is a binary classifier that predicts whether the automatic label is correct or not. It is trained using gradient boosting 2 . If the false positive rate of the eval-

Algorithm 1 Proposed Algorithm
Given: uation model Eval is controlled at a low threshold, vast amounts of high-quality, automatically labeled data can be extracted from the unlabeled data.
For text classification tasks, the evaluation model Eval leverages a variety of features. The confidence score of the main model M ains prediction, the n-gram language model related ranking, the input sentence probability scores evaluated by the language model from M ains predicted class and the optional noisy user feedback signal about M ains output are the features used by Eval. The language model related ranking and input sentence probability scores are based on the assumption that sentences belonging to the same class are more similar than sentences belonging to different classes. Thus, we train a language model SLM i for each CLASS i using sentences belonging to that class. If a sentence belongs to CLASS i , its sentence probability that is evaluated with SLM i should be higher than the sentence probability evaluated with CLASS j , where i = j. We use the trigram statistical language model 3 to train SLM i for each CLASS i .
Details on how to train the auxiliary evaluation model Eval are described in Algorithm 2. The input to the algorithm is the M ain model, the labeled data L and the optional set of user feedback data F . For each example in L, based on l i s labeled class, we create one positive training sample with the correct class and one negative training sample with a wrong class that is chosen randomly. The features can be generated using the aforementioned multiple signal sources, such as M s prediction confidence scores, the SLM ranking scores and the optional user feedback.

Algorithm 2 Train Auxiliary Evaluation Model
Given:

Theoretical Analysis
The EM-like semi-supervised learning approach with an auxiliary evaluation component is designed to tackle large scale ML problems. In section 5, we will demonstrate that our framework has a superior and consistently better performance in various real-world machine learning tasks based on the empirical results. In this section, we will first analyze and highlight some mathematical aspects of this dual-player, semi-supervised learning approach, and illustrate its deep connection to the Expectation-Maximization algorithm.
Suppose we are given an initial set of N manually labeled text S (0) , and our main task is to classify unseen text to a label. As described before, we use a shallow neural network similar to Joulin et al. (2016) to build the M ain model. For this purpose, according to Joulin et al. (2016), we want to minimize the negative log-likelihood where x n is the normalized bag of features of the n th text, y n is the category labels, and A and B are the weight matrices. As part of the auxiliary evaluation component Eval, we established a machine learning system with richer context compared to M ain. The task of Eval is to estimate the probability that the given input text belongs to the category predicted by M ain. This probability is defined as p text i ,c j .
Notice that the entire purpose of the evaluation system is to select newly labeled data to enrich the training set of the main machine learning system. Thus, the Eval model estimates the confidence score of this prediction for each sample.
The whole learning process of M ain → Eval iterates as described in the previous sections. The dual system runs iteratively. We stress that it has a close connection to the popular Expectation-Maximization algorithm Dempster et al. (1977) via the following result.
Theory 1. Given a two-player machine learning system comprised of Main and Eval, the Main model converges to the local minima of the negative log-likelihood with the controlled false positive rate given enough capacity.
Proof. Given a set of training data S (0) = (x i , y i ), i = 1, · · · , N , which are the observed features and labels, let us denote a hidden variable z i ∈ {0, 1} that is a variable indicating the quality of the observation. z i takes a value of 1 if the label for the corresponding instance is correct or relevant, and 0 otherwise. Without the loss of generality, suppose that the M ain model is trained to maximize the log-likelihood function: Using equation (1), equation (3) can be rewritten where the inequality is obtained by Jensen's inequality. The equality holds if and only if The term E p(z) log p (p(x, y, z|Θ) is the expected complete log-likelihood (or, Q-function). The two machine learning systems then iterate through the following two steps. From a set of noisy data, Eval performs similarly in the E-step of the EM algorithm. For n th step, (n = 1, 2, · · · ): Notice that the conditional distribution of the hidden variable z is not necessarily fully predictable by the machine learning model even if the observed data and the models parameters are given.
The evaluation system mainly provides a confidence score of the correctness or confidence of the prediction, which is defined by equation (5). By properly controlling the false positive rate, we select only those new training examples with a good estimate of p(z) (n) by the Eval model. This results in a set of filtered samples S (n) to be added to our M ain system for the next iteration. The main system then performs the maximization step role in the EM algorithm framework, which is the M-step that follows: over S (0) ∪ S (1) ∪ · · · ∪ S (n) which is readily solvable from the main machine learning system. Notice that in the M-step, it is not necessary to find the optimal values over the whole parameter space. Using the monotonic convergence property of the generalized EM algorithm, given enough capacity, the M ain system would eventually converge to its local optimum after enough iterations.
In the e-commerce scenario, we have more informative features in the offline Eval system, and thus the evaluation system can have a very high accuracy. According to the proof, the main machine learning system eventually reaches a stable state.

Related Work
Various semi-supervised learning approaches have been proposed to leverage unsupervised data to improve the performance of machine learning systems Triguero et al. (2015).
Active learning Cohn et al. (1996); Nigam et al. (1998);Beygelzimer et al. (2009), which is a special kind of semi-supervised learning, provides ways to actively select the most informative data samples from a vast amount of unlabeled data. The selected samples are then labeled by humans. In this way, the total amount of data needed for manual labeling is reduced to save resources. How to handle the problem of label quality is one of the active areas of active learning research. Zhang and Chaudhuri (2015) studied the problem of active learning where labels were obtained from strong and weak labelers. In addition to the standard active learning setting, they consider the problem where they have extra weaker labelers that may provide incorrect labels. Yan et al. (2016) studies the adaptive active learning problem where the labeler can return incorrect labels and also abstain from labeling.
Although active learning can significantly reduce the amount of manual labeling, it still requires extra human labeling, which is costly and time consuming. Compared with active learning, our approach does not require any additional manual labeling effort.
The self-labeled technique is another type of semi-supervised approach to boost the models performance by iteratively labeling parts of the unlabeled data. This approach aims to obtain an enlarged labeled set, which is based on its most confident predictions, to classify unlabeled data. Zhu and Goldberg (2009) divides the self-labeled methods into self-training and co-training.
In the self-training process Triguero et al. (2014); Yarowsky (1995), a model is trained with an initially small number of human labeled examples that aim to predict unlabeled data. Then, it is retrained with its most confident predictions, thus enlarging its labeled training set. The process iterates in the same manner.
In the co-training process Blum and Mitchell (1998); Chen et al. (2011), two learning models are trained separately to provide distinct views of the data set by using different feature sets of the data. These two models are initially trained with a small amount of human labeled data, and then the most confident predictions of one model on the unlabeled data are used to construct the training data for the other model. This process is repeated iteratively. Similar to our proposed approach, the selflabeled method uses the EM-based iterative process to boost the models accuracy and also does not need any further manual labeling efforts. The major difference between the self-labeled approach and our approach is as follows. In the self-labeled method, with either self-training or co-training, all the models are main task machine learning models. In our proposed approach, there exists only one main-task model and another auxiliary evaluation model that runs offline. Using an offline auxiliary evaluation model has the benefit of utilizing offline information that is not available at prediction time. Thus, the auxiliary evaluation model has a better estimation capability than the main model regarding whether Mains output is correct or not.
The Generative Adversarial Network Goodfellow et al. (2014) is another semi-supervised approach that tries to generate unlimited synthetic fake samples that can mimic real data. The GAN also builds two models, namely, the generative model and the discriminative model, and puts them against each other. The generative model takes random inputs and tries to generate output data that looks similar to real data. The discriminative model takes input data from both the generative model and real data and tries to correctly distinguish between them. The GAN has been successfully applied to image and audio areas where the synthetic data is real-valued. It's quite challenging for the GAN to generate sequence of discrete tokens in the NLP domain. Yu et al. (2017) has proposed the SeqGAN method to address this challenge by directly performing gradient policy update with reinforcement learning. Kusner and Hernández-Lobato (2016) pro-posed an alternative method to address this challenge using the Gumbel-softmax distribution.
One of the differences between our approach and GAN is that our approach relies on real unlabeled data while the GAN generates plausible data with random inputs. Another major difference is that the evaluation component in our approach tries to evaluate whether the main model results are correct or not. Meanwhile, in the GAN approach, the adversarial component learns to tell whether the current data sample is real or fake.

Experiments
To illustrate the effectiveness of the proposed semi-supervised learning method, we evaluate it with different text classification tasks. We compare the new method with a few other benchmark semi-supervised approaches using the public Yahoo! Answers topic classification dataset ; Joulin et al. (2016) and our e-commerce product categorization dataset.

Yahoo! Answers Dataset Experiments
The Yahoo! Answers topic classification dataset contains 10 classes. Each class contains 140K training samples and 6K testing samples. In this dataset, the total number of training instances is 1.4M and total number of test instances is 60K . We shuffle and split the original 1.4 M labeled training data into two sets. The first set contains 100K instances with labels and is used as the initial labeled dataset L. The second set contains 1.3 M instances and the labels are deleted to form the unlabeled dataset U . The 60K test samples are untouched as the blind test set T .

Results
Using the initial 100K labeled dataset L and 1.3 M unlabeled dataset U , we compare our approach to the three benchmark approaches: supervised learning, co-training and active-learning. In all experiments, once the labeled training corpus for the main model is derived, we use the shallow neural network classifier described in Joulin et al. (2016) to train M ain.
1. 1.4M Supervised Learning: Use the entire Yahoo! dataset and build a model similar to Joulin et al. (2016). The accuracy is reported as 72.3%. This is the theoretical upper bound for a semi-supervised learning training. Any pro-posed method with less labeled data tries approach this accuracy.
2. 100K Supervised Learning: Use L dataset and build a model similar to Joulin et al. (2016).
The accuracy for this model is 65.9%. This is the lower bound result. Any proposed method to leverage unlabeled data should outperform this number as much as possible.
3. Co-Training: Use L to build two initial models and use U data in a co-training setup to enhance the initial models. The system converged to an accuracy of 69.03% after 40 iterations.
4. Self-Training: Use L to train an initial model and use this initial model to predict the labels of U . In the next step, mix L and U with its predictions to train a new model. Keep iterating the predicting labels for U , mixing corpus and training the classifier until the system converges. In our experiments, the self-labeled baseline converged after 30 iterations to an accuracy of 67.5%.

Active Learning:
Train the initial classifier L, use it to evaluate and select the most informative samples from U and reveal their ground truth labels. Then, update the classifier with the mixed corpus of L and reveal the label samples. As more samples get selected, the performance of the active-learning algorithm improves. Fig  1 shows the improvements in the accuracy with the increasing amount of manual labeling data. U and apply Algorithm 1 and Algorithm 2 to iteratively train M ain and Eval. After approximately 20 iterations, M ain can achieve an accuracy of 70.42%. The Eval model yields 92.8% precision with 83.5% recall. At the convergence stage, the system generated labels for 1.12M instances in U with a false positive rate of <8%. This approach automatically labels the majority (>86%) of the U dataset with high-quality (error rate <8%).

EMAEC with Enriched Data:
The Yahoo! dataset does not contain any additional user session feedback data. To simulate the scenario where user-provided feedback data is unreliable to produce 100% correct automatic labels, we assume that user feedback data can be simulated by randomly introducing noises to the original ground truth labels in the dataset. Thus, we first reveal all the ground truth labels in U , and then randomly select x% of U . Next, we randomly flip their correct labels into wrong labels and then mix them with the remaining instances in U . We call the mixed and blurred dataset as B which is the U dataset with noisy labels. We use the B dataset to simulate user noisy feedback signals. In this experiment, we use L, B, Algorithm 1 and Algorithm 2 to iteratively train M ain and Eval. As expected, the higher that the level of blurring is, the worse that the system performs. The theoretical upper bound for this experiment would be a classifier trained with 100K + (1 − x%) * 1.3M ground truth labeled data. Figure 2 demonstrates the varying system performance varying with different noise level x%. We can see that user feedback loop data can further improve the system's performance even if we introduce 50% noise to B.

Discussion
The experimental results on the Yahoo! Answers topic classification dataset are summarized in Ta Moreover, the proposed approach can automatically generate high-quality labels for over 86% of the unlabeled data with an error rate less than 8%.
The results also show that by adding simulated user feedback loop signals into the evaluation component, the final system accuracy can be further improved. Even with 50% label noise, the system achieves 71.33% accuracy. The active learning system can achieve the same accuracy only after adding extra 800K manually labeled data. It's also worth noting that with the same noisy blurred label dataset, the supervised learning approach has much worse performance. Its classifier accuracy significantly drops to 57.8% with 100K golden label and 1.3M blurred labels at 20% noise level.

E-commerce Product Categorization Dataset Experiments
The proposed method is derived to tackle large scale text classification problems that occur in the e-Commerce industry, where the challenge is that we significantly lacked high-quality labeled data for these problems. For example, the e-commerce product categorization dataset contains product titles and 600 different categories for the product titles. This dataset contains four different parts: the product category description data for 600 categories, a 6K observation manually labeled initial training dataset L, a 28K observation manually labeled blind test set T , and a 3.5 million observation unlabeled dataset U that included rich user feedback session data F . The main task here is to predict the product category as soon as the online user enters the product title. For example, the user might enter a product title, such as green coach bag to describe a product. The system should classify this input title into the most relevant product category, such as "women's purse & bag". The 3.5 million unlabeled user behavior dataset contains a seller chosen category and a category suggested by a machine learning model. We consider these data to be unlabeled since the seller chosen category has a greater than 30% error rate according to our evaluations. The reason for this high error rate is due to the fact that the users are not familiar with the category tree or they just intentionally select the wrong category to increase the chance of selling their product. Note that for the maintask system that runs online, only the product title information is available to main-task model.

Results
With the initial 6K labeled dataset L, the 3.5M unlabeled dataset U and the 3.5M feedback session dataset F , we compare our proposed EMAEC approach with the auxiliary evaluation component to a weak supervised learning baseline and cotraining baseline as described below. Similar to the previous set of experiments, once the labeled training corpus for the main model is derived, we use the shallow neural network classifier described in Joulin et al. (2016) to train M ain.  3. Co-Training: Use L to build two initial models and use U data in a co-training setup to enhance the initial models. Different feature sets from F are used to train two different models. After approximately 30 iterations, the system will converge to best performance. Total error reduction rate for co-training is 15.2%.

EMAEC with Enriched Data:
Build the initial classifier M ain using L and F . Apply Algorithm 1 and Algorithm 2 to iteratively train the M ain and Eval models. After approximately 20 iterations, the main task classifier M ain converges to its best performance. Total error reduction for our approach is 19.23%.

Discussions
The experiment results on the e-commerce product categorization dataset are summarized in table 2. The results demonstrate that our proposed approach with the auxiliary evaluation component outperforms the co-training approach substantially. The classification error rate is reduced by 5%. This improvement is well aligned with the results on the public Yahoo! Answers dataset.

Conclusions
In this paper, we proposed a semi-supervised learning approach to tackle the challenge of lacking high-quality labeled data. The experimental results in text classification tasks with both open source Yahoo! Answer data and our e-commerce data show the effectiveness of the proposed approach. This general dual player machine learning framework can also be applied to other machine learning tasks, such as search ranking, speech recognition, machine translation, etc. The proposed method comes with advantages and disadvantages over existing semi-supervised learning approaches. The advantages have been demonstrated in text classification tasks in that it can automatically extract fairly high-quality predicted labeled data from massive unlabeled data. Thus, it can further improve prediction accuracy by adding those automatically enriched labeled data into the original training corpus.
On the other side, a potential drawback could be that its effectiveness may be limited by the prediction performance of the auxiliary evaluation model. If the auxiliary evaluation model is not able to generate many labeled samples with low false positive rate, the automatically enriched labeled data might not be well distributed to reflect the real problems underlying data distribution. To overcome this, we must rely on vast amounts of real-world unlabeled data.