Learning a Deep Hybrid Model for Semi-Supervised Text Classification

We present a novel ﬁne-tuning algorithm in a deep hybrid architecture for semi-supervised text classiﬁcation. During each increment of the online learning process, the ﬁne-tuning algorithm serves as a top-down mechanism for pseudo-jointly modifying model parameters following a bottom-up generative learning pass. The resulting model, trained under what we call the Bottom-Up-Top-Down learning al-gorithm, is shown to outperform a variety of competitive models and baselines trained across a wide range of splits be-tween supervised and unsupervised training data.


Introduction
Recent breakthroughs in learning expressive neural architectures have addressed challenging problems in domains such as computer vision, speech recognition, and natural language processing. This success is owed to the representational power afforded by deeper architectures supported by longstanding theoretical arguments (Hastad, 1987). These architectures efficiently model complex, highly varying functions via multiple layers of non-linearities, which would otherwise require very "wide" shallow models that need large quantities of samples (Bengio, 2012). However, many of these deeper models have relied on mini-batch training on large-scale, labeled data-sets, either using unsupervised pre-training (Bengio et al., 2007) or improved architectural components (such as activation functions) (Schmidhuber, 2015).
In an online learning problem, samples are presented to the learning architecture at a given rate (usually with one-time access to these data points), and, as in the case of a web crawling agent, most of these are unlabeled. Given this, batch training and supervised learning frameworks are no longer applicable. While incremental approaches such as co-training have been employed to help these models learn in a more update-able fashion (Blum and Mitchell, 1998;Gollapalli et al., 2013), neural architectures can naturally be trained in an online manner through the use of stochastic gradient descent (SGD).
Semi-supervised online learning does not only address practical applications, but it also reflects some challenges of human category acquisition (Tomasello, 2001). Consider the case of a child learning to discriminate between object categories and mapping them to words, given only a small amount of explicitly labeled data (the mother pointing to the object), and a large portion of unsupervised learning, where the child comprehends an adult's speech or experiences positive feedback for his or her own utterances regardless of their correctness. The original argument in this respect applied to grammar (e.g., Chomsky, 1980;Pullum & Scholz, 2002). While neural networks are not necessarily models of actual cognitive processes, semi-supervised models can show learnability and illustrate possible constraints inherent to the learning process.
The contribution of this paper is the development of the Bottom-Up-Top-Down learning algorithm for training a Stacked Boltzmann Experts Network (SBEN) (Ororbia II et al., 2015) hybrid architecture. This procedure combines our proposed top-down fine-tuning procedure for jointly modifying the parameters of a SBEN with a modified form of the model's original layer-wise bottom-up learning pass (Ororbia II et al., 2015). We investigate the performance of the constructed deep model when applied to semi-supervised text classification problems and find that our hybrid architecture outperforms all baselines.

Related Work
Recent successes in the domain of connectionist learning stem from the expressive power afforded by models, such as the Deep Belief Network (DBN) (Hinton et al., 2006;Bengio et al., 2007) or Stacked Denoising Autoencoder (Vincent et al., 2010), that greedily learn layers of stacked non-linear feature detectors, equivalent to levels of abstraction of the original representation. In a variety of language-based problems, deep architectures have outperformed popular shallow models and classifiers (Salakhutdinov and Hinton, 2009;Liu, 2010;Socher et al., 2011;Glorot et al., 2011b;Lu and Li, 2013;Lu et al., 2014). However, these architectures often operate in a multistage learning process, where a generative architecture is pre-trained and then used to initialize parameters of a second architecture that can be discriminatively fine-tuned (using back-propagation of errors or drop-out: Hinton et al., 2012). Several ideas have been proposed to help deep models deal with potentially uncooperative input distributions or encourage learning of discriminative information earlier in the process, many leveraging auxiliary models in various ways (Bengio et al., 2007;. A few methods for adapting deep architecture construction to an incremental learning setting have also been proposed (Calandra et al., 2012;Zhou et al., 2012). Recently, it was shown in (Ororbia II et al., 2015) that deep hybrid architectures, or multi-level models that integrate discriminative and generative learning objectives, offer a strong viable alternative to multi-stage learners and are readily usable for categorization tasks.
For text-based classification, a dominating model is the support vector machine (SVM) (Cortes and Vapnik, 1995) with many useful innovations to yet further improve its discriminative performance (Subramanya and Bilmes, 2008). When used in tandem with prior human knowledge to hand-craft good features, this simple architecture has proven effective in solving practical text-based tasks, such as academic document classification (Caragea et al., 2014). However, while model construction may be fast (especially when using a linear kernel), this process is costly in that it requires a great deal of human labor to annotate the training corpus. Our approach, which builds on that of (Ororbia II et al., 2015), provides a means for improving classification performance when labeled data is in scarce supply, learning structure and regularity within the text to reduce classification error incrementally.

A Deep Hybrid Model for Semi-Supervised Learning
To directly handle the problem of discriminative learning when labeled data is scarce, (Ororbia II et al., 2015) proposed deep hybrid architectures that could effectively leverage small amounts of labeled and large amounts of unlabeled data. In particular, the best-performing architecture was the Stacked Boltzmann Experts Network (SBEN), which is a variant of the DBN. In its construction and training, the SBEN design borrows many recent insights from efficiently learning good DBN models (Hinton et al., 2006) and is essentially a stack of building block models where each layer of model parameters is greedily modified while freezing the parameters of all others. In contrast to the DBN, which stacks restricted Boltzmann machines (RBM's) and is often used to initialize a deep multi-layer perceptron (MLP), the SBEN model is constructed by composing hybrid restricted Boltzmann machines and can be directly applied to the discriminative task in a single learning phase. The hybrid restricted Boltzmann machine (HRBM) (Schmah et al., 2008;Larochelle and Bengio, 2008;Larochelle et al., 2012) building block of the SBEN is itself an extension of the RBM meant to ultimately perform classification. The HRBM graphical model is defined via parameters Θ = (W, U, b, c, d) (where W is the input-to-hidden weight matrix, U the hidden-toclass weight matrix, b is the visible bias vector, c is the hidden unit bias vector, and d is the class unit bias vector), and is a model of the joint distribution of a binary feature vector x = (x 1 , · · · , x D ) and its label y ∈ {1, · · · , C} that makes use of a latent variable set h = (h 1 , · · · , h H ). The model assigns a probability to the triplet (y,x,h) using: where Z is known as the partition function. The model's energy function is defined as (3) where e y = (1 i=y ) C i=1 is the one-hot vector encoding of y. It is often not possible to compute p(y, x, h) or the marginal p(y, x) due to the intractable normalization constant. However, exploiting the model's lack of intra-layer connections, block Gibbs sampling may be used to draw samples of the HRBM's latent variable layer given the current state of the visible layer and vice versa. This yields the following equations: where σ(v) = 1/(1 + e −v ). Classification may be performed directly with the HRBM by using its free energy function F (y, x) to compute the conditional distribution where the free energy is formally defined as To construct an N-layer SBEN (or N-SBEN), as was shown in (Ororbia II et al., 2015), one may learn a stack of HRBMs in one of two ways: (1) in a strict greedy, layer-wise manner, where layers are each trained in isolation on all of the data samples one at a time from the bottom-up; or (2) in a more relaxed disjoint fashion, where all layers are trained together on all of the data but still in a

Ensembling of Layer-Wise Experts
The SBEN may be viewed as a natural vertical ensemble of layer-wise "experts", where each layer maps latent representations to predictions, which differs from standard methods such as boosting (Schapire, 1990). Traditional feedforward neural models propagate data through the final network to obtain an output prediction y t from a penultimate layer for a given x t . In contrast, this hybrid model is capable of a producing a label y n t at each level n for x t .
To vertically aggregate layer-wise expert outputs, we compute a simple mean predictor, p(y|x) ensemble , as follows: This ensembling scheme provides a simple way to incorporate acquired discriminative knowledge of different levels of abstraction into the model's final prediction. We note that the SBEN's inherent layer-wise discriminative ability stands as an alternative to coupling helper classifiers (Bengio et al., 2007) or the "companion objectives" .

The Bottom-Up-Top-Down Learning Algorithm
With the SBEN architecture defined, we next present its simple two-step training algorithm, or the Bottom-Up-Top-Down procedure (BUTD), which combines a greedy, bottom-up pass with a subsequent top-down fine-tuning step. At every iteration of training, the model makes use of a single labeled sample (taken from an available, small labeled data subset) and an example from either a large unlabeled pool or a data-stream. We describe each of the two phases in Sections 3.2.1 and 3.2.2.

Bottom-Up Layer-wise Learning (BU)
The first phase of N-SBEN learning consists of a bottom-up pass where each layerwise HRBM can be trained using a compound objective function. Data samples are propagated up the model to the layer targeted for layer-wise training using the feedforward schema described above. Each HRBM layer of the SBEN is greedily trained using the frozen latent representations of the one below, which are generated by using the lower level expert's input and prediction. The loss function for each layer balances a discriminative objective L disc , a supervised generative objective L gen , and an unsupervised generative objective L unsup , fully defined as follows: Unlike generative pre-training of neural architectures (Bengio et al., 2007), the additional free parameters γ, α, and β offer explicit control over the extent to which the final parameters discovered are influenced by generative learning (Larochelle et al., 2012;Ororbia II et al., 2015). More importantly, the generative objectives may be viewed as providing data-dependent regularization on the discriminative learning gradient of each layer. The objectives themselves are defined as: log p(y t , x t ), and (12) where D train = {(x t , y)} is the labeled training data-set and D unlab = {(u t )} is the unlabeled training data-set. The gradient for L disc may be computed directly, which follows the general form and can be calculated directly (see Larochelle et al., 2012 , for details) or through a form of Dropping, such as Drop-Out or Drop-Connect (Tomczak, 2013). The generative gradients themselves follow the form and, despite being intractable for any sample (x t , y t ), may be approximated via the contrastive divergence procedure (Hinton, 2002). The intractable second expectation is replaced with a point estimate using a single Gibbs sampling step.
To calculate the generative gradient for an unlabeled sample u, a pseudo-label must be obtained by using a layer-wise HRBM's current estimate of p(y|u), which can be viewed as a form of selftraining or Entropy Regularization (Lee, 2013). The online procedure for computing the generative gradient (either labeled or unlabeled example) for a single HRBM can be found in Ororbia et al., (2015). Setting the coefficients that control learning objective influences can lead to different model configurations (especially with respect to γ) as well as impact the gradient-based training of each model layer (i.e., α and β). In this paper, we shall explore two particular configurations, namely 1) by setting γ = 0 and α = 1, which leads to constructing a purely generative model of D train and Algorithm 1 Top-down fine-tuning of an N-SBEN (ensemble back-propagation). Note that "·" indicates a Hadamard product, ξ is an error signal vector, the prime superscript indicates a derivative (i.e., σ means derivative function of the sigmoid), and z is the symbol for linear pre-activation values.
Input: (x t , y t ) ∈ D, learning rate λ and model parameters Θ = {Θ 1 , Θ 2 , ..., Θ N } function FINETUNEMODEL((x t , y t ), λ, Θ) Ω ← ∅, x n ← x t , y n ← ∅ Initialize list of layer-wise model statistics & variables // Conduct feed-forward pass to collect layer-wise statistics for Θ n ∈ Θ do (h n , z n , y h n , x n ) ← COMPUTELAYERWISESTATISTICS(x n , Θ n ) Ω n ← (h n , z n , y h n , x n ), x n ← h n , y n ← y h n // Conduct error back-propagation pass to adjust layer-wise parameters Can re-use z to perform next step D unsup , and 2) by setting γ = 1 with α freely varying (which recovers the model of Ororbia et al., 2015). In both scenarios, β is allowed to vary as a user-defined hyper-parameter. The second setting of γ allows for training the SBEN directly with only the bottom-up phase defined in this section. However, if the first setting is used, a second phase may be used to incorporate a top-down fine-tuning phase. A bottom-up pass simply entails computing this compound gradient for each layer of the model for 1 or 2 samples per training iteration. Notice that the first scenario reduces the number of hyper-parameters to explore in model selection, requiring only an appropriate value for β to be found.

Top-Down Fine-tuning (TD)
Although efficient, the bottom-up procedure described above is greedy, which means that the gradients are computed for each layer-wise HRBM independent of gradient information from other layers of the model. One way we propose to introduce a degree of joint training of parameters is to incorporate a second phase that adjusts the SBEN parameters via a modified form of back-propagation. Such a routine can further exploit the SBEN's multiple predictors (or entry points) where additional error signals may be computed and aggregated while signals are reversepropagated down the network. We hypothesize that holistic fine-tuning ensures that discriminative information is incorporated into the generative Algorithm 2 The Bottom-Up-Top-Down training procedure for learning an N-SBEN.
Fine-tuning in the context of training an SBEN is different from using a pre-trained MLP that is subsequently fine-tuned with back-propagation. First, since the SBEN is a more complex architecture than an MLP, pre-initializing an MLP would be insufficient given that one would be tossing potentially useful information stored in the SBEN's class filters (and corresponding class bias vectors) of each layer-wise expert (i.e., U and d). Second, merely using the SBEN as an intermediate model ignores the fact the SBEN can already perform classification directly. To avoid losing such information and to fully exploit the model's predictive ability, we adapt the back-propagation algorithm for training MLP's to operate on the SBEN, which we shall call ensemble back-propagation since the fine-tuning method propagates error derivatives down the network from many points of entry. Ensemble back-propagation is described in Algorithm 1.
With this second online training step, the Bottom-Up-Top-Down (BUTD) training algorithm for fully training an SBEN proceeds with a single bottom-up modification step followed by a single top-down joint fine-tuning step using the ensemble back-propagation procedure defined in Algorithm 1 for each training time step. A full top-down phase can consist of up to two calls to the ensemble back-propagation procedure. One is used to jointly modify the SBEN's parameters with respect to the sample taken from D train . A second one is potentially needed to tune parameters with respect to the sample drawn from D unlab . For the unlabeled sample, if the highest class probability assigned by the SBEN ) is greater than a pre-set threshold (i.e., max[p ensemble (y|u)] >p), a pseudo-label is created for that sample by converting the model's mean vector to a 1-hot encoding. The probability thresholdp for the potential second call to the ensemble back-propagation routine allows us to incorporate a tunable form of pseudo-labeling (Lee, 2013) into the Bottom-Up-Top-Down learning algorithm.
The high-level view of the BUTD learning algorithm is depicted in Algorithm 2.

Experimental Results
We investigate the viability of our deep hybrid architecture for semi-supervised text categorization. Model performance was evaluated on the WebKB data-set 1 and a small-scale version of the 20News-Group data-set 2 .
The original WebKB collection contains pages from a variety of universities (Cornell, Texas, Washington, and Wisconsin as well as miscellaneous pages from others). The 4-class classification problem we defined using this data-set was to determine if a web-page could be identified as one belonging to a Student, Faculty, Course, or a Project, yielding a subset of usable 4,199 samples. We applied simple pre-processing to the text, namely stop-word removal and stemming, chose to leverage only the k most frequently occurring terms (this varied across the two experiments), and binarized the document low-level representation (only 1 page vector was discarded due to presence of 0 terms). The 20NewsGroup data-set, on the other hand, contained 16242 total samples and was already pre-processed, containing 100 terms, binary-occurrence low-level representation, with tags for the four top-most highest level domains or meta-topics in the newsgroups array.
For both data-sets, we evaluated model generalization performance using a stratified 5-fold cross-validation (CV) scheme. For each possible train/test split, we automatically partitioned the training fold into separate labeled, unlabeled, and validation subsets using stratified random sampling without replacement. Generalization performance was evaluated by estimating classification error, average precision, average recall, and average F-Measure, where F-Measure was chosen to be the harmonic mean of precision and recall, F 1 = 2(precision · recall)/(precision + recall).

Model Designs
We evaluated the BUTD version of our model, the 3-SBEN,BUTD, as described in Algorithm 2. For simplicity, the number of latent variables at each level of the SBEN was held equal to the dimensionality of the data (i.e., a complete representation). We compared this model trained with BUTD against a version utilizing only the bottomup phase (3-SBEN,BU) as in Ororbia et al. (2015). Both SBEN models contained 3 layers of latent variables.
We compared against an array of baseline classifiers. We used our implementation of an incremental version of Maximum Entropy, or MaxEnt-ST (which, as explained in Sarikaya et al., 2014, is equivalent to a softmax classifier). Furthermore, we used our implementation of the Pegasos algorithm (SVM-ST) (Shalev-Shwartz et al., 2011) which was extended to follow a proper multi-class scheme (Crammer and Singer, 2002). This is the online formulation of the SVM, trained via sub-gradient descent on the primal objective followed by a projection step (for simplicity, we opted to using a linear-kernel). Additionally, we implemented a semi-supervised Bernoulli Naive Bayes classifier (NB-EM) trained via Expectation-Maximization as in (Nigam et al., 1999). We also compared our model against the HRBM (Larochelle and Bengio, 2008) (effectively a single layer SBEN), which serves as a powerful, nonlinear shallow classifier in of itself, as well as a 3-layer sparse deep Rectifier Network (Glorot et al., 2011a), or Rect, composed of leaky rectifier units.
All shallow classifiers (except NB-EM and the HRBM) were extended to the semi-supervised set- ting by leveraging a simple self-training scheme in order to learn from unlabeled data samples. The self-training scheme entailed using a classifier's estimate of p(y|u) for an unlabeled sample and, if max[p(y|u)] >p, we created a 1-hot proxy encoding using the argmax of model's predictor, wherep is a threshold meta-parameter. Since we found this simple pseudo-labeling approach, similar in spirit to (Lee, 2013), to improve the results for all classifiers, and thus we report all results utilizing this scheme. 3 All classes of models (SBEN, HRBM, Rect, SVM-ST, MaxEnt-ST, NB-ST) were subject to the same model selection procedure described in the next section.

Model Selection
Model selection was conducted using a parallelized multi-setting scheme, where a configuration file for each model was specified, describing a set of hyper-parameter combinations to explore (this is akin to a course-grained grid search, where the points of model evaluation are set manually a priori). For the SBEN's, we varied the learning rate ([0.01, 0.25]) and β coefficient ([0.1, 1.0]) and  experimented with stochastic and mean-field versions of the models 4 (we found that mean-field did slightly better for this experiment and thus report the performance of this model in this paper). The HRBM's meta-parameters were tuned using a similar set-up to (Larochelle et al., 2012)  All models of all configurations were trained for a 10,000 iteration sweep incrementally on the data and the model state with lowest validation error for that particular run was used. The SBEN, HRBM, and Rect models were also set to use a momentum term of 0.9 (linearly increased from 0.1 in the first 1000 training iterations) and the Rect model made use of a small L1 regularization penalty to encourage additional hidden sparsity. For a data-set like the 20NewsGroup, which contained a number of unlabeled samples greater than training iterations, we view our schema as simulating access to a data-stream, since all models had access to any given unlabeled example only once during a training run.

Model Performance
We first conducted an experiment, using the We-bKB data-set, exploring classification error as a function of labeled data subset cardinality (Figure 2). In this setup, we repeated the stratified cross-fold scheme for each possible labeled data subset size, comparing the performance of the SVM model against 3-SBEN,BU (blue dotted curve) and 3-SBEN,BUTD (green dash-dotted curve). We see that as the number of labeled examples increases (which entails greater human annotation effort) all models improve, nearly reaching 90% accuracy. However, while the performance difference between models becomes negligible as the training set becomes more supervised, as expected, it is in the less scarce regions of the plot we are interested in. We see that for small proportions, both variants of the SBEN outperform the SVM, and furthermore, the SBEN trained via full BUTD can reach lower error, especially for the most extreme scenario where only 8 labeled examples per class are available. We notice a bump in the performance of BUTD as nearly the whole training set becomes labeled and posit that since the BUTD involves additional pseudo- labeling steps (as in the top-down phase), there is greater risk of reinforcing incorrect predictions in the pseudo-joint 5 tuning of layerwise expert parameters. For text collections where most of the data is labeled and unlabeled data is minimal, only a simple bottom-up pass is needed to learn a good hybrid model of the data. The next set of experiments was conducted with only 1% of the training sets labeled. We observe (Tables 1 and 2) that our deep hybrid architecture trained via BUTD outperforms all other models with respect to all performance metrics. While the SBEN trained with simply an online bottom-up performs significantly better than the SVM model, we note a further reduction of error using our proposed BUTD training procedure. The additional top-down phase serves as a mechanism for unifying the layer-wise experts, where error signals for both labeled and pseudo-labeled examples increase agreement among all model layer experts.
For the 20NewsGroup data-set, we conducted a simple experiment to uncover some of the knowledge acquired by our model with respect to the target categorization task. We applied the mechanism from (Larochelle et al., 2012) to extract the variables that are most strongly associated with each of the clamped target variables in the lowest layer of a BUTD-trained SBEN. The top-scored terms associated with each class variable are shown in Table 3, using the 10 hidden nodes most highly triggered by the clamped class node, in a model trained on all of the 20NewsGroup data using a model configuration determined from CV results for the 20NewsGroup data-set reported in the paper. Since the SBEN is a composition of layerwise experts each capable of classification, we note that this procedure could be applied to each level to uncover which unobserved variables are most strongly associated with each class target. We speculate that this could serve the basis for un-covering the model's underlying learnt hierarchy of the data and be potentially used for knowledge extraction, a subject for future work in analyzing black box neural models such as our own.

Conclusions
We presented the Bottom-Up-Top-Down procedure for training the Stacked Boltzmann Experts Network, a hybrid architecture that balances both discriminative and generative learning goals, in the context of semi-supervised text categorization. It combines a greedy, layer-wise bottom-up approach with a top-down fine-tuning method for pseudo-joint modification of parameters.
Models were evaluated using two text corpora: WebKB and 20NewsGroup. We compared results against several baseline models and found that our hybrid architecture outperformed the others in all settings investigated. We found that the SBEN, especially when trained with the full Bottom-Up-Top-Down learning procedure could in some cases improve classification error by as much 39% over the Pegasos SVM, and nearly 17% over the HRBM, especially when data is in very limited supply. While we were able to demonstrate the viability of our hybrid model when using only simple surface statistics of text, future work shall include application of our models to more semantic-oriented representations, such as those leveraged in building log-linear language models (Mikolov et al., 2013).