Semi-Supervised Sequence Modeling with Cross-View Training

Unsupervised representation learning algorithms such as word2vec and ELMo improve the accuracy of many supervised NLP models, mainly because they can take advantage of large amounts of unlabeled text. However, the supervised models only learn from task-specific labeled data during the main training phase. We therefore propose Cross-View Training (CVT), a semi-supervised learning algorithm that improves the representations of a Bi-LSTM sentence encoder using a mix of labeled and unlabeled data. On labeled examples, standard supervised learning is used. On unlabeled examples, CVT teaches auxiliary prediction modules that see restricted views of the input (e.g., only part of a sentence) to match the predictions of the full model seeing the whole input. Since the auxiliary modules and the full model share intermediate representations, this in turn improves the full model. Moreover, we show that CVT is particularly effective when combined with multi-task learning. We evaluate CVT on five sequence tagging tasks, machine translation, and dependency parsing, achieving state-of-the-art results.


Introduction
Deep learning models work best when trained on large amounts of labeled data. However, acquiring labels is costly, motivating the need for effective semi-supervised learning techniques that leverage unlabeled examples. A widely successful semi-supervised learning strategy for neural NLP is pre-training word vectors (Mikolov et al., 2013). More recent work trains a Bi-LSTM sentence encoder to do language modeling and then incorporates its context-sensitive representations into supervised models (Dai and Le, 2015;Peters et al., 1 Code will be made available at https: //github.com/tensorflow/models/tree/ master/research/cvt_text 2018). Such pre-training methods perform unsupervised representation learning on a large corpus of unlabeled data followed by supervised training.
A key disadvantage of pre-training is that the first representation learning phase does not take advantage of labeled data -the model attempts to learn generally effective representations rather than ones that are targeted towards a particular task. Older semi-supervised learning algorithms like self-training do not suffer from this problem because they continually learn about a task on a mix of labeled and unlabeled data. Selftraining has historically been effective for NLP (Yarowsky, 1995;McClosky et al., 2006), but is less commonly used with neural models. This paper presents Cross-View Training (CVT), a new self-training algorithm that works well for neural sequence models.
In self-training, the model learns as normal on labeled examples. On unlabeled examples, the model acts as both a teacher that makes predictions about the examples and a student that is trained on those predictions. Although this process has shown value for some tasks, it is somewhat tautological: the model already produces the predictions it is being trained on. Recent research on computer vision addresses this by adding noise to the student's input, training the model so it is robust to input perturbations (Sajjadi et al., 2016;Wei et al., 2018). However, applying noise is difficult for discrete inputs like text.
As a solution, we take inspiration from multiview learning (Blum and Mitchell, 1998;Xu et al., 2013) and train the model to produce consistent predictions across different views of the input. Instead of only training the full model as a student, CVT adds auxiliary prediction modules -neural networks that transform vector representations into predictions -to the model and also trains them as students. The input to each student prediction module is a subset of the model's intermediate rep-resentations corresponding to a restricted view of the input example. For example, one auxiliary prediction module for sequence tagging is attached to only the "forward" LSTM in the model's first Bi-LSTM layer, so it makes predictions without seeing any tokens to the right of the current one.
CVT works by improving the model's representation learning. The auxiliary prediction modules can learn from the full model's predictions because the full model has a better, unrestricted view of the input. As the auxiliary modules learn to make accurate predictions despite their restricted views of the input, they improve the quality of the representations they are built on top of. This in turn improves the full model, which uses the same shared representations. In short, our method combines the idea of representation learning on unlabeled data with classic self-training.
CVT can be applied to a variety of tasks and neural architectures, but we focus on sequence modeling tasks where the prediction modules are attached to a shared Bi-LSTM encoder. We propose auxiliary prediction modules that work well for sequence taggers, graph-based dependency parsers, and sequence-to-sequence models. We evaluate our approach on English dependency parsing, combinatory categorial grammar supertagging, named entity recognition, partof-speech tagging, and text chunking, as well as English to Vietnamese machine translation. CVT improves over previously published results on all these tasks. Furthermore, CVT can easily and effectively be combined with multi-task learning: we just add additional prediction modules for the different tasks on top of the shared Bi-LSTM encoder. Training a unified model to jointly perform all of the tasks except machine translation improves results (outperforming a multi-task ELMo model) while decreasing the total training time.

Cross-View Training
We first present Cross-View Training and describe how it can be combined effectively with multi-task learning. See Figure 1 for an overview of the training method.

Method
Let D l = {(x 1 , y 1 ), (x 2 , y 2 ), ..., (x N , y N )} represent a labeled dataset and D ul = {x 1 , x 2 , ..., x M } represent an unlabeled dataset We use p θ (y|x i ) to denote the output distribution over classes pro-  Figure 1: An overview of Cross-View Training. The model is trained with standard supervised learning on labeled examples. On unlabeled examples, auxiliary prediction modules with different views of the input are trained to agree with the primary prediction module. This particular example shows CVT applied to named entity recognition. From the labeled example, the model can learn that "Washington" usually refers to a location. Then, on unlabeled data, auxiliary prediction modules are trained to reach the same prediction without seeing some of the input. In doing so, they improve the contextual representations produced by the model, for example, learning that "traveled to" is usually followed by a location. duced by the model with parameters θ on input x i . During CVT, the model alternates learning on a minibatch of labeled examples and learning on a minibatch of unlabeled examples. For labeled examples, CVT uses standard cross-entropy loss: CVT adds k auxiliary prediction modules to the model, which are used when learning on unlabeled examples. A prediction module is usually a small neural network (e.g., a hidden layer followed by a softmax layer). Each one takes as input an intermediate representation h j (x i ) produced by the model (e.g., the outputs of one of the LSTMs in a Bi-LSTM model). It outputs a distribution over labels p j θ (y|x i ). Each h j is chosen such that it only uses a part of the input x i ; the particular choice can depend on the task and model architecture. We propose variants for several tasks in Section 3. The auxiliary prediction modules are only used during training; the test-time prediction come from the primary prediction module that produces p θ .
On an unlabeled example, the model first produces soft targets p θ (y|x i ) by performing inference. CVT trains the auxiliary prediction modules to match the primary prediction module on the unlabeled data by minimizing where D is a distance function between probability distributions (we use KL divergence). We hold the primary module's prediction p θ (y|x i ) fixed during training (i.e., we do not back-propagate through it) so the auxiliary modules learn to imitate the primary one, but not vice versa. CVT works by enhancing the model's representation learning. As the auxiliary modules train, the representations they take as input improve so they are useful for making predictions even when some of the model's inputs are not available. This in turn improves the primary prediction module, which is built on top of the same shared representations. We combine the supervised and CVT losses into the total loss, L = L sup + L CVT , and minimize it with stochastic gradient descent. In particular, we alternate minimizing L sup over a minibatch of labeled examples and minimizing L CVT over a minibatch of unlabeled examples.
For most neural networks, adding a few additional prediction modules is computationally cheap compared to the portion of the model building up representations (such as an RNN or CNN). Therefore our method contributes little overhead to training time over other self-training approaches for most tasks. CVT does not change inference time or the number of parameters in the fullytrained model because the auxiliary prediction modules are only used during training.

Combining CVT with Multi-Task Learning
CVT can easily be combined with multi-task learning by adding additional prediction modules for the other tasks on top of the shared Bi-LSTM encoder. During supervised learning, we randomly select a task and then update L sup using a minibatch of labeled data for that task. When learning on the unlabeled data, we optimize L CVT jointly across all tasks at once, first running inference with all the primary prediction modules and then learning from the predictions with all the auxiliary prediction modules. As before, the model alternates training on minibatches of labeled and unlabeled examples. Examples labeled across many tasks are useful for multi-task systems to learn from, but most datasets are only labeled with one task. A benefit of multi-task CVT is that the model creates (artificial) all-tasks-labeled examples from unlabeled data. This significantly improves the model's data efficiency and training time. Since running prediction modules is computationally cheap, computing L CVT is not much slower for many tasks than it is for a single one. However, we find the all-tasks-labeled examples substantially speed up model convergence. For example, our model trained on six tasks takes about three times as long to converge as the average model trained on one task, a 2x decrease in total training time.

Cross-View Training Models
CVT relies on auxiliary prediction modules that have restricted views of the input. In this section, we describe specific constructions of the auxiliary prediction modules that are effective for sequence tagging, dependency parsing, and sequence-tosequence learning.

Bi-LSTM Sentence Encoder
All of our models use a two-layer CNN-BiLSTM (Chiu and Nichols, 2016;Ma and Hovy, 2016) sentence encoder. It takes as input a sequence of words First, each word is represented as the sum of an embedding vector and the output of a character-level Convolutional Neural Network, resulting in a sequence of vectors The encoder applies a twolayer bidirectional LSTM (Graves and Schmidhuber, 2005) to these representations. The first layer runs a Long Short-Term Memory unit (Hochreiter and Schmidhuber, 1997) in the forward direction (taking v t as input at each step t) and the backward direction (taking v T −t+1 at each step) to produce vector sequences The output of the Bi-LSTM is the concatenation of these vectors: The second Bi-LSTM layer works the same, producing outputs h 2 , except it takes h 1 as input instead of v.

CVT for Sequence Tagging
In sequence tagging, each token x t i has a corresponding label y t i . The primary prediction module for sequence tagging produces a probability distribution over classes for the t th label using a onehidden-layer neural network applied to the corresponding encoder outputs: The auxiliary prediction modules take , the outputs of the forward and backward LSTMs in the first 2 Bi-LSTM layer, as inputs. We add the following four auxiliary prediction modules to the model (see Figure 2): The "forward" module makes each prediction without seeing the right context of the current token. The "future" module makes each prediction without the right context or the current token itself. Therefore it works like a neural language model that, instead of predicting which token comes next in the sequence, predicts which class of token comes next. The "backward" and "past" modules are analogous.

CVT for Dependency Parsing
In a dependency parse, words in a sentence are treated as nodes in a graph. Typed directed edges connect the words, forming a tree structure describing the syntactic structure of the sentence. In particular, each word x t i in a sentence ., x T i receives exactly one in-going edge (u, t, r) going from word x u i (called the "head") to it (the "dependent") of type r (the "relation"). We use a graph-based dependency parser similar to the one from Dozat and Manning (2017). This treats dependency parsing as a classification task where the goal is to predict which in-going edge y t i = (u, t, r) connects to each word x t i . First, the representations produced by the encoder for the candidate head and dependent are 2 Modules taking inputs from the second Bi-LSTM layer would not have restricted views because information about the whole sentence gets propagated through the first layer. passed through separate hidden layers. A bilinear classifier applied to these representations produces a score for each candidate edge. Lastly, these scores are passed through a softmax layer to produce probabilities. Mathematically, the probability of an edge is given as: where s is the scoring function: The bilinear classifier uses a weight matrix W r specific to the candidate relation as well as a weight matrix W shared across all relations. Note that unlike in most prior work, our dependency parser only takes words as inputs, not words and part-of-speech tags. We add four auxiliary prediction modules to our model for cross-view training: Each one has some missing context (not seeing either the preceding or following words) for the candidate head and candidate dependent.

CVT for Sequence-to-Sequence Learning
We use an encoder-decoder sequence-to-sequence model with attention (Sutskever et al., 2014;Bahdanau et al., 2015). Each example consists of an input (source) sequence x i = x 1 i , ..., x T i and output (target) sequence y i = y 1 i , ..., y K i . The encoder's representations are passed into an LSTM decoder using a bilinear attention mechanism . In particular, at each time step t the decoder computes an attention distribution over source sequence hidden states as α j ∝ e h j Wαh t whereh t is the decoder's current hidden state. The source hidden states weighted by the attention distribution form a context vector: c t = j α j h j . Next, the context vector and current hidden state are combined into an attention vector a t = tanh(W a [c t , h t ]). Lastly, a softmax layer predicts the next token in the output sequence: p(y t i |y <t i , x i ) = softmax(W s a t ). We add two auxiliary decoders when applying CVT. The auxiliary decoders share embedding and LSTM parameters with the primary decoder, but have different parameters for the attention mechanisms and softmax layers. For the first one, we restrict its view of the input by applying attention dropout, randomly zeroing out a fraction of its attention weights. The second one is trained to predict the next word in the target sequence rather than the current one: p future θ (y t i |y <t i , x i ) = softmax(W future s a future t−1 ). Since there is no target sequence for unlabeled examples, we cannot apply teacher forcing to get an output distribution over the vocabulary from the primary decoder at each time step. Instead, we produce hard targets for the auxiliary modules by running the primary decoder with beam search on the input sequence. This idea has previously been applied to sequence-level knowledge distillation by Kim and Rush (2016).

Experiments
We compare Cross-View Training against several strong baselines on seven tasks: Combinatory Categorial Grammar (CCG) Supertagging: We use data from CCGBank (Hockenmaier and Steedman, 2007).
We use the 1 Billion Word Language Model Benchmark (Chelba et al., 2014) as a pool of unlabeled sentences for semi-supervised learning.

Model Details and Baselines
We apply dropout during training, but not when running the primary prediction module to produce soft targets on unlabeled examples. In addition to the auxiliary prediction modules listed in Section 3, we find it slightly improves results to add another one that sees the whole input rather than a subset (but unlike the primary prediction module, does have dropout applied to its representations). Unless indicated otherwise, our models have LSTMs with 1024-sized hidden states and 512-sized projection layers. See the supplementary material for full training details and hyperparameters. We compare CVT with the following other semi-supervised learning algorithms: Word Dropout. In this method, we only train the primary prediction module. When acting as a teacher it is run as normal, but when acting as a student, we randomly replace some of the input words with a REMOVED token. This is similar to CVT in that it exposes the model to a restricted view of the input. However, it is less data efficient. By carefully designing the auxiliary prediction modules, it is possible to train the auxiliary prediction modules to match the primary one across many different views of the input a once, rather than just one view at a time.
Virtual Adversarial Training (VAT). VAT (Miyato et al., 2016) works like word dropout, but adds noise to the word embeddings of the student instead of dropping out words. Notably, the noise is chosen adversarially so it most changes the model's prediction. This method was applied successfully to semi-supervised text classification by Miyato et al. (2017).
ELMo. ELMo incorporates the representations from a large separately-trained language model into a task-specific model. Our implementaiton follows Peters et al. (2018). When combining ELMo with multi-task learning, we allow each task to learn its own weights for the ELMo embeddings going into each prediction module. We found applying dropout to the ELMo embeddings was crucial for achieving good performance.

Results
Results are shown in Table 1. CVT on its own outperforms or is comparable to the best previously published results on all tasks. Figure 3 shows an example win for CVT over supervised learning.
Of the prior results listed in Table 1, only TagLM and ELMo are semi-supervised. These methods first train an enormous language model on unlabeled data and incorporate the representations produced by the language model into a supervised classifier. Our base models use 1024 hidden units in their LSTMs (compared to 4096 in ELMo), require fewer training steps (around one pass over the billion-word benchmark rather than : An NER example that CVT classifies correctly but supervised learning does not. "Warner" only occurs as a last name in the train set, so the supervised model classifies "Warner Bros" as a person. The CVT model also mistakenly classifies "Warner Bros" as a person to start with, but as it sees more of the unlabeled data (in which "Warner" occurs thousands of times) it learns that "Warner Bros" is an organization. many passes), and do not require a pipelined training procedure. Therefore, although they perform on par with ELMo, they are faster and simpler to train. Increasing the size of our CVT+Multitask model so it has 4096 units in its LSTMs like ELMo improves results further so they are significantly better than the ELMo+Multi-task ones. We suspect there could be further gains from combining our method with language model pre-training, which we leave for future work.
CVT + Multi-Task. We train a single sharedencoder CVT model to perform all of the tasks except machine translation (as it is quite different and requires more training time than the other ones). Multi-task learning improves results on all of the tasks except fine-grained NER, sometimes by large margins. Prior work on many-task NLP such as Hashimoto et al. (2017) uses complicated architectures and training algorithms. Our result shows that simple parameter sharing can be enough for effective many-task learning when the model is big and trained on a large amount of data. Interestingly, multi-task learning works better in conjunction with CVT than with ELMo. We hypothesize that the ELMo models quickly fit to the data primarily using the ELMo vectors, which perhaps hinders the model from learning effective representations that transfer across tasks. We also believe CVT alleviates the danger of the model "forgetting" one task while training on the other ones, a well-known problem in many-task learning (Kirkpatrick et al., 2017). During multi-task CVT, the model makes predictions about unlabeled examples across all tasks, creating (artificial) all-tasks-labeled examples, so the model does not only see one task at a time. In fact, multi-task learning plus self training is similar to the Learning without Forgetting algorithm (Li and Hoiem, 2016), which trains the model to keep its predictions on an old task unchanged when learning a new task. To test the value of all-tasks-labeled examples, we trained a multi-task CVT model that only computes L CVT on one task at a time (chosen randomly for each unlabeled minibatch) instead of for all tasks in parallel. The one-at-a-time model performs substantially worse (see Table 2).  Model Generalization. In order to evaluate how our models generalize to the dev set from the train set, we plot the dev vs. train accuracy for our different methods as they learn (see Figure 4). Both CVT and multi-task learning improve model generalization: for the same train accuracy, the models get better dev accuracy than purely supervised learning. Interestingly, CVT continues to improve  in dev set accuracy while close to 100% train accuracy for CCG, Chunking, and NER, perhaps because the model is still learning from unlabeled data even when it has completely fit to the train set. We also show results for a smaller multi-task + CVT model. Although it generalizes at least as well as the larger one, it halts making progress on the train set earlier. This suggests it is important to use sufficiently large neural networks for multitask learning: otherwise the model does not have the capacity to fit to all the training data.
Auxiliary Prediction Module Ablation. We briefly explore which auxiliary prediction modules are more important for the sequence tagging tasks in Table 3. We find that both kinds of auxiliary prediction modules improve performance, but that the future and past modules improve results more than the forward and backward ones, perhaps because they see a more restricted and challenging view of the input.   cess to. Unsurprisingly, the improvement of CVT over purely supervised learning grows larger as the amount of labeled data decreases (see Figure 5, left). Using only 25% of the labeled data, our approach already performs as well or better than a fully supervised model using 100% of the training data, demonstrating that CVT is particularly useful on small datasets.
Training Larger Models. Most sequence taggers and dependency parsers in prior work use small LSTMs (hidden state sizes of around 300) because larger models yield little to no gains in performance (Reimers and Gurevych, 2017). We found our own supervised approaches also do not benefit greatly from increasing the model size. In contrast, when using CVT accuracy scales better with model size (see Figure 5, right). This finding suggests the appropriate semi-supervised learning methods may enable the development of larger, more sophisticated models for NLP tasks with limited amounts of labeled data.
Generalizable Representations. Lastly, we explore training the CVT+multi-task model on five tasks, freezing the encoder, and then only training a prediction module on the sixth task. This tests whether the encoder's representations generalize to a new task not seen during its training. Only training the prediction module is very fast because (1) the encoder (which is by far the slowest part of the model) has to be run over each example only once and (2) Table 4: Comparison of single-task models on the dev sets. "CVT-MT frozen" means we pretrain a CVT + multi-task model on five tasks, and then train only the prediction module for the sixth. "ELMo frozen" means we train prediction modules (but no LSTMs) on top of ELMo embeddings.
outperforming ELMo embeddings and sometimes even a vanilla supervised model, showing the multi-task model is building up effective representations for language. In particular, the representations could be used like skip-thought vectors (Kiros et al., 2015) to quickly train models on new tasks without slow representation learning.
Self-Training. One of the earliest approaches to semi-supervised learning is self-training (Scudder, 1965), which has been successfully applied to NLP tasks such as word-sense disambiguation (Yarowsky, 1995) and parsing (McClosky et al., 2006). In each round of training, the classifier, acting as a "teacher," labels some of the unlabeled data and adds it to the training set. Then, acting as a "student," it is retrained on the new training set. Many recent approaches (including the consistentency regularization methods discussed below and our own method) train the student with soft targets from the teacher's output distribution rather than a hard label, making the procedure more akin to knowledge distillation (Hinton et al., 2015). It is also possible to use multiple models or prediction modules for the teacher, such as in tri-training (Zhou and Li, 2005;Ruder and Plank, 2018).
Consistency Regularization. Recent works add noise (e.g., drawn from a Gaussian distribution) or apply stochastic transformations (e.g., horizontally flipping an image) to the student's inputs. This trains the model to give consistent predictions to nearby data points, encouraging distributional smoothness in the model. Consistency regularization has been very successful for computer vision applications (Bachman et al., 2014;Laine and Aila, 2017;Tarvainen and Valpola, 2017). However, stochastic input alterations are more difficult to apply to discrete data like text, making consistency regularization less used for natural language processing. One solution is to add noise to the model's word embeddings (Miyato et al., 2017); we compare against this approach in our experiments. CVT is easily applicable to text because it does not require changing the student's inputs.
Multi-View Learning. Multi-view learning on data where features can be separated into distinct subsets has been well studied (Xu et al., 2013). Particularly relevant are co-training (Blum and Mitchell, 1998) and co-regularization (Sindhwani and Belkin, 2005), which trains two models with disjoint views of the input. On unlabeled data, each one acts as a "teacher" for the other model. In contrast to these methods, our approach trains a single unified model where auxiliary prediction modules see different, but not necessarily indepen-dent views of the input.
Self Supervision. Self-supervised learning methods train auxiliary prediction modules on tasks where performance can be measured without human-provided labels. Recent work has jointly trained image classifiers with tasks like relative position and colorization (Doersch and Zisserman, 2017), sequence taggers with language modeling (Rei, 2017), and reinforcement learning agents with predicting changes in the environment (Jaderberg et al., 2017). Unlike these approaches, our auxiliary losses are based on self-labeling, not labels deterministically constructed from the input.
Multi-Task Learning. There has been extensive prior work on multi-task learning (Caruana, 1997;Ruder, 2017). For NLP, most work has focused on a small number of closely related tasks (Luong et al., 2016;Zhang and Weiss, 2016;Søgaard and Goldberg, 2016;Peng et al., 2017). Manytask systems are less commonly developed. Collobert and Weston (2008)

Conclusion
We propose Cross-View Training, a new method for semi-supervised learning. Our approach allows models to effectively leverage their own predictions on unlabeled data, training them to produce effective representations that yield accurate predictions even when some of the input is not available. We achieve excellent results across seven NLP tasks, especially when CVT is combined with multi-task learning.