Variational Sequential Labelers for Semi-Supervised Learning

We introduce a family of multitask variational methods for semi-supervised sequence labeling. Our model family consists of a latent-variable generative model and a discriminative labeler. The generative models use latent variables to define the conditional probability of a word given its context, drawing inspiration from word prediction objectives commonly used in learning word embeddings. The labeler helps inject discriminative information into the latent space. We explore several latent variable configurations, including ones with hierarchical structure, which enables the model to account for both label-specific and word-specific information. Our models consistently outperform standard sequential baselines on 8 sequence labeling datasets, and improve further with unlabeled data.


Introduction
Sequence labeling tasks in natural language processing (NLP) often have limited annotated data available for model training.In such cases regularization can be important, and it can be helpful to use additional unlabeled data.One approach for both regularization and semi-supervised training is to design latent-variable generative models and then develop neural variational methods for learning and inference (Kingma and Welling, 2014; Rezende and Mohamed, 2015).
Neural variational methods have been quite successful for both generative modeling and representation learning, and have recently been applied to a variety of NLP tasks (Mnih and Gregor, 2014;Bowman et al., 2016;Miao et al., 2016;Serban et al., 2017;Zhou and Neubig, 2017;Hu et al., 2017).They are also very popular for semisupervised training; when used in such scenarios, they typically have an additional task-specific prediction loss (Kingma et al., 2014;Maale et al., 2016;Zhou and Neubig, 2017;Yang et al., 2017b).However, it is still unclear how to use such methods in the context of sequence labeling.
In this paper, we apply neural variational methods to sequence labeling by combining a latentvariable generative model and a discriminativelytrained labeler.We refer to this family of procedures as variational sequential labelers (VSLs).Learning maximizes the conditional probability of each word given its context and minimizes the classification loss given the latent space.We explore several models within this family that use different kinds of conditional independence structure among the latent variables within each time step.Intuitively, the multiple latent variables can disentangle information pertaining to labeloriented and word-specific properties.
We study VSLs in the context of named entity recognition (NER) and several part-of-speech (POS) tagging tasks, both on English Twitter data and on data from six additional languages.Without unlabeled data, our models consistently show 0.5-0.8%accuracy improvements across tagging datasets and 0.8 F 1 improvement for NER.Adding unlabeled data further improves the model performance by 0.1-0.3%accuracy or 0.2 F 1 score.We obtain the best results with a hierarchical structure using two latent variables at each time step.
Our models, like generative latent variable models in general, have the ability to naturally combine labeled and unlabeled data.We obtain small but consistent performance improvements by adding unlabeled data.In the absence of unlabeled data, the variational loss acts as regularizer on the learned representation of the supervised sequence prediction model.Our results demonstrate that this regularization improves performance even when only labeled data is used.We also compare different ways of applying the classification loss when using a latent variable hierar-chy, and find that the most effective structure also provides the cleanest separation of information in the latent space.

Related Work
There is a growing amount of work applying neural variational methods to NLP tasks, including document modeling (Mnih and Gregor, 2014;Miao et al., 2016;Serban et al., 2017), machine translation (Zhang et al., 2016), text generation (Bowman et al., 2016;Serban et al., 2017;Hu et al., 2017), language modeling (Bowman et al., 2016;Yang et al., 2017b), and sequence transduction (Zhou and Neubig, 2017), but we are not aware of any such work for sequence labeling.Before the advent of neural variational methods, there were several efforts in latent variable modeling for sequence labeling (Quattoni et al., 2007;Sun and Tsujii, 2009).
Our work involves multi-task losses and is therefore also related to the rich literature on multi-task learning for sequence labeling (Plank et al., 2016;Augenstein and Søgaard, 2017;Bingel and Søgaard, 2017;Rei, 2017, inter alia).
Another related thread of work is learning interpretable latent representations.Zhou and Neubig (2017) factorize an inflected word into lemma and morphology labels, using continuous and categorical latent variables.Hu et al. (2017) interpret a sentence as a combination of an unstructured latent code and a structured latent code, which can represent attributes of the sentence.
There have been several efforts in combining variational autoencoders and recurrent networks (Gregor et al., 2015;Chung et al., 2015;Fraccaro et al., 2016).While the details vary, these models typically contain latent variables at each time step in a sequence.This prior work mainly focused on ways of parameterizing the time dependence between the latent variables, which gives them more power in modeling distributions over observation sequences.In this paper, we similarly use latent variables at each time step, but we adopt stronger independence assumptions which leads to simpler models and inference procedures.Also, the models cited above were developed for modeling data distributions, rather than for supervised or semi-supervised learning, which is our focus here.
The key novelties in our work compared to the prior work mentioned above are the proposed sequential variational labelers and the investigation of latent variable hierarchies within these models.The empirical effectiveness of latent hierarchical structure in variational modeling is a key contribution of this paper and may be applicable to the other applications discussed above.Recent work, contemporaneous with this submission, similarly showed the advantages of combining hierarchical latent variables and variational learning for conversational modeling, in the context of a non-sequential model (Park et al., 2018).

Proposed Methods
We begin by describing variational autoencoders and the notation we will use in the following sections.We denote the input word sequence by x 1:T , the corresponding label sequence by l 1:T , the input words other than the word at position t by x −t , the generative model by p θ (•), and the posterior inference model by q φ (•).

Background: Variational Autoencoders
We review variational autoencoders (VAEs) by describing a VAE for an input sequence x 1:T .When using a VAE, we assume a generative model that generates an input using a latent variable z, typically assumed to follow a multivariate Gaussian distribution.We seek to maximize the marginal likelihood of inputs x 1:T when marginalizing out the latent variable z.Since this is typically intractable, especially when using continuous latent variables, we instead maximize a lower bound on the marginal log-likelihood (Kingma and Welling, 2014): Reconstruction Loss − KL(q φ (z|x 1:T ) p θ (z)) KL divergence (1) where we have introduced the variational posterior q parametrized by new parameters φ. q is referred to as an "inference model" as it encodes an input into the latent space.We also have the generative model probabilities p parametrized by θ.The parameters are trained in a way that reflects a classical autoencoder framework: encode the input into a latent space, decode the latent space to reconstruct the input.These models are therefore referred to as "variational autoencoders".
The lower bound consists of two terms: reconstruction loss and KL divergence.The KL divergence term provides a regularizing effect during learning by ensuring that the learned posterior remains close to the prior over the latent variables.

Variational Sequential Labelers
We now introduce variational sequential labelers (VSLs) and propose several variants for sequence labeling tasks.Although the latent struc-ture varies, a VSL maximizes the conditional probability of p θ (x t |x −t ) and minimizes a classification loss using the latent variables as the input to the classifier.Unlike VAEs, VSLs do not autoencode the input, so they are more similar to recent conditional variational formulations (Sohn et al., 2015;Miao et al., 2016;Zhou and Neubig, 2017).Intuitively, the VSL variational objective is to find the information that is useful for predicting the word x t from its surrounding context, which has similarities to objectives for learning word embeddings (Collobert et al., 2011;Mikolov et al., 2013).This objective serves as regularization for the labeled data and as an unsupervised objective for the unlabeled data.
All of our models use latent variables for each position in the sequence.These characteristics are shown in the visual depictions of our models in Figure 1.We consider variants with multiple latent variables per time step and attach the classifier to only particular variables.This causes the different latent variables to capture different characteristics.
In the following sections, we will describe various latent variable configurations that we will evaluate empirically in subsequent sections.

Single Latent Variable
We begin by defining a basic VSL and corresponding parametrization, which will also be used in other variants.This first model (which we call VSL-G and show in Figure 1a) has a Gaussian latent variable at each time step.VSL-G uses two training objectives; the first is similar to the lower bound on log-likelihood used by VAEs: ) VSL-G additionally uses a classifier f on the latent variable z t which is trained with the following objective: The final loss is where α is a trade-off hyperparameter.α is set to zero during supervised training but it is tuned based on development set performance during semi-supervised training.The same procedure is adopted for the other VSL models below.
For the generative model, we parametrize p θ (x t |z t ) as a feedforward neural network with two hidden layers and ReLU (Nair and Hinton, 2010) as activation function.As reconstruction loss, we use cross-entropy over the words in the vocabulary.We defer the descriptions of the parametrization of p θ (z t | x −t ) to Section 3.6.
We now discuss how we parametrize the inference model q φ (z t |x 1:T , t).We use a bidirectional gated recurrent unit (BiGRU; Chung et al., 2014) network to produce a hidden vector h t at position t.The BiGRU is run over the input x 1:T , where each x t is the concatenation of a word embedding and the concatenated final hidden states from a character-level BiGRU.The inference model q φ (z t |x 1:T , t) is then a single layer feedforward neural network that uses h t as input.When parametrizing the posterior over latent variables in the following models below, we use this same procedure to produce hidden vectors with a BiGRU and then use them as input to feedforward networks.The structure of our inference model is similar to those used in previous state-of-the-art models for sequence labeling (Lample et al., 2016;Yang et al., 2017a).
In order to focus more on the effect of our variational objective, the classifier we use is always the same as our baseline model (see Section 4.3), which is a one layer feedforward neural network without a hidden layer, and it is also used in testtime prediction.

Flat Latent Variables
We next consider ways of factorizing the functionality of the latent variable into label-specific and other word-specific information.We introduce VSL-GG-Flat (shown in Figure 1b), which has two conditionally independent Gaussian latent variables at each time step, denote z t and y t for time step t.The variational lower bound is derived as follows: The classifier f is on the latent variable y t and its loss is The final loss for the model is (6) Where α is a trade-off hyperparameter.
Similarly to the VSL-G model, q φ (z t |x 1:T , t) and q φ (y t |x 1:T , t) are parametrized by single layer feedforward neural networks using the hidden state h t as input.

Hierarchical Latent Variables
We also explore hierarchical relationships among the latent variables.In particular, we introduce the VSL-GG-Hier model which has two Gaussian latent variables with the hierarchical structure shown in Figure 1c.This model encodes the intuition that the word-specific latent information z t may differ depending on the class-specific information of y t .
For this model, the derivations are similar to Equations ( 4) and ( 5).The first is: The classifier f uses y t as input and is trained with the following loss: Note that C 1 and C 2 have the same form.The final loss is (9) Where α is a trade-off hyperparameter.
The hierarchical posterior q φ (z t |y t , x 1:T , t) is parametrized by concatenating the hidden vector h t and the random variable y t and then using them as input to a single layer feedforward network.

Parametrization of Priors
Traditional variational models assume extremely simple priors (e.g., multivariate standard Gaussian distributions).Recently there have been efforts to learn the prior and posterior jointly during training (Fraccaro et al., 2016;Serban et al., 2017;Tomczak and Welling, 2018).In this paper, we follow this same idea but we do not explicitly parametrize the prior p θ (z t |x −t ).This is partially due to the lack of computationally-efficient parametrization options for p θ (z t |x −t ).In addition, since we are not seeking to do generation with our learned models, we can let part of the generative model be parametrized implicitly.
More specifically, the approach we use is to learn the priors by updating them iteratively.During training, we first initialize the priors of all examples as multivariate standard Gaussian distributions.As training proceeds, we use the last optimized posterior as our current prior based on a particular "update frequency" (see supplementary material for more details).
Our learned priors are implicitly modeled as (10) where p data is the empirical data distribution, X t is a random variable corresponding to the observation at position t, and k is the prior update time step.The intuition here is that the prior is obtained by marginalizing over values for the missing observation represented by the random variable X t .
The posterior q k−1 φ is as defined in our latent variable models.We assume p data (X t = x|x −t ) = 0 for x 1:T / ∈ training set.For context x −t that can pair with multiple values of X t , its prior is the data-dependent weighted average posterior.For simplicity of implementation and efficient computation, however, if context x −t can pair with multiple values in our training data, we ignore this fact and simply use instance-dependent posteriors.Another way to view this is as conditioning on the index of the training examples while parametrizing the above.That is where i is the index of the instance.

Training
In this subsection, we introduce techniques we have used to address difficulties during training.
Reparametrization Trick.It is challenging to use gradient descent for a random variable as it involves a non-differentiable sampling procedure.Kingma and Welling (2014) introduced a reparametrization trick to tackle this problem.They parametrize a Gaussian random variable z as u ϕ (x) + g ψ (x) • where ∼ N (0, I) and u ϕ (x), g ψ (x) are deterministic and differentiable functions, so the gradient can go through u ϕ (•) and g ψ (•).In our experiments, we use one sample for each time step during training.For evaluation at test time, we use the mean value u ϕ (x).
KL Divergence Weight Annealing.Although the use of prior updating lets us avoid tuning the weight of the KL divergence, the simple priors can still hinder learning during the initial stages of training.To address this, we follow the method described by Bowman et al. (2016) to add weights to all KL divergence terms and anneal the weights from a small value to 1.

Experiments
We describe key details of our experimental setup in the subsections below but defer details about hyperparameter tuning to the supplementary material.Our implementation is available at https: //github.com/mingdachen/vsl
Twitter POS Dataset.The Twitter dataset has 25 tags.We use OCT27TRAIN and OCT27DEV as the training set, OCT27TEST as the development set, and DAILY547 as the test set.We randomly sample {1k, 2k, 3k, 4k, 5k, 10k, 20k, 30k, 60k} tweets from 56 million English tweets as our unlabeled data and tune the amount of unlabeled data based on development set accuracy.
UD POS Datasets.The UD datasets have 17 tags.We use French, German, Spanish, Russian, Indonesian and Croatian.We follow the same setup as Zhang et al. (2017), randomly sampling 20% of the original training set as our labeled data and 50% as unlabeled data.There is no overlap between the labeled and unlabeled data.See Zhang et al. (2017) for more details about the setup.
NER Dataset.We use the BIOES labeling scheme and report micro-averaged F 1 .We preprocessed the text by replacing all digits with 0. We randomly sample 10% of the original training set as our labeled data and 50% as unlabeled data.We also ensure there is no overlap between the labeled and unlabeled data.

Pretrained Word Embeddings
For all experiments, we use pretrained 100dimensional word embeddings.For Twitter, we trained skip-gram embeddings (Mikolov et al., 2013) on a dataset of 56 million English tweets.
For the UD datasets, we trained skip-gram embeddings on Wikipedia for each of the six languages.For NER, we use 100-dimensional pretrained GloVe (Pennington et al., 2014) embeddings.Our models perform better with word embeddings kept fixed during training while for the baselines the word embeddings are fine tuned as this improves the baseline performance.

Baselines
Our primary baseline is a BiGRU tagger where the input consists of the concatenation of a word embedding and the concatenation of the final hidden states of a character-level BiGRU.This BiGRU architecture is identical to that used in the inference networks in our VSL models.1: For dev and test, we show results when only using labeled data and the change in performances ("UL∆") when adding unlabeled data.Bold is highest in each column.Italic is the best model including unlabeled data.We only show test results for the baseline and our best-performing model, which achieves 91.9% accuracy on the Twitter test set and 84.7% F 1 on the NER test set when using unlabeled data.current hidden state.The output dimensionality of the transformation is task-dependent (e.g., 25 for Twitter tagging).We use the standard per-position cross entropy loss for training.
We also report results from the best systems from Zhang et al. (2017), namely the NCRF and NCRF-AE models.Both use feedforward networks as encoders and conditional random field layers for capturing sequential information.The NCRF-AE model additionally can benefit from unlabeled data.

Results
Table 1a shows results on the Twitter development and test sets.All of our VSL models outperform the baseline and our best VSL models outperform the BiGRU baseline by 0.8-1% absolute.When comparing different latent variable configurations, we find that a hierarchical structure performs best.Without unlabeled data, our models already outperform the BiGRU baseline.Adding unlabeled data enlarges the gap between the baseline and our models by up to 0.1-0.3%absolute.
Table 1b shows results on the CoNLL 2003 NER development and test sets.We observe similar trends as in the Twitter data, except that the model does not show improvement on the test set when adding unlabeled data.2: Tagging accuracies (%) on UD test sets.For each language, we show test accuracy ("acc.")when only using labeled data and the change in test accuracy ("UL∆") when adding unlabeled data.Results for NCRF and NCRF-AE are from Zhang et al. (2017), though results are not strictly comparable because we used pretrained word embeddings for all languages on Wikipedia.Bold is highest in each column, excluding the NCRF variants.Italic is the best accuracy including the unlabeled data.Table 2 shows our results on the UD datasets.The trends are broadly consistent with those of Table 1a and 1b.The best performing models use hierarchical structure in the latent variables.There are some differences across languages.For French, German, Indonesian and Russian, VSL-G does not show improvement when using unlabeled data.This may be resolved with better tuning, since the model actually shows improvement on the dev set.
Note that results reported by Zhang et al. (2017) and ours are not strictly comparable as their word embeddings were only pretrained on the UD training sets while ours were pretrained on Wikipedia.Nonetheless, they also mentioned that using embeddings pretrained on larger unlabeled data did not help.We include these results to show that our baselines are indeed strong compared to prior results reported in the literature.Table 3: Twitter and NER dev results (%), UD averaged test accuracies (%) for two choices of attaching the classification loss to latent variables in the VSL-GG-Hier model.All previous results for VSL-GG-Hier used the classification loss on y.

Effect of Position of Classification Loss
We investigate the effect of attaching the classifier to different latent variables.In particular, for the VSL-GG-Hier model, we compare the attachment of the classifier between z and y.See Figure 2. The results in Table 3 suggest that attaching the reconstruction and classification losses to the same latent variable (z) harms accuracy although attaching the classifier to z effectively gives the classifier an extra layer.We can observe why this occurs by looking at the latent variable visualizations in Figure 3d.Compared with Figure 3e, where the two variables are more clearly disentangled, the latent variables in Figure 3d appear to be capturing highly similar information.

Effect of Latent Hierarchy
To verify our assumption of the latent structure, we visualize the latent space for Gaussian models using t-SNE (Maaten and Hinton, 2008) in Figure 3.The BiGRU baseline (Figure 3a) and the VSL-G (Figure 3b) do not show significant differences.However, when using multiple latent variables, the different latent variables capture different characteristics.In the VSL-GG-Flat model (Figure 3c), the y variable (the upper plot) reflects the clustering of the tagging space much more closely than the z variable (the lower plot).Since both variables are used to reconstruct the word, but only the y variable is trained to predict the tag, it appears that z is capturing other information useful for reconstructing the word.However, since they are both used for reconstruction, the two spaces show signs of alignment; that is, the "tag" latent variable y does not show as clean a separation into tag clusters as the y variable in the VSL-GG-Hier model in Figure 3e.
In Figure 3e (VSL-GG-Hier), the clustering of words with respect to the tag is clearest.This may account for the consistently better performance of this model relative to the others.The z variable reflects a space that is conditioned on y but that diverges from it, presumably in order to better reconstruct the word.The closer the latent variable Twitter NER acc.no VR F 1 no VR BiGRU baseline 90.8 -87.6 -VSL-G 91.1 90.9 87.8 87.7 VSL-GG-Flat 91.4 90.9 88.0 87.8 VSL-GG-Hier 91.6 91.0 88.4 87.9 Table 4: Results on Twitter and NER dev sets.For each model, we show supervised results for the models with variational regularization ("acc."or F 1 ) and results when replacing variational components with their deterministic counterparts ("no VR").
is to the decoder output, the weaker the tagging information becomes while other word-specific information becomes more salient.
Figure 3d shows that VSL-GG-Hier with classification loss on z, which consistently underperforms both the VSL-GG-Flat and VSL-GG-Hier models in our experiments, appears to be capturing the same latent space in both variables.Since the z variable is used to both predict the tag and reconstruct the word, it must capture both the tag and word reconstruction spaces, and may be limited by capacity in doing so.The y variable does not seem to be contributing much modeling power, as its space is closely aligned to that of z.

Effect of Variational Regularization
We investigate the beneficial effects of variational frameworks ("variational regularization") by replacing our variational components in VSLs with their deterministic counterparts, which do not have randomness in the latent space and do not use the KL divergence term during optimization.Note that these BiGRU encoders share the same architectures as their variational posterior counterparts and still use both the classification and reconstruction losses.While other subsets of losses could be considered in this comparison, our motivation is to compare two settings that correspond to wellknown frameworks.The "no VR" setting corresponds roughly to the combination of a classifier and a traditional autoencoder.We note that these experiments do not use any unlabeled data.
The results in Table 4 demonstrate that compared to the baseline BiGRU, adding the reconstruction loss ("VSL-G, no VR") yields only 0.1 improvement for both Twitter and NER.Although adding hierarchical structure further improves performance, the improvements are small (+0.1 and +0.2 for Twitter and NER respectively).For VSL- GG-Hier, variational regularization accounts for relatively large differences of 0.6 for Twitter and 0.5 for NER.These results show that the improvements do not come solely from adding a reconstruction objective to the learning procedure.In limited preliminary experiments, we did not find a benefit from adding unlabeled data under the "no VR" setting.

Effect of Unlabeled Data
In order to examine the effect of unlabeled data, we report our Twitter dev accuracies when varying the unlabeled data size.We choose VSL-GG-Hier as the model for this experiment since it benefits the most from unlabeled data.As Figure 4 shows, gradually adding unlabeled data helps a little at the beginning.Further adding unlabeled data boosts the accuracy of the model.The improvements that come from unlabeled data quickly plateau after the amount of unlabeled data goes beyond 10,000.This suggests that with little unlabeled data, the model is incapable of fully utilizing the information in the unlabeled data.However if the amount of unlabeled data is too large, the supervised training signal becomes too weak to extract something useful from the unlabeled data.
We also notice that when there is a large amount of unlabeled data, it is always better to pretrain the prior first using a small α (e.g., 0.1) and then use it as a warm start to train a new model using a larger α (e.g., 1.0).Tuning the weight of the KL divergence could achieve a similar effect, but it may require tuning the weight for labeled data and unlabeled data separately.We prefer to pretrain the prior as it is simpler and involves less hyperparameter tuning.

Conclusion
We introduced variational sequential labelers for semi-supervised sequence labeling.They consist of latent-variable generative models with flexible parametrizations for the variational posterior (using RNNs over the entire input sequence) and a classifier at each time step.Our best models use multiple latent variables arranged in a hierarchical structure.We demonstrate systematic improvements in NER and POS tagging accuracy across 8 datasets over a strong baseline.We also find small, but consistent, improvements by using unlabeled data. BiGRU

Figure 2 :
Figure 2: Comparison of attaching classification loss to different latent variables in VSL-GG-Hier.
F1 UL∆ acc.UL∆ classifier on y 91.6 +0.3 88.4 +0.2 95.0 +0.1 classifier on z 91.1 +0.2 87.8 +0.1 94.4 +0.0 Figure 3: t-SNE visualization of Gaussian latent variables and baseline hidden states for Twitter development set.In plot 3c, 3d, and 3e, the upper subplot is latent variable y and the lower is z.Each point in the plot is a token and the color represents the true tag of the token.

Figure 4 :
Figure 4: Twitter dev accuracies (%) when varying the amount of unlabeled data. Table