SeqVAT: Virtual Adversarial Training for Semi-Supervised Sequence Labeling

Virtual adversarial training (VAT) is a powerful technique to improve model robustness in both supervised and semi-supervised settings. It is effective and can be easily adopted on lots of image classification and text classification tasks. However, its benefits to sequence labeling tasks such as named entity recognition (NER) have not been shown as significant, mostly, because the previous approach can not combine VAT with the conditional random field (CRF). CRF can significantly boost accuracy for sequence models by putting constraints on label transitions, which makes it an essential component in most state-of-the-art sequence labeling model architectures. In this paper, we propose SeqVAT, a method which naturally applies VAT to sequence labeling models with CRF. Empirical studies show that SeqVAT not only significantly improves the sequence labeling performance over baselines under supervised settings, but also outperforms state-of-the-art approaches under semi-supervised settings.


Introduction
While having achieved great success on various computer vision and natural language processing tasks, deep neural networks, even state-of-the-art models, are usually vulnerable to tiny input perturbations (Szegedy et al., 2014;Goodfellow et al., 2015). To improve the model robustness against perturbations, Goodfellow et al. (2015) proposed to train neural networks on both original training examples and adversarial examples (examples generated by adding small but worst-case perturbations to the original examples). This approach, named adversarial training (AT), has been reported to be highly effective on image classification (Goodfellow et al., 2015), text classification (Miyato et al., 2017), as well as sequence labeling (Yasunaga et al., 2018).
However, AT is limited to a supervised scenario, which uses the labels to compute adversarial losses.
To make use of unlabeled data, virtual adversarial training (VAT) was proposed to extend AT to semi-supervised settings (Miyato et al., 2019). Unlike AT which treats adversarial examples as new training instances that have the same labels as original examples, VAT minimizes the KL divergence between estimated label distribution of original examples and that of adversarial examples. In this manner, both labeled and unlabeled data can be used in training to improve accuracy and robustness. As a semi-supervised learning algorithm, VAT was reported to be effective on both image (Goodfellow et al., 2015;Miyato et al., 2019) and text classifications (Miyato et al., 2017). Moreover, a recent study (Oliver et al., 2018) conducted comprehensive comparisons on various popular semisupervised learning algorithms. VAT turned out to be the most effective one. Despite its success in classification tasks, VAT has not shown similar effectiveness in sequence labeling tasks. In the conventional classification task, the model learns a mapping between a sentence (sequence of tokens) and a label. Nevertheless, in sequence labeling task, the target function becomes a mapping from a sequence of tokens to a sequence of labels. To apply VAT on sequence labeling,  proposed to use a softmax layer on the top of token representations to obtain label probability distributions for each token. In this fashion, VAT could take KL divergence between tokens at the same position of the original sequence and the adversarial sequence as the adversarial losses. This approach shows marginal improvements over baseline models on several benchmarks, but fails to achieve comparable performance as other stateof-the-art models Akbik et al., 2018;Peters et al., 2018;Devlin et al., 2019).
Although the approach above applies VAT on the entire sequence, it locally normalizes the label probability per token and assumes all transitions between labels have equal possibilities. But in sequence labeling tasks, label transition probabilities are not always the same. For example, a song name is more likely to appear after a singer name, compared to a travel company.
To incorporate label transitions into sequence models, Lafferty et al. (2001) proposed conditional random field (CRF). CRF models the probability distribution of the whole label sequence given the input sequence, instead of yielding a label probability distribution for each token. It takes account of both token features and transition features. Most state-of-the-art sequence labeling models apply a CRF on top of token representations as a decoder. Such neural-CRF models usually outperform models without CRF (Ma and Hovy, 2016;Akbik et al., 2018;Peters et al., 2018;Yasunaga et al., 2018).
To apply the conventional VAT on a model with CRF, one can calculate the KL divergence on the label distribution of each token between the original examples and adversarial examples. However, it is sub-optimal because the transition probabilities are not taken into account.
To better address these issues, we proposed Se-qVAT, a variant of VAT that can be used along with CRF. Our evaluation demonstrates that Seq-VAT brings significant improvements in supervised settings, rather than marginal improvements reported from previous VAT-based approaches Clark et al.. In the semi-supervised settings, SeqVAT also outperforms many widely used methods such as self-training (ST) (Yarowsky, 1995) and entropy minimization (EM) (Grandvalet and Bengio, 2004), as well as the state-of-the-art semisupervised sequence labeling algorithm, cross-view training (CVT) .

Sequence Labeling
Sequence labeling is a series of common natural language processing tasks that predicts a label for each token within a sequence, rather than a label for the whole sequence. Such tasks include named entity recognition, chunking and part-ofspeech (POS) tagging etc. Most state-of-the-art sequence labeling models are based on a neural-CRF architecture (Ma and Hovy, 2016;Akbik et al., 2018;Peters et al., 2018;Yasunaga et al., 2018). More precisely, the general design is to use bidirectional recurrent neural network (RNN) layers for encoding and a CRF layer for decoding. In addition, usually one or more convolutional neural network (CNN) or RNN layers are applied before the neural-CRF architecture to encode characterlevel information as part of the input. In this paper, we adapt the neural-CRF architecture by a CNN-LSTM-CRF model, which consists of one CNN layer to generate character embeddings, two layers of bidirectional long short-term memory (LSTM) as the encoder and a CRF layer as the decoder.

Semi-Supervised Learning
Semi-supervised learning is an important approach to improve model performance without enough labeled data. It utilizes unlabeled data to get more information which might be beneficial for supervised tasks. For semi-supervised learning, two robust and widely used approaches are self-training (ST) (Yarowsky, 1995) and entropy minimization (EM) (Grandvalet and Bengio, 2004). In natural language processing, ST has been successfully applied to word sense disambiguation (Yarowsky, 1995) and parsing (McClosky et al., 2006), and EM also has successful application in text classification (Sachan et al., 2019).
Recently, a powerful semi-supervised approach, cross-view training (CVT), has achieved state-ofthe-art on several semi-supervised language tasks, including dependency parsing, machine translation and chunking . CVT forces the model to make consistent predictions when using the full input or partial input. Hence, it does not require label information and can be used for semi-supervised learning. In order to validate the effectiveness of our approach on semi-supervised sequence labeling, we make fair comparisons to those three semi-supervised learning methods in the experiments.

Virtual Adversarial Training
Adversarial training (Goodfellow et al., 2015) is a regularization method that enhances model robustness against input perturbations. It generates adversarial examples by injecting worst-case perturbations bounded by a small norm into the original examples, and adds them into training. As a consequence, model predictions would be consistent regardless of the perturbations. Prior to AT, several papers investigated various ways of perturbations (Xie et al., 2017). Adversarial training was demonstrated to be more effective since it introduces the perturbations which leading to the largest increase on model loss, respective to a constrained size (Goodfellow et al., 2015). Goodfellow et al. (2015) proved the effect of adversarial training in enhancing model robustness especially towards unseen samples for image classification. In addition to computer vision tasks, adversarial training also demonstrated its effectiveness on language tasks, such as text classification, POS tagging, named entity recognition and chunking (Miyato et al., 2017;Yasunaga et al., 2018).
To extend AT to semi-supervised settings, Miyato et al. (2019) proposed virtual adversarial training (VAT). "Virtual" means label information is not required in this new adversarial training approach and consequently it could be applied to both labeled or unlabeled training instances. VAT achieved state-of-the-art performance for image classification tasks (Miyato et al., 2019), and proved to be more efficient than traditional semi-supervised approaches, such as entropy minimization (Grandvalet and Bengio, 2004) and selftraining (Yarowsky, 1995), from a recent study (Oliver et al., 2018).
However, despite the successful applications on text classification (Miyato et al., 2017), VAT has not shown great benefits to semi-supervised sequence labeling tasks, due to its incompatibility with CRF. In this paper, SeqVAT is proposed to make VAT compatible with CRF, and achieves significant improvements in sequence labeling.

Model Architecture
Our baseline model architecture is illustrated in Fig.1. It adopts the basic architecture for several state-of-the-art sequence labeling models (Ma and Hovy, 2016;Peters et al., 2017;Akbik et al., 2018;Peters et al., 2018), called CNN-LSTM-CRF (CLC) in this paper. We apply a CNN layer to extract character information and concatenate its output with word embeddings as input features. Then, we feed the input features into LSTM layers, and decode with a CRF layer.

Word Embeddings
300-dimension randomly initialized word embeddings serve as word-level input. However, the model could learn embeddings with large norm, which makes the effects of adversarial perturbations with small norm insignificant (Miyato et al., 2017). To avoid such effect, we normalize the word embeddings at the beginning of each epoch. Denote v = {v i |i = 1, 2, ..., n} as the embeddings set, where n is vocabulary size, a specific embedding v i is normalized by: After normalization, word embeddings have zero mean and unit variance.

Character CNN Layer
Character-level information has proved to help improve the sequence labeling accuracy by capturing morphological features (Ma and Hovy, 2016). In this paper, 32-dimension embeddings are randomly initialized for each character. To ensure that adversarial perturbations have significant effects, character embeddings are also normalized at the beginning of each epoch in the same way as word embeddings. Suppose u = {u i |i = 1, 2, ..., m} where m is the number of unique characters show up in the dataset, a specific embedding u i is randomly initialized and normalized by: A CNN layer with 16 unigram, 16 bigram and 32 trigram filters is applied on top of all 32-dimension embeddings for one word. Hence, each word has 64-dimension character embeddings which are the output of CNN layer.

LSTM Layer
After concatenating character embeddings and word embeddings as input, all those features pass through two bidirectional LSTM layers with 256 neurons per direction to encode information for the whole sequence.

CRF Layer
To incorporate the probabilities of label transitions, the outputs of LSTM layers are fed into a linearchain CRF decoder (Lafferty et al., 2001). Negative log-likelihood is computed as the training loss and Viterbi algorithm (Viterbi, 1967) is used for decoding.

Adversarial Training
Adversarial training (Goodfellow et al., 2015) is an effective method to improve model robustness over input perturbations. AT first generates adversarial examples, which are close to the original examples but model is not likely to correctly predict their labels (i.e. leading to most significant loss increase). Then, the model is trained with both original examples and adversarial examples. The loss on adversarial examples are treated as adversarial loss. In this paper, adversarial perturbations are added to word and character embeddings respectively. To prevent vanishing effects of adversarial perturbations explained in section 3.1.1 and 3.1.2, embeddings are normalized at the beginning of each epoch. Denote w and c as normalized word and character embeddings of the whole input sequence, θ is parameter of model, y is a vector of labels for all tokens in the sequence, and Loss is the loss (i.e. negative log-likelihood) for the whole sequence. Given the bounded norms δ w and δ c respectively, the worst-case perturbations d w and d c for w and c are: Loss(y; w, c + τ,θ) Note that all variables, y, w, c, d w and d c here are vectors for the whole sequence, since the last layer, CRF, is modeling the whole label sequence. In addition,θ is current estimation of θ. The purpose for using constant valueθ instead of θ is to emphasize that the gradient should not propagate during generation of adversarial examples. Hence, the worst-case perturbations d w and d c against current model can be calculated through (3) and (4) at each training step, and model can be trained on examples plus those perturbations to improve robustness against them. Yet, computing exact value of those perturbations with maximization is intractable for complex DNN models. As proposed by Goodfellow et al. (2015), first order approximation is applied to approximate the value of d w and d c . With this approximation, d w and d c can be calculated by: where g w = ∇ w Loss(y; w, c,θ), and g c = ∇ c Loss(y; w, c,θ) Then, the adversarial loss L adv is formed by:

Virtual Adversarial Training
Nevertheless, adversarial training cannot be applied to unlabeled data since label information is required to generate adversarial examples and compute adversarial loss. Virtual adversarial training is proposed (Miyato et al., 2019) to adapt adversarial training to semi-supervised settings. In VAT, instead of using the regular loss on perturbed examples as adversarial loss, the discrepancy (KL divergence) between predictions of original examples and those of adversarial examples acts as the adversarial loss. With this modification, label information is not needed in the computation of adversarial loss. Indeed, the adversarial loss for VAT is written as: L adv = KL( P ori || P adv ) (8) where P ori = P (ŷ; w, c,θ), and P adv = P (ŷ; w + d w , c + d c ,θ) Here,ŷ is to emphasize that the computation of KL divergence takes current estimation of distribution over y, so that label information is not required. P ori and P adv are the estimated probability distributions of labels on original examples and adversarial examples respectively. As explained in section 1, VAT is not compatible with CRF. Hence, P ori and P adv here stand for sets of label distributions for tokens, computed by applying a softmax on top of LSTM output representations. As a consequence, the function P to estimate probability distributions of labels here is: where CLS means applying softmax on top of CNN-LSTM encoder. However, to compute worst-case perturbations d w and d c , label information y is still needed, as in equation (3), (4), (5) and (6). To get rid of the label information, the worst-case perturbations are now computed based on KL divergence between P ori and P adv , given the bounded norms δ w and δ c .

SeqVAT
Because of its incompatibility with CRF, adapting VAT to sequence labeling is not yet successful . To fully release the power of VAT to sequence labeling models with CRF, we propose a CRF-friendly VAT, named SeqVAT. CRF models the conditional probability of the whole label sequence given the whole input sequence. Consequently, instead of using the label distribution over individual token, we could use the probability distribution for the whole label sequence, to compute KL divergence. The probability distribution can be denoted by: whereŷ is the whole label sequence, and CLC indicates the full CLC model. Nevertheless, given a sequence with t tokens and l possible labels for each token, the total number of possible label sequences is l t . Considering the substantial number of possible label sequences, it is not possible to compute the full probability distribution over all possible label sequences. To make the computation of such distribution possible, we estimate the full distribution by only considering the probabilities of k most possible label sequences, with one additional dimension to represent all the rest label sequences. Thus, the estimation of the probability distribution is (k + 1) dimensions and feasible to compute.
To get the most possible label sequences, we apply a k-best Viterbi decoding (Huang and Chiang, 2005) on the original sequence in each training step. Denote S = (s 1 , s 2 , .., s k ) as the k-best label sequences of current input embeddings w and c, and p crf as the function to get probability of a label sequence. Given the current parametersθ, the probability distribution estimation P can be written as: Then, P ori and P adv can be denoted as: Here, d w and d c can be computed using the same approximation as VAT by: where: g w = ∇ KL(P (S; w, c,θ)||P (S; w + , c,θ)), The adversarial loss for SeqVAT can be computed by:

Training with Adversarial Loss
Regardless of the adversarial training method we use (AT, VAT or SeqVAT), sequence labeling loss is computed for all labeled data at each training step: L label = Loss(y; w, c, η,θ) In addition, in every training step, adversarial examples are generated and adversarial loss L adv is calculated based on the corresponding adversarial training algorithm. To combine the sequence labeling loss and adversarial loss, the total loss is a summation of those two loss: Here, weight λ is introduced to balance the model accuracy (sequence labeling loss) and robustness (adversarial loss). This objective function is optimized with respect to θ.
Note, unlabeled data might be leveraged in VAT and SeqVAT, and they do not have sequence labeling loss due to lack of annotation. Hence, the sequence labeling loss L label would be set to 0 for unlabeled data.

Dataset
Our proposed method is evaluated on three datasets: CoNLL 2000(Sang and Buchholz, 2000) for chunking, CoNLL 2003(Sang and Meulder, 2003 for named entity recognition (NER) and an internal natural language understanding (NLU) dataset for slot filling.
For chunking and NER, One Billion Word Language Model Benchmark (Chelba et al., 2014) is  used as unlabeled data pool for semi-supervised learning. Considering the relatively small size of those two datasets, we randomly sampled 1% of the benchmark as the unlabeled dataset. We still have 20 times more data than training sets of CoNLL 2000 and 2003. For slot filling, our NLU dataset contains labeled and unlabeled sentences for 6 domains (detailed information is shown in Table.1). We directly use the unlabeled data for semi-supervised experiments.

Experiment Settings
All parameters are randomly initialized. All hyperparameters are chosen by grid search on the development set. Variational dropout (Blum et al., 2015) with rate 0.2 is applied to the input and output of each LSTM layer. The perturbation sizes for word and character embeddings, δ w and δ c , are 0.4 and 0.2 respectively. The weight for adversarial loss (i.e. λ) is set to 0.6. k is set to 3 for CoNLL datasets and 9 for our NLU dataset. Sequence labeling model is optimized by Adam optimizer (Kingma and Ba, 2015) with batch size 64, learning rate 0.0006 and decay rate 0.992. Early stopping is applied based on model performance on the development set.

Supervised Sequence Labeling
We evaluate our proposed SeqVAT technique in supervised settings and compare the results with other techniques designed to improve model robustness, including AT (Miyato et al., 2017), VAT (Miyato et al., 2019) and CVT .
To demonstrate the effectiveness of CRF, we compare results from models with or without CRF using each training technique mentioned above.   Table.2, regardless of the training techniques, models with CRF consistently perform better than those without it. This demonstrates that CRF is a crucial component in sequence labeling. Hence, we conduct the rest of our evaluation only on models with CRF.
Moreover, except that AT performs slightly better than SeqVAT in Cook domain, SeqVAT can outperform all approaches in all the other domains/datasets. All improvements of SeqVAT over other approaches are statistically significant (with p-value < 0.05 in t-test). Compared with VAT used by , SeqVAT consistently shows more significant improvements, which indicates that SeqVAT is a better way of adopting virtual adversarial loss to sequence labeling.

Semi-Supervised Sequence Labeling
VAT has been proved to be very effective in semisupervised learning (Oliver et al., 2018). Our proposed SeqVAT preserves the ability of utilizing unlabeled data. In this work, we also compare SeqVAT with two widely used semi-supervised learning algorithms: self-training (ST) (Yarowsky, 1995), entropy minimization (EM) (Grandvalet and Bengio, 2004), and one state-of-the-art semisupervised sequence labeling approach, cross-view training (CVT) . Detailed results are tabulated in the third set of Table.2. From this comparison, SeqVAT consistently outperforms conventional VAT, ST, EM, and CVT. The improvements over other approaches are also statistically significant with p-value < 0.05. These results suggest that SeqVAT is also highly effective at utilizing unlabeled data.

K-best Selection in SeqVAT
To choose the optimal k in k-best decoding, we conduct experiments with different ks on supervised sequence labeling. The F1 score from each k is plotted in Fig.2. From these plots, we observe that each dataset has its own optimal k for SeqVAT, and there is no unique k that gives the best results across datasets.
To get a better generalization over all datasets and tasks, we avoid selecting the optimal k for each dataset/domain. However, different sources of language have different characteristics, including vocabulary, sentence length, syntax etc. Using the same k for different types of text might limit the effects of SeqVAT. To make a balance between generalization and effectiveness, we use different k for different types of text, but the same k for all datasets/domains with the same source. We use k = 3 for CoNLL 2000 and 2003 (news), and k = 9 for our internal NLU dataset (spoken language).

Impact of Unlabeled Data
To further understand the effect of unlabeled data in semi-supervised learning, we analyze the corre- lation between the amount of augmented unlabeled data and model performance on both CoNLL 2000 and 2003 datasets. For this analysis, we specifically focus ourselves on CVT and SeqVAT, which show the best accuracy across all datasets in Table.2. As shown in Fig.3, the amount of unlabeled data is a crucial factor for the performance of those two approaches. More specifically, the performance of those two approaches increases with more unlabeled data. For the CoNLL 2000 dataset, CVT has better performance when the unlabeled data is limited while SeqVAT gradually outperforms with more unlabeled data. As for the CoNLL 2003 dataset, SeqVAT shows consistently superior performance. This experiment shows that both approaches can provide significant benefits with a large amount of unlabeled data. In addition, Seq-VAT has better utilization of unlabeled data, especially when having substantial unlabeled data.

Comparison on Semi-Supervised Approaches
ST utilizes the unlabeled data by augmenting training data with the teacher model predictions, while EM makes the model more confident on the predictions for unlabeled data. Hence, both approaches are trying to force the model to trust predictions from the teacher model. If the teacher initially makes wrong predictions, the error would propagate to the student model. Unlike them, CVT and VAT/SeqVAT construct similar sentences which might have the same labels, and force the model to make consistent predictions on them. If the model makes incorrect prediction for the original sentence, CVT and VAT/SeqVAT can form a "discussion" to reach an agreement among the prediction of the original sentence and that of the similar sentences. If the model can make correct predictions for some similar utterances, it would have a chance to fix the error. Consequently, CVT and VAT/SeqVAT are generally expected to be more effective than ST and EM on the use of unlabeled data. The major difference between CVT and VAT is the mechanism of selecting similar sentences. CVT takes segments of the original sentence while VAT/SeqVAT generates new sentences by replacing tokens in the original sentence with their neighbors in the embedding space. Each approach has its own benefits and problems: 1) CVT can handle different tokens in the similar context, but would produce noise when the key words for meaning are not in the segments; 2) VAT generates truly similar sentences, but it might not be able to cover synonyms which have large distances in the embedding space. Hence, the effectiveness of them highly depends on the data. As in Table.2, CVT and VAT might outperform each other on different domains/datasets. The improvements of SeqVAT over CVT and VAT can be explained by its compatibility with CRF, because CRF is a critical component for some sequence labeling tasks (including the three in this paper). The compatibility with CRF would largely affect the effectiveness of semi-supervised approaches. In other tasks where label transitions are important, we might not see significant gains from SeqVAT over VAT or CVT.

Insights from K-best Estimation
To make VAT compatible with CRF, we propose an idea to estimate the label sequence distribution using k-best estimation. This idea provides a view to optimize the label sequence level distribution directly rather than work on the label distribution per token. This idea could be beneficial for tasks needing distribution transfer on sequence models, such as knowledge distillation, multi-source transfer learning.

Conclusion
In this paper, we propose a CRF compatible VAT training algorithm and demonstrate that sequence labeling tasks can greatly benefit from it. Our proposed method, SeqVAT, has strong effects to improve model robustness and accuracy on supervised sequence labeling tasks. In addition, SeqVAT is also highly effective in semi-supervised settings and outperforms traditional semi-supervised algorithms (ST and EM) as well as a state-of-the-art approach (CVT). Overall, our approach is highly effective for chunking, NER and slot filling, and can be easily extended to solve other sequence labeling problems in both supervised and semi-supervised settings.