GAN-BERT: Generative Adversarial Learning for Robust Text Classification with a Bunch of Labeled Examples

Recent Transformer-based architectures, e.g., BERT, provide impressive results in many Natural Language Processing tasks. However, most of the adopted benchmarks are made of (sometimes hundreds of) thousands of examples. In many real scenarios, obtaining high- quality annotated data is expensive and time consuming; in contrast, unlabeled examples characterizing the target task can be, in general, easily collected. One promising method to enable semi-supervised learning has been proposed in image processing, based on Semi- Supervised Generative Adversarial Networks. In this paper, we propose GAN-BERT that ex- tends the fine-tuning of BERT-like architectures with unlabeled data in a generative adversarial setting. Experimental results show that the requirement for annotated examples can be drastically reduced (up to only 50-100 annotated examples), still obtaining good performances in several sentence classification tasks.


Introduction
In recent years, Deep Learning methods have become very popular in Natural Language Processing (NLP), e.g., they reach high performances by relying on very simple input representations (for example, in (Kim, 2014;Goldberg, 2016;Kim et al., 2016)). In particular, Transformer-based architectures, e.g., BERT (Devlin et al., 2019), provide representations of their inputs as a result of a pre-training stage. These are, in fact, trained over large scale corpora and then effectively finetuned over a targeted task achieving state-of-the-art results in different and heterogeneous NLP tasks. These achievements are obtained when thousands of annotated examples exist for the final tasks. As experimented in this work, the quality of BERT fine-tuned over less than 200 annotated instances shows significant drops, especially in classification tasks involving many categories. Unfortunately, obtaining annotated data is a time-consuming and costly process. A viable solution is adopting semisupervised methods, such as in (Weston et al., 2008;Chapelle et al., 2010;Yang et al., 2016;Kipf and Welling, 2016) to improve the generalization capability when few annotated data is available, while the acquisition of unlabeled sources is possible.
One effective semi-supervised method is implemented within Semi-Supervised Generative Adversarial Networks (SS-GANs). Usually, in GANs (Goodfellow et al., 2014) a "generator" is trained to produce samples resembling some data distribution. This training process "adversarially" depends on a "discriminator", which is instead trained to distinguish samples of the generator from the real instances. SS-GANs (Salimans et al., 2016) are an extension to GANs where the discriminator also assigns a category to each example while discriminating whether it was automatically generated or not.
In SS-GANs, the labeled material is thus used to train the discriminator, while the unlabeled examples (as well as the ones automatically generated) improve its inner representations. In image processing, SS-GANs have been shown to be effective: exposed to few dozens of labeled examples (but thousands of unlabeled ones), they obtain performances competitive with fully supervised settings.
In this paper, we extend the BERT training with unlabeled data in a generative adversarial setting. In particular, we enrich the BERT fine-tuning process with an SS-GAN perspective, in the so-called GAN-BERT 1 model. That is, a generator produces "fake" examples resembling the data distribution, while BERT is used as a discriminator. In this way, we exploit both the capability of BERT to produce high-quality representations of input texts and to adopt unlabeled material to help the network in generalizing its representations for the final tasks. At the best of our knowledge, using SS-GANs in NLP has been investigated only by (Croce et al., 2019) with the so-called Kernel-based GAN. In that work, authors extend a Kernel-based Deep Architecture (KDA, (Croce et al., 2017)) with an SS-GAN perspective. Sentences are projected into low-dimensional embeddings, which approximate the implicit space generated by using a Semantic Tree Kernel function. However, it only marginally investigated how the GAN perspective could extend deep architecture for NLP tasks. In particular, a KGAN operates in a pre-computed embedding space by approximating a kernel function (Annesi et al., 2014). While the SS-GAN improves the quality of the Multi-layered Perceptron used in the KDA, it does not affect the input representation space, which is statically derived by the kernel space approximation. In the present work, all the parameters of the network are instead considered during the training process, in line with the SS-GAN approaches.
We empirically demonstrate that the SS-GAN schema applied over BERT, i.e., GAN-BERT, reduces the requirement for annotated examples: even with less than 200 annotated examples it is possible to obtain results comparable with a fully supervised setting. In any case, the adopted semisupervised schema always improves the result obtained by BERT.
In the rest of this paper, section 2 provides an introduction to SS-GANs. In sections 3 and 4, GAN-BERT and the experimental evaluations are presented. In section 5 conclusions are derived.
2 Semi-supervised GANs SS-GANs (Salimans et al., 2016) enable semisupervised learning in a GAN framework. A discriminator is trained over a (k + 1)-class objective: "true" examples are classified in one of the target (1, ..., k) classes, while the generated samples are classified into the k + 1 class.
More formally, let D and G denote the discriminator and generator, and p d and p G denote the real data distribution and the generated examples, respectively. In order to train a semi-supervised k-class classifier, the objective of D is extended as follows. Let us define p m (ŷ = y|x, y = k + 1) the probability provided by the model m that a generic example x is associated with the fake class and p m (ŷ = y|x, y ∈ (1, ..., k)) that x is con-sidered real, thus belonging to one of the target classes. The loss function of D is defined as: L Dsup. measures the error in assigning the wrong class to a real example among the original k categories. L Dunsup. measures the error in incorrectly recognizing a real (unlabeled) example as fake and not recognizing a fake example.
At the same time, G is expected to generate examples that are similar to the ones sampled from the real distribution p d . As suggested in (Salimans et al., 2016), G should generate data approximating the statistics of real data as much as possible. In other words, the average example generated in a batch by G should be similar to the real prototypical one. Formally, let's f (x) denote the activation on an intermediate layer of D. The feature matching loss of G is then defined as: that is, the generator should produce examples whose intermediate representations provided in input to D are very similar to the real ones. The G loss also considers the error induced by fake examples correctly identified by D, i.e., While SS-GANs are usually used with image inputs, we will show that they can be adopted in combination with BERT (Devlin et al., 2019) over inputs encoding linguistic information.
is a very deep model that is pre-trained over large corpora of raw texts and then is fine-tuned on target annotated data. The building block of BERT is the Transformer (Vaswani et al., 2017), an attentionbased mechanism that learns contextual relations between words (or sub-words, i.e., word pieces, (Schuster and Nakajima, 2012)) in a text.
BERT provides contextualized embeddings of the words composing a sentence as well as a sentence embedding capturing sentence-level semantics: the pre-training of BERT is designed to capture such information by relying on very large corpora. After the pre-training, BERT allows encoding (i) the words of a sentence, (ii) the entire sentence, and (iii) sentence pairs in dedicated embeddings. These can be used in input to further layers to solve sentence classification, sequence labeling or relational learning tasks: this is achieved by adding task-specific layers and by fine-tuning the entire architecture on annotated data.
In this work, we extend BERT by using SS-GANs for the fine-tuning stage. We take an already pre-trained BERT model and adapt the fine-tuning by adding two components: i) task-specific layers, as in the usual BERT fine-tuning; ii) SS-GAN layers to enable semi-supervised learning. Without loss of generality, let us assume we are facing a sentence classification task over k categories. Given an input sentence s = (t 1 , ..., t n ) BERT produces in output n + 2 vector representations in R d , i.e., (h CLS , h t 1 , ..., h tn , h SEP ). As suggested in (Devlin et al., 2019), we adopt the h CLS representation as a sentence embedding for the target tasks.
As shown in figure 1, we add on top of BERT the SS-GAN architecture by introducing i) a discriminator D for classifying examples, and ii) a generator G acting adversarially. In particular, G is a Multi Layer Perceptron (MLP) that takes in input a 100-dimensional noise vector drawn from N (µ, σ 2 ) and produces in output a vector h f ake ∈ R d . The discriminator is another MLP that receives in input a vector h * ∈ R d ; h * can be either h f ake produced by the generator or h CLS for unlabeled or labeled examples from the real distribution. The last layer of D is a softmax-activated layer, whose output is a k + 1 vector of logits, as discussed in section 2.
During the forward step, when real instances are sampled (i.e., h * = h CLS ), D should classify them in one of the k categories; when h * = h f ake , it should classify each example in the k + 1 category. As discussed in section 2, the training process tries to optimize two competing losses, i.e., L D and L G .
During back-propagation, the unlabeled examples contribute only to L Dunsup. , i.e., they are considered in the loss computation only if they are erroneously classified into the k + 1 category. In all other cases, their contribution to the loss is masked out. The labeled examples thus contribute to the supervised loss L Dsup. . Finally, the examples generated by G contribute to both L D and L G , i.e., D is penalized when not finding examples generated by G and vice-versa. When updating D, we also change the BERT weights in order to fine-tune its inner representations, so considering both the labeled and the unlabeled data 2 .
After training, G is discarded while retaining the rest of the original BERT model for inference. This means that there is no additional cost at inference time with respect to the standard BERT model. In the following, we will refer to this architecture as GAN-BERT.

Experimental Results
In this section, we assess the impact of GAN-BERT over sentence classification tasks characterized by different training conditions, i.e., number of examples and number of categories. We report measures of our approach to support the development of deep learning models when exposed to few labeled examples over the following tasks: Topic Classification over the 20 News Group (20N) dataset (Lang, 1995), Question Classification (QC) on the UIUC dataset (Li and Roth, 2006), Sentiment Analysis over the SST-5 dataset (Socher et al., 2013). We Learning rate was set for all to 2e-5, except for 20N (5e-6).
will also report the performances over a sentence pair task, i.e., over the MNLI dataset (Williams et al., 2018). For each task, we report the performances with the metric commonly used for that specific dataset, i.e., accuracy for SST-5 and QC, while F1 is used for 20N and MNLI datasets. As a comparison, we report the performances of the BERT-base model fine-tuned as described in (Devlin et al., 2019) on the available training material. We used BERT-base as the starting point also for the training of our approach. GAN-BERT is implemented in Tensorflow by extending the original BERT implementation 3 . In more detail, G is implemented as an MLP with one hidden layer activated by a leaky-relu function. G inputs consist of noise vectors drawn from a normal distribution N (0, 1). The noise vectors pass through the MLP and finally result in 768-dimensional vectors, that are used as fake examples in our architecture. D is, also, an MLP with one hidden layer activated by a leaky-relu function followed by a softmax layer for the final prediction. For both G and D we used dropout=0.1 after the hidden layer. We repeated the training of each model with an increasing set of annotated material (L), starting by sampling only 0.01% or 1% of the training set, in order to measure the performances In the QC task we observe similar outcomes. The training dataset is made of about 5, 400 question. In the coarse-grained setting (figure 2b) 6 classes are involved; in the fine-grained scenario (figure 2c) the number of classes is 50. In both cases, BERT diverges when only 1% of labeled questions are used, i.e., about 50 questions. It starts to com-pensate when using about 20% of the data in the coarse setting (about 1, 000 labeled examples). In the fine-grained scenario, our approach is performing better until 50% of the labeled examples. It seems that, when a large number of categories is involved, i.e., the classification task is more complex, the semi-supervised setting is even more beneficial.
The results are confirmed in sentiment analysis over the SST-5 dataset (figure 2d), i.e., sentence classification involving 5 polarity categories. Also in this setting, we observe that GAN Finally, we report the performances on Natural Language Inference on the MNLI dataset. We observe (in figures 2e and 2f) a systematic improvement starting from 0.01% labeled examples (about 40 instances): GAN-BERT provides about 6 − 10 additional points in F1 with respect to BERT (18.09% vs. 29.19% and 18.01% vs. 31.64%, for mismatched and matched settings, respectively). This trend is confirmed until 0.5% of annotated material (about 2, 000 annotated examples): GAN-BERT reaches 62.67% and 60.45% while BERT reaches 48.35% and 42.41%, for mismatched and matched, respectively. Using more annotated data results in very similar performances with a slight advantage in using GAN-BERT. Even if acquiring unlabeled examples for sentence pairs is not trivial, these results give a hint about the potential benefits on similar tasks (e.g., questionanswer classification).

Conclusion
In this paper, we extended the limits of Transformed-based architectures (i.e., BERT) in poor training conditions. Experiments confirm that fine-tuning such architectures with few labeled examples lead to unstable models whose performances are not acceptable. We suggest here to adopt adversarial training to enable semisupervised learning Transformer-based architectures. The evaluations show that the proposed variant of BERT, namely GAN-BERT, systematically improves the robustness of such architectures, while not introducing additional costs to the infer-ence. In fact, the generator network is only used in training, while at inference time only the discriminator is necessary.
This first investigation paves the way to several extensions including adopting other architectures, such as GPT-2 (Radford et al., 2019) or DistilBERT (Sanh et al., 2019) or other tasks, e.g., Sequence Labeling or Question Answering. Moreover, we will investigate the potential impact of the adversarial training directly in the BERT pre-training. From a linguistic perspective, it is worth investigating what the generator encodes in the produced representations.