Cloze-driven Pretraining of Self-attention Networks

We present a new approach for pretraining a bi-directional transformer model that provides significant performance gains across a variety of language understanding problems. Our model solves a cloze-style word reconstruction task, where each word is ablated and must be predicted given the rest of the text. Experiments demonstrate large performance gains on GLUE and new state of the art results on NER as well as constituency parsing benchmarks, consistent with BERT. We also present a detailed analysis of a number of factors that contribute to effective pretraining, including data domain and size, model capacity, and variations on the cloze objective.


Introduction
Language model pretraining has recently been shown to provide significant performance gains for a range of challenging language understanding problems (Dai and Le, 2015;Peters et al., 2018;Radford et al., 2018). However, existing work has either used unidirectional (left-to-right) language models (LMs) (Radford et al., 2018) or bi-directional (both left-to-right and right-to-left) LMs (BiLMs) where each direction is trained with an independent loss function (Peters et al., 2018). In this paper, we show that even larger performance gains are possible by jointly pretraining both directions of a large language-model-inspired self-attention cloze model.
Our bi-directional transformer architecture predicts every token in the training data ( Figure 1). We achieve this by introducing a cloze-style training objective where the model must predict the center word given left-to-right and right-to-left context representations. Our model separately computes both forward and backward states with * Equal contribution.  Figure 1: Illustration of the model. Block i is a standard transformer decoder block. Green blocks operate left to right by masking future time-steps and blue blocks operate right to left. At the top, states are combined with a standard multi-head self-attention module whose output is fed to a classifier that predicts the center token. a masked self-attention architecture, that closely resembles a language model. At the top of the network, the forward and backward states are combined to jointly predict the center word. This approach allows us to consider both contexts when predicting words and to incur loss for every word in the training set, if the model does not assign it high likelihood. Experiments on the GLUE (Wang et al., 2018) benchmark show strong gains over the state of the art for each task, including a 9.1 point gain on RTE over Radford et al. (2018). These improvements are consistent with, if slightly behind, BERT (Devlin et al., 2018), which we will discuss in more detail in the next section. We also show that it is possible to stack task-specific architectures for NER and constituency parsing on top of our pretrained representations, and achieve new state-ofthe-art performance levels for both tasks. We also present extensive experimental analysis to better understand these results, showing that (1) having multiple sentences in each training example is crucial for many tasks; (2) pre-training continues to improve performance with up to 18B tokens and would likely continue to improve with more data; and finally (3) our novel cloze-driven training regime is more effective than predicting left and right tokens separately.

Related work
There has been much recent work on learning sentence-specific representations for language understanding tasks. McCann et al. (2017) learn contextualized word representations from a sequence to sequence translation task and uses the representations from the encoder network to improve a variety of language understanding tasks. Subsequent work focused on language modeling pretraining which has been shown to be more effective and which does not require bilingual data (Zhang and Bowman, 2018).
Our work was inspired by ELMo (Peters et al., 2018) and the generative pretraining (GPT) approach of Radford et al. (2018). ELMo introduces language models to pretrain word representations for downstream tasks including a novel mechanism to learn a combination of different layers in the language model that is most beneficial to the current task. GPT relies on a left to right language model and an added projection layer for each downstream task without a task-specific model. Our approach mostly follows GPT, though we show that our model also works well with an ELMo module on NER and constituency parsing.
The BERT model (Devlin et al., 2018) is a transformer encoder model that captures left and right context. There is significant overlap between their work and ours but there are also significant differences: our model is a bi-directional transformer language model that predicts every single token in a sequence. Our model has two unidirectional components encoding either the left or right context and both are combined to predict center words. BERT is also a transformer encoder that has access to the entire input but this choice requires a special training regime. In particular, they multi-task between predicting a subset of masked input tokens, similar to a denoising autoencoder, and a next sentence prediction task. In comparison, we optimize a single loss function that requires the model to predict each token of an in-put sentence given all surrounding tokens. We use all tokens as training targets and therefore extract learning signal from every single token in the sentence and not just a subset. Melamud et al. (2016) follow a similar approach to ours by predicting the center word but their architecture is based on LSTMs and we include the center word when we actually fine-tune on downstream tasks.
BERT tailors pretraining to capture dependencies between sentences via a next sentence prediction task as well as by constructing training examples of sentence-pairs with input markers that distinguish between tokens of the two sentences. Our model is trained similarly to a classical language model since we do not adapt the training examples to resemble the end task data and we do not solve a denoising task during training.
Finally, BERT as well as Radford et al. (2018) consider only a single data source to pretrain their models, either BooksCorpus (Radford et al., 2018), or BooksCorpus and additional Wikipedia data (Devlin et al., 2018), whereas our study ablates the effect of various amounts of training data as well as different data sources.

Two tower model
Our cloze model represents a probability distribution p(t i |t 1 , . . . , t i−1 , t i+1 , . . . , t n ) for a sentence with n tokens t 1 , . . . , t n . There are two selfattentional towers each consisting of N stacked blocks: the forward tower operates left-to-right and the backward tower operates in the opposite direction. To predict a token, we combine the representations of the two towers, as described in more detail below, taking care that neither representation contains information about the current target token.
The forward tower computes the representation F l i for token i at layer l based on the forward representations of the previous layer F l−1 ≤i via selfattention; the backward tower computes representation B l i based on information from the opposite direction B l−1 ≥i . When examples of uneven length are batched, one of the towers may not have any context at the beginning. We deal with this issue by adding an extra zero state over which the selfattention mechanism can attend.
We pretrain on individual examples as they occur in the training corpora ( §5.1). For News Crawl this is individual sentences while on Wikipedia, Bookcorpus, and Common Crawl examples are paragraph length. Sentences are prepended and appended with sample boundary markers < s >.

Block structure
The structure of the blocks follows most of the architectural choices described in Vaswani et al. (2017). Each block consists of two sub-blocks: the first is a multi-head self-attention module with H = 16 heads for which we mask out any subsequent time-steps, depending on if we are dealing with the forward or backward tower. The second sub-block is a feed-forward module (FFN) of the form Vaswani et al. (2017) we apply layer normalization before the self-attention and FFN blocks instead of after, as we find it leads to more effective training. Sub-blocks are surrounded by a residual connection (He et al., 2015). Position is encoded via fixed sinusoidal position embeddings and we use a character CNN encoding of the input tokens for word-based models (Kim et al., 2016). Input embeddings are shared between the two towers.

Combination of representations
The forward and backward representations computed by the two towers are combined to predict the ablated word. To combine them we use a self-attention module which is followed by an FFN block ( §3.1). The output of the FFN block f is projected by W into V classes representing the types in the vocabulary: W T f to which a softmax is applied. When the model predicts token i, the input to the attention module are forward states F L 1 . . . F L i−1 and backward states B L i+1 . . . B : n where n is the length of the sequence and L is the number of layers. We implement this by masking B L ≤i and F L ≥i . The attention query for token i is a combination of F L i−1 and B L i+1 . For the base model we sum the two representations and for the larger models they are concatenated. Keys and values are based on the forward and backward states fed to the attention module. In summary, this module has access to information about the entire input surrounding the current target token. During training, we predict every token in this way. The output of this module is fed to an output classifier which predicts the center token. We use an adaptive softmax for the output classifier (Grave et al., 2017) for the word based models and regular softmax for the BPE based models.  Figure 1). The red dot-dashed arrows show connections that are masked during training, but unmasked for fine-tuning.
While all states that contain information about the current target word are masked in the final selfattention block during training, we found it beneficial to disable this masking when fine tuning the pretrained model for downstream tasks. This is especially true for tasks that label each token, such as NER, as this allows the model to access the full context including the token itself.

Fine-tuning
We use the following approach to fine-tune the pretrained two tower model to specific downstream tasks (Figure 2).
Classification and regression tasks. For single sentence classification tasks, we consider the language model outputs for the boundary tokens < s > which we add before the start and end of each sentence. The language model outputs are the representations f just before the final softmax layer ( §3.2). The outputs are of dimension d = 1024 and we concatenate them to project to the number of classes C in the downstream task with W 1 ∈ R C×2d (Radford et al., 2018); we add a bias term b ∈ R C and initialize all weights as well as the bias to zero. The output of the projection is softmax-normalized and the model is optimized with cross-entropy for classification tasks. Re-gression tasks such as the Semantic Textual Similarity benchmark (STS-B; Cer et al., 2017) use C = 1 and are trained with mean squared error. For tasks involving sentence-pairs, we concatenate them and add a new separator token < sep > between them. We add the output of this token to the final projection W 2 ∈ R C×3d . Structured prediction tasks. For named entity recognition and parsing we use task-specific architectures which we fine-tune together with the language model but with different learning rate. The architectures are detailed in the respective results sections. The input to the architectures are the output representations of the pretrained language model.
No Masking. For fine-tuning, we found it beneficial to remove masking of the current token in the final layer that pools the output of the two towers. This is different than in the actual pretraining. It is important to have access to information about the token to be classified for token level classification tasks such as NER but we also found this to perform better for sentence classification tasks. In practice, we completely disable masking in the combination layer so that it operates over all forward and backward states. However, disabling masking below the combination layer does not perform well.
Optimization. During fine-tuning we use larger learning rates for the new parameters, that is W 1 , W 2 , b or the task-specific architecture, compared to the pretrained model. For GLUE tasks, we do so by simply scaling the output of the language model before the W 1 and W 2 projections by a factor of 16. For structured prediction tasks, we explicitly use different learning rates for the pretrained model and the task-specific parameters.
We fine tune with the Adam optimizer (Kingma and Ba, 2015). For GLUE tasks, we disable dropout in the language model and add 0.1 dropout between language model output and the final output projection; for structured prediction tasks, we use 0.3 at all levels (within the pretrained model, within the task-specific architecture, and on the weights connecting them). In all settings, we use a batch size of 16 examples. We use a cosine schedule to linearly warm up the learning rate from 1e-07 to the target value over the first 10% of training steps, and then anneal the learning rate to 1e-06, following the cosine curve for the remaining steps. For GLUE tasks, we tuned the learning rate for each task and chose the best value over three settings: 1e-04, 5e-05 and 3e-05. For structured prediction tasks, we tuned on the pairs of learning rate, see the results section for details. For GLUE tasks, we train three seeds for each learning rate value for three epochs and choose the model after each epoch that performs best on the validation set. For structured prediction tasks, we train for up to 25 epochs and stop if the validation loss does not improve over the previous epoch.

Datasets for pretraining
We train the two tower model on several datasets.
Common Crawl. We consider various subsets of Common Crawl which is web data. We follow the same pre-processing as Grave et al. (2018) which is based on the May 2017 Common Crawl dump. This setup add 20 copies of English Wikipedia resulting in about 14% of the final dataset to be Wikipedia. We subsample up to 18B tokens. All experiments use Common Crawl subsampled to 9B tokens, except §6.4. News Crawl. We use up to 4.5B words of English news web data distributed as part of WMT 2018 (Bojar et al., 2018).
BooksCorpus + Wikipedia. This is similar to the training data used by BERT which comprises the BooksCorpus (Zhu et al., 2015) of about 800M words plus English Wikipedia data of 2.5B words.

Pretraining hyper-parameters
We adapt the transformer implementation available in the fairseq toolkit to our two tower architecture (Ott et al., 2019). For hyper-parameter and optimization choices we mostly follow Baevski and Auli (2018). Our experiments consider three model sizes shown in Table 1: There are two CNN input models in a base and large configuration as well as a Byte-Pair-Encoding based model (BPE; Sennrich et al., 2016). The CNN models have unconstrained input vocabulary, and an output vocabulary limited to 1M most common types for the large model, and 700K most common types for the base model. CNN models use an adaptive softmax in the output: the head band contains the 60K most frequent types with dimensionality Every setup uses model dimensionaltiy d = 1024 with H = 16 attention heads for all but the final attention layer. Model based on character inputs use character embedding size 128 and we apply six filters of size 1x128, 2x256, 3x384, 4x512, 5x512, 6x512 followed by a single highway layer. The models are trained with model and attention dropout rate of 0.1 and ReLU dropout rate of 0.05. Different to Vaswani et al. (2017) we use Nesterov's accelerated gradient method (Sutskever et al., 2013) with a momentum of 0.99 and we renormalize gradients if their norm exceeds 0.1 (Pascanu et al., 2013). The learning rate is linearly warmed up from 10 −7 to 1 for 16K steps and then annealed using a cosine learning rate schedule with a single phase to 0.0001 (Loshchilov and Hutter, 2016).
We run experiments on DGX-1 machines with 8 NVIDIA V100 GPUs and machines are interconnected by Infiniband. We also use the NCCL2 library and the torch.distributed package for inter-GPU communication. We train models with 16bit floating point precision, following Ott et al. (2018). The BPE model trains much faster than the character CNN models (Table 1).

GLUE
First, we conduct experiments on the general language understanding evaluation benchmark (GLUE; Wang et al., 2018) and present a short overview of the tasks. More information can be found in Wang et al. (2018). There are two singlesentence classification tasks: First, the Corpus of Linguistic Acceptability (CoLA; Warstadt et al., 2018) is a binary task to judge sentence grammaticality; evaluation is in terms of the Matthews correlation coefficient (mcc). Second, the Stanford Sentiment Treebank (SST-2; Socher et al., 2013) requires to judge if movie reviews have positive or negative sentiment; evaluation is in terms of accuracy (acc).
There are three tasks assessing sentence similarity: The Microsoft Research Paragraph Corpus (MRPC; Dolan and Brockett, 2015) and the Quora Question Pairs benchmark (QQP); we evaluate in terms of F1. The Semantic Textual Similarity Benchmark (STS-B; Cer et al., 2017) requires predicting a similarity score between 1 and 5 for a sentence pair; we report the Spearman correlation coefficient (scc).
We also report an average over the GLUE metrics. This figure is not comparable to the average on the official GLUE leaderboard since we exclude Winograd and do not report MRPC accuracy  STS-B Pearson correlation as well as QQP accuracy. Table 2 shows results for three configurations of our approach (cf. Table 1). The BPE model has more parameters than the CNN model but does not perform better in aggregate, however, it is faster to train. All our models outperform the unidirectional transformer (OpenAI GPT) of Radford et al. (2018), however, our model is about 50% larger than their model. We also show results for STILTs (Phang et al., 2018) and BERT (Devlin et al., 2018). Our CNN base model performs as well as STILTs in aggregate, however, on some tasks involving sentence-pairs, STILTs performs much better (MRPC, RTE); there is a similar trend for BERT.
STILTs adds another fine-tuning step on another downstream task which is similar to the final task. The technique is equally applicable to our approach. Training examples for our model are Common Crawl paragraphs of arbitrary length. We expect that tailoring training examples for language model pretraining to the end tasks to significantly improve performance. For example, BERT trains on exactly two sentences while as we train on entire paragraphs.

Structured Prediction
We also evaluated performance on two structured predictions tasks, NER and constituency parsing. For both problems, we stacked task-specific architectures from recent work on top of our pretrained two tower models. We evaluate two ways of stacking: (1) ELMo-style, where the pretrained models are not fine-tuned but are linearly combined at different depths, and (2) with fine-tuning, where we set different learning rates for the task-specific   layers but otherwise update all of the parameters during the task-specific training.

Named Entity Recognition
We evaluated span-level F1 performance on the CoNLL 2003 Named Entity Recognition (NER) task, where spans of text must be segmented and labeled as Person, Organization, Location, or Miscellaneous. We adopted the NER architecture in Peters et al. (2018), a biLSTM-CRF, with two minor modifications: (1) instead of two layers of biL-STM, we only used one, and (2) a linear projection layer was added between the token embedding and biLSTM layer. We did grid search on the pairs of learning rate, and found that projection-biLSTM-  Table 5: Different loss functions on the development sets of GLUE (cf. Table 2). Results are based on the CNN base model (Table 1) CRF with 1E-03 and pretrained language model with 1E-05 gave us the best result. Table 3 shows the results, with comparison to previous published ELMo BASE results (Peters et al., 2018) and the BERT models. Both of our stacking methods outperform the previous state of the art, but fine tuning gives the biggest gain.

Constituency Parsing
We also report parseval F1 for Penn Treebank constituency parsing. We adopted the current state-ofthe-art architecture (Kitaev and Klein, 2018). We again used grid search for learning rates and number of layers in parsing encoder, and used 8E-04 for language model finetuning, 8E-03 for the parsing model parameters, and two layers for encoder. Table 4 shows the results. Here, fine tuning is required to achieve gains over the previous state of the art, which used ELMo embeddings.

Objective functions for pretraining
The two-tower model is trained to predict the current token given representations of the entire left and right context (cloze). Next we compare this choice to two alternatives: First, Peters et al. (2018) train two language models operating leftto-right and right-to-left to predict the next word for each respective direction. We change the twotower model to predict the next word using the individual towers only and remove the combination module on top of the two towers (bilm); however, we continue to jointly train the two towers.
Second, we combine the cloze loss with the bilm loss to obtain a triplet loss which trains the model to predict the current word given both left and right context, as well as just right or left context. The latter is much harder than the cloze loss since less context is available and therefore gradients for the bilm loss are much larger: the cloze model achieves perplexity of about 4 while as for the bilm it is 27-30, depending on the direction. This results in the bilm loss dominating the triplet 562M loss and we found that scaling the bilm term by a factor of 0.15 results in better performance. Table 5 shows that the cloze loss performs significantly better than the bilm loss and that combining the two loss types does not improve over the cloze loss by itself. We conjecture that individual left and right context prediction tasks are too different from center word prediction and that their learning signals are not complementary enough.

Domain and amount of training data
Next we investigate how much pretraining benefits from larger training corpora and how the domain of the data influences end-task performance. Figure 3 shows that more training data can significantly increase accuracy. We train all models with the exact same hyper-parameter settings on Common Crawl data using the CNN base architecture for 600K updates. We train on up to 18B Common Crawl tokens and the results suggest that more training data is likely to further increase performance.  We also experiment with BooksCorpus (Zhu et al., 2015) as well as English Wikipedia, similar to Devlin et al. (2018). Examples in BooksCorpus are a mix of individual sentences and paragraphs; examples are on average 36 tokens. Wikipedia examples are longer paragraphs of 66 words on average. To reduce the effect of training on examples of different lengths, we adopted the following strategy: we concatenate all training examples into a single string and then crop blocks of 512 consecutive tokens from this string. We train on a batch of these blocks (BWiki -blck). It turns out that this strategy did not work better compared to our existing strategy of simply using the data as is (BWikisent). BooksCorpus and Wikipedia performs very well on QNLI and MNLI but less well on other tasks.
In summary, more data for pretraining improves performance, keeping everything else equal. Also pretraining on corpora that retains paragraph structure performs better than individual sentences.

Conclusion
We presented a pretraining architecture based on a bi-directional transformer model that predicts every token in the training data. The model is trained with a cloze-style objective and predicts the center word given all left and right context.
Results on the GLUE benchmark show large gains over Radford et al. (2018) for each task, while experiments with model stacking set new state of the art performance levels for parsing and named entity recognition. We also did extensive experimental analysis to better understand these results, showing that (1) having multiple sentences in each training example is crucial for many tasks; (2) pre-training continues to improve performance up to 18B tokens and would likely continue to improve with more data; and finally (3) our novel cloze-driven training regime is more effective than predicting left and right tokens separately.
In future work, we will investigate variations of our architecture. In particular, we had initial success sharing the parameters of the two towers which allows training much deeper models without increasing the parameter count.