Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling

Natural language understanding has recently seen a surge of progress with the use of sentence encoders like ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2019) which are pretrained on variants of language modeling. We conduct the first large-scale systematic study of candidate pretraining tasks, comparing 19 different tasks both as alternatives and complements to language modeling. Our primary results support the use language modeling, especially when combined with pretraining on additional labeled-data tasks. However, our results are mixed across pretraining tasks and show some concerning trends: In ELMo’s pretrain-then-freeze paradigm, random baselines are worryingly strong and results vary strikingly across target tasks. In addition, fine-tuning BERT on an intermediate task often negatively impacts downstream transfer. In a more positive trend, we see modest gains from multitask training, suggesting the development of more sophisticated multitask and transfer learning techniques as an avenue for further research.


Introduction
State-of-the-art models in natural language processing (NLP) often incorporate encoder functions which generate a sequence of vectors intended to represent the in-context meaning of each word in an input text. These encoders have typically been trained directly on the target task at hand, which can be effective for data-rich tasks and yields human performance on some narrowlydefined benchmarks (Rajpurkar et al., 2018;Hassan et al., 2018), but is tenable only for the few tasks with millions of training data examples. This * This paper supercedes "Looking for ELMo's Friends: Sentence-Level Pretraining Beyond Language Modeling", an earlier version of this work by the same authors. Correspondence to: alexwang@nyu.edu

Intermediate Task Model
Intermediate Task Output limitation has prompted interest in pretraining for these encoders: The encoders are first trained on outside data, and then plugged into a target task model. Howard and Ruder (2018), Peters et al. (2018a), Radford et al. (2018), and Devlin et al. (2019) establish that encoders pretrained on variants of the language modeling task can be reused to yield strong performance on downstream NLP tasks. Subsequent work has homed in on language modeling (LM) pretraining, finding that such mod-els can be productively fine-tuned on intermediate tasks like natural language inference before transferring to downstream tasks (Phang et al., 2018). However, we identify two open questions: (1) How effective are tasks beyond language modeling in training reusable sentence encoders (2) Given the recent successes of LMs with intermediate-task training, which tasks can be effectively combined with language modeling and each other.
The main contribution of this paper is a largescale systematic study of these two questions. For the first question, we train reusable sentence encoders on 19 different pretraining tasks and task combinations and several simple baselines, using a standardized model architecture and procedure for pretraining. For the second question, we conduct additional pretraining on ELMo (Peters et al., 2018b) and BERT (Devlin et al., 2019) with 17 different intermediate tasks and task combinations. We evaluate each of these encoders on the nine target language-understanding tasks in the GLUE benchmark , yielding a total of 53 sentence encoders and 477 total trained models. We measure correlation in performance across target tasks and plot learning curves to show the effect of data volume on both pretraining and target task training.
We find that language modeling is the most effective pretraining task that we study. Multitask pretraining or intermediate task training offers modest further gains. However, we see several worrying trends: • The margins between substantially different pretraining tasks can be extremely small in this transfer learning regimen and many pretraining tasks struggle to outperform trivial baselines.
• Many of the tasks used for intermediate task training adversely impact the transfer ability of LM pretraining.
• Different target tasks differ dramatically in what kinds of pretraining they benefit most from, but naïve multitask pretraining seems ineffective at combining the strengths of disparate pretraining tasks.
These observations suggest that while scaling up LM pretraining (as in Radford et al., 2019) is likely the most straightforward path to further gains, our current methods for multitask and transfer learning may be substantially limiting our results.

Related Work
Work on reusable sentence encoders can be traced back at least as far as the multitask model of Collobert et al. (2011). Several works focused on learning reusable sentence-to-vector encodings, where the pretrained encoder produces a fixed-size representation for each input sentence (Dai and Le, 2015;Kiros et al., 2015;Hill et al., 2016;Conneau et al., 2017). More recent reusable sentence encoders such as CoVe (McCann et al., 2017) and GPT (Radford et al., 2018) instead represent sentences as sequences of vectors. These methods work well, but most use distinct pretraining objectives, and none offers a substantial investigation of the choice of objective like we conduct here. We build on two methods for pretraining sentence encoders on language modeling: ELMo and BERT. ELMo consists of a forward and backward LSTM (Hochreiter and Schmidhuber, 1997), the hidden states of which are used to produce a contextual vector representation for each token in the inputted sequence. ELMo is adapted to target tasks by freezing the model weights and only learning a set of task-specific scalar weights that are used to compute a linear combination of the LSTM layers. BERT consists of a pretrained Transformer (Vaswani et al., 2017), and is adapted to downstream tasks by fine-tuning the entire model. Follow-up work has explored parameterefficient fine-tuning (Stickland and Murray, 2019;Houlsby et al., 2019) and better target task adaptation via multitask fine-tuning (Phang et al., 2018;Liu et al., 2019), but work in this area is nascent.
The successes of sentence encoder pretraining have sparked a line of work analyzing these models (Zhang and Bowman, 2018;Peters et al., 2018b;Tenney et al., 2019b;Peters et al., 2019;Tenney et al., 2019a;Liu et al., 2019, i.a.). Our work also attempts to better understand what is learned by pretrained encoders, but we study this question entirely through the lens of pretraining and fine-tuning tasks, rather than architectures or specific linguistic capabilities. Some of our experiments resemble those of Yogatama et al. (2019), who also empirically investigate transfer performance with limited amounts of data and find similar evidence of catastrophic forgetting.
Multitask representation learning in NLP is well studied, and again can be traced back at least as far as Collobert et al. (2011). Luong et al. (2016) show promising results combining translation and parsing; Subramanian et al. (2018) benefit from multitask learning in sentence-to-vector encoding; and Bingel and Søgaard (2017) and Changpinyo et al. (2018) offer studies of when multitask learning is helpful for lower-level NLP tasks.

Transfer Paradigms
We consider two recent paradigms for transfer learning: pretraining and intermediate training.
See Figure 1 for a graphical depiction.
Pretraining Our first set of experiments is designed to systematically investigate the effectiveness of a broad range of tasks in pretraining sentence encoders. For each task, we first train a randomly initialized model to convergence on that pretraining task, and then train a model for a target task on top of the trained encoder. For these experiments, we largely follow the procedure and architecture used by ELMo rather than BERT, but we expect similar trends with BERT-style models.
Intermediate Training Given the robust success of LM pretraining, we explore methods of further improving on such sentence encoders. In particular, we take inspiration from Phang et al. (2018), who show gains in first fine-tuning BERT on an intermediate task, and then fine-tuning again on a target task. Our second set of experiments investigates which tasks can be used for intermediate training to augment LM pretraining. We design experiments using both pretrained ELMo and BERT as the base encoder. When using ELMo, we follow standard procedure and train a task-specific LSTM and output component (e.g. MLP for classification, decoder for sequence generation, etc.) on top of the representations produced by ELMo. During this stage, the pretrained ELMo weights are frozen except for a set of layer mixing weights. When using BERT, we follow standard procedure and train a small task-specific output component using the [CLS] output vector while also finetuning the weights of the full BERT model.  which is again frozen throughout target task training except for a set of target-task-specific layer mixing weights. For our intermediate BERT experiments, we follow the same procedure as in intermediate training: We train a target-task model using the [CLS] representation and fine-tune the encoder throughout target task training. We use the nine target tasks in GLUE  to evaluate each of the encoders we train. GLUE is an open-ended shared task competition and evaluation toolkit for reusable sentence encoders, built around a set of nine sentence and sentence pairs tasks spanning a range of dataset sizes, paired with private test data and an online leaderboard. We evaluate each model on each of the nine tasks, and report the resulting scores and the GLUE score, a macro-average over tasks.

Tasks
Our experiments compare encoders pretrained or fine-tuned on a large number of tasks and task combinations, where a task is a dataset-objective function pair. We select these tasks either to serve as baselines or because they have shown promise in prior work, especially in sentence-to-vector encoding. See Appendix A for details and tasks we experimented with but which did not show strong enough performance to warrant a full evaluation.
Random Encoder A number of recent works have noted that randomly initialized, untrained LSTMs can obtain surprisingly strong downstream task performance (Zhang and Bowman, 2018;Wieting and Kiela, 2019;Tenney et al., 2019b). Accordingly, our pretraining and intermediate ELMo experiments include a baseline of a randomly initialized BiLSTM with no further training. This baseline is especially strong because our ELMo-style models use a skip connection from the input of the encoder to the output, allowing the task-specific component to see the input representations, yielding a model similar to Iyyer et al. (2015). . MNLI and QQP have previously been shown to be effective for pretraining in other settings (Conneau et al., 2017;Subramanian et al., 2018;Phang et al., 2018). Other tasks are included to represent a broad sample of labeling schemes commonly used in NLP.
Outside Tasks We train language models on two datasets: WikiText-103 (WT; Merity et al., 2017) and Billion Word Language Model Benchmark (BWB; Chelba et al., 2013). Because representations from ELMo and BERT capture left and right context, they cannot be used in conjunction with unidirectional language modeling, so we exclude this task from intermediate training experiments. We train machine translation (MT) models on WMT14 English-German (Bojar et al., 2014) and WMT17 English-Russian (Bojar et al., 2017). We train SkipThought-style sequence-to-sequence (seq2seq) models to read a 1 data.quora.com/First-Quora-Dataset-Release-Question-Pairs 2 QNLI has been re-released with updated splits since the original release. We use the original splits. sentence from WT and predict the following sentence (Kiros et al., 2015;Tang et al., 2017). We train DisSent models to read two clauses from WT that are connected by a discourse marker such as and, but, or so and predict the the discourse marker (Jernite et al., 2017;Nie et al., 2019). Finally, we train seq2seq models to predict the response to a given comment from Reddit, using a previously existing dataset obtained by a third party (available on pushshift.io), comprised of 18M commentresponse pairs from 2008-2011. This dataset was used by Yang et al. (2018) to train sentence encoders.
Multitask Learning We consider three sets of these tasks for multitask pretraining and intermediate training: all GLUE tasks, all non-GLUE (outside) tasks, and all tasks.

Models and Experimental Details
We implement our models using the jiant toolkit, 3 which is in turn built on AllenNLP (Gardner et al., 2017) and on a public PyTorch implementation of BERT. 4 Appendix A presents additional details.
Encoder Architecture For both the pretraining and intermediate ELMo experiments, we process words using a pretrained character-level convolutional neural network (CNN) from ELMo. We use this pretrained word encoder for pretraining experiments to avoid potentially difficult issues with unknown word handling in transfer learning.
For the pretraining experiments, these input representations are fed to a two-layer 1024D bidirectional LSTM from which we take the sequence of hidden states from the top layer as the contextual representation. A task-specific model sees both the top-layer hidden states of this model and, through a skip connection, the input token representations. For the intermediate ELMo experiments, we compute contextual representations using the entire pretrained ELMo model, which are passed to a similar LSTM that is then trained on the intermediate task. We also include a skip connection from the ELMo representations to the task specific model. Our experiments with BERT use the BASE case-sensitive version of the model.

Task-Specific Components
We design taskspecific components to be as close to standard models for each task as possible. Though different components may have varying parameter counts, architectures, etc., we believe that results between tasks are still comparable and informative.
For BERT experiments we use the standard preprocessing and pass the representation of the special [CLS] representation to a logistic regression classifier. For seq2seq tasks (MT, SkipThought, pushshift.io Reddit dataset) we replace the classifier with a single-layer LSTM word-level decoder and initialize the hidden state with the [CLS] representation.
For ELMo-style models, we use several model types: • Single-sentence classification tasks: We train a linear projection over the output states of the encoder, max-pool those projected states, and feed the result to an MLP.
• Sentence-pair tasks: We perform the same steps on both sentences and use the heuristic feature vector [h 1 ; h 2 ; h 1 · h 2 ; h 1 − h 2 ] in the MLP, following Mou et al. (2016). When training target-task models on QQP, STS, MNLI, and QNLI, we use a cross-sentence attention mechanism similar to BiDAF (Seo et al., 2017). We do not use this mechanism in other cases as early results indicated it hurt transfer performance.
• Seq2seq tasks (MT, SkipThought, pushshift.io Reddit dataset): We use a single-layer LSTM decoder where the hidden state is initialized with the pooled input representation.
• Language modeling: We follow ELMo by concatenating forward and backward models and learning layer mixing weights.
To use GLUE tasks for pretraining or interme-diate training in a way that is more comparable to outside tasks, after pretraining we discard the learned GLUE classifier, and initialize a new classifier from scratch for target-task training.
Training and Optimization For BERT experiments, we train our models with the same optimizer and learning rate schedule as the original work. For all other models, we train our models with AMSGrad (Reddi et al., 2018). We do early stopping using development set performance of the task we are training on. Typical experiments (pretraining or intermediate training of an encoder and training nine associated target-task models) take 1-5 days to complete on an NVIDIA P100 GPU. When training on multiple tasks, we randomly sample a task with probability proportional to its training data size raised to the power of 0.75. This sampling rate is meant to balance the risks of overfitting small-data tasks and underfitting large ones, and performed best in early experiments. More extensive experiments with methods like this are shown in Appendix C. We perform early stopping based on an average of the tasks' validation metrics.
Hyperparameters Appendix B lists the hyperparameter values used. As our experiments require more than 150 GPU-days on NVIDIA P100 GPUs to run-not counting debugging or learning curves-we do not have the resources for extensive tuning. Instead, we fix most hyperparameters to commonly used values. The lack of tuning limits our ability to diagnose the causes of poor performance when it occurs, and we invite readers to further refine our models using the public code. While it is not feasible to run each setting multiple times, we estimate the variance of the GLUE score by re-running three experiments five times each with different random seeds. We observe σ = 0.4 for the random encoder with no pretraining, σ = 0.2 for ELMo with intermediate MNLI training, and σ = 0.5 for BERT without intermediate training. This variation is substantial but many of our results surpass a standard deviation of our baselines.

Results
The WNLI dataset is both difficult and adversarial: The same hypotheses can be paired with different premises and opposite labels in the train and development sets, so models that overfit the train set (which happens quickly on the tiny training set) often show development set performance below chance, making early stopping and model selection difficult. Few of our models reached even the most frequent class performance (56.3), and when evaluating models that do worse than this, we replace their predictions with the most frequent label to simulate the performance achieved by not modeling the task at all.

Pretraining
From Table 2, among target tasks, we find the grammar-related CoLA task benefits dramatically from LM pretraining: The results achieved with LM pretraining are significantly better than the results achieved without. In contrast, the meaningoriented STS sees good results with several kinds of pretraining, but does not benefit substantially from LM pretraining.
Among pretraining tasks, language modeling performs best, followed by MNLI. The remaining pretraining tasks yield performance near that of the random baseline. Even our single-task baseline gets less than a one point gain over this simple baseline. The multitask models are tied or outperformed by models trained on one of their constituent tasks, suggesting that our approach to multitask learning does not reliably produce models that productively combine the knowledge taught by each task. However, of the two models that perform best on the development data, the multitask model generalizes better than the single-task model on test data for tasks like STS and MNLI where the test set contains out-of-domain data.  Intermediate Task Training Looking to Table  3, using ELMo uniformly improves over training the encoder from scratch. The ELMo-augmented random baseline is strong, lagging behind the single-task baseline by less than a point. Most intermediate tasks beat the random baseline, but several fail to significantly outperform the single-task baseline. MNLI and English-German translation perform best with ELMo, with SkipThought and DisSent also beating the single-task baseline. Intermediate multitask training on all the non-GLUE tasks produces our best-performing ELMo model.  Many correlations are low, suggesting that different tasks benefit from different forms of pretraining to a substantial degree, and bolstering the observation that no single pretraining task yields good performance on all target tasks. For reasons noted earlier, the models that tended to perform best overall also tended to overfit the WNLI training set most, leading to a negative correlation between WNLI and overall GLUE score. STS also shows a negative correlation, likely due to the observation that it does not benefit from LM pretraining. In contrast, CoLA shows a strong correlation with the overall GLUE scores, but has weak or negative correlations with many tasks: The use of LM pretraining dramatically improves CoLA performance, but most other forms of pretraining have little effect. Figure 2 shows performance on the overall GLUE metric for encoders pretrained to convergence on each task with varying amounts of data. Looking at pretraining tasks in isolation (left), most tasks improve slightly as the amount of data increases, with the LM and MT tasks showing the most promising combina-tion of slope and maximum performance. Combining these tasks with ELMo (center) or BERT (right) yields less interpretable results: the relationship between training data volume and performance becomes weaker, and some of the best results reported in this paper are achieved by models that combine ELMo with restricted-data versions of intermediate tasks like MNLI and QQP. This effect is amplified with BERT, with training data volume having unclear or negative relationships with performance for many tasks. With large datasets for generation tasks, we see clear evidence of catastrophic forgetting with performance sharply decreasing in amount of training data.

Learning Curves
We also measure the performance of target task performance for three fully pretrained encoders under varying amounts of target task data. We find that all tasks benefit from increasing data quantities, with no obvious diminishing returns, and that most tasks see a consistent improvement in performance with the use of pretraining, regardless of the data volume. We present these learning curves in Appendix E.
Results on the GLUE Diagnostic Set On GLUE's analysis dataset, we find that many of our pretraining tasks help on examples involving lexical-semantic knowledge and logical operations, but less so on examples that highlight world knowledge. See Appendix F for details.

Conclusions
We present a systematic comparison of tasks and task combinations for the pretraining and intermediate fine-tuning of sentence-level encoders like those seen in ELMo and BERT. With nearly 60 pretraining tasks and task combinations and nine target tasks, this represents a far more comprehensive study than any seen on this problem to date.
Our primary results are perhaps unsurprising: LM works well as a pretraining task, and no other single task is consistently better. Intermediate training of language models can yield modest further gains. Multitask pretraining can produce results better than any single task can. Target task performance continues to improve with more LM data, even at large scales, suggesting that further work scaling up LM pretraining is warranted.
We also observe several worrying trends. Target tasks differ significantly in the pretraining tasks they benefit from, with correlations between target tasks often low or negative. Multitask pretrain-ing fails to reliably produce models better than their best individual components. When trained on intermediate tasks like MT that are highly different than its original training task, BERT shows signs of catastrophic forgetting. These trends suggest that improving on LM pretraining with current techniques will be challenging.
While further work on language modeling seems straightforward and worthwhile, we believe that the future of this line of work will require a better understanding of the settings in which target task models can effectively utilize outside knowledge and data, and new methods for pretraining and transfer learning to do so.

Acknowledgments
Parts of this work were conducted as part of the Fifth Frederick Jelinek Memorial Summer Workshop (JSALT) at Johns Hopkins University, and benefited from support by the JSALT sponsors and a team-specific donation of computing resources from Google. We gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan V GPU used at NYU for this research. AW is supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 1342536. PX and BVD were supported by DARPA AIDA. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.