Which *BERT? A Survey Organizing Contextualized Encoders

Pretrained contextualized text encoders are now a staple of the NLP community. We present a survey on language representation learning with the aim of consolidating a series of shared lessons learned across a variety of recent efforts. While significant advancements continue at a rapid pace, we find that enough has now been discovered, in different directions, that we can begin to organize advances according to common themes. Through this organization, we highlight important considerations when interpreting recent contributions and choosing which model to use.


Introduction
A couple years ago, Peters et al. (2018, ELMo) won the NAACL Best Paper Award for creating strong performing, task-agnostic sentence representations due to large scale unsupervised pretraining.Days later, its high level of performance was surpassed by Radford et al. (2018) which boasted representations beyond a single sentence and finetuning flexibility.This instability and competition between models has been a recurring theme for researchers and practitioners who have watched the rapidly narrowing gap between text representations and language understanding benchmarks.However, it has not discouraged research.Given the recent flurry of models, we often ask: "What, besides state-of-the-art, does this newest paper contribute?Which encoder should we use?" The goals of this survey are to outline the areas of progress, relate contributions in text encoders to ideas from other fields, describe how each area is evaluated, and present considerations for practitioners and researchers when choosing an encoder.This survey does not intend to compare specific model metrics, as tables from other works provide comprehensive insight.For example, Table 16 in Raffel et al. (2019) compares the scores on a large suite of tasks of different model architectures, training objectives, and hyperparameters, and Table 1 in Rogers et al. (2020) details early efforts in model compression and distillation.We also recommend other closely related surveys on contextualized word representations (Smith, 2019;Rogers et al., 2020;Liu et al., 2020a), transfer learning in NLP (Ruder et al., 2019), and integrating encoders into NLP applications (Wolf et al., 2019).Complementing these existing bodies of work, we look at the ideas and progress in the scientific discourse for text representations from the perspective of discerning their differences.
We organize this paper as follows.§2 provides brief background on encoding, training, and evaluating text representations.§3 identifies and analyzes two classes of pretraining objectives.In §4, we explore faster and smaller models and architectures in both training and inference.§5 notes the impact of both quality and quantity of pretraining data.§6 briefly discusses efforts on probing encoders and representations with respect to linguistic knowledge.§7 describes the efforts into training and evaluating multilingual representations.Within each area, we conclude with high-level observations and discuss the evaluations that are used and their shortcomings.
We conclude in §8 by making recommendations to researchers: publicizing negative results in this area is especially important owing to the sheer cost of experimentation and to ensure evaluation reproducibility.In addition, probing studies need to focus not only on the models and tasks, but also on the pretraining data.We pose questions for users of contextualized encoders, like whether the compute requirement of a model is worth the benefits.We hope our survey serves as a guide for both NLP researchers and practitioners, orienting them to the current state of the field of contextualized encoders and differences between models.

arXiv:2010.00854v1 [cs.CL] 2 Oct 2020 2 Background
Encoders Pretrained text encoders take as input a sequence of tokenized1 text, which is encoded by a multi-layered neural model.The representation of each (sub)token, x t , is either the set of hidden weights, {h (l) t } for each layer l, or its weight on just the top layer, h (−1) t . Unlike fixed-sized word, sentence, or paragraph representations, the produced contextualized representations of the text depends on the length of the input text.Most encoders use the Transformer architecture (Vaswani et al., 2017).
Transfer: The Pretrain-Finetune Framework While text representations can be learned in any manner, ultimately, they are evaluated using specific target tasks.Historically, the learned representations (e.g.word vectors) were used as initialization for task-specific models.Dai and Le (2015) are credited with using pretrained language model outputs as initialization, McCann et al. (2017) use pretrained outputs from translation as frozen word embeddings, and Howard and Ruder (2018) and Radford et al. (2018) demonstrate the effectiveness of finetuning to different target tasks by updating the full (pretrained) model for each task.We refer to the embeddings produced by the pretrained models (or encoders) as contextualized text representations.As our goal is to discuss the encoders and their representations, we do not cover the innovations in finetuning (Liu et al., 2015;Ruder et al., 2019;Phang et al., 2018;Liu et al., 2019c;Zhu et al., 2020, inter alia).

Area I: Pretraining Tasks
To utilize data at scale, pretraining tasks are typically self-supervised.We categorize the contributions into two types: token prediction (over a large vocabulary space) and nontoken prediction (over a handful of labels).In this section, we discuss several empirical observations.While token prediction is clearly important, less clear is which variation of the token prediction task is the best (or whether it even matters).Nontoken prediction tasks appear to offer orthogonal contributions that marginally improve the language representations.We emphasize that in this section, we seek to outline the primary efforts in pretraining objectives and not to provide a comparison on a set of benchmarks.2

Token Prediction
Predicting (or generating) the next word has historically been equivalent to the task of language modeling.Large language models perform impressively on a variety of language understanding tasks while maintaining their generative capabilities (Radford et al., 2018(Radford et al., , 2019;;Keskar et al., 2019;Brown et al., 2020), often outperforming contemporaneous models that use additional training objectives.
ELMo (Peters et al., 2018) is a BiLSTM model with a language modeling objective for the next (or previous) token given the forward (or backward) history.This idea of looking at the full context was further refined as a cloze3 task (Baevski et al., 2019), or as a denoising Masked Language Modeling (MLM) objective (Devlin et al., 2019, BERT).MLM replaces some tokens with a [mask] symbol and provides both right and left contexts (bidirectional context) for predicting the masked tokens.The bidirectionality is key to outperforming a unidirectional language model on a large suite of natural language understanding benchmarks (Devlin et al., 2019;Raffel et al., 2019).
The MLM objective is far from perfect, as the use of [mask] introduces a pretrain/finetune vo-cabulary discrepancy.Devlin et al. (2019) look to mitigate this issue by occasionally replacing [mask] with the original token or sampling from the vocabulary.Yang et al. (2019) convert the discriminative objective into an autoregressive one, which allows the [mask] token to be discarded entirely.Naively, this would result in unidirectional context.By sampling permutations of the factorization order of the joint probability of the sequence, they preserve bidirectional context.Similar ideas for permutation language modeling (PLM) have also been studied for sequence generation (Stern et al., 2019;Chan et al., 2019;Gu et al., 2019).The MLM and PLM objectives have since been unified architecturally (Song et al., 2020;Bao et al., 2020) and mathematically (Kong et al., 2020).
ELECTRA (Clark et al., 2020) replaces [mask] through the use of a small generator (trained with MLM) to sample a real token from the vocabulary.The main encoder, a discriminator, then determines whether each token was replaced.
A natural extension would mask units that are more linguistically meaningful, such as rarer words,4 whole words, or named entities (Devlin et al., 2019;Sun et al., 2019b).This idea can be simplified to random spans of texts (Yang et al., 2019;Song et al., 2019).Specifically, Joshi et al. (2020) add a reconstruction objective which predicts the masked tokens using only the span boundaries.They find that masking random spans is more effective than masking linguistic units.
An alternative architecture uses an encoderdecoder framework (or denoising autoencoder) where the input is a corrupted (masked) sequence the output is the full original sequence (Wang et al., 2019d;Lewis et al., 2020;Raffel et al., 2019).

Nontoken Prediction
Bender and Koller (2020) argue that for the goal of natural language understanding, we cannot rely purely on a language modeling objective; there must be some grounding or external information that relates the text to each other or to the world.One solution is to introduce a secondary objective to directly learn these biases.
Self-supervised discourse structure objectives, such as text order, has garnered significant attention.To capture relationships between two sentences, 5 Devlin et al. (2019) introduce the next sentence prediction (NSP) objective.In this task, either sentence B follows sentence A or B is a random negative sample.Subsequent works showed that this was not effective, suggesting the model simply learned topic (Yang et al., 2019;Liu et al., 2019d).Jernite et al. (2017) propose a sentence order task of predicting whether A is before, after, or unrelated to B, and Wang et al. (2020b) and Lan et al. (2020) use it for pretraining encoders.They report that (1) understanding text order does contribute to improved language understanding; and (2) harder-to-learn pretraining objectives are more powerful, as both modified tasks have lower intrinsic performance than NSP.It is still unclear, however, if this is the best way to incorporate discourse structure, especially since these works do not use real sentences.
Additional work has focused on effectively incorporating multiple pretraining objectives.Sun et al. (2020a) use multi-task learning with continual pretraining (Hashimoto et al., 2017), which incrementally introduces newer tasks into the set of pretraining tasks from word to sentence to document level tasks.Encoders using visual features (and evaluated only on visual tasks) jointly optimize multiple different masking objectives over both token sequences and regions of interests in the image (Tan and Bansal, 2019). 6rior to token prediction, discourse information has been used in training sentence representations.Conneau et al. (2017Conneau et al. ( , 2018a) ) 2019) use discourse markers.While there is weak evidence suggesting that these types of objectives are less effective than language modeling (Wang et al., 2019a), we lack fair studies comparing the relative influence between the two categories of objectives.

Comments on Evaluation
We reviewed the progress on pretraining tasks, finding that token prediction is powerful but can be improved further by other objectives.Currently, successful techniques like span masking or arbitrarily sized "sentences" are linguistically unmotivated.We anticipate future work to further incorporate no more than a fixed number of subtokens.It may contain any (fractional) number of real sentences.more meaningful linguistic biases in pretraining.
Our observations are informed by evaluations that are compared across different works.These benchmarks on downstream tasks do not account for ensembling or finetuning and can only serve as an approximation for the differences between the models.For example, Jiang et al. (2020) develop a finetuning method over a supposedly weaker model which leads to gains in GLUE score over reportedly stronger models.Furthermore, these evaluations aggregate vastly different tasks.Those interested in the best performance should first carefully investigate metrics on their specific task.Even if models are finetuned on an older encoder,7 it may be more cost-efficient and enable fairer future comparisons to reuse those over restarting the finetuning or reintegrating new encoders into existing models when doing so does not necessarily guarantee improved performance.

Area II: Efficiency
As models perform better but cost more to train, some have called for research into efficient models to improve deployability, accessibility, and reproducibility (Amodei and Hernandez, 2018;Strubell et al., 2019;Schwartz et al., 2019).Encoders tend to scale effectively (Lan et al., 2020;Raffel et al., 2019;Brown et al., 2020), so efficient models will also result in improvements over inefficient ones of the same size.In this section, we give an overview of several efforts aimed to decrease the computation budget (time and memory usage) during training and inference of text encoders.While these two axes are correlated, reductions in one axis do not always lead to reductions in the other.

Training
One area of research decreases wall-clock training time through more compute and larger batches.You et al. (2020) reduce the time of training BERT by introducing the LAMB optimizer, a large batch stochastic optimization method adjusted for attention models.Rajbhandari et al. (2020) analyze memory usage in the optimizer to enable parallelization of models resulting in higher throughput in training.By reducing the training time, models can be practically trained for longer, which has also been shown to lead to benefits in task performance (Liu et al., 2019d;Lan et al., 2020, inter alia).
Another line of research reduces the compute through attention sparsification (discussed in §4.2) or increasing the convergence rate (Clark et al., 2020).These works report hardware and estimate the reduction in floating point operations (FPOs). 8hese kinds of speedup are orthogonal to hardware parallelization and are most encouraging as they pave the path for future work in efficient training.
Note that these approaches do not necessarily affect the latency to process a single example nor the compute required during inference, which is a function of the size of the computation graph.

Inference
Reducing model size without impacting performance is motivated by lower inference latency, hardware memory constraints, and the promise that naively scaling up dimensions of the model will improve performance.Size reduction techniques produce smaller and faster models, while occasionally improving performance.Rogers et al. (2020) survey BERT-like models and present in Table 1 the differences in sizes and performance across several models focused on inference efficiency.
Architectural changes have been explored as one avenue for reducing either the model size or inference time.In Transformers, the self-attention pattern scales quadratically in sequence length.To reduce the asymptotic complexity, the self-attention can be sparsified: each token only attending to a small "local" set (Vaswani et al., 2017;Child et al., 2019;Sukhbaatar et al., 2019).This has further been applied to pretraining on longer sequences, resulting in sparse contextualized encoders (Qiu et al., 2019;Ye et al., 2019;Kitaev et al., 2020;Beltagy et al., 2020, inter alia).Efficient Transformers is an emerging subfield with applications beyond NLP; Tay et al. (2020) survey 17 Transformers that have implications on efficiency.
Another class of approaches carefully selects weights to reduce model size.Lan et al. (2020) use low-rank factorization to reduce the size of the embedding matrices, while Wang et al. (2019f) factorize other weight matrices.Additionally, parameters can be shared between layers (Dehghani et al., 2019;Lan et al., 2020) or between an encoder and decoder (Raffel et al., 2019).However, models that employ these methods do not always have smaller computation graphs.This greatly reduces the usefulness of parameter sharing compared to other methods that additionally offer greater speedups relative to the reduction in model size.
Closely related, model pruning (Denil et al., 2013;Han et al., 2015;Frankle and Carbin, 2018) during training or inference has exploited the overparameterization of neural networks by removing up to 90%-95% parameters.This approach has been successful in not only reducing the number of parameters, but also improving performance on downstream tasks.Related to efforts for pruning deep networks in computer vision (Huang et al., 2016), layer selection and dropout during both training and inference have been studied in both LSTM (Liu et al., 2018a) and Transformer (Fan et al., 2020) based encoders.These also have a regularization effect resulting in more stable training and improved performance.There are additional novel pruning methods that can be performed during training (Guo et al., 2019;Qiu et al., 2019).These successful results are corroborated by other efforts (Gordon et al., 2020) showing that low levels of pruning do not substantially affect pretrained representations.Additional successful efforts in model pruning directly target a downstream task (Sun et al., 2019a;Michel et al., 2019;McCarley, 2019;Cao et al., 2020a).Note that pruning does not always lead to speedups in practice as sparse operations may be hard to parallelize.
Knowledge distillation (KD) uses an overparameterized teacher model to rapidly train a smaller student model with minimal loss in performance (Hinton et al., 2015) and has been used for translation (Kim and Rush, 2016), computer vision (Howard et al., 2017), and adversarial examples (Carlini and Wagner, 2016).This has been applied to ELMo (Li et al., 2019) and BERT (Tang et al., 2019;Sanh et al., 2019;Sun et al., 2020b, inter alia).KD can also be combined with adaptive inference, which dynamically adjusts model size (Liu et al., 2020b), or performed on submodules which are later substituted back into the full model (Xu et al., 2020).
Quantization with custom low-precision hardware is also a promising method for both reducing the size of models and compute time, albeit it does not reduce the number of parameters or FPOs (Shen et al., 2020;Zafrir et al., 2019).This line of work is mostly orthogonal to other efforts specific to NLP.

Standardizing Comparison
There has yet to be a comprehensive and fair evaluation across all models.The closest, Table 1 in Rogers et al. (2020), compares 12 works in model compression.However, almost no two papers are evaluated against the same BERT with the same set of tasks.Many papers on attention sparsification do not evaluate on NLU benchmarks.We claim this is because finetuning is itself an expensive task, so it is not prioritized by authors: works on improving model efficiency have focused only on comparing to a BERT on a few tasks.
While it is easy for future research on pretraining to report model sizes and runtimes, it is harder for researchers in efficiency to report NLU benchmarks.We suggest extending versions of the leaderboards under different resource constraints so that researchers with access to less hardware could still contribute under the resource-constrained conditions.Some work has begun in this direction: the SustaiNLP 2020 Shared Task is focused on the energy footprint of inference for GLUE.9 5 Area III: (Pretraining) Data Unsurprisingly for our field, increasing the size of training data for an encoder contributes to increases in language understanding capabilities (Yang et al., 2019;Raffel et al., 2019;Kaplan et al., 2020).At current data scales, some models converge before consuming the entire corpus.In this section, we identify a weakness when given less data, advocate for better data cleaning, and raise technical and ethical issues with using web-scraped data.

Data Quantity
There has not yet been observed a ceiling to the amount of data that can still be effectively used in training (Baevski et al., 2019;Liu et al., 2019d;Yang et al., 2019;Brown et al., 2020).Raffel et al. (2019) curate a 745GB subset of Common Crawl (CC),10 which starkly contrasts with the 13GB used in BERT.For multilingual text encoding, Wenzek et al. (2020) curate 2.5TB of language-tagged CC.As CC continues to grow, there will be even larger datasets (Brown et al., 2020).Sun et al. (2017) explore a similar question for computer vision, as years of progress iterated over 1M labeled images.By using 300M images, they improved performance on several tasks with a basic model.We echo their remarks that we should be cognizant of data sizes when drawing conclusions.
Is there a floor to the amount of data needed to achieve current levels of success on language understanding benchmarks?As we decrease the data size, LSTM-based models start to dominate in perplexity (Yang et al., 2019;Melis et al., 2020), suggesting there are challenges with either scaling up LSTMs or scaling down Transformers.While probing contextualized models and representations is an important area of study (see §6), prior work focuses on pretrained models or models further pretrained on domain-specific data (Gururangan et al., 2020).We are not aware of any work which probes identical models trained with decreasingly less data.How much (and which) data is necessary for high performance on probing tasks?11

Data Quality
While text encoders should be trained on language, large-scale datasets may contain web-scraped and uncurated content (like code).Raffel et al. (2019) ablate different types of data for text representations and find that naively increasing dataset size does not always improve performance, partially due to data quality.This realization is not new.Parallel data and alignment in machine translation (Moore and Lewis, 2010;Duh et al., 2013;Xu and Koehn, 2017;Koehn et al., 2018, inter alia) and speech (Peddinti et al., 2016) often use language models to filter out misaligned or poor data.Sun et al. (2017) use automatic data filtering in vision.These successes on other tasks suggest that improved automated methods of data cleaning would let future models consume more high-quality data.
In addition to high quality, data uniqueness appears to be advantageous.Raffel et al. (2019) show that increasing the repetitions (number of epochs) of the pretraining corpus hurts performance.This is corroborated by Liu et al. (2019d), who find that random, unique masks for MLM improve over repeated masks across epochs.These findings together suggest a preference to seeing more new text.We suspect that representations of text spans appearing multiple times across the corpus are better shaped by observing them in unique contexts.Raffel et al. (2019) find that differences in domain mismatch in pretraining data (web crawled vs. news or encyclopedic) result in strikingly different performance on certain challenge sets, and Gururangan et al. ( 2020) find that continuing pretraining on both domain and task specific data lead to gains in performance.

Datasets and Evaluations
With these larger and cleaner datasets, future research can better explore tradeoffs between size and quality, as well as strategies for scheduling data during training.
As we continue to scrape data off the web and publish challenge sets relying on other web data, we need to cautiously construct our training and evaluation sets.For example, the domains of many benchmarks (Wang et al. (2019c, GLUE), Rajpurkar et al. (2016, 2018, SQuAD), Wang et al. (2019b, SuperGLUE) 2020) highlight the prevalance of toxic language in the common pretraining corpora and stress the important of pretraining data selection, especially for deployed models.We are not aware of a comprehensive study that explores the effect of leaving out targeted subsets of the pretraining data.We hope future models note the domains of pretraining and evaluation benchmarks, and for future language understanding benchmarks to focus on more diverse genres in addition to diverse tasks.
As we improve models by training on increasing sizes of crawled data, these models are also being picked up by NLP practitioners who deploy them in real-world software.These models learn biases found in their pretraining data (Gonen and Goldberg, 2019;May et al., 2019, inter alia).It is critical to clearly state the source 12 of the pretraining data and clarify appropriate uses of the released models.For example, crawled data can contain incorrect facts about living people; while webpages can be edited or retracted, publicly released "language" model are frozen, which can raise privacy concerns (Feyisetan et al., 2020).raise the second.Inspired by prior work (Lipton, 2018;Belinkov and Glass, 2019;Alishahi et al., 2019), we organize here the major probing methods that are applicable to all encoders in hopes that future work will use comparable techniques.

Probing with Tasks
One technique uses the learned model as initialization for a model trained on a probing task consisting of a set of targeted natural language examples.The probing task's format is flexible as additional, (simple) diagnostic classifiers are trained on top of a typically frozen model (Ettinger et al., 2016;Hupkes et al., 2018;Poliak et al., 2018;Tenney et al., 2019b).Task probing can also be applied to the embeddings at various layers to explore the knowledge captured at each layer (Tenney et al., 2019a;Lin et al., 2019;Liu et al., 2019a).Hewitt and Liang (2019) warn that expressive (nonlinear) diagnostic classifiers can learn more arbitrary information than constrained (linear) ones.This revelation, combined with the differences in probing task format and the need to train, leads us to be cautious in drawing conclusions from these methods.

Model Inspection
Model inspection directly opens the metaphorical black box and studies the model weights without additional training.For examples, the embeddings themselves can be analyzed as points in a vector space (Ethayarajh, 2019).Through visualization, attention heads have been matched to linguistic functions (Vig, 2019;Clark et al., 2019b).These works suggest inspection is a viable path to debugging specific examples.In the future, methods for analyzing and manipulating attention in machine translation (Lee et al., 2017;Liu et al., 2018b;Bau et al., 2019;Voita et al., 2019) can also be applied to text encoders.
Recently, interpreting attention as explanation has been questioned (Serrano and Smith, 2019;Jain and Wallace, 2019;Wiegreffe and Pinter, 2019;Clark et al., 2019b).The ongoing discussion suggests that this method may still be insufficient for uncovering the rationale for predictions, which is critical for real-world applications.

Input Manipulation 13
Input manipulation draws conclusions by recasting the probing task format into the form of the pre-training task and observing the model's predictions.As discussed in §3, word prediction (cloze task) is a popular objective.This method has been used to investigate syntactic and semantic knowledge (Goldberg, 2019;Ettinger, 2020;Kassner and Schütze, 2019).For a specific probing task, Warstadt et al. (2019) show that cloze and diagnostic classifiers draw similar conclusions.As input manipulation is not affected by variables introduced by probing tasks and is as interpretable than inspection, we suggest more focus on this method: either by creating new datasets (Warstadt et al., 2020) or recasting existing ones (Brown et al., 2020) into this format.A disadvantage of this method (especially for smaller models) is the dependence on both the pattern used to elicit an answer from the model and, in the few-shot case where a couple examples are provided first, highly dependent on the examples (Schick and Schütze, 2020).

Future Directions in Model Analysis
Most probing efforts have relied on diagnostic classifiers, yet these results are being questioned.Inspection of model weights has discovered what the models learn, but cannot explain their causal structure.We suggest researchers shift to the paradigm of input manipulation.By creating cloze tasks that assess linguistic knowledge, we can both observe decisions made by the model, which would imply (lack of) knowledge of a phenomenon.Furthermore, it will also enable us to directly interact with these models (by changing the input) without additional training, which currently introduces additional sources of uncertainty.
Bender and Koller (2020) also recommend a topdown view for model analysis that focuses on the end-goals for our field over hill-climbing individual datasets.While language models continue to outperform each other on these tasks, they argue these models do not learn meaning. 14If not meaning, what are these models learning?
We are overinvesting in BERT.While it is fruitful to understand the boundaries of its knowledge, we should look more across (simpler) models to see how and why specific knowledge is picked up as our models both become increasingly complex and perform better on a wide set of tasks.For example, how many parameters does a Transformerbased model need to outperform ELMo or even rule-based baselines?
7 Area V: Multilinguality The majority of research on text encoders has been in English. 15Cross-lingual shared representations have been proposed as an efficient way to target multiple languages by using multilingual text for pretraining (Mulcaire et al., 2019;Devlin et al., 2019;Lample and Conneau, 2019;Liu et al., 2020c, inter alia).For evaluation, researchers have devised multilingual benchmarks mirroring those for NLU in English (Conneau et al., 2018b;Liang et al., 2020;Hu et al., 2020).Surprisingly, without any explicit cross-lingual signal, these models achieve strong zero-shot cross-lingual performance, outperforming prior cross-lingual word embedding-based methods (Wu and Dredze, 2019;Pires et al., 2019).
A natural follow-up question to ask is why these models learn cross-lingual representations.Some answers include the shared subword vocabulary (Pires et al., 2019;Wu and Dredze, 2019), shared Transformer layers (Conneau et al., 2020b;Artetxe et al., 2020) across languages, and depth of the network (K et al., 2020).Studies have also found the geometry of representations of different languages in the multilingual encoders can be aligned with linear transformations (Schuster et al., 2019;Wang et al., 2019e, 2020cLiu et al., 2019b), which has also been observed in independent monolingual encoders (Conneau et al., 2020b).These alignments can be further improved (Cao et al., 2020b).

Evaluating Multilinguality
All of the areas discussed in this paper are applicable to multilingual encoders.However, progress in training, architecture, datasets, and evaluations are occurring concurrently, making it difficult to draw conclusions.We need more comparisons between competitive multilingual and monolingual systems or datasets.To this end, Wu and Dredze (2020) find that monolingual BERTs in low-resource languages are outperformed by multilingual BERT.Additionally, as zero-shot (or few-shot) cross-lingual transfer has inherently high variance (Keung et al., 2020), the variance of models should also be reported.
We anticipate cross-lingual performance being a new dimension to consider when evaluating text representations.For example, it will be exciting to discover how a small, highly-performant mono-lingual encoder contrasts against a multilingual variant; e.g., what is the minimum number of parameters needed to support a new language?Or, how does model size relate to the phylogenetic diversity of languages supported?

Limitations and Recommendations
This survey, like others, is limited to only what has been shared publicly so far.The papers of many models described here highlight their best parts, where potential flaws are perhaps obscured within tables of numbers.Leaderboard submissions that do not achieve first place may never be published.Meanwhile, encoders are expensive to work with, yet they are a ubiquitous component in most modern NLP models.We strongly encourage more publication and publicizing of negative results and limitations.In addition to their scientific benefits,16 publishing negative results in contextualized encoders can avoid significant externalities of rediscovering what doesn't work: time, money, and electricity.Furthermore, we ask leaderboard owners to periodically publish surveys of their received submissions.
The flourishing research in improving encoders is rivaled by research in interpreting them, mainly focused on discovering the boundary of what knowledge is captured by the models.For investigations that aim to sharpen the boundary, it is logical to build off of these prior results.However, we raise a concern that these encoders are all trained on similar data and have similar sizes.Future work in probing should also look across different sizes and domains of training data, as well as study the effect of model size.This can be further facilitated by model creators who release (data) ablated versions of their models.
We also raise a concern about reproducibility and accessibility of evaluation.Already, several papers focused on model compression do not report full GLUE results, possibly due to the expensive finetuning process for each of the nine datasets.Finetuning currently requires additional compute and infrastructure, 17 and the specific methods used impact task performance.As long as finetuning is still an essential component of evaluating encoders, de-vising cheap, accessible, and reproducible metrics for encoders is an open problem.
Ribeiro et al. ( 2020) suggest a practical solution to both probing model errors and reproducible evaluations by creating tools that quickly generate test cases for linguistic capabilities and find bugs in models.This task-agnostic methodology may be extensible to both challenging tasks and probing specific linguistic phenomenon.
8.2 Which *BERT should we use?
Here, we discuss tradeoffs between metrics and synthesize the previous sections.We provide a series of questions to consider when working with encoders for research or application development.
Task performance vs. efficiency An increasingly popular line of recent work has investigated knowledge distillation, model compression, and sparsification of encoders ( §4.2).These efforts have led to significantly smaller encoders that boast competitive performance, and under certain settings, non-contextual embeddings alone may be sufficient (Arora et al., 2020;Wang et al., 2020a).For downstream applications, ask: Is the extra iota of performance worth the significant costs of compute?Leaderboards vs. real data As a community, we are hill-climbing on curated benchmarks that aggregate dozens of tasks.Performance on these benchmarks does not necessarily reflect that of specific real-world tasks, like understanding social media posts about a pandemic (Müller et al., 2020).Before picking the best encoder determined by average scores, ask: Is this encoder the best for our specific task?Should we instead curate a large dataset and pretrain again?Gururangan et al. (2020) suggest continued pretraining on indomain data as a viable alternative to pretraining from scratch.
For real-world systems, practitioners should be especially conscious of the datasets on which these encoders are pretrained.There is a tradeoff between task performance and possible harms contained within the pretraining data.
Monolingual vs. Multilingual For some higher resource languages, there exist monolingual pretrained encoders.For tasks in those languages, those encoders are a good starting point.However, as we discussed in §7, multilingual encoders can, surprisingly, perform competitively, yet these metrics are averaged over multiple languages and tasks.Again, we encourage looking at the relative performance for a specific task and language, and whether monolingual encoders (or embeddings) may be more suitable.
Ease-of-use vs. novelty With a constant stream of new papers and models (without peer review) for innovating in each direction, we suggest using and building off encoders that are well-documented with reproduced or reproducible results.Given the pace of the field and large selection of models, unless aiming to reproduce prior work or improve underlying encoder technology, we recommend proceeding with caution when reimplementing ideas from scratch.

Conclusions
In this survey we categorize research in contextualized encoders and discuss some issues regarding its conclusions.We cover background on contextualized encoders, pretraining objectives, efficiency, data, approaches in model interpretability, and research in multilingual systems.As there is now a large selection of models to choose from, we discuss tradeoffs that emerge between models.We hope this work provides some assistance to both those entering the NLP community and those already using contextualized encoders in looking beyond SOTA (and Twitter) to make more educated choices.
use natural language inference sentence pairs, Jernite et al. (2017) use discourse-based objectives of sentence order, conjunction classifier, and next sentence selection, and Nie et al. ( , Paperno et al. (2016, LAM-BADA), Nallapati et al. (2016, CNN/DM)) now overlap with the data used to train language representations.Section 4 in Brown et al. (2020) more thoroughly discuss the effects of overlapping test data with pretraining data.Gehman et al. (