BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance

Pretraining deep language models has led to large performance gains in NLP. Despite this success, Schick and Schütze (2020) recently showed that these models struggle to understand rare words. For static word embeddings, this problem has been addressed by separately learning representations for rare words. In this work, we transfer this idea to pretrained language models: We introduce BERTRAM, a powerful architecture based on BERT that is capable of inferring high-quality embeddings for rare words that are suitable as input representations for deep language models. This is achieved by enabling the surface form and contexts of a word to interact with each other in a deep architecture. Integrating BERTRAM into BERT leads to large performance increases due to improved representations of rare and medium frequency words on both a rare word probing task and three downstream tasks.

Contextualized representations obtained from pretrained deep language models (e.g.Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019;Liu et al., 2019b) already handle rare words implicitly using methods such as byte-pair encoding (Sennrich et al., 2016), WordPiece embeddings (Wu et al., 2016) and character-level CNNs (Baevski et al., 2019).Nevertheless, Schick and Schütze (2020) recently showed that BERT's (Devlin et al., 2019) performance on a rare word probing task can be significantly improved by explicitly learning representations of rare words using Attentive Mimicking (AM) (Schick and Schütze, 2019a).However, AM is limited in two important respects: • For processing contexts, it uses a simple bagof-words model, making poor use of the available information.
• It combines form and context in a shallow fashion, preventing both input signals from interacting in a complex manner.
These limitations apply not only to AM, but to all previous work on obtaining representations for rare words by leveraging form and context.While using bag-of-words models is a reasonable choice for static embeddings, which are often themselves bagof-words (e.g.Mikolov et al., 2013;Bojanowski et al., 2017), it stands to reason that they are not the best choice to generate input representations for position-aware, deep language models.To overcome these limitations, we introduce BERTRAM (BERT for Attentive Mimicking), a novel architecture for learning rare word representations that combines a pretrained BERT model with AM.As shown in Figure 1, the learned rare word representations can then be used as an improved input representation for another BERT model.By giving BERTRAM access to both surface form and contexts starting at the lowest layer, a deep integration of both input signals becomes possible.
Assessing the effectiveness of methods like BERTRAM in a contextualized setting is challenging: While most previous work on rare words was evaluated on datasets explicitly focusing on rare words (e.g Luong et al., 2013;Herbelot and Baroni, 2017;Khodak et al., 2018;Liu et al., 2019a), these datasets are tailored to uncontextualized embeddings and thus not suitable for evaluating our model.Furthermore, rare words are not well represented in commonly used downstream task datasets.We therefore introduce rarification, a procedure to automatically convert evaluation datasets into ones for which rare words are guaranteed to be important.This is achieved by replacing task-relevant frequent words with rare synonyms obtained using semantic resources such as WordNet (Miller, 1995).We rarify three common text (or text pair) classification datasets: MNLI (Williams et al., 2018), AG's News (Zhang et al., 2015) and DBPedia (Lehmann et al., 2015).BERTRAM outperforms previous work on four English datasets by a large margin: on the three rarified datasets and on WNLaMPro (Schick and Schütze, 2020).
In summary, our contributions are as follows: • We introduce BERTRAM, a model that integrates BERT into Attentive Mimicking, enabling a deep integration of surface-form and contexts and much better representations for rare words.
• We devise rarification, a method that transforms evaluation datasets into ones for which rare words are guaranteed to be important.
• We show that adding BERTRAM to BERT achieves a new state-of-the-art on WNLaM-Pro (Schick and Schütze, 2020) and beats all baselines on rarified AG's News, MNLI and DBPedia, resulting in an absolute improvement of up to 25% over BERT.

Related Work
Surface-form information (e.g., morphemes, characters or character n-grams) is commonly used to improve word representations.For static word embeddings, this information can either be injected into a given embedding space (Luong et al., 2013;Pinter et al., 2017), or a model can directly be given access to it during training (Bojanowski et al., 2017;Salle and Villavicencio, 2018;Piktus et al., 2019).
In the area of contextualized representations, many architectures employ subword segmentation methods (e.g.Radford et al., 2018;Devlin et al., 2019;Yang et al., 2019;Liu et al., 2019b convolutional neural networks to directly access character-level information (Kim et al., 2016;Peters et al., 2018;Baevski et al., 2019).Complementary to surface form, another useful source of information for understanding rare words are the contexts in which they occur (Lazaridou et al., 2017;Herbelot and Baroni, 2017;Khodak et al., 2018).Schick and Schütze (2019a,b) show that combining form and context leads to significantly better results than using just one of the two.While all of these methods are bag-of-words models, Liu et al. (2019a) recently proposed an architecture based on context2vec (Melamud et al., 2016).However, in contrast to our work, they (i) do not incorporate surface-form information and (ii) do not directly access the hidden states of context2vec, but instead simply use its output distribution.
Several datasets focus on rare words, e.g., Stanford Rare Word (Luong et al., 2013), Definitional Nonce (Herbelot and Baroni, 2017), and Contextual Rare Word (Khodak et al., 2018).However, unlike our rarified datasets, they are only suitable for evaluating uncontextualized word representations.Rarification is related to adversarial example generation (e.g.Ebrahimi et al., 2018), which manipulates the input to change a model's prediction.We use a similar mechanism to determine which words in a given sentence are most important and replace them with rare synonyms.

Form-Context Model
We first review the basis for our new model, the form-context model (FCM) (Schick and Schütze, 2019b).Given a set of d-dimensional high-quality embeddings for frequent words, FCM induces embeddings for rare words that are appropriate for the given embedding space.This is done as follows: Given a word w and a context C in which it occurs, a surface-form embedding v form (w,C) ∈ R d is obtained by averaging over embeddings of all character n-grams in w; the n-gram embeddings are learned during training.Similarly, a context embedding v context (w,C) ∈ R d is obtained by averaging over the embeddings of all words in C. Finally, both embeddings are combined using a gate with parameters x ∈ R 2d , y ∈ R and σ denoting the sigmoid function, allowing the model to decide how to weight surface-form and context.The final representation of w is then a weighted combination of form and context embeddings: The context part of FCM is able to capture the broad topic of rare words, but since it is a bag-ofwords model, it is not capable of obtaining a more concrete or detailed understanding (see Schick and Schütze, 2019b).Furthermore, the simple gating mechanism results in only a shallow combination of form and context.That is, the model is not able to combine form and context until the very last step: While it can learn to weight form and context components, the two embeddings (form and context) do not share any information and thus do not influence each other.

BERTRAM
To overcome these limitations, we introduce BERTRAM, a model that combines a pretrained BERT language model (Devlin et al., 2019) with Attentive Mimicking (Schick and Schütze, 2019a).We denote with e t the (uncontextualized, i.e., firstlayer) embedding assigned to a (wordpiece) token t by BERT.Given a sequence of such uncontextualized embeddings e = e 1 , . . ., e n , we denote by h j (e) the contextualized representation of the j-th token at the final layer when the model is given e as input.
Given a word w and a context C in which it occurs, let t = t 1 , . . ., t m be the sequence obtained from C by (i) replacing w with a [MASK] token and (ii) tokenization (matching BERT's vocabulary); furthermore, let i denote the index for which We experiment with three variants of BERTRAM: BERTRAM-SHALLOW, BERTRAM-REPLACE and BERTRAM-ADD.2SHALLOW.Perhaps the simplest approach for obtaining a context embedding from C using BERT is to define This approach aligns well with BERT's pretraining objective of predicting likely substitutes for [MASK] tokens from their contexts.The context embedding v context (w,C) is then combined with its form counterpart as in FCM.
While this achieves our first goal of using a more sophisticated context model that goes beyond bagof-words, it still only combines form and context in a shallow fashion.
REPLACE.Before computing the context embedding, we replace the uncontextualized embedding of the [MASK] token with the word's surface-form embedding: ) , e t i+1 , ... , e tm ) .
Our rationale for this is as follows: During regular BERT pretraining, words chosen for prediction are replaced with [MASK] tokens only 80% of the time and kept unchanged 10% of the time.Thus, standard pretrained BERT should be able to make use of form embeddings presented this way as they provide a strong signal with regards to how the "correct" embedding of w may look like.
ADD.Before computing the context embedding, we prepad the input with the surface-form embedding of w, followed by a colon (e : ):3 ) , e : , e t 1 , . . ., e tm ) .
The intuition behind this third variant is that lexical definitions and explanations of a word w are occasionally prefixed by "w :" (e.g., in some online dictionaries).We assume that BERT has seen many definitional sentences of this kind during pretraining and is thus able to leverage surface-form information about w presented this way.
For both REPLACE and ADD, surface-form information is directly and deeply integrated into the S was wash . . .les S v form (w,C 1 ) e [CLS] e : e other e [MASK] e such e as e trousers . . .
shows how a single context is processed using ADD.
To exploit multiple contexts of a word if available, we follow the approach of Schick and Schütze (2019a) and add an AM layer on top of our model; see Figure 2 (right).Given a set of contexts C = {C 1 , . . ., C m } and the corresponding embeddings v (w,C 1 ) , . . ., v (w,Cm) , AM applies a selfattention mechanism to all embeddings, allowing the model to distinguish informative from uninformative contexts.The final embedding v (w,C) is then a weighted combination of all embeddings: where the self-attention layer determines the weights ρ i subject to m i=1 ρ i = 1.For further details, see Schick and Schütze (2019a).

Training
Like previous work, we use mimicking (Pinter et al., 2017) as a training objective.That is, given a frequent word w with known embedding e w and a set of corresponding contexts C, BERTRAM is trained to minimize e w − v (w,C) 2 .Training BERTRAM end-to-end is costly: the cost of processing a single training instance (w, C) with C = {C 1 , . . ., C m } is the same as processing an entire batch of m examples in standard BERT.Therefore, we resort to the following three-stage training process: 1. We train only the context part, minimizing where ρ i is the weight assigned to each context C i through the AM layer.Regardless of the selected BERTRAM variant, the context embedding is always obtained using SHALLOW in this stage.Furthermore, only A, b and all parameters of the AM layer are optimized.
2. We train only the form part (i.e., only the ngram embeddings); our loss for a single example Training in this stage is completely detached from the underlying BERT model.
3. In the third stage, we combine the pretrained form-only and context-only models and train all parameters.The first two stages are only run once and then used for all three BERTRAM variants because context and form are trained in isolation.The third stage must be run for each variant separately.
We freeze all of BERT's parameters during training as we -somewhat surprisingly -found that this slightly improves the model's performance while speeding up training.For ADD, we additionally found it helpful to freeze the form part in the third training stage.Importantly, for the first two stages of our training procedure, we do not have to backpropagate through BERT to obtain all required gradients, drastically increasing the training speed.

Dataset Rarification
The ideal dataset for measuring the quality of rare word representations would be one for which the accuracy of a model with no understanding of rare words is 0% whereas the accuracy of a model that perfectly understands rare words is 100%.Unfortunately, existing datasets do not satisfy this desidera-tum, not least because rare words -by their nature -occur rarely.This does not mean that rare words are not important: As we shift our focus in NLP from words and sentences as the main unit of processing to larger units like paragraphs and documents, rare words will occur in a high proportion of such larger "evaluation units".Rare words are also clearly a hallmark of human language competence, which should be the ultimate goal of NLP.Our work is part of a trend that sees a need for evaluation tasks in NLP that are more ambitious than what we have now. 4o create more challenging datasets, we use rarification, a procedure that automatically transforms existing text classification datasets in such a way that rare words become important.We require a pretrained language model M as a baseline, an arbitrary text classification dataset D containing labeled instances (x, y) and a substitution dictionary S, mapping each word w to a set of rare synonyms S(w).Given these ingredients, our procedure consists of three steps: (i) splitting the dataset into a train set and a set of test candidates, (ii) training the baseline model on the train set and (iii) modifying a subset of the test candidates to generate the final test set.
Dataset Splitting.We partition D into a training set D train and a set of test candidates, D cand .D cand contains all instances (x, y) ∈ D such that for at least one word w in x, S(w) = ∅ -subject to the constraint that the training set contains at least one third of the entire data.
Baseline Training.We finetune M on D train .Let (x, y) ∈ D train where x = w 1 , . . ., w n is a sequence of words.We deviate from the finetuning procedure of Devlin et al. (2019) in three respects: • We randomly replace 5% of all words in x with a [MASK] token.This allows the model to cope with missing or unknown words, a prerequisite for our final test set generation.
• As an alternative to overwriting the language model's uncontextualized embeddings for rare words, we also want to allow models to add an alternative representation during test time, in which case we simply separate both representations by a slash (cf.§5.3).To accustom the language model to this duplication of words, we replace each word w i with "w i / w i " with a probability of 10%.To make sure that the model does not simply learn to always focus on the first instance during training, we randomly mask each of the two repetitions with probability 25%.
• We do not finetune the model's embedding layer.We found that this does not hurt performance, an observation in line with recent findings of Lee et al. (2019).
Test Set Generation.Let p(y | x) be the probability that the finetuned model M assigns to class y given input x, and M (x) = arg max y∈Y p(y | x) be the model's prediction for input x where Y denotes the set of all labels.For generating our test set, we only consider candidates that are classified correctly by the baseline model, i.e., candidates (x, y) ∈ D cand with M (x) = y.For each such entry, let x = w 1 , . . ., w n and let x w i =t be the sequence obtained from x by replacing w i with t.We compute i.e., we select the word w i whose masking pushes the model's prediction the farthest away from the correct label.If removing this word already changes the model's prediction -that is, M (x w i =[MASK] ) = y -, we select a random rare synonym ŵi ∈ S(w i ) and add (x w i = ŵi , y) to the test set.Otherwise, we repeat the above procedure; if the label still has not changed after masking up to 5 words, we discard the candidate.Each instance (x w i 1 = ŵi 1 ,...,w i k = ŵi k , y) of the resulting test set has the following properties: the entry is classified incorrectly by M .In other words, understanding the words w i j is necessary for M to determine the correct label.
• If the model's internal representation of each ŵi j is sufficiently similar to its representation of w i j , the entry is classified correctly by M .That is, if the model is able to understand the rare words ŵi j and to identify them as synonyms of w i j , it will predict the correct label.Note that the test set is closely coupled to the baseline model M because we select the words to be replaced based on M 's predictions.Importantly, however, the model is never queried with any rare synonym during test set generation, so its representations of rare words are not taken into account for creating the test set.Thus, while the test set is not suitable for comparing M with an entirely different model M , it allows us to compare various strategies for representing rare words in the embedding space of M .Definitional Nonce (Herbelot and Baroni, 2017) is subject to a similar constraint: it is tied to a specific (uncontextualized) embedding space based on Word2Vec (Mikolov et al., 2013).

Setup
For our evaluation of BERTRAM, we follow the experimental setup of Schick and Schütze (2020).We experiment with integrating BERTRAM both into BERT base and RoBERTa large (Liu et al., 2019b).Throughout our experiments, when BERTRAM is used to provide input representations for one of the two models, we use the same model as BERTRAM's underlying language model.Further training specifications can be found in Appendix A.
While BERT was trained on BookCorpus (Zhu et al., 2015) and a large Wikipedia dump, we follow previous work and train BERTRAM only on the much smaller Westbury Wikipedia Corpus (WWC) (Shaoul and Westbury, 2010); this of course gives BERT a clear advantage over BERTRAM.This advantage is even more pronounced when comparing BERTRAM with RoBERTa, which is trained on a corpus that is an order of magnitude larger than the original BERT corpus.We try to at least partially  compensate for this as follows: In our downstream task experiments, we gather the set of contexts C for each word from WWC+BookCorpus during inference.5

WNLaMPro
We evaluate BERTRAM on the WNLaMPro dataset (Schick and Schütze, 2020).This dataset consists of cloze-style phrases like "A lingonberry is a ."and the task is to correctly fill the slot ( ) with one of several acceptable target words (e.g., "fruit", "bush" or "berry"), which requires understanding of the meaning of the phrase's keyword ("lingonberry" in the example).As the goal of this dataset is to probe a language model's ability to understand rare words without any task-specific finetuning, Schick and Schütze (2020) do not provide a training set.The dataset is partitioned into three subsets based on the keyword's frequency in WWC: RARE (occurring fewer than 10 times) MEDIUM (occurring between 10 and 100 times), and FREQUENT (all remaining words).
For our evaluation, we compare the performance of a standalone BERT (or RoBERTa) model with one that uses BERTRAM as shown in Figure 1 (bottom).As our focus is to improve representations for rare words, we evaluate our model only on WN-LaMPro RARE and MEDIUM.Table 1 gives results; our measure is mean reciprocal rank (MRR).We see that supplementing BERT with any of the proposed methods results in noticeable improvements for the RARE subset, with ADD clearly outperforming SHALLOW and REPLACE.Moreover, ADD performs surprisingly well for more frequent words, improving the score for WNLaMPro-MEDIUM by 58% compared to BERT base and 37% compared to Attentive Mimicking.This makes sense considering that the key enhancement of BERTRAM over AM lies in improving context representations and interconnection of form and context; the more contexts are given, the more this comes into play.Noticeably, despite being both based on and integrated into a BERT base model, our architecture even outperforms BERT large by a large margin.While RoBERTa performs much better than BERT on WNLaMPro, BERTRAM still significantly improves results for both rare and medium frequency words.As it performs best for both the RARE and MEDIUM subset, we always use the ADD configuration of BERTRAM in the following experiments.

Downstream Task Datasets
To measure the effect of adding BERTRAM to a pretrained deep language model on downstream tasks, we rarify (cf.§4) the following three datasets: • MNLI (Williams et al., 2018), a natural language inference dataset where given two sentences a and b, the task is to decide whether a entails b, a and b contradict each other or neither; • AG's News (Zhang et al., 2015), a news classification dataset with four different categories (world, sports, business and science/tech); • DBPedia (Lehmann et al., 2015), an ontology dataset with 14 classes (e.g., company, artist) that have to be identified from text snippets.
For all three datasets, we create rarified instances both using BERT base and RoBERTa large as a baseline model and build the substitution dictionary S using the synonym relation of WordNet (Miller, 1995) and the pattern library (Smedt and Daelemans, 2012) to make sure that all synonyms have consistent parts of speech.Furthermore, we only consider synonyms for each word's most frequent sense; this filters out much noise and improves the quality of the created sentences.In addition to WordNet, we use the misspelling dataset of Piktus et al. (2019).To prevent misspellings from dominating the resulting datasets, we only assign misspelling-based substitutes to randomly selected 10% of the words contained in each sentence.Motivated by the results on WNLaMPro-MEDIUM, we consider every word that occurs less than 100 times in WWC+BookCorpus as being rare.Example entries from the rarified datasets obtained using BERT base as a baseline model can be seen in Table 2.The average number of words replaced with synonyms or misspellings is 1.38, 1.82 and 2.34 for MNLI, AG's News and DBPedia, respectively.
Our default way of injecting BERTRAM embeddings into the baseline model is to replace the sequence of uncontextualized subword token embeddings for a given rare word with its BERTRAMbased embedding (Figure 1, bottom).That is, given a sequence of uncontextualized token embeddings e = e 1 , . . ., e n where e i , . . ., e j with 1 ≤ i ≤ j ≤ n is the sequence of embeddings for a single rare word w with BERTRAM-based embedding v (w,C) , we replace e with e = e 1 , . . ., e i−1 , v (w,C) , e j+1 , . . ., e n .
As an alternative to replacing the original sequence of subword embeddings for a given rare word, we also consider BERTRAM-SLASH, a con-figuration where the BERTRAM-based embedding is simply added and both representations are separated using a single slash: e SLASH = e 1 , . . ., e j , e / , v (w,C) , e j+1 , . . ., e n .
The intuition behind this variant is that in BERT's pretraining corpus, a slash is often used to separate two variants of the same word (e.g., "useable / usable") or two closely related concepts (e.g., "company / organization", "web-based / cloud") and thus, BERT should be able to understand that both e i , . . ., e j and v (w,C) refer to the same entity.We therefore surmise that whenever some information is encoded in one representation but not in the other, giving BERT both representations is helpful.
By default, the set of contexts C for each word is obtained by collecting all sentences from WWC+BookCorpus in which it occurs.We also try a variant where we add in-domain contexts by giving BERTRAM access to all texts (but not labels) found in the test set; we refer to this variant as INDOMAIN. 6 Our motivation for including this variant is as follows: Moving from the training stage of a model to its production use often causes a slight domain shift.This is turn leads to an increased number of input sentences containing words that did not -or only very rarely -appear in the training data.However, such input sentences can easily be collected as additional unlabeled examples during production use.While there is no straightforward way to leverage these unlabeled examples with an already finetuned BERT model, BERTRAM can easily make use of them without requiring any labels or any further training: They can simply be included as additional contexts during inference.As this gives BERTRAM a slight advantage, we also report results for all configurations without using indomain data.Importantly, adding indomain data increases the number of contexts for more than 90% of all rare words by at most 3, meaning that they can still be considered rare despite the additional indomain contexts.
Table 3 reports, for each task, the accuracy on the entire dataset (All) as well as scores obtained considering only instances where at least one word was replaced by a misspelling (Msp) or a WordNet synonym (WN), respectively.7Consistent with results Using the SLASH variant brings improvements across all datasets as does adding INDOMAIN contexts (exception: BERT/AG's News).This makes sense considering that for a rare word, every single additional context can be crucial for gaining a deeper understanding.Correspondingly, it is not surprising that the benefit of adding BERTRAM to RoBERTa is less pronounced, because BERTRAM uses only a fraction of the contexts available to RoBERTa during pretraining.Nonetheless, adding BERTRAM significantly improves RoBERTa's accuracy for all three datasets both with and without adding INDOMAIN contexts.
To further understand for which words using BERTRAM is helpful, Figure 3 looks at the accuracy of BERT base both with and without BERTRAM as a function of word frequency.That is, we compute the accuracy scores for both models when considering only entries (x w i 1 = ŵi 1 ,...,w i k = ŵi k , y) where each substituted word ŵi j occurs less than c max times in WWC+BookCorpus, for different values of c max .As one would expect, c max is positively correlated with the accuracies of both models, showing that the rarer a word is, the harder it is to understand.Interestingly, the gap between standalone BERT and BERT with BERTRAM remains more or less constant regardless of c max .This suggests that using BERTRAM may even be helpful for more frequent words.
To investigate this hypothesis, we perform another rarification of MNLI that differs from the previous rarification in two respects.First, we increase the threshold for a word to count as rare from 100 to 1000.Second, as this means that we have more WordNet synonyms available, we do not use the misspelling dictionary (Piktus et al., 2019) for substitution.We refer to the resulting datasets for BERT base and RoBERTa large as MNLI-1000.
Figure 4 shows results on MNLI-1000 for various rare word frequency ranges.For each value [c 0 , c 1 ) on the x-axis, the y-axis shows improvement in accuracy compared to standalone BERT or RoBERTa when only dataset entries are considered for which each rarified word occurs between c 0 (inclusively) and c 1 (exclusively) times in WWC+BooksCorpus.We see that for words with frequency less than 125, the improvement in accuracy remains similar even without using misspellings as another source of substitutions.Interestingly, for every single interval of rare word counts considered, adding BERTRAM-SLASH to BERT considerably improves its accuracy.For RoBERTa, adding BERTRAM brings improvements only for words occurring less than 500 times.While using INDOMAIN data is beneficial for rare words -simply because it gives us additional contexts for these words -, when considering only words that occur at least 250 times in WWC+BookCorpus, adding INDOMAIN contexts does not help.

Conclusion
We have introduced BERTRAM, a novel architecture for inducing high-quality representations for rare words in BERT's and RoBERTa's embedding spaces.This is achieved by employing a powerful pretrained language model and deeply integrating surface-form and context information.By replacing important words with rare synonyms, we created downstream task datasets that are more challenging and support the evaluation of NLP models on the task of understanding rare words, a capability that human speakers have.On all of these datasets, BERTRAM improves over standard BERT and RoBERTa, demonstrating the usefulness of our method.
Our analysis showed that BERTRAM is beneficial not only for rare words (our main target in this paper), but also for frequent words.In future work, we want to investigate BERTRAM's potential benefits for such frequent words.Furthermore, it would be interesting to explore more complex ways of incorporating surface-form information -e.g., by using a character-level CNN similar to the one of Kim et al. (2016) -to balance out the potency of BERTRAM's form and context parts.

A Training Details
Our implementation of BERTRAM is based on Py-Torch (Paszke et al., 2017) and the Transformers library (Wolf et al., 2019).To obtain target embeddings for frequent multi-token words (i.e., words that occur at least 100 times in WWC+BookCorpus) during training, we use onetoken approximation (OTA) (Schick and Schütze, 2020).For RoBERTa large , we found increasing the number of iterations per word from 4,000 to 8,000 to produce better OTA embeddings using the same evaluation setup as Schick and Schütze (2020).For all stages of training, we use Adam (Kingma and Ba, 2015) as optimizer.
Context-Only Training.During the first stage of our training process, we train BERTRAM with a maximum sequence length of 96 and a batch size of 48 contexts for BERT base and 24 contexts for RoBERTa large .These parameters are chosen such that a batch fits on a single Nvidia GeForce GTX 1080Ti.Each context in a batch is mapped to a word w from the set of training words, and each batch contains at least 4 and at most 32 contexts per word.For BERT base and RoBERTa large , we pretrain the context part for 5 and 3 epochs, respectively.We use a maximum learning rate of 5 • 10 −5 and perform linear warmup for the first 10% of training examples, after which the learning rate is linearly decayed.
Form-Only Training.In the second stage of our training process, we use the same parameters as Schick and Schütze (2020), as our form-only model is the very same as theirs.That is, we use a learning rate of 0.01, a batch size of 64 words and we apply n-gram dropout with a probability of 10%.We pretrain the form-only part for 20 epochs.
Combined Training.For the final stage, we use the same training configuration as for context-only training, but we keep n-gram dropout from the form-only stage.We perform combined training for 3 epochs.For ADD, when using RoBERTa as an underlying language model, we do not just prepad the input with the surface-form embedding followed by a colon, but additionally wrap the surface-form embedding in double quotes.That is, we prepad the input with e " , v form (w,C) , e " , e : .We found this to perform slightly better in preliminary experiments with some toy examples.

B Evaluation Details
WNLaMPro In order to ensure comparability with results of Schick and Schütze (2020), we use only WWC to obtain contexts for WNLaMPro keywords.
Rarified Datasets To obtain rarified instances of MNLI, AG's News and DBPedia, we train BERT base and RoBERTa large on each task's training set for 3 epochs.We use a batch size of 32, a maximum sequence length of 128 and a weight decay factor of 0.01.For BERT, we perform linear warmup for the first 10% of training examples and use a maximum learning rate of 5 • 10 −5 .After reaching its peak value, the learning rate is linearly decayed.For RoBERTa, we found training to be unstable with these parameters, so we chose a lower learning rate of 1 • 10 −5 and performed linear warmup for the first 10,000 training steps.
To obtain results for our baselines on the rarified datasets, we use the original Mimick implementation of Pinter et al. (2017), the A La Carte implementation of Khodak et al. (2018) and the Attentive Mimicking implementation of Schick and Schütze (2019a) with their default hyperparameter settings.As A La Carte can only be used for words with at least one context, we keep the original BERT embeddings whenever no such context is available.
While using BERTRAM allows us to completely remove the original BERT embeddings for all rare words and still obtain improvements in accuracy on all three rarified downstream tasks, the same is not true for RoBERTa, where removing the original sequence of subword token embeddings for a given rare word (i.e., not using the SLASH variant) hurts performance with accuracy dropping by 5.6, 7.4 and 2.1 points for MNLI, AG's News and DBPedia, respectively.We believe this to be due to the vast amount of additional contexts for rare words in RoBERTa's training set that are not available to BERTRAM.

Figure 3 :
Figure 3: BERT vs. BERT combined with BERTRAM-SLASH (BERT+BSL) on three downstream tasks for varying maximum numbers of contexts c max

Table 3 :
Accuracy of standalone BERT and RoBERTa, various baselines and BERTRAM on rarified MNLI, AG's News and DBPedia.The five BERTRAM instances are BERTRAM-ADD.Best results per baseline model are underlined, results that do not differ significantly from the best results in a two-sided binomial test (p < 0.05) are bold.Msp/WN: subset of instances containing at least one misspelling/synonym.All: all instances.