Quantifying the Contextualization of Word Representations with Semantic Class Probing

Pretrained language models achieve state-of-the-art results on many NLP tasks, but there are still many open questions about how and why they work so well. We investigate the contextualization of words in BERT. We quantify the amount of contextualization, i.e., how well words are interpreted in context, by studying the extent to which semantic classes of a word can be inferred from its contextualized embedding. Quantifying contextualization helps in understanding and utilizing pretrained language models. We show that the top layer representations support highly accurate inference of semantic classes; that the strongest contextualization effects occur in the lower layers; that local context is mostly sufficient for contextualizing words; and that top layer representations are more task-specific after finetuning while lower layer representations are more transferable. Finetuning uncovers task-related features, but pretrained knowledge about contextualization is still well preserved.


Introduction
Pretrained language models like ELMo (Peters et al., 2018a), BERT (Devlin et al., 2019), and XL-Net  are top performers in NLP because they learn contextualized representations, i.e., representations that reflect the interpretation of a word in context as opposed to its general meaning, which is less helpful in solving NLP tasks. As stated, pretrained language models contextualize words, is clear qualitatively; there has been little work on investigating contextualization, i.e., to which extent a word can be interpreted in context, quantitatively.
We use BERT (Devlin et al., 2019) as our pretrained language model and quantify contextualization by investigating how well BERT infers semantic classes (s-classes) of a word in context, e.g., the s-class organization for "Apple" in "Apple stock rises" vs. the s-class food in "Apple juice is healthy". We use s-class inference as a proxy for contextualization since accurate s-class inference reflects a successful contextualization of a word: an effective interpretation of the word in context.
By probing for s-classes we quantify directly where and how contextualization happens in BERT. E.g., we find that the strongest contextual interpretation effects occur in the lower layers and that the top two layers contribute little to contextualization. We also investigate how the amount of context available affects contextualization.
In addition, since pretrained language models in practice need to be finetuned on downstream tasks (Devlin et al., 2019;, we further investigate the interactions between finetuning and contextualization. We show that the pretrained knowledge about contextualization is well preserved in finetuned models. We make the following contributions: (i) We investigate how accurately BERT interprets words in context. We find that BERT's performance is high (almost 85% F 1 ), but that there is still room for improvement. (ii) We quantify how much each additional layer in BERT contributes to contextualization. We find that the strongest contextual interpretation effects occur in the lower layers. The top two layers seem to be optimized only for the pretraining objective of predicting masked words (Devlin et al., 2019) and only add small increments to contextualization. (iii) We investigate the amount of context BERT needs to exploit for interpreting a GloVe  BERT  suits  suits  lawsuit  suited  filed  lawsuit  lawsuits  ##suit  sued  lawsuits  complaint  slacks  jacket  47th   Table 1: Nearest neighbors of "suit" in GloVe and in BERT (BERT-base-uncased) wordpiece embeddings word and find that BERT effectively integrates local context up to five words to the left and to the right (a 10-word context window). (iv) We investigate the dynamics of BERT's representations in finetuning. We find that finetuning has little effect on lower layers, suggesting that they are more easily transferable across tasks. Higher layers are strongly changed for word-level tasks like part-of-speech tagging, but less noticeably for sentence-level tasks like paraphrase classification. Finetuning uncovers task-related features, but the knowledge captured in pretraining is well preserved. We quantify these effects by s-class inference performance.

Motivation and Methodology
The key benefit of pretrained language models (Mc-Cann et al., 2017;Peters et al., 2018a;Radford et al., 2019;Devlin et al., 2019) is that they produce contextualized embeddings that are useful in NLP. The top layer contextualized word representations from pretrained language models are widely utilized; however, the fact that pretrained language models implement a process of contextualizationstarting with a completely uncontextualized layer of wordpieces at the bottom -is not well studied. Table 1 gives an example: BERT's wordpiece embedding of "suit" is not contextualized: it contains several meanings of the word, including "to suit" ("be convenient"), lawsuit, and garment ("slacks"). Thus, there is no difference in this respect between BERT's wordpiece embeddings and uncontextualized word embeddings like GloVe (Pennington et al., 2014). Pretrained language models start out with an uncontextualized representation at the lowest layer, then gradually contextualize it. This is the process we analyze in this paper. For investigating the contextualization process, one possibility is to use word senses and to tap resources like the WordNet (WN) (Fellbaum, 1998) based word sense disambiguation benchmarks of the Senseval series (Edmonds and Cotton, 2001;   words comb's  contexts  train 35,399 62,184 2,178,895  dev  8,850 15,437  542,938  test 44,250 77,706 2,722,893   Table 2: Number of words, word-s-class combinations, and contexts per split in our probing dataset. Appendix §A.6 shows the 34 s-classes and statistics per class. Snyder and Palmer, 2004;Raganato et al., 2017). However, the abstraction level in WN sense inventories has been criticized as too fine-grained (Izquierdo et al., 2009), providing limited information to applications requiring higher level abstraction. Various levels of granularity of abstraction have been explored such as WN domains (Magnini and Cavaglià, 2000), supersenses (Ciaramita and Johnson, 2003;Levine et al., 2019) and basic level concepts (Beviá et al., 2007). In this paper, we use semantic classes (s-classes) (Yarowsky, 1992;Resnik, 1993;Kohomban and Lee, 2005;Yaghoobzadeh et al., 2019) as the proxy for the meaning contents of words to study the contextualization capability of BERT. Specifically, we use the Wikipedia-based resource for Probing Semantics in Word Embeddings (Wiki-PSE) (Yaghoobzadeh et al., 2019) which is detailed in §3.1.

Probing dataset
For s-class probing, we use the s-class labeled corpus Wiki-PSE (Yaghoobzadeh et al., 2019). It consists of a set of 34 s-classes, an inventory of word→s-class mappings and an English Wikipedia text corpus in which words in context are labeled with the 34 s-classes. For example, contexts of "Apple" that refer to the company are labeled with "organization". We refer to a word labeled with an s-class as a word-s-class combination, e.g., "@apple@-organization". 1 The Wiki-PSE text corpus contains >550 million tokens, >17 million of which are annotated with an s-class. Working on the entire Wiki-PSE with BERT is not feasible, e.g., the word-s-class combination "@france@-location" has 98,582 contexts. Processing all these contexts by BERT consumes significant amounts of energy (Strubell et al., 2019;Schwartz et al., 2019) and time. Hence for each word-s-class combination, we sample a maximum of 100 contexts to speed up our experiments. if S ∈ sclasses then 7: PosVecs.append(vector) 8: else 9: NegVecs.append(vector) 10: classifier = Classifier() 11: classifier.train(PosVecs, NegVecs) 12: return classifier Figure 1: Training a diagnostic classifier with uncontextualized word representations for an s-class S.
Wiki-PSE provides a balanced train/test split; we use 20% of the training set as our development set. Table 2 gives statistics of our dataset.

Probing for semantic classes
For each of the 34 s-classes in Wiki-PSE, we train a binary classifier to diagnose if an input embedding encodes information for inferring the s-class.

Probing uncontextualized embeddings
We make a distinction in this paper between two different factors that contribute to BERT's performance: (i) a powerful learning architecture that gives rise to high-quality representations and (ii) contextualization in applications, i.e., words are represented as contextualized embeddings for solving NLP tasks. Here, we adopt Schuster et al.
(2019)'s method of computing uncontextualized BERT embeddings (AVG-BERT-, see §4.2.1) and show that (i) alone already has a strong positive effect on performance when compared to other uncontextualized embeddings. So BERT's representation learning yields high performance, even when used in a completely uncontextualized setting. We adopt the setup in Yaghoobzadeh et al. (2019) to probe uncontextualized embeddings -for each of the 34 s-classes, we train a binary classifier as shown in Figure 1. Table 2, column words shows the sizes of train/dev/test. The evaluation measure is micro F 1 over all decisions of the 34 binary classifiers.

Probing contextualized embeddings
We probe BERT with the same setup: a binary classifier is trained for each of the 34 s-classes; each BERT layer is probed individually.
For uncontextualized embeddings, a word has  Figure 2: Setups for probing uncontextualized and contextualized embeddings. For BERT, we input a context sentence to extract the contextualized embedding of a word, e.g., "airheads"; "food" is the correct s-class label for this context. a single vector, which is either a positive or negative example for an s-class. For contextualized embeddings, the contexts of a word will typically be mixed; for example, "food" contexts (a candy) of "@airheads@" are positive but "art" contexts (a film) of "@airheads@" are negative examples for the classifier of "food". Table 2, column contexts shows the sizes of train/dev/test when probing BERT. Figure 2 compares our two probing setups. In evaluation, we weight frequent word-s-class combinations (those having 100 contexts in our dataset) and the much larger number of less frequent word-s-class combinations equally. To this end, we aggregate the decisions for the contexts of a word-s-class combination. We stipulate that at least half of the contexts must be correctly classified. For example, "@airheads@-art" occurs 47 times, so we evaluate the "art" classifier as accurate for "@airheads@-art" if it classifies at least 24 contexts correctly. The final evaluation measure is micro F 1 over all 15,437 (for dev) and 77,706 (for test) decisions (see Table 2) of the 34 classifiers for the word-s-class combinations.

Data preprocessing
BERT uses wordpieces (Wu et al., 2016) to represent text and infrequent words are tokenized to several wordpieces. For example, "infrequent" is tokenized to "in", "##fr", "##e", and "##quent". Following He and Choi (2020), we average wordpiece embeddings to get a single vector representation of a word. 2 We limit the maximum sequence length of the context sentence input to BERT to 128. Consistent with the probing literature, we use a simple probing classifier: a 1-layer multilayer perceptron (MLP) with 1024 hidden dimensions and ReLU.

Representation learners
Six uncontextualized embedding spaces are evaluated: (i) PSE. A 300-dimensional embedding space computed by running skipgram with negative sampling (Mikolov et al., 2013) on the Wiki-PSE text corpus. Yaghoobzadeh et al. (2019) show that PSE outperforms other embedding spaces. (ii) Rand. An embedding space with the same vocabulary and dimension size as PSE. Vectors are drawn from N (0, I 300 ). Rand is used to confirm that word representations indeed encode valid meaning contents that can be identified by diagnostic MLPs rather than random weights. (iii) The 300-dimensional fastText (Bojanowski et al., 2017) embeddings. (iv) GloVe. The 300-dimensional space trained on 6 billion tokens (Pennington et al., 2014). Out-of-vocabulary (OOV) words are associated with vectors drawn from N (0, I 300 ). (v) BERTw. The 768-dimensional wordpiece embeddings in BERT. We tokenize a word with the BERT tokenizer then average its wordpiece embeddings. (vi) AVG-BERT-. 3 For an annotated word in Wiki-PSE, we average all of its contextualized embeddings from BERT layer in the Wiki-PSE text corpus. Comparing AVG-BERT-with others brings a new insight: to which extent does this "uncontextualized" variant of BERT outperform others in encoding different s-classes of a word?
Four contextualized embedding models are considered: (i) BERT. We use the PyTorch (Paszke et al., 2019;Wolf et al., 2019) implementation of the 12-layer BERT-base-uncased model (Wiki-PSE is uncased). (ii) P-BERT. A bag-of-word model that "contextualizes" the wordpiece embedding of an annotated word by averaging the embeddings of wordpieces of the sentence it occurs in. Comparing BERT with P-BERT reveals to which extent the self attention mechanism outperforms an average pooling practice when contextualizing words. (iii) P-fastText. Similar to P-BERT, but we use fast-Text word embeddings. Comparing BERT with 3 BERTw and AVG-BERT-have more dimensions. But Yaghoobzadeh et al. (2019) showed that different dimensionalities have a negligible impact on relative performance when probing for s-classes using MLPs as diagnostic classifiers.  Table 5 in Appendix. P-fastText indicates to which extent BERT outperforms uncontextualized embedding spaces when they also have access to contextual information. (iv) P-Rand. Similar to P-BERT, but we draw word embeddings from N (0, I 300 ). Wieting and Kiela (2019) show that a random baseline has good performance in tasks like sentence classification. Figure 3 shows uncontextualized embedding probing results. Comparing with random weights, all embedding spaces encode informative features helping s-class inference. BERTw delivers results similar to GloVe and fastText, demonstrating our earlier point (cf. the qualitative example in Table 1) that the lowest embedding layer of BERT is uncontextualized; several meanings of a word are conflated into a single vector.

S-class inference results
PSE performs strongly, consistent with observations in Yaghoobzadeh et al. (2019). AVG-BERT-10 performs best among all spaces. Thus for a given word, averaging its contextualized embeddings from BERT yields a high quality type-level embedding vector, similar to "anchor words" in cross-lingual alignment (Schuster et al., 2019).
As expected, the top AVG-BERT layers outperform lower layers, given the deep architecture of BERT. Additionally, AVG-BERT-0 significantly outperforms BERTw, evidencing the importance of position embeddings and the self attention mechanism (Vaswani et al., 2017) when composing the wordpieces of a word. Figure 4 shows contextualized embedding probing results. Comparing BERT layers, a clear trend can be identified: s-class inference performance increases monotonically with higher layers. This increase levels off in the top layers. Thus, the features from deeper layers improve word  Table 6 in Appendix.
contextualization, advancing s-class inference. It also verifies previous findings: semantic tasks are mainly solved at higher layers (Liu et al., 2019;Tenney et al., 2019a). We can also observe that the strongest contextualization occurs early at lower layers -going up to layer 1 from layer 0 brings a 4% (absolute) improvement.
The very limited contextualization improvement brought by the top two layers may explain why representations from the top layers of BERT can deliver suboptimal performance on NLP tasks (Liu et al., 2019): the top layers are optimized for the pretraining objective, i.e., predicting masked words (Voita et al., 2019), not for the contextualization of words that is helpful for NLP tasks.
BERT layer 0 performs slightly worse than P-BERT, which may be due to the fact that some attention heads in lower layers of BERT attend broadly in the sentence, producing "bag-of-vectorlike" representations (Clark et al., 2019), which is in fact close to the setup of P-BERT. However, starting from layer 1, BERT gradually improves and surpasses P-BERT, achieving a maximum gain of 0.16 in F 1 in layer 11. Thus, BERT knows how to better interpret the word in context, i.e., contextualize the word, when progressively going to deeper (higher) layers. P-Rand performs strongly, but is noticeably worse than P-fastText and P-BERT. P-fastText outperforms P-BERT and BERT layers 0 and 1. We hypothesize that this may be due to the fact that fastText learns embeddings directly for words; P-BERT and BERT have to compose subwords to understand the meaning of a word, which is more challenging. Starting from layer 2, BERT outperforms P-fastText and P-BERT, illustrating the effectiveness of self attention in better integrating the information from the context into contextualized word embeddings than the average pooling practice in bag-of-word models. Figure 3 and Figure 4 jointly illustrate the high quality of word representations computed by BERT. The BERT-derived uncontextualized AVG-BERTrepresentations -modeled as Schuster et al. (2019)'s anchor words -show superior capability in inferring s-classes of a word, performing best among all uncontextualized embeddings. This suggests that BERT's powerful learning architecture may be the main reason for BERT's high performance, not contextualization proper, i.e., the representation of words as contextualized embeddings on the highest layer when BERT is applied to NLP tasks. This offers intriguing possibility for creating (or distilling) strongly performing uncontextualized BERT-derived models that are more compact and more efficiently deployable.

4.2.3
Qualitative analysis §4.2.2 quantitatively shows that BERT performs strongly in contextualizing words, thanks to its deep integration of information from the entire input sentence in each contextualized embedding. But there are scenarios where BERT fails. We identify two such cases in which the contextual information does not help s-class inference.
(i) Tokenization. In some domains, the annotated word and/or its context words are tokenized into several wordpieces due to their low frequency in the pretraining corpora. As a result, BERT may not be able to derive the correct composed meaning. Then the MLPs cannot identify the correct s-class from the noisy input. Consider the tokenized results of "@glutamate@-biology" and one of its contexts: "three ne ##uro ##tra ##ns ##mit ##ters that play important roles in adolescent brain development are g ##lu ##tama ##te . . . " Though "brain development" hints at a context related to "biology", this signal could be swamped by the noise in embeddings of other -especially short -wordpieces. Schick and Schütze (2020) propose a mimicking approach (Pinter et al., 2017) to help BERT understand rare words.
(ii) Uninformative contexts. Some contexts do not provide sufficient information related to the sclass. For example, according to probing results on BERTw, the wordpiece embedding of "goodfellas" does not encode the meaning of s-class "art" (i.e., movies); the context "Chase also said he wanted Imperioli because he had been in Goodfellas" of word-s-class combination "@goodfellas@-art" is not informative enough for inferring an "art" context, yielding incorrect predictions in higher layers.

Context size
We now quantify the amount of context required by BERT for properly contextualizing words to produce accurate s-class inference results. When probing for the s-class of word w, we define context size as the number of words surrounding w (left and right) in a sentence before wordpiece tokenization. For example, a context size of 5 means 5 words left, 5 words right. The context size seems to be picked heuristically in other work. Yarowsky (1992) and Gale et al. (1992) use 50 while Black (1988) uses 3-6. We experiment with a range of context sizes then compare s-class inference results. We also enclose P-BERT for comparison. Note that this experiment is different from edge probing (Tenney et al., 2019b), which takes the full sentence as input. We only make input words within the context window available to BERT and P-BERT.

Probing results
We report micro F 1 on Wiki-PSE dev, with context size ∈ {0, 2, 4, 8, 16, 32}. Context size 0 means that the input consists only of the wordpiece embeddings of the input word. Figure 5 shows results.
Comparing context sizes. Larger context sizes have higher performance for all BERT layers. Improvements are most prominent for small context sizes, e.g., 2 and 4, meaning that often local features are sufficient to contextualize words and infer s-classes, supporting Black (1988)'s design choice of 3-6. Further increasing the context size im-proves contextualization only marginally.
A qualitative example showing informative local features is "The Azande speak Zande, which they call Pa-Zande." In this context, the gold sclass of "Zande" is "language" (instead of "peopleethnicity", i.e., the Zande people). The MLPs for BERTw and for context size 0 for BERT fail to identify s-class "language". But the BERT MLP for context size 2 predicts "language" correctly since it includes the strong signal "speak". This context is a case of selectional restrictions (Resnik, 1993;Jurafsky and Martin, 2009), in this case possible objects of "speak".
As small context sizes already contain noticeable information contextualizing the words, we hypothesize that it may not be necessary to exploit the full context in cases where the quadratic complexity of full-sentence self attention is problematic, e.g., on edge devices. Initial results on part-of-speech tagging with the Penn Treebank (Marcus et al., 1993) in Appendix §C confirm our hypothesis. We leave more experiments to future work. P-BERT shows a similar pattern when varying the context sizes. However, large context sizes such as 16 and 32 hurt contextualization, meaning that averaging too many embeddings results in a bag of words not specific to a particular token.
Comparing BERT layers. Higher layers of BERT yield better contextualized word embeddings. This phenomenon is more noticeable for large context sizes such as 8, 16 and 32. However for small context sizes, e.g., 0, embeddings from all layers perform similarly and badly. This means that without context information, simply passing the wordpiece embedding of a word through BERT layers does not help, suggesting that contextualization is the key ability of BERT yielding impressive performance across NLP tasks.
Again, P-BERT only outperforms layer 0 of BERT with most context sizes, suggesting that BERT layers, especially the top layers, contextualize words with abstract and informative representations, instead of naively aggregating all information within the context sentence.

Probing finetuned embeddings
We have done "classical" probing: extracting features from pretrained BERT and feeding them to diagnostic classifiers. However, pretrained BERT needs to be adapted, i.e., finetuned, for good performance on tasks ( (2019), we put a linear layer on top of the pretrained BERT, then finetune all parameters. We use Adam (Kingma and Ba, 2014) with learning rate 5e-5 for 5 epochs. We save the model from the step that performs best on dev (of MRPC/SST2/POS/NER), extract representations from Wiki-PSE using this model and then report results on Wiki-PSE dev.

Probing results
We now quantify the contextualization of word representations from finetuned BERT models. Two setups are considered: (a) directly apply the MLPs in §4.2 (trained with pretrained embeddings) to finetuned BERT embeddings; (b) train and evaluate a new set of MLPs on the finetuned BERT embeddings.
Comparing (a) with probing results on pretrained BERT ( §4.2) gives us an intuition about how many changes occurred to the knowledge captured during pretraining. Comparing (b) with §4.2 reveals whether or not the pretrained knowledge about contextualization is still preserved in finetuned models. Figure 6 shows s-class probing results of finetuned BERT with setup (a) and (b). For example in (ii), layer 11 s-class inference performance of the POS-finetuned BERT decreases by 0.763 (0.835 → 0.072, from "Pretrained" to "POS-(a)") when using the MLPs from §4.2.
Comparing setup (a) and "Pretrained", we see that finetuning brings significant changes to the word representations. Finetuning on POS and NER introduces more obvious probing accuracy drops than finetuning on SST2 and MRPC. This may be due to the fact that the training objective of SST2 and MRPC takes as input only the [CLS] token while all words in a sentence are involved in the training objective of POS and NER.
Comparing setup (b) and "Pretrained". Finetuning BERT on MRPC introduces small but consistent improvements on s-class inference. For SST2 and NER, very small s-class inference accuracy drops are observed. Finetuning on POS brings more noticeable changes. Solving POS requires more syntactic information than the other tasks, inducing BERT to "propagate" the syntactic information that is represented in lower layers to the upper layers; due to their limited capacity, the fixed-size vectors from the upper layers may lose some semantic information, yielding a more noticeable performance drop on s-class inference.
Comparing (a) and (b), we see that the knowledge about contextualizing words captured during pretraining is still well preserved after finetuning. For example, the MLPs trained with layer 11 embeddings computed by the POS-finetuned BERT still achieve a reasonably good score of 0.735 (a 0.100 drop compared with "Pretrained" -compare black and green dotted lines in Figure 6 (ii)). Thus, the semantic information needed for inferring sclasses is still present to a large extent.
Finetuning may introduce large changes (setup (a)) to the representations -similar to the projection utilized to uncover divergent information in uncontextualized word embeddings (Artetxe et al., 2018) -but relatively little information about contextualization is lost as the good performance of the newly trained MLPs shows (setup (b)). Similarly,
A key goal of current research is to understand how these models work and what they represent on different layers.
Probing is a recent strand of work that investigates -via diagnostic classifiers -desired syntactic and semantic features encoded in pretrained language model representations. Shi et al. (2016) show that string-based RNNs encode syntactic information. Belinkov et al. (2017) investigate word representations at different layers in NMT. Linzen et al. (2016) assess the syntactic ability of LSTM (Hochreiter and Schmidhuber, 1997) encoders and Goldberg (2019) of BERT. Tenney et al. (2019a) find that information on POS tagging, parsing, NER, semantic roles, and coreference is represented on increasingly higher layers of BERT. In all of these studies, probing serves to analyze representations and reveal their properties. We employ probing to investigate the contextualization of words in pretrained language models quantitatively. In addition, we exploit how finetuning affects word contextualization.
Ethayarajh (2019) quantitatively investigates contextualized embeddings, using unsupervised cosine-similarity-based evaluation. Inferring sclasses, we address a complementary set of questions because we can quantify contextualization with a uniform set of semantic classes. Brunner et al. (2020) employ token identifiability to compute the deviation of a contextualized embedding from the uncontextualized embedding. Voita et al. (2019) address this from the mutual information perspective, e.g., low mutual information between an uncontextualized embedding and its contextualized embedding can be viewed as a reflection of more contextualization. Similar observations are made: higher layer embeddings are more contextualized while lower layer embeddings are less contextualized. In contrast, we draw the observations from the perspective of s-class inference. The higher layer embeddings perform better when evaluating the semantic classes -they are better contextualized and have higher fitness to the context than the lower layer embeddings.
Two-stage NLP paradigm. Recent work (Dai and Le, 2015;Howard and Ruder, 2018;Devlin et al., 2019) introduces a "two-stage paradigm" in NLP: pretrain a language encoder on a large amount of unlabeled data via self-supervised learning, then finetune the encoder on task-specific benchmarks like GLUE (Wang et al., 2018(Wang et al., , 2019. This transfer-learning pipeline yields good and robust results compared to models trained from scratch (Hao et al., 2019).
In this work, we shed light on how BERT's pretrained knowledge about contextualization changes during finetuning by comparing s-class inference ability of pretrained and finetuned models. Merchant et al. (2020) analyze BERT models finetuned on different downstream tasks with the edge probing suite (Tenney et al., 2019b) and make similar observations as us. They focus on "linguistic features" while we focus on the contextualization of words.

Conclusion
We presented a quantitative study of the contextualization of words in BERT by investigating BERT's semantic class inference capabilities. We focused on two key factors for successful contextualization by BERT: layer index and context size. By comparing pretrained and finetuned models, we showed that word-level tasks like part-of-speech tagging bring more noticeable changes than sentence-level tasks like paraphrase classification; and top layers of BERT are more sensitive to the finetuning objective than lower layers. We also found that BERT's pretrained knowledge about contextualizing words is still well retained after finetuning.
We showed that exploiting the full context may be unnecessary in applications where the quadratic complexity of full-sentence attention is problematic. Future work may evaluate this phenomenon on more datasets and downstream tasks.

A.3 Validation performance
Following Table 5 and Table 6 report the validation performance of probing uncontextualized and contextualized embeddings.

A.4 Evaluation metric
Our evaluation is the micro F 1 over all decisions of the 34 probing classifiers. More details are available in §3.2 of the main paper.

A.5 Hyperparameter search
For probing tasks, we do not conduct hyperparameter search since our goal is to analyze the contextualization. The probing classifiers are trained with learning rate 1e-3 and 400 epochs. For finetuning BERT, we do not search hyperparameters but directly adopt the setup in Devlin et al. (2019) as shown in Table 4.

A.6 Datasets
List of the 34 semantic classes (s-classes), number of word-s-class combinations and contexts per s-class in the sampled Wiki-PSE (Yaghoobzadeh et al., 2019) are listed in Table 8. Some annotated contexts in Wiki-PSE are also displayed in Table 9. The Wiki-PSE developed by Yaghoobzadeh et al. (2019) is publicly available at https://github. com/yyaghoobzadeh/WIKI-PSE.

B Finetuning Details
Hyperparameters in Table 4 are used when we finetune BERT on POS, NER, SST2, and MRPC. For SST2 and MRPC, we use the embedding of [CLS] as the representation of the sentence (pair). For POS and NER, we use the embedding of the last wordpiece of the word as Liu et al. (2019). A plain Adam (Kingma and Ba, 2014) optimizer is used and we did not use strategies like learning rate warmup and layer-wise learning rate (Howard and Ruder, 2018) during finetuning to avoid potential side effects to ensure a clear comparison of different BERT layers.

C Context Sizes in POS
We investigate how the findings from §4.3 in the main paper transfer to downstream tasks. To this end we perform standard finetuning of BERT for different tasks, but we prune the attention matrix to a context size of length k. That is we apply a mask on the attention matrix such that each word can only attend to k left and k right words. This has great benefits as it reduces the memory and computation requirements from O(n 2 ) to O(nk) where n is the sequence length. We only consider part-of-speech tagging as for sentence pair classification tasks such as SST2 and MRPC this is not a sensible approach. Table 7 confirms that small context windows are sufficient to achieve full performance for POStagging. This indicates that the finding from the main paper (i.e., local context is sufficient for BERT to achieve a high degree of contextualization) is to some degree applicable to a downstream tasks, as well. Note that the median sentence length in the Penn Treebank dataset is 25 words (the number of wordpieces even higher). Thus masking the context to the next 4 or 8 words does indeed reduce the available context words. In future work we plan to investigate this effect not only during finetuning but also during pretraining.