Context Analysis for Pre-trained Masked Language Models

Pre-trained language models that learn contextualized word representations from a large un-annotated corpus have become a standard component for many state-of-the-art NLP systems. Despite their successful applications in various downstream NLP tasks, the extent of contextual impact on the word representation has not been explored. In this paper, we present a detailed analysis of contextual impact in Transformer- and BiLSTM-based masked language models. We follow two different approaches to evaluate the impact of context: a masking based approach that is architecture agnostic, and a gradient based approach that requires back-propagation through networks. The findings suggest significant differences on the contextual impact between the two model architectures. Through further breakdown of analysis by syntactic categories, we find the contextual impact in Transformer-based MLM aligns well with linguistic intuition. We further explore the Transformer attention pruning based on our findings in contextual analysis.


Introduction
Pre-trained masked language models (MLM) such as BERT (Devlin et al., 2019) and ALBERT (Lan et al., 2019) have set state-of-the-art performance on a broad range of NLP tasks. The success is often attributed to their ability to capture complex syntactic and semantic characteristics of word use across diverse linguistic contexts (Peters et al., 2018). Yet, how these pre-trained MLMs make use of the context remains largely unanswered.
Recent studies have started to inspect the linguistic knowledge learned by pre-trained LMs such as word sense (Liu et al., 2019a) , syntactic parse trees (Hewitt and Manning, 2019), and semantic relations (Tenney et al., 2019). Others directly analyze model's intermediate representations and attention weights to understand how they work (Kovaleva et al., 2019;Voita et al., 2019).
While previous works either assume access to model's internal states or take advantage of model's special structures such as self-attention maps, these analysis are difficult to generalize as the architectures evolve. In this paper, our work complements these previous efforts and provides a richer understanding of how pre-trained MLMs leverage context without assumptions on architectures. We aim to answer following questions: (i) How much context is relevant to and used by pre-trained MLMs when composing representations? (ii) How far do MLMs look when leveraging context? That is, what are their effective context window sizes? We further define a target word's essential context as the set of context words whose absence will make the MLM indiscriminate of its prediction. We analyze linguistic characteristics of these essential context words to better understand how MLMs manage context.
We investigate the contextual impacts in MLMs via two approaches. First, we propose the context perturbation analysis methodology that gradually masks out context words following a predetermined procedure and measures the change in the target word probability. For example, we iteratively mask words that have the least change to the target word probability until the probability deviates too much from the start. At this point, the remaining words are relevant to and used by the MLM to represent the target word, since further perturbation causes a notable prediction change. Being model agnostic, our approach looks into the contextualization in the MLM task itself, and quantify them only on the output layer. We refrain from inspecting internal representations since new architectures might not have a clear notion of "layer" with inter-leaving jump connections such as those in Guo et al. (2019) and Yao et al. (2020).
The second approach is adapted from Falenska and Kuhn (2019) and estimates the impact of an input subword to the target word probability via the norm of the gradients. We study pre-trained MLMs based on two different architectures: Transformer and BiLSTM. The former is essentially BERT and the latter resembles ELMo (Peters et al., 2018). Although the scope in this work is limited to the comparison between two popular architectures, the same novel methodology can be readily applied to multilingual models as well as other Transformerbased models pre-trained with MLM.
From our analysis, when encoding words using sentence-level inputs, we find that BERT is able to leverage 75% of context on average in terms of the sentence length, while BiLSTM has the effective context size of around 30%. The gap is compelling for long-range context more than 20 words away, wherein, BERT still has a 65% chance to leverage the words in comparison to BiLSTM that only has 10% or less to do so. In addition, when restricted to a local context window around the target word, we find that the effective context window size of BERT is around 78% of the sentence length, whereas BiLSTM has a much smaller window size of around 50%. With our extensive study on how different pre-trained MLMs operate when producing contextualized representations and what detailed linguistic behaviors can be observed, we exploited these insights to devise a pilot application. We apply attention pruning that restricts the attention window of BERT based on our findings. Results show that the performance remains the same with its efficiency improved. Our main contributions can be briefly summarized as: • Standardize the pre-training setup (model size, corpus, objective, etc.) for a fair comparison between different underlying architectures.
• Novel design of a straight-forward and intuitive perturbation-based analysis procedure to quantify impact of context words.
• Gain insights about how different architectures behave differently when encoding contexts, in terms of number of relevant context words, effective context window sizes, and more fine-grained break-down with respect to POS and dependency structures.
• Leverage insights from our analysis to conduct a pilot application of attention pruning on a sequence tagging task.
Another line of research inspects internal states of pre-trained LMs such as attention weights (Kovaleva et al., 2019;Clark et al., 2019) or intermediate word representations (Coenen et al., 2019;Ethayarajh, 2019) to facilitate our understanding of how pre-trained LMs work. In particular, Voita et al. (2019) studies the evolution of representations from the bottom to top layers and finds that, for MLM, the token identity tends to be recreated at the top layer. A close work to us is Khandelwal et al. (2018), they conduct context analysis on LSTM language models to learn how much context is used and how nearby and long-range context is represented differently.
Our work complements prior efforts by analyzing how models pre-trained by MLM make use of context and provides insights that different architectures can have different patterns to capture context. Distinct from previous works, we leverage no specific model architecture nor intermediate representations while performing the context analysis.
Another related topic is generic model interpretations including LIME (Ribeiro et al., 2016), SHAP (Lundberg andLee, 2017), andAncona et al. (2017). Despite the procedural similarity, our work focuses on analyzing how pre-trained MLMs behave when encoding contexts and our methodology is both model-agnostic and training-free. For context analysis, we perform the masking and predictions at the word level. Given a target word w t , all its subwords are masked X \t = (... Following Devlin et al. (2019), the conditional probability of w t can be computed from outputs of MLMs with the independence assumption between subwords: (1) To investigate how MLMs use context, we propose procedures to perturb the input sentence from X \t to X \t and monitor the change in the target word probability P (w t |X \t ).

Approach
Our goal is to analyze the behaviors of pre-trained MLMs when leveraging context to recover identity of the masked target word w t , e.g. to answer questions such as how many context words are considered and how large the context window is. To this end, we apply two analysis approaches. The first one is based on the masking or perturbation of input context which is architecture agnostic. The second gradient-based approach requires back-propagation through networks.
Our first approach performs context perturbation analysis on pre-trained LMs at inference time and measures the change in masked target word probabilities. To answer each question, we start from X \t and design a procedure Ψ that iteratively processes the sentence from last perturbation X k+1 \t = Ψ( X k \t ). The patterns of P (w t | X k \t ) offer insights to our question. An example of Ψ is to mask out a context word that causes the least or negligible change in P (w t | X k \t ). It's worth mentioning that as pre-trained LMs are often used off-the-shelf as a general language encoder, we do not further finetune the model on the analysis dataset but directly analyze how they make use of context. In practice, we loop over a sentence word-by-word to set the word as the target first and use rest of words as the context for our masking process. Since we do the context analysis only with model inference, the whole process is fast -around half day on a 4-GPU machine to process 12k sentences.
Our second approach estimates the impact of an input subword s ij to P (w t |X \t ) by using derivatives. Specifically, we adapt the IMPACT score proposed in Falenska and Kuhn (2019) to our questions. The score IMPACT(s ij , w t ) can be computed with the gradients of the negative log likelihood (NLL) with respect to the subword embedding: The l 2 -norm of the gradient is used as the impact measure and normalized over all the subwords in a sentence. In practice, we report the impact of a context word w i by adding up the scores from its subwords l i j IMPACT(s ij , w t ). We investigate two different encoder architectures of pre-trained MLMs. The first one is BERT that employs 12 Transformer encoder layers, 768 dimension, 3072 feed-forward hidden size, and 110 million parameters. The other uses a standard bidirectional LSTM (Hochreiter and Schmidhuber,    (Peters et al., 2018). We perform MLM context analysis on two English datasets from the Universal Dependencies (UD) project, English Web Treebank (EWT) (Silveira et al., 2014) and Georgetown University Multilayer corpus (GUM) (Zeldes, 2017). Datasets from the UD project provide consistent and rich linguistic annotations across diverse genres, enabling us to gain insights towards the contexts in MLMs. We use the training set of each dataset for analysis. EWT consists of 9, 673 sentences from web blogs, emails, reviews, and social media with the median length being 17 and maximum length being 159 words. GUM comprises 3, 197 sentences from Wikipedia, news articles, academic writing, fictions, and how-to guides with the median length being 19 and maximum length being 98 words. The statistics of datasets are summarized in Table 2.

How much context is used?
Self-attention is designed to encode information from any position in a sequence, whereas BiL-STMs model context through the combination of long-and short-term memories in both left-to-right and right-to-left directions. For MLMs, the entire sequence is provided to produce contextualized representations, it is unclear how much context in the sequence is used by different MLMs.
In this section, we first propose a perturbation procedure Ψ that iteratively masks out a context word contributing to the least absolute change of the target word probability P (w t | X k \t ). That is, we incrementally eliminate words that do not penalize MLMs predictions one by one, until further masking cause P (w t | X k \t ) to deviate too much from the original probability P (w t |X \t ). At this point, the remaining unmasked words are considered being used by the MLM since corrupting any of them causes a notable change in target word prediction.
In practice, we identify deviations using the negative log likelihood (NLL) that corresponds to the loss of MLMs. Assuming NLL has a variance of at the start of masking, we stop the perturbation procedure when the increase on NLL log P (w t |X \t ) − log P (w t | X k \t ) exceeds 2 . We observe that NLLs fluctuate around the start of masking, hence we terminate our procedure when the NLL increase reaches 0.2. We report the effective context size in terms of percentage of length to normalize the length impact. The analysis process is repeated using each word in a sentence as the target word for all sentences in the dataset.
For our second approach, we follow equation 2 to calculate the normalized impact of each subword to the target word and aggregate them for each context word to get IMPACT(w i , w t ). We group the IM-PACT scores by relative position of a word w i to the target word w t and plot the average. To compare with our first approach, we also use masking-based method to analyze that for a word with a specific relative position, what would be its probability of being used by a MLM. BERT uses distant context more than BiLSTM. After our masking process, a subset of context words are tagged as "being used" by the pre-trained LM. In Figure 1a, we aggregated results in terms of relative positions (context-word-to-target-word) for all targets and sentences. "Probability of being used %" denotes when a context word appears at a relative position to target, how likely is it to be relevant to the pre-trained LM. Figure 1a shows that context words at all relative positions have substantially higher probabilities to be considered by BERT than BiLSTM. And BiL-STM focuses sharply on local context words, while BERT leverages words at almost all the positions. A notable observation is that both models consider a lot more often, words within distance around [−10, 10] and BERT has as high as 90% probability to use the words just before and after the target word. Using gradient-based analysis, Figure 1b shows similar results that BERT considers more distant context than BiLSTM and local words have  10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95<100100 Masked context (% of length) more impact to both models than distant words.
There are notable differences between two analysis approaches. Since the gradient-based IM-PACT score is normalized into a distribution across all positions, it does not show the magnitude of the context impact on the two different models. On the other hand, the masking-based analysis shows that BERT uses words at each position more than BiLSTM based on absolute probability values. Another important difference is that the gradientbased approach is a glass-box method and requires back-propagation through networks, assuming the models to be differentiable. On the other hand, the masking-based approach treats the model as a black-box and has no differentiability assumption on models. In the following sections, we will continue analysis with the masking-based approach. BERT uses 75% of words in a sentence as context while BiLSTM considers 30%. Figure 2 shows the increase in NLL when gradually masking out the least relevant words. BERT's NLL increases considerably when 25% of context are masked, suggesting that BERT uses around 75% of context. For BiLSTM, its NLL goes up remarkably after 70% of context words are masked, meaning that it considers around 30% of context. Albeit having the same capacity, we observe that BERT uses more than two times of context words into account than BiLSTM. This could explain the superior fine-tuning performance of BERT on tasks demanding more context to solve. We observe that pre-trained MLMs have consistent behaviors across two datasets that have different genres. For the following analysis, we report results combining EWT and GUM datasets. Content words needs more context than function words. We bucket instances based on the part-of-speech (POS) annotation of the target word. Our analysis covers content words including nouns, verbs and adjectives, and function words including adpositions and determiners. Figure 3a shows that both models use significantly more context to represent content words than function words, which is aligned with linguistic intuitions (Boyd-Graber and Blei, 2009). The findings also show that MLMs handle content and function words in a similar manner as regular language models do, which are previously analyzed by Wang and Cho (2016); Khandelwal et al. (2018). BiLSTM context usage percentage varies by input sentence length, whereas for BERT, it doesn't. We categorize sentences with length shorter than 25 as short, between 25 and 50 as medium, and more than 50 as long. Figure 3b shows that BiLSTM uses 35% of context for short sentences, 20% for medium, and only 10% for long sentences. On the other hand, BERT leverages fixed 75% of context words regardless of the sentence length.

How far do MLMs look?
In the previous section, we looked at how much context is relevant to the two MLMs via an elimination procedure. From Figure 1a and 1b, we also observe that local context is more impactful than long-range context for MLMs. In this section, we investigate this notion of locality of context even further and try to answer the question of how far away do MLMs actually look at in practice, i.e., what is the effective context window size (cws) of each MLM.
For context perturbation analysis, we introduce a locality constraint to the perturbation procedure 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Access to combined context window size on both sides (% of available context) while masking words. We aim to identify how local versus distant context impacts the target word probability differently. We start with masking all the words around the target, i.e., the model only relies on its priors learned during pre-training (cws ∼ 0% 1 ). We iteratively increase the cws on both sides until all the surrounding context is available (cws ∼ 100%). Details of the masking procedure can be found in Appendix. We report the increase in NLL compared to when the entire context is available log P (w t |X \t ) − log P (w t | X k \t ), with respect to the increasing cws. This process is repeated using each word as the target word, for all the sentences in the dataset. We aggregate and visualize the results similar to section 5 and use the same threshold (0.2) as before to mark the turning point.
As shown in Figure 4, increasing the cws around target word reduces the change of NLL until a point where the gap is closed. The plot clearly highlights the differences in the behavior of two modelsfor BERT, words within cws of 78% impact the model's ability to make target word predictions, whereas, for BiLSTM, only words within cws of 50% affect the target word probability. This shows that BERT, leveraging entire sequence by selfattention, looks at a much wider context window size (effective cws ∼ 78%) in comparison to the recurrent architecture BiLSTM (effective cws ∼ 50%). Besides, BiLSTM shows a clear notion of contextual locality that it tends to consider very local context for target word prediction.
Furthermore, we investigate the symmetricity of cws on either side by following the same procedure but now separately on each side of the target word. We iteratively increase cws either on left side or right side while keeping the rest of the words unmasked. More details of the analysis procedure can be found in the Appendix. The analysis results are further bucketed by the POS categories of target words as well as input sentence lengths, similar to Section 5, to gain more fine-grained insights. In Figure 5, we show the symmetricity analysis of cws for short length sentences and target word with POS tags -NOUN and DET. The remaining plots for medium and long length sentences with target word from other POS tags are shown in Appendix due to the lack of space.
From Figure 5, both models show similar behaviors across different POS tags when leveraging symmetric/asymmetric context. The cws attended to on either side is rather similar when target words are NOUN, whereas for DET, we observe both models paying more attention to right context words than the left. This observation aligns well with linguistic intuitions for English language. We can also observe the striking difference between two models in effective cws, with BERT attending to a much larger cws than BiLSTM. The difference in the left and right cws for DET appears to be more pronounced for BiLSTM in comparison to BERT. We hypothesize that this is due to BiLSTM's overall smaller cws (left + right) which makes it only attend to the most important words that happen to be mostly in the right context.

What kind of context is essential?
There is often a core set of context words that is essential to capture the meaning of target word. For example, "Many people think cotton is the most comfortable to wear in hot weather." Although most context is helpful to understand the masked word fabric, cotton and wear are essential as it would be almost impossible to make a guess without them.
In this section, we define essential context as words such that when they are absent, MLMs would have no clue about the target word identity, i.e., the target word probability becomes close to masking out the entire sequence P (w t | X mask all ). To identify essential context, we design the perturbation Ψ to iteratively mask words bringing largest drop in P (w t | X k \t ) until we reach a point, where the increase in NLL just exceeds the 100% mask setting (log P (w t |X \t ) − log P (w t | X mask all )). The words masked using above procedure are labelled as essential context words. We further analyze linguistic characteristics of the identified essential context words. BERT sees 35% of context as essential, whereas BiLSTM perceives around 20%. Figure 6 shows that on average, BERT recognizes around 35% of context as essential when making predictions, i.e., when the increase in NLL is on par with masking all context. On the other hand, BiLSTM sees only 20% of context as essential. This implies that BERT would be more robust than the BiLSTM-   Essential words are close to target words in both linear position and on dependency tree. Table 3 calculates the mean distances from identified essential words to the target words on combined EWT and GUM datasets. Both the models tend to identify words much closer to the target as essential, whether we consider linear positional distance or node distance in dependency trees. We use annotated dependency relations to extract the traversal paths from each essential word to the target word in dependency tree. We find that the top 10 most frequent dependency paths often correspond with the common syntactic structures in natural language. For example, when target words are NOUN, the top 3 paths are DET(up:det)⇒NOUN, ADP(up:case)⇒NOUN, ADJ(up:amod)⇒ NOUN for both models. Further, we also look at the dependency paths of essential words which are unique to each model. The comparison shows that words of common dependency paths are sometimes identified as essential by BERT but not by BiLSTM and vice versa. This suggests that there is room to improve MLMs by making them consistently more aware of input's syntactic structures, possibly by incorporating dependency relations into pre-training. The full lists of top dependency paths are presented in the Appendix. Figure 7 shows examples of essential words from BERT with POS tags and dependency relations. Words in square brackets are target words and the underlined words are essential words. We observe that words close to the target in the sentence as well as in the dependency tree are often seen as essential. We can also see that BERT often includes the root of the dependency tree as an essential word.

Application: Attention Pruning for Transformer
As a pilot application, we leverage insights from analysis in previous sections to perform attention pruning for Transformer. Transformer has achieved impressive results in NLP and has been used for long sequences with more than 10 thousand tokens (Liu et al., 2018). Self-attention for a sequence of  length L is of O(L 2 ) complexity in computation and memory. Many works attempt to improve the efficiency of self-attention by restricting the number of tokens that each input query can attend to (Child et al., 2019;Kitaev et al., 2020).
Our analysis in Section 6 shows that BERT has effective cws of around 78%. We perform a dynamic attention pruning by making self-attention neglect the furthest 22% of tokens. Due to the O(L 2 ) complexity, this could save around 39% of computation in self-attention. We apply this locality constraint to self-attention when fine-tuning BERT on a downstream task. Specifically, we use the CoNLL-2003 Named Entity Recognition (NER) dataset (Sang and Meulder, 2003) with 200k words for training. We fine-tune BERT for NER in the same way as in Devlin et al. (2019). We also explore a static attention pruning that restricts the attention span to be within [−5, +5] 2 . Results in Table 4 show that BERT with attention pruning has comparable performance to the original BERT, implying successful application of our analysis findings. Note that we use an uncased vocabulary, which could explain the gap compared to Devlin et al. (2019).

Conclusion
In our context analysis, we have shown that BERT has an effective context size of around 75% of input length, while BiLSTM has about 30%. The difference in context usage is striking for long-range context beyond 20 words. Our extensive analysis of context window size demonstrate that BERT uses much larger context window size than BiLSTM. Besides, both models often identify words with common syntactic structures as essential context. These findings not only help to better understand contextual impact in masked language models, but also encourage model improvements in efficiency and effectiveness in future works. On top of that, diving deep into the connection between our con-you need. In Advances in neural information processing systems, pages 5998-6008. As mentioned in Section 6, for analyzing how far masked LMs look at within the available context, we follow a masking strategy with locality constraints applied. The masking strategy is as follows -we start from no context available, i.e., all the context words masked and iteratively increase the available context window size (cws) on both sides simultaneously, till the entire context is available. This procedure is also depicted in Figure 8. For symmetricity analysis of cws, we follow similar process as above but considering each side of the target word separately. Hence, when considering context words to the left, we iteratively increase the cws on the left of target word, keeping the rest of the context words on the right unmasked as shown in Figure 9.  In Figure 10, we show various plots investigating how context around the target word impact's model performance as we look at left and right context separately. Figures 10a, 10d, 10g, 10j, 10m show left and right cws for sentences belonging to short length category (l ≤ 25). The trends show that, where NOUN, ADJ, VERB leverage somewhat symmetric context windows, DET and ADP show asymmetric behavior relying more heavily on right context words for both the models -BERT and BiLSTM. Similar observations can be made for sentences belonging to medium length bucket (l > 25 and l ≤ 50) with ADP being an exception where BiLSTM shows more symmetric context different than BERT, as shown in Figures 10b, 10e, 10h, 10k, 10n. However, for sentences belonging to  We can also see that BiLSTM leverages almost similar number of context words as we moved on to buckets of longer sentence lengths in comparison to BERT which can leverage more context when its available. This is aligned with our observation from Section 5.

C Dependency Paths from Essential Words to Target Words
Given a target word, BERT or BiLSTM identifies a subset of context words as essential. Based on the dependency relations provided in the datasets, we extract the dependency paths starting from each essential word to the target words, i.e., the path to traverse from an essential word to the given target word in the dependency tree. We summarize the top 10 most frequent dependency paths recognized by BERT or BiLSTM given the target words being a specific part-of-speech category. Table 5 , 6, 7, 8, 9 show the results for NOUN, ADJ, VERB, DET, and ADP, respectively. The up and down denote the direction of traversal, followed by the corresponding relations in the dependency tree. We can see that the top dependency paths for BERT and BiLSTM are largely overlapped with each other. We also observe that these most frequent dependency paths are often aligned with common syntactic patterns. For example, the top 3 paths for NOUN are DET =(up:det)⇒ NOUN that could be "the" cat, ADP =(up:case)⇒ NOUN that could be "at" home, and ADJ =(up:amod)⇒ NOUN which could be "white" car. This implies that both models could be aware of the common syntactic structures in the natural language.
To further compare the behaviors of BERT and BiLSTM when identifying essential context, we count the occurrence of dependency paths based on the disjoint essential words. That is, given an input sentence, we only count the dependency paths of          essential words which are unique to each model, e.g., words essential to BERT but not essential to BiLSTM. Our goal is to see for these essential words unique to a model, whether some special dependency paths are captured by the model. Table 10, 11, 12, 13, 14 show the results for NOUN, ADJ, VERB, DET, and ADP, respectively. We observe that around top 5 dependency paths for essential words unique to BERT or BiLSTM are mostly overlapping with each other as well as the results in Table 5 , 6, 7, 8, 9. This implies that sometimes words of common dependency paths can be identified by BERT as essential while BiL-STM fails to do so and sometimes it's another way around. In other words, there is a room to make models to be more consistently aware of syntactic structures of an input. The observation suggests that explicitly incorporating dependency relations into pre-training could potentially benefit masked language models.