On the Inductive Bias of Masked Language Modeling: From Statistical to Syntactic Dependencies

We study how masking and predicting tokens in an unsupervised fashion can give rise to linguistic structures and downstream performance gains. Recent theories have suggested that pretrained language models acquire useful inductive biases through masks that implicitly act as cloze reductions for downstream tasks. While appealing, we show that the success of the random masking strategy used in practice cannot be explained by such cloze-like masks alone. We construct cloze-like masks using task-specific lexicons for three different classification datasets and show that the majority of pretrained performance gains come from generic masks that are not associated with the lexicon. To explain the empirical success of these generic masks, we demonstrate a correspondence between the Masked Language Model (MLM) objective and existing methods for learning statistical dependencies in graphical models. Using this, we derive a method for extracting these learned statistical dependencies in MLMs and show that these dependencies encode useful inductive biases in the form of syntactic structures. In an unsupervised parsing evaluation, simply forming a minimum spanning tree on the implied statistical dependence structure outperforms a classic method for unsupervised parsing (58.74 vs. 55.91 UUAS).


Introduction
Pretrained masked language models (Devlin et al., 2019;Liu et al., 2019b) have benefitted a wide range of natural language processing (NLP) tasks (Liu, 2019;Wadden et al., 2019;Zhu et al., 2020). Despite recent progress in understanding what useful information is captured by MLMs (Liu et al., 2019a;Hewitt and Manning, 2019), it remains a mystery why task-agnostic masking of words can capture linguistic structures and transfer to downstream tasks.
One popular justification of MLMs relies on viewing masking as a form of cloze reduction.
Cloze-like Masking I this movie [MASK] Dependency Learning I like this movie Figure 1: We study the inductive bias of MLM objectives and show that cloze-like masking (left) does not account for much of the downstream performance gains. Instead, we show that MLM objectives are biased towards extracting both statistical and syntactic dependencies using random masks (right).
Cloze reductions reformulate an NLP task into a prompt question and a blank and elicit answers by filling in the blank (Figure 1). When tested by cloze reductions pretrained MLMs and left-to-right language models (LMs) have been shown to possess abundant factual knowledge (Petroni et al., 2019) and display impressive few-shot ability (Brown et al., 2020). This success has inspired recent hypotheses that some word masks are cloze-like and provide indirect supervision to downstream tasks Lee et al., 2020). For example, a sentiment classification task (Pang et al., 2002) can be reformulated into filling in like or hate in the cloze I [MASK] this movie. Such cloze-like masks provide a clear way in which an MLM can implicitly learn to perform sentiment classification.
While this hypothesis is appealing, MLMs in practice are trained with uniform masking that does not contain the special structure required by cloze-like masks most of the time. For example, predicting a generic word this in the cloze I like [MASK] movie would not offer task-specific supervision. We quantify the importance of cloze-like and generic masks by explicitly creating cloze-like masks using task-specific lexicons and comparing models pretrained on these masks. These experiments suggest that although cloze-like masks can be helpful, the success of uniform masking cannot be explained via cloze-like masks alone. In fact, we demonstrate that uniform masking performs as well as a negative control where we explicitly remove cloze-like masks from the mask distribution.
To address this mismatch between theory and practice, we offer a new hypothesis of how generic masks can help downstream learning. We propose a conceptual model for MLMs by drawing a correspondence between masking and graphical model neighborhood selection (Meinshausen and Bühlmann, 2006). Using this, we show that MLM objectives are designed to recover statistical dependencies in the presence of latent variables and propose an estimator that can recover these learned dependencies from MLMs. We hypothesize that statistical dependencies in the MLM objective capture useful linguistic dependencies and demonstrate this by using recovered statistical dependencies to perform unsupervised parsing, outperforming an actual unsupervised parsing baseline (58.74 vs 55.91 UUAS; Klein and Manning, 2004). We release our implementation on Github 1 .

Related works
Theories inspired by Cloze Reductions. Cloze reductions are fill-in-the-blank tests that reformulate an NLP task into an LM problem. Existing work demonstrates that such reductions can be highly effective for zero/few-shot prediction (Radford et al., 2019;Brown et al., 2020) as well as relation extraction (Petroni et al., 2019;. These fill-in-the-blank tasks provide a clear way by which LMs can obtain supervision about downstream tasks, and recent work demonstrates how such implicit supervision can lead to useful representations . More general arguments by Lee et al. (2020) show these theories hold across a range of self-supervised settings. While these theories provide compelling arguments for the value of pre-training with cloze tasks, they do not provide a clear reason why uniformly random masks such as those used in BERT provide such strong gains. In our work, we quantify this gap using lexicon-based cloze-like masks and show that cloze-like masks alone are unlikely to account for the complete success of MLM since generic and non-cloze masks are responsible for a substantial part of the empirical performance of MLMs. 1 https://github.com/tatsu-lab/mlm_ inductive_bias Theories for vector representations. Our goal of understanding how masking can lead to useful inductive biases and linguistic structures is closely related to that of papers studying the theory of word embedding representations (Mikolov et al., 2013;Pennington et al., 2014;Arora et al., 2015). Existing work has drawn a correspondence between word embeddings and low-rank factorization of a pointwise mutual information (PMI) matrix (Levy and Goldberg, 2014) and others have shown that PMI is highly correlated with human semantic similarity judgements (Hashimoto et al., 2016).
While existing theories for word embeddings cannot be applied to MLMs, we draw inspiration from them and derive an analogous set of results. Our work shows a correspondence between MLM objectives and graphical model learning through conditional mutual information, as well as evidence that the conditional independence structure learned by MLMs is closely related to syntactic structure.
Probing Pretrained Representations. Recent work has applied probing methods (Belinkov and Glass, 2019) to analyze what information is captured in the pretrained representations. This line of work shows that pretrained representations encode a diverse range of knowledge (Peters et al., 2018;Tenney et al., 2019;Liu et al., 2019a;Hewitt and Manning, 2019;. While probing provides intriguing evidence of linguistic structures encoded by MLMs, they do not address the goals of this work, which is how the pretraining objective encourages MLMs to extract such structures.

Problem Statement
Masked Language Modeling asks the model to predict a token given its surrounding context. Formally, consider an input sequence X of L tokens x 1 , . . . , x L where each variable takes a value from a vocabulary V. Let X ∼ D be the data generating distribution of X. Let x i be the ith token in X, and let X \i denote the sequence after replacing the ith token with a special [MASK] token. In other words, Similarly, define X \{i,j} as replacing both x i and  Figure 2: In our case study, we append the true label to each input and create ideal cloze-like masks. We study how deviations from the ideal mask distribution affect downstream performance by adding in generic masks. that minimizes In BERT pretraining, each input token is masked with a fixed, uniform probability, which is a hyperparameter to be chosen. We refer to this strategy as uniform masking.
Finetuning is the canonical method for using pretrained MLMs. Consider a prediction task where y ∈ Y is the target variable, e.g., the sentiment label of a review. Finetuning uses gradient descent to modify the pretrained parameters θ and learn a new set of parameters φ to minimize where p(y|x) is the ground-truth distribution and D is the data distribution of the downstream task. Our goals. We will study how the mask distribution M affects downstream performance. We define perfect cloze reductions as some partition of the vocabulary V y such that p(x i ∈ V y |X \i ) ≈ p(y|X). For a distribution M such that the masks we draw are perfect cloze-reductions, the MLM objective offers direct supervision to finetuning since L MLM ≈ L finetune . In contrast to cloze-like masking, in uniform masking we can think of p θ as implicitly learning a generative model of X (Wang and Cho, 2019). Therefore, as M moves away from the ideal distribution and becomes more uniform, we expect p θ to model more of the full data distribution D instead of focusing on cloze-like supervision for the downstream task. This mismatch between theory and practice raises questions about how MLM with uniform masking can learn useful inductive biases.
When L MLM is not L finetune , what is L MLM learning? We analyze L MLM and show that it is similar to a form of conditional mutual information based graphical model structure learning.

Case Study for Cloze-like Masking
To motivate our subsequent discussions, we perform a controlled study for the case when L MLM ≈  L finetune and analyze how deviations from the ideal mask distribution affect downstream performance. We perform analysis on the Stanford Sentiment Treebank (SST-2; Socher et al., 2013), which requires models to classify short movie reviews into positive or negative sentiment. We append the ground-truth label (as the word positive or negative) to each movie review ( Figure 2). Masking the last word in each review is, by definition, an ideal mask distribution. To study how the deviation from the ideal mask distribution degrades downstream performance, we vary the amount of cloze-like masks during training. We do this by masking out the last word for p% of the time and masking out a random word in the movie review for (100 − p)% of the time, and choose p ∈ {0, 20, 40, 60, 80, 100}. Experimental details. We split the SST-2 training set into two halves, use one for pretraining, and the other for finetuning. For the finetuning data, we do not append the ground-truth label. We pretrain small transformers with L MLM using different masking strategies and finetune them along with a baseline that is not pretrained (NOPRETRAIN). Further details are in Appendix A.
Results. We observe that while cloze-like masks can lead to successful transfer, even a small modification of the ideal mask distribution deteriorates performance. Figure 3 shows the development set accuracy of seven model variants averaged across ten random trials. We observe as p decreases, the performance of CLOZE-p% degrades. Notably, CLOZE-80% is already worse than CLOZE-100% and CLOZE-20% does not outperform NOPRE-TRAIN by much. We notice that CLOZE-0% in fact degrades finetuning performance, potentially because the pretrained model is over-specialized to the language modeling task (Zhang et al., 2020;Tamkin et al., 2020). While this is a toy example, we observe similar results for actual MLM models across three tasks (Section 5.1), and this motivates us to look for a framework that explains the success of generic masks in practice.

Analysis
In the previous section, we saw that cloze-like masks do not necessarily explain the empirical success of MLMs with uniform masking strategies. Understanding uniform masking seems challenging at first, as uniform-mask MLMs seem to lack task-specific supervision and is distinct from existing unsupervised learning methods such as word embeddings (which rely upon linear dimensionality reduction) and autoencoders (which rely upon denoising). However, we show in this section that there is a correspondence between MLM objectives and classic methods for graphical model structure learning. As a consequence, we demonstrate that MLMs are implicitly trained to recover statistical dependencies among observed tokens.

Intuition and Theoretical Analysis
Our starting point is the observation that predicting a single feature (x i ) from all others (X \i ) is the core subroutine in the classic Gaussian graphical model structure learning algorithm of Meinshausen and Bühlmann (2006). In this approach, L different Lasso regression models are trained (Tibshirani, 1996) with each model predicting x i from X \i , and the nonzero coefficients of this regression correspond to the conditional dependence structure of the graphical model.
The MLM objective can be interpreted as a nonlinear extension of this approach, much like a classical algorithm that uses conditional mutual information (MI) estimators to recover a graphical model (Anandkumar et al., 2012). Despite the similarity, real world texts are better viewed as models with latent variables (e.g. topics; Blei et al., 2003) and many dependencies across tokens arise due to latent variables, which makes learning the direct dependencies difficult. We show that MLMs implicitly recover the latent variables and can capture the direct dependencies while accounting for the effect of latent variables. Finally, MLMs are only approximations to the true distribution and we show that the MLM objective can induce highquality approximations of conditional MI.
Analysis setup. To better understand MLMs as a way to recover graphical model structures, we show mask-based models can recover latent variables and the direct dependencies among variables in the Gaussian graphical model setting of Meinshausen and Bühlmann (2006). Let X = [x 1 , . . . , x L ] ∈ R L represent an input sequence where each of its coordinates x i represents a token, and Z ∈ R k be a latent variable that controls the sequence generation process. We assume that all coordinates of X are dependent on the latent variable Z, and there are sparse dependencies among the observed variables ( Figure 4). In other words, we can write Z ∼ N(0, Σ ZZ ) and X ∼ N(AZ, Σ XX ). Intuitively, we can imagine that Z represents shared semantic information, e.g. a topic, and Σ XX represents the syntactic dependencies. In this Gaussian graphical model, the MLM is analogous to regressing each coordinate of X from all other coordinates, which we refer to as masked regression.
MLM representations can recover latent variable. We now study the behavior of masked regression through the representation x mask,i that is obtained by applying masked regression on the ith coordinate of X and using the predicted values. Our result shows that masked regression is similar to the two-step process of first recovering the latent variable Z from X \i and then predicting x i from Z.
Let Σ XX,\i,i ∈ R d−1 be the vector formed by dropping the ith row and taking the ith column of Σ XX and β 2SLS,i be the linear map resulting from the two-stage regression X \i → Z → x i . Proposition 1. Assuming that Σ XX is full rank, In other words, masked regression implicitly recovers the subspace that we would get if we first explicitly recovered the latent variables (β 2SLS,i ) with an error term that scales with the off-diagonal terms in Σ XX . The proof is presented in Appendix C.
To give additional context for this result, let us consider the behavior of a different representation learning algorithm: PCA. It is well-known that PCA can recover the latent variables as long as the Σ ZZ dominates the covariance Cov(X). We state this result in terms of X PCA , the observed data projected to the first k components of PCA. Proposition 2. Let λ k be the kth eigenvalue of AΣ ZZ A and λ XX,k+1 be the k+1th eigenvalue of Σ XX and V be the first k eigenvectors of Cov(X). Assuming λ k > λ XX,k+1 , we have where · op is the operator norm and tr(·) is the trace.
This shows that whenever Σ XX is sufficiently small and λ k is large (i.e., the covariance is dominated by Z), then PCA recovers the latent information in Z. The proof is based on the Davis-Kahan theorem (Stewart and Sun, 1990) and is presented in Appendix C.
Comparing the bound of PCA and masked regression, both bounds have errors that scales with Σ XX , but the key difference in the error bound is that the error term for masked regression does not scale with the per-coordinate noise (diag(Σ XX )) and thus can be thought of as focusing exclusively on interactions within X. Analyzing this more carefully, we find that Σ XX,\i,i corresponds to the statistical dependencies between x i and X \i , which we might hope captures useful, task-agnostic structures such as syntactic dependencies.
MLM log-probabilies can recover direct dependencies. Another effect of latent variables is that many tokens have indirect dependencies through the latent variables, which poses a challenge to recovering the direct dependencies among tokens. We now show that the MLMs can account for the effect of latent variable.
In the case where there are no latent variables, we can identify the direct dependencies via conditional MI (Anandkumar et al., 2012) because any x i and x j that are disconnected in the graphical model will have zero conditional MI, i.e., I(x i ; x j |X \{i,j} ) = 0. One valuable aspect of MLM is that we can identify direct dependencies even in the presence of latent variables.
If we naively measure statistical dependency by mutual information, the coordinates of X would appear dependent on each other because they are all connected with Z. However, the MLM objective resolves this issue by conditioning on X \{i,j} . We show that latent variables (such as topics) that are easy to predict from X \{i,j} can be ignored when considering conditional MI. Proposition 3. The gap between conditional MI with and without latent variables is bounded by the conditional entropy H(Z|X \{i,j} ), This suggests that when the context X \{i,j} captures enough of the latent information, conditional MI can remove the confounding effect of the shared topic Z and extract the direct and sparse dependencies within X (see Appendix C for the proof).
MLM objective encourages capturing conditonal MI. We have now shown that conditional MI captures direct dependencies among tokens, even in the presence of latent variables. Next, we will show that the MLM objective ensures that a LM with low log-loss accurately captures the conditional MI. We now show that learning the MLM objective implies high-quality estimation of conditional MI. Denote X(i, v) as substituting x i with a new token v, Conditional MI is defined as the expected pointwise mutual information (PMI) conditioned on the rest of the tokens, where I p is the abbreviation of I p (x i ; x j |X \{i,j} ). Our main result is that the log-loss MLM objective directly bounds the gap between the true conditional mutual information from the data distribution and an estimator that uses the log-probabilities from the model. More formally, Proposition 4. Let be an estimator constructed by the model distribution p θ . Then we can show, where D kl represents the KL-divergence.
Here, the KL-divergence corresponds to the L MLM objective, up to a constant entropy term that depends on p. We present the proof in Appendix C. In other words, the MLM objective is implicitly encouraging the model to match its implied conditional MI to that of the data. We now use this result to create an estimator that extracts the conditional independence structures implied by MLM.

Extracting statistical dependencies implied by MLMs
Our earlier analysis in Proposition 4 suggests that an MLM with low loss has an accurate approximation of conditional mutual information. Using this result, we will now propose a procedure which estimatesÎ p θ . The definition ofÎ p θ shows that if we can access samples of x i and x j from the true distribution p, then we can directly estimate the conditional mutual information by using the log probabilities from the MLM. Unfortunately, we cannot draw new samples of x j | X \{i,j} , leading us to approximate this distribution using Gibbs sampling on the MLM distribution. Our Gibbs sampling procedure is similar to the one proposed in Wang and Cho (2019). We start with X 0 = X \{i,j} . For the tth iteration, we draw a sample x t i from p θ (x i |X t−1 \i ), and update by X t = X t−1 (i, x t i ). Then, we draw a sample x t j from p θ (x j |X t \j ) and set X t = X t (j, x t j ). We repeat and use the samples (x 1 i , x 1 j ), . . . , (x t i , x t j ) to compute the expectations for conditional MI.
This procedure relies upon an additional assumption that samples drawn from the MLM are faithful approximations of the data generating distribution. However, we show empirically that even this approximation is sufficient to test the hypothesis that the conditional independences learned by an MLM capture syntactic dependencies (Section 5.2).

Experiment
We now test two predictions from our analyses. First, similar to our observation in the case study, we show that cloze-like masks do not explain the success of uniform masks on three real-world datasets. Second, our alternative view of relating MLM to graphical models suggests that statistical dependencies learned by MLMs may capture linguistic structures useful for downstream tasks. We demonstrate this by showing that MLMs' statistical dependencies reflect syntactic dependencies.

Uniform vs Cloze-like Masking
Setup. We now demonstrate that real-world tasks and MLMs show a gap between task-specific cloze masks and random masks. We compare the MLM with random masking to two different control groups. In the positive control (CLOZE), we pretrain with only cloze-like masks and in the negative control (NOCLOZE), we pretrain by explicitly excluding cloze-like masks. If the success of MLM can be mostly explained by implicit cloze reductions, then we should expect CLOZE to have strong downstream performance while NOCLOZE leads to a minimal performance gain. We compare pretraining with the uniform masking strategy used in BERT (UNIFORM) to these two control groups. If UNIFORM performs worse than the positive control and more similar to the negative control, then we know that uniform masking does not leverage cloze-like masks effectively.
Simulating Pretraining. Given computational constraints, we cannot retrain BERT from scratch. Instead, we approximate the pretraining process by continuing to update BERT with MLM (Gururangan et al., 2020), which we refer to as second-stage pretraining. Although this is an approximation to the actual pretraining process, the second-stage pretraining shares the same fundamental problem for pretraining: how can unsupervised training lead to downstream performance gains?
We study the effectiveness of different masking strategies by comparing to a BERT model without second-stage pretraining (VANILLA). We experiment with three text classification datasets: SST-2 (Socher et al., 2013), Hyperpartisan (Kiesel et al., 2019), and AGNews (Zhang et al., 2015). SST-2 classifies movie reviews by binary sentiment; Hyperpartisan is a binary classification task on whether a news article takes an extreme partisan standpoint; and AGNews classifies news articles into four different topics. On SST-2 and AGNews, we perform the second-stage pretraining on the training inputs (not using the labels). On Hyperpartisan, we use 100k unlabeled news articles that are released with the dataset. For SST-2 and AG-News, we study a low-resource setting and set the number of finetuning examples to be 20. For Hyperpartisan, we use the training set, which has 515 labeled examples. All evaluations are performed by fine-tuning a bert-base-uncased model (See Appendix A for full details).
Approximating Cloze-like Masking. We cannot identify the optimal set of cloze-like masks for an arbitrary downstream task, but these three tasks have associated lexicons which we can use to approximate the cloze-like masks. For SST-2, we take the sentiment lexicon selected by Hu and Liu (2004); for Hyperpartisan, we take the NRC word-emotion association lexicon (Mohammad and Turney, 2013); and for AGNews, we extract topic words by training a logistic regression classifier and AGNews 20-shot taking the top 1k features to be cloze-like masks.
Results. Figure 5 plots the finetuning performance of different masking strategies. We observe that UNIFORM outperforms VANILLA, which indicates that second-stage pretraining is extracting useful information and our experiment setup is useful for studying how MLM leads to performance gains. As expected, CLOZE achieves the best accuracy, which confirms that cloze-like masks can be helpful and validates our cloze approximations.
The UNIFORM mask is much closer to NO-CLOZE than CLOZE. This suggests that uniform masking does not leverage cloze-like masks well and cloze reductions alone cannot account for the success of MLM. This view is further supported by the observation that NOCLOZE outperforms VANILLA suggesting that generic masks that are not cloze-like still contain useful inductive biases.
Our results support our earlier view that there may be an alternative mechanism that allows generic masks that are not cloze-like to benefit downstream learning. Next, we will empirically examine BERT's learned conditional independence structure among tokens and show that the statistical dependencies relate to syntactic dependencies.

Analysis: Unsupervised Parsing
Our analysis in section 4.1 shows that conditional MI (which is optimized by the MLM objective) can extract conditional independences. We will show that statistical dependencies estimated by conditional MI are related to syntactic dependencies by using conditional MI for unsupervised parsing.
Background. One might expect that the statistical dependencies among words are correlated with syntactic dependencies. Indeed, Futrell et al. (2019) show that heads and dependents in dependency parse trees have high pointwise mutual information (PMI) on average. However, previous at-tempts (Carroll and Charniak, 1992;Paskin, 2002) show that unsupervised parsing approaches based on PMI achieve close to random accuracy. Our analysis suggests that MLMs extract a more finegrained notion of statistical dependence (conditional MI) which does not suffer from the existence of latent variables (Proposition 3). We now show that the conditional MI captured by MLMs achieves far better performance, on par with classic unsupervised parsing baselines.
Baselines. We compare conditional MI to PMI as well as conditional PMI, an ablation in which we do not take expectation over possible words. For all statistical dependency based methods (cond. MI, PMI, and cond. PMI), we compute pairwise dependence for each word pair in a sentence and construct a minimum spanning tree on the negative values to generate parse trees. To contextualize our results, we compare against three simple baselines: RANDOM which draws a random tree on the input sentence, LINEARCHAIN which links adjacent words in a sentence, and a classic unsupervised parsing method (Klein and Manning, 2004).
Experimental Setup. We conduct experiments on the English Penn Treebank using the WSJ corpus and convert the annotated constituency parses to Stanford Dependency Formalism (de Marneffe et al., 2006). Following Yang et al. (2020), we evaluate on sentences of length ≤ 10 in the test split, which contains 389 sentences (Appendix B.1 describes the same experiment on longer sentences, which have similar results). We experiment with the bert-base-cased model (more details in Appendix A) and evaluate by the undirected unlabeled attachment score (UUAS).
Results. Table 1 shows a much stronger-thanrandom association between conditional MI and dependency grammar. In fact, the parses extracted

5138
The above represents a triumph of either apathy or civility .  Figure 6: An example parse extracted from conditional MI. The black parse tree above the sentence represents the ground-truth parse and the red parse below is extracted from conditional MI. The correctly predicted edges are labeled with the annotated relations, and the incorrect ones are labeled as wrong.

RANDOM
28.50 ± 0.73 LINEARCHAIN 54.13 Klein and Manning (2004)  from conditional MI has better quality than LIN-EARCHAIN and the classic method (Klein and Manning, 2004). Unlike conditional MI, PMI only has a close-to-random performance, which is consistent with prior work. We also see that conditional MI outperforms conditional PMI, which is consistent with our theoretical framework that suggests that conditional MI (and not PMI) recovers the graphical model structure.
We also perform a fine-grained analysis by investigating relations where conditional MI differs from LINEARCHAIN. Because the test split is small and conditional MI does not involve any training, we perform this analysis on 5,000 sentences from the training split. Table 2 presents the results and shows that conditional MI does not simply recover the linear chain bias. Meanwhile, we also observe a deviation between conditional MI and dependency grammar on relations like number and cc. This is reasonable because certain aspects of dependency grammar depend on human conventions that do not necessarily have a consensus (Popel et al., 2013). Figure 6 illustrates with an example parse extracted from conditional MI. We observe that conditional MI correctly captures dobj and conj. Knowing the verb, e.g. represents, limits the range of objects that can appear in a sentence so intuitively we expect a high conditional MI between the direct object and the verb. Similarly for phrases like "A and B", we would expect A and B to be statistically dependent. However, conditional MI fails  to capture cc (between apathy and or). Instead, it links or with either which certainly has statistical dependence. This once again suggests that the 'errors' incurred by the conditional PMI method are not simply failures to estimate dependence but natural differences in the definition of dependence.

Discussion and Conclusion
We study how MLM with uniform masking can learn useful linguistic structures and inductive biases for downstream tasks. Our work demonstrates that a substantial part of the performance gains of MLM pretraining cannot be attributed to taskspecific, cloze-like masks.

A Experimental Details
Experimental details for Section 3.2 Our transformers have 2 layers and for each transformer block, the hidden size and the intermediate size are both 64. We finetune the models for 10 epochs and apply early stopping based on validation accuracy. We use Adam (Kingma and Ba, 2014) for optimization, using a learning rate of 1e −3 for pretraining and 1e −4 for finetuning. Experimental details for Section 5.1 Table 3 summarizes the dataset statistics of three real-world datasets we studied. For second stage pretraining, we update the BERT model for 10 epochs. Following the suggestion in Zhang et al. (2020), we finetune the pretrained BERT models for 400 steps, using a batch size of 16 and a learning rate of 1e −5 . We apply linear learning rate warmup for the first 10% of finetuning and linear learning rate decay for the rest. For SST-2 and AGNews, we average the results over 20 random trials. For Hyperpartisan, because the test set is small and the variation is larger, we average the results over 50 random trials and evaluate on the union the development set and the test set for more stable results.
Experimental details for Section 5.2 We convert the annotated constituency parses using the Stanford CoreNLP package . We compute conditional MI and conditional PMI using the bert-base-cased model and run Gibbs sampling for 2000 steps. BERT's tokenization may split a word into multiple word pieces. We aggregate the dependencies between a word and multiple word pieces by taking the maximum value. We compute the PMI statistics and train the K&M model (Klein and Manning, 2004) on sentences of length ≤ 10 in the WSJ train split (section 2-21). For DMV, we train with the annotated POS tags using a public implementation released by (He et al., 2018). Results are averaged over three runs when applicable.

B.1 Additional Results in Section 5.2
We conduct an additional experiment on the English Penn Treebank to verify that conditional MI can extract parses for sentences longer than ten words. To expedite experimentation, we subsample 200 out of 2416 sentences from the test split of English Penn Treebank and the average sentence length of our subsampled dataset is 24.1 words. When applicable, we average over three random seeds and report standard deviations. Table 4 presents the UUAS of conditional MI and other methods. We draw similar conclusions as in Section 5.2, observing that the parses drawn by conditional MI have higher quality than those of other baselines.

C Proofs
Proof of Proposition 2 We first recall our statement.
Proposition 2. Let λ k be the kth eigenvalue of AΣ ZZ A and λ XX,k+1 be the k+1th eigenvalue of Σ XX and V be the first k eigenvectors of Cov(X). Assuming λ k > λ XX,k+1 , we have where · op is the operator norm and tr(·) is the trace.

Proof
We will use the Davis-Kahan Theorem for our proof.
Theorem (Davis-Kahan (Stewart and Sun, 1990)). Let σ be the eigengap between the kth and the k+1th eigenvalue of two positive semidefinite symmetric matrices Σ and Σ . Also, let V and V be the first k eigenvectors of Σ and Σ respectively. Then we have, That is, we can bound the error in the subspace projection in terms of the matrix perturbation.
In our setting, we choose Σ = AΣ ZZ A + Σ XX and Σ = AΣ ZZ A . We know the eigengap of Σ is λ k because Σ only has k nonzero eigenvalues. By Weyl's inequality, the kth eigenvalue is at most perturbed by λ XX,k+1 , which is the k+1 eigenvalue of Σ XX . Let V be the top k eigenvectors of Σ and assuming λ k > λ XX,k+1 , we have, Turning this operator norm bound into approximation bound, we have We use the fact that E X,Z AZ − X 2 2 = tr(Σ XX ) and Jensen's inequality to bound, E X AZ − X 2 ≤ tr(Σ XX ).