Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT

By introducing a small set of additional parameters, a probe learns to solve specific linguistic tasks (e.g., dependency parsing) in a supervised manner using feature representations (e.g., contextualized embeddings). The effectiveness of such probing tasks is taken as evidence that the pre-trained model encodes linguistic knowledge. However, this approach of evaluating a language model is undermined by the uncertainty of the amount of knowledge that is learned by the probe itself. Complementary to those works, we propose a parameter-free probing technique for analyzing pre-trained language models (e.g., BERT). Our method does not require direct supervision from the probing tasks, nor do we introduce additional parameters to the probing process. Our experiments on BERT show that syntactic trees recovered from BERT using our method are significantly better than linguistically-uninformed baselines. We further feed the empirically induced dependency structures into a downstream sentiment classification task and find its improvement compatible with or even superior to a human-designed dependency schema.


Introduction
Recent prevalent pre-trained language models such as ELMo (Peters et al., 2018b), BERT (Devlin et al., 2018), and XLNet (Yang et al., 2019) achieve state-of-the-art performance for a diverse array of downstream NLP tasks. An interesting area of research is to investigate the interpretability of these pre-trained models (i.e., the linguistic properties they capture). Most recent approaches are built upon the idea of probing classifiers (Shi et al., 2016;Adi et al., 2017;Conneau et al., 2018;Peters et al., 2018a;Hewitt and Manning, 2019;Clark et al., 2019;Tenney et al., 2019b;Jawahar et al., 2019). A probe is a simple neural net-work (with a small additional set of parameters) that uses the feature representations generated by a pre-trained model (e.g., hidden state activations, attention weights) and is trained to perform a supervised task (e.g., dependency labeling). The performance of a probe is used to measure the quality of the generated representations with the assumption that the measured quality is mostly attributable to the pre-trained language model.
One downside of such approach, as pointed out in (Hewitt and Liang, 2019), is that a probe introduces a new set of additional parameters, which makes the results difficult to interpret. Is it the pretrained model that captures the linguistic information, or is it the probe that learns the downstream task itself and thus encodes the information in its additional parameter space?
In this paper we propose a parameter-free probing technique called Perturbed Masking to analyze and interpret pre-trained models. The main idea is to introduce the Perturbed Masking technique into the masked language modeling (MLM) objective to measure the impact a word x j has on predicting another word x i (Sec 2.2) and then induce the global linguistic properties (e.g., dependency trees) from this inter-word information.
Our contributions are threefold: • We introduce a new parameter-free probing technique, Perturbed Masking, to estimate inter-word correlations. Our technique enables global syntactic information extraction.
• We evaluate the effectiveness of our probe over a number of linguistic driven tasks (e.g., syntactic parsing, discourse dependency parsing). Our results reinforce the claims of recent probing works, and further complement them by quantitatively evaluating the validity of their claims.
• We feed the empirically induced dependency structures into a downstream task to make a comparison with a parser-provided, linguist-designed dependency schema and find that our structures perform on-par or even better (Sec 6) than the parser created one. This offers an insight into the remarkable success of BERT on downstream tasks.

Perturbed Masking
We propose the perturbed masking technique to assess the impact one word has on the prediction of another in MLM. The inter-word information derived serves as the basis for our later analysis.

Background: BERT
BERT 1 (Devlin et al., 2018) is a large Transformer network that is pre-trained on 3.3 billion tokens of English text. It performs two tasks: (1) Masked Language Modeling (MLM): randomly select and mask 15% of all tokens in each given sequence, and then predict those masked tokens. In masking, a token is (a) replaced by the special token [MASK], (b) replaced by a random token, or (c) kept unchanged. These replacements are chosen 80%, 10%, and 10% of the time, respectively.
(2)Next Sentence Prediction: given a pair of sentences, predict whether the second sentence follows the first in an original document or is taken from another random document.

Token Perturbation
Given a sentence as a list of tokens x = [x 1 , . . . , x T ], BERT maps each x i into a contextualized representation H θ (x) i , where θ represents the network's parameters. Our goal is to derive a function f (x i , x j ) that captures the impact a context word x j has on the prediction of another word We propose a two-stage approach to achieve our goal. First, we replace x i with the [MASK] token and feed the new sequence x\{x i } into BERT. We use H θ (x\{x i }) i to denote the representation of x i . To calculate the impact x j ∈ x\{x i } has on H θ (x\{x i }) i , we further mask out x j to obtain the second corrupted sequence x\{x i , x j }. Similarly, H θ (x\{x i , x j }) i denotes the new representation of token x i .
We define f (x i , x j ) as: In our experiments, we use the base, uncased version from (Wolf et al., 2019). where d(x, y) is the distance metric that captures the difference between two vectors. We experimented with two options for d(x, y): • Dist: Euclidean distance between x and y • Prob: d(x, y) = a(x) x i − a(y) x i , where a(·) maps a vector into a probability distribution among the words in the vocabulary. a(x) x i represents the probability of predicting token x i base on x.
By repeating the two-stage perturbation on each pair of tokens x i , x j ∈ x and calculating f (x i , x j ), we obtain an impact matrix F, where F ij ∈ R T ×T . Now, we can derive algorithms to extract syntactic trees from F and compare them with ground-truth trees that are obtained from benchmarks. Note that BERT uses byte-pair encoding (Sennrich et al., 2016) and may split a word into multiple tokens(subwords). To evaluate our approach on word-level tasks, we make the following changes to obtain inter-word impact matrices. In each perturbation, we mask all tokens of a split-up word. The impact on a split-up word is obtained by averaging 2 the impacts over the split-up word's tokens. To measure the impact exerted by a split-up word, we assume the impacts given by its tokens are the same; We use the impact given by the first token for convenience.

Span Perturbation
Given the token-level perturbation above, it is straightforward to extend it to span-level perturbation. We investigate how BERT models the relations between spans, which can be phrases, clauses, or paragraphs. As a preliminary study, we investigate how well BERT captures document structures.
We model a document D as N non-overlapping text spans D = [e 1 , e 2 , . . . , e N ], where each span e i contains a sequence of tokens . For span-level perturbation, instead of masking one token at a time, we mask an array of tokens in a span simultaneously. We obtain the span representation by averaging the representations of all the tokens the span contains. Similarly, we calculate the impact e j has on e i by: where d is the Dist function.  Figure 1: Heatmap of the impact matrix for the sentence "For those who follow social media transitions on Capitol Hill, this will be a little different."

Visualization with Impact Maps
Before we discuss specific syntactic phenomena, let us first analyze some example impact matrices derived from sample sentences. We visualize an impact matrix of a sentence by displaying a heatmap. We use the term "impact map" to refer to a heatmap of an impact matrix. Setup. We extract impact matrices by feeding BERT with 1,000 sentences from the English Parallel Universal Dependencies (PUD) treebank of the CoNLL 2017 Shared Task (Zeman et al., 2017). We follow the setup and pre-processing steps employed in pre-training BERT. An example impact map is shown in Figure 1.
Dependency. We notice that the impact map contains many stripes, which are short series of vertical/horizontal cells, typically located along the diagonal. Take the word "different" as an example (which is illustrated by the second-to-last column in the impact matrix). We observe a clear vertical stripe above the main diagonal. The interpretation is that this particular occurrence of the word "different" strongly affects the occurrences of those words before it. These strong influences are shown by the darker-colored pixels seen in the second last column of the impact map. This observation agrees with the ground-truth dependency tree, which selects "different" as the head of all remaining words in the phrase "this will be a little different." We also observe similar patterns on "transitions" and "Hill". Such correlations lead us to explore the idea of extracting dependency trees from the matrices (see Section 4.1). follow social media transitions on Capitol Hill Constituency. Figure 2 shows part of the constituency tree of our example sentence generated by Stanford CoreNLP (Manning et al., 2014). In this sentence, "media" and "on" are two words that are adjacent to "transitions". From the tree, however, we see that "media" is closer to "transitions" than "on" is in terms of syntactic distance. If a model is syntactically uninformed, we would expect "media" and "on" to have comparable impacts on the prediction of "transitions", and vice versa. However, we observe a far greater impact (darker color) between "media" and "transitions" than that between "on" and "transitions". We will further support this observation with empirical experiments in Section 4.2.
Other Structures. Along the diagonal of the impact map, we see that words are grouped into four contiguous chunks that have specific intents (e.g., a noun phrase -on Capitol Hill). We also observe that the two middle chunks have relatively strong inter-chunk word impacts and thus a bonding that groups them together, forming a larger verb phrase. This observation suggest that BERT may capture the compositionality of the language.
In the following sections we quantitatively evaluate these observations.

Syntactic Probe
We start with two syntactic probes -dependency probe and constituency probe.

Dependency Probe
With the goal of exploring the extent dependency relations are captured in BERT, we set out to answer the following question: Can BERT outperform linguistically uninformed baselines in unsupervised dependency parsing? If so, to what extent?
We begin by using the token-level perturbed masking technique to extract an impact matrix F for each sentence. We then utilize graph-based algorithms to induce a dependency tree from F, and compare it against ground-truth whose annotations are linguistically motivated. Experiment Setup. We evaluate the induced trees on two benchmarks: (1) the PUD treebank described in Section 3. (2) the WSJ10 treebank, which contains 7,422 sentences (all less than 10 words after punctuation removal) from the Penn Treebank (PTB) (Marcus et al., 1993). Note that the original PTB does not contain dependency annotations. Thus, we convert them into Universal Dependencies using Stanford CoreNLP. We denote this set as WSJ10-U.
Next, two parsing algorithms, namely, the Eisner algorithm (1996) and Chu-Liu/Edmonds (CLE) algorithm (1965;1967), are utilized to extract the projective and non-projective unlabeled dependency trees, respectively. Given that our impact matrices have no knowledge about the dependency root of the sentence, we use the gold root in our analysis. Introducing the gold root may artificially improve our results slightly. We thus apply this bias evenly across all baselines to ensure a fair comparison, as done in (Raganato and Tiedemann, 2018;Htut et al., 2019).
We compared our approach against the following baselines: (1) right-(left-) chain baseline, which always selects the next(previous) word as dependency head. (2) A random BERT baseline, with which we randomly initialize weights of the BERT model (Htut et al., 2019), then use our methods to induce dependency trees.
We measure model performance using Unlabeled Attachment Score (UAS). We note that UAS has been shown to be highly sensitive to annotation variations (Schwartz et al., 2011;Tsarfaty et al., 2011;Kübler et al., 2009). Therefore, it may not be a fair evaluation metric for analyzing and interpreting BERT. To reflect the real quality of the dependency structures that are retained in BERT, we also report Undirected UAS (UUAS) (Klein and Manning, 2004) and the Neutral Edge Direction (NED) scores (Schwartz et al., 2011).
Results. Tables 1 and 2 show the results of our dependency probes. From Table 1, we see that although BERT is trained without any explicit supervision from syntactic dependencies, to some extent the syntax-aware representation already exists in it. The best UAS scores it achieves (Eisner+Dist) are substantially higher than that of the random BERT baseline with respect to both WSJ10-U(+41.7) and PUD(+31.5). Moreover, the Dist method significantly outperforms the Prob

Model
Parsing UAS WSJ10-U PUD  method on both datasets we evaluated. We thus use Dist as the default distance function in our later discussion. We also note that the Eisner algorithm shows a clear advantage over CLE since English sentences are mostly projective. However, our best performing method does not go much beyond the strong right-chain baseline (with gold root modified), showing that the dependency relations learned are mostly those simple and local ones. For reference, the famous unsupervised parser -DMV (Klein and Manning, 2004) achieves a 43.2 UAS on WSJ10 with Collins (1999) conventions. Note that the DMV parser utilizes POS tags for training while ours start with the gold root. The results are therefore not directly comparable. By putting them together, however, we see potential room for improvement for current neural unsupervised dependency parsing systems in the BERT era.
From Table 2, we see that although BERT only outperforms the right-chain baseline modestly in terms of UAS, it shows significant improvements on UUAS (+12.2) and NED (+28.4). We also make similar observation with WSJ10-U. This suggests that BERT does capture interword dependencies despite that it may not totally agree with one specific human-designed governordependent schema. We manually inspect those discrepancies and observe that they can also be syntactically valid. For instance, consider the sen-tence "It closed on Sunday.". For the phrase "on Sunday", our method selects the functional word "on" as the head while the gold-standard annotation uses a lexical head ("Sunday") 3 .
The above findings prove that BERT has learned its own syntax as a by-product of self-supervised training, not by directly copying any human design. However, giving the superior performance of BERT on downstream tasks, it is natural to ask if BERT is learning an empirically useful structure of language. We investigate this question in Sec 6.

Constituency Probe
We now examine the extent BERT learns about the constituent structure of sentences. We first present the algorithm for unsupervised constituent parsing, which executes in a top-down manner by recursively splitting larger constituents into smaller ones.
Top-Down Parsing. Given a sentence as a sequence of tokens x = [x 1 , . . . , x T ] and the corresponding impact matrix F. We start by finding the best splitting position k that will separate the sentence into constituents ((x <k ), (x k , (x >k ))), where x <k = [x 1 , . . . , x k−1 ]. The best splitting position ensures that each constituent has a large average impact between words within it (thus those words more likely to form a constituent) while at the same time the impact between words of different constituents are kept as small as possible (thus they are unlikely to be in the same constituent). Mathematically, we decide the best k for the constituent x = [x i , x i+1 , . . . , x j ] by the following optimization: where . We recursively split (x <k ) and (x >k ) until only single words remain. Note that this top-down strategy is similar to that of ON-LSTM (Shen et al., 2019) and PRPN (Shen et al., 2018), but differs from them in that ON-LSTM and PRPN decide the splitting position based on a "syntactic distance vector" which is explicitly modeled by a special network component. To distinguish our approach from the others, we denote our parser as MART (MAtRix-based Top-down parser) Experiment Setup. We follow the experiment setting in Shen et al (2019; and evaluate our method on the 7,422 sentences in WSJ10 dataset and the PTB23 dataset (the traditional PTB test set for constituency parsing). Table 3 shows the results of our constituency probes. From the table, we see that BERT outperforms most baselines on PTB23, except for the second layer of ON-LSTM. Note that all these baselines have specifically-designed architectures for the unsupervised parsing task, while BERT's knowledge about constituent formalism emerges purely from self-supervised training on unlabeled text.

Results.
It is also worth noting that recent results (Dyer et al., 2019;Li et al., 2019a) have suggested that the parsing algorithm used by ON-LSTM (PRPN) is biased towards the right-branching trees of English, leading to inflated F1 compared to unbiased parsers. To ensure a fair comparison with them, we also introduced this right-branching bias. However, our results show that our method is also robust without this bias (e.g., only 0.9 F1 drops on PTB23).
To further understand the strengths and weaknesses of each system, we analyze their accuracies by constituent tags. In Table 3, we show the accuracies of five most common tags in PTB23. We find that the success of PRPN and ON-LSTM mainly comes from the accurate identification of NP (noun phrase), which accounts for 38.5% of all constituents. For other phrase-level tags like VP (verb phrase) and PP (prepositional phrase), the accuracies of BERT are competitive. Moreover, for clause level tags, BERT significantly outplays ON-LSTM. Take SBAR (clause introduced by a subordinating conjunction) for example, BERT achieves an accuracy of 51.9%, which is about 3.4 times higher than that of ON-LSTM. One possible interpretation is that BERT is pre-trained on long contiguous sequences extracted from a documentlevel corpus. And the masking strategy (randomly mask 15% tokens) utilized may allow BERT to learn to model a sequence of words (might form a clause).

Discourse Probe
Having shown that clause-level structures are well-captured in BERT using the constituency probe, we now explore a more challenging probe -probing BERT's knowledge about the struc-    (Yang and Li, 2018;Polanyi, 1988). EDUs are connected to each other by discourse relations to form a document. We devise a discourse probe to investigate how well BERT captures structural correlations between EDUs. As the foundation of the probe, we extract an EDU-EDU impact matrix for each document using span-level perturbation.
Setup. We evaluate our probe on the discourse dependency corpus SciDTB (Yang and Li, 2018). We do not use the popular discourse corpora RST-DT (Carlson et al., 2003) and PDTB (Prasad et al.) because PDTB focuses on local discourse relations but ignores the whole document structure, while RST-DT introduces intermediate nodes and does not cover non-projective structures. We follow the same baseline settings and evaluation procedure in Sec 4.1, except that we remove gold root from our evaluation since we want to compare the accuracy by syntactic distances.
Results. Table 4 shows the performance of our discourse probes. We find that both Eisner and CLE achieve significantly higher UAS (+28) than the random BERT baseline. This suggests that BERT is aware of the structure of the document it is given. In particular, we observe a decent accuracy in identifying discourse relations between adjacent EDUs, perhaps due to the "next sentence prediction" task in pre-training, as pointed out in (Shi and Demberg, 2019). However, our probes fall behind the left-chain baseline, which benefits from its strong structural prior 4 (principal clause mostly in front of its subordinate clause). Our finding sheds some lights on BERT's success in downstream tasks that have paragraphs as input (e.g., Question Answering).

BERT-based Trees VS Parser-provided Trees
Our probing results suggest that although BERT has captured a certain amount of syntax, there are still substantial disagreements between the syntax BERT learns and those designed by linguists. For instance, our constituency probe on PTB23 significantly outperforms most baselines, but it only roughly agree with the PTB formalism (41.2% F1). However, BERT has already demonstrated its superiority in many downstream tasks. An interesting question is whether BERT is learning an empirically useful or even better structure of a language.
To answer this question, we turn to neural networks that adopt dependency parsing trees as the explicit structure prior to improve downstream tasks. We replace the ground-truth dependency trees those networks used with ones induced from BERT and approximate the effectiveness of different trees by the improvements they introduced.
We conduct experiments on the Aspect Based Sentiment Classification (ABSC) task (Pontiki et al., 2014). ABSC is a fine-grained sentiment classification task aiming at identifying the sentiment expressed towards each aspect of a given target entity. As an example, in the following comment of a restaurant, "I hated their fajitas, but their salads were great", the sentiment polarities for aspect fajitas is negative and that of salads is positive. It has been shown in Zhang et al. (2019) that injecting syntactic knowledge into neural networks can improve ABSC accuracy. Intuitively, given an aspect, a syntactically closer context word should play a more important role in predicting that aspect's sentiment. They integrate the distances between context words and the aspect on a dependency tree into a convolution network and build a Proximity-Weighted Convolution Network (PWCN). As a naive baseline, they compare with network weighted by relative position between aspect and context words.
Setup. We experimented on two datasets from SemEval 2014 (Pontiki et al., 2014), which consist of reviews and comments from two categories: LAPTOP and RESTAURANT. We adopt the standard evaluation metrics: Accuracy and Macro-Averaged F1. We follow the instructions of Zhang et al. (2019) to run the experiments 5 times with random initialization and report the averaged performance. We denote the original PWCN with relative position information as PWCN-Pos, and that utilizes dependency trees constructed by SpaCy 5 as PWCN-Dep. SpaCy has reported an UAS of 94.5 on English PTB and so it can serve as a good reference for human-designed dependency schema. We also compare our model against two trivial trees (left-chain and right-chain trees). For our model, we feed the corpus into BERT and extract dependency trees with the best performing setting: Eisner+Dist. For parsing, we introduce an inductive bias to favor short dependencies (Eisner and Smith, 2010). To ensure a fair comparison, we induce the root word from the impact matrix F instead of using the gold root. Specifically, we select the root word x k based on the simple heuristic arg max i T j=1 f (x i , x j ).  Results. Table 5 presents the performance of different models. We observe that the trees induced from BERT is either on-par (LAPTOP) or marginally better (RESTAURANT) in terms of downstream task's performance when comparing with trees produced by SpaCy. LAPTOP is considerably more difficult than RESTAURANT due to the fact that the sentences are generally longer, which makes inducing dependency trees more challenging. We also see that the Eisner trees generally perform better than the right-/left-chain baselines. It is also worth noting that the right-chain baseline also outperforms PWCN+Dep on RESTAURANT, which leads to an exciting future work that investigates how encoding structural knowledge can help ABSC.
Our results suggest that although the tree structures BERT learns can disagree with parserprovided-linguistically-motivated ones to a large extent, they are also empirically useful to downstream tasks, at least to ABSC. As future work, we plan to extend our analysis to more downstream tasks and models, like those reported in Shi (2018).

Related Work
There has been substantial research investigating what pre-trained language models have learned about languages' structures.
One rising line of research uses probing classifiers to investigate the different syntactic properties captured by the model. They are generally referred to as "probing task" (Conneau et al., 2018), "diagnostic classifier" (Giulianelli et al., 2018), and "auxiliary prediction tasks" (Adi et al., 2017). The syntactic properties investigated range from basic ones like sentence length (Shi et al., 2016;Jawahar et al., 2019), syntactic tree depth (Jawahar et al., 2019), and segmentation (Liu et al., 2019) to challenging ones like syntactic labeling (Tenney et al., 2019a,b), dependency parsing (Hewitt and Manning, 2019;Clark et al., 2019), and constituency parsing (Peters et al., 2018a). However, when a probe achieves high accuracy, it's difficult to differentiate if it is the representation that encodes targeted syntactic information, or it is the probe that just learns the task (Hewitt and Liang, 2019).
In line with our work, recent studies seek to find correspondences between parts of the neural network and certain linguistic properties, without explicit supervision.
Most of them focus on analyzing attention mechanism, by extracting syntactic tree for each attention head and layer individually (Raganato and Tiedemann, 2018;Clark et al., 2019). Their goal is to check if the attention heads of a given pre-trained model can track syntactic relations better than chance or baselines. In particular, Raganato and Tiedemann (2018) analyze a machine translation model's encoder by extracting dependency trees from its self-attention weights, using Chu-Liu/Edmonds algorithm. Clark et al. (2019) conduct a similar investigation on BERT, but the simple head selection strategy they used does not guarantee a valid dependency tree. Mareček and Rosa (2018) propose heuristic methods to convert attention weights to syntactic trees. However, they do not quantitatively evaluate their approach. In their later study , they propose a bottom-up algorithm to extract constituent trees from transformer-based NMT encoders and evaluate their results on three languages. Htut et al. (2019) reassess these works but find that there are no generalist heads that can do holistic parsing. Hence, analyzing attention weights directly may not reveal much of the syntactic knowledge that a model has learned. Recent dispute about attention as explanation (Jain and Wallace, 2019; Serrano and Smith, 2019; Wiegreffe and Pinter, 2019) also suggests that the attention's behavior does not necessarily represent that of the original model.
Another group of research examine the outputs of language models on carefully chosen input sentences (Goldberg, 2019;Bacon and Regier, 2019). They extend previous works (Linzen et al., 2016;Gulordava et al., 2018;Marvin and Linzen, 2018) on subject-verb agreement test (generating the correct number of a verb far away from its subject) to provide a measure of the model's syntactic ability. Their results show that the BERT model captures syntax-sensitive agreement patterns well in general. However, subject-verb agreement cannot provide more nuanced tests of other complex structures (e.g., dependency structure, constituency structure), which are the interest of our work.
Two recent works also perturb the input sequence for model interpretability Li et al., 2019b). However, these works only perturb the sequence once.  utilize the original MLM objective to estimate each word's "reducibility" and import simple heuristics into a right-chain baseline to construct dependency trees. Li et al. (2019b) focus on evaluating word alignment in NMT, but unlike our two-step masking strategy, they only replace the token of interest with a zero embedding or a randomly sampled word in the vocabulary.

Discussion & Conclusion
One concern shared by our reviewers is that performance of our probes are underwhelming: the induced trees are barely closer to linguist-defined trees than simple baselines (e.g., rightbranching) and are even worse in the case of discourse parsing. However, this does not mean that supervised probes are wrong or that BERT captures less syntax than we thought. In fact, there is actually no guarantee that our probe will find a strong correlation with human-designed syntax, since we do not introduce the human-designed syntax as supervision. What we found is the "natural" syntax inherent in BERT, which is acquired from selfsupervised learning on plain text. We would rather say our probe complements the supervised probing findings in two ways. First, it provides a lowerbound (on the unsupervised syntactic parsing ability of BERT). By improving this lower-bound, we could uncover more "accurate" information to support supervised probes' findings. Second, we show that when combined with a down-stream application (sec 6), the syntax learned by BERT might be empirically helpful despite not totally identical to the human design.
In summary, we propose a parameter-free probing technique to complement current line of work on interpreting BERT through probes. With carefully designed two-stage perturbation, we obtain impact matrices from BERT. This matrix mirrors the function of attention mechanism that captures inter-word correlations, except that it emerges through the output of BERT model, instead of from intermediate representations. We devise algorithms to extract syntactic trees from this matrix. Our results reinforce those of (Hewitt and Manning, 2019;Liu et al., 2019;Jawahar et al., 2019;Tenney et al., 2019b,a) who demonstrated that BERT encodes rich syntactic properties. We also extend our method to probe document structure, which sheds lights on BERT's effectiveness in modeling long sequences. Finally, we find that feeding the empirically induced dependency structures into a downstream system (Zhang et al., 2019) can further improve its accuracy. The improvement is compatible with or even superior to a human-designed dependency schema. This offers an insight into BERT's success in downstream tasks. We leave it for future work to use our technique to test other linguistic properties (e.g., coreference) and to extend our study to more downstream tasks and systems.