What Does BERT Learn about the Structure of Language?

BERT is a recent language representation model that has surprisingly performed well in diverse language understanding benchmarks. This result indicates the possibility that BERT networks capture structural information about language. In this work, we provide novel support for this claim by performing a series of experiments to unpack the elements of English language structure learned by BERT. Our findings are fourfold. BERT’s phrasal representation captures the phrase-level information in the lower layers. The intermediate layers of BERT compose a rich hierarchy of linguistic information, starting with surface features at the bottom, syntactic features in the middle followed by semantic features at the top. BERT requires deeper layers while tracking subject-verb agreement to handle long-term dependency problem. Finally, the compositional scheme underlying BERT mimics classical, tree-like structures.


Introduction
BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2018) is a bidirectional variant of Transformer networks (Vaswani et al., 2017) trained to jointly predict a masked word from its context and to classify whether two sentences are consecutive or not. The trained model can be fine-tuned for downstream NLP tasks such as question answering and language inference without substantial modification. BERT outperforms previous state-of-the-art models in the eleven NLP tasks in the GLUE benchmark (Wang et al., 2018) by a significant margin. This remarkable result suggests that BERT could "learn" structural information about language.
Can we unveil the representations learned by BERT to proto-linguistics structures? Answering this question could not only help us understand the reason behind the success of BERT but also its limitations, in turn guiding the design of improved architectures. This question falls under the topic of the interpretability of neural networks, a growing field in NLP (Belinkov and Glass, 2019). An important step forward in this direction is Goldberg (2019), which shows that BERT captures syntactic phenomena well when evaluated on its ability to track subject-verb agreement.
In this work, we perform a series of experiments to probe the nature of the representations learned by different layers of BERT. 1 We first show that the lower layers capture phrase-level information, which gets diluted in the upper layers. Second, we propose to use the probing tasks defined in  to show that BERT captures a rich hierarchy of linguistic information, with surface features in lower layers, syntactic features in middle layers and semantic features in higher layers. Third, we test the ability of BERT representations to track subject-verb agreement and find that BERT requires deeper layers for handling harder cases involving long-distance dependencies. Finally, we propose to use the recently introduced Tensor Product Decomposition Network (TPDN)  to explore different hypotheses about the compositional nature of BERT's representation and find that BERT implicitly captures classical, tree-like structures.

BERT
BERT (Devlin et al., 2018) builds on Transformer networks (Vaswani et al., 2017) to pre-train bidirectional representations by conditioning on both left and right contexts jointly in all layers. The representations are jointly optimized by predicting randomly masked words in the input and classify-  ing whether the sentence follows a given sentence in the corpus or not. The authors of BERT claim that bidirectionality allows the model to swiftly adapt for a downstream task with little modification to the architecture. Indeed, BERT improved the state-of-the-art for a range of NLP benchmarks (Wang et al., 2018) by a significant margin. In this work, we investigate the linguistic structure implicitly learned by BERT's representations. We use the PyTorch implementation of BERT, which hosts the models trained by (Devlin et al., 2018). All our experiments are based on the bert-base-uncased variant, 2 which consists of 12 layers, each having a hidden size of 768 and 12 attention heads (110M parameters). In all our experiments, we seek the activation of the first input token ('[CLS]') (which summarizes the information from the actual tokens using a self-attention mechanism) at every layer to compute BERT representation, unless otherwise stated. Peters et al. (2018) have shown that the representations underlying LSTM-based language models (Hochreiter and Schmidhuber, 1997) can capture phrase-level (or span-level) information. 3 It remains unclear if this holds true for models not trained with a traditional language modeling objective, such as BERT. Even if it does, would the information be present in multiple layers of the model? To investigate this question we extract span representations from each layer of BERT.

Phrasal Syntax
Following Peters et al. (2018), for a token sequence s i , . . . , s j , we compute the span representation s (s i ,s j ),l at layer l by concatenating the first (h s i ,l ) and last hidden vector (h s j ,l ), along with their element-wise product and difference. We randomly pick 3000 labeled chunks and 500 spans not labeled as chunks from the CoNLL 2000 chunking dataset (Sang and Buchholz, 2000).
As shown in Figure 1, we visualize the span representations obtained from multiple layers using t-SNE (Maaten and Hinton, 2008), a non-linear dimensionality reduction algorithm for visualizing high-dimensional data. We observe that BERT mostly captures phrase-level information in the lower layers and that this information gets gradually diluted in higher layers. The span representations from the lower layers map chunks (e.g. 'to demonstrate') that project their underlying category (e.g. VP) together. We further quantify this claim by performing a k-means clustering on span representations with k = 10, i.e. the number of distinct chunk types. Evaluating the resulting clusters using the Normalized Mutual Information (NMI) metric shows again that the lower layers encode phrasal information better than higher layers (cf. Table 1).

Probing Tasks
Probing (or diagnostic) tasks (Adi et al., 2017;Hupkes et al., 2018; help in unearthing the linguistic features possibly encoded in neural models. This is achieved by setting up an auxiliary classification task where the final output of a model is used as features to predict a linguistic phenomenon of interest. If the auxiliary classifier can predict a linguistic prop-   Table 3: Subject-verb agreement scores for each BERT layer. The last five columns correspond to the number of nouns intervening between the subject and the verb (attractors) in test instances. The average distance between the subject and the verb is enclosed in parentheses next to each attractor category. erty well, then the original model likely encodes that property. In this work, we use probing tasks to assess individual model layers in their ability to encode different types of linguistic features. We evaluate each layer of BERT using ten probing sentence-level datasets/tasks created by , which are grouped into three categories. Surface tasks probe for sentence length (SentLen) and for the presence of words in the sentence (WC). Syntactic tasks test for sensitivity to word order (BShift), the depth of the syntactic tree (TreeDepth) and the sequence of toplevel constituents in the syntax tree (TopConst). Semantic tasks check for the tense (Tense), the subject (resp. direct object) number in the main clause (SubjNum, resp. ObjNum), the sensitivity to random replacement of a noun/verb (SOMO) and the random swapping of coordinated clausal conjuncts (CoordInv). We use the SentEval toolkit (Conneau and Kiela, 2018) along with the recommended hyperparameter space to search for the best probing classifier. As random encoders can surprisingly encode a lot of lexical and structural information (Zhang and Bowman, 2018), we also evaluate the untrained version of BERT, obtained by setting all model weights to a random number. Table 2 shows that BERT embeds a rich hierarchy of linguistic signals: surface information at the bottom, syntactic information in the middle, semantic information at the top. BERT has also surpassed the previously published results for two tasks: BShift and CoordInv. We find that the untrained version of BERT corresponding to the higher layers outperforms the trained version in the task of predicting sentence length (SentLen). This could indicate that untrained models contain sufficient information to predict a basic surface feature such as sentence length, whereas training the model results in the model storing more complex information, at the expense of its ability to predict such basic surface features.

Subject-Verb Agreement
Subject-verb agreement is a proxy task to probe whether a neural model encodes syntactic structure (Linzen et al., 2016). The task of predicting the verb number becomes harder when there are more nouns with opposite number (attractors) intervening between the subject and the verb. Goldberg (2019) has shown that BERT learns syntactic phenomenon surprisingly well using various stimuli for subject-verb agreement. We extend his work by performing the test on each layer of BERT and controlling for the number of attractors. In our study, we use the stimuli created by Linzen et al. (2016) and the SentEval toolkit (Conneau and Kiela, 2018) to build the binary classifier with the recommended hyperparameter space, using as features the activations from the (masked) verb at hand.   Figure 2: Dependency parse tree induced from attention head #11 in layer #2 using gold root ('are') as starting node for maximum spanning tree algorithm. Table 3 show that the middle layers perform well in most cases, which supports the result in Section 4 where the syntactic features were shown to be captured well in the middle layers. Interestingly, as the number of attractors increases, one of the higher BERT layers (#8) is able to handle the long-distance dependency problems caused by the longer sequence of words intervening between the subject and the verb, better than the lower layer (#7). This highlights the need for BERT to have deeper layers to perform competitively on NLP tasks.

Compositional Structure
Can we understand the compositional nature of representation learned by BERT, if any? To investigate this question, we use Tensor Product Decomposition Networks (TPDN) , which explicitly compose the input token ("filler") representations based on the role scheme selected beforehand using tensor product sum. For instance, a role scheme for a word can be based on the path from the root node to itself in the syntax tree (e.g. 'LR' denotes the right child of left child of root). The authors assume that, for a given role scheme, if a TPDN can be trained well to approximate the representation learned by a neural model, then that role scheme likely specifies the compositionality implicitly learned by the model. For each BERT layer, we work with five different role schemes. Each word's role is computed based on its left-to-right index, its right-to-left index, an ordered pair containing its left-to-right and right-to-left indices, its position in a syntactic tree (formatted version of the Stanford PCFG Parser (Klein and Manning, 2003) with no unary nodes and no labels) and an index common to all the words in the sentence (bag-of-words), which ignores its position. Additionally, we also define a role scheme based on random binary trees.
Following McCoy et al. (2019), we train our TPDN model on the premise sentences in the SNLI corpus (Bowman et al., 2015). We initialize the filler embeddings of the TPDN with the pre-trained word embeddings from BERT's input layer, freeze it, learn a linear projection on top of it and use a Mean Squared Error (MSE) loss function. Other trainable parameters include the role embeddings and a linear projection on top of tensor product sum to match the embedding size of BERT. Table 4 displays the MSE between representation from pretrained BERT and representation from TPDN trained to approximate BERT. We discover that BERT implicitly implements a treebased scheme, as a TPDN model following that scheme best approximates BERT's representation at most layers. This result is remarkable, as BERT encodes classical, tree-like structures despite relying purely on attention mechanisms.
Motivated by this study, we perform a case study on dependency trees induced from self attention weight following the work done by Raganato and Tiedemann (2018). Figure 2 displays the dependencies inferred from an example sentence by obtaining self attention weights for every word pairs from attention head #11 in layer #2, fixing the gold root as the starting node and invoking the Chu-Liu-Edmonds algorithm (Chu and Liu, 1967). We observe that determiner-noun dependencies ("the keys", "the cabinet" and "the table") and subject-verb dependency ("keys" and "are") are captured accurately. Surprisingly, the predicate-argument structure seems to be partly modeled as shown by the chain of dependencies between "key","cabinet" and "table". 7 Related Work Peters et al. (2018) studies how the choice of neural architecture such as CNNs, Transformers and RNNs used for language model pretraining affects the downstream task accuracy and the qualitative properties of the contextualized word representations that are learned. They conclude that all architectures learn high quality representations that outperform standard word embeddings such as GloVe (Pennington et al., 2014) for challenging NLP tasks. They also show that these architectures hierarchically structure linguistic information, such that morphological, (local) syntactic and (longer range) semantic information tend to be represented in, respectively, the word embedding layer, lower contextual layers and upper layers. In our work, we observe that such hierarchy exists as well for BERT models that are not trained using the standard language modelling objective. Goldberg (2019) shows that the BERT model captures syntactic information well for subject-verb agreement. We build on this work by performing the test on each layer of BERT controlling for the number of attractors and then show that BERT requires deeper layers for handling harder cases involving long-distance dependency information.
Tenney et al. (2019) is a contemporaneous work that introduces a novel edge probing task to investigate how contextual word representations encode sentence structure across a range of syntactic, semantic, local and long-range phenomena. They conclude that contextual word representations trained on language modeling and machine translation encode syntactic phenomena strongly, but offer comparably small improvements on semantic tasks over a non-contextual baseline. Their result using BERT model on capturing linguistic hierarchy confirms our probing task results although using a set of relatively simple probing tasks. Liu et al. (2019) is another contemporaneous work that studies the features of language captured/missed by contextualized vectors, transferability across different layers of the model and the impact of pretraining on the linguistic knowledge and transferability. They find that (i) contextualized word embeddings do not capture finegrained linguistic knowledge, (ii) higher layers of RNN to be task-specific (with no such pattern for a transformer) and (iii) pretraining on a closely related task yields better performance than language model pretraining. Hewitt and Manning (2019) is a very recent work which showed that we can recover parse trees from the linear transformation of contextual word representation consistently, better than with non-contextual baselines. They focused mainly on syntactic structure while our work additionally experimented with linear structures (leftto-right, right-to-left) to show that the compositionality modelling underlying BERT mimics traditional syntactic analysis.
The recent burst of papers around these questions illustrates the importance of interpreting contextualized word embedding models and our work complements the growing literature with additional evidences about the ability of BERT in learning syntactic structures.

Conclusion
With our experiments, which contribute to a currently bubbling line of work on neural network interpretability, we have shown that BERT does capture structural properties of the English language. Our results therefore confirm those of Goldberg (2019) 2019) on BERT who demonstrated that span representations constructed from those models can encode rich syntactic phenomena. We have shown that phrasal representations learned by BERT reflect phraselevel information and that BERT composes a hierarchy of linguistic signals ranging from surface to semantic features. We have also shown that BERT requires deeper layers to model long-range dependency information. Finally, we have shown that BERT's internal representations reflect a compositional modelling that shares parallels with traditional syntactic analysis. It would be interesting to see if our results transfer to other domains with higher variability in syntactic structures (such as noisy user generated content) and with higher word order flexibility as experienced in some morphologically-rich languages.