Towards Interpreting BERT for Reading Comprehension Based QA

BERT and its variants have achieved state-of-the-art performance in various NLP tasks. Since then, various works have been proposed to analyze the linguistic information being captured in BERT. However, the current works do not provide an insight into how BERT is able to achieve near human-level performance on the task of Reading Comprehension based Question Answering. In this work, we attempt to interpret BERT for RCQA. Since BERT layers do not have predefined roles, we define a layer's role or functionality using Integrated Gradients. Based on the defined roles, we perform a preliminary analysis across all layers. We observed that the initial layers focus on query-passage interaction, whereas later layers focus more on contextual understanding and enhancing the answer prediction. Specifically for quantifier questions (how much/how many), we notice that BERT focuses on confusing words (i.e., on other numerical quantities in the passage) in the later layers, but still manages to predict the answer correctly. The fine-tuning and analysis scripts will be publicly available at https://github.com/iitmnlp/BERT-Analysis-RCQA .


Introduction
The past decade has witnessed a surge in the development of deep neural network models to solve NLP tasks. Pretrained language models such as ELMO (Peters et al., 2018a), BERT (Devlin et al., 2018) , XLNet (Yang et al., 2019) etc. have achieved state-of-the-art results on various NLP tasks. This success motivated various studies to understand how BERT achieves human-level performance on these tasks. Tenney et al. (2019);Peters et al. (2018b) analyze syntactic and semantic roles played by different layers in such models. Clark et al. (2019) specifically analyze BERT's attention * * now at Google Research, Bangalore heads for syntactic and linguistic phenomena. Most of these works focus on tasks such as sentiment classification, syntactic/semantic tags prediction, natural language inference, and so on. However, to the best of our knowledge, BERT has not been thoroughly analyzed for complex tasks like RCQA. It is a challenging task because of 1) the large number of parameters and non-linearities in BERT, and 2) the absence of pre-defined roles across layers in BERT as compared to pre-BERT models like BiDAF (Seo et al., 2016) or DCN (Xiong et al., 2016). In this work, we take the first step to identify each layer's role using the attribution method of Integrated Gradients (Sundararajan et al., 2017). We then try to map these roles to the following functions, deemed necessary in pre-BERT models to reach the answer: (i) learn contextual representations for the passage and the question, individually, (ii) attend to information in the passage specific to the question and, (iii) predict the answer.
We perform analysis on the SQuAD (Rajpurkar et al., 2016) and DuoRC (Saha et al., 2018) datasets. We observe that the initial layers primarily focus on question words that are present in the passage. In the later layers, the focus on question words decreases, and more focus is on the supporting words that surround the answer and the predicted answer span. Further, through a focused analysis of quantifier questions (questions that require a numerical entity as the answer), we observe that BERT pays importance to many words similar to the answer (same type, such as numbers) in later layers. We find this intriguing since, even after marking confusing words spread across passage as important, BERT's prediction accuracy is high. We also provide qualitative analysis to demonstrate the above trends.

Related Work
In the past few years, various large-scale datasets have been proposed for the RCQA task (Nguyen et al., 2016;Joshi et al., 2017;Rajpurkar et al., 2016;Saha et al., 2018) which have led to various deep neural-network (NN) based architectures such as Seo et al. (2016); Dhingra et al. (2016). Additionally, with complex pretraining, models such as Liu et al. (2019); Lan et al. (2019); Devlin et al. (2018) are very close to human-level performance. Due to the large number of parameters and nonlinearity of deep NN models, the answer to the question "how did the model arrive at the prediction?", is not known; hence, they are termed as blackbox models. Motivated by this question, there have also been many works that analyze the interpretability of deep NN models on NLP tasks; many of them analyze models based on in-built attention mechanisms (Jain and Wallace, 2019;Serrano and Smith, 2019;Wiegreffe and Pinter, 2019). Further, various attribution methods such as Bach et al. They point out that BERT is prone to be fooled by such attacks. Unlike these earlier works, we focus on analyzing BERT's layers specifically for RCQA to understand their QA-specific roles and their behavior on potentially confusing quantifier questions.

Experimental Setup
For our BERT analysis, we use the BERT-BASE model, which has 12 Transformer blocks (layers), each with a multi-head self-attention and a feedforward neural network. We use the official code and pre-trained checkpoints 1 and fine-tune it for two epochs for the SQuAD and DuoRC datasets to achieve F1 scores of 88.73 and 54.80 on their respective dev-splits. We use SQuAD (Rajpurkar et al., 2016) 1.1 with 90k/10k train/dev samples, each with a 100-300 words passage and the SelfRC dataset in DuoRC (Saha et al., 2018) with 60k/13k train/dev samples, each with a 500 (on average) words passage. For each passage, both datasets have a natural language query and answer span in the passage itself.

Layer-wise Functionality
As discussed earlier, we aim to understand each BERT layer's functionality for the RCQA task; we want to identify the passage words that are of primary importance at each layer for the answer. Intuitively, the initial layers should focus on question words, and the latter should zoom in on contextual words that point to the answer. To analyze the above, we use the attribution method Integrated Gradients (Sundararajan et al., 2017) on BERT at a layerwise level.
For a given passage P consisting of n words [w 1 , w 2 , . . . , w n ], query Q, and model f with θ parameters, answer prediction is modeled as: where w s , w e are the predicted answer start and end words or positions. For any given layer l, the above is equivalent to: where f l is the forward propagation from layer l to the prediction. E l (.), is the representation learnt for passage or query words by a given layer l. To elaborate, we consider the network below the layer l as a blackbox which generates input representations for layer l. The Integrated Gradients for a Model M , a passage word w i , embedded as x i ∈ R L is: wherex is a zero vector, that serves as a baseline to measure integrated gradient for w i . We calculate the integrated gradients at each layer IG l (x i ) for all passage words w i using Algorithm 1. We approximate the above integral across 50 uniform samples of α ∈ [0, 1]. We then compute importance scores for each w i by taking the euclidean norm of IG(w i ) and normalizing it to get a probability distribution I l over the passage words.

JSD with top-k retained/removed
We quantify and visualize a layer's function as its distribution of importance over the passage words I l . To compute the similarity between any Algorithm 1 To compute Layer-wise Integrated Gradients for layer l two layers x, y, we measure the Jensen-Shannon Divergence (JSD) between their corresponding importance distributions I x , I y . We calculate the JSD scores between every pair of layers in the model and visualize it as a n l × n l heatmap (n l -number of layers in the model). A higher JSD score corresponds to the two layers being more different. This further means the two layers consider different words as salient. We visualize heatmaps for the dev-splits of SQuAD (Figures 1a,  1b) and DuoRC (Figures 1c, 1d), averaging over 1000 samples in each case.
We analyze the distribution in two parts: (i) we retain only top-k scores in each layer and zero out the rest, which denotes the distribution's head. (ii) we zero the top-k scores in each layer and retain the rest, which denotes the distribution's tail. In either case, we re-normalize to get a probability distribution. When comparing just the top-2 items, we  in heatmap 1c than in heatmap 1d (min 0.12/max 0.28).We conclude that a layer's function is reflected in words high up in the importance distribution. As we remove them, we encounter an almost uniform distribution across the less important words. Hence to correctly identify a layer's functionality, we need to focus only on the head (top-k words) and not on the tail.

Probing layers: QA functionality
Based on the defined layers' functionality I l , we try to identify which layers focus more on the question, the context around the answer, etc. We segregate the passage words into three categories: answer words, supporting words, and query words, where supporting words are the words surrounding the answer within a window size of 5. Query words are the question words which appear in the passage. We take the top-5 words marked as important in I l for any layer l and compute how many words from each of the above-defined categories appear in the top-5 words (results in Tables 1 and 2). We observe  Table 3: Heatmap visualisation of the I l distribution over BERT's first and last 3 layers, for a sample from SQuAD. The initial layers focus on question specific words and latter focus on supporting words that lead to answer . similar overall trends for both SQuAD and DuoRC. From Column 3, it is evident that the model first tries to identify the part of the passage where the question words are present. As it gets more confident about the answer (Column 2), the question words' importance decreases. From Col. 4, we infer that the layers' contextual role increases from the initial to the final layers. Qualitative Example: We present a visualization of the top-5 words of the first and last three layers (with respect to I l ) in Table 3 for a sample from SQuAD. We see that all six layers give a high score to the answer span itself ('disastrous', 'situation'). Further, we see that the initial layers 0,1 and 2 are also trying to connect the passage and the query ('relegated', 'because', 'Polonia' get high importance scores). Hence, in this example, we see that the initial layers incorporate interaction between the query and passage. In contrast, the last layers focus on enhancing and verifying the model's prediction.

Visualizing Word Representations
We now qualitatively analyze the word representations of each layer. We visualize the t-SNE plot for one such passage, question,answer triplet from SQuAD (refer Table 4) in Figures 2, 3. We visualize the answer, supporting words, query words, and special tokens. Note that we have grayed out the other words in the passage. In initial layers (such as layer 0), we observe that similar words such as stop-words, team names, numbers {eight, four}, etc., are close to each other. In Layer 4, the passage, question, and answer come closer to each other. By layer 9, we see that the answer words are segregated from the rest of the words, even though the passage word 'four', which is of the same type as the answer 'eight' (number), is still close to 'eight'. We see more interesting observations yet here: (i) Passage: the panthers finished the regular season with a 15 -1 record, ... the broncos ... finished the regular season with a 12 -4 record. They joined the patriots , dallas cowboys , and pittsburgh steelers as one of teams that have made eight appearances in the super bowl . Question: How many appearances have the Broncos made in the super bowl? in later layers, the question words separate from the answer and the supporting words, (ii) Across all 12 layers, embeddings for four, eight remain very close together, which could have easily led to the model making a wrong prediction. However, the model still predicts the answer 'eight' correctly. We were not able to identify the layer where the distinction between the two confusing answers occurs.
Quantifier questions: For a detailed analysis of quantifier questions like how many, how much that could have many confusing answers (i.e., numerical words) in the passage, we perform further analysis. Based on our layer-level functionality I l , we compute the number of words that are numerical quantities in the top-5 words, and the entire passage, and compute their ratio. This represents the ratio of confusing words that are marked as important by each layer. There are 799 and 310 such questions in SQuAD and DuoRC, respectively. Interestingly, we observe that this ratio increases as we go higher up (SQuAD: L0 -5.6%, L10 -17.7%, L11 -15.5%, DuoRC: L0 -12.9%, L10 -21.6%, L11 -22.6%). For the example in Table  4, we observed that in its later layers, BERT gives high importance to the words 'eight', 'four', and  'second' (numerical quantities), even though the latter are not related or necessary to answer the question. This shows that BERT, in its later layers, distributes its focus over confusing words. However, it still manages to predict the correct answer for such questions (87.35% EM for such questions for SQuAD, and 53.5% in DuoRC); BERT also has high confidence in predicting the answer for such questions (86.5% vs 80.4% for quantifier questions with more than one numerical entity in the passage vs non-quantifier questions in SQuAD, 95.2% vs 87.2% in DuoRC). This behavior is very different from the assumed roles a layer might take to answer the question, as it is expected that such words were considered in the initial rather than final layers. This shows the complexity of BERT and the difficulty of interpreting it for the RCQA task.  Table 4. In layers 9-11, the answer eight segregates from other words. However, numerical entity four, is very close to the answer.

Conclusion
In this work, we highlight that the lack of predefined roles for layers adds to the difficulty of interpreting highly complex BERT-based models. We first define each layer's functionality using Integrated Gradients. We present results and analysis to show that BERT is learning some form of passagequery interaction in its initial layers before arriving at the answer. We found the following observations interesting and with a potential to be probed further: (i) why do the question word representations move away from contextual and answer representation in later layers? (ii) If the focus on confusing words increases from the initial to later layers, how does BERT still have a high accuracy? We hope that this work will help the research community interpret BERT for other complex tasks and explore the above open-ended questions.

A Probing layers: POS Tags
Based on the layers' functionality I l , we analyze the top-5 important words in each layer on the basis of POS tags. The results can be found in Tables  5 and 6. We note that all 12 layers are majorly focused on entity based words (common nouns, proper nouns and numerical entities). Surprisingly, all layers give approximately 10% of their importance to punctuation marks and stopwords each, the same level of importance that is given to verbs and adjectives. It is worth noting that on average, answer spans in SQuAD on 82.04% entites, and answer spans in DuoRC are 79.78% entities.