Fine-grained Interpretation and Causation Analysis in Deep NLP Models

Deep neural networks have constantly pushed the state-of-the-art performance in natural language processing and are considered as the de-facto modeling approach in solving complex NLP tasks such as machine translation, summarization and question-answering. Despite the proven efficacy of deep neural networks at-large, their opaqueness is a major cause of concern. In this tutorial, we will present research work on interpreting fine-grained components of a neural network model from two perspectives, i) fine-grained interpretation, and ii) causation analysis. The former is a class of methods to analyze neurons with respect to a desired language concept or a task. The latter studies the role of neurons and input features in explaining the decisions made by the model. We will also discuss how interpretation methods and causation analysis can connect towards better interpretability of model prediction. Finally, we will walk you through various toolkits that facilitate fine-grained interpretation and causation analysis of neural models.


Introduction
Deep neural networks have constantly pushed the state-of-the-art performance in natural language processing and are considered as the de facto modeling approach in solving most complex NLP tasks such as machine translation, summarization and question-answering. Despite the benefits and the usefulness of deep neural networks at-large, their opaqueness is a major cause of concern. Interpreting neural networks is considered important for increasing trust in AI systems, providing additional information to decision makers, and assisting ethical decision making (Lipton, 2016).
Interpretation of neural network models is a broad area of research. Significant work has analyzed network at representation-level (Belinkov et al., 2017;Conneau et al., 2018;Adi et al., 2016;Tenney et al., 2019), and at neuron-level (Bau et al., 2020;Mu and Andreas, 2020a;Bau et al., 2019;Dalvi et al., 2019a). Others have experimented with various behavioural studies to analyze models (Gulordava et al., 2018;Linzen et al., 2016;Marvin and Linzen, 2018). Moreover, a number of studies cover the importance of input features and neurons with respect to a prediction (Dhamdhere et al., 2018a;Lundberg and Lee, 2017;Tran et al., 2018). The topic of interpretation of neural models has gained a lot of attention in a last couple of years. For example, it has been added as a regular track in major *CL conferences. There is an annual workshop, BlackboxNLP, dedicated for this purpose. The ACL 2020 and EMNLP 2020 1 featured tutorials on the topic (Belinkov et al., 2020). The ACL tutorial focused on two subareas of interpretation which are the representation analysis and the behavioral studies. The EMNLP tutorial is solely focused on behavioral studies i.e. assess a model's behavior using constructed examples. Both of these tutorials serves as a great starting point for the new 1 https://2020.emnlp.org/tutorials researchers in this area.
The representation analysis, also called as structural analysis, is useful to understand how various core linguistic properties are learned in the model. However, the analysis suffers from a few limitations. It mainly focuses at interpreting full vector representations and does not study the role of finegrained components in the representation i.e. neurons. Also the findings of representation analysis do not link with the cause of a prediction (Belinkov and Glass, 2019). While the behavioral analysis evaluates model predictions, it does not typically connect them with the influence of the input features and the internal components of a model (Vig et al., 2020).
In this tutorial, we aim to present and discuss the research work on interpreting fine-grained components of a model from two perspectives, i) finegrained interpretation, ii) causation analysis. The former will introduce methods to analyze individual neurons and a group of neurons with respect to a desired language property or a task. The latter will bring up the role of neurons and input features in explaining decisions made by the model. We will cover important research questions such as i) how is knowledge distributed across the model components? ii) what knowledge learned within the model is used for specific predictions? iii) does the inhibition of specific knowledge in the model change predictions? iv) how do different modeling and optimization choices impact the underlying knowledge?
Recent work on interpreting neurons has shown that in-addition to gaining better understanding of the inner workings of neural networks, the neuronlevel interpretation has applications in model distillation (Rethmeier et al., 2020), domain adaptation (Gu et al., 2021) or efficient feature selection (Dalvi et al., 2020) e.g., by removing unimportant neurons, facilitating architecture search, and mitigating model bias by identifying neurons responsible for sensitive attributes like gender, race or politeness (Bau et al., 2019;Vig et al., 2020). These recent works are not only enabling better understanding of these networks, but are also leading towards better, fairer and more environmental-friendly models, which are all important goals for the Artificial Intelligence community at large.
The second part, Causation Analysis, will focus on methods that seek to characterize the role of neurons and layers towards a specific prediction. More concretely, we will discuss gradient and perturbation-based attribution algorithms such as Integrated Gradients (Sundararajan et al., 2017), Layer Conductance (Dhamdhere et al., 2018b), Saliency (Simonyan et al., 2014), SHapley Additive exPlanations(SHAP) (Lundberg and Lee, 2017) and showcase how they can help us to identify important neurons in different layers of a deep neural network. Besides that we will also dive deep into more recent and advanced attribution algorithms that take feature or neuron interactions into account. More specifically, we will look into Integrated Hessians (Janizek et al., 2020), Shapely Taylor index (Dhamdhere et al., 2020) and Archipelago (Tsang et al., 2020).
Lastly, we will mention various open source toolkits and libraries that provide implementation of notable techniques in the area. A few examples of the toolkits are: Captum (Kokhlikyan et al., 2020), InterpretML 2 , NeuroX (Dalvi et al., 2019b), Ecco 3 and Diagnnose (Jumelet and Hupkes, 2019). We will walk-through how some of these tools can be used for fine-grained interpretation and causation analysis.
Throughout the tutorial, our goal will also be to critically evaluate where the strengths and weakness of each of the presented methods lie, and provide ideas and recommendations around future directions.
3 Outline 1. Introduction: We will introduce the topic and motivate it by providing the vision of model interpretability, and how it leads towards fair and ethical models that generalize well. We will then describe various forms of interpretation and will outline the scope of the tutorial (15 minutes).
2. Fine-grained Interpretation: We will present and discuss the work on neuron-level interpretation. (90 minutes 4. Concept-based Interpretation of Prediction: This part will aim to bridge the gap between fine-grained interpretation and causation analysis. We will discuss how finegrained interpretation and causation analysis can be combined to establish concept-based interpretation of model predictions. (10 minutes) 5. Discussion: The last part will discuss the overall challenges that the current work faces and suggest future directions. (10 minutes)

Prerequisites
We assume basic knowledge of the deep learning and familiarity with the LSTM-based and transformer-based pre-trained models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019). Additionally, some familiarity with natural language processing tasks such as, named entity tagging, natural language inference, etc. would be useful but not mandatory. We do not expect participants to have familiarity with the research on the interpretation and analysis of deep models. Familiarity with Python, Pytorch and Transformers library (Wolf et al., 2019) would be useful to understand the practical part.

Reading List
• In order to get an overview of the interpretation field, trainees may look at the following survey papers: Belinkov and Glass (2019) In addition to the above list, interested trainees may look at the papers mentioned in Section 2. http://alt.qcri.org/ ndurrani/ Nadir Durrani is a Research Scientist at the Arabic Language Technologies group at Qatar Computing Research Institute. His research interests include interpretation of neural networks, neural and statistical machine translation (with focus on reordering, domain adaptation, transliteration, dialectal translation, pivoting, closely related and morphologically rich languages), eye-tracking for MT evaluation, spoken language translation and speech synthesis. His recent work focuses on analyzing contextualized representations with the focus of linguistic interpretation, manipulation, feature selection and model distillation. His work on analyzing deep neural networks has been published at venues like Computational Linguistics, *ACL, AAAI and ICLR. Nadir has been involved in co-organizing workshops such as simultaneous machine translation and WMT 2019/2020 Machine translation robustness task. He regularly serves as program committee and has served as Area chair at ACL and AAAI this year.