Interpreting Predictions of NLP Models

Although neural NLP models are highly expressive and empirically successful, they also systematically fail in counterintuitive ways and are opaque in their decision-making process. This tutorial will provide a background on interpretation techniques, i.e., methods for explaining the predictions of NLP models. We will first situate example-specific interpretations in the context of other ways to understand models (e.g., probing, dataset analyses). Next, we will present a thorough study of example-specific interpretations, including saliency maps, input perturbations (e.g., LIME, input reduction), adversarial attacks, and influence functions. Alongside these descriptions, we will walk through source code that creates and visualizes interpretations for a diverse set of NLP tasks. Finally, we will discuss open problems in the field, e.g., evaluating, extending, and improving interpretation methods.


Tutorial Description
Neural models have become the de-facto standard tool for NLP tasks. These models are becoming increasingly powerful-recent work shows that large neural models substantially improve accuracy on a wide range of downstream tasks (Devlin et al., 2019;Brown et al., 2020). However, today's models still make egregious errors: they reinforce racial biases (Sap et al., 2019), fail in counterintuitive ways (Jia and Liang, 2017;Feng et al., 2018), and often solve tasks using simple surface-level patterns (Gururangan et al., 2018;Min et al., 2019).
These model insufficiencies are exacerbated by the inability to understand why models made the predictions they do. Interpretation methods seek to fill this void. In particular, example-specific interpretations provide post-hoc explanations for indi-vidual model predictions. These explanations come in various forms, e.g., attributing the importance of the input features through saliency maps (Smilkov et al., 2017), perturbing the inputs and observing the model's response (Feng et al., 2018;Ribeiro et al., 2018b), or locating a model's local decision boundary (Ribeiro et al., 2016).
This tutorial will provide an introduction to the various types of example-specific interpretations. We will present the technical details of existing methods, including saliency maps, adversarial attacks, input perturbations, influence functions, and other methods. We will cover how these interpretations are applied to various tasks and input-output formats, e.g., text classification using LSTMs, masked language modeling using BERT (Devlin et al., 2019), and text generation using GPT-2 (Radford et al., 2019).
For each task, we will walk through example use cases of interpretations: highlighting model weaknesses (Jia and Liang, 2017), increasing/decreasing user trust (Feng et al., 2018), and understanding hard-to-formalize criteria such as bias, safety, and fairness (Doshi-Velez and Kim, 2017). Alongside the tutorial, we will present source code implementations of various interpretation methods using AllenNLP Interpret (Wallace et al., 2019b).

Details and Prerequisites
The tutorial will be of the cutting-edge type. The tutorial slides and the accompanying code is available online at https://www.ericswallace. com/interpretability.
Prerequisites Attendees should have a basic understanding of different tasks in NLP such as text classification, sequence tagging, and reading comprehension (predicting spans in a passage).
Attendees should also have a basic understanding of neural network methods for NLP, including: • How backpropagation can compute gradients with respect to the parameters. • How tokens/words are represented (i.e., word and sub-word embeddings). • High-level ideas behind different model architectures (e.g., RNNs, Transformers). • Optional knowledge of contextualized embedding models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019). Finally, a portion of the tutorial will walk through Python code samples in PyTorch and Al-lenNLP (Gardner et al., 2018b). Participants do not need to understand this code to follow the main tutorial material.
Reading List Doshi-Velez and Kim (2017) provide a great overview and motivation for interpretability research. Lipton (2018) and Jain and Wallace (2019) discuss some of the challenges of defining and evaluating interpretability. Jia and Liang (2017) help demonstrate the fragility of NLP models. LIME (Ribeiro et al., 2016) and saliency maps (Simonyan et al., 2014) are now standard interpretations. Wallace et al. (2019b) provides example NLP interpretations (interested readers can inspect their code).

Tutorial Outline
The tutorial will present three hours of content with a thirty-minute break.
Motivation This section will discuss why we care about interpretability. It will paint a landscape of today's neural models, describe how models are brittle and behave counterintuitively, and explain how interpretations can open the "black box" of machine learning.
Introduction to Interpretations This section will situate example-specific interpretations in the context of other methods. We will discuss: • Dataset analyses, e.g., error analysis, Errudite , diagnostic "challenge" test sets ( Example-specific Interpretations This section will introduce example-specific interpretations in more detail. We will discuss the challenges and approaches to evaluating such interpretations. We will also cover the critiques and shortcomings of using attention as explanations (Jain and Wallace, 2019; Serrano and Smith, 2019). We will then explain why we focus on gradient-based methods: they are model-agnostic, easy to compute, and (largely) faithful to a model's behavior.
Understanding What Parts of An Input Led to a Prediction This section will discuss: • Saliency maps, i.e., generating visualizations of "salient" input tokens. We will discuss how to generate saliency maps using gradient-based techniques (Simonyan et al., 2014;Sundararajan et al., 2017;Smilkov et al., 2017)) and black-box techniques (Ribeiro et al., 2016). • Input Perturbations, i.e., showing how changes to the input do (or do not) change the prediction. For example, leave-one-out (Li et al., 2016) and input reduction (Feng et al., 2018). We will also cover adversarial perturbations such as token flipping (Ebrahimi et al., 2018) and adding distractor sentences (Jia and Liang, 2017).

Break
Understanding How Global Decision Rules Led to a Prediction This section will discuss how certain global "decision rules" can explain model predictions. We will cover Anchors (Ribeiro et al., 2018a) and Universal Adversarial Triggers (Wallace et al., 2019a). We will also discuss how spurious patterns in datasets, e.g., lexical overlap in textual entailment , can cause models to learn certain undesirable decision rules.

Understanding Which Training Examples
Caused a Prediction This section will discuss how to trace model predictions back to the training data, i.e., identifying "influential" training points. We will cover influence functions (Koh and Liang, 2017) and representor points (Yeh et al., 2018).
Coding Interpretations This section will walk through source code for selected interpretation methods. Using AllenNLP Interpret (Wallace et al., 2019b), we will cover example use cases such as interpreting LSTM-based sentiment analysis models and BERT-based masked language models.
Open Problems We will conclude with a discussion of areas for future research: