The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models

We present the Language Interpretability Tool (LIT), an open-source platform for visualization and understanding of NLP models. We focus on core questions about model behavior: Why did my model make this prediction? When does it perform poorly? What happens under a controlled change in the input? LIT integrates local explanations, aggregate analysis, and counterfactual generation into a streamlined, browser-based interface to enable rapid exploration and error analysis. We include case studies for a diverse set of workflows, including exploring counterfactuals for sentiment analysis, measuring gender bias in coreference systems, and exploring local behavior in text generation. LIT supports a wide range of models—including classification, seq2seq, and structured prediction—and is highly extensible through a declarative, framework-agnostic API. LIT is under active development, with code and full documentation available at https://github.com/pair-code/lit.


Introduction
Advances in modeling have brought unprecedented performance on many NLP tasks (e.g.Wang et al., 2019), but many questions remain about the behavior of these models under domain shift (Blitzer and Pereira, 2007) and adversarial settings (Jia and Liang, 2017), and for their tendencies to behave according to social biases (Bolukbasi et al., 2016;Caliskan et al., 2017) or shallow heuristics (e.g.McCoy et al., 2019;Poliak et al., 2018).For any new model, one might want to know: What kind of examples does my model perform poorly on?Why did my model make this prediction?And critically, does my model behave consistently if I change things like textual style, verb tense, or pronoun gender?Despite the recent explosion of work on model understanding and evaluation (e.g.Belinkov et al., 2020;Linzen et al., 2019;Ribeiro et al., 2020), there is no "silver bullet" for analysis.Practitioners must often experiment with many techniques, looking at local explanations, aggregate metrics, and counterfactual variations of the input to build a full understanding of model behavior.
Existing tools can assist with this process, but many come with limitations: offline tools such as TFMA (Mewald, 2019) can provide only aggregate metrics, interactive frontends (e.g.Wallace et al., 2019) may focus on single-datapoint explanation, and more integrated tools (e.g.Wexler et al., 2020;Mothilal et al., 2020;Strobelt et al., 2018) often work with only a narrow range of models.Switching between tools or adapting a new method from research code can take days of work, distracting from the real task of error analysis.An ideal workflow would be seamless and interactive: users should see the data, what the model does with it, and why, so they can quickly test hypotheses and build understanding.
With this in mind, we introduce the Language Interpretability Tool (LIT), a toolkit and browserbased user interface (UI) for NLP model understanding.LIT supports local explanationsincluding salience maps, attention, and rich visualizations of model predictions-as well as aggregate analysis-including metrics, embedding spaces, and flexible slicing-and allows users to seamlessly hop between them to test local hypotheses and validate them over a dataset.LIT provides first-class support for counterfactual generation: new datapoints can be added on the fly, and their effect on the model visualized immediately.Sideby-side comparison allows for two models, or two datapoints, to be visualized simultaneously.
We recognize that research workflows are con- stantly evolving, and designed LIT along the following principles: • Flexible: Support a wide range of NLP tasks, including classification, seq2seq, language modeling, and structured prediction.
• Extensible: Designed for experimentation, and can be reconfigured and extended for novel workflows.
• Modular: Components are self-contained, portable, and simple to implement.
• Framework agnostic: Works with any model that can run from Python -including Tensor-Flow (Abadi et al., 2015), PyTorch (Paszke et al., 2019), or remote models on a server.
• Easy to use: Low barrier to entry, with only a small amount of code needed to add models and data (Section 4.3), and an easy path to access sophisticated functionality.

User Interface and Functionality
LIT has a browser-based UI comprised of modules (Figure 1) which contain controls and visualizations for specific tasks (  (Li et al., 2016) and LIME (Ribeiro et al., 2016), or look for patterns in attention heads (Figure 1-bottom).

Module Description Attention
Displays an attention visualization for each layer and head.

Confusion Matrix
A customizable confusion matrix for single model or multi-model comparison.

Counterfactual Generator
Creates counterfactuals for selected datapoint(s) using a variety of techniques.

Data Table
A tabular view of the data, with sorting, searching, and filtering support.

Datapoint Editor
Editable details of a selected datapoint.

Embeddings
Visualizes dataset by layer-wise embeddings, projected down to 3 dimensions.

Metrics Table
Displays metrics such as accuracy or BLEU score, on the whole dataset and slices.

Predictions
Displays model predictions, including classification, text generation, language model probabilities, and a graph visualization for structured prediction tasks.

Salience Maps
Shows heatmaps for token-based feature attribution for a selected datapoint using techniques like local gradients and LIME.

Scalar plot
Displays a jitter plot organizing datapoints by model output scores, metrics or other scalar values.
Table 1: Built-in modules in the Language Interpretability Tool.
J5 -Compare side-by-side.Users can interactively compare two or more models on the same data, or a single model on two datapoints simultaneously.Visualizations automatically "replicate" for a side-by-side view.
J6 -Compute metrics.LIT calculates and displays metrics for the whole dataset, the current selection, as well as on manual or automaticallygenerated slices (Figure 3 (c)) to easily find patterns in model performance.
LIT's interface allows these user journeys to be explored interactively.Selecting a dataset and model(s) will automatically show compatible modules in a multi-pane layout (Figure 1).A tabbed bottom panel groups modules by workflow and functionality, while the top panel shows persistent modules for dataset exploration.
These modules respond dynamically to user interactions.If a selection is made in the embedding projector, for example, the metrics table will respond automatically and compute scores on the selected datapoints.Global controls make it easy to page through examples, enter a comparison mode, or save the selection as a named "slice".In this way, the user can quickly explore multiple workflows using different combinations of modules.
A brief video demonstration of the LIT UI is available at https://youtu.be/j0OfBWFUqIE.

Case Studies
Sentiment analysis.How well does a sentiment classifier handle negation?We load the development set of the Stanford Sentiment Treebank (SST; Socher et al., 2013), and use the search function in LIT's data table (J1, J2) to find the 56 datapoints containing the word "not".Looking at the Metrics Table (J6), we find that surprisingly, our BERT model (Devlin et al., 2019) gets 100% of these correct!But we might want to know if this is truly robust.With LIT, we can select individual datapoints and look for explanations (J3).For example, take the negative review, "It's not the ultimate depression-era gangster movie.".As shown in Figure 2, salience maps suggest that "not" and "ultimate" are important to the prediction.
We can verify this by creating modified inputs, using LIT's datapoint editor (J4).Removing "not" gets a strongly positive prediction from "It's the ultimate depression-era gangster movie.",while replacing "ultimate" to get "It's not the worst depression-era gangster movie."elicits a mildly positive score from our model.
Gender bias in coreference.Does a system encode gendered associations, which might lead to incorrect predictions?We load a coreference model  trained on OntoNotes (Hovy et al., 2006), and load the Winogender (Rudinger et al., 2018) dataset into LIT for evaluation.Each Winogender example has a pronoun and two candidate referents, one a occupation term like ("technician") and one an "other participant" (like "customer").Our model predicts coreference probabilities for each candidate.We can explore the model's sensitivity to pronouns by comparing two examples side-by-side (see Figure 3 (a).)We can see how commonly the model makes similar errors by paging through the dataset (J1), or by selecting specific slices of interest.For example, we can use the scalar plot module (J2) (Figure 3 (b)) to select datapoints where the occupation term is associated with a high proportion of male or female workers, according to the U.S. Bureau of Labor Statistics (BLS; Caliskan et al., 2017).
In the Metrics Table (J6), we can slice this selection by pronoun type and by the true referent.On the set of male-dominated occupations (< 25% female by BLS), we see the model performs well when the ground-truth agrees with the stereotypee.g. when the answer is the occupation term, male pronouns are correctly resolved 83% of the time, compared to female pronouns only 37.5% of the time (Figure 3 4 (a)).We find the generated text (Figure 4 (b)) contains an erroneous constituent: "alastair cook was replaces as captain by former captain ...".We can dig deeper, using LIT's language modeling module (Figure 4 (c)) to see that the token "by" is predicted with high probability (28.7%).
To find out how T5 arrived at this prediction, we utilize the "similarity searcher" component through the counterfactual generator tab (Figure 4 (d)).This performs a fast approximate nearest-neighbor lookup (Andoni and Indyk, 2006) from a pre-built index over the training corpus, using embeddings from the T5 decoder.With one click, we can retrieve 25 nearest neighbors and add them to the LIT UI for inspection (as in Figure A.1).We see that the words "captain" and "former" appear 34 and 16 times in these examples-along with 3 occurrences of "replaced by" (Figure 4 (e))-suggesting a strong prior toward our erroneous phrase.

System design and components
The LIT UI is written in TypeScript, and communicates with a Python backend that hosts models, datasets, counterfactual generators, and other interpretation components.LIT is agnostic to modeling frameworks; data is exchanged using NumPy arrays and JSON, and components are integrated through a declarative "spec" system (Section 4.4) that minimizes cross-dependencies and encourages modularity.A more detailed design schematic is given in the Appendix, Figure A.2.

Frontend
The browser-based UI is a single-page web app, built with lit-element2 and MobX3 .A shared framework of "service" objects tracks interaction state, such as the active model, dataset, and selection, and coordinates a set of otherwise-independent modules which provide controls and visualizations.

Backend
The Python backend serves models, data, and interpretation components.The server is stateless, but includes a caching layer for model predictions, which frees components from needing to store intermediate results and allows interactive use of large models like BERT (Devlin et al., 2019) and GPT-2 (Radford et al., 2019).Component types include: • Models which implement a predict() function, input spec(), and output spec().
• Datasets which load data from any source and expose an .examplesfield and a spec().
• Interpreters are called on a model and a set of datapoints, and return output-such as a salience map-that may also depend on the model's predictions.
• Generators are interpreters that return new input datapoints from source datapoints.
• Metrics are interpreters which return aggregate scores for a list of inputs.
These components are designed to be selfcontained and interact through minimalist APIs, with most exposing only one or two methods plus spec information.They communicate through standard Python and NumPy types, making LIT compatible with most common modeling frameworks, including TensorFlow (Abadi et al., 2015) and Py-Torch (Paszke et al., 2019).Components are also portable, and can easily be used in a notebook or standalone script.For example: will run the LIME (Ribeiro et al., 2016) component and return a list of tokens and their importance to the model prediction.

Running with your own model
LIT is built as a Python library, and its typical use is to create a short demo.pyscript that loads models and data and passes them to the lit.Server class: models = {'foo': FooModel(...), 'bar': BarModel(...)} datasets = {'baz': BazDataset(...)} server = lit.Server(models, datasets) server.serve() A full example script is included in the Appendix (Figure A.3).The same server can host several models and datasets for side-by-side comparison, and can also reference remotely-hosted models.

Extensibility: the spec() system
NLP models come in many shapes, with inputs that may involve multiple text segments, additional categorical features, scalars, and more, and output modalities that include classification, regression, text generation, and span labeling.Models may have multiple heads of different types, and may also return additional values like gradients, embeddings, or attention maps.Rather than enumerate all variations, LIT describes each model and dataset with an extensible system of semantic types.
For example, a dataset class for textual entailment (Dagan et al., 2006;Bowman et al., 2015) might have spec(), describing available fields: • premise: TextSegment() • hypothesis: TextSegment() • label: MulticlassLabel(vocab=...) A model for the same task would have an input spec() to describe required inputs: • premise: TextSegment() • hypothesis: TextSegment() As well as an output spec() to describe its predictions: • probas: MulticlassPreds( vocab=..., parent="label") Other LIT components can read this spec, and infer how to operate on the data.For example, the MulticlassMetrics component searches for MulticlassPreds fields (which contain probabilities), uses the vocab annotation to decode to string labels, and evaluates these against the input field described by parent.Frontend modules can detect these fields, and automatically display: for example, the embedding projector will appear if Embeddings are available.
New types can be easily defined: a SpanLabels class might represent the output of a named entity recognition model, and custom components can be added to interpret it.

Related Work
A number of tools exist for interactive analysis of trained ML models.Many are general-purpose, such as the What-If Tool (Wexler et al., 2020), Captum (Kokhlikyan et al., 2019), Manifold (Zhang et al., 2018), or InterpretML (Nori et al., 2019), while others focus on specific applications like fairness, including FairVis (Cabrera et al., 2019) and FairSight (Ahn and Lin, 2019).And some provide rich support for counterfactual analysis, either within-dataset (What-If Tool) or dynamically generated as in DiCE (Mothilal et al., 2020).
While many components exist in other tools, LIT aims to integrate local explanations, aggregate analysis, and counterfactual generation into a single tool.In this, it is most similar to Errudite (Wu et al., 2019), which provides an integrated UI for NLP error analysis, including a custom DSL for text transformations and the ability to evaluate over a corpus.However, LIT is explicitly designed for flexibility: we support a broad range of workflows and provide a modular design for extension with new tasks, visualizations, and generation techniques.
Limitations LIT is an evaluation tool, and as such is not directly useful for training-time monitoring.As LIT is built to be interactive, it does not scale to large datasets as well as offline tools such as TFMA (Mewald, 2019).(Currently, the LIT UI can handle about 10,000 examples at once.)Because LIT is framework-agnostic, it does not have the deep model integration of tools such as AllenNLP Interpret (Wallace et al., 2019) or Captum (Kokhlikyan et al., 2019).This makes many things simpler and more portable, but also requires more code for techniques like integrated gradients (Sundararajan et al., 2017) that need to directly manipulate parts of the model.

Conclusion and Roadmap
LIT provides an integrated UI and a suite of components for visualizing and exploring the behavior of NLP models.It enables interactive analysis both at the single-datapoint level and over a whole dataset, with first-class support for counterfactual generation and evaluation.LIT supports a diverse range of workflows, from explaining individual predictions to disaggregated analysis to probing for bias through counterfactuals.LIT also supports a range of model types and techniques out of the box, and is designed for extensibility through simple, framework-agnostic APIs.
LIT is under active development by a small team.Planned upcoming additions include new counterfactual generation plug-ins, additional metrics and visualizations for sequence and structured output types, and a greater ability to customize the UI for different applications.
LIT is open-source under an Apache 2.0 license, and we welcome contributions from the community at https://github.com/pair-code/lit.

A Appendices
Figure A.1: The counterfactual generator module, showing a set of generated datapoints in the staging area.Labels can be maually edited before adding these to the dataset.In this example, the counterfactuals were created using the word replacer, replacing the word "great" with "terrible" across the dataset.(Williams et al., 2018) development sets.The actual model can be implemented in TensorFlow, PyTorch, C++, a REST API, or anything that can be wrapped in a Python class: to work with LIT, users needs only to define the spec fields and implement a predict() function which returns a dict of NumPy arrays for each input datapoint.The dataset loader is even simpler; a complete implementation is given above to read from a TSV file, but libraries like TensorFlow Datasets can also be used.

Figure 1 :
Figure 1: The LIT UI, showing a fine-tuned BERT (Devlin et al., 2019) model on the Stanford Sentiment Treebank (Socher et al., 2013) development set.The top half shows a selection toolbar, and, left-to-right: the embedding projector, the data table, and the datapoint editor.Tabs present different modules in the bottom half; the view above shows classifier predictions, an attention visualization, and a confusion matrix.

Figure 2 :
Figure2: Salience maps on "It's not the ultimate depression-era gangster movie.",suggesting that "not" and "ultimate" are important to the model's prediction.

Figure 3 :
Figure 3: Exploring a coreference model on the Winogender dataset.

Figure 4 :
Figure 4: Investigating a local generation error, from selection of an interesting example to finding relevant training datapoints that led to an error.
(c)).Debugging text generation.Does the training data explain a particular error in text generation?We analyze a T5 (Raffel et al., 2019) model on the CNN-DM summarization task (Hermann et al., 2015), and loosely follow the steps of Strobelt et al. (2018).LIT's scalar plot module (J2) allows us to look at per-example ROUGE scores, and quickly select an example with middling performance (Figure

Figure A. 2 :
Figure A.2: Overview of LIT system architecture.The backend manages models, datasets, metrics, generators, and interpretation components, as well as a caching layer to speed up interactive use.The frontend is a TypeScript single-page app consisting of independent modules (webcomponents built with lit-element) which interact with shared "services" which manage interaction state.The backend can be extended by passing components to the lit.Server class in the demo script (Section 4.3 and Figure A.3), while the frontend can be extended by importing new components in a single file, layout.ts,which both lists available modules and specifies their position in the UI (Figure1).

Figure A. 3 :
Figure A.3: Example demo script to run LIT with two NLI models and the MultiNLI(Williams et al., 2018) development sets.The actual model can be implemented in TensorFlow, PyTorch, C++, a REST API, or anything that can be wrapped in a Python class: to work with LIT, users needs only to define the spec fields and implement a predict() function which returns a dict of NumPy arrays for each input datapoint.The dataset loader is even simpler; a complete implementation is given above to read from a TSV file, but libraries like TensorFlow Datasets can also be used.

Figure
Figure A.4: Full UI screenshot, showing a BERT (Devlin et al., 2019) model on a sample from the "matched" split of the MultiNLI (Williams et al., 2018) development set.The embedding projector (top left) shows three clusters, corresponding to the output layer of the model, and colored by the true label.On the bottom, the metrics table shows accuracy scores faceted by genre, and a confusion matrix shows the model predictions against the gold labels.
Figure A.5: Confusion matrix (a) and side-by-side comparison of predictions and salience maps (b) on two sentiment classifiers.In model comparison mode, the confusion matrix can compare two models, and clicking an off-diagonal cell with select examples where the two models make different predictions.In (b) we see one such example, where the model in the second row ("sst 1") predicts incorrectly, even though gradient-based salience show both models focusing on the same tokens.