VizSeq: a visual analysis toolkit for text generation tasks

Automatic evaluation of text generation tasks (e.g. machine translation, text summarization, image captioning and video description) usually relies heavily on task-specific metrics, such as BLEU and ROUGE. They, however, are abstract numbers and are not perfectly aligned with human assessment. This suggests inspecting detailed examples as a complement to identify system error patterns. In this paper, we present VizSeq, a visual analysis toolkit for instance-level and corpus-level system evaluation on a wide variety of text generation tasks. It supports multimodal sources and multiple text references, providing visualization in Jupyter notebook or a web app interface. It can be used locally or deployed onto public servers for centralized data hosting and benchmarking. It covers most common n-gram based metrics accelerated with multiprocessing, and also provides latest embedding-based metrics such as BERTScore.


Introduction
Many natural language processing (NLP) tasks can be viewed as conditional text generation problems, where natural language texts are generated given inputs in the form of text (e.g. machine translation), image (e.g. image captioning), audio (e.g. automatic speech recognition) or video (e.g. video description). Their automatic evaluation usually relies heavily on task-specific metrics. Due to the complexity of natural language expressions, those metrics are not always perfectly aligned with human assessment. Moreover, metrics only produce abstract numbers and are limited in illustrating system error patterns. This suggests the necessity of inspecting detailed evaluation examples to get a full picture of system behaviors as * Work carried out during an internship at Facebook. well as seek improvement directions. A bunch of softwares have emerged to facilitate calculation of various metrics or demonstrating examples with sentence-level scores in an integrated user interface: ibleu (Madnani, 2011), MTEval 1 , MT-ComparEval (Klejch et al., 2015), nlg-eval (Sharma et al., 2017), Vis-Eval Metric Viewer (Steele and Specia, 2018), comparemt (Neubig et al., 2019), etc. Quite a few of them are collections of command-line scripts for metric calculation, which lack visualization to better present and interpret the scores. Some of them are able to generate static HTML reports to present charts and examples, but they do not allow updating visualization options interactively. MT-ComparEval is the only software we found that has an interactive user interface. It is, however, written in PHP, which unlike Python lacks a complete NLP eco-system. The number of metrics it supports is also limited and the software is no longer being actively developed. Support of multiple references is not a prevalent standard across all the softwares we investigated, and Source Type Example Tasks   Text  machine translation, text summarization,  dialog generation, grammatical error correction, open-domain question answering  Image  image captioning, visual question answering, optical character recognition  Audio  speech recognition, speech translation  Video video description Multimodal multimodal machine translation  none of them supports multiple sources or sources in non-text modalities such as image, audio and video. Almost all the metric implementations are single-processed, which cannot leverage the multiple cores in modern CPUs for speedup and better scalability.
With the above limitations identified, we want to provide a unified and scalable solution that gets rid of all those constraints and is enhanced with a user-friendly interface as well as the latest NLP technologies. In this paper, we present VizSeq, a visual analysis toolkit for a wide variety of text generation tasks, which can be used for: 1) instance-level and corpus-level system error analysis; 2) exploratory dataset analysis; 3) public data hosting and system benchmarking. It provides visualization in Jupyter notebook or a web app interface. A system overview can be found in Figure 1. We open source the software at https://github.com/facebookresearch/vizseq.

Multimodal Data and Task Coverage
VizSeq has built-in support for multiple sources and references. The number of references is allowed to vary across different examples, and the sources are allowed to come from different modalities, including text, image, audio and video. This flexibility enables VizSeq to cover a wide range of text generation tasks and datasets, far beyond the scope of machine translation, which previous softwares mainly focus on. Table 1 provides a list of example tasks supported by Vizseq. Table 2 shows the comparison of VizSeq and its counterparts on metric coverage.
Embedding-based metrics N-gram-based metrics have difficulties in capturing semantic similarities since they are usually based on exact word matches. As a complement, VizSeq also integrates latest embedding-based metrics such as BERTScore (Zhang et al., 2019) and LASER (Artetxe and Schwenk, 2018). This is rarely seen in the counterparts.
Scalability We re-implemented all the n-grambased metrics with multiprocessing, allowing 1 from vizseq.scorers import register_scorer 2 3 @register_scorer('metric name') 4 def calculate_score(  users to fully utilize the power of modern multicore CPUs. We tested our multi-process versions on large evaluation sets and observed significant speedup against original single-process ones (see Figure 2). VizSeq's embedding-based metrics are implemented using PyTorch (Paszke et al., 2017) framework and their computation is automatically parallelized on CPU or GPU by the framework.
Versatility VizSeq's rich metric collection is not only available in Jupyter notebook or in the web app, it can also be used in any Python scripts. A typical use case is periodic metric calculation during model training. VizSeq's implementations save time, especially when evaluation sets are large or evaluation is frequent. To allow userdefined metrics, we designed an open metric API, whose definition can be found in Figure 3.

User-Friendly Interface
Given the drawbacks of simple command-line interface and static HTML interface, we aim at visualized and interactive interfaces for better user experience and productivity. VizSeq provides visualization in two types of interfaces: Jupyter notebook and web app. They share the same visual analysis module (Figure 4). The web app interface additionally has a data uploading module (Figure 9) and a task/dataset browsing module (Figure 10), while the Jupyter notebook interface gets data directly from Python variables. The analysis module includes the following parts.
Example grouping VizSeq uses sentence tags to manage example groups (data subsets of different interest, can be overlapping). It contains both user-defined and machine-generated tags (e.g. labels for identified languages, long sentences, sentences with rare words or code-switching). Metrics will be calculated and visualized by different (2) left: example index, right: user-defined tags (blue) and machine-generated tags (grey); (3) multimodal sources and Google Translate integration; (4) model predictions with highlighted matched (blue) and unmatched (red) n-grams; (5) sentence-level scores (highest ones among models in boldface, lowest ones in italics with underscore). example groups as a complement to scores over the entire dataset.

Example viewing
VizSeq presents examples with various sentence-level scores and visualized alignments of matched/unmatched reference ngrams in model predictions. It also has Google Translate integration to assist understanding of text sources in unfamiliar languages as well as providing a baseline translation model. Examples are listed in multiple pages (bookmarkable in web app) and can be sorted by various orders, for example, by a certain metric or source sentence lengths. Tags or n-gram keywords can be used to filter out examples of interest.
Dataset statistics VizSeq provides various corpus-level statistics, including: 1) counts of sentences, tokens and characters; 2) source and reference length distributions; 3) token frequency distribution; 4) list of most frequent n-grams (with links to associated examples); 5) distributions of sentence-level scores by models ( Figure 5, 6  and 7). Statistics are visualized in zoomable charts with hover text hints.
Data export Statistics in VizSeq are one-click exportable: charts into PNG or SVG images (with (1) sentence, token and character counts for source and reference sentences; (2) length distributions of source and reference sentences; (3) token frequency distribution. Plots are zoomable and exportable to SVG or PNG images. users' zooming applied) and tables into CSV or L A T E X (copied to clipboard).

Data Management and Public Hosting
VizSeq web app interface gets new data from the data uploading module (Figure 9) or a RESTful API. Besides local deployment, the web app back-end can also be deployed onto public servers and provide a general solution for hosting public benchmarks on a wide variety of text generation tasks and datasets. In VizSeq, data is organized by special folder structures as follows, which is easy to maintain: <task>/<eval set>/source_ * .{txt,zip} <task>/<eval set>/reference_ * .txt <task>/<eval set>/tag_ * .txt <task>/<eval set>/<model>/prediction.txt <task>/<eval set>/__cfg__.json When new data comes in, scores, n-grams and machine-generated tags will be pre-computed and cached onto disk automatically. A file monitoring and versioning system (based on file hashes, sizes or modification timestamps) is employed to detect (1) distributions of sentence-level scores by models; (2) oneclick export of tabular data to CSV and L A T E X (copied to clipboard); (3) corpus-level and group-level (by sentence tags) scores (highest ones among models in boldface, lowest ones in italics with underscore). file changes and trigger necessary updates on precomputed results. This is important for supporting evaluation during model training where model predictions change over training epochs.

Example Use Cases of VizSeq
We validate the usability of VizSeq with multiple tasks and datasets, which are included as examples in our Github repository: • WMT14 English-German 2 : a classic mediumsize dataset for bilingual machine translation.
• COCO captioning 2015 (Lin et al., 2014): a classic image captioning dataset where VizSeq can present source images with text targets. Figure 9: VizSeq data uploading. Users need to organize the files by given folder structures and pack them into a zip file for upload. VizSeq will unpack the files to the data root folder and perform integrity checks. Figure 10: VizSeq task/dataset browsing. Users need to select a dataset and models of interest to proceed to the analysis module.
• WMT16 multimodal machine translation task 1 4 : English-German translation with an image the sentences describe. VizSeq can present both text and image sources, and calculate the official BLEU, METEOR and TER metrics.

Conclusion
In this paper, we present VizSeq, a visual analysis toolkit for {text, image, audio, video}-totext generation system evaluation, dataset analysis and benchmark hosting. It is accessible as a web app or a Python package in Jupyter notebook or Python scripts. VizSeq is currently under active development and our future work includes: 1) enabling image-to-text and video-to-text alignments; 2) adding human assessment modules; 3) integration with popular text generation frameworks such as fairseq 10 , opennmt 11 and tensor2tensor 12 .