Towards a dialogue system that supports rich visualizations of data

The goal of our research is to support full-ﬂedged dialogue between a user and a sys-tem that transforms the user queries into visualizations. So far, we have collected a corpus where users explore data via visualizations; we have annotated the corpus for user intentions; and we have developed the core NL-to-visualization pipeline


Introduction
Visualization, even in its simplest forms, remains a highly effective means for converting large volumes of raw data into insight. Still, even with the aid of robust visualization software, e.g. Tableau 1 and ManyEyes (Viegas et al., 2007), especially novices face challenges when attempting to translate their questions into appropriate visual encodings (Heer et al., 2008;Grammel et al., 2010). Ideally, users would like to tell the computer what they want to see, and have the system intelligently create the visualization. However, existing systems (Cox et al., 2001;Sun et al., 2013;Gao et al., 2015) do not offer two-way communication, or only support limited types of queries, or are not grounded in how users explore data.
Our goal is to develop Articulate 2, a fullfledged conversational interface that will automatically generate visualizations. The contributions of our work so far are: a new corpus unique in its genre; 2 and a prototype system, which is able to process a sequence of requests, create the corresponding visualizations, position them on the screen, and manage them. 1 http://www.tableau.com/ 2 The corpus will be released at the end of the project.

Related Work
Much work has focused on the automatic generation of visual representations, but not via NL (Feiner, 1985;Roth et al., 1994;Mackinlay et al., 2007). Likewise, much work is devoted to multimodal interaction with visual representations (e.g. (Walker et al., 2004;Meena et al., 2014)), but not to automatically generating those visual representations. Systems like AutoBrief (Green et al., 2004) focus on producing graphics accompanied by text; or on finding the appropriate graphics to accompany existing text (Li et al., 2013). (Cox et al., 2001;Reithinger et al., 2005) were among the first to integrate a dialogue interface into an existing information visualization system, but they support only a small range of questions. Our own Articulate (Sun et al., 2013) maps NL queries to statistical visualizations by using very simple NLP methods. When DataTone (Gao et al., 2015), the closest to our work, cannot resolve an ambiguity in an NL query, it presents the user with selection widgets to solve it. However, only one visualization is presented to the user at a given time, and previous context is lost. (Gao et al., 2015) compares DataTone to IBM Watson Analytics, 3 that allows users to interact with data via structured language queries, but does not support dialogic interaction either.

A new corpus
15 subjects, 8 male and 7 female, interacted with a remote Data Analysis Expert (DAE) who assists the subject in an exploratory data analysis task: analyze crime data from 2010-2014 to provide suggestions as to how to deploy police officers in four neighborhoods in Chicago. Each session consisted of multiple cycles of visualization construc- Subjects were instructed to ask spoken questions directly to the DAE (they knew the DAE was human, but couldn't make direct contact 4 ). Users viewed visualizations and limited communications from the DAE on a large, tiled-display wall. This environment allowed analysis across many different types of visualizations (heat maps, charts, line graphs) at once (see Figure 1). The DAE viewed the subject through two highresolution, direct video feeds, and also had a mirrored copy of the tiled-display wall on two 4K displays. The DAE generated responses to questions using Tableau, and used SAGE2 (Marrinan et al., 2014), a collaborative large-display middlewear, to drive the display wall. The DAE could also communicate via a chat window, but confined herself to messages of the types specified in Table 1. Apart from greetings, and status messages (sorry, it's taking long) the DAE would occasionally ask for clarifications, e.g. Did you ask for thefts or batteries. Namely, the DAE never responded with a message, if the query could be directly visualized; neither did the DAE engage in multi-turn elicitations of the user requirements. Basically, the DAE tried to behave like a system with limited dialogue capabilities would. Table 2 shows summary statistics for our data, that was transcribed in its entirety. So far, we  Since no appropriate coding scheme exists, we developed our own. Three coders identified the directly actionable utterances, namely, those utterances 5 which directly affect what the DAE is doing. This was achieved by leaving an utterance unlabelled or labeling it with one of 10 codes (κ = 0.84 (Cohen, 1960) on labeling an utterance or leaving it unlabeled; κ = 0.74 on the 10 codes). The ten codes derive from six different types of actionable utterances, which are further differentiated depending on the type of their argument. The six high-level labels are: requests to create new visualizations (8%, e.g. Can I see number of crimes by day of the week?), modifications to existing visualizations (45%, Umm, yeah, I want to take a look closer to the metro right here, umm, a little bit eastward of Greektown); window management instructions (12.5%, If you want you can close these graphs as I won't be needing it anymore); factbased questions, whose answer doesn't necessarily require a visualization (7%, During what time is the crime rate maximum, during the day or the night?); requests for clarification (20.5%, Okay, so is this statistics from all 5 years? Or is this for a particular year?); expressing preferences (7%, The first graph is a better way to visualize rather than these four separately).
Three main themes have emerged from the analysis of the data. 1) Directly actionable requests cover only about 15% of what the subject is saying; the remaining 85% provides context that informs the requests (see Section 6). 2) Even the directly actionable 15% cannot be directly mapped to visualization specifications, but intermediate representations are needed. 3) An orthogonal dimension is to manage the visualizations that are generated and positioned on the screen.
So far, we have made progress on issues 2) and 3). The NL-to-visualization pipeline we describe next integrates state-of-the-art components to build a novel conversational interface. At the moment, the dialogue initiative is squarely with the user, since the system only executes the requests. However, strong foundations are in place for it to become a full conversational system.

The NL-to-visualization pipeline
The pipeline in Figure 2 illustrates how Articulate 2 processes a spoken utterance, first by translating it into a logical form and then into a visualization specification to be processed by the Visualization Executor (VE). For create/modify visualization requests, an intermediate SQL query is also generated.
Before providing more details on the pipeline, Figure 3 presents one example comprising a sequence of four requests, which results in three visualizations. The user speaks the utterances to the system by using the Google Speech API. The first utterance asks for a heatmap of the "River North" and "Loop" neighborhoods (two downtown areas in Chicago). The system generates the visualization in the upper-left corner of the figure. In response to utterance b, Articulate 2 generates a new visualization, which is added to the first visualization (see bottom of screen in the middle); it is a line graph because the utterance requests the aggregate temporal attribute "year", as we discuss below. The third request is absent of aggregate temporal attributes, and hence the system produces a bar chart also added to the display. Finally, for the final request d), the system closes the most recently generated visualization, i.e. the bar chart (this is not shown in Figure 3).

Parsing
We begin by parsing the utterance we obtain from the Google Speech API into three NLP structures. ClearNLP (Choi, 2014) is used to obtain Prop-Bank (Palmer et al., 2005) semantic role labels (SRLs), which are then mapped to Verbnet (Kipper et al., 2008) and Wordnet using SemLink (Palmer, 2009). The Stanford Parser is used to obtain the remaining two structures, i.e. the syntactic parse tree and dependency tree. The final formulation is the conjunction C predicate ∩ C agent ∩ C patient ∩ C det ∩ C mod ∩ C action . The first three clauses are extracted from the SRL. The NPs from the syntactic parse tree contain the determiners for C det , adjectives for C mod , and nouns as arguments for C action .

Request Type Classification
A request is classified into the six actionable types mentioned earlier, for which we developed a mul-   Table 3. We used Weka (Hall et al., 2009) to experiment with several classifiers. We will discuss their performance in Sec. 5; currently, we use the SVM model, which performs the best.

Window Management Requests
If the classifier assigns to an utterance the window management type, a logical form along the lines described above will be generated, but no SQL query will be produced. At the moment, keyword extraction is used to determine whether the window management instruction relates to closing, opening, or repositioning; the system only supports closing the most recently created new visualization.

Create/Modify Visualization Requests
If the utterance is classified as a request to create or modify visualizations, the logical form is used to produce an SQL query. 6 SQL was partly chosen because the crime data we obtained from the City of Chicago is stored in a relational database. Most often, in their requests users include constraints that can be conceptualized as standard filter and aggregate visualization operators. In utterance c in Figure 3, assaults can be considered as a filter, and location as an aggregator (location is meant as office, restaurant, etc.).
We distinguish between filter and aggregate based on types stored in the KO, a small domain-dependent Figure 2: NL-to-Visualization Pipeline knowledge ontology. 7 The KO contains relations, attributes, and attribute values. Filters such as "assault" are defined as attribute values in the KO, whereas aggregate objects such as "location" are attribute names. A synonym lexicon contains synonyms corresponding to each entry in the KO. SQL naturally supports these operators, since the data can be filtered using the "WHERE" clause and aggregated with the "GROUP BY" clause.

Vizualization Specification
The final transformation is from SQL to visualization specification. Overall, the specification for creating a new visualization includes the x-axis, y-axis, and plot type. Finally, the VE uses Vega (Trifacta, 2014) to plot a graphical representation of the visualization specification on SAGE2. We currently support 2-D bar charts, line graphs, and heat maps. The different representations for sentence c) from Figure  "NON UNIT", "horizontalGroupAxis": "location", "verticalAxis": "TOTAL CRIME", "plotType": "BAR"} Since the work is in progress, a controlled user study cannot be carried out until all the components of the system are in place. We have conducted piecemeal smaller and/or informal evaluations of its components. For example, we have manually inspected the results of the pipeline on the 38 requests that concern creating new visualizations. The pipeline produces the correct SQL expression (that is, the actual SQL that a human would produce for a given request) for 31 (81.6%) of them (spoken features such as filled pauses and restarts were removed, but the requests are otherwise processed unaltered). The seven unsuccessful requests fail for various reasons, including: two are fact-based that cannot be answered yet; two require mathematical operations on the data which are not currently supported; one does not have a main verb, one does not name any attributes or values (can I see the map -in the future, our conversational interface will ask for clarification). For the last one, the SQL query is generated, but it is very complex and the system times out. As concerns classifying the request type, Table 4 reports the results of the classifiers trained on the features discussed in Section 4.2. The SVM results are statistically significantly different from the Naive Bayes results (paired t-test), but indistinguishable from Random Forest or Multinomial Naive Bayes.
As concerns the whole pipeline, our prelimi-   6 Current Work Annotation. We are focusing on referring expressions (see below), and on the taxonomy of abstract visualization tasks from (Brehmer and Munzner, 2013). This taxonomy, which includes why a task is performed, will help us analyze that 85% of the users utterances that are not directly actionable. In fact, many of those indicate the why, i.e. the user's goal (e.g., "I want to identify the places with violent crimes."). Dialogue Manager / Referring Expressions. We are developing a Dialogue Manager (DM) and a Visualization Planner (VP) that will be in a continuous feedback loop. If the DM judges the query to be unambiguous, it will pass it to the VP. If not, the DM will generate a clarification request for the user. We will focus on referring expression resolution, necessary when the user asks for a modification to a previous visualization or wants to manipulate a particular visualization. In this domain, referring expressions can refer to graphical elements or to what those graphical elements represent (color the short bar red vs. color the theft bar red), which creates an additional dimension of coding, and an additional layer of ambiguity.
The Visualization Planner. The VP both needs to create more complex visualizations, and to manage the screen real estate when several visualizations are generated (which is the norm in our data collection, see Figure 1, and reflected in the system's output in Figure 3). The VP will determine the relationships between the visualizations on screen and make decisions about how to position them effectively. For instance, if a set of visualizations are part of a time series, they might be more effective if ordered on the display.