Photon: A Robust Cross-Domain Text-to-SQL System

Natural language interfaces to databases(NLIDB) democratize end user access to relational data. Due to fundamental differences between natural language communication and programming, it is common for end users to issue questions that are ambiguous to the system or fall outside the semantic scope of its underlying query language. We present PHOTON, a robust, modular, cross-domain NLIDB that can flag natural language input to which a SQL mapping cannot be immediately determined. PHOTON consists of a strong neural semantic parser (63.2% structure accuracy on the Spider dev benchmark), a human-in-the-loop question corrector, a SQL executor and a response generator. The question corrector isa discriminative neural sequence editor which detects confusion span(s) in the input question and suggests rephrasing until a translatable input is given by the user or a maximum number of iterations are conducted. Experiments on simulated data show that the proposed method effectively improves the robustness of text-to-SQL system against untranslatable user input.The live demo of our system is available at http://www.naturalsql.com


Confusing Span Detection Schema Grounding
There are 4 students registered in statistics.

Response Generation
It is an invalid query, please check the tables and ask again.Jagadish, 2014; Setlur et al., 2016Setlur et al., , 2019)).While they have been shown effective in pilot study and production, rule-based approaches are limited in terms of coverage, scalability and naturalnessthey are not robust against the diversity of human language expressions and are difficult to scale across domains.
We present PHOTON, a modular, cross-domain NLIDB that adopts deep learning in its core components.PHOTON consists of (1) a neural semantic parser, (2) a human-in-the-loop question corrector, (3) a SQL query executor and (4) a natural language response generator.The neural semantic parser assumes limited DB content access due to data privacy concerns ( § 3.1).It employs a BERTbased (Devlin et al., 2019) DB schema-aware question encoder and a pointer-generator decoder (See et al., 2017a) with static SQL correctness check.It achieves competitive performance on the popular cross-domain text-to-SQL benchmark, Spider (Yu et al., 2018) (63.2% structure accuracy on the dev set based on the official evaluation). 1The question corrector is a neural sequence editor which detects potential confusion span(s) in the input question and suggests possible corrections for the user to give feedback.When an input question is successfully translated into an executable SQL query, the response generator generates a natural language response conditioned on the output of the SQL query executor.
A pilot study with non-expert SQL users shows that the system effectively increases the flexibility of user's natural language expression and is easy to be adapted to unseen databases.Being able to detect and correct untranslatable questions reduces unexpected error cases during user interaction.

System Design
In this section, we will elaborate on the system design of PHOTON. 1 We are continuously improving the performance of the neural semantic parser.Currently the semantic parser only accepts standalone question as input.We plan to also model the interaction context in future work.

Overview
Figure 1 shows the overall workflow of our system.PHOTON is an end-to-end system that takes a user question and database schema as input, and output the query result after executing the generated SQL on the database.PHOTON is a modular framework designed towards practical industrial applications.The core modules in PHOTON are the SQL parser and confusion detection mechanism.The SQL parser parses the input question and database schema, maps them into executable SQL query via an encoder-decoder framework.The confusion detection module identifies the untranslatable questions and captures the confusing span of the untranslatable question.The confusing tokens together with the context are fed into the autocorrection module to make a prediction of user attempted question.
To make it more applicable and accessible for user to query the database in a natural way, PHO-TON also provides user interaction module enabling user to refine their queries in the interaction with the system.Response generation handles the output of the system by transducing the database-style query result into natural language or post warning when the query is non-executable on the database, making the system more user-friendly.Notice that the response generation module in the current version is implemented using a template-based approach and can be improved by using more advanced response generation models.

UI Design
Our system UI consists of three panels: chat window, schema viewer and results viewer (Figure 3).
• Chat window: This is a standard chat window that facilitates communication between the user and PHOTON.The user types the natural language input and the natural language responses of the system are displayed.
• Schema viewer: This view provides a graph visualization of the underlying relational DB schema.The panel is hideable and will not be shown in case the DB schema is confidential.
• Result viewer: This view displays the returned results of an executable SQL query mapped from a confirmed input question.The SQL query is formatted and displayed in the top for user verification.Multi-record results are presented as sub-tables.Result consists of a single table cell is presented as a 1-cell subtable .If the result comes from an aggregation operation such as a counting, the data records supporting the calculation are also shown for explanability.Confidential DB records are hidden from the display and the user is informed of the number of hidden records.

Cross Domain
A relational DB for user queries should be set before usage.PHOTON consists of a collection of default databases and allows users to upload their own DBs for testing.Users can select which database they want to query by clicking the "Selected Database" drop down button.

Dual Query Mode
PHOTON accepts both natural language questions and well-formed SQL queries as input.It automatically detects the user input type and executes the input immediately if it is a valid SQL query.We observed that the dual query model can be more efficient than NL-only mode, especially for users who have SQL background.

Neural Semantic Parser
The neural semantic parser is an end-to-end model whose input consists of a user question and the DB schema, and outputs a SQL query.Due to data privacy concerns, we assume that the neural semantic parser does not have full access to the DB content.Instead, we assume for each DB field, the parser have access to the set of possible values of the field, for example, "Country.Region": {"Carribean", "Porto Rico", ...}2 .We call such value sets "picklists" by industry convention.

Schema-Question Encoder
Following previous work (Hwang et al., 2019;Zhang et al., 2019), we serialize the relational DB schema and concatenate it to the user question.As shown in Figure 4 , we represent each table with the table name followed by a sequence of field names.Each table name is preceded by the special token [T] and each field name is preceded by the special token [C].The representations of multiple tables are concatenated together to form the serialization of the schema, which is surrounded by [SEP] tokens and concatenated to the question.Finally, the question is preceded by the [CLS] token following convention of BERT encoder (Devlin et al., 2019).This sequence is fed into the pretrained BERT, followed by a bi-directional LSTM to form a joint encoding of the question and schema h input .The text portion of h input is passed through another bi-LSTM to obtain the question encoding h Q .We represent each schema component (tables and fields) using the slices of h input corresponding to the special token [T] and [C].

Meta-data Features
We further trained dense look-up features to represent if a field is a primary key (f pri ), if a field appears in a foreign key pair (f for ) and the data type of the field (f type ).These meta-data features are fused with the representations in h input via a projection layer g to obtain the final representation of each schema component: where m is the index of the special token corresponding to the p-th column in the input and n is the index of the special token corresponding to the q-th table in the input.i, j and k are the feature indices indicating the corresponding properties of is the concatenation of the four vectors.The meta-data features we include are specific to fields and the table representations are fused with zero place-holder vectors.

Decoder
We use an LSTM-based sequential pointergenerator (See et al., 2017b) as the decoder.The generation vocabulary of our decoder consists of 70 SQL keywords and reserved tokens, plus the 10 digits3 .At each step, the decoder computes a probability distribution over actions that consists of generating a token from the reserved vocabulary, copying a token from the input text or copying a schema component.

Static SQL Correctness Check
The sequential pointer-generator we adopted does not guarantee the output SQL is syntactically correct.In practice, we perform beam-search decoding and run a static SQL correctness check4 to eliminate erroneous predictions from the beam.Specifically, we employ a tool implemented on top of the Mozilla SQL Parser5 to analyze the output SQL queries and ensure they satisfy the following criteria: 1.The SQL query is syntactically correct.2. The SQL query satisfies schema consistency6 .
We found this approach very effective and results in an absolute improvement of 4∼5% in the evaluation score on Spider dev set (Yu et al., 2018).

Picklist Incorporation
We use picklists to inform the semantic parser regarding potential matches in the DB.For an input question Q and a field C p , we compute the longest character sequence match between Q and each value in the picklist of C p .We select the value with top-1 matching score above a certain threshold θ as a match.For each field with a matched picklist value, we append the surface form of the value to it in the input sequence representation, separated by the special token [V].The augmented sequence is used as the input to the schema-question encoder.
In practice, we found picklist augmentation results in an absolute performance improvement of 1% on the Spider dev set.
Figure 5 illustrates the input sequence with augmented picklist values.In this example, the matching algorithm identifies "Carribean" associated with the column "Country.Region" as a match.Hence it inserts "Carribean" after [... [C], "Region"] with [V] as a separation token7 .The representations of fields with no picklist value match are unchanged.

Confusion Detection: Handling
Untranslatable and Ambiguous Input In order to handle ambiguous and untranslatable input questions, PHOTON adopts a discriminatively trained classifier to detect user input to which a SQL mapping cannot be immediately determined.This covers questions that are incomplete (e.g.What is the total?), ambiguous or vague (e.g.Show me homes with good schools), beyond the representation scope of SQL (e.g.How many tourists visited all of the 10 attractions?), or simply noisy (e.g.Cyrus teaches physics in department).

Untranslatable Question Detection
Inspired by (Rajpurkar et al., 2018), we create a synthetic dataset which consists of untranslatable questions generated by applying rule-based transformations and adversarial filtering (Zellers et al., 2018) to examples in existing text-to-SQL datasets.
We then train a stagewise model that first classifies if the input is translatable or not, and then predicts confusing spans in an untranslatable input.
Dataset Construction.In order to construct the untranslatable questions, we firstly exam the types of untranslatable questions seen on the manually constructed CoSQL (Yu et al., 2019a) and Multi-WOZ (Budzianowski et al., 2018) datasets (Table 4 of A.1).We then design our modification strategies to generate the untranslatable questions from the original text-to-SQL dataset automatically.Specif-ically, for a text-to-SQL example that contains a natural language question, a DB schema and a SQL query, we first identify all non-overlapping question spans that possibly refer to a table field occurred in the SELECT and WHERE clauses of the SQL query using string-matching heuristics.Then we apply Swap and Drop operations on the question and DB schema respectively to generate different types of untranslatable questions.The modification tokens are marked as the confusion spans of the synthetic untranslatable questions, except for the question Drop strategy.Table 5 in A.1 provides a detailed summary of all transformations applied.To introduce semantic variation and ensure grammar fluency, we apply back-translation on the generated question using Google Cloud Translation API 8 .For example, given the original question "How many countries exist?", "countries" is detected to be referring to a table field.We drop the token, and pass the modified question "How many exist?" to backtranslation for grammar smoothing.After that, we obtain the untranslatable question "How many are there?".Once we have the synthetic untranslatable questions, adversarial filtering is employed to iteratively refine the set of untranslatable examples to be more indiscernible by trivial stylistic classifiers (Zellers et al., 2018).
Predicting Untranslatable Questions and Confusing Spans.We utilize the BERT-based contextualized question-schema representations produced by the encoder in Figure 5 to extract the confusion span in a question hence predict its translatability.To do so, we normalize 9 the aforementioned synthetic dataset such that each question Q in an example is paired with a span label ζ, where We use the span extractor proposed by Devlin et al. (2019) for SQuAD v1.1 (Rajpurkar et al., 2016).8 https://cloud.google.com/translate/.We use Chinese as the intermediate language.
9 Essentially, when the question is untranslatable but there is no identifiable confusion span, the whole question is marked as confusing; when the question is translatable, [CLS] is used as the dummy confusion span.
Specifically, the extractor introduces a start vector s and an end vector e.The probability of word i being the start of the answer span is conputed as a dot product between h i input and s followed by a softmax over [CLS] and all words in the question.The end of the answer span is predicted in a similar manner.
The score of a candidate span (i, j) is s h i input + e h j input , and the maximum scoring span with j ≥ i is used as a prediction.The training objective is the sum of the log-likelihood of the correct start and end indices.

Database-aware Token Correction
Figure 6 illustrates the proposed tokens correction module in PHOTON.We use the masked language model (MLM) of BERT (Devlin et al., 2019) to auto-correct the confusing tokens.Specifically, we replace the confusing tokens with the [MASK] special token.The output distribution of MLM head on the mask token is employed to score the candidate spans.We construct the candidate span list by extracting all the table names and columns names from the database schema.After user confirmation, the confusing tokens in the input are replaced by the predicted tokens of MLM.

Evaluation
In this section, we empirically evaluate the robustness and effectiveness of PHOTON.
In particular, we examine two key modules of PHOTON: the confusion detection module and the neural semantic parser.The former aims to detect the untranslatable questions and predicts the confusing spans; if the question is translatable, it then applies the proposed neural semantic parser to perform the text-to-SQL parsing.Since PHOTON is designed as a stagewise system, we can evaluate the performance of each module separately.

Experimental Setup
Dataset.We conduct experiments on Spider (Yu et al., 2018) and Spider UTran dataset.Spider is a large-scale, human annotated, cross-domain textto-SQL benchmark.Spider UTran is our modified dataset to evaluate robustness, created by injecting the untranslatable questions into Spider.We obtained 5,330 additional untranslatable questions (4,733 for training and 597 for development) from the original Spider dataset.To ensure the quality of our synthetic dataset, we hired SQL experts from Upwork 10 to annotate the auto-generated untranslatable examples in the dev set.We conduct our evaluation by following the database split setting, as illustrated in Table 1.The split follows the original dataset hence there is no test set of Spider UTran (the test set of Spider is not publicly accessible).Training and Inference Details.Our neural semantic parser is trained on Spider.We permute table order (up to 16 different ones) during training.We use the uncased BERT-base model from Huggingface's transformer library (Wolf et al., 2019).We set all LSTMs to 1-layer and set the dimension of h input , f pri , f for , f type and the decoder to 512.We employ Adam-SGD (Kingma and Ba, 2015) with a mini-batch size of 32 and default Adam parameters.We train a maximum of 50,000 steps and set the learning rate to 5e − 4 in the first 5,000 iterations and linearly decays it to 0 afterwards.We fine-tune BERT with a fine-tuning rate linearly increasing from 3e − 5 to 8e − 5 in the first 5,000 iterations and linearly decaying to 0 afterwards.We use a beam size of 128 in the beam search decoding.
Our confusion span extractor is trained on Spider UTran .It uses the same encoder architecture 10 https://www.upwork.com/as the semantic parser.We set the dimension of s and e to 512.During inference, we determine a question as untranslatable if the predicted span start s > 0. PHOTON highlights the confusion span to the user when it contains 5 or less tokens; otherwise, it generates a generic response prompting the user to rephrase.

Experimental Results
Confusion Detection.We examine the robustness of PHOTON by evaluating the performance of the confusion detection module in handling ambiguous and untranslatable input.In particular, we aim to examine if PHOTON is effective in handling untranslatable questions by measuring its translatability detection accuracy and confusing span prediction accuracy & F1 score11 .We compare to a baseline that uses a single-layer attentive bidirectional LSTM ("Att-biLSTM").Table 2 shows the evaluation results on the Spider UTran dataseet.As observed from Table 2, PHOTON achieves encouraging performance in determining the translatability of a question and predicting the confusing spans of untranslatable ones.In comparison to the Att-biLSTM baseline, PHOTON obtains significant improvements in both translatability accuracy and the confusing spans prediction accuracy.These improvements are partly attribute to the proposed effective schema encoding strategy.
Neural Semantic Parser.We then evaluate the performance of the proposed neural semantic parser of PHOTON on the original Spider dataset.In particular, we compare PHOTON and other existing text-to-SQL approaches by measuring the exact set match (EM) accuracy (Yu et al., 2018).We compare with several existing approaches, including Global GNN (Bogin et al., 2019b), Edit-SQL (Zhang et al., 2019), IRNet (Guo et al., 2019), and RYANSQL (Choi et al., 2020).Table 3 shows the evaluation results on Spider Dev set.

Related Work
Natural Language Interfaces to Databases.NLIDBs has been studied extensively in the past decades.Thanks to the availability of large-scale datasets (Zhong et al., 2017;Finegan-Dollak et al., 2018;Yu et al., 2018), data-driven approaches have dominated the field, in which deep learning based models achieve the best performance in both strongly (Hwang et al., 2019;Zhang et al., 2019;Guo et al., 2019) and weakly (Liang et al., 2017;Min et al., 2019) supervised settings.However, most of existing text-to-SQL datasets include only questions that can be translated into a valid SQL query.Spider (Finegan-Dollak et al., 2018) specifically controlled question clarify during data collection to exclude poorly phrased and ambiguous questions.WikiSQL (Zhong et al., 2017) was constructed on top of manually written synchronous grammars, and the mapping between its questions and SQL queries can be effectively resolved via lexical matching in vector space (Hwang et al., 2019).CoSQL (Yu et al., 2019a) is by far the only existing corpus to our knowledge which entables data-driven modeling and evaluation of untranslatable question detection.Yet the dataset is of context-dependent nature and contains untranslatable questions of limited variety.We fill in this gap by proposing PHOTON to cover a diverse set of untranslatable user input in text-to-SQL.
Noisy User Input in Semantic Parsing.Despite being absent from most large-scale text-to-SQL benchmarks, noisy user input has been frequently encountered and battled with by the semantic parsing community.Underspecification (Archangeli, 1988) and vagueness (Varzi, 2001) have solid linguistic theory foundation.Lexicon-based semantic parsers (Zettlemoyer and Collins, 2005;Roberts and Patra, 2017) may reject the input if the lexicon match is unsuccessful.Other approaches for handling untranslatable user input include inference and generating defaults (Setlur et al., 2019), paraphrasing (Arthur et al., 2015(Arthur et al., , 2016)), verification (Arthur et al., 2015) and confidence estimation (Dong et al., 2018).We adopt a dataaugmentation and discriminative learning based approach, which has demonstrated superior performance in related domains (Rajpurkar et al., 2018)

Conclusion and Future Work
We present PHOTON, a robust modular crossdomain text-to-SQL system, consisting of semantic parser, untranslatable question detector, human-inthe-loop question corrector, and natural language response generator.PHOTON has the potential to scale up to hundreds of different domains.It is the first cross-domain text-to-SQL system designed towards industrial applications with rich features, and bridges the demand of sophisticated database analysis and people without any SQL background knowledge.
The current PHOTON system is still a prototype, with very limited user interactions and functions.We will continue to add more features to PHOTON, such as voice input, spelling checking, and visualizing the output when appropriate to inspect the translation process.We also plan to improve the performance of core models in PHOTON, such as semantic parsing (text-to-SQL), response generation (table-to-text) and context-aware user interaction (text-to-text).A comprehensive evaluation will also be conducted among the users of our system.

IFigure 1 :
Figure 1: PHOTON workflow.The question corrector (upper block) detects the untranslatable questions from user input, scans the confusion span(s) that need clarification or correction.The accepted question is mapped into a SQL query through a text-to-SQL model, and finally the SQL execution results are returned to the user.

Figure 2
Figure 2 illustrates the interaction process, which involves four types of response states: CONFIRM RESULT, CONFIRM CORRECTION, NEED REPHRASE, and INVALID QUERY.The set of response templates can be found at the bottom of Figure 2. When a user initiates the conversation by entering one query, PHOTON will first predict whether the query is translatable or not.If translatable, PHOTON generates the corresponding SQL command and checks the command's executability; otherwise, PHO-TON will provide a correction strategy (i.e., CONFIRM CORRECTION) based on the detected confusing span or ask the users to further rephrase the inquiry (i.e., NEED REPHRASE) if no span is captured.

Table 1 :
Data split of Spider and Spider UTran .Q represents the all the questions, UTran Q represents the untranslatable questions.

Table 3 :
Experimental results on the Spider Dev set (%). EM Acc.denotes the exact set match accuracy.