SQL-to-Text Generation with Graph-to-Sequence Model

Previous work approaches the SQL-to-text generation task using vanilla Seq2Seq models, which may not fully capture the inherent graph-structured information in SQL query. In this paper, we propose a graph-to-sequence model to encode the global structure information into node embeddings. This model can effectively learn the correlation between the SQL query pattern and its interpretation. Experimental results on the WikiSQL dataset and Stackoverflow dataset show that our model outperforms the Seq2Seq and Tree2Seq baselines, achieving the state-of-the-art performance.


Introduction
The goal of the SQL-to-text task is to automatically generate human-like descriptions interpreting the meaning of a given structured query language (SQL) query (Figure 1 gives an example).This task is critical to the natural language interface to a database since it helps non-expert users to understand the esoteric SQL queries that are used to retrieve the answers through the questionanswering process (Simitsis and Ioannidis, 2009) using varous text embeddings techniques (Kim, 2014;Arora et al., 2017;Wu et al., 2018a).
Earlier attempts for SQL-to-text task are rulebased and template-based (Koutrika et al., 2010;Ngonga Ngomo et al., 2013).Despite requiring intensive human efforts to design temples or rules, these approaches still tend to generate rigid and stylized language that lacks the natural text of the human language.To address this, Iyer et al. (2016) proposes a sequence-to-sequence (Seq2Seq) network to model the SQL query and natural language jointly.However, since the SQL is designed to express graph-structured query intent, the sequence encoder may need an elaborate design to fully capture the global structure information.Intuitively, varous graph encoding techniques base on deep neural network (Kipf and Welling, 2016;Hamilton et al., 2017;Song et al., 2018) or based on Graph Kernels (Vishwanathan et al., 2010;Wu et al., 2018b), whose goal is to learn the node-level or graph-level representations for a given graph, are more proper to tackle this problem.
In this paper, we first introduce a strategy to represent the SQL query as a directed graph (see §2) and further make full use of a novel graphto-sequence (Graph2Seq) model (Xu et al., 2018) that encodes this graph-structured SQL query, and then decodes its interpretation (see §3).On the encoder side, we extend the graph encoding work of Hamilton et al. (2017) by encoding the edge direction information into the node embedding.Our encoder learns the representation of each node by aggregating information from its K-hop neighbors.Different from Hamilton et al. (2017) which neglects the edge direction, we classify the neighbors of a node according to the edge direction, say v, into two classes, i.e., forward nodes (v directs to) and backward nodes (direct to v).We apply two distinct aggregators to aggregate the information of these two types of nodes, resulting two representations.The node embedding of v is the concatenation of these two representations.Given the learned node embeddings, we further introduce a pooling-based and an aggregation-based method to generate the graph embedding.
On the decoder side, we develop an RNN-based decoder which takes the graph vector representation as the initial hidden state to generate the sequences while employing an attention mechanism over all node embeddings.Experimental results show that our model achieves the state-of-the-art performance on the WikiSQL dataset and Stackoverflow dataset.Our code and data is available at https://github.com/IBM/SQL-to-Text.

Graph Representation of SQL Query
Representing the SQL query as a graph instead of a sequence could better preserve the inherent structure information in the query.An example is illustrated in the blue dashed frame in Figure 2. One can see that representing them as a graph instead of a sequence could help the model to better learn the correlation between this graph pattern and the interpretation "...both X and Y higher than Z...".This observation motivates us to represent the SQL query as a graph.In particular, we use the following method to transform the SQL query to a graph: 1SELECT Clause.For the SELECT clause such as "SELECT company", we first create a node assigned with text attribute select.This SELECT node connects with column nodes whose text attributes are the selected column names such as company.For SQL queries that contain aggregation functions such as count or max, we add one aggregation node which is connected with column nodes.Similarly, their text attributes are the aggregation function names.
WHERE Clause.The WHERE clause usually contains more than one condition.For each condition, we use the same process as for the SELECT clause to create nodes.For example, in Figure 2, we create node assets and >val 0 for the first condition, the node sales and >val 0 for the second condition.We then integrate the constraint nodes that have the same text attribute (e.g., >val 0 in Figure 2).For a logical operator such as AND, OR and NOT, we create a node that connects with all column nodes that the operator works on.Finally, these logical operator nodes connect with the SE-LECT node.

Graph-to-sequence Model
Based on the constructed graphs for the SQL queries, we make full use of a novel graph-tosequence model (Xu et al., 2018), which consists of a graph encoder to learn the embedding for the graph-structured SQL query, and a sequence decoder with attention mechanism to generate sentences.Conceptually, the graph encoder generates the node embedding for each node by accumulating information from its K-hop neighbors, and produces a graph embedding for the entire graph by abstracting all node embeddings.Our decoder takes the graph embedding as the initial hidden state and calculates the attention over all node embeddings on the encoder side to generate natural language interpretations.Node Embedding.Given the graph G = (V, E), since the text attribute of a node may include a list of words, we first use a Long Short Term Memory (LSTM) to generate the feature vector a v for all nodes ∀v ∈ V from v's text attribute.We use these feature vectors as initial node embeddings.Then, our model incorporates information from a node's neighbors within K hop into its representation by repeating the following process K times: where k ∈ {1, ..., K} is the iteration index, N is the neighborhood function M k and M k are the forward and backward aggregator functions, W k denotes weight matrices, σ is a non-linearity function.
For example, for node v ∈ V, we first aggregate the forward representations of its immediate neighbors {h k−1 u , ∀u ∈ N (v)} into a single vector h k N (v) (equation 2).Note that this aggregation step only uses the representations generated at previous iteration and its initial representation is a v .Then we concatenate v's current forward representation h k−1 v with the newly generated neighborhood vector h k N (v) .This concatenated vector is fed into a fully connected layer with nonlinear activation function σ, which updates the forward representation of v to be used at the next iteration (equation 3).Next, we update the backward representation of v in the similar fashion (equation 4∼5).Finally, the concatenation of the forward and backward representation at last iteration K, is used as the resulting representation of v. Since the neighbor information from different hops may have the different impact on the node embedding, we learn a distinct aggregator function at each step.This aggregator feeds each neighbor's vector to a fully-connected neural network and an element-wise max-pooling operation is applied to capture different aspects of the neighbor set.
Graph Most existing works of graph convolution neural networks focus more on node embeddings rather than graph embeddings (GE) since their focus is on the node-wise classification task.However, graph embeddings that convey the entire graph information are essential to the downstream decoder, which is crucial to our task.For this purpose, we propose two ways to generate graph embeddings, namely, the Poolingbased and Node-based methods.
Pooling-based GE.This method feeds the obtained node embeddings into a fully-connected neural network and applies the element-wise maxpooling operation on all node embeddings.In experiments, we did not observe significant performance improvement using min-pooling and average-pooling.
Node-based GE.Following (Scarselli et al., 2009), this method adds a super node v s that is connected to all other nodes by a special type of edge.The embedding of v s , which is treated as graph embedding, is produced using node embedding generation algorithm mentioned above.
Sequence Decoding.The decoder is an RNN which predicts the next token y i given all the previous words y <i = y 1 , ..., y i−1 , the RNN hidden state s i for time-step i and the context vector c i that captures the attention of the encoder side.In particular, the context vector c i depends on a set of node representations (h 1 ,...,h V ) to which the encoder maps the input graph.The context vector c i is dynamically computed using attention mechanism over the node representations.Our model is jointly trained to maximize the conditional log-probability of the correct description given a source graph with respect to the parameters θ of the model: where (x n , y n ) is the n-th SQL-interpretation pair in the training set, and T n is the length of the n-th target sentence y n .In the inference phase, we use the beam search algorithm with beam size = 5.

Experiments
We evaluate our model on two datasets, WikiSQL (Zhong et al., 2017) and Stackoverflow (Iyer et al., 2016).WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs.These  (Iyer et al., 2016).We use the BLEU-4 score (Papineni et al., 2002) as our automatic evaluation metric and also perform a human study.For human evaluation, we randomly sampled 1,000 predicted results and asked three native English speakers to rate each interpretation against both the correctness conforming to the input SQL and grammaticality on a scale between 1 and 5.We compare some variants of our model against the template, Seq2Seq, and Tree2Seq baselines.Graph2Seq-PGE.This method uses the Pooling method for generating Graph Embedding.
Graph2Seq-NGE.This method uses the Node based Graph Embedding.
Template.We implement a template-based method which first maps each element of a SQL query to an utterance and then uses simple rules to assemble these utterances.For example, we map SELECT to which, WHERE to where, > to more than.This method translates the SQL query of Figure 1 to which company where assets more than val 0 and sales more than val 0 and industry less than or equal to val 1 and profits equals val 2 .Seq2Seq.We choose two Seq2Seq models as our baselines.The first one is the attentionbased Seq2Seq model proposed by Bahdanau et al. (2014), and the second one additionally introduces the copy mechanism in the decoder side (Gu et al., 2016).To evaluate these models, we employ a template to convert the SQL query into a sequence: "SELECT + <aggregation function> + <Split Symbol> + <selected column> + WHERE + <condition0> Tree2Seq.We also choose a tree-to-sequence model proposed by (Eriguchi et al., 2016) as our baseline.We use the SQL Parser tool3 to convert a SQL query into the tree structure4 which is fed to the Tree2Seq model.
Our proposed models are trained using the Adam optimizer (Kingma and Ba, 2014), with mini-batch size 30.Our hyper-parameters are set based on performance on the validation set.The learning rate is set to 0.001.We apply the dropout strategy (Srivastava et al., 2014) with the ratio of 0.5 at the decoder layer to avoid overfitting.Gradients are clipped when their norm is bigger than 20.We initialize word embeddings using GloVe word vectors from Pennington et al. (2014), and the word embedding dimension is 300.For the graph encoder, the hop size K is set to 6, the nonlinearity function σ is implemented as ReLU (Glorot et al., 2011), the parameters of weight matrices W k are randomly initialized.The decoder has one layer, and its hidden state size is 300.

Results and Discussion
Table 1 summarizes the results of our models and baselines.Although the template-based method achieves decent BLEU scores, its grammaticality score is substantially worse than other baselines.We can see that on both two datasets, our Graph2Seq models perform significantly better than the Seq2Seq and Tree2Seq baselines.One possible reason is that in our graph encoder, the node embedding retains the information of neighbor nodes within K hops.However, in the tree encoder, the node embedding only aggregates the information of descendants while losing the knowledge of ancestors.The pooling-based graph embedding is found to be more useful than the node-based graph embedding because Graph2Seq-NGE adds a nonexistent node into the graph, which introduces the noisy information in calculating the embeddings of other nodes.We also conducted an experiment that treats the SQL query graph as an undirected graph and found the performance degrades.
By manually analyzing the cases in which the Graph2Seq model performs better than Seq2Seq, we find the Graph2Seq model is better at interpreting two classes of queries: (1) the complicated queries that have more than two conditions (Query 1); (2) the queries whose columns have implicit relationships (Query 2).Table 2 lists some such SQL queries and their interpretations.One possible reason is that the Graph2Seq model can better learn the correlation between the graph pattern and natural language by utilizing the global structure information.
We find the hop size has a significant impact on our model since it determines how many neighbor nodes to be considered during the node embedding generation.As the hop size increasing, the performance is found to be significantly improved.However, after the hop size reaches 6, increasing the hop size can not boost the performance on WikiSQL anymore.By analyzing the most complicated queries (around 6.2%) in Wik-iSQL, we find there are average six hops between a node and its most distant neighbor.This result indicates that the selected hop size should guarantee each node can receive the information from others nodes in the graph.

Conclusions
Previous work approaches the SQL-to-text task using an Seq2Seq model which does not fully capture the global structure information of the SQL query.To address this, we proposed a Graph2Seq model which includes a graph encoder, an attention based sequence decoder.Experimental results show that our model significantly outperforms the Seq2Seq and Tree2Seq models on the WikiSQL and Stackoverflow datasets.

Figure 1 :
Figure 1: An example of SQL query and its interpretation.

Figure 2 :
Figure 2: The graph representation of the SQL query in Figure 1.
SQL Query & Interpretations 1. COUNT Player WHERE starter = val0 AND touchdowns = val1 AND position = val2 S: How many players played in position val2 G: number of players with starter val0 and get touchdowns val1 for val2 2. SELECT Tires WHERE engine = val0 AND chassis = val1 AND team = val2 S: which tire has engine val0 and chassis val1 and val2 G: which tire does val2 run with val0 engine and val1 chassis

Table 2 :
Example of SQL queries and predicted interpreta-