Generating Logical Forms from Graph Representations of Text and Entities

Structured information about entities is critical for many semantic parsing tasks. We present an approach that uses a Graph Neural Network (GNN) architecture to incorporate information about relevant entities and their relations during parsing. Combined with a decoder copy mechanism, this approach provides a conceptually simple mechanism to generate logical forms with entities. We demonstrate that this approach is competitive with the state-of-the-art across several tasks without pre-training, and outperforms existing approaches when combined with BERT pre-training.


Introduction
Semantic parsing maps natural language utterances into structured meaning representations. The representation languages vary between tasks, but typically provide a precise, machine interpretable logical form suitable for applications such as question answering (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2007;Berant et al., 2013). The logical forms typically consist of two types of symbols: a vocabulary of operators and domain-specific predicates or functions, and entities grounded to some knowledge base or domain.
Recent approaches to semantic parsing have cast it as a sequence-to-sequence task (Dong and Lapata, 2016;Jia and Liang, 2016;Ling et al., 2016), employing methods similar to those developed for neural machine translation (Bahdanau et al., 2014), with strong results. However, special consideration is typically given to handling of entities. This is important to improve generalization and computational efficiency, as most tasks require handling entities unseen during training, and the set of unique entities can be large.
Some recent approaches have replaced surface forms of entities in the utterance with placehold-ers (Dong and Lapata, 2016). This requires a preprocessing step to completely disambiguate entities and replace their spans in the utterance. Additionally, for some tasks it may be beneficial to leverage relations between entities, multiple entity candidates per span, or entity candidates without a corresponding span in the utterance, while generating logical forms.
Other approaches identify only types and surface forms of entities while constructing the logical form (Jia and Liang, 2016), using a separate post-processing step to generate the final logical form with grounded entities. This ignores potentially useful knowledge about relevant entities.
Meanwhile, there has been considerable recent interest in Graph Neural Networks (GNNs) (Scarselli et al., 2009;Li et al., 2016;Kipf and Welling, 2017;Gilmer et al., 2017;Veličković et al., 2018) for effectively learning representations for graph structures. We propose a GNN architecture based on extending the self-attention mechanism of the Transformer (Vaswani et al., 2017) to make use of relations between input elements.
We present an application of this GNN architecture to semantic parsing, conditioning on a graph representation of the given natural language utterance and potentially relevant entities. This approach is capable of handling ambiguous and potentially conflicting entity candidates jointly with a natural language utterance, relaxing the need for completely disambiguating a set of linked entities before parsing. This graph formulation also enables us to incorporate knowledge about the relations between entities where available. Combined with a copy mechanism while decoding, this approach also provides a conceptually simple method for generating logical forms with grounded entities.
We demonstrate the capability of the pro-

GEO
x : which states does the mississippi run through ? y : answer ( state ( traverse 1( riverid ( mississippi ) ) ) ) ATIS x : in denver what kind of ground transportation is there from the airport to downtown y : ( _lambda $0 e ( _and ( _ground_transport $0 ) ( _to_city $0 denver : ci ) ( _from_airport $0 den : ap ) ) ) SPIDER x : how many games has each stadium held ? posed architecture by achieving competitive results across 3 semantic parsing tasks. Further improvements are possible by incorporating a pretrained BERT (Devlin et al., 2018) encoder within the architecture.

Task Formulation
Our goal is to learn a model for semantic parsing from pairs of natural language utterances and structured meaning representations. Let the natural language utterance be represented as a sequence x = (x 1 , . . . , x |x| ) of |x| tokens, and the meaning representation be represented as a sequence y = (y 1 , . . . , y |y| ) of |y| elements. The goal is to estimate p(y | x), the conditional probability of the meaning representation y given utterance x, which is augmented by a set of potentially relevant entities.

Input Utterance
Each token x i ∈ V in is from a vocabulary of input tokens.
Entity Candidates Given the input utterance x, we retrieve a set, e = {e 1 , . . . , e |e| }, of potentially relevant entity candidates, with e ⊆ V e , where V e is in the set of all entities for a given domain. We assume the availability of an entity candidate generator for each task to generate e given x, with details given in § 5.2.
For each entity candidate, e ∈ V e , we require a set of task-specific attributes containing one or more elements from V a . These attributes can be NER types or other characteristics of the entity, such as "city" or "river" for some of the entities listed in Table 1. Whereas V e can be quite large for open domains, or even infinite if it includes sets such as the natural numbers, V a is typically much smaller. Therefore, we can effectively learn representations for entities given their set of attributes, from our set of example pairs.
Edge Labels In addition to x and e for a particular example, we also consider the (|x|+|e|) 2 pairwise relations between all tokens and entity candidates, represented as edge labels.
The edge label between tokens x i and x j corresponds to the relative sequential position, j − i, of the tokens, clipped to within some range.
The edge label between token x i and entity e j , and vice versa, corresponds to whether x i is within the span of the entity candidate e j , or not.
The edge label between entities e i and e j captures the relationship between the entities. These edge labels can have domain-specific interpretations, such as relations in a knowledge base, or any other type of entity interaction features. For tasks where this information is not available or useful, a single generic label between entity candidates can be used.
Output We consider the logical form, y, to be a linear sequence (Vinyals et al., 2015b). We tokenize based on the syntax of each domain. Our formulation allows each element of y to be either an element of the output vocabulary, V out , or an entity copied from the set of entity candidates e. Therefore, y i ∈ V out ∪ V e . Some experiments in §5.2 also allow elements of y to be tokens ∈ V in from x that are copied from the input.

Model Architecture
Our model architecture is based on the Transformer (Vaswani et al., 2017), with the selfattention sub-layer extended to incorporate relations between input elements, and the decoder extended with a copy mechanism.  Figure 1: We use an example from SPIDER to illustrate the model inputs: tokens from the given utterance, x, a set of potentially relevant entities, e, and their relations. We selected two edge label types to highlight: edges denoting that an entity spans a token, and edges between entities that, for SPIDER, indicate a foreign key relationship between columns, or an ownership relationship between columns and tables.

GNN Sub-layer
We extend the Transformer's self-attention mechanism to form a Graph Neural Network (GNN) sublayer that incorporates a fully connected, directed graph with edge labels. The sub-layer maps an ordered sequence of node representations, u = (u 1 , . . . , u |u| ), to a new sequence of node representations, u = (u 1 , . . . , u |u| ), where each node is represented ∈ R d . We use r ij to denote the edge label corresponding to u i and u j .
We implement this sub-layer in terms of a function f (m, l) over a node representation m ∈ R d and an edge label l that computes a vector representation in R d . We use n heads parallel attention heads, with d = d/n heads . For each head k, the new representation for the node u i is computed by where each coefficient α ij is a softmax over the scaled dot products s ij , and W q is a learned matrix. Finally, we concatenate representations from each head, where W h is another learned matrix and [ · · · ] denotes concatenation.
If we implement f as, where W r ∈ R d ×d is a learned matrix, then the sub-layer would be effectively identical to self-attention as initially proposed in the Transformer (Vaswani et al., 2017). We focus on two alternative formulations of f that represent edge labels as learned matrices and learned vectors.
Edge Matrices The first formulation represents edge labels as linear transformations, a common parameterization for GNNs , where W l ∈ R d ×d is a learned embedding matrix per edge label.

Edge Vectors
The second formulation represents edge labels as additive vectors using the same formulation as Shaw et al. (2018), where W r ∈ R d ×d is a learned matrix shared by all edge labels, and w l ∈ R d is a learned embedding vector per edge label l.

Encoder
Input Representations Before the initial encoder layer, tokens are mapped to initial representations using either a learned embedding   (Vaswani et al., 2017), with two modifications. First, the self-attention sub-layer has been extended to be a GNN that incorporates edge representations. In the encoder, the GNN sub-layer is conditioned on tokens, entities, and their relations. Second, the decoder has been extended to include a copy mechanism (Vinyals et al., 2015a). We can optionally incorporate a pre-trained model such as BERT to generate contextual token representations.
V a . We also concatenate an embedding representing the node type, token or entity, to each input representation.
We assume some arbitrary ordering for entity candidates, generating a combined sequence of initial node representations for tokens and entities. We have edge labels between every pair of nodes as described in § 2.
Encoder Layers Our encoder layers are essentially identical to the Transformer, except with the proposed extension to self-attention to incorporate edge labels. Therefore, each encoder layer consists of two sub-layers. The first is the GNN sub-layer, which yields new sets of token and entity representations. The second sub-layer is an element-wise feed-forward network. Each sublayer is followed by a residual connection and layer normalization (Ba et al., 2016). We stack N enc encoder layers, yielding a final set of token representations, w x (Nenc) , and entity representations, w e (Nenc) .

Decoder
The decoder auto-regressively generates output symbols, y 1 , . . . , y |y| . It is similarly based on the Transformer (Vaswani et al., 2017), with the self-attention sub-layer replaced by the GNN sublayer. Decoder edge labels are based only on the relative timesteps of the previous outputs. The encoder-decoder attention layer considers both encoder outputs w x (Nenc) and w e (Nenc) , jointly normalizing attention weights over tokens and entity candidates. We stack N dec decoder layers to produce an output vector representation at each output step, z j ∈ R dz , for j ∈ {1, . . . , |y|}.
We allow the decoder to copy tokens or entity candidates from the input, effectively combining a Pointer Network (Vinyals et al., 2015a) with a standard softmax output layer for selecting symbols from an output vocabulary (Gu et al., 2016;Gulcehre et al., 2016;Jia and Liang, 2016). We define a latent action at each output step, a j for j ∈ {1, . . . , |y|}, using similar notation as Jia et al. (2016). We normalize action probabilities with a softmax over all possible actions.
Generating Symbols We can generate a symbol, denoted Generate[i], where w out i is a learned embedding vector for the element ∈ V out with index i. If a j = Generate[i], then y j will be the element ∈ V out with index i.

Copying Entities
We can also copy an entity candidate, denoted CopyEntity[i], where W e is a learned matrix, and i ∈ {1, . . . , |e|}. If a j = CopyEntity[i], then y j = e i .

Related Work
Various approaches to learning semantic parsers from pairs of utterances and logical forms have been developed over the years (Tang and Mooney, 2000;Zettlemoyer and Collins, 2007;Kwiatkowski et al., 2011;Andreas et al., 2013). More recently, encoder-decoder architectures have been applied with strong results (Dong and Lapata, 2016;Jia and Liang, 2016). Even for tasks with relatively small domains of entities, such as GEO and ATIS, it has been shown that some special consideration of entities within an encoder-decoder architecture is important to improve generalization. This has included extending decoders with copy mechanisms (Jia and Liang, 2016) and/or identifying entities in the input as a pre-processing step (Dong and Lapata, 2016).
Other work has considered open domain tasks, such as WEBQUESTIONSSP (Yih et al., 2016). Recent approaches have typically relied on a separate entity linking model, such as S-MART (Yang and Chang, 2015), to provide a single disambiguated set of entities to consider. In principle, a learned entity linker could also serve as an entity candidate generator within our framework, although we do not explore such tasks in this work.
Considerable recent work has focused on constrained decoding of various forms within an encoder-decoder architecture to leverage the known structure of the logical forms. This has led to approaches that leverage this structure during decoding, such as using tree decoders (Dong and Lapata, 2016;Alvarez-Melis and Jaakkola, 2017) or other mechanisms (Dong and Lapata, 2018;Goldman et al., 2017). Other approaches use grammar rules to constrain decoding (Xiao et al., 2016;Yin and Neubig, 2017;Krishnamurthy et al., 2017;Yu et al., 2018b). We leave investigation of such decoder constraints to future work.
Many formulations of Graph Neural Networks (GNNs) that propagate information over local neighborhoods have recently been proposed Kipf and Welling, 2017;Gilmer et al., 2017;Veličković et al., 2018). Recent work has often focused on large graphs (Hamilton et al., 2017) and effectively propagating information over multiple graph steps (Xu et al., 2018). The graphs we consider are relatively small and are fullyconnected, avoiding some of the challenges posed by learning representations for large, sparsely con-nected graphs.
Other recent work related to ours has considered GNNs for natural language tasks, such as combining structured and unstructured data for question answering (Sun et al., 2018), or for representing dependencies in tasks such as AMR parsing and machine translation (Beck et al., 2018;Bastings et al., 2017). The approach of Krishnamurthy et al. (2017) similarly considers ambiguous entity mentions jointly with query tokens for semantic parsing, although does not directly consider a GNN.
Previous work has interpreted the Transformer's self-attention mechanism as a GNN (Veličković et al., 2018;Battaglia et al., 2018), and extended it to consider relative positions as edge representations (Shaw et al., 2018). Previous work has also similarly represented edge labels as vectors, as opposed to matrices, in order to avoid over-parameterizing the model .

Semantic Parsing Datasets
We consider three semantic parsing datasets, with examples given in Table 1.
GEO The GeoQuery dataset consists of natural language questions about US geography along with corresponding logical forms (Zelle and Mooney, 1996). We follow the convention of Zettlemoyer and Collins (2005)  SPIDER This is a large-scale text-to-SQL dataset that consists of 10,181 questions and 5,693 unique complex SQL queries across 200 database tables spanning 138 domains (Yu et al., 2018c). We use the standard training set of 8,659 training example and development set of 1,034 examples, split across different tables.

Experimental Setup
Model Configuration We configured hyperparameters based on performance on the validation set for each task, if provided, otherwise crossvalidated on the training set.
For the encoder and decoder, we selected the number of layers from {1, 2, 3, 4} and embedding and hidden dimensions from {64, 128, 256}, setting the feed forward layer hidden dimensions 4× higher.
We employed dropout at training time with P dropout selected from {0.1, 0.2, 0.3, 0.4, 0.5, 0.6}. We used 8 attention heads for each task. We used a clipping distance of 8 for relative position representations (Shaw et al., 2018).
We used the Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98, and = 10 −9 , and tuned the learning rate for each task. We used the same warmup and decay strategy for learning rate as Vaswani et al. (2017), selecting a number of warmup steps up to a maximum of 3000. Early stopping was used to determine the total training steps for each task. We used the final checkpoint for evaluation. We batched training examples together, and selected batch size from {32, 64, 128, 256, 512}. During training we used masked self-attention (Vaswani et al., 2017) to enable parallel decoding of output sequences. For evaluation, we used greedy search.
We used a simple strategy of splitting each input utterance on spaces to generate a sequence of tokens. We mapped any token that didn't occur at least 2 times in the training dataset to a special outof-vocabulary token. For experiments that used BERT, we instead used the same wordpiece (Wu et al., 2016) tokenization as used for pre-training.
BERT For some of our experiments, we evaluated incorporating a pre-trained BERT (Devlin et al., 2018) encoder by effectively using the output of the BERT encoder in place of a learned token embedding table. We then continue to use graph encoder and decoder layers with randomly initialized parameters in addition to BERT, so there are many parameters that are not pre-trained. The additional encoder layers are still necessary to condition on entities and relations.
We achieved best results by freezing the pretrained parameters for an initial number of steps, and then jointly fine-tuning all parameters, similar to existing approaches for gradual unfreezing (Howard and Ruder, 2018). When unfreezing the pre-trained parameters, we restart the learning rate schedule. We found this to perform better than keeping pre-trained parameters either entirely frozen or entirely unfrozen during fine-tuning.
We used BERT LARGE (Devlin et al., 2018), which has 24 layers. For fine tuning we used the same Adam optimizer with weight decay and learning rate decay as used for BERT pre-training. We reduced batch sizes to accommodate the significantly larger model size, and tuned learning rate, warm up steps, and number of frozen steps for pre-trained parameters.
Entity Candidate Generator We use an entity candidate generator that, given x, can retrieve a set of potentially relevant entities, e, for the given domain. Although all generators share a common interface, their implementation varies across tasks.
For GEO and ATIS we use a lexicon of entity aliases in the dataset and attempt to match with ngrams in the query. Each entity has a single attribute corresponding to the entity's type. We used binary valued relations between entity candidates based on whether entity candidate spans overlap, but experiments did not show significant improvements from incorporating these relations.
For SPIDER, we generalize our notion of entities to include tables and table columns. We include all relevant tables and columns as entity candidates, but make use of Levenshtein distance between query ngrams and table and column names to determine edges between tokens and entity candidates. We use attributes based on the types and names of tables and columns. Edges between entity candidates capture relations between columns and the table they belong to, and foreign key relations.
For GEO, ATIS, and SPIDER, this leads to 19.5%, 32.7%, and 74.6% of examples containing at least one span associated with multiple entity candidates, respectively, indicating some entity ambiguity.
Further details on how entity candidate generators were constructed are provided in § A.1.

Output Sequences
We pre-processed output sequences to identify entity argument values, and replaced those elements with references to entity candidates in the input. In cases where our entity candidate generator did not retrieve an entity that was used as an argument, we dropped the example from the training data set or considered it incorrect Method GEO ATIS Kwiatkowski et al. (2013) 89.0 -  87.9 - Wang et al. (2014) -91.3 Zhao and Huang (2015) 88.9 84.2 Jia and Liang (2016) 89 Ours GNN w/ edge matrices 29.3 GNN w/ edge vectors 32.1 GNN w/ edge vectors + BERT 23.5 if in the test set.
Evaluation To evaluate accuracy, we use exact match accuracy relative to gold logical forms. For GEO we directly compare output symbols. For ATIS, we compare normalized logical forms using canonical variable naming and sorting for unordered arguments (Jia and Liang, 2016). For SPI-DER we use the provided evaluation script, which decomposes each SQL query and conducts set comparison within each clause without values. All accuracies are reported on the test set, except for SPIDER where we report and compare accuracies on the development set.
Copying Tokens To better understand the effect of conditioning on entities and their relations, we also conducted experiments that considered an alternative method for selecting and disambiguating entities similar to Jia et al. (2016). In this approach we use our model's copy mechanism to copy tokens corresponding to the surface forms of entity arguments, rather than copying entities directly.
where W x is a learned matrix, and where i ∈ {1, . . . , |x|} refers to the index of token x i ∈ V in . If a j = CopyToken[i], then y j = x i . This allows us to ablate entity information in the input while still generating logical forms. When copying tokens, the decoder determines the type of the entity using an additional output symbol. For GEO, the actual entity can then be identified as a post-processing step, as a type and surface form is sufficient. For other tasks this could require a more complicated post-processing step to disambiguate entities given a surface form and type.  Table 3: Experimental results for copying tokens instead of entities when decoding, with and without conditioning on the set of entity candidates, e.

Results and Analysis
Accuracies on GEO, ATIS, and SPIDER are shown in Table 2.
GEO and ATIS Without pre-training, and despite adding a bit of entity ambiguity, we achieve similar results to other recent approaches that disambiguate and replace entities in the utterance as a pre-processing step during both training and evaluating Lapata, 2016, 2018). When incorporating BERT, we increase absolute accuracies over Dong and Lapata (2018) on GEO and ATIS by 3.2% and 2.0%, respectively. Notably, they also present techniques and results that leverage constrained decoding, which our approach would also likely further benefit from. For GEO, we find that when ablating all entity information in our model and copying tokens instead of entities, we achieve similar results as Jia and Liang (2016) when also ablating their data augmentation method, as shown in Table 3. This is expected, since when ablating entities completely, our architecture essentially reduces to the same sequence-to-sequence task setup. These results demonstrate the impact of conditioning on the entity candidates, as it improves performance even on the token copying setup. It appears that leveraging BERT can partly compensate for not conditioning on entity candidates, but combining BERT with our GNN approach and copying entities achieves 2.9% higher accuracy than using only a BERT encoder and copying tokens.
For ATIS, our results are outperformed by Wang et al. (2014) by 1.6%. Their approach uses hand-engineered templates to build a CCG lexicon. Some of these templates attempt to handle the specific types of ungrammatical utterances in the ATIS task.
SPIDER For SPIDER, a relatively new dataset, there is less prior work. Competitive approaches have been specific to the text-to-SQL task (Xu et al., 2017;Yu et al., 2018a,b), incorporating taskspecific methods to condition on table and column information, and incorporating SQL-specific structure when decoding. Our approach improves absolute accuracy by +7.3% relative to Yu et al. (2018b) without using any pre-trained language representations, or constrained decoding. Our approach could also likely benefit from some of the other aspects of Yu et al. (2018b) such as more structured decoding, data augmentation, and using pre-trained representations (they use GloVe (Pennington et al., 2014)) for tokens, columns, and tables.
Our results were surprisingly worse when attempting to incorporate BERT. Of course, successfully incorporating pre-trained representations is not always straightforward. In general, we found using BERT within our architecture to be sensitive to learning rates and learning rate schedules. Notably, the evaluation setup for SPIDER is very different than training, as examples are for tables unseen during training. Models may not generalize well to unseen tables and columns. It's likely that successfully incorporating BERT for SPIDER would require careful tuning of hyperparameters specifically for the database split configuration.

Entity Spans and Relations
Ablating span relations between entities and tokens for GEO and ATIS is shown in Table 4. The impact is more significant for ATIS, which contains many queries with multiple entities of the same type, such as nonstop flights seattle to boston where disambiguating the origin and destination entities requires knowledge of which tokens they are associated with, given that we represent entities based only on their types for these tasks. We leave for future work consideration of edges between entity candidates that incorporate relevant domain knowledge for these tasks.  For SPIDER, results ablating relations between entities and tokens, and relations between entities, are shown in Table 5. This demonstrates the importance of entity relations, as they include useful information for disambiguating entities such as which columns belong to which tables, and which columns have foreign key relations.

Edge Ablations SPIDER
GNN w/ edge vectors 32.1 − entity span edges 27.8 − entity relation edges 26.3 Table 5: Results for ablating information about relations between entity candidates and tokens for SPIDER.
Edge Representations Using additive edge vectors outperforms using learned edge matrix transformations for implementing f , across all tasks. While the vector formulation is less expressive, it also introduces far fewer parameters per edge type, which can be an important consideration given that our graph contains many similar edge labels, such as those representing similar relative positions between tokens. We leave further exploration of more expressive edge representations to future work. Another direction to explore is a heterogeneous formulation of the GNN sub-layer, that employs different formulations for different subsets of nodes, e.g. for tokens and entities.

Conclusions
We have presented an architecture for semantic parsing that uses a Graph Neural Network (GNN) to condition on a graph of tokens, entities, and their relations. Experimental results have demonstrated that this approach can achieve competitive results across a diverse set of tasks, while also providing a conceptually simple way to incorporate entities and their relations during parsing. For future direction, we are interested in exploring constrained decoding, better incorporating pre-trained language representations within our architecture, conditioning on additional relations between entities, and different GNN formulations.
More broadly, we have presented a flexible approach for conditioning on available knowledge in the form of entities and their relations, and demonstrated its effectiveness for semantic parsing.