Reranking for Neural Semantic Parsing

Semantic parsing considers the task of transducing natural language (NL) utterances into machine executable meaning representations (MRs). While neural network-based semantic parsers have achieved impressive improvements over previous methods, results are still far from perfect, and cursory manual inspection can easily identify obvious problems such as lack of adequacy or coherence of the generated MRs. This paper presents a simple approach to quickly iterate and improve the performance of an existing neural semantic parser by reranking an n-best list of predicted MRs, using features that are designed to fix observed problems with baseline models. We implement our reranker in a competitive neural semantic parser and test on four semantic parsing (GEO, ATIS) and Python code generation (Django, CoNaLa) tasks, improving the strong baseline parser by up to 5.7% absolute in BLEU (CoNaLa) and 2.9% in accuracy (Django), outperforming the best published neural parser results on all four datasets.

x z 3 z 7 ! x : 34.7 Figure 1: Illustration of the reranker with a real example from the CONALA code generation task (Yin et al., 2018a) with reconstruction (z → x) and discriminative matching (x ↔ z) scores.
While neural network-based semantic parsers have achieved impressive results, there is still room for improvement. A pilot analysis of incorrect predictions from a competitive neural semantic parser, TRANX (Yin and Neubig, 2018) indicates an obvious issue of incoherence. In the real example in Figure 1, top prediction z 1 is semantically incoherent with the intent expressed in the utterance. Perhaps a more interesting issue is inadequacy -while the predicted MRs match the overall intent of the utterance, they still miss or misinterpret crucial pieces of information (e.g., missing or generating wrong arguments, as in z 2 and z 9 ). Indeed, we observe that around 41% of the failure cases of TRANX on a popular Python code generation task (DJANGO, Oda et al. (2015)) are due to such inadequate predictions.
Although the top predictions from a semantic parser could fall short in adequacy or coherence, we found the parser still maintains high recall, covering the gold-standard MR in its n-best list of predictions most of the time 2 . This naturally mo-tivates us to investigate whether the performance of an existing neural parser can be potentially improved by reranking the n-best list of candidate MRs. In this paper, we propose a simple reranker powered mainly by two quality-measuring features of a candidate MR: (1) a generative reconstruction model, which tests the coherence and adequacy of an MR via the likelihood of reconstructing the original input utterance from the MR; and (2) a discriminative matching model, which directly captures the semantic coherence between utterances and MRs. We implement our reranker in a strong neural semantic parser and evaluate on both tasks of parsing NL to domain-specific logical form (GEO, ATIS) and general-purpose source code (DJANGO, CONALA). Our reranking approach improves upon this strong parser by up to 5.7% absolute in BLEU (CONALA) and 2.9% in accuracy (DJANGO), outperforming the best published neural parser results on all datasets.

Reranking Model
Figure 1 illustrates our approach. Given an input NL utterance x, we assume access to an existing neural semantic parser p(z|x), which outputs a ranked n-best list of system-generated meaning representations given x, {z i } n i=1 . In practice, such an n-best list is usually generated by approximate inference like beam search. The reranker R(·) takes as input the n-best list of MRs and the input utterance, and outputs the MR z with the highest reranking score, i.e.,ẑ = argmax z∈{z i } n i=1 R(z, x). We parameterize R(·) as a (log-) linear model: where f k is a feature function that scores a candidate prediction z, and {α} tuned weights. We also include the original parser score in R(·). The idea of reranking the beam of candidate parses has been attempted for various NLP tasks (Collins and Koo, 2000), and was also previously applied for classical grammardriven semantic parsers. Such reranking models typically use domain-specific syntactic features strongly coupled with the underlying parsing algorithm (e.g., an indicator feature for each grammar rule applied, Raymond and Mooney (2006); registers 77.3% top-1 accuracy and 84.0% recall over the 15best beam on the DJANGO dataset, a 6.7% absolute gap.
1. Alignment 2. Comparison 3. Aggregation Figure 2: Illustration of the discriminative matching model, adapted from Parikh et al. (2016). Punctuations (dots, parentheses) in MR are omitted for clarity. Srivastava et al. (2017)), while our reranker applies domain-general quality-measuring features compatible with both domain-specific (e.g., λcalculus) and general-purpose (e.g., Python) MRs (more in § 3). Specifically, our reranker mainly uses two features, whose scores are given by two external models: a reconstruction model and a matching model.
Generative Reconstruction Feature Our reconstruction feature log p(z → x) is a generative model that scores the coherence and adequacy of an MR z using the probability of reproducing the original input utterance x from z. Intuitively, a good candidate MR should adequately encode the semantics of x, leading to high reconstruction score. The idea of using reconstruction as a quality metric is closely related to reconstruction models in auto-encoders (Vincent et al., 2008), and its applications in semi-supervised (Yin et al., 2018b) and weakly supervised (Cheng and Lapata, 2018) semantic parsing, where p(z →x) is used to score the quality of sampled MRs in optimization. Similar models have also been applied for pragmatic inference in instruction-following agents for modeling the likelihood of causing the speaker to produce the utterance given an inferred action (Fried et al., 2018), while we use p(z →x) as one qualitymeasuring feature in our reranker. Specifically, we implement p(z → x) using an attentional sequence-to-sequence network (Luong et al., 2015), which takes as input a tokenized MR z. The network is augmented with a copy mechanism (Gu et al., 2016), allowing out-of-vocabulary variable names (e.g., file name in Figure 1) in z to be directly copied to the utterance x.

Discriminative Matching Feature
We use a matching model to measure the probability of the input utterance x and a candidate MR z being semantically coherent to each other. Intuitively, for a semantically coherent parse z (e.g., z 3 in Figure 1), each sub-piece in z (e.g., urllib.request.urlretrieve) could coarsely match with a span (e.g., download the file) in the utterance, and vice versa. Motivated by this observation, we implement p(x↔z) as a decomposable attention model (Parikh et al., 2016), a discriminative model which computes a semantic coherence score based on the latent pairwise alignment between tokens in x and z. Figure 2 depicts an overview of the model, while we refer interested readers to Parikh et al. (2016) for technical details. Intuitively, the model measures the semantic equivalence of an utterance x and an MR z based on pair-wise associations of tokens in x and z in three steps: (1) an alignment step, where alignment scores between each pair of tokens in x and z are computed using attention; (2) a comparison step, where a set of representations are produced from embeddings of pairwise aligned tokens, capturing their semantic similarities; and (3) an aggregation step, where all pairwise comparisons results are combined to compute the semantic coherence score.
Token Count Feature Besides the two primary features introduced above, we also include an auxiliary token count feature |z| of an MR, which has been shown useful in preventing a machine translation model from favoring shorter predictions (Cho et al., 2014;Och and Ney, 2002), while we test them for reranking MRs, especially when the target metric is BLEU ( § 3).

Experiment
We test on four semantic parsing and code generation benchmarks: GEO (Zelle and Mooney, 1996) and ATIS (Deborah A. Dahl and Shriber) are two closed-domain semantic parsing datasets. The NL utterances are geographical (GEO) and flight booking (ATIS) inquiries (e.g., What is the latest flight to Boston?). The corresponding MRs are defined in λ-calculus logical forms (e.g., argmax x (and (flight x) (to x boston)) (departure time x))). DJANGO (Oda et al., 2015) is a popular Python code generation dataset consisting of NL-annotated code from the Django framework. Around 70% of examples are simple cases of variable assignment (e.g., result = []), function definition/invocation or condition tests, which can be easily inferred from the verbose NL utterances (e.g., Result is an empty list).
CONALA (Yin et al., 2018a) 3 is a newly introduced task for open-domain code generation. It consists of 2, 879 examples of manually annotated NL questions (e.g., Check if all elements in list 'my list' are the same) and their Python solution (e.g., len(set(mylist)) == 1) on STACK OVERFLOW. Compared with DJANGO, examples in CONALA cover real-world NL queries issued by programmers with diverse intents, and therefore are significantly more difficult due to its broad coverage and high compositionality of target MRs.
Base Semantic Parser p(z|x) While we remark that our reranking model is parser agnostic, in the experiments we are primarily interested in investigating if the reranker could further improve the performance of an already-strong semantic parser. We use TRANX (Yin and Neubig, 2018) 5 , a general-purpose open-source neural semantic parser that maps an input utterance into MRs using a neural sequence-to-tree network, where MRs are represented as abstract syntax trees. We leave evaluating the performance of the reranker with other parsers as interesting future work.
Training Reranking Model Deploying the reranker to a benchmark dataset involves three steps: (1) training the base parser, (2) training the reranking features (reconstruction and matching models), and (3) tuning the feature weights.
(1) Training Base Parser We use its pre-processed version of the dataset shipped with the base parser TRANX. We train TRANX using its official configuration and collect the n-best list of candidates for each example using beam search (beam size is 5 for GEO and ATIS, and 15 for DJANGO and CONALA).
(2) Training Reranking Features The reconstruction model is trained using standard maximumlikelihood estimation using utterances and their associated gold MRs in the training set. We then chose the model with the lowest perplexity on the development set 6 . To train the matching model, it requires training examples in the form of triplets x, z, y , consisting of an utterance x, an MR z and a binary label y indicating whether z is a  We also made one special modification for the discriminative matching model on DJANGO. Different with the preprocessed version of other datasets, OOV variable names in DJANGO cannot be easily identified and canonicalized, which hurts the performance of the vanilla matching model. Therefore, for each input x, z , we replace each OOV token (e.g., a variable name my list) in x and z with a unique numbered slot (e.g., VAR0). Hence, different OOV variable names in the input can still be distinguished based on their slot IDs. We found this simple trick improved the average classification accuracy on the development set from 77% to 81%.
(3) Tuning Feature Weights Finally, given the trained features, we then tune the feature weights in E.q. 1 using the minimum risk training (Smith and Eisner, 2006) algorithm implemented in the Travatar package (Neubig, 2013), which optimizes the expected metric over candidates in the n-best list of candidate MRs on development sets. Steps (2) and (3) are quite efficient, and takes less than 10 minutes on a server with a GPU.
Metric We use the standard evaluation metric for each dataset: exact match accuracy for GEO, ATIS, DJANGO and corpus-level BLEU for CONALA. Table 1 lists evaluation results. We also report the oracle recall over n-best list as an upper-bound performance (last line in Table 1). First, we note that our base parser is indeed strong, performing competitively against existing neural systems on all datasets. This suggests that our base parser will serve as a reasonable testbed for the reranking model. Next, we observe that the reranker achieves improved results across the board, closing the gap between top-1 predictions and oracle recall. Notably, the reranker registers 2.9% absolute gain in accuracy on DJANGO and 5.7% in BLEU on CONALA, resp. This demonstrates that reranking is an effective approach to improve the performance of an already-strong neural parser.

Results
We also performed a feature ablation study by removing one feature at a time. For discussion, we also present qualitative examples of reranking results in Table 2. We are particularly interested in investigating the comparative utility of the discriminative matching and reconstruction features. Interestingly, we observe that while the matching feature seems to be important for semantic parsing tasks like ATIS, the reconstruction model performs generally better on two Python code generation tasks, where target MRs are much more complex. We hypothesize that our matching model based on pair-wise token associations between utterances and MRs is particularly effective for simpler MRs in ATIS, where there is a clear correspondence between utterance spans (e.g., round trip in Example 1, Table 2) and MR predicates (e.g., round trip). This could also hold for some examples in DJANGO, where the verbose NL utterances could be roughly aligned with the MR ATIS Show me all round trip flight from ci0 to ci1 nonstop z1 lambda $0 e (and (flight $0) (from $0 ci0) (to $0 ci1) (nonstop $0)) p(z|x) = −2.0 z → x=−11.1 x ↔ z=−14.3 z2 lambda $0 e (and (flight $0) (from $0 ci0) (to $0 ci1) (nonstop $0) (round trip $0)) p(z|x) = −2.3 z → x=−4.2 x ↔ z>−0.01 DJANGO If length of version does not equals to integer 5, raise an exception z1 raise p(z|x) = −0.1 z → x=−61.7 x ↔ z=−5.3 z2 assert len(version) == 5 p(z|x) = −3.3 z → x=−17.0 x ↔ z=−0.5 DJANGO and truncate, return the result. return elements of words joined in a string, separated with whitespace z1 return str('joined') p(z|x) = −1.6 z → x=−54.6 x ↔ z=−0.68 z5 return ' '.join (words) p(z|x) = −2.6 z → x=−48.8 x ↔ z=−0.61 CONALA Removing duplicates in list 't' z1 print(list(itertools.chain( * t))) p(z|x) = −5.9 z → x=−15.5 x ↔ z=−0.4 z4 list(set(t)) p(z|x) = −6.7 z → x=−11.3 x ↔ z=−1.4 (e.g., Example 2). On the other hand, we observe the reconstruction model could potentially go beyond surface token-wise match between NL utterances and MRs, promoting more complex (longer) candidate MRs that more adequately encode the semantics of the input utterance 7 (e.g., in Example 3, z 5 receives a much higher reconstruction score, while the difference between the discriminative matching scores is small). Therefore, on the challenging CONALA dataset with much weaker alignment between its succinct NL utterances and highly compositional MRs (e.g., Example 4), the matching model does not function as well (e.g., the incorrect MR z 1 received a higher matching score). While more careful investigation of the relative advantage between the reconstruction and discriminative matching features remain an interesting future work, we remark that the reconstruction model p(z → x), when combined with with the original parser score p(z|x) in E.q.
(1), also implicitly functions as a matching model that measures the semantic similarity using the bidirectional generation likelihood between x and z. Such architecture could be an interesting future di-7 On DJANGO, the average length of top-reranked MRs by the reconstruction and matching models is 8.6 and 7.7, resp. rection for modeling semantic similarity.
Additionally, the auxiliary token count feature is also effective, especially on CONALA, yielding a +1.66 gain in BLEU by promoting longer MRs.
Finally, we investigated the failure cases where our best-performed reranker generated incorrect MRs. We are particularly interested in those remaining failed examples on simpler semantic parsing tasks (GEO, ATIS), where our reranker's accuracies are close to the oracle recall. For instance, on ATIS, only 6 incorrect examples (out of a total of 49) are due to reranking error, 10 are due to that the gold MRs are not included the n-best list, while most (20) remaining cases are because the specific NL patterns in testing utterances (e.g., the temporal NL pattern flight . . . prior to . . .) are not covered by its (small) training set. This interesting result suggests that incorporating external linguistic knowledge (e.g., Wang et al. (2014)) is important in order to further improve the performance of neural parsers on closed-domain semantic parsing tasks.

Conclusion
We proposed a feature-based reranker for neural semantic parsing, which achieved strong results on three semantic parsing and code generation tasks. In the future we plan to apply the reranker to other parsers and more benchmark datasets. We will also attempt to jointly train the base semantic parser and the reranker by using the reranker's output as supervision to fine tune the base parser.