AdaNSP: Uncertainty-driven Adaptive Decoding in Neural Semantic Parsing

Neural semantic parsers utilize the encoder-decoder framework to learn an end-to-end model for semantic parsing that transduces a natural language sentence to the formal semantic representation. To keep the model aware of the underlying grammar in target sequences, many constrained decoders were devised in a multi-stage paradigm, which decode to the sketches or abstract syntax trees first, and then decode to target semantic tokens. We instead to propose an adaptive decoding method to avoid such intermediate representations. The decoder is guided by model uncertainty and automatically uses deeper computations when necessary. Thus it can predict tokens adaptively. Our model outperforms the state-of-the-art neural models and does not need any expertise like predefined grammar or sketches in the meantime.


Introduction
Semantic Parsing (SP) maps a natural language utterance into a formal language, which is crucial in abundant tasks, such as question answering Collins, 2005, 2007) and code generation (Yin and Neubig, 2017). The prevailing neural semantic parsers view semantic parsing as a sequence transduction task, and adopt the encoder-decoder framework similar to machine translation.
The distinguishing difference of semantic parsing, however, is in its target sequences, which are token sequences of well-formed semantic representations. SQL language and lambda expressions are typical examples of SP targets. The "SELECT..FROM..WHERE" pattern in SQL and the paired parentheses in lambda expressions are consequences of underlying grammars. However, standard Seq2Seq models ignore the patterns and may give ill-formed results.
To better model the grammatical and semantical constraints, many decoding methods were devised.  proposed to generate tokens of an intermediate sketch first, followed by decoding into final formal targets. Others chose to gradually build abstract syntax trees using a transition-based paradigm, and tokens are generated at the tree leaves or in the middle of the transitions (Krishnamurthy et al., 2017;Chen et al., 2018;Yin and Neubig, 2018). There are also some decoders comprised of several submodules which are intended to generate different parts of the semantic output, respectively (Yu et al., 2018a,b). However, the aforementioned methods still have the following key issue. They explicitly require the expertise to design intermediate representations or model structures, which is not ideal or acceptable for scenarios with Domain Specific Languages (DSL) or new representations because of domain alterations and the incompleteness of the expert knowledge.
To follow the successful idea and overcome the above issue, we introduce a novel adaptive decoding mechanism. Inspired by adaptive computing (Graves, 2016), pervasive tokens in training data will be generated immediately with no doubt. But for tokens seen less often, the model may be pondering and less confident, and it will be better to carry out more computations. In this way, it is unnecessary to pre-build any intermediate supervision for training, such as preprocessed sketches  and predesigned grammars (Yin and Neubig, 2018), which must be manually redesigned for an unseen kind of target language. Furthermore, we use the model uncertainty estimates to reflect its prediction confidence. Although different uncertainty estimates have been explored in semantic parsing , we use Dropout (Srivastava et al., 2014) as the uncertainty signal (Gal and Ghahramani, 2016) due to its simplicity, and use policy gradient algorithm to guide the model search.
Our contributions are thus three-fold.
• We introduce the adaptive decoding mechanism into semantic parsing, which is well rid of intermediate representations and easily adaptable to new target languages.
• We adopt uncertainty estimates to bias the decoder search, which is not covered in architecture searching literature to our best knowledge.
• Our model outperforms the state-of-the-art neural models without other intermediate supervisions.

Uncertainty-driven Adaptive Decoding Model
Our semantic parser is learned from pairs of natural language sentences and formal semantic representations. Let x = {x 1 , x 2 , . . . , x m } denote the words in an input sentence, and y = {y 1 , y 2 , . . . , y n } be the tokens of the corresponding target lambda expression.

Adaptive Decoding Model
We first introduce the general model for adaptive decoding. In general, the model consists of an encoder, a decoder, a halting module, and the attention mechanism.
Encoder. Input words x are first embedded using an embedding matrix W x ∈ R d×|Vx| , where d is the dimension of embedded vectors and V x is the set of all input words. We use a stacked two-layer BiLSTM to encode the input embedding. Hidden states from both direction at the same position of the second layer are concatenated as final encoder outputs {h 1 , . . . , h m }.
Decoder. We stack two LSTM cells as one basic decoding unit. Similarlly, we use a matrix to embed target tokens y, y i = W y o(y i ). The token embedding will serve the input of the decoding cell.
where [·; ·] means the concatenation of vectors, c e t and c d t are two attention context vectors described later, and flag is what we additionally concatenated to the input embedding, being either 1 or 0,  Figure 1: The illustration of our adaptive decoding. Attention and pondering mode are only shown at time t for brevity. Every decoder will go into pondering mode before the next timestep. The decoder cell is a stacked two-layer LSTM and initialized by the last forward states of the corresponeding encoder layer.
based on whether the model is acting in pondering mode or not, which will be introduced later. We further apply a linear mapping and a softmax function to the concatenation of s t and attention vectors to obtain the word predicting probabilities. We greedily decode the tokens at testing time.
Attention. We adopt two types of attention when decoding. One attends the decoder state upon encoder outputs and yield the input context vector c e t , where [·, ·] means to vector stacking. The other similarly attends the hidden state to previous decoder outputs, yielding the context vector c d t over the decoding history. We use the bilinear function for encoder attention Attn(x, y) = x T Wy + b, with trained parameters W and b, and use the dot production function for decoding history attentions Attn(x, y) = x T y.
Halting and Pondering. The key feature of our model is to adaptively choose the decoder depth before predicting tokens. Given the output state s t from (1), the model goes into the pondering mode. The output state s t is further sent to a halting module, which will generate a probability p t positively correlated with the model uncertainty. We use an MLP with ReLU and sigmoid activations for the halting module. Then a choice is sampled from the Bernoulli distribution determined by p t . If it chooses to continue, we again use (1) to update the state, meanwhile using the same embedding y t for the input.
where s 0 t = s t , flag = 1, and c e k , c d k are attention vectors recomputed with s (k−1) t using (2). The model will keep pondering until it chooses to stop or reaches our limit of k = 3. The final state s (k) t will act as original s t in (1) for other modules.

Uncertainty Estimates
Since the halting module outputs a Bernoulli distribution to guide the decoder, we have to provides some uncertainty quantification for training. Fortunately, Dropout (Srivastava et al., 2014) was proved a good uncertainty estimate (Gal and Ghahramani, 2016). It's simple and effective that neither the model nor the optimization algorithm would need to be changed. We left other estimates like those proposed in  in future work.
To estimate uncertainty with Dropout, we leave the model in training mode and thus the Dropout enabled. We run the forward pass of the equation (3) for F times with the same inputs. Output states are further sent to get token probabilities, where Θ i is the set of all pertubated parameters affected by Dropout in the i th forward pass. We take the variance of q to reflect model uncertainty U(s t ) = Var(q) as suggested in Gal and Ghahramani (2016). We disable the gradient propagation when computing the variance such that the gradient-based optimization is not influenced.
Note that the variance of a set of probabilities many not be quite large in practice, we thus rescale the variance to make it more numerically robust U n (s t ) = min(γ, Var(q))/γ, where γ = 0.15 in our case.

Learning
Our model consists of the Seq2Seq part (encoder, decoder, and attention) and the halting mod-ule. For the former, we minimize the traditional cross entropy loss with gradient decent, J ent = E (x,y) log p(y | x).
We use the REINFORCE algorithm to optimize the halting module. The module acts as our policy network, by which the model consecutively make decisions from the action space A = {Ponder, Stop}. Each time the model make a choice a ∈ A, the uncertainty of the seq2seq part is involved in the reward, where a = 1 means a Ponder choice and a = 0 the other. We measure the correctness by examining the greedily decoded token if arg max y p(y | s k t ) = y t+1 . The model will be rewarded for a Stop action if the prediction is correct, and for a Ponder action if the prediction is incorrect. This is similar to the ponder cost of ACT that does not encourage excessive pondering steps.

Experiments
We compare our method with other models on two datasets. Our codes could be obtained via https://github.com/zxteloiv/AdaNSP.

Experimental Setup
Datasets. We use the preprocessed ATIS and Geo-Query datasets kindly provided by Dong and Lapata (2018). All natural language sentences are converted to lower cases and stemmed with NLTK (Bird et al., 2009). Entity mentions like city codes, flight numbers are anonymized using numbered placeholders.
Setups. We choose hyperparameters on the ATIS dataset with the validation set. For the Geo-Query dataset that doesn't come with a validation set, we randomly shuffle the training set and select the top 100 records as the validation set, and the remaining as the new training data. After choosing the best hyperparameters, we resort back to train on the original set. The Dropout ratio is selected from {0.5, 0.6, 0.7, 0.8}, and the embedding dimension d is chosen from {64, 128, 256, 512}. We fix the batch size to 20, and both the encoder and decoder cell are two stacked LSTM layers. We apply scheduled sampling (Bengio et al., 2015) with the ratio 0.2 during training. We run F = 5 forward passes before computing the variance. We use Adam (Kingma and Ba, 2015) for the optimizer, and use its default parameters from the paper.
Evaluation. We use the logical form accuracy as the evaluation metric, which is computed with parsed trees of the predictions and true labels. Two trees are considered identical as long as their structures are the same, i.e., the order to sibling predicates doesn't matter. We reuse the STree parser code from .

Results and Analysis
Our model outperforms the other comparative neural semantic parsers on this two set. We reuse the data from  since the datasets are identical. Results are listed in Table 1. Our results are better than the SO-TAs Yin and Neubig, 2018) even without any intermediate representations, whereas Coarse2fine defines a sketch and TranX uses ASDL for every type of target semantic sequences. We outperform Coarse2fine by 0.7% and 0.9% on GeoQuery and ATIS datasets respectively. Although Jia and Liang (2016) has a slightly better result on GeoQuery, they introduced a synchronous CFG to learn new and recombinated examples from the training data, which is a novel method of data augmentation and requires much human effort for preprocessing. For an ablation test, our degenerated model without the pondering part receives considerable performance decreases by 2.8% and 2.9% on GeoQuery and ATIS datasets respectively.

Model
Geo ATIS ZC07 (Zettlemoyer and Collins, 2007) 86.1 84.6 λ-WASP (Wong and Mooney, 2007) 86.6 -FUBL (Kwiatkowski et al., 2011) 88.6 82.8 TISP (Zhao and Huang, 2015) 88.9 84.2 Neural network models Seq2Seq (Dong and Lapata, 2016) 84.6 84.2 Seq2Tree (Dong and Lapata, 2016) 87.1 84.6 JL16 (Jia and Liang, 2016) 89.3 83.3 TranX (Yin and Neubig, 2018) 88.2 86.2 Coarse2fine  88.2 87.7 AdaNSP (ours) 88.9 88.6 -halting module 86.1 85.7  Collins, 2005, 2007;Kwiatkowski et al., 2010;Mooney, 2006, 2007) try to model the correlation between semantic tokens and lexical meaning of natural language sentences. Methods based on dependency trees (Ge and Mooney, 2009;Liang et al., 2011;Reddy et al., 2016) otherwise convert outputs from an existing syntactic parser into semantic representations, which can be easily adopted in languages with much fewer resources than English. Recently neural semantic parsers, especially under the encoder-decoder framework, also sprang up Lapata, 2016, 2018;Jia and Liang, 2016;Xiao et al., 2016). To make the model aware of the underlying grammar of targets, people try to exert constraints on the decoder side by sketches, typing, grammars and runtime execution guides Krishnamurthy et al., 2017;Groschwitz et al., 2018;Wang et al., 2018). Moreover, learning algorithms in SP like structural learning and maximum marginal likelihood are combined with reinforcement algorithms (Guu et al., 2017;Iyyer et al., 2017;Misra et al., 2018). Adaptive Computing. Adaptive Computation Times (ACT) was first proposed to adaptively learn the depth of RNN models from data (Graves, 2016). Skip-RNN (Campos et al., 2018) used a similar idea to equip a skipping mechanism with existing RNN cells, which adaptively skip some recurrent blocks along the computational graph and thus saved many computations. BlockDrop (Wu et al., 2018) also introduced the REINFORCE algorithm to jointly learn a dropping policy and discard some blocks of the ResNet by the policy network. Recently, Dehghani et al. (2019) proposed Universal Transformers (UT) as an alternative form of the vanilla Transformer (Vaswani et al., 2017). It utilized ACT to control the recurrence times of the basic layer blocks (same parameters) in UT, instead of stacking different block layers in the vanilla Transformer. This helped UT mimic the inductive bias of RNNs and was shown Turing-completed, and has outperformed the vanilla Transformer in many tasks.

Conclusion
We present the AdaNSP that adaptively searches the corresponding computation structure of RNNs for semantic parsing. Our method does not need Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwater, and Mark Steedman. 2010. Inducing probabilistic ccg grammars from logical form with higherorder unification. In Proceedings of the 2010 conference on empirical methods in natural language