A Split-and-Recombine Approach for Follow-up Query Analysis

Context-dependent semantic parsing has proven to be an important yet challenging task. To leverage the advances in context-independent semantic parsing, we propose to perform follow-up query analysis, aiming to restate context-dependent natural language queries with contextual information. To accomplish the task, we propose STAR, a novel approach with a well-designed two-phase process. It is parser-independent and able to handle multifarious follow-up scenarios in different domains. Experiments on the FollowUp dataset show that STAR outperforms the state-of-the-art baseline by a large margin of nearly 8%. The superiority on parsing results verifies the feasibility of follow-up query analysis. We also explore the extensibility of STAR on the SQA dataset, which is very promising.


Introduction
Recently, Natural Language Interfaces to Databases (NLIDB) has received considerable attention, as they allow users to query databases by directly using natural language. Current studies mainly focus on context-independent semantic parsing, which translates a single natural language sentence into its corresponding executable form (e.g. Structured Query Language) and retrieves the answer from databases regardless of its context. However, context does matter in real world applications. Users tend to issue queries in a coherent way when communicating with NLIDB. For example, after the query "How much money has Smith earned?" (Precedent Query), users may pose another query by simply asking "How about Bill Collins?" (Follow-up Query) instead of the complete "How much money has Bill * Work done during an internship at Microsoft Research.
Collins earned?" (Restated Query). Therefore, contextual information is essential for more accurate and robust semantic parsing, namely contextdependent semantic parsing.
Compared with context-independent semantic parsing, context-dependent semantic parsing has received less attention. Several attempts include a statistical model with parser trees (Miller et al., 1996), a linear model with context-dependent logical forms (Zettlemoyer and Collins, 2009) and a sequence-to-sequence model (Suhr et al., 2018). However, all these methods cannot apply to different domains, since the ATIS dataset (Dahl et al., 1994) they rely on is domain-specific. A searchbased neural method DynSP * arises along with the SequentialQA (SQA) dataset (Iyyer et al., 2017), which takes the first step towards cross-domain context-dependent semantic parsing. Nevertheless, DynSP * focuses on dealing with relatively simple scenarios. All the aforementioned methods design context-dependent semantic parser from scratch. Instead, inspired by Liu et al. (2019), we propose to directly leverage the technical advances in context-independent semantic parsing. We define follow-up query analysis as restating the follow-up queries using contextual information in natural language, then the restated queries can be translated to the corresponding executable forms by existing context-independent parsers. In this way, we boost the performance of contextdependent semantic parsing.
In this paper, we focus on follow-up query analysis and present a novel approach. The main idea is to decompose the task into two phases by introducing a learnable intermediate structure span: two queries first get split into several spans, and then undergo the recombination process. As no intermediate annotation is involved, we design re-wards to jointly train the two phases by applying reinforcement learning (RL) (Sutton and Barto, 1998). Our major contributions are as follows: • We propose a novel approach, named SpliT-And-Recombine (STAR) 1 , to restate follow-up queries via two phases. It is parser-independent and can be seamlessly integrated with existing context-independent semantic parsers. • We conduct experiments on the FollowUp dataset (Liu et al., 2019), which covers multifarious cross-domain follow-up scenarios. The results demonstrate that our approach significantly outperforms the state-of-the-art baseline. • We redesign the recombination process and extend STAR to the SQA dataset, where the annotations are answers. Experiments show promising results, that demonstrates the extensibility of our approach.

Methodology
In this section, we first give an overview of our proposed method with the idea of two-phase process, then introduce the two phases in turn.

Overview of Split-And-Recombine
Let x = (x 1 , . . . , x n ), y = (y 1 , . . . , y m ) and z = (z 1 , . . . , z l ) denote the precedent query, followup query and restated query respectively, each of which is a natural language sentence. Our goal is to interpret the follow-up query y with its precedent query x as context, and generate the corresponding restated query z. The restated query has the same meaning with the follow-up query, but it is complete and unambiguous to facilitate better downstream parsing. Formally, given the pair (x, y), we aim to learn a model P model (z|x, y) and maximize the objective: where D represents the set of training data. As to P model (z|x, y), since z always overlaps a great with x and y, it is intuitively more straightforward to find a way to merge x and y. To this end, we design a two-phase process and present a novel approach STAR to perform follow-up query analysis with reinforcement learning. A concrete example of the two-phase process is shown in Figure 1. Phase I is to Split input queries 1 Code is available at http://github.com/ microsoft/EMNLP2019-Split-And-Recombine.  Figure 1: The two-phase process of an example from the FollowUp dataset (More real cases of diverse follow-up scenarios can be found in Table 3).
into several spans. For example, the precedent query is split into 3 spans: "How much money has", "Smith" and "earned". Let q denote a kind of way to split (x, y), then Phase I can be formulated as P split (q|x, y). Phase II is to Recombine the spans by finding out the most probable conflicting way, and generating the final output by restatement, denoted as P rec (z|q). Two spans being conflicting means they are semantically similar. For example, "Smith" conflicts with "Bill Collins". A conflicting way contains all conflicts between the precedent and follow-up spans. Backed by the two-phase idea of splitting and recombination, the overall likelihood of generating z given x, y is: where Q represents the set of all possible ways to split (x, y). Due to the lack of annotations for splitting and recombination, it is hard to directly perform supervised learning. Inspired by Liang et al. (2017), we employ RL to optimize P model . Denoting the predicted restated query byz, simplifying E (x,y,z)∼D as E, the goal of the RL training is to maximize following objective: where Z is the space of all restated query candidates and r represents the reward defined by comparingz and the annotation z. However, the overall candidate space Q×Z is vast, making it impossible to exactly maximize L rl . The most straightforward usage of the REINFORCE algorithm (Williams, 1992), sampling both q andz, also poses challenges for learning. To alleviate the problem, we propose to sample q and enumerate all candidatez after q is determined. It could shrink the sampling space with an acceptable computational cost, which will be discussed in Sec- tion 3.2.2. Thus the problem turns to design a reward function R(q, z) to evaluate q and guide the learning. To achieve it, we reformulate Equation 3 as: and set the R(q, z) as: The overview of STAR is summarized in Figure 2. Given x, y, during training of Phase I (in blue), we fix P rec to provide the reward R(q, z), then P split can be learnt by the REINFORCE algorithm. During training of Phase II (in red), we fix P split and utilize it to generate q, P rec is trained to maximize Equation 5. In this way, P split and P rec can be jointly trained. The details are introduced below.

Phase I: Split
As mentioned above, fixed P rec , Phase I updates P split , the Split Neural Network (SplitNet). Taking the precedent query and follow-up query as input, as shown in Figure 2, splitting spans can be viewed as a sequence labeling problem over input. For each word, SplitNet outputs a label Split or Retain, indicating whether a split operation will be performed after the corresponding word. A label sequence uniquely identifies a way of splitting (x, y), mentioned as q in Section 2.1. Figure 3 gives an example on the bottom. In the precedent query, two split operations are performed after "has" and "Smith" , since their labels are Split.

Split Neural Network
Intuitively, only after obtaining information from both the precedent query and follow-up query can SplitNet get to know the reasonable way to split spans. Inspired by BiDAF (Seo et al., 2017), we apply a bidirectional attention mechanism to capture the interrelations between the two queries.
Embedding Layer We consider embedding in three levels: character, word and sentence, respectively denoted as φ c , φ w and φ s . Characterlevel embedding maps each word to a vector in a high-dimensional space using Convolutional Neural Networks (Kim, 2014). Word-level embedding is initialized using GloVe (Pennington et al., 2014), and then it is updated along with other parameters. Sentence-level embedding is a one-hot vector designed to distinguish between precedent and follow-up queries. Then, the overall embed- Context Layer On top of the embedding layer, Bidirectional Long Short-Term Memory Network (BiLSTM) (Hochreiter and Schmidhuber, 1997;Schuster and Paliwal, 1997) is applied to capture contextual information within one query. For word x i (i = 1, . . . , n) in the precedent query x, the hid- where the forward hidden state is: Similarly, a hidden state u j is computed for word y j (j = 1, . . . , m). The BiLSTMs for x and y share the same parameters.
Attention Layer The interrelations between the precedent and follow-up queries are captured via attention layer. Let H = [h 1 , h 2 , . . . , h n ] and U = [u 1 , u 2 , . . . , u m ] denote the hidden states of two queries respectively, the similarity matrix is: where A ∈ R n×m and the entry A i,j represents the similarity between words x i and y j . Then the softmax function is used to obtain the precedent-tofollow (P2F) attention and the follow-to-precedent (F2P) attention. P2F attention represents y j using the similarities between y j and every word in x.
Specifically, let f j = softmax(A :,j ), where f j ∈ R n denotes the attention weights on x according to y j . Then y j can be represented by a precedentaware vectorũ j = n k=1 f j [k]·h k . Similarly, F2P attention computes the attention weights on y according to x i , and represents x i ash i .
Output Layer Combining the outputs of the context layer and the attention layer, we design the final hidden state as follows: where i∈{1,. . . ,n−1},j ∈{1,. . . ,m−1} and • denotes element-wise multiplication (Lee et al., 2017). Let c = (c t ) T t=1 = (c x 1 , ..., c x n−1 , c y 1 , ..., c y m−1 ) denote the final hidden state sequence. At each position t, the probability of Split is σ(W * c t + b), where σ denotes the sigmoid function and {W, b} denotes the parameters.

Training
It is difficult to train RL model from scratch. Therefore, we propose to initialize SplitNet via pre-training, and then use reward to optimize it.
Pre-training We obtain the pre-training annotation a by finding the common substrings between (x, y) and z. One a is a label sequence, each of which is Split or Retain. Given the pretraining data set D pre whose training instance is as (x, y, a), the objective function of pre-training is: where θ is the parameter of SplitNet.
Policy Gradient After pre-training, we treat the label sequence as a variableã. The reward R(ã, z) (details in Section 2.3) is used to optimize the parameter θ with policy gradient methods (Sutton et al., 1999). SplitNet is trained to maximize the following objective function: In practice, REINFORCE algorithm (Williams, 1992) is applied to approximate Equation 11 via samplingã from p θ (ã|x, y) for M times, where M is a hyper-parameter representing the sample size. Furthermore, subtracting a baseline (Weaver and Tao, 2001) on R(ã, z) is also applied to reduce variance. The final objective function is as follows:

Phase II: Recombine
Here we present Phase II with two questions: (1) Receiving the sampled label sequenceã, how to compute its reward R(ã, z); (2) How to do training and inference for P rec .

Reward Computation
Receiving the label sequenceã, we first enumerate all conflicting way candidates. Following the example in Figure 3, once we get a deterministic a, the split of (x, y) is uniquely determined. Here x and y are split into 3 and 2 spans respectively. Treating spans as units, we enumerate all conflicting way candidates methodically. We act up to the one-to-one conflicting principle, which means a span either has no conflict (denoted as EMPTY) or has only one conflict with a span in another query. Let C denote the set of all conflicting way candidates, the size of which is 13 in Figure 3. For each conflicting way, we deterministically generate a restated query via the process named Restatement. In general, we simply replace spans in the precedent query with their conflicting spans to generate the restated query. For example, in Figure 3, the first one in C is restated as "How about Bill Collins earned". For spans in the follow-up query, if they contain column names or cell values and do not have any conflict, they are appended to the tail of the precedent query. It is designed to remedy the sub-query situation where there is no conflict (e.g. "Which opponent received over 537 attendance" and "And which got the result won 5-4"). Specially, if a span in the follow-up query contains a pronoun, we will in reverse replace it with its conflicting span to obtain the restated query.
Finally, the reward can be computed. Here we use BLEU and SymAcc 2 to build the reward function, expanding r(z,z) in Equation 5 as: where α, β > 0 and α + β = 1. The reward forã can be obtained using Equation 5.

Training and Inference
Besides the reward computation, the recombination model P rec needs to be trained to maximize Equation 5. To achieve this, we define a conflicting probability matrix F ∈ R Nx×Ny , where N x and N y denote the number of spans in x and y respectively. The entry F u,v , the conflicting probability between the u-th span in x and the v-th span in y, is obtained by normalizing the cosine similarity between their representations. Here the span representation is the subtraction representation (Wang and Chang, 2016;Cross and Huang, 2016), which means that span ( from the same BiL-STM in the context layer in Section 2.2.1. Given a conflicting way denoted asc ∈ C, the probability of generating its correspondingz can be written as the multiplication over g(u, v): 2 Their definitions along with the motivations of using them will be explained in Section 3.2.
where g(u, v) = F u,v if the u-th span in x conflicts with the v-th span in y; otherwise, g(u, v) = 1 − F u,v . With the above formulation, we can maximize Equation 5 through automatic differentiation. To reduce the computation, we only maximize P rec (z * |ã), the near-optimal solution to Equation 5, wherez * = arg maxz ∈Z (r(z,z)) denotes the best predicted restated query so far.
Guided by the golden restated query z, in training, we find outz * by computing the reward of each candidate. However in inference, where there is no golden restate query, we can only obtainz * from F. Specially, for the v-th span in the followup query, we find u * = arg max u F u,v . That means, compared to other spans in the precedent query, the u * -th span has the highest probability to conflict with the v-th span in the follow-up query. Moreover, similar to Lee et al. (2017), if F u * ,v < λ, then the v-th span in the follow-up query has no conflict. The hyper-parameter λ > 0 denotes the threshold.

Extension
So far, we have introduced the whole process of STAR. Next we explore its extensibility. As observed, when the annotations are restated queries, STAR is parser-independent and can be incorporated into any context-independent semantic parser. But what if the annotations are answers to follow-up queries? Assuming we have an ideal semantic parser, a predicted restated queryz can be converted into its corresponding answerw. For example, givenz as "where are the players from",w could be "Las Vegas". Therefore, revisiting Equation 3, in theory STAR is able to be extended by redesigning r as r(w,w), where w denotes the answer annotation. We conduct an extension experiment to verify it, as discussed in Section 3.3.

Experiments
In this section, we demonstrate the effectiveness of STAR on the FollowUp dataset 3 with restated query annotations, and its promising extensibility on the SQA dataset 4 with answer annotations.

Implementation details
We utilize PyTorch (Paszke et al., 2017) and Al-lenNLP (Gardner et al., 2018) for implementation, and adopt Adam (Kingma and Ba, 2015) as  (Blum et al., 2015) is employed at embedding layer for better generalization ability (with probability 0.5). The learning rate is set to be 0.001 for pre-training, 0.0001 for RL training on FollowUp, and 0.0002 for SQA. In the implementation of the REINFORCE algorithm, we set M to be 20. Finally, for hyper-parameters, we set α = 0.5, β = 0.5 and λ = 0.6. All the results are averaged over 5 runs with random initialization.

Results on FollowUp dataset
The FollowUp dataset contains 1000 natural language query triples (x, y, z). Each triple belongs to a single database table, and there are 120 tables in several different domains. Following the previous work, we split them into the sets of size 640/160/200 for train/dev/test. We evaluate the methods using both answer level and query level metrics. AnsAcc is to check the answer accuracy of predicted queries manually. Concretely, 103 golden restated queries can be successfully parsed by COARSE2FINE (Dong and Lapata, 2018). We parse their corresponding predicted queries into SQL using COARSE2FINE and manually check the answers. Although AnsAcc is most convincing, it cannot cover the entire test set. Therefore, we apply two query level metrics: SymAcc detects whether all the SQL-related words are correctly involved in the predicted queries, for example column names, cell values and so on. It reflects the approximate upper bound of AnsAcc, as the correctness of SQL-related words is a prerequisite of correct execution in most cases; BLEU, referring to the cumulative 4-gram BLEU score, evaluates how similar the predicted queries are to the golden ones (Papineni et al., 2002). SymAcc focuses on limited keywords, so we introduce BLEU to eval-uate quality of the entire predicted query.

Model Comparison
Our baselines fall into two categories. Generation-based methods conform to the architecture of sequence-to-sequence (Sutskever et al., 2014) and generate restated queries by decoding each word from scratch. SEQ2SEQ (Bahdanau et al., 2015) is the sequence-to-sequence model with attention, and COPYNET further incorporates a copy mechanism. COPY+BERT incorporates the latest pre-trained BERT model (Devlin et al., 2019) as the encoder of COPYNET. Rewritingbased methods obtain restated queries by rewriting precedent and follow-up queries. CONCAT directly concatenates the two queries. E2ECR (Lee et al., 2017) obtain restated queries by performing coreference resolution in follow-up queries. FANDA (Liu et al., 2019) utilizes a structureaware model to merge the two queries. Our method STAR also belongs to this category.
Answer Level Table 1 shows AnsAcc results of competitive baselines on the test set. Compared with them, STAR achieves the highest, 65.05%, which demonstrates its superiority. Meanwhile, it verifies the feasibility of follow-up query analysis in cooperating with context-independent semantic parsing. Compared with CONCAT, our approach boosts over 39.81% on COARSE2FINE for the capability of context-dependent semantic parsing.
Query Level Table 1 also shows SymAcc and BLEU of different methods on the dev and test sets. As observed, STAR significantly outperforms all baselines, demonstrating its effectiveness. For example, STAR achieves an absolute improvement of 8.03% BLEU over the state-ofthe-art baseline FANDA on testing. Moreover, the rewriting-based baselines, even the simplest CON-CAT, perform better than the generation-based  ones. It suggests that the idea of rewriting is more reasonable for the task, where precedent and follow-up queries are of full utilization.

Variant Analysis
Besides baselines, we also conduct experiments with several variants of STAR to further validate the design of our model. As shown in Table 2, there are three variants with ablation: "-Phase I" takes out SplitNet and performs Phase II on word level; "-Phase II" performs random guess in the recombination process for testing; and "-RL" only contains pre-training. The SymAcc drops from about 55% to 40% by ablating Phase I, and to 23% by ablating Phase II. Their poor performances indicate both of the two phases are indispensable. "-RL" also performs worse, which again demonstrates the rationality of applying RL.
Three more variants are presented with different designs of R(q, z) to prove the efficiency and effectiveness of Equation 5 as a reward. "+ Basic Reward" represents the most straightforward RE-INFORCE algorithm, which samples both q ∈ Q andz∈Z, then takes r(z,z) as R(q, z). "+ Oracle Reward" assumes the conflicts are always correct and rewrites R(q, z) as maxz ∈Z (r(z,z)). "+ Uniform Reward" assigns the same probability to all z and obtains R(q, z) as mean( z∈Z r(z,z)). As shown in Table 2 and Figure 4, STAR learns better and faster than the variants due to the reasonable reward design. In fact, as mentioned in Section 2.1, the vast action space of the most straightforward REINFORCE algorithm leads to poor learning. STAR shrinks the space from |Q|·|Z| down to |Q| by enumeratingz. Meanwhile, statistics show that STAR obtains a 15× speedup over "+ Basic Reward" on the convergence time. Figure 5 shows a concrete example of the similarity matrix A on attention layer of SplitNet. The span "before week 10" is evidently more similar to "After the week 6" than to others, which meets our expectations. Moreover, the results of three real cases are shown in Table 3. The spans in blue are those have conflicts, and the histograms represent the conflict probabilities to all the spans in precedent queries. In Case 1, "glebe park", "hampden park" and "balmoor" are all cell values in the database table with similar meanings. STAR correctly finds out the conflict between "compared to glebe park" and "compared to balmoor" with the highest probability. Case 2 shows STAR can discover the interrelation of words, where "the writer Nancy miller" is learnt as a whole span to replace "Nancy miller" in the precedent query. As for Case 3, STAR successfully performs coreference resolution and interprets "those two films" as "greatest love and promised land". Benefiting from two phases, STAR is able to deal with diverse follow-up scenarios in different domains.

Error Analysis
Our approach works well in most cases except for few ones, where SplitNet fails. For example, given the precedent query "what's the biggest zone?" and the follow-up query "the smallest one", STAR prefers to recognize "the biggest zone" and "the smallest one" as two spans, rather   (Iyyer et al., 2017) 70.9 35.8 NP (Neelakantan et al., 2016) 58.9 35.9 NP + STAR 58.9 38.1 DynSP + STAR 70.9 39.5 DynSP * (Iyyer et al., 2017) 70.4 41.1 than perform split operations inside them. The SplitNet fails probably because the conflicting spans, "the biggest" ↔ "the smallest" and "zone" ↔ "one", are adjacent, which makes it difficult to identify span boundaries well.

Extension on SQA dataset
Finally, we demonstrate STAR's extensibility in working with different annotations. As mentioned in Section 2.4, by designing r(w,w), STAR can cooperate with the answer annotations. We conduct experiments on the SQA dataset, which consists of 6066 query sequences (5042/1024 for train/test). Each sequence contains multiple natural language queries and their answers, where we are only interested in the first query and the immediate follow-up one. As discussed in (Iyyer et al., 2017), every answer can be represented as a set of cells in the tables, each of which is a multi-word value, and the intentions of the follow-up queries mainly fall into three categories. Column selection means the follow-up answer is an entire column; Subset selection means the follow-up answer is a subset of the precedent answer; and Row selection means the follow-up answer has the same rows with the precedent answer. We employ two context-independent parsers, DynSP (Iyyer et al., 2017) and NP (Neelakantan et al., 2016), which are trained on the SQA dataset to provide relatively reliable answers for reward computing. Unfortunately, they both perform poorly for the restated queries, as the restated queries are quite different from the original queries in SQA. To address the problem, we redesign the recombination process. Instead of generating the restated query, we recombine the predicted precedent answerw x and the predicted follow-up answerw y to produce the restated answerw. Therefore, the objective of Phase II is to assign an appropriate intention to each follow-up span via an additional classifier. The goal of Phase I turns to split out spans having obvious intentions such as "of those". The way of recombining answer is determined by the voting from intentions on all spans. If the intention column selection wins, thenw =w y ; for subset selection, we obtain the subsetw by taking the rows ofw y as the constraint and applying it tow x ; and for row selection, we take the rows ofw x and the columns of w y as the constraints, then apply them to the whole database table to obtain the answerw retrieved by the predicted SQL. The reward r(w,w) is computed based on Jaccard similarity between the gold answer w andw as in (Iyyer et al., 2017), and the overall training process remains unchanged. Table 4 shows the answer accuracy of precedent and follow-up queries on test set. DynSP * (Iyyer et al., 2017) is designed for SQA by introducing a special action Subsequent to handle follow-up queries based on DynSP. DynSP * is incapable of being extended to work with the annotation of the restated queries. We attempt to apply DynSP * (trained on SQA) directly on FollowUp test set, which results in an extremely low AnsAcc. On the contrary, STAR is extensible. "+STAR" means our method STAR is incorporated into the contextindependent parser and empowers them with the ability to perform follow-up query analysis. As observed, integrating STAR consistently improves performance for follow-up queries, which demonstrates the effectiveness of STAR in collaborating with different semantic parsers. The comparable results of DynSP+STAR to DynSP * further verifies the promising extensibility of STAR.

Related Work
Our work is closely related to two lines of work: context-dependent sentence analysis and reinforcement learning. From the perspective of context-dependent sentence analysis, our work is related to researches like reading comprehension in dialogue (Reddy et al., 2019;Choi et al., 2018), dialogue state tracking (Williams et al., 2013), conversational question answering in knowledge base (Saha et al., 2018;Guo et al., 2018), context-dependent logic forms (Long et al., 2016), and non-sentential utterance resolution in opendomain question answering (Raghu et al., 2015;Kumar and Joshi, 2017). The main difference is that we focus on the context-dependent queries in NLIDB which contain complex scenarios. As for the most related context-dependent semantic parsing, Zettlemoyer and Collins (2009) proposes a context-independent CCG parser and then conduct context-dependent substitution, Iyyer et al. (2017) presents a search-based method for sequential questions, and Suhr et al. (2018) presents a sequence-to-sequence model to solve the problem. Compared to their methods, our work achieves context-dependent semantic parsing via learnable restated queries and existing context-independent semantic parsers.
Moreover, the technique of reinforcement learning has also been successfully applied to natural language tasks in dialogue, such as hyperparameters tuning for coreference resolution (Clark and Manning, 2016), sequential question answering (Iyyer et al., 2017) and coherent dialogue responses generation . In this paper, we employ reinforcement learning to capture the structures of queries, which is similar to Zhang et al. (2018) for text classification.

Conclusion and Future Work
We present a novel method, named Split-And-Recombine (STAR), to perform follow-up query analysis. A two-phase process has been designed: one for splitting precedent and follow-up queries into spans, and the other for recombining them. Experiments on two different datasets demonstrate the effectiveness and extensibility of our method. For future work, we may extend our method to other natural language tasks.