Paraphrase Generation from Latent-Variable PCFGs for Semantic Parsing

One of the limitations of semantic parsing approaches to open-domain question answering is the lexicosyntactic gap between natural language questions and knowledge base entries -- there are many ways to ask a question, all with the same answer. In this paper we propose to bridge this gap by generating paraphrases of the input question with the goal that at least one of them will be correctly mapped to a knowledge-base query. We introduce a novel grammar model for paraphrase generation that does not require any sentence-aligned paraphrase corpus. Our key idea is to leverage the flexibility and scalability of latent-variable probabilistic context-free grammars to sample paraphrases. We do an extrinsic evaluation of our paraphrases by plugging them into a semantic parser for Freebase. Our evaluation experiments on the WebQuestions benchmark dataset show that the performance of the semantic parser significantly improves over strong baselines.


Introduction
Semantic parsers map sentences onto logical forms that can be used to query databases (Zettlemoyer and Collins, 2005;Wong and Mooney, 2006), instruct robots (Chen and Mooney, 2011), extract information (Krishnamurthy and Mitchell, 2012), or describe visual scenes (Matuszek et al., 2012).In this paper we consider the problem of semantically parsing questions into Freebase logical forms for the goal of question answering.Current systems accomplish this by learning task-specific grammars (Berant et al., 2013), strongly-typed CCG grammars (Kwiatkowski et al., 2013;Reddy et al., 2014), or neural networks without requiring any grammar (Yih et al., 2015).These methods are sensitive to the words used in a question and their word order, making them vulnerable to unseen words and phrases.Furthermore, mismatch between natural language and Freebase makes the problem even harder.For example, Freebase expresses the fact that "Czech is the official language of Czech Republic" (encoded as a graph), whereas to answer a question like "What do people in Czech Republic speak?" one should infer people in Czech Republic refers to Czech Republic and What refers to the language and speak refers to the predicate official language.
We address the above problems by using paraphrases of the original question.Paraphrasing has shown to be promising for semantic parsing (Fader et al., 2013;Berant and Liang, 2014;Wang et al., 2015).We propose a novel framework for paraphrasing using latent-variable PCFGs (L-PCFGs).Earlier approaches to paraphrasing used phrase-based machine translation for textbased QA (Duboue and Chu-Carroll, 2006;Riezler et al., 2007), or hand annotated grammars for KB-based QA (Berant and Liang, 2014).We find that phrase-based statistical machine translation (MT) approaches mainly produce lexical paraphrases without much syntactic diversity, whereas our grammar-based approach is capable of producing both lexically and syntactically diverse paraphrases.Unlike MT based approaches, our system does not require aligned parallel paraphrase corpora.In addition we do not require hand annotated grammars for paraphrase generation but instead learn the grammar directly from a large scale question corpus.
The main contributions of this paper are two fold.First, we present an algorithm ( §2) to generate paraphrases using latent-variable PCFGs.We use the spectral method of Narayan and Cohen (2015) to estimate L-PCFGs on a large scale question treebank.Our grammar model leads to a robust and an efficient system for paraphrase generation in opendomain question answering.While CFGs have been explored for paraphrasing using bilingual parallel corpus (Ganitkevitch et al., 2013), ours is the first implementation of CFG that uses only monolingual data.Second, we show that generated paraphrases can be used to improve semantic parsing of questions into Freebase logical forms ( §3).We build on a strong baseline of Reddy et al. (2014) and show that our grammar model competes with MT baseline even without using any parallel paraphrase resources.

Paraphrase Generation Using Grammars
Our paraphrase generation algorithm is based on a model in the form of an L-PCFG.L-PCFGs are PCFGs where the nonterminals are refined with latent states that provide some contextual information about each node in a given derivation.L-PCFGs have been used in various ways, most commonly for syntactic parsing (Prescher, 2005;Matsuzaki et al., 2005;Petrov et al., 2006;Cohen et al., 2013;Narayan and Cohen, 2015;Narayan and Cohen, 2016).
In our estimation of L-PCFGs, we use the spectral method of Narayan and Cohen (2015), instead of using EM, as has been used in the past by Matsuzaki et al. (2005) and Petrov et al. (2006).The spectral method we use enables the choice of a set of feature functions that indicate the latent states, which proves to be useful in our case.It also leads to sparse grammar estimates and compact models.
The spectral method works by identifying feature functions for "inside" and "outside" trees, and then clusters them into latent states.Then it follows with a maximum likelihood estimation step, that assumes the latent states are represented by clusters obtained through the feature function clustering.For more details about these constructions, we refer the reader to Cohen et al. (2013) and Narayan and Cohen (2015).
The rest of this section describes our paraphrase generation algorithm.

Paraphrases Generation Algorithm
We define our paraphrase generation task as a sampling problem from an L-PCFG G syn , which is estimated from a large corpus of parsed questions.Once this grammar is estimated, our algorithm follows a pipeline with two major steps.We first build a word lattice W q for the input question q.1 We use the lattice to constrain our paraphrases to a specific choice of words and phrases that can be used.Once this lattice is created, a grammar G syn is then extracted from G syn .This grammar is constrained to the lattice.
We experiment with three ways of constructing word lattices: naïve word lattices representing the words from the input question only, word lattices constructed with the Paraphrase Database (Ganitkevitch et al., 2013) and word lattices constructed with a bi-layered L-PCFG, described in §2.2.For example, Figure 1 shows an example word lattice for the question What language do people in Czech Republic speak?using the lexical and phrasal rules from the PPDB. 2  Once G syn is generated, we sample paraphrases of the input question q.These paraphrases are further filtered with a classifier to improve the precision of the generated paraphrases.

L-PCFG Estimation
We train the L-PCFG G syn on the Paralex corpus (Fader et al., 2013).Paralex is a large monolingual parallel corpus, containing 18 million pairs of question paraphrases with 2.4M distinct questions in the corpus.It is suitable for our task of generating paraphrases since its large scale makes our model robust for opendomain questions.We construct a treebank by parsing 2.4M distinct questions from Paralex using the BLLIP parser (Charniak and Johnson, 2005). 3 Given the treebank, we use the spectral algorithm of Narayan and Cohen (2015)  for constituency parsing to learn G syn .We follow Narayan and Cohen (2015) and use the same feature functions for the inside and outside trees as they use, capturing contextual syntactic information about nonterminals.We refer the reader to Narayan and Cohen (2015) for more detailed description of these features.In our experiments, we set the number of latent states to 24.
Once we estimate G syn from the Paralex corpus, we restrict it for each question to a grammar G syn by keeping only the rules that could lead to a derivation over the lattice.This step is similar to lexical pruning in standard grammar-based generation process to avoid an intermediate derivation which can never lead to a successful derivation (Koller and Striegnitz, 2002;Narayan and Gardent, 2012).
Paraphrase Sampling Sampling a question from the grammar G syn is done by recursively sampling nodes in the derivation tree, together with their latent states, in a top-down breadth-first fashion.Sampling from the pruned grammar G syn raises an issue of oversampling words that are more frequent in the training data.To lessen this problem, we follow a controlled sampling approach where sampling is guided by the word lattice W q .Once a word w from a path e in W q is sampled, all other parallel or conflicting paths to e are removed from W q .For example, generating for the word lattice in Figure 1, when we sample the word citizens, we drop out the paths "human beings", "people's", "the population", "people" and "members of the public" from W q and accordingly update the grammar.The controlled sampling ensures that each sampled question uses words from a single start-to-end path in W q .For example, we could sample a question what is Czech Republic 's language?by sampling words from the path (what, language, do, people 's, in, Czech, Republic, is speaking, ?) in Figure 1.We repeat this sampling process to generate multiple potential paraphrases.
The resulting generation algorithm has multiple advantages over existing grammar generation methods.First, the sampling from an L-PCFG grammar lessens the lexical ambiguity problem evident in lexicalized grammars such as tree adjoining grammars (Narayan and Gardent, 2012) and combinatory categorial grammars (White, 2004).Our grammar is not lexicalized, only unary context-free rules are lexicalized.Second, the top-down sampling restricts the combinatorics inherent to bottom-up search (Shieber et al., 1990).Third, we do not restrict the generation by the order information in the input.The lack of order information in the input often raises the high combinatorics in lexicalist approaches (Kay, 1996).In our case, however, we use sampling to reduce this problem, and it allows us to produce syntactically diverse questions.And fourth, we impose no constraints on the grammar thereby making it easier to maintain bi-directional (recursive) grammars that can be used both for parsing and for generation (Shieber, 1988).

Bi-Layered L-PCFGs
As mentioned earlier, one of our lattice types is based on bi-layered PCFGs introduced here.
In their traditional use, the latent states in L-PCFGs aim to capture syntactic information.We introduce here the use of an L-PCFG with two layers of latent states: one layer is intended to capture the usual syntactic information, and the other aims to capture semantic and topical information by using a The questions what day is nochebuena, when is nochebuena and when is nochebuena celebrated are paraphrases from the Paralex corpus.Each nonterminal is decorated with a syntactic label and two identifiers, e.g., for WP-7-254, WP is the syntactic label assigned by the BLLIP parser, 7 is the syntactic latent state, and 254 is the semantic latent state.large set of states with specific feature functions. 4o create the bi-layered L-PCFG, we again use the spectral algorithm of Narayan and Cohen (2015) to estimate a grammar G par from the Paralex corpus.We use the word alignment of paraphrase question pairs in Paralex to map inside and outside trees of each nonterminals in the treebank to bag of word features.The number of latent states we use is 1,000.
Once the two feature functions (syntactic in G syn and semantic in G par ) are created, each nonterminal in the training treebank is assigned two latent states (cluster identifiers).Figure 2 shows an example annotation of trees for three paraphrase questions from the Paralex corpus.We compute the parameters of the bi-layered L-PCFG G layered with a simple frequency count maximum likelihood estimate over this annotated treebank.As such, G layered is a combination of G syn and G par , resulting in 24,000 latent states (24 syntactic x 1000 semantic).
Consider an example where we want to generate paraphrases for the question what day is nochebuena.Parsing it with G layered will lead to the leftmost hybrid structure as shown in Figure 2. The assignment of the first latent states for each nonterminals ensures that we retrieve the correct syntactic representation of the sentence.Here, however, we are more interested in the second latent states assigned to each nonterminals which capture the paraphrase information of the sentence at various levels.For example, we have a unary lexical rule (NN-* -142 day) indicating that we observe day with NN of the paraphrase type 142.We could use this information to extract unary rules of the form (NN-* -142 w) in the treebank that will generate words w which are paraphrases to day.Similarly, any node WHNP-* -291 in the treebank will generate paraphrases for what day, SBARQ-* -403, for what day is nochebuena.This way we will be able to generate paraphrases when is nochebuena and when is nochebuena celebrated as they both have SBARQ-* -403 as their roots. 5o generate a word lattice W q for a given question q, we parse q with the bi-layered grammar G layered .For each rule of the form X-m 1 -m 2 → w in the bilayered tree with X ∈ P, m 1 ∈ {1, . . ., 24}, m 2 ∈ {1, . . ., 1000} and w a word in q, we extract rules of the form X- * -m 2 → w from G layered such that w = w.For each such (w, w ), we add a path w parallel to w in the word lattice.

Paraphrase Classification
Our sampling algorithm overgenerates paraphrases which are incorrect.To improve its precision, we build a binary classifier to filter the generated paraphrases.We randomly select 100 distinct questions from the Paralex corpus and generate paraphrases using our generation algorithm with various lattice settings.We randomly select 1,000 pairs of inputsampled sentences and manually annotate them as "correct" or "incorrect" paraphrases. 6We train our classifier on this manually created training data. 7We follow Madnani et al. (2012), who used MT metrics for paraphrase identification, and experiment with 8 MT metrics as features for our binary classifier.In addition, we experiment with a binary feature which checks if the sampled paraphrase preserves named entities from the input sentence.We use WEKA (Hall et al., 2009) to replicate the classifier of Madnani et al. (2012) with our new feature.We tune the feature set for our classifier on the development data.

Semantic Parsing using Paraphrasing
In this section we describe how the paraphrase algorithm is used for converting natural language to Freebase queries.Following Reddy et al. ( 2014), we formalize the semantic parsing problem as a graph matching problem, i.e., finding the Freebase subgraph (grounded graph) that is isomorphic to the input question semantic structure (ungrounded graph).
This formulation has a major limitation that can be alleviated by using our paraphrase generation algorithm.Consider the question What language do people in Czech Republic speak?.The ungrounded graph corresponding to this question is shown in Figure 3(a).The Freebase grounded graph which results in correct answer is shown in Figure 3(d).Note that these two graphs are non-isomorphic making it impossible to derive the correct grounding from the ungrounded graph.In fact, at least 15% of the examples in our development set fail to satisfy isomorphic assumption.In order to address this problem, we use paraphrases of the input question to generate additional ungrounded graphs, with the aim that one of those paraphrases will have a structure isomorphic to the correct grounding.For a given input question, first we build ungrounded graphs from its paraphrases.We convert these graphs to Freebase graphs.To learn this mapping, we rely on manually assembled questionanswer pairs.For each training question, we first find the set of oracle grounded graphs-Freebase subgraphs which when executed yield the correct answer-derivable from the question's ungrounded graphs.These oracle graphs are then used to train a structured perceptron model.These steps are discussed in detail below.

Ungrounded Graphs from Paraphrases
We use GRAPHPARSER (Reddy et al., 2014) to convert paraphrases to ungrounded graphs.This conversion involves three steps: 1) parsing the paraphrase using a CCG parser to extract syntactic derivations (Lewis and Steedman, 2014), 2) extracting logical forms from the CCG derivations (Bos et al., 2004), and 3) converting the logical forms to an ungrounded graph. 8The ungrounded graph for the example question and its paraphrases are shown in

Grounded Graphs from Ungrounded Graphs
The ungrounded graphs are grounded to Freebase subgraphs by mapping entity nodes, entity-entity edges and entity type nodes in the ungrounded graph to Freebase entities, relations and types, respectively.For example, the graph in Figure 3(b) can be converted to a Freebase graph in Figure 3(d) by replacing the entity node Czech Republic with the Freebase entity CZECHRE-PUBLIC, the edge (speak.arg 2 , speak.in) between x and Czech Republic with the Freebase relation (location.country.officiallanguage.2,location.country.officiallanguage.1),the type node language with the Freebase type language.humanlanguage, and the TARGET node remains intact.The rest of the nodes, edges and types are grounded to null.In a similar fashion, Figure 3(c) can be grounded to Figure 3(d), but not Figure 3(a) to Figure 3(d).If no paraphrase is isomorphic to the target grounded grounded graph, our grounding fails.

Learning
We use a linear model to map ungrounded graphs to grounded ones.The parameters of the model are learned from question-answer pairs.For example, the question What language do people in Czech Republic speak?paired with its answer {CZECHLANGUAGE}.In line with most work on question answering against Freebase, we do not rely on annotated logical forms associated with the question for training and treat the mapping of a question to its grounded graph as latent.Let q be a question, let p be a paraphrase, let u be an ungrounded graph for p, and let g be a grounded graph formed by grounding the nodes and edges of u to the knowledge base K (throughout we use Freebase as the knowledge base).Following Reddy et al. (2014), we use beam search to find the highest scoring tuple of paraphrase, ungrounded and grounded graphs (p, û, ĝ) under the model θ ∈ R n : where Φ(p, u, g, q, K) ∈ R n denotes the features for the tuple of paraphrase, ungrounded and grounded graphs.The feature function has access to the paraphrase, ungrounded and grounded graphs, the original question, as well as to the content of the knowledge base and the denotation |g| K (the denotation of a grounded graph is defined as the set of entities or attributes reachable at its TARGET node).See ?? for the features employed.The model parameters are estimated with the averaged structured perceptron (Collins, 2002).Given a training question-answer pair (q, A), the update is: where (p + , u + , g + ) denotes the tuple of gold paraphrase, gold ungrounded and grounded graphs for q.Since we do not have direct access to the gold paraphrase and graphs, we instead rely on the set of oracle tuples, O K,A (q), as a proxy: where O K,A (q) is defined as the set of tuples (p, u, g) derivable from the question q, whose denotation |g| K has minimal F 1 -loss against the gold answer A. We find the oracle graphs for each question a priori by performing beam-search with a very large beam.

Experimental Setup
Below, we give details on the evaluation dataset and baselines used for comparison.We also describe the model features and provide implementation details.

Evaluation Data and Metric
We evaluate our approach on the WebQuestions dataset (Berant et al., 2013).WebQuestions consists of 5,810 question-answer pairs where questions represents real Google search queries.We use the standard train/test splits, with 3,778 train and 2,032 test questions.For our development experiments we tune the models on held-out data consisting of 30% training questions, while for final testing we use the complete training data.We use average precision (avg P.), average recall (avg R.) and average F 1 (avg F 1 ) proposed by Berant et al. (2013) as evaluation metrics.9

Baselines
ORIGINAL We use GRAPHPARSER without paraphrases as our baseline.This gives an idea about the impact of using paraphrases.
MT We compare our paraphrasing models with monolingual machine translation based model for paraphrase generation (Quirk et al., 2004;Wubben et al., 2010).In particular, we use Moses (Koehn et al., 2007) to train a monolingual phrase-based MT system on the Paralex corpus.Finally, we use Moses decoder to generate 10-best distinct paraphrases for the test questions.

Implementation Details
Entity Resolution For WebQuestions, we use 8 handcrafted part-of-speech patterns (e.g., the pattern (DT)?(JJ.?|NN.?){0,2}NN.? matches the noun phrase the big lebowski) to identify candidate named entity mention spans.We use the Stanford CoreNLP caseless tagger for part-of-speech tagging (Manning et al., 2014).For each candidate mention span, we retrieve the top 10 entities according to the Freebase API. 10 We then create a lattice in which the nodes correspond to mention-entity pairs, scored by their Freebase API scores, and the edges encode the fact that no joint assignment of entities to mentions can contain overlapping spans.We take the top 10 paths through the lattice as possible entity disambiguations.For each possibility, we generate n-best paraphrases that contains the entity mention spans.In the end, this process creates a total of 10n paraphrases.We generate ungrounded graphs for these paraphrases and treat the final entity disambiguation and paraphrase selection as part of the semantic parsing problem.11GRAPHPARSER Features.We use the features from Reddy et al. (2014).These include edge align-ments and stem overlaps between ungrounded and grounded graphs, and contextual features such as word and grounded relation pairs.In addition to these features, we add two new real-valued features -the paraphrase classifier's score and the entity disambiguation lattice score.
Beam Search We use beam search to infer the highest scoring graph pair for a question.The search operates over entity-entity edges and entity type nodes of each ungrounded graph.For an entityentity edge, there are two operations: ground the edge to a Freebase relation, or skip the edge.Similarly, for an entity type node, there are two operations: ground the node to a Freebase type, or skip the node.We use a beam size of 100 in all our experiments.

Results and Discussion
In this section, we present results from five different systems for our question-answering experiments: ORIGINAL, MT, NAIVE, PPDB and BILAY-ERED.First two are baseline systems.Other three systems use paraphrases generated from an L-PCFG grammar.NAIVE uses a word lattice with a single start-to-end path representing the input question itself, PPDB uses a word lattice constructed using the PPDB rules, and BILAYERED uses bi-layered L-PCFG to build word lattices.Note that NAIVE does not require any parallel resource to train, PPDB requires an external paraphrase database, and BILAY-ERED, like MT, needs a parallel corpus with paraphrase pairs.We tune our classifier features and GRAPHPARSER features on the development data.We use the best setting from tuning for evaluation on the test data. 1 shows the results with our best settings on the development data.We found that oracle scores improve significantly with paraphrases.ORIGINAL achieves an oracle score of 65.1 whereas with paraphrases we achieve an F 1 greater than 70 across all the models.This shows that with paraphrases we eliminate substantial mismatch between Freebase and ungrounded graphs.This trend continues for the final prediction with the paraphrasing models performing better than the ORIGINAL.

Results on the Development Set Table
All our proposed paraphrasing models beat the MT baseline.Even the NAIVE model which does not use any parallel or external resource surpass the MT baseline in the final prediction.Upon error analysis, we found that the MT model produce too similar paraphrases, mostly with only inflectional variations.For the question What language do people in Czech Republic speak, the top ten paraphrases produced by MT are mostly formed by replacing words language with languages, do with does, people with person and speak with speaks.These paraphrases do not address the structural mismatch problem.In contrast, our grammar based models generate syntactically diverse paraphrases.
Our PPDB model performs best across the paraphrase models (avg F 1 = 47.9).We attribute its success to the high quality paraphrase rules from the external paraphrase database.For the BILAYERD model we found 1,000 latent semantic states is not sufficient for modeling topical differences.Though MT competes with NAIVE and BILAYERED, the performance of NAIVE is highly encouraging since it does not require any parallel corpus.Furthermore, we observe that the MT model has larger search space.The number of oracle graphs -the number of ways in which one can produce the correct Freebase grounding from the ungrounded graphs of the given question and its paraphrases -is higher for MT (77.2) than the grammar-based models (50-60).
Results on the Test Set Table 2 shows our final results on the test data.We get similar results on the test data as we reported on the development data.Again, the PPDB model performs best with an F 1 score of 47.7.The baselines, ORIGINAL and MT, lag with scores of 45.0 and 47.1, respectively.We also present the results of existing literature on this dataset.Among these, Berant and Liang (2014)  framework for evaluating our paraphrases extrinsically.We leave plugging our paraphrases to other existing methods and other tasks for future work.
Error Analysis The upper bound of our paraphrasing methods is in the range of 71.2-71.8.We examine the reason where we lose the rest.For the PPDB model, the majority (78.4%) of the errors are partially correct answers occurring due to incomplete gold answer annotations or partially correct groundings.Note that the partially correct groundings may include incorrect paraphrases.13.5% are due to mismatch between Freebase and the paraphrases produced, and the rest (8.1%) are due to wrong entity annotations.

Conclusion
We described a grammar method to generate paraphrases for questions, and applied it to a question answering system based on semantic parsing.We showed that using paraphrases for a question answering system is a useful way to improve its performance.Our method is rather generic and can be applied to any question answering system.

Figure 1 :
Figure 1: An example word lattice for the question What language do people in Czech Republic speak?using the lexical and phrasal rules from the PPDB.

Figure 2 :
Figure 2: Trees used for bi-layered L-PCFG training.The questions what day is nochebuena, when is nochebuena Figure 3(b) and Figure 3(c) are two such paraphrases which can be converted to Figure 3(d) as described in ??.
Figure 3: Ungrounded graphs for an input question and its paraphrases along with its correct grounded graph.The green squares indicate NL or Freebase entities, the yellow rectangles indicate unary NL predicates or Freebase types, the circles indicate NL or Freebase events, the edge labels indicate binary NL predicates or Freebase relations, and the red diamonds attach to the entity of interest (the answer to the question).

Table 1 :
Xu et al. (2016)rasing but unlike ours it is based on a template grammar (containing 8 grammar rules) and requires logical forms beforehand to generate paraphrases.Our PPDB outperforms Berant and Liang's model by 7.8 F 1 points.Yih et al. (2015)andXu et al. (2016)use neural network models for semantic parsing, in addition to using sophisticated entity resolution (Yang and Chang, 2015) and a very large unsupervised corpus as additional training data.Note that we use GRAPHPARSER as our semantic parsing Oracle statistics and results on the WebQues-

Table 2 :
Results on WebQuestions test dataset.