Learning to Paraphrase for Question Answering

Question answering (QA) systems are sensitive to the many different ways natural language expresses the same information need. In this paper we turn to paraphrases as a means of capturing this knowledge and present a general framework which learns felicitous paraphrases for various QA tasks. Our method is trained end-to-end using question-answer pairs as a supervision signal. A question and its paraphrases serve as input to a neural scoring model which assigns higher weights to linguistic expressions most likely to yield correct answers. We evaluate our approach on QA over Freebase and answer sentence selection. Experimental results on three datasets show that our framework consistently improves performance, achieving competitive results despite the use of simple QA models.


Introduction
Enabling computers to automatically answer questions posed in natural language on any domain or topic has been the focus of much research in recent years.Question answering (QA) is challenging due to the many different ways natural language expresses the same information need.As a result, small variations in semantically equivalent questions, may yield different answers.For example, a hypothetical QA system must recognize that the questions "who created microsoft" and "who started microsoft" have the same meaning and that they both convey the founder relation in order to retrieve the correct answer from a knowledge base.
Given the great variety of surface forms for semantically equivalent expressions, it should come as no surprise that previous work has investigated the use of paraphrases in relation to question answering.There have been three main strands of research.The first one applies paraphrasing to match natural language and logical forms in the context of semantic parsing.Berant and Liang (2014) use a template-based method to heuristically generate canonical text descriptions for candidate logical forms, and then compute paraphrase scores between the generated texts and input questions in order to rank the logical forms.Another strand of work uses paraphrases in the context of neural question answering models (Bordes et al., 2014a,b;Dong et al., 2015).These models are typically trained on question-answer pairs, and employ question paraphrases in a multi-task learning framework in an attempt to encourage the neural networks to output similar vector representations for the paraphrases.
The third strand of research uses paraphrases more directly.The idea is to paraphrase the question and then submit the rewritten version to a QA module.Various resources have been used to produce question paraphrases, such as rule-based machine translation (Duboue and Chu-Carroll, 2006), lexical and phrasal rules from the Paraphrase Database (Narayan et al., 2016), as well as rules mined from Wiktionary (Chen et al., 2016) and large-scale paraphrase corpora (Fader et al., 2013).A common problem with the generated paraphrases is that they often contain inappropriate candidates.Hence, treating all paraphrases as equally felicitous and using them to answer the question could degrade performance.To remedy this, a scoring model is often employed, however independently of the QA system used to find the answer (Duboue and Chu-Carroll, 2006;Narayan et al., 2016).Problematically, the separate paraphrase models used in previous work do not fully utilize the supervision signal from the training data, and as such cannot be properly tuned

Question Vectors Scores
Answer Question q: who created microsoft?Paraphrases q 1 : who founded microsoft?q 2 : who is the founder of microsoft?q 3 : who is the creator of microsoft?q m : who designed microsoft?microsoft org_ founder Paul Allen Bill Gates Figure 1: We use three different methods to generate candidate paraphrases for input q.The question and its paraphrases are fed into a neural model which scores how suitable they are.The scores are normalized and used to weight the results of the question answering model.The entire system is trained end-to-end using question-answer pairs as a supervision signal.
to the question answering tasks at hand.Based on the large variety of possible transformations that can generate paraphrases, it seems likely that the kinds of paraphrases that are useful would depend on the QA application of interest (Bhagat and Hovy, 2013).Fader et al. (2014) use features that are defined over the original question and its rewrites to score paraphrases.Examples include the pointwise mutual information of the rewrite rule, the paraphrase's score according to a language model, and POS tag features.In the context of semantic parsing, Chen et al. (2016) also use the ID of the rewrite rule as a feature.However, most of these features are not informative enough to model the quality of question paraphrases, or cannot easily generalize to unseen rewrite rules.
In this paper, we present a general framework for learning paraphrases for question answering tasks.Given a natural language question, our model estimates a probability distribution over candidate answers.We first generate paraphrases for the question, which can be obtained by one or several paraphrasing systems.A neural scoring model predicts the quality of the generated paraphrases, while learning to assign higher weights to those which are more likely to yield correct answers.The paraphrases and the original question are fed into a QA model that predicts a distribution over answers given the question.The entire system is trained end-to-end using question-answer pairs as a supervision signal.The framework is flexible, it does not rely on specific paraphrase or QA models.In fact, this plug-and-play functional-ity allows to learn specific paraphrases for different QA tasks and to explore the merits of different paraphrasing models for different applications.
We evaluate our approach on QA over Freebase and text-based answer sentence selection.We employ a range of paraphrase models based on the Paraphrase Database (PPDB; Pavlick et al. 2015), neural machine translation (Mallinson et al., 2016), and rules mined from the WikiAnswers corpus (Fader et al., 2014).Results on three datasets show that our framework consistently improves performance; it achieves state-of-the-art results on GraphQuestions and competitive performance on two additional benchmark datasets using simple QA models.

Problem Formulation
Let q denote a natural language question, and a its answer.Our aim is to estimate p (a|q), the conditional probability of candidate answers given the question.We decompose p (a|q) as: where H q is the set of paraphrases for question q, ψ are the parameters of a QA model, and θ are the parameters of a paraphrase scoring model.
As shown in Figure 1, we first generate candidate paraphrases H q for question q.Then, a neural scoring model predicts the quality of the generated paraphrases, and assigns higher weights to the paraphrases which are more likely to obtain the correct answers.These paraphrases and the original question simultaneously serve as input to a QA model that predicts a distribution over answers for a given question.Finally, the results of these two models are fused to predict the answer.
In the following we will explain how p (q |q) and p (a|q ) are estimated.

Paraphrase Generation
As shown in Equation ( 1), the term p (a|q) is the sum over q and its paraphrases H q .Ideally, we would generate all the paraphrases of q.However, since this set could quickly become intractable, we restrict the number of candidate paraphrases to a manageable size.In order to increase the coverage and diversity of paraphrases, we employ three methods based on: (1) lexical and phrasal rules from the Paraphrase Database (Pavlick et al., 2015); (2) neural machine translation models (Sutskever et al., 2014;Bahdanau et al., 2015); and (3) paraphrase rules mined from clusters of related questions (Fader et al., 2014).We briefly describe these models below, however, there is nothing inherent in our framework that is specific to these, any other paraphrase generator could be used instead.

PPDB-based Generation
Bilingual pivoting (Bannard and Callison-Burch, 2005) is one of the most well-known approaches to paraphrasing; it uses bilingual parallel corpora to learn paraphrases based on techniques from phrase-based statistical machine translation (SMT, Koehn et al. 2003).The intuition is that two English strings that translate to the same foreign string can be assumed to have the same meaning.
The method first extracts a bilingual phrase table and then obtains English paraphrases by pivoting through foreign language phrases.Drawing inspiration from syntax-based SMT, Callison-Burch (2008) and Ganitkevitch et al. (2011) extended this idea to syntactic paraphrases, NMT 1 (green) translates question q into pivots g 1 . . .g K which are then backtranslated by NMT 2 (blue) where K decoders jointly predict tokens at each time step, rather than only conditioning on one pivot and independently predicting outputs.leading to the creation of PPDB (Ganitkevitch et al., 2013), a large-scale paraphrase database containing over a billion of paraphrase pairs in 24 different languages.Pavlick et al. (2015) further used a supervised model to automatically label paraphrase pairs with entailment relationships based on natural logic (MacCartney, 2009).In our work, we employ bidirectionally entailing rules from PPDB.Specifically, we focus on lexical (single word) and phrasal (multiword) rules which we use to paraphrase questions by replacing words and phrases in them.An example is shown in Table 1 where we substitute car with vehicle and manufacturer with producer.

NMT-based Generation
Mallinson et al. ( 2016) revisit bilingual pivoting in the context of neural machine translation (NMT, Sutskever et al. 2014;Bahdanau et al. 2015) and present a paraphrasing model based on neural networks.At its core, NMT is trained end-to-end to maximize the conditional probability of a correct translation given a source sentence, using a bilingual corpus.Paraphrases can be obtained by translating an English string into a foreign language and then back-translating it into English.NMTbased pivoting models offer advantages over conventional methods such as the ability to learn continuous representations and to consider wider context while paraphrasing.
In our work, we select German as our pivot following Mallinson et al. (2016) who show that it outperforms other languages in a wide range of paraphrasing experiments, and pretrain two NMT systems, English-to-German (EN-DE) and  German-to-English (DE-EN).A naive implementation would translate a question to a German string and then back-translate it to English.However, using only one pivot can lead to inaccuracies as it places too much faith on a single translation which may be wrong.Instead, we translate from multiple pivot sentences (Mallinson et al., 2016).As shown in Figure 2, question q is translated to K-best German pivots, G q = {g 1 , . . ., g K }.The probability of generating paraphrase q = y 1 . . .y |q | is decomposed as: where y <t = y 1 , . . ., y  1.
Compared to PPDB, NMT-based paraphrases are syntax-agnostic, operating on the surface level without knowledge of any underlying grammar.Furthermore, paraphrase rules are captured implicitly and cannot be easily extracted, e.g., from a phrase table.As mentioned earlier, the NMTbased approach has the potential of performing major rewrites as paraphrases are generated while considering wider contextual information, whereas PPDB paraphrases are more local, and mainly handle lexical variation.

Rule-Based Generation
Our third paraphrase generation approach uses rules mined from the WikiAnswers corpus (Fader et al., 2014) which contains more than 30 million question clusters labeled as paraphrases by WikiAnswers1 users.This corpus is a large resource (the average cluster size is 25), but is relatively noisy due to its collaborative nature -45% of question pairs are merely related rather than genuine paraphrases.We therefore followed the method proposed in (Fader et al., 2013) to harvest paraphrase rules from the corpus.We first extracted question templates (i.e., questions with at most one wild-card) that appear in at least ten clusters.Any two templates co-occurring (more than five times) in the same cluster and with the same arguments were deemed paraphrases.Table 2 shows examples of rules extracted from the corpus.During paraphrase generation, we consider substrings of the input question as arguments, and match them with the mined template pairs.For example, the stemmed input question in Table 1 can be paraphrased using the rules ("what be the zip code of ", "what be 's postal code") and ("what be the zip code of ", "zip code of ").If no exact match is found, we perform fuzzy matching by ignoring stop words in the question and templates.

Paraphrase Scoring
Recall from Equation (1) that p θ (q |q) scores the generated paraphrases q ∈ H q ∪ {q}.We estimate p θ (q |q) using neural networks given their successful application to paraphrase identification tasks (Socher et al., 2011;Yin and Schütze, 2015;He et al., 2015).Specifically, the input question and its paraphrases are encoded as vectors.Then, we employ a neural network to obtain the score s (q |q) which after normalization becomes the probability p θ (q |q).
Encoding Let q = q 1 . . .q |q| denote an input question.Every word is initially mapped to a d-dimensional vector.In other words, vector q t is computed via q t = W q e (q t ), where W q ∈ R d×|V| is a word embedding matrix, |V| is the vocabulary size, and e (q t ) is a one-hot vector.Next, we use a bi-directional recurrent neural network with long short-term memory units (LSTM, Hochreiter and Schmidhuber 1997) as the question encoder, which is shared by the input questions and their paraphrases.The encoder recursively processes tokens one by one, and uses the encoded vectors to represent questions.We compute the hidden vectors at the t-th time step via: where In this work we follow the LSTM function described in Pham et al. (2014).The representation of q is obtained by: where [•, •] denotes concatenation, and q ∈ R 2n .
Scoring After obtaining vector representations for q and q , we compute the score s (q |q) via: where w s ∈ R 6n is a parameter vector, [•, •, •] denotes concatenation, is element-wise multiplication, and b s is the bias.Alternative ways to compute s (q |q) such as dot product or with a bilinear term were not empirically better than Equation ( 5) and we omit them from further discussion.
Normalization For paraphrases q ∈ H q ∪ {q}, the probability p θ (q |q) is computed via: where the paraphrase scores are normalized over the set H q ∪ {q}.

QA Models
The framework defined in Equation ( 1) is relatively flexible with respect to the QA model being employed as long as it can predict p ψ (a|q ).We illustrate this by performing experiments across different tasks and describe below the models used for these tasks.
Knowledge Base QA In our first task we use the Freebase knowledge base to answer questions.Query graphs for the questions typically contain more than one predicate.For example, to answer the question "who is the ceo of microsoft in 2008", we need to use one relation to query "ceo of microsoft" and another relation for the constraint "in 2008".For this task, we employ the SIMPLE-GRAPH model described in Reddy et al. (2016Reddy et al. ( , 2017)), and follow their training protocol and feature design.In brief, their method uses rules to convert questions to ungrounded logical forms, which are subsequently matched against Freebase subgraphs.The QA model learns from questionanswer pairs: it extracts features for pairs of questions and Freebase subgraphs, and uses a logistic regression classifier to predict the probability that a candidate answer is correct.We perform entity linking using the Freebasee/KG API on the original question (Reddy et al., 2016(Reddy et al., , 2017)), and generate candidate Freebase subgraphs.The QA model estimates how likely it is for a subgraph to yield the correct answer.
Answer Sentence Selection Given a question and a collection of relevant sentences, the goal of this task is to select sentences which contain an answer to the question.The assumption is that correct answer sentences have high semantic similarity to the questions (Yu et al., 2014;Yang et al., 2015;Miao et al., 2016).We use two bidirectional recurrent neural networks (BILSTM) to separately encode questions and answer sentences to vectors (Equation ( 4)).Similarity scores are computed as shown in Equation ( 5), and then squashed to (0, 1) by a sigmoid function in order to predict p ψ (a|q ).

Training and Inference
We use a log-likelihood objective for training, which maximizes the likelihood of the correct answer given a question: maximize (q,a)∈D log p (a|q) where D is the set of all question-answer training pairs, and p (a|q) is computed as shown in Equation (1).For the knowledge base QA task, we predict how likely it is that a subgraph obtains the correct answer, and the answers of some candidate subgraphs are partially correct.So, we use the binary cross entropy between the candidate subgraph's F1 score and the prediction as the objective function.The RMSProp algorithm (Tieleman and Hinton, 2012) is employed to solve this nonconvex optimization problem.Moreover, dropout is used for regularizing the recurrent neural networks (Pham et al., 2014).
At test time, we generate paraphrases for the question q, and then predict the answer by: where C q is the set of candidate answers, and p (a |q) is computed as shown in Equation (1).

Experiments
We compared our model which we call PARA4QA (as shorthand for learning to paraphrase for question answering) against multiple previous systems on three datasets.In the following we introduce these datasets, provide implementation details for our model, describe the systems used for comparison, and present our results.

Datasets
Our model was trained on three datasets, representative of different types of QA tasks.The first two datasets focus on question answering over a structured knowledge base, whereas the third one is specific to answer sentence selection.
WEBQUESTIONS This dataset (Berant et al., 2013)  WIKIQA This dataset (Yang et al., 2015) has 3, 047 questions sampled from Bing query logs.The questions are associated with 29, 258 candidate answer sentences, 1, 473 of which contain the correct answers to the questions.

Implementation Details
Paraphrase Generation Candidate paraphrases were stemmed (Minnen et al., 2001) and lowercased.We discarded duplicate or trivial paraphrases which only rewrite stop words or punctuation.For the NMT model, we followed the implementation 2 and settings described in Mallinson et al. (2016), and used English↔German as the language pair.The system was trained on data released as part of the WMT15 shared translation task (4.2 million sentence pairs).We also had access to back-translated monolingual training data (Sennrich et al., 2016a).Rare words were 2 github.com/sebastien-j/LV_groundhogsplit into subword units (Sennrich et al., 2016b) to handle out-of-vocabulary words in questions.We used the top 15 decoding results as candidate paraphrases.We used the S size package of PPDB 2.0 (Pavlick et al., 2015) for high precision.At most 10 candidate paraphrases were considered.
We mined paraphrase rules from WikiAnswers (Fader et al., 2014) as described in Section 2.1.3.The extracted rules were ranked using the pointwise mutual information between template pairs in the WikiAnswers corpus.The top 10 candidate paraphrases were used.
Training For the paraphrase scoring model, we used GloVe (Pennington et al., 2014) vectors3 pretrained on Wikipedia 2014 and Gigaword 5 to initialize the word embedding matrix.We kept this matrix fixed across datasets.Out-of-vocabulary words were replaced with a special unknown symbol.We also augmented questions with start-ofand end-of-sequence symbols.Word vectors for these special symbols were updated during training.Model hyperparameters were validated on the development set.The dimensions of hidden vectors and word embeddings were selected from {50, 100, 200} and {100, 200}, respectively.The dropout rate was selected from {0.2, 0.3, 0.4}.
The BILSTM for the answer sentence selection QA model used the same hyperparameters.Parameters were randomly initialized from a uniform distribution U (−0.08, 0.08).The learning rate and decay rate of RMSProp were 0.01 and 0.95, respectively.The batch size was set to 150.To alleviate the exploding gradient problem (Pascanu et al., 2013), the gradient norm was clipped to 5.
Early stopping was used to determine the number of epochs.

Paraphrase Statistics
Table 3 presents descriptive statistics on the paraphrases generated by the various systems across datasets (training set).As can be seen, the average paraphrase length is similar to the average length of the original questions.The NMT method generates more paraphrases and has wider coverage, while the average number and coverage of the other two methods varies per dataset.As a way of quantifying the extent to which rewriting takes place, we report BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) scores between the original questions and their paraphrases.The NMT method and the rules extracted from WikiAnswers tend to paraphrase more (i.e., have lower BLEU and higher TER scores) compared to PPDB.

Comparison Systems
We compared our framework to previous work and several ablation models which either do not use paraphrases or paraphrase scoring, or are not jointly trained.
The first baseline only uses the base QA models described in Section 2.3 (SIMPLEGRAPH and BILSTM).The second baseline (AVGPARA) does not take advantage of paraphrase scoring.The paraphrases for a given question are used while the QA model's results are directly averaged to predict the answers.The third baseline (DATAAUGMENT) employs paraphrases for data augmentation during training.Specifically, we use the question, its paraphrases, and the correct answer to automatically generate new training samples.
In the fourth baseline (SEPPARA), the paraphrase scoring model is separately trained on paraphrase classification data, without taking questionanswer pairs into account.In the experiments, we used the Quora question paraphrase dataset 4 which contains question pairs and labels indicating whether they constitute paraphrases or not.We removed questions with more than 25 tokens and sub-sampled to balance the dataset.We used 90% of the resulting 275K examples for training, and the remaining for development.The paraphrase score s (q |q) (Equation ( 5)) was wrapped by a sigmoid function to predict the probability of a question pair being a paraphrase.A binary crossentropy loss was used as the objective.The classification accuracy on the dev set was 80.6%.

Results
We first discuss the performance of PARA4QA on GRAPHQUESTIONS and WEBQUESTIONS.The first block in Table 4 shows a variety of systems previously described in the literature using average F1 as the evaluation metric (Berant et al., 2013).Among these, PARASEMP, SUBGRAPH, MCCNN, and BILAYERED utilize paraphrasing resources.The second block compares PARA4QA against various related baselines (see Section 3.4).SIMPLEGRAPH results on WEBQUESTIONS and GRAPHQUESTIONS are taken from Reddy et al. (2016) and Reddy et al. (2017), respectively.
Results on WIKIQA are shown in Table 5.We report MAP and MMR which evaluate the relative ranks of correct answers among the candidate sentences for a question.Again, we observe that PARA4QA outperforms related baselines (see BILSTM, DATAAUGMENT, AVGPARA, and SEP-PARA).Ablation experiments show that performance drops most when NMT paraphrases are removed.When word matching features are used (see +CNT in the third block), PARA4QA reaches state of the art performance.
Examples of paraphrases and their probabilities p θ (q |q) (see Equation ( 6)) learned by PARA4QA are shown in Table 6.The two examples are taken from the development set of GRAPHQUESTIONS and WEBQUESTIONS, respectively.We also show the Freebase relations used to query the correct answers.In the first example, the original question cannot yield the correct answer because of the mismatch between the question and the knowledge base.The paraphrase contains "role" in place of "sort of part", increasing the chance of overlap between the question and  the predicate words.The second question contains an informal expression "play 4", which confuses the QA model.The paraphrase model generates "play for" and predicts a high paraphrase score for it.More generally, we observe that the model tends to give higher probabilities p θ (q |q) to paraphrases biased towards delivering appropriate answers.
We also analyzed which structures were mostly paraphrased within a question.We manually inspected 50 (randomly sampled) questions from the development portion of each dataset, and their three top scoring paraphrases (Equation ( 5)).We grouped the most commonly paraphrased structures into the following categories: a) question words, i.e., wh-words and and "how"; b) question focus structures, i.e., cue words or cue phrases for an answer with a specific entity type (Yao and Van Durme, 2014); c) verbs or noun phrases indicating the relation between the question topic entity and the answer; and d) structures requiring aggregation or imposing additional constraints the answer must satisfy (Yih et al., 2015).In the example "which year did Avatar release in UK", the question word is "which", the question focus is "year", the verb is "release", and "in UK" constrains the answer to a specific location.often rewritten in GRAPHQUESTIONS compared to the other datasets.Finally, we examined how our method fares on simple versus complex questions.We performed this analysis on GRAPHQUESTIONS as it contains a larger proportion of complex questions.We consider questions that contain a single relation as simple.Complex questions have multiple relations or require aggregation.Table 7 shows how our model performs in each group.We observe improvements for both types of questions, with the impact on simple questions being more pronounced.This is not entirely surprising as it is easier to generate paraphrases and predict the paraphrase scores for simpler questions.

Conclusions
In this work we proposed a general framework for learning paraphrases for question answering.Paraphrase scoring and QA models are trained end-to-end on question-answer pairs, which results in learning paraphrases with a purpose.The framework is not tied to a specific paraphrase generator or QA system.In fact it allows to incorporate several paraphrasing modules, and can serve as a testbed for exploring their coverage and rewriting capabilities.Experimental results on three datasets show that our method improves performance across tasks.There are several directions for future work.The framework can be used for other natural language processing tasks which are sensitive to the variation of input (e.g., textual entailment or summarization).We would also like to explore more advanced paraphrase scoring models (Parikh et al., 2016;Wang and Jiang, 2016) as well as additional paraphrase generators since improvements in the diversity and the quality of paraphrases could also enhance QA performance.

Figure 2 :
Figure 2: Overview of NMT-based paraphrase generation.NMT 1 (green) translates question q into pivots g 1 . . .g K which are then backtranslated by NMT 2 (blue) where K decoders jointly predict tokens at each time step, rather than only conditioning on one pivot and independently predicting outputs.

Figure 3
Figure3shows the degree to which different types of structures are paraphrased.As can be seen, most rewrites affect Relation Verb, especially on WEBQUESTIONS.Question Focus, Relation NP, and Constraint & Aggregation are more

Table 2 :
Examples of rules used in the rule-based paraphrase generator.

Table 4 :
Model performance on GRAPHQUES-TIONS and WEBQUESTIONS.Results with additional task-specific resources are shown in parentheses.The base QA model is SIMPLEGRAPH.Best results in each group are shown in bold.

Table 5 :
Model performance on WIKIQA.+CNT: word matching features introduced in Yang et al. (2015).The base QA model is BILSTM.Best results in each group are shown in bold.

Table 6 :
Questions and their top-five paraphrases with probabilities learned by the model.The Freebase relations used to query the correct answers are shown in brackets.The original question is underlined.Questions with incorrect predictions are in red.

Table 7 :
We group GRAPHQUESTIONS into simple and complex questions and report model performance in each split.Best results in each group are shown in bold.The values in brackets are absolute improvements of average F1 scores.