Recovering Question Answering Errors via Query Revision

The existing factoid QA systems often lack a post-inspection component that can help models recover from their own mistakes. In this work, we propose to crosscheck the corresponding KB relations behind the predicted answers and identify potential inconsistencies. Instead of developing a new model that accepts evidences collected from these relations, we choose to plug them back to the original questions directly and check if the revised question makes sense or not. A bidirectional LSTM is applied to encode revised questions. We develop a scoring mechanism over the revised question encodings to refine the predictions of a base QA system. This approach can improve the F1 score of STAGG (Yih et al., 2015), one of the leading QA systems, from 52.5% to 53.9% on WEBQUESTIONS data.


Introduction
With the recent advances in building large scale knowledge bases (KB) like Freebase (Bollacker et al., 2008), DBpedia (Auer et al., 2007), and YAGO (Suchanek et al., 2007) that contain the world's factual information, KB-based question answering receives attention of research efforts in this area. Traditional semantic parsing is one of the most promising approaches that tackles this problem by mapping questions onto logical forms using logical languages CCG (Kwiatkowski et al., 2013;Reddy et al., 2014;Choi et al., 2015;Reddy et al., 2016), DCS (Berant et al., 2013;Liang, 2014, 2015), or directly query graphs (Yih et al., 2015) with predicates closely related to KB schema. Recently, neural network based models have been applied to question answering (Bordes Figure 1: Sketch of our approach. Elements in solid round rectangles are KB relation labels. Relation on the left is correct, but the base QA system predicts the one on the right. Dotted rectangles represent revised questions with relation labels plugged in. The left revised question looks semantically closer to the original question and itself is more consistent. Hence, it shall be ranked higher than the right one. Yih et al., 2015;Xu et al., 2016a,b).
While these approaches yielded successful results, they often lack a post-inspection component that can help models recover from their own mistakes. Table 1 shows the potential improvement we can achieve if such a component exists. Can we leverage textual evidences related to the predicted answers to recover from a prediction error? In this work, we show it is possible.
Our strategy is to cross-check the corresponding KB relations behind the predicted answers and identify potential inconsistencies. As an intermediate step, we define question revision as a tailored transformation of the original question using textual evidences collected from these relations in a knowledge base, and check if the revised questions make sense or not. Figure 1  our work from many existing QA studies. Given a question, we first create its revisions with respect to candidate KB relations. We encode question revisions using a bidirectional LSTM. A scoring mechanism over these encodings is jointly trained with LSTM parameters with the objective that the question revised by a correct KB relation has higher score than that of other candidate KB relations by a certain confidence margin. We evaluate our method using STAGG (Yih et al., 2015) as the base question answering system. Our approach is able to improve the F 1 performance of STAGG (Yih et al., 2015) from 52.5% to 53.9% on a benchmark dataset WEBQUESTIONS (Berant et al., 2013). Certainly, one can develop specialized LSTMs that directly accommodate text evidences without revising questions. We have modified QA-LSTM and ATTENTIVE-LSTM (Tan et al., 2016) accordingly (See Section 4). However, so far the performance is not as good as the question revision approach.

Question Revisions
We formalize three kinds of question revisions, namely entity-centric, answer-centric, and relation-centric that revise the question with respect to evidences from topic entity type, answer type, and relation description. As illustrated in Figure 2, we design revisions to capture generalizations at different granularities while preserving the question structure.
Let s r (e.g., Activist) and o r (e.g., ActivismIssue) denote the subject and object types of a KB relation r (e.g., AreaOfActivism), respectively.
Let α (type.object.name) denote a function returning the textual description of a KB element (e.g., relation, entity, or type). Assuming that a candidate answer set is retrieved by executing a KB relation r from a topic entity in question, we can uniquely identify the types of topic entity and answer for the hypothesis by s r and o r , respectively. It is also possible that a chain of relations r = r 1 r 2 . . . r k is used to retrieve an answer set from a topic entity. When k = 2, by abuse of notation, we define s r 1 r 2 = s r 1 , o r 1 r 2 = o r 2 , and α(r 1 r 2 ) = concat(α(r 1 ), α(r 2 )).
Let m : (q, r) → q denote a mapping from a given question q = [w 1 , w 2 , . . . , w L ] and a KB relation r to revised question q . We denote the index span of wh-words (e.g., "what") and topic entity (e.g., "Mary Wollstonecraft") in question q by [i s , i e ] and [j s , j e ], respectively. Entity-Centric (EC). Entity-centric question revision aims a generalization at the entity level. We construct it by replacing topic entity tokens with its type. For the running example, it be- Answer-Centric (AC). It is constructed by augmenting the wh-words of entity-centric question revision with the answer type. The running example is revised to "[what activism issue] did [activist] fight for". We formally define it as where w i 's are the tokens of entity-centric question revision m EC (q, r) of length L with [i s , i e ] still denoting the index span of wh-words in w . Relation-Centric (RC). Here we augment the whwords with the relation description instead of answer type. This form of question revision has the most expressive power in distinguishing between the KB relations in question context, but it can suffer more from the training data sparsity. For the running example, it maps to "[what area of activism] did [activist] fight for". Formally, it is defined as

Task Formulation
Given a question q, we first run an existing QA system to answer q. Suppose it returns r as the top predicted relation and r is a candidate relation that is ranked lower. Our objective is to decide if there is a need to replace r with r . We formulate this task as finding a scoring function s : (q, r) → R and a confidence margin threshold t ∈ R >0 such that the function makes the replacement decision.

Encoding Question Revisions
Let q = (w 1 , w 2 , . . . , w l ) denote a question revision. We first encode all the words into a ddimensional vector space using an embedding matrix. Let e i denote the embedding of word w i . To obtain the contextual embeddings for words, we We combine forward and backward contextual embeddings by We then generate the final encoding of revised question q by enc(q ) = concat(h 1 , h l ).

Training Objective
Score Function. Given a question revision mapping m, a question q, and a relation r, our scoring function is defined as s(q, r) = w T enc(m(q, r)) where w is a model parameter that is jointly learnt with the LSTM parameters. Loss Function. Let T = {(q, a q )} denote a set of training questions paired with their true answer set. Let U (q) denote the set of all candidate KB relations for question q. Let f (q, r) denote the F 1 value of an answer set obtained by relation r when compared to a q . For each candidate relation r ∈ U (q) with a positive F 1 value, we define as the set of its negative relations for question q. Similar to a hinge-loss in (Bordes et al., 2014), we define the objective function J(θ, w, E) as where the sum is taken over all valid {(q, r, r )} triplets and the penalty margin is defined as δ λ (q, r, r ) = λ(f (q, r) − f (q, r )).
We use this loss function because: i) it allows us to exploit partially correct answers via F 1 scores, and ii) training with it updates the model parameters towards putting a large margin between the scores of correct (r) and incorrect (r ) relations, which is naturally aligned with our prediction refinement objective defined in Equation 1.

Alternative Solutions
Our approach directly integrates additional textual evidences with the question itself, which can be processed by any sequence oriented model, and benefit from its future updates without significant modification. However, we could also design models taking these textual evidences into specific consideration, without even appealing to question revision. We have explored this option and tried two methods that closely follow QA-LSTM and ATTENTIVE-LSTM (Tan et al., 2016). The latter model achieves the state-of-the-art for passagelevel question answer matching. Unlike our approach, they encode questions and evidences for candidate answers in parallel, and measure the semantic similarity between them using cosine distance. The effectiveness of these architectures has been shown in other studies (Neculoiu et al., 2016;Hermann et al., 2015;Mueller and Thyagarajan, 2016) as well.
We adopt these models in our setting as follows: (1) Textual evidences α(s r ) (equiv. of EC revision), α(o r ) (equiv. of AC revision) or α(r) (equiv. of RC revision) of a candidate KB relation r is used in place of a candidate answer a in the original model, (2) We replace the entity mention with a universal #entity# token as in (Yih et al., 2015) because individual entities are rare and uninformative for semantic similarity, (3) We train the score function sim(q, r) using the objective defined in Eq. 5. Further details of the alternative solutions can be found in Appendix A.

F1
Ensemble STAGG-RANK (Yavuz et al., 2016) 54.0 QUESREV on STAGG-RANK 54.3 Training Data Preparation. WEBQUESTIONS only provides question-answer pairs along with annotated topic entities. We generate candidates U (q) for each question q by retrieving 1-hop and 2-hop KB relations r from annotated topic entity e in Freebase. For each relation r, we query (e, r, ?) against Freebase and retrieve the candidate answers r a . Then, we compute f (q, r) by comparing the answer set r a with the annotated answers.

Implementation Details
Word embeddings are initialized with pretrained GloVe (Pennington et al., 2014) vectors 1 , and updated during the training. We take the dimension of word embeddings and the size of LSTM hidden layer equal and experiment with values in {50, 100, 200, 300}. We apply dropout regularization on both input and output of LSTM encoder with probability 0.5. We hand tuned penalty margin scalar λ as 1. The model parameters are optimized using Adam (Kingma and Ba, 2015) with batch size of 32. We implemented our models in tensorflow (Abadi et al., 2016). To refine predictions r of a base QA system, we take its second top ranked prediction as the refinement candidate r , and employ replace(r, r , q) in Eq. 1. Confidence margin threshold t is tuned by grid search on the training data after the score function is trained. QUESREV-AC + RC model is obtained by a linear combination of QUESREV-AC and QUESREV-RC, which is formally defined in Appendix B. To evaluate the alternative solutions for prediction refinement, we apply the same decision mechanism in Eq. 1 with the trained sim(q, r) in Section 4 as the score function.
We use a dictionary 2 to identify wh-words in a question. We find topic entity spans using Stan-

Refinement Model
WebQ. + SimpleQ.  ford NER tagger . If there are multiple matches, we use the first matching span for both. Table 2 presents the main result of our prediction refinement model using STAGG's results. Our approach improves the performance of a strong base QA system by 1.4% and achieves 53.9% in F 1 measure, which is slightly better than the state-ofthe-art KB-QA system (Xu et al., 2016a). However, it is important to note here that Xu et al. (2016a) uses DBPedia knowledge base in addition to Freebase and the Wikipedia corpus that we do not utilize. Moreover, applying our approach on the STAGG predictions reranked by (Yavuz et al., 2016), referred as STAGG-RANK in Table 2, leads to a further improvement over a strong ensemble baseline. These suggest that our system captures orthogonal signals to the ones exploited in the base QA models. Improvements of QUESREV over both STAGG and STAGG-RANK are statistically significant.

Results
In Table 3, we present variants of our approach. We observe that AC model yields to best refinement results when trained only on WEBQUES-TIONS data (e.g., WebQ. column). This empirical observation is intuitively expected because it has more generalization power than RC, which might make AC more robust to the training data sparsity. This intuition is further justified by observing that augmenting the training data with SIMPLEQUES-TIONS improves the performance of RC model most as it has more expressive power.
Although both QA-LSTM and ATTENTIVE-LSTM lead to successful prediction refinements on STAGG, question revision approach consistently outperforms both of the alternative solutions. This suggests that our way of incorporating the new textual evidences by naturally blending them in Example Predictions and Replacements 1. What position did vince lombardi play in college ? STAGG: person.education / education.institution (2-hop) -what position did person play in college QUESREV-EC: football player.position s -what position did american football player play in college 2. What did mary wollstonecraft fight for ? STAGG: person.profession -what profession did person fight for QUESREV-AC: activist.area of activism -what activism issue did activist fight for 3. Where was anne boleyn executed ? STAGG: person.place of birth -where place of birth was person executed QUESREV-RC: deceased person.place of death -where place of death was deceased person executed 4. Where does the zambezi river start ? STAGG: river.mouth -where mouth does the river start QUESREV-RC: river.origin -where origin does the river start Table 4: Example predictions of STAGG (Yih et al., 2015) and replacements proposed by variants of QUESREV, followed by their corresponding question revisions. The colors red and blue indicate wrong and correct, respectively. Domain names of KB relations are dropped for brevity. the question context leads to a better mechanism for checking the consistency of KB relations with the question. It is possible to argue that part of the improvements of refinement models over STAGG in Table 3 may be due to model ensembling. However, the performance gap between QUESREV and the alternative solutions enables us to isolate this effect for query revision approach.

Related Work
One of the promising approaches for KB-QA is semantic parsing, which uses logical language CCG (Kwiatkowski et al., 2013;Reddy et al., 2014;Choi et al., 2015) or DCS (Berant et al., 2013) for finding the right grounding of the natural language on knowledge base. Another major line of work (Bordes et al., 2014;Yih et al., 2015;Xu et al., 2016b) exploit vector space embedding approach to directly measure the semantic similarity between questions and candidate answer subgraphs in KB. In this work, we propose a postinspection step that can help existing KB-QA systems recover from answer prediction errors.
Our work is conceptually related to traditional query expansion, a well-explored technique (Qiu and Frei, 1993;Mitra et al., 1998;Navigli and Velardi, 2003;Riezler et al., 2007;Fang, 2008;Sordoni et al., 2014;Diaz et al., 2016) in information retrieval area. The intuition behind query expansion is to reformulate the original query to improve retrieval performance. Our approach revises questions using candidate answers already retrieved by a base QA system. Revised questions are then used for reasoning about the corresponding predictions themselves, not for retrieving more candidates. Hence, it is specialized rather as a reasoning component than a retrieval one.
Hypothesis generation steps in (Téllez-Valero et al., 2008) and (Trischler et al., 2016) are related to our question revision process. However, hypotheses in these approaches need to be further compared against supporting paragraphs for reasoning. This limits the applicability of them in KB-QA setting due to lack of supporting texts. Our approach modifies the appropriate parts of the question using different KB evidences behind candidate answers that are more informative and generalizable. This enables us to make reasoning about candidate predictions directly via revised questions without relying on any supporting texts.

Conclusion
We present a prediction refinement approach for question answering over knowledge bases. We introduce question revision as a tailored augmentation of the question via various textual evidences from KB relations. We exploit revised questions as a way to reexamine the consistency of candidate KB relations with the question itself. We show that our method improves the quality of answers produced by STAGG on the WEBQUES-TIONS dataset.

A Implementation details of alternative solutions
Following (Tan et al., 2016), we use the same bidirectional LSTM for both questions and textual evidences. For the attentive model, we apply the attention mechanism on the question side because our objective is to match textual evidences to the question context unlike the original model. We use average pooling for both models and compute the general attention via a bilinear term that has been shown effective in (Luong et al., 2015). For the model and training parameters, we follow the strategy described in Section 5.1 with a difference that λ is tuned to be 0.2 in this setting. This intuitively makes sense because the score sim(q, r) is in [−1, 1].
To clarify the question and answer sides for the alternative models, we provide concrete examples in Table 5 for the running example.  Table 5: Question (q) and answer (a) sides used for alternative (e.g., ALT.) solutions QA-LSTM and ATTENTIVE-LSTM.

B Combining multiple question revision strategies
We also performed experiments combining multiple question revisions that may potentially capture complementary signals. To this end, let s 1 , . . . , s k be the trained scoring functions with question revisions constructed by m 1 , . . . , m k , we define s(q, r) = k i=1 γ i s i (q, r) where γ ∈ R k is a weight vector that is trained using the same objective defined in Equation 5. This strategy is used to obtain AC+RC model reported in experimental results by combining AC and RC for k = 2.