Discriminative Information Retrieval for Question Answering Sentence Selection

We propose a framework for discriminative IR atop linguistic features, trained to improve the recall of answer candidate passage retrieval, the initial step in text-based question answering. We formalize this as an instance of linear feature-based IR, demonstrating a 34%-43% improvement in recall for candidate triage for QA.


Introduction
Question answering (QA) with textual corpora is typically modeled as first finding a candidate set of passages (sentences) that may contain an answer to a question, followed by an optional candidate reranking stage, and then finally an information extraction (IE) step to select the answer string. QA systems normally employ an information retrieval (IR) system to produce the initial set of candidates, usually treated as a black box, bag-of-words process that selects candidate passages best overlapping with the content in the question.
Recent efforts in corpus-based QA have been focused heavily on reranking, or answer sentence selection: filtering the candidate set as a supervised classification task to single out those that answer the given question. Extensive research has explored employing syntactic/semantic features (Yih et al., 2013;Wang and Manning, 2010;Heilman and Smith, 2010;Yao et al., 2013a) and recently using neural networks (Yu et al., 2014;Severyn and Moschitti, 2015;Wang and Nyberg, 2015;Yin et al., 2016). The shared aspect of all these approaches is that the quality of reranking a candidate set is upper-bounded by the initial set of candidates: unless one plans on reranking the entire corpus for each question as it arrives, one is still reliant on an initial IR stage in order to obtain a computationally feasible QA system. Huang et al. (2013) used neural networks and cosine distance to rank the candidates for IR, but without providing a method to search for the relevant documents in sublinear time.
We propose a framework for performing this triage step for QA sentence selection and other related tasks in sublinear time. Our method shows a log-linear model can be trained to optimize an objective function for downstream reranking, and the resulting trained weights can be reused to retrieve a candidate set. The content that our method retrieves is what the downstream components are known to prefer: it is trainable using the same data as employed in training candidate reranking. Our approach follows Yao et al. (2013b) who proposed the automatic coupling of QA sentence selection and IR by augmenting a bag-of-words query with desired named entity (NE) types based on a given question. While Yao et al. showed improved performance in IR as compared with an off-the-shelf IR system, the model was proof-of-concept, employing a simple linear interpolation between bagof-words and NE features with a single scalar value tuned on a development set, kept static across all types of questions at test time. We generalize Yao et al.'s intuition by casting the problem as an instance of classification-based retrieval (Robertson and Spärck Jones, 1976), formalized as a discriminative retrieval model (Cooper et al., 1992;Gey, 1994;Nallapati, 2004) allowing for the use of NLP features. Our framework can then be viewed as an instance of linear feature-based IR, following Metzler and Croft (2007).
To implement this approach, we propose a general feature-driven abstraction for coupling retrieval and answer sentence selection. 1 Our experiments demonstrate state-of-the-art results on QA sentence selection on the dataset of Lin and Katz What continent is Egypt in?  Figure 1: Steps in mapping natural language questions into weighted features used in retrieval.

MIPS Retrieval
(2006), and we show significant improvements over a bag-of-words of baseline on a novel Wikipediaderived dataset we introduce here, based on WIK-IQA (Yang et al., 2015).

Approach
Formally, given a candidate set D = {p 1 , · · · , p N }, a query q and a scoring function F (q, p), an IR system retrieves the top-k items under the objective arg max p∈D F (q, p). (1) If the function F is simple enough (e.g. tf-idf ), it could be easily solved by traditional IR techniques. However, tackling this problem with a complex F via straightforward application of supervised classification (e.g., recent neural network based models) requires a traversal over all possible candidates, i.e. the corpus, which is computationally infeasible for any reasonable collection. Let f Q (q) refer to feature extraction on the query q, with corresponding candidate-side feature extraction f P (p) on the candidate, and finally f QP (q, p) extracts features from a (query, candidate) pair is defined in terms of f Q and f P via composition (defined later): ( From a set of query/candidate pairs we can train a model M such that given the feature vector of a pair (q, p), its returning value M (f QP (q, p)) represents the predicted probability of whether the passage p answers the question q. This model is chosen to be a log-linear model with the feature weight vector θ, leading to the optimization problem This is in accordance with the pointwise reranker approach, and is an instance of the linear featurebased model of Metzler and Croft (2007). Under specific compositional operations in f QP the following transformation can be made: This is elaborated in § 4. We project the original feature vector of the query f Q (q) to a transformed version t θ (f Q (q)): this transformed vector is dependent on the model parameters θ, where the association learned between the query and the candidate is incorporated into the transformed vector. This is a weighted, trainable generalization of query expansion in traditional IR systems.
Under this transformation we observe that the joint feature function f QP (q, p) is decomposed into two parts with no interdependency -the original problem in Eq. (4) is reduced to a standard maximum inner product search (MIPS) problem as seen on the RHS of Eq. (4). Under sparse assumptions (where the query vector and the candidate feature vector are both sparse), this MIPS problem can be efficiently (sublinearly) solved using classical IR techniques (multiway merging of postings lists).

Features
A feature vector can be seen as an associative array that maps features in the form "KEY=value" to realvalued weights. One item in a feature vector f is denoted as "(KEY = value, weight)", and a feature vector can be seen as a set of such tuples. We write f (KEY=value) = weight to indicate that the features serve as keys to the associative array, and θ X is the weight of the feature X in the trained model θ.

Question features
f wh : Question word, typically the wh-word of a sentence. If it is a question like "How many", the word after the question word is also included in the feature, i.e., feature "(QWORD=how many, 1)" will be added to the feature vector. f lat : Lexical answer type (LAT), if the query has a question word:"what" or "which", we identify the LAT of this question (Ferrucci et al., 2010), which is defined as the head word of the first NP after the question word. E.g., "What is the city of brotherly love?" would result in "(LAT=city, 1)". 2 f NE : All the named entities (NE) discovered in this question. E.g., "(NE-PERSON=Margaret Thatcher, 1)" would be generated if Thatcher is mentioned. f TfIdf : The L 2 -normalized tf-idf weighted bag-ofwords feature of this question. An example feature would be "(WORD = author, 0.454)".

Passage features
All passage features are constrained to be binary. f BoW : Bag-of-words: any distinct word x in the passage will generate a feature "(WORD=x, 1)". f NEType : Named entity type. If the passage contains a name of a person, a feature "(NE-TYPE=PERSON, 1)" will be generated. f NE : Same as the NE feature for questions.

Feature vector operations
Composition Here we elaborate the composition C of the question feature vector and passage feature vector, defining two operators on feature vectors: Cartesian product (⊗) and join ( ).
For any feature vector of a question f Q (q) = {(k i = v i , w i )}, (w i ≤ 1) 3 and any feature vector of a passage f P (p) = {(k j = v j , 1)}, the Cartesian product and join of them is defined as Notation (k i = k j ) = 1 denotes a feature for a question/passage pair, that when present, witnesses the fact that that the value for feature k i on the question side is the same as the feature k j on the passage side.
The composition that generates the feature vector for the question/passage pair is therefore defined (f wh (q) ⊗ f lat (q)) ⊗ f NEType (p) captures the association of question words and lexical answer types with the expected type of named entities. (f wh (q) ⊗ f lat (q)) ⊗ f BoW (p) captures the relation between some question types with certain words in the answer. f NE (q) f NE (p) captures named entity overlap between questions and answering sentences.
f TfIdf (q) f BoW (p) measures general tf-idfweighted context word overlap. Using only this feature without the others effectively reduces the system to a traditional tf-idf -based retrieval system.
Projection Given a question, it is desired to know what kind of features that its potential answer might have. Once this is known, an index searcher will do the work to retrieve the desired passage.
For the Cartesian product of features, we define For join, we have for all k such that θ (k=k )=1 = 0, i.e. feature (k = k ) = 1 appears in the trained model. It can be shown from the definitions above that Then the transformed feature vector t(q) of an expected answer passage given a feature vector of a question f Q (q) is:

Calculating the vector t(q) is computationally efficient because it only involves sparse vectors.
We have formally proved Eq. (4) by the feature vectors we proposed, showing that given a question, we can reverse-engineer the features we expect to be present in a candidate using the transformation function t θ , which we will then use as a query vector for retrieval.
Retrieval We use Apache LUCENE 4 to build the index of the corpus, which, in the scenario of this work, is the feature vectors of all candidates f P (p), p ∈ D. This is an instance of weighted bag-of-features instead of common bag-of-words.
For a given question q, we first compute its feature vector f (q) and then compute its transformed feature vector t θ (q) given model parameters θ, forming a weighted query. We modified the similarity function of LUCENE when executing multiway postings list merging so that fast efficient maximum inner product search can be achieved. This classical IR technique ensures sublinear performance because only vectors with at least one overlapping feature, instead of the whole corpus, is traversed. 5

Experiments
TREC Data We use the training and test data from Yao et al. (2013b). Passages are retrieved from the AQUAINT Corpus (Graff, 2002), which is NERtagged by the Illinois Named Entity Tagger (Ratinov and Roth, 2009) with an 18-label entity type set. Questions are parsed using the Stanford CORENLP (Manning et al., 2014) package. Each question is paired with 10 answer candidates from AQUAINT, annotated for whether it answers the question via crowdsourcing. The test data derives from Lin and Katz (2006), which contains 99 TREC questions that can be answered in AQUAINT. We follow Nallapati (2004) and undersample the negative class, taking 50 sentences uniformly at random from the AQUAINT corpus, per query, filtered to ensure no such sentence matches a query's answer pattern as negative samples to the training set. Wikipedia Data We introduce a novel evaluation for QA retrieval, based on WIKIQA (Yang et al., 2015), which pairs questions asked to Bing with their most associated Wikipedia article, along with sentence-level annotations on the introductory section of those articles as to whether they answer the question. 6 4 http://lucene.apache.org. 5 The closest work on indexing we are aware of is by Bilotti et al. (2007), who transformed linguistic structures to structured constraints, which is different from our approach of directly indexing linguistic features. 6 Note that as compared to the TREC dataset, there are some questions in WIKIQA which are not answerable based on the provided context alone. E.g. "who is the guy in the wheelchair who is smart" has the answer "Professor Stephen Hawking , known for being a theoretical physicist , has appeared in many works of popular culture ." This sets the upper bound on performance with WIKIQA below 100% when using contemporary question answering techniques, as assumed We automatically aligned WIKIQA annotations, which was based on an unreported version of Wikipedia, with the Feb. 2016 snapshot, using for our corpus the introductory section of all Wikipedia articles, processed with Stanford CORENLP. Alignment was performed via string edit distance, leading to a 55% alignment to the original annotations.  Setup The model is trained using LIBLINEAR (Fan et al., 2008), with heavy L 1 -regularization (feature selection) to the maximum likelihood objective. The model is tuned on the dev set, with the objective of maximizing recall. Baseline systems Recent work in neural network based reranking is not directly applicable here as those are linear with respect to the number of candidate sentences, which is computationally infeasible given a large corpus.
Off-the-shelf LUCENE: Directly indexing the sentences in LUCENE and do sentence retrieval. This is equivalent to maximum tf-idf retrieval. Yao et al. (2013b): A retrieval system which augments the bag-of-words query with desired named entity types based on a given question. Evaluation metrics (1) R@1k: The recall in top-1000 retrieved list. Contrary to normal IR systems which optimize precision (as seen in metrics such as P@10), our system is a triaging system whose goal is to retrieve good candidates for downstream reranking: high recall within a large set of initial candidates is our foremost aim. (2)   We also plot the performance of these systems at different ks on a log-scale (shown in Fig. 2 and Fig.  3). We use two metrics here: recall at k (R@k) and success at k (S@k). Success at k is the percentage of queries in which there was at least one relevant answer sentence among the first k retrieved result by a specific system, which is the true upper bound for downstream tasks.
Again, DiscIR demonstrated significantly higher 8 Results on dev data is not reported in Yao et al. (2013b). recalls than baselines at different ks and across different datasets. Success rate at different ks are also uniformly higher than LUCENE, and at most ks higher than the model of Yao et al.'s. 6 Conclusion and Future Work Yao et al. (2013b) proposed coupling IR with features from downstream question answer sentence selection. We generalized this intuition by recognizing it as an instance of discriminative retrieval, and proposed a new framework for generating weighted, feature-rich queries based on a query. This approach allows for the straightforward use of a downstream feature-driven model in the candidate selection process, and we demonstrated how this leads to a significant gain in recall, b-pref and MAP, hence providing a larger number of correct candidates that can be provided to a downstream (neural) reranking model, a clear next step for future work.