A dataset and baselines for sequential open-domain question answering

Previous work on question-answering systems mainly focuses on answering individual questions, assuming they are independent and devoid of context. Instead, we investigate sequential question answering, asking multiple related questions. We present QBLink, a new dataset of fully human-authored questions. We extend existing strong question answering frameworks to include previous questions to improve the overall question-answering accuracy in open-domain question answering. The dataset is publicly available at http://sequential.qanta.org.


Introduction
The framework of combining information retrieval and neural reading comprehension has been the basis of several systems for answering open-domain questions over unstructured text (Chen et al., 2017;Wang et al., 2018;Clark and Gardner, 2018;Htut et al., 2018). Typically, such systems take one input question at a time, retrieving and ranking multiple paragraphs that potentially contain the answer. A reading comprehension model then produces a ranked list of candidate answer spans from each paragraph. The final answer is then selected from the produced spans.
In information-seeking dialogs, e.g., personal assistants, users interact with a question answering system by asking a sequence of related questions, where questions share the same predicate, entities, or at least a topic. Answering each question in isolation is sub-optimal as information from previously asked questions and previously obtained answers can help better answer the current question.
We study the task of sequential open-domain question answering. We ask how a standard opendomain question answering system can incorporate connections between question-answer pairs in the same sequence. We introduce QBLink, a new dataset of about 18,000 question sequences (Figure 1), each sequence consists of three naturally occurring human-authored questions (totaling around 56,000 unique questions). The sequences themselves are also naturally occurring (i.e., we do not artificially combine individually-authored questions to form sequences), which allows us to focus more on the important connections between questions that should be incorporated to improve the end-to-end question answering accuracy.
We compare sequence-aware models to baselines that process each question separately. For our sequence-aware models, we tweak the retrieval component by incorporating previous questions and their answers together with the current question to better rank the retrieved paragraphs. For the reader, we use the semantic relations between entities in previous questions (or their corresponding answers) and entities mentioned in the paragraph being read (candidate answers) to better choose the answer entity. Both the retrieval and reading steps can be slightly improved by incorporating sequence information.
Our contributions are two-fold: first, we present a new dataset for sequential question answering. Our dataset is composed of complex questions about a variety of topics. We make the dataset publicly available to encourage future research. Second, we use our dataset to compare baselines in the open-domain question answering setup with the goal of showing that incorporating sequential connections between questions is advantageous.

Sequential Question Answering Task
We define the task of open-domain sequential question answering: given a document collection D and questions grouped into disjoint sequences {S i | i = 1 . . . n} where each S i is an ordered sequence of question, answer pairs, and a subset of , the task is to answer questions qĵ i with document evidences Dĵ i given access to previously asked questions in the same sequence and their corresponding answers {(q j i , a j i ) | j <ĵ}. Following Chen et al. (2017), we split the task into two steps-a retrieval step and a reading step. In the retrieval step the current question qĵ i and previous questions and answers {(q j i , a j i ) | j <ĵ} are used to retrieve a ranked list of paragraphs Dĵ i from D that are likely to contain the correct answer to the current question qĵ i . The retrieved paragraphs Dĵ i are the input to the reading step that selects a span from Dĵ i as the answer to qĵ i . The reading step has access to previous questions and answers {(q j i , a j i ) | j <ĵ} as well.

Dataset Construction
This section describes QBLink's construction. QBLink is based on the bonus questions of Quiz Bowl tournaments. Unlike previous work that only uses the starter (or tossup) questions (Boyd-Graber et al., 2012), bonus questions are not interruptable (players always hear the complete question) and have greater variability in difficulty. Bonus questions start with a lead-in text, which sets the stage for the rest of the question, followed by a sequence of related questions. Figure 1 shows an example of a sequence of three questions.  Specifically, we collect bonus questions from http://quizdb.org for the tournaments in 2008-2018. Each question is categorized by topic as history, literature, science, geography, fine arts, philosophy, religion, mythology, social sciences, current events or trash. We filter out too short questions (fewer than ten tokens), and only keep questions with exactly three sub-questions.
We map the answers to unambiguous Wikipedia pages using combination of rule based matching and fuzzy string matching, then filter out the questions whose answers are not mapped to any Wikipedia page (12.5% of the questions).
To keep our development and test set intact and and of a reasonable percentage of questions, we use the questions in 2014 tournament (the year with the largest number of questions) for development and testing, and the rest of the questions are used for training. Table 1 shows the number of question sequences and questions per split as well as tokens and linked Wikipedia mentions per question. We use TagMe (Ferragina and Scaiella, 2010) for mention detection and linking question text to Wikipedia.

Baselines
We build our baselines on the DrQA framework of Chen et al. (2017) for open-domain question answering over Wikipedia. 1 The framework operates in a retrieval phase followed by a reading phase.
The retrieval phase uses a simple tf-idf (Salton and Buckley, 1987) ranking of Wikipedia articles given a question as the query.
The reading phase is a multi-layer recurrent neural network model that extracts an answer span from the top d retrieved paragraphs. The reader model computes a contextualized representation of each token t i by running the token sequence through a multi-layer bidirectional long short-term memory network (BiLSTM) (Hochreiter and Schmidhuber, 1997) and taking the corresponding hidden state to each token at the top layer. The question itself is encoded in a vector q as a weighted average of the hidden states of a BiLSTM over the word embeddings of its individual tokens. The model then computes an unnormalized probability score of t i as the start and end token of the answer span, To find the answer in multiple paragraphs at test time, we merge all paragraphs before feeding them to the reader (Clark and Gardner, 2018).

Answering Question in Isolation
We experiment with three models that ignore the sequential connections between questions and answer each question in isolation. Our first model is a simple information retrieval (IR) baseline that only uses the retrieval component: the title of the top-1 Wikipedia article is predicted as the answer. Our second baseline is the full DrQA baseline whose reader is trained/tuned on the training/development questions of our dataset. To assign paragraphs to each of the training questions, we follow a similar distant-supervision approach to Chen et al. (2017). We retrieve the top twenty Wikipedia articles for each question, exclude the paragraphs that do not contain the gold answer, and then rank the remaining paragraphs using tfidf. Each of the top ten paragraphs is paired with the question to form a data instance for training the reader. 1 We use the Wikipedia dump of 2017-09-20.
Finally, we tweak the DrQA reader to limit the candidate answer spans to entity mentions that are linked to Wikipedia. We set the pre-normalization start and end scores of spans that are not detected mentions to zero.

Incorporating Context in Retrieval
To incorporate the sequential connections between questions in the retrieval phase, we append the previously asked questions to the current question. We also compare appending the predicted answers (top-1 span) to each of the previous questions as well as the gold answers to the current question.

Incorporating Context in Reader
In addition to encoding which entities have appeared in previous questions, we also want to provide our models with relationship information. However, pre-defined relationships from knowledge bases tend to be brittle. Instead, we use a continuous representation of relationships (Iyyer et al., 2016). For example, suppose we want to encode the relationships for an entity (answer candidate) that starts at i and ends at j. We summarize that entities relationships from each of possible k relation-spans. A relation-span is a sequence of tokens from Wikipedia that contains both the answer candidate and an answer to a previous question (For example, the correct answer in Figure 2 has a relation-span "He is best known for defending President Ronald Reagan during the assassination attempt by John Hinckley Jr." with the previous answer "Ronald Reagan"). This is summarized in a vector r ij by merging all k relation-spans in a single span that is then fed through a BiLSTM whose hidden states are combined as a weighted sum where the weights are computed with selfattention (Lin et al., 2017).
The stronger the similarity between the relation that the question is asking about and the relationspans, the higher the score of the candidate answer should be. We estimate that similarity r by concatenating the elementwise absolute difference and hadamard product between r ij and the question embedding q. Then, we use a trainable weight vector w rel to combine the components of the concatenation output and produce a single similarity score as This can then influence the final selection of the answer span by adding the relation similarity  score r to the start and end scores of the candidate answer (Equation 1) as The relation embedding module is trained jointly with the reader.

Baseline Results
We use QBLink to compare the baselines' question answering accuracy. Incorporating previous questions and answers slightly improves the accuracy (Table 2). We set the maximum number of retrieved documents to ten, and each document is divided into paragraphs each of 400 tokens. At test time, we merge the top ten ranked such paragraphs and feed them to the reader. We use the reader network of Chen et al. (2017). We limit the number of relation description spans for each entity pair to five. We used an LSTM of one hidden layer and 128 hidden units for the paragraph, question, and relation description encoders. Each reader was trained for twenty epochs. Table 2 summarizes the results of the baselines (Section 4). Question-answering accuracy is exactmatch accuracy since we limit the answer spans to entity mentions whose boundaries are fixed for all models.
Question: This man attempted to impress Jodie Foster by shooting Ronald Reagan, but he failed to kill the President. At trial, he was found not guilty by reason of insanity. Gold answer to previous question: Ronald Reagan Predict without relation span: George H. W. Bush Correct answer: John Hinckley Jr. Relation span: He is best known for defending President Ronald Reagan during the assassination attempt by John Hinckley Jr. Incorporating the previous answer in the retrieval and the reading components slightly improves the overall question answering accuracy ( Table 2). The accuracy drops by more than 3% when using the entire text of previous questions in the retrieval phase. Modeling relations reduces the accuracy slightly compared to augmenting paragraphs with relation spans. One possible explanation is that our relation embedding model ends up being under-trained since we could not retrieve any relation-spans for many questions. Replacing Wikipedia with a larger corpus (e.g., ClueWeb) might help improve the training of the relation embedding model. Unsurprisingly, the gold answers to previous questions are more useful than the predicted answers, which highlights a need for models that take into account the uncertainty about previous answers when gold previous answers are not available. However, providing answers to previous questions is consistent for most Quiz Bowl tournament play. Figure 2 gives an example of how explicit relation embedding helps reader get a correct prediction. Without the relation span, the model predicts George H. W. Bush (vice president at that time) as correct answer. Including the direct relation span between Reagan and John Hinckley Jr., the model gets the correct answer.

Related Work and Discussion
We adopt the open-domain question answering framework (Wang et al., 2018;Chen et al., 2017). Previous work considers improving that base framework itself (Clark and Gardner, 2018;Swayamdipta et al., 2018, inter alia). But retains the assumption of answering individual questions.
Aside from the open-domain setup, much of the recent work on question answering has focused on the sub-problem of reading-comprehension, where the gold answer to each question is assumed to exist in a given single paragraph for the model to read (Hermann et al., 2015;Rajpurkar et al., 2016;Seo et al., 2017). Another line of work on question answering is question answering over structured knowledge-bases (Berant et al., 2013;Berant and Liang, 2014;Yao and Van Durme, 2014;Gardner and Krishnamurthy, 2017). Although we focus on the more general open-domain setup, QBLink can be adapted to be usable in the readingcomprehension setup as well as the question answering over knowledge-bases setup.
Several question answering datasets have been proposed (Berant et al., 2013;Joshi et al., 2017;Trischler et al., 2017;Rajpurkar et al., 2018, inter alia). However, all of them were limited to answering individual questions. Saha et al. (2018) study the problem of sequential question answering, and introduce a dataset for the task. However, we differ from them in two aspects: 1) They consider question-answering over structured knowledge-bases. 2) Their dataset construction was overly synthetic: templates were collected by human annotators given knowledge-base predicates. Further, sequences were constructed synthetically as well by grouping individual questions by predicate or subjects.
Both Iyyer et al. (2017) and Talmor and Berant (2018) answer complex questions by decomposing each into a sequence of simple questions. Iyyer et al. (2017) adopt a semantic parsing approach to answer questions over semi-structured tables. They construct a dataset of around 6,000 question sequences by asking humans to rewrite a set of 2,000 complex questions into simple sequences. Talmor and Berant (2018) consider the setup of open-domain question answering over unstructured text, but their dataset is constructed synthetically (with human paraphrasing) by combining simple questions with a few rules.
In parallel to our work, Choi et al. (2018) and Reddy et al. (2018) introduce sequential question answering datasets (QuAC and CoQA) that focus on the reading comprehension setup (i.e., a single text snippet is pre-specified for answering the given questions). QBLink is entirely naturally occurring (all questions and answers were authored independently from any knowledge sources) and is primarily designed to challenge human players.
The idea of our baseline to improving the reading step by incorporating additional relation description spans is similar as Weissenborn et al. (2017) and Mihaylov and Frank (2018), who integrate background commonsense knowledge into readingcomprehension systems. Both rely on structured knowledge bases to extract information about semantic relations that hold between entities. On the other hand, we extract text spans that mention each pair of entities and encoded them into vector representations of the relations between entities.

Conclusions and Future Work
We introduce QBLink, a dataset of 56,000 naturally occurring sequential question, answer pairs. The questions are designed primarily to challenge human players in Quiz Bowl tournaments. We use QBLink to evaluate baselines for sequential open-domain question answering. We show that incorporating sequential information helps slightly improve question answering accuracy.
In the future, we would like to invest in building better sequential question answering models that push the accuracy beyond the presented baselines. Specifically, we will look at how to better model the interaction between the reader and the relation embedding model and how to improve the relation embedding model itself by adopting ideas from the relation extraction literature (Miwa and Bansal, 2016;Peng et al., 2017;Ammar et al., 2017).