Multi-Relational Question Answering from Narratives: Machine Reading and Reasoning in Simulated Worlds

Question Answering (QA), as a research field, has primarily focused on either knowledge bases (KBs) or free text as a source of knowledge. These two sources have historically shaped the kinds of questions that are asked over these sources, and the methods developed to answer them. In this work, we look towards a practical use-case of QA over user-instructed knowledge that uniquely combines elements of both structured QA over knowledge bases, and unstructured QA over narrative, introducing the task of multi-relational QA over personal narrative. As a first step towards this goal, we make three key contributions: (i) we generate and release TextWorldsQA, a set of five diverse datasets, where each dataset contains dynamic narrative that describes entities and relations in a simulated world, paired with variably compositional questions over that knowledge, (ii) we perform a thorough evaluation and analysis of several state-of-the-art QA models and their variants at this task, and (iii) we release a lightweight Python-based framework we call TextWorlds for easily generating arbitrary additional worlds and narrative, with the goal of allowing the community to create and share a growing collection of diverse worlds as a test-bed for this task.

Question Answering (QA), as a research field, has primarily focused on either knowledge bases (KBs) or free text as a source of knowledge. These two sources have historically shaped the kinds of questions that are asked over these sources, and the methods developed to answer them. In this work, we look towards a practical use-case of QA over user-instructed knowledge that uniquely combines elements of both structured QA over knowledge bases, and unstructured QA over narrative, introducing the task of multirelational QA over personal narrative. As a first step towards this goal, we make three key contributions: (i) we generate and release TEXTWORLDSQA, a set of five diverse datasets, where each dataset contains dynamic narrative that describes entities and relations in a simulated world, paired with variably compositional questions over that knowledge, (ii) we perform a thorough evaluation and analysis of several state-of-the-art QA models and their variants at this task, and (iii) we release a lightweight Python-based framework we call TEXTWORLDS for easily generating arbitrary additional worlds and narrative, with the goal of allowing the community to create and share a growing collection of diverse worlds as a test-bed for this task.

Introduction
Personal devices that interact with users via natural language conversation are becoming ubiquitous (e.g., Siri, Alexa), however, very little of that conversation today allows the user to teach, and then query, new knowledge. Most of the focus in  Figure 1: Illustration of our task: relational question answering from dynamic knowledge expressed via personal narrative these personal devices has been on Question Answering (QA) over general world-knowledge (e.g., "who was the president in 1980" or "how many ounces are in a cup"). These devices open a new and exciting possibility of enabling end-users to teach machines in natural language, e.g., by expressing the state of their personal world to its virtual assistant (e.g., via narrative about people and events in that user's life) and enabling the user to ask questions over that personal knowledge (e.g., "which engineers in the QC team were involved in the last meeting with the director?").
This type of questions highlight a unique blend of two conventional streams of research in Question Answering (QA) -QA over structured sources such as knowledge bases (KBs), and QA over unstructured sources such as free text. This blend is a natural consequence of our problem setting: (i) users may choose to express rich relational knowledge about their world, in turn enabling them to pose complex composi-1. There is an associate professor named Andy 2. He returned from a sabbatical 3. This professor currently has funding 4. There is a masters level course called G301 5. That course is taught by him 6. That class is part of the mechanical engineering department 7. Roslyn is a student in this course 8. U203 is a undergraduate level course 9. Peggy and that student are TAs for this course … tional queries (e.g., "all CS undergrads who took my class last semester"), while at the same time (ii) personal knowledge generally evolves through time and has an open and growing set of relations, making natural language the only practical interface for creating and maintaining that knowledge by non-expert users. In short, the task that we address in this work is: multi-relational question answering from dynamic knowledge expressed via narrative.
Although we hypothesize that questionanswering over personal knowledge of this sort is ubiquitous (e.g., between a professor and their administrative assistant, or even if just in the user's head), such interactions are rarely recorded, presenting a significant practical challenge to collecting a sufficiently large real-world dataset of this type. At the same time, we hypothesize that the technical challenges involved in developing models for relational question answering from narrative would not be fundamentally impacted if addressed via sufficiently rich, but controlled simulated narratives.
Such simulations also offer the advantage of enabling us to directly experiment with stories and queries of different complexity, potentially offering additional insight into the fundamental challenges of this task.
While our problem setting blends the problems of relational question answering over knowledge bases and question answering over text, our hypothesis is that end-to-end QA models may learn to answer such multisentential relational queries, without relying on an intermediate knowledge base representation. In this work, we conduct an extensive evaluation of a set of state-of-the-art end-to-end QA models on our task and analyze their results.

Related Work
Question answering has been mainly studied in two different settings: KB-based and text-based. KB-based QA mostly focuses on parsing questions to logical forms (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2012;Berant et al., 2013;Kwiatkowski et al., 2013; in order to better retrieve answer candidates from a knowledge base. Text-based QA aims to directly answer questions from the input text. This includes works on early information retrieval-based methods (Banko et al., 2002;Ahn et al., 2004) and methods that build on extracted structured representations from both the question and the input text (Sachan et al., 2015;Sachan and Xing, 2016;Khot et al., 2017;Khashabi et al., 2018b). Although these structured presentations make reasoning more effective, they rely on sophisticated NLP pipelines and suffer from error propagation. More recently, end-to-end neural architectures have been successfully applied to textbased QA, including Memory-augmented neural networks (Sukhbaatar et al., 2015;Miller et al., 2016;Kumar et al., 2016) and attention-based neural networks (Hermann et al., 2015;Chen et al., 2016;Kadlec et al., 2016;Dhingra et al., 2017;Xiong et al., 2017;Seo et al., 2017;Chen et al., 2017). In this work, we focus on QA over text (where the text is generated from a supporting KB) and evaluate several state-of-the-art memoryaugmented and attention-based neural architectures on our QA task. In addition, we consider a sequence-to-sequence model baseline (Bahdanau et al., 2015), which has been widely used in dialog (Vinyals and Le, 2015;Ghazvininejad et al., 2017) and recently been applied to generating answer values from Wikidata (Hewlett et al., 2016).
There are numerous datasets available for evaluating the capabilities of QA systems. For example, MCTest (Richardson et al., 2013) contains comprehension questions for fictional stories. Allen AI Science Challenge (Clark, 2015) contains science questions that can be answered with knowledge from text books. RACE (Lai et al., 2017) is an English exam dataset for middle and high school Chinese students. MULTIRC (Khashabi et al., 2018a) is a dataset that focuses on evaluating multi-sentence reasoning skills. These datasets all require humans to carefully design multiplechoice questions and answers, so that certain aspects of the comprehension and reasoning capabilities are properly evaluated. As a result, it is difficult to collect them at scale. Furthermore, as the knowledge required for answering each question is not clearly specified in these datasets, it can be hard to identify the limitations of QA systems and propose improvements.  proposes to use synthetic QA tasks (the BABI dataset) to better understand the limitations of QA systems. BABI builds on a simulated physical world similar to interactive fiction (Montfort, 2005) with simple objects and relations and includes 20 different reasoning tasks. Various types of end-to-end neural networks (Sukhbaatar et al., 2015;Lee et al., 2015;Peng et al., 2015) have demonstrated promising accuracies on this dataset. However, the performance can hardly translate to real-world QA datasets, as BABI uses a small vocabulary (150 words) and short sentences with limited language variations (e.g., nesting sentences, coreference). A more sophisticated QA dataset with a supporting KB is WIKIMOVIES (Miller et al., 2016), which contains 100k questions about movies, each of them is answerable by using either a KB or a Wikipedia article. However, WIKIMOVIES is highly domain-specific, and similar to BABI, the questions are designed to be in simple forms with little compositionality and hence limit the difficulty level of the tasks.
Our dataset differs in the above datasets in that (i) it contains five different realistic domains permitting cross-domain evaluation to test the ability of models to generalize beyond a fixed set of KB relations, (ii) it exhibits rich referring expressions and linguistic variations (vocabulary much larger than the BABI dataset), (iii) questions in our dataset are designed to be deeply compositional and can cover multiple relations mentioned across multiple sentences.
Other large-scale QA datasets include Clozestyle datasets such as CNN/Daily Mail (Hermann et al., 2015), Children's Book Test (Hill et al., 2015), and Who Did What (Onishi et al., 2016); datasets with answers being spans in the document, such as SQuAD (Rajpurkar et al., 2016), NewsQA (Trischler et al., 2016), and TriviaQA (Joshi et al., 2017); and datasets with human generated answers, for instance, MS MARCO (Nguyen et al., 2016) and SearchQA (Dunn et al., 2017). One common drawback of these datasets is the difficulty in accessing a system's capability of integrating information across a document context. Kočiskỳ et al. (2017) recently emphasized this issue and proposed NarrativeQA, a dataset of fictional stories with questions that reflect the complexity of narratives: characters, events, and evolving relations. Our dataset contains similar narrative elements, but it is created with a supporting KB and hence it is easier to analyze and interpret results in a controlled setting. lated user may introduce new knowledge, update existing knowledge or express a state change (e.g., "Homework 3 is now due on Friday" or "Samantha passed her thesis defense"). Each narrative is interleaved with questions about the current state of the world, and questions range in complexity depending on the amount of knowledge that needs to be integrated to answer them. This allows us to benchmark a range of QA models at their ability to answer questions that require different extents of relational reasoning to be answered.
The set of worlds that we simulate as part of this work are as follows: 1. MEETING WORLD: This world describes situations related to professional meetings, e.g., meetings being set/cancelled, people attending meetings, topics of meetings.
2. HOMEWORK WORLD: This world describes situations from the first-person perspective of a student, e.g., courses taken, assignments in different courses, deadlines of assignments.
3. SOFTWARE ENGINEERING WORLD: This world describes situations from the first-person perspective of a software development manager, e.g., task assignment to different project team members, stages of software development, bug tickets.
4. ACADEMIC DEPARTMENT WORLD: This world describes situations from the first-person perspective of a professor, e.g., teaching assignments, faculty going/returning from sabbaticals, students from different departments taking/dropping courses.
5. SHOPPING WORLD: This world describes situations about a person shopping for various occasions, e.g., adding items to a shopping list, purchasing items at different stores, noting where items are on sale.

Narrative
Each world is represented by a set of entities E and a set of unary, binary or ternary relations R. Formally, a single step in one simulation of a world involves a combination of instantiating new entities and defining new (or mutating existing) relations between entities. Practically, we implement each world as a collection of classes and  Table 1: TEXTWORLDSQA dataset statistics methods, with each step of the simulation creating or mutating class instances by sampling entities and methods on those entities. By design, these classes and methods are easy to extend, to either enrich existing worlds or create new ones. Each simulation step is then expressed as a natural language statement, which is added to the narrative. In the process of generating a natural language expression, we employ a rich mechanism for generating anaphora, such as "meeting with John about the performance review" and "meeting that I last added", in addition to simple pronoun references. This allows us to generate more natural and flowing narratives. These references are generated and composed automatically by the underlying TEXTWORLDS framework, significantly reducing the effort needed to build new worlds. Furthermore, all generated stories also provide additional annotation that maps all entities to underlying gold-standard KB ids, allowing to perform experiments that provide models with different degrees of access to the "simulation oracle". We generate 1,000 narratives within each world, where each narrative consists of 100 sentences, plus up to 300 questions interleaved randomly within the narrative. See Figure 1 for two example narratives. Each story in a given world samples its entities from a large general pool of entity names collected from the web (e.g., people names, university names). Although some entities do overlap between stories, each story in a given world contains a unique flow of events and entities involved in those events. See Table 1 for the data statistics.

Questions
Formally, questions are queries over the knowledge-base in the state defined up to the point when the question is asked in the narrative. In the narrative, the questions are expressed  in natural language, employing the same anaphora mechanism used in generating the narrative (e.g., "who is attending the last meeting I added?").
We categorize generated questions into four types, reflecting the number and types of facts required to answer them; questions that require more facts to answer are typically more compositional in nature. We categorize each question in our dataset into one of the following four categories: Single Entity/Single Relation Answers to these questions are a single entity, e.g. "what is John's email address?", or expressed in lambda-calculus notation: λx.EmailAddress (John, x) The answers to these questions are found in a single sentence in the narrative, although it is possible that the answer may change through the course of the narrative (e.g., "John's new office is GHC122").
Multi-Entity/Single Relation Answers to these questions can be multiple entities but involve a single relation, e.g., "Who is enrolled in the Math class?", or expressed in lambda calculus notation:

λx.TakingClass(x, Math)
Unlike the previous category, answers to these questions can be sets of entities.
Multi-Entity/Two Relations Answers to these questions can be multiple entities and involve two relations, e.g., "Who is enrolled in courses that I am teaching?", or expressed in lambda calculus: λx.∃y.EnrolledInClass(x, y)

∧ CourseTaughtByMe(y)
Multi-Entity/Three Relations Answers to these questions can be multiple entities and involve three relations, e.g., "Which undergraduates are enrolled in courses that I am teaching?", or expressed in lambda calculus notation: In the data that we generate, answers to questions are always sets of spans in the narrative (the reason for this constraint is for easier evaluation of several existing machine-reading models; this assumption can easily be relaxed in the simulation). In all of our evaluations, we will partition our results by one of the four question categories listed above, which we hypothesize correlates with the difficulty of a question.

Methods
We develop several baselines for our QA task, including a logistic regression model and four different neural network models: Seq2Seq (Bahdanau et al., 2015), MemN2N (Sukhbaatar et al., 2015), BiDAF (Seo et al., 2017), and DrQA (Chen et al., 2017). These models generate answers in different ways, e.g., predicting a single entity, predicting spans of text, or generating answer sequences. Therefore, we implement two experimental settings: ENTITY and RAW. In the ENTITY setting, given a question and a story, we treat all the entity spans in the story as candidate answers, and the prediction task becomes a classification problem. In the RAW setting, a model needs to predict the answer spans. For logistic regression and MemN2N, we adopt the ENTITY setting as they are naturally classification models. This ideally provides an upper bound on the performance when considering answer candidate generation. For all the other models, we can apply the RAW setting.

Logistic Regression
The logistic regression baseline predicts the likelihood of an answer candidate being a true answer.
For each answer candidate e and a given question, we extract the following features: (1) The frequency of e in the story; (2) The number of words within e; (3) Unigrams and bigrams within e; (4) Each non-stop question word combined with each non-stop word within e; (5) The average minimum distance between each non-stop question word and e in the story; (6) The common words (excluding stop words) between the question and the text surrounding of e (within a window of 10 words); (7) Sum of the frequencies of the common words to the left of e, to the right e, and both. These features are designed to help the model pick the correct answer spans. They have shown to be effective for answer prediction in previous work (Chen et al., 2016;Rajpurkar et al., 2016).
We associate each answer candidate with a binary label indicating whether it is a true answer. We train a logistic regression classifier to produce a probability score for each answer candidate. During test, we search for an optimal threshold that maximizes the F1 performance on the validation data. During training, we optimize the cross-entropy loss using Adam (Kingma and Ba, 2014) with an initial learning rate of 0.01. We use a batch size of 10, 000 and train with 5 epochs. Training takes roughly 10 minutes for each domain on a Titan X GPU.

Seq2Seq
The seq2seq model is based on the sequence to sequence model presented in (Bahdanau et al., 2015), which includes an attention model. Bahdanau et al. (Bahdanau et al., 2015) have used this model to build a neural based machine translation performing at the state-of-the-art. We adopt this model to fit our own domain by including a preprocessing step in which all statements are concatenated with a dedicated token, while eliminating all previously asked questions, and the current question is added at the end of the list of statements. The answers are treated as a sequence of words. We use word embeddings (Zou et al., 2013), as it was shown to improve accuracy. We use 3 GRU (Cho et al., 2014) connected layers, each with a capacity of 256. Our batch size was set to 16. We use gradient descent with an initial learning rate of 0.5 and a decay factor of 0.99, iterating on the data for 50, 000 steps (5 epochs). The training process for each domain took approximately 48 hours on a Titan X GPU.

MemN2N
End-To-End Memory Network (MemN2N) is a neural architecture that encodes both long-term and short-term context into a memory and iteratively reads from the memory (i.e., multiple hops) relevant information to answer a question (Sukhbaatar et al., 2015). It has been shown to be effective for a variety of question answering tasks Sukhbaatar et al., 2015;Hill et al., 2015).
In this work, we directly apply MemN2N to our task with a small modification. Originally, MemN2N was designed to produce a single answer for a question, so at the prediction layer, it uses softmax to select the best answer from the answer candidates. In order to account for multiple answers for a given question, we modify the prediction layer to apply the logistic function and optimize the cross entropy loss instead. For training, we use the parameter setting as in a publicly available MemN2N 1 except that we set the embedding size to 300 instead of 20. We train the model for 100 epochs and it takes about 2 hours for each domain on a Titan X GPU.

BiDAF-M
BiDAF (Bidirectional Attention Flow Networks) (Seo et al., 2017) is one of the topperforming models on the span-based question answering dataset SQuAD (Rajpurkar et al., 2016). We reimplement BiDAF with simplified parameterizations and change the prediction layer so that it can predict multiple answer spans.
Specifically, we encode the input story {x 1 , ..., x T } and a given question {q 1 , ..., q J } at the character level and the word level, where the character level uses CNNs and the word level uses pre-trained word vectors. The concatenation of the character and word embeddings are passed to a bidirectional LSTM to produce a contextual embedding for each word in the story context and in the question. Then, we apply the same bidirectional attention flow layer to model the interactions between the context and question embeddings, producing question-aware feature vectors for each word in the context, denoted as G ∈ R dg×T . G is then fed into a bidirectional LSTM layer to obtain a feature matrix M 1 ∈ R d 1 ×T for predicting the start offset of the answer span, and M 1 is then passed into  another bidirectional LSTM layer to obtain a feature matrix M 2 ∈ R d 2 ×T for predicting the end offset of the answer span. We then compute two probability scores for each word i in the narrative: where w 1 and w 2 are trainable weights. The training objective is simply the sum of cross-entropy losses for predicting the start and end indices.
We use 50 1D filters for CNN character embedding, each with a width of 5. The word embedding size is 300 and the hidden dimension for LSTMs is 128. For optimization, we use Adam (Kingma and Ba, 2014) with an initial learning rate of 0.001, and use a minibatch size of 32 for 15 epochs. The training process takes roughly 20 hours for each domain on a Titan X GPU.

DrQA-M
DrQA (Chen et al., 2017) is an open-domain QA system that has demonstrated strong performance on multiple QA datasets. We modify the Document Reader component of DrQA and implement it in a similar framework as BiDAF-M for fair comparisons. First, we employ the same character-level and word-level encoding layers to both the input story and a given question. We then use the concatenation of the character and word embeddings as the final embeddings for words in the story and in the question. We compute the aligned question embedding (Chen et al., 2017) as a feature vector for each word in the story and concatenate it with the story word embedding and pass it into a bidirectional LSTM to obtain the contextual embeddings E ∈ R d×T for words in the story. Another bidirectional LSTM is used to obtain the contextual embeddings for the question, and self-attention is used to compress them into one single vector q ∈ R d . The final prediction layer uses a bilinear term to compute scores for predicting the start offset: p start = sigmoid(q T W 1 E) and another bilinear term for predicting the end offset: p end = sigmoid(q T W 2 E), where W 1 and W 2 are trainable weights. The training loss is the same as in BiDAF-M, and we use the same parameter setting. Training takes roughly 10 hours for each domain on a Titan X GPU.

Experiments
We use two evaluation settings for measuring performance at this task: within-world and acrossworld. In the within-world evaluation setting, we test on the same world that the model was trained on. We then compute the precision, recall and F 1 for each question and report the macro-average F1 score for questions in each world. In the acrossworld evaluation setting, the model is trained on four out of the five worlds, and tested on the remaining world. The across-world regime is obviously more challenging, as it requires the model to be able to learn to generalize to unseen relations and vocabulary. We consider the across-world evaluation setting to be the main evaluation criteria for any future models used on this dataset, as it mimics the practical requirement of any QA system used in personal assistants: it has to be able to answer questions on any new domain the user introduces to the system.

Results
We draw several important observations from our results. First, we observe that more compositional questions (i.e., those that integrate multiple relations) are more challenging for most models -as all models (except Seq2seq) decrease in performance with the number of relations composed in a question ( Figure 5.1). This can be in part explained by the fact that more composition questions are typically longer, and also require the model to integrate more sources of information in the narrative in order to answer them. One surprising observation from our results is that the performance on questions that ask about a single relation and have only a single answer is lower than questions that ask about a single relation but that can have multiple answers (see detailed results in the Appendix). This is in part because questions that can have multiple answers typically have canonical entities as answers (e.g., person's name), and these entities generally repeat in the text, making it easier for the model to find the correct answer. Table 3 reports the overall (macro-average) F1 scores for different baselines. We can see that BiDAF-M and DrQA-M perform surprisingly well in the within-world evaluation even though they do not use any entity span information. In particular, DrQA-M outperforms BiDAF-M which suggests that modeling question-context interactions using simple bilinear terms have advantages over using more complex bidirectional attention flows. The lower performance of MemN2N suggests that its effectiveness on the BABI dataset does not directly transfer to our dataset. Note that the original MemN2N architecture uses simple bag-of-words and position encoding for sentences. This may work well on dataset with a simple vocabulary, for example, MemN2N performs the best in the SOFTWARE world as the SOFTWARE world has a smaller vocabulary compared to other worlds. In general, we believe that better text representations for questions and narratives can lead to improved performance. Seq2Seq model also did not perform as well. This is due to the inherent difficulty of generation and encoding long sequences. We found that it performs better when training and testing on shorter stories (limited to 30 statements). Interestingly, the logistic regression baseline performs on a par with MemN2N, but there is still a large performance gap to BiDAF-M and DrQA-M, and the gap is greater for questions that compose multiple relations.
In the across-world setting, the performance of all methods dramatically decreases. 2 This suggests the limitations of these methods in generalizing to unseen relations and vocabulary. The span-based models BiDAF-M and DrQA-M have an advantage in this setting as they can learn to answer questions based on the alignment between the question and the narrative. However, the low performance still suggests their limitations in transferring question answering capabilities.

Conclusion
In this work, we have taken the first steps towards the task of multi-relational question answering expressed through personal narrative. Our hypothesis is that this task will become increasingly important as users begin to teach personal knowledge about their world to the personal assistants embedded in their devices. This task naturally synthesizes two main branches of question answering research: QA over KBs and QA over free text. One of our main contributions is a collection of diverse datasets that feature rich compositional questions over a dynamic knowledge graph expressed through simulated narrative. Another contribution of our work is a thorough set of experiments and analysis of different types of endto-end architectures for QA at their ability to answer multi-relational questions of varying degrees of compositionality. Our long-term goal is that both the data and the simulation code we release will inspire and motivate the community to look towards the vision of letting end-users teach our personal assistants about the world around us.   Table 5: Test performance (F 1 score) at the task of question answering by question type using the across-world evaluation.