Generating Questions for Reading Comprehension using Coherence Relations

In this paper, we have proposed a technique for generating complex reading comprehension questions from a discourse that are more useful than factual ones derived from assertions. Our system produces a set of general-level questions using coherence relations and a set of well-defined syntactic transformations on the input text. Generated questions evaluate comprehension abilities like a comprehensive analysis of the text and its structure, correct identification of the author’s intent, a thorough evaluation of stated arguments; and a deduction of the high-level semantic relations that hold between text spans. Experiments performed on the RST-DT corpus allow us to conclude that our system possesses a strong aptitude for generating intricate questions. These questions are capable of effectively assessing a student’s interpretation of the text.


Introduction
The argument for a strong correlation between question difficulty and student perception comes from Bloom's taxonomy (Bloom et al. (1964)). It is a framework that attempts to categorize question difficulty in accordance with educational goals. The framework has undergone several revisions over time and currently has six levels of perception in the cognitive domain: Remembering, Understanding, Applying, Analyzing, Evaluating and Creating (Anderson et al. (2001)). The goal of a Question Generation (QG) system should be to generate meaningful questions that cater to the higher levels of this hierarchy and are therefore adept at gauging comprehension skills.
The scope of several QG tasks has been severely restricted to restructuring declarative sentences into specific level questions. For example, consider the given text and the questions that follow.
Input: The project under construction will raise Las Vegas' supply of rooms by 20%. Clark county will have 18000 new jobs. Question 1: What will raise Las Vegas' supply of rooms by 20%? Question 2: Why will Clark County have 18000 new jobs?
From the perspective of Bloom's Taxonomy, questions like Question 1 cater to the 'Remembering' level of the hierarchy and are not apt for evaluation purposes. Alternatively, questions like Question 2 would be associated with the 'Analyzing' level as these would require the student to draw a connection between the events, 'increase in room supply in Las Vegas' and 'creation of 18000 new jobs in Clark County'. Further, such questions would be more relevant in the context of an entire document or paragraph; and serve as better reading comprehension questions. This paper describes a generic framework for generating comprehension questions from short edited texts using coherence relations. It is organized as follows: Section 2 elaborates on previously designed QG systems and outlines their limitations. We also discuss Rhetorical Structure Theory (RST), which lays the linguistic foundations for discourse parsing. In Section 3, we explain our model and describe the syntactic transformations and templates applied to text spans for performing QG. In Section 4, we discuss experiments performed on the annotated RST-DT corpus and measure the quality of questions generated by the system. Proposed evaluation criteria address both the grammaticality and complexity of generated questions. We have also compared our system with a baseline to show that our system is able to generate complex questions. Finally, in Section 5, we provide our conclusions and suggest potential avenues for future research.

Previous QG systems
Previous research work done in QG has primarily focused on transforming declarations into interrogative sentences, or on using shallow semantic parsers to create factoid questions. Mitkov and Ha (2003) made use of term extraction and shallow parsing to create questions from simple sentences. Heilman and Smith (2010) suggested a system that over-generates questions from a sentence. Firstly, the sentence is simplified by discarding leading conjunctions, sentencelevel modifying phrases, and appositives. It is then transformed into a set of candidate questions by carrying out a sequence of well-defined syntactic and lexical transformations. Then, these questions are evaluated and ranked using a classifier to identify the most suitable one.
Similar approaches have been suggested over time to generate questions, like using a recursive algorithm to explore parse trees of sentences in a top-down fashion (Curto et al. (2012)), creating fill-in-the-blank type questions by analyzing parse trees of sentences and thereby identifying answer phrases (Becker et al. (2012)); or using semantics-based templates (Lindberg et al. (2013); Mazidi and Nielsen (2014)). A common drawback associated with these systems is that they create factoid questions from single sentences and focus on grammatical and/or semantic correctness, not question difficulty.
The generation of complex questions from multiple sentences or paragraphs was explored by Mannem et al. (2010). Discourse connectives such as 'because', 'since' and 'as a result' signal explicit coherence and can be used to generate Why-type questions. Araki et al. (2016) created an event-centric information network where each node represents an event and each edge represents an event-event relation. Using this network, multiple choice questions and a corresponding set of distractor choices are generated. Olney et al. (2012) suggested the use of concept maps to create inter-sentential questions where knowledge in a book chapter is represented as a concept map to generate relevant exam questions. Likewise, Papasalouros et al. (2008) and Stasaski and Hearst (2017) created questions utilizing information-rich ontologies.
Of late, several encoder-decoder models have been used in Machine Translation (Cho et al. (2014)) to automatically learn the transformation rules that enable translation from one language to another. Yin et al. (2015) and Du et al. (2017) argue that similar models can be used to automatically translate narrative sentences into interrogative ones.

Rhetorical Structure Theory
In an attempt to study the functional organization of information in a discourse, a framework called Rhetorical Structure Theory (RST) was proposed by Thompson and Mann (1987). The framework describes how short texts written in English are structured by defining a set of coherence relations that can exist between text spans. Typically, relations in RST are characterized by three parameters: the nucleus, the satellite and the rhetorical interaction between the nucleus and the satellite. The nucleus is an action; the satellite either describes this action, provides the circumstance in which this action takes place or is a result of the performed action. Notable exceptions are relations such as Contrast, List, etc. which are multinuclear and do not involve satellites.
In order to describe the complete document, these relations are expressed in the form of a discourse graph, an example of which is shown in Figure 1 (O'Donnell, 2000).
We simplify the task of QG by focusing only on the relations given in Table 1. We have condensed some of the relations defined in the RST manual (Thompson and Mann, 1987) and grouped them into new relation types as shown. A complete definition of these relation types can be found in Carlson et al. (2003).

Relation (N,S)
Obtained from Evaluation, Conclusion Table 1: Set of relations used by our system. Here, N represents the Nucleus and S represents the Satellite Figure 1: An example of discourse graph for a text sample from the RST-DT corpus 3 Approach

System Description
The text from which questions are to be generated goes through the pipeline shown in Figure 2.
A detailed description of each module/step in the pipeline is described in the subsequent subsections.

Data Preparation
Here the discourse graph associated with the document is input to the system, which in turn extracts all relevant nucleus-satellite pairs. Each pair is represented as the tuple: Relation (Nucleus, Satellite).
Prior to applying any syntactic transformations on the text spans, we remove all leading and/or trailing conjunctions, adverbs and infinitive phrases from the text span. Further, if the span begins or ends with transition words or phrases like 'As a result' or 'In addition to', we remove them as well.
The inherent nature of discourse makes it difficult to interpret text spans as coherent pockets of information. To facilitate the task of QG, we have ignored text spans containing one word. Further, in several cases, we observe that the questions make more sense if coreference resolution is performed: this task was performed manually by a pair of human annotators who resolved all coreferents by replacing them with the concepts they were referencing. Two types of coreference resolution are considered: event coreference resolution (where coreferents referring to an event are replaced by the corresponding events) and entity coreference resolution (where coreferents referring to entities are replaced by the corresponding entities). Also, to improve the quality of generated questions, annotators replaced some words by their synonyms (Glover et al. (1981); Desai et al. (2016)).

Text-span Identification
We associate each text span with a Type depending on its syntactic composition. The assignment of Types to the text spans is independent of the coherence relations that hold between them. Table 2 describes these Types with relevant examples.

Syntax transformations
If the text span is of Type 1 or Type 2, we analyze its parse tree and perform a set of simple surface syntax transformations to convert it into a form suitable for QG. We first use a dependency parser to find the principal verb associated with the span,   its part-of-speech tag and the noun or noun phrase it is modifying. Then, according to the obtained information, we apply a set of syntactic transformations to alter the text. Figure 3 describes these transformations as a flowchart.
No syntactic transformations are applied on text spans of Type 0 or Type 3. We directly craft questions from text spans that belong to these Types.

Question Generation
Upon applying the transformations described in Figure 3, we obtain a text form suitable for QG. A template is applied to this text to formulate the final question. Table 3 defines these templates. The design of the chosen templates depends on the relation holding between the spans, without considering the semantics or the meaning of the spans. This makes our system generic and thereby scalable to any domain.

Example
As an example, consider the same discourse graph from Figure 1. We show how our system will gen-

Relation
Template for type 0  erate questions for a causal relation that has been isolated in Figure 4. For the given relation, we begin by associating the satellite: "destroying a major part of its installations and equipment" with Type 2. The principal verb 'destroying' is changed to past tense form 'destroyed' and the pronoun 'it' is replaced by the entity it is referencing i.e. 'the offices of El Especatador', to obtain the question stem: 'destroyed a major part of the installations and equipment of the offices of El Especatador'.
We use the template for the cause relation for Type 2 to obtain the question: "What destroyed the installations and equipment of the offices of El Especatador?". Similar examples have also been provided in Table 4.

Data
For the purpose of experimentation, we used the RST-DT corpus (Carlson et al. (2003)) that contains annotated Wall Street Journal articles. Each article is associated with a discourse graph that describes all the coherence relations that hold between its components. We used these discourse graphs for generating questions. As described in a previous section, we filtered certain relations, and did not consider those relations in which the template is to be applied to text spans containing only one word.

Implementation
Part-of-Speech tagging and Dependency parsing were performed using Stanford's Part-of-Speech tagger (Toutanova et al. (2003)) and Dependency Parser (Nivre et al. (2016); Bird (2006)) respectively. We used the powerful linguistics library provided by NodeBox (Bleser et al. (2002)) to convert between verb forms. We have used a heavily annotated corpus and made several amendments ourselves, by performing coreference resolution and paraphrasing. This is due to the inability of modern discourse parsers to perform these tasks with high accuracy. While advances have been made in discourse parsing (Rutherford and  Here, both the question and answer are derived from text spans belonging to different sentences. Thus the score assigned will be 1.

Number of inference steps
Nucleus: Then, when it would have been easier to resist them, nothing was done Satellite: and my brother was murdered by the mafia three years ago Relation: Explanation Why was the author's brother killed by the mafia three years ago?
The student should be able to correctly resolve the pronoun 'my' to 'the author' and know that 'killed' is a synonym of 'murdered'. Thus two semantic concepts, paraphrase detection and entity co-reference resolution, are tested here.  Li et al. (2014)), such models make several simplifying assumptions about the input. Likewise, coreference resolution (Bengtson and Roth (2008); Wiseman et al. (2016)) is also an uphill task in discourse parsing.

Evaluation Criteria
To evaluate the quality of generated questions, we used a set of criteria that are defined below. We considered and designed metrics that measure both the correctness and difficulty of the question.
All the metrics use a two-point scale: a score of 1 indicates the question successfully passed the metric, a score of 0 indicates otherwise.
• Grammatic correctness of questions: This metric checks whether the question generated is only syntactically correct. We do not take into account the semantics of the question.
• Semantic correctness of questions: We account for the meaning of the generated question and whether it makes sense to the reader.
It is assumed if a question is grammatically incorrect, it is also semantically incorrect.
• Superfluous use of language: Since we are not focusing on shortening sentences or removing redundant data from the text, generated questions may contain information not required by the student to arrive at the answer. Such questions should be refined to make them shorter and sound more fluent or natural.
• Question appropriateness: This metric judges whether the question is posed correctly i.e. we check if the question is not ambivalent and makes complete sense to the reader.
• Nature of coherence relation: Coherence relations are classified into two categories: explicit (the relations that are made apparent through using discourse connectives) and implicit (the relations that require a deep understanding of the text). Questions generated through explicit coherence relations are easier to attempt as compared to the ones generated via implicit coherence relations. We assign a score of 1 to a question generated from an implicit coherence relation and 0 to that generated from an explicit relation.
• Nature of question: We check for the nature of generated question: If both the answer and question are derived from the same sentence, we assign a score of 0, otherwise the score will be 1.
• Number of inference steps (Araki et al. (2016)): To evaluate this metric, we consider three semantic concepts: paraphrase detection, entity co-reference resolution and event co-reference resolution. We consider a score for each concept: 1 if the concept is required and 0 if not. We take the arithmetic mean of these scores to get the average number of inference steps for a question.

Example
As an example, consider some of the tuples obtained from the RST-DT corpus. Table 4 explains how the generated questions evaluate against some of our criteria.

Results and Analysis
We generated questions for the entire corpus using our system. For the 385 documents it contains, a total of 3472 questions were generated.  Table 5: Statistics for Generated Questions For evaluating our system (represented as QG), we considered the system developed by Heilman and Smith (2010) as a baseline (represented as MH). We sampled 20 questions for each relation type. Note that we did not consider the last four metrics for comparison purposes as these metrics were designed keeping question complexity in mind: MH never addressed this issue and hence such a comparison would be unfair. Table 6 summarizes the results obtained for our system against each relation type. The process was done by two evaluators who are familiar with the evaluation criteria, and are well versed with the corpus and nature of generated questions. The table reports the average scores, considering the evaluation done by each evaluator.
An analysis of the results reveals that many questions are syntactically and semantically wellformed and our results are comparable to that of MH. QG does outperform MH in several cases: however these performance gains are incremental. Issues commonly arose due to errors made by the parser; and the inability of NodeBox to convert between verb forms. Additionally, in some cases, the templates designed were unable to handle all text span Types either due to poor design or because the text span did not follow either definition of the defined Types. For example, some text spans were phrased as questions and some had typographical errors (originally in the text): this led to the generation of unnatural questions. Further, some text spans were arranged in a way such that the main clause appeared after the subordinate clause (For example, the sentence 'If I am hungry, I will eat a cake'): handling such text spans would require us to modify the text such that the subordinate clause  follows the main clause (In this example's case, 'I will eat a cake if I am hungry'). However, to the best of our knowledge, there are no known transformations that allow us to achieve this rearrangement. Table 7 provides some statistics on common error sources that contributed to semantic (and/or grammatical) errors in generated questions. Other minor errors 1.0% Table 7: Common error sources: The percentage of incorrect questions is the ratio of incorrect to total questions with semantic/grammatic errors.

Source of Error
Superfluity of language is of concern, as generated questions often contained redundant information. However, identifying redundant information in a question would require a deep understanding of the semantics of the text spans and of the relation that holds between them. Currently, modern discourse parsers are inept at handling this aspect.
The latter four metrics depend heavily on the corpus, and not the designed system. QG, because of its ability to create inter-sentential questions and handle complex coherence relations, was given a moderate to good score by both evaluators. Depending on the text and its relations, these scores may vary. We expect these scores to increase considerably for a corpus containing many implicit relations between text spans that are displaced far apart in the text.

Conclusions and future work
We used multiple sources of information, namely a cognitive taxonomy and discourse theory to generate meaningful questions. Our contribution to the task of QG can be thus summarized as: • As opposed to generating questions from sentences, our system generates questions from entire paragraphs and/or documents.
• Generated questions require the student to write detailed responses that may be as long as a paragraph.
• Designed templates are robust. Unlike previous systems which work on structured inputs such as sentences or events, our system can work around mostly any type of input.
• We have considered both explicit coherence relations that are made apparent through discourse connectives (Taboada (2009)), and implicit relations that are difficult to realize.
• Our system generates inter-sentential questions. To the best of our knowledge, this is the first work to be proposed that performs this task for a generic document.
There are several avenues for potential research. We have focused only a subset of relations making up the RST-DT corpus. Templates can also be defined for other relations to generate more questions. Further, Reed and Daskalopulu (1998) argue RST can be complemented by defining more relations or relations specific to a particular domain. We also wish to investigate the effectiveness of encoder-decoder models in obtaining questions from Nucleus-Satellite relation pairs. This might eliminate the need for manually performing coreference resolution and/or paraphrasing.
We also wish to investigate other performance metrics that could allow us to measure question complexity and extensibility. Further, we have not addressed the task of ranking questions according to their difficulty or complexity. We wish to come up with a statistical model that analyzes questions and ranks them according to their complexity or classifies them in accordance with the levels making up the hierarchy of Bloom's taxonomy (Thompson et al. (2008)).