Semi-automatic Generation of Multiple-Choice Tests from Mentions of Semantic Relations

We propose a strategy for the semi-automatic generation of learning material for reading-comprehension tests, guided by semantic relations embedded in expository texts. Our approach combines meth-ods from the areas of information extraction and paraphrasing in order to present a language teacher with a set of candidate multiple-choice questions and answers that can be used for verifying a language learners reading capabilities. We implemented a web-based prototype showing the feasibility of our approach and carried out a pilot user evaluation that resulted in encouraging feedback but also pointed out aspects of the strategy and prototype implementation which need improvements.


Introduction
Computer assisted language learning (CALL) opens many new opportunities for language learners and teachers. In this paper, we focus on one often used tool of this area: reading-comprehension tests. Such tests are an important means for assessing a learner's current skill level by verifying his understanding of foreign-language texts. Some work in the area of CALL has focused on reducing the teacher's workload in context of reading-comprehension tests by inventing methods for the automatic scoring of such tests (see Section 5). In contrast, we propose a strategy for the semi-automatic generation of learning material for reading-comprehension tests, guided by semantic relations embedded in expository text. The multiple-choice exercises ask learners to choose from a list of statements about semantic relations the one which is actually expressed in a long free text. The exercises attempt to test whether the learners understand the text and have enough language knowledge for recognizing variants of expressions for the same semantic relations.
Our strategy combines technologies from different branches of NLP. A standard information extraction (IE) system is utilized to automatically recognize relevant entities and semantic relations among them in texts. The resulting mentions are used for the creation of (a) paraphrases of the actually mentioned facts and (b) natural-language statements expressing facts not mentioned in the original text, i. e., the sentence generation system takes linguistic patterns filled with entities as input and produces paraphrases as potential answer candidates. The multiple-choice exercises generated in this way are then presented to a language teacher, who has to go trough them and can reject a subset of these or can replace individual elements. This human-in-the-loop step is necessary because of the noise inherent to current NLP systems. We let the teacher choose the appropriate trade-off between correctness and content-coverage of the generated (candidate) questionnaires.
The proposed strategy is implemented in a webbased prototype system, with separate interfaces for teachers (for the preparation of exercises) and learners (for conducting the exercises). This prototype is capable of handling a number of selected semantic relations from the biographic and financial domains, illustrating the applicability of the approach for the frequent class of news articles from tabloid press and business news. Based on this prototype, we carried out a pilot user study to gather insights on the best directions for future development.
In summary, the contributions of this paper are as follows: • A strategy for the semi-automatic generation of learning material for fact-centric readingcomprehension tests.
• A way of incorporating the idea of a human-  in-the-loop into the data flow of the strategy.
• A web-based prototype implementation of this approach, along with a pilot user study to explore directions for further development.

A workflow for automatic exercise generation
In this section, we present our generic approach towards the generation of candidate multiple-choice questions for given expository texts. Furthermore, we detail the steps necessary for a teacher to compile actual reading-comprehension exercises suitable for presentation to learners, given only our approach's resulting candidates.

Reading-comprehension exercises
The reading-comprehension exercises generated by our approach ask for relational facts mentioned in a text. In order to automatically identify these, we apply a series of processing steps, as depicted in Figure 1. In the first step, the input texts, e.g., news articles, are processed by a standard component for named-entity recognition (e.g., the well-known Stanford Named Entity Recognizer by Finkel et al. (2005)), in order to identify persons, organization, locations, etc. mentioned in the text, followed by application of a relation extraction system for the identification of facts. The information about mentioned facts and entities is passed on to a further processing step in which a mentioned instance of a semantic relation is transformed into a natural-language statement, paraphrasing the original occurrence of the fact. Furthermore, the information about named-entity occurrences is used to create false statements about relations between the entities. For each fact identified in a text, four choices are provided as potential statements about the text, only one of them stating a fact actually mentioned.

Relation Extraction
A key part of our approach is the application of a pattern-based relation extraction (RE) system. Such systems, e.g., NELL (Carlson et al., 2010;Mitchell et al., 2015), PATTY (Nakashole et al., 2012), DARE (Xu et al., 2007), rely on lexicosyntactic patterns that pose restrictions on the surface level or grammatical level of sentences. Their underlying assumption is that whenever a given sentence matches a given pattern (i.e., a sentence template), the sentence expresses the pattern's corresponding semantic relation. This assumption does not always hold, hence the system output usually contains a certain amount of noise, which makes a human-in-the-loop necessary for high-precision applications.
Typically, RE systems associate patterns with a confidence score of some kind, allowing downstream components to trade precision for recall. At this step in our pipeline, we extract all the information the RE system can deliver, associate it with the extracting pattern's score, and pass it on to the next step.
One important aspect to consider is the amount of information such an approach can extract from texts. We believe that pattern-based RE systems provide enough facts for our approach as in principle any semantic relation between entities, such as kinship relations, can be detected. For example, given the following sentence, RE systems could extract the relation instance marriage(Madonna, Guy Ritchie) and pass it on to the next component: Example 1: As the skirls of a lone bagpiper gave way to the music of French pianist Katia Labèque and a local organist, the wedding ceremony of Madonna Louise Ciccone, 42, and film director Guy Ritchie, 32, began.

Answer Generation
Given the relation instances and arguments identified in the previous step, candidates for questions and answers are automatically generated by filling arguments into sentence templates. These templates are created based on patterns that were used for relation extraction in the previous step, i.e., RE patterns are utilized for two purposes in our approach.
Depending on the specific kind of RE pattern, this step involves a few straight-forward processing steps, e.g., for the case of surface-level RE patterns it involves restoring correct inflections of poten-tially lemmatized lexical pattern elements; for the case of depenency-grammar based patterns it additionally includes a step of tree linearization, see, e.g., (Wang and Zhang, 2012).
In the following, we present some example sentence templates, used for the generation of multiplechoice tests 1 : • marriage relation: person tied the knot with person.
person and person were married.
• parent-child relation: person was raised by parent.
parent passed on the family gene to person.
• foundation relation: A multiple-choice question is generated for every identified relation mention involving two entities, where the questions are rather generic, e.g., "Which one of the following four facts can be inferred from the text?". The respective correct answer is generated by filling the relation instance's arguments into a sentence template associated with the target relation. For the sentence in Example 1, a generated correct answer could be: "Madonna Louise Ciccone and Guy Ritchie were in a wonderful marriage relation".
Wrong answers are generated on the one hand by filling the arguments in templates for other target relations. For the case of the parent-child relation this yields "Madonna Louise Ciccone passed on the family gene to Guy Ritchie.". The second way wrong answers are created is by mixing in arguments from other relation instances, e.g., "Madonna Louise Ciccone tied the knot with John Ritchie".
To avoid the generation of answer options which are easy to identify as being made up, we use only entities for wrong answers which have at least one relationship with another entity mentioned in the same article. This means that, for the example scenario outlined by Example 1, celebrities who are only mentioned once, e.g., in a list of wedding guests, are not utilized.
As another measure to improve the quality of the wrong answer options, we ensure that the respective entities are mentioned relatively close to one 1 Items with sans-serif font represent entity placeholders. another in the source text. The best case would be that they appear in the same sentence from which a relationship is extracted. Consider the following example sentence from which the instance marriage(Madonna, Ritchie) is extracted: If Penn was Madonna's temperamental match and boyfriend Carlos Leon, father of Lourdes, her physical ideal, Ritchie ---who reportedly calls his new wife 'Madge' in private ---is a man who holds his own against his high-powered bride.
Here, the approach would generate the following answer options: • Madonna and Ritchie had a wedding. (correct) • Madonna tied the knot with Carlos Leon.
• Carlos Leon passed on the family gene to Ritchie.
• Lourdes was brought up by Penn.
In order to identify the correct statement, learners need sufficient knowledge of both vocabulary and grammar, also they need to be able to resolve coreference relations between occurring entities.
Paraphrasing In order to create both challenging and motivating tests for language learners, the generated statements need to present the user with a large variety of ways to refer to semantic relations, i.e., repetitions should be avoided. We ensure this first of all by employing web-scale RE-pattern sets as a source for the sentence templates, were these sets often contain hundreds of different ways to express a given target relation (see Section 3).
To create even more variations, we also introduce paraphrasing technology to the system to reorder words in the patterns learned by relation extraction systems and produce a new sentence with the same meaning. For example, a sentence template "wife had a kid from husband" is formed from one of the patterns used in the marriage relation. The paraphrasing engine takes this template as input and provides templates with the same words in natural language, e.g., "from husband wife had a kid". Both templates are treated as valid and randomly chosen to create answers.

Human supervision
As already noted earlier, employing automatic information-extraction methods has the disadvantage of inevitable noise in the system output. Given that the targeted users of the readingcomprehension exercises are language learners, it is necessary to include a step with human supervision into the data flow of our proposed system. All sentence templates, including those formulated from RE patterns and their variants created by the paraphrasing engine, are verified by teachers. Furthermore, teachers check the extracted relation instances to ensure they are actually mentioned in the text, then they verify the generated answers wrt. to grammaticality and adequacy for the context.

Prototype implementation
We have implemented a prototype of our system in order to test the feasibility of our proposed approach and to gather insights on future research directions, by carrying out user studies with it. The system is implemented as a browser-based application.
In principle, the approach can handle arbitrary texts. We tested it on a corpus of 140 English news articles (Krause et al., 2014), and measured the productivity of our approach for automatic question and answer generation. For these first experiments, we used the available gold-standard entity annotation. For the relation-extraction part, we applied the RE patterns of Moro et al. (2013) to automatically extract the relations between the annotated entities in the text. These patterns are based on the dependency-grammar analysis of sentences and were extracted from a large web corpus, hence they should provide enough variation for both the detection of relation mentions in texts as well as the generation of statements about such identified mentions. We used the patterns for three kinship relations in this experiment, namely the relations marriage, parent-child, siblings. As a means of automatic noise reduction, we work with a combination of training-support-based pattern filters and ones relying on the distribution of relation-relevant word senses in a lexico-semantic resource, as provided by Moro et al. (2013).
The paraphrasing engine in (Ai et al., 2014) is used in our system to generate sentence variants for the patterns, part of the process involves the utilization of the sentence generator by Wang and Zhang (2012), which produces linearizations for the dependency-tree-based patterns.
To reduce a teacher's work in examining the generated exercises, we provide a two-step user interface. In the first step, extracted relation instances for a given text are displayed and require validation by the user, as shown in Figure 2. The teacher can adjust the pattern-filter parameters in order to trade precision for recall, by moving the slider in the UI. Extracted relationships are shown below the text and teachers need to go through each of them, either accept them or decline them.
By choosing different parameter values for the filters, the number of relationships found by the extractor varies. The result of this trade-off for the employed corpus is illustrated in Figure 3. If the teacher tunes the relation-extraction component to its strictest setting, approximately three relation instances per article are found, out of which two are correct. If a user is willing to invest more time into question validation (i.e., the next step), it is possible to get more than twice as much facts of a lower average accuracy, hence a teacher would need more time to examine them.
In the second step of the teacher sub-workflow, generated questions and answers are presented for When teachers have verified all the questions and answers, they can press the export button to generate reading-comprehension exercises as shown in Figure 5. This is the interface that language learners use to interact with the system, i.e., access the teacher-approved exercises.
In order to find out the correct statements, learners need to firstly understand the semantic relations among the entities expressed in the texts and secondly have sufficient linguistic knowledge to understand the answer candidates which are paraphrases of the original sentences mentioned in the text. The paraphrases are namely linguistic variants at word (e.g., synonyms) or word-order level (e.g., topicalization). The interface provides feedback to the learners by marking the selected choice with green or red color depending on the correctness. In case a wrong answer is selected, the correct answer is shown to learners in green. If learners need more explanation, they can choose to click the hint button, which highlights the sentence with the relation instance mentioned in the correct answer. Furthermore, the system provides a visualization function which displays a graph with all recognized relations among the entities in the text.

Pilot user study
Two aspects of the implemented prototype were evaluated in separated tests with human subjects, Figure 5: Example exercise, as generated by the prototype.
i.e., the interface for language learners and the interface for teachers. For the tests with learners, our interests were to find out whether: • . . . the generated multiple-choice exercises fit the learners' expectations, e.g., with respect to user friendliness.
• . . . the questions are of sufficient complexity, i.e., a learner's reading comprehension skills are actually tested.
• . . . the system feedback after a wrong answer does help learners in figuring out the right answer.
For the tests with teachers, our evaluation tries to determine: • . . . if the prototype provides a user-friendly interface to generate exercises from texts.
• . . . how teachers think about the step-by-step generation of exercises and if the teachers' requirements are met.
• . . . if teachers agree such exercises would help users achieve their language-learning goals.
We set up a field test for users in which we asked them to work with the respective interface in an online version of the prototype and to fill out a provided online questionnaire after the test. Questions in the interview for usability and acceptability are composed based on ISO NORM 9241/10, which checks compliance to ergonomic requirements for screen work places, for example selfdescriptiveness, controllability, conformity with user expectations and suitability for individualization. For this pilot study, we had five students act as teachers and language learners and asked them to take the questionnaire.
The following table lists the questions used in the interview: The interface gives a clear concept of what there is to do.

5-Step Likert Scale (Agreement)
I can adjust the layout to suit my preferences.

5-Step Likert Scale (Agreement)
Generally, I feel no challenge in answering the questions. 5-Step Likert Scale (Agreement) It is possible to answer the questions without fully understanding the article text, e.g. by concluding the correct answer from certain properties of the text.

5-Step Likert Scale (Agreement)
I can easily tell apart the correct answer from the wrong ones, without looking at the article text at all.

5-Step Likert Scale (Agreement)
The "hint" function makes it easier to figure out the correct answer.

5-Step Likert Scale (Agreement)
What has to be changed?/What did you like? Open Table 1: Questionnaire for learners.
The interface provides self-explained instructions. 5-Step Likert Scale (Agreement) I can easily check the validity of the extracted relationships from the corresponding sentences in the article.

5-
Step Likert Scale (Agreement) I can easily change answers in the generated questions if I find any of them not proper.

5-
Step Likert Scale (Agreement) I myself would create similar questions from these articles without this tool.

5-Step Likert Scale (Agreement)
What are your expectations from such a tool? What would you suggest? Open Table 2: Questionnaire for teachers.
The summarized evaluation results are as follows: • The language learners find the exercise interface intuitive and suitable for a quick start on the exercises.
• Questions in the exercises are somewhat too easy to answer. Advanced learners are able to infer the correct answer without looking at the article text.
• The "hint" functionality is perceived ambiguously. While all users agree that such a functionality is helpful in principle, only some of the learners think the way it is implemented is helpful.
• Teachers find the multi-step exercise generation confusing, however, after one or two attempts they get familiar with it and can conveniently filter relation instances and modify answers.
• Generally, teachers think this is the kind of exercise they would create based on the given articles.
The results from the learner interviews indicate several problematic aspects of the prototype. Besides a usability issue with insufficient feedback during exercise conduction, an aspect mentioned frequently by the users relates to the complexity of the generated exercises. Since our testers are mainly advanced English learners, the exercises were relatively easy to solve for them. Apart from their English skill level, they also mentioned that answers with incorrect or less plausible gender statements are easy to exclude. For example an answer "Guy Richtie gave birth to Loudres." is obviously false. We believe that such problems can be fixed by few, relation-specific heuristics, i.e., stricter rules on patterns. Another reason why questions tend to be easy is that the topic of the articles in our test set is celebrity gossip, i.e., an area which many people are familiar with, hence learners could answer questions based only on their prior knowledge, not their understanding of the text.
As for the tool for teachers, in the future we will provide clearer instructions so that teachers will not get lost in the process of creating exercises. According to the teachers, although some articles contain rich amounts of relations, they are not a good fit for a reading-comprehension exercise because of other aspects of the text. They also reported that at times none of the suggested answer options was acceptable; to solve this issue, we will add the option to freely edit the provided answer candidates, including the chance to compose totally new ones. In sum, the interview feedback from the teachers shows that despite the need for manual supervision, the overall prototype is perceived in a positive way.

Related work
The work presented in this paper is part of a growing body of approaches in computer-assisted language learning (CALL). The methods in this area aim to support (second) language learners through various means, among them methods for error correction (e.g., pronunciation training) or providing them with exercises for practicing existing language skills, while some approaches focus on reducing the workload of language teachers related to preparation and verification of exercises.
An example from the area of text-based CALL is the work of Uitdenbogerd (2014), who present systems for finding or generating exercise texts of a complexity level appropriate to the learner's current skill level, e.g., by reordering existing text elements wrt. difficulty or finding texts which make use of only appropriate vocabulary. Similar work is reported by Sheehan et al. (2014), who classify texts wrt. different metrics (academic vocabulary, syntactic complexity, concreteness, cohesion, among others) in order to identify texts for specific complexity levels.
An area receiving particular focus in the literature is the task of reading comprehension. Typically, language learners are asked to provide a short free-text summary for, e.g., a news article. A teacher then has to manually verify whether the learner was capable of understanding the text and correctly summarized the main content. Some CALL systems support the teacher in this task by automatically scoring the learner's summary wrt. the original article text or compared to a teacherprovided gold-standard summary, see for example (Hahn and Meurers, 2012;Madnani et al., 2013;Horbach et al., 2013;Koleva et al., 2014).
Equally relevant to our work is the approach of Gates (2008), who automatically generated WHquestions for reading-comprehension tests through a transformation of the parse tree of selected sentences from the article text, as well as Riloff and Thelen (2000), who developed a rule-based system for the automatic answering of questions in a reading-comprehension setting.
Our focus is the automatic generation of multiple-choice reading-comprehension exercises. This exercise type is a standard tool for educational tests and has, compared to short-answer summaries, the benefit that once created such tests require relatively few work on the teacher's side in order to assess a learner's skill level. At the core of our approach is the application of existing informationextraction approaches, mainly from the sub-area of relation extraction, for the identification of facts in texts which are suitable for checking a learner's understanding of a foreign language. In addition to the work of Moro et al. (2013), which we employed in our prototype implementation, many more relation extraction systems exist that could be utilized in our setting, either from traditional relation extraction (Carlson et al., 2010;Mitchell et al., 2015) or the open-IE paradigm (Fader et al., 2011;Pighin et al., 2014).

Conclusion and outlook
In this paper, we present a semi-automatic approach to the generation of reading-comprehension exercises, which builds on existing strategies from the areas of information extraction and paraphrasing. A user evaluation of a prototype implementation provided some evidence for the feasibility of the approach, albeit it also showed that the quality and particularly the difficulty of the generated questions needs to improve.
For the future, we plan to implement further prototypes which will employ additional relationextraction and paraphrasing systems, and which will support a broader range of fact types. Furthermore, we want to enlarge the lexical and syntactic variability of the generated answers and would like to reduce the amount of required teacher supervision, in order to make the approach better suited for real-world applications.
Another line of future work could focus on injecting more indirectness into the question-answer generation, i.e., the system should not only ask for facts explicitly referenced in the text but should also check a language learners conclusion capabilities, which require a deeper understanding of language than fact finding. A possible way to implement this may be the integration of textual-entailment methods. For example, the system might ask about a particular parent-child relation not directly mentioned in the text, which the system could infer from a mentioned relation between siblings and another (different) instance of relation parent-child. This can help to generate exercises testing for more sophisticated reading-comprehension capabilities of language learners.