Automatic Distractor Generation for Multiple Choice Questions in Standard Tests

To assess knowledge proficiency of a learner, multiple choice question is an efficient and widespread form in standard tests. However, the composition of the multiple choice question, especially the construction of distractors is quite challenging. The distractors are required to both incorrect and plausible enough to confuse the learners who did not master the knowledge. Currently, the distractors are generated by domain experts which are both expensive and time-consuming. This urges the emergence of automatic distractor generation, which can benefit various standard tests in a wide range of domains. In this paper, we propose a question and answer guided distractor generation (EDGE) framework to automate distractor generation. EDGE consists of three major modules: (1) the Reforming Question Module and the Reforming Passage Module apply gate layers to guarantee the inherent incorrectness of the generated distractors; (2) the Distractor Generator Module applies attention mechanism to control the level of plausibility. Experimental results on a large-scale public dataset demonstrate that our model significantly outperforms existing models and achieves a new state-of-the-art.


Introduction
Standard test, such as TOFEL and SAT, is an efficient and essential tool to assess knowledge proficiency of a learner (Ch and Saha, 2018).According to testing results, teachers or ITS (Intelligent Tutoring System) services can develop personalized study plans for different students.When organizing a standard test, a vital issue is to select a suitable question form.Among various question forms, multiple choice question (MCQ) is widely adopted in many notable tests, such as GRE, TOFEL and SAT.MCQs have many advantages including less testing time, more objective and easy on the grader (Ch and Saha, 2018).A typical MCQ consists of a stem and several candidate answers, among which one is correct, the rest are distractors.As shown in Figure 1, in addition to a stem, some tests also include a long reading passage to provide the context of this MCQ.
The quality of an MCQ depends heavily on the quality of the distractors.If the distractors can not confuse students, the correct answer could be concluded easily.As a result, the discrimination of the question will degrade, and the test will also lose the ability of the assessment.
However, it is a challenging job to design useful and qualified distractors.Rather than being a trivial wrong answer, the distractor should have the plausibility which confuses learners who did not master the knowledge (Liang et al., 2018;Qiu et al., 2019).A good distractor should be grammatically correct given the question and semantically consistent with the passage context of the question (Gao et al., 2018).Meanwhile, the question composers need to enhance the plausibility of the distractor without hurting its inherent incorrectness.Otherwise, the distractor easily becomes a definitely wrong answer, further making the question to be sloppy.Hence, the manual preparation of distractors is time-consuming and costly (Ha and Yaneva, 2018).It is an urgent issue to automatically generate useful distractors, which can help to alleviate question composers' workload and relax the restrictions on experience.It could also be helpful to prepare a large train set to boost the machine reading comprehension (MRC) systems (Yang et al., 2017).In this paper, we focus on automatically generating semantic-rich distractors for MCQs in real-world standard tests, such as RACE (Lai et al., 2017) which is collected from the English exams for Chinese students from grades 7 to 12.

zp10211059
Passage: ... Held on a farm, the Glastonbury Festival is the most well-known and popular festival in the UK.It began in 1970 and the rst festival was attended by one thousand ve hundred people each paying an admission price of PS1 ... Since then the Glastonbury Festival has gone from strength to strength -in 2004 one hundred and fty thousand fans attended, paying PS112 each for a ticket to the three-day event.Tickets for the event sold out within three hours.... Glastonbury is not unique in using live music to raise money to ght global poverty.... Existing works conduct some attempts on generating short distractors (Stasaski and Hearst, 2017;Guo et al., 2016).These approaches formulate distractor generation as a similar word selection task.They leverage the pre-defined ontology or word embeddings to find similar words/entities of the correct answer as the generated distractors.These word-level or entity-level methods can only generate short distractors and do not apply to semantic-rich and long distractors for RACE-like MCQs.Recently, generating longer distractors has been explored in a few studies (Zhou et al., 2019;Gao et al., 2018).For example, Gao et al. (2018) proposes a sequence-to-sequence based model, which leverages the attention mechanism to automatically generate distractors from the reading passages.However, these methods mainly focus on the relation between the distractor and the passage and fail to comprehensively model the interactions among the passage, question and correct answer which helps to ensure the incorrectness of the generated distractors.
To better generate useful distractors, we propose a novel quEstion and answer guided Distractor GEneration (EDGE) framework.More specifically, given the passage, the question and the correct answer, we first leverage a contextual encoder to generate the semantic representations for all text materials.Then we use the attention mechanism to enrich the context of the question and the correct answer.Next, we break down the distractor's usefulness into two aspects: the incorrectness and plausibility.Incorrectness is the inherent attribute of the distractor, while plausibility refers to the ability to confuse the students.We introduce two modules by leveraging the gate layer to guarantee the incorrectness: Reforming Question Module and Reforming Passage Module.We further leverage an attention-based distractor generator, plus the previous two reforming modules, to guarantee the plausibility.Finally, in the generation stage, we use the beam search to generate several diverse distractors by controlling their distances.We conduct experiments on a large-scale public distractor generation dataset prepared from RACE.The experimental results demonstrate the effectiveness of our proposed framework.Moreover, our method achieves a new state-of-the-art result in the distractor generation task.

Related work
In recent years, some research efforts have devoted to distractor generation.Generally, the related work can be classified into the following three categories.
Feature-based methods.Considerable research efforts (Liang et al., 2018;Sakaguchi et al., 2013;Araki et al., 2016) have been devoted to using manual design features, such as POS features and statistic features, to generate the distractors.However, designing effective features is also labor intensive and hard to scale to various domains.Differently, our work is an end-to-end framework without manual design features.
Similar word/entity-based methods.Some works (Stasaski and Hearst, 2017;Guo et al., 2016;Kumar et al., 2015;Afzal and Mitkov, 2014) focused on finding answer-relevant ontologies or words as the distractors with the help of WordNet and Word2Vec.For example, Stasaski and Hearst (2017) leveraged an educational Biology ontology to conduct the distractor generation.However, some of these works depend heavily on the well-designed ontology and they can only generate short distractors, which usually only contain one single word or phrase.
NN-based methods.Recently, neural network based and data-driven solutions emerge.Gao et al. (2018) proposed an end-to-end solution focusing on distractor generation for MCQs in standard English tests.They employed the hierarchical encoder-decoder network as the base model and used the dynamic attention mechanism to generate the long distractors.Zhou et al. (2019) further strengthened the interaction between the question and the passage based on the model of (Gao et al., 2018).
3 Framework Description

Problem Definition
In this paper, we focus on the automatic distractor generation for MCQs (see Figure 1).Let P = {w p t } t=Lp t=1 denote the reading passage, which consists of L p words.Let Q = {w q t } t=Lq t=1 and A = {w a t } t=La t=1 denote the question and its correct answer, respectively.L q and L a denote the lengths of the question and the answer, respectively.Note that the answer may not be a span of the passage P .
Problem Definition.Formally, given the reading passage P , the question Q and its correct answer A as inputs, an EDGE model M aims to generate a distractor D = {w d t } t=L d t=1 about the question, which is defined as finding the best distractor D that maximizes the conditional likelihood given P , Q, and A:

Framework Overview
Inspired by existing question generation works (Duan et al., 2017;Du and Cardie, 2017;Zhou et al., 2017;Kim et al., 2018), we employ a sequence-to-sequence based network to generate the distractors.As shown in Figure 2, our overall framework contains five components.First, we employ the encoding module to extract the contextual semantic representations for all materials.Then, we use the attention mechanism to enrich the semantic representations of the question and its answer.Finally, we design three key components to generate useful distractors.
As mentioned above, the quality of the generated distractor are guaranteed from two aspects: • Incorrectness: Both the passage and the question contain some parts strongly relevant to the answer, which may disorder the decoder to output the words contained by the answer and further hurt the inherent incorrectness of generated distractors.To guarantee the incorrectness, in our proposed framework, we reform the passage and question by erasing their answer-relevant information before they are fed into the decoder.Based on the gate mechanism, the two reforming modules highlight the distractor-relevant words and constrain the answer-relevant words by measuring the distances between the words and the correct answer.
• Plausibility: To look reasonable, the distractor first should be grammatically and semantically consistent with the question.Otherwise, after reading the question, the students can trivially exclude it.Furthermore, to hinder students from excluding the distractor only by reading the passage, it should also be semantically relevant to the passage.To guarantee the plausibility, in our proposed framework, the distractor generator uses the semantic representation of the reformed question to initialize the generation process and leverages the attention mechanism to obtain the context representation from the reformed passage to guide the output.
We will address each component in detail in the following subsections.

Encoding and Enriching Module
In the encoding module, given a passage P = {w p t } t=Lp t=1 , a question Q = {w q t } t=Lq t=1 and its answer A = {w a t } t=La t=1 , we first convert every word w to its d-dimensional vector e via an embedding matrix    Next, we introduce the attention layer and the fusion layer to enrich the semantic representations of the question and its answer by fusing the passage information.In the enriching module, we adopt the scaled dot product attention mechanism and the fusion kernel used in recent works (Chen et al., 2017;Mou et al., 2016) for better semantic understanding.
where W f ∈ R 4d×d and b f ∈ R d are the parameters to learn.
• and − denote the element-wise multiplication and subtraction between two matrices, respectively.M q ∈ R Lq×Lp denotes the attention weight matrix.
For the answer A, we can also obtain the enriched representation A via the same attention and fusion process.

Reforming Question Module
As mentioned above, to retain the inherent incorrectness of the distractors, we need to reform the question by constraining the answer-relevant parts.In this module, we first evaluate the semantic distance between the answer and each word of the question.Then, the distances are used as the weights to differentiate the useful words from the words strongly relevant to the correct answer.The reforming process is conducted in the following two steps.
Self-Attend Layer.Firstly, we use this layer to obtain the sentence-level representation v a ∈ R 1×d of the answer.
where W a ∈ R d×1 is a trainable parameter.r ∈ R La×1 denotes the weight vector.
Gate Layer.In this layer, we first use a bilinear layer to measure the distance between each word and the answer.Then, the distance information is used as gate values to reform each word and gain the reformed question Q.
where W q g is a trainable bilinear projection matrix and b q g ∈ R is also a parameter to learn.δ i ∈ R is the semantic distance between the correct answer and the i-th word of the question.Q i and Qi denote the original representation and reformed representation of the i-th word in the question, respectively.

Reforming Passage Module
Likewise, the passage also needs to be reformed to erase the impact of the answer.However, there still are some differences between the two reforming modules' architectures: • Some parts of the passage may belong to other questions but have some common words with the answer.These common words may obtain low gate values which further reduce their contributions to the generation.Hence, before passage reforming, we first attend the question information to the answer to constrain it only to affect the words related to its question.
• To further strengthen the relationship between the generated distractor and the question, we integrate the question information into the reformed passage to further highlight the question-relevant sentences.
The reformation process consists of the following four steps.
A-Q Attention Layer.We use the Attn(•, •) and Fuse(•, •) to fuse the question information into the answer.Note that we use the original question but not the reformed question because the former contains complete answer-relevant information.
As the reforming question module, we first use the SelfAlign(•) to obtain the answer's sentence-level representation va = SelfAlign( Â) ∈ R 1×d .We then use another gate layer to obtain the reformed passage Ṗ.
where P i and Ṗi denote the original and reformed representations of the i-th passage word, respectively.
P-Q Attention Layer.This layer uses the attention mechanism to fuse the question information as mentioned above.P = Fuse( Ṗ, P) where P = Attn( Ṗ, Q) Q Re-encoding Layer.We re-extract the contextual representation by another Bi-LSTM for the reformed passage P. The final semantic representation of the passage is denoted as P ∈ R Lp×d .

Question Initializer
As mentioned above, the generated distractor should be grammatically and semantically consistent with the question.Inspired by (Gao et al., 2018), we use the question information to initialize the decoding process to enhance the semantic relevance between the distractor and question.Specifically, we first use a Bi-LSTM (Hochreiter and Schmidhuber, 1997) to re-encode the reformed question Q.
where − → h q i is the hidden state of the forward LSTM at time i.We then concatenate the last hidden states of two directions as h q ∈ R d , which then is projected to get the initial state of the decoder h 0 .
where W p ∈ R d×d and b p ∈ R d are learnable parameters.

Distractor Generator
At the decoder side, we adopt an attention-based LSTM layer.Specifically, at the first step, we use the output of the reforming question module h 0 as the initial state and use the mean pooling vector of the reformed passage P as the context vector c 0 ∈ R d .The first word is set to the special token [EOS].
Next, for each decoding step t, we use the attention mechanism to attend the most relevant words in the reading passage to form the context vector.
where W h ∈ R d×d , projecting the hidden state to the passage context, is the parameter to learn.e t−1 denotes the embedding of the word at the t − 1-th step.Moreover, at each step, we concatenate h t and c t together and use an MLP layer to predict the word probability distribution. where are learnable parameters.H V denotes the probabilities of all words in the vocabulary in which the word with the maximum probability is the predicted word at step t.

Training and Inference
We train the model by minimizing regular cross-entropy loss: where D is the training corpus in which each data sample contains a distractor D, a passage P , a question Q and an answer A. w d t is the t-th position of the distractor D. Pr(w d t |P, A, Q, w d <t ; θ) is the predicted probability of the w d t and can be calculated by the Eq.( 2).θ M denotes all trainable parameters in EDGE.During the inference phase, we use a beam search of width n and receive n candidate distractors with decreasing likelihood because an MCQ has several diverse distractors.Following Gao et al. (2018), we use the Jaccard distance to generate the final multiple diverse distractors from the beam search results.Specifically, we first select the first candidate distractor with the maximum likelihood from the search results as D g 1 .The second one D g 2 should have a Jaccard distance, larger than 0.5, to D g 1 .Likewise, we select the third one D g 3 which has a restricted distance to both D g 1 and D g 2 .

Experiments
4.1 Experiment Setup

Dataset
For a fair comparison, we use the distractor generation dataset1 released by Gao et al. (2018) as our benchmark.This dataset is constructed based on RACE (Lai et al., 2017), which is collected from the English exams and widely used in the MRC field.More details about the dataset construction process can be found in (Gao et al., 2018).The train/validation/test set contain 96,501/12,089/12,284 examples, respectively.
The left sub-figure of Figure 3 shows the count statistics on this dataset.We can see that most passages are related to more than two questions.A question is usually associated with multiple distractors, which proves necessary to conduct the beam search in the testing phase.The right sub-figure shows the distributions of the lengths of the passages, questions, answers, and distractors.The distractors and the answers have similar lengths and the questions are slightly longer than them.Meanwhile, we can see that the median of the distractor lengths is larger than 8.This suggests the similar word/entity-based methods do not apply to this dataset.

Model Details
We use the GloVe.840B.300d(Pennington et al., 2014) as the pre-trained word embeddings (i.e., d = 300), and the word representations are shared across different components of EDGE.In the encoding module, we choose the Bi-LSTM as the contextual encoder, the size of the hidden unit is set to 300 (150 for each direction).Please note that the Bi-LSTM encoder is a plug-in module that can be easily replaced by Transformer (Vaswani et al., 2017), BERT (Devlin et al., 2019) or XLNet (Yang et al., 2019).The parameters of the Bi-LSTM are shared among the encoding module and two reforming modules.
According to the 95th percentile values shown in Figure 3, we set the maximum lengths of passages, questions, answers, and distractors to be 500, 17, 15, and 15, respectively.
The model is trained with a mini-batch size of 64.We use Nesterov Accelerated Gradient (NAG) optimizer (Nesterov, 1983) with a learning rate of 0.005.The dropout rate is set to 0.1 to reduce overfitting.The beam size n is set to 50.

Baseline Approaches and Metrics
The following models are selected as baselines: Basic models: the basic sequence-to-sequence framework and its variants including (1) SS (Seq2Seq): the basic model that generates a distractor from the passage; (2) SEQ (SS+Enriching Module+Question Initializer): the sequence-to-sequence with the enriching module and the question-initialized decoder in which the initial state is set to the output of the question initializer; and (3) SEQA (SEQ+Attention): the sequence-to-sequence with a decoder same as the distractor generator of the EDGE.
HRED (HieRarchical Encoder-Decoder): the basic framework of (Gao et al., 2018), which also contains the question initializer and attention mechanism.
HSA (HRED+static attention) (Gao et al., 2018): which uses the HRED as the basic architecture and leverages two attention strategies to combine the information of the passage, question, and answer.
CHN (Zhou et al., 2019): which extends HSA with a co-attention mechanism to further strengthen the interaction between the passage and the question.This model achieved state-of-the-art performance previously on this task.
All hyper-parameters of EDGE and other baselines are selected on the validation set based on the lowest perplexity and the results are reported on the test set.

Performance Comparison
The experimental results of all models are summarized in Table 1.Since the dataset in (Gao et al., 2018) is slightly different from the public dataset in Github, we not only report the HSA's results from its original paper (Gao et al., 2018) but also include the reimplementation results of (Gao et al., 2018) from (Zhou et al., 2019) on the public dataset.There are several observations: Firstly, the proposed model, EDGE, outperforms all baselines significantly in all metrics and achieves the new state-of-the-art scores on this distractor generation dataset; Secondly, SEQ and SEQA outperform the basic Seq2Seq which indicates both the question information and the passage information are vital to the distractor generation; Thirdly, SEQA outperforms HRED which indicates that the co-attention between the question and the  (Gao et al., 2018) and HSA denotes the result reported in (Zhou et al., 2019).passage in the enriching module can help to generate better distractors.The observation that CHN outperforms HSA also proves the effectiveness of the co-attention mechanism; Finally, the basic Seq2Seq performs far worse than other models, which indicates that the distractor generation is a challenging task and hard to solve only with simple models.

Ablation Analysis
Table 2 shows the experimental results of the ablation study.We can see that removing the Reforming Passage Module or the Reforming Question Module leads to the suboptimal results.This validates the effectiveness of two reforming mechanisms.Moreover, we find the former module is more important for the overall distractor generation model.This is probably due to that the reformed passage has a higher impact on the decoding process.Particularly, in each decoding step, the context information from the passage can provide more clues to generate proper distractors than the context information from the You will be allowed to make your own decisions, and to design courses as well as present them.
You should be good at the computer and have some experience in program writing.
Please apply in writing to Producer Vacancies, Kiss 100.question.
We can also observe that the question initializer brings performance gain.This verifies the hypothesis in Section 3.6 that the initial decoder state encoded from the question helps to generate distractors grammatically and semantically consistent with the question.Removing the encoding or enriching module will also result in a performance drop.This indicates that extracting contextual representations is important for the generation task.Moreover, the enriching module can further improve the performance by fusing the passage information into the contextual representation of the question.

Human Evaluation
We conduct a human evaluation to evaluate the quality of the generated distractors of different models.We use three metrics designed by Zhou et al. (2019) to conduct the evaluation: (1) Fluency, which evaluates whether the distractor follows regular English grammar and conforms with human logic and common sense; (2) Coherence, which measures whether the key phrases in the distractors are relevant to the passage and the question; (3) Distracting Ability, which evaluates how likely a generated distractor will be used by the question composers in real examinations.We choose the first 100 samples of the test set and the corresponding distractors generated by three models as the input.We employ five annotators with good English background (at least holding a bachelor's degree) to scores these distractors with three gears (i.e., Good, Fair or Bad) by three metrics, the scores are then projected to 0 -10.
The results of all models, averaged over all generated distractors, are shown in Table 3.We can find that our model performs best in three metrics.This suggests our model is able to generate plausible and useful distractors.This conclusion also aligns with the experimental results of automatic metrics in Section 4.3.

Case Study
The design of two reforming modules and the generator enables convenient interpretation of the generated distractors.Take the MCQ in Figure 4 for example, the blue highlighted sentence in the paragraph includes the clue to infer the correct answer.EDGE managed to block this clue by assigning a lower at-tention weight with the help of the gate layer.In this manner, the clue of the correct answer is prohibited from generating the distractor.Meanwhile, all the sentences related to three distractors (colored red and orange) obtain higher gate values, which further help to achieve higher attention weights (highlighted pink in the left part).Especially, when generating the first distractor, the red sentence has the highest attention score, indicating it make the most contributions to the generation.In summary, the visualization results demonstrate that EDGE provides a good way for the interpretation of the key information of a generated distractor.

Conclusion and Future Work
In this paper, we propose a novel quEstion and answer guided Distractor GEneration(EDGE) framework to automatically generate distractors for multiple choice questions in standard English tests.In EDGE, we design two modules based on attention and gate strategies to reform the passage and question, which then are combined to decode the distractor.Experimental results on a large-scale public dataset demonstrate the state-of-the-art performance of EDGE and the effectiveness of two reforming modules.
In future work, we will explore two potential directions.First, since the beam search ignores the generation diversity, we will explore how to incorporate the prior generated distractor information to guide the generation of successor distractors.Second, we will work on how to generate the distractors requiring multi-sentence/hop reasoning, which can further improve the plausibility.

Question 1 :Figure 1 :
Figure 1: Two examples of multiple choice questions in RACE.Blue choices are the correct answers.
w 8 u I I 6 3 E E D W s A A 4 R l e 4 c 1 5 d F 6 c d + d j E S 0 4 + c w x / I H z + Q O q 5 4 z Y < / l a t e x i t > e A < l a t e x i t s h a 1 _ b a s e 6 4 = " c y v j r 6 4 X E D p Z e 1 I n T c L I I 3 k m r + T N G l s v 1 r v 1 s W h d s / K Z E / I H 1 u c P M u i S Y A = = < / l a t e x i t > Q < l a t e x i t s h a 1 _ b a s e 6 4 = " B d l D W k 4 / P 6 8 t 7 G p a I x B i b U Q 8 X B I = " > A A A B 7 n i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s x U Q Z d F N y 5 b s A 9 o h 5 J J M 2 1 o J j M k d 4 Q y 9 C P c u F D E r d / j z r 8 x b W e h r Q c C h 3 P u I f e e I J H C o O t + O 4 W N z a 3 t n e J u a W / / 4 P Figure 2: The EDGE framework.

Figure 3 :
Figure 3: The statistics of the evaluation dataset (Left).The word count distribution (Right).The max values are 95th percentile.

Figure 4 :
Figure 4: A sample question with the truth distractors and generated distractors.The left sub-figure shows the attention weights in the generator and the gate values in the reforming passage module when decoding the first generated distractor.To enable comparisons among different sentences, we average the attention and gate values of all words in each sentence.Colored sentences are the clues of four options.

Table 1 :
The performance comparison results.The best results are highlighted bold.HSA* denotes the result reported in

Table 2 :
The ablation study results.We average the BLEU-4 and ROUHE-L over all three generated distractors.Higher scores indicate better performance.

Table 3 :
Results of human evaluation.Higher scores indicate better performance.