Distractor Generation for Multiple Choice Questions Using Learning to Rank

We investigate how machine learning models, specifically ranking models, can be used to select useful distractors for multiple choice questions. Our proposed models can learn to select distractors that resemble those in actual exam questions, which is different from most existing unsupervised ontology-based and similarity-based methods. We empirically study feature-based and neural net (NN) based ranking models with experiments on the recently released SciQ dataset and our MCQL dataset. Experimental results show that feature-based ensemble learning methods (random forest and LambdaMART) outperform both the NN-based method and unsupervised baselines. These two datasets can also be used as benchmarks for distractor generation.


Introduction
Multiple choice questions (MCQs) are widely used as an assessment of students' knowledge and skills. A MCQ consists of three elements: (i) stem, the question sentence; (ii) key, the correct answer; (iii) distractors, alternative answers used to distract students from the correct answer. Among all methods for creating good MCQs, finding reasonable distractors is crucial and usually the most time-consuming. We here investigate automatic distractor generation (DG), i.e., generating distractors given the stem and the key to the question. We focus on the case where distractors are not limited to single words and can be phrases and sentences.
Rather than generate trivial wrong answers, the goal of DG is to generate plausible false answers -good distractors. Specifically, a "good" distractor should be at least semantically related to the key (Goodrich, 1977), grammatically correct given the stem, and consistent with the semantic context of the stem. Taking these cri-terion into consideration, most existing methods for DG are based on various similarity measures. These include WordNet-based metrics (Mitkov and Ha, 2003), embedding-based similarities (Guo et al., 2016;Kumar et al., 2015;Jiang and Lee, 2017), n-gram co-occurrence likelihood (Hill and Simha, 2016), phonetic and morphological similarities (Pino and Eskenazi, 2009), structural similarities in an ontology (Stasaski and Hearst, 2017), a thesaurus (Sumita et al., 2005), context similarity (Pino et al., 2008), context-sensitive inference (Zesch and Melamud, 2014), and syntactic similarity (Chen et al., 2006). Then distractors are selected from a candidate distractor set based on a weighted combination of similarities, where the weights are determined by heuristics.
In contrast to the above-mentioned similaritybased methods, we apply learning-based ranking models to select distractors that resemble those in actual exam MCQs. Specifically, we propose two types of models for DG: feature-based and NNbased models. Our models are able to take existing heuristics as features and learn from these questions a function beyond a simple linear combination. Learning to generate distractors has been previously explored in a few studies. Given a blanked question, Sakaguchi et al. (2013) use a discriminative model to predict distractors and Liang et al. (2017) apply generative adversarial nets. They view DG as a multi-class classification problem and use answers as output labels while we use them as input. Other related work (Welbl et al., 2017) uses a random forest. However, with the reported binary classification metrics, the quality of the top generated distractors is not quantitatively evaluated. Here we conduct a more comprehensive study on various learning models and devise ranking evaluation metrics for DG.
Machine learning of a robust model usually requires large-scale training data. However, to the best of our knowledge, there is no benchmark dataset for DG, which makes it difficult to directly compare methods. Prior methods were evaluated on different question sets collected from textbooks (Agarwal and Mannem, 2011), Wikipedia (Liang et al., 2017), ESL corpuses (Sakaguchi et al., 2013), etc. We propose to evaluate DG methods with two datasets: the recently released SciQ dataset (Welbl et al., 2017) (13.7K MCQs) and the MCQL dataset (7.1K MCQs) that we made. These two datasets can be used as benchmarks for training and testing DG models. Our experimental results show that feature-based ensemble learning methods (random forest and LambdaMART) outperform both the NN-based method and unsupervised baselines for DG.

Learning to Rank for Distractor Generation
We solve DG as the following ranking problem: Problem. Given a candidate distractor set D and are the distractors associated with q i and a i , find a point-wise ranking function r: (q i , a i , d) → [0, 1] for d ∈ D, such that distractors in D i are ranked higher than those in D − D i . This problem formulation is similar to "learning to rank" (Liu et al., 2009) in information retrieval. To learn the ranking function, we investigate two types of models: feature-based models and NNbased models.

Feature Description
Given a tuple (q, a, d), a feature-based model first transforms it to a feature vector φ(q, a, d) ∈ R d with the function φ. We design the following features for DG, resulting in a 26-dimension feature vector: • Emb Sim. Embedding similarity between q and d and the similarity between a and d. We use the average GloVe embedding (Pennington et al., 2014) as the sentence embedding. Embeddings have been shown to be effective for finding semantically similar distractors (Kumar et al., 2015;Guo et al., 2016).
• POS Sim. Jaccard similarity between a and d's POS tags. The intuition is that ditractors might also be noun phrases if the key is a noun phrase.
• ED. Edit distance between a and d. This measures the spelling similarity and is useful for cases such as selecting "RNA" as a distractor for "DNA".
• Token Sim. Jaccard similarities between q and d's tokens, a and d's tokens, and q and a's tokens. This feature is motivated by the observation that distractors might share tokens with the key.
• Length. a and d's character and token lengths and the difference of lengths. This feature is designed to explore whether distractors and the key are similar in terms of lengths.
• Suffix. The absolute and relative length of a and d's longest common suffix. The key and distractors often have common suffixes. For example, "maltose", "lactose", and "suctose" could be good distractors for "fructose".
• Freq. Average word frequency in a and d.
Word frequency has been used as a proxy for words' difficulty levels (Coniam, 1997). This feature is designed to select distractors with a similar difficulty level as the key.
• Single. Singular/plural consistency of a and d. This checks the consistency of singular vs. plural usage, which will select grammatically correct distractors given the stem.
• Num. Whether numbers appear in a and d. This feature will cover cases where distractors and keys contain numbers, such as "90 degree", "one year", "2018", etc.
• Wiki Sim. If a and d are Wikipedia entities, we calculate their Wiki embedding similarity. The embedding is trained using word2vec (Mikolov et al., 2013) on Wikipedia data with each Wiki entity treated as an individual token. This feature is a complement to Emb Sim where sentence embedding is a simple average of word embeddings.

Classifiers
We study the following three feature-based classifiers: (i) Logistic Regression: an efficient generalized linear classification model; (ii) Random Forest (Breiman, 2001): an effective ensemble classification model; (iii) LambdaMART (Burges, 2010): a gradient boosted tree based learning-torank model. To train these models, following previous notations, we use D i as positive examples and sample from D − D i to get negative examples.

NN-based Models
Based on the recently proposed method IR-GAN (Wang et al., 2017), we propose an adversarial training framework for DG. Our framework consists of two components: a generator G and a discriminator D. G is a generative model that aims to capture the conditional probability of generating distractors given stems and answers P (d|q, a). D is a discriminative model that estimates the probability that a distractor sample comes from the real training data rather than G.
Assume that the discriminator is based on an arbitrary scoring function f φ (d, q, a) ∈ R parameterized by φ, then the objective for D is to maximize the following log-likelihood: where σ is the sigmoid function. For the generator G, we choose another scoring function f θ (d, q, a) ∈ R parameterized by θ, evaluate it on every possible distractor d i given a (q, a) pair, and sample generated distractors based on the discrete probability after applying softmax: where τ is a temperature hyper-parameter. In practice, since the total size of distractors is large, it is very time-consuming to evaluate on every possible d i . Following the common practice as in (Wang et al., 2017;Cai and Wang, 2018), we uniformly sample K candidate distractors for each (q, a) pair and evaluate f θ on each d i , ∀i ∈ [1, K]. The objective for G is to "fool" D so that D misclassifies distractors generated by G as positive: The training procedure follows a two-player minimax game, where D and G are alternatively optimized towards their own objective.
The scoring function f φ and f θ can take arbitrary forms. IRGAN utilizes a convolutional neu- ral network based model to obtain sentence embeddings and then calculates the cosine similarities. However, such a method ignores the wordlevel interactions, which is important for the DG task. For example, if the stem asks "which physical unit", good distractors should be units. Therefore, we adopt the Decomposable Attention model (DecompAtt) (Parikh et al., 2016) proposed for Natural Language Inference to measure the similarities between q and d. We also consider the similarities between a and d. Since they are usually short sequences, we simply use the cosine similarity between summed word embeddings. As such, the scoring function is defined as a linear combination of DecompAtt(d, q) and Cosine(d, a).

Cascaded Learning Framework
To make the ranking process more efficient and effective, we propose a cascaded learning framework, a multistage ensemble learning framework that has been widely used for computer vision (Viola and Jones, 2001). We experiment with 2stage cascading, where the first stage ranker is a simple model trained with part of the features in Sec. 2.1.1 and the second stage ranker can be any aforementioned ranking model. Such cascading has two advantages: (i) The candidate size is significantly reduced by the first stage ranker, which allows the use of more expensive features and complex models in the second stage; (ii) The second stage ranker can learn from more challenging negative examples since they are top predictions from previous stage, which can make the learning more effective.

Datasets
We evaluate the proposed DG models on the following two datasets: (i) SciQ ( For SciQ, we follow the original train/valid/test splits. For MCQL, we randomly divide the dataset into train/valid/test with an approximate ratio of 10:1:1. We convert the dataset to lowercase, filter out the distractors such as "all of them", "none of them", "both A and B", and keep questions with at least one distractor. We use all the keys and distractors in the dataset as candidate distractor set D. Table 1 summarizes the statistics of the two datasets after preprocessing. |D| is the number of candidate distractors. # MCQs is the total number of MCQs. # Train/Valid/Test is the number of questions in each split of the dataset. Avg. # Dis is the average number of distractors per question.

Experiment Settings
We use Logistic Regression (LR) as the first stage ranker. As for the second stage, we compare LR, Random Forest (RF), LambdaMART (LM), and the proposed NN-based model (NN). Specifically, we set C to 1 for LR, use 500 trees for RF, and 500 rounds of boosting for LM. For first stage training, the number of negative samples is set to be equal to the number of distractors, which is 3 for most questions. And we sample 100 negative samples for second stage training. More details can be found in the supplementary material. In addition, we also study the following unsupervised baselines that measure similarities between the key and distractors: (i) pointwise mutual information (PMI) based on co-occurrences; (ii) edit distance (ED), which measures the spelling similarity; and (iii) GloVe embedding similarity (Emb Sim). For evaluation, we report top recall (R@10), precision (P@1, P@3), mean average precision (MAP@10), normalized discounted cumulative gain (NDCG@10), and mean reciprocal rank (MRR).

Experimental Results
First Stage Ranker The main goal of the first stage ranker is to reduce the candidate size for the later stage while achieving a relatively high recall. Figure 1 shows the Recall@K for the first stage ranker on the two datasets. Validation set is used for choosing top K predictions for later stage training. We empirically set K to 2000 for SciQ and 2500 for MCQL to get a recall of about 90%.   Feature Analysis We conduct a feature analysis to have more insights on the proposed feature set. Feature importance is calculated by "mean decrease impurity" using RF. It is defined as the total decrease in node impurity, weighted by the probability of reaching that node, averaged over all trees of the ensemble. Table 3 lists the top 10 important features for SciQ and MCQL datasets. We find that: (i) the embedding similarity between a and d is the most important feature, which shows embeddings are effective at capturing semantic relations between a and d. (ii) String similarities such as Token Sim, ED, and Suffix are more important in MCQL than those in SciQ. This is consistent with the observation that ED has relatively good performance as seen in Table 2b. (iii) The set of top 10 features is the same for SciQ and MCQL, regardless of order.

Distractor Ranking Results
Effects of Cascaded Learning Since we choose the top 2000 for SciQ and 2500 for MCQL from first stage, the ranking candidate size is reduced by 91% for SciQ and 85% for MCQL, which makes the second stage learning more efficient. To study whether cascaded learning is effective, we experiment with RF and LM without 2-stage learning, as shown as the bottom two rows in Table 2. Here we sample 100 negative samples for training models in order to make a fair comparison with other methods using 2-stage learning. We can see that the performance is better when cascaded learning is applied.

Conclusion
We investigated DG as a ranking problem and applied feature-based and NN-based supervised ranking models to the task. Experiments with the SciQ and the MCQL datasets empirically show that ensemble learning models (random forest and LambdaMART) outperform both the NN-based method and unsupervised baselines. The MCQL data is publicly available upon request. The two datasets can be used as benchmarks for further DG research. Future work will be to design a user interface to implement the proposed models to help teachers with DG and collect more user data for model training.

A Training and Implementation Details
Feature-based Models. We use the implementations of scikit-learn (Pedregosa et al., 2011) for logistic regression and random forest experiments.
For LambdaMART experiments, we use the XG-Boost library (Chen and Guestrin, 2016). For both SCIQ and MCQL datasets we train with 500 rounds of boosting, step size shrinkage of 0.1, maximum depth of 30, minimum child weight of 0.1 and minimum loss reduction of 1.0 for partition. For calculating Wiki Sim features, we use a Wikipedia dump of Oct. 2016. Part of speech tags are calculated with NLTK (Bird and Loper, 2004). The logistic regression used for the first stage ranker is based on features including: Emb Sim, POS Sim, ED, Token Sim, Length, Suffix, and Freq. Models for the second stage ranker is based on all features described in Sec. 2.1.1.
NN-based Models. Our NN-based models are implemented with TensorFlow (Abadi et al., 2016). When training the generator, we first uniformly select K = 512 candidates and then sample 16 distractors according to Equation 2. The temperature τ is set to 5. Our scoring functions are based on Decomposable Attention Model (Parikh et al., 2016). The word embeddings are initialized using the pre-trained GloVe (Pennington et al., 2014) (840B tokens), and the embedding size is 300. Our model is optimized using Adam algorithm (Kingma and Ba, 2015) with a learning rate of 1e-4 and a weight decay of 1e-6.
Since the sampling process in G is not differentiable, the gradient-decent-based optimization in the original GAN paper (Goodfellow et al., 2014) is not directly applicable. To tackle this problem, we use policy gradient based reinforcement learning as in IRGAN.