Distractor Analysis and Selection for Multiple-Choice Cloze Questions for Second-Language Learners

We consider the problem of automatically suggesting distractors for multiple-choice cloze questions designed for second-language learners. We describe the creation of a dataset including collecting manual annotations for distractor selection. We assess the relationship between the choices of the annotators and features based on distractors and the correct answers, both with and without the surrounding passage context in the cloze questions. Simple features of the distractor and correct answer correlate with the annotations, though we find substantial benefit to additionally using large-scale pretrained models to measure the fit of the distractor in the context. Based on these analyses, we propose and train models to automatically select distractors, and measure the importance of model components quantitatively.


Introduction
Multiple-choice cloze questions (MCQs) are widely used in examinations and exercises for language learners (Liang et al., 2018). The quality of MCQs depends not only on the question and choice of blank, but also on the choice of distractors, i.e., incorrect answers. Distractors, which could be phrases or single words, are incorrect answers that distract students from the correct ones.
According to Pho et al. (2014), distractors tend to be syntactically and semantically homogeneous with respect to the correct answers. Distractor selection may be done manually through expert curation or automatically using simple methods based on similarity and dissimilarity to the correct answer (Pino et al., 2008;Alsubait et al., 2014). Intuitively, optimal distractors should be sufficiently similar to the correct answers in order to challenge students, but not so similar as to make the question unanswerable (Yeung et al., 2019). However, past work usually lacks direct supervision for training, making it difficult to develop and evaluate automatic methods. To overcome this challenge, Liang et al. (2018) sample distractors as negative samples for the candidate pool in the training process, and Chen et al. (2015) sample questions and use manual annotation for evaluation.
In this paper, we build two datasets of MCQs for second-language learners with distractor selections annotated manually by human experts. Both datasets consist of instances with a sentence, a blank, the correct answer that fills the blank, and a set of candidate distractors. Each candidate distractor has a label indicating whether a human annotator selected it as a distractor for the instance. The first dataset, which we call MCDSENT, contains solely the sentence without any additional context, and the sentences are written such that they are understandable as standalone sentences. The second dataset, MCDPARA, contains sentences drawn from an existing passage and therefore also supplies the passage context.
To analyze the datasets, we design context-free features of the distractor and the correct answer, including length difference, embedding similarities, frequencies, and frequency rank differences. We also explore context-sensitive features, such as probabilities from large-scale pretrained models like BERT (Devlin et al., 2018). In looking at the annotations, we found that distractors are unchosen when they are either too easy or too hard (i.e., too good of a fit in the context). Consider the examples in Table 1. For the sentence "The large automobile manufacturer has a factory near here.", "beer" is too easy and "corporation" is too good of a fit, so both are rejected by annotators. We find that the BERT probabilities capture this tendency; that is, there is a nonlinear relationship between the distractor probability under BERT and the likelihood of annotator selection. We develop and train models for automatic distractor selection that combine simple features with representations from pretrained models like BERT and ELMo (Peters et al., 2018). Our results show that the pretrained models improve performance drastically over the feature-based models, leading to performance rivaling that of humans asked to perform the same task. By analyzing the models, we find that the pretrained models tend to give higher score to grammatically-correct distractors that are similar in terms of morphology and length to the correct answer, while differing sufficiently in semantics so as to avoid unaswerability.

Datasets
We define an instance as a tuple x, c, d, y where x is the context, a sentence or paragraph containing a blank; c is the correct answer, the word/phrase that correctly fills the blank; d is the distractor candidate, the distractor word/phrase being considered to fill the blank; and y is the label, a true/false value indicating whether a human annotator selected the distractor candidate. 1 We use the term question to refer to a set of instances with the same values for x and c.

Data Collection
We build two datasets with different lengths of context. The first, which we call MCDSENT ("Multiple Choice Distractors with SENTence context"), uses only a single sentence of context. The second, MCDPARA ("Multiple Choice Distractors with PARAgraph context"), has longer contexts (roughly one paragraph).
1 Each instance contains only a single distractor candidate because this matches our annotation collection scenario. Annotators were shown one distractor candidate at a time. The collection of simultaneous annotations of multiple distractor candidates is left to future work.
Our target audience is Japanese business people with TOEIC level 300-800, which translates to preintermediate to upper-intermediate level. Therefore, words from two frequency-based word lists, the New General Service List (NGSL; Browne et al., 2013) and the TOEIC Service List (TSL; Browne and Culligan, 2016), were used as a base for selecting words to serve as correct answers in instances. A proprietary procedure was used to create the sentences for both MCDSENT and MCDPARA tasks, and the paragraphs in MCDPARA are excerpted from stories written to highlight the target words chosen as correct answers. The sentences are created following the rules below: • A sentence must have a particular minimum and maximum number of characters. • The other words in the sentence should be at an equal or easier NGSL frequency level compared with the correct answer. • The sentence theme should be business-like.
All the MCDSENT and MCDPARA materials were created in-house by native speakers of English, most of whom hold a degree in Teaching English to Speakers of Other Languages (TESOL).

Distractor Annotation
We now describe the procedure used to propose distractors for each instance and collect annotations regarding their selection.
A software tool with a user interface was created to allow annotators to accept or reject distractor candidates. Distractor candidates are sorted automatically for presentation to annotators in order to favor those most likely to be selected. The distractor candidates are drawn from a proprietary dictionary, and those with the same part-of-speech (POS) as the correct answers (if POS data is available) are preferred. Moreover, the candidates that have greater similarity to the correct answers are preferred, such as being part of the same word learning section in the language learning course and the same NGSL word frequency bucket. There is also preference for candidates that have not yet been selected as distractors for other questions in the same task type and the same course unit. 2 After the headwords are decided through this procedure, a morphological analyzer is used to generate multiple inflected forms for each headword, which are provided to the annotators for annotation. Both the headwords and inflected forms are available when computing features and for use by our models.
Six annotators were involved in the annotation, all of whom are native speakers of English. Out of the six, four hold a degree in TESOL. Selecting distractors involved two-step human selection. An annotator would approve or reject distractor candidates suggested by the tool, and a different annotator, usually more senior, would review their selections. The annotation guidelines for MCD-SENT and MCDPARA follow the same criteria. The annotators are asked to select distractors that are grammatically plausible, semantically implausible, and not obviously wrong based on the context. Annotators also must accept a minimum number of distractors depending on the number of times the correct answer appears in the course. Table 1 shows examples from MCDSENT and MCDPARA along with annotations.

Annotator Agreement
Some instances in the datasets have multiple annotations, allowing us to assess annotator agreement. We use the term "sample" to refer to a set of instances with the same x, c, and d. Table 2 shows the number of samples with agreement and disagreement for both datasets. 3 Samples with only one annotation dominate the data. Of the samples with multiple annotations, nearly all show agreement.

Distractor Phrases
While most distractors are words, some are phrases, including 16% in MCDSENT and 13% in MCD-PARA. In most cases, the phrases are constructed by a determiner or adverb ("more", "most", etc.) and another word, such as "most pleasant" or # anno.
MCDSENT MCDPARA agree disagree total agree disagree total   Table 3: Dataset sizes in numbers of questions (a "question" is a set of instances with the same x and c) and instances, broken down by label (y) and data split.

Dataset Preparation
We randomly divided each dataset into train, development, and test sets. We remind the reader that we define a "question" as a set of instances with the same values for the context x and correct answer c, and in splitting the data we ensure that for a given question, all of its instances are placed into the same set. The dataset statistics are shown in Table 3. False labels are much more frequent than true labels, especially for MCDPARA.

Features and Analysis
We now analyse the data by designing features and studying their relationships with the annotations.

Features
We now describe our features. The dataset contains both the headwords and inflected forms of both the correct answer c and each distractor candidate d. In defining the features below based on c and d for an instance, we consider separate features for the headword pair and the inflected form pair. For features that require embedding words, we use the 300-dimensional GloVe word embed-dings (Pennington et al., 2014) pretrained on the 42 billion token Common Crawl corpus. The GloVe embeddings are provided in decreasing order by frequency, and some features below use the line numbers of words in the GloVe embeddings, which correspond to frequency ranks. For words that are not in the GloVe vocabulary, their frequency ranks are |N | + 1, where N is the size of the GloVe vocabulary. We use the four features listed below: • length difference: absolute value of length difference (in characters, including whitespace) between c and d. • embedding similarity: cosine similarity of the embeddings of c and d. For phrases, we average the embeddings of the words in the phrase. • distractor frequency: negative log frequency rank of d. For phrases, we take the max rank of the words (i.e., the rarest word is chosen). • freq. rank difference: feature capturing frequency difference between c and d, i.e., log(1 + |r c − r d |) where r w is the frequency rank of w. Figure 1 shows histograms of feature values for each label. 4 Since the data is unbalanced, the histograms are "label-normalized", i.e., normalized so that the sum of heights for each label is 1. So, we can view each bar as the fraction of that label's instances with feature values in the given range. The annotators favor candidates that have approximately the same length as the correct answers ( Fig. 1, plot 1), as the true bars are much higher in the first bin (length difference 0 or 1). Selected distractors have moderate embedding similarity to the correct answers ( Fig. 1, plot 2). If cosine similarity is very high or very low, then those distractors are much less likely to be selected. Such distractors are presumably too difficult or too easy, respectively.

Label-Specific Feature Histograms
Selected distractors are moderately frequent (Fig. 1, plot 3). Very frequent and very infrequent distractors are less likely to be selected. Distractors with small frequency rank differences (those on the left of plot 4) are more likely to be chosen ( Fig. 1, plot 4). Large frequency differences tend to be found with very rare distractors, some of which may be erroneously-inflected forms.
We also computed Spearman correlations between feature values and labels, mapping the T/F labels to 1/0. Aside from what are shown in the feature histograms, we find that a distractor with a rare headword but more common inflected form is more likely to be selected, at least for MCDSENT. The supplementary material contains more detail on these correlations.

Probabilities of Distractors in Context
We use BERT (Devlin et al., 2018) to compute probabilities of distractors and correct answers in the given contexts in MCDSENT. We insert a mask symbol in the blank position and compute the probability of the distractor or correct answer at that position. 5 Figure 2 shows histograms for correct answers and distractors (normalized by label). The correct answers have very high probabilities. The distractor probabilities are more variable and the shapes of the histograms are roughly similar for the true and false labels. Interestingly, however, when the probability is very high or very low, the distractors tend to not be selected. The selected distractors tend to be located at the middle of the probability range. This pattern shows that BERT's distributions capture (at least partially) the nonlinear relationship between goodness of fit and suitability as distractors.

Models
Since the number of distractors selected for each instance is uncertain, our datasets could be naturally treated as a binary classification task for each distractor candidate. We now present models for the task of automatically predicting whether a distractor will be selected by an annotator. We approach the task as defining a predictor that produces a scalar score for a given distractor candidate. This score can be used for ranking distractors for a given question, and can also be turned into a binary classification using a threshold. We define three types of models, described in the subsections below.

Feature-Based Models
Using the features described in Section 3, we build a simple feed-forward neural network classifier that outputs a scalar score for classification. Only inflected forms of words are used for features without contexts, and all features are concatenated and used as the input of the classifier. For features that use BERT, we compute the log-probability of the distractor and the log of its rank in the distribution. For distractors that consist of multiple subword units, we mask each individually to compute the above features for each subword unit, then use the

ELMo-Based Models
We now describe models that are based on ELMo (Peters et al., 2018) which we denote M ELMo . Since MCDPARA instances contain paragraph context, which usually includes more than one sentence, we denote the model that uses the full context by M ELMo ( ). By contrast, M ELMo uses only a single sentence context for both MCDSENT and MCDPARA. We denote the correct answer by c, distractor candidate by d, the word sequence before the blank by w p , and the word sequence after the blank by w n , using the notation rev (w n ) to indicate the reverse of the sequence w n .
We use GloVe (Pennington et al., 2014) to obtain pretrained word embeddings for context words, then use two separate RNNs with gated recurrent units (GRUs; Cho et al., 2014) to output hidden vectors to represent w p and w n . We reverse w n before passing it to its GRU, and we use the last hidden states of the GRUs as part of the classifier input. We also use ELMo to obtain contextualized word embeddings for correct answers and distractors in the given context, and concatenate them to the input. An illustration of this model is presented in Figure 3.
A feed-forward network (FFN) with 1 ReLU hidden layer is set on top of these features to get the score for classification: where z is a row vector representing the inputs shown in Figure 3. We train the model as a binary classifier by using a logistic sigmoid function on  the output of FFN (z) to compute the probability of the true label. We also experiment with the following variations: • Concatenate the features from Section 3 with z.
• Concatenate the correct answer to the input of the GRUs on both sides (denoted gru+c).
• Concatenate the GloVe embeddings of the correct answers and distractors with z. We combine this with gru+c, denoting the combination all.

BERT-Based Models
Our final model type uses a structure similar to M ELMo but using BERT in place of ELMo when producing contextualized embeddings, which we denote by M BERT and M BERT ( ) given different types of context. We also consider the variation of concatenating the features to the input to the classifier, i.e., the first variation described in Section 4.2. We omit the gru+c and all variations here because the BERT-based models are more computationally expensive than those that use ELMo.

Experiments
We now report the results of experiments with training models to select distractor candidates.

Evaluation Metrics
We use precision, recall, and F1 score as evaluation metrics. These require choosing a threshold for the score produced by our predictors. We also report the area under the precision-recall curve (AUPR), which is a single-number summary that does not require choosing a threshold.

Baselines
As the datasets are unbalanced (most distractor candidates are not selected), we report the results of baselines that always return "True" in the "baseline" rows of Tables 5 and 6. MCDSENT has a higher percentage of true labels than MCDPARA.

Estimates of Human Performance
We estimated human performance on the distractor selection task by obtaining annotations from NLP researchers who were not involved in the original data collection effort. We performed three rounds among two annotators, training them with some number of questions per round, showing the annotators the results after each round to let them calibrate their assessments, and then testing them using a final set of 30 questions, each of which has at most 10 distractors. Human performance improved across rounds of training, leading to F1 scores in the range of 45-61% for MCDSENT and 25-34% for MCDPARA (Table 4). Some instances were very easy to reject, typically those that were erroneous word forms resulting from incorrect morphological inflection or those that were extremely similar in meaning to the correct answer. But distractors that were at neither extreme were very difficult to predict, as there is a certain amount of variability in the annotation of such cases. Nonetheless, we believe that the data has sufficient signal to train models to provide a score indicating suitability of candidates to serve as distractors.

Modeling and Training Settings
All models have one hidden layer for the feedforward classifier. The M feat classifier has 50 hidden units, and we train it for at most 30 epochs using Adam (Kingma and Ba, 2014) with learning rate 1e−3. We stop training if AUPR keeps decreasing for 5 epochs. 6 Although our primary metric of interest is AUPR, we also report optimal-threshold F1 scores on dev and test, tuning the threshold on the given set (so, on the test sets, the F1 scores we report are oracle F1 scores). The threshold is tuned within the range of 0.1 to 0.9 by step size 0.1.
For M ELMo and M ELMo ( ), we use ELMo (Original 7 ) for the model, and BERT-large-cased to compute the BERT features from Section 3 (only applies to rows with "features = yes" in the tables). We increase the number of classifier hidden units to 1000 and run 20 epochs at most, also using Adam with learning rate 1e−3. We stop training if AUPR does not improve for 3 epochs.
For M BERT and M BERT ( ), we applied the same training settings as M ELMo and M ELMo ( ). We com-   pare the BERT-base-cased and BERT-large-cased variants of BERT. When doing so, the BERT features from Section 3 use the same BERT variant as that used for contextualized word embeddings. For all models based on pretrained models, we keep the parameters of the pretrained models fixed. However, we do a weighted summation of the 3 layers of ELMo, and all layers of BERT except for the first layer, where the weights are trained during the training process.

Results
We present our main results for MCDSENT in Table 5 and for MCDPARA in Table 6.
Feature-based models. The feature-based model, shown as M feat in the upper parts of the tables, is much better than the trivial baseline. Including the BERT features in M feat improves performance greatly (10 points in AUPR for MCDSENT), showing the value of using the context effectively with a powerful pretrained model. There is not a large difference between using BERT-base and BERT-large when computing these features.
ELMo-based models. Even without features, M ELMo outperforms M feat by a wide margin. Adding features to M ELMo further improves F1 by 2-5% for MCDSENT and 5-6% for MCDPARA. The F1 score for M ELMo on MCDSENT is close to human performance, and on MCDPARA the F1 score outperforms humans (see Table 4). For MCDSENT, we also experiment with using the correct answer as input to the context GRUs (gru+c), and additionally concatenating the GloVe embeddings of the correct answers and distractors to the input of the classifier (all). Both changes improve F1 on dev, but on test the results are more mixed.
BERT-based models. For M BERT , using BERTbase is sufficient to obtain strong results on this task and is also cheaper computationally than BERTlarge. Although M BERT with BERT-base has higher AUPR on dev, its test performance is close to M ELMo . Adding features improves performance for MCDPARA (3-5% F1), but less than the improvement found for M ELMo . While M feat is aided greatly when including BERT features, the features have limited impact on M BERT , presumably because it already incorporates BERT in its model.
Long-context models. We now discuss results for the models that use the full context in MCD-PARA, i.e., M ELMo ( ) and M BERT ( ). On dev, M ELMo and M BERT outperform M ELMo ( ) and M BERT ( ) respectively, which suggests that the extra context for MCDPARA is not helpful. However, the test AUPR results are better when using the longer context, suggesting that the extra context may be helpful for generalization. Nonetheless, the overall differences are small, suggesting that either the longer context is not important for this task or that our way of encoding the context is not helpful. The judges in our manual study (Sec. 5.3) rarely found the longer context helpful for the task, pointing toward the former possibility.

Statistical Significance Tests
For better comparison of these models' performances, a paired bootstrap resampling method is applied (Koehn, 2004). We repeatedly sample with replacement 1000 times from the original test set with sample size equal to the corresponding test set size, and compare the F1 scores of two models. We use the thresholds tuned by the development set for F1 score computations, and assume significance at a p value of 0.05.  Figure 4 shows an example question from MCD-SENT, i.e., "The bank will notify its customers of the new policy", and two subsets of its distractors. The first subset consists of the top seven distractors using scores from M ELMo with features, and the second contains distractors further down in the ranked list. For each model, we normalize its distractor scores with min-max normalization. 9 Overall, model rankings are similar across models, with all distractors in the first set ranked higher than those in the second set. The high-ranking but unselected distractors ("spell", "consult", and "quit") are likely to be reasonable distractors for second-language learners, even though they were not selected by annotators.

Examples
We could observe the clustering of distractor ranks with similar morphological inflected form in some cases, which may indicate that the model makes use of the grammatical knowledge of pretrained models.
Based on these heuristic features, Liang et al. (2018) assemble these features and apply neural networks, training the model to predict the answers within a lot of candidates. Yeung et al. (2019) further applies BERT for ranking distractors by masking the target word. As we have two manually annotated datasets that have different lengths of contexts, we adopt both word pair features and the context-specific distractor probabilities to build our feature-based models. Moreover, we build both ELMo-based and BERT-based models, combining them with our features and measuring the impact of these choices on performance.

Conclusion
We described two datasets with annotations of distractor selection for multiple-choice cloze questions for second-language learners. We designed features and developed models based on pretrained language models. Our results show that the task is challenging for humans and that the strongest models are able to approach or exceed human performance. The rankings of distractors provided by our models appear reasonable and can reduce a great deal of human burden in distractor selection. Future work will use our models to collect additional training data which can then be refined in a second pass by limited human annotation. Other future work can explore the utility of features derived from pretrained question answering models in scoring distractors.

A.1 Dataset
There are some problematic words in the dataset, such as 'testing, test', 'find s', 'find ed' in MCD-SENT/MCDPARA candidate words. There are also some extra spaces (or non-breaking spaces) at the start or end of words. To keep the words the same as what the annotators saw, we only remove leading/trailing white space, and replace non-breaking spaces with ordinary spaces. By comparing the percentages of the circumstances where spaces are included in the string before/after tokenization, we find the percentage of extra spaces presented in Table 7. The vocabulary size after tokenization is presented in Table 8

A.2 Distractor Annotation
The software tool suggested distractor candidates based on the following priority ranking: Table 9: Position of the candidates, where "sent" denotes sentence and "para" denotes paragraph. "para start" mean that the sentence containing the blank is at the beginning of the paragraph.

A.4 Correlations of Features and Annotations
The Spearman correlations for these features are presented in Table 10. The overall correlations are mostly close to zero, so we explore how the relationships vary for different ranges of feature values below. Nonetheless, we can make certain observations about the correlations: • Length difference has a weak negative correlation with annotations, which implies that the probability of a candidate being selected decreases when the absolute value of word length difference between the candidate and correct answer increases. The same conclusion can be drawn with headword pairs although the correlation is weaker. • Embedding similarity has a very weak correlation (even perhaps none) with the annotations. However, the correlation for headwords is slightly negative while that for inflected forms is slightly positive, suggesting that annotators tend to select distractors with different lemmas than the correct answer, but similar inflected forms. • Candidate frequency also has a very weak correlation with annotations (negative for headwords and positive for inflected forms). Since the feature is the negative log frequency rank, a distractor with a rare headword but more common inflected form is more likely to be selected, at least for MCDSENT.  Table 10: Spearman correlations with T/F choices, where "head" denotes headword pairs, and "infl" denotes inflected form pairs.
• Frequency rank difference has a weak negative correlation with annotations, and this trend is more significant with the inflected form pair. This implies that annotators tend to select distractors in the same frequency range as the correct answers.
The correlations are not very large in absolute terms, however we found that there were stronger relationships for particular ranges of these feature values and we explore this in the next section.
A.5 Label-Specific Feature Histograms Figure 5 shows histograms of the feature values for each label on headword pairs.

A.6 Results Tuned Based on F1
We report our results tuned based on F1 in Table 11 and 12.

A.7 Supplement for Analysis
The example for MCDPARA is as below, and two sets of its distractors are shown in Figure 6.
• MCDPARA: A few years have passed since the Great Tohoku Earthquake occurred. It has been extremely costly to rebuild the damaged areas from scratch, with well over $200 billion dollars provided for reconstruction. However, the availability of these funds has been limited. However, a large portion of the money has been kept away from the victims due to a system which favors construction companies....     Figure 6: Ranks for distractor candidates of MCDPARA question "However, the availability of these funds has been limited." along with annotations.