SCDE: Sentence Cloze Dataset with High Quality Distractors From Examinations

We introduce SCDE, a dataset to evaluate the performance of computational models through sentence prediction. SCDE is a human created sentence cloze dataset, collected from public school English examinations. Our task requires a model to fill up multiple blanks in a passage from a shared candidate set with distractors designed by English teachers. Experimental results demonstrate that this task requires the use of non-local, discourse-level context beyond the immediate sentence neighborhood. The blanks require joint solving and significantly impair each other’s context. Furthermore, through ablations, we show that the distractors are of high quality and make the task more challenging. Our experiments show that there is a significant performance gap between advanced models (72%) and humans (87%), encouraging future models to bridge this gap.


Introduction
Cloze questions were first proposed by Taylor (1953) as a readability test, motivated by Gestalt psychology. They become an efficient way of testing reading for public exams, overtaking the dominant paradigm of subjective questions (Fotos, 1991;Jonz, 1991). Cloze datasets (Zweig and Burges, 2011;Hermann et al., 2015;Hill et al., 2015;Paperno et al., 2016;Onishi et al., 2016;Xie et al., 2018) became prevalent as questionanswering (QA) benchmarks since they are convenient either to be generated automatically or by annotators. These datasets could be split into two clear types: 1. Where the context is a complete text, and there is an explicit question posed which is a statement with a cloze gap. The answer is either generated freely or is a span * Equal Contribution 1 Data: vgtomahawk.github.io/sced.html 2 Code: https://github.com/shawnkx/SCDE from the context, e.g. Children's Books Test (CBT) (Hill et al., 2015).
2. Where the context itself comes with cloze gaps. There is no explicit question. The answer is generated freely or chosen from a set of candidates, e.g. CLOTH (Xie et al., 2018).
Herein, we focus on the 2nd category. A common property of these datasets is that they have gaps at the level of words, entities or short syntactic spans. The entity and span-based clozes may sometimes be multi-token, but they do not extend beyond a few tokens. Nevertheless, none of these datasets have cloze gaps at the level of full sentences. Since many syntactic and semantic cues are present in the same sentence, this makes the gap easier to fill compared to the sentence level cloze case where models would have to rely on "discourse" cues beyond the same sentence.
Besides lack of intra-sentence cues, sentencelevel cloze may require comparing candidates of very different lengths. For instance, the example in Table 1 has a standard deviation of 7.6 with candidate lengths between 3 to 25. A model that only represents words well may not get comparable probabilities at sentence level for very different sentence lengths. Therefore, robust sentence representation models are also required to solve this question. In this paper, we present SCDE, a dataset of sentence-level cloze questions sourced from public school examinations. Each dataset example consists of a passage with multiple sentence-level blanks and a shared set of candidates. Besides the right answer to each cloze in the passage, the candidate set also contains ones which don't answer any cloze, a.k.a., distractors. Both cloze positions and distractors are authored by teachers who design the public school examinations carefully. §3.2 explains our data collection. A representative example from SCDE is shown in Table 1. Another salient aspect of our dataset is that more than 40% of blanks belong to the reasoning category "Inference" (more on this in §3.3 and Table 4) which require models to compare plausibility of competing hypotheses given a premise (whether the previous or last sentence(s), or even a combination of information from the two). Filling these blanks requires the model to reason by using commonsense knowledge, factual knowledge, time gaps, etc. Some of these can be thought of as simple entailment, but more generally, many of these can be seen as requiring abductive reasoning, which is of recent interest (Bhagavatula et al., 2019;Sap et al., 2019a,b) to the NLP community. In summary, our contributions are as follows 1. We introduce the task of sentence level cloze completion with multiple sentence blanks and a shared candidate set with distractors. 2. We release SCDE, a sentence level cloze dataset of ≈ 6k passages and ≈ 30k blanks. 3. We estimate human performance on SCDE, and benchmark several models, including state-of-the-art contextual embeddings (Table 5). We find a significant gap of > 15% for future models to close in order to match human performance.
4. Through several ablations described in §5.6, we show that distractors designed by English teachers are of high quality and make the task more challenging. 5. We show that extra sentence level cloze questions generated automatically from an external corpus can be used to further improve model performance through data augmentation (See §5.7).   et al., 2016) is the closest we could find to a sentence level cloze dataset. In this task, the first 4 sentences of a 5-sentence story are provided, and the task is to choose the correct ending from a pair of candidate ending sentences. However, there are several key differences between SCDE and ROCStories. Firstly, there are multiblanks in SCDE which are not in a fixed position and require learning cues from bidirectional contexts of varying lengths. Secondly, the endings in ROCStories have been found to contain "annotation artifacts" (Gururangan et al., 2018) which makes a large fraction of them predictable independent of context.

Related Work
In contrast, SCDE is by design independent of artifacts, since a) given a blank, only some of our candidates are distractors, the rest being answers for other blanks. Even if one were to learn a classifier to distinguish distractors without context, the non-distractor candidates would be unresolvable without context. b) we further check how distinguishable our distractors are from non-distractors without context by training a strong classifier in this setting, as described in §5.6. The classifier obtains a reasonably low F1 score of 0.38.
In Table 2, we summarize the comparison of SCDE with cloze datasets from prior art to show its attractive aspects.
Public school examinations have been used as a data source by many earlier QA works, two prominent examples being the CLEF QA tracks (Penas et al., 2014;Rodrigo et al., 2015) and RACE (Lai et al., 2017).

Sentence Cloze Test with distractors
In this task, each question consists of a passage, S, multiple sentence level blanks B, and a shared set of candidates C with distractors D, where D ⊂ C.
Problem Complexity 3 For our case, given the typical value of |C| and |B| being 7 and 5 respectively, the size of the answer space, |A| is 2520. Thus, the chance of guessing all blanks correctly at random is only 0.04%. Moreover, there is a 48.2% probability of being entirely wrong with randomly guessing. Finally, given an answer list chosen uniformly at random, the expectation of number of distractors in the answer list is 1.4, i.e. on average, roughly one and half answers are distractors.

Data Collection and Statistics
Raw sentence cloze problems are crawled from public websites 4 which curate middle and high school English exams designed by teachers. In total, 14,062 raw passages and 68,515 blank questions are crawled from these websites and the following steps are used to clean them. Firstly, duplicate passages are removed. Secondly, when the official answer to the problems are images, two OCR toolkits 5 are employed to convert these images to text and the questions with different results from these two programs will be discarded. Finally, we remove examples which have 1) answers pointing to non-existent candidates, 2) missing or null candidates, 3) number of blanks > number of candidates, 4) missing answers.
After cleaning, we obtain our SCDE dataset with 5,959 passages and 29,731 blanks. They are   Table 3. We find that candidates have very different lengths and passages have long context.

In-Depth Analysis & Categorization
In order to evaluate students' mastery of a language, teachers usually design tests in a way that questions cover different aspects of a language.
Reasoning Types As illustrated with examples in Table 4, we set a four-fold categorization for the reasoning which leads to a ground truth candidate being assigned to a blank. Our reasoning type taxonomy is motivated by categorization of question types in earlier works in QA such as (Chen et al., 2016;Trischler et al., 2017) 6 . Strictly speaking, these reasoning types could co-exist. But for simplicity, we classify each blank into only one of the four.
• WORDMATCH: If the candidate has word overlap, especially of non-stopwords or infrequent phrases, with context around the blank. A sample of 100 passages containing 500 blanks are manually categorized into these four categories. Examples and statistics of these four types are listed in Table 4. More than 40% blanks need inference to be solved, denoting the high difficulty of our dataset.

Context Length
We experiment with giving our models different amounts of context. Through this, we can explore how context length affects model performance.
1. P(N): Immediate previous (next) sentence 2. P+N: Immediate previous and next sentence 3. AP(AN): All previous (next) sentences 4. AP+AN: All previous and next sentences AP+AN is the unablated setting, where all passage sentences are available to the model.

PMI
Before exploring deep representational approaches, we would like to find how well symbolic ones perform at this task. Starting with works such as Iyyer et al. (2015) and Arora et al. (2017), it has become convention to first benchmark simple baselines of this kind. PMI merely encodes how likely it is for a word pair to occur in consecutive sentences. It does not consider the internal sentence structures, or the relative position of the words in their respective sentence. Intuitively, it can be called a "surfacelevel" approach. A high performance by PMI would indicate that candidates can be matched to blanks by simple ngram statistics, without requiring sentence representation, which would make SCDE uninteresting. 1: One day, a teacher was giving a speech to his student. He held up a glass of water and asked the class The students answers ranged from 20g to 500g. Candidate: B. How heavy do you think this glass of water is? × Candidate: D. It does not matter on the weight itself. Explanation: WordMatch based on glass of water.
Para. (19.48%) 2: If you want time to have breakfast with your family, save some time the night before by setting out clothes, shoes and bags.
That's a quarter-hour more you could be sleeping if you bought a coffee maker with a timer. × Candidate: D. And consider setting a second alarm. × Candidate: F. Stick to your set bedtime and wake-up time, no matter the day.
Candidate: G. Reconsider the 15 minutes you spend in line at the cafe. Explanation: Need to match 15 minutes, quarter-hour and coffee, cafe.
You can have a good time with your family. × Candidate: E. All the students can come to their schools.
Candidate: F. From May 1st to 7th, we don't need to come to school. × Candidate: G. On May 20th, a famous sports star YaoMing comes to our school. Explanation: Need to infer that not coming to school → one is at home with family. Simply matching for words May or school will also match wrong candidates.
Sum. (20.08%) 4: How to Enjoy Life As a Teen? Are high school days equal to the "best years of your life"? Maybe not, but you can learn to make the most of your high school days Whether it 's having a computer, having friends, having a good supply of food, a bed to sleep on, family that loves you, having a decent education or simply being born in this world. Be happy, and life will reward you. × Candidate: A. Remember that the point of life is for you to enjoy it.
Candidate: C. Learn to appreciate small things. Explanation: After summarizing sentences after the blank [which describe a list of "small things"], the answer should be C. A is a strong distractor since both "enjoy" and "life" appear in the context, besides being pertinent to the topic. Indeed, our best-performing BERT-ft model chooses A as the answer. We estimate PMI counts (Church and Hanks, 1990) from all consecutive sentence pairs in our training split. Let f denote frequency Note that our PMI definition diverges from typical PMI since its asymmetric between w s and w c .
Since S and C are the sets of non-terminating and non-starting sentences respectively, they overlap but aren't identical. For a pair of sentences, we find aggregate PMI(S, C) as: This definition can be extended to all n-grams upto a certain n. We denote this by PMI n . We notice that PMI n performance saturates after n = 2. Hence, in our experiments, we use PMI 2 .

Language Modelling
One intuitive way to solve this task is to generate the blank sentence given the context by advanced pre-trained language models (LM). Formally, suppose the blank is the ith sentence, s i , and s 1 , . . . , s i−1 , s i+1 , . . . , s n are the context. Our goal is to choose c k from the candidate set which could maximize the joint probability p(s 1 , . . . , s i−1 , c k , s i+1 , . . . , s n ).
Due to limited number of passages available to train a robust LM, Transformer-XL (TR.XL) Base (Dai et al., 2019), trained on WikiText-103, is employed to address this task. In order to make decoding time tractable, context length is limited to three sentences before and after the blank.

Coherence
Coherence models assign a continuous score to a sentence sequence indicative of its coherence. This score is usually unnormalized and not needed to be a probability [unlike language models].
We use the local coherence approaches implemented by the COHERE 7 framework (Smith et al., 2016). Roughly, this model works on the intuition that successive sentences exhibit regularities in syntactic patterns. Specifically, it uses ngram patterns on linearized syntactic parses (e.g. S NP VP . . . ) of consecutive sentences. Once trained, this model can return a "coherence score" for any sentence sequence.
The COHERE model is first trained on all ground-truth passages from our training set, with the ground truth answers filled into the blanks. At test-time, we score each possible answer permutation using the trained COHERE model and pick the highest scoring one. Note that decoding for COHERE is by definition exhaustive, and doesn't make any assumptions by answering the blanks in a particular order.

InferSent
Conneau et al. (2017) use textual inference supervision as a signal to train a shared sentence encoder for premises and hypotheses, which can later be used as a sentence representor. We refer to this approach as INFST. Context features of a given blank and one candidate feed to two encoders in INFST respectively and classify whether this candidate is suitable to this blank. The maximum tokens of context features is set as 256. Bi-directional LSTMs with the max pooling operation are employed as our encoders. We follow the training procedure described in Conneau et al. (2017).

BERT Models
Input Representations Let c k denotes the kth candidate. s −i and s +i denote the ith sentence before and after the blank respectively and |P | and |N | represent total number of sentences before and after the current blank respectively. Following the input convention in Devlin et al. (2018), the input sequence given various context lengths and c k is: To retain sentence sequentiality, the order between the context and the candidate follows that in the original passage. Furthermore, for (A)P+(A)N, we create and score one input sample for each of the context directions during prediction. The average of these two scores is taken as the final score. The maximum tokens of input is set as 256 in our experiments and only the context is truncated to meet this requirement.
BERT Next Sentence Prediction (NSP) One of the objectives in BERT pre-training stage is  understanding the relationship between two sentences, which is highly correlated with our task. Therefore, we use the pre-trained BERT-Largeuncasedd with its NSP layer to predict the most appropriate candidate for each blank given its context. Specifically, BERT is employed to predict the probability of the context and the candidate being consecutive.
Finetuning BERT A wide range of NLP tasks have greatly benefited from the pre-trained BERT model. Therefore, we also finetune the pre-trained BERT-Large model on our task through sequence pair classification schema. Specifically, for each blank, its correct candidate will be labelled as 0 and the label of all other wrong candidates is 1. Batch size and number of epochs for all models are 32 and 3. We employ Adam (Kingma and Ba, 2014) as the optimizer with three different learning rates {1e −5 , 2e −5 , 3e −5 }. Best model selection is based on validation performance. All BERT finetuning experiments including ablation study follow this training strategy.

Decoding Strategy
The decoding strategy decides how exactly we assign a candidate to each blank in the passage. Due to shared candidates, we have two strategies:   constituent blank-candidate pairs. The highest scoring permutation is the answer.

Evaluation Metrics
We design two metrics to evaluate models. Both of these metrics are reported as percentage.
Blank accuracy (BA): The fraction of blanks answered correctly, averaged over all passages.
Passage Accuracy (PA): PA is 1 iff the model gets all blanks in a passage correct, and 0 otherwise. The average of PA over all passages is reported.

Human Performance
We questions (61.0%). Human performance is reported in Table 5. Annotators achieve BA of 87% which we take as the ceiling performance for models to match.

Model Performance
All models are trained with AP+AN context and decoded by EXH 9 . Results are shown in Table 5. Finetuning BERT achieves the best performance among other models, though it still lags behind human performance significantly. Unsupervised models could only solve one third of all blanks. Surprisingly, PMI 2 and COHERE performs worse than the unsupervised models. We conjecture that it is difficult for COHERE, using syntactic regularities alone, to distinguish between the ground truth answer for a particular blank and another candidate which is a ground truth answer for another nearby blank. As noted, PMI 2 suffers due to inability of incorporating larger context. To explore effects of various context length and decoding strategies, models are trained with different context lengths and inferred by both decoding methods. Results are shown in Table 6.

INC vs EXH EXH is better than INC for most approaches, indicating that human created blanks are interdependent and need joint answering.
Context Length Increasing the context length, such as (P vs. AP), could significantly improve model performance, showing that this task needs discourse-level context to be answered. Furthermore, models with bidirectional context, such as (P+N), perform better than single-direction context, e.g., P, indicating that this task needs global context. Lastly, we observe that PMI-based approaches which do not explicitly encode sentences are unable to incorporate larger context levels, showing best performance with P+N.

BERT-ft vs. Human
BERT after finetuning (BERT-ft) can perform reasonably well (72%) but there is still a gap comparing with human performance (87%). In this section, we would like to analyze the strength and weakness of BERT-ft compared with HUMAN. Therefore, we analyze their performance across different reasoning categories on test set. From Figure 1, inference questions are the most difficult for both HUMAN and BERT-ft and questions needing WordMatch are relatively easy. Compared with human performance, BERT-ft could achieve comparable BA on WordMatch and paraphrasing problems. However, BERT-ft performs much worse on questions needing inference and summary. We also refer to some examples from Table 4.
In Example 4, BERT-ft prefers A but the answer is C. The reason why BERT-ft chooses A may be that "enjoy life" happens in the context, but summarizing the next sentence is necessary to achieve the correct answer. Therefore, it is necessary to improve the ability of BERT to represent meaning at the sentence level beyond representing individual words in context.
We also explore how the system performance corresponds to the human judgement of difficulty. Since evaluates rate the problems into 5 difficulty levels, we report the system BA/PA for each level in Table 8. For BA (blank-level accuracy), we see that, overall, the system accuracy decreases as difficulty increases from VeryEasy (0.75) to Very-Hard (0.68). However, the decrease is not exactly monotonic (there is a small increase from VeryEasy to Easy, as also from Moderate to Hard).
We conjecture that non-monotonicity could be due to two reasons: • Our difficulty annotations are at passage level rather than blank level. There might be some hard blanks in a passage marked overall "Easy". Conversely, there might be easy blanks in a passage marked overall "Hard". •

Distractor Quality
An attractive aspect of this task is distractors designed by English teachers. We verify distractor quality through the following experiments.
Model Performance w/o Distractors All distractors in the test set are removed and models are evaluated on this non-distracted test set. Results are shown in Table 7. It is clear to see that after removing these distracting candidates, models can get better scores, showing that models find it hard to exclude distractors during prediction.   averaged score in Table 7. Comparing with distractors designed by teachers, models could discern these distractors more easily.

Randomly Sampled Distractors
Annotation artifacts of distractors Annotation artifacts (Gururangan et al., 2018) occurs in many datasets created by human annotators. A potential artifact type for our task is whether we could detect distractors without passages. Therefore, we finetune BERT-Large as a binary classifier, the input of which is just distractors and other correct candidates. With this model, we could only obtain 38% F1 score on the test set, showing that it is difficult to filter distractors out without any context.

Distractor Error (DE)
We define DE as the number of predicted answers per passage which are actually distractors. Through DE, we measure a model's ability to exclude distractors during prediction. Results are shown in Table 9. HUMAN has the lowest DE and BERT-ft could discern distractors to some extent. However, DE of PMI 2 is more than 1, meaning that on average, there is atleast one distractor in the predicted answer list. In summary, distractors created by teachers are high quality and increase task difficulty.

Automatically Generated Sentence Cloze Questions
To explore automatic generation of examples for the task, we construct sentence cloze questions by randomly choosing five sentences in a passage as blanks. We defer automatically generating distractors to future work since non-trivial distractor generation is a hard problem in itself. Specifically, we extract all passages from RACE (Lai et al., 2017) (which is also from exams) and filter out passages which have less than 10 sentences or more than 30 sentences. While choosing blank positions, we prevent three or more blanks consecutive to each other in generated questions. Finally, 16,706 examples are obtained automatically. Here, questions generated automatically and collected from examinations are called Q A and Q H respectively. We leverage Q A in three ways: 1). train models only on Q A , 2) first train models on Q A and finetune models on Q H , i.e., Q A ; Q H , 3) train models on the concatenation of Q A and Q H , i.e., Q A + Q H . BERT-Large is finetuned through these ways and results are shown in Table 10. The model trained only on Q A has worst performance and we attribute this to the difficulty of distinguishing distractors without seeing them during training. Therefore, this model has the highest DE. However, models trained on Q H and Q A could achieve better performance. We conjecture this is because Q A assists the model to have better generalization.

Conclusion
We introduce SCDE, a sentence cloze dataset with high quality distractors carefully designed by English teachers. SCDE requires use of discourselevel context and different reasoning types. More importantly, the high quality distractors make this task more challenging. Human performance is found to exceed advanced contextual embedding and language models by a significant margin. Through SCDE, we aim to encourage the development of more advanced language understanding models.
A simple but tough-to-beat baseline for sentence embeddings. ICLR.

A Problem Complexity
With |B| = 5 blanks and |C| = 7 candidates, the size of answer space, |A|, is number of permutations |B| objects taken |C| at a time, i.e., P(7, 5) = 2520. Therefore, the probability of answering all blanks correctly is 1 2520 = 0.03% What are the chances of getting answers partially correct? What are the chances of getting answers partially correct? If we have the same number of candidates as blanks, this is equivalent to |B|! − D |B| , where D |B| is the number of derangements 10 of |B| elements. In the presence of more candidates than blanks i.e distractors, this expression becomes more involved to derive. Therefore, here, we enumerate all the permutation of answer lists given a correct answer. With |C| = 7 and |B| = 5, ζ(|C|, |B|) = 51.8%. In other words, there is a 48.2% probability of being entirely wrong with a randomly chosen set of answers to each blank in the passage.
What are the chances of getting distractors as predicted answers? For the expectation of number of distractors choosing by uniform model, it should be E [DE], where DE denotes distractors errors.
where C(·, ·) and P(·, ·) is combination and permutation respectively. Therefore, the expectation of number of distractors is 1.429.

B Additional Experiment Specifications
Specific BERT Model Used We use uncased BERT models for all our experiments. We use the BERT models trained by the canonical pytorch implementation of Wolf et al.  Table 11. Also, some completed questions with strong distractors, multiblank logic and diverse reasoning types are shown in Table 12, 13 and 14.
Reasoning Examples with Excerpts From Blank Context WM (18.47%) 1: One day, a teacher was giving a speech to his student. He held up a glass of water and asked the class.
The students answers ranged from 20g to 500g. Candidate: B. How heavy do you think this glass of water is? × Candidate: D. It does not matter on the weight itself. Explanation: Match based on glass of water 2: Begin the sleep adjustment for your school schedule as early as possible.
But if you feel you will need some extra time to adjust, start earlier. Candidate: C. Starting a few days early will be enough. Explanation: Need to infer that not coming to school → one is at home with family. Simply matching for words May or school will also match wrong candidates.
6: The Colosseum in Rome was built during the time of the Roman Empire, in the first century AD.
. It is a popular tourist attraction today.
Candidate: D. It could seat 50K people, who went to see fights between animals and people. × Candidate: B. The country used to depend on agriculture. × Candidate: C. Mountains cover about three-fourths of the country. Explanation: World knowledge that Colosseum or -eum suffix relates to building with seating facility. Also coreference with the It in It is a popular . . .

7:
American students usually get to school at about 8 : 30 in the morning.
In class, American students can sit in their seats when they answer teachers' questions.
Candidate: B. School starts at 9:00 a.m. × Candidate: D. Then they take part in different kinds of after-school activities. Explanation: Requires inference about time. Activity starts at 9 after participants get there before.
Sum. (20.08%) 8: Around water, adults should watch children at all times to make sure they are safe. Those who don't know how to swim should wear life jackets. But by themselves they are not enough, so an adult should always be present. If you have to rescue a child from drowning, a few seconds can make a big difference. Make sure you have a friend with you whenever you swim.
. That person can make sure you get help. Drink a lot water. The sun's heat and the physical activity may make you sweat more than you realize. By following these simple rules, you can make sure your swim time is safe as well as fun.
Candidate: B. Now get out there, and enjoy the water. × Candidate: D. Make sure everyone in your family swim well. Explanation: B is a good conclusion pertinent to the content of the passage.

9:
. Whenever you are worried, write down the questions that make you worry. And write out all the various steps you could take and then the probable consequences of each step. For example, "What am l worrying about?", What can I do about it? Here is what I'm going to do about it. After carefully weighing all the facts, you can calmly come to a decision.
Candidate: A. Analyze the facts. × Candidate: C. Decide how much anxiety a thing may be worth. Explanation: A is a more appropriate option to summarize its succeeding context. 2 Students in my group are from different cities of Britain and their dialects are different too! Some of their accents are quite strong and they also have their own words and expressions.
3 Before I came to England I had thought that fish and chips were eaten every day. That's quite wrong! I get rather annoyed now when I hear all the foolish words about typical English food. I had expected to see "London fog". Do you remember our texts about it ? We had no idea that most of this "thick fog" disappeared many years ago when people stopped using coal in their homes. But the idea to speak about weather was very helpful.
4 On the other hand , habits are different . People tell me what is typical British here in London is not always typical in Wales or Scotland. 5 But what is ordinary for all British is that they follow traditions. Probably Britain has more living signs of its past than many other countries. And people have always been proud of having ancient buildings in capitals, big cities and the countryside. I will tell you more about Britain in my other letters. Love from Britain.
The demand for ways to improve memory is higher in students than it is in adults. Students often come across new knowledge in different areas that they need to store for exams.
1 Here are three effective ways to improve your memory as a student.
2 Research shows that learning activities that take more than two hours without a break are less productive when compared to those that take one hour or 30 minutes. Students are likely to remember things they learn over a short period of time. Make sure you take breaks between learning sessions to help improve your memory. Try to relax. Relaxing should be an essential part of your learning process. Scientists have proven that stronger and lasting memories can be achieved when a person relaxes.
3 Deep breathing is one of the most popular relaxation techniques. Establish a quiet environment and find a comfortable position. Then go through a deep breathing process for at least 15 minutes. Train the brain Students should give their brains a workout in order to improve their memory. At times the brain needs the right stimulation to keep growing and developing. You need to come up with a brain boosting activity that is suitable for you.
4 Write a short story and then try to use seven to nine words to describe it. You can also do games and puzzles to help improve your memory.
5 The techniques discussed above will help you to improve your memory significantly.

Candidates:
A. Distribute learning. B. Enrich learning activities. C. Some students suffer with memory problems. D. Like a muscle memory can stretch and grow with a workout. E. For instance you can prepare a list of items and try to memorize them. F. You need to use different relaxation techniques in order to improve your memory. G. In summary a good memory is an important advantage to any student who wants to improve his or her grades.
Answers: 1→C, 2→A, 3→F , 4→E, 5→G (B and D are distractors) Discussion: The candidate F can actually go into three possible blanks and fit well into their context -Blanks 1, 3 and 5. This can be seen from the several overlapping phrases/paraphrases F shares with all three, as shown by the three colors (one per concept). However, G (which starts with the phrase In summary, can only fit into Blank 5. A is also difficult to place in any blank other than Blank 1. Hence , candidate F has to be placed into Blank 3.