Assessing the Helpfulness of Learning Materials with Inference-Based Learner-Like Agent

Many English-as-a-second language learners have trouble using near-synonym words (e.g., small vs.little; briefly vs.shortly) correctly, and often look for example sentences to learn how two nearly synonymous terms differ. Prior work uses hand-crafted scores to recommend sentences but has difficulty in adopting such scores to all the near-synonyms as near-synonyms differ in various ways. We notice that the helpfulness of the learning material would reflect on the learners' performance. Thus, we propose the inference-based learner-like agent to mimic learner behavior and identify good learning materials by examining the agent's performance. To enable the agent to behave like a learner, we leverage entailment modeling's capability of inferring answers from the provided materials. Experimental results show that the proposed agent is equipped with good learner-like behavior to achieve the best performance in both fill-in-the-blank (FITB) and good example sentence selection tasks. We further conduct a classroom user study with college ESL learners. The results of the user study show that the proposed agent can find out example sentences that help students learn more easily and efficiently. Compared to other models, the proposed agent improves the score of more than 17% of students after learning.


Introduction
Many English-as-a-second-language (ESL) learners have trouble using near-synonyms correctly (Liu and Zhong, 2014;Liu, 2013). "Nearsynonym" refers to a word whose meaning is similar but not identical to that of another word, for instance, establish and construct. An experience common to many ESL learners is looking for example sentences to learn how two nearly synonymous words differ (Liu, 2013;Liu and Jiang, 2009). To facilitate the learner's learning process, our focus Figure 1: The Learner-Like Agent mimics learners' behavior of performing well when learning from good material and vice versa. We utilize such a behavior to find out helpful learning materials.
is on finding example sentences to clarify English near-synonyms.
In previous work, researchers develop linguistic search engines, such as Linggle (Boisson et al., 2013) and Netspeak 1 , to allow users to query English words in terms of n-gram frequency. However, these tools can only help people investigate the difference, where learners are required to make assumptions toward the subtlety and verify them with the tools, but can not tell the difference proactively. Other work attempts to automatically retrieve example sentences for dictionary entries (Kilgarriff et al., 2008); however, finding clarifying examples for near-synonyms is not the goal of such work. In a rare exception, Huang et al. (2017) retrieve useful examples for near-synonyms by defining a clarification score for a given English sentence and using it to recommend sentences. However, the sentence selection process depends on handcrafted scoring functions that are unlikely to work well for all nearsynonym sets. For example, the difference between refuse and reject is their grammatical usages where we would use "refuse to verb" but not "reject to verb"; such a rule, yet, is not applicable for delay and postpone as they differ in sentiment where delay expresses more negative feeling. Though Huang et al. (2017) propose two different models to handle these two cases respectively, there is no clear way to automatically detect which model we should use for an arbitrary near-synonym set.
In the search for a better solution, we noted that ESL learners learn better with useful learning materials-as evidenced by their exam scoreswhereas bad materials cause confusion. Such behavior can be used to assess the usefulness of example sentences as shown in Figure 1. Therefore, we propose a Learner-Like Agent which mimics human learning behavior to enable the ability to select good example sentences. This task concerns the ability to answer questions according to the example sentences for learning. As such, we transform this research problem to an entailment problem, where the model needs to decide whether the provided example sentence can entail the question or not. Moreover, to encourage learner-like behavior, we propose perturbing instances for model training by swapping the target confusing word to its nearsynonyms. We conduct a lexical choice experiment to show that the proposed entailment modeling can distinguish the difference of near-synonyms. A behavior check experiment is used to illustrate that perturbing instances do encourage learner-like behavior, that is inferring answers from the provided materials. In addition, we conduct a sentence selection experiment to show that such learner-like behavior can be used for identifying helpfulness materials. Last, we conduct a user study to analyze near-synonym learning effectiveness when deploying the proposed agent on students.
Our contributions are three-fold. We (i) propose a learner-like agent which perturbs instances to effectively model learner behavior, (ii) use inferencebased entailment modeling instead of context modeling to discern nuances between near-synonyms, and (iii) construct the first dataset of helpful example sentences for ESL learners. 2

Related Works
This task is related to (i) learning material generation, (ii) near-synonyms disambiguation, and (iii) natural language inference. 2 Dataset and code are available here: https://github.com/joyyyjen/ Inference-Based-Learner-Like-Agent Learning Material Generation. Collecting learning material is one of the hardest tasks for both teachers and students. Researchers have long been looking for methods to generate high-quality learning material automatically. Sumita et al. (2005); Sakaguchi et al. (2013) proposed approaches to generate fill-in-the-blank questions to evaluate students language proficiency automatically. Lin et al. (2007); Susanti et al. (2018); Liu et al. (2018) worked on generating good distractors for multiplechoice questions. However, there are only a few tasks working on automatic example sentence collection and generation. Kilgarriff et al. (2008); Didakowski et al. (2012) proposed a set of criteria for a good example sentences and Tolmachev and Kurohashi (2017) used sentence similarity and quality as features to extract high-quality examples. These tasks only focused on the quality of a single example sentence, whereas our goal in this paper is to generate an example sentence set that clarifies near-synonyms. The only existing work is from Huang et al. (2017), who designed the fitness score and relative closeness score to represent the sentence's ability to clarify near-synonyms. Our work enables the models to learn the concept of "usefulness" directly from data to reduce the possible issues of the human-crafted scoring function.
Near-synonyms Disambiguation. Unlike the language modeling task that aims at predicting the next word given the context, near-synonyms disambiguation focuses on differentiating the subtlety of the near-synonyms. Edmonds (1997) first introduced a lexical co-occurrence network with secondorder co-occurrence for near-synonym disambiguation. Edmonds also suggested a fill-in-the-blank (FITB) task, providing a benchmark for evaluating lexical choice performance on near-synonyms. Islam and Inkpen (2010) used the Google 5-gram dataset to distinguish near-synonyms using language modeling techniques. Wang and Hirst (2010) encoded words into vectors in latent semantic space and applied a machine learning model to learn the difference. Huang et al. (2017) applied BiLSTM and GMM models to learn the subtle context distribution. Recently, BERT (Devlin et al., 2018) brought a big success in nearly all the Natural Language Processing tasks. Though BERT is not designed to differentiate near-synonyms, its powerful learning capability could be used to understand the subtlety lies in the near-synonyms. In this paper, our models are all designed on top of the pre-trained BERT model. Natural Language Inference. Our proposed model directly learns the difference and sentence quality by imitating the human reactions of learning material and behavior of learning from example sentences. The idea of learning from example is similar to natural language inference (NLI) task and recognizing question entailment (RQE) task. There are various NLI dataset varied in size, construction, genre, labels classes (Bowman et al., 2015;Williams et al., 2018;Khot et al., 2018;Lai et al., 2017). In the NLI task, each instance consists of two natural language text: a premise, a hypothesis, and a label indicating the relationship whether a premise entails the hypothesis. RQE, on the other hand, identifies entailment between two questions in the context of question answering. Abacha and Demner-Fushman (2016) used the definition of question entailment: "a question A entails a question B if every answer to B is also a complete or partial answer to A." Though NLI and RQE research has acquired lots of success, to the best of our knowledge, we are the first to attempt using these two tasks on language learning problems. Poliak et al. (2018)'s recast version of the definite pronoun resolution (DPR) task inspired us to build learner-like agents with entailment modeling . In the original DPR problem, sentences contain two entities and one pronoun, and the mission is to link the pronoun to its referent (Rahman and Ng, 2012). In the recast version, the premises are the original sentences, and the hypothesis is the same sentence with the pronoun replaced with its correct (entailed) and incorrect (not-entailed) reference. We believe our proposed entailment modeling can help the model to understand the relationship between the given example sentence and question for the target near-synonym. Thus entailment modeling enables the learner-like agent to mimic human behavior through inference.

Method
In this paper, we use learner-like agent to refer to a model that answers questions given examples. The goal of the learner-like agent is to answer fill-in-theblank questions on near-synonyms selection. However, instead of answering the question from the agent's prior knowledge, the agent needs to answer the question using the information from the given examples. That is, if the given examples provide incorrect information, the agent should then come up with the wrong answer. This process is to simulate the learner behavior illustrated in the Figure 1. Since the model is required to infer the answer, we further formulate it as an entailment modeling problem to enable model's capability of inference. In this section, we (i) define the proposed learner-like agent, (ii) describe how to formulate it as an entailment modeling problem, and (iii) introduce the perturbed instances to further enhance the agent's learner behavior.

Learner-Like Agent
The overall structure of a learner-like agent is as follows: given six example sentences E (3 sentences for each word) and a fill-in-the blank question Q as an input instance, the model is to answer the question based on the example hints. We adopt BERT (Devlin et al., 2018) to fine-tune the taskspecific layer of the proposed learner-like agent using our training data, equipping the learner-like agent with the ability to discern differences between near-synonyms. The input of our model contains the following: is the length of the sentence and contains a word w i from the near-synonym pair, where i ∈ {1, 2} denotes word 1 or word 2; , where E w i denotes a sentence containing w i ; • A [CLS] token for the classification position, and several [SEP] tokens used to label the boundary of the question and the example sentences, following the BERT settings.
The output will is the correct word for the input question, namely, w 1 or w 2 . We specifically define E[w j ] i where i, j ∈ 1, 2 to be the context of w i . The example sentence of case (2) in Table 1 shows a case of E[w 1 ] 1 where the target word w 1 is little and the rest of the sentence is called context E[ ] 1 . When we change little to small to create case (9), it is described as E[w 2 ] 1 meaning an example sentence where w 2 fills the position of w 1 in sentence E w 1 . This notation also applies to the question input Q[w j ] i .

Inference-based Entailment Modeling
We apply NLI and RQE tasks in the learner-like agent question design. The goal of the Entailment   (9) and (14) are the perturbed instances.
The inappropriate examples are used in section 4 for behavior check.
Modeling Learner-like Agent (EMLA) is to answer entailment questions given example sentences.
We transform the original fill-in-the-blank question into an entailment question where the EMLA answers whether the given example sentence E entails the question sentence Q. If the word usage in the question sentence matches the word usage in the example sentence, the EMLA answers entail , or ¬entail otherwise. The EMLA M e is described as where ans-either entail or ¬entail -is the prediction of the inference relationship of one of the six example sentences E i k , where k ∈ {1, 2, ..6}, and Q j . To fill all the context possibilities of Q[ ] j for the same word in E w i , an example has the following four cases: From the input and output of the instances (equations 2 to 5), we see that the target word and its context in Q j for all cases except for equation 2 do not follow the example word usage. The examples of the instances are shown in Table 1. Equation 3 and equation 4 tell us that an example sentence of w 1 does not provide any information for the model to infer anything about w 2 so both of them result in not entail. The question of equation 5 is incorrect, as shown in the Table 1 case (5), so it would also lead to not entail.
After training the EMLA to understand the relation between example and question, we can convert its prediction {entail , ¬entail } back into the fillin-the-blank task by looking into the model predictions. Given the probability of {entail , ¬entail }, we know which term in the near-synonym pair is more appropriate in the context of If the question context and the example context match, then a word with a higher entail probability is the answer. If they do not match, that with the higher ¬entail probability is the answer.

Perturbed Instances
To encourage learner-like behavior, i.e., good examples lead to the correct answer, and vice versa, we propose introducing automatically generated perturbed instances to the training process.
A close look at the input and output of the instances (equations 2 to 5) shows that they consider only correct examples and their corresponding labels. We postulate that wrong word usage yields inappropriate examples; thus we perturb instances by swapping the current confusing word to its nearsynonym as where ¬ans is {entail, ¬entail} − ans and E[¬w i ] w i k is the example sentence in which the contexts in w 1 and w 2 are swapped. The corresponding perturbed instances from equations 2 to respectively, in which w 2 's context becomes E[ ] 1 . Again, only equation 9, where both the context and the word usage match, is entail. The example instance is shown in Table 1 case 9.

Experiments
We conducted three experiments: lexical choice, behavior check, and sentence selection. The lexical choice task assesses whether the model differentiates confusing words, the behavior check measures whether the model responds to the quality of learning material as learners do, and sentence selection evaluates the model's ability to explore useful example sentences.

Lexical Choice
Lexical choice evaluates the model's ability to differentiate confusing words. We adopted the fill-inthe-blank (FITB) task, where the model is asked to choose a word from a given near-synonym word pair to fill in the blank.

Baseline
Context modeling is a common practice for nearsynonym disambiguation in which the model learns the context of the target word via the FITB task. For this we use a Context Modeling Learner-like Agent (CMLA) as the baseline based on BERT (Devlin et al., 2018) as a two-class classifier to predict which of w 1 or w 2 is more appropriate given a near-synonym word pair. The question for CMLA is a sentence whose target word, i.e., one of the confusing words, is masked; the model is to predict the masked target word. The CMLA M c is then described as where Q[MASK] i fills the the position of w i with MASK, and ans ∈ {w 1 , w 2 } is the prediction of [MASK] in the question, and E are the six example sentences.

Q[MASK]
i is a question with the context of either w 1 or w 2 . This raises a problem of the model deriving the answer only from Q i , Equations 12 and 13 risk the model to selects w i given Q i . To encourage learner-like behavior, we incorporate perturbed instances into the training process corresponding to equations 12 and 13 as , ., E[¬w 1 ] 1 3 ] For context modeling , the perturbed instance has the additional benefit that it forces the model to make inferences based on the given example sentences, as illustrated in Table 1 case (14).

Dataset and Settings
We collected a set of near-synonym word pairs from online resources, including BBC 3 , the Oxford Dictionary 4 , and a Wikipedia page about commonly misused English words 5 .
An expert in ESL education manually selected 30 near-synonym word pairs as our experimental material. We collected our data for both training and testing from Wikipedia on January 20, 2020. Words in the confusing word pair were usually of a specific part of speech. This guaranteed that the part of speech of the confusing word in the sentence pool matched that in target near-synonym word pair. To construct a balanced dataset, we randomly selected 5,000 sentences for each word; 4,000 sentences for each word in a near-synonym word pair were used to train the learner-like model and 1,000 sentences for testing.
For comparison, we trained four learner-like agents: EMLA, CMLA, EMLA without perturbed instances, and CMLA without perturbed instances. For the best learning effect, we empirically set the ratio of normal-to-perturbed instances to 2 : 1. The agents were trained using the Adam optimizer with a 30% warm-up ratio and a 5e-5 learning rate. The maximum total input sequence length after tokenization was 256; other settings followed the BERT configuration.

Results and Discussion
We compared the EMLA and CMLA and Figure 2 shows the model performance on 30 word pairs. The average accuracy of EMLA and CMLA is 0.90 and 0.86, while that excluding perturbing instances is 0.80 and 0.86, respectively. On average, EMLA performs the best; when perturbed instances are not included in the training, its performance for lexical choice drops. We expected training with perturbed instances to worsen model performance in exchange for learner-like behavior. However, results show that the perturbed instances enhance the inference ability of EMLA. Also, CMLA models seem to be unaffected by perturbed instances (yellow vs. green lines); this could be because CMLA tends to memorize the input context instead of making an actual inference, which in NLI is recognized as bias (Chien and Kalita, 2020).

Behavior Check
The behavior check evaluates whether the agent learns as learners do; that is, a learner-like agent should perform well on FITB questions when the given learning materials are helpful, and should perform poorly when the materials are not helpful.
In this experiment, all models complete two FITB quizzes. For the first quiz, authentic sen-tences are provided as appropriate learning materials; for the second quiz, inappropriate learning materials are provided. These materials are considered inappropriate because they are automatically generated using the authentic sentences but replacing their target words with near-synonyms for training, resulting in confusion and wrong word usage, as illustrated in Table 1 (see the last two "Inappropriate example" rows). In other words, given inappropriate example sentences, if the model is truly inferring answers from the examples, the model should select the other choice in the same quiz question.

Results and Discussion
We recorded the accuracy of every question and combined the 30 pairs of near-synonym wordsets from the same model into one graph. As shown in Figure 3, even without perturbed instances, the learning effect of EMLA corresponds to the learning material quality. In contrast, CMLA without perturbed instances, as in the lexical choice task, is no worse when given inappropriate examples.
To determine whether the results of the two fill-in-the-blank quizzes are significantly different when given appropriate and inappropriate examples, we conducted a t-test. Table 2 shows that learner-like behavior is enabled in CMLA with perturbed instances, whereas EMLA learns like learners even without perturbed instances. This result conforms to that shown in Figure 3: the quiz results for both EMLA models can be clearly distinguished, and adding the perturbed instances to EMLA slightly magnifies their difference. However, the CMLA still relies on perturbed instances to learn the difference.
Looking more closely, we present Table 3, in which ∆ is the difference in accuracy between two quizzes. The higher ∆ is, the better the model differentiates confusing words. We measure the correlation between the lexical choice accuracy and ∆ with the Pearson correlation coefficient and obtain a value of 0.87, which demonstrates a strong positive correlation.

Sentence Selection
In the sentence selection experiment, we evaluate the ability of the learner-like agent to select useful example sentences. Our assumption is straightforward. We give the agent a set of example sentences and evaluate its performance on a number of quizzes. If it does well on many quizzes, the example sentences are deemed helpful for learning   confusing words.

Baseline
We compared agents with an implementation of Huang et al. (2017)'s Gaussian mixture model (GMM), which learns the distribution and semantics of the context. We set the number of Gaussian mixtures to 10 and trained the GMM with the dataset proposed here. In the testing phase, we retrieved the top three recommended sentences for each word in the confusing word pair and compared this to the expert's choices.

Evaluation Dataset
To evaluate the sentence selection, we employed an ESL teacher as an expert to carefully select the three best example sentences out of ten randomly selected, grammatically, and pragmatically correct examples for each word in all confusing word pairs. Specifically, the evaluation dataset had a total of 600 example sentences. For each near-synonym pair, three sentences for each word were labeled as helpful example sentences. To select sentences that clearly clarify the semantic difference between near-synonyms, the ESL expert considered suitability, informativeness, diversity, sentence complexity, and lexical complexity during selection. For suitability, the expert considered whether the two near-synonym words in one confusing word pair were interchangeable in the current sentence. Diversity was considered when constructing the selected pool. Suitability and diversity are designed from the (Huang et al., 2017)'s conclusion. Other criteria are from Kilgarriff's good example sentence (Kilgarriff et al., 2008).

Selection Method
For the proposed good example sentence set, we selected an example sentence combination that helps EMLA or CMLA to achieve the highest accuracy in the quiz. That is, the example sentence set that leads to the highest learning performance.
One of a total of 14,400 (C 10 3 × C 10 3 ) example sentence sets, including six example sentences, was provided to the models to evaluate their helpfulness.  Each example sentence set was used to answer a quiz composed of k questions. Here, k determines the representativeness and consistency of the testing result from each quiz. We used five independent quizzes to find a reliable k by calculating the correlation of their testing results. Finally, we empirically set k to 100, where the lowest correlation among 30 word pairs was 0.24, and the median was 0.67. That is, each quiz contained 100 questions. When testing example sentence sets, multiple example sentence sets could achieve the same highest accuracy for the quiz. We considered them equally good so sentences in these sets were all treated as selected. Thus, our method would possibly suggest more example sentences than the gold labels. Table 4 shows the results of sentence selection. EMLA significantly outperforms CMLA and Huang's GMM in sentence selection. The improvement comes from the increasing recall, indicating that the proposed learner-like agent manages to find helpful example sentences for ESL learners.

Learner Study
We conducted a user study to see the effect of learning on example sentences selected by EMLA, CMLA, and a random baseline. In this learner study, a total of 29 Chinese-speaking college freshmen majored in English were recruited. All the participants were aged between 18 and 19. A proficiency test (Chen and Lin, 2011) was given before the study to identify their English level for further analysis.

Experimental Design and Material
We followed Huang et al. (2017)'s learner study design with some modification. The whole test consisted of a pre-test and a post-test section in a total of 80 minutes. The fill-in-the-blank multiplechoice question was used in both tests to examine students' understanding of near-synonym. A total of 30 word pairs were used to create 30 question sets where each set contained three questions. The  Figure 4 (B), was presented to students. The students were asked to finish the randomly assigned 15 question sets in the pre-test and a background questionnaire. During the post-test section, example sentences generated by EMLA, CMLA, or the random baseline will be presented in the example panel as shown in Figure 4 A. A maximum of three example sentences for each word can be obtained by clicking the readme button. The readme button can help us track how many example sentences were used for learning. Note that the students were asked to answer the same question sets in the post-test so we can measure the improvement they made between the pre-test and the post-test. For each question set, the model used for sentence selection was also randomly assigned in order to prevent learners from getting tired from the useless example sentences. Different from the sentence selection in Section 4.3, where all the combinations with the highest score in the quiz are selected, we picked the most common three example sentences from the combination to fulfill the experimental design. Here, we assume the most common three sentences for each word would be the best candidate in all the combinations.

Results and Discussion
When learning from example sentences from EMLA, 16 students improved. Only 12 and 11 students improved when learning from CMLA and random baseline, suggesting that EMLA helped more. Figure 5 shows the students' improvement score versus proficiency score. Figure 5: Improvement of 29 learner scores in respect to entailment modeling, context modeling, and random baseline. A total of 16 learners improved when learning on the material generated by entailment modeling.  Table 5: Analysis of two groups. Above and Below stand for the above-average group and the belowaverage group respectively. EMLA helps the aboveaverage group the most. We also find that the aboveaverage group reads significantly fewer sentences than the below-average group. However, the below-average group rates the example sentences easier (scores range from 1 to 4 while 1 being "too difficult").
To further understand students' behaviors, we separated students into two groups using their English proficiency test scores. Students whose test scores were lower than the average score were grouped into the below-average group and were considered having lower English proficiency, and vice versa. The above-average group and the belowaverage group had 12 and 17 students respectively. The average improvement scores of the two groups are shown in Table 5. We can see the aboveaverage students benefit more from example sentences while below-average benefit less or even confused by the example sentences. Again, EMLA helps above-average students the most. The random baseline provides a mixed result, and even the above-average students got affected. This echos results from Huang et al. (2017) where students can still learn from the random example sentences but more effort is needed to fully understand the nearsynonym and the outcome is unstable. In Figure 5, we can find that there are two outliers in the ran-dom baseline. The one improved a lot is from the below-average group, and the other one worsen a lot is from the above-average group. This evidence shows the uncertainty of the random baseline.
We investigated the learner's behavior during the post-test and their questionnaire response toward example difficulty. The result is also shown in Table 5. The above-average students read significantly fewer examples while they also rate examples more difficult. On the other hand, most of the below-average students read all the six examples and rate them relatively easier. Though many above-average students improved in the post-test, we found that there are two of them read less than three examples and thus performed worse in the post-test. Such a case suggests that reading a fair amount of example sentences is required to fully understand the near-synonym.

Conclusion
We introduce the learner-like agent, in particular EMLA, which differentiates the helpfulness of learning materials using inference. Entailment modeling, unlike common context-based nearsynonymous word disambiguation, makes inferences to learn the relationship between the example sentences and the question, similar to human behavior. Context modeling in the learnerlike agent relies upon additional perturbed examples to mimic human behavior, whereas EMLA already has this ability. The agent can be used to evaluate the helpfulness of learning materials, or-more interestingly-to select the best materials from a large candidate pool. We select good example sentences in practice, which confirms the usefulness of modeling learner behavior. Using the EMLA learner-like agent, we find more helpful learning material for learners, as demonstrated by the learner study. These demonstrate the usefulness of modeling learner behavior using an inference approach. In the future, we would like to explore if the learner-like agent can be extended to materials and data beyond the example sentences for near-synonyms.