DT-QDC: A Dataset for Question Comprehension in Online Test

With the transformation of education from the traditional classroom environment to online education and assessment, it is more and more important to accurately assess the difficulty of questions than ever. As teachers may not be able to follow the student’s performance and learning behavior closely, a well-defined method to measure the difficulty of questions to guide learning is necessary. In this paper, we explore the concept of question difficulty and provide our new Chinese DT-QDC dataset. This is currently the largest and only Chinese question dataset, and it also has enriched attributes and difficulty labels. Additional attributes such as keywords, chapter, and question type would allow models to understand questions more precisely. We proposed the MTMS-BERT and ORMS-BERT, which can improve the judgment of difficulty from different views. The proposed methods outperforms different baselines by 7.79% on F1-score and 15.92% on MAE, 28.26% on MSE on the new DT-QDC dataset, laying the foundation for the question difficulty comprehension task.


Introduction
Intelligent education systems are heavily studied and investigated since it can generate great value both academically and commercially, especially when the COVID-19 pandemics is prevailing. A welldeveloped educational assistance system can facilitate students to grasp the learning progress and customize personalized learning approaches. It improves the whole learning experience, enhances students' initiative, and results in better performance in assessments.
One key component of an intelligent education system is to assess the level of understanding of students, and asking questions is the most intuitive way of accomplishing that. However, the most online quiz system treats questions equally without factoring in the difference in difficulty.
Some researcher (Sonkar et al., 2020) argued that we should not treat all questions equivalently, because questions exhibit significant variations in difficulty and discrimination (Embretson and Reise, 2013). Pardos et al. (Pardos and Heffernan, 2011) once introduced the modeling of the problem difficulty in the knowledge tracing task, but they only used the guess-and-slip parameters related to the question, without paying attention to the nature of the question itself. For every new question, they must use the corresponding user answer data to estimate its difficulty.
In this paper, we go one step further to investigate methods that perform question difficulty comprehension using various attributes of the question. Question understanding is one of the key component in machine reading comprehension, which is viewed as the sign of machine understand nature language (Nakanishi et al., 2018). Although question comprehension is not studied as a special task, it is positively related to the difficulty of the question. Generally speaking, the difficulty is a more abstract and personalized concept, and it is difficult to quantify to define. To evaluate the students online, We can measure the difficulty of a question to be answered from the perspective of classification.
By evaluating the difficulty of each question, the performance of downstream applications such as deep knowledge tracing and question answering can be improved. We release a dataset, Chinese Driving Test Question Difficulty Comprehension (DT-QDC), which is formed with a large volume of user records from the Driving License Examination Website. The dataset contains 14,933 questions with 10 attributes. Figure 1 shows two examples of questions in the dataset, which attribution includes question explanation, keywords and test information, etc.

Figure 1: Examples from DT-QDC dataset
Some other datasets exist such as: a genuine grade-school level, multiple-choice science questions dataset is contributed by (Clark et al., 2018), Wasim et al. (Wasim et al., 2019) submitted a Multi-label biomedical question dataset, and Li et al. (Li and Roth, 2002) proposed a free-form questions dataset, yet these datasets only have question text and label, and their volume is relatively lacking. In comparison, our dataset is larger and richer in attributes, which is very valuable for future research communities to design, evaluate, and understand questions.
We propose the Ordinal Regression Multi-Source BERT (ORMS-BERT) model to solve the difficulty comprehension problem. Multi-source BERT (Devlin et al., 2018) text representation and relation modeling enable us to better understand question difficulty. A novel category encoding technique is applied to transform multi-class classification tasks into multiple binary classification tasks. Our model outperforms different baselines by 6.77% on F1 score and 15.92% on MAE, 28.26% on MSE.
To summarize our contributions: • We have clear definitions of question difficulty: absolute difficulty and field difficulty. And we propose the task of question difficulty comprehension.
• We constructed the first question difficulty comprehension dataset DT-QDC, annotated question difficulty label based on statistics of tens of millions of users' answering records.
• We benchmark a variety of neural models trained on the new DT-QDC dataset, and we propose ORMS-BERT for the question difficulty comprehension task, which achieves significant improvement than baselines.

Related work
In order to simplify the online questions comprehension task, we mainly consider the question difficulty definition, fine-grained division and modeling. Related work includes the following two-fold:

Text difficulty predicting
When learning new knowledge, it is important to select the proper material for each student. Text difficulty predicting systems can help educators find texts from abundant text materials that are gradeappropriate for the individual student. Balyan et al.  proposed four classification machine learning approaches (flat, one-vs-one, one-vs-all, and hierarchical) used natural language processing features in predicting human ratings of text difficulty. Ruseti et al. (Ruseti et al., 2018) used recurrent neural networks to predicted question depth (very shallow to very deep), in order to provide feedback on questions generated by students.

Knowledge Tracing
Knowledge Tracing (KT) is the task of modeling and predicting how human beings learn. There were several works using Bayesian Knowledge Tracing (BKT) to building temporal models of student learning (de Baker et al., 2008;Yudelson et al., 2013). Especially, Pardos et al. (Pardos and Heffernan, 2011) uesd guessing and slipping estimates to model the problem difficulty. Recent work (Piech et al., 2015) has explored deep knowledge tracing, combining Long Short Term Memory (LSTM) networks with a knowledge tracing task. Sonkar et al. (Sonkar et al., 2020) proposed a question-centric deep knowledge tracing method, which leveraged question-level information and incorporates graph Laplacian regularization to smooth predictions under each skill.

Difficulty and Task Definitions
In this paper, we define the Absolute Difficulty of a problem as the unobservable intrinsic difficulty of solving the problem. Absolute Difficulty is constituted by the prior knowledge and the comprehensive ability required to tackle the challenge.
Definition 1. (Absolute Difficulty) Let Q be the set of all questions in a certain field, and the absolute difficulty be a mapping from Q to a set of non-negative real numbers, annotated as d a .
We give two examples to illustrate the idea: 1). We would require at least understanding linear algebra, differentiation, and analysis to complete a problem on advanced calculus. Whereas solving a set of simultaneous equations would require less prior knowledge.
2). If Question A asks a student to simply spell out a formula, where Question B presents a situation that the student needs to first extract the data and then apply the formula. Question B is more difficult to solve as human reasoning and induction are involved.
We define another concept, Field Difficulty, to factor in individual differences. Field Difficulty of a question is the difficulty a problem solver feels on 'field'. It depends on the problem solver and could change dynamically along the time dimension.
Definition 2. (Field difficulty) Let Q be the set of all the questions in a certain field, S be the set of all characteristics associated with the problem solver, and the field difficulty is a mapping from Q × S to the real number set, denoted d p .
People may have different depths of understanding of the prior knowledge, and as time goes, they would have become more familiar on the subject and hence the difficulty is reduced. Ultimately, when a problem solver has full control and understanding of all prior knowledge required to solve the problem, Field Difficulty would approach the Absolute Difficulty. Unless they forget some key knowledge as time goes and hence the Field Difficulty goes up again.
Task Definition Upon collecting a sufficient amount of data, the absolute difficulty of a problem can be estimated and be compared to other questions' absolute difficulty. We also map and discretize the Absolute Difficulty into M levels D = {1, 2, ..., M }, making it more suitable for comparison and interpretation. To simplify the configuration, M is set to 5 in this paper.
Definition 3. (prediction of question difficulty) Given a set of questions {q 1 , q 2 , ..., q m } in a certain field, and a set of problem solvers G = {G 1 , G 2 , ..., G n }, let the set of Absolute Difficulty after discretization be D as defined above. We would like to find a mapping f to map a question q i to the set D, where the following is true: Among them, prob is the probability that the absolute difficulty performance of test question q for group G is k.
The absolute difficulty of the question will affect its error rate, so we use the error rate of a large number of users as an observation of absolute difficulty.

Data collection
The Driving License Examination Website 1 is an online platform that provides mock questions where users undertake to test their knowledge before the actual driving exam. We have constructed our Driving Test Question Difficulty Comprehension (DT-QDC) dataset 2 with the platform's questions and the corresponding user answer records. By analyzing these question-answer record pairs, we can inference the true and objective difficulty of these questions. 14933 questions with 10 attributes were collected from 136 chapters of the Driving License Examination.

Dataset annotation
In our dataset, TrueCount means number of users who answered the question correctly, FalseCount means number of users who answered the question incorrectly, Here WrongRate(error rate) is calculated by the following formula: Therefore, the distribution of difficulty labels in the dataset is shown in Table 1  The Driving License Examination Website has provided difficulty labels for questions, which bases on the previous error rate. As shown in Table 1, the difficulty label we obtained through the record of tens of millions of users' answer record, which can prove that our difficulty label is consistent with the human answer record.
We believe that the error rate is the question absolute difficulty observation, it related to not only question difficulty but also many other random factors. Over time, the error rate of some questions fluctuates slightly, but it does not affect the difficulty level of the question. This is also the purpose of the difficulty label discretization. When multiple users answer the same question, they will experience different field difficulty according to their different knowledge levels and comprehensive abilities. Since our difficulty label comes from the behavior of tens of millions of users, we can minimize the deviation caused by different backgrounds of users. Therefore, the absolute difficulty of the problem can be inferred from the difficulty observation. The correlation between error rate and question difficulty is a strong evidence for the quality of the difficulty labels we have collected. As shown in Figure 2(b), The correlation between the two is extremely obvious.
Because TrueCount, FalseCount, WrongRate are directly related to question's difficulty label, and not available in the inference stage, so we do not use them in our model, but use them to prove the quality of the dataset. By making a difficulty stack map of each chapter's questions, as shown in Figure 2(a), we can observe that the difficulty distribution in each chapter is quite different. For example, there is nearly no question with difficulty level 1 in chapter 183-199, which proves that these chapters are generally difficult. The questions in each chapter involve different sets of knowledge points and investigation methods, which in turn affects the question difficulty. This shows that additional attributes can help question comprehension task.

Statistics of the Dataset
In the DT-QDC data set, there are three types of questions: True/ False questions, single choice questions, and multiple-choice questions. The distribution of difficulty labels on each question type is quite different, as shown in Figure 3(a). Intuitively, it is difficult to judge the difficulty of a question literally. The visualization of the semantic embedding of each question confirms it, as shown in Figure 3(b). Here we use BERT embedding and use Principal components analysis to reduce the dimension from 768 to 3.

Question difficulty Comprehension Model
Given an input question Q = (w 1 , w 2 , ..., w m ), which is asked with o = (y 1 , y 2 , ..., y n ) as its options , e = (z 1 , z 2 , ..., z l ) as its explanation, k = (t 1 , t 2 , ..., t h ) as its keywords, t as its question type, c as its chapter, and a as its answer, our task is to predict its difficulty level d. The architecture of our ORMS-BERT model is depicted in Figure 4. The encoder takes the text and the discrete data as inputs. Two separate BERTs (Devlin et al., 2018) are employed to encode the question and options into contextualized representations. And three separate linear layers are used to learn the embedding of the question type, answer, and chapter. Besides these standard elements, We also use the attention mechanism with average sequence pooling to model question-option and option-option relationships. Moreover, we also apply ordinal regression loss to better model the nature of the difficulty.

Text Representations
We use two separate BERTs to encode the question and options: where h qi and h oi are the hidden states at the i-th time step of the question-BERT and options-BERT. This is because the question and options are not consecutive statements, their content often has both connection and conflict. Using different BERTs to encode can better learn their representations separately. This operation also increases the amount of model parameters.
This approach we refer as MS-BERT(Multi-Source BERT).

Relation Modeling
According to intuition, the degree of confusion of options can determine the question difficulty. If the options are not very different from each other, it is difficult to choose the correct answer from them. To model the option-option relation, we use the attention mechanism to model the semantic similarity between the options. Then use average sequence pooling to reduce the parameters size from n × d to 1 × d, where d is the hidden size.
We use the same method to model the relationship between question and options, where Q, K, and V represent question, options, and options respectively.

Ordinal Regression Loss
Different from ordinary classification tasks, categories in question difficulty classification task are not completely independent, but rather follow a natural order (Diaz and Marathe, 2019). If our model incorrectly predicts a difficulty level 1 question as a difficulty level 2, the penalty it receives should be less than predicting a difficulty level 1 question as difficulty level 5. Therefore, we try to transform the multi-classification problem into multiple binary classification problems following the (Niu et al., 2016), while considering the relationship between categories.
Since the multi-classification task needs to be converted into multiple two-classification tasks, the difficulty label also needs to be mapped to the corresponding binary code. Each bit of the code corresponds to a binary classification subtask. Our coding design ideas are as follows: • Editing distance between codes of categories can reflect the natural distance between categories.
• Each bit of the code is the label of a subtask.
• The variance of the probability that each bit being 1 as small as possible. Let the model learn the real task, rather than learning which bit in the binary code is more likely to be 1.
The improved encoding rules are as follows, with N c representing the total number of categories, S(k) is the binary code of the kth category.
[n] h,N represents the operation of converting the integer n into its N -bit h-ary code. If h is 2, it's binary.
When N c is odd: When N c is even: Subtasks corresponding to each bit of the code are as follow. When the cth bit of the code is 1, the corresponding subtask is T (c). x is the difficulty label of current sample.
When N c is odd: When N c is even: For example, when N c is 5, the number of binary coded digits is N c − 1 = 4, the binary code of categories is in Table 2.
Category 1 2 3 4 5 Binary code 1,1,0,0 0,1,0,0 0,0,0,0 0,0,1,0 0,0,1,1 When N c is 5, the corresponding subtasks of each bit of the binary code are in Table 3. We think that each subtask has the same weight, so the loss function is: where N s is the total number of subtasks, d i is the true label of the ith subtask, d i is the predicted label of the ith subtask. This approach we refer as Ordinal-Regression Multi-Source BERT (ORMS-BERT).

Experiments
We constructed several baselines including powerful pre-trained models for comparison with our proposed model and reported their performance on the new DT-QDC dataset.

Compared Models
For all the baseline models, we stitch together all text, including questions, options, explanations, and keywords, as the first part of the input. We use gensim 3 to train word2vec vectors (Mikolov et al., 2013) to initialize non-pretrained models' embedding layers, and then use TextCNN, Bi-LSTM, Bi-GRU, and BERT to encode the text. At the same time, the discrete data including question type, answer, and chapter are used as the second part of the input, we use linear layers to learn their embeddings. Two parts of information are concated together before the final fully connected layer to predict the difficulty label. (Xu et al., 2020) proposed BERT-QC model, which enumerate multi-label questions as multiple singlelabel instances, to solve the question classification task. We use their code 4 to get the result on DT-QDC dataset. Because their model has no additional structure for discrete data, so we concated all the discrete data at the end of texts. This may be the reason why BERT-QC performs worse than our BERT baseline.
We split the DT-QDC dataset by 8:1:1 as train, val, and test set. The performance of models mentioned above on the test set is shown in Table 2, we report weighted Precision, Recall, and F1-score to make the comparison fairer.

Experiments Settings
For TextCNN, Bi-LSTM, and Bi-GRU, the batch size is 512, the learning rate is 0.001, which is best for its performance. For all BERT based models, max sequence length for texts is 40, the batch size is 36, the learning rate is 4e-5. For each experiment, 5 runs were conducted and an average of the results were taken.

Experiment Results
As in Table 4, our best model ORMS-BERT outperforms other baselines by large margins. Our ORMS-BERT achieved the best Precision, Recall, and F1-score, where we see 4.51, 2.88, and 3.18 improvement, corresponding 9.42%, 6.08%, and 6.77%. which indicates our text representation and relation modeling approach can learn the difference between difficulty levels better. Our ORMS-BERT also achieved best MAE and MSE, where we see 0.11 and 0.32 improvement, corresponding 15.92% and 28.68%, which indicates our ordinal regression loss can help the model learn the relations between categories.  When there is only a multi-class classification loss, MS-BERT's F1-score performance is lower than ORMS-BERT 0.34, MSE is greater than ORMS-BERT 0.11. This shows that the loss we defined allows the model to learn the nature of difficulty better.

Ablation Study
When removing our question-options relation and options confusion module, it will cause F1-score to decrease by 1.6, and MSE increases 0.126. If removing all the tags, including question type, answer, and chapter information, it will cause F1-score to decrease by 2.73, and MSE increases 0.181. The impact of removing tags is greater than the impact of removing relations, which shows that additional attributes of questions play an important role in the task of question comprehension.

Conclusion
We proposed the task of question difficulty comprehension and constructed a new dataset DT-QDC with real-world user answer records and multi-attributes questions to target it. In addition, we provide a strong model called ORMS-BERT, and then compared its performance with several baselines.
In the future, we will explore question difficulty comprehension tasks in the field of online education. We expect to combine the question difficulty comprehension task with the question answering task and knowledge tracing task to model the student performance more scientifically, so as to better assess the student's knowledge level and track the student's learning progress. Our dataset can also be used to explore related tasks such as question answering, question generation, option generation, question normalization, question rewriting, etc. Our dataset is available online, and we expect it is beneficial for future research in this field.