A Sentence Cloze Dataset for Chinese Machine Reading Comprehension

Owing to the continuous efforts by the Chinese NLP community, more and more Chinese machine reading comprehension datasets become available. To add diversity in this area, in this paper, we propose a new task called Sentence Cloze-style Machine Reading Comprehension (SC-MRC). The proposed task aims to fill the right candidate sentence into the passage that has several blanks. We built a Chinese dataset called CMRC 2019 to evaluate the difficulty of the SC-MRC task. Moreover, to add more difficulties, we also made fake candidates that are similar to the correct ones, which requires the machine to judge their correctness in the context. The proposed dataset contains over 100K blanks (questions) within over 10K passages, which was originated from Chinese narrative stories. To evaluate the dataset, we implement several baseline systems based on the pre-trained models, and the results show that the state-of-the-art model still underperforms human performance by a large margin. We release the dataset and baseline system to further facilitate our community. Resources available through https://github.com/ymcui/cmrc2019


Introduction
Machine Reading Comprehension (MRC) is a task to comprehend given articles and answer the questions based on them, which is an important ability for artificial intelligence.The recent MRC research was originated from the cloze-style reading comprehension (Hermann et al., 2015;Hill et al., 2015;Cui et al., 2016), which requires to fill in the blank with a word or named entity, and following works on these datasets have laid the foundations of this research (Kadlec et al., 2016;Cui et al., 2017;Dhingra et al., 2017).
gle word to a span, which has become a representative span-extraction dataset and massive neural network approaches (Wang and Jiang, 2016;Xiong et al., 2016;Wang et al., 2017;Hu et al., 2018;Wang et al., 2018;Yu et al., 2018) have been proposed which further accelerated the MRC research.
Besides the MRC in English text, we have also seen rapid progress on Chinese MRC research.Cui et al. (2016) proposed the first Chinese cloze-style reading comprehension dataset: People Daily & Children's Fairy Tale (PD&CFT).Later, Cui et al. (2018) proposed another dataset for CMRC 2017, which is gathered from chil-dren's reading books, consisting of both cloze and natural questions.He et al. (2018) proposed a large-scale open-domain Chinese reading comprehension dataset (DuReader), which consists of 200k queries annotated from the user query logs on the search engine.In span-extraction MRC, Cui et al. (2019b) proposed CMRC 2018 dataset for Simplified Chinese, and Shao et al. (2018) proposed DRCD dataset for Traditional Chinese, similar to the popular dataset SQuAD (Rajpurkar et al., 2016).Zheng et al. (2019) proposed a large-scale Chinese idiom cloze dataset.
To further test the machine comprehension ability, In this paper, we propose a new task called Sentence Cloze-style Machine Reading Comprehension (SC-MRC).The proposed task preserves the simplicity of cloze-style reading comprehension but requires sentence-level inference when filling the blanks.Figure 1 shows an example of the proposed dataset.We conclude our contributions in three aspects.
• We propose a new task called Sentence Clozestyle Machine Reading Comprehension (SC-MRC), which aims to test the ability of sentencelevel inference.
• We release a challenging Chinese dataset for the SC-MRC task, called CMRC 2019, which consists of 100K blanks and fake candidates.
• Experiments on several state-of-the-art Chinese pre-trained models show that the proposed dataset still has a big gap to surpass human performance, which indicates its potentials in future research.
2 The Proposed Dataset

Task Definition
Generally, the reading comprehension task can be described as a triple P, Q, A , where P represents Passage, Q represents Question and the A represents Answer.Specifically, for sentence cloze-style reading comprehension task, we select several sentences2 in the passages and replace with special marks (for example, [BLANK]), forming an incomplete passage.The selected sentences form a candidate list, and the machine should fill in the blanks with these candidate sentences.Note that, to add more difficulties, we could also add the sentences, which do not belong to any blanks in the passage, to the candidate list.

Passage Selection
The raw material of the proposed dataset is from children's books, containing fairy tales and narratives of historical figures, which is the proper genre for testing the sentence-level inference ability, requiring the correct sentence order of the stories.During the passage selection, we restrict the passage length (char-level) in the range of 500 to 750.If the passage is too short, then there will be only few blanks in the passage.On the contrary, if the passage is too long, it will be harder for the model to process.After the passage selection, we got 10k passages and split them into three parts for generating the training, development, and test set.

Cloze Generation
Sentence cloze task does not require human annotation as it only requires the selection of the blanks, and the selected sentences will naturally become the answers.The following rules are applied for generating the candidate sentences.
• The first sentence is skipped, which usually contains important topic information.
• Select the sentence based on the comma or period mark, resulting in the range of 10 to 30 characters.Note that we eliminate the comma or period at the end of the candidate sentence.
• If a part of a long sentence is selected, we will not choose other parts, avoiding consecutive blanks.

Fake Candidates
In order to bring difficulties in this task, and better test the ability of machine reading comprehension, we propose to add fake candidates to confuse the system.In this way, the machine should not only generate the correct order of the candidate sentences but also should identify the fake candidates that do not belong to any passage blanks.A good fake candidate should have the following characters • The topic should be the same as the passage.
• If there are named entities in the fake candidates, it should also appear in the passage.
• It could NOT be a machine-generated sentence, or it would be very easy to pick the fake one out.
A natural way to generate fake candidates is to adopt human annotation.However, it is rather time-consuming to annotate a lot of fake candidates.In order to minimize the cost by human annotation, in this paper, we propose a novel approach to generate fake candidates that is qualified for the requirements above.
Typically, a complete story is rather long that we must truncate for easy processing by the machines.In this context, we could directly pick the sentences outside the truncated passage within the same story.As these sentences are still from the same story, the topic and name entities are in accordance with the main passage.Also, it is a part of the original story, which is a natural sentence rather than a machine-generated sentence.Using the strategies above, we could generate many fake candidates and mix them with the correct candidates to form the final candidate sentences.

Statistics
We name our dataset as CMRC 2019, as it was also used in the third evaluation workshop on Chinese machine reading comprehension. 3The general statistics of the final data are given in Table 1, and comparisons with other Chinese MRC datasets are shown in Table 2.As we can see, the proposed dataset mitigates the absence of sentence-level inferential reading comprehension dataset.

Baseline System
In this paper, we mainly adopt BERT and its related variants for our baseline systems.
• Input Sequence Given a passage p and its n answer options {a 1 , a 2 , . . ., a n }, we first replace the blanks in p with the special tokens [unusedNum] from the vocabulary to fit the input format of BERT, where Num ranges from 0 to number of blanks −1.Then for each a i in the answer options, we concatenate a i and p with the token [SEP] as the input sequence.

• Main Model
The input sequence of length l is fed into BERT to get the hidden representations H ∈ R l×d .The dot product of H with trainable parameters w ∈ R d gives the logits t = H • w, where t ∈ R l .Finally, the probabilities of the blanks for the current option is calculated by a softmax over the logits with only positions of blanks unmasked.The training objective is to minimize the cross-entropy between the predicted probabilities and the ground-truth positions.

• Decoding
The model outputs the predictions for answer options in terms of the probabilities of blanks they can be filled into.They need to be transformed into the predictions for blanks in terms of the answers they choose.A simple method we used is, among all the answer options for a passage, taking the option that gives the highest probability to a blank as the prediction for that blank (each option is allowed to be the prediction of multiple blanks).

Evaluation Metrics
We adopt two metrics to evaluate the systems on our datasets, namely Question-level Accuracy (QAC) and Passage-level Accuracy (PAC).

Question-level Accuracy (QAC)
The Question-level Accuracy (QAC) is calculated by the ratio between the correct predictions and total blanks.

Passage-level Accuracy (PAC)
Similar to the QAC, Passage-level Accuracy (PAC) is to measure how many passages have been correctly answered.We only count the passages that all blanks have been correctly predicted.

Human Performance
To evaluate human performance, we invited qualified annotators to solve the sentence cloze in 100 passages in the development and test set, respectively, resulting in 1,016 and 1,027 blanks for each.
Then we calculate QAC and PAC to roughly estimate the human performance on this dataset.

Experiments
We adopt Chinese BERT-base and BERTmultilingual 4 as well as the Chinese BERT-wwm and RoBERTa (Liu et al., 2019) with whole word masking released by Cui et al. (2019a).All models are trained with 3 epochs on Tesla V100, with an initial learning rate of 3e-5, a maximum sequence length of 512, and a batch size of 24.The implementation was done on PyTorch (Paszke et al., 2017) with Transformers library (Wolf et al., 2019).
The baseline results are shown in Table 3.As we can see, the Chinese BERT-base model could give a QAC of 71.2 and 71.0 on the development and test set, respectively.However, with respect to the PAC metric, it only gives an accuracy of below 10, which suggests that there is plenty of room for optimizing the sentence cloze procedure to consider not only the single cloze but also the coherence of the whole passage.BERT 4 https://github.com/google-research/bertwith whole word masking substantially outperform original BERT implementation, and using the large model could also give a significant boost on both QAC and PAC metrics.Comparing the RoBERTa-wwm-ext-large model with the human performance, though there is only a gap of 13.3 on QAC, there is a significant gap on PAC, which also suggests that more attention should be drawn on the accuracy of the passage as a whole.

Conclusion
In this paper, we proposed a new task called Sentence Cloze-style Machine Reading Comprehension (SC-MRC) and released a Chinese dataset CMRC 2019 for evaluating the sentence-level inference ability.The proposed CMRC 2019 dataset contains both real and fake candidate sentences for filling the clozes, which not only requires the machine to choose the correct sentence but also distinguishes the real sentence from fake sentences.We built up baseline models based on pre-trained models, and the results show that the state-of-the-art model still underperforms the human performance, especially the PAC metric.We hope the release of this dataset could bring language diversity in machine reading comprehension task, and acceler-ate further investigation on solving the questions that need comprehensive reasoning over multiple clues.

Table 1 :
Statistics of the CMRC 2019 dataset.

Table 2 :
Comparisons of Chinese MRC datasets.NQ represents natural questions.

Table 3 :
Baseline results.Unless indicated, all models are Chinese pre-trained models.L: large model.