A Span-Extraction Dataset for Chinese Machine Reading Comprehension

Machine Reading Comprehension (MRC) has become enormously popular recently and has attracted a lot of attention. However, the existing reading comprehension datasets are mostly in English. In this paper, we introduce a Span-Extraction dataset for Chinese machine reading comprehension to add language diversities in this area. The dataset is composed by near 20,000 real questions annotated on Wikipedia paragraphs by human experts. We also annotated a challenge set which contains the questions that need comprehensive understanding and multi-sentence inference throughout the context. We present several baseline systems as well as anonymous submissions for demonstrating the difficulties in this dataset. With the release of the dataset, we hosted the Second Evaluation Workshop on Chinese Machine Reading Comprehension (CMRC 2018). We hope the release of the dataset could further accelerate the Chinese machine reading comprehension research. Resources are available: https://github.com/ymcui/cmrc2018


Introduction
To read and comprehend natural languages is the key to achieve advanced artificial intelligence. Machine Reading Comprehension (MRC) aims to comprehend the context of given articles and answer the questions based on them. Various types of machine reading comprehension datasets have been proposed, such as cloze-style reading comprehension (Hermann et al., 2015;Hill et al., 2015;Cui et al., 2016), span-extraction reading comprehension (Rajpurkar et al., 2016;Trischler et al., 2016), open-domain reading comprehension (Nguyen et al., 2016;He et al., 2017), reading comprehension with multiple-choice (Richardson et al., 2013;Lai et al., 2017), etc. Along with the development of the reading comprehension dataset, various neural network approaches have been proposed and made a big advancement in this area (Kadlec et al., 2016;Dhingra et al., 2017;Wang and Jiang, 2016;Xiong et al., 2016;Yu et al., 2018).
We also have seen various efforts on the construction of Chinese machine reading comprehension datasets. In cloze-style reading comprehension, Cui et al. (2016) proposed a Chinese cloze-style reading comprehension dataset: People's Daily & Children's Fairy Tale. To add difficulties to the dataset, along with the automatically generated evaluation sets (development and test), they also release a human-annotated evaluation set. Later, Cui et al. (2018) propose another dataset, which is gathered from children's reading material. To add more diversity and for further investigation on transfer learning, they also provide another evaluation dataset, which is also annotated by human experts, but the query is more natural than the cloze type. The dataset was used in the first evaluation workshop on Chinese machine reading comprehension (CMRC 2017). In opendomain reading comprehension, He et al. (2017) propose a large-scale open-domain Chinese machine reading comprehension dataset (DuReader), which contains 200k queries annotated from the user query logs on the search engine. Shao et al. (2018) proposed a reading comprehension dataset in Traditional Chinese.
Though we have seen that the current machine learning approaches have surpassed the human performance on the SQuAD dataset (Rajpurkar et al., 2016), we wonder if these state-of-the-art models could also give a similar performance on the dataset of different languages. To further accelerate the development of the machine reading comprehension research, we propose a span- The Adventure of the Yellow Face", one of the 56 short Sherlock Holmes stories written by Sir Arthur Conan Doyle, is the third tale from The Memoirs of Sherlock Holmes. Mr. Munro has always been loved by his wife, but since the new neighbors recently joined, Mrs. Munro has become very strange. She used to go out in the early hours of the morning and secretly went to her neighbors when her husband was not at home. ... Mrs. Munro went to the neighbor's house again, and Holmes accompanied Mr. Munro to rush in, only to find that the neighbor's family was the daughter of Mrs. Munro and her ex-husband, because Mrs. Munro's ex-husband was black, and she was afraid of Mr. Munro hate the mixed-race, so she did not dare to tell the truth. [Question] [  extraction dataset for Chinese machine reading comprehension. Figure 1 shows an example of the proposed dataset. The main contributions of our work can be concluded as follows.
• We propose a Chinese span-extraction reading comprehension dataset which contains near 20,000 human-annotated questions, to add linguistic diversity in reading comprehension field.
• To thoroughly test the ability of the MRC systems, besides the development and test set, we also make a challenge set which contains carefully annotated questions that require various clues in the passage. The BERT-based approaches could only give under 50% F1-score on this set, indicating its difficulty.
• The proposed Chinese RC data could also be a resource for cross-lingual research purpose when studied along with SQuAD and other similar datasets.

Task Definition
Generally, the reading comprehension task can be described as a triple P, Q, A , where P represents Passage, Q represents Question and the A represents Answer. Specifically, for spanextraction reading comprehension task, the question is annotated by the human, which is much more natural than the cloze-style MRC datasets (Hill et al., 2015;Cui et al., 2016). The answer A should be a span which is directly extracted from the passage P. According to most of the works on SQuAD, the task can be simplified by predicting the start and end pointer in the passage (Wang and Jiang, 2016).

Data Pre-Processing
We downloaded Chinese portion of Wikipedia webpage dump 2 on Jan 22, 2018 and used open-source toolkit Wikipedia Extractor 3 for preprocessing the raw files into plain text. We also convert the Traditional Chinese characters into Simplified Chinese for normalization purpose using opencc 4 toolkit.

Human Annotation
The questions in the proposed dataset are completely annotated by human experts, which is different from previous works that rely on the automatic data generation (Hermann et al., 2015;Hill et al., 2015;Cui et al., 2016). Before annotating, the document is divided into several passages, and each passage is limited to have no more than 500 Chinese words, where the word is counted by using LTP (Che et al., 2010). Then, the annotator was instructed to first evaluate the appropriateness of the passages, because some of the passages are extremely difficult for the public to understand. Following rules are applied when discarding the passages.
• Contain many professional words that hard to understand.
• Contains many special characters and symbols.
• The paragraph is written in classical Chinese, which is substantially different from the Chinese language nowadays.
After identifying the passage is appropriate for annotation, the annotator will read the passage and ask the questions based on it and annotated a primary answer. During the question annotation, the following rules are applied.
• No more than five questions for each passage.
• The answer MUST be a span in the passage to meet the task definition.
• Encourage the question diversity, such as who/when/where/why/how, etc.
• Avoid directly using the description in the passage. Use paraphrase or syntax transformation to add difficulties for answering.
• Long answers (say over 30 characters) will be discarded.
For the evaluation sets, i.e., development, test, challenge, there are three answers available for better evaluation. Besides the primary answer that was annotated by the question proposer, we also invite two additional annotators to write the second and third answers for the question. During this phase, the annotators could not see the primary answer to ensure the answer was not copied from others and encourage the diversities in the answer.

Challenge Set
In order to examine how well can reading comprehension models deal with the questions that need comprehensive reasoning over various clues in the context, we additionally annotated a small challenge set for this purpose while keeping the spanextraction style. The annotation was also done by three annotators in a similar way that for development and test set. Figure 1 shows an example in the challenge set. The question should meet the following standards to be qualified into this set. where who how what when why other • The answer cannot be only inferred by a single sentence in the passage if the answer is a single word or short phrase. We encourage the annotator to ask the questions that need comprehensive reasoning in the passage to increase the difficulties.
• If the answer belongs to a type of named entity, or specific genre (such as date, color, etc.), it can not be the only one in the context, or the machine could easily pick it out according to its type. For example, if there is only one person name appears in the context, then it cannot be used for annotating questions. There should be at least two person names that could mislead the machine for answering.

Statistics
The general statistics of the pre-processed data are given in Table 1. The question type distribution of the development set is given in Figure 2.

Evaluation Metrics
In this paper, we adopt two evaluation metrics following Rajpurkar et al. (2016). However, as  the Chinese language is quite different from English, we adapt the original metrics in the following ways. Note that, the common punctuations, white spaces are ignored for normalization.

Exact Match
Measure the exact match between the prediction and ground truths that is 1 for the exact match. Otherwise, the score is 0. This is the same as the one proposed by Rajpurkar et al. (2016).

F1-Score
Measure the character-level fuzzy match between the prediction and ground truths. Instead of treating the predictions and ground truths as bag-ofwords, we calculate the length of the longest common sequence (LCS) between them and compute the F1-score accordingly. We take the maximum F1 over all of the ground truth answers for a given question. Note that, non-Chinese words will not be segmented into characters.

Estimated Human Performance
We also report the estimated human performance in order to measure the difficulty of the proposed dataset. As we have illustrated in the previous section, there are three answers for each question in development, test, and challenge set. Unlike Rajpurkar et al. (2016), we use a cross-validation method to calculate the performance. We regard the first answer as human prediction and treat the rest of the answers as ground truths. In this way, we can get three human prediction performance by iteratively regarding the first, second, and third answer as the human prediction. Finally, we calculate the average of three results as the final estimated human performance on this dataset.

Baseline System
Following Devlin et al. (2019), we adopt BERT for our baseline system. Specifically, we slightly modify the run squad.py script 5 for adjusting our dataset, while keeping the most of the original implementation. For the baseline system, we used an initial learning rate of 3e-5 with a batch size of 32 and trained for two epochs. The maximum lengths of document and query are set to 512 and 64.

Results
The results are shown in Table 2. Besides the baseline systems, we also include the participants' results of CMRC 2018 evaluation. We release the training and development set to the public and accepted submissions from participants to evaluate their models on the hidden test and challenge set to preserve the integrity of the evaluation process following Rajpurkar et al. (2016). As we can see that most of the participants could obtain over 80 in the test F1. While compared to F1 metric, the EM metric is substantially lower compared to the SQuAD dataset (usually within 10 points). This suggests that how to determine the exact span boundary in Chinese machine reading comprehension plays a key role to improve the system performance.
Not surprisingly, as shown in the last column of Table 2, though the top-ranked systems obtain decent scores on the development and test set, they are failed to give satisfactory results on the challenge set. However, as we can see that the estimated human performance on the development, test, and challenge set are relatively similar, where the challenge set gives slightly lower scores. We also observed that though Z-Reader obtains best scores on the test set, it failed to give consistent performances on the EM metric of the challenge set. This suggests that the current reading comprehension models are relatively not capable of handling difficult questions that need comprehensive reasoning among several clues in the passage.
BERT-based approaches show competitive performance against participants submissions. 6 Though traditional models have higher scores in the test set, when it comes to the challenge set, BERT-based baselines are consistently higher, demonstrating that rich representation provided by BERT is beneficial for solving harder questions and generalize well among both easy and hard questions.

Conclusion
In this work, we propose a span-extraction dataset for Chinese machine reading comprehension. The dataset is annotated by human experts with near 20,000 questions as well as a challenging set which is composed of the questions that need reasoning over multiple clues. The evaluation results show that the machine could give excellent scores on the development and test set with only near 10 points below the estimated human performance in F1-score. However, when it comes to the challenge set, the scores are declining drastically while the human performance remains almost the same with the non-challenge set, indicating that there are still potential challenges in designing more sophisticated models to improve the performance. We hope the release of this dataset could bring language diversity in machine reading comprehension task, and accelerate further investigation on solving the questions that need comprehensive reasoning over multiple clues.

Open Challenge
We would like to invite more researchers doing experiments on our CMRC 2018 dataset and evaluate on the hidden test and challenge set to further test the generalization of the models. You can follow the instructions on our CodaLab Worksheet to submit your model via https://bit.ly/ 2ZdS8Ct