DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications

This paper introduces DuReader, a new large-scale, open-domain Chinese machine reading comprehension (MRC) dataset, designed to address real-world MRC. DuReader has three advantages over previous MRC datasets: (1) data sources: questions and documents are based on Baidu Search and Baidu Zhidao; answers are manually generated. (2) question types: it provides rich annotations for more question types, especially yes-no and opinion questions, that leaves more opportunity for the research community. (3) scale: it contains 200K questions, 420K answers and 1M documents; it is the largest Chinese MRC dataset so far. Experiments show that human performance is well above current state-of-the-art baseline systems, leaving plenty of room for the community to make improvements. To help the community make these improvements, both DuReader and baseline systems have been posted online. We also organize a shared competition to encourage the exploration of more models. Since the release of the task, there are significant improvements over the baselines.


Introduction
For human beings, reading comprehension is a basic ability to acquire knowledge.We believe it is one of the crucial abilities machine has to have to acquire knowledge through reading the whole web and answer open domain questions.Such an ability is considered to be of great value for next-generation search engines and intelligent agent products.However, Machine Reading Comprehension (MRC) is an extremely challenging work since it involves several difficult tasks such as comprehension, inference and summarization.
Recently, several MRC datasets have been released, greatly inspiring the research in this field.A series of neural network models, such as Match-LSTM (Wang and Jiang, 2017), BiDAF (Seo et al., 2016), R-net (Wang et al., 2017), have been proposed, achieving promising results on a variety of MRC evaluation tasks.
However, most existing MRC datasets have some limitations due to their synthetic data, simplified tasks or constrained domains.Therefore, studies on these datasets are different from realworld comprehension tasks.In detail, cloze-style MRC (Hermann et al., 2015;Hill et al., 2015;Cui et al., 2016) simplifies the task into word prediction on hole-digging synthesis data.Multiplechoice MRC (Lai et al., 2017) tests comprehension ability via option selection on examination data.Question answering based MRC (Trischler et al., 2017;Rajpurkar et al., 2016;Joshi et al., 2017) usually casts reading comprehension as the prediction of span in a news article, a Wikipedia entry or other documents for a given question.Although such kinds of simplifications and constraints facilitate the data construction and the model design, they bring some undesired problems.By analyzing questions real users submitted to Baidu search engine, we found that current datasets cover only some types of questions, leaving other types, such as opinion questions and complex description questions, not well studied.Furthermore, recent studies (Chen et al., 2016;Jia and Liang, 2017) have shown that current MRC models could achieve high performance on many of these datasets with limited comprehending or inferring.Therefore, it is necessary to build real-world reading comprehension datasets in open domain.An English dataset, MS-MARCO (Nguyen et al., 2016) was released under this consideration, in which the questions and documents were collected from search engine, and answers were generated by human annotators.In this paper, we propose DuReader, a new large-scale and human annotated MRC dataset in Chinese language, aiming to tackle real-world MRC problems.Besides its merits that the questions are open-domain and extracted from real application data, DuReader has the following characteristics compared to previous datasets.
1. DuReader provides rich annotations for question types.In particular, DuReader annotates yes-no and opinion questions that take a large proportion in real user's questions but have not been well studied before.Answering opinion questions usually requires inference and summarization of multiple evidences, which are challenging even for human.The first release of DuReader contains 200k questions, 1M documents and more than 420k human-summarized answers.To the best of our knowledge, DuReader is the largest Chinese MRC dataset so far.The comparison of some key properties of DuReader and the existing datasets is shown in Table 1.
We implemented two state-of-the-art MRC models, i.e., Match-LSTM (Wang and Jiang, 2017) and BiDAF (Seo et al., 2016) on DuReader.We find that the performances of these models are far inferior to human, which suggests that there is a large room for researchers to improve the MRC models on DuReader dataset.

Analysis of Questions in Search Engine
In this section, we analyze the distribution of questions in Baidu Search data.We randomly sample 1,000 questions from one day's search log, and then manually annotate the questions from two different views.From the view of the answer type that a question belongs to, we classify the questions into three kinds: Entity, Description and YesNo.For Entity questions, the answers are expected to be a single entity or a list of entities.For Description questions, the answers are usually multi-sentence summaries.This kind of questions contain how/why questions, questions of comparing the functions of two or more objects, questions about inquiring the merits/demerits of a goods, etc.As for YesNo questions, the answers are expected to be an affirmative or negative answers with supportive evidences.
After a deep investigation into the questions, we found that whichever answer type a question belongs to, it can be further classified into Fact or Opinion, depending on whether it is about a fact or an opinion4 .Table 2 shows some examples.
For each question in the sampled data, we label it from two views: one is the answer type it belongs to, the other is whether it is about fact or opinion.In this way, the questions can be classified into six classes.The distribution of the questions in the sample data is shown in Table 3.
From the distribution, some interesting phenomena are observed: 1.The Entity-Fact questions, also known as factoid questions that have been widely studied in previous work, account only for 23.4%.
2. Over half of the questions (52.5%) are Description questions.Previous studies mostly focus on Description-Fact questions.
3. YesNo questions accounts for 15.6%, with one half about fact, another half about opinion.
4. More than one-third of questions are Opinion questions, seldom addressed in the previous research.
To the best of our knowledge, it is the first time to analyze the MRC dataset from two different views.Some of the question types have been widely studied in previous work while others, especially YesNo questions and Opinion questions, are expecting more attention from researchers.Hopefully, our datasets can promote further researches on them.

DuReader Dataset
In this section, we will introduce the data collection and annotation process of DuReader.The DuReader dataset can be considered as a set of quadruples of {q, t, D, A}, which are defined as: (a) the question q; (b) the question type t; (c) the relevant document set D=d 1 ,d 2 ,...,d |D| , which contains |D| documents; (d) reference answers set A which is generated by human annotators.

Data Collection
To collect questions for DuReader, we first randomly sampled frequently occurring queries from Baidu search engine query logs.Questions were filtered from the queries using a binary classifier, which were then double-checked by human annotators.200K questions are reserved in this release.
We then collected relevant documents for the questions.Two sources were explored, i.e., search results of Baidu search engine, and Baidu Zhidao5 .In detail, the 200K questions were divided into two sets.For the first half, we searched each question in Baidu search engine and kept top-5 search results, while for the second half, we searched each question in Zhidao's site search and also kept top-5 results.The reason why we use Zhidao as a source of relevant documents, is that the User Generated Content (UGC) nature of Zhidao makes its documents different from random web pages on the Internet.Especially, for the opinion questions, there are more answers in Zhidao.For each document, we extracted the title and main contents, which were then word-segmented using the open API of the Baidu AI platform 6 .

Question Type Annotation
According to the two-dimension question types introduced in Section 2, the annotators were asked to label each question in a two-pass manner.In the first pass, the annotators classified all the questions into three types: Entity, Description and YesNo questions.And in the second pass, the annotators labeled each question as either Fact or Opinion.The distribution of questions of different types in DuReader is shown in Table 4.Note that the distribution of question types in DuReader (Table 4) is different from that in Baidu Search (Table 3).This is mainly because Table 4 is type-based statistics, since we keep only one instance in the dataset for same questions, while the statistics for Table 3 is frequency-based.

Answer Annotation
For the answer annotation, we employed crowdsourcing workers to generate answers for each question based on the relevant documents.Specifically, each question and its relevant documents were shown to an annotator.He/she was asked to generate answers in his/her own words by reading and summarizing the documents.If more than one answer can be found in the relevant documents, the annotator was required to write down all the answers.Those answers that are very similar to each other were merged into only one.The answers were pointed-checked to guarantee that the accuracy is high enough.Specifically, for the Entity questions, the answers include both the entities and the sentences containing them.For the YesNo questions, the answers include the opinion types (Yes, No or Depend) as well as the supporting sentences.(See the last example in Table 9)

Data Analyzing
Statistics on length.On average, each question and answer has 4.8 and 69.6 words respectively.The average length of the documents is 396.0 words, which is about 5 times longer than those in MS-MARCO.The reason is that we kept the full body of each relevant document whereas MS-MARCO only use a certain paragraph.
Statistics on answer numbers.Figure 1 shows the statistics over number of answers.We can see that 1.5% of Baidu Search questions have zero answers, but this number increases to 9.7% for Baidu Zhidao.Meanwhile, Baidu Zhidao has a larger proportion of multi-answer questions than Baidu Search (70.8% vs. 62.2%), which may be explained as the UGC answers are more subjective and cause more diversity.
Difficulty analysis of DuReader.In order to understand the difficulty to answer the problem in DuReader, we calculate the distribution of mini-mum edit distance (ED) between the answers generated by human and the original documents7 .The larger ED is, the more summarization and modification has been operated by the annotators, which requires more complex methods for modeling the problem.For the span-answer datasets, such as SQuAD (Rajpurkar et al., 2016), NewsQA (Trischler et al., 2017) and TriviaQA (Joshi et al., 2017) the ED score should be zero, since the answers are all directly extracted from the document.Figure 2 shows the distribution of ED scores between the answers and documents in DuReader, which is compared with those in MS-MARCO.We can see that for 77.1% of answers in MS-MARCO the ED is below 3.In contrast, 51.3% of DuReader answers has a ED score over 10, which can be inferred that DuReader requires more complex techniques such as text summarization and generation.

Experiments
In this section, we implement MRC systems with two state-of-the-art models.BLEU (Papineni et al., 2002) and Rouge (Lin, 2004) are used as the basic evaluation metrics.Furthermore, with the rich annotations in our dataset, including the queries and answers of various types, we conduct comprehensive evaluations from different aspects.

Baseline Systems
We implement two typical state-of-the-art models as baseline systems.
Match-LSTM Match-LSTM is a widely used MRC model and has been well explored in recent studies (Wang and Jiang, 2017).To find an answer in the passage, it goes through the passage sequentially and dynamically aggregates the matching of an attention-weighted question representation to each token of the passage.Finally, an answer pointer layer is used to find an answer span in the passage.
BiDAF BiDAF is a promising MRC model, and its improved version has achieved the best single model performance on SQuAD dataset (Seo et al., 2016).It uses both context-to-question attention and question-to-context attention in order to highlight the important parts in both question and context.After that, the so-called attention flow layer is used to fuse all useful information in order to get a vector representation for each position.
To set up, we randomly initialize the word embeddings with a dimension of 300 and set the hidden vector size as 150 for all layers.We use the Adam algorithm (Kingma and Ba, 2014) to train both models with an initial learning rate of 0.001 and a batch size of 32.Since for every question there may be multiple corresponding passages.To improve training and testing efficiency, a simple heuristic strategy is employed to select a representative paragraph from each passage.This paragraph is supposed to be the one that achieves the highest recall score when compared against the annotated answers during training.While for testing, since the answers are not available, we compute the recall score against the question instead.

Results and Analysis
We evaluate the reading comprehension task via character-level BLEU-4 (Papineni et al., 2002) and Rouge-L (Lin, 2004), which are widely used for evaluating the quality of language generation.The experimental results on test set are shown in Table 5.For comparison, we also evaluate the Selected Paragraph system, which directly selects the paragraph that achieving the highest recall score as answer.And we assess human performance by involving a new annotator to annotate on the test data and treat his first answer as the prediction.
The results demonstrate that current reading comprehension models can achieve an impressive improvement compared with the selected paragraph baseline, which approves the effectiveness of these models.However, there is still a large performance gap between these models and human.An interesting discovery comes from the comparison between results on Search and Zhidao data.We find that the reading comprehension models get much higher score on Zhidao data.This shows that it is much harder for the models to comprehend open-domain web articles than to find answers in passages from a question answering community.In contrast, the performance of human beings on these two datasets shows little difference, which suggests that human's reading skill is more stable on different types of documents.
As described in Section 4.1, the representative paragraph of each passage is selected based on the query during testing.To analyze the effect of the strategy and obtain the upper bound of the base- line models, we re-evaluate our systems on the gold paragraphs, each of which is selected if its recall score against the annotated answers is the highest.Comparing Table 6 with Table 5, we can see that the use of gold paragraphs could significantly boosts the overall performance.Moreover, directly using the gold paragraph can obtain a very high Rouge-L score, which is as expected because each gold paragraph is selected based on recall, which is relevant to Rouge-L.Though, we still find that the baseline models can get much better performance with respect to BLEU, which means the models have learned to refine the answers.The experiment shows that paragraph selection is a crucial problem to solve in real applications, while most current MRC datasets suppose to find the answer in a given small passage.Thus, DuReader provides full body text of evidence document to stimulate research in real-world setting.
To gain more insight into the characteristics of our dataset, we report the performance across different question types in Table 7.We can see that both the models and human achieve relatively good performance on description questions, while YesNo questions seem to be the hardest to model.We consider that description questions are usually answered with long text on the same topic.This is preferred by BLEU or Rouge.However, the answers of YesNo questions are relatively short, which could be a simple Yes or No in some cases.Even more interesting is, the answers of some YesNo questions are quite subjective and some may even be contradictory based on the evidence collected from different passages.Therefore, even the human annotators cannot reach a high level of agreement for these questions.

Opinion-aware Evaluation
Considering the characteristics of YesNo questions, we found that it's not suitable to directly use BLEU or Rouge to evaluate the performance on these questions, because these metrics could not reflect the agreement between answers.For example, two contradictory answers like "You can do it" and "You can't do it" get high agreement scores with these metrics.A natural idea is to formulate this subtask as a classification problem.However, as described in Section 3, multiple different judgments could be made based on the evidence collected from different passages, especially when the question is of opinion type.In real-world settings, we definitely don't want a smart model to give an arbitrary answer for such questions as Yes or No.
To tackle this, we propose a novel opinionaware evaluation method that requires the evaluated system to not only output an answer in natural language, but also give it an opinion label.We also have the annotators provide the opinion label for each answer they generated.In such cases, every answer is paired with an opinion label (Yes, No or Depend) so that we can categorize the answers by their labels.Finally, the predicted answers are evaluated via Blue or Rouge against only the reference answers with the same opinion label.By using this opinion-aware evaluation method, a model that can predict a good answer in natural language and give it an opinion label correctly will get a higher score.
In order to classify the answers into different opinion polarities, we add a classifier.We slightly change the Match-LSTM model, in which the final pointer network layer is replaced with a fully connected layer.This classifier is trained with the gold answers and their corresponding opinion labels.We compare a reading comprehension system equipped with such an opinion classifier with a pure reading comprehension system without it, and the results are demonstrated in Table 8.We can see that doing opinion classification does help under our evaluation method.Also, classifying the answers correctly is much harder for the questions of opinion type than for those of fact type.

Discussion
As shown in the experiments, the current state-ofthe-art models still underperform human beings by a large margin on our dataset.There is considerable room for improvement on several directions.First, the state-of-the-art models formulate reading comprehension as a span selection task.However, as shown in DuReader dataset, human beings actually summarize answers with their own comprehension.How to summarize or generate the answers deserves more research.Current methods employ a simple paragraph selection strategy, which results in great degradation of comprehension accuracy as compared to gold paragraph's performance.It is necessary to design novel and efficient whole-document representation models for the real-world MRC problem.
Second, there are some new features in our dataset that have not been extensively studied before, such as yes-no questions and opinion questions requiring multi-document MRC.New methods are needed for opinion recognition, crosssentence reasoning, and multi-document summarization.Hopefully, DuReader's rich annotations would be useful for study of these potential directions.
Third, as the first release of the dataset, it is far from perfection and it leaves much room for improvement.For example, we annotate only opinion tags for yes-no questions, we will also anno-tate opinion tags for description and entity questions.We would like to gather feedback from the research community to improve DuReader continually.
Overall it is necessary to propose new algorithms and models to tackle with real-world reading comprehension problems.We hope that the DuReader dataset would be a good start for facilitating the MRC research.

Conclusion and Future Work
We introduce DuReader, a new Chinese largescale open domain dataset for machine reading comprehension.Different from exiting Chinese MRC datasets, DuReader contains questions and possible answers from real-world applications, with the aim to promote MRC research in realworld setting.In particular, DuReader contains rich annotations of questions, documents and answers.It is the first time to annotate the questions from two different views, among which yes-no and opinion questions account for a large proportion but have not been well studied yet.For each question, we provide documents coming from both Baidu Search and Baidu Zhidao, and multianswers with supporting evidence, possible entities and opinions labelled.Hopefully, these annotations could help in facilitating MRC research.Preliminary experimental results show that there exists a significant gap between the performances of state-of-the-art models and that of humans on this dataset.
In future work, we will steadily update our dataset by enlarging the size and enriching the annotations based on feedbacks from the community.We expect DuReader will be a valuable resource to

Figure 1 :
Figure 1: Distribution of answer numbers per question in DuReader.

Table 1 :
Comparison of some properties of existing datasets 3 vs.DuReader.

Table 2 :
Examples for question types from two views.

Table 4 :
Distribution of question types in DuReader.

Table 5 :
Performance of typical MRC systems on the DuReader dataset.

Table 6 :
Model performance with gold paragraph.

Table 7 :
Performance on various question types.

Table 8 :
Performance of opinion-aware model on YesNo questions.

Table 9 :
Examples from DuReader dataset the development of MRC technologies and applications.