A Vietnamese Dataset for Evaluating Machine Reading Comprehension

Over 97 million inhabitants speak Vietnamese as the native language in the world. However, there are few research studies on machine reading comprehension (MRC) in Vietnamese, the task of understanding a document or text, and answering questions related to it. Due to the lack of benchmark datasets for Vietnamese, we present the Vietnamese Question Answering Dataset (UIT-ViQuAD), a new dataset for the low-resource language as Vietnamese to evaluate MRC models. This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia. In particular, we propose a new process of dataset creation for Vietnamese MRC. Our in-depth analyses illustrate that our dataset requires abilities beyond simple reasoning like word matching and demands complicate reasoning such as single-sentence and multiple-sentence inferences. Besides, we conduct experiments on state-of-the-art MRC methods in English and Chinese as the first experimental models on UIT-ViQuAD, which will be compared to further models. We also estimate human performances on the dataset and compare it to the experimental results of several powerful machine models. As a result, the substantial differences between humans and the best model performances on the dataset indicate that improvements can be explored on UIT-ViQuAD through future research. Our dataset is freely available to encourage the research community to overcome challenges in Vietnamese MRC.


Introduction
Machine reading comprehension (MRC) is an understanding natural language task that requires computers to understand a text and then answer questions related to it. MRC is an essential core for a range of natural language processing applications such as search engines and intelligent agents (Alexa, Google Assistant, Siri, and Cortana) In order to evaluate MRC models, gold standard resources with questionanswer pairs based on documents have to be collected or created by human. Building a benchmark dataset plays a vital role in evaluating natural language processing models, especially for a low-resource language like Vietnamese.
Vietnamese is a language with few resources for natural language processing. The dataset for MRC introduced by (Nguyen et al., 2020) consists of 2,783 multiple-choice questions and answers based on a set of 417 Vietnamese texts which are used for evaluating the reading comprehension skill for 1 st to 5 th graders. However, this dataset is relatively small in size to evaluate deep learning models for the Vietnamese MRC. Thus, we aim to build a new large dataset for evaluating Vietnamese MRC.
Though the deep learning approach has surpassed the human performance on the SQuAD (Rajpurkar et al., 2016) and NewsQA (Trischler et al., 2017) datasets, we wonder if these state-of-the-art models could also achieve similar performances on datasets of different languages. To further enhance the development of the MRC, we build a span-extraction MRC dataset where answers to questions are always spans from a given text for Vietnamese. Figure 1 shows several examples for Vietnamese span-extraction reading comprehension. In this study, we have four main contributions described as follows.
• We create a benchmark dataset for evaluating Vietnamese MRC: UIT-ViQuAD comprises 23,074 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese Wikipedia articles. The dataset is available freely on our website 2 for research purposes.
• To gain thorough insights into the dataset, we analyze the dataset according to different linguistic aspects including length-based analysis (question length, answer length, and passage length) and type-based analysis (question type, answer type, and reasoning type).
• To achieve first MRC evaluation on UIT-ViQuAD, we conduct experiments with MRC models which are state-of-the-art for English and Chinese. Then, we compare performances between the machine models and humans in terms of different linguistic aspects. These in-depth analyses provide insights into span-based MRC in Vietnamese.
• Cross-lingual MRC (Cui et al., 2019a) is a new trend in natural language processing. Our proposed MRC dataset for Vietnamese could also be a resource for cross-lingual study along with other similar datasets such as SQuAD, CMRC, and KorQuAD. The rest of this paper is structured as follows. Section 2 reviews existing datasets. Section 3 introduces the creation process of our dataset. In-depth analyses of our dataset are presented in Section 4. Then Section 5 presents our experiments and analysis results. Finally, Section 6 presents conclusions and directions for future work.

Existing datasets
Because we aim to build a span-based MRC dataset for Vietnamese, a range of recent span-extraction MRC datasets such as SQuAD (Rajpurkar et al., 2016), NewsQA (Trischler et al., 2017), CMRC (Cui et al., 2019b), and KorQuAD (Lim et al., 2019) is reviewed in this section. These datasets are described as follows.
SQuAD is one of the most popular English datasets of the span-based MRC. Rajpurkar et al. (2016). proposed SQuAD v1.1 created by crowd-workers on 536 Wikipedia articles with 107,785 questionanswer pairs. SQuAD v2.0 (Rajpurkar et al., 2018) was released with adding over 50,000 unanswerable questions created adversarially by crowd-workers according to the original ones.
NewsQA is another English dataset proposed by Trischler et al. (2017), consisting of 119,633 question-answer pairs generated by crowd-workers on 12,744 news articles from CNN news. This dataset is similar to SQuAD because the answer to each question is a text segment of arbitrary length in the corresponding news article.
CMRC (Cui et al., 2019b) is a span-extraction dataset for Chinese MRC introduced in the Second Evaluation Workshop on Chinese Machine Reading Comprehension 2018, comprising approximately 20,000 human-annotated questions on Wikipedia articles.
KorQuAD (Lim et al., 2019) is a Korean dataset for span-based MRC, consisting of over 70,000 human-generated question-answer pairs on Korean Wikipedia articles.
Until now, there has not been any datasets of Vietnamese Wikipedia texts for span-based MRC research. As mentioned above, the datasets are benchmarks for the MRC task and may be used for organizing a challenge which encourages researchers to explore the best processing models. Therefore, this is our primary motivation to create the new dataset for Vietnamese MRC.

Dataset creation
In this section, we introduce our proposed process of MRC dataset creation for the Vietnamese language. In particular, we build our UIT-ViQuAD dataset through five phases consisting of worker recruitment, passage collection, question-answer sourcing, validation and additional answers collection. These phases are described in detail as follows. Phase 5 Guidelines Figure 2: The overview process of creating our dataset UIT-ViQuAD.
Phase 1 -Worker recruitment: The quality of a dataset depends on high-quality workers and the process of data creation. In this section, we present worker recruitment for creating our dataset according to a rigorous process, consisting of four different stages. (1) People apply to become workers for creating answer-question pairs of the dataset; (2) Selected people are excellent at general knowledge and passed our reading comprehension test; (3) Official workers are carefully trained over 500 question-answer pairs and cross-checked their created data to detect common mistakes that can be avoided when creating data.
Phase 2 -Passage collection: Similar to SQuAD, we also use Project Nayuki's Wikipedia's internal PageRanks 3 to obtain a set of the top 5,000 Vietnamese articles, from which we choose randomly 151 articles for dataset creation. Each passage corresponds to a paragraph in an article. Images, figures, and tables are excluded. We also delete passages shorter than 300 characters or containing many special characters and symbols.
Phase 3 -Question-answer sourcing: Workers comprehend each passage and then create questions and corresponding answers. During the question and answer creation, workers follow rules which are: (1) Workers are required to create at least three questions per passage.
(2) Workers are encouraged to ask questions in their own words. (3) Answers are text spans in the passage that are used to answer the questions. (4) Workers are encouraged to make diversities in questions, answers, and reasoning.
Phase 4 -Question and answer validation: In this phase, workers perform two different sub-phases to check mistakes in question-answer pairs including self-checking and cross-checking. The mistakes are classified into five different categories: unclear questions, misspellings, incorrect answers, lack or excess of information in answers, and incorrect-boundary answers. The two sub-phases are described as follows.
• Self-checking: Workers revise their question-answer pairs themselves.
• Cross-checking: Workers cross-check each other's question-answer pairs. If they discover any mistakes in the dataset, they discuss with each other to correct the mistakes.

Phase 5 -Additional answers collection:
To evaluate the quality of dataset creation, for the development and test datasets, we add three more answers for each question by different workers in addition to the original answer. During this phase, the workers cannot see each other's answer and they are encouraged to make diversified answers.

Dataset analysis 4.1 Overall statistics
The statistics of the training (Train), development (Dev) and test (Test) sets of our dataset are described in Table 1. The number of questions of UIT-ViQuAD is 23,074. In the table, the numbers of articles and passages, the average lengths 6 of questions and answers, and vocabulary sizes are also presented.

Length-based analysis
We present statistics of our dataset according to three types of length including question length (see Table  2), answer length (see Table 2), and passage length (see Table 3). The 11-15-word questions of the dataset account for a high proportion of 45.29%. The answers are mostly from 1 to 10 word lengths, accounting for 73.68%. The length of passages is largely from 101 to 200 words with 73.13%. These analyses show that our dataset has its own characteristics.

Type-based analysis
In this section, we analyze the Dev set in terms of different types such as question type, reasoning type, and answer type. Because Vietnamese is a subject-verb-object language similar to Chinese (Nguyen et al., 2018), Vietnamese question types in UIT-ViQuAD follow a manner in CMRC (Cui et al., 2019b). Thus, we also divide the questions into seven types: Who, What, When, Where, Why, How, and Others. However, in Vietnamese, question words vary a lot, so we have Workers manually annotate the type of questions. Figure 3a presents the distribution of the question types on our dataset. What questions account for the largest proportion of 49.97%. Compared to SQuAD, the percentage of the What question in our dataset is similar to that in SQuAD (53.60%) (Aniol et al., 2019).
To explore the difficulty of reasoning required, we conduct human annotation for the different reasoning level of the question, shown in Figure 3b. Following Hill et al. (2015) and Nguyen et al. (2020), workers manually annotate the questions into five different types of reasoning with ascending order of difficulty: word matching (WM), paraphrasing (PP), single-sentence reasoning (SSR), multisentence reasoning (MSR), and ambiguous/insufficient (AoI). Our dataset is more difficult than SQuAD and NewsQA because the percentage of inference types (68.29%) in our dataset is higher than that in SQuAD (20.5%) and NewsQA (33.90%) (Trischler et al., 2017  Unlike SQuAD (Rajpurkar et al., 2016) and NewsQA (Hill et al., 2015), instead of using automatic tools for annotation, the answer types on the Dev set of UIT-ViQuAD are annotated entirely by workers. Table  4 shows the distribution of the answer types based on various syntactic structures on the Dev set of our dataset. Common noun phrases account for the largest proportion in UIT-ViQuAD, which is similar to the statistics of SQuAD (Rajpurkar et al., 2016) and NewsQA (Trischler et al., 2017). In addition, verb phrases (P2) and other entities (E3) rank the second and third percentages in our dataset.  Table 4: Statistics of the answer types on the Dev set of the UIT-ViQuAD dataset.

Empirical evaluation
In this section, we conduct experiments with the state-of-the-art MRC models to evaluate our dataset. To measure the difficulty of our dataset, we also estimate human performance on the task of Vietnamese MRC. Similar to evaluations on English and Chinese datasets (Rajpurkar et al., 2016;Cui et al., 2019b), we used two evaluation metrics, exact match (EM) and F1-score, to evaluate performances of MRC models on our dataset.

Human performance
In order to measure human performance on the development and test sets, we hired three other workers to independently answer questions on the test and development sets. As a result, each question in the development ans test sets has four answers, as described in Phase 5 of Section 3. Unlike Rajpurkar et al. (2016) and like Cui et al. (2019b), to measure the performance, we use a cross-validation methodology.
In particular, we consider the first answer as human prediction and treat the remainder of the answers as ground truths. We obtain three human prediction performances by iteratively regarding the first, second, and third answer as the human prediction. We take the maximum performance over all of the ground truth answers for each question. Lastly, we calculate the average of four results as the final human performance on the dataset.

Re-implemented methods and baselines
In this paper, we re-implemented the following MRC models on our dataset as described in Section 4.
• DrQA: Chen et al. (2017) introduced a simple but effective neural network-based model for the MRC task. DrQA Reader achieved good performance on multiple MRC datasets (Rajpurkar et al., 2016;Reddy et al., 2019;Labutov et al., 2018). Thus, we re-implement this method into our dataset as the first baseline models to compare future models.
• QANet: QANet was proposed by Yu et al. (2018) and this model also demonstrated good performance on multiple MRC datasets (Rajpurkar et al., 2016;Dua et al., 2019). This model consists of multiple convolutional layers followed by two components: the self-attention and fully connected layer, for both question and passage encoding as well as some more layers stacked before predicting the final output.
• BERT: BERT was proposed by Devlin et al. (2019). This model is a strong methodology for pre-training language representations, which achieved the state-of-the-art results on many reading comprehension tasks. In this paper, we used mBERT (Devlin et al., 2019), a large-scale multilingual language model pre-trained for the evaluation of our Vietnamese MRC task.
• XLM-R: XLM-R was proposed by Conneau et al. (2020), a super strong methodology for pretraining multilingual language models at scale, which leads to significant performance gains for a wide range of cross-lingual transfer tasks. This model significantly outperforms multilingual BERT (mBERT) on a variety of crosslingual benchmarks, including XNLI, MLQA, and NER. In this paper, we evaluate XLM-R Base and XLM-R Large on our dataset.

Experimental settings
We use a single NVIDIA Tesla P100 GPU via Google Colaboratory to train all MRC models on our dataset. We utilize the pre-trained word embeddings introduced by (Xuan et al., 2019), including Word2vec, fastText, ELMO, and BERT Base for DrQA and QANet. Besides, we set batch size = 32 and epochs = 40 for both the two models. To evaluate BERT on our dataset, we implement a multilingual pre-trained model mBERT (Devlin et al., 2019) and pre-trained cross-lingual models XLM-R (Conneau et al., 2020) with the baseline configuration provided by HuggingFace 3 . Based on our dataset characteristics, we use the maximum answer length to 300, the question length to 64, and the input sequence length to 384 for all the experiments on mBERT and XLM-R.  Table 5: Human and model performances on the Dev and Test sets of UIT-ViQuAD. Table 5 presents the performance of our models alongside human performance on the development and test sets of our dataset. For EM and F1-core, XLM-R Large significantly outperforms the other models but is largely below human performance. On the test set, the model predicts answers with the F1-score of 87.02%. However, this model's exact match achieves 68.98%, which is significantly lower than the F1-score.

Analysis
To gain more in-depth insights into the evaluation of the machine models and humans in Vietnamese, we analyze their performances in terms of different linguistic aspects such as length-based (question length, answer length, and passage length) and type-based (question type, answer type, and reasoning type).

Effects of length-based aspects
In order to examine how well the MRC models could perform on UIT-ViQuAD, we analyze the performances of the machine models and humans by F1-score. Figure 4 shows length-based analyses of humans and MRC models' performances on the Dev set. In general, the performances of the mBERT and XLM-R models outperform that of the QANet and DrQA models. However, all machine models' performances are lower than humans on different types of lengths. For the question-length-based analysis (see Figure  4a), we found that longer questions tend to achieve better results because these questions maybe contain more information, which makes it easier for MRC models to find answers. On the contrary, the longer answers achieve lower performances, which is challenging for the MRC models, shown clearly in the performances of the DrQA and QANet models in Figure 4b. Unlike question-length and answer-length analyses, the passage lengths witness fluctuations in the performances of most MRC models work well for short (<100 words) and long (>250 words) passages (see Figure 4c). The result analyses based on the different lengths can be used to evaluate the difficulty of Vietnamese automatic reading comprehension on our dataset, which can help researchers have ideas for curriculum learning in future work.

Effects of type-based aspects
Besides, we examine how MRC models solve the type-based aspects of UIT-ViQuAD. Therefore, we analyze the F1-score performances of the machine models and humans on the development. Figure 5 shows the type-based analyses of humans and MRC models' performances. No machine models have been able to handle question types, answer types, and types of reasoning better than humans. On the type of reasoning, complex inference types (SSI, MSI, and AoI) obtain lower performances, which is similar to results on SQuAD and NewsQA (Trischler et al., 2017). Similarly, difficult question types (Why and How) obtain low performances. However, the Where question is also another question type that does not been handle well in machine models. Thus, the Location answer type related to the Where question type also achieves low performances. Although the noun-phrase answer type accounts for the highest proportion of the dataset (22.86%), the machine model does not yet handle well as other types because of the diverse and complicated structure of Vietnamese noun phrases (Nguyen et al., 2018).

Effects of the amount of training data
The training data consists of 18,579 question-answer pairs which are lower than the quantity of the data trained for English and Chinese MRC models. To verify whether the small amount of training data affect the poor performance of the MRC systems based on model evaluations, we conduct various experiments with training sets comprising 3, 145, 6,471, 9,268, 12,273, 15,145, and 18,579 questions. Figure 6 shows the performance (F1-score) based on the Test set of UIT-ViQuAD. Through these experimental analyses, we find that DrQA, QANet, and mBERT obtain better performances when the amount of training data increases, whereas the performances of XLM-R are stable over 86% with any training data amount. These observations indicate that the best model (XLM-R Large ) is more effective with a small amount of training data compared with the other three models. In general, increasing the training data quantity may be required to improve the performance of future models for most of neural network-based MRC models.

Conclusion and future work
In this paper, we introduce a new span-extraction dataset for evaluating Vietnamese MRC. UIT-ViQuAD contains over 23,000 questions generated by humans. Our experimental results show that the machines could obtain up to 87 percent scores on both the development and test set. However, they are lower than the estimated human performances in F1-score. We hope the release of our dataset contributes to the language diversity in MRC task, and accelerates further investigation on solving difficult questions that need comprehensive reasoning over multiple clues. According to the analysis results, we may extend this work by exploring models to solve challenging questions involving specific question types (Where, Why, and How), answer types (Location, and Noun Phrases) and reasoning types (Single-Sentence Inference, Multiple-Sentence Inference, Ambiguous or Insufficient). In future, we plan to enhance the quantity and the quality of our dataset to achieve better performance on deep learning and transformer models. In addition, we would like to open the Vietnamese MRC challenging task for researchers in the field.