FQuAD: French Question Answering Dataset

Recent advances in the field of language modeling have improved state-of-the-art results on many Natural Language Processing tasks. Among them, Reading Comprehension has made significant progress over the past few years. However, most results are reported in English since labeled resources available in other languages, such as French, remain scarce. In the present work, we introduce the French Question Answering Dataset (FQuAD). FQuAD is a French Native Reading Comprehension dataset of questions and answers on a set of Wikipedia articles that consists of 25,000+ samples for the 1.0 version and 60,000+ samples for the 1.1 version. We train a baseline model which achieves an F1 score of 92.2 and an exact match ratio of 82.1 on the test set. In an effort to track the progress of French Question Answering models we propose a leaderboard and we have made the 1.0 version of our dataset freely available at https://illuin-tech.github.io/FQuAD-explorer/.


Introduction
Current progress in language modeling has led to increasingly successful results on various Natural Language Processing (NLP) tasks. This is namely the case of the Reading Comprehension task (Richardson et al., 2013). However, Reading Comprehension datasets are costly and difficult to collect and are essentially native English datasets. Indeed, datasets such as SQuAD1.1 (Rajpurkar et al., 2016), SQuAD2.0 (Rajpurkar et al., 2018), or CoQA (Reddy et al., 2018 have fostered important and impressive progress for English Question Answering models over the past few years. The lack of native language annotated datasets apart from English is one of the main reasons why the development of language specific Question Answering models is lagging behind and this is namely the case for French.
In order to fill the gap for the French language, we introduce a French Reading Comprehension dataset similar to SQuAD1.1. The dataset consists of French native questions and answers samples annotated by a team of university students. The dataset comes in two versions. First FQuAD1.0, containing over 25,000+ samples. Second, FQuAD1.1 containing over 60,000+ samples. The 35,000+ additional samples have been annotated with more demanding guidelines to strengthen complexity of the data and model to make the task harder. More specifically, the training, development, and test sets of FQuAD1.0 contain respectively 20,703, 3,188, and 2,189 samples. And the training, development, and test sets of FQuAD1.1 contain respectively 50,741, 5,668, and 5,594 samples.
In order to evaluate the FQuAD dataset, we perform various experiments by fine-tuning BERT based Question Answering models on both versions of the FQuAD dataset. The experiments involve the fine-tuning of French monolingual model CamemBERT (Martin et al., 2019), and multilingual models mBERT (Pires et al., 2019) and XLM-RoBERTa (Conneau et al., 2019).
We perform also two types of cross-lingual Reading Comprehension experiences. First, we evaluate the performance of the zero-shot cross-lingual transfer learning approach as stated in Artetxe et al. (2019) and Lewis et al. (2019) on our newly obtained native French dataset. Second, we evaluate the performance of the translation approach by finetuning models on the French translated version of SQuAD1.1. The results of these two experiments help to better understand how the two cross-lingual approaches actually perform on a native dataset.

Related Work
The Reading Comprehension task (RC) (Richardson et al., 2013;Rajpurkar et al., 2016) attempts to solve the Question Answering (QA) problem by finding the text span in one or several documents or paragraphs that answers a given question (Ruder, 2020).
These datasets are similar but each of them introduces its own subtleties. For instance, SQuAD2.0 (Rajpurkar et al., 2018) develops unanswerable adversarial questions. CoQA (Reddy et al., 2018) focuses on Conversation Question Answering in order to measure the ability of algorithms to understand a document and answer series of interconnected questions that appear in a conversation. QuAC (Choi et al., 2018) focuses on Question Answering in Context developed for Information Seeking Dialog (ISD). The benchmark established by Yatskar (2018) offers a qualitative comparison of these datasets. Finally, HotpotQA (Yang et al., 2018) attempts to extend the Reading Comprehension task to more complex reasoning by introducing multi-hop questions where the answer must be found among multiple documents.

Reading Comprehension in other languages
Native Reading Comprehension datasets other than English remain rare. Among them, some initiatives have been carried out in Chinese, Korean and Russian and all of them have been built in a similar way to SQuAD1. As language specific datasets are costly and challenging to obtain, an alternative consists in developing cross-lingual models that can transfer to a target language without requiring training data in that language (Lewis et al., 2019). It has indeed been shown that these unsupervised multilingual models generalize well in a zero-shot cross-lingual setting (Artetxe et al., 2019). For this reason, crosslingual Question Answering has recently gained traction and two cross-lingual benchmarks have been released, i.e. XQuAD (Artetxe et al., 2019) and MLQA (Lewis et al., 2019). The XQuAD dataset (Artetxe et al., 2019) is obtained by translating 1,190 question and answer pairs from the SQuAD1.1 development set by professionals translators in 10 foreign languages. The MLQA dataset (Lewis et al., 2019) consists of over 12,000 question and answer samples in English and 5,000 samples in 6 other languages such as Arabic, German and Spanish. Note that the two aforementioned datasets do not cover French.
Another alternative consists in translating the training dataset into the target language and finetuning a language model on the translated dataset. This is namely the case of Carrino et al. (2019) where the authors develop a specific translation method called Translate Align Retrieve (TAR) to translate the English SQuAD1.1 dataset into Spanish. The resulting Spanish SQuAD1.1 dataset is used to fine-tune a multilingual model that reaches a performance of respectively 68.1/48.3% F1/EM and 77.6/61.8% F1/EM on MLQA cross-lingual benchmark (Lewis et al., 2019) and XQuAD (Artetxe et al., 2019). Note that a similar approach has been adopted for French and Japanese in Asai et al. (2018)

Language modeling for Reading Comprehension
Increasingly efficient language models have been released recently such as GPT-2 (Radford et al., 2018), BERT (Devlin et al., 2018), XLNet (Yang et al., 2019) and RoBERTa (Liu et al., 2019). They have indeed disrupted the Reading Comprehension task and most of NLP fields: pre-training a language model on a generic corpus, eventually finetuning it on a domain specific corpus and then training it on a downstream task is the de facto state-ofthe-art approach for optimizing both performances and annotated data volumes (Devlin et al., 2018;Liu et al., 2019). For instance, the top performing models on the SQuAD1.1 and SQuAD2.0 leaderboards 1 are essentially transformer based models. Unfortunately, the aforementioned models are pretrained on English corpora and their use for French is therefore limited.
Multilingual models pre-trained on large multilingual datasets attempt to alleviate the language specific shortcoming characteristic of the former models such as Lample and Conneau (2019)

Dataset Collection
The collection was conducted in two distinct steps: the first one resulted in FQuAD1.0 with 25,000+ question and answer pairs, and the second one resulted in FQuAD1.1 with 60,000+ question and answer pairs. Apart from that, the collection follows 1 rajpurkar.github.io/SQuAD-explorer the same standards and guidelines as SQuAD1.1 (Rajpurkar et al., 2016).

Paragraphs collection
A set of 1,769 articles are collected from the French Wikipedia page referencing quality articles 2 . From this set, a total of 145 articles are randomly sampled to build the FQuAD1.0 dataset. Also, 181 additional articles are randomly sampled to extend the dataset to FQuAD1.1. resulting in a total of 326 articles. Among them, articles are randomly assigned to the training, development, and test sets. The training, development, and test sets for FQuAD1.0 are respectively made up of 117, 18, and 10 articles. For the FQuAD1.1 dataset, they are respectively made up of 271, 30, and 25 articles. Note that train, development, test split is performed at the article level in order to avoid any possible biases.
The paragraphs that are at least 500 characters long are kept for each article, similarly to Rajpurkar et al. (2016). This technique results in 4,951, 768, and 523 paragraphs for respectively the training, development, and test sets of FQuAD1.0. For FQuAD1.1, the number of collected paragraphs for the same sets are respectively 12,123, 1,387, and 1,398.

Question and answer pairs collection
A specific annotation platform was developed to collect the question and answer pairs. The workers are French students that were hired in collaboration with the Junior Enterprise of CentraleSupélec 3 . They were paid about 16.5 euros per hour of work. The guidelines for writing question and answer pairs for each paragraph are the same as for SQuAD1.1 (Rajpurkar et al., 2016). First, the paragraph is presented to the student on the platform and the student reads it. Second, the student thinks of a question whose answer is a span of text within the context. Third, the student selects the smallest span in the paragraph which contains the answer. The process is then repeated until 3 to 5 questions are generated and correctly answered. The students were asked to spend on average 1 minute on each question and answer pair. This amounts to an average of 3-5 minutes per annotated paragraph. Additionally during the annotation process, about 25 % of the questions for each annotator were manually reviewed to make sure the questions remain of high quality. Final dataset metrics are shared in table 2.

Additional answers collection
Additional answers are collected to decrease the annotation bias similarly to Rajpurkar et al. (2016). For each question in the development and test sets, two additional answers are collected, resulting in three answers per question for these sets. The crowd-workers were asked to spend on average 30 seconds to answer each question.
For the same question, several answers may be correct: for instance the question Quand fut couronné Napoléon ? would have several possible answers such as mai 1804, en mai 1804, or 1804. As all those answers are admissible, enriching the test set with several annotations for the same question, with different annotators, is a way to decrease annotation bias. The additional answers are useful to get an indication of the human performance on FQuAD.

FQuAD1.0 & FQuAD 1.1
The results for the first annotation process resulting in the FQuAD1.0 dataset are reported in table 1. The number of collected question and answer pairs amounts to 26,108. Diverse analysis to measure the difficulty of the resulting dataset are performed as described in the next section. A complete annotated paragraph is displayed in figure 2.  The first dataset is extended with additional annotation samples to build the FQuAD1.1 dataset reported in table 2. The total number of questions amounts to 62,003. The FQuAD1.1 training, development and test sets are then respectively composed of 271 articles (83%), 30 (9%), and 25 (8%). Following the version 1.0 annotation campaign, we observed that the most difficult questions for the models trained were questions of types Why and How or answers involving verbs and adjectives. This is further explained in section E. Therefore, we asked the annotators to come up with more questions of these specific types. The motivation was to come up with more challenging questions to understand if the trained models could improve on those. This constitutes the only difference with the first annotation process. The additional answer collection process remains the same.

Question analysis
The second analysis aims at understanding the question types of the dataset. The present analysis is performed rule-based only. Table 4 first demonstrates that the annotation process issued a wide range of question types, underlining the fact that What (que) represents almost half (47.8%) of the corpus. This important proportion may be explained by this formulation encompassing both the English What and Which, as well as a possible natural bias in the annotators way of asking questions. Our intuition is that this bias is the same during inference, as it originates from native French structure.

Question-answer differences
The difficulty in finding the answer given a particular question lies in the linguistic variation between the two. This can come in different ways, which are listed in table 9 The categories are taken from Rajpurkar et al. (2016): Synonymy implies key question words are changed to a synonym in the context; World knowledge implies key question words require world knowledge to find the correspondence in the context; Syntactic variation implies a difference in the structure between the question and the answer; Multiple sentence reasoning implies knowledge requirement from multiple sentences in order to answer the question. We randomly sampled 6 questions from each article in the development set and manually labeled them. Note that samples can belong to multiple categories.

Evaluation metrics
The Exact Match (EM) and F1-score metrics are common metrics being computed to evaluate the performances of a model. The former measures the percentage of predictions matching exactly one of the ground truth answers. The later computes the average overlap between the predicted tokens and the ground truth answer. The prediction and ground truth are processed as bags of tokens. For questions labeled with multiple answers, the F1 score is the maximum F1 over all the ground truth answers. The evaluation process in Rajpurkar et al. (2016) for both the F1 and EM ignores some English punctuation, i.e. the a, an, the articles. In order to remain consistent with the former approach, the French evaluation process ignores the following articles: le, la, les, l', du, des, au, aux, un, une.

Human performance
Similarly to SQuAD, human performances are evaluated on the development and test sets in order to assess how humans agree on answering questions. This score gives a comparison baseline when assessing the performance of a model. To measure the human performance, for each question, two of the three answers are considered as the ground truth, and the third as the prediction. In order not to bias this choice, the three answers are successively considered as the prediction, so that three human scores are calculated. The three runs are then averaged to obtain the final human performance for the F1 Score and Exact Match. For the test set and development set we find a Human Score reaching respectively 91.2% F1 and 75.9% EM, and 91.2% F1 and 78.3% EM. An in-depth analysis is carried out in appendix C to compare the FQuAD1.1 to SQuAD1.1 in terms of Human Performance and answer length.

Experimental set-up
The experimental set-up is kept the same across all the experiments. The number of epochs is set to 3, with a learning rate equal to 3.0 · 10 −5 . The learning rate is scheduled according to a warm-up linear scheduler where the percentage ratio for the warm-up is consistently set to 6%. The batch size is kept constant across the training and is equal to 8 for the base models and 4 for the large ones. The optimizer that is being used is AdamW with its default parameters. All the experiments were carried out with the HuggingFace transformers library (Wolf et al., 2019) on a single V100 GPU.

Native French Reading Comprehension
The goal of these experiments is two fold. First, we want to evaluate the performance of the French language models CamemBERT BASE and CamemBERT LARGE (Martin et al., 2019) on FQuAD. Second, we want to evaluate the performances of multilingual models using the same setup. For this purpose we train two multilingual models, i.e. mBERT (Pires et al., 2019) Table 5: Question-answer relationships in 108 randomly selected samples from the FQuAD development set. In bold the elements needed for the corresponding reasoning, in italics the selected answer.
perform on the French dataset. Note that for each experiment, the fine-tuning is performed on the training set of FQuAD1.1 and evaluated on the development and test sets of FQuAD.1.1. Additional fine-tuning experiments performed on the training set of FQuAD1.0 are presented in appendix D.

Cross-lingual Reading Comprehension
Cross-lingual Reading comprehension follows mainly two approaches as explained in section 2. First, we perform several experiments with a so called zero-shot learning approach. In other words, we fine-tune multilingual models on the English SQuAD1.1 dataset and we evaluate them on the FQuAD1.1 development set. In addition to that, the opposite approach is also carried out, i.e. finetuned models on FQuAD1.1 are evaluated on the SQuAD1.1 development set.
Second, we fine-tune CamemBERT on the SQuAD1.1 training dataset translated into French.
For this purpose, the SQuAD1.1 training set is translated using NMT (Ott et al., 2018). Note that the translation process makes it difficult to keep all the samples from the original dataset and, for the sake of simplicity, we discard the translated answers that do not align with the start/end positions of the translated paragraphs. The resulting translated dataset SQuAD1.1-fr-train contains about 40,700 question and answer pairs. The finetuned model is then evaluated on the native French FQuAD1.1 development set.

Native French Reading Comprehension
The training experiments on FQuAD1.  These experiments show therefore that models fine-tuned on translated data do not perform as well as when they are fine-tuned on native dataset. This difference is probably explained by the fact that NMT produces translation inaccuracies that impact the EM score more than F1 score. When we merge the native and the translated dataset into what we call the Augmented dataset, we do not observe a significant performance improvement. Interestingly, the CamemBERT LARGE model performs slightly worse when fine-tuned on translated samples.  Through our language models benchmark on FQuAD, we have evaluated several monolingual and multilingual models. The CamemBERT BASE and CamemBERT LARGE models reach a very promising baseline and the large model even outperforms the Human Performance consistently across the development and test datasets. For comparable model sizes we find that the monolingual models outperform multilingual models on the Reading Comprehension task. However, we find that multilingual models such as mBERT (Pires et al., 2019) or XLM-R BASE and XLM-R LARGE (Conneau et al., 2019) reach very promising scores. We find that XLM-R LARGE performs consistently better than the monolingual model CamemBERT BASE on both the development and test sets of FQuAD1.1. Let us further highlight that XLM-R LARGE reaches 79% EM on FQuADtest which is better than Human Performance, while the F1 score remains only 2% below it. As such a model is pre-trained on a multilingual corpus, we can hope that it could be used with reasonable performances on other languages.

Translated Reading Comprehension
Fine-tuning CamemBERT BASE on a French translated dataset yields 81.8/67.8% F1/EM on the FQuAD1.1 dev set. By means of comparison, CamemBERT BASE scores 88.1/78.1% F1/EM on the same set when trained with native French data. We find here that there exists an important gap between both approaches. Indeed, models that are fine-tuning on native data outperform models finetuned on translated data by an order of magnitude of 10% for the Exact Match.
In Carrino et al. (2019), the authors report a performance of 77.6/61.8% F1/EM score when mBERT is trained on a Spanish-translated SQuAD1.1 and evaluated on XQuAD (Artetxe et al., 2019). While the two approaches differ in terms of evaluation dataset, i.e. XQuAD is not a native Spanish dataset, and model, mBERT vs. CamemBERT, and although French and Spanish are different languages, they are close enough in their construction and structure, so that comparing these two approaches is relevant to us. Given the level of effort put into the translation process in Carrino et al. (2019), we think that both translationbased approaches, although using very recent language models, reach a performance ceiling with translated data. We observe also that enriching native French training data with the translated samples does not improve the performances on the native evaluation set. Given our experiments, we conclude therefore that there exist a significant gap between the native French and the French translated data in terms on quality and indicates that approaches based on translated data reach ceiling performances.

Cross-lingual Reading Comprehension
The zero-shot experiments show that multilingual models can reach strong performances on the Reading Comprehension task in French or English when the model has not encountered labels of the target language. For example, the XLM-R LARGE model fine-tuned solely on FQuAD1.1 reaches a performance on SQuAD just a few points below the English Human Performance. The same is also observed while fine-tuning solely on SQuAD1.1 and evaluating on the development set of FQuAD1.1. We conclude here in agreement with Artetxe et al. (2019) and Lewis et al. (2019) that the transfer of models from French to English and vice versa relevant approach when no annotated samples are available in the target language.
The experiments also show that the zero-shot performances are better for SQuAD than for FQuAD. This phenomenon can be explained by structural differences between French and English or an increased difficulty of FQuAD compared to SQuAD. It is also possible that the XLM-R language models used are capturing English language specifics better than for other languages because the dataset used for pre-training these models contains more English data. Further experiments aiming at training multilingual models on both FQuAD1.1 and SQuAD1.1 may improve the results further. This possibility is left for future works.

Conclusion
In the present work, we introduce the French Question Answering Dataset. The contexts are collected from the set of high quality Wikipedia articles. With the help of French college students, 60,000+ questions have been manually annotated. The FQuAD dataset is the result of two different annotation processes. First, FQuAD1.0 is collected to build a 25,000+ questions dataset. Second, the dataset is enriched to reach 60,000+ questions resulting in FQuAD1.1. The development and test sets have both been enriched with additional answers for the evaluation process.
We find that the Human performances for FQuAD1.1 on the test and development sets reach respectively a F1-score of 91.2% and an Exact Match of 75.9%, and a F1-score of 92.1% and an Exact Match of 78.3%. Furthermore, we find that the Human performances on FQuAD1.1 reach comparable scores to SQuAD1.1.
Various experiments were carried out to evaluate the performances of monolingual and multilingual language models. Our best model, CamemBERT LARGE , achieves a F1-score and an Exact Match of respectively 92.2% and 82.1%, surpassing the established Human performance in terms of F1-Score and Exact Match. The experiments show that multilingual models reach promising results but monolingual models of comparable sizes perform better.
The FQuAD1.0 training and FQuAD1.1 development sets are made publicly available in order to foster research in the French NLP area. We believe our dataset can boost French research in other NLP fields such as NLU, Information Retrieval or Open Domain Question Answering to cite a few. The extension of the dataset to adversarial questions similarly to SQuAD2.0 is left for future works. Table 8 lists some of the available Reading Comprehension datasets along with the number of samples they contain 4 . By means of comparison, Table 8 also includes FQuAD. Figure 2 is a screenshot of the annotation interface used to collect FQuAD. Last, figure 2 shows examples of question and answer pairs for a paragraph in FQuAD.

B Additional dataset analysis B.1 Questions and answers differences
The difficulty in finding the answer given a particular question lies in the linguistic variation between the two. This can come in different ways, which are listed in table 9 The categories are taken Synonymy implies key question words are changed to a synonym in the context; World knowledge implies key question words require world knowledge to find the correspondence in the context; Syntactic variation implies a difference in the structure between the question and the answer; Multiple sentence reasoning implies knowledge requirement from multiple sentences in order to answer the question. We randomly sampled 6 questions from each article in the development set and manually labeled them. Note that samples can belong to multiple categories.    Answer length To compare the answer lengths for the FQuAD1.1 and SQuAD1.1 datasets, we first remove every punctuation signs as well as respectively french words le, la, les, l', du, des, au, aux, un, une and english words a, an, the. Then answers are split on white spaces to compute the number of tokens for each answer. The results are reported in figure 3. It appears clearly that FQuAD answers are generally longer than SQuAD answers. Furthermore, to highlight this important difference it is interesting to realise that the average number of tokens per answer for SQuAD1.1 is equal to 2.72 while it is equal to 4.24 for FQuAD1.1. This indicates that reaching a high Exact Match score on FQuAD is more difficult than on SQuAD. Human performance as a function of the answer length To understand if the answer length can impact the difficulty of the Reading Comprehension task, we group question and answer pairs in FQuAD and SQuAD by the number of tokens for each answer. The figure 4 shows the human performance as a function of the answer length. On one hand, it is straightforward to notice that the Exact Match quickly declines with an increasing answer length for both FQuAD and SQuAD. On the other hand, the F1 score is a lot less affected by answer length for both datasets. We conclude from these distributions that the difference in answers lengths between FQuAD and SQuAD may explain part of the difference in human performance regarding EM metric, while it does not seem to have an impact on human performance regarding F1 metric. And indeed, human performance regarding F1 metric is very similar between FQuAD and SQuAD. It is possible that these variations in answers lengths distributions are due to structural differences between French and English languages. The more answers to a question there are, the more likely it is that any other answer is equal to one of the expected answers. As a consequence, the higher number of answers in SQuAD1.1 contributes to the higher human performance compared to FQuAD1.1 regarding the exact match metric.

D Additional experiments
Training on FQuAD1.0 As we open source the 1.0 version of FQuAD dataset, we also reproduce all the native French Reading Comprehension finetuning experiments described in section 5.2 with the training set of FQuAD1.0.
Performance analysis An analysis of the predictions for the best trained model on FQuAD is carried out. We have explored the distribution of answer and questions types in section 4 and we report now the performance of the model in terms of F1 score and Exact Match for each category. This analysis aims at understanding how the model performs on the various question and answer types.
Learning curve The question of how much data is needed to train a question answering model remains relatively unexplored. In our effort of an-notating FQuAD1.0 and FQuAD1.1 we have consistently monitored the scores to know if the annotation process must be continued or stopped.  Performance analysis Our best model CamemBERT LARGE is used to run the performance analysis on the question and answer types. Tables 13 and 14 present the results sorted by F1 score. The model performs very well on structured data such as Date, Numeric, or Location. Similarly, the model performs well on questions seeking for structured information, such as How many, Where, When. The Person answer type human score is very high on EM metric, meaning that these answers are easier to detect exactly probably because the answer is in general short. On the other end, the How and Why questions that probably expect a long and wordy answer are among the least well addressed.
Note that Verb answers EM score is also quite low. This is probably due to either the variety of forms a verb can take, or to the fact that verbs are often part of long and wordy answers, which are by definition difficult to match exactly. Some prediction examples are available in the appendix. Selected samples are not part of FQuAD, but were sourced from Wikipedia.