CAS: French Corpus with Clinical Cases

Textual corpora are extremely important for various NLP applications as they provide information necessary for creating, setting and testing these applications and the corresponding tools. They are also crucial for designing reliable methods and reproducible results. Yet, in some areas, such as the medical area, due to confidentiality or to ethical reasons, it is complicated and even impossible to access textual data representative of those produced in these areas. We propose the CAS corpus built with clinical cases, such as they are reported in the published scientific literature in French. We describe this corpus, currently containing over 397,000 word occurrences, and the existing linguistic and semantic annotations.


Introduction
Textual corpora are extremely important for various NLP applications as they provide information necessary for creating, setting and testing these applications and the corresponding tools.Yet, in some areas, due to confidentiality or to ethical reasons, it is complicated and even impossible to access representative textual data.Medical and legal areas correspond to such examples: in the legal area, information on lawsuits and trials remain confidential, while in the medical area, the medical secret must be respected.In both situations, personal data cannot be used.For several years now, anonymization and de-identification methods and tools have been made available and provide competitive and reliable results (Ruch et al., 2000;Sibanda and Uzuner, 2006;Uzuner et al., 2007;Grouin and Zweigenbaum, 2013) reaching up to 90% precision and recall.But even de-identified data may be difficult to be freely accessed and used for the research purpose because there is a risk of re-identification of people, and more particularly of patients (Meystre et al., 2014;Grouin et al., 2015) because several medical histories are unique, or because of other reasons.Hence, the application of the de-identification tools on personal data often does not permit to make these data freely available and usable within the research context.
Yet, there is a real need for the development of methods and tools for several applications suited for such restricted areas.For instance, in the medical area, it is important to have suitable tools for information retrieval and extraction, for the recruiting of patients for clinical trials, and for performing several other important tasks such as indexing, study of temporality, negation, etc. (Embi et al., 2005;Hamon and Grabar, 2010;Uzuner et al., 2011;Fletcher et al., 2012;Sun et al., 2013;Campillo-Gimenez et al., 2015;Kang et al., 2017).Another important issue is related to the reliability of tools and to the reproducibility of study results across similar data from different sources.The scientific research and clinical community are indeed increasingly coming under criticism for the lack of reproducibility in the biomedical area (Chapman et al., 2011;Collins and Tabak, 2014;Cohen et al., 2016), as well as in other areas.First step towards the reproducibility of results is the availability of freely usable tools and corpora.In our work, we are mainly concerned by building freely available corpora from the medical area.
The purpose of our work is to introduce the CAS corpus with French medical data, containing clinical cases such as those published in scientific literature or used for the education and training of medical students.In what follows, we first present some works on creation of medical corpora stressing more particularly on corpora freely available for the research (Section 2).We then introduce and describe the CAS corpus in French (Section 3) and its current annotations.We conclude with some directions for the future work (Section 4).
Within the medical area, we can distinguish two main types of medical corpora: scientific and clinical.Scientific corpora are issued from scientific publications and reporting.Such corpora are becoming increasingly available for the research thanks to the recent and less recent initiatives dedicated to the open publication, such as those promoted by the NLM (National Library of Medicine) through the PUBMED portal1 and specifically dedicated to the biomedical area, and by the HAL2 and ISTEX3 initiatives, which provide generic portals for accessing scientific publications from various areas, including medicine.Such corpora describe the research works, their motivation, methods, results and issues on precise research questions.Other portals may also provide access to scientific literature following specific purposes, like indexing of reliable literature, such as proposed by HON (Boyer et al., 1997), CIS-MEF (Darmoni et al., 1999), and other similar initiatives (Risk and Dzenowagis, 2001).Thanks to some research works, there are also scientific corpora which provide precise annotations and categorizations.These are mainly built for the purposes of challenges (Kelly et al., 2013;Goeuriot et al., 2014) but may also be provided from works of researchers, such as POS-tag (Tsuruoka et al., 2005) and negation (Szarvas et al., 2008) annotated corpora.As for clinical corpora, they are related to hospital and clinical events of patients.Such corpora typically describe medical history of patients and the medical care they are undergoing.It is complicated to obtain free access to this kind of medical data and, for this reason, there are very few clinical corpora freely available for the research.In our work, we are mainly interested in clinical corpora: the proposed literature review of the existing work is aimed at clinical corpora which are freely available for the research.We present here the main existing clinical corpora.
MIMIC (Medical Information Mart for Intensive Care) corpora, now in their version III, provide the largest available set of structured and unstructured clinical data in English.MIMIC III is a single-center database comprising information relating to patients admitted to critical care units at a large tertiary care hospital.These data include vi-A term female infant was born by vaginal delivery with normal birth weight, body length and APGAR score, from a 42-year-old mother with 13 previous pregnancies resulting in 3 miscarriages and 10 live births.The mother had no history of antenatal medical illness nor of exposure to smoking, drinking and other drugs.At birth, general and systemic examination revealed a round face, single palmar crease, left precordial systolic murmur.Two hours after birth a deterioration of the general condition occurred, with generalized hypotonia, cyanosis, poor feeding.The blood count revealed white blood cell count of 35.6*10 /µL with 20.6*10 /µL, 57.9% monocytes, normal neutrophils, lymphocytes and eosinophils count, hemoglobin levels of 19.1 g/dl and 27*10 /µL platelets count.The acute phase reactants were negative.Because she maintained the altered general condition and the platelets ranged between 17-18 *10 /µL, on the 8th day after birth she was referred to our unit for proper diagnosis and treatment.Physical examination showed a phenotype suggestive for Down syndrome, later confirmed by karyotyping (47, XX + 21).She was lethargic, tachypneic and a systolic heart murmur was observed.The liver was 2 cm below the right costal margin, along with a slight enlargement of the spleen.The laboratory tests on the first day of admission in our unit revealed a white blood count of 15.8*10 /µL, with an abnormal monocyte count (increased absolute and percentile count: 5.66*10 /µL, respectively 35.5%), normal absolute neutrophil count (5.53*10 /µL), a hemoglobin level of 15.9 g/dl and severe thrombocytopenia (15*10 /µL).The biochemical parameters including electrolytes, uric acid, creatinine, bilirubin, liver enzymes were normal.The serum lactate dehydrogenase was raised.The bacterial culture work-up and titers of antibodies against toxoplasmosis, cytomegalovirus, Epstein Barr virus, hepatitis C, HIV were negative.The peripheral blood smear presented atypical cells.The bone marrow aspiration showed hemodiluted aspirate with blast cells.Immunophenotyping revealed 23% blast cells, positive for megakaryocytic markers (CD42b, CD41, CD61), myeloid markers (CD33), progenitor cell markers (CD117, CD34) and T cell marker -CD7 positive.MPO and HLA/DR were negative.The mutational status of AM-LETO, PML-RARα, FLT3 and NPM1 fusion genes came out absent.The positive diagnosis was acute megakaryoblastic leukemia (AMKL).The echocardiography found a patent foramen ovale.The infant underwent chemotherapy according to the Down syndrome-specific AML chemotherapy protocol, consisting in four cycles of treatment: the first two cycles (induction phase) included combinations of cytarabine and liposomal daunorubicin and the last two cycles (consolidation phase): etoposide, cytarabine and mitoxantrone.Our patient aquired clinical and hematological remission without serious adverse events.3 The CAS corpus

Content of the corpus
We present the CAS corpus in French.It contains clinical cases such as published in scientific literature and training material.Cases from these different sources are included in the corpus.Usually, the source data are available as pdf files.Their conversion in the text format is automatic but then needs to be fully checked out in order to correct potential segmentation errors (remove the paratext specific to a given journal, verify the conversion of columns, of end of lines and pages, etc.).
Similarly to clinical documents, the content of clinical cases depends on the clinical situations which are illustrated and on the disorders, but also on the purpose of the presented cases (description of diagnoses, treatments or procedures, expected audience, etc.). Figure 1 presents an example of clinical case in English.Such data are de-identified by the auhors and their publication is done with the written permission of patients.The case reports can be related to any medical situation (diagnosis, treatment, procedure, follow-up...)  patient history, lab results, clinical evolution, treatment, etc. can also be provided for the illustration of clinical cases.Finally, these clinical cases are discussed.Hence, such cases may present an extensive description of medical problems.Such publications gather medical information related to clinical discourse (clinical cases) and to scientific discourse (introduction and discussion).Related scientific literature is also provided.
As we can see from Figure 1, the clinical part of publications on clinical cases may be very similar to clinical documents: it describes patients, and proposes their diagnosis based on examination, imaging, and biological and genetic information.Besides, numerical values and abbreviations are also present.Misspellings, which are quite frequent in clinical documents, may be missing in publications on clinical cases.

Annotation of the corpus
Currently, the corpus contains linguistic and semantic annotations.
At the linguistic level, the corpus is PoS-tagged and lemmatized with a tool developed in-house and available as a web-service at https:// anonymized_url.Then, several layers of semantic annotation are performed automatically: • Concept Unique Identifiers (CUI) corresponding to French terms from the UMLS (Lindberg et al., 1993) for single or multiword terms.For multi-word terms, the annotations exploits the IOB (Inside-Outside-Begin) format.For instance, the two-word term vitamine B12 is encoded as follows: ... O vitamine B-C0042845 B12 I-C0042845 ... O In the current version of the corpus, in case of several concurrent CUIs, only the longest, and supposedly more precise, CUIs are kept.
For instance, carence en vitamine B12 (deficiency in B12 vitamin) (C0042847) will be preferred to vitamine B12 (C0042845); • Negation.Negation indicates whether a given disorder, procedure or treatment are present or not in the medical history and care of a given patient.For this reason, its annotation and detection are important.We adopt the approach proposed by Fancellu et al. (2016) and adapted for French by Dalloux et al. (2018) based on Machine Learning techniques trained on annotated data.This follows a two-step process: (1) the negation markers are detected with a specifically trained CRF; (2) the scope of each detected marker is found with a neural network (Bi-LSTM with a CRF layer).On the French and English data tested, the detection of negation gives up to 0.98 for the cues and 0.86 for their scope; • Uncertainty.Uncertainty is also an integral part of medical discourse and should be taken into account for a more precise computing of the status of disorders, procedures and treatments.A set of markers has been built manually.Since there may be several markers of negation and uncertainty in a sentence, they are numbered with their scopes accordingly.
In Table 1, we present an excerpt from the corpus with all the aforementioned linguistic and semantic annotations for the sentence L'adolescent paraît triste et ne parle pas.(The teenager seems to be sad and doesn't speak.)

Annotation statistics
Overall, the corpus currently contains 20,363 sentences and over 397,000 word occurrences excluding punctuation marks.Table 2 indicates the number of units automatically recognized for each category.

Conclusion
We presented a new corpus in French which provides medical data close to those produced in the clinical context: description of clinical cases and their discussion.Overall, the corpus currently contains over 397,000 word occurrences excluding punctuation marks.The corpus is currently annotated with several layers of information: linguistic (PoS-tagging, lemmas) and semantic (the UMLS concepts, uncertainty, negation and their scopes).The corpus will be enriched with more clinical cases published.Other annotation layers will be added and their correctness cross-validated by human annotators.The enriched version of the corpus will undergo a more detailed description, such as statistics on age and gender of patients, their diseases, or the sources of publications.
Besides, similar corpora will be built for other languages.For instance, the repository of clinical cases in English is available on a dedicated website Archive of Clinical Cases8 respecting the Creative Commons License.
The very purpose of our work is to make these annotated corpora freely available for research.We expect that this may encourage development of

Figure 1 :
Figure 1: Example of clinical case

Table 1 :
Example of the annotated sentence from the corpus (B-u-x stands for the beginning of the uncertainty cue or scope number x, B-n-y for the negation cue or scope number y)

Table 2 :
Statistics on annotations 0.90 F-measure for the cues and 0.80 for the scope.