Clinical Case Reports for NLP

Textual data are useful for accessing expert information. Yet, since the texts are representative of distinct language uses, it is necessary to build specific corpora in order to be able to design suitable NLP tools. In some domains, such as medical domain, it may be complicated to access the representative textual data and their semantic annotations, while there exists a real need for providing efficient tools and methods. Our paper presents a corpus of clinical cases written in French, and their semantic annotations. Thus, we manually annotated a set of 717 files into four general categories (age, gender, outcome, and origin) for a total number of 2,835 annotations. The values of age, gender, and outcome are normalized. A subset with 70 files has been additionally manually annotated into 27 categories for a total number of 5,198 annotations.


Introduction
In Natural Language Processing (NLP), texts are useful to access information, especially expert information. Nevertheless, the linguistic diversity (type of narratives, common or specialized vocabulary, regular or complex syntactic structures, etc.) requires robust tools to access the information present in those texts. In order to build suitable NLP-based tools, to model linguistic elements (machine-learning, word-embeddings), or to produce gold standards for evaluating automatic systems, texts are needed . However, due to privacy and ethical reasons, documents from specialized domains (e.g., clinical notes or justice decisions) are not easily accessible unless authorization .
When such data exist for the research, they are generally limited to English language, such as the MIMIC-III database (Johnson et al., 2016) and derived corpora. For French language, the Quaero medical corpus (Névéol et al., 2014) is composed of a limited number of documents (13 documents from the European Medicines Agency, 25 documents from the European Patent Organization) or very short documents (2,500 Medline titles).
In order to make available documents concerned by privacy issues, de-identification techniques have been widely used to replace nominative data by plausible information (Meystre et al., 2010;Kayaalp, 2017). Despite the recent improvements of these techniques, especially based on artificial neural networks (Dernoncourt et al., 2017), one can not assure that all nominative data have been removed and humans must further check those documents. Another solution relies on the production of synthetic data (Lohr et al., 2018). Originally, they were generated and used to train OCR systems for handwriting recognition (Doermann and Yao, 1995). They are now used when original data are missing or to provide more data, despite their artificial character (Eger et al., 2019). Besides, whether the texts are de-identified or artificially generated, their linguistic specificity will have an impact on further designed NLP rulebased and statistically-based approaches.
In this paper, we present the semantic annotations we made on a corpus of clinical cases written in French by domain experts. Since this corpus is composed of already published and freely accessible clinical cases, our aim is to make this annotated corpus available for the research. In order to present the usefulness of those annotations, we present a few basic experiments we made.

Corpus and annotation guidelines 2.1 Corpus
In the clinical domain, in order to overcome the privacy and ethical issues when working on electronic health records, one solution consists in using clinical case reports. Indeed, it is quite common to find freely available publications from scientific journals which report in clinical cases of real de-identified or fake patients. Such clinical cases are usually published and discussed to improve medical knowledge (Atkinson, 1992) of colleagues and medical students. One may find scientific journals specifically dedicated to case reports, such as the Journal of Medical Case Reports launched in 2006(Rison et al., 2017. Clinical cases consist of a detailed and hierarchically structured description of history, signs and symptoms, diseases, tests, treatments, follow-up and outcome of a given patient or of a cohort of patients (Rison, 2013). As pinpointed by Lysanets et al. (2017), clinical cases are composed of linguistic particularities which constitute a specific genre of medical texts: active voice sentences, past simple tense, personal pronouns, and modal verbs. Beyond this warning, they represent both an available and useful clinical content, especially for the NLP community for which the access to EHRs is becoming harder and harder.
We assume that this new orientation to tackle the medical data accessibility problem may become popular in the years to come within the biomedical domain. Let's for instance mention the work by Satomura and Amaral (1992), which produced back in 90's an automatic system designed for the indexing of clinical cases with ICD-9 codes. These clinical cases written in English have been extracted from the New England Journal of Medicine and permitted the researchers to develop their NLP system and to test it. More recently, Gurulingappa et al. (2012) produced a benchmark corpus composed of 3,000 clinical case reports in English, which has been then annotated into several categories (drug, dosage, and adverse effects), and relationships among them in order to provide mentions of adverse drug reactions.
The corpus we present in this work is composed of 717 clinical case reports written in French (see  table 1 for general statistics). These cases have been previously published and are freely accessible. The cases from scientific literature often go with their discussion and keywords. In this work, we only focus on the clinical case description. This set has been manually annotated with general and fine-grained information, which is described in the two following sections. This corpus is part of a larger and yet growing corpus, which currently contains over 4,100 clinical cases (Grabar et al., 2018

Annotations of general information
We considered four general categories of information for the annotation. They are related to demographic data (age and gender) and to medical data (the starting medical problem or origin and the outcome). Most of the clinical cases describe the clinical events of one patient. Yet, some clinical cases may be dedicated to the description of several patients, in which case, all relevant information are annotated for each patient. For this reason, the total number of annotations may be higher than the number of clinical cases. For three out of four categories, the values are normalized and taken from finite sets: • Age ∈ N: numerical value rounded in years; age in letters is converted into numerical value; • Gender ∈ { feminine, masculine }; • Outcome ∈ { recovery, improvement, stable condition, worsening, death }.
Besides, when several ages are given for the same patient, only the age at the moment of the main clinical event is considered. For the category Origin, the values correspond to text spans describing the initial medical problem. Two scientists with a biomedical computer science background created the annotations independently, and then elaborated consensual annotations. Hence, all spans of text providing the expected information were annotated. For the category origin, the most inclusive text spans have been chosen.

Annotations of fine-grained information
The corpus has also been enriched with finegrained annotations of entities concerning physiology, surgery, diseases, drugs, temporal data, lab and exam results. The annotations are based on the semantic types from the UMLS (Lindberg et al., 1993), on existing annotation guidelines such as the I2B2 NLP Challenges , and on medical entities from our corpus. We provide those annotations as a basis for several NLP tasks such as information extraction or automatic classification based on clinical entities.
In this section, we present the guidelines we defined. For each category, we give a definition and a few examples from the corpus. Biology: anatomical parts (left lung, thyroid), localization of procedures or diseases (arterial, pulmonary), and biological functions (pregnancy, pulse)

Surgery
These categories are related to the surgery: Medical speciality including the types of medical units (oncology, surgical care units).
Tests including names of tested elements (radiography, biological check-up, blood pressure) Surgical treatments: treatments done by physicians (chemotherapy, resection) Surgical approach: access used by the physician (apical access) Medical devices used by patients or by physicians (drainage, mask, sensor)

Diseases
We considered four types of disease-related information: Pathology: mentions of diseases or diseased condition (acute lymphoblastic leukemia, tumor) Signs or symptoms which are not chronic diseases (cough, fever, headache, hypertension) Biological organism: bacteria and infectious organisms (escherichia coli, group B streptococcus) Nature: indication of quality (qualifying adjectives, grade) for diseases, signs and symptoms (pT2 G1 carcinoma, benign cyst)

Drugs
Pharmaceutical class or family of drugs (antibiotic, anticoagulant, anti-vitamin K) Substance: commercial and generic drug names or generic substance (acetaminophen, ferrous sulphate) Concentration of molecules in drugs (10%, 5 mg/ml)

Mode of administration (intravenous, oral route, by nebulization)
Dose: composed of value and unit for drug dose (0.5 mg, four doses, one to two pills, three million units) or rates (5 mg/kg). If a dose was changed according to a past condition, the modification is annotated among two normalized values (increase, decrease) 2.3.5 Temporal data Date: absolute and relative dates (January 2005) Moment: moment of a day for drug intake or surgical intervention (at bedtime, the morning) or specific time during the hospital stay (at D1-D2) Duration especially for treatments and diseases (since 10 years, for four weeks) Frequency for intakes, diseases, signs and symptoms (once a day, if needed, chronic, every two weeks)

Lab and exam results
This category is related to all numerical values from lab results (105/80 mm Hg, 68 bpm) and analysis result from examination (e.g., normal for imaging or palpation).

Additional information
Some categories are annotated with additional information.

Linguistic annotations
Similarly to , we added assertion values among the six tags possible: present, absent, associated to someone else, conditional, hypothetical, possible. Present: default value; Absent: element planned but not realized; Conditional : element that can occur under certain circumstances; Hypothetical : element that may occur in the future; Possible: element that may occur; Associated to someone else: element concerning family or acquaintances. Assertions may be used for the annotation of the Pathology, Signs and Symptoms, Tests, and Treatments categories.

Medical information
Linguistic interpretation: With Substances and Weight, if the medication or the weight change according to their previous values, this modification is annotated according to two normalized values: stop and titration for Substances, and gain and loss for Weight.
Medical interpretation: For lab results (e.g., blood pressure) and physiological data (temperature), if values can be compared to known ranges (external medical knowledge), three normalized levels are used (high, normal, low) in order to provide a better comprehension of those values.

General information
We computed interannotator agreement scores on the normalized values for general information: Age, Gender and Outcome, and on the annotated text spans for Origin. We achieved excellent agreements for Age and Gender (κ=0.939), differences being due to omissions; poor agreement for Outcome (κ=0.369) due to differences of interpretation between close values (e.g., recovery vs. improvement for long-term diseases); and very low agreement for Origin (κ=-0.762) since spans of text were often distinct between annotators. As stated by Grouin et al. (2011), the κ metric is not well suited for annotations of text since it relies on a random baseline for which the number of units that may be annotated is hard to define. As a consequence, the classical F-measure is often used as an approximation of inter-annotator agreement. In the following experiments, we present the inter-annotator agreements through Precision, Recall, and F-measure.
Outcome The outcome value is complex since differences between recovery and improvement may imply more knowledge than the information presented in the clinical case. As an example, for a patient presenting arterial hypertension at the consultation, do we consider a "recovery" or an "improvement" when clinicians indicate a complete remission 18 months after the intervention? Can we consider a recovery for a remission? Is a period of eighteen months sufficient to take a decision? If no tumor recurrence after fifteen months of decline is considered, since a tumor may appear again, can we still consider a "recovery"?
At last, we made a difference between cancers or malign tumors ("improvement") and benign tumors or other diseases ("recovery"). For chronic diseases, we only considered an "improvement".
Fine-grained categories In Table 2, we indicate the inter-annotator agreement for the main categories from fine-grained annotations on a subset of 70 clinical cases we annotated in duplicate.   is being applied to the whole set of 717 clinical cases. We expect that the further annotations will provide with better inter-annotator agreement.  Physiological information (body measurements and vital signs) are found in a few number of files (less than 10% of files from the dataset). Since those types of information are useful for a limited number of pathologies or signs or symptoms, they have been found in few documents. Table 4 presents the final number of annotations on the four general categories and their distribution on the whole dataset of 717 files. Since a few clinical cases describe several patients (either a cohort of patients or a pathology affecting several patients), the total number of annotations may be higher than the total number of files in the corpus. This has been observed for Gender and Origin.   de 73 ans n'ayant eu qu'un seul enfant par césarienne, mais présentant depuis plusieurs années un prolapsus de stade III totalement négligé par la patiente. Elle est en insuffisance rénale obstructive avec une urée sanguine à 10 mmol/l de sérum. Sur l'urographie intraveineuse, on note une dilatation urétéropyélocalicielle bilatérale très importante. La tension artérielle est de 12/8. La mise en place d'un pessaire améliore très rapidement la situation puisque quatre jours plus tard, l'urée sanguine est à 6,4 mmol/l. La patiente refuse tout geste chirurgical complémentaire et elle est ensuite perdue de vue. 1 brat /deft2019/70-fichiers_cg-propa/filepdf-6-4-cas Figure 2: Annotated case report. General information includes the following tags: genre (gender),âge (age), origine (origin), issue (outcome). Other tags are related to fine-grained information. Normalized values appear between square brackets (feminine gender, high or normal values, improvement outcome) stage III prolapse totally neglected by the patient. She is in obstructive renal failure with blood urea at 10 mmol/l serum. On the intravenous urography, we notice a very significant bilateral ureteropyelocaliceal dilation. The blood pressure is 12/8. The pessary placement very quickly improves the situation since four days later, the blood urea is 6.4 mmol/l. The patient refuses any additional surgery and is then lost to follow-up. The case is annotated with general and fine-grained information. Elements in square brackets correspond to normalized tags: feminine ("féminin") for gender, high ("haut") and normal for values, and improvement ("amélioration") for outcome.

Distribution of annotations
Columns two and three from Table 4 indicate that general information are found in all clinical cases. For gender and origin, the number of annotations if higher than the number of clinical cases because several people are described in some cases (gender), and because several origins of consultation may be indicated (namely, several signs or symptoms). From Table 3, one can observe a very imbalanced number of annotations per category. The main categories are: signs or symptoms (15.4%), tests (15.1%), localizations (11.6%), substances (8.4%), and anatomical parts (8.2%). The number of signs and symptoms mentions are three times higher than annotations of diseases (5.5%). Small categories are related to specific data (especially body measurements and vital signs) that are indicated in a limited number of cases. This may correspond to the average difference with the clinical patient reports.

Experiments and analysis
The annotated corpus has been exploited to perform similar annotations automatically and for their evaluation. Our aim is to verify the adequateness of the annotations for this information extraction task, as well as to serve as baseline for future work. We specify we do not aim to provide new methods, nor to improve existing systems, but to present a few use cases that may be done on the annotations presented in section 2.

Linguistic analysis
Syntax. Depending on the outcome observed in clinical cases, we studied the distribution of a few verbal tenses based on the POS annotations provided by the TreeTagger system (Schmid, 1994). As presented in Table 5, past perfect is the main tense for death outcome while present is the main tense for both improvement and stable condition outcomes. Conversely, we observe no future tense in case reports concerned by death.   Table 6 presents the distribution of demonstrative pronouns (PRO:dem) vs. personal pronouns (PRO:per) depending on the outcome. We observe that impersonal linguistic constructions are mainly used for stable condition outcomes (less personal pronouns and more demonstrative pronouns) than in other outcome types, as if the uncertainty of the stable condition (no improvement nor worsening) would prevent from a too much personal representation of the case.  Semantics. Table 7 presents the main elements annotated as anatomical parts, pathologies, signs or symptoms, and surgical treatments depending on the gender. The observed differences of medical entities mainly highlight differences due to anatomical parts specific to men or women, or to distinct prevalences of pathologies. We observe less differences in surgical treatments than in other categories.

Information extraction
The information extraction experiments rely on the Wapiti tool (Lavergne et al., 2010) that implements linear chain CRF (Lafferty et al., 2001). We trained a model on the 16 fine-grained categories presented in Table 2, through a 10 fold cross-validation process, using a l1 regularization. We used the following features: unigrams and bigrams of tokens, number of characters, typographic case, presence of punctuation and digit, Soundex code 1 value of each token, relative position of token within the document (beginning, middle, end), POS tags from the TreeTagger sys-tem (Schmid, 1994) and syntactic chunks based on those tags, presence of the token in a dictionary of 251k inflected forms for French, and cluster id (120 classes) of each token using the clustering algorithm from Brown et al. (1992) implemented by Liang (2005

Discussion
Corpus. One contribution of this work is related to the availability of the annotated corpus from the medical domain for French. We based our annotation schema on both existing ones (semantic types from the UMLS, i2b2 NLP Challenges) and on types of elements found in our corpus. This annotated corpus will be made available for the research purposes and may be of interest for several NLP tasks related to the biomedical domain: information extraction, relationships identification, classification, discourse analysis, temporality, etc.
Human annotations vs. CRF. We observed that results obtained by the designed CRF system are in line with results obtained by humans when annotating the corpus. More specifically, while humans were producing the gold standard, they had to deal with categories harder to process than others. We also observe that those categories are generally difficult to retrieve and annotate with the CRF model as well: Concentration (F=0.38 vs. 0.13), Function (F=0.37 vs. 0.35), and Pathology (F=0.36 vs. 0.31). An explanation is the lack of regularity (for the CRF system) and ambiguous content w.r.t. content from other categories. Yet, two categories considered as hard for humans yielded better results than expected with the CRF model: Specialty (F=0.25 vs. 0.56) and Dates (F=0.40 vs. 0.67). The differences observed between humans which produce those bad results were mainly due to omissions. Conversely, humans outperformed the CRF model on Frequency (F=0.68 vs. 0.33), Duration (F=0.66 vs. 0.42), and Devices (F=0.46 vs. 0.09). Those categories are composed of distinct elements with low frequencies of use which are complex to process for a probability-based system, but basic for humans.
As future work, we plan to continue the finegrained annotation of the whole corpus. We also plan to define relationships between the existing entities, in order to provide annotations of relations. Despite the absence of relationships annotations, the corpus can still serve to perform unsupervised experiments. Such results may be used for automatic pre-annotation of relationships, in order to make it easier the human annotation work.

Conclusion
In this paper, we presented a corpus composed of 717 medical clinical case reports, written in French, with two levels of annotations (general and fine-grained annotations). Our annotation schema is composed of four general categories (age, gender, outcome, origin) for a total of 2,835 annotations, and 27 fine-grained categories dealing with five domains (physiology, surgery, diseases, drugs, temporal) for a total of 5,198 annotations on a subset of 70 files. For certain categories, the annotations are provided under a normalized format (age, gender, outcome) while other categories are associated with additional information based on a human judgement, either of linguistic nature (assertions, change of conditions) or medical nature (lab results compared to known ranges). The corpus and its annotations will be made available for the research. We expect that the availability of this corpus may boost the research on biomedical textual data in French, and provide the domain with more robust and stable tools leading to a better reproducibility of the results.