GGPONC: A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines

The lack of publicly accessible text corpora is a major obstacle for progress in natural language processing. For medical applications, unfortunately, all language communities other than English are low-resourced. In this work, we present GGPONC (German Guideline Program in Oncology NLP Corpus), a freely dis tributable German language corpus based on clinical practice guidelines for oncology. This corpus is one of the largest ever built from German medical documents. Unlike clinical documents, clinical guidelines do not contain any patient-related information and can therefore be used without data protection restrictions. Moreover, GGPONC is the first corpus for the German language covering diverse conditions in a large medical subfield and provides a variety of metadata, such as literature references and evidence levels. By applying and evaluating existing medical information extraction pipelines for German text, we are able to draw comparisons for the use of medical language to other corpora, medical and non-medical ones.


Introduction
The synthesis of validated experience in the form of Clinical Practice Guidelines (CPGs) serves as a basis for evidence-based decision making in clinical practice. To leverage the knowledge in CPGs for clinical decision support systems, e.g., for integration with electronic health records or automated evaluation of adherence to these guidelines, machine-readable versions of CPGs are necessary. However, CPGs today are disseminated mostly as free-text documents, with few formal elements. Thus, Natural Language Processing (NLP) * Authors marked by * equally share first authorship. might be helpful to automatically extract information from these unstructured texts and transform them into a structured, or even machine-executable, format. As CPGs are also specific to their country of origin, they are usually formulated in the respective native language, so NLP technology has to be adapted properly.
A major reason for the progress in NLP research in the past years is the public availability of large text corpora (and attached metadata). Yet, for documents originating from a clinical context, the protection of personal information is a major requirement imposed by legal privacy regulations. Some research initiatives, e.g., I2B2 (Uzuner et al., 2011), MIMIC-III (Johnson et al., 2016), the Shared Task of Social Media for Health (Weissenbacher et al., 2019), or CLEF EHEALTH (Goeuriot et al., 2020) make de-identified clinical text document collections available under the conditions of Data Use Agreements (DUA). Besides, databases of biomedical research articles like PUBMED provide an abundant amount of examples for medical language. However, with only few exceptions, such easily accesssible text corpora are hardly available for the German  and other non-English languages. Today, there is no viable solution for sharing even de-identified clinical texts in Germany. In effect, not only are large-scaled research datasets missing but also pretrained language models for German medical language, such as an equivalent of BioBERT (Lee et al., 2020).
To address (1) the lack of available German medical text resources for NLP research, and (2) the need for machine-readable CPGs, we constructed a corpus based on a set of German CPGs for oncology. The German Guideline Program in Oncology (GGPO) (Follmann, 2020), operated by the Association of the Scientific Medical Societies in Germany, the German Cancer Society and the German Cancer Aid, is in a unique position to enable this research, as their guidelines are also provided via a mobile app (Seufferlein et al., 2019). Hence, this data set is already available in a semi-structured format with rich, formatted metadata, resulting in a much higher data quality than data extracted a posteriori from PDF versions of the guidelines.

Related work
Due to legal data protection measures, Germanlanguage clinical text corpora are extremely rare and existing ones almost impossible to (re)use. Typically, accessibility is restricted to research staff only within the lifetime of a project and blocked for the outside world. In Table 1, we list, to the best of our knowledge, all existing German-language clinical research text corpora which have been described in scientific publications up until now. With only few exceptions, these corpora are small and mostly limited to a specific medical discipline or clinical division. In addition to these pure clinical documents, other document types are also interesting for the NLP community, e.g., CPGs, which are available for a wide range of conditions.
CPGs as a target for automated text analytics have been much less utilized compared to other scientific publications and clinical documents. Most of that work took place in the context of formalizing CPGs as computer-interpretable guidelines (Peleg, 2013). Bouffier and Poibeau (2007) describe an approach to fill in a semi-structured Guideline Elements Model template by segmenting unstructured guidelines using linguistic patterns. An evaluation was run on 18 French guidelines. Serban et al. (2007) describe the extraction and instantiation of linguistic templates for guideline formalization, evaluated on a Dutch guideline for breast cancer treatment. German CPGs were the focus of Becker and Böckmann (2017) who adapted APACHE CTAKES to detect German UMLS concepts and evaluated their approach on a single German breast cancer guideline. Zadrozny et al. (2017) outline a system which identifies contradictions and disagreements in English CPGs.
Some authors have focused on extracting more task-specific information, such as activities (Kaiser et al., 2010), process structures (Wenzina and Kaiser, 2013;Zhu et al., 2013; or negation triggers (Gindl et al., 2008). Taboada et al. (2013) apply a pipeline of open-source tools for parsing CPGs, Named Entity Recognition (NER) tagging and relation extraction in a case study with 171 sentences from CPGs. Most of the aforementioned approaches work with relatively small annotated corpora and English language, only. Recently, Fazlic et al. (2019) use LSTMs and fuzzy rules to extract "action takers", "symptoms", "actions" and "purposes" from CPGs, recognize recommendations and predict the grade

Corpus / Data
Documents Sentences Tokens Available FRAMED: clinical reports and medical textbook snippets (Wermter and Hahn, 2004) -6k 100k Reports from five medical domains (Fette et al., 2012) 544 --Radiology reports (Bretschneider et al., 2013) 174 4k 28k Transthoracic echocardiography reports (Toepfer et al., 2015) 140 --Operative reports (surgery) (Lohr and Herms, 2016) 450 22k 266k Discharge summaries from a dermatology department (Kreuzthaler et al., 2016)  Some larger corpora of CPGs for the English language exist already. Hussain et al. (2009) present the Yale Guideline Recommendation Corpus (YGRC), a sample of 1,275 guideline recommendations extracted from the National Guideline Clearinghouse (NGC). Their work revealed inconsistencies in writing style and reporting of the strength of recommendations. Using a subset of YGRC, El-Rab et al. (2017) present a rule-based approach to detect procedures and drug recommendations. Read et al. (2016) describe the CREST corpus, consisting of 4,029 recommendations from 170 guidelines annotated with their respective recommendation strength and report a total number of 8,138 types within the recommendations. Large corpora of CPGs lend themselves to mining the state-of-the-art knowledge in a medical subfield. For instance, Leung et al. (2015) identify comorbidities by analyzing pairs of co-occurring conditions, using a corpus of 268 NGC guideline summaries. Leung and Dumontier (2016) find drugdisease relations via named entity recognition using a corpus of 377 NGC guideline summaries. The relations are compared to structured drug product labels to assess their overlap.
In summary, our work is most similar to the CREST corpus (Read et al., 2016), in the sense that we provide a corpus based on CPGs consisting of medical text and metadata. However, while the number of recommendations in GGPONC is comparable to CREST, the amount of structured metadata and background text in our corpus is much larger (see Section 4.1). Also, our corpus contains German text, addressing a scenario where available resources are much scarcer (see Table 1). While  (2017) also apply NLP to German CPGs, we consider a large superset of the CPGs used in their work and provide access to our data as a preprocessed and analyzed text corpus.

Data Collection
In order to assemble the corpus of German CPGs, we acquired semi-structured JSON versions of the guidelines from the REST API of the Content Management System (CMS) that serves the backend for the mobile app provided by the GGPO. The data was subsequently transformed from JSON to an XML format. We preserved the document structure (chapters and sections), as well as recommendation metadata and literature references. An example of the resulting XML format can be found in Listing 1. The metadata elements are described in Table 2. Literature references are included with an ID number which can be resolved to a citation in the provided literature index file.
The guidelines distinguish between recommendations and background texts; we preserved this distinction in the corpus. In general, the recommendations tend to be concise statements related to a particular clinical question. For evidence-based recommendations, literature references and evidence levels are included. The background texts provide the reasoning behind the recommendations and a summary of the evidence underlying the recommendations, again backed by literature references.

Automated Annotation
Besides the XML version of the corpus, we created plain text versions of all recommendations of background text parts to facilitate processing by existing NLP pipelines. For preprocessing, like sentence splitting and tokenization, we used JCORE (Hahn et al., 2016) (i.e., UIMA-based) pipelines and FRAMED (Wermter and Hahn, 2004) models which were developed for German clinical text.
We also employed the JUFIT tool (v1.1) (Hellrich et al., 2015), a filter for UMLS to create a dictionary of all German words from the UMLS (Bodenreider, 2004) (version 2019AB) 2 and the Semantic Groups ANAT (Anatomical Structure), CHEM (Chemicals & Drugs), DEVI (Devices), DISO (Disorders), LIVB (Living Beings), PHYS (Physiology), and PROC (Procedures) (without advanced JUFIT rules). We have chosen only these six out of the full set of 15 UMLS Semantic Groups because we used similar categories in the named entity recognition tasks and also wanted to avoid cognitive overloading of the human annotators.
Finally, we screened TNM expressions 3 which were extracted using a rule-based approach implemented with the PYTHON library SPACY. This part was originally developed for German pathology reports in the context of the HIGHMED consortium of the Medical Informatics Initiative of Germany. TNM expressions and genes were specifically chosen for their relevance in cancer treatment.

Corpus Characteristics
In total, 25 GPGs with 8,414 text segments were extracted from the CMS comprising the first version of GGPONC. In Table 3, we give an overview of the CPGs in terms of the number of tokens and types, as well as the number of literature references. We also report the total number of recommendations and background text segments, since they serve as the units of analysis for our automated annotation pipelines. The CPGs cover a wide range of indications and anatomical locations. They also differ significantly in their extent, e.g., there is much more text for broad topics, such as palliative medicine, or indications with many treatment options, such as lung cancer. Of the approximately 38k literature references in the corpus, around 20k are unique with roughly 9k explicit links to PUBMED. We provide bibliographic details on these references alongside the corpus to facilitate research on the relationships between CPGs and the underlying medical evidence. Table 4 contains the automated named entity extraction results. Their quality and interpretation in comparison to other German (clinical and nonclinical) text corpora will be discussed in the next section. The whole corpus consists of: • a single XML file, including the document structure and all mentioned metadata, • a file for the complete literature index, • individual plain text versions of the text segments, sentences, and tokens, • automatically created entity annotations and a subset of manually corrected annotations in standoff format. As CPGs are subject to a regular update cycle, we are able to automatically redo the data acquisition process in the future in order to provide a historical view on the guideline development.

Comparison with Other German Medical and Non-Medical Corpora
We analyze the characteristics of GGPONC by comparing the entity matches with three German medical text corpora, namely version 1.1 of the JSYNCC corpus (case examples from clinical text books) , the Jena Part of the 3000PA corpus (1106 German discharge summaries) , 4 and abstracts of German case reports from PUBMED. In addition, we compare the results to out-of-domain corpora, namely German WIKIPEDIA articles of wars (WIKIWARSDE) (Strötgen and Gertz, 2011) and news articles from the KRAUTS corpus (Strötgen et al., 2018). The results are depicted in Table 4. The fraction of stop words is comparable across all medical text corpora, as is the fraction of tokens that map to UMLS concepts. As expected, the guideline recommendations contain more medical terms per token than the background text. Compared to the clinical corpora, the CPGs have more 4 Based on the approval by the local ethics committee (4639-12/15) and the data protection officer of Jena University Hospital discharge summaries were extracted from the HIS of the Jena University Hospital and further transformed.
instances of the class Living Beings, as they often describe treatment recommendations for certain populations. Notably, the average sentence length is much greater in the clinical guidelines, and in particular in the background text, pointing at the more scientific style of writing prevalent in the guidelines as compared to clinical narratives. TNM expressions occur much more frequently in GG-PONC, which can be attributed to its focus on the oncology domain.
Both out-of-domain corpora contain only small amounts of UMLS concepts (apart from the semantic class Living Beings), which indicates a high precision of our entity tagging approach. In Figure 1 we visualize the overlap of unique medical concepts from UMLS found in each of the corpora.
While there is a significant overlap between GG-PONC and the clinical corpora, a major fraction of concepts is unique to each corpus. These results suggest that our corpus combined with other clinical text corpora can provide a more comprehensive view on the use of medical language, in general, than each of the corpora alone.

Evaluation of Annotation Results
The automatic annotations for a subset of the CPGs have been independently reviewed by human experts (4 students of medicine, all of them passed their first medical exam, supervised by a medical Table 5: Pair-wise average F1-score and standard deviation (σ) for instance and token based inter-annotatoragreement (IAA), precision and recall per entity class. Genes had to be excluded from IAA analysis due to the large number of false positives. Genes ----.022 .589 doctor) using the BRAT annotation tool (Stenetorp et al., 2012). Due to restricted resources for manual annotation work, we decided to evaluate on a subset of 13 (full) guidelines (see Table 3), which amounts to half of the corpus. The CPGs were chosen such that they cover a diverse range of topics and percentages of token matches, with a rather high rate of around 6-8% results per token for HCC as opposed to a lower rate of roughly 5-6% for CLL and psycho-oncology. Additional guidelines were chosen for manual annotation based on project requirements.

IAA
We calculate the inter-annotator-agreement (IAA) using the pair-wise average F -score (Hripcsak and Rothschild, 2005) of instances and tokens. An instance is a single composite annotation unit which consists of one or more tokens, e.g., "eingeschränkte Nierenfunktion" (limited renal function) denotes an instance with two (German) tokens, "eingeschränkte" and "Nierenfunktion".
The agreement subset consists of 20 text segments with the largest amount of automatic annotations for each of four guidelines (HCC, CLL, Pancreatic cancer, Psycho-oncology) and 20 randomsampled text segments, resulting in 40 agreement documents with a size of approx. 19k tokens annotated by all annotators. We excluded the gene category from the IAA analysis, due to an apparently large number of false positive pre-annotations. The IAA achieved an average F -score of 0.742 on instances and 0.758 on tokens. Furthermore, we calculated micro-averaged precision and recall values for the automated annotation results, using the complete set of manually reviewed annotations as gold standard. The results are depicted in Table 5.
In another annotation study of diagnoses, symptoms and findings on the Jena part of the 3000PA corpus, average F -score values converged in the range of around 0.7-0.8 for typical clinical entities as well, e.g., anatomy or disorders in comparison to diagnoses (approx. 0.7), also for pre-annotations. The low IAA value of Physiology is similar to the IAA of 0.5 on the symptoms category of the named study (Lohr et al., 2020). The UMLS category Living Beings contains a lot of information similar to personal health information. The average IAA value of around 0.9 is similar to average values of an annotation study for the anonymization of German discharge summaries (F -score > 0.95) (Kolditz et al., 2019).

Discussion & Limitations of this Study
While the initial results of the information extraction pipelines we employed are promising, there is much room for improvement. The extraction of genes suffers from a large number of false positives, as there are many common German words (e.g., "gilt", "dar") and three-letter-acronyms (e.g., "CLL", "HCC") with strings identical with gene names in our large dictionary (around 562k entries). Thus, supplying an improved gene tagger which balances German lexical noise with advanced capabilities of gene taggers for English texts will be a desideratum of future research.
The German UMLS has a number of issues, which severely affect our dictionary-based entity extraction pipelines. First and foremost, its vocabulary size is extremely limited. The English UMLS contains over 6.5M entries and the Spanish one around 750k, whereas there are only around 234k entries in the German version (3.6% coverage of the English version). Recently introduced drugs are missing in the UMLS Chemistry category, so a more up-to-date dictionary of drug names is also needed for future work. Moreover, the surface representation of German umlauts is notoriously inconsistent in UMLS, e.g., "ä" is sometimes transcribed as "ae" or even simplified as "a", as in "eingeschraenkte Nierenfunktion", which results in an increasing false negative rate. All of these factors contribute to rather low recall values, as shown in Table 5.
The accuracy of dictionary matches further decreases due to inconsistent handling of compounds throughout the corpus.
For instance, "Pankreaskarzinompatienten" (patients with pancreas carcinoma) would not be detected as an entity, whereas hyphen-connected "Pankreaskarzinom-Patienten" would, yielding two entities (Disorders and Living Beings), respectively. In this case, we would choose to annotate the whole compound as Living Beings to avoid annotation on a subword level, which could be addressed using a more finely adapted tokenization algorithm. While precision and recall of the rule-based TNM extraction approach are high on GGPONC, one has to be careful as certain TNM expressions can cause context-dependent semantic ambiguities. For instance, "V1" and "V2" are valid TNM components referring to venous invasion, but are also detected in the WIKIWARSDE corpus referring in this context to German missiles from World War II.

Conclusion
We presented GGPONC, one of the currently largest corpora composed of German medical texts, assembled from the CPGs in oncology and equipped with rich structure information and metadata. We applied information extraction pipelines to extract a variety of named entity classes. Despite the limitations we discussed, the information extracted so far can be of immediate use to enable semantic search functionalities in the guideline app (Seufferlein et al., 2019), precision medicine search engines (Faessler et al., 2020) or in clinical decision support systems (Schapranow et al., 2015).
Our results indicate that GGPONC shares many characteristics with existing clinical text corpora. This can facilitate the development of machine learning-based NLP algorithms for German clinical text. Beam et al. (2020) suggest that combining corpora covering different parts of medical terminology can improve the utility of trained word embeddings. In addition to the German documents discussed in this work, some of the GGPO guidelines have an additional English version, which could be used to construct parallel corpora for research in multilingual medical NLP.
Extending our work to clinical guidelines from other medical specialities besides oncology will be a straightforward way to extend the volume of the corpus, provided that the document structures can be harmonized across medical societies. However, as most CPGs are distributed as PDF documents, extraction of the plain text content from these can result in quality issues not encountered in this work.
The structured metadata of the corpus provide ample opportunities for future research. For instance, the corpus can be used as a resource for evidence-based medicine summarization, as it contains mappings from literature references to recommendation statements and evidence levels. As we plan to create future versions of the corpus based on updated guideline versions, the extracted concepts can also be used to track changes in CPGs, like the emergence of new treatments and other changes in recommended clinical practice. We envision to combine information extracted from scientific articles, such as study reports, or clinical trial registers with information from CPGs to automatically detect whether these CPGs might be outdated given changes in the underlying evidence base.
In addition to the existing annotations for a wide selection of UMLS semantic types, we can easi-ly extend the employed pipelines with different dictionaries, e.g., derived from other subsets of the German part of the UMLS, more comprehensive official lists of drug names, or the German version of the International Classification of Diseases.
We make GGPONC available for researchers under the conditions of a Data Use Agreement. For instructions on how to access the corpus and the human annotated data see: https://www.leitlinienprogramm-onkologie. de/projekte/ggponc-english/.