Iterative development of family history annotation guidelines using a synthetic corpus of clinical text

In this article, we describe the development of annotation guidelines for family history information in Norwegian clinical text. We make use of incrementally developed synthetic clinical text describing patients’ family history relating to cases of cardiac disease and present a general methodology which integrates the synthetically produced clinical statements and guideline development. We analyze inter-annotator agreement based on the developed guidelines and present results from experiments aimed at evaluating the validity and applicability of the annotated corpus using machine learning techniques. The resulting annotated corpus contains 477 sentences and 6030 tokens. Both the annotation guidelines and the annotated corpus are made freely available and as such constitutes the first publicly available resource of Norwegian clinical text.


Introduction
The limited availability of clinical text corpora constitutes a major challenge for the development of clinical NLP tools. Such text originates in the (electronic) health record (EHR), and access to and use of the EHR is governed by strict data privacy and health service regulations, which usually restricts secondary use and prohibits re-distribution and sharing with the larger NLP community. Among notable exceptions are anonymized health record texts published as part of the i2b2 challenges (Uzuner and Stubbs, 2015) and the CLEF corpus (Roberts et al., 2008b). For languages other than English the situation is even more difficult, and despite notable annotation efforts (Dalianis et al., 2012), the underlying corpora are largely unavailable.
Clinical texts are radically different in form and function from other biomedical texts: They are communicative, conveying information between health service providers, terse (in that the patient is implicit), and very specialized according to the role of the narrative and profession of the author (Allvin et al., 2010;Røst et al., 2008). In this work, the targeted narrative of family history corresponds to the anamnesis recorded by the cardiologist when interviewing the patient as part of a consultation. However, lacking a corpus of family history statements, we decided to develop a synthetic corpus (Lohr et al., 2018;Boag et al., 2018).
Development of most NLP tools requires manually annotated data and the design of annotation guidelines is crucial for consistent and high quality data suitable for machine learning and classification. Development of annotation guidelines is a time consuming process which in the case of clinical data often also requires access to domain experts (clinicians). The question of how to involve the clinician in the annotation process and make the best use of their domain knowledge is therefore highly relevant.
This article describes the systematic development of annotation guidelines for family history information in Norwegian clinical text. We make use of incrementally developed synthetic clinical text describing patients' family history relating to cases of cardiac diseases. The domain expert is an integral part of this methodology and generates synthetic examples that challenge the guidelines and further participates both in the annotation and development of guidelines. In doing so, the domain knowledge of the clinician informs the annotation process systematically. Measures of interannotator agreement is actively used to improve the annotation guideline, as well as to extend the synthetic corpus and range of annotated concepts.
In the rest of the paper, we describe the methodology for corpus generation and annotation guideline design in more detail and provide an overview of our current state of progress in the fam-ily history domain. We analyse inter-annotator agreement based on the developed guidelines and present results from experiments aimed at evaluating the validity and applicability of the purposemade annotated corpus using machine learning.

Family history in clinical text
A family history is an important part of the medical record. It helps the clinician in identifying risk factors, in diagnosing conditions that have genetic components, and in identifying family members that should be offered genetic counselling or medical follow up. Specific patterns of disease or symptoms in a family suggest modes of inheritance, and could be helpful in the diagnosis of an unrecognised disease or syndrome. For example, if only men in the family are affected, one might expect an X-linked trait, or if approximately half of the offspring in a generation seem to be affected, it would suggest an autosomal dominant disease. In the cases where a pathological mutation has already been identified, the pedigree is used to plan further genetic screening or counselling. Figure 1 shows an example pedigree with a typical autosomal dominant inheritance pattern.
For some diseases, the course of events in the patient's family are important in judging the patient's own risk of serious events. In patients with hereditary hypertrophic cardiomyopathy, the European Society of Cardiology recommends using an online risk calculator to estimate a patient's 5 year risk of sudden cardiac death (SCD). Among the seven factors included in the underlying model -a strong contributor to individual risk -is a history of SCD in first degree relatives (Elliott et al., 2014).
Family histories occur as descriptive text in the EHR, but acknowledging that computational reasoning about family history have substantial benefits in research, diagnosis and decision support where many tools has been developed for interactive pedigree input (Welch et al., 2018). The underlying objective of our NLP challenge is to be able to infer the pedigree of a patient from text. However, even checking consistency of family history information represented in OWL proves to be a challenge (Stevens et al., 2014). A potential outcome of our work would be to transform statements about pedigree into tabular formats directly usable in risk calculators and for bioinformatics application like genome-wide anal- Figure 1: An example pedigree chart with a typical autosomal dominant inheritance pattern. Horizontal rows represent generations, lines represent relationships, lines of descent and sibship. Squares are male, circles female, and diamond shape is unknown gender. A symbol with a 'P' inside denotes a pregnancy. Diagonal lines through symbols denote deceased individuals and the text below their age at the time of death (eg. 'd. 43' means died when 43 years old). Filled symbols represent individuals with manifest disease, symbols with a vertical line are healthy gene carriers who may develop disease later. The small arrow denotes the current patient ("self") and the arrow with the 'P' is the proband or index patient where the genetic analysis of the family started (Bennett et al., 2008). ysis (Hiekkalinna et al., 2005).

Previous work
There has been some previous work aimed at extracting family history information from clinical text. Bill et al. (2014) annotate 284 sentences from the publicly available MTSamples corpus of synthetically produced English clinical text for information about family members and clinical observations with some additional attributes (vital status, negation and age of death). However, they do not provide any measures of inter-annotator agreement. Polubriaginof et al. (2015) compared the information contained in structured and free-text descriptions of family history information and found that the free-text descriptions were more comprehensive.
In another work, Goryachev et al. (2008) developed a pipeline of rule based systems to detect family members and diagnosis concepts; and, then assign the family diagnosis to a specific family number. The authors run standard NLP tools such as sentence splitter and part-of-speech taggers on discharge summary notes. The pipeline system is related to Friedlin and McDonald (2006) in only identifying diagnosis concepts that are present in standard medical dictionaries and do not perform relation extraction as performed in this paper.
Both rule based systems (Abacha and Zweigenbaum, 2011) and machine learning methods such as Roberts et al. (2008a) and Minard et al. (2011) use multi-class SVMs to perform relation extraction from clinical reports. Our work in this paper is closest to the work of Roberts et al. (2008a) who manually annotated cancer narratives for entities and relations and, then, trained and tested a one-vs-rest SVM classifier for training and testing. In this paper, we employ widely used features in general purpose named entity recognition (Hong, 2005;Miwa and Sasaki, 2014) to train the SVM models.

Incremental annotation guideline and synthetic corpus development
One immediate goal of this work is to develop a tool for the extraction of family history information from Norwegian clinical text. Due to the unavailability of the real health records describing family histories, we developed a methodology for annotation guideline development which makes use of an incrementally developed synthetic corpus. The textual data contained in the corpus was produced by a clinician who has extensive experience with clinical work and genetic cardiology. The data consists of statements that summarize the family history of a patient and will typically correspond to a small part of a patient journal. The descriptions were made by performing web searches for images of "autosomal dominant pedigree", and pseudo-randomly describing parts of the displayed pedigrees while assigning invented but realistic medical events. No real patient histories are reproduced, but coincidental similarities must be expected. The text does not contain any personal identification information. The first step in a semantic annotation of text is to decide upon the entities and the relations that are interesting to extract or characterize. Biomedicine employs terminologies and classifications that may be used for annotation (Savova et al., 2010). In our domain of family history, we started with family members and relationships, and largely ignored medical conditions apart from death or known (cardiac) disease in general. The guideline developers consisted of a clinician and three computational linguists and/or computer scientists. We usually maintained two roles: The clinician would produce a set of representative sentences and along with one of the others propose an annotation scheme for these. Then, the clinician would annotate while another independent person not involved in the design of the annotation scheme would make an independent annotation. The results were compared and discrepancies were recorded. We (sometimes artificially) could identify both semantic and pragmatic discrepancies. Semantic discrepancy would signify a misunderstanding of the underlying domain and required amending the ontology, whereas the pragmatic discrepancy would uncover an underspecified or incomplete annotation rule which could be further specified by adding more examples to the corpus. The drivers and amendments in this quizlike game is shown in the table 1. Figure 2 shows the double loops of corpus production and guideline development. As shown, the family history statements were produced iteratively. In the initial round, the clinician was asked to produce a set of representative statements about SCD-related family history. Example 1 below shows a sentence from the corpus.
(1) Indekspasienten Following the initial iterations and discussions with the clinician the need to account for i) relations to groups of family members, ii) temporal statements, and iii) negation emerged. During this iteration the clinician was therefore tasked with the generation of statements that challenged the current guidelines, whilst still producing representative family statements. Example 2 shows an example sentence containing a temporal statement and example 3 shows another type of temporal statement describing the age of the family member at the time of diagnosis.
(2) Han He har has kjent felt hjertebank heart-palps de the siste last fire-fem four-five månedene. months 'He has been feeling heart palpitations during the last four-five months' (3) Broren fifty-years 'The brother was diagnosed in his fifties' After arriving at a fairly stable set of guidelines, a large portion of the data set (320 sentences) was doubly annotated. Following this, disagreements were resolved in a round of consolidation between the annotators. The final portion of the data set (91 sentences) was then annotated doubly and the resulting inter-annotator agreement on these data sets is reported here in Section 4.5.
All annotation was performed using the Brat web-based annotation tool (Stenetorp et al., 2012). The data was manually segmented and tokenized prior to annotation.

Annotation guidelines
The following section presents an overview of the resulting annotation guidelines. The annotation of the corpus distinguishes semantically relevant clinical entities and shows how these relate to each other in the text via a set of relations. Figure  4 shows a graphical overview of the annotation schema, where rectangles indicate core clinical entities, ovals indicate modifier entities, and all possible relations are indicated by directed arcs.

Clinical entities
Clinical entities are marked with one of the following entity types: • Family describes various family members (e.g. onkelen 'the uncle', bestefar 'grandfather').
• Self is used only for the patient under consideration (e.g. pasienten 'the patient', hun 'she').
• Index entities designate the property of being the index patient or proband, i.e. the first identified family member with disease indekspasienten 'the index patient'.
The distinction between conditions and events relate to the temporal extension of the entity described: an event is something that happens and then is over, but a condition is a prolonged state of the patient, for instance, the patient has a heart attack (Event), but from this point on she is considered to have heart disease (Condition). In addition to the main clinical entities described above, the annotation guidelines also distinguish a set of modifier entities that further de- scribe the clinical entities for a number of properties that are relevant for semantic interpretation of family history information: • Side entities describe the side of the family and thus modify Family entities (e.g. farssiden 'paternal side').
• Age entities describe the age of a family member 40 år gammel '40 years old'.
• Negation entities mark lexical items that signal negation, so-called negation cues in the terminology of Morante and Daelemans (2012). These may be negative adverbs, such as e.g., ikke 'not', aldri 'never', or negative determiners/pronouns ingen 'nobody'. Note that in contrast to Morante and Daelemans (2012), we do not annotate morphological negation cues (e.g. im-possible). In this version of the guidelines, we treat negation as encompassing uncertainty. The main reason for this is that just like the presence of negation, it marks missing information in the family history.
• Amount modifiers describe quantifiers that describe numerical properties of clinical entities, e.g. to 'two', mange 'many'.
• Temporal modifiers typically position Condition/Event entities in time, e.g. i sommer 'this summer', for tre år siden 'three years ago'. These are similar to temporal expressions (so-called timexes) in previous temporal annotation schemes (Ferro et al., 2002;Saurí et al., 2006).

Family history relations
In addition to the clinical entities described above, we further annotate a number of relationships between entities in our annotation scheme. Example 4 shows a fully annotated example containing entities and their relations for an sentence from the corpus. The relations are binary undirected relations of the following types: • Holder relations are always between Condition/Event entity on the one hand and its holder, a Family/Self/Index entity.
• Related_to relations specify relations between family members and always hold between entities of the Family type.
• Subset relations specify relations between family members, where one is a subset of the other, e.g. in statements such as Hun har to brødre, den ene har mutasjonen 'She has two brothers, one of them has the mutation', where den ene 'one of them' would be connected to the Family entity brødre 'brothers' with a Subset-relation.
• Partner relations specify relations between entities of the Family type, used to identify couples (husbands and wives, civil partnerships) that are able to provide offspring. The assumption is no kinship.

Span of annotations
In general, annotation should pick out the minimal span in the text which denotes the entity or property in question. This will most often be a single word (onkel 'uncle', mutasjon 'mutation') but will in some cases also include more than one word (plutselig hjertedød 'sudden cardiac death', voksende hjerte 'growing heart'). Genitive modifiers of an entity, e.g. farens 'father's' in farens søster 'the father's sister' or Søsteren til faren 'the sister of the father' should not be included in the annotation span. Rather, these are annotated as two separate entities related by a Related_to relation. The span of Family entities usually encompass only the family term itself (onkel 'uncle', søster 'sister'), however, when the family term is described using a pronominal element (hun 'she', den ene 'one (of them)') this should be annotated as a family entity. When both are present (den ene broren 'the one brother') only the family term is annotated. Temporal expressions will often be more complex and should include both numerical expressions denoting amount (tre 'three', flere 'several'), temporal units such as month/year, as well as expressions denoting temporal ordering or duration (i 'in', siden 'since' as in tre år siden 'three years since', i tre år 'for three years'). Initial iterations of annotation showed that agreement for this category was low due to differences in annotation span. We therefore introduced the generalization  that temporal annotation should make use of a replacement rule where the full constituent replaced by a temporal pronoun corresponding to English then is annotated. This means that unlike e.g. Ferro et al. (2002), our temporal annotations will include prepositions (e.g. i tre år 'for three years').

Statistics
The resulting annotated corpus contains 477 sentences and 6030 tokens. In table 2 we present the distribution of the entities and relations in the corpus. We see that Condition and Event entities are fairly equally distributed in the corpus. Temporal modifiers span more than one word in a majority of cases. Whereas Holder-relations are the most common type of relation in the corpus, there are only 14 cases of the Partner relation.

Inter-Annotator Agreement
As described in Section 3, two final rounds of annotation with different second annotators (in addition to the clinician, here dubbed A1 and A2) were used to complete the annotation guidelines. We measured the inter-annotator agreement at two levels. At the first level, IAA is based on match of the entities spans and their labels. At the second level, IAA is based on the relationship matches between the matched spans. Therefore, the relationship agreement measurement is stricter than the entity level agreement measurement. We examine token level agreement where we treat the clinician's notes as gold standard and compute the per token F-measures i.e., Precision, Recall, and F 1 -score. We measure the inter-annotator agreement using micro F 1 -score. The Precision, Recall, and F 1 -scores of the agreement is provided in table 3.
Annotator  Table 3: Each row shows the number of sentences annotated by each annotator. The first and second rows shows the Precision, Recall, and F 1 -score for entities and relations. All the results are in comparison to the texts annotated by the clinician.
We find that the round of consolidation and improvement of the guidelines was useful and improves the IAA scores for both entities and relations. When we compare the annotations of the clinician (A0) and the second additional annotator (A2), we find that there are still a number of remaining discrepancies. Some of these are what we termed semantic discrepancies above in Section 3 above, annotation decisions that require domain knowledge. For instance, in several places A2 annotates clinical conditions that are not marked by the clinician, e.g. marking symptomer 'symptoms' as a Condition. There are also examples where additional distinctions should probably be added to the guidelines, in particular with respect to annotation of temporal and negation-related information, both examples of complex annotation tasks by themselves. For instance, A2 annotates the phrase under en flytur til Spania 'during a flight to Spain' as Temporal, where A0 does not. With respect to negation, the distinction between negation and uncertainty causes differences in annotation spans, where A0 annotates husker ikke 'does not remember' as NEG, whereas A2 annotates only ikke 'not'.

Preliminary experiments
In this section, we perform entity classification and relation extraction experiments to verify the viability of our annotation. We train and test SVM model on the data annotated by the domain expert in five-fold cross-validation fashion. The domain expert annotated dataset has 477 sentences and we performed five-fold cross-validation to train and test our model. In all our experiments, we split the sentences into five folds and extracted entities and relations. Then, we treated each of five folds as test dataset and trained on the other four folds in an iterative fashion.

Entity detection
In this experiment, we trained and tested a linear classifier (SVM model) for entity classification. We treat entity classification as a multi-class classification problem where there are 11 classes including the O entity that denotes unmarked lexical units. Our model is a linear SVM model that is trained on the following features: • Lexical: Current word, words in a context window size of 2.
• Universal POS tags: Current word, words in a context window size of 2.
• Entity tags: The two previous entity tags where the model uses the gold entity tags to train but uses the previous predicted entity tags to predict the current tag.
We also experimented with lowercasing a word and orthographic features such as prefixes and suffixes of length 3 which did not improve the performance of the SVM model. We evaluate the performance of the SVM model using weighted F 1 score to account for class imbalance. On an average, these feature templates yielded 5000 features across the five cross-validation experiments. All the Universal POS tags are obtained through the CoNLL17 Baseline model (Zeman et al., 2017) trained on the publicly available Universal Dependencies Norwegian Bokmål treebank (Øvrelid and Hohle, 2016  score is computed only on parts of the annotated data whereas the SVM models are trained and tested on the whole of the data annotated by A0. The SVM model performs better than the majority class baseline model across all the measures. The SVM model made errors at distinguishing Condition entities from Event entities and Age from Temporal entities. Most of the errors occurred when the SVM model misclassified the rest of the classes as "O".

Relation extraction
In this subsection, we performed a relation detection and classification experiment. In this experiment, we treat a relation defined between exactly two entities to belong to one of the six relations where five of them are given in table 2 and the sixth relation is "No_Relation". We train and test an SVM model in a five-fold cross-validation fashion. Apart from entity labels, we experimented with increasingly complex set of features: • Lexical: Words belonging to the entities are treated as two separate features.
• POS tags: Universal POS tags of the entities' lexical tokens as separate features.
• Dependency features: The dependency label of a entity word's incoming arc as a feature.
If a entity is spanning across multiple words, we concatenate the per-word feature and treated them as a single feature when training and testing the SVM model. The results of the experiments are given in table 5. Our results suggest that word based features themselves yield a performance which is close to the model with more complex features. Incremental inclusion of POS tags and dependency labels increases the performance of the SVM model, whereas the inclusion of predicted entity labels does not improve the performance of the SVM model. We experimented if including the gold standard labels would improve the performance of the SVM model. We find that the quality of entity labels does improve the performance of the model. Finally, we present the confusion matrix for the best fold is presented in table 6. The SVM model makes most of the errors when it misclassifies one of the five annotated relations as "No_Relation" and vice-versa. The classifier errs when distinguishing between "Related_to", "Partner" and "Subset" relations. Finally, the classifier makes errors when distinguishing between the Norwegian indefinite determiner en which is unmarked and the quantifier en.   Table 6: Confusion matrix for the SVM model at the task of relation extraction on the best performing fold.

Discussion
The validity of our study is limited by using synthetic data. While the clinician producing the clinical text works in genetic cardiology, and writes similar patient histories in his clinical practice, the synthetic data can not be expected to be fully representative of real clinical notes from a large patient cohort. The analysis pertaining to the synthetic data should be thought of as an illustration of one iteration of the cycle described in 2, and the objective of the iterative process is a stepwise, guided, design of an annotation guideline in a setting where the target text data is unavailable. The same process could be used with a real corpus, where specific new examples would present challenges driving guideline development. The only difference is that in our case, a specialist produced text, instead of finding representative text.
The guideline development workflow itself may also be improved or expanded by storing a representation of the input data (the pedigree) and linking it to the resulting synthetic text description, which would allow further downstream comparison of extraction results to the actual source material.

Conclusion
This article has investigated the development of annotation guidelines for family history information in Norwegian by leveraging synthetically produced clinical text. Inter-annotator agreement scores show that the annotation schema can be applied fairly consistently and that it may also be generalized to unseen text using machine learning. Both the annotation guidelines and the annotated corpus will be made freely available and as such constitutes the first freely available resource of Norwegian clinical text.
In the near future, we will apply the annotation schema to real clinical texts. The family history is but a minor part of a patient record, and segmentation as shown in Bill et al. (2014) is needed. Analysis of the annotation disagreements along with the experimental results also highlighted part of the schema that will need to be further refined, e.g. the analysis of temporality and our treatment of uncertainty. We will develop the method for incremental and systematic annotation guideline development further. The method will be put to test when we iteratively improve the current guideline in order to capture real patient pedigree information from the EHR.