Extracting Kinship from Obituary to Enhance Electronic Health Records for Genetic Research

Claims database and electronic health records database do not usually capture kinship or family relationship information, which is imperative for genetic research. We identify online obituaries as a new data source and propose a special named entity recognition and relation extraction solution to extract names and kinships from online obituaries. Built on 1,809 annotated obituaries and a novel tagging scheme, our joint neural model achieved macro-averaged precision, recall and F measure of 72.69%, 78.54% and 74.93%, and micro-averaged precision, recall and F measure of 95.74%, 98.25% and 96.98% using 57 kinships with 10 or more examples in a 10-fold cross-validation experiment. The model performance improved dramatically when trained with 34 kinships with 50 or more examples. Leveraging additional information such as age, death date, birth date and residence mentioned by obituaries, we foresee a promising future of supplementing EHR databases with comprehensive and accurate kinship information for genetic research.


Introduction
Kinship or family relationship is important for genetic research, particularly for understanding trait and disease heritability, predicting individual disease susceptibility, and developing personalized medicine (Chatterjee et al., 2016).Human genetics started by analyzing pedigrees and twins to understand the roles of heredity and environment in the manifestation of physiological traits and diseases.With the rising of genomics, Electronic Health Records (EHRs) and their integration through biobank, kinship information, if available, can largely augment latest highthroughput computational technologies such as deep phenotyping from medical records (Robinson, 2012) and phenome-wide association study (PheWAS, Denny et al., 2010), and accelerate population-based genetic research (Mayer et al., 2014;Polderman et al., 2015).Unfortunately, neither EHR systems nor claims databases capture kinship information systematically.
A few studies have investigated disease heritability based on inferred kinship information.For example, Wang et al. selected 128,989 families of 481,657 individuals from a large claims database covering 1/3 of the US population, by selecting policyholders and their dependents (e.g., spouse and children) who were on file for at least 6 years, to estimate 149 diseases' heritability and familial environmental patterns (Wang et al., 2017).Similarly, Polubriaginof and colleagues performed a multicenter study based on 3,550,598 patients' medical records from three EHR systems in New York City and used emergency contact information to build more than 595,000 pedigrees, in order to compute the heritability of 500 disease phenotypes (Polubriaginof et al., 2018).
However, these studies relied on indirect sources to infer kinship information, which are incomplete and error-prone.First, both the dependents defined by medical insurance and the emergency contacts submitted to EHR systems by patients do not guarantee biological relationships.They do not distinguish adopted relationships or step relationships created through re-marriage from biological relationships.Second, dependents or emergency contact only represents a small portion of a person's whole family relationships.The 2010 Affordable Care Act allows young adults up to 26 to remain on their parents' health insurance plans.Before that, dependent children often "aged out" of their parents' health plan at age 19, or 22 if they were full-time students.Thus adult children older than those ages cannot be identified from claims data.In addition, if married couple work and receive medical insurance through their employers (even with the same employer), they are not usually linked on record.Likewise, most clinics and hospitals list emergency contact as optional (instead of mandatory) information.Most patients provide one or two emergency contacts, but not their entire family when filling the form -The Polubriaginof study (Polubriaginof et al., 2018) collected on average 1.86 emergency contacts per patient.
To address these issues, we propose a new data source (online obituaries) and a special Natural Language Processing (NLP) solution for systematically constructing biological relationships for large families of multigenerations.Obituaries contain rich and highquality kinship information and are publicly available from the sites of newspapers and funeral services companies.Although obituaries are similar to social media, they are much less studied in biomedicine.One study analyzed obituaries to investigate cancer mortality trends (Tourassi et al., 2016).Another group combined LinkedIn profiles and obituaries to investigate the association between frequent relocation and lung cancer risk (Yoon et al., 2015).In this project, our ultimate goal is to link multiple obituaries by cross-validating name, age, residence and birth/death date information, to build large family trees.For this paper, we aim to investigate if state-of-the-art NLP methods can automatically extract names and kinships from online obituaries with high accuracy.
Establishing human names and their relations is a Named Entity Recognition (NER) and Relation Extract (RE) task.The NLP community has been working on both for many years.Usually, NER and RE are considered as two separate and sequential tasks (NER precedes RE).Most information extraction systems in biomedicine, including those mining biomedical literature to extract adverse drug events, and molecular interactions between drug, gene and proteins, are built on a battery of pipeline modules integrating NER and RE tasks (Miwa et al., 2012;Kang et al., 2014;Yildirim et al., 2014;Sun et al., 2017;Li et al., 2013;Li et al., 2017).However, pipeline models have inherent limitations: (i) The error from NER will propagate to RE. (ii) Pipeline models cannot fully utilize the internal connections between NER and RE to improve model performance when the separated models finished the two tasks independently.For instance, in a task of extracting adverse drug event, the named entity appeared before the relation keyword of "induce" (non-passive voice) would be a drug and the named entity after "induce" would be an adverse event.NER, which should be finished firstly, definitely would be harder to benefit from this relation information than RE.(iii) Pipeline models are computationally redundant and errorprone because they match up every two named entities to decide their relations, which is not necessary.
In this work, we propose a joint neural model to simultaneously extract names and kinships from obituaries, which combines a two-layer bidirectional Long Short-Term Memory (bi-LSTM) (Hochreiter and Schmidhuber, 1997) and a unique tagging scheme.It, in theory, surpasses pipeline models by overcoming the limitations (i) (Li et al., 2016;Zheng et al., 2017a) and (iii), and by making room for leveraging the contextual information and domain knowledge to address limitation (ii).The rest of the paper is organized into four sections.In the Data and Methods section, we describe how we annotated the obituary corpus, together with the special tagging scheme, the bi-directional LSTM model and evaluation metrics.Then in the Results section, we demonstrate corpus statistics and model performance metrics.After that, we share some discussions regarding the strengths and limitations of our method, before final conclusions and future work.

Corpus preparation
We downloaded obituaries from the websites of three funeral services and one local newspaper in Rochester Minnesota, including: (1) http://www.bradshawfuneral.com, (2) http://www.czaplewskifuneralhomes.com, (3) https://mackenfuneralhome.com, and (4) https://www.postbulletin.com.The downloaded obituaries were published from 10/2008 to 09/2018.After removing those shorter than 290 characters, which is unlikely to contain any mentions of family relationships, messy ones with irregular HTML format or language, and duplicates, we selected 1,809 obituaries for annotation, due to limited resources and laborintensive annotation described in next subsection.

Corpus annotation
The success of a machine learning application does not solely depend on the model itself.Most of the time it is more determined by the quality of data, particularly the gold standard dataset for training and testing the model.The challenge for annotating a natural language corpus is that the ground truth is not always obvious, due to the ambiguity and complexity of human language.A detailed annotation guideline and duplicated annotation by multiple people is often necessary to guarantee annotation consistency and corpus quality.Based on two examples of biomedical corpus annotations (Gurulingappa et al., 2012, Roberts et al., 2009), we designed an iterative annotation workflow and revised our guideline three times.All annotations were done at the document level so that the annotators can leverage the context in difficult cases.An open-source software called MAE version 2.2.6 (Kyeongmin, 2016) was used as the annotation tool throughout the entire process.
The corresponding author and three native speakers of English drafted the 1 st version of annotation guideline.Then 3 computer science major students were trained for annotation in 2 rounds.In each round, we randomly selected 300 obituaries and asked each student to annotate 200 obituaries.This way each obituary was annotated twice by two different annotators.At the end of each round of training, we evaluated the annotation consistency using inter-annotator agreement (IAA) metrics and improved the annotation guideline.Considering that extracting kinship was actually a NER+RE task, we adopted precision, recall and F1 score rather than Kappa coefficient to report IAA, as suggested by Gurulingappa et al., 2012 andChinchor, 1992.After completing the training, 3 qualified annotators finished annotating the rest obituaries with the assistance of a rule-based quality control program written by us.Table 1 demonstrates that the precision, recall and F1 score were steadily improving through training round 1, training round 2 and final annotation.The discrepancy in the final annotation was resolved through group discussions.We warranted that 1,809 obituaries have high-quality annotations before building the models.

The tagging scheme
Conventional NER and RE are usually formulated as triplet tagging (entity_1, relation, entity_2).But our addressed task is not a general NER+RE task.It is simplified by three factors: (1) There is only one type of named entity to detect (human names); (2) all relations have the same first entity (the deceased); and (3) the first entity is mentioned in the metadata or the first sentence of the obituary, and hence does not need to be  detected most of the times.Therefore in this study, we proposed a novel tagging scheme inspired by Zheng et al (Zheng et. al, 2017b), which extracts names and kinships in relative to the deceased person in one step, as shown in Figure 1.We used the popular "BIESO" (begin, inside, end, single, and other) scheme to mark the position of words in entities, where "O" refers to cases that a word does not belong to an entity.This way we can identify a named entity by simply applying the rule of S or B + n*I + E, where n ≥ 0. But we added the kinship type into the "BIESO" tags, in order to synchronize the NER and RE annotation.So each tag consists of two parts: the first part indicates the kinship type and the second part illustrates the position of a word in an entity.In an illustrative example shown in Figure 1, "Joyce M. Tottingham" is assigned three tags, including "sister_B" for the word "Joyce", "sister_I" for the word "M.", and "sister_E" for the word "Tottingham".For the single-word entity "Kim", the assigned tag was "daughter_S".All the remaining words were assigned a tag "O".Because we set the decreased as the default first entity for any kinship, triplets were simplified to duplets, like [sister, Joyce M. Tottingham] and [daughter, Kim] for the sentence in Figure 1."Tom" was the name of the deceased person (inferred from the context or metadata) and we did not annotate it as a named entity.But we annotated other entity types including age, residence, birth date and death date.We plan to use these additional entity types in future work when we build the family trees and link them to EHR database.

The end-to-end joint neural model
The end-to-end neural model has lately demonstrated effectiveness in various NLP tasks, including NER, RE, part-of-speech tagging and semantic role labeling (Hashimoto et al., 2017, Strubell, 2018).In this study, we adopted an endto-end neural model (See Figure 2), which contained an embedding layer, two bi-LSTM layers, and a softmax output layer.A rule-based result improver layer was also added to the end for consolidating the tags generated by the softmax output layer.We also used a dynamically weighted loss function to alleviate data imbalance issue.
The input sentences were tokenized and each token was converted to a word vector learned from the GloVe method (Pennington et al., 2014), when fed into the embedding layer.Padding, which was a common programming trick, was performed in a way that all sentences were aligned to the longest sentence in a batch using padding tags for parallel computation.They would not impact the model performance as the output of those padding tags were masked out in the backward layer of the Bi-LSTM model.The Bi-LSTM architecture consisted of a forward layer and a backward layer, which was supposed to capture sequential context information bidirectionally.Both layers consisted of blocks made up of a forget gate, an input gate and an output gate.The forget gate decided how much information from the previous block would be dropped at the current block, considering the current input and the previous hidden representation.The input gate took the output of the forget gate and the previous cell state to update the current cell state.The output gate was designed to create a hidden representation for each token based on all the information from the forget gate and input gate.Finally, the outputs of both forward layer and backward layer were concatenated by Bi-LSTM as final representation.The softmax function served as the classifier for computing final normalized probabilities for each tag.After that, each token was classified into one of (m*5+1) tags, where m was the total number of kinship types.We tried m=57 and 34, according to the number of annotated examples in our experiment (See Table 4).In the end, a rule-based result improver was added to make sense of the sequence of the classified tags.For example, if the softmax output layer tagged two neighboring words as "sister_B" and "sister_I" without "sister_E" nearby, the improver would correct the second tag to "sister_E".where B was the batch size,   was the length of input sentence   . ̂() and   () were the true tag and the normalized probability of the predicted tag for word t. λ was the hyper-parameter for L2 regularization.() was the indicator function to determine if the current tag was "O" (other), which was formulated as: () = { 0,  = "" 1,  ≠ "" (2) was dynamic weighted loss function, which assigned the tag ω different weights in different sentences, aiming to alleviate influence caused by too much "O" tag.It was defined as: where T was the union of all possible tags,   referred to a sentence i in a batch of the training set,    ω was the total count of all tags in   ,    was the number of a specific tag ω in   , and    and    were the maximal and minimal hyper-parameters for normalization respectively.

Evaluation metrics
A recognized named entity mention was considered true positive (TP) if both its boundary and type matched with the annotation.A relation extraction was considered as TP if both the NER and RE tasks were correctly captured.A recognized entity or relation was considered as false positive (FP) if it did not exactly match with the manual annotation in terms of the boundaries and relation types.The number of false negatives (FN) instances was computed by counting the number of named entities or relations in the manual annotation that had been missed by the model.We performed 10-fold cross validation in our experiment, where 10% of the annotated data were randomly selected for validation, and the remaining for training the model.We evaluated the model performance using macro-and microaveraged Precision, Recall and F-measure.A macro-averaged metric treats all classes equally by computing the metric independently for each class and then taking the average.In contrast, a micro-averaged metric aggregates the TP, TN, FP, and FN counts of all classes to compute an average metric.

Corpus annotation
Table 2 lists the detailed summary statistics of our corpus.There were 1,711 mentions of deceased names in 1,809 obituaries.Some obituaries mentioned the names of the deceased people in the title (metadata) rather than the main body of obituaries.In those cases, we directly linked the deceased names in the title of obituaries with their main body of free text.On average, each obituary contains 16.6 sentences, or 1,809 obituaries contain 30,035 sentences in total.We extracted and annotated 29,938 names, 27,227 family relations and 8,476 residences for the deceased and their families.We were able to pair up a name and a residence for 9,189 times.For the deceased people, we also annotated their age, death date, birth date, and residence when available.
We noticed two interesting language patterns in obituaries, namely last name distribution and name with parentheses (See Table 3).These patterns might be due to the word limitation in the old time when the family paid for publishing an obituary on printed newspapers.In total, we annotated 71 kinships (See It is worth noting that we kept "married to" and "spouse", "born to" and "parent" as separate kinship types in our experiment.This is because the syntax, co-occurred words and their order near "married to"/"born to" are subtly different from "spouse"/"parent".Keeping them as separate kinship types might help to improve the model performance.We will group them in the next step when we build the family trees, as they are semantically equivalent.

Model performance
Table 5 illustrates the final performance of the baseline method (pipeline model) versus our proposed joint neural model for extracting names and kinships from obituaries.The baseline model consists of two one-layer bi-LSTMs.The first bi-LSTM is for NER with simple BIESO tagging scheme, and its outputs were used as the inputs of the second bi-LSTM for RE.The general architecture is the same as that of the joint model, but the tagging scheme is different for NER, and NER and RE worked in a pipelined way.It is shown that the joint model outperformed the pipeline model by 4.09%, 9.02% and 6.5% for Precision, Recall and F measure at macro level using 57 kinships with 10 or more examples.The joint model outperformed the pipeline model by even bigger margins for Precision, Recall and F measure (4.16%, 14.84% and 9.79% respectively) at macro level when considering 34 kinships with 50 or more examples.The micro-level evaluation metrics demonstrated even better results of similar trends, due to the nature of an imbalanced multiclass classification problem.

Discussions
The proposed joint neural model seemed capable of extracting the human names and relations with As shown in Table 6, the model was able to tell that "Kristy" was the wife of the deceased person (the second example of correct classification), but could not figure out "Jolene Stock" was the wife of the deceased "Craig" (the first example of wrong classification).It seems that the model was confused by the relationships between the deceased, "the boy's mother" and Jolene Stock.For the second example of wrong classification, the incorrect punctuation might have led to the error.The period before "Kristen (Matt) Asleson" should be a comma instead.The last example in Table 6 was an extremely difficult and rare case.Common kinship keyword indicating wife was missing.Without properly understanding the semantic meaning of 'propose' and 'marriage' in the sentence, our model failed to pick up "Cecelia Stevens" as a name.
One limitation of this study was that we built the Bi-LSTM model on sentences, and therefore lost the context information beyond a sentence.More sophisticated LSTM model would be helpful to parse the entire document of obituaries.Another challenge was that we could not afford to annotate more obituaries, which led to 14 kinship types had less than 10 examples (e.g., grandmother-in-law, grand uncle, great-great grandson and great-great granddaughter).Our model, or any supervised models, would not perform well on such small size of training data.

Conclusions and Future Work
In this work, we built an annotated corpus of >30,000 sentences (from 1,809 obituaries written in English) and proposed a two-layer Bi-LSTM model to simultaneously extract human names and kinships.Our joint neural model achieved macro-averaged Precision, Recall and F measure of 72.69%, 78.54% and 74.93%, and micro-averaged Precision, Recall and F measure of 95.74%, 98.25% and 96.98% using 57 kinships with 10 or more examples during 10-fold cross validation experiment.The model performance improved dramatically when trained with 34 kinships with 50 or more examples.We shared our corpus and codes on GitHub for the convenience of researchers.
Given such promising results, we will continue to improve our joint model to recognize other types of entity and relation, including the age, residence, birth date and death date.We will further parse names with parenthesis; resolve last name distributions; and leverage existing knowledge to infer the gender of names.Only when we complete theses tasks with high quality, could we build large family trees and link people to our EHR database.We are cautiously optimistic because almost all residents in Rochester MN have been patients at Mayo Clinic at some time of their life and population mobility rate in Rochester MN is far less than major metropolitan areas in the U.S. With the massive obituary data freely available on the Internet, our ultimate goal is to accelerate large-scale disease heritability research and clinical genetics research.

Ethics
In this study, we mined only publicly available information from 4 websites, without interacting with, intervening, or manipulating/changing the website's environment.The study does not include "human subject" data and is approved by the Office of Research and Compliance without IRB requirement at Mayo Clinic.

Figure
Figure 1: A novel tagging scheme for extracting names and kinships from obituaries

Figure 2 :
Figure 2: The neural network architecture for jointly extracting names and kinship types Funding for KH, JW, XM, CZ and CL are provided by the National Key Research and Development Program of China (2018YFC0910404); National Natural Science Foundation of China (61772409) and the consulting research project of the Chinese Academy of Engineering (The Online and Offline Mixed Educational Service System for "The Belt and Road" Training in MOOC China).Funding for MH and LY are provided by the National Center for Advancing Translational Sciences (UL1TR002377) and the National Library of Medicine (5K01LM012102).

Table 1 :
IAA scores in different rounds of annotation with different annotation features(-means "without")

Table 5 :
Comparing the performance of pipeline model versus joint model.The values in brackets represent the standard deviation during 10-fold cross validation.

Table 6 :
Stevens by serenading the words from the musical Carousel, "If I loved you, words wouldn't come in an easy way" -he proposed and on July 6, 1955, they began sixty-one years of marriage.Correctly classified examples and wrongly classified examples high performance.For common kinship types with large number of examples in the training dataset, such as grandchild, child, parent (born to), sister and brother, the model's performance were