Improving Clinical Diagnosis Inference through Integration of Structured and Unstructured Knowledge

This paper presents a novel approach to the task of automatically inferring the most probable diagnosis from a given clinical narrative. Structured Knowledge Bases (KBs) can be useful for such complex tasks but not sufficient. Hence, we leverage a vast amount of unstructured free text to integrate with structured KBs. The key innovative ideas include building a concept graph from both structured and unstructured knowledge sources and ranking the diagnosis concepts using the enhanced word embedding vectors learned from integrated sources. Experiments on the TREC CDS and HumanDx datasets showed that our methods improved the results of clinical diagnosis inference.


Introduction and Related Work
Clinical diagnosis inference is the problem of automatically inferring the most probable diagnosis from a given clinical narrative. Many healthrelated information retrieval tasks can greatly benefit from the accurate results of clinical diagnosis inference. For example, in recent Text REtrieval Conference (TREC) Clinical Decision Support track (CDS 1 ), diagnosis inference from medical narratives has improved the accuracy of retrieving relevant biomedical articles (Roberts et al., 2015;Hasan et al., 2015;Goodwin and Harabagiu, 2016).
On the other hand, unstructured textual resources such as free texts from Wikipedia generally contain more information than structured KBs. As a supplementary knowledge to mitigate the limitations of structured KBs, unstructured text combined with structured KBs provides improved results for related tasks, for example, clinical question answering (Miller et al., 2016). For processing text, word embedding models (e.g. skipgram model (Mikolov et al., 2013b;Mikolov et al., 2013a)) can efficiently discover and represent the underlying patterns of unstructured text. Word embedding models represent words and their relationships as continuous vectors. To improve word embedding models, previous works have also successfully leveraged structured KBs (Bordes et al., 2011;Weston et al., 2013;Wang et al., 2014;. Motivated by the superior power of the integration of structured KBs and unstructured free text, we propose a novel approach to clinical diagnosis inference. The novelty lies in the ways of integrating structured KBs with unstructured text. Experiments showed that our methods improved clinical diagnosis inference from different aspects (Section 5.4). Previous work on diagnosis inference from clinical narratives either formulates the problem as a medical literature retrieval task (Zheng and Wan, 2016;Balaneshin-kordan and Kotov, 2016) or as a multiclass multilabel classi-fication problem in a supervised setting . To the best of our knowledge, there is no work on diagnoses inference from clinical narratives conducted in an unsupervised way. Thus, we build such baselines for this task.

Overview of the Approach
Our approach includes four steps in general: 1) extracting source concepts, q, from clinical narratives, 2) iteratively identifying corresponding evidence concepts, a, from KBs and unstructured text, 3) representing both source and evidence concepts in a weighted graph via a regularizerenhanced skip-gram model, and 4) ranking the relevant evidence concepts (i.e. diagnoses) based on their association with the source concepts, S(q, a) (computed by weighted dot product of two vectors), to generate the final output. Figure 1 shows the overview using an illustrative example.
Given source concepts as input, we build an edge-weighted graph representing the connections among all the concepts by iteratively retrieving evidence concepts from both KBs and unstructured text. The weights of the edges represent the strengths of the relationships between concepts. Each concept is represented as a word embedding vector. We combine all the source concept vectors into a single vector representing a clinical scenario. Source concepts are differentiated according to the weighting scheme in Section 4.2. Evidence concepts are also represented as vectors and ranked according to their relevance to the source concepts. For each clinical case, we find the most probable diagnoses from the top-ranked evidence concepts.

Knowledge Sources of Evidence Concepts
In this study, we use UMLS Metathesaurus (Bodenreider, 2004) and Freebase (Bollacker et al., 2008) as the structured KBs. Both KBs provide semantic relation triples in the following format: <concept1, relation, concept2>. We select UMLS relation types that are relevant to the problem of clinical diagnosis inference. These types include disease-treatment, diseaseprevention, disease-finding, sign or symptom, causes etc. Freebase contains a large number of triples from multiple domains. We select 61,243 triples from freebase that are classified as medicine relation types. There are 19 such relation types in total. Most of them fall under the "medicine.disease" category.
For unstructured text, we use articles from Wikipedia and MayoClinic corpus as the supplementary knowledge source. Important clinical concepts mentioned in a Wikipedia/MayoClinic page can serve as a critical clue to a clinical diagnosis. For example, in Figure 1, we see that "dyspnea", "shortness of breath", "tachypnea" etc. are the related signs and symptoms of the "Pulmonary Embolism" diagnosis. We select 37,245 Wikipedia pages under the clinical diseases and medicine category in this study. Most of the page titles represent disease names. In addition, MayoClinic 2 disease corpus contains 1,117 pages, which include sections of Symptoms, Causes, Risk Factors, Treatments and Drugs, Prevention, etc.

Building Weighted Concept Graph
Both the source and the evidence concepts are represented as nodes in a graph. A clinical case is represented as a set of source concept nodes: q = {q 1 , q 2 , . . .}. We build a weighted concept graph from source concepts using Algorithm 1. Two kinds of evidence concept nodes are added to the graph: 1) the entities from KBs (UMLS and Freebase) (step 7-12 in Algorithm 1), and 2) the entities from unstructured text pages (step 13-18). If there exists a triple < q i , r, a j > in KBs, where r refers to a relation, an edge is used to connect node q i and node a j . w ij represents the weight for that edge, and let w ij = 1, if the corresponding triple occurs at least once. Due to the incompleteness of the KBs, there may exist multiple missing connections between a potential evidence concept a j and a source concept q i . Unstructured knowledge from Wikipedia and MayoClinic can replenish these missing connections. For each page p, the page title represents an evidence concept a j . We use each source concept q i as a query, and page p as a document, and then calculate a querydocument similarity to measure the edge weight w ij between node a j and node q i . We only take evidence concepts as all nodes connected to source concepts in a distance of at most 2 (step 4-6).

Representing Clinical Case
We combine the source concepts q and get a single vector v q to represent the clinical case narrative. The source concepts from narratives for clinical diagnosis inference should be differentiated. Some source concepts are major symptoms for a diagnosis, while others are less critical. These major source concepts should be identified and given higher weight values. We develop two kinds of weighting schema for the differential expression of the source concepts. The source concept is represented as v q = 1 N q i ∈q γ i v q i . N is the total number of source concepts. v q i is the vector representation for one source concept q i .
(1) A longer concept usually convey more information (e.g. malar rash vs. rash), so it should be given more weights. We define this weight value as γ 1 = #W ords in Concept.
(2) For some commonly seen concepts (e.g. fever), usually, there are more edges connected to them. Sometimes, a common concept is less important for diagnosis inference, while some unique concepts are critical to infer a specific diagnosis. We define this weight value for each concept as γ 2 = 1 #Connected Edges . A higher weight value means the source concept is more unique.

Inferring Concepts for Diagnosis
Extracting Potential Evidence Concepts: From source concept nodes q, we find their connected concepts in the graph as evidence concepts. Traversing all edges in a graph is computationally expensive and often unnecessary for finding potential diagnoses. The solution is to use a subgraph. We follow the idea proposed in Bordes et al. (2014). The evidence concepts are defined as all nodes connected to source concepts in a distance of at most 2.
Ranking Evidence Concepts: We rank each evidence concept a according to its matching score S(q, a ) to the source concepts. The matching score S(q, a ) is a dot product of embedding representation of the evidence concept a and the source concept q by taking the edge weights w ij into consideration. S(q, a ) = w ij v a · v q . v a and v q are embedding representations for a and q. The embedding E ∈ R k×N for concepts are trained using embedding models (Section 4.4). N is the total number of concepts and k is the predefined dimensions for the embedding vector. Each concept in the graph can find a k dimensional vector representation in E. For a set of source concepts and evidence concepts A(q), the top-ranked evidence concept can be computed as: (1)

Word Embedding Models
We use the skip-gram model as the basic model. The skip-gram model predicts surrounding words w t−c , . . . , w t−1 , w t+1 , . . . , w t+c given the current center word w t . We further enhance the skip-gram model by adding a graph regularizer. Given a sequence of training words w 1 , w 2 , . . . , w T , the objective function is: where v t and v r are the representation vectors for word w t and word w r . λ is a parameter to leverage the graph regularizer and original objective. Suppose, word w t is mentioned having relations with a set of other words w r , r ∈ {1, . . . , R} in KBs. The graph regularizer λ R r=1 D(v t , v r ) integrates extra knowledge about semantic relationships among words within the graph structure. D(v t , v r ) represents the distance between v t and v r . In our experiments, the distance between two concepts is measured using KL-Divergence. D(v t , v r ) can be calculated using any other types of distance metrics. By minimizing D(v t , v r ), we expect if two concepts have a close relation in KBs, their vector representations will also be close to each other.

Datasets for Clinical Diagnosis Inference
Our first dataset is from the 2015 TREC CDS track (Roberts et al., 2015). It contains 30 topics, where each topic is a medical case narrative that describes a patient scenario. Each case is associated with the ground truth diagnosis. We use MetaMap 3 to extract the source concepts from a narrative and then manually refine them to remove redundancy.
3 https://metamap.nlm.nih.gov/ Our second dataset is curated from HumanDx 4 , a project to foster integrating efforts to map health problems to their possible diagnoses. We curate diagnosis-findings relationships from HumanDx and create a dataset with 459 diagnosis-findings entries. Note that, the findings from this dataset are used as the given source concepts for a clinical scenario.

Training Data for Word Embeddings
We curate a biomedical corpus of around 5M sentences from two data sources: PubMed Central 5 from the 2015 TREC CDS snapshot 6 and Wikipedia articles under the "Clinical Medicine" category 7 . After sentence splitting, word tokenization, and stop words removal, we train our word embedding models on this corpus. UMLS Metathesaurus and Freebase are used as KBs to train the graph regularizer. We use stochastic gradient descent (SGD) to maximize the objective function and set the parameters empirically.

Results
We use Mean Reciprocal Rank (MRR) and Average Precision at 5 (P@5) to evaluate our models. MRR is a statistical measure to evaluate a process that generates a list of possible responses to a sample of queries, ordered by probability of correctness. Average P@5 is calculated as precision at top 5 predicted results divided by the total number of topics. Since our dataset only has one correct diagnosis for each topic, all results have poor Average P@5 scores. Table 1 presents the results for our experiments. We report two baselines: Skip-gram refers to the basic word embedding model, and Skipgram* refers to the graph-regularized model using KBs. We also show the results for using different unstructured knowledge sources and different weighting schema. We can see that the best scores are obtained by the graph-regularized models with both the unstructured knowledge sources with variable weighting schema (Section 4.2).

Discussion
Unstructured text is a critical supplement: We analyze the source concepts and the corresponding evidence concepts for CDS topics, and investigate Source concepts should be differentiated: In clinical narratives, some concepts are more critical than others for the clinical diagnosis inference. We developed two weighting schema to assign higher weight values to more important concepts. The results in Table 1 show that differentiating the source concepts with different weight values has a large impact on the model performance.
Enhanced skip-gram is better: We propose the enhanced skip-gram model by using a graph regularizer to integrate the semantic relationships among concepts from KBs. Experimental results show that diagnosis inference is improved by using word embedding representations from the enhanced skip-gram model.

Conclusion
We proposed a novel approach to the task of clinical diagnosis inference from clinical narratives. Our method overcomes the limitations of structured KBs by making use of the integrated structured and unstructured knowledge. Experimental results showed that the enhanced skip-gram model with differential expression of source concepts improved the performance on two benchmark datasets.