Learning to Infer Entities, Properties and their Relations from Clinical Conversations

Recently we proposed the Span Attribute Tagging (SAT) Model to infer clinical entities (e.g., symptoms) and their properties (e.g., duration). It tackles the challenge of large label space and limited training data using a hierarchical two-stage approach that identifies the span of interest in a tagging step and assigns labels to the span in a classification step. We extend the SAT model to jointly infer not only entities and their properties but also relations between them. Most relation extraction models restrict inferring relations between tokens within a few neighboring sentences, mainly to avoid high computational complexity. In contrast, our proposed Relation-SAT (R-SAT) model is computationally efficient and can infer relations over the entire conversation, spanning an average duration of 10 minutes. We evaluate our model on a corpus of clinical conversations. When the entities are given, the R-SAT outperforms baselines in identifying relations between symptoms and their properties by about 32% (0.82 vs 0.62 F-score) and by about 50% (0.60 vs 0.41 F-score) on medications and their properties. On the more difficult task of jointly inferring entities and relations, the R-SAT model achieves a performance of 0.34 and 0.45 for symptoms and medications respectively, which is significantly better than 0.18 and 0.35 for the baseline model. The contributions of different components of the model are quantified using ablation analysis.


Introduction
The widespread adoption of Electronic Health Records by clinics across United States has placed a disproportionately heavy burden on clinical providers, causing burnouts among them (Wachter and Goldsmith, 2018;Xu, 2018;Arndt et al., 2017). There has been considerable interest, both in academia and industry, to automate aspects of documentation so that providers can spend more time with their patients. One such approach aims to generate clinical notes directly from the doctorpatient conversations (Patel et al., 2018;Finley et al., 2018a,b). The success of such an approach hinges on extracting relevant information reliably and accurately from clinical conversations.
In this paper, we investigate the tasks of jointly inferring entities, specifically, symptoms (Sx), medications (Rx), their properties and relations between them from clinical conversations. These tasks are defined in Section 2. The key contributions of the work reported here include: (i) a novel model architecture for jointly inferring entities and their relations, whose parameters are learned using the multi-task learning paradigm (Section 4), (ii) comprehensive empirical evaluation of our model on a corpus of clinical conversations (Section 6), and (iii) understanding the model performance using ablation study and human error analysis (Section 6.7). Since clinical conversations include domain specific knowledge, we also investigate the benefit of augmenting the input feature representation with knowledge graph embedding. Finally, we summarize our conclusions and contributions in Section 7. pre-defined 186 categories for symptom types, curated by a team of practising physicians and scribes, based on how they appear in clinical notes. We deliberately abstained from the more exhaustive symptom labels such as UMLS and ICD codes (Bodenreider, 2004) in favor of this smaller set since our training data is limited.
The properties associated with the symptoms, propType, fall into four categories: symprop/severity, symprop/duration, symprop/location, and symprop/frequency. The propContent denotes the content associated with the property. In the running example, "every morning" is the content associated with the property type symprop/frequency.
Not all the symptoms mentioned in the course of clinical conversations are experienced by the patients. We explicitly infer the status of a symptom as experienced or not. This secondary task extracts the pair: (symType, symStatus).
While symptoms can be categorized into a closed set, the set of medications is very large and continually updated. Moreover, in conversations, we would like to extract indirect references such as "pain medications" as medContent. We define three types of properties: medsprop/dosage, medsprop/duration and medsprop/frequency. In the running example,"twice a day" is the propContent of the type medsprop/frequency associated with the medContent "ibuprofen".

Previous Work
Relation extraction is a long studied problem in the NLP domain and include tasks such as the ACE (Doddington et al., 2004), the Se-mEval (Hendrickx et al., 2010), the i2b2/VA Task (Uzuner et al., 2011a), and the BioNLP Shared Task (Kim et al., 2013). Many early algorithms such as DIPRE algorithm by Brin (1998) and SNOWBALL algorithm by Agichtein and Gravano (2000) relied on regular expressions and rules (Fundel et al., 2007;Peng et al., 2014). Subsequent work exploited syntactic dependencies of the input sentences. Features from the dependency parse tree were used in maximum entropy models (Kambhatla, 2004) and neural network models (Snow et al., 2005). Kernels were defined over tree structures (Zelenko et al., 2003;Culotta and Sorensen, 2004;Qian et al., 2008). More efficient methods were investigated including shortest dependency path (Bunescu and Mooney, 2005) and sub-sequence kernels (Mooney and Bunescu, 2006). Recent work on deep learning models investigated convolutional neural networks (Liu et al., 2013), graph convolutional neural networks over pruned trees (Zhang et al., 2018), recursive matrix-vector projections (Socher et al., 2012) and Long Short Term Memory (LSTM) networks (Miwa and Bansal, 2016). Other more recent approaches include two-level reinforcement learning models (Takanobu et al., 2019), two layers of attention-based capsule network models , and self-attention with transformers (Verga et al., 2018). In particular, (Miwa and Sasaki, 2014;Katiyar and Cardie, 2016;Zhang et al., 2017;Zheng et al., 2017;Verga et al., 2018;Takanobu et al., 2019) also seek to jointly learn the entities and relations among them together. A large fraction of the past work focused on relations within a single sentences. The dependency tree based approaches have been extended across sentences by linking the root notes of adjacent sentences (Gupta et al., 2019). Coreference resolution is a similar task which requires finding all mentions of the same entity in the text (Martschat and Strube, 2015;Clark and Manning, 2016;Lee et al., 2017).
In the medical domain, the BioNLP shared task deals with gene interactions and is very different from our domain (Kim et al., 2013). The i2b2/va challenge is closer to our domain of clinical notes, however, that task is defined on a small corpus of written discharge summaries (Uzuner et al., 2011b). Written domain benefits from cues such as the section headings which are unavailable in clinical conversations. For a wider survey of extracting clinical information from written clinical documents, see (Liu et al., 2012).  Figure 1: Variants of the R-SAT model architecture. The entity spans ("some pain", "blood thinner") are identified in a tagging step, which are pushed into a memory buffer along with their latent representation and in a subsequent step the property spans ("my back", "three months") selects the most related entity from the buffer.

Model
Our application requires performing multiple inferences simultaneously, that of identifying symptoms, medications, their properties and relations between them. For this purpose, we adopt the well-suited multitask learning framework and develop a model architecture, illustrated in Figure 1, that utilizes our limited annotated corpus efficiently.

Input Encoder Layer
Let x be an input sequence.
We compute the contextual representation at the kth step using a bidirectional LSTM, , which is fed into a two-layer fully connected feed-forward network. For simplicity, we drop the index k from the rest. The final features are represented as h = M LP (h |Θ F F ). In our task, we found that the LSTM-based encoder performs better that the transformer-based encoder (Vaswani et al., 2017;Chen et al., 2018).

Extending a Standard Tagging Model
In a typical tagging model, the contextual representation of the encoder h is fed into a conditional random field (CRF) layer to predict the BIO-style tags (Collobert et al., 2011;Huang et al., 2015;Ma and Hovy, 2016;Chiu and Nichols, 2016;Lample et al., 2016;Peters et al., 2017;Yang et al., 2017;Changpinyo et al., 2018). Such a model can be extended to predict the relations. For example, in the utterance, "I feel some pain in my back", we could setup the tagger to predict the association between the symptom (sym/msk), and its property (sym-prop/loc) using a cross-product space where "my back" is tagged with sym/msk+symprop/loc so that the relation prediction problem is reformulated as a standard sequence labeling task. Although this would be a viable option for tasks where the tag set is small (e.g., place, organization, etc.), the crossproduct space in our Sx task is unfortunately large (e.g., 186 Sx labels × 3 Sx property types, and 186 Sx labels × 3 Sx status types).

Span Extraction Layer
We propose an alternative formulation that tackles the problem in a hierarchical manner. We first identify the span of interest using generic tags with BIO notation, namely, (sym B, sym I) for symptoms and (symprop B, symprop I) for their properties, as in Figure 1(a). Likewise, (med B, med I) for medications and (medsprop B, medsprop I) for their properties as shown in Figure 1(b). This corresponds to highlighting, for example, "some pain" and "my back" as spans of interest.
Given the latent representations, h = (h 1 , · · · , h N ), and the target tag sequence y e = (y 1 , · · · , y N ) (e.g., sym B, sym I, O, symprop B, symprop I), we use the negative log-likelihood − log P (y e |h) under CRF as the loss of identifying spans of interest −S(y e , h) the compatibility between a sequence y and h. The first component estimates the accumulated cost of transition between two neighboring tags using a learned transition matrix A. P (h i , y i ) is computed via the inner product h i y i where y i belongs to any sequence of tags y that can be decoded from h. During training, the log P (y e |h) is estimated using forward-backward algorithm and during inference, the most probable sequence is computed using the Viterbi algorithm.

Attribute Tagging Layer
Using the latent representation of the highlighted span, we can predict one or more attributes of the span. In Figure 1(a), we can predict two attributes associated with "some pain": sym/msk as the symptom label and symStatus/experienced as its status. Similarly, in Figure 1(b), the span property span "three months" has the predicted property type medsprop/duration. Therefore, by forming semantic abstractions for each highlighted text span, we decompose a single complex tagging task in a large label space into correlated but simpler sub-tasks, which are likely to generalize better when the training data is limited.
Given the spans, either from the inferred or the ground truth sequence y * , a richer representation of the contexts can be used to predict attributes than otherwise possible. A contextual representation is computed from the starting i and ending j index of each span.
where Aggregate(·) is the pooling function, implemented as mean, sum or attention-weighted sum of the latent states of the input encoder. The kth attributes associated with the span are modeled using P (y k attr |h s ij ). For example, while prediction symptom labels s x and their associated status s t , the target attributes are y 0 attr := y sx and y 1 attr := y st . For predicting medication entities r x and their properties p r , each span only has one attribute. Since each attribute comes from a pre-defined ontology, the multi-nomial distribution P (y k attr |h s ij ) can be modeled as Softmax(h s ij |Θ k ) for each attribute.

Memory Buffer Layer
One of the critical components of our model is the memory buffer. Most previous models on joint inference of entities and relations consider all spans of entities and properties. This has the computational complexity of O(n 4 ) in the length of the input n, and makes it infeasible for application such as ours where the input could often be 1k words or more. We circumvent this problem using a memory buffer to cache all inferred candidate spans and test their relationship with inferred property spans. Note, unlike methods that cascade two such stages, our model is trained end-to-end jointly with multi-task learning.
The memory buffer saves different entries for symptom and medication tasks, as illustrated in Figure 1. At each occurrence of a symptom (medication) entity span, we push m k = Aggregate({h s ij , e s }) into the k-th position of the memory buffer. For the symptom task, e s is the learned word embedding of one of the labels in the closed label set. In the medication case, e s is the Aggregate of learned word embedding of the verbatim sub-sequence corresponding to the medication entity.

Relation Inference Layer
Each span of inferred property in the conversation is compared against each entry in the buffer. A property entity span is represented as where e p is the Aggregate of word embedding corresponding to the span. The multi-nomial likelihood is computed using a bilinear weight matrix W . The most likely entry (k) is picked from the memory stack M = (m 1 , ..., m K ) by maximizing the likelihood.
Remarks The computation cost of inferring relation between a property span and all the entities in the input is proportional to the memory buffer size. On our corpus, for Sx task, the mean and standard deviation per conversation was 22 and 15 respectively, and for Rx task, it was 32 and 23 respectively. Hence, the set of candidate entities considered is substantially smaller than all potential entities O(n 2 ) in the input sequence.
The small size of the memory buffer also has an impact on rate of learning. In each training step, rather than updating all embedding, we only update a smaller number of embedding, those associated with the entries in the memory buffer. This makes the learning fast and efficient.

An End-to-end Learning Paradigm
We train the model end-to-end by minimizing the following loss function for each conversation: where y e is the target sequence (sym B , sym I , prop B , prop I ), {y k attr } is the set of attribute labels for each highlighted span, {y j pos } is the list of buffer slot indices and α is a relative weight.
During training, we are simultaneously attempting to detect the location of tags as well as classify the tags. Initially our model for locating the tags is unlikely to be reliable, and so we adopt a curriculum learning paradigm. Specifically, we provide the classification stage the reference location of the tag from the training data with probability p, and the inferred location of the tag with probability 1 − p. We start the joint multi-task training by setting this probability to 1 and decrease it as training progresses (Bengio et al., 2015).
Since our model consists of span extraction and attribute tagging layers followed by relation extraction, we refer to our model as Relational Span-Attribute Tagging Model (R-SAT). One advantage of our model is that the computational complexity of joint inference is O(n) which is linear in the length of the conversation n. This is substantially cheaper than other previous work on joint relation prediction models where the computational complexity is O(N 4 ) (Lee et al., 2017).

Knowledge Graph Features
Medical domain knowledge could be helpful in increasing the likelihood of symptoms when related medications is mentioned in a conversation, and vice versa. One such source is a knowledge graph (KG) whose embedding represent a lowdimensional projection that captures structural and semantic information of its nodes. Previous work has demonstrated that KG embedding can improve relation extraction in written domain (Han et al., 2018). We utilize an internally developed KG that contains about 14k medical nodes of 87 different types (e.g., medications, symptoms, treatments, etc.). The nodes are represented by 256 dimension embedding vectors, which were trained to minimize word2vec loss function on web documents (Mikolov et al., 2013). A given node may belong to multiple types and this is encoded as a sum of one-hot vectors. The input word sequences were mapped to KG nodes using an internal tool (Brown, 2013). For words that do not map to KG nodes, we use a learnable UNK vector of the same dimension as the KG embedding. In addition, we also represented linguistic information using partof-speech (POS) tags as one-hot vector. The POS tags were inferred from the input sequence using an internal tool with 47 distinct tags (Andor et al., 2016). In our experiments, we find it most effective to concatenate word embedding with KG entities, and the encoder output with the embedding of POS tags and the KG entity types.

Experiments
We describe our corpus, evaluation metrics, the experimental setup, the evaluations of the proposed model and comparison with different baselines on both the symptom and medication tasks.

Corpus
Given the privacy-sensitive nature of clinical conversations, there aren't any publicly available corpora for this domain. Therefore, we utilize a private corpus consisting of 92K de-identified and manually transcribed audio recordings of clinical conversations, typically about 10 minutes long [IQR: 5-12 minutes] with mostly 2 participants (72.7%). Other participants when present included, for example, nurses and caregivers. The corresponding manual transcripts contained sequences that were on average 208 utterances or 1,459 words in length. We note that due to the casual conversational style of speech, an entity mentioned at the beginning can be related to a property mentioned at the end of the conversation. This makes the problem of modeling relations much harder than previous work on extracting relations.
A subset of about 2,950 clinical conversations, related to primary care, were annotated by professional medical scribes. The ontology for labeling medication consisted of the type of medications (e.g., medications, supplements) and their properties (e.g., dosage, frequency, duration), and that for symptoms consisted of 186 symptom names and their properties. This resulted in 77K and 99K tags for the medication and symptom tasks, respectively. In all, there were 23k and 16k relationships between medications and symptoms and their properties, respectively. The conversations were divided into training (1,950), development (500) and test (500) sets.
In the case of medications, about 70% of the labels were about medications and the rest about their properties, of which 51% were dosage or quantity, and 40% were frequency. In the case of symptoms, 41% of the labels were about symptom names, another 41% about status, and the rest about properties, of which 39% were about frequency and 37% about body locations.

Pretraining
Since our labeled data is small, only about 3k, the input encoder of the model was pre-trained over the entire 92k conversations. For pre-training, given a short snippet of conversation, the model was tasked with predicting the next turn, similar to skip-thought (Kiros et al., 2015). Our models were trained using the Adam optimizer (Kingma and Ba, 2015) and the hyperparameters are described in the supplementary material.

Evaluation Metrics
As described in Section 2, our tasks consist of extracting tuples -(symType, propType, propContent) for symptoms task and (medContent, prop-Type, propContent) for medications. The precision, recall and F1-scores are computed jointly over all the three elements and the content is treated as a list of tokens for evaluation purposes. To allow for partial content matches, we general-ize the calculation of precision and recall such that where SŶ denotes the set of predictions, S Y denotes the set of ground truths, and I z j We note that, as symType and prop-Type are simply target classes, I reduces to a simple indicator function. Under the scenario that the content includes single elements, the entire calculation simplifies to the exact matching-based calculation of precision and recall over the set of predictions and ground truths. For the symptom task, we additionally evaluate the performance of predicting symType and symStatus by performing the exact matching-based calculation.

Baselines
Symptom Task As a baseline for this task, we train an extension of the standard tagging model, described in Section 4.1. The label space for extracting the relations between symptoms and their properties is 186 symptoms × 3 properties, and for extracting symptoms and their status is 186 symptoms × 3 status. Using the BIO-scheme, that adds up to 2,233 labels in total. The baseline consists of a bidirectional LSTM-encoder followed by two feed-forward layers [512, 256] and then a 2,233 dimension softmax layer. The label space is too large to include a CRF layer. The encoder was pre-trained in the same way as described in Section 6.2, the hyperparameters were selected according to Table 2, and the model parameters were trained using cross-entropy loss.
Medication Task For this task, we adopt a different baseline since the generic medication entity type (e.g., drug name, supplement name) does not provide any useful information unlike the 186 symptom entity labels (e.g., sym/msk/pain). Instead, we adopt the neural co-reference resolution approach which is better suited to this task (Lee et al., 2017). The encoder is the same as the baseline for symptom task and pre-trained in the same manner. Since the BIO labels contain only 9 elements in this case, the encoder output is fed into a CRF layer. Each candidate relation is represented by concatenating the latent states of the head tokens of the medication entity and the property. This representation is augmented with an embedding of the token-distance, which is fed to a softmax layer whose binary output encodes whether they are related or not. Note our R-SAT model does not take the advantage of this distance embedding. Table 2 shows the parameters that were selected after evaluating over a range on a development set.

Parameter Tuning
In all experiments, the Aggregate(·) function is implemented as the mean function for its simplicity.

Results & Ablation Analysis
The performance of the proposed R-SAT model was compared with the baseline models, and the results are reported in Table 3.

Symptom Task
The model was trained using multi-task learning for both tasks: (symType, prop-Type, propContent) as well as (symType, symStatus). The performance was evaluated using all the elements of the tuple as described in Section 6.3. The baseline performs better on (sym-Type, symStatus) compared to (symType, prop-Type, propContent) possibly because there are more instances of the former in the training data than the latter. The R-SAT model performs significantly better than baselines on both tasks.
For understanding the contribution of different components of the model, we performed a series of ablation analysis by removing them one at a time. In extracting relations in Sx + Property, the KG embeddings along with POS tags contribute a relative gain of 13% while the memory buffer brings a relative gain of 8%. Neither of them impact Sx + status, and that is expected for memory buffer since the status is tagged on the same span as the contents of the memory buffer. Multi-task learning brings a relative improvement of 4% on Sx + Property, and this may be because there are fewer instances of this relation in the training data, and jointly learning with Sx + Status helps to learn better representations. Note we have not checked other sequences for removing model components (e.g., removing Multi-tasking earlier or KG later).

Medication Task
In the Rx case, we only have one task (Rx + Property), that is, predicting the relations between medications and their properties, e.g., ([ibuprofen], prop/dosage, [10 mg]). The baseline gives reasonable performance. Ablation analysis reveals that KG and POS features contribute about 4.6% relative improvement, while the contextual span in memory buffer adds a substantial 43% relative improvement. Since the medications are from an open set, we cannot run experiments without the buffer. Compared to symptoms task, the model performs better on medication task, and this may be due to lower variability in dosage.

Relation Only Prediction
For teasing apart the strength and weakness of the model, we evaluated its performance when the entities and their properties were given, and the model was only required to decide whether a relation exists or not.
As a baseline, we compare our model with a most recently proposed model for document-level joint entity and relation extraction: BRAN, which achieved state-of-art performance for chemicaldisease relation (Verga et al., 2018). When this model was originally used to test relations between all pairs of entities and properties in the entire conversation, it performed relatively poorly. Using the implementation released by the authors, the performance of BRAN was then optimized by restricting the distance between the pairs and by fine-tuning the threshold. The best results are re-   ported in Table 4. Our proposed R-SAT model without any such constraints performs better than BRAN on both tasks by an absolute F1-score gain of about 0.20. Interestingly, the performance of our model on Sx + Property jumps from 0.34 in the joint prediction task to 0.82 in the relation only prediction task. This reveals the primary weakness of the Sx model is in tagging the entities and the properties accurately. In contrast, the F1-score for Rx + Property is impacted less, and only moves up from 0.45 to 0.6.
The task of inferring whether a relation is present between a medication and its properties is more challenging than in the case of symptoms task. This is not entirely surprising since there is a higher correlation between symptom type and location (e.g., respiratory symptom being associated with nose) and relatively low correlation between dosage and medications (e.g., 400mg could be the dosage for several different medications).

Analysis
For understanding the inherent difficulty of extracting symptoms and medications and their properties from clinical conversations, we estimated human performance. A set of 500 conversations were annotated by 3 different scribes. We created a "voted" reference and compared the 3 annotations from each of the 3 scribes against them.
The F1-score of scribes were surprisingly low, with 0.51 for Sx + Property and 0.78 for Sx + Status. The model performance also finds extracting relation in Sx + Property to be more difficult than Sx + Status task. In summary, the model performance reaches 67% of human performance for Sx + Property and 73% for Sx + Status. The F1-score of scribes for Rx + Property is similar to that of Sx + Property. In this case, the model achieves about 85% of human performance. The human errors or inconsistencies in Sx and Rx annotations appear to be largely due to missed labels and not due to inconsistent spans for the same tags, or inconsistent tags for the same span.
While the majority of our relations in the reference annotations occurred within the same sentence, approximately 11.1% of relations occurred across 3 or more sentences. This typically occurred when the symptoms or medications are discussed over multiple dialog turns, as illustrated in Table 1. Among the relations correctly identified by the model, 10.6% were also across 3 or more sentences, which is very similar to the priors on the reference and seem to contain no bias. We notice that in certain cases, the model is able to link a property to an entity that is far away (100+ sentences) when a nearby mention of the same entity was missed by the model. Models that only examine relations in nearby sentences (2-3 sentences) would have missed the relation in such a scenario.
The majority of the errors result from our model missing the property span. Specifically, we see that 35% and 81% of the errors are due to model not detecting medications and symptoms property. For example, when i really have to, every three three months, which are rare mentions in informal language.
Our reference links each property to only one entity. In certain cases, we notice that the model links the entity to an alternative mention or entity that is equally valid (Advil vs pain killer). So, our performance measure underestimates the actual model performance.

Conclusions
We propose a novel model to jointly infer entities and relations. The key components of the model are: a mechanism to highlight the spans of interest, classify them into entities, store the entities of interest in a memory buffer along with the latent representation of the context, and then infer relation between candidate property spans with the entities in the buffer. The components of the model are not tied to any domain. We have demonstrated applications in two different tasks. In the case of symptoms, the entities are categorized into 188 classes, while in the case of medications, the entities are an open set. The model is tailored for tasks where the training data is limited and the label space is large but can be partitioned into subsets. The two stage processing where the candidates are stored in a memory buffer allows us to perform the joint inference at a computational cost of O(n) in the length of the input n compared to methods that explore all spans of entities and properties at a computational cost of O(n 4 ). The model is trained end-toend. We evaluate the performance on three related tasks, namely, extracting symptoms and their status, relations between symptoms and their properties, and relations between medications and their properties. Our model outperforms the baselines substantially, by about 32-50%. Through ablation analysis, we observe that the memory buffer and the KG features contribute significantly to this performance gain. By comparing human scribes against "voted" reference, we see that the task is inherently difficult, and the models achieve about 67-85% of human performance.