Weakly Supervised Medication Regimen Extraction from Medical Conversations

Automated Medication Regimen (MR) extraction from medical conversations can not only improve recall and help patients follow through with their care plan, but also reduce the documentation burden for doctors. In this paper, we focus on extracting spans for frequency, route and change, corresponding to medications discussed in the conversation. We first describe a unique dataset of annotated doctor-patient conversations and then present a weakly supervised model architecture that can perform span extraction using noisy classification data. The model utilizes an attention bottleneck inside a classification model to perform the extraction. We experiment with several variants of attention scoring and projection functions and propose a novel transformer-based attention scoring function (TAScore). The proposed combination of TAScore and Fusedmax projection achieves a 10 point increase in Longest Common Substring F1 compared to the baseline of additive scoring plus softmax projection.


Introduction
Patients forget 40-80% of the medical information provided by healthcare practitioners immediately (Mcguire, 1996) and misconstrue 48% of what they think they remembered (Anderson et al., 1979), and this adversely affects patient adherence.Automatically extracting information from doctor-patient conversations can help patients correctly recall doctor's instructions and improve compliance with the care plan (Tsulukidze et al., 2014).On the other hand, clinicians spend up to 49.2% of their overall DR: Limiting your alcohol consumption is important, so, and, um, so, you know, I would recommend vitamin D 1 to be taken 1 .Have you had Fosamax 2 before?PT: I think my mum did.DR: Okay, Fosamax 2 , you take 2 one pill 2 on Monday and one on Thursday 2 .DR: Do you use much caffine?PT: No, none.DR: Okay, this is 3 Actonel 3 and it's one tablet 3 once a month 3 .DR: Do you get a one month or a three months supply in your prescriptions? Figure 1: An example excerpt from a doctorpatient conversation transcript.Here, there are three medications mentioned indicated by the superscript.The extracted attributes, change, route and frequency, for each medications are also shown.
time on EHR and desk work, and only 27.0% of their total time on direct clinical face time with patients (Sinsky et al., 2016).Increased data management work is also correlated with increased doctor burnout (Kumar, 2016).Information extracted from medical conversations can also aid doctors in their documentation work (Rajkomar et al., 2019;Schloss and Konam, 2020), allow them to spend more face time with the patients, and build better relationships.
In this work, we focus on extracting Medication Regimen (MR) information (Du et al., 2019;Selvaraj and Konam, 2019) from the doctor-patient conversations.Specifically, we extract three attributes, i.e., frequency, route and change, corresponding to medications discussed in the conversation (Figure 1).Medication Regimen information arXiv:2010.05317v1[cs.CL] 11 Oct 2020 can help doctors with medication orders cum renewals, medication reconciliation, verification of reconciliations for errors, and other medicationcentered EHR documentation tasks.It can also improve patient engagement, transparency and better compliance with the care plan (Tsulukidze et al., 2014;Grande et al., 2017).
MR attribute information present in a conversation can be obtained as spans in text (Figure 1) or can be categorized into classification labels (Table 2).While the classification labels are easy to obtain at scale in an automated manner -for instance, by pairing conversations with billing codes or medication orders -they can be noisy and can result in a prohibitively large number of classes.Classification labels go through normalization and disambiguation, often resulting in label names which are very different from the phrases used in the conversation.This process leads to a loss of granular information present in the text (see, for example, row 2 in Table 2).Span extraction, on the other hand, alleviates this issue as the outputs are actual spans in the conversation.However, span extraction annotations are relatively hard to come by and are time-consuming to annotate manually.Hence, in this work, we look at the task of MR attribute span extraction from doctor-patient conversation using weak supervision provided by the noisy classification labels.
The main contributions of this work are as follows.We present a way of setting up an MR attribute extraction task from noisy classification data (Section 2).We propose a weakly supervised model architecture which utilizes attention bottleneck inside a classification model to perform span extraction (Section 3 & 4).In order to favor sparse and contiguous extractions, we experiment with two variants of attention projection functions (Section 3.1.2),namely, softmax and Fusedmax (Niculae and Blondel, 2017).Further, we propose a novel transformer-based attention scoring function TAScore (Section 3.1.1).The combination of TAScore and Fusedmax achieves significant improvements in extraction performance over a phrase-based (22 LCSF1 points) and additive softmax attention (10 LCSF1 points) baselines.

Medication Regimen (MR) using Weak Supervision
Medication Regimen (MR) consists of information about a prescribed medication akin to attributes of  an entity.In this work, we specifically focus on frequency, route of the medication and any change in the medication's dosage or frequency as shown in Figure 1.For example, given the conversation excerpt and the medication "Fosamax" as shown in Figure 1, the model needs to extract the spans "one pill on Monday and one on Thursday", "pill" and "you take" for attributes frequency, route and change, respectively.The major challenge, however, is to perform the attribute span extraction using noisy classification labels with very few or no span-level labels.The rest of this section describes the dataset used for this task.

Data
The data used in this paper comes from a collection of human transcriptions of 63000 fully-consented and de-identified doctor-patient conversations.A total of 57000 conversations were randomly selected to construct the training (and dev) conversation pool and the remaining 6000 conversations were reserved as the test pool.
The classification dataset: All the conversations are annotated with MR tags by expert human annotators.Each set of MR tags consists of the medication name and its corresponding attributes frequency, route and change, which are normalized free-form instructions in natural language phrases corresponding to each of the three attributes (see The extraction dataset: Since the goal is to extract spans related to MR attributes, we would ideally need a dataset with span annotations to perform this task in a fully supervised manner.However, span annotation is laborious and expensive.Hence, we re-purpose the classification dataset (along with its classification labels) to perform the task of span extraction using weak supervision.We also manually annotate a small fraction of the train, validation and test sets (150, 150 and 500 data-points respectively) for attribute spans to see the effect of supplying a small number of strongly supervised instances on the performance of the model.In order to have a good representation of all the classes in the test set, we increase the sampling weight of data-points which have rare classes.Hence, our test set is relatively more difficult compared to a random sample of 500 data-points.All the results are reported on our test set of 500 difficult data-points annotated for attribute spans.
For annotating attribute spans, the annotators were given instructions to mark spans which provide minimally sufficient and natural evidence for the already annotated attribute class as described below.Sufficiency: Given only the annotated span for a particular attribute, one should be able to predict the correct classification label.This aims to encour-2 The detailed explanation for each of the classes can be found in

Challenges
Using medical conversations for information extraction is more challenging compared to written doctor notes because the spontaneity of conversation gives rise to a variety of speech patterns with disfluencies and interruptions.Moreover, the vocabulary can range from colloquial to medical jargon.
In addition, we also have noise in our classification dataset with its main source being annotators' use of information outside the grounded text window to produce the free-form tags.This happens in two ways.First, when the free-form MR instructions are written using evidence that was discussed elsewhere in the conversation but is not present in the grounded text window.Second, when the annotator uses their domain knowledge instead of using just the information in the grounded text window -for instance, when the route of a medication is not explicitly mentioned, the annotator might use the medication's common route in their free-form instructions.Using manual analysis of the 800 datapoints across the train, dev and test sets, we find that 22% of frequency, 36% of route and 15% of change classification labels, have this noise.
In this work, our approach to extraction depends on the size of the auxiliary task's (classification) dataset to overcome above mentioned challenges.

Background
There have been several successful attempts to use neural attention (Bahdanau et al., 2015) to extract information from text in an unsupervised manner (He et al., 2017;Lin et al., 2016;Yu et al., 2019).Attention scores provide a good proxy for importance of a particular token in a model.However, when there are multiple layers of attention, or if the encoder is too complex and trainable, the model no longer provides a way to produce reliable and faithful importance scores (Jain and Wallace, 2019).
We argue that, in order to bring in the faithfulness, we need to create an attention bottleneck in our classification + extraction model.The attention bottleneck is achieved by employing an attention function which generates a set of attention weights over the encoded input tokens.Attention bottleneck forces the classifier to only see the portions of input that pass through it, thereby enabling us to trade the classification performance for extraction performance and getting span extraction with weak supervision from classification labels.
In the rest of this section, we provide general background on neural attention and present its variants employed in this work.This is followed by the presentation of our complete model architecture in the subsequent sections.

Neural Attention
Given a query q ∈ R m and keys K ∈ R l×n , the attention function α : R m × R l×n → ∆ l is composed of two functions: a scoring function S : R m × R l×n → R l which produces unnormalized importance scores, and a projection function Π : R l → ∆ l which normalizes these scores by projecting them to an (l − 1)-dimensional probability simplex. 4

Scoring Function
The purpose of the scoring function is to produce importance scores for each entry in the key K w.r.t the query q for the task at hand, which in our case is classification.We experiment with two scoring functions: additive and transformer-based. 4Throughout this work l represents the sequence length dimension and Additive: This is same as the scoring function used in Bahdanau et al. (2015), where the scores are produced as follows: where, v ∈ R m , W q ∈ R m×m and W k ∈ R m×n are trainable weights.
Transformer-based Attention Score (TAScore): While the additive scoring function is simple and easy to train, it suffers from one major drawback in our setting: since we freeze the weights of our embedder and do not use multiple layers of trainable attention (Section 4.4), the additive attention can struggle to resolve references -finding the correct attribute when there are multiple entities of interest, especially when there are multiple distinct medications (Section 6.4).For this reason, we propose a novel multi-layer transformer-based attention scoring function (TAScore) which can perform this reference resolution while also preserving the attention bottleneck.Figure 2 shows the architecture of TAScore.The query and key vectors are projected to the same space using two separate linear layers while also adding sinusoidal positional embeddings to the key vectors.A special trainable separator vector is added between the query and key vectors and the entire sequence is passed through a multi-layer transformer (Vaswani et al., 2017).Finally, scalar scores (one corresponding to each vector in the key) are produced from the outputs of the transformer by passing them through a feed-forward layer with dropout.

Projection Function
A projection function Π : R l → ∆ l in the context of attention distribution, normalizes the real valued importance scores by projecting them to an (l − 1)-dimensional probability simplex ∆ l .Niculae and Blondel (2017) provide a unified view of the projection function as follows: Here, a ∈ ∆ l , γ is a hyperparameter and Ω is a regularization penalty which allows us to introduce problem specific inductive bias into our attention distribution.When Ω is strongly convex, we have a closed form solution to the projection operation as well as its gradient (Niculae and Blondel, 2017;Blondel et al., 2020).Since we use the attention distribution to perform extraction, we experiment with the following instances of projection functions in this work.Softmax: Ω(a) = l i=1 a i log a i Using the negative entropy as the regularizer, results in the usual softmax projection operator Using squared loss with fused-lasso penalty (Niculae and Blondel, 2017), results in a projection operator which produces sparse as well as contiguous attention weights 5 .The fusedmax projection operator can be written as Π Ω (s) = P ∆ l (P T V (s)) , where is the proximal operator for 1d Total Variation Denoising problem, and P ∆ l is the euclidean projection operator.Both these operators can be computed non-iteratively as described in Condat (2013) and Duchi et al. (2008), respectively.The gradient of Fusedmax operator can be efficiently computed as described in Niculae and Blondel (2017). 6 Fusedmax*: We observe that while softmax learns to focus on the right region of text, it tends to assign very low attention weights to some tokens of phrases resulting in multiple discontinuous spans per attribute, while Fusedmax on the other hand, almost always generates contiguous attention weights.However, Fusedmax makes more mistakes in identifying the overall region that contains 5 Some example outputs of softmax and fusedmax on random inputs are shown in Appendix A.3 6 The pytorch implementation to compute fusedmax used in this work is available at https://github.com/dhruvdcoder/sparse-structured-attention. the target span (Section 6.3).In order to combine the advantages of softmax and Fusedmax, we first train a model using softmax as the projector and then swap the softmax with Fusedmax in the final few epochs.We call this approach Fusedmax*.

Model
Our classification + extraction model uses MR attributes classification labels to extract MR attributes.The model can be divided into three phases: identify, classify and extract (Figure 3).The identify phases encodes the input text and medication name and uses the attention bottleneck to produce attention over the text.Classify phase computes the context vector using the attention from the identify phases and classifies the context vectors.
Finally, the extract phase uses the attention from the identify phase to extract spans corresponding to MR attributes.
Each x consists of a medication m and conversation text t, and each y consists of classification labels for frequency, route and change, i.e, y = ( f y, r y, c y), respectively.The number of classes for each attribute is denoted by (•) n.As seen from Table 1, f n = 12, r n = 10 and c n = 8.The length of a text excerpt is denoted by l.The extracted span for attribute k ∈ {f, r, c} is denoted by a binary vector k e of length l, such that k e j = 1, if j th token is in the extracted span for attribute k.

Identify
As shown in the Figure 3, the identify phase finds the most relevant parts of the text w.r.t each of the three attributes.For this, we first encode the text as well as the given medication using a contextualized token embedder E. In our case, this is 1024 dimensional BERT (Devlin et al., 2019)  7 .Since BERT uses WordPiece representations (Wu et al., 2016), we average these wordpiece representations to form the word embeddings.In order to supply the speaker information, we concatenate a 2-dimensional fixed vocabulary speaker embedding to every token embedding in the text to obtain speaker-aware word representations.
We then perform average pooling of the medication representations to get a single vector representation for the medication 8 .Finally, with the given medication representation as the query and the speaker-aware token representations as the key, we use three separate attention functions (attention bottleneck), one for each attribute (no weight sharing), to produce three sets of normalized attention distributions f â, r â and c â over the tokens of the text.The identify phase can be succinctly described as follows: Here, each k â is an element of the probability simplex ∆ l and is used to perform attribute extraction (Section 4.3).

Classify
We obtain the attribute-wise context vectors k c, as the weighted sum of the encoded tokens (K in Figure 3) where the weights are given by the attributewise attention distributions k a.To perform the classification for each attribute, the attribute-wise context vectors are used as input to feed-forward neural networks F k (one per attribute), as shown below: 9 where k ∈ {f, r, c}.
7 The pre-trained weight for BERT is from the Hugging-Face library (Wolf et al., 2019) 8 Most medication names are single word, however a few medicines have names which are upto 4-5 words.
9 Complete set of hyperparameters used is given in Appendix A.2

Extract
The spans are extracted from the attention distribution using a fixed extraction function X : ∆ l → {0, 1} l , defined as: where k γ is the extraction threshold for attribute k.For softmax projection function, it is important to tune the attribute-wise extraction thresholds γ.
We tune these using extraction performance on the extraction validation set.For fusedmax projection function which produces spare weights, the thresholds need not be tuned, and hence are set to 0.

Training
We train the model end-to-end using gradient descent, except the extract module (Figure 3), which does not have any trainable weights, and the embedder E. Freezing the embedder is vital for the performance, since not doing so results in excessive dispersion of token information to other nearby tokens, resulting in poor extractions.
The total loss for the training is divided into two parts as described below.
(1) Classification Loss L c : In order to perform classification with highly class imbalanced data (see Table 1), we use weighted cross-entropy: where the class weights k wk y are obtained by inverting each class' relative proportion.
(2) Identification Loss L i : If span labels e are present for some subset A of training examples, we first normalize these into ground truth attention probabilities a: k e j for k ∈ {f, r, c} We then use KL-Divergence between the ground truth attention probabilities and the ones generated by the model (â) to compute identification loss L i = k∈{f,r,c} KL k a k â .Note that L i is zero for data-points that do not have span labels.Using these two loss functions, the overall loss L = L c + λL i .

Results and Analysis
Table 3 shows the results obtained by various combinations of attention scoring and projection functions on the task of MR attribute extraction in terms of the metrics defined in Section 5.It also shows the classification F1 score to emphasize how the attention bottleneck affects classification performance.The first row shows how a simple phrase based extraction system would perform on the task.10

Effect of Span labels
In order to see if having a small number of extraction training data-points (containing explicit span labels) helps the extraction performance, we annotate 150 (see Section 2 for how we sampled the datapoints) of the training data-points with span labels.As seen from Table 3, even a small number of examples with span labels (≈ 0.3%) help a lot with the extraction performance for all models.We think this trend might continue if we add more training span labels.We leave the finding of the right balance between annotation effort and extraction performance as a future direction to explore.

Effect of classification labels
In order to quantify the effect of performing the auxiliary task of classification along with the main task of extraction, we train the proposed model in three different settings.( 1 It is worth noting that the classification performance of the proposed method is also improved by explicit supervision to the extraction portion of the model (row 2 vs 4, Table 4).In order to set a reference for classification performance, we train strong classification only models, one for each attribute, using pretrained BERT.These BERT Classifiers, are implemented as described in Devlin et al. (2019) with input consisting of the text and medication name separated by a [SEP] token (row 1).Based on the improvements achieved in the classification performance using span annotations, we believe that more span labels can further close the gap between the classification performance of the proposed model and the BERT Classifiers.However, this work focuses on extraction performance, hence improving the classification performance is left to future work.

Effect of projection function
While softmax with post-hoc threshold tuning achieves consistently higher TF1 compared to fusedmax (which does not require threshold tuning), the later achieves better LCSF1.We observe that while the attention function using softmax projection focuses on the correct portion of the text, it drops intermediate words, resulting in multiple discontinuous spans.Fusedmax on the other hand almost always produces contiguous spans.Figure 4 further illustrates this point using a test example.
The training trick which we call fusedmax* swaps the softmax projection function with fusedmax during the final few epochs to combine the strengths of both softmax and fusedmax.This achieves high LCSF1 as well as TF1.

Effect of scoring function
Table 5 shows the percent change in the extraction F1 if we use TAScore instead of additive scoring (everything else being the same).As seen, there is a significant improvement irrespective of the projection function being used.The need for TAScore stems from the difficulty of the additive scoring function to resolve references between spans when there are multiple medications present.In order to measure the efficacy of TAScore for this problem, we divide the test set into two subsets: data-points which have multiple distinct medications in their text (MM) and datapoints that have single medication only.As seen from the first two columns for both the metrics in Table 5, using TAScore instead of additive results in more improvement in the MM-subset compared to the SM-subset, showing that using transformer scorer does help with resolving references when multiple medications are present in the text.
Figure 5 shows the distribution of Avg.LCSF1 (average across all three attributes).It can be seen that there are a significant number of datapoints in the MM subset which get LCSF1 of zero, showing that even when the scorer achieves improvement on MM subset, it gets quite a lot of these data-points completely wrong.This shows that the there is still room for improvement.

Discussion
In summary, our analysis reveals that Fusedmax/Fusedmax* favors contiguous extraction spans which is a necessity for our task.Irrespective of the projection function used, the proposed scoring function TAScore improves the extraction performance when compared to the popular additive scoring function.The proposed model architecture is able to establish a synergy between the classification and span extraction tasks where one improves the performance of the other.Overall, the proposed combination of TAScore and Fusedmax* achieves a 22 LCSF1 points improvement over the phrasebased baseline and 10 LCSF1 points improvement over the naive additive and softmax combination.

Related Work
Existing literature directly related to our work can be bucketed into two categories -related methods and related tasks.Methods: The recent work on generating rationales/explanations for deep neural network based classification models (Lei et al., 2016;Bastings et al., 2020;Paranjape et al., 2020) is closely related to ours in terms of the methods used.Most of these works use binary latent variables to perform extraction as an intermediate step before classification.
Our work is closely related to (Jain et al., 2020;Zhong et al., 2019), who use attention scores to generate rationales for classification models.These works, however, focus on generating faithful and plausible explanation for classification as opposed to extracting the spans for attributes of an entity, which is the focus of our work.Moreover, our method can be generalized to any number of attributes while all these methods would require a separate model for each attribute.Tasks: Understanding doctor-patient conversations is starting to receive attention recently (Rajkomar et al., 2019;Schloss and Konam, 2020).Selvaraj and Konam ( 2019) performs MR extraction by framing the problem as a generative question answering task.This approach is not efficient at inference time -it requires one forward pass for each attribute.Moreover, unlike a span extraction model, the generative model might produce hallucinated facts.Du et al. (2019) obtain MR attributes as spans in text; however, they use a fully supervised approach which requires a large dataset with spanlevel labels.

Conclusion and Future work
We provide a framework to perform MR attribute extraction from medical conversations with weak supervision using noisy classification labels.This is done by creating an attention bottleneck in the classification model and performing extraction using the attention weights.After experimenting with several variants of attention scoring and projection functions, we show that the combination of our transformer-based attention scoring function (TAScore) combined with Fusedmax* achieves significantly higher extraction performance compared to the other attention variants and a phrase-based baseline.
While our proposed method achieves good performance, there is still room for improvement, especially for text with multiple medications.Data augmentation by swapping or masking medication names is worth exploring.An alternate direction of future work involves improving the naturalness of extracted spans.Auxiliary supervision using a language modeling objective would be a promising approach for this.are set to 0.2.The linear layer for the query has input and output dimensions of 1024 and 32, respectively.Due to the concatenation of speaker embedding, the linear layer for keys has input and output dimensions of 1026 and 32, respectively.The feedforward layer (which generates scalar scores for each token) on top of the transformer is 2-layered with relu activations and hidden sizes (16, 1).

Classifiers:
The final classifier for each attribute is a 2-layer feedforward network with hidden sizes (512, "number of classes for the attribute") and dropout probability of 0.2.

A.3 Examples: Projection Functions
Figures 6 and 7 show examples of outputs of projection functions softmax and fusedmax on random input scores.

A.4 Phrase based extraction baseline
We implement a phrase based extraction system to provide a baseline for the extraction task.A lexicon of relevant phrases is created for each class for each attribute as shown in Table 8.We then look for string matches within these phrases and the text for the data-point.If there are matches then the longest match is considered as an extraction span for that attribute.
Every morning | At Bedtime | Twice a day | Three times a day | Every six hours | Every week | Twice a week | Three times a week | Every month | Other | None route Pill | Injection | Topical cream | Nasal spray | Medicated patch | Ophthalmic solution | Inhaler | Oral solution | Other | None change Take | Stop | Increase | Decrease | None | Other

Figure 2 :
Figure2: Architecture of TAScore.q and K are input query and keys, respectively, and s are the output scores.
) The Classification Only uses the complete dataset ( 45k) but only with the classification labels.(2) The Extraction Only setting only uses the 150 training examples that have span labels.(3) Finally, the Classifica-tion+Extraction setting uses the 45k examples with classification labels along with the 150 examples with the span labels to train the model.Table 4 (rows 2, 3 and 4) shows the effect of having classification labels and performing extraction and classification jointly using the proposed model.The model structure and the volume of the classification data ( 45k examples) makes the auxiliary task of classification extremely helpful for the main task of extraction, even with the presence of label noise.

Figure 4 :
Figure 4: Difference in extracted spans for MR attributes with models that uses Fusedmax* and Softmax, for the medication Actonel.Blue: change, green: route and yellow: frequency.Refer Figure 1 for groundtruth annotations.
Figure 6: Sample outputs (right column) of softmax function on random input scores (left column).
Figure 7: Sample outputs (right column) of fusedmax function on random input scores (left column).
set of normalized classification labels for all three medication attributes and their morning | every morning | morning At Bedtime everyday before sleeping | everyday after dinner | every night | after dinner | at bedtime | before sleeping Twice a day twice a day | 2 times a day | two times a day | 2 times per day | two times per day Three times a day 3 times a day | 3 times per day | 3 times every day Every six hours every 6 hours | every six hours Every week every week | weekly | once a week Twice a week twice a week | two times a week | 2 times a week | twice per week | two times per week | 2 times per week Three times a week 3 times a week | 3 times per week Every month every month | monthly | once a month Other None route Pill tablet | pill | capsule | mg Injection pen | shot | injector | injection | inject Topical cream cream | gel | ointment | lotion Nasal spray spray | nasal conversation transcript.Medicated patch patch Ophthalmic solution ophthalmic | drops | drop Oral Phrases used in the phrase based baseline.These are also the most frequently occurring phrases in the free-form annotations.

Table 1 :
The normalized labels in the classification data.

Table 3 :
Attribute extraction performance for various combinations of scoring and projection functions.The avg. columns represent the macro average of the corresponding metric across the attributes.

Table 4 :
Effect of performing extraction+classification jointly in our proposed model.While the Extraction Only training only uses the 150 examples which are explicitly annotated with span labels, the Classification only training uses the complete training dataset with classification labels.