Set to Ordered Text: Generating Discharge Instructions from Medical Billing Codes

We present set to ordered text, a natural language generation task applied to automatically generating discharge instructions from admission ICD (International Classification of Diseases) codes. This task differs from other natural language generation tasks in the following ways: (1) The input is a set of identifiable entities (ICD codes) where the relations between individual entity are not explicitly specified. (2) The output text is not a narrative description (e.g. news articles) composed from the input. Rather, inferences are made from the input (symptoms specified in ICD codes) to generate the output (instructions). (3) There is an optimal order in which each sentence (instruction) should appear in the output. Unlike most other tasks, neither the input (ICD codes) nor their corresponding symptoms appear in the output, so the ordering of the output instructions needs to be learned in an unsupervised fashion. Based on clinical intuition, we hypothesize that each instruction in the output is mapped to a subset of ICD codes specified in the input. We propose a neural architecture that jointly models (a) subset selection: choosing relevant subsets from a set of input entities; (b) content ordering: learning the order of instructions; and (c) text generation: representing the instructions corresponding to the selected subsets in natural language. In addition, we penalize redundancy during beam search to improve tractability for long text generation. Our model outperforms baseline models in BLEU scores and human evaluation. We plan to extend this work to other tasks such as recipe generation from ingredients.


Problem Statement
Many healthcare applications exhibit a strong mapping between numerical or categorical infor- To develop language generation capabilities for these settings, we define a task where discharge instructions are automatically generated using admission ICD (International Classification of Diseases) codes 1 to potentially streamline clinical workflow. We define a task where the input is a set of identifiable items (ICD codes) and the output consists of ordered text sequences (instructions), which are inferred from the input (see Figure 1).

Proposed Approach
We hypothesize that each discharge instruction in the output is mapped to a subset of ICD codes specified in the input. Our proposed approach thus models the correlations between individual entities in the input set to choose the most relevant subsets, and learn to generate their corresponding textual outputs in the appropriate order. We also incorporate explicit means for reducing redundancy during decoding. We empirically verify the proposed approach by generating discharge instructions from ICD codes assigned during hospital admissions.

Relation to Other Work
For most natural language generation tasks, the relations between the input entities are specified in one way or another. For text-to-text generation (e.g. news articles), the relation between entities are semantically encoded in sequences of words (Paulus et al., 2017). For graph-to-text generation, relations between the nodes in a graph are characterized through labeled or unlabeled edges (Liu et al., 2019b). For text generation with database inputs, relations are usually specified through the attributes of each data entry (Lebret et al., 2016). In our problem setup, there is no explicit characterization of the relations between the input ICD codes assigned to a patient's visit.
Most natural language generation problems focus on generating descriptions, which often include the entities specified in the input. Examples include summarization or text generation from database records (Cheng and Lapata, 2016;Jhamtani et al., 2018). In our case, both the content and the ordering of the output need to be inferred from the input data.
Our work is closest to the line of research on text expansion, where the generated text is conditioned on a set of entities (Clark et al., 2018;Zhao et al., 2018;Kiddon et al., 2016). Although in this line of work the relations between the entity input set are not specified, as in our case, these input entities often appear in the output text, making it more straightforward to model the order of the input entities appearing in the output. In our case, neither the input entity set (ICD codes) nor their corresponding text representations (diagnoses and clinical procedures) occur in the generated output (instructions).

Task Definition
For input set S, the generation task can be specified as follows, where F represents the mechanism to choose the next subset x given input S and the already chosen sequence of subsets X; Gen generates the output sentence o from the chosen subset x.

Neural Architecture
The proposed neural architecture is shown in Figure 2. The major components of the network are a lookup table for ICD codes, gates for content and subset selection, one RNN for content ordering and another one for decoding each discharge instruction. The network also posses two attention layers for finding the correlations between input ICD codes and for attending to the chosen content at each stage of instruction generation. The instruction ordering RNN is initialized with a zero vector. Below we elaborate on the major components of the proposed neural architecture in more detail. Bold symbols in equations represent parameter matrices.

Content and Subset Selection
The content selection gate selects the relevant content from each ICD code during each instruction generation phase. The gate takes the ICD code embedding, the previous state of the content ordering RNN H t−1 and the correlations among ICD codes into account for selecting the content. The content correlation vector C j for each ICD code embedding icd j in the input is computed as follows: The content gate value computation and the subsequent content selection is conducted as follows: The selected content from each of the input ICD code passes through a subset selection gate. Each subset is selected as a probability distribution over the input set of ICD codes as follows: gs j = softmax(gs j ) ∀j gs j , ∀j represents the distribution of ICD codes in the subset chosen at the current time step of content ordering RNN. The subset selection gate updates the output from content selection gate as follows: icd " j = gs j icd j (1)  In the Figure 2, trapezoids represent the content and subset selection gates. During each stage of content and subset selection, the gates receive information regarding content and order of already selected subsets from the instruction ordering RNN. The figure also depicts the self attention layer which computes C j .

Content Ordering and Instruction Generation
We use a GRU recurrent neural network (Chung et al., 2014) for content ordering. The RNN is initialized with a zero vector before the network activity begins and it takes the selected content of ICD codes as input during each time step. At timestep t, where I t is the mean vector of icd " j s computed using the Equation 1. The instruction decoder RNN (also a GRU) is initialized with H t to generate the instruction at the current time step of content ordering RNN. During each time step of decoding, the decoder RNN attends over the current set of icd " j to generate the sequence of words w t in the instruction as formulated below.
where h t is the hidden state of the decoder at time-step t . The probability distribution over the output vocabulary for generating the word w t at time step t of the instruction generating decoder is computed as below, The portion of Figure 2 marked in blue, represents the set of computations detailed in the current section.

Beam Search with Redundancy Penalization
In our approach, instructions are decoded one after another, corresponding to each hidden state of the content ordering RNN, as generating one single long text sequence could lead to intractability. In addition, we include a penalization factor for reducing redundancy in the cost function C of beam search:

Experiments
In this section, empirical evaluation is conducted to quantify the accuracy of our model regarding text generation, content ordering, correctness of grammar and informativeness.

Corpus: ICD Codes to Discharge Instructions 2
Our dataset consisting of admission ICD codes and their corresponding discharge instructions is derived from MIMIC-III 3 . MIMIC-III is a database containing clinical information regarding patients, admission details, lab tests and medical notes (Ew et al., 2016). For each patient admitted to the hospital, there is a recorded set of ICD codes assigned for billing purposes to specify the diagnoses and clinical procedures related to the patient's admission. We assigned unique IDs for diagnoses and procedure codes and did not distinguish between them. Patients receive a list of discharge instructions written by clinical staff before they return home. These discharge instructions are embedded in larger documents called discharge reports. We trained statistical models (SVM and logistic regression) to classify instruction sentences (e.g. commands) from non-instruction sentences in the discharge reports using TF-IDF vectors computed from sentences in the discharge reports as features. The dataset is split into the training set, developmental set and test set, each comprising 2,000, 500 and 500 sentences, respectively. The accuracy of this binary classification is in Table 1. The logistic regression classifier was used to construct the corpus that maps ICD codes to discharge instructions, resulting in 18,900 of input output pairs for training and 900 each for testing and development. Manual verification was performed to ensure data quality for testing and developmental sets. Following customary data post-processing protocols, named entities such as numerical values and per-son names were replaced by place holder tags such as [num] and [person name].

Implementation Details
We chose different base models and different settings of our model to evaluate the efficiency of our method. In this section we describe each of the models in detail.

Seq2Seq
In this setting we use sequence to sequence model with the attention mechanism . The set of input ICD codes are arranged as a single sequence in random order and the model generates the entire set of instructions as a single sequence. The learning algorithm minimizes negative log-likelihood. Beam search decoding with beam size of 9 is used during testing.

Set2SingleSeq
In this setup, the model is a variant of , where input is a set and content selection is conducted by learning correlations between the input items. This is also a variant of our method, where the output is treated as a single sequence instead of a set of instructions. Any icd j in the input set of ICD codes are updated to selected content icd j using the content selection gate. The content correlation vector C j for each icd j is computed below: The content gate value computation and subsequent content selection is conducted as follows: The entire sequence of instructions is decoded using the computed set of icd j . For this purpose, the decoder is initialized with the mean vector of icd j and the decoder attends over all the set of icd j during each time-step of decoding. The learning algorithm minimizes negative log likelihood for optimizing parameters. Beam search decoding with beam size of 9 is used during testing.

Set2MultipleSeq
This setting represents the method explained in the Section 2.2. In this setup, instructions are generated one after another in the learned order. The optimized size for ICD embedding, content ordering RNN, and decoder RNN is 600. Sizes of network parameter matrices W a , W gc , W gs , W c and W o are adjusted accordingly. The learning algorithm minimizes negative log likelihood for optimizing parameters. Error is averaged for all instruction decoder time steps for the set of instructions generated. Beam search with beam size of 9 is used during testing.

Set2MultipleSeq+Opt
This setup is an enhanced version of Set2MultipleSeq (Section 3.2.3), but during beam search decoding redundancy is penalized (see Section 2.2.3). The value of λ in Equation 4 is set to 2.7 to obtain maximum content coverage on the development set.

Evaluation I: Content Generation
Instruction generation is quantitatively estimated by measuring the N-gram match of the generated content with ground-truth in test set using the BLEU metric (Papineni et al., 2002). We did not set any stopping criteria for the number of instructions to be generated for simplicity sake. However, we generated as many number of instructions in the corresponding test record. The results are shown in Table 2. Set2MultipleSeq yields a better score than Seq2Seq, indicating the effectiveness of content selection and self attention for modeling the correlation between ICD codes. The proposed Set2MultipleSeq approach yields even more improvement, validating that the introduction of subset selection and content selection helped in defining the content and context of an instruction during generation. This resulted in more accurate generation of instructions one after the other. When penalizing redundancy during decoding (Set2MultipleSeq+Opt), it explicitly forces the instruction decoder to generate an instruction with novel content during each stage of the content ordering RNN, thereby reducing errors caused by repeated content.

Evaluation II: Content Ordering
For evaluating content ordering, we use a variant of the metric used by Gong et al. (2016). In this scheme, we compare the order of words in generated sequence of instructions with ground-truth: We take the set of all order bigrams S o from the generated sequence of instructions where the first word in each bigram is from a preceding instruction and the second word is from any instruction succeeding it in the sequence. Precision and recall is computed as: S 0 is the order bigrams in the corresponding human written set of instructions in testset. Thus the metric scores the ordering better when the right content is generated in the right order. F-Measure is computed as the harmonic mean of precision and recall. The results in Table 3 show that there is a considerable improvement in the quality of ordering with after introducing content ordering mechanism in the neural architecture. Better ordering score for Set2M ultipleSeq + Opt is obvious as redundant content adversely affect instruction ordering.

Evaluation III: Human Analysis
We conduct human evaluations to measure if the generated instructions are grammatically correct and how informative they are. Four human evaluators who are post graduate students in linguistics were recruited and each given 30 sets of instructions from the testset.

Grammaticality
For each of the 30 instructions chosen for human evaluation, the evaluators were given a number of choices, each generated from a different model, and asked to choose the option that was the most grammatically correct. For each question, the instructions from the different models were shown to the evaluators in a random order to avoid any kind of bias. Instructions generated by Set2SingleSeq and Set2MultipleSeq. Set2MultipleSeq+Opt is excluded as it is an optimization for avoiding redundancy without any direct influence on the grammatical quality of text generated by neural models. Results aggregated across evaluators through majority voting are shown in Table 4. The results show that incorporating neural network components for content selection and ordering helps in defining the context of an instruction and generating the right content in the right form. Set2MultipleSeq generates one instruction at a time, while Set2SingleSeq generates the entire set of instructions as one sequence. Grammaticality is shown to improve Set2MultipleSeq when only one (shorter) instruction needs to be decoded at a time .

Informativeness
We also conducted human experiments to evaluate the informativeness of generated instructions. The evaluators were asked to read the reference set of instructions prior to examining the instructions generated by competing methods. They were then asked to choose the model that generated instructions retaining the most information from the ref-

Method % Set2SingleSeq 30 Set2MultipleSeq 63
Ambigous 7   Tables 5 and 6 respectively. Results shown in Table 5 explains that incorporating neural components for subset selection and content ordering helps in improving informative instruction generation. We observe that conducting content selection multiple times during each time step through content ordering RNN helps in generating a discrete set of instructions (Set2MultipleSeq). Table 6 shows that penalizing redundancy during beam search decoding reduces noise and helps in generating instructions with rich information density. Inter-evaluator agreement for the entire set of human evaluation is reasonably high: Cohen's kappa coefficient is 0.79. Table 7 shows an example of instructions generated by the different approaches we investigated. Though Set2SingleSeq generates relevant instructions, it repeats the same content, misses out on instructions, and is less grammatically correct.

Qualitative Comparison Across Models
Set2MultipleSeq shows qualitative improvement over Set2SingleSeq, where the repeated content is largely reduced and the grammaticality is improved. However, for Set2MultipleSeq, there is

Method Generated Content
Set2SingleSeq shower daily including washing incisions gently with mild soap , no baths or swimming until cleared by surgeon. shower daily and pat incisions dry no lotions , creams or powders on incisions no driving for [num]. shower daily , let water flow over wounds , pat dry with a towel towel , do no Set2MultipleSeq 1) shower daily and pat incisions dry no lotions , creams or powders on incisions, no baths or swimming until cleared by surgeon.
2) no lifting greater then [num] pounds for [num] weeks , do not drive or operate heavy machinery while taking any narcotic pain medication such as percocet 3) call for any fever , redness or drainage from wounds or weight gain more than [num] pounds 4) call your doctor for any fever , redness or drainage from wounds Set2MultipleSeq+Opt 1) shower daily and pat incisions dry no lotions , creams or powders on incisions, no baths or swimming until cleared by surgeon.
2) no lifting greater then [num] pounds for [num] weeks , do not drive or operate heavy machinery while taking any narcotic pain medication such as percocet 3) call for any fever , redness or drainage from wounds or weight gain more than [num] pounds 4) follow up with your primary care doctor , dr [person name] , in the next week as well Nurse Written Instructions 1) shower, no baths, no lotions,creams or powders to incisions 2) no lifting more than [num] pounds for [num] weeks from surgery 3) do not drive or drink alcohol while taking narcotic pain medications 4) call with fever, redness or drainage from incision or weight gain more than [num] pounds in one day or five in one week. still repeated content in the 3rd and 4th instructions.
Set2MultipleSeq+Opt brings more tractability in the neural model by penalizing redundancy and preventing noisy content generation. In the example shown, Set2MultipleSeq+Opt improves over Set2MultipleSeq, as the content in the 3rd instruction is no longer repeated in the 4th instruction.

Variability in Groundtruth References
Variability stemming from communication style differences across clinicians could potentially be one reason why the overall scores for content gen-eration (BLEU scores) and content ordering (Fmeasure) are on the low end. We observed that for the same set of ICD codes, different clinicians have different writing styles, even when the underlying content of the instructions are the same. In (a), a more detailed way of representing the in-struction is presented, while in (b) the same information is represented in a more succinct manner. If a clinician decides to be more detailed in writing the instructions, the clinician might also include specific information such as medication details: "Please be sure to take aspirin and plavix everyday as directed." Such inconsistency of how information such as medication details are specified in the instructions can potentially lead the models to generate noisy content.
6 Related Prior Work 6.1 Natural Language Generation Here we define natural language generation as the task of generating text from textual input or nonlinguistic input (e.g. graphs, data records). Previous work on text generation span across various input types and methods. Initial models used learned rules (Reiter et al., 2003(Reiter et al., , 2005 and manually engineered templates (Kukich, 1983;McRoy et al., 2000;McKeown, 1985) for constructing the output text. There is also work using automatic means for generating templates (Angeli et al., 2010;Howald et al., 2013).
Such template approaches could be inherently less efficient in modeling semantics when compared to neural networks, especially if there is ample training data. Initial neural models for text generation were first motivated by machine translation Srivastava et al., 2014). However, such approaches are less suitable in comprehending the semantics of structures that are more complex than short sequences and in generating longer sequences (Paulus et al., 2017;Wiseman et al., 2018). Recent advancement in text to text generation has primarily focused on news document summarization (Cheng and Lapata, 2016;Nallapati et al., 2017;See et al., 2017;Paulus et al., 2017).
A portion of related work focuses on generating text from an input graph (Koncel-Kedziorski et al., 2019;Song et al., 2018Song et al., , 2017: Graphs with labeled edges (e.g., knowledge graphs or abstract meaning representation) are used to generate a direct description of the information characterized in the input graph.
Text generation methods from data records has been investigated for different datasets such as wikipedia infobox (Lebret et al., 2016;Perez-Beltrachini and Lapata, 2018), weather predictions (Mei et al., 2016) or sport game summaries (Wiseman et al., 2017). There is a subset of work on text generation from data records which relies on content planning for generating a single sentence , while some other researchers generate multi-sentence outputs from structured data inputs (Puduppully et al.;Jhamtani et al., 2018). Puduppully et al. generated basketball game summaries by explicitly learning the order in which information should be mentioned in the output by generating an intermediate content plan.

Text Generation in Healthcare
A lot of NLP work in the medical domain has focused on information extraction (see  for a review). Driven by healthcare application demands, recently there is emerging interest in areas such as automatic ICD code assignment Scheurwegs et al., 2017;Mullenbach et al., 2018), risk prediction (Ma et al., 2018), and dialogue comprehension (Liu et al., 2019a). One of the earliest work on text generation in the medical domain dates back to more than two decades ago, where interactive systems generate summaries of the patient status for physicians and tailored explanations of clinical information for individual patients (Buchanan et al., 1995). Recent advances of neural modeling has rekindled interest in text generation. Text generation has had more presence in the media and journalism domain, ranging from image captioning (Kinghorn et al., 2018) to news reports on weather (Reiter et al., 2005) and sports (Wiseman et al., 2017). For the medical domain, interest in neural text generation has been springing up, with a primary focus on document summarization (Moradi and Ghadiri, 2018). For interested readers, Pauws et al. (2019) provides a recent review on how data-to-text technology can be applied in healthcare settings.

Conclusion
We proposed a neural architecture that learns to generate content in a specific order (discharge instructions for patients) without explicit specifications of the relations between input entities (ICD codes representing diagnoses and procedures) and how the input entities relate to the output. Our approach yields encouragingly better results in comparison with strong baselines. We further improved performance by penalizing redundancy during decoding.

Acknowledgement
We thank Jiewen Wu for pointing us to the MIMIC dataset and sharing his experience with us. We thank Ai Ti Aw, Mien Ho, Pavitra Krishnaswamy, and Zhengyuan Liu for their insightful discussions. We are also grateful for the encouraging and constructive feedback from the anonymous reviewers.
Research efforts were supported by funding for Digital Health from the Science and Engineering Research Council (SERC Project No: A1818g0044), A*STAR, Singapore. In addition, this work was conducted using resources and infrastructure provided by the Human Language Technology unit at I2R.