What’s in a Summary? Laying the Groundwork for Advances in Hospital-Course Summarization

Summarization of clinical narratives is a long-standing research problem. Here, we introduce the task of hospital-course summarization. Given the documentation authored throughout a patient’s hospitalization, generate a paragraph that tells the story of the patient admission. We construct an English, text-to-text dataset of 109,000 hospitalizations (2M source notes) and their corresponding summary proxy: the clinician-authored “Brief Hospital Course” paragraph written as part of a discharge note. Exploratory analyses reveal that the BHC paragraphs are highly abstractive with some long extracted fragments; are concise yet comprehensive; differ in style and content organization from the source notes; exhibit minimal lexical cohesion; and represent silver-standard references. Our analysis identifies multiple implications for modeling this complex, multi-document summarization task.


Introduction
The electronic health record (EHR) contains critical information for clinicians to assess a patient's medical history (e.g., conditions, laboratory tests, procedures, treatments) and healthcare interactions (e.g., primary care and specialist visits, emergency department visits, and hospitalizations). While medications, labs, and diagnoses are documented through structured data elements and flowsheets, clinical notes contain rich narratives describing the patient's medical condition and interventions. A single hospital visit for a patient with a lengthy hospital stay, or complex illness, can consist of hundreds of notes. At the point of care, clinicians already pressed for time, face a steep challenge of making sense of their patient's documentation and synthesizing it either for their own decision making process or to ensure coordination of care (Hall and Walton, 2004;Ash et al., 2004).
Automatic summarization has been proposed to support clinicians in multiple scenarios, from making sense of a patient's longitudinal record over long periods of time and multiple interactions with the healthcare system, to synthesizing a specific visit's documentation. Here, we focus on hospital-course summarization: faithfully and concisely summarizing the EHR documentation for a patient's specific inpatient visit, from admission to discharge. Crucial for continuity of care and patient safety after discharge (Kripalani et al., 2007;Van Walraven et al., 2002), hospital-course summarization also represents an incredibly challenging multi-document summarization task with diverse knowledge requirements. To properly synthesize an admission, one must not only identify relevant problems, but link them to symptoms, procedures, medications, and observations while adhering to temporal, problem-specific constraints.
Our main contributions are as follows: (1) We introduce the task of hospital-course summarization; (2) we collect a dataset of inpatient documentation and corresponding "Brief Hospital Course" paragraphs extracted from discharge notes; and (3) we assess the characteristics of these summary paragraphs as a proxy for target summaries and discuss implications for the design and evaluation of a hospital-course summarization tool.

Related Works
Summarization of clinical data and documentation has been explored in a variety of use cases (Pivovarov and Elhadad, 2015). For longitudinal records, graphical representations of structured EHR data elements (i.e., diagnosis codes, laboratory test measurements, and medications) have been proposed (Powsner and Tufte, 1997;Plaisant et al., 1996). Interactive visualizations of clinical problems' salience, whether extracted from notes (Hirsch et al., 2015) or inferred from clinical documentation (Levy-Fix et al., 2020) have shown promise (Pivovarov et al., 2016;Levy-Fix, 2020).
Most work in this area, however, has focused on clinical documentation of a fine temporal resolution. Traditional text generation techniques have been proposed to synthesize structured data like ICU physiological data streams (Hunter et al., 2008;Goldstein and Shahar, 2016).  use a transformer model to write EHR notes from the prior 24 hours, while Liang et al. (2019) perform disease-specific summarization from individual progress notes. McInerney et al. (2020) develop a distant supervision approach to generate extractive summaries to aid radiologists when interpreting images. Zhang et al. (2018Zhang et al. ( , 2020; MacAvaney et al. (2019); Sotudeh Gharebagh et al. (2020) generate the "Impression" section of the Radiology report from the more detailed "Findings" section. Finally, several recent works aim to generate EHR notes from doctor-patient conversations (Krishna et al., 2020;Joshi et al., 2020;Research, 2020). Recent work on summarizing hospital admissions focuses on extractive methods (Moen et al., 2014(Moen et al., , 2016Liu et al., 2018b;Alsentzer and Kim, 2018).

Hospital-Course Summarization Task
Given the clinical documentation available for a patient hospitalization, our task of interest is to generate a text that synthesizes the hospital course in a faithful and concise fashion. For our analysis, we rely on the "Brief Hospital Course" (BHC), a mandatory section of the discharge note, as a proxy reference. The BHC tells the story of the patient's admission: what was done to the patient during the hospital admission and why, as well as the follow up steps needed to occur post discharge, whenever needed. Nevertheless, it is recognized as a challenging and time consuming task for clinicians to write (Dodd, 2007;UC Irvine Residency, 2020).

Dataset
To carry out our analysis, we construct a large-scale, multi-document summarization dataset, CLINSUM. Materials come from all hospitalizations between 2010 and 2014 at Columbia University Irving Medical Center. datasets. Relatively speaking, CLINSUM is remarkable for having a very high compression ratio despite having long reference summaries. Additionally, it appears highly extractive with respect to fragment density (we qualify this in Section 4.1). Based on advice from clinicians, we rely on the following subset of note types as source documents: "Admission", "Progress", and "Consult" notes. The dataset does not contain any structured data, documentation from past encounters, or other note types (e.g., nursing notes, social work, radiology reports) (Reichert et al., 2010). Please refer to Appendix A for more details and rationale.

Tools for Analysis
Entity Extraction & Linking. We use the Med-CAT toolkit (Kraljevic et al., 2020) to extract medical entity mentions and normalize to concepts from the UMLS (Unified Medical Language System) terminology (Bodenreider, 2004). To exclude less relevant entities, we only keep entities from the Disorders, Chemicals & Drugs, and Procedures semantic groups, or the Lab Results semantic type.
Local Coherence. We examine inter-sentential coherence in two ways. Next-Sentence Prediction (NSP). Since we compare across a few datasets representing different domains, we use domain-specific pre-trained BERT models via HuggingFace (Wolf et al., 2019): "bert-base-cased" for CNN/DM and Arxiv, "monologg/biobert_v1.1_pubmed" for Pubmed, and "emilyalsentzer/Bio_ClinicalBERT" for CLINSUM. Entity-grids. Entity-grids model local coherence by considering the distribution of discourse entities (Barzilay and Lapata, 2005). An entity grid is a 2-D representation of a text whose entries represent the presence or absence of a discourse entity in a sentence. For our analyses, we treat UMLS concepts as entities and train a neural model, similar to Tien Nguyen and Joty (2017); Joty et al. (2018), which learns to rank the entity grid of a text more highly than the same entity grid whose rows (sentences) have been randomly shuffled. Please see Appendix B for more details.
Extractive Summarization Baselines. We rely on a diverse set of sentence extraction methods, whose performance on a held-out portion of CLIN-SUM is reported in Table 2. Oracle models have access to the ground-truth reference and represent upper bounds for extraction. Here, we define the sentence selection criteria for each oracle variant, leaving more in-depth discussion to the subsequent analysis. ORACLE TOP-K: Take sentences with highest R 12 vis-a-vis the reference until a target token count is reached; ORACLE GAIN: Greedily take source sentence with highest relative R 12 gain conditioned on existing summary 1 . Extract sentences until the change in R 12 is negative; OR-ACLE SENT-ALIGN: For each sentence in reference, take source sentence with highest R 12 score; ORACLE RETRIEVAL: For each sentence in reference, take reference sentence from train set with largest BM25 score (Robertson and Walker, 1994); and ORACLE SENT-ALIGN + RETRIEVAL: For each sentence in reference, take sentence with highest R 12 between ORACLE SENT-ALIGN and ORA-CLE RETRIEVAL. We provide two unsupervised methods as well. RANDOM: extracts random sentences until summary reaches target word count (average summary length); LEXRANK: selects the top-k sentences with largest LexRank (Erkan and Radev, 2004) score until target word count is reached. For a supervised baseline, we present CLINNEUSUM: a variant of the Neusum model adapted to the clinical genre (Zhou et al., 2018). CLINNEUSUM is a hierarchical LSTM network trained on ground-truth labels derived from ORA-CLE GAIN, which we detail in Appendix C.

Dataset Analysis & Implications
To motivate future research in multiple, selfcontained directions, we distill task-specific characteristics to a few salient, standalone takeaways. For each takeaway, we provide evidence in the data and/or literature, before proposing implications of findings on model development and evaluation.
4.1 Summaries are mostly abstractive with a few long segments of copy-pasted text tl;dr. CLINSUM summaries appear extractive according to widely used metrics. Yet, there is large variance within summaries. This directly affects the performance of a supervised extractive model, whose selection capability degrades as summary content transitions from copy-paste to abstractive. In turn, we need models which can handle abrupt transitions between extractive and abstractive text.
Background. Clinicians copy forward information from previous notes to save time and ensure that each note includes sufficient evidence for billing and insurance purposes (Wrenn et al., 2010). Copy-paste is both widely used (66-90% of clinicians according to a recent literature review (Tsou et al., 2017)) and widely applied (a recent study concluded that in a typical note, 18% of the text was manually entered; 46%, copied; and 36% imported 2 (Wang et al., 2017)). Please see Appendix D for more information on the issue of copy-paste.
Analysis -extractiveness. CLINSUM appears very extractive: a high coverage (0.83 avg / 0.13 std) and a very high density (13.1 avg / 38.0 std) (See Grusky et al. (2018) for a description of the statistics). However, we find that 64% of the extractive fragments are unigrams, and 25% are bigrams, which indicate a high level of re-writing. The density measure is large because the remaining 11% of extractive fragments are very long. Yet, there is a strong positional bias within summaries for long fragments. Figure 1, groups fragments according to their relative order within each summary. The longest fragments are usually first. Qualitative analysis confirms that the beginning of the BHC is typically copied from a previous note and conveys the "one-liner" (e.g., pt is a 50yo male with history of CHF who presents with edema.) This abrupt shift in extractiveness should affect content selection. In particular, when look-   ing at oracle extractive strategies, we should see clear-cut evidence of (1) 1-2 sentences which are easy to identify as salient (i.e., high lexical overlap with source due to copy-paste), (2) a murkier signal thereafter. To confirm this, we analyze the sentences selected by the ORACLE GAIN method, which builds a summary by iteratively maximizing the R 12 score of the existing summary vis-a-vis the reference.
In Figure 2, two supporting trends emerge.
(1) On average, one sentence accounts for roughly 50% 3 of the overall R 12 score.
(2) Afterwards, the marginal contribution of the next shrinks, as well as the R 12 gap between the best sentence and the minimum / average, according to the oracle.
There should also be evidence of the copy-paste positional bias impacting content selection. Table 3 reveals that the order in which the ORACLE GAIN summary is built-by maximal lexical overlap with the partially built summary-roughly corresponds to the true ordering of the summary. More simply, the summary transitions from extractive to abstractive. Figure 2: We plot average ROUGE score as summaries are greedily built by adding the sentence with the highest relative ROUGE gain vis-a-vis the current summary, until the gain is no longer positive (ORACLE GAIN). We also include the difference between the highest scoring sentence and the average / minimum to demonstrate a weakening sentence selection signal after the top 1-2. Unsurprisingly, a model (CLINNEUSUM) trained on ORACLE GAIN extractions gets progressively worse at mimicking it. Specifically, for each extractive step, there exists a ground-truth ranking of candidate sentences by relative R 12 gain. As the relevance gap between source sentences shrinks (from Figure 2), CLINNEUSUM's predictions deviate further from the oracle rank (Table 4).
Analysis -Redundancy. Even though we prevent all baseline methods from generating duplicate sentences (23% of source sentences have exact match antecedents), there is still a great deal of redundancy in the source notes (i.e., modifications to copy-pasted text). This causes two issues related to content selection. The first is fairly intuitivethat local sentence extraction propagates severe redundancy from the source notes into the summary and, as a result, produces summaries with low lexical coverage. We confirm this by examining the performance between the ORACLE TOP-K and OR-ACLE GAIN, which represent summary-unaware and summary-aware variants of the same selection

Extractive Average Rank of Closest
Step Reference Sentence 1 4.7 2 6.0 3 6.3 4 6.7 5 7.3 > 5 10.1 Table 3: ORACLE GAIN greedily builds summaries by repeatedly selecting the sentence which maximizes the R 12 score of the partially built summary. By linking each extracted sentence to its closest in the reference, we show that this oracle order is very similar to the true ordering of the summary.

Extractive Ground Truth Rank
Step  method. While both extract sentences with the highest R 12 score, ORACLE GAIN outperforms because it incorporates redundancy by considering the relative R 12 gain from an additional sentence. The second side effect is perhaps more surprising, and divergent from findings in summarization literature. For most corpora, repetition is indicative of salience. In fact, methods based on lexical centrality, i.e., TextRank (Mihalcea and Tarau, 2004) and LexRank (Erkan and Radev, 2004), still perform very competitively for most datasets. Yet, for CLINSUM, LexRank barely outperforms a random baseline. Poor performance is not only due to redundance, but also a weak link between lexical centrality and salience. The Pearson correlation coefficient between a sentence's LexRank score and its R 12 overlap with the reference is statistically significant (p = 0) yet weak (r = 0.29).
Qualitative analysis reveals two principal reasons, both related to copy-paste and/or imported data. The first relates to the propagation of frequently repeated text which may not be useful for summaries: administrative (names, dates), imported structured data, etc. The second relates to sentence segmentation. Even though we use a cus- tom sentence splitter, our notes still contain some very long sentences due to imported lists and semistructured text-a well-documented issue in clinical NLP (Leaman et al., 2015). LexRank summaries have a bias toward these long sentences (26.2 tokens versus source average of 10.9), which have a greater chance of containing lexical centroid(s).
To bypass some of these issues, however, one can examine the link between centrality and salience at the more granular level of entities. Figure 3 shows a clear-cut positive correlation between source note mention frequency of UMLS concepts and the probability of being included in the summary.
Implications. Regarding within-summary variation in extractiveness, we argue for a hybrid approach to balance extraction and abstraction. One of the most widely-used hybrid approaches to generation is the Pointer-Generator (PG) model (See et al., 2017), an abstractive method which allows for copying (i.e., extraction) of source tokens. Another research avenue explicitly decouples the two. These extract-then-abstract approaches come in different flavors: sentence-level re-writing (Chen and Bansal, 2018; Bae et al., 2019), multi-sentence fusion (Lebanoff et al., 2019), and two-step disjoint extractive-abstracive steps (Mendes et al., 2019).
While highly effective in many domains, these approaches do not consider systematic differences in extractiveness within a single summary. To incorporate this variance, one could extend the PG model to copy pre-selected long snippets of text. This would mitigate the problem of copy mechanisms learning to copy very long pieces of text (Gehrmann et al., 2018) -undesirable for the highly abstractive segments of CLINSUM. Span-level extraction is not a new idea (Xu et al., 2020), but, to our knowledge, it has not been studied much in otherwise abstractive settings. For instance, Joshi et al.
(2020) explore patient-doctor conversation summarization and add a penalty to the PG network for over-use of the generator, yet this does not account for intra-summary extractiveness variance.
Regarding redundancy, it is clear that, in contrast to some summarization tasks (Kedzie et al., 2018), summary-aware content selection is essential for hospital course summarization. Given so much noise, massive EHR and cite-specific preprocessing is necessary to better understand the signal between lexical centrality and salience.
4.2 Summaries are concise yet comprehensive tl;dr. BHC summaries are packed with medical entities, which are well-distributed across the source notes. As such, relations are often not explicit. Collectively, this difficult task calls for a domain-specific approach to assessing faithfulness.
Analysis -concise We find that summaries are extremely dense with medical entities: 20.9% of summary words are medical UMLS entities, compared to 14.1% in the source notes. On average, summaries contain 26 unique entities whereas the source notes contain 265 -an entity compression ratio of 10 (versus token-level compression of 43).
Analysis -comprehensive. Many summarization corpora exhibit systematic biases regarding where summary content can be found within source document(s) (Dey et al., 2020). On CLINSUM, we examine the distribution of entities along two dimensions: macro considers the differences in entity share across notes, and micro considers the differences within each note (i.e., lead bias).
(1) Macro Ordering. When looking at the source notes one by one, how much additional relevant information (as measured by entities present in the summary) do you get from each new note? We explore three different orderings: (1) FORWARD orders the notes chronologically, (2) BACKWARD the reverse, and (3) GREEDY ORACLE examines notes in order of decreasing entity entity overlap with the target. Given the large variation in number of notes per admission, we normalize by binning notes into deciles. Figure 4 shows that it is necessary to read the entire set of notes despite diminishing marginal returns. One might expect the most recent notes to have the most information, considering present as well as copy-forwarded text. Surprisingly, FORWARD and BACKWARD distributions are very similar. GREEDY ORACLE gets at the level of information concentration. On average, the top 10% of most informative notes cover just over half of the entities found in the summary. We include absolute and percentage counts in Table 5.
(2) Micro Ordering. We plot a normalized histogram of summary entities by relative position within the source documents. Figure 5 reveals a slight lead bias, followed by an uptick toward the end. Clinical notes are organized by section: often starting with the past medical history and present illness, and typically ending with the plan for future care.
All are needed to write a complete BHC.   Implications. The fact that entities are so densely packed in summaries makes models more susceptible to factual errors that misrepresent complex relations. On the CNN/DailyMail dataset, Goel et al. (2021) reveal performance degradation as a function of the number of entities. This is magnified for clinical text, where failure to identify which treatments were tolerated or discontinued, or to differentiate conditions of the patient or family member, could lead to serious treatment errors.
Recently, the summarization community has explored fact-based evaluation. Yet, many of the proposed methods treat global evaluation as the independent sum of very local assessments. In the case of QA-based methods, it is a quiz-like aggregation of individual scores to fairly narrow questions that usually seek to uncover the presence or absence of a single entity or relation. Yet, factoid (Chen et al., 2018), cloze-style (Eyal et al., 2019;Scialom et al., 2019;Deutsch et al., 2020), or mask-conditioned question generation (Durmus et al., 2020) may not be able to directly assess very fine-grained temporal and knowledge-intensive dependencies within a summary. This is a natural byproduct of the fact that many of the factuality assessments were developed for shorter summarization tasks (i.e., headline generation) in the news domain (Cao et al., 2018b;Kryscinski et al., 2019;Maynez et al., 2020). Entailment-based measures to assess faithfulness (Pasunuru and Bansal, 2018;Welleck et al., 2019) can capture complex dependencies yet tend to rely heavily on lexical overlap without deep reasoning (Falke et al., 2019).
Taken together, we argue for the development of fact-based evaluation metrics which encode a deeper knowledge of clinical concepts and their complex semantic and temporal relations 4 .

Summaries have different style and
content organization than source notes tl;dr. Hospital course summarization involves not only massive compression, but a large style and organization transfer. Source notes are written chronologically yet the way clinicians digest the information, and write the discharge summary, is largely problem-oriented. With simple oracle analysis, we argue that retrieve-edit frameworks are well-suited for hospital course generation.
Analysis -Style. Clinical texts contain many, often obscure, abbreviations (Finley et al., 2016;Adams et al., 2020), misspellings, and sentence fragments (Demner-Fushman et al., 2009). Using a publicly available abbreviation inventory (Moon et al., 2014), we find that abbreviations are more common in the BHC. Furthermore, summary sentences are actually longer on average than source sentences (15.8 versus 12.4 words).
Analysis -Organization. Qualitative analysis confirms that most BHCs are written in a problemoriented fashion (Weed, 1968), i.e., organized around a patient's disorders. To more robustly analyze content structure, we compare linked UMLS entities at the semantic group level: DRUGS, DIS-ORDERS, and PROCEDURES (McCray et al., 2001).
In particular, we compare global proportions of semantic groups, transitions between entities, as well as positional proportions within summaries.
(1) Global. Procedures are relatively more prevalent in summaries (31% versus 24%), maybe because of the emphasis on events happening during the hospitalization. In both summary and source notes, DISORDERS are the most prevalent (54% and 46%, respectively). Drugs make up 23% and 22% of entity mentions in summary and source notes, respectively.
(2) Transitions. From both source and summary text, we extract sequences of entities and record adjacent transitions of their semantic groups in a 3 × 3 matrix. Figure 7 indicates that summaries have fewer clusters of semantically similar entities (diagonal of the transition matrix). This transition matrix suggests a problem-oriented approach in which disorders are interleaved with associated medications and lab results.
(3) Positional. Finally, within summaries, we examine the positional relative distribution of semantic groups and connect it to findings from Section 4.1. In Figure 6, we first compute the start index of each clinical entity, normalized by the total length, and then group into ten equally sized bins. The early prevalence of disorders and late prevalence of medications is expected, yet the difference is not dramatic. This suggests an HPI-like statement up front, followed by a problem oriented narrative. If there is a material transfer in style and content, we would expect that summaries constructed  from other summaries in the dataset would have similar or better lexical coverage than summaries constructed from sentences in the source notes. To assess this, we compare two oracle baselines, SENT-ALIGN and RETRIEVAL. For each sentence in the summary, we find its closest corollary either in the source text (SENT-ALIGN) or in other summaries in the dataset (RETRIEVAL). While the retrieval method is at a distinct disadvantage because it does not contain patient-specific information and retrieval is performed with BM25 scores, we find both methods yield similar results ( Table  2). An ensemble of SENT-ALIGN and RETRIEVAL performs better than either alone, suggesting that the two types of sources may be complementary. 82% of this oracle's summary sentences are retrievals. Summaries adapt the style and problemoriented structure of other summaries, but contain patient-specific information from the source notes.
Implications. Hospital-course summaries weave together disorders, medications, and procedures in a problem-oriented fashion. It is clear that substantial re-writing and re-organization of source content is needed. One suitable approach is to use the retrieve-rerank-rewrite (R 3 ) framework proposed by Cao et al. (2018a). To support this notion, more recent work demonstrates that retrieval augmented generation is effective for knowledgeintensive tasks (Lewis et al., 2020b), enhances sys- Figure 8: NSP logit by relative position of the next sentence across summaries for several datasets. An offset of 1 corresponds to the true next sentence. tem interpretability (Guu et al., 2020;Krishna et al., 2020), and can improve LM pre-training (Lewis et al., 2020a) 5 . Also, efforts to bridge the gap between template-based and abstractive generation have been successful in the medical domain for image report generation (Li et al., 2018).
In this light, BHC generation could be truly problem-oriented. The first step would involve selecting salient problems (i.e., disorders) from the source text-a well-defined problem with proven feasibility (Van Vleck and Elhadad, 2010). The second step would involve separately using each problem to retrieve problem-specific sentences from other summaries. These sentences would provide clues to the problem's relevant medications, procedures, and labs. In turn, conceptual overlap could be used to re-rank and select key, problem-specific source sentences. The extracted sentences would provide the patient-specific facts necessary to rewrite the problem-oriented retrieved sentences.

Summaries exhibit low lexical cohesion
tl;dr. Lexical cohesion is sub-optimal for evaluating hospital-course discourse because clinical summaries naturally exhibit frequent, abrupt topic shifts. Also, low correlation exists between lexical overlap and local coherence metrics.
Analysis. Entity-based coherence research posits that "texts about the same discourse entity are perceived to be more coherent than texts fraught with abrupt switches from one topic to the next" (Barzilay and Lapata, 2005). Yet, for CLINSUM summaries, coherence and abrupt topic shifts are not mutually exclusive. An analysis of the entity grids of summaries, presumably coherent, are sparse, with few lexical chains. In fact, over 66% of the entities in the BHC appear only once. Of those with multiple mentions, the percentage which appear in adjacent sentences is only 9.6%. As in Prabhumoye et al. (2020), we also compare coherence with next-sentence prediction (NSP). Figure  8 plots the NSP logit by positional offset, where an offset of 1 corresponds to the next sentence, and -1 to the previous. NSP relies on word overlap and topic continuity (Bommasani and Cardie, 2020), so it makes sense it is lowest for CLINSUM.
To confirm the hypothesis that ROUGE does not adequately capture content structure, we use the pairwise ranking approach to train and evaluate an entity-grid based neural coherence model (Barzilay and Lapata, 2005;Tien Nguyen and Joty, 2017). Table 6 shows ROUGE and coherence metrics sideby-side for ORACLE GAIN, which naively orders sentences according to document timestamp, then within-document position, and ORACLE SENT-ALIGN, which maintains the structure of the original summary. The poor coherence of ORACLE GAIN is obscured by comparable ROUGE scores.

Summary
Acc. R1  Implications. Content organization is critical and should be explicitly evaluated. A wellestablished framework for assessing organization and readability is coherence. A large strand of work on modeling coherent discourse has focused on topical clusters of entities (Azzam et al., 1999;Barzilay and Elhadad, 2002;Barzilay and Lee, 2004;Okazaki et al., 2004). Yet, as shown above, CLINSUM summaries exhibit abrupt topic shifts and contain very few repeated entities. The presence and distribution of lexical (Morris and Hirst, 1991;Barzilay and Elhadad, 1997) or co-referential (Azzam et al., 1999) chains, then, might not be an appropriate proxy for clinical summary coherence. Rather, we motivate the development of problemoriented models of coherence, which are associative in nature, and reflect a deeper knowledge about the relationship between disorders, medications, and procedures. The impetus for task-tailored evaluation metrics is supported by recent meta analyses (Fabbri et al., 2020;Bhandari et al., 2020).

BHC summaries are silver-standard
tl;dr. Discharge summaries and their associated BHC sections are frequently missing critical information or contain excessive or erroneous content.
Modeling efforts should address sample quality.
Analysis. Kripalani et al. (2007) find that discharge summaries often lack important information including diagnostic test results (33-63% missing) treatment or hospital course (7-22%), discharge medications (2-40%), test results pending at discharge (65%), patient/family counseling (90-92%), and follow-up plans (2-43%). The quality of the reporting decreases as the length of the discharge summary increases, likely due to copy-pasted information (van Walraven and Rokosh, 1999). These quality issues occur for a number of reasons: (1) limited EHR search functionality makes it difficult for clinicians to navigate through abundant patient data (Christensen and Grimsmo, 2008); (2) multiple clinicians contribute to incrementally documenting care throughout the patient's stay; (3) despite existing guidance for residents, clinicians receive little to no formal instruction in summarizing patient information ; and (4) clinicians have little time for documenting care.
Implications. Noisy references can harm model performance, yet there is a rich body of literature to show that simple heuristics can identify good references (Bommasani and Cardie, 2020) and/or filter noisy training samples (Rush et al., 2015b;Akama et al., 2020;Matsumaru et al., 2020). Similar strategies may be necessary for hospital-course generation with silver-standard data. Another direction is scalable reference-free evaluations (ShafieiBavani et al., 2018;Hardy et al., 2019;Sellam et al., 2020;Gao et al., 2020;Vasilyev et al., 2020).

Conclusion
Based on a comprehensive analysis of clinical notes, we identify a set of implications for hospitalcourse summarization on future research. For modeling, we motivate (1) the need for dynamic hybrid extraction-abstraction strategies (4.1); (2) retrievalaugmented generation (4.3); and (3) the development of heuristics to assess reference quality (4.5). For evaluation, we argue for (1) methods to assess factuality and discourse which are associative in nature, i.e., incorporate the complex inter-dependence of problems, medications, and labs (4.2, 4.4); and (2) scalable reference-free metrics (4.5).

Ethical Considerations
Dataset creation. Our CLINSUM dataset contains protected health information about patients. We have received IRB approval through our institution to access this data in a HIPAA-certified, secure environment. To protect patient privacy, we cannot release our dataset, but instead describe generalizable insights that we believe can benefit the general summarization community as well as other groups working with EHR data. Intended Use & Failure Modes. The ultimate goal of this work is to produce a summarizer that can generate a summary of a hospital course, and thus support clinicians in this cognitively difficult and time-consuming task. While this work is a preface to designing such a tool, and significant advances will be needed to achieve the robustness required for deployment in a clinical environment, it is important to consider the ramifications of this technology at this stage of development. We can learn from existing clinical summarization deployed (Pivovarov et al., 2016) and other datadriven clinical decision support tools (Chen et al., 2020). As with many NLP datasets, CLINSUM likely contains biases, which may be perpetuated by its use. There are a number of experiments we plan to carry out to identify documentation biases and their impact on summarization according to a number of dimensions such as demographics (e.g., racial and gender), social determinents of health (e.g., homeless individuals), and clinical biases (e.g., patients with rare diseases). Furthermore, deployment of an automatic summarizer may lead to automation bias (Goddard et al., 2012), in which clinicians over rely on the automated system, despite controls measures or verification steps that might be built into a deployed system. Finally, medical practices and EHRs systems constantly change, and this distribution drift can cause models to fail if they are not updated. As the NLP community continues to develop NLP applications in safety-critical domains, we must carefully study how can can build robustness, fairness, and trust into these systems.  Table 7: Basic statistics for single-document (SDS) and multi-document (MDS) summarization datasets. For multi-document summarization (MDS), # Source words are aggregated across documents. Compression ratio is the average ratio of source words to summary words. Extractiveness metrics (coverage and density) come from Grusky et al. (2018) and, for consistency, are calculated using the official code across the validation set for each dataset. Spacy tokenization is performed before extracting fragments. Other corpus statistics are pulled from either the corresponding paper or Table 1 in Sharma et al. (2019). Entries are filled with N/A because the dataset is private (Krishna et al., 2020), or too expensive to generate (Liu et al., 2018a). The Gigaword SDS dataset comes from the annotated Gigaword dataset (Graff et al., 2003;Napoles et al., 2012) . and "Consult notes", which document specialist consultations. The dataset does not contain any structured data, documentation from past encounters, or other note types (e.g., nursing notes, social work, radiology reports) (Reichert et al., 2010). 6 Additionally, we remove all visits without at least one source note and at least one Brief Hospital Course target section and exclude notes with less than 25 characters. For computational and modeling feasibility, we bound the minimum and maximum lengths for the source and target texts. We exclude visits where the source notes are collectively over 20, 000 tokens (< 10% of visits) or are shorter than the Brief Hospital Course. Finally, we exclude visits where the Brief Hospital Course section is less than 25 characters and greater than 500 tokens to remove any incorrectly parsed BHC sections.

B Local Coherence Model Details
The underlying premise of the entity-grid model is that "the distribution of entities in locally coherent texts exhibits certain regularities" (Barzilay and Lapata, 2005). The paper defines entities as coreferent noun phrases, while we use UMLS entities. Additionally, Barzilay and Lapata (2005) add syntactic role information to the grid entries, whereas, without reliable parses, we denote a binary indicator of entity presence. As is common practice, we learn 6 We note that most structured data fields are now automatically imported into input notes, and, similarly, findings of reports are available in notes.
to rank the entity grid of a text more highly than the same entity grid whose rows (sentences) have been randomly shuffled. Inspired by Joty et al. (2018), we first project the entity grid entries onto a shared embedding space whose vocabulary consists of all the UMLS CUIs and a special <empty> token. As in Tien Nguyen and Joty (2017), we then learn features of original and permuted embedded grids by separately applying 1-D convolutions. Finally, scalars produced by the siamese convolutional networks are used for pairwise ranking.

C CLINNEUSUM Details
As in Nallapati et al. (2017) and Zhou et al. (2018), we extract ground truth extraction labels by greedily selecting sentences which maximize the relative ROUGE gain (R 12 ) from adding an additional sentence to an existing summary. We use basic heuristics to scale the model efficiently to a dataset with such a large set of candidate sentences. To avoid inclusion of sentences with spurious, small relevance based on ROUGE, we filter out extractive steps with a weak learning signal -extractive steps for which either the ROUGE improvement of the highest scoring sentence is less than 1%, or the differential between the least and most relevant sentences is less than 2%. Furthermore, based on manual evaluation, we take steps to reduce the size of the candidate sentence set provided to the model sees during training. First, we de-duplicate sentences and remove sentences with no alphabetical letters or a token count less than 3. Then, we randomly re-