Yanjun Gao


2023

pdf bib
Multi-Task Training with In-Domain Language Models for Diagnostic Reasoning
Brihat Sharma | Yanjun Gao | Timothy Miller | Matthew Churpek | Majid Afshar | Dmitriy Dligach
Proceedings of the 5th Clinical Natural Language Processing Workshop

Generative artificial intelligence (AI) is a promising direction for augmenting clinical diagnostic decision support and reducing diagnostic errors, a leading contributor to medical errors. To further the development of clinical AI systems, the Diagnostic Reasoning Benchmark (DR.BENCH) was introduced as a comprehensive generative AI framework, comprised of six tasks representing key components in clinical reasoning. We present a comparative analysis of in-domain versus out-of-domain language models as well as multi-task versus single task training with a focus on the problem summarization task in DR.BENCH. We demonstrate that a multi-task, clinically-trained language model outperforms its general domain counterpart by a large margin, establishing a new state-of-the-art performance, with a ROUGE-L score of 28.55. This research underscores the value of domain-specific training for optimizing clinical diagnostic reasoning tasks.

pdf bib
Improving the Transferability of Clinical Note Section Classification Models with BERT and Large Language Model Ensembles
Weipeng Zhou | Majid Afshar | Dmitriy Dligach | Yanjun Gao | Timothy Miller
Proceedings of the 5th Clinical Natural Language Processing Workshop

Text in electronic health records is organized into sections, and classifying those sections into section categories is useful for downstream tasks. In this work, we attempt to improve the transferability of section classification models by combining the dataset-specific knowledge in supervised learning models with the world knowledge inside large language models (LLMs). Surprisingly, we find that zero-shot LLMs out-perform supervised BERT-based models applied to out-of-domain data. We also find that their strengths are synergistic, so that a simple ensemble technique leads to additional performance gains.

pdf bib
Overview of the Problem List Summarization (ProbSum) 2023 Shared Task on Summarizing Patients’ Active Diagnoses and Problems from Electronic Health Record Progress Notes
Yanjun Gao | Dmitriy Dligach | Timothy Miller | Majid Afshar
The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks

The BioNLP Workshop 2023 initiated the launch of a shared task on Problem List Summarization (ProbSum) in January 2023. The aim of this shared task is to attract future research efforts in building NLP models for real-world diagnostic decision support applications, where a system generating relevant and accurate diagnoses will augment the healthcare providers’ decision-making process and improve the quality of care for patients. The goal for participants is to develop models that generated a list of diagnoses and problems using input from the daily care notes collected from the hospitalization of critically ill patients. Eight teams submitted their final systems to the shared task leaderboard. In this paper, we describe the tasks, datasets, evaluation metrics, and baseline systems. Additionally, the techniques and results of the evaluation of the different approaches tried by the participating teams are summarized.

2022

pdf bib
Hierarchical Annotation for Building A Suite of Clinical Natural Language Processing Tasks: Progress Note Understanding
Yanjun Gao | Dmitriy Dligach | Timothy Miller | Samuel Tesch | Ryan Laffin | Matthew M. Churpek | Majid Afshar
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Applying methods in natural language processing on electronic health records (EHR) data has attracted rising interests. Existing corpus and annotation focus on modeling textual features and relation prediction. However, there are a paucity of annotated corpus built to model clinical diagnostic thinking, a processing involving text understanding, domain knowledge abstraction and reasoning. In this work, we introduce a hierarchical annotation schema with three stages to address clinical text understanding, clinical reasoning and summarization. We create an annotated corpus based on a large collection of publicly available daily progress notes, a type of EHR that is time-sensitive, problem-oriented, and well-documented by the format of Subjective, Objective, Assessment and Plan (SOAP). We also define a new suite of tasks, Progress Note Understanding, with three tasks utilizing the three annotation stages. This new suite aims at training and evaluating future NLP models for clinical text understanding, clinical knowledge representation, inference and summarization.

pdf bib
Proceedings of TextGraphs-16: Graph-based Methods for Natural Language Processing
Dmitry Ustalov | Yanjun Gao | Alexander Panchenko | Marco Valentino | Mokanarangan Thayaparan | Thien Huu Nguyen | Gerald Penn | Arti Ramesh | Abhik Jana
Proceedings of TextGraphs-16: Graph-based Methods for Natural Language Processing

pdf bib
Summarizing Patients’ Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models
Yanjun Gao | Dmitriy Dligach | Timothy Miller | Dongfang Xu | Matthew M. M. Churpek | Majid Afshar
Proceedings of the 29th International Conference on Computational Linguistics

Automatically summarizing patients’ main problems from daily progress notes using natural language processing methods helps to battle against information and cognitive overload in hospital settings and potentially assists providers with computerized diagnostic decision support. Problem list summarization requires a model to understand, abstract, and generate clinical documentation. In this work, we propose a new NLP task that aims to generate a list of problems in a patient’s daily care plan using input from the provider’s progress notes during hospitalization. We investigate the performance of T5 and BART, two state-of-the-art seq2seq transformer architectures, in solving this problem. We provide a corpus built on top of progress notes from publicly available electronic health record progress notes in the Medical Information Mart for Intensive Care (MIMIC)-III. T5 and BART are trained on general domain text, and we experiment with a data augmentation method and a domain adaptation pre-training method to increase exposure to medical vocabulary and knowledge. Evaluation methods include ROUGE, BERTScore, cosine similarity on sentence embedding, and F-score on medical concepts. Results show that T5 with domain adaptive pre-training achieves significant performance gains compared to a rule-based system and general domain pre-trained language models, indicating a promising direction for tackling the problem summarization task.

2021

pdf bib
ABCD: A Graph Framework to Convert Complex Sentences to a Covering Set of Simple Sentences
Yanjun Gao | Ting-Hao Huang | Rebecca J. Passonneau
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Atomic clauses are fundamental text units for understanding complex sentences. Identifying the atomic sentences within complex sentences is important for applications such as summarization, argument mining, discourse analysis, discourse parsing, and question answering. Previous work mainly relies on rule-based methods dependent on parsing. We propose a new task to decompose each complex sentence into simple sentences derived from the tensed clauses in the source, and a novel problem formulation as a graph edit task. Our neural model learns to Accept, Break, Copy or Drop elements of a graph that combines word adjacency and grammatical dependencies. The full processing pipeline includes modules for graph construction, graph editing, and sentence generation from the output graph. We introduce DeSSE, a new dataset designed to train and evaluate complex sentence decomposition, and MinWiki, a subset of MinWikiSplit. ABCD achieves comparable performance as two parsing baselines on MinWiki. On DeSSE, which has a more even balance of complex sentence types, our model achieves higher accuracy on the number of atomic sentences than an encoder-decoder baseline. Results include a detailed error analysis.

pdf bib
Learning Clause Representation from Dependency-Anchor Graph for Connective Prediction
Yanjun Gao | Ting-Hao Huang | Rebecca J. Passonneau
Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15)

Semantic representation that supports the choice of an appropriate connective between pairs of clauses inherently addresses discourse coherence, which is important for tasks such as narrative understanding, argumentation, and discourse parsing. We propose a novel clause embedding method that applies graph learning to a data structure we refer to as a dependency-anchor graph. The dependency anchor graph incorporates two kinds of syntactic information, constituency structure, and dependency relations, to highlight the subject and verb phrase relation. This enhances coherence-related aspects of representation. We design a neural model to learn a semantic representation for clauses from graph convolution over latent representations of the subject and verb phrase. We evaluate our method on two new datasets: a subset of a large corpus where the source texts are published novels, and a new dataset collected from students’ essays. The results demonstrate a significant improvement over tree-based models, confirming the importance of emphasizing the subject and verb phrase. The performance gap between the two datasets illustrates the challenges of analyzing student’s written text, plus a potential evaluation task for coherence modeling and an application for suggesting revisions to students.

2019

pdf bib
Automated Pyramid Summarization Evaluation
Yanjun Gao | Chen Sun | Rebecca J. Passonneau
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Pyramid evaluation was developed to assess the content of paragraph length summaries of source texts. A pyramid lists the distinct units of content found in several reference summaries, weights content units by how many reference summaries they occur in, and produces three scores based on the weighted content of new summaries. We present an automated method that is more efficient, more transparent, and more complete than previous automated pyramid methods. It is tested on a new dataset of student summaries, and historical NIST data from extractive summarizers.

pdf bib
Rubric Reliability and Annotation of Content and Argument in Source-Based Argument Essays
Yanjun Gao | Alex Driban | Brennan Xavier McManus | Elena Musi | Patricia Davies | Smaranda Muresan | Rebecca J. Passonneau
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

We present a unique dataset of student source-based argument essays to facilitate research on the relations between content, argumentation skills, and assessment. Two classroom writing assignments were given to college students in a STEM major, accompanied by a carefully designed rubric. The paper presents a reliability study of the rubric, showing it to be highly reliable, and initial annotation on content and argumentation annotation of the essays.

2018

pdf bib
Automated Content Analysis: A Case Study of Computer Science Student Summaries
Yanjun Gao | Patricia M. Davies | Rebecca J. Passonneau
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

Technology is transforming Higher Education learning and teaching. This paper reports on a project to examine how and why automated content analysis could be used to assess precis writing by university students. We examine the case of one hundred and twenty-two summaries written by computer science freshmen. The texts, which had been hand scored using a teacher-designed rubric, were autoscored using the Natural Language Processing software, PyrEval. Pearson’s correlation coefficient and Spearman rank correlation were used to analyze the relationship between the teacher score and the PyrEval score for each summary. Three content models automatically constructed by PyrEval from different sets of human reference summaries led to consistent correlations, showing that the approach is reliable. Also observed was that, in cases where the focus of student assessment centers on formative feedback, categorizing the PyrEval scores by examining the average and standard deviations could lead to novel interpretations of their relationships. It is suggested that this project has implications for the ways in which automated content analysis could be used to help university students improve their summarization skills.

pdf bib
PyrEval: An Automated Method for Summary Content Analysis
Yanjun Gao | Andrew Warner | Rebecca Passonneau
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)