Joint Summarization-Entailment Optimization for Consumer Health Question Understanding

Understanding the intent of medical questions asked by patients, or Consumer Health Questions, is an essential skill for medical Conversational AI systems. We propose a novel data-augmented and simple joint learning approach combining question summarization and Recognizing Question Entailment (RQE) in the medical domain. Our data augmentation approach enables to use just one dataset for joint learning. We show improvements on both tasks across four biomedical datasets in accuracy (+8%), ROUGE-1 (+2.5%) and human evaluation scores. Human evaluation shows joint learning generates faithful and informative summaries. Finally, we release our code, the two question summarization datasets extracted from a large-scale medical dialogue dataset, as well as our augmented datasets.


Introduction
In order to answer questions, Conversational AI systems have to first understand the intent of questions (Chen et al., 2012;Cai et al., 2017). This is particularly important for medical conversational agents (Wu et al., 2020), as Consumer Health Questions (CHQ) are often long and contain peripheral information not needed to answer the question. Approaches to medical question understanding include query relaxation (Ben Abacha and Zweigenbaum, 2015; Lei et al., 2020), question entailment recognition (Ben Abacha andDemner-Fushman, 2016, 2019b;Agrawal et al., 2019) and summarization (Ben Abacha and Demner-Fushman, 2019a).
We approach the problem of medical question understanding using joint learning of medical question pairs in the two tasks of question summarization and Recognizing Question Entailment (RQE). Previous work on combining summarization and entailment uses at least two datasets -one for each 1 https://github.com/KhalilMrini/ Medical-Question-Understanding task. We start from the observation that, given a pair of questions A and B, where A is the longer question, A entails B if and only if B is a summary of A. Using this observation, we propose a data augmentation scheme to use a single dataset for joint learning, instead of two. Then, we propose a simple, simultaneous joint learning approach with fully shared model parameters.
Our findings show that joint learning performs significantly better than single-task training. Our joint learning approach brings about an 8% increase in accuracy in the RQE task compared to singletask training, and shows an average of 2.5% increase in ROUGE-1 F1 scores across three medical question summarization datasets. Additionally, we perform human evaluation and find our approach generates more informative question summaries. Our results suggest the RQE objective makes our summaries more similar in style to the CHQ. Finally, we release the two consumer health question summarization datasets we extracted from an existing large-scale medical dialogue dataset, our augmented datasets and our code.

Recognizing Question Entailment (RQE)
The task of RQE was introduced by Ben Abacha and Demner-Fushman (2016) in the context of medical question answering. It is closely related to the task of Recognizing Textual Entailment (RTE) (Dagan et al., 2005(Dagan et al., , 2013, and early definitions of question entailment (Groenendijk and Stokhof, 1984;Roberts, 1996). Ben Abacha and Demner-Fushman (2016) define RQE as follows: given a pair of questions A and B, question A entails question B if every answer to B is a correct answer to A, and answers A either partially or fully.

Transfer Learning for Medical QA
Language models that use multi-task learning and transfer learning have become ubiquitous in various NLP applications, including BioNLP. BERT (Devlin et al., 2019) has been fine-tuned using biomedical text from PubMed (Beltagy et al., 2019), PMC (Lee et al., 2020), and/or the MIMIC III dataset (Johnson et al., 2016;Huang et al., 2019;Alsentzer et al., 2019). In this paper, we use pre-trained BART models (Lewis et al., 2019).
Transfer learning was a popular approach at the 2019 MEDIQA shared task (Ben Abacha et al., 2019) on medical NLI, RQE and QA. The question answering task involved re-ranking answers, not generating them (Demner-Fushman et al., 2020). For the RQE task, the best-performing model (Zhu et al., 2019) uses transfer learning on NLI and ensemble methods.
In contemporaneous work of ours (Mrini et al., 2021), we participate in the question summarization task of the 2021 MEDIQA shared task (Ben Abacha et al., 2021). We show that transfer learning using medical RQE can improve performance on medical question summarization.
Falke et al. (2019) use textual entailment predictions to detect factual errors in abstractive summaries generated by state-of-the-art models. Pasunuru and Bansal (2018) propose an entailment reward for their abstractive summarizer, where the entailment score is obtained from a pre-trained and frozen natural language inference model. Pasunuru et al. (2017) propose an LSTM encoder-decoder model that incorporates entailment generation and abstractive summarization. They use separate natural language inference and summarization datasets, and train by optimizing the two objectives alternatively. Guo et al. (2018) build upon the work of Pasunuru et al. (2017), and add question generation as an auxiliary task. Li et al. (2018) propose an encoder-decoder summarization model, with an entailment-aware encoder with a separate classification module, and an entailment-rewarded decoder. They follow closely the multi-task setting of Pasunuru et al. (2017).

Joint Learning for Consumer Health Question Understanding
We consider the joint learning of medical question summarization and Recognizing Question Entail-ment (RQE). In both tasks, a question pair includes a first medical question, written in an informal style by a patient -thus called a Consumer Health Question (CHQ). The second medical question is shorter, and often written in a formal style by medical experts: it is a Frequently Asked Question (FAQ). The inspiration for our joint learning scheme stems from the observation that a CHQ entails an FAQ, if and only if the FAQ is a summary of the CHQ. Our data-augmented joint learning approach to consumer health question understanding has two main components. First, we use our equivalence observation to propose a scheme for data augmentation. Second, we show our joint learning model architecture and learning objective.

Data Augmentation
Instead of using separate datasets as in previous work, we propose to augment datasets to train jointly, such that we have the same amount of summarization and RQE pairs.
For summarization datasets, we create equivalent RQE pairs. For each existing summarization pair, we first choose with equal probability whether the equivalent RQE pair is labeled as entailment or not. If it is an entailment case, we create an RQE pair identical to the summarization pair. If it is not an entailment case, the CHQ of the RQE pair is identical to the CHQ of the summarization pair, and the FAQ of the RQE pair is a different, randomly selected from the FAQs of the same dataset split.
Inversely, for the RQE dataset, we create equivalent summarization pairs. For each existing RQE pair, we consider two cases. If the RQE pair is labeled as entailment, we create an identical summarization pair. If the RQE pair is labeled as not entailment, we create a summarization pair that is identical to a randomly selected entailment-labeled RQE pair from the same dataset split.

Joint Model
We adopt the architecture of BART Large (Lewis et al., 2019), a model that set a new state of the art in XSum (Narayan et al., 2018) and CNN-Dailymail (Hermann et al., 2015), two popular abstractive summarization benchmark datasets.
BART is an encoder-decoder seq2seq model, that can train generation as well as classification tasks, such as RQE. BART trains for abstractive summarization by feeding the source text (CHQ) to the encoder, and the negative log-likelihood loss is computed between the decoder output and the CHQ: Hi I have an un-opened prescription of Atorvastatin. How long is the lifespan in an Un-Opened container that has been stored at room temp (roughly 60degrees)? Thanks.

FAQ:
For how long can Atorvastatin be stored at room temperature?

Shared Encoder
Shared Decoder

Entailment Prediction
Generated FAQ (Generated Summary)

Recognizing Question Entailment (RQE)
Question Summarization Figure 1: An example medical question pair. The first question is a Consumer Health Question (CHQ) and the second question is a Frequently Asked Question (FAQ). We use BART (Lewis et al., 2019) to jointly train question summarization (bottom) and RQE (top). We show how BART takes input differently for each task.
reference summary (FAQ). BART trains for classification by feeding the full input to the encoder -in the case of RQE, the full input is the concatenation of the CHQ and FAQ. An added classification head attached to the last decoder output then generates a prediction. We compute the binary cross-entropy loss based on the classification head's prediction and the RQE label. We show an overview of our joint training in Figure 1.
We propose to optimize a single loss function that is the sum of the objectives of both tasks. At each training step, we have a summarization question pair that is used for the negative log-likelihood loss, and an RQE question pair that is used for the Binary Cross-Entropy (BCE) loss. Given a CHQ embedding x, the corresponding FAQ embedding y, and the entailment label l entail ∈ {0, 1}, we optimize the following loss function: For RQE, we consider two loss alternatives, in which we create summarization pairs that are identical to the RQE pairs, regardless of entailment. In the first alternative we simply remove the negative log-likelihood loss for pairs labeled as not entailment. In the second alternative, we flip the negative log-likelihood loss for pairs labeled as not entailment, such that we try to maximize the summarization loss instead of minimizing it. We consider three medical question summarization datasets and one medical RQE dataset, all in English. Table 1 shows dataset statistics.
(1) MeQSum (Ben Abacha and Demner-Fushman, 2019a) is a medical question summarization dataset released by the U.S. National Institutes of Health (NIH). It contains 1,000 consumer health questions summarized into FAQ-style singlesentence questions by medical experts. The authors used the first 500 datapoints as training and the last 500 as testing. We use a randomly selected 100 datapoints from the training set as our dev set.
These two datasets include first a one-sentence question describing the medical condition of the patient, followed by two long utterances: one from the patient that includes a description of the problem and a question, and then one from the doctor that includes the response. To form medical question summarization datasets, we consider the single-sentence descriptions as summaries of the patient utterances. HealthCareMagic's summaries are more abstractive and are written in a formal style, unlike iCliniq's patient-written summaries. We create a 80/10/10 split for train/dev/test sets.   (Nallapati et al., 2016) 24.8 13.8 24.3 ------Pointer-Generator Networks (PG) (See et al., 2017) 35   (Wang et al., 2018), and re-use the same classification head for RQE.

Training Settings
We train for 100 epochs for the MeQSum dataset, and for 10 epochs for all other datasets. We report ROUGE F1 scores for the question summarization datasets, and accuracy for the RQE dataset, as it is a binary classification task with two labels: entailment and not entailment. For the question summarization datasets, the negative log likelihood on the dev set is used to select the best model. For the RQE dataset, the RQE accuracy on the dev set is the metric used to select the best model.
For single-task training, we use binary cross entropy for RQE, and negative log-likelihood for question summarization.
The learning rate for RQE experiments is 10 −5 and for the question summarization experiments, it is 3 * 10 −5 . We use an Adam optimizer where the betas are 0.9 and 0.999 for summarization, and 0.9 and 0.98 for RQE. In all experiments, the Adam epsilon is 10 −8 , and the dropout is 0.1.

Inference
At test time, we evaluate each task completely separately. For RQE, we feed the concatenation of the CHQ and FAQ as input to the model. For question summarization, we only feed the CHQ as input to the model. This way, we ensure that the model never sees the reference FAQs when being evaluated for question summarization.

Summarization Results
In their introduction of MeQSum, Ben Abacha and Demner-Fushman (2019a) show results with seq2seq models and pointer-generated networks. They additionally propose to augment MeQSum using semantically selected relevant pairs from the Quora Question Pairs dataset (Iyer et al., 2017). We report these baselines as well as our BART baseline results.
We show our summarization results in Table  2. On MeQSum and iCliniq, our joint learning objective achieves increases between 3 and 8 points across all three metrics -a significant improvement despite MeQSum being extremely low-resource. On the more abstractive and larger HealthCareMagic dataset, there is a decrease of 2 points compared to the BART baseline.

Human Evaluation
Given that ROUGE is notoriously unreliable, we hire 2 volunteer annotators, and we pick 40 generated summaries from each model in each summarization dataset, resulting in 240 generated summaries (FAQs). We collect 960 evaluations using best-worst scaling. The annotators could also choose to judge both generated FAQs as equal with regards to the given criteria. We show the annotators the generated FAQs in a random order, so that they do not know which model generated which FAQ. We evaluate the generated summaries on 4 criteria:   Table 4: RQE accuracy results on the dev set of our joint loss compared to the two loss alternatives. NLL is Negative Log-Likelihood, the summarization loss.
• Fluency: which generated FAQ is more grammatically correct, and easier to read and to understand?
• Coherence: which generated FAQ is better structured and more organized?
• Informativeness: which generated FAQ captures the most out of the concern of the patient who wrote the CHQ?
• Correctness: which generated FAQ is more factually correct given the CHQ?
Our human evaluation results are in Table 3. Scores are generally in favor of our approach in MeQSum and HealthCareMagic. There is a high increase in informativeness for HealthCareMagic, and the results for iCliniq show that our approach gives summaries of roughly similar quality as the BART baseline. The ROUGE score increases in the extractive iCliniq and decreases in the abstractive HealthCareMagic indicate that our approach's summaries are more faithful to patient writing styles, suggesting a stronger influence from entailment.

RQE Results
We compare the joint loss function of equation 1 with the two loss alternatives in section 3.2. We show the results on the dev set in Table 4. Our

54.1%
BART + Data-Augmented Joint Learning 60.0% joint loss function fares the best, exceeding the alternatives by 5%. The results suggest that optimizing RQE jointly with question summarization does help improve performance on the RQE side as well. The difference with the alternative where we remove NLL for not-entailment pairs shows that optimizing our joint learning objective is more efficient than alternating single-task objectives. We show our RQE results in Table 5. We see an 8% increase on the test set compared to optimizing only on the RQE objective. Our findings show that joint learning helps both tasks equally.

Conclusions
We propose a novel data-augmented joint learning approach for the tasks of RQE and question summarization. Our data augmentation method extends a dataset such that it can be used for both tasks. Our results show improvements in both tasks, across three question summarization datasets (+2.5% in ROUGE-1 F1) and one RQE dataset (+8% accuracy). We perform a human evaluation for our generated summaries: we find that our approach generates more informative summaries for formally written FAQs, and summaries that are faithful to patient writing styles in the more extractive iCliniq dataset. Finally, we make our datasets, code and training details publicly available.