On the Summarization of Consumer Health Questions

Question understanding is one of the main challenges in question answering. In real world applications, users often submit natural language questions that are longer than needed and include peripheral information that increases the complexity of the question, leading to substantially more false positives in answer retrieval. In this paper, we study neural abstractive models for medical question summarization. We introduce the MeQSum corpus of 1,000 summarized consumer health questions. We explore data augmentation methods and evaluate state-of-the-art neural abstractive models on this new task. In particular, we show that semantic augmentation from question datasets improves the overall performance, and that pointer-generator networks outperform sequence-to-sequence attentional models on this task, with a ROUGE-1 score of 44.16%. We also present a detailed error analysis and discuss directions for improvement that are specific to question summarization.


Introduction
Teaching machines how to automatically understand natural language questions to retrieve relevant answers is still a challenging task. Different factors increase the complexity of the task such as the question length (cf. Figure 1), the lexical heterogeneity when describing the same information need, and the lack of domain-specific training datasets. Improving Question Answering (QA) has been the focus of multiple research efforts in recent years. Several efforts proposed interactive and non-interactive query relaxation techniques to translate the input questions into structured queries covering specific elements of the questions (Yahya et al., 2013;Mottin et al., 2014;Ben Abacha and Zweigenbaum, 2015;Meng et al., 2017). Other efforts focused on (i) identifying question similarity (Nakov et al., 2016(Nakov et al., , 2017 and question entailment (Ben Abacha and Demner-Fushman, 2019b) in order to retrieve similar or entailed questions that have associated answers, or (ii) paraphrasing the questions and submitting the simplified versions to QA systems (Bordes et al., 2014;Dong et al., 2017).
Question simplification or summarization was less studied than the summarization of news articles that has been the focus of neural abstractive methods in recent years (Rush et al., 2015;Nallapati et al., 2016;Chopra et al., 2016;See et al., 2017). In this paper, we tackle the task of consumer health question summarization. Consumer health questions are a natural candidate for this task as patients and their families tend to provide numerous peripheral details such as the patient history (Roberts and Demner-Fushman, 2016), that are not always needed to find correct answers. Recent experiments also showed the key role of question summarization in improving the performance of QA systems (Ben Abacha and Demner-Fushman, 2019a).
We present three main contributions: (i) we define Question Summarization as generating a condensed question expressing the minimum information required to find correct answers to the original question, and we create a new corpus 1 of 1K consumer health questions and their summaries based on this definition (cf. Figure 1); (ii) we explore data augmentation techniques, including semantic selection from open-domain datasets, and study the behavior of state-of-the-art neural abstractive models on the original and augmented datasets; (iii) we present a detailed error analysis and discuss potential areas of improvements for consumer health question summarization.
We present related work in the following section. The abstractive models and data creation and augmentation methods are presented in section 3. We present the evaluation in section 4 and discuss the results and error analysis in section 5.

Related Work
With the recent developments in neural machine translation and generative models (Bahdanau et al., 2014), text summarization has been focusing on abstractive models for sentence or headline generation and article summarization (Rush et al., 2015;Nallapati et al., 2016;Gehrmann et al., 2018). In particular, Rush et al. (2015) proposed an approach for the abstractive summarization of sentences combining a neural language model with a contextual encoder (Bahdanau et al., 2014). For text summarization, Nallapati et al. (2016) proposed a recurrent and attentional encoder-decoder network that takes into account out-of-vocabulary words with a pointer mechanism. This copy mechanism can combine the advantages of both extractive and abstractive summarization (Gu et al., 2016). See et al. (2017) used a hybrid pointer-generator network combining a sequence-to-sequence (seq2seq) attentional model with a similar pointer network (Vinyals and Le, 2015) and a coverage mechanism (Tu et al., 2016). They achieved the best performance of 39.53% ROUGE-1 on the CNN/DailyMail dataset of 312k news articles. Abstractive summarization models have mainly been trained and evaluated on news articles due to the availability of large scale news 1 github.com/abachaa/MeQSum datasets. Fewer efforts tackled other subtasks with different inputs, such as summarization of opinions, conversations or emails (Duboué, 2012;Angelidis and Lapata, 2018).
In this paper we focus on the summarization of consumer health questions. To the best of our knowledge, only Ishigaki et al. (2017) studied the summarization of lengthy questions in the open domain. They created a dataset from a community question answering website by using the question-title pairs as question-summary pairs, and compared extractive and abstractive summarization models. Their results showed that an abstractive model based on an encoder-decoder and a copying mechanism achieves the best performance of 42.2% ROUGE-2.

Methods
We define the question summarization task as generating a condensed question expressing the minimum information required to find correct answers to the original question.

Summarization Models
We study two encoder-decoder-attention architectures that achieved state-of-the-art results on open domain summarization datasets. Sequence-to-sequence attentional model. This model is adopted from Nallapati et al. (2016). The encoder consists of a bidirectional LSTM layer fed with input word embeddings trained from scratch for the summarization task. The decoder also consists of a bidirectional LSTM layer. An attentional distribution (Bahdanau et al., 2014) is computed from the encoder's LSTM to build a context vector that is combined with the decoder embeddings to predict the word that is most likely to come next in the sequence.

Pointer-generator network.
This model is adopted from See et al. (2017). It extends the sequence-to-sequence attentional model with pointer network (Vinyals and Le, 2015) that has a flexible copying mechanism allowing to either generate the next word or point to a location in the source text. The decision on whether to generate the new word or to point back to a source location is made by using a probability function as a soft switch. This probability is computed from dense connections to the decoder's input and hidden state and the context vector. This design is particularly suited to deal with words outside of

Medical Question
Is it healthy to ingest 500 mg of vitamin c a day? Should I be taking more or less?
Summary How much vitamin C should I take a day? the target vocabulary in production or test environments. We also test the coverage variant of this model which includes an additional loss term taking into account the diversity of the words that were targeted by the attention layer for a given text Tu et al. (2016). This variant is intended to deal with repetitive word generation issue in sequence to sequence models.

Data Creation
We manually constructed a gold standard corpus, MeQSum, of 1,000 consumer health questions and their associated summaries. We selected the questions from a collection distributed by the U.S. National Library of Medicine (Kilicoglu et al., 2018). Three medical experts performed the manual summarization of the 1K questions using the following guidelines: (i) the summary must allow retrieving correct and complete answers to the original question and (ii) the summary cannot be shortened further without failing to comply with the first condition. All the summaries were then double validated by a medical doctor who also gave the following scores: 1 (perfect summary), 0.5 (acceptable), and 0 (incorrect, and replaced the summary in this case). Based on these scores, the interannotator agreement (IAA) was 96.9%. In method #1, we used 500 pairs for training and 500 pairs for the evaluation of the summarization models.
We augmented the training set incrementally with two different methods. In the first augmentation method (#2) we added a set of 4,655 pairs of clinical questions asked by family doctors and their short versions (Ely et al., 2000). The second (augmented) training set has a total of 5,155 question-summary pairs.
Our third method (#3) relies on the semantic selection of relevant question pairs from the Quora open-domain dataset (Shankar Iyer and Csernai, 2017). The source Quora dataset consists of 149,262 pairs of duplicate questions. We selected a first set of candidate pairs where a question A had at least 2 sentences and its duplicate question B had only one sentence. Sentence segmentation was performed using the Stanford parser. This first selection led to a subset of 11,949 pairs. From this subset, we targeted three main medical categories: Diseases, Treatments, and Tests. We extracted the question pairs that have at least one medical entity from these categories. We used MetaMapLite (Demner-Fushman et al., 2017) to extract these entities by targeting a list of 35 UMLS (Lindberg et al., 1993) semantic types 2 . The final Quora subset constructed by this method contains 2,859 medical pairs. The third (augmented) training set includes the data from the three methods (8,014 training pairs). Table 1 presents example questionsummary pairs from each dataset.

Experiments and Results
In the pointer generator and the seq2seq models, we use hidden state vectors of 256 dimensions and word embedding vectors of 128 dimensions trained from scratch. We set the size of the source and target vocabularies to 50K and the minimum length of the question summaries to 4 tokens. When applied, the coverage mechanism was started from the first iteration. We use the Adagrad optimizer with a learning rate of 0.15 to train the network. At decode-time, we used beam search of size 4 to generate the question summary.

Method
Training Set  Results are reported using the ROUGE-1, ROUGE-2, and ROUGE-L measures and presented in Table 2. The pointer generator achieves a ROUGE-1 score of 44.16% when trained on the full training dataset of 8k pairs (Method #3). The coverage mechanism improved the results of the first training set, with a limited number of training pairs (500), but decreased performance on the other training sets. This is maybe explained by the fact that the systems did not generate frequent repetitions when using the second and third training sets, which suggests that the data augmentation methods provided enough coverage and better training for the generation of relevant summaries from the test data. Figure 2 presents an example of a generated summary.

Discussion
The best performance of 44.16% is comparable to the state-of-the-art results in open-domain text summarization. Interestingly this performance was achieved using a relatively small set of 8K training pairs (2.5% of the size of the CNN-DailyMail dataset). Although this observation can be partially explained by the shorter average length of question summaries when compared to news summaries, a ROUGE-1 score of 44.16% suggests that the trained model reached a relatively efficient local optimum with a useful level of abstraction for consumer health question summarization. This result is especially promising, considering (i) the low-frequency nature of most medical entities and (ii) the fact that the model did not rely on external sources of medical knowledge. ROUGE (Lin and Hovy, 2003) is based on n-gram co-occurrences and despite its wide use in summary evaluation, it has some limitations. Metrics specific to question answering, such as POURPRE for the evaluation of answers to definition questions (Lin and Demner-Fushman, 2005), share some of the same limitations and do not capture fluency or semantic correctness of the summary. To study the correlation between ROUGE and human judgment in question summarization, we manually evaluated a subset of 10% of the generated summaries. We randomly selected 50 summaries produced by each PG method (M#1, M#2, and M#3) from the test set. To judge the correctness of the generated summaries, we used three scores: 0 (incorrect summary), 1 (acceptable summary), and 2 (perfect). Table 4 presents the results of the manual evaluation of the summaries. Table 3 presents examples of the generated summaries by each evaluated method. A fair amount of the manually evaluated summaries were extractive, but many were correctly generated, as can be seen in the examples.
We manually evaluated the three PG methods that achieved the best performance. These methods do not include coverage which aimed to deal with repetitive word generation issue. From our observations, few generated summaries had the repetition issue (e.g. "where can i find information on genetic genetic genetic genetic genetic ..."). All repetitions were generated by the M#1 method having the smallest training set (500 pairs), which means that having more training instances (5K for M#2 and 8K for M#3) alleviated the repetition problem in question summarization.
For a more in-depth analysis, we studied the   manually generated summaries of the PG+M#3 method on a random 10% subset of the test data. We identified 4 main types of errors that should be tackled in future efforts: T1 (Question Focus 3 ): The question focus is missing or not correctly identified (e.g. "What are the treatments?").

T2 (Question Type):
The question type is not the same (e.g. "what are the treatments for williams syndrome?" instead of "where can I get genetic testing for william's syndrome?"). T3 (Semantic inconsistency): The question type does not apply to the focus category: e.g., "what are the treatments for nulytely?", where nulytely is a drug name). T4 (Summarization): The summary is either not minimal, or not complete: e.g., the original question contains several sub-questions, but the summary contains only one of them. The examples above are from the results of the method PG+M#3. Table 5 presents the distribution of error types, taking into account multiple error types per summary when they occur. 76% of the errors are related to the question focus and the question type. Interestingly, only 7% of the summaries are semantically inconsistent. These findings suggest that training the networks to take into account the question focus and type is a promising direction for improvement. Such approach could be achieved either through multitask training or 3 Main entity in the question.
through additional input features, and will be investigated further in our future work.

Conclusion
We studied consumer health question summarization and introduced the MeQSum corpus of 1K consumer health questions and their summaries, which we make available in the scope of this paper 4 . We also explored data augmentation methods and studied the behavior of abstractive models on this task. In future work, we intend to examine multitask approaches combining question summarization and question understanding.