Towards Understanding of Medical Randomized Controlled Trials by Conclusion Generation

Randomized controlled trials (RCTs) represent the paramount evidence of clinical medicine. Using machines to interpret the massive amount of RCTs has the potential of aiding clinical decision-making. We propose a RCT conclusion generation task from the PubMed 200k RCT sentence classification dataset to examine the effectiveness of sequence-to-sequence models on understanding RCTs. We first build a pointer-generator baseline model for conclusion generation. Then we fine-tune the state-of-the-art GPT-2 language model, which is pre-trained with general domain data, for this new medical domain task. Both automatic and human evaluation show that our GPT-2 fine-tuned models achieve improved quality and correctness in the generated conclusions compared to the baseline pointer-generator model. Further inspection points out the limitations of this current approach and future directions to explore.


Introduction
Randomized controlled trials (RCTs) are the most rigorous method to assess the effectiveness of treatments, such as surgical procedures and drugs, in clinical medicine (Sibbald and Roland, 1998). A typical RCT often constitutes of two randomized groups of patients receiving either the "intervention" (new treatment) or "control" (conventional treatment). Then, a statistical analysis is done after the experiments to determine whether the intervention has a significant effect (i.e. actually making patients better or worse). The results from various RCTs contribute to the medical decisions made by physicians every day. However, analyzing these large amounts of data could be over- * These authors contribute this paper equally. * The code is available at: https://github.com/ MiuLab/RCT-Gen whelming for clinicians (Davidoff and Miglus, 2011). With the help of machine readers, we can alleviate the burden for providing correct information that contributes to critical clinical decisions.
In this work, we aim to evaluate the capabilities of deep learning models on understanding RCTs by generating the conclusions of RCT abstracts. We achieve this by transforming the PubMed 200k RCT abstract sentence classification dataset (Dernoncourt and Lee, 2017) into a RCT conclusion generation task. Generating a correct and coherent conclusion requires the model to 1) identify the objectives of the trial, 2) understand the result and 3) generate succinct yet comprehensible texts. Therefore, this task can be a preliminary goal toward a more thorough understanding of clinical medicine literature.
To tackle this task, we first build a pointergenerator model (See et al., 2017) as the baseline. This model is widely used in abstractive summarization, which is similar to our conclusion generation task. We then leverage the high quality text generation capability of the Open AI GPT-2 (Radford et al., 2019) language model by fine-tuning the general domain GPT-2 model into a medical domain conclusion generator.
Because the correctness of RCT understanding is essential for supporting clinical decisions and neural summarization models could inaccurately present facts from the source document, we incorporate human evaluation on the correctness and quality of the generated in addition to standard ROUGE score (Lin, 2004) for automated summarization scoring. Evaluation results show the finetuned GPT-2 models score higher for both correctness and quality. However, there is still quite a large room for improvement both on the diversity and accuracy of the generated conclusions, providing a guidance for future research directions.

Related Work
The paper focuses on generating RCT conclusions, which is related to natural language generation. We describe the related work below and emphasize the difference between the prior work and our work. In our proposed method, we exploit the state-of-the-art language model representations for understanding the complex medical literature, and related work is then briefly described below.

Medical Natural Language Generation
Several medical domain natural language generation tasks have been studied using machine learning models, including generating radiology reports from images (Jing et al., 2018;Vaswani et al., 2017) and summarizing clinical reports (Zhang et al., 2018;Pivovarov and Elhadad, 2015) or research literature (Cohan et al., 2018). Recently, Gulden et al. (2019) studied extractive summarization on RCT descriptions.
Abstractive summarization, in which the model directly generates summaries from the source document, is closely related to our conclusion generation task. Most neural approaches for abstractive summarization were based on sequenceto-sequence recurrent neural networks (RNNs) with attention mechanisms (Devlin et al., 2019). The pointer-generator network (See et al., 2017) combined a copy mechanism that directly copies words from the source document and a coverage mechanism to avoid repetition caused by the RNN-based decoder, achieving good performance by handling unseen information. Devlin et al. (2019) further combined intra-encoder and intradecoder attention with policy learning by using ROUGE-L score as the reward and improved the performance in terms of the evaluation metric. Hsu et al. (2018) combined an extractive model that provided attention on the sentence level and the pointer-generator architecture, and Cohan et al. (2018) also worked on abstractive summarization of long documents, including medical papers from the PubMed database, based on the pointer-generator network.
However, our goal to generate conclusions is different from abstractive summarization in that summarization is to shorten the source document while preserving most of the important information, whereas our conclusion generation model gives one or two sentences describing the main outcome of the given trial. Given the superior performance of pointer-generation networks from the above related summarization work, this paper uses the pointer-generation model as baseline and focuses on RCT conclusion generation instead of abstractive summarization.

Contextualized Representations
Recent advances of contextualized representation models, such as ELMo (Peters et al., 2018), Open AI GPT (Radford et al., 2018) and BERT (Devlin et al., 2019) achieved remarkable results across different natural language understanding tasks, such as question answering, entailment classification and named entity recognition. At the core of these models was language modeling, with either forward prediction used in GPT, bidirectional prediction used in ELMo, or masked prediction used by BERT. Variants of BERT also improved the performance of bio-medical natural language understanding tasks (Xu et al., 2019;Pugaliya et al., 2019). Peng et al. (2019) further proposed a new benchmark to evaluate the performance of contextualized models in the bio-medical domain.
Particularly, the Open AI GPT-2 model (Radford et al., 2019) has demonstrated rudimentary zero-shot summarization capabilities with only language modeling training. Its forward prediction architecture made it suitable for autoregressive generation in a sequence-to-sequence task. Most benchmarks on contextualized representation were based on sequence classification tasks such as natural language inference and multiple choice question answering (Wang et al., 2018;Peng et al., 2019). Our work, on the other hand, focuses on exploring GPT-2's capability of generating goal-directed sentences in the medical domain. Note that to our knowledge, this paper is the first attempt that investigates GPT-2 towards the medical document understanding and interpretation.

Task Formulation
The PubMed 200k RCT dataset was originally constructed for sequential short text classification, with each sentence labeled as "background", "objective", "methods", "results" and "conclusions". We concatenated the "background", "objective" and "results" sections of each RCT paper abstract as the model input and the goal of the model is to generate the "conclusions". Table 1 illustrates Source: (BACKGROUND) Varenicline is believed to work , in part , by reducing craving responses to smoking cues and by reducing general levels of craving ; however , these hypotheses have never been evaluated with craving assessed in the natural environments of treatment-seeking smokers . (OBJECTIVE) Ecological momentary assessment procedures were used to assess the impact of varenicline on cuespecific and general craving in treatment-seeking smokers prior to quitting . (RESULTS) During all phases , smoking cues elicited greater craving than neutral cues ; the magnitude of this effect declined after the first week . General craving declined across each phase of the study . Relative to the placebo condition , varenicline was associated with a greater decline in general craving over the drug manipulation phase . Varenicline did not significantly attenuate cue-specific craving during any phase of the study .

Target (True Negative):
Smoking cues delivered in the natural environment elicited strong craving responses in treatment-seeking smokers , but cue-specific craving was not affected by varenicline administered prior to the quit attempt . These findings suggest that the clinical efficacy of varenicline is not mediated by changes in cue-specific craving during the pre-quit period of treatment-seeking smokers .
Pointer-generator baseline model with n = 1 hint word (N/A): smoking cues are associated with a greater craving in general , and may be associated with a greater decline in general craving and Fine-tuned GPT-2 with n = 0 hint word (False Negative): Varenicline did not reduce general craving in treatment-seeking smokers prior to quitting.
Fine-tuned GPT-2 with n = 1 hint word (True Negative): Smoking cues are associated with greater general craving than neutral cues, and varenicline does not attenuate cuespecific craving. Table 1: An example of the GPT-2 n = 0 model generating a false negative conclusion (Varenicline did reduce general craving), while the GPT-2 n = 1 model generated a better true negative one. The "(BACKGROUND)", "(OBJECTIVE)" and "(RESULTS)" tags denote the sentence classifications according to the original PubMed RCT dataset and are not included in the actual input of our conclusion generation task. the formulated task, where the generated conclusion needs to contain correct information based on the experiments and should be concise. After preprocessing, the number of abstracts in the training set is 189,035 and there are 2,479 conclusions used for validation. The average source paragraph length is 170.1 words (6.0 sentences), and the average target conclusion length is 41.4 words (1.8 sentences) long.

Models
Language model pre-training has achieved a great success among language understanding tasks with different model architectures. Because training language models requires a large amount of text data, and it is relatively difficult to acquire a lot of RCT documents, this work focuses on first pretraining language models with the transformer architecture (Vaswani et al., 2017) and then adapts the model to support the medical domain by finetuning. The language model pre-training from general texts is described below.

Transformer Encoder in GPT-2
We first introduce the transformer encoder (Vaswani et al., 2017)  GPT-2 model. The transformer encoder is a stack of N transformer encoder blocks, where the l-th block takes a sequence of hidden representations X l = {X l 1 , · · · , X l n } as the input and outputs an encoded sequence X l+1 = {X l+1 1 , · · · , X l+1 n }.
A transformer encoder block consists of a multihead self-attention layer and a position-wise fully connected feed-forward layer. A residual connection (He et al., 2016) is employed around each of the two layers followed by layer normalization (Ba et al., 2016). In GPT-2, however, the layer normalization step is moved to the front of the multi-head self-attention layers and the feed-forward layers. An illustration of a GPT-2 transformer encoder block is presented in Figure 1. Each component is briefly described as follows.
Byte-Pair Encoding GPT-2 uses a special byte pair encoding (BPE) for input and output representations. It can cover essentially all Unicode strings, which is useful in processing the medical texts due to the significant out-of-vocabulary problems such as distinct nomenclature and jargon. This special BPE prevents merging characters from different categories and preserves wordlevel segmentation properties with a space exception.
Positional Encoding Because the transformer model relies on a self-attention mechanism with no recurrence, the model is unaware of the sequential order of inputs. To provide the model with positional information, positional encodings are applied to the input token embeddings where w i denotes the i-th input token, embed token and embed pos denote a learned token embedding matrix and a learned positional embedding matrix respectively.
Multi-Head Self-attention An attention function can be described as mapping a query to an output with a set of key-value pairs. The output is a weighted sum of values. We denote queries, keys and values as Q, K and V , respectively. Following the original implementation (Vaswani et al., 2017), a scaled dot-product attention is employed as the attention function. Hence, the output can be calculated as where d k denotes the dimension of key vectors. The idea of multi-head attention is to compute multiple independent attention heads in parallel, and then concatenate the results and project again.
The multi-head self-attention in the l-th block can be calculated as where X l denotes the input sequence of the l-th block, h denotes the number of heads, Position-Wise Feed-Forward Layer The second sublayer in a block is a position-wise feedforward layer, which is applied to each position separately and independently. The output of this layer can be calculated as where W 1 and W 2 are parameter matrices, b 1 and b 2 are parameter biases.

Residual Connection and Layer Normalization
As shown in Figure 1, layer normalization is first applied on the input to the multi-head attention and feed-forward sublayers. The residual connection is then added around the two sublayers. The output of the l-th block can be calculated as H l = MultiHead(LayerNorm(X l )) + X l , X l+1 = FFN(LayerNorm(H l )) + H l .

GPT-2 Pre-Training
The generative pre-training (GPT) via a language model objective is shown to be effective for learning representations that capture syntactic and semantic information without supervision (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019). The GPT model proposed by Radford et al. (2018) employs the transformer encoder with 12 encoder blocks. It is pre-trained on a large generic corpus that covers a wide range of topics. The training objective is to minimize the negative loglikelihood: where w t denotes the t-th word in the sentence, w <t denotes all words prior to w t , and θ are parameters of the transformer model. To avoid seeing the future contexts, a masked self-attention is applied to the encoding process.  In the masked self-attention, the attention function is modified into where M is a matrix representing masks. M ij = −∞ indicates that the j-th token has no contribution to the output of the i-th token, so it is essentially "masked out" when encoding the i-th token. Therefore, by setting M ij = −∞ for all j > i, we can calculate all outputs simultaneously without looking at future contexts. It was pre-trained on the WebText dataset consisting of 40 GB high quality text crawled from internet sources. We use the small version (12 layers and 117 M parameters) of the released GPT-2 models.

GPT-2 Fine-Tuning
After the model is pre-trained with a language model objective, it can be fine-tuned on downstream tasks with supervised data. In our task, we adapt the GPT-2 to the target domain by finetuning using RCT data. Figure 2 illustrates the learning procedure. By fine-tuning on the target data, the GPT-2 model may have the potential of understanding and generating medical texts.
In the fine-tuning stage, we modify the attention masking of the GPT-2 model so that source byte pairs are fully aware of the entire context of the source sentence, while the target byte pairs are aware of the entire source sentence plus the generated byte pairs that precede itself. That is, for context token pairs (c i , c j ) ∈ c 1 , · · · , c m , we set M ij = −∞ for all j > i, while for context and source token pairs (c i , s j ), where c i ∈ c 1 , · · · , c m and s j ∈ s 1 , · · · , s n , we set M ij = 0. For all source token pairs (s i , s j ) ∈ s 1 , · · · , s n , we also set M ij = 0. This setting is illustrated in Figure 3.

Experiments
Here we describe experimental details of the baseline pointer-generator model and the GPT-2 finetuned models.

Experimental Setup
The baseline model is a pointer-generator network (See et al., 2017) with both copy and coverage mechanisms, and is trained with a coverage loss. We adopt the implementation of Zhang et al. (2018). The vocabulary size is about 50,000, with uncased word embeddings pretrained from the PubMed RCT 200k training set and the abstracts from the PubMed dataset of long documents (Cohan et al., 2018). We concatenate n ∈ {0, 1} hint words following the source sentences, where the hint words are first n words of the target conclusion. Our pointer-generator model uses beam search with beam size 5 to decode the final output conclusion. In our GPT-2 models, we conduct conclusion generation using n ∈ {0, 1, 3, 5} hint words. For n = 0, we append "In conclusion , " to the in-  Table 2: ROUGE scores of the PGNet baseline models and the GPT-2 fine-tuned models on the development set. The GPT-2 res were trained with the "results" section only. Addition of n > 1 hint words did not show further gains in ROUGE scores.
put. Also, we perform data ablation study using only the "results" section as the model input. To address the memory constraint on our machines, we only train examples that are less than 500 byte pairs after encoding. Because GPT-2 model uses BPE for input and output, the generated conclusions are capitalized. Previous work showed that beam search did not help the generation quality of GPT-2 models (Holtzman et al., 2019), so we simply use greedy decoding to generate the conclusions . Our GPT-2 model is fine-tuned with teacher forcing, using the SGD optimizer with learning rate of 0.001, momentum of 0.9 and the decay factor of 0.0005. Our model is based on a PyTorch implementation of GPT-2 † . Table 2 shows the best validation ROUGE scores of baselines and our models. Note that the hint words are not considered in score calculation and the output of all models are lower-cased. The GPT-2 fine-tuned model significantly outperforms the pointer-generator (PGNet) baseline on all ROUGE scores, where the best performing model is GPT-2 with hint word n = 1, demonstrating the effectiveness of generating good conclusion in our model. However, more hint words do not bring additional gain in ROUGE scores, probably because more constraints hinder the GPT-2 model to explore potentially good conclusions.

Automatic Evaluation
Moreover, the ablation result shows the significant drop in all ROUGE scores, indicating the importance of including the "background" and "objec- † https://github.com/huggingface/ pytorch-transformers System TP TN FP FN N/A Acc.

Human Evaluation
We recruited 10 medical students with prior training in bio-statistics and epidemiology to annotate and rate the generated conclusions. Our questionnaire contains two types of questions: the annotation question and the rating question.
• Annotation: A annotation question contains a source document and four conclusions, namely the target conclusion written by human, the GPT-2 n = 0, the GPT-2 n = 1 and the PGnet generated conclusions. The raters are asked to classify each generated conclusions as either true positive, true negative, false positive, false negative or not applicable. We define true / false as whether the generated conclusion corresponds to the given document, and positive / negative as whether the intervention studied has a statistically significant effect, regardless of the effect being favourable or detrimental to the patients. This is to explicitly examine whether the generated conclusion is precise in terms of RCT content understanding.
• Rating: Rating questions use the same set-up except the question is a 5 point Likert scale for correctness, quality and overall impression. Each rater is given 5 annotation questions and 5 rating questions, with each source document randomly chosen from the validation set. This is to judge the generated conclusions both regarding to and regardless of the source document.  To mitigate bias, we do not inform which conclusion was generated or written by human, and the conclusions are lower-cased and randomly ordered in each question for fair comparison. Table 3 presents the results from the annotation question, where the number of true positive and true negative generations from the GPT-2 fine-tuned models increase when compared to the PGNet baseline. It is clear that the proposed GPT-2 achieves better performance in terms of accuracy (the ratio of true samples). We also include the performance of human-written conclusions in the last row, which serves as the upper bound of this task. However, there is still a gap between humanwritten conclusions and the generated ones.

System Correctness Quality Overall
In the rating questions depicted in Table 4, the human written conclusions obtain a score nearly 4 out of 5 on all three dimensions. The GPT-2 models have comparable scores in overall impression, both scoring around 3.5 out of 5. The most significant improvement of the GPT-2 generated conclusions is the text quality, with the correctness improvement to a lesser extent. The correctness of GPT-2 n = 1 is slightly better than that of GPT-2 n = 0 in the annotation question, yet in the rating question, GPT-2 n = 0 has a higher averaged score. In sum, the human evaluation results demonstrate that our models significantly outperform the baseline pointer generator and tell that the proposed RCT conclusion generation task is not the same as typical summarization task, so deep text understanding is required for better performance.

Discussion
From the human evaluation results and our empirical inspection, we discover two major problems concerning the quality of the generated conclusions from GPT-2 models. First, there is some repetition in the generated conclusions, which impair the quality of generated text, though not as com-mon in that of RNN-based models. We suggest additional weighted decoding or coverage mechanisms to avoid such problems. Second, the GPT-2 generated conclusions are significantly shorter than the targets. The average length generated by GPT-2 n = 0 and GPT-2 n = 1 are 19.4 and 21.0, while that of human written conclusions is 41.4. This could be caused by the limitation of greedy decoding, but the examples generated by PGnet, which applies beam search, only gives an average length of 22.6. This suggests investigation of additional measures to enrich and lengthen the generated conclusions in future work.
Another important issue is the correctness of the generation model. The GPT-2 models are able to identify simple patterns and generate conclusions with the correct relationship. However, errors occur when the study design becomes more complicated or the outcomes are complex. Therefore, future work should aim at enhancing the language understanding capabilities of generation models. Methods such as pre-training the GPT-2 models with medical domain literature or using external background knowledge might fill the missing gap in the correctness performance. This is very crucial regarding to our RCT understanding task and other tasks that require precise and reliable language generation.
Here we select 3 examples to better illustrate our evaluation methods and the discussed limitations of the current models. The example in Table 5 show two successful generations from the GPT-2 models. Table 6 shows a false positive example by the GPT-2 n = 1 model. On the other hand, a false negative example generated by the GPT-2 n = 0 can be seen in Table 7. The generated conclusions in Table 7 is also much shorter than the target conclusion written by human. Other factors that could cause this issue may include that the human authors mention information not included in the preceding source document, additional comments on the results and background knowledge and they paraphrase the same concept in different ways.
Given the above results, this paper opens a new research direction by formulating the RCT conclusion generation task and investigates the potential of language generation models towards better understanding of medical documents.
Source: Proton pump inhibitor ( PPI ) therapy is considered as the first choice for treatment of non-erosive reflux disease ( NERD ) . However , NERD is less sensitive to PPIs than erosive gastroesophageal reflux disease ( GERD ) and the differences between PPIs and H2 receptor antagonists are less evident in NERD than in erosive GERD . Since gastric acid secretion is lower in the Japanese population than in Western populations , we aimed to investigate whether PPI therapy is really necessary for NERD patients in Japan . Both roxatidine and omeprazole significantly improved the heartburn score at 4 and 8 weeks . The clinical response rates did not differ between roxatidine and omeprazole . Both roxatidine and omeprazole significantly relieved not only reflux but also abdominal pain and indigestion . The degrees of improvement did not differ between the two groups .
Target (True Positive): Roxatidine relieved the symptoms of NERD patients with similar effectiveness to omeprazole . Therefore , roxatidine may be a good choice for NERD treatment .
GPT-2 n = 0 (True Positive): Both roxatidine and omeprazole significantly improved the heartburn score at 4 and 8 weeks.
GPT-2 n = 1 (True Positive): Roxatidine and omeprazole are effective in relieving symptoms of NERD in Japanese patients. Table 5: An example of GPT-2 models generating true positive conclusions.
Source: To evaluate the efficacy of oxcarbazepine ( OXC ) in the treatment of agitation and aggression in patients with Alzheimer 's disease , vascular dementia or both . In total , 103 institutionalized patients at 35 sites were randomized to the trial . After 8 weeks , no statistically significant differences were found between the 2 groups for all outcomes . A trend was observed in favor of the OXC group in the reduction in the scores on the BARS ( p = 0.07 ) .
Target (True Negative): This study found no significant effect of OXC in treatment of agitation and aggression in patients with dementia .
GPT-2 n = 0 (True Negative): OXC was not effective in the treatment of agitation and aggression in patients with Alzheimer's disease, vascular dementia or both.
GPT-2 n = 1 (False Positive): This study suggests that OXC is effective in the treatment of agitation and aggression in patients with Alzheimer's disease. Table 6: An example of GPT-2 n = 0 model generating a true negative conclusion, while the GPT-2 n = 1 model generated a false positive one.
Source: Atrial fibrillation ( AF ) is the most common complication following coronary artery bypass graft ( CABG ) . The mechanism of AF after CABG is not well defined ; however , it is suggested that endogenous adenosine , released in response to tissue hypoxia , may play a mechanistic role in these arrhythmias . The purpose of this study was to examine whether intravenous theophylline , via adenosine A1 receptor antagonism , would correct or modify new-onset early ( <48 h post CABG ) atrial fibrillation in patients post CABG , and thereby implicate endogenous adenosine as an inciting agent . Thirty patients comprised the study group . In Group A , 8 of the 15 patients ( 53 % ) converted from AF to sinus rhythm within 15 min of theophylline administration . One patient who converted to sinus rhythm 20 min after theophylline administration was accepted as showing a negative response . In the placebo-treated group , no patient converted to sinus rhythm within 15 min ( p <0.007 compared with Group A ) .
Target (True Positive): The mechanism of AF after CABG is not well defined and is probably multifactorial . However , this study demonstrated that antagonism of the adenosine A1 receptor can promptly convert many of these patients back to sinus rhythm , and thereby implicates endogenously released adenosine in a mechanistic role for inciting early ( <48 h ) post-CABG AF .
GPT-2 n = 1 (True Positive): The results of this study suggest that intravenous theophylline, via adenosine A1 receptor antagonism, may correct or modify early AF in patients post CABG. Table 7: An example of GPT-2 n = 0 model generating a false negative conclusion, while the GPT-2 n = 1 model generated a true positive one.

Conclusion and Future Work
This work introduces the RCT paper conclusion generation task as a stepping stone to the automatic understanding of clinical research literature.
Our results show the general domain pre-trained GPT-2 language model can be fine-tuned to generate medical domain conclusions. The evaluation results show improvements regarding to both quality and correctness in conclusions generated by the fine-tuned GPT-2 model compared to the pointergenerator summarization model. Further study is needed to enhance the generation quality by reducing repetition errors and increasing the generation length, and to improve the correctness through better language understanding for practical clinical scenarios.
Beyond generating conclusions for RCT papers, generative language models in the medical domain with improved correctness and quality can open up new opportunities to tasks that require profound domain knowledge. For example, automatic generation of systemic review and meta-analysis articles.