Predicting Clinical Trial Results by Implicit Evidence Integration

Clinical trials provide essential guidance for practicing Evidence-Based Medicine, though often accompanying with unendurable costs and risks. To optimize the design of clinical trials, we introduce a novel Clinical Trial Result Prediction (CTRP) task. In the CTRP framework, a model takes a PICO-formatted clinical trial proposal with its background as input and predicts the result, i.e. how the Intervention group compares with the Comparison group in terms of the measured Outcome in the studied Population. While structured clinical evidence is prohibitively expensive for manual collection, we exploit large-scale unstructured sentences from medical literature that implicitly contain PICOs and results as evidence. Specifically, we pre-train a model to predict the disentangled results from such implicit evidence and fine-tune the model with limited data on the downstream datasets. Experiments on the benchmark Evidence Integration dataset show that the proposed model outperforms the baselines by large margins, e.g., with a 10.7% relative gain over BioBERT in macro-F1. Moreover, the performance improvement is also validated on another dataset composed of clinical trials related to COVID-19.


Introduction
Shall COVID-19 patients be treated with hydroxychloroquine? In the era of Evidence-Based Medicine (EBM, Sackett 1997), medical practice should be guided by well-designed and wellconducted clinical research, such as randomized controlled trials. However, conducting clinical trials is expensive and time-consuming. Furthermore, inappropriately designed studies can be devastating in a pandemic: a high-profile Remdesivir clinical trial fails to achieve statistically significant conclusions (Wang et al., 2020b), partially because it does not attain the predetermined sample size when "competing with" other inappropriately designed trials that are unlikely to succeed or not so urgent to test (e.g.: physical exercises and dietary treatments). Therefore, it is crucial to carefully design and evaluate clinical trials before conducting them.
Proposing new clinical trials requires support from previous evidence in medical literature or practice. For example, the World Health Organization (WHO) has launched a global megatrial, Solidarity (WHO, 2020), to prioritize clinical resources by recommending only four most promising therapies 1 . The rationale for this suggestion comes from the integration of evidence that they might be effective against coronaviruses or other related organisms in laboratory or clinical studies (Peymani et al., 2016;Sheahan et al., 2017;Morra et al., 2018). However, manual integration of evidence is far from satisfying, as one study reports that about 86.2% of clinical trials fail (Wong et al., 2019) and even some of the Solidarity therapies do not get expected results (Mehra et al., 2020).
To assist clinical trial designing, we introduce a novel task: Clinical Trial Result Prediction (CTRP), i.e. predicting the results of clinical trials without actually doing them ( §3). Figure 1 shows the architecture of the CTRP task. We define the input to be a clinical trial proposal 2 , which contains free-texts of a Population (e.g.: "COVID-19 patients with severe symptoms"), an Intervention (e.g.: "Active remdesivir (i.v.)"), a Comparator (e.g.: "Placebos matched remdesivir") and an Outcome (e.g.:"Time to clinical improvement"), i.e. a PICO-formatted query (Huang et al., 2006), and the background of the proposed trial. The output is the trial Result, denoting how (higher, lower, or  compares to C in terms of O for P. One particular challenge of this task is that evidence is entangled with other free-texts in the literature. Prior works have explored explicit methods for evidence integration through a pipeline of retrieval, extraction and inference on structured {P,I,C,O,R} evidence (Wallace et al., 2016;Singh et al., 2017;Jin and Szolovits, 2018;Lee and Sun, 2018;Nye et al., 2018;Lehman et al., 2019;DeYoung et al., 2020;Zhang et al., 2020). However, they are limited in scale since getting domain-specific supervision for all clinical evidence is prohibitively expensive.
In this work, we propose to implicitly learn from such evidence by pre-training, instead of relying on explicit evidence with purely supervised learning. There are more than 30 million articles in PubMed 3 , which stores almost all available medical evidence and thus is an ideal source for learning. We collect 12 million sentences from PubMed abstracts and PubMed Central 4 (PMC) articles with comparative semantics, which is commonly used to express clinical evidence ( §4.1). P, I, C, O, and R are entangled with other free-texts in such sentences, which we denote as implicit evidence. Unlike previous efforts that seek to disentangle all of PICO and R, we only disentangle R out of the implicit evidence using simple heuristics ( §4.2). For better learning the ordering function of I/C conditioned on P and O, we also use adversarial examples generated by reversing both the entangled PICO and the R in the pre-training ( §4.3). Then, we pre-train a transformer encoder (Vaswani et al., 2017) to predict the disentangled R from the implicit evidence, which still contains PICO ( §5.1). The model is named EBM-Net to reflect its utility for Evidence-Based 3 https://pubmed.ncbi.nlm.nih.gov/ 4 https://www.ncbi.nlm.nih.gov/pmc/ Medicine. Finally, we fine-tune the pre-trained EBM-Net on downstream datasets of the CTRP task ( §5.2), which are typically small in scale ( §6).
Clustering analyses indicate that EBM-Net can effectively learn quantitative comparison results ( §6.4). In addition, the EBM-Net model is further validated on a dataset composed of COVID-19 related clinical trials ( §6.5).
Our contribution is two-fold. First, we propose a novel and meaningful task, CTRP, to predict clinical trial results before conducting them. Second, unlike previous efforts that depend on structured data to understand the totality of clinical evidence, we heuristically collect unstructured textual data, i.e. implicit evidence, and utilize large-scale pretraining to tackle the proposed CTRP task. The datasets and codes are publicly available at https: //github.com/Alibaba-NLP/EBM-Net.

Related Works
Predicting Clinical Trial Results: Most relevant works typically use only specific types or sources of information for prediction (e.g.: chemical structures (Gayvert et al., 2016), drug dosages or routes (Holford et al., 2000(Holford et al., , 2010). Gayvert et al. (2016) predicts clinical trial results based on chemical properties of the candidate drugs. Clinical trial simulation (Holford et al., 2000(Holford et al., , 2010 applies pharmacological models to predict the results of a specific intervention with different procedural factors, such as doses and sampling intervals. Some use closely related report information, e.g.: interim analyses (Broglio et al., 2014) or phase II data for just phase II trials (De Ridder, 2005). Our task is (1) more generalizable, since all potential PICO elements can be represented by free-texts and thus modeled in our work; and (2) aimed at evaluating new clinical trial proposals.
Explicit Evidence Integration: It depends on the existence of structured evidence, i.e.: {P, I, C, O, R} (Wallace, 2019). Consequently, collecting such explicit evidence is vital for further analyses, and is also the objective for most relevant works: Some seek to find relevant papers through retrieval (Lee and Sun, 2018); many works are aimed at extracting PICO elements from published literature (Wallace et al., 2016;Singh et al., 2017;Jin and Szolovits, 2018;Nye et al., 2018;Zhang et al., 2020); the evidence inference task extracts R for a given ICO query using the corresponding clinical trial report (Lehman et al., 2019;DeYoung et al., 2020). However, since getting expert annotations is expensive, these works are typically limited in scale, with only thousands of labeled instances. Few works have been done to utilize the automatically collected structured data for analyses. In this paper, we adopt an end-to-end approach, where we use large-scale pre-training to implicitly learn from free-text clinical evidence.

The CTRP Task
The CTRP task is motivated to evaluate clinical trial proposals by predicting their results before actually conducting them, as discussed in §1. Therefore, we formulate the task to take as input exactly the information required for proposing a new clinical trial: free-texts of a background description and a PICO query to be investigated. Formally, we denote the strings of the input background as B and PICO elements as P, I, C, and O, respectively. The task output is defined as one of the three possible comparison results: higher (↑), no difference (→), or lower (↓) measurement O in intervention group I than in comparison group C for population P. We denote the result as R, and: Main metrics include accuracy and 3-way macroaveraged F1. We also use 2-way (↑, ↓) macroaveraged F1 to evaluate human expectations ( §6.2).

Implicit Evidence Integration
In this section, we introduce the Implicit Evidence Integration, which is used to collect pre-training data for comparative language modeling ( §5.1).
Instead of collecting explicit evidence with structured {B, P, I, C, O − R} information, we utilize a simple observation to collect evidence implicitly: clinical evidence is naturally expressed by comparisons, e.g.: "Blood oxygen is higher in the intervention group than in the placebo group". Free-texts of P, I, C, O and R are entangled with other functional words that connect these elements in such comparative sentences, where R is a free-text version of the structured result R (e.g.: R = "higher ... than" translates into R = ↑). We call these sentences entangled implicit evidence and denote them as E ent = {PICOR}. Then, we disentangle R out of the E ent by heuristics, getting R and the left E dis = {PICO}. We also include adversarial instances generated from the original ones. Several examples are shown in Table 1.
Details of implicit evidence collection, disentanglement, and adversarial data generation are introduced in §4.1, §4.2 and §4.3, respectively.

Collection of Implicit Evidence
We collect implicit evidence from PubMed abstracts and PMC articles 5 , where most of the clinical evidence is published. PubMed contains more than 30 million abstracts, and PMC has over 6 million full-length articles. Each abstract is chunked into a background/method section and a result/conclusion section: For the unstructured abstracts, sentences before the first found implicit evidence are included in the background/method section. For the semi-structured abstracts where each section is labeled with a section name, the chunking is done by mapping the section name to either background/method or result/conclusion. Sentences in abstract result/conclusion sections and main texts that express comparative semantics (Kennedy, 2004) are collected as implicit evidence. They are identified by a pattern detection heuristic, similar to the keyword method described in Jindal and Liu (2006) "no difference between ... and" [NODIFF] → "Levels of viral antigen staining in lung sections of GS-5734-treated animals were significantly lower as compared to vehicle-treated animals." "Levels of viral antigen staining in lung sections of GS-5734-treated animals were [MASK] vehicle-treated animals." "significantly lower as compared to" [LOWER] ↓ corresponding B for the collected implicit evidence. These sentences are denoted as E ent , which contain entangled PICO-R. We have collected 11.8 million such sentences. Among them, 2.4 million (20.2%), 3.5 million (29.9%) and 5.9 million (49.9%) express inferiority, equality and superiority respectively.

Disentanglement of Implicit Evidence
To disentangle the free-text result R from implicit evidence E ent , we mask out the detected morphemes that express comparative semantics (e.g.: "higher than") as well as other functional tokens that might be exploited by the model to predict the result (e.g.: p values). This generates the masked out result R and the left part E dis ({PICO}) from E ent ({PICOR}), i.e.: R + E dis = E ent . R is mostly a phrase with a central comparative adjective/adverb (e.g.: "significantly smaller than") and can be directly mapped to R (↓ for the same example).
Nevertheless, R contains richer information than the sole change direction because of the central adjective/adverb. To utilize such information, we map free-texts of R to a finer-grained result label r ∈ C instead of the 3-way direction, where

Adversarial Data Generation
We generate adversarial examples from the original ones using a simple rule of ordering: if the result r holds for the comparison I/C conditioned on P and O, the reversed result Rev(r) must hold for the reversed comparison C/I on the same condition. This is similar to generate adversarial examples for natural language inference task by logic rules (Minervini and Riedel, 2018;Wang et al., 2019).
However, Since E dis = {PICO} is only partially disentangled and P, I, C, O are still in their freetext forms, we cannot explicitly reverse I/C and generate such examples. As an alternative, we reverse the entire sentence order while keeping the word order between any two masked phrases in E dis , getting E rev . For example, if:

EBM-Net
We introduce the EBM-Net model in this section. Similar to BERT (Devlin et al., 2019), EBM-Net is essentially a transformer encoder (Vaswani et al., 2017), and follows the pre-training -fine-tuning approach: We pre-train EBM-Net by Comparative Language Modeling (CLM, §5.1) that is designed to learn the conditional ordering function of I/C. The pre-trained EBM-Net is fine-tuned to solve the CTRP task on downstream datasets ( §5.2).

Comparative Language Modeling
We show the CLM architecture in Figure 2. CLM is adapted from the masked language modeling used ...

All available clinical evidence
"Vehicle-treated animals [MASK] levels of viral antigen staining in lung sections of GS-5734-treated animals."

Reversed Rev
Adversarial Implicit Evidence  [CLS] hidden state of the EBM-Net is used to predict the CLM label r with a linear layer followed by a softmax output unit:

EBM-Net
We minimize the cross-entropy between the estimatedr and the empirical r distribution.
At input-level, the adversarial examples only differ from their original examples in word orders between E dis and E rev . However, their labels are totally reversed from r to Rev(r). By regularizing the model to learn such conditional ordering function, CLM prevents the pre-trained model from learning unwanted and possibly biased co-occurrences between evident elements and their results. , C] on the Evidence Integration dataset ( §6.1). The sequence of PICO elements in E exp can be tuned empirically. EBM-Net learns from scratch another linear layer that maps from the predicted CLM label probabilitiesr to 3-way result label R logits. The final predictions are made by a softmax output unit:

CTRP Fine-tuning
Cross-entropy between the estimatedR and the empirical R distribution is minimized in fine-tuning.

Configuration
The transformer weights of EBM-Net (L=12, H=768, A=12, #Params=110M) are initialized with BioBERT (Lee et al., 2020), a variant of BERT that is also pre-trained on PubMed abstracts and PMC articles. The maximum sequence lengths for B, E dis , E rev , E exp are 256, 128, 128, and 128, respectively. We use Adam optimizer (Kingma and Ba, 2014) to minimize the cross-entropy losses. EBM-Net is implemented using Huggingface's Transformers library (Wolf et al., 2019) in PyTorch (Paszke et al., 2019). Pre-training on 12M implicit evidence takes about 1k Tesla P100 GPU hours.

The Evidence Integration Dataset
The Evidence Integration dataset serves as a benchmark for our task. We collect this dataset by repurposing the evidence inference dataset (Lehman et al., 2019;DeYoung et al., 2020), which is essentially a machine reading comprehension task for extracting the structured result (i.e.: R) of a given structured ICO query 6 from the corresponding clinical trial report article. Since clinical trial reports already contain free-text result descriptions (i.e.: R) of the given ICO, solving the original task does not require the integration of previous clinical evidence. To test such capability for our proposed CTRP task, we remove the result/conclusion part and only keep the background/method part in the input clinical trial report. 34.6% tokens of the original abstracts are removed on average and the remained are used as the clinical trial backgrounds.
Specifically, input of the Evidence Integration dataset includes free texts of ICO elements I, C and O which are the same as the original evidence inference dataset, and their clinical trial backgrounds B. The output is the comparison result R. Following the original dataset split, there are 8,164 instances for training, 1,002 for validation, and 965 for test.
We also do experiments under the adversarial setting, where adversarial examples generated by reversing both the I/C order and the R label (similar to §5.1) are added. This setting is used to test model robustness under adversarial attack.

Compared Methods
We compare to a variety of methods, ranging from trivial ones like Random and Majority to the stateof-the-art BioBERT model. Two major approaches in open-domain question answering (QA) are tested as well: the knowledge base (KB) approach (MeSH ontology) and the text/retrieval approach (Retrieval + Evidence Inference), since solving our task also requires reasoning over a large external corpus. Finally, we introduce some ablation settings and the evaluation of human expectations.
Random: we report the expected performance of randomly predicting the result for each instance.
Majority: we report the performance of predicting the majority class (→) for all test instances.n Bag-of-Words + Logistic Regression: we concatenate the TF-IDF weighted bag-of-word vectors of B, P, I, C and O as features and use logistic regression for learning.
MeSH Ontology: Since no external KB is available for our task, we use the training set as an internal alternative: we map the I, C and O of the test instances to terms in the Medical Subject Headings (MeSH) 7 ontology by string matching. MeSH is a controlled and hierarchically-organized vocabulary for describing biomedical topics. Then, we find their nearest labeled instances in the training set, where the distance is defined by: m e i and m e j are MeSH terms identified in ICO element e of instance i and j, respectively. TreeDist is defined as the number of edges between two nodes on the MeSH tree. The majority label of the nearest training instances is used as the prediction.
Retrieval + Evidence Inference: State-of-theart method on the evidence inference dataset (DeYoung et al., 2020) is a pipeline based on SciBERT (Beltagy et al., 2019): (1) find the exact evidence sentences in the clinical trial report for the given ICO query, using a scoring function derived from a fine-tuned SciBERT; and (2) predict the result R based on the found evidence sentences and the given ICO query by fine-tuning another SciBERT.
Our task needs an additional retrieval step to find relevant documents that might contain useful results of similar trials, as the input trial background does not contain the result information for the given ICO query. Documents are retrieved from the entire PubMed and PMC using a TF-IDF matching between their indexed MeSH terms and the MeSH terms identified in the ICO queries. We then apply the pipeline described above on the retrieved documents. This baseline is similar to but more domainspecific than BERTserini .
BioBERT: For this setting, we feed BioBERT with similar input to EBM-Net as is described in §5 and fine-tune it to predict the R using its special [CLS] hidden state.
Ablations: We conduct two sets of ablation experiments with EBM-Net: (1) Pre-training level, where we exclude the adversarial examples in pretraining, to analyze the utility of CLM against traditional LM.
(2) Input level, where we exclude different input elements (B, I, C, O) to study their relative importance.
Human Expectations: We define the expected result (R e ) of a clinical trial (e.g.: R e = ↓ for O = "mortality rate") as the Human Expectation (HE),  which is the underlying motivation for conducting the corresponding trial. Generally, R e ∈ {↑, ↓} since significant results are expected. To make fair comparisons, we use the 2-way macro-average F1: F1 (2-way) = (F1(↑) + F1(↓))/2 as a main metric for evaluations of HE. HE performance is an overestimation of human performance: main biases are due to the shift of input trial distribution from the targeted proposal stage to the actual report stage, which contains fewer trials with unexpected results. We use |∆|, the absolute value of relative accuracy decrease to measure model robustness under adversarial attacks. The higher the |∆|, the more vulnerable a model is. BioBERT has about twice as much (5.1% v.s. 2.7%) |∆| in the adversarial setting as EBM-Net does. It suggests that EBM-Net is more robust to adversarial attacks, which is a vital property for healthcare applications. EBM-Net without adversarial pre-training is less robust than EBM-Net as well (3.0% v.s. 2.7%), but not as vulnerable as BioBERT, indicating that robustness can be learned by pre-training with original implicit evidence to some extent and further consolidated by the adversarial evidence.

Main Results
Unsurprisingly, EBM-Net with full input consistently outperforms all input-level ablations. Among them, O is the most important input element as the performance decreases dramatically on its ablation. This is expected as O is the standard of comparisons. B is the second most important element, since B contains methodological details of how the clinical trials will be conducted, which is also vital for result prediction. The performance does not decrease as much without I or C, since there is redundant information of them in B.
On the one hand, the accuracy of EBM-Net surpasses that of HE, mainly because the latter is practically a 2-way classifier. On the other hand, HE outperforms EBM-Net in terms of 2-way F1, but is still unsatisfying (68.86%). This suggests that the proposed CTRP task is hard and there is still room for further improvements.

Discussions
We study how different numbers of pre-training and fine-tuning instances influence the EBM-Net performance, in comparison to the BioBERT.  shows the results: (Left) The final performance of EBM-Net improves log-linearly as the pre-training dataset size increases, suggesting that there can be further improvements if more data is collected for pre-training but the marginal utility might be small. EBM-Net surpasses BioBERT when pre-trained by about 50k to 100k instances of implicit evidence, which are 5 to 10 times as many as the fine-tuning instances. (Right) EBM-Net is more robust in a few-shot learning setting: using only 10% of the training data, EBM-Net outperforms BioBERT fine-tuned with 100% of the training data. From zero-shot 8 to using all the training data, EBM-Net improves only by 26.6% relative F1 (from 47.52% to 60.15%) while BioBERT improves largely by 60.0% relative F1 (from 32.77% to 54.33%).
We use t-SNE (Maaten and Hinton, 2008) to visualize the test instance representations derived from EBM-Net [CLS] hidden state in Figure 4. It shows that EBM-Net effectively learns the relationships between comparative results: the points cluster into three results (↑, ↓, →). While there is a clear boundary between the ↓ cluster (dashed-blue circle) and the ↑ cluster (dashed-red circle), the boundaries between the → cluster (dashed-black circle) and the other two are relatively vague. It suggests that the learnt manifold follows a quantitatively continuous "↓ -→ -↑" pattern.
Out of the 373 mistakes EBM-Net makes on the test set, significantly less (11.8%, p<0.001 by permutation test) predictions are opposite to the ground-truth (e.g.: predicting ↑ when the label is ↓), also suggesting that EBM-Net effectively learn the relationship between comparison results. In addition, we notice that there is a considerable proportion of instances whose results are not predictable Figure 4: T-SNE visualizations of EBM-Net representations of Evidence Integration test set instances. Red colored , blue colored , and green colored refer to the corresponding R equaling ↑, ↓ and →, respectively. without their exact reports. For example, some I and C differ only quantitatively, e.g.: "4% lidocaine" and "2% lidocaine", and modeling such differences is beyond the scope of our task.

Validation on COVID-19 Clinical Trials
For analyzing COVID-19 related clinical trials, we further pre-train EBM-Net on the CORD-19 dataset (Wang et al., 2020a) 9 , also using the comparative language modeling ( §5.1). It leads to a COVID-19 specific EBM-Net that is used in this section.
We use leave-one-out validation to evaluate EBM-Net on the 22 completed clinical trials in COVID-evidence 10 , which is an expert-curated database of available evidence on interventions for COVID-19. Again, EBM-Net outperforms BioBERT by a large margin (59.1% v.s. 50.0% accuracy). Expectedly, their 3-way F1 results (45.5% v.s. 36.1%) are close to those in the zero-shot learning setting since not many trials have finished. Accuracy and 2-way F1 performance of HE are 54.5% and 68.9%, and are close to those in Table 2. These further confirm the performance improvement of EBM-Net and the difficulty of the CTRP task.

Conclusions
In this paper, we introduce a novel task, CTRP, to predict clinical trial results without actually doing them. Instead of using structured evidence that is prohibitively expensive to annotate, we heuristically collect 12M unstructured sentences as implicit evidence, and use large-scale CLM pretraining to learn the conditional ordering function required for solving the CTRP task. Our EBM-Net model outperforms other strong baselines on the Evidence Integration dataset and is also validated on COVID-19 clinical trials.