Extracting Semantic Aspects for Structured Representation of Clinical Trial Eligibility Criteria

Eligibility criteria in the clinical trials specify the characteristics that a patient must or must not possess in order to be treated according to a standard clinical care guideline. As the process of manual eligibility determination is time-consuming, automatic structuring of the eligibility criteria into various semantic categories or aspects is the need of the hour. Existing methods use hand-crafted rules and feature-based statistical machine learning methods to dynamically induce semantic aspects. However, in order to deal with paucity of aspect-annotated clinical trials data, we propose a novel weakly-supervised co-training based method which can exploit a large pool of unlabeled criteria sentences to augment the limited supervised training data, and consequently enhance the performance. Experiments with 0.2M criteria sentences show that the proposed approach outperforms the competitive supervised baselines by 12% in terms of micro-averaged F1 score for all the aspects. Probing deeper into analysis, we observe domain-specific information boosts up the performance by a significant margin.


Introduction
Clinical trials (CTs) are research studies that are aimed at evaluating a medical, surgical, or behavioral intervention (Embi et al., 2008), (Shivade et al., 2015). Through such trials, researchers aim to find out whether a new treatment, like a new drug or diet or medical device is more effective than the existing treatments for a particular ailment. From an organization's perspective, a successful completion of a trial depends on achieving a significant sample size of patients enrolled for the trial within a limited time period. * The first two authors contributed equally * . Total bilirubin less than or equal to 1.5 mg/dl, except in patients with history of anaemia. Have had their ileostomy or colostomy for at least 3 months. Subjects must be between the age of 18-65 yr old and must not intake alcohol.
Categories of Semantic Aspects are represented using the colors: Health Status ; Lab Test ; Demography ; Life Style ; Treatment Status However, recruiting enough number of eligible patients to participate in a trial can be a bottleneck. If suitable patients are not found then the trials might get cancelled or delayed significantly. In this case a patient queries the sites like clinicaltrial.gov to retrieve suitable trials. Due to the complexity of the task which involves repeated reading of the patient's Electronic Health Record (EHR) and the trial criteria for multiple trials, this is not only a labor-intensive and time-consuming task but also prone to human errors. In addition to this, the eligibility criteria often uses complex language structures and medical jargons mentioned in either semi-structured or unstructured way.
Previous works (Koopman and Zuccon, 2016) have formulated the problem of retrieving relevant document collection based on patient query. However, we demonstrate an approach in which the primary eligibility aspects are identified initially for further screening of the patients in terms of inclusion or exclusion strategy, which is the first step towards matching patients with the relevant trials.
In this paper, we propose an effective method which automatically identifies and segregates the clinical trial eligibility criteria into five semantic aspects. Also, the criteria texts speak volume about multiple aspects of the patients that includes demographic information, health status, treatment history, laboratory test reports and life-style. However, there has been a dearth of annotated crite-ria. Since, prior methods on neural clinical entity recognition models rely on the presence of a large annotated corpora and due to the high cost associated with manual tagging of semantic aspects and limited availability of labeled datasets (Najafabadi et al., 2015), it is difficult to train a deep neural network effectively for such a task. We attempt to combat this difficulty by proposing a novel semi-supervised method based on deep co-training (Blum and Mitchell, 1998) which can harness a large pool of unlabeled clinical trial criteria that are more economical to collect. To the best of our knowledge, we are the first to introduce such a co-training-based method and demonstrate its effectiveness in aspect categorization of clinical trials in comparison to stand-alone sequence-labelling in isolation. The end-product of our experiments is a clinical trial-register that contain details of the different aspects across conditions and interventions.

Problem Formulation
Given an eligibility criteria sentence in the form of a word sequence x = (x 1 ...., x n ) , where n is the maximum length of the sequence, the task is to predict an output sequence y = (y 1 , ...., y n ) in which each y i is encoded using standard sequence labeling encoding scheme. Each y i might take one of the following aspects :

Data Annotation
To induce semantic categorization of aspects in the eligibility criteria, we generate a small pool of annotated data by manually examining some of the most frequently used n-gram patterns such as history of, upper limit of normal, treated by, Allergy to as specified in (Luo et al., 2011) in the initial phase. During pre-processing, we filter out the most frequently occurring n-grams (n=2, n=3, n=4, n=5) present in the criteria of the patients. Secondly, the criteria sentences are also tagged with CliNER Tagger (Boag et al., 2015) for extracting out the diseases and drugs. Further details of data are provided in the supplementary material 1 . After these two steps, finally, the false positives are being removed during manual supervision by four independent domain-expert annotators. These include annotations for each of the different categories.
The mean Cohen's Kappa (McHugh, 2012) was 0.82, which indicate good inter-annotator agree-   Table 1. While manually inspecting the co-occurence statistics of different aspects in the same criteria sentence of the manually annotated dataset, we observe that around 30% of the eligibility criteria contains more than one aspect, with 65% containing health, life-style, demography aspects, while the remaining 35% contains demography and treatment. For facilitating further research, we will also provide some sample examples of the annotated corpus.

Methodology
In this work, we experiment with two different methods of aspect extraction. One of the following being the traditional supervised setup of using BiLSTM-CRF/Bi-GRU CRF with input representation optimized using categorical cross-entropy loss (Zhang and Sabuncu, 2018). The second one being the Co-Training (Blum and Mitchell, 1998) method to extract the semantic aspects which has been outlined in Algorithm 1. The later method uses two conditionally independent feature views of the same dataset illustrated below: 1. Domain-independent: The contextual pretrained language models such as, BERT (De-2 https://clinicaltrials.gov/ vlin et al., 2019) (E1) (or word2vec (Mikolov et al., 2013) trained on GoogleNews Corpus 3 (E2)) embeddings followed by a BiLSTM-CRF (C 1 ) (Huang et al., 2015) feature extractor.
At each step of co-training, the classifiers C 1 and C 2 are trained on respective views of training sets V 1 and V 2 , thereby minimizing the loss function. Each instance from the unlabeled samples (U) is scored using a scoring function computed as follows. First, the current classifier is used to decode the output label distribution for each word in the unlabeled instances. For each word in the output, we choose the output label which has the maximum probability. We compute the score for the sample as the multiplication of the probabilities of each label type for all labeled words in sequence normalized by the total number of words in the sentence. If this confidence score of the sample is greater than some pre-defined threshold τ , the sample has been added to the training set of the other classifier along with its output labels as generated by the classifier. This is the process of generation of weak labels for each sequence. Due to interchange of training data, both classifiers can learn from mistakes of each other and work in synergy.

Experimental Details
We implement the model using Pytorch 0.3.0. The two classifiers considered for co-training are C 1 : Bi-LSTM-CRF and C 2 : Bi-GRU-CRF. For both supervised and co-training methods, the training data is divided according to 70-30% trainvalidation split. The two different views of cotraining setup are explained as follows: Hyper-parameters for two independent views: We run two experiments based on co-training, one using contextual embeddings (C-CTr) and the other using context-independent embeddings (NC-CTr). The hyper-parameter settings for the two views as required by the co-training method are as follows: View 1: For the first view (V 1 ), we use Bi-LSTM-CRF (Huang et al., 2015) with domain-independent word embeddings. We experiment with both a) (NC-CTr) Word2vec embeddings trained on GoogleNews Corpus with dimension 300 b) (C-CTr) pre-trained bert-base (12 layers, 12 attention heads, and 110 million parameters).

Results and Analysis
In this section, we have provided a detailed analysis of the various results and findings that we have observed during experimentation. There are various criteria on which we have tried to evaluate our semi-supervised approach.  Comparison with the baselines: The results of the baseline methods are enumerated in Table 2. We report the results based on exact match of each type of the aspects using F1-score. Following (Luo et al., 2011), we implement the same (Baseline-2) on our dataset with UMLS (Bodenreider, 2004) feature representation and "bag-of-words" (BoW) features, and report results for various aspects. Although (Luo et al., 2011) assumes each criteria sentence essentially belongs to a single aspect, we have done an ablation of Baseline-2 without UMLS features (Baseline-1(1)) and without BoW (Baseline-1(2)). We observe that UMLS feature representation boosts up the performance due to inclusion of domain-specific information. We observed that this work finds resonance with (Chalapathy et al., 2016) in which the corpus uses multiple annotations. Due to availability of their working code, we have experimented with their stand-alone Bi-LSTM-CRF approach, used them as Baseline-2 and report results for each of the first three annotated aspects.

Feature ablation on model architecture:
For the purpose of fair comparison, we experiment with different ablations of feature extractor and types of input representation (in the supervised setup) and present the results of Macro-averaged F1-score in Table 3. It has been observed that Bi-LSTM CRF with domain-specific input representation as Bio-BERT outperforms other ablations.
Impact of using co-training: It is also evident from table 4, when the two independent views consist of contextualized embeddings (C-CTr), the model outperforms the  Sensitivity of co-training parameters: In figure 1, Macro-F1 score (across all aspects) of the co-trained model has been evaluated based on the values of co-training threshold. The values have been chosen from 0 to 1 at an interval of 0.1, in which the optimum value has been observed as 0.5. The sensitivity of co-training parameters has been shown in figure 2.
Effect of unlabelled data size: Moreover, the results are fairly constant even when the unlabeled data size varies (enumerated in Table  4) which demonstrates the robustness of our approach. The contextualized representations when augmented with fair amount of semi-automatically annotated samples outperforms the supervised baseline setup.

Conclusion
In this paper, we have proposed a semi-supervised co-training method to tackle the scarcity of annotated data for the semantic clinical aspect extraction. This method augments a limited pool of annotated data with a large number of unlabeled clinical eligibility criteria outperforming pure supervised approaches. To the best of our knowledge, we are the first to provide an effective semi-supervised approach to detect the semantic aspects from clinical eligibility criteria which is a promising direction for further research on automatic linking of the patient Electronic Health Records (EHR) to clinical eligibility criteria with promising performance. As a future work, we aim to propose an end-to-end automatic matching system for patient-based clinical trial eligibility with low-cost data annotation.