Learning from Unlabelled Data for Clinical Semantic Textual Similarity

Domain pretraining followed by task fine-tuning has become the standard paradigm for NLP tasks, but requires in-domain labelled data for task fine-tuning. To overcome this, we propose to utilise domain unlabelled data by assigning pseudo labels from a general model. We evaluate the approach on two clinical STS datasets, and achieve r= 0.80 on N2C2-STS. Further investigation reveals that if the data distribution of unlabelled sentence pairs is closer to the test data, we can obtain better performance. By leveraging a large general-purpose STS dataset and small-scale in-domain training data, we obtain further improvements to r= 0.90, a new SOTA.


Introduction
Semantic textual similarity (STS) measures the degree of semantic equivalence between two text snippets, based on a graded numerical value, with applications including question answering (Yadav et al., 2020), duplicate detection (Poerner and Schütze, 2019), and entity linking (Zhou et al., 2020).
Modern pretrained language models have achieved impressive results for general STS (Devlin et al., 2019). However in low-resource domains without in-domain labelled data, results are generally lower (Wang et al., 2020b). In the clinical domain in particular, annotation requires medical experts (Wang et al., 2018;Romanov and Shivade, 2018), meaning that labelled datasets are generally small, hampering clinical STS.
We address the question of how to apply pretrained language models to such domain-specific tasks where there is little or no labelled data, focusing specifically on the task of clinical STS.
Employing a general STS model generally yields poor results over technical domains due to covariate shift. To bridge this gap, a standard approach is to pretrain the LM on in-domain text, such as Clin-icalBERT (Alsentzer et al., 2019) using MIMIC-III (Johnson et al., 2016). However, existing research has tended to estimate effectiveness under the fine-tuning setting, rather than via inference tasks (Peng et al., 2019;Wang et al., 2020b).
In this paper, we first evaluate domain pretraining approaches for clinical STS, with no labelled data. Based on the assumption that general STS models trained on large-scale STS datasets will perform reasonably well on clinical sentence pairs (Section 4), we then experiment with learning from the pseudo-labelled data (Section 5).
Experimental results show both domain pretraining and pseudo-labelled data fine-tuning improve clinical STS, and the combination of the two achieves the best performance of r = 0.80 on N2C2-STS (Section 6.3). Further analysis shows that the score distribution and volume of pseudolabelled pairs influence the performance of finetuning. We also find that training for more iterations leads to minor improvements.
The paper makes three major contributions: (1) we propose a simple pseudo-training method, and show it to perform well on clinical STS; (2) we evaluate several existing approaches to clinical STS in a zero-shot setting, and benchmark against our method; and (3) we achieve state-of-the-art results of r = 0.90 for N2C2-STS.

Related Work
The general approach to domain-specific task modelling is: (1) pretrain a language model (LM) on a large volume of open-domain text (Devlin et al., 2019;Liu et al., 2019); and (2) fine-tune on domainspecific text and task-specific labelled data (Gururangan et al., 2020;Peng et al., 2019). For this approach, however, domain-specific labelled data is required, an assumption that we seek to relax.
For STS, in the absence of labelled data, the simplest approach is to calculate the cosine similarity between the CLS-vectors of two sentences or averaged last-layer embeddings, but this tends to perform poorly, even worse than averaged GloVe (Pennington et al., 2014) embeddings. SBERT (Reimers and Gurevych, 2019) proposed to use a Siamese structure based on BERT to learn sentence representations, where they fine-tuned the model over general NLI data, and continued to fine-tune on general STS data (STS-B) (Cer et al., 2017). In this work, we experiment with this approach specifically in the clinical context.

Datasets and Tasks
We select two available clinical STS benchmark datasets for evaluation: MedSTS (Wang et al., 2018) and N2C2-STS (Wang et al., 2020a). The latter annotated 412 instances as new test bed, and updated train partition by labelling extra 574 instances and merging the former train and test cases (see Table 1). Our aim is to predict a score, given a sentence pair (S1, S2), closing to the gold label -a numerical value ranging from 0 to 5, where 0 refers to completely dissimilar semantics while 5 is completely equivalent in the meaning.
For example, S1: Discussed goals, risks, alternatives, advanced directives, and the necessity of other members of the surgical team participating in the procedure with the patient. S2: Discussed risks, goals, alternatives, advance directives, and the necessity of other members of the healthcare team participating in the procedure with the patient and his mother. Label: 4, as the two sentences are mostly equivalent and differ only in unimportant details (in bold).
Pearson's correlation (r) and Spearman's correlation (ρ) between the predicted and gold standard scores are used as evaluation metrics.

Observations
In modern NLP, large amounts of high-quality training data are a key element in building successful systems (Aharoni and Goldberg, 2020). This is also the case with STS, where additional training data has been shown to improve accuracy (Wang et al., 2020b). However, domain shifts inevitably lead to performance drops (Gururangan et al., 2020). Therefore, we ask: RQ1 Can large-scale generaldomain labelled STS data be transferred to train  clinical STS models? RQ2 How does low-quality training data impact clinical STS performance, vs. high-quality labelled data or no labelled data?
Effect of Larger General STS Corpus. We source general-domain labelled data from: (1) SemEval-STS shared tasks 2012-2017 (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016Cer et al., 2017); and SICK-R (Marelli et al., 2014). This results in a total of 28,518 labelled sentence pairs, which we refer to as "STS-G". We adapt a BERT encoder connected to a linear regression layer to fine-tune a general-domain STS model using STS-G, where the CLS-vector is used to represent the sentence pair (CLS-BERT). We compare this with a model trained only on STS-B. We evaluate both models on STS-B dev (same setup as Section 6.1).
For clinical STS, we employ a hierarchical convolution (HConv) model based on BERT (updating parameters of the last four layers), where the model is first fine-tuned with STS-B, then N2C2-STS is augmented by back-translation (Wang et al., 2020b). The model architecture and hyperparameter settings are the same as the original paper, such that we merely replace STS-B with STS-G, and observe that more training data improves clincial STS.
As shown in Table 2, the extra training data in STS-G results in an increase in r of up to .028, in the case of HConvBERT (Wang et al., 2020b), resulting in a new SOTA of r = .902.
Discussion. Though general-domain data lacks clinical information, the model clearly benefits from the extra out-of-domain training data (answering RQ1). This inspires us to rethink the clinical STS task as a combination of domain-specific text understanding and domain-invariant task learning, leading to the question: can the two aspects be learned separately? That is, can task learning take place via large volumes of general-domain labelled data, and domain-specific characteristics be learned from silver-standard labelled domain data, such as low-quality clinical sentence pairs labelled by a general STS model?

Method
Next, we investigate the use of pseudo-labelled clinical data based on the general STS model.

Pseudo-Labelled Sentence Pairs
Gururangan et al. (2020) illustrate that if the data distribution of the text used for pretraining is more similar to the task data, the performance will be better. Based on this, we propose a distribution-centric strategy for generating and selecting sentence pairs.
Generation. Two data sources -MIMIC-III clinical notes and N2C2-STS training data (ignoring labels) -are used to generate unlabelled sentence pairs. We sample 10,000 discharge summaries from MIMIC-III, which we segment into 27 parts based on section subtitles. Of these, we select five sections we consider to be most related to the N2C2-STS task: diagnosis, medications, history of present illness, follow-up instructions and physical exam. After sentence segmentation using SpaCy (Honnibal and Montani, 2017), we randomly sample sentence pairs from each section partition.
Labelling and Sampling. We take the CLS-BERT model trained on STS-G, and generate a score for all sentence pairs. To balance the data, we group into 5 equal-width bands based on score: . We use all pairs whose assigned score is above 3.0, and sample N pairs from the other three intervals.

Experiments
We first evaluate existing approaches for clinical STS in the zero-shot setting, and compare with our method. Then we analyse the impact of the volume of sampled instances and data distribution on the fine-tuning quality. We experiment with the number of iterations in Section 6.5.

Experimental Setup
We evaluate over MedSTS and N2C2-STS. As gathering naturally occurring pairs of sentences with different degrees of semantic similarity is very challenging (Wang et al., 2018), only 84 instances in (4.0, 5.0] are sampled from a group of 100k unlabelled sentence pairs (see Table 3). To increase the number of instances with high similarity, another group of 500k unlabelled sentence pairs is generated from discharge summaries. Limiting to cases above 3.0, (1) "STS-PS" (Pseudo-labelled Small) = 5,423 pairs, is sampled from 100k based on N = 1500; and (2) "STS-PL" (Pseudo-labelled Large) = 16,501 pairs, is sampled from 500k based on N = 4015. Unless otherwise indicated, pseudo labelling is based on CLS-BERT base -STS-G (see Section 4). All models are trained with a batch size of 16, learning rate of 2e-5, and 3 epochs with linear scheduler setting warmup proportion of 0.1 of fine-tuning. For all CLS-BERT models, we update all 12 layers, and for HConvBERT we update the last 4 layers.

Results
We perform experiments over three models (SBERT, CLS-BERT, and HConvBERT), two pretraining configurations (general and clinical), and four training datasets (general gold-labelled STS-B and STS-G, clinical pseudo-labelled STS-PL and STS-PS).
Results are presented in Tables 4 and 5 for N2C2-STS and MedSTS, resp. Here, the subscripts for  model descriptors -"base" and "clinical" -correspond to the two pretraining configurations, general and clinical. The "Data" column indicates the corpus used for fine-tuning, and A+B means that the model is first fine-tuned on A then fine-tuned on B. The model using general ("base") pretraining and fine-tuning only on STS-B or STS-G is referred to as the "general STS model". Both pretraining using in-domain text ("clinical") and fine-tuning on pseudo-labelled data (+STS-PS/STS-PL) improve performance over the general STS model, with fine-tuning on pseudo-labelled data generally performing better than domain pretraining, in addition to being computationally cheaper.
It may be argued that the performance improvement is gained simply as a result of using an enlarged data set for fine-tuning, instead of learning domain characteristics from clinical pseudolabelled data. However, for both datasets, and under CLS-BERT base and HConvBERT base , comparing results using: (1) STS-B with size of 5,749; (2) STS-B + STS-PS with size of 11,172 (5,749 + 5.423); and (3) STS-G with size of 28,518, we find that both (2) and (3) have higher r and ρ than (1), suggesting that enlarging the data size for fine-tuning is beneficial to improving performance. Simutaneously, (2) always performs much better than (3) though (3) is larger and has more gold la-  bels; this indicates the gains are mainly attributable to learned domain characteristics rather than merely increased data. Moreover, based on the results for CLS-BERT base and HConvBERT base using STS-PL and STS-PS, it would appear that the amount and score distribution of the pseudo-labelled data influences fine-tuning performance, which we return to investigate further in Section 6.4.

Combination of Domain Pretraining (DP) and Fine-tuning
We adapt CLS-BERT clinical -STS-G to predict scores for 500,000 pairs, generating STS-DP (6,306) after sampling as shown in Table 3. We continue to fine-tune CLS-BERT clinical -STS-G using STS-DP, boosting the performance to r = .803 and ρ = .788, from r = .788 and ρ = .768.

Impact of Data Distribution and Amount
In this section, we investigate how data source, score distribution -percentage of instances distributed in five score interval, and the volume of sampled instances influence fine-tuning performance. Based on CLS-BERT base with STS-G, we continue to fine-tune over five different groups of data: (1) N2C2-STS training data without goldstandard labels, where the score distribution of pseudo labels is 0. 04, 0.15, 0.25, 0.35, 0.21; (2) data sampled from STS-PL in the same volume  and with the score distribution as (1); (3) uniformly sampled from STS-PL with 330 pairs in each score interval; (4) proportionally sampled from STS-PL at a ratio of 1/10 for each score interval; and (5) full STS-PL. Comparing Experiments 2, 3 and 4 in Table 6, which have same data source and size (1.6k), and differ only in score distribution, we observe only minor performance differences. Experiments 1 and 2 rely on different sources, where Experiment 1 has the same source as the test data, and performs much better than Experiment 2. An aligned data source therefore is the optimal scenario. Looking at Experiments 4 and 5, where the difference is in the amount of sampled data, it is clear that more instances brings further improvements. But Could performance be improved consistently with increased pseudo-labelled data?
To answer this question, we proportionally sampled from STS-PL by ratio of 0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 1.0, and also sampled from 500k unlabelled sentence pairs setting N = 5000, 6000, 7000, 7500, 8000, resulting in 12 subsets in sizes ranging from 1,648 to 28,456, for fine-tuning based on CLS-BERT base -STS-G. As shown in Figure 1, 1 from 0 to 16,501, both r and ρ gradually increase, and then fluctuate around 0.77 and 0.76 resp. This reveals the trade-off between increasing the number of pseudo-labelled fine-tuning instances and error propagation due to cumulative noise.

Impact of Number of Iterations
Based on CLS-BERT base with STS-G, we investigate the impact of multiple iterations of fine-tuning  in Table 7, as introduced in Section 5.2. The performance boost from additional iterations is modest. Increasing iterations from 2 to 3, the accuracy does not improve, which is consistent with the findings in Figure 1.

Conclusion
In this paper, we have proposed a simple method of pseudo-labelling in-domain data and iterative training, to improve clinical STS. Evaluation over two clinical STS datasets demonstrates the effectiveness of the approach, and domain pretraining is shown to achieve further improvements. Further investigation indicated that keeping the distribution of pseudo-labelled instances close to that of the in-domain data improves performance. We also observed modest improvements through more iterations of iterative training. Our work provides an alternative approach to employing domain-specific unlabelled data to support clinical STS. As future work, we plan to explore the application of our method to other model structures such as SBERT.