Evaluating the Utility of Model Configurations and Data Augmentation on Clinical Semantic Textual Similarity

Yuxia Wang, Fei Liu, Karin Verspoor, Timothy Baldwin


Abstract
In this paper, we apply pre-trained language models to the Semantic Textual Similarity (STS) task, with a specific focus on the clinical domain. In low-resource setting of clinical STS, these large models tend to be impractical and prone to overfitting. Building on BERT, we study the impact of a number of model design choices, namely different fine-tuning and pooling strategies. We observe that the impact of domain-specific fine-tuning on clinical STS is much less than that in the general domain, likely due to the concept richness of the domain. Based on this, we propose two data augmentation techniques. Experimental results on N2C2-STS 1 demonstrate substantial improvements, validating the utility of the proposed methods.
Anthology ID:
2020.bionlp-1.11
Volume:
Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing
Month:
July
Year:
2020
Address:
Online
Editors:
Dina Demner-Fushman, Kevin Bretonnel Cohen, Sophia Ananiadou, Junichi Tsujii
Venue:
BioNLP
SIG:
SIGBIOMED
Publisher:
Association for Computational Linguistics
Note:
Pages:
105–111
Language:
URL:
https://aclanthology.org/2020.bionlp-1.11
DOI:
10.18653/v1/2020.bionlp-1.11
Bibkey:
Cite (ACL):
Yuxia Wang, Fei Liu, Karin Verspoor, and Timothy Baldwin. 2020. Evaluating the Utility of Model Configurations and Data Augmentation on Clinical Semantic Textual Similarity. In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, pages 105–111, Online. Association for Computational Linguistics.
Cite (Informal):
Evaluating the Utility of Model Configurations and Data Augmentation on Clinical Semantic Textual Similarity (Wang et al., BioNLP 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.bionlp-1.11.pdf
Video:
 http://slideslive.com/38929642