Detection of Text Reuse in French Medical Corpora

Eva D’Hondt; Cyril Grouin; Aurelie Neveol; Efstathios Stamatatos; Pierre Zweigenbaum

Detection of Text Reuse in French Medical Corpora

Eva D’hondt, Cyril Grouin, Aurélie Névéol, Efstathios Stamatatos, Pierre Zweigenbaum

Abstract

Electronic Health Records (EHRs) are increasingly available in modern health care institutions either through the direct creation of electronic documents in hospitals’ health information systems, or through the digitization of historical paper records. Each EHR creation method yields the need for sophisticated text reuse detection tools in order to prepare the EHR collections for efficient secondary use relying on Natural Language Processing methods. Herein, we address the detection of two types of text reuse in French EHRs: 1) the detection of updated versions of the same document and 2) the detection of document duplicates that still bear surface differences due to OCR or de-identification processing. We present a robust text reuse detection method to automatically identify redundant document pairs in two French EHR corpora that achieves an overall macro F-measure of 0.68 and 0.60, respectively and correctly identifies all redundant document pairs of interest.

Anthology ID:: W16-5112
Volume:: Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)
Month:: December
Year:: 2016
Address:: Osaka, Japan
Editors:: Sophia Ananiadou, Riza Batista-Navarro, Kevin Bretonnel Cohen, Dina Demner-Fushman, Paul Thompson
Venue:: WS
SIG:
Publisher:: The COLING 2016 Organizing Committee
Note:
Pages:: 108–114
Language:
URL:: https://aclanthology.org/W16-5112
DOI:
Bibkey:
Cite (ACL):: Eva D’hondt, Cyril Grouin, Aurélie Névéol, Efstathios Stamatatos, and Pierre Zweigenbaum. 2016. Detection of Text Reuse in French Medical Corpora. In Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016), pages 108–114, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):: Detection of Text Reuse in French Medical Corpora (D’hondt et al., 2016)
Copy Citation:
PDF:: https://aclanthology.org/W16-5112.pdf

PDF Cite Search