A Treebank for the Healthcare Domain

Nganthoibi Oinam, Diwakar Mishra, Pinal Patel, Narayan Choudhary, Hitesh Desai


Abstract
This paper presents a treebank for the healthcare domain developed at ezDI. The treebank is created from a wide array of clinical health record documents across hospitals. The data has been de-identified and annotated for constituent syntactic structure. The treebank contains a total of 52053 sentences that have been sampled for subdomains as well as linguistic variations. The paper outlines the sampling process followed to ensure a better domain representation in the corpus, the annotation process and challenges, and corpus statistics. The Penn Treebank tagset and guidelines were largely followed, but there were many syntactic contexts that warranted adaptation of the guidelines. The treebank created was used to re-train the Berkeley parser and the Stanford parser. These parsers were also trained with the GENIA treebank for comparative quality assessment. Our treebank yielded great-er accuracy on both parsers. Berkeley parser performed better on our treebank with an average F1 measure of 91 across 5-folds. This was a significant jump from the out-of-the-box F1 score of 70 on Berkeley parser’s default grammar.
Anthology ID:
W18-4916
Volume:
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)
Month:
August
Year:
2018
Address:
Santa Fe, New Mexico, USA
Editors:
Agata Savary, Carlos Ramisch, Jena D. Hwang, Nathan Schneider, Melanie Andresen, Sameer Pradhan, Miriam R. L. Petruck
Venues:
LAW | MWE
SIGs:
SIGANN | SIGLEX
Publisher:
Association for Computational Linguistics
Note:
Pages:
144–155
Language:
URL:
https://aclanthology.org/W18-4916
DOI:
Bibkey:
Cite (ACL):
Nganthoibi Oinam, Diwakar Mishra, Pinal Patel, Narayan Choudhary, and Hitesh Desai. 2018. A Treebank for the Healthcare Domain. In Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), pages 144–155, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):
A Treebank for the Healthcare Domain (Oinam et al., LAW-MWE 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-4916.pdf
Data
Penn Treebank