Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets

Nathan Greenberg; Trapit Bansal; Patrick Verga; Andrew Mccallum

doi:10.18653/v1/D18-1306

Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets

Nathan Greenberg, Trapit Bansal, Patrick Verga, Andrew McCallum

Abstract

Extracting typed entity mentions from text is a fundamental component to language understanding and reasoning. While there exist substantial labeled text datasets for multiple subsets of biomedical entity types—such as genes and proteins, or chemicals and diseases—it is rare to find large labeled datasets containing labels for all desired entity types together. This paper presents a method for training a single CRF extractor from multiple datasets with disjoint or partially overlapping sets of entity types. Our approach employs marginal likelihood training to insist on labels that are present in the data, while filling in “missing labels”. This allows us to leverage all the available data within a single model. In experimental results on the Biocreative V CDR (chemicals/diseases), Biocreative VI ChemProt (chemicals/proteins) and MedMentions (19 entity types) datasets, we show that joint training on multiple datasets improves NER F1 over training in isolation, and our methods achieve state-of-the-art results.

Anthology ID:: D18-1306
Volume:: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:: October-November
Year:: 2018
Address:: Brussels, Belgium
Editors:: Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
Venue:: EMNLP
SIG:: SIGDAT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2824–2829
Language:
URL:: https://aclanthology.org/D18-1306
DOI:: 10.18653/v1/D18-1306
Bibkey:
Cite (ACL):: Nathan Greenberg, Trapit Bansal, Patrick Verga, and Andrew McCallum. 2018. Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2824–2829, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):: Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets (Greenberg et al., EMNLP 2018)
Copy Citation:
PDF:: https://aclanthology.org/D18-1306.pdf
Attachment:: D18-1306.Attachment.pdf
Video:: https://aclanthology.org/D18-1306.mp4

PDF Cite Search Attachment Video