Use of unsupervised word classes for entity recognition: Application to the detection of disorders in clinical reports

Maria Evangelia Chatzimina, Cyril Grouin, Pierre Zweigenbaum


Abstract
Unsupervised word classes induced from unannotated text corpora are increasingly used to help tasks addressed by supervised classification, such as standard named entity detection. This paper studies the contribution of unsupervised word classes to a medical entity detection task with two specific objectives: How do unsupervised word classes compare to available knowledge-based semantic classes? Does syntactic information help produce unsupervised word classes with better properties? We design and test two syntax-based methods to produce word classes: one applies the Brown clustering algorithm to syntactic dependencies, the other collects latent categories created by a PCFG-LA parser. When added to non-semantic features, knowledge-based semantic classes gain 7.28 points of F-measure. In the same context, basic unsupervised word classes gain 4.16pt, reaching 60% of the contribution of knowledge-based semantic classes and outperforming Wikipedia, and adding PCFG-LA unsupervised word classes gain one more point at 5.11pt, reaching 70%. Unsupervised word classes could therefore provide a useful semantic back-off in domains where no knowledge-based semantic classes are available. The combination of both knowledge-based and basic unsupervised classes gains 8.33pt. Therefore, unsupervised classes are still useful even when rich knowledge-based classes exist.
Anthology ID:
L14-1336
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3264–3271
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/389_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Maria Evangelia Chatzimina, Cyril Grouin, and Pierre Zweigenbaum. 2014. Use of unsupervised word classes for entity recognition: Application to the detection of disorders in clinical reports. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 3264–3271, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Use of unsupervised word classes for entity recognition: Application to the detection of disorders in clinical reports (Chatzimina et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/389_Paper.pdf