Towards Robust Named Entity Recognition for Historic German

Stefan Schweter, Johannes Baiter


Abstract
In this paper we study the influence of using language model pre-training for named entity recognition for Historic German. We achieve new state-of-the-art results using carefully chosen training data for language models. For a low-resource domain like named entity recognition for Historic German, language model pre-training can be a strong competitor to CRF-only methods. We show that language model pre-training can be more effective than using transfer-learning with labeled datasets. Furthermore, we introduce a new language model pre-training objective, synthetic masked language model pre-training (SMLM), that allows a transfer from one domain (contemporary texts) to another domain (historical texts) by using only the same (character) vocabulary. Results show that using SMLM can achieve comparable results for Historic named entity recognition, even when they are only trained on contemporary texts. Our pre-trained character-based language models improve upon classical CRF-based methods and previous work on Bi-LSTMs by boosting F1 score performance by up to 6%.
Anthology ID:
W19-4312
Volume:
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)
Month:
August
Year:
2019
Address:
Florence, Italy
Editors:
Isabelle Augenstein, Spandana Gella, Sebastian Ruder, Katharina Kann, Burcu Can, Johannes Welbl, Alexis Conneau, Xiang Ren, Marek Rei
Venue:
RepL4NLP
SIG:
SIGREP
Publisher:
Association for Computational Linguistics
Note:
Pages:
96–103
Language:
URL:
https://aclanthology.org/W19-4312
DOI:
10.18653/v1/W19-4312
Bibkey:
Cite (ACL):
Stefan Schweter and Johannes Baiter. 2019. Towards Robust Named Entity Recognition for Historic German. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 96–103, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Towards Robust Named Entity Recognition for Historic German (Schweter & Baiter, RepL4NLP 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-4312.pdf
Code
 stefan-it/historic-ner +  additional community code
Data
Europeana Newspapers