De-identification of Privacy-related Entities in Job Postings

Kristian Nørgaard Jensen, Mike Zhang, Barbara Plank


Abstract
De-identification is the task of detecting privacy-related entities in text, such as person names, emails and contact data. It has been well-studied within the medical domain. The need for de-identification technology is increasing, as privacy-preserving data handling is in high demand in many domains. In this paper, we focus on job postings. We present JobStack, a new corpus for de-identification of personal data in job vacancies on Stackoverflow. We introduce baselines, comparing Long-Short Term Memory (LSTM) and Transformer models. To improve these baselines, we experiment with BERT representations, and distantly related auxiliary data via multi-task learning. Our results show that auxiliary data helps to improve de-identification performance. While BERT representations improve performance, surprisingly “vanilla” BERT turned out to be more effective than BERT trained on Stackoverflow-related data.
Anthology ID:
2021.nodalida-main.21
Volume:
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
Month:
May 31--2 June
Year:
2021
Address:
Reykjavik, Iceland (Online)
Editors:
Simon Dobnik, Lilja Øvrelid
Venue:
NoDaLiDa
SIG:
Publisher:
Linköping University Electronic Press, Sweden
Note:
Pages:
210–221
Language:
URL:
https://aclanthology.org/2021.nodalida-main.21
DOI:
Bibkey:
Cite (ACL):
Kristian Nørgaard Jensen, Mike Zhang, and Barbara Plank. 2021. De-identification of Privacy-related Entities in Job Postings. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 210–221, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden.
Cite (Informal):
De-identification of Privacy-related Entities in Job Postings (Jensen et al., NoDaLiDa 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.nodalida-main.21.pdf
Code
 kris927b/JobStack
Data
JobStack