Enhancing BERT for Lexical Normalization

Benjamin Muller, Benoit Sagot, Djamé Seddah


Abstract
Language model-based pre-trained representations have become ubiquitous in natural language processing. They have been shown to significantly improve the performance of neural models on a great variety of tasks. However, it remains unclear how useful those general models can be in handling non-canonical text. In this article, focusing on User Generated Content (UGC), we study the ability of BERT to perform lexical normalisation. Our contribution is simple: by framing lexical normalisation as a token prediction task, by enhancing its architecture and by carefully fine-tuning it, we show that BERT can be a competitive lexical normalisation model without the need of any UGC resources aside from 3,000 training sentences. To the best of our knowledge, it is the first work done in adapting and analysing the ability of this model to handle noisy UGC data.
Anthology ID:
D19-5539
Volume:
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)
Month:
November
Year:
2019
Address:
Hong Kong, China
Editors:
Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi
Venue:
WNUT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
297–306
Language:
URL:
https://aclanthology.org/D19-5539
DOI:
10.18653/v1/D19-5539
Bibkey:
Cite (ACL):
Benjamin Muller, Benoit Sagot, and Djamé Seddah. 2019. Enhancing BERT for Lexical Normalization. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 297–306, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):
Enhancing BERT for Lexical Normalization (Muller et al., WNUT 2019)
Copy Citation:
PDF:
https://aclanthology.org/D19-5539.pdf