Can Wikipedia Categories Improve Masked Language Model Pretraining?

Diksha Meghwal, Katharina Kann, Iacer Calixto, Stanislaw Jastrzebski


Abstract
Pretrained language models have obtained impressive results for a large set of natural language understanding tasks. However, training these models is computationally expensive and requires huge amounts of data. Thus, it would be desirable to automatically detect groups of more or less important examples. Here, we investigate if we can leverage sources of information which are commonly overlooked, Wikipedia categories as listed in DBPedia, to identify useful or harmful data points during pretraining. We define an experimental setup in which we analyze correlations between language model perplexity on specific clusters and downstream NLP task performances during pretraining. Our experiments show that Wikipedia categories are not a good indicator of the importance of specific sentences for pretraining.
Anthology ID:
2020.winlp-1.19
Volume:
Proceedings of the Fourth Widening Natural Language Processing Workshop
Month:
July
Year:
2020
Address:
Seattle, USA
Editors:
Rossana Cunha, Samira Shaikh, Erika Varis, Ryan Georgi, Alicia Tsai, Antonios Anastasopoulos, Khyathi Raghavi Chandu
Venue:
WiNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
78
Language:
URL:
https://aclanthology.org/2020.winlp-1.19
DOI:
10.18653/v1/2020.winlp-1.19
Bibkey:
Cite (ACL):
Diksha Meghwal, Katharina Kann, Iacer Calixto, and Stanislaw Jastrzebski. 2020. Can Wikipedia Categories Improve Masked Language Model Pretraining?. In Proceedings of the Fourth Widening Natural Language Processing Workshop, page 78, Seattle, USA. Association for Computational Linguistics.
Cite (Informal):
Can Wikipedia Categories Improve Masked Language Model Pretraining? (Meghwal et al., WiNLP 2020)
Copy Citation:
Video:
 http://slideslive.com/38929556