Multi-pretraining for Large-scale Text Classification

Kang-Min Kim; Bumsu Hyeon; Yeachan Kim; Jun-Hyung Park; SangKeun Lee

doi:10.18653/v1/2020.findings-emnlp.185

Multi-pretraining for Large-scale Text Classification

Kang-Min Kim, Bumsu Hyeon, Yeachan Kim, Jun-Hyung Park, SangKeun Lee

Abstract

Deep neural network-based pretraining methods have achieved impressive results in many natural language processing tasks including text classification. However, their applicability to large-scale text classification with numerous categories (e.g., several thousands) is yet to be well-studied, where the training data is insufficient and skewed in terms of categories. In addition, existing pretraining methods usually involve excessive computation and memory overheads. In this paper, we develop a novel multi-pretraining framework for large-scale text classification. This multi-pretraining framework includes both a self-supervised pretraining and a weakly supervised pretraining. We newly introduce an out-of-context words detection task on the unlabeled data as the self-supervised pretraining. It captures the topic-consistency of words used in sentences, which is proven to be useful for text classification. In addition, we propose a weakly supervised pretraining, where labels for text classification are obtained automatically from an existing approach. Experimental results clearly show that both pretraining approaches are effective for large-scale text classification task. The proposed scheme exhibits significant improvements as much as 3.8% in terms of macro-averaging F1-score over strong pretraining methods, while being computationally efficient.

Anthology ID:: 2020.findings-emnlp.185
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2020
Month:: November
Year:: 2020
Address:: Online
Editors:: Trevor Cohn, Yulan He, Yang Liu
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2041–2050
Language:
URL:: https://aclanthology.org/2020.findings-emnlp.185/
DOI:: 10.18653/v1/2020.findings-emnlp.185
Bibkey:
Cite (ACL):: Kang-Min Kim, Bumsu Hyeon, Yeachan Kim, Jun-Hyung Park, and SangKeun Lee. 2020. Multi-pretraining for Large-scale Text Classification. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2041–2050, Online. Association for Computational Linguistics.
Cite (Informal):: Multi-pretraining for Large-scale Text Classification (Kim et al., Findings 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.findings-emnlp.185.pdf
Data: Billion Word Benchmark, One Billion Word Benchmark

PDF Cite Search Fix data