Multiclass Text Classification on Unbalanced, Sparse and Noisy Data

Tillmann Dönicke, Matthias Damaschk, Florian Lux


Abstract
This paper discusses methods to improve the performance of text classification on data that is difficult to classify due to a large number of unbalanced classes with noisy examples. A variety of features are tested, in combination with three different neural-network-based methods with increasing complexity. The classifiers are applied to a songtext–artist dataset which is large, unbalanced and noisy. We come to the conclusion that substantial improvement can be obtained by removing unbalancedness and sparsity from the data. This fulfils a classification task unsatisfactorily—however, with contemporary methods, it is a practical step towards fairly satisfactory results.
Anthology ID:
W19-6207
Volume:
Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing
Month:
September
Year:
2019
Address:
Turku, Finland
Editors:
Joakim Nivre, Leon Derczynski, Filip Ginter, Bjørn Lindi, Stephan Oepen, Anders Søgaard, Jörg Tidemann
Venue:
NoDaLiDa
SIG:
Publisher:
Linköping University Electronic Press
Note:
Pages:
58–65
Language:
URL:
https://aclanthology.org/W19-6207
DOI:
Bibkey:
Cite (ACL):
Tillmann Dönicke, Matthias Damaschk, and Florian Lux. 2019. Multiclass Text Classification on Unbalanced, Sparse and Noisy Data. In Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing, pages 58–65, Turku, Finland. Linköping University Electronic Press.
Cite (Informal):
Multiclass Text Classification on Unbalanced, Sparse and Noisy Data (Dönicke et al., NoDaLiDa 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-6207.pdf