Cross-Domain Detection of Abusive Language Online

Mladen Karan, Jan Šnajder


Abstract
We investigate to what extent the models trained to detect general abusive language generalize between different datasets labeled with different abusive language types. To this end, we compare the cross-domain performance of simple classification models on nine different datasets, finding that the models fail to generalize to out-domain datasets and that having at least some in-domain data is important. We also show that using the frustratingly simple domain adaptation (Daume III, 2007) in most cases improves the results over in-domain training, especially when used to augment a smaller dataset with a larger one.
Anthology ID:
W18-5117
Volume:
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)
Month:
October
Year:
2018
Address:
Brussels, Belgium
Editors:
Darja Fišer, Ruihong Huang, Vinodkumar Prabhakaran, Rob Voigt, Zeerak Waseem, Jacqueline Wernimont
Venue:
ALW
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
132–137
Language:
URL:
https://aclanthology.org/W18-5117
DOI:
10.18653/v1/W18-5117
Bibkey:
Cite (ACL):
Mladen Karan and Jan Šnajder. 2018. Cross-Domain Detection of Abusive Language Online. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pages 132–137, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Cross-Domain Detection of Abusive Language Online (Karan & Šnajder, ALW 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-5117.pdf