Combining Character and Word Embeddings for the Detection of Offensive Language in Arabic

Abdullah I. Alharbi, Mark Lee


Abstract
Twitter and other social media platforms offer users the chance to share their ideas via short posts. While the easy exchange of ideas has value, these microblogs can be leveraged by people who want to share hatred. and such individuals can share negative views about an individual, race, or group with millions of people at the click of a button. There is thus an urgent need to establish a method that can automatically identify hate speech and offensive language. To contribute to this development, during the OSACT4 workshop, a shared task was undertaken to detect offensive language in Arabic. A key challenge was the uniqueness of the language used on social media, prompting the out-of-vocabulary (OOV) problem. In addition, the use of different dialects in Arabic exacerbates this problem. To deal with the issues associated with OOV, we generated a character-level embeddings model, which was trained on a massive data collected carefully. This level of embeddings can work effectively in resolving the problem of OOV words through its ability to learn the vectors of character n-grams or parts of words. The proposed systems were ranked 7th and 8th for Subtasks A and B, respectively.
Anthology ID:
2020.osact-1.15
Volume:
Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Hend Al-Khalifa, Walid Magdy, Kareem Darwish, Tamer Elsayed, Hamdy Mubarak
Venue:
OSACT
SIG:
Publisher:
European Language Resource Association
Note:
Pages:
91–96
Language:
English
URL:
https://aclanthology.org/2020.osact-1.15
DOI:
Bibkey:
Cite (ACL):
Abdullah I. Alharbi and Mark Lee. 2020. Combining Character and Word Embeddings for the Detection of Offensive Language in Arabic. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, pages 91–96, Marseille, France. European Language Resource Association.
Cite (Informal):
Combining Character and Word Embeddings for the Detection of Offensive Language in Arabic (Alharbi & Lee, OSACT 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.osact-1.15.pdf