A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection

Hate speech detection in social media texts is an important Natural language Processing task, which has several crucial applications like sentiment analysis, investigating cyberbullying and examining socio-political controversies. While relevant research has been done independently on code-mixed social media texts and hate speech detection, our work is the first attempt in detecting hate speech in Hindi-English code-mixed social media text. In this paper, we analyze the problem of hate speech detection in code-mixed texts and present a Hindi-English code-mixed dataset consisting of tweets posted online on Twitter. The tweets are annotated with the language at word level and the class they belong to (Hate Speech or Normal Speech). We also propose a supervised classification system for detecting hate speech in the text using various character level, word level, and lexicon based features.


Introduction
With recent surge in the amount of user generated social media data, there has been a tremendous scope in automated text analysis in the domain of computational linguistics.
Popularity of opinion-rich online resources like review forums and microblogging sites has encouraged users to express and convey their thoughts all across the world in real time. This often results in users posting offensive and abusive content online using hateful speech. These may be directed towards an individual or community to show their dissent. Detecting hate speech is thus important for lawmakers and social media platforms to discourage occurence of any wrongful activities. Previous research related to this task has mainly been focused on monolingual texts (Malmasi and Zampieri, 2017;Schmidt and Wiegand, 2017; * These authors contributed equally to this work. Davidson et al., 2017) due to their large-scale availability. However, in multilingual societies like India, usage of code-mixed languages (among which Hindi-English is most prominent) is quite common for conveying opinions online. Code-Mixing (CM) is a natural phenomenon of embedding linguistic units such as phrases, words or morphemes of one language into an utterance of another (Myers-Scotton, 1993;Gysels, 1992;Duran, 1994;Muysken, 2000). Following are some instances of Hindi-English code-mixed texts also transliterated in English.
T1 : "Mujhe apne manager se nafrat hai, I want to kill that guy." Translation : "I hate my manager, I want to kill that guy." T2 : "Aaj ka day humesha yaad rahega humein because India won the World Cup! :D" Translation : "We'll forever remember this day because India won the World Cup! :D " T3 : "Jisne bhi Nirbhaya ka rape kiya should be bloody hanged till death." Translation : "Whoever raped Nirbhaya, should be bloody hanged till death." It can be observed that T1 and T3 contain hate speech, while T2 is an instance of normal speech. To the best of our knowledge, currently there are no online code-mixed resources available for detecting hate speech. We believe that our initial efforts in constructing a Hindi-English code-mixed dataset for hate speech detection will prove to be extremely valuable for linguists working in this domain.
The structure of the paper is as follows. In Section 2, we review related research in the area of code mixing and hate speech detection. In Section 3, we describe the corpus creation and annotation scheme. In Section 4, we present our system architecture which includes the pre-processing steps and classification features. In Section 5, we present the results of experiments conducted using various character-level, word-level and lexicon features. In the last section, we conclude our paper, followed by future work and references.
2 Background and Related Work  performed analysis of data from Facebook posts generated by Hindi-English bilingual users. Analysis depicted that significant amount of code-mixing was present in the posts. ) created a POS tag annotated Hindi-English code-mixed corpus and reported the challenges and problems in the Hindi-English code-mixed text. They also performed experiments on language identification, transliteration, normalization and POS tagging of the dataset. (Sharma et al., 2016) addressed the problem of shallow parsing of Hindi-English code-mixed social media text and developed a system that can identify the language of the words, normalize them to their standard forms, assign their POS tag and segment them into chunks. (Barman et al., 2014) addressed the problem of language identification on Bengali-Hindi-English Facebook comments. They annotated a corpus and achieved an accuracy of 95.76% using statistical models with monolingual dictionaries. (Raghavi et al., 2015) developed a Question Classification system for Hindi-English code-mixed language using word level resources. The shared tasks have been also organized on classifying code-mixed cross-script question and on information retrieval of Hindi-English code-mixed tweets where the task was to retrieve the top k tweets from a corpus for a given query consisting of Hind-English terms where the Hindi terms are written in Roman transliterated form (Banerjee et al., 2016). (Gupta et al., 2014) addressed the problem of Mixed-Script IR (MSIR). They also proposed a solution to handle the mixed-script term matching and spelling variation where the terms across the scripts are modelled jointly in a deep-learning architecture and can be compared in a low-dimensional abstract space. They also did empirical analysis of the proposed method along with the evaluation results in an ad-hoc retrieval setting of mixedscript IR where the proposed method achieves significantly better results (12% increase in MRR and 29% increase in MAP) compared to other state-of-the-art baselines. (Joshi et al., 2016;Ghosh et al., 2017) performed Sentiment Identification in code-mixed social media text. (Malmasi and Zampieri, 2017) examined methods to detect hate speech in social media. They presented a supervised classification system which uses character n-grams, word n-grams and word skip grams. They were able to achieve accuracy of 78% on dataset which contains English tweets annotated with three labels, namely, hate speech (HATE), offensive language but no hate speech (OFFENSIVE); and no offensive content (OK). (Del Vigna et al., 2017) addressed the problem of Hate speech detection for Italian language. They built their annotated corpus using comments retrieved from the Facebook public pages of Italian newspapers, politicians, artists, and groups.

Corpus Creation and Annotation
We constructed the Hindi-English code-mixed corpus using the tweets posted online in last five years. Tweets were scrapped from Twitter using the Twitter Python API 1 which uses the advanced search option of twitter. We have mined the tweets by selecting certain hashtags and keywords from politics. public protests, riots, etc., which have a good propensity for the presence of hate speech. We retrieved 1,12,718 tweets from Twitter in json format, which consists of information such as timestamp, URL, text, user, re-tweets, replies, full name, id and likes. An extensive processing was carried out to remove all the noisy tweets. Furthermore, all those tweets which were written either in pure English or pure Hindi language were removed. As a result of manual filtering, a dataset of 4575 code-mixed tweets was created.

Annotation
Annotation of the corpus was carried out as follows: Language at Word Level : For each word, a tag was assigned to its source language. Three kinds of tags namely, 'eng', 'hin' and 'other' were assigned to the words by bilingual speakers. 'eng' tag was assigned to words which are present in English vocabulary, such as "School", "Death", etc. 'hin' tag was assigned to words which are present in the Hindi vocabulary such as "nafrat" (Hatred), "marna" (dying). The tag 'other' was given to symbols, emoticons, punctuations, named entities, acronyms, and URLs.
Hate Speech or Normal Speech : An instance of annotation is illustrated in Figure 1. Each tweet is enclosed within <tweet></tweet>tags. First line in every annotation consists of tweet id. Language tags are added before every token of the tweet, enclosed within <word></word>tags. Each tweet is annotated with one of the two tags (Hate Speech or Normal Speech). Hate speech is detected in 1661 tweets. Remaining 2914 code-mixed tweets in the dataset comprise of normal speech. The annotated dataset with the classification system is made available online 2 .

Inter Annotator Agreement
Annotation of the dataset to detect presence of hate speech was carried out by two human annotators having linguistic background and proficiency in both Hindi and English. A sample annotation set consisting of 50 tweets (25 hate speech and 25 non hate speech) selected randomly from all across the corpus was provided to both the annotators in order to have a reference baseline so as to differentiate between hate speech and non hate speech text. In order to validate the quality of annotation, we calculated the inter-annotator agreement (IAA) for hate speech annotation between the two annotation sets of 4575 code-mixed tweets using Cohen's Kappa coefficient. Kappa score is 0.982 which indicates that the quality of the annotation and presented schema is productive.

System Architecture
In this section, we present our machine learning model which is trained and tested on the codemixed dataset described in the previous sections.

Pre-processing of the code-mixed tweets
Following are the steps which were performed in order to pre-process the data prior to feature extraction.
1. Removal of URLs: All the links and URLs in the tweets are stored and replaced with "URL", as these do not contribute towards any kind of sentiment in the text.
each punctuation mark since we use them as one of the features in classification.

Feature Identification and Extraction :
In our work, we have used the following feature vectors to train our supervised machine learning model.
1. Character N-Grams (C): Character N-Grams are language independent and have proven to be very efficient for classifying text. These are also useful in the situation when text suffers from misspelling errors (Cavnar and Trenkle, 1994;Huffman, 1995;Lodhi et al., 2002). Group of characters can help in capturing semantic meaning, especially in the code-mixed language where there is an informal use of words, which vary significantly from the standard Hindi and English words. We use character n-grams as one of the features, where n vary from 1 to 3.

Word N-Grams (W) :
Bag of word features have been widely used to capture emotion in a text (Purver and Battersby, 2012) and in detecting hate speech (Warner and Hirschberg, 2012). Thus we use word n-grams, where n vary from 1 to 3 as a feature to train our classification models.
3. Punctuations (P): Punctuation marks can also be useful for hate speech detection. Users often use exclamation marks when they want to express strong feelings. Multiple question marks in the text can denote anger and dissent. Usage of an exclamation mark in conjunction with the question mark indicates annoyed feeling. We count the occurrence of each punctuation mark in a sentence and use them as a feature.

Negation Words (N) :
A list of negation words was taken from Christopher Pott's sentiment tutorial 3 . We count the number of negations in a tweet and use the count as a feature.  (Mohammad, 2012). We identified 177 Hindi and English hate words from the dataset and took them as a feature for classification.

Results
We performed experiments with two different classifiers namely Support Vector Machines with radial basis function kernel and Random Forest Classifier. Since the size of feature vectors formed are very large, we applied chi-square feature selection algorithm which reduces the size of our feature vector to 1200 4 . For training our system classifier, we have used Scikit-learn (Pedregosa et al., 2011). In all the experiments, we carried out 10fold cross validation. Table 1 and Table 2 describe the accuracy of each feature along with the accuracy when all features are used, in the case of Support vector machine and Random forest classifier respectively. Support vector machine performs better than Random forest classifier and gives a highest accuracy of 71.7% when all features are used. Character N-Grams proved to be most efficient in SVM, while Word N-Grams resulted in most accuracy in the case of Random Forest Classifier.

Conclusion and Future Work
In this paper, we present an annotated corpus of Hindi-English code-mixed text, consisting of tweet ids and the corresponding annotations. We also present the supervised system used for detection of Hate Speech in the code-mixed text. The  corpus consists of 4575 code-mixed tweets annotated with hate speech and normal speech. The words in the tweets are also annotated with source language of the words. The features used in our classification system are character n-grams, word n-grams, punctuations, negation words and hate lexicon. Best accuracy of 71.7% is achieved when all the features are incorporated in the feature vector using SVM as the classification system. As a part of future work, the corpus can be annotated with part-of-speech tags at word level which may yield better results. Moreover, the annotations and experiments described in this paper can also be carried out for code-mixed texts containing more than two languages from multilingual societies, in future.