INGEOTEC at SemEval-2019 Task 5 and Task 6: A Genetic Programming Approach for Text Classification

This paper describes our participation in HatEval and OffensEval challenges for English and Spanish languages. We used several approaches, B4MSA, FastText, and EvoMSA. Best results were achieved with EvoMSA, which is a multilingual and domain-independent architecture that combines the prediction of different knowledge sources to solve text classification problems.


Introduction
Social media platforms, like Twitter and Facebook, are spaces where people interact with others and express themselves; while these platforms encourage free speech, other issues could emerge such as the usage of offensive language that could mock or insult individuals or groups of people. Thus, detecting offenses and misbehavior expressed in text form is interesting to measure the people's feelings and warn them about possible attacks on others such as abusive language, hate speech, cyberbullying, trolling, among others (Waseem et al., 2017).
In order to tackle these text classifications problems, SemEval-2019 proposed two tasks: multilingual detection of hate speech against immigrants and women in Twitter HatEval, task 5 (Basile et al., 2019), and identification and categorization of offensive language in social media OffensEval, task 6 (Zampieri et al., 2019b). In this paper, we present the results from our participating in these two tasks.
The HatEval challenge consists in detecting hate speech for two targets, immigrants and women, in Twitter for Spanish and English languages. There are two subtasks, subtask A is a binary classification where systems have to predict whether a tweet with a given target (immigrants or women) is hateful or not hateful; subtask B is about aggressive behavior and target classification, systems are asked to classify hateful tweets as aggressive or not aggressive, and identify the target harassed (individual or group).
On the other hand, OffensEval challenge consists in determining if a given message has offensive content. It is divided into three subtasks. Subtask A is dedicated to identifying the offensive language, i.e., determine if a message is offensive or not offensive. Subtask B is about categorizing offense types; that is, a tweet containing an insult or threat to someone, or a tweet containing nontargeted profanity and swearing. Finally, subtask C focus on identifying the target, i.e., whether the offensive post is about an individual, a group, or others.
Both HatEval and OffensEval are related tasks to abusive language, Waseem et al. (Waseem et al., 2017) describe tasks on this theme; authors focus their analysis on two primary factors that could guide the modeling of systems: i) language is directed towards a specific individual, entity, or generalized group; ii) the abusive content may be explicit or implicit.
For instance, Schmidt and Wiegand (Schmidt and Wiegand, 2017) present a collection of works on hate speech detection highlighting the features commonly used such as surface-level features. For instance, authors use bag of words (n-grams) and character-level n-grams to attenuate the spelling variation issue on informal text, frequency of URL mentions, punctuation, token lengths, capitalization, among others; word generalization such as topic identification (LDA) and word embeddings (Mikolov et al., 2013); outcomes from sentiment analysis classifiers (for example, samples predicted as negative polarity) as auxiliary evidence of hate for multi-step approaches; usage of lexical resources containing specific negative words (slurs, insults, etc.); linguistic aspects such as parts of speech and syntactic information; knowledge information such as ontologies and taxonomies (ConcepNet, WordNet, etc.).
For both tasks, we use the same approach for final runs. Our approach takes into account several features mentioned above. For example, the effects of character-level n-grams are broadly studied for related tasks in (Tellez et al., 2017b). In particular, text modeling is a crucial factor in our approach; therefore we used the approach presented in (Tellez et al., 2018) that selects the best configuration on the datasets concerned. We also use external knowledge to the given training set to support the classification task; in this sense, our approach named EvoMSA ( §2.1) is a stacking system based on genetic programming, and particularly on the use of semantic genetic operators, that focus on sentiment analysis, and, in general, on text classification.

System Description
We used our framework based on genetic programming named EvoMSA to evaluate HatEval and OffensEval tasks. EvoMSA is composed of a stack of B4MSA classifiers to produce predictions, and EvoDAG combines the predictions into the final one.

EvoMSA
EvoMSA 1 (Graff et al., 2018a,b) is a Generic Sentiment Analysis System based on B4MSA and EvoDAG. It is an architecture of two phases to solve classification tasks, see Figure 1. EvoMSA improves the performance of a global classifier combining the predictions of a set of classifiers with different models on the same text to be classified. Roughly speaking, in the first stage, a set of B4MSA classifiers (see Sec. 2.1.1) are trained from several views of the same datasets; datasets provided by SemEval. It creates a decision functions space with mixtures of values coming from different views of knowledge, one coming from B4MSA trained with the training set of the competition (it is used as generic classifier), a lexiconbased model (it only counts affective words: positive and negative, based on several lexicons (Liu, 2017;Albornoz et al., 2012;Sidorov et al., 2013;Perez-Rosas et al., 2012)), an emoji-based space (the sixty-four most probable emoticons for the message) (Graff et al., 2018b), and the output of 1 https://github.com/INGEOTEC/EvoMSA FastText (Grave et al., 2018) (word embeddings of dimension of 100) trained with the training set. Finally, EvoDAG's inputs are the concatenation of all the decision functions predicted, and EvoDAG produces a final value or prediction. The following subsections describe the internal parts of EvoMSA. The precise configuration of our benchmarked system is described in Sec. 4. B4MSA 2 focus on multilingual sentiment analysis. For complete details of the model see (Tellez et al., 2017a,b). The core idea behind B4MSA is to tackle the sentiment analysis problem as a model selection problem, using a different view of the underlying combinatorial problem, i.e., B4MSA combines a bunch of different text tokenization, text transformations, weighting methods, and internally uses an SVM with a linear kernel to classify. Also, B4MSA takes advantage of several domain-specific particularities like emojis and emoticons and makes explicit handling of negation statements expressed in texts. Nonetheless, EvoMSA avoids the sophisticated use of B4MSA fixing the model for each language in favor of performing an optimization process at the level of the decision functions of several models . Table 1 shows text transformation parameters used in our system for English and Spanish languages.

EvoDAG
EvoDAG 3 (Graff et al., 2016) is a Genetic Programming system specifically tailored to tackle classification and regression problems on very high dimensional vector spaces and large datasets. In particular, EvoDAG uses the principles of Darwinian evolution to create models represented as a directed acyclic graph (DAG). An EvoDAG model has three distinct node's types; the inputs nodes, that as expected received the independent variables, the output node that corresponds to the label, and the inner nodes are the different numerical functions such as sum, product, sin, cos, max, and min, among others. Due to lack of space, we refer the reader to (Graff et al., 2016) where EvoDAG is broadly described.

Experimental Settings
As we mentioned, to determine the best configuration of parameters for text modeling, B4MSA integrates a hyper-parameter optimization phase that ensures the performance of the classifier based on the training data. The text modeling parameters for B4MSA were set for all process as we show in Table 1 for English and Spanish languages. A text transformation feature could be binary (yes/no) or ternary (group/delete/none) option. Tokenizers denote how texts must be split after applying the process of each text transformation to texts. Tokenizers generate text chunks in a range of lengths, all tokens generated are part of the text representation. B4MSA allows selecting tokenizers based on n-words, q−grams, and skip-grams, in any combination. We call n-words to the popular word ngrams; in particular, we allow to use any combination of unigrams, bigrams, and trigrams. Also, the configuration space allows selecting any combination of character, q-grams, for q = 1 to 9. Finally, we allow skip-grams such as (3, 1) and (2, 2), three words separated by one word (gap), and two words separated by two gaps.
We use two baselines B4MSA and the Fast-Text's classifier (Bojanowski et al., 2016) for both contests. FastText represents sentences with a weighted bag of words, and each word is represented as a bag of character n-gram to create text vectors based on word embeddings. Our custom FastText searches automatically the best parameters, e.g., for OffensEval with parameters such as window size = 9, learning rate = 0.01, epochs = 10, size of word vectors = 10, minimum and maximum length of character n-grams, 2 and 5, respectively; and some other preprocessing steps such as group numbers and reduce duplicated characters.

Datasets
SemEval contests provide datasets to train systems for each task. Table 2 presents the data distribu-   Table 3 shows the OffensEval data distribution. In Task A, class OFF defines tweets that have offenses or insults; while class NOT describes tweets with no offensive content. Messages with labeled as TIN contain an insult or threat to an entity; UNT defines the opposite. Group (GRP), individual (IND), and others (OTH) classes contain the target of the offensive messages. The Offen-sEval collection is described in detail in Zampieri et al. (2019a).

Results
We present the results of our approaches for HatEval contest in Table 4 and Table 5. We performed our experimentation on the development dataset provided by HateEval. Table 4 shows the results of task A, given a tweet is hateful or not hateful for English and Spanish languages. In the case of task A, the macro-F1 score is used to measure the performance. Table 5 shows the results of task B, classify tweets as aggressive or not aggressive and the target harassed.
In the case of OffenseEval, Table 6 shows the results for the three task proposed offensive language identification (Task A), categorization of offense types (Task B), and offense target identification (Task C). We present three system configurations for both tasks. B4MSA uses only the training data provided by the contest as the knowledge base to classify texts, i.e., B4MSA is our baseline, but it is also its outcome is an additional input for our more sophisticated classifier (EvoMSA). Fast-Text generates word embeddings from the provided dataset. We do not use pre-training vectors, using pre-trained vectors did not provide any significant improvement in this case, but increased the complexity of the models and the processing pipeline. EvoMSA (Graff et al., 2018a) combines, using EvoDAG, the output of different text models such as B4MSA, a lexicon-based model, an emojispace model, and FastText.
As we can see the performance in all results Tables, EvoMSA is systematically better than our other systems; under these circumstances, we decided to use EvoMSA firstly in the evaluation phase. Following the rules of HatEval, only the last run would be valid; therefore we used EvoMSA for this chance. In the case of Offen-sEval, up to three predictions were allowed on the test dataset, but only the best one was compared with other systems. As we can see, Table  6 shows the performance of our three systems on gold standards; EvoMSA stays ahead in all tasks including the baselines from the contest. The table also shows the performance of two baselines, "All NOT" and "ALL OFF", that correspond to labeling all tweets as NOT or OFF, respectively; similarly, the rest of the tasks have baselines for "All TIN", "All UNT", "All GRP", "ALL IND", and "ALL OTH" labeling strategies.

Conclusions
In this paper was presented our solution for Hat-Eval and OffensEval, two campaigns of SemEval 2019. We show the competitiveness of our ap-