INGEOTEC at SemEval-2018 Task 1: EvoMSA and μTC for Sentiment Analysis

This paper describes our participation in Affective Tweets task for emotional intensity and sentiment intensity subtasks for English, Spanish, and Arabic languages. We used two approaches, μTC and EvoMSA. The first one is a generic text categorization and regression system; and the second one, a two-stage architecture for Sentiment Analysis. Both approaches are multilingual and domain independent.


Introduction
Sentiment Analysis is a research area where does a computational analysis of people's feelings or beliefs expressed in texts such as emotions, opinions, attitudes, appraisals, etc. (Liu and Zhang, 2012). People communicate not only the emotion or sentiment they are feeling, but also the intensity, that is, the degree of emotion or sentiment. In this context, SemEval is one of the forums that conducts evaluations on semantics at different levels, for instance, it proposes tasks such as sentiment analysis, the intensity of emotion or sentiment (affective tweets) , irony detection, among others (SemEval, 2017).
In this work, we present the results of our participation in Affective Tweets task for four of the five subtasks in English, Spanish, and Arabic languages and for all emotions available: anger, fear, joy, and sadness.
The subtasks are A) emotion intensity regression (EI-REG): given a tweet and an emotion, determine the intensity of the emotion that best represents the mental state of the tweeter, a real-value score between 0 and 1. B) Emotion intensity ordinal classification * corresponding author: sabino.miranda@infotec.mx (EI-OC): given a tweet and an emotion E, classify the tweet into one of four ordinal classes of intensity of emotion: anger, fear, joy, and sadness, that best represents the mental state of the tweeter. C) A sentiment intensity regression task (V-REG): given a tweet, determine the intensity of sentiment, a real-valued score between 0 (most negative) and 1 (most positive). D) A sentiment analysis, ordinal classification (V-OC): given a tweet, classify it into one of seven ordinal classes, corresponding to several levels of positive and negative sentiment intensity.
In this context, one crucial step is the procedure used to transform the data (i.e., tweets) into the inputs (vectors) of the supervised learning techniques used. Typically, Natural Language Processing (NLP) approaches for data representation use n-grams of words, linguistic information such as dependency relations, syntactic information, lexical units (e.g., lemmas, stems), affective lexicons, error correction, etc. However, selecting the best configuration of those characteristics could be a cumbersome task, many times disregarded in favor of some well-known competitive setups. (Tellez et al., 2017b) studies the dependency between the performance and the proper selection of the text model. This selection can be seen as a combinatorial optimization problem where the objective is to maximize the performance metric of the classifier being used; this approach is implemented by µTC, (Tellez et al., 2018). Due to its combinatorial nature, and the kind of parameters that compose the configuration space, the resulting classifiers are multilingual and domain independent. Therefore, with a tight dependency on the training set, it is mandatory to provide additional information about the particular task to avoid overfitting. In this sense, the use of multiple knowledge sources is essential, and combining them simply and effectively is the idea be-hind EvoMSA. EvoMSA ( §2.2) is a stacking system based on genetic programming, and particularly on the use of semantic genetic operators, that focus on sentiment analysis. The core of our contribution is to use both µTC and EvoMSA to learn from different annotated collections and then use that diverse knowledge to tackle the SemEval 2018 Task 1 challenge.
Looking at systems that obtained the best results in previous SemEval editions, it can be concluded that it is necessary to include more datasets, see for instance BB twtr system (Cliche, 2017) for Sentiment Analysis in the Twitter task, which uses more datasets besides the one given in the competition. Here, it was decided to follow a similar approach by including an additional humanannotated dataset publicly available for English, Spanish, and Arabic to build robust models.

System Description
As commented, we use two systems to evaluate the Affective Tweets task: µTC and EvoMSA. On the one hand, µTC is used mainly to evaluate two tasks for the Arabic language because in our experiments it obtained the best performance in almost all subtask in this language both for regression and classification tasks. On the other hand, EvoMSA is used to evaluate English and Spanish languages, and ordinal sentiment classification (valence) task for Arabic. In the following paragraphs, we describe these approaches.
2.1 µTC µTC 1 is a minimalistic and wide system able to tackle text classification and regression tasks independent of domain and language a detail. For complete details of the model see (Tellez et al., 2018). Essentially, µTC creates text classifiers (or a text regressors) searching for the best models in a given configuration space. A configuration consists of instructions to enable several preprocessing functions, a combination of tokenizers among the power set of several possible ones (character q-grams, n-word grams, skip-grams, etc.), and a weighting scheme (application of frequency filters and the use of TF, TFIDF, or several distributional schemes). µTC seeks the best configurations optimizing a score which is evaluated through a classifier or a regressor; currently, it uses SVM for both tasks. In Table 1, we can see details of text 1 https://github.com/INGEOTEC/microTC transformations used in our solution for detecting Anger emotion for Arabic. This set of text transformations was selected among millions of possible configurations through the combinatorial optimization process implemented in µTC. In ordinal classification tasks the model is found out based on the training dataset provided for each emotion, if this is the case.

EvoMSA
EvoMSA 2 is a Sentiment Analysis System based on B4MSA and EvoDAG. It is an architecture of two phases to solve classification or regression tasks, see Figure 1. EvoMSA improves the performance of a global classifier combining the predictions of a set of classifiers with different models on the same text to be classified. Roughly speaking, in the first stage, a set of B4MSA classifiers (see Sec. 2.2.1) are trained with two kind of datasets; datasets provided by SemEval, and large datasets annotated by humans for sentiment analysis for English and Spanish languages (Mozetič et al., 2016), called HA datasets. In the case of HA datasets, it is split into balanced small datasets that feed each B4MSA classifier which produces three real output values, one for each sentiment (negative, neutral and positive). In the case of Se-mEval datasets, for instance, for EI-OC, the classifier produces one of four ordinal classes of intensity of emotion (0, 1, 2, 3). It creates a decision functions space with mixtures of values coming from different views of knowledge. Finally, EvoDAG's inputs are the concatenation of all the decision functions predicted by each B4MSA system, and EvoDAG produces a final value or prediction. The following subsections describe the internal parts of EvoMSA. The precise configuration of our benchmarked system is described in Sec. 4. B4MSA 3 is related to µTC, but this framework is mainly focused for multilingual sentiment analysis. For complete details of the model see (Tellez et al., 2017a,b).
The core idea behind B4MSA is similar to that of µTC, i.e., it tackles the sentiment analysis problem as a model selection problem, yet using a different view of the underlying combinatorial problem. Also, contrarily to µTC, B4MSA takes advantage of several domain-specific particularities like emojis and emoticons and makes explicit handling of negation statements expressed in texts. Nonetheless, EvoMSA avoids the sophisticated use of B4MSA fixing the model for each language in favor of performing an optimization process at the level of the decision functions of several models. Table 1 shows text transformation parameters used in our system for English and Spanish languages.

EvoDAG
EvoDAG 4 (Graff et al., 2016(Graff et al., , 2017) is a Genetic Programming system specifically tailored to tackle classification and regression problems on very high dimensional vector spaces and large datasets. In particular, EvoDAG uses the principles of Darwinian evolution to create models represented as a directed acyclic graph (DAG). An EvoDAG model has three distinct node's types; the inputs nodes, that as expected received the independent variables, the output node that corresponds to the label, and the inner nodes are the different numerical functions such as: sum, product, sin, cos, max, and min, among others. Due to lack of space, we refer the reader to (Graff et al., 2016) where EvoDAG is broadly described. In fact, in this research, we followed the steps explained there. In order to give an idea of the type of models being evolved, Figure 2 depicts a model evolved for the Arabic polarity classification at global message task. As can be seen, the model is represented using a DAG where direction of the edges indicates the dependency, e.g., cos depends on X 3 , i.e., cosine function is applied to X 3 . As commented above, there are three types of nodes; the inputs nodes are colored in red, the inner nodes are blue (the intensity is related to the distance to the height, the darker the closer), and the green node is the output node. As men-3 https://github.com/INGEOTEC/b4msa 4 https://github.com/mgraffg/EvoDAG tioned previously, EvoDAG uses as inputs the decision functions of B4MSA, then the first three inputs (i.e., X 0 , X 1 , and X 2 ) correspond to the decision functions values of the negative, neutral, and positive polarity of B4MSA model trained with SemEval Arabic dataset, and the later two (i.e., X 3 and X 4 ) correspond to the decision function values of two B4MSA systems each one trained with other dataset for two classes. It is important to mention that EvoDAG does not have information regarding whether input X i comes from a particular polarity decision function, consequently from EvoDAG point of view all inputs are equivalent.

Experimental Settings
As we mentioned, to determine the best configuration of parameters for text modeling, µTC and B4MSA integrate a hyper-parameter optimization phase that ensures the performance of the classifier based on the training data. The text modeling parameters for B4MSA were set for all process as we show in Table 1 for English and Spanish language for classification and regression tasks. In the case of the Arabic language, the parameters were calculated by the optimization phase; an example is showed in Table 1. A text transformation feature could be binary (yes/no) or ternary (group/delete/none) option. Tokenizers denote how texts must be split after applying the process of each text transformation to texts. Tokenizers generate text chunks in a range of lengths, all tokens generated are part of the text representation. Both, B4MSA and µTC, allow selecting tokenizers based on n-words, q−grams, and skip-grams, in any combination. We call n-words to the wellknown word n-grams; in particular, we allow to use any combination of unigrams, bigrams, and trigrams. Also, the configuration space allows selecting any combination of character q-grams (or just q-grams) for q = 1 to 9. Finally, we allow to use (2, 1) and (3, 1) skip-grams (two words separated by one word, and three words separated by a gap). Table 1 shows the final configurations for English and Spanish and an example for one emotion for Arabic. For example, numbers are deleted in Arabic, but it is grouped in English and Spanish. In the case of English, it is split in unigrams, bigrams, character q-grams of sizes 2, 3, and 4.

Datasets
SemEval provides datasets to train systems for each subtask. For instance, for emotion Anger in English, subtask emotion intensity ordinal classification, OC, the training data is distributed for four classes (class 0 = 445, class 1 = 322, class 2 = 507, class 3 = 427). The Arabic datasets for each emotion have around 800 samples each one, for English the sizes are between 1500 and 2200 samples, and for Spanish are between 1000 and 1150 samples, for more details of the data distribution and how the datasets were built we refer the reader to . In addition of Se-mEval data, we use extra datasets annotated by humans around 73 thousand tweets for English, 223 thousand for Spanish (Mozetič et al., 2016), and two thousand for Arabic (NRC, 2017). Table 2 shows the distribution of classes for datasets. Those datasets are mainly used for sentiment analysis; however, we use this extra information to improve the final decision in the approach we implemented (EvoMSA).   (Mozetič et al., 2016), and the Arabic data from (NRC, 2017).

Results
We present the results of our approaches in Table  3 and Table 4. All experiments were tested on the development dataset provided by SemEval. In the case of OC tasks, we use the macro-F1 score to measure the performance, and in the case of Reg tasks, we use the Pearson correlation coefficient. Table 3 shows the results of emotional intensity for ordinal classification (OC) and regression tasks (Reg) grouped by each emotion and language. Table 4 shows the results of sentiment analysis, ordinal classification task (V-OC) and sentiment intensity regression task (V-Reg) group by each emotion and language. We present three system configurations in Table 3 and Table 4. EvoMSA configuration uses only the training datasets provided by SemEval, and it is used as regressor or classification system. In addition of SemEval data, EvoMSA-HA uses extra information comes from sentiment analysis domain, and this information improves the performance as we can see. And µTC uses only the training data provided by the contest as the knowledge base to calculate the final class or real value. As we can see in Table  3, the best performance obtained are grouped by EvoMSA-HA configuration for both OC and Reg tasks for English and Spanish languages. For the Arabic language, µTC is quite good with OC and Reg task. According to the results we obtained, we decided to use for the evaluation phase the following configuration: EvoMSA-HA is used for OC, Reg, V-OC, and V-Reg tasks for English and Spanish; also for OC (Fear and Joy) and V-OC tasks for Arabic; and µTC is used for Arabic in OC (Anger and Sadness), Reg, and V-Reg tasks. In the table, the performance of our configuration systems, on gold standard, is labeled by subscripts; they stand for the rank in the general evaluation. For example, for Spanish in OC task, we were ranked for Anger emotion in position 4; Fear, position 2; Joy, position 3; and Sadness, position 2.    and Regression (Reg), in terms of macro-F1 (OC) and Pearson correlation coefficient (Reg).

Conclusions
In this paper was presented our solution for Affective Tweets task combining two approaches EvoMSA and µTC. Both systems are designed to be multilingual and language and domain independent as much as possible. For the training step, we use extra human annotated datasets out of any specific emotion, but related to sentiment-analysis information; our solution performs well in Spanish and Arabic languages; however, there is room for further improvements in performance for tasks in English language using another sort of knowledge such as semantic information (word embeddings) into EvoMSA architecture.