EiTAKA at SemEval-2018 Task 1: An Ensemble of N-Channels ConvNet and XGboost Regressors for Emotion Analysis of Tweets

This paper describes our system that has been used in Task1 Affect in Tweets. We combine two different approaches. The first one called N-Stream ConvNets, which is a deep learning approach where the second one is XGboost regressor based on a set of embedding and lexicons based features. Our system was evaluated on the testing sets of the tasks outperforming all other approaches for the Arabic version of valence intensity regression task and valence ordinal classification task.


Introduction
Sentiment Analysis is the task of automatically identifying the valence or polarity of a piece of text. This piece of text can be a user review, a document, an SMS message, a tweet, etc. According to , the term sentiment analysis also refers to determining the attitude towards a particular target or topic. The attitude can be the polarity (positive or negative), or an emotional or effectual attitude such as joy, anger, sadness and so on.
Most of the researchers in sentiment analysis have focused on developing systems to determine the polarity of a given text. This involves designing classifiers based on a set of examples with a manually annotated sentiment polarity. Although developing systems that automatically determine the intensity (i.e. the degree or the amount) of emotions that are communicated in a text has a wide range of applications in commerce, public health, social welfare, etc., most of the work has focused on categorical classification (whether a given piece of text communicates anger, joy, sadness, etc.). This can be attributed to the lack of suitable annotated data (Mohammad and Bravo-Márquez, 2017) .
In task1: Affect in Tweets, the organizers provide an array of tasks where systems have to au-tomatically determine the intensity of emotions (anger, fear, joy, and sadness) and the intensity of the sentiment (aka valence) of the tweeters from their tweets. They provide annotated datasets for each task with English, Arabic, and Spanish tweets . We define the tasks below: EI-reg (an emotion intensity regression task): Given a tweet and an emotion E, determine the intensity of E that best represents the mental state of the tweeter with a real-valued score between 0 (least E) and 1 (most E).
EI-oc (an emotion intensity ordinal classification task): Given a tweet and an emotion E, classify the tweet into one of four ordinal classes of intensity of E that best represents the mental state of the tweeter.
V-reg (a sentiment intensity regression task): Given a tweet, determine the intensity of sentiment or valence V that best represents the mental state of the tweeter with a real-valued score between 0 (most negative) and 1 (most positive).
V-oc (a sentiment analysis, ordinal classification, task): Given a tweet, classify it into one of seven ordinal classes, corresponding to various levels of positive and negative sentiment intensity, that best represents the mental state of the tweeter. We proposed one system to solve the intensity regression tasks (i.e. EI-reg and V-reg) and use it as a feature extractor to train Decision Trees to solve the ordinal classification tasks (i.e. EI-oc and V-oc). We developed two versions of the proposed system for the English and the Arabic language tweets.
Our system is an ensemble of two different approaches. The first one, called N-Channels Con-vNet, is a deep learning approach where the second one is an XGboost regressor based on a set of embedding and lexicons-based features.
The rest of the paper is organized as follows: Section 2 presents the tools and the resources that are used. Section 3 describes the proposed system. In Section 4 we report the experimental results, whereas in Section 5 the conclusions and the future work are presented.

Resources
This section explains the tools and the resources that have been used in our system.

Sentiment Lexicons
We used the following lexicons for the English version of our system: AFINN (Nielsen, 2011), General Inquirer (Stone et al., 1968), Bing-Liu opinion lexicon (HL) (Hu and Liu, 2004), MPQA (Choi and Wiebe, 2014), NRC hashtag sentiment lexicon (Mohammad et al., 2013), NRC emotion lexicon (EmoLex), NRC affect intensity lexicon, NRC hashtag emotion lexicon and Vader lexicon. More details about each lexicon, such as how it was created, the polarity score for each term, and the statistical distribution of the lexicon, can be found in (Jabreel and Moreno, 2016).
For the Arabic version we used the following lexicons: Arabic Hashtag lexicon, Dialectal Arabic Hashtag lexicon, Arabic Bing Liu lexicon, Arabic Sen-timent140 lexicon and Arabic translation of the NRC emotion lexicon. The first two were created manually, whereas the rest were translated to Arabic from the English version using Google Translator. .

Embeddings
Word embeddings are an approach for distributional semantics which represents words as vectors of real numbers. Such representation has useful clustering properties, since the words that are semantically and syntactically related are represented by similar vectors (Mikolov et al., 2013). For example, the words "coffee" and "tea" will be very close in the created space.
We used two publicly available pre-trained embedding models in the English version of our system. The first one was used in (Rouvier and Favre, 2016). It was trained using word2vec (skipgram model) on an unannotated corpus of 20 million English tweets containing at least one emoticon. The second one was provided by (Baziotis et al., 2017). It was trained on a big dataset of 330M English Twitter messages, gathered from 12/2012 to 07/2016 and a vocabulary size of 660K words using Glove algorithm.
Additionally, we have trained two embedding models on 60M English tweets(30M contain positive emoticons, 30M negative ones). The first one was trained by applying word2vec skipgram of window size 5 and filtering words that occur less than 4 times. The dimensionality of the vector was set to 300. The second one was trained using fastText [CITE]. The dimensionality of the vector was set to 300.
Similarly, we used two publicly available pretrained embedding models in the Arabic version of our system and trained two. The first one is the model Arabic-SKIP-G300, provided by (Zahran et al., 2015). Arabic-SKIP-G300 was trained on a large corpus of Arabic text collected from different sources such as Arabic Wikipedia, Arabic Gigaword Corpus, Ksucorpus, King Saud University Corpus, Microsoft crawled Arabic Corpus, etc. It contains 300-dimensional vectors for 6M words and phrases. The second one is Twitter-SG-AraVec (Soliman et al., 2017), which was trained using word2vec skipgram algorithm on 66M Arabic tweets and 1B tokens. The dimensionality of the vector was set to 300.
Our embedding models were trained on the distant supervision corpus (about 16M Arabic tweets) provided by the organizers. We were able to find about 12M tweets. Again, similar to our English embeddings, we trained the two Arabic embedding models.

System Description
This section explains the proposed system, whose architecture is shown in Figure 1. First, we preprocess the tweets (Subsection 3.1). Afterwards, we pass them to the N-Channels ConvNet and the XGboost regressors (Subsections 3.2 and 3.3). Finally we ensemble the output of the two systems to get the final result as described in subsection 3.4. The proposed system is also used as feature extractor to train an ordinal Decision Tree classifier. as described in subsection 3.5.

Preprocessing
Some standard pre-processing methods were applied on the tweets: • Normalization: Each tweet in English was converted to the lowercase. URLs and user- names were omitted. Non-Arabic letters were removed from each tweet in the Arabiclanguage sets. Words with repeated letters (i.e. elongated) are corrected.
• Tokenization and POS tagging: All Englishlanguage tweets were tokenized and tagged using Ark Tweet NLP (Gimpel et al., 2011), while all Arabic-language tweets were tokenized and tagged using Stanford Tagger (Green and Manning, 2010).

N-Channels ConvNet
Convolutional Neural Networks (ConvNets) have achieved remarkable results in computer vision and speech recognition tasks in recent years. The next subsection explains the architecture of our proposed ConvNet.

Architecture
The N-Channels ConvNet model architecture, shown in the bottom box in figure 1, is inspired by Inception-Net (Szegedy et al., 2016) and the CNN proposed by (Kim, 2014). It is composed of multiple channels followed by a logistic regressor. Figure 2 shows the channel architecture. The input to each channel is a sequence of words w 1 , w 2 , ...w n where n is the number of words. We pass the input through an embedding layer to map each word w i into a real-valued vector. Each channel has its own embedding layer which is initialized by a specific pre-trained embedding model. We use five channels with the four pre-trained embedding models described in subsection 2.2 and a character based one. The result from the embedding layer is a matrix n × d c where d c is the vector dimension. This matrix is passed to a projection or pre-activation layer. Afterward, we feed the projected matrix to three Conv1D. Each one has a different kernel (1, 2, and 3) and 200 filters. To get more details about the architecture of this Conv1D please check (Kim, 2014). We pass the output of each Conv1D through a global max-pooling layer which produces a vector with dimensionality of 200. Finally, the three vectors are concatenated. This yields a vector with dimensionality of 600 that represents the tweet (i.e. the input sequence of words). Finally, the outputs of all channels are concatenated with a lexicon-based vector (see next section) and fed to a single sigmoid neuron which gives the intensity of the emotion/valence.

Training
The proposed model was trained by minimizing the mean squared error between the real and pre-dicted intensities. The optimization was done by applying back-propagation through layers via minibatch gradient descent. The training parameters were the following: batch size of 32, 100 epochs and Adam optimization method with learning rate of 0.001, β 1 = 0.9 and β 2 = 0.999 and = 10 − 9. To prevent the over-fitting, we used dropout and early stopping methods.

XGBoost Regressor
XGBoost (Chen and Guestrin, 2016) has become a widely used and really popular tool among Data Scientists in industry, as it shows great performance on large-scale problems. It is a highly flexible and versatile tool that can work through most regression, classification and ranking problems as well as user-built objective functions.
We trained an XGBoost regressor to give the intensity of the emotion/valence based on the two types of features explained in the next subsection.

Features
Each tweet is represented with a vector by concatenating the following two feature vectors: Lexicon Features: For each lexicon, we used the sum of the scores provided by the lexicon for each word in the tweet. Let L denote the set of lexicons and f l i (w) the score of the word w based on the feature i in the lexicon l (note that some lexicons have only one feature like the sentiment score and some of them have multiple features like anger emotion score, positive score, etc). Then, the set of features that represent a given tweet T and a lexicon l ∈ L can be obtained as follows: Here, F l denotes the set of features in lexicon l. Embedding Features: We used the sum pooling function to obtain the tweet representation in the embedding space. More formally, let us consider an embedding matrix E ∈ R d×|V | and a tweet T = w 1 , w 2 , ..., w n , where d is the dimension size, |V | is the length of the vocabulary (i.e. the number of words in the embedding model), w i is i-th the word in the tweet and n is the number of words. First, each word w i is substituted by the corresponding vector v j i in the matrix E where j is the index of the word w i in the vocabulary. This step ends with the matrix W ∈ R d×n . The vector V T,E that represents the tweet T is computed by aggregating the matrix W . This aggregation is done by taking the summation over its columns. The sum spooling function is an element-wise function, and it converts texts with various lengths into a fixed-length vector allowing to capture the information throughout the entire text.

Training
The XGBoost regressor has some parameters that need to be tuned. Table 1

Ensemble
We combined the results of the two systems described above with the intention of improving the performance and increasing the generalizability of the final system. We used the weighted average method to achieve that. Let r 1 and r 2 respectively denote the output of the XGBoost regressor and the N-Channels ConvNet system. The final output r was obtained as follows: Table 2 shows the value of α for each individual model. All these values were obtained by grid search on the development set.

Decision Tree for Ordinal Classification Tasks
To solve the problem of ordinal classification we simply used the proposed model as feature extractor and trained a Decision Tree. The idea is to use the emotion/intensity as input feature and use rules

Results
We trained and validated our models on the training and validation sets provided by the organizers. More details about the data and the evaluation metrics can be found in . Tables 3 and 4 show the results of the emotion and valence intensity regression tasks of our two systems and their combination (the ensemble model). It also shows the baseline results. The evaluation metrics are the Pearson correlation for all samples and for a subset of the test set that includes only those tweets with intensity score greater or equal to 0.5. The values in the tables show the superiority of the N-Channels Con-vNet over the XGBoost regressor. For instance, the results of the English version of the emotion intensity task show that the N-Channels ConvNet outperforms the XGBoost regressor by 5.9% with respect to macro-avg measure. The performance of N-Channels Convnet is very close to the ensemble model. The improvement is only 1.2%. The improvement in the final system of the Arabic version is very small (0.3%). The results of the Pearson correlation of samples whose intensity score is greater or equal to 0.5 show that our system can be used as a classifier. This conclusion is confirmed by the results of the ordinal classification tasks, shown in Tables 5 and 6.
As we described in subsection 3.5, our approach to design a system to solve the ordinal classification tasks was to use the intensity score as input feature to train a Decision Tree. During the inference phase we used our system to produce the intensity score for the new (unseen) samples (i.e. use it as feature extractor). Thus, the performance in this phase heavily relies on the performance of the proposed system. This is clearly shown in the results reported in tables 5 and 6. For example, our system gives very good results in the valence intensity regression task for both the English and Arabic versions (the Pearson correlation is 0.828 for both). This affects positively the performance of our system for the valence ordinal classification tasks (the Pearson correlation is about 0.80 for both).

Conclusion
We have presented an ensemble model of two different approaches. The first one, called N-Channels ConvNet, is a deep learning approach whereas the second one is an XGBoost regressor based on a set of embedding and lexicons-based features. The ensemble technique helped to improve the performance of the final model in all subtasks. We have realized that The N-Channels ConvNet gives a performance very close to the ensemble model. This observation confirms the fact that deep learning models, and especially Con-vNets, have achieved remarkable results in many fields such as computer vision, speech recognition and natural language processing. Distant Supervision is an approach of transfer learning which aims to train a model on a large amount of semilabeled data and use it as a pre-trained model for training another model on a small amount of fullylabeled data. This approach has been shown to be very efficient. Thus, the authors are considering the possibility of using this technique to improve the proposed system.    (7) 0.249 (6) 0.385 (6) 0.287 (7) Random Baseline 0.006 (12) -0.057 (12) -0.019 (12) 0.008 (12) 0.092 (11) 0.006 (12) -0.057 (14) -0.019 (12) 0.007 (12) 0.091 (10)