AffecThor at SemEval-2018 Task 1: A cross-linguistic approach to sentiment intensity quantification in tweets

In this paper we describe our submission to SemEval-2018 Task 1: Affects in Tweets. The model which we present is an ensemble of various neural architectures and gradient boosted trees, and employs three different types of vectorial tweet representations. Furthermore, our system is language-independent and ranked first in 5 out of the 12 subtasks in which we participated, while achieving competitive results in the remaining ones. Comparatively remarkable performance is observed on both the Arabic and Spanish languages.


Introduction
The Affects in Tweets shared task (Mohammad et al., 2018) is the second iteration of a task which offers a new approach to Sentiment Analysis -one that concerns itself with emotion and sentiment intensity, rather than simple categorical classification.The shared task is divided into a set of subtasks, where the aim is to predict the emotion intensity of a predetermined emotion (fear, anger, sadness, joy) or sentiment (valence) intensity of a given set of tweets.Such predictions are either formulated as a regression problem where the output is a continuous-valued score in the interval (0, 1), or as ordinal classification into a given number of classes representing intensity.Additionally, each one of the subtasks targets a particular language: English, Arabic or Spanish.
In total, we participated in 12 different subtasks and our system achieved the best performance on the test set out of all participants in 5 out of those, ranked second in 3 others, and performed compet-itively in the rest.Moreover, our system can arguably be considered the best overall performing system for both Arabic and Spanish1 .It should be noted, however, that the shared task includes traditional emotion classification subtasks in which we did not participate.
The system described in this paper builds upon a survey of some of the best performing systems from previous related shared tasks (Mohammad and Bravo-Marquez, 2017;Rosenthal et al., 2017).In particular, we draw inspiration from the systems described in (John and Vechtomova, 2017), which makes use of gradient boosted trees for regression; (Goel et al., 2017), which employs an ensemble of various neural models; and (Baziotis et al., 2017), which features Long Short Term Memory (LSTM) networks with an attention mechanism.Our work contributes to the aforementioned approaches by further developing a variety of neural architectures, using transfer learning via pretrained sentence encoders, testing methods of ensembling neural and non-neural models, and gauging the performance and stability of a regressor across languages.
The rest of this paper describes the pipeline of the system used for our submission, which is an ensemble of neural and non-neural models.

Data and features
The provided training and development data is comprised of tweets, an emotion or sentiment, and labels describing the intensity of the emotion or sentiment.We refer readers interested in an exhaustive description of the data to (Mohammad et al., 2018;Mohammad and Kiritchenko, 2018).In this work, we convert each tweet into a combination of three types of vector representations: character and word-level vectors for Arabic and Spanish; and character, word, and sentence-level vectors for English.This section describes the procedure that allows us to obtain these varied representations, which are later employed by our classification and regression models.

Preprocessing
The syntactic and orthographic form of tweets often differs substantially from text belonging to other domains (John and Vechtomova, 2017).As such, pre-processing procedures are as important as the architecture of any given model.
In pre-processing our data, we first replace all same-character sequences of length 3 or more with only 2 occurrences.We also replace all user mentions with a unique common token, as well as all control characters with whitespaces.Emojis are surrounded with spaces, enforcing that any two emojis are not consecutive characters.Finally, all text is lowercased.In the case of Spanish text, we further remove the characters ¿ and ¡, and replace accented characters with their unaccented versions, as well as ñ with n.In the case of Arabic text, we remove quotation marks as well.
Following the cleaning process, we tokenize the resulting text by applying the twokenize tool (Krieger and Ahn, 2010), as provided in the CMU Tweet NLP software (Owoputi et al., 2013), which is, by design, able to cope with the noise that appears in social media.Once the tokenization is completed, we filter all stopwords 2 . 2 We employ the stopword lists available from https: //www.ranks.nl/stopwords

Lexicons
Lexicons are one of the resources which we employ in order to compute features.In short, a lexicon is a collection of words that are associated with a value for an arbitrary number of affective categories.In our case, given a tweet, we produce several features per lexicon which are the result of aggregating individual matching word values in each category, adding the numerical values and counting those which are nominal.We provide an overview of the lexicons used per language below, with the number of features contributed by each individual lexicon in parenthesis.In the case of English, the following lexicons and extracted values jointly produce a feature vector of dimension 43: • MPQA lexicon (2): Number of positive and negative words (Wilson et al., 2005).
In the case of Arabic we also employ the same first 6 lexicons which we listed for English, but with the content words automatically translated (Salameh et al., 2015).However, we extract 4 scores from the MPQA lexicon (on the affective categories positive, negative, neutral and both), an a single combined score from the Bing Liu and Emoticons lexicons.Furthermore, we employ 3 lexicons generated by distant supervision techniques on Arabic tweets as follows (Mohammad et al., 2016), in order to obtain a feature vector of dimension 26: • Arabic Emoticon Lexicon (2): Number of positive and negative words.
Finally, the following lexicons are used in Spanish to produce a feature vector of dimension 14.In contrast to the Arabic language, the majority of the lexicons here listed are manually annotated or semi-automatically generated from Spanish data: • Emoticons (1): Combination of positive and negative aggregated scores for emoticons (Nielsen, 2011).
• Sentwords (3): Aggregated score for each category in an automatically translated version of the lexicon described in (Beth Warriner et al., 2013).
Note that the lexicons are not directly used on tweet data, but rather that lexical features are extracted after applying the same data cleaning and tokenization process which we described for the training data to each one of the lexicons listed.

Word embeddings
Word embeddings are another popular choice for feature extraction.We employ pre-trained word embeddings for English and train our own embeddings on separated Arabic and Spanish tweet data that we manually collected.All sets of embeddings comprise 400 dimensions and are detailed below for each language: • English: Word2vec skip-gram embeddings, trained on the Edinburgh Twitter Corpus (Petrović et al., 2010).

Manually-crafted representations
In the Arabic and Spanish subtasks, some model components in our ensemble use a combination of the two types of representations described so far (lexical features and word embeddings) as an input feature vector.To obtain this, we average the embeddings corresponding to each word in a given tweet up to a maximum of 25 words, and append the computed lexical features to the result.These features are extracted using the filters provided in the Affective Tweets package (Mohammad and Bravo-Marquez, 2017) available for WEKA (Hall et al., 2009).In this paper, we will refer to this combined representation as AvgLexRep.

Learned representations
Engineering a representation of the data (such as a the one described in Section 2.4) that can support effective machine learning is a complex task, requiring human ingenuity and domain-specific knowledge.Representation learning techniques (Bengio et al., 2003) enable machine learning algorithms to automatically extract and organize discriminative features, thereby mapping raw data into forms that make it easier to extract useful information.Some model components in our ensemble employ this kind of representation, which we obtain using 2 different methods: • Encoding a using (Conneau et al., 2017)'s BiLSTM-max pooling encoder, which is pre-trained on a natural language inference dataset 5 and produces representations that perform well on a wide variety of NLP tasks.This approach in particular employs GloVe word embeddings (Pennington et al., 2014) as input and produces a vector containing 4096 dimensions, to which we will refer with the name Inf erRep.However, note that we only produce this feature vector for the English language subtasks.
• Encoding a tweet using one or a combination of three neural architectures which use skip-gram word embeddings (Mikolov et al., 2013) as input and are trained on the shared 5 Stanford Natural Language Inference dataset (Bowman et al., 2015).This is only available for English.

System architecture
While RegRep is produced as part of end-toend trainable regression and classification models, AvgLexRep and Inf erRep are generated independently.Thus, AvgLexRep and Inf erRep are fed separately into these models after being generated.The pipeline of our ensemble is represented schematically in Figure 2.

Neural models
We implement three varieties of neural network architectures which are commonly used in text classification tasks using Keras (Chollet et al., 2015) with a TensorFlow backend.In all of them, our objective function is Mean Squared Error (MSE) and dropout (Srivastava et al., 2014) is used for regularization at various levels.These architectures are listed below: • Convolutional Neural Network (CNN) with max pooling.

Regression
For AvgLexRep and InferRep, which are not part of an end-to-end trainable model, we perform regression using either a feed-forward Deep Neural Network (DNN) or Gradient Boosted Trees (GBT) 7 .
The depth of the feed-forward network is determined constructively, starting with one layer and adding layers which are half the size of the previous one until performance on cross-validation stops improving.

Model selection for regression
We perform model selection using 5-fold crossvalidation on the training data from the shared task.In each subtask that involves regression, the possible models are ranked according to their individual performance and ensembled through simple averaging.The ensemble itself is built constructively based on the ordering defined by the ranking, starting from a single component and adding components in order whenever the average performance on cross-validation improves.
Ensembling has long been shown to be an effective method of variance reduction for complex models (Perrone, 1993), and we indeed find in our experiments that averaging predictions leads to results better than those of any individual model8 .
Furthermore, we also find predictions obtained via simple averaging to be more accurate (on cross-validation) compared to those obtained via feeding the outputs from all model components into a sigmoid layer.Although such a finding might appear counter-intuitive, it can perhaps be explained through the fact that the training dataset is relatively small, and therefore ensembling via a non-linear function of the outputs can potentially lead to overfitting.

Ordinal classification
Our system for each ordinal classification subtask makes use of the ensemble model which we build for the corresponding regression subtask in the same language, and model selection is performed using the same procedure described in Section 3.3.However, instead of averaging the predictions, the best model's predictions are concatenated and fed as features to an ordinal meta-classifier (Antoniuk et al., 2013).

Hyperparameter tuning
Hyper-parameter optimization is carried out using 5-fold cross-validation.At first, a reasonable range is determined manually, and then gridsearch is performed within that range.For Gradient Boosted Trees, the hyper-parameters optimized are maximum tree depth, number of estimators, and maximum leaf nodes.For neural models, the parameters optimized are batch size, number of epochs, size of the layers or filters, and whether or not dropout is used at different levels.Dropout is by default always set at 0.2.Furthermore, we use a fixed random seed to enable replicability.Table 1 displays the scores (both 5-fold crossvalidation and test scores) of the individual models and the ensemble model for the Emotion Intensity English regression subtasks.The ensemble model in this case is always for the best three models.Table 2 shows the results obtained using 5-fold cross-validation on the combined training and development data and the official test set results for each subtask.All scores are reported as the Pearson correlation coefficient between our system's predictions and the provided gold-labels (i.e.human judgments).

Analysis
It can be observed in Table 2 that the test and cross-validation scores are similar, meaning that cross-validation provided an accurate estimate of the generalization error and that our system's overfitting of the different combined training and development sets is minimal.In fact, for the English valence subtasks, the Arabic Emotion Intensity regression subtask and all Spanish subtasks except the ones involving anger as the target emotion, the test scores are higher or equal than the cross-validation scores.This indicates both that our system generalizes appropriately and that the test sets are not substantially different than the training sets.Overall performance is higher for English, likely due to the availability of better quality lexicons and word embeddings.Nonetheless, it is interesting to note that on average, cross-validation provided an optimistic estimate of the generalization error for English and a pessimistic one for Spanish and Arabic.
Furthermore, as shown in Table 1 for various English regression subtasks, it is clear that the ensemble outperforms all individual models on both cross-validation and the test set.This points towards the success of our ensembling method in reducing the variance of individual models.We omit similar results for other subtasks because the trend displayed by those is comparable.
Finally, it is interesting to note that the mod-els using Inf erRep (DNN and GBT), which rely on tweet representations produced through transfer learning from Natural Language Inference, outperformed the models using the task-specific RegRep (CNN, Bi-LSTM and CHAR-LSTM) for all emotions except Sadness.

Conclusion and future work
In this paper we have described AffecThor, the system which we submitted to the SemEval-2018 Affects in Tweets shared task.AffecThor uses three different types of learned and manually-crafted representations and is an ensemble of neural and non-neural models.It is the best performing system on 5 out of 12 subtasks, and the second best performing in 3 others.Furthermore, it is arguably the best overall performer for Spanish and Arabic.
Our work explored two methods of ensembling regressors: simple averaging and using a nonlinearity (sigmoid) layer on top of the different sub-models as part of an end-to-end trainable neural model, and found that simple averaging is more robust.However, we believe that ensembling using a linear combination (weighted-averaging) where the weights are learned could lead to better results, as is shown in (Perrone, 1993;Hashem and Schmeiser, 1993).
Finally, the availability of fine-grained labeled data across emotions and languages opens up the possibility of investigating multi-task and multilingual learning objectives.In the future, we would like to extend this work in that direction.

Figure 1 :
Figure 1: Graphical visualization of various feature vectors used in our ensemble model.These are from left to right: character embedding, word embedding, Inf erRep and AvgLexRep representations.

Figure 2 :
Figure 2: Diagram of our system which describes how the different models used are ensembled.The outputs of each component in the ensemble are averaged into a single score.
task's training data for regression subtasks.These correspond to the CNN, Bi-LSTM and CHAR-LSTM models described in Section 3. Representations produced by the CHAR-LSTM model are of dimension 612 6 , and the ones obtained via the Bi-LSTM model are of dimension 512.Representations produced by the CNN model have different dimensionality depending on the number and size of filters used.We will collectively refer to such representations with the name RegRep.

Table 2 :
Pearson correlation using cross-validation (CV) on the trainining data and official results of the shared task (Test) obtained with our system, for each one of the Emotion Intensity (EI), Valence (V), regression (reg) and ordinal classification (oc) subtasks.