UG18 at SemEval-2018 Task 1: Generating Additional Training Data for Predicting Emotion Intensity in Spanish

The present study describes our submission to SemEval 2018 Task 1: Affect in Tweets. Our Spanish-only approach aimed to demonstrate that it is beneficial to automatically generate additional training data by (i) translating training data from other languages and (ii) applying a semi-supervised learning method. We find strong support for both approaches, with those models outperforming our regular models in all subtasks. However, creating a stepwise ensemble of different models as opposed to simply averaging did not result in an increase in performance. We placed second (EI-Reg), second (EI-Oc), fourth (V-Reg) and fifth (V-Oc) in the four Spanish subtasks we participated in.


Introduction
Understanding the emotions expressed in a text or message is of high relevance nowadays. Companies are interested in this to get an understanding of the sentiment of their current customers regarding their products and the sentiment of their potential customers to attract new ones. Moreover, changes in a product or a company may also affect the sentiment of a customer. However, the intensity of an emotion is crucial in determining the urgency and importance of that sentiment. If someone is only slightly happy about a product, is a customer willing to buy it again? Conversely, if someone is very angry about customer service, his or her complaint might be given priority over somewhat milder complaints.  present four tasks 1 in which systems have to automatically determine the intensity of emotions (EI) or the intensity of the sentiment (Valence) of tweets in the languages English, Arabic, and Spanish. The goal is to either predict a continuous regression (reg) value or 1 We did not participate in subtask 5 (E-c).
to do ordinal classification (oc) based on a number of predefined categories. The EI tasks have separate training sets for four different emotions: anger, fear, joy and sadness. Due to the large number of subtasks and the fact that this language does not have many resources readily available, we only focus on the Spanish subtasks. Our work makes the following contributions: • We show that automatically translating English lexicons and English training data boosts performance; • We show that employing semi-supervised learning is beneficial; • We show that the stepwise creation of an ensemble model is not necessarily better method than simply averaging predictions.
Our submissions ranked second (EI-Reg), second (EI-Oc), fourth (V-Reg) and fifth (V-Oc), demonstrating that the proposed method is accurate in automatically determining the intensity of emotions and sentiment of Spanish tweets. This paper will first focus on the datasets, the data generation procedure, and the techniques and tools used. Then we present the results in detail, after which we perform a small error analysis on the largest mistakes our model made. We conclude with some possible ideas for future work.

Data
For each task, the training data that was made available by the organizers is used, which is a selection of tweets with for each tweet a label describing the intensity of the emotion or sentiment . Links and usernames were replaced by the general tokens URL and @username, after which the tweets Task NRC-HSL S-140 SenStr AFINN EMOTICONS Bing Liu MPQA NRC-10-exp NRC-HEAL NEGATION were tokenized by using TweetTokenizer. All text was lowercased. In a post-processing step, it was ensured that each emoji is tokenized as a single token.

Word Embeddings
To be able to train word embeddings, Spanish tweets were scraped between November 8, 2017 and January 12, 2018. We chose to create our own embeddings instead of using pre-trained embeddings, because this way the embeddings would resemble the provided data set: both are based on Twitter data. Added to this set was the Affect in Tweets Distant Supervision Corpus (DISC) made available by the organizers  and a set of 4.1 million tweets from 2015, obtained from Toral et al. (2015). After removing duplicate tweets and tweets with fewer than ten tokens, this resulted in a set of 58.7 million tweets, containing 1.1 billion tokens. The tweets were preprocessed using the method described in Section 2.1. The word embeddings were created using word2vec in the gensim library (Řehůřek and Sojka, 2010), using CBOW, a window size of 40 and a minimum count of 5. 2 The feature vectors for each tweet were then created by using the AffectiveTweets WEKA package (Mohammad and Bravo-Marquez, 2017).

Translating Lexicons
Most lexical resources for sentiment analysis are in English.
To still be able to benefit from these sources, the lexicons in the Affec-tiveTweets package were translated to Spanish, using the machine translation platform Apertium (Forcada et al., 2011).
All lexicons from the AffectiveTweets package were translated, except for SentiStrength. Instead 2 Embeddings available at www.let.rug.nl/rikvannoord/embeddings/spanish/ of translating this lexicon, the English version was replaced by the Spanish variant made available by Bravo-Marquez et al. (2013).
For each subtask, the optimal combination of lexicons was determined. This was done by first calculating the benefits of adding each lexicon individually, after which only beneficial lexicons were added until the score did not increase anymore (e.g. after adding the best four lexicons the fifth one did not help anymore, so only four were added). The tests were performed using a default SVM model, with the set of word embeddings described in the previous section. Each subtask thus uses a different set of lexicons (see Table 1 for an overview of the lexicons used in our final ensemble). For each subtask, this resulted in a (modest) increase on the development set, between 0.01 and 0.05.

Translating Data
The training set provided by  is not very large, so it was interesting to find a way to augment the training set. A possible method is to simply translate the datasets into other languages, leaving the labels intact. Since the present study focuses on Spanish tweets, all tweets from the English datasets were translated into Spanish. This new set of "Spanish" data was then added to our original training set. Again, the machine translation platform Apertium (Forcada et al., 2011) was used for the translation of the datasets.

Algorithms Used
Three types of models were used in our system, a feed-forward neural network, an LSTM network and an SVM regressor. The neural nets were inspired by the work of Prayas (Goel et al., 2017) in the previous shared task. Different regression algorithms (e.g. AdaBoost, XGBoost) were also tried due to the success of SeerNet (Duppada and Hiray, 2017), but our study was not able to reproduce their results for Spanish.
For both the LSTM network and the feedforward network, a parameter search was done for the number of layers, the number of nodes and dropout used. This was done for each subtask, i.e. different tasks can have a different number of layers. All models were implemented using Keras (Chollet et al., 2015). After the best parameter settings were found, the results of 10 system runs to produce our predictions were averaged (note that this is different from averaging our different type of models in Section 2.7). For the SVM (implemented in scikit-learn (Pedregosa et al., 2011)), the RBF kernel was used and a parameter search was conducted for epsilon. Detailed parameter settings for each subtask are shown in Table 2. Each parameter search was performed using 10fold cross validation, as to not overfit on the development set.

Semi-supervised Learning
One of the aims of this study was to see if using semi-supervised learning is beneficial for emotion intensity tasks. For this purpose, the DISC  corpus was used. This corpus was created by querying certain emotionrelated words, which makes it very suitable as a semi-supervised corpus. However, the specific emotion the tweet belonged to was not made public. Therefore, a method was applied to automatically assign the tweets to an emotion by comparing our scraped tweets to this new data set.
First, in an attempt to obtain the query-terms, we selected the 100 words which occurred most frequently in the DISC corpus, in comparison with their frequencies in our own scraped tweets corpus. Words that were clearly not indicators of emotion were removed. The rest was annotated per emotion or removed if it was unclear to which emotion the word belonged. This allowed us to create silver datasets per emotion, assigning tweets to an emotion if an annotated emotion-word occurred in the tweet.
Our semi-supervised approach is quite straightforward: first a model is trained on the training set and then this model is used to predict the labels of the silver data. This silver data is then simply added to our training set, after which the model is retrained. However, an extra step is applied to en-sure that the silver data is of reasonable quality. Instead of training a single model initially, ten different models were trained which predict the labels of the silver instances. If the highest and lowest prediction do not differ more than a certain threshold the silver instance is maintained, otherwise it is discarded.
This results in two parameters that could be optimized: the threshold and the number of silver instances that would be added. This method can be applied to both the LSTM and feed-forward networks that were used. An overview of the characteristics of our data set with the final parameter settings is shown in Table 3. Usually, only a small subset of data was added to our training set, meaning that most of the silver data is not used in the experiments. Note that since only the emotions were annotated, this method is only applicable to the EI tasks. 3

Ensembling
To boost performance, the SVM, LSTM, and feedforward models were combined into an ensemble. For both the LSTM and feed-forward approach, three different models were trained. The first model was trained on the training data (regular), the second model was trained on both the training and translated training data (translated) and the third one was trained on both the training data and the semi-supervised data (silver). Due to the nature of the SVM algorithm, semi-supervised learning does not help, so only the regular and translated model were trained in this case. This results in 8 different models per subtask. Note that for the valence tasks no silver training data was obtained, meaning that for those tasks the semisupervised models could not be used.
Per task, the LSTM and feed-forward model's predictions were averaged over 10 prediction runs. Subsequently, the predictions of all individual models were combined into an average. Finally, models were removed from the ensemble in a stepwise manner if the removal increased the average score. This was done based on their original scores, i.e. starting out by trying to remove the worst individual model and working our way up to the best model. We only consider it an increase in score if the difference is larger than 0.002 (i.e. the difference between 0.716 and 0.718). If at some point the score does not increase and we   are therefore unable to remove a model, the process is stopped and our best ensemble of models has been found. This process uses the scores on the development set of different combinations of models. Note that this means that the ensembles for different subtasks can contain different sets of models. The final model selections can be found in Table 4. Table 5 shows the results on the development set of all individuals models, distinguishing the three types of training: regular (r), translated (t) and semi-supervised (s). In Tables 4 and 5, the letter behind each model (e.g. SVM-r, LSTM-r) corresponds to the type of training used. Comparing the regular and translated columns for the three algorithms, it shows that in 22 out of 30 cases, using translated instances as extra training data resulted in an improvement. For the semisupervised learning approach, an improvement is found in 15 out of 16 cases. Moreover, our best individual model for each subtask (bolded scores in Table 5) is always either a translated or semisupervised model. Table 5 also shows that, in general, our feed-forward network obtained the best results, having the highest F-score for 8 out of 10 subtasks.

Results and Discussion
However, Table 6 shows that these scores can still be improved by averaging or ensembling the individual models. On the dev set, averaging our 8 individual models results in a better score for 8 out of 10 subtasks, while creating an ensemble beats all of the individual models as well as the average for each subtask. On the test set, however, only a small increase in score (if any) is found for stepwise ensembling, compared to averaging. Even though the results do not get worse, we can-Task SVM-r SVM-t LSTM-r LSTM-t LSTM-s FF-r FF-t FF-s   not conclude that stepwise ensembling is a better method than simply averaging.
Our official scores (column Ens Test in Table 6) have placed us second (EI-Reg, EI-Oc), fourth (V-Reg) and fifth (V-Oc) on the SemEval AIT-2018 leaderboard. However, it is evident that the results obtained on the test set are not always in line with those achieved on the development set. Especially on the anger subtask for both EI-Reg and EI-Oc, the scores are considerably lower on the test set in comparison with the results on the development set. Therefore, a small error analysis was performed on the instances where our final model made the largest errors.

Error Analysis
Due to some large differences between our results on the dev and test set of this task, we performed a small error analysis in order to see what caused these differences. For EI-Reg-anger, the gold labels were compared to our own predictions, and we manually checked 50 instances for which our system made the largest errors.
Some examples that were indicative of the shortcomings of our system are shown in Table 7. First of all, our system did not take into account capitalization. The implications of this are shown in the first sentence, where capitalization intensi-  fies the emotion used in the sentence. In the second sentence, the name Imperator Furiosa is not understood. Since our texts were lowercased, our system was unable to capture the named entity and thought the sentence was about an angry emperor instead. In the third sentence, our system fails to capture that when you are so angry that it makes you laugh, it results in a reduced intensity of the angriness. Finally, in the fourth sentence, it is the figurative language me infla la vena (it inflates my vein) that the system is not able to understand.
The first two error-categories might be solved by including smart features regarding capitalization and named entity recognition. However, the last two categories are problems of natural language understanding and will be very difficult to fix.

Conclusion
To conclude, the present study described our submission for the Semeval 2018 Shared Task on Affect in Tweets. We participated in four Spanish subtasks and our submissions ranked second, second, fourth and fifth place. Our study aimed to investigate whether the automatic generation of additional training data through translation and semi-supervised learning, as well as the creation of stepwise ensembles, increase the performance of our Spanish-language models. Strong support was found for the translation and semi-supervised learning approaches; our best models for all subtasks use either one of these approaches. These results suggest that both of these additional data resources are beneficial when determining emotion intensity (for Spanish). However, the creation of a stepwise ensemble from the best models did not result in better performance compared to simply averaging the models. In addition, some signs of overfitting on the dev set were found. In fu-ture work, we would like to apply the methods (translation and semi-supervised learning) used on Spanish on other low-resource languages and potentially also on other tasks.