ELiRF-UPV at SemEval-2018 Tasks 1 and 3: Affect and Irony Detection in Tweets

This paper describes the participation of ELiRF-UPV team at tasks 1 and 3 of Semeval-2018. We present a deep learning based system that assembles Convolutional Neural Networks and Long Short-Term Memory neural networks. This system has been used with slight modifications for the two tasks addressed both for English and Spanish. Finally, the results obtained in the competition are reported and discussed.


Introduction
The study of figurative language and affective information expressed in texts is of great interest in sentiment analysis applications because they can change the polarity of a message. The objective of tasks 1 and 3 of Semeval 2018 is the study of these phenomena on Twitter.
Task 1 (Mohammad et al., 2018) is related to Affect in Tweets. Systems have to automatically determine the intensity of emotions and intensity of sentiment or valence of the tweeters from their tweets. The task is divided in five subtasks: emotion intensity regression (EI-Reg), emotion intensity ordinal classification (EI-Oc), sentiment intensity regression (V-Reg), sentiment analysis ordinal classification (V-Oc) and emotion classification (E-C).
Task 3 (Van Hee et al., 2018) addresses the problem of Irony detection in English Tweets. It consists of two subtasks. The first subtask is a two-class (or binary) classification task where the system has to predict whether a tweet is ironic or not. The second subtask is a multiclass classification task where the system has to predict one out of four labels describing i) verbal irony realized through a polarity contrast, ii) verbal irony without such a polarity contrast (i.e., other verbal irony), iii) descriptions of situational irony, iv) non-irony. This paper describes the main characteristics of the developed system by the ELiRF-UPV team for tasks 1 and 3. We addressed all subtasks of task 1 both for English and Spanish, and all subtasks of task 3.

Data Preprocessing
In this work we have taken into account different aspects when preprocessing the tweets. First we removed the accents and converted all the text to lowercase. In general, emoticons, web links, hashtags, numbers, and user mentions, were substituted by generic tokens. For instance, "#hashtag" → "hashtag", → "Slightly Smiling Face", etc. After that, we used TweetMotif (Krieger and Ahn, 2010) as tweet tokenizer, moreover we adapted it to work with Spanish tweets.
On the other hand, for Spanish, we used the following polarity/emotion lexicons: ElHPolar (Saralegi and San Vicente, 2013), ISOL (Molina- González et al., 2013), and MLSenticon (Cruz et al., 2014). In addition, we also pre-trained Word2Vec embeddings from 87 million Spanish tweets collected by our team by means of a twitter crawler. In this case, it is a skip-gram architecture with 300 dimensions per word, negative sampling with 5 negative samples and a 5-term context on the left and right was used.
Through a tuning process on the development sets with a fixed system architecture, we selected the best lexicons for each task. For English, all the lexicons stated above for both tasks were used. For Spanish, only ElHPolar and ISOL lexicons were used.

System Description
In this section, we briefly describe the general characteristics of the system developed for Task 1 and Task 3 at SemEval 2018. This description includes the input representation and the system architecture.

Input representation
Regarding the representation used, in those subtasks where the input is only a tweet (V-Reg, V-Oc and E-C in task 1 and both subtasks in task 3), each tweet is represented as a matrix M ∈ R n·d where n is the maximum number of words per tweet and d is the embedding dimensionality. To include the information from the polarity lexicons, for each word, x, the vector of its embedding is concatenated with the vector of polarities/emotions for this word, v(x). In this way, the representation matrix of a tweet finally results in M ∈ R n·(d+|v|) .
For the EI-Reg subtask, where in addition to a tweet an emotion p is also provided, we add the representation of the word p as the last row of the M matrix. Moreover, we concatenate one column to the word embeddings to indicate if the words belong to the tweet (0) or belong to the emotion (1).

System architecture
We propose a general architecture for all subtasks. This architecture is based on a two-layer Convolutional Neural Network (CNN) (Fukushima, 1980) ensembled with a final Long Short-Term Memory (LSTM) neural network (Hochreiter and Schmidhuber, 1997) as in (González et al., 2017). We use the representation of the tweet in terms of the M matrix defined above as input to the system. Finally, a fully connected layer computes the outputs of the system. The activation function of this layer depends on the subtask. Figure 1 shows the general architecture of the system, where d 0 is the dimensionality of the representation of each word (size of the embedding), f i is the number of filters in the convolutional layer i, s i the height of each filter in layer i, L is the dimensionality of the output-state of the LSTM network, and C is the number of outputs for a specific task.
Although the architecture was the same for all subtasks, the parameters are subtask dependent and were experimentally defined by means of a tuning phase with the development sets. The values studied for the parameters of the convolutional network are f i ∈ [64, 256] and s i = 3. The number of neurons of the last layer depends on the subtask. We also tested a simplified version of the architecture without the convolutional network and using only the LSTM network with L = 256.
Moreover, we use Batch Normalization (Ioffe and Szegedy, 2015) between all convolutional layers, Dropout (Srivastava et al., 2014) after the LSTM with p = 0.2, ReLU activation functions (Nair and Hinton, 2010) and RMSProp (Tieleman and Hinton) as optimization algorithm.  Regarding the loss function, we used Mean Squared Error (MSE) for the regression subtasks. However, for subtask E-C and both subtasks of task 3, we used an adaptation of the evaluation metrics (Jaccard Index, F 1 for binary classification, and macro-average F 1 ) as loss functions. In future work we will define and study in more detail this kind of loss functions. In addition, we also tested Cross Entropy (CCE) to extend the comparison.
The strategy used in the ordinal classification subtasks of task 1 (EI-Oc and V-Oc) consisted in the discretization of the outputs of the equivalent regression subtasks (EI-Reg and V-Reg). The discretization process is as follows, be C the classes set of a ordinal classification subtask and v x ∈ R the score assigned to sample x using a regression model. We compute |C| + 1 thresholds by searching the minimum output for each class, according to the regression train sets. Concretely, {th 0 , ..., th |C| } where th i ∈ R, th i < th i+1 , th 0 = 0, and th |C| = 1. Sample x is assigned to the class i such that th i < v x ≤ th i+1 .

Experimental Results
We performed a tuning process with the development sets in order to select the best model for each task. We tested different ways of preprocessing the tweets, we fit the parameters of the models and we evaluated some external lexicons. Next, we summarize the best results obtained in the tuning process by considering some combinations of the tested models and configurations. Table 3 shows the results for 3 of the subtasks in the tuning process for Task 1. For the two remaining tasks (EI-Oc and V-Oc) we do not learn spe-cific models, in these cases we used the best models obtained for EI-Reg and V-Reg, respectively.
As it can be seen, LSTM achieved the best results for subtasks El-Reg. The rest of subtasks performed better when we combined CNN and LSTM models. In addition, when we consider the evaluation metric as loss function we improved the results (see the differences between CNN-LSTM + Lexicons (Jaccard) and CNN-LSTM + Lexicons (CCE)). Table 3 shows the results for the two subtasks in the tuning process for Task 3. We can observe the same behavior as Task 1. The best results are obtained using a combination of CNN and LSTM models and if we consider the evaluation metric as loss function the results are improved. Once our best system for each subtask with the development set was chosen, we tested it on the official test set and we compare it with the best results obtained by another participant. These results are shown in Table 5 for Task 1, and in Table  5 for Task 3.

Conclusions and Future Work
We presented a deep learning based system that assembles CNN and LSTM neural networks for tasks 1 and 3 of Semeval-2018. This system has been used with slight modifications for the two tasks addressed.
We want to highlight the improvements obtained when the evaluation measures have been adapted as loss functions. In addition, we have also incorporated information extracted from different lexical resources into the models.
As future work, we will continue to study different loss functions and the incorporation of new lexical resources as well as to carry out a detailed study of the obtained results.