LyS at SemEval-2016 Task 4: Exploiting Neural Activation Values for Twitter Sentiment Classification and Quantification

In this paper we describe our deep learning approach for solving both two-, three-and ﬁve-class tweet polarity classiﬁcation, and two-and ﬁve-class quantiﬁcation. We ﬁrst trained a convolutional neural network using pretrained Twitter word embeddings, so that we could extract the hidden activation values from the hidden layers once some input had been fed to the network. These values were then used as features for a support vector machine in both the classiﬁcation and quantiﬁcation subtasks, together with additional linguistic information in the former scenario. The results obtained for the classiﬁcation subtasks show that this approach performs better than a single convolutional network, and for the quantiﬁcation part it also yields good results. Ofﬁcial rankings locate us: 2nd (practically tied with 1st) for the binary classiﬁcation task, 2nd for binary quantiﬁcation and 4th (practically tied with 3rd) for the ﬁve-class polarity classiﬁcation challenge.


Introduction
Opinion mining has become an important mechanism to monitor what people are saying about a variety of topics (Cambria et al., 2013). As an example, Thelwall et al. (2011) use SentiStrength to * This research is supported by the Ministerio de Economía y Competitividad (FFI2014-51978-C2). David Vilares is funded by the Ministerio de Educación, Cultura y Deporte (FPU13/01180). Yerai Doval is funded by the Ministerio de Economía y Competitividad (BES-2015-073768). Carlos Gómez-Rodríguez is funded by an Oportunius program grant (Xunta de Galicia). monitor popular events on Twitter (e.g. the Oscars, the SuperBowl), showing how these resonate among the public. In a similar line, Vilares et al. (2015) also use Twitter to measure the level of popularity of the main Spanish political leaders, proving that the results obtained by their systems are similar to the ones obtained by traditional polls.
Opinion mining on Twitter actually involves two different challenges: (1) analyzing the characteristics of individual tweets and (2) quantifying a given set of tweets so that useful statistics can be extracted. This paper describes our different models to overcome such challenges, using the SemEval 2016:Task 4 as the evaluation framework.

Sentiment Analysis in Twitter
The SemEval organization proposed two different types of challenges in its 2016 edition: (1) classification into two, three and five classes and (2) quantification into two and five categories. A detailed explanation of the task can be found in the description paper (Nakov et al., 2016). For all subtasks, three official splits are provided: training, development and development test sets. In this paper, we use the training and development sets for training, and the development test set for evaluation. 1

Convolutional Neural Network
As a starting point, we train a deep neural network (DNN), in particular a convolutional neural network (CNN), following a similar configuration to the one used by Severyn and Moschitti (2015). Figure 1 illustrates the topology of the CNN from where we will extract the hidden activation values.

Embeddings layer
Let w be a token of a vocabulary V , a word embedding is a distributed representation of that token as a low dimensional vector v ∈ R n . In that way, it is possible to create a matrix of embeddings, E ∈ R |V |×n , to act as the input layer to the CNN. Particularly, we rely on a collection of Twitter word embeddings pretrained with Glove 2 (Pennington et al., 2014) with |V | ≈ 10 6 and n=100.
Thus, given a tweet t=[w 1 , w 2 , ..., w t ], after running our input layer we will obtain a matrix T ∈ R |t|×n that will serve as the input to the convolutional layer. Since tweets might have variable length, |t| is set to 100, padding with zeros if the tweet is shorter and taking the first 100 words if it is longer. We have realized after the evaluation that this value might be not the best option for short texts, such as tweets, and we plan to optimize this parameter empirically. To avoid overfitting, we first apply dropout (Srivastava et al., 2014), which randomly sets to zero the activation values of x% of the neurons in a given layer (in this paper, x = 50).

Convolutional Layer
A convolutional layer exploits local correlations in the input data. In the case of text as input, this translates into extracting correlations between groups of word or character n-grams in a sentence. To do so, each hidden unit of the CNN will only respond (activate) to a specific continuous slice of the input text. This is implemented on http:// keras.io using convolutional operations with m convolutional filters of width f separately applied to the input, obtaining m representations of this input usually known as feature maps.
Formally, let T ∈ R |t|×n be the matrix embedding for the tweet t and let F ∈ R f ×n be a filter, the output of a wide convolution is a matrix C ∈ R m×(|t|+f −1) , where each c i ∈ R |t|+f −1 is defined as: and where ⊗ is the element-wise multiplication, 1 < i < m; and j and k are the rows and columns of the matrix T [i−f +1:i,:] ⊗ F ∈ R f ×n . The non valid rows of T (T (i,:) with i < 0) are set to zero. Following Severyn and Moschitti (2015), in this paper we chose f = 5 and m = 300. We also rely on ReLU (x) = max(0, x) as the non-linear activation function. To avoid overfitting we incorporate a L2 regularization of 0.0001. After that, a max pooling layer selects max(ReLU (c i )) for each feature map.

Output layer
The output of the pooling layer is then passed to a fully connected layer (R 100 ). We add again dropout (50%) and a ReLU as the activation function. Finally, an additional fully connected layer reduces the dimensionality of the input to fit the output (number of classes) and as the final step we apply a softmax function to make the final prediction.

Current limitations
Obtaining an accurate deep neural network can be a very slow process. Hyper-parameter engineering is often needed, but training a single DNN with its hyper-parameters can be painfully slow without enough computational resources. Additionally, distant supervision is also recommended to pretrain the network (Go et al., 2009;Severyn and Moschitti, 2015). These two issues act as limitations that we could not overcome at the moment. We did try pretraining, but at the moment, we did not achieve improvements over the CNN without pretraining. A preliminary analysis suggests that: (1) we need more tweets to exploit distant supervision, (2) fine hyperparameter engineering needs to be explored to ensure that the fine-tuning on the labeled data does not completely overwrite what the network has already learned and (3), it is easy to collect tweets for analysis into 2 classes, but downloading non-noisy tweets for analysis into 3 and 5 classes is a more challenging issue.
In the following section we show how to exploit the hidden activation values of our deep learning model as part of a supervised system (Poria et al., 2015), when pretraining and fast hyper-parameter engineering are not feasible options.

Classification
Let S={s 1 , ..., s n } be a set of tweets and let L={l 1 , ..., l n } be a set of labels, the classification subtask can be defined as designing a hypothesis function h Θ : S L, where Θ denotes a set of features representing the texts. We build functions to solve classification into five (strong positive (P+), positive (P), neutral (NEU), negative (N) and strong negative (N+)), three (P, NEU and N) and two (P and N) classes. We rely on a support vector machine (SVM), in particular on a LibLinear (Fan et al., 2008) implementation with L2-regularization, to train our supervised model. 3 As features, we started testing some of those from of our last SemEval system (Vilares et al., 2014), using the total occurrence as the weighting factor. Information gain (IG) is used in all cases. Thus, before training our classifier we run an information gain algorithm to remove all irrelevant features, i.e. those where IG=0: • Words (W): Each single word is considered as a feature to feed the supervised classifier. • Psychometric properties (P): Features extracted from psychological properties coming from LIWC (Pennebaker et al., 2001) that relate 3 We used Weka (Hall et al., 2009) to build the models.
terms with psychometric properties (e.g. anger or anxiety) or topics (e.g. family or religion). We take the hidden activation values of the last hidden layer. • Features extracted from sentiment dictionaries: We extract the total, maximum, minimum and last sentiment score of a tweet from the Sen-timent140 (Mohammad et al., 2013), Hu and Liu (2004) and Taboada et al. (2011) subjective lexica. Table 1 shows the experimental results for classification into two classes obtained using the SVM with different feature sets and the CNN. The neural network outperforms most of the SVM approaches. Only when we combine a number of linguistic features with the hidden activation values and we weight the classes, we obtain an improvement over the CNN. We believe that by applying fine hyperparameter tuning on the CNN we will be able to further improve these results. Similar conclusions can be extracted from the classification into three classes, whose results are shown in Table 2. Finally, Table 3 details the results for the five categories classification subtask. In this case, the neural network does not perform as good as in previous scenarios.    With respect to SVM-specific parameter optimiza-tion, cost parameter (C) and class weigths (w):

Quantification
For this task we are not interested in predicting the class of each individual instance of the dataset, as in classification tasks, but the relative frequency of each class in whole groups of instances; this is, the class distribution. In this context, models trained using loss functions well suited for classification are not necessarily good enough for quantification, as the loss function we need to optimize for has changed just in the same way as the aim of the task, in relation to a classification task (Barranquero et al., 2015).
The most simple approach to tackle this problem would be Classify and Count (CC) (Forman, 2008), which estimates the class frequencies counting the positive results of a classifier for each class over the total amount of input instances. Nevertheless, more specialized methods exist, such as the use of an SVM learning algorithm paired with a nonlinear loss function such as the Kullback-Leibler Divergence (KLD) (Esuli and Sebastiani, 2015), which we have used in this work thanks to the tool SVM perf (Joachims, 2005) patched to work with KLD. 4 The different feature sets tested for our quantification system were automatically obtained as the activation values from different layers of the convolutional network used in the classification subtasks of this workshop. The SVM model was trained with a linear kernel and no regularization bias, optimizing the KLD over the entire training dataset.
Finally, as our system deals specifically with binary quantification, we took a one-vs-all approach and trained multiple models to generalize the quantification process for n classes rather than just two   classes. The results obtained later by these models were normalized so that relative frequencies sum up to one.

Experimental results
Results obtained using neural activation values chosen from particular layers of our convolutional network as features for the SVM can be found in Table 4. As our baseline, we performed a CC on the results obtained from the best classifiers from the classification subtasks. Table 5 shows the scores and rankings of our systems for each subtask, according to the official metrics used for each challenge. A detailed report of the results for all participants can be found at Nakov et al. (2016) and the official website. 5

Conclusions
We have described our approach to tackle the classification and quantification challenges proposed at 5 http://alt.qcri.org/semeval2016/task4/index.php?id=results Task 4 of SemEval 2016: Sentiment Analysis in Twitter. Official rankings locate us in top positions for binary classification and quantification and also for the 5-class polarity classification challenge.
We first trained a convolutional neural network to address the classification challenge. Additionally, we used its hidden activation values as features to train support vector machines, both for classification and quantification tasks. In light of the results obtained, we can state that our convolutional network seems to be a good feature extractor for both of these tasks.
As future work, we plan to exploit new distributed representations of the input to improve the performance of our current model. For the quantification task, we are planning on extending our experiments to8 other machine learning arquitectures, such as quantification trees (Milli et al., 2013) and different types of neural networks, and further exploring the feature domain using both handcrafted features and other continuous representation methods such as doc2vec (Le and Mikolov, 2014).