Exploring Convolutional Neural Networks for Sentiment Analysis of Spanish tweets

Spanish is the third-most used language on the internet, after English and Chinese, with a total of 7.7% (more than 277 million of users) and a huge internet growth of more than 1,400%. However, most work on sentiment analysis has been focused on English. This paper describes a deep learning system for Spanish sentiment analysis. To the best of our knowledge, this is the first work that explores the use of a convolutional neural network to polarity classification of Spanish tweets.


Introduction
Knowing the opinion of customers or users has become a priority for companies and organizations in order to improve the quality of their services and products. With the ongoing explosion of social media, it affords a significant opportunity to poll the opinion of many internet users by processing their comments. However, it should be noted that sentiment analysis, which can be defined as the automatic analysis of opinion in texts (Pang and Lee, 2008), is a challenging task because even different people often assign different polarities to a given text. Moreover, sentiment analysis can involve several Natural Language Processing (NLP) tasks such as negation or subjectivity detection, which have not been fully resolved by the NLP research community to date. On Twitter, the task is even more difficult, because the texts are small (only 140 characters) and are characterized by a informal style language utilized by users, many grammatical errors and spelling mistakes, slang and vulgar vocabulary, and plenty of abbreviations.
The shortage of training and testing data is one of the main bottlenecks for most NLP tasks in general, and for sentiment analysis in particular. This drawback has been partially overcome thanks to the organization of shared tasks such as Sentiment Analysis in Twitter Task at SemEval 2013-2015 (Nakov et al., 2013;Rosenthal et al., 2014;Rosenthal et al., 2015;Nakov et al., 2016), which provided annotated corpora of tweets for sentiment analysis. Most research efforts in this task have focused on English texts, much less attention has been given to other languages (Abdul-Mageed et al., 2011;Kapukaranov and Nakov, 2015). However, Spanish is the third language most used on the internet, with a total of 7.7% (more than 277 million of users) and a huge internet growth of more than 1,400%.
Since its introduction in 2013, the workshop on Sentiment Analysis at SEPLN (Villena-Román et al., 2013;Villena-Román et al., 2015b;Villena-Román et al., 2015a;García Cumbreras et al., 2016) has had as main goal to promote the development of methods and resources for sentiment analysis of tweets written in Spanish. In this task, the participating systems have to determine the global polarity of each tweet in the test dataset. A detailed description of the task can be found in the overview paper of TASS 2016 (García Cumbreras et al., 2016).
As said above, sentiment analysis of tweets is a very challenging task because of their small size, and thereby, they usually contain very scarce contextual information. This shortage of contextual information can be supplied by exploiting knowledge from large collections of unlabelled texts. Our approach uses a convolutional neural network (CNN), which uses word embeddings as its only input. Word embeddings can be very useful for the sentiment analysis task because they are able to represent syntactic and semantic information of words (Collobert et al., 2011;Socher et al., 2013b). We do not only experiment with randomly initialized word vectors, but also explore the use of pre-trained word embeddings. We also perform a detailed exploration of the hyper-parameters of the CNN and their effect on the results.
The paper is organized as follows. In Section 2, we discuss some related work. Section 3 describes our approach. The experimental results are presented and discussed in Section 4. We conclude in Section 5 with a summary of our findings and some directions for future work.

Related Work
For the past two decades, a remarkable amount of NLP research has been dedicated to the sentiment analysis task. Most of early works were focused on customer reviews of products and services, while more recently researches usually carry out the task on data from social media such as tweets and user comments. Polarity classification can be performed at three different levels: document, sentence and entity level, being the first one the one most addressed until now. However, it is increasingly demanded to know the opinion of users on specific topics (entities).
Both unsupervised and supervised approaches have been used to tackle this problem, using standard features such as unigram/bigrams, word counts or binary presence features, word position, POS tags and sentiment features from polarity lexicons such as SentiWordNet (Baccianella et al., 2010), AFFIN (Hansen et al., 2011) or iSOL (Molina-González et al., 2013). Additionally, attention has recently been directed to more complex linguistic processes such as negation and speculation detection (Pang and Lee, 2008;Cruz et al., 2015).
Only a few works have explored the use of neural networks for sentiment analysis. Socher and colleagues (2011) proposed a recursive model that is able to capture the recursive nature of sentences and learn auto-encoders for multi-word phrases. Later, they proposed a matrix-vector recursive neural network model to learn compositional vector representations for phrases and sentences of any length (Socher et al., 2012). A feed-forward neural network was designed by dos Santos and Gatti (2014) to learn relevant features from characters, words and sentences for the sentiment analysis task.
In the four editions of the TASS workshop, most systems have been based on the use of popular supervised machine learning classifiers (most of them used SVM) and very extensive feature sets, which included lexical and morphosyntactic features (such as tokens, lemmas, n-grams, PoS tags) and sentiment features from the polarity lexicons (such as ElhPolar (Saralegi and San Vicente, 2013), iSOL (Molina-González et al., 2013) or AFFIN (Hansen et al., 2011)) to represent the information of each tweet. Only one of the systems (Vilares et al., 2015) proposed an approach based on deep learning, in particular, a neural network Long Short-Term Memory (LSTM) with a logistic function at the output layer. The evaluation of the task showed that this deep learning approach did not overcome the classical classifiers such as SVM. In TASS, there are two different evaluations: one based on 6 different polarity labels (P+, P, NEU, N, N+, NONE) and another based on just 4 labels (P, N, NEU, NONE). The state-of-art result for the task with 4 polarity levels is around 0.70 of accuracy. As expected, the best accuracy is lower (around 0.67) for the task with 6 polarity levels. A more in-depth analysis of the results and the different participating systems can be found in (Villena-Román et al., 2013;Villena-Román et al., 2015b;Villena-Román et al., 2015a;García Cumbreras et al., 2016).
To the best of our knowledge, convolutional networks have not been applied to the sentiment analysis of Spanish tweets yet. Several works (dos Santos and Gatti, 2014; Severyn and Moschitti, 2015) have shown that they can be a valuable approach for English tweets, and thereby, the same could be expected also for Spanish. One of the main advantages of this architecture is that it does not require syntactic information from sentences. It should be noted that many tweets are grammatically incorrect, and thereby, those methods not based on syntactic information, could give better results.

The General Corpus
The General corpus was created for the TASS competition. It consists of 68,000 Spanish tweets, which were collected from November 2011 to March 2012, and covers a variety of topics such as economy, communication, politics, mass media and culture. The corpus was divided into training and test sets with a 10%-90% ratio. As said above, each tweet in the training set is classified with its polarity, which can take some of the following values: strong positive (P+), positive (P), neutral (NEU), negative (N), strong negative (N+) and one additional for tweets without polarity (NONE). The annotation process of the training set was semi-automatic using a baseline machine learning model whose annotations were manually reviewed later by human experts. Although the test set has not been released with their gold annotations, the evaluation platform of the TASS competition is still open 1 for registered users. This platform allows participants to submit new runs and then obtain their scores.

A baseline approach
To start with, we developed a baseline based on the most common approach for polarity classification at document level: the use of very popular supervised machine learning algorithms such as SVM and logistic regression. Our goal is to compare this baseline with the CNN model.
Instead of using bag-of-words (BoW) to represent tweets, we exploited word embedding representation. A word embedding is a function to map words to low dimensional vectors, which are learned from a large collection of texts. At present, Neural Network is one of the most used learning techniques for generating word embeddings (Mikolov et al., 2013). The essential assumption of this model is that semantically close words will have similar vectors (in terms of cosine similarity). Word embeddings can help to capture semantic and syntactic relationships of the corresponding words. While the well-known BoW model involves a very large number of features (as many as the number of non-stopwords words with at least a minimum number of occurrences in the training data), the word embedding representation allows a significant reduction in the feature set size (in our case, from millions to just 300). The dimensionality reduction is a desirable goal, because it helps in avoiding over-fitting and leads to a reduction of the training and classification times, without any performance loss. Moreover, word embeddings have shown promising results in NLP tasks, such as named entity recognition (Segura-Bedmar et al., 2015), relation extraction (Alam et al., 2016), sentiment analysis (Socher et al., 2013b) or parsing (Socher et al., 2013a).
As a preprocessing step, tweets must be cleaned. First, all links, urls and usernames (these last ones can be easily recognized because their first character is always the symbol @) were removed. Then, the hashtags were transformed to words by removing its first character (that is, the symbol #). Taking advantage of regular expressions, the emoticons were detected and classified in order to count the number of positive and negative emoticons in each tweet and then were removed from the text. Table 1 shows the list of positive and negative emoticons, which have been taken from Wikipedia 2 . The tweets were converted to lower-case. Moreover, the misspelled accented letters were replaced by their correct ones (for instanceá with a). We also treated elongations (that is, the repetition of a character) by removing the repetition of a character after its second occurrence (for example, hoooolaaaa would be translated to hola). We also took into account laughs (for instance jajaja) which turned out to be challenging because of the diverse ways they are expressed (i.e. expressions like "ja", "jaja", jajajaja, "jiji" or jejeje and even misspelled ones like jajjajaaj). We addressed this using regular expressions to standardize the different forms (i.e. jajjjaaj to jajaja) and then replaced them with the Spanish translation of laugh: "risa". Finally we removed all non-letters characters and all stopwords present in tweets 3 .

Orientation Emoticons
Positive  Once the tweets were preprocessed, they were tokenized using the NLTK toolkit (a Python pack-age for NLP). To represent the tweets, we used Cardellino's pre-trained model (Cardellino, 2016). This model is available for research community and was built from several Spanish collection texts such as Spanish Wikipedia (2015), the OPUS corpora (Tiedemann and Nygaard, 2004) or the Ancora corpus (Taulé et al., 2008), among others. It contains nearly 1.5 billion words (Cardellino, 2016). The dimension of its vectors is 300. Then, for each token, we searched its vector in the word embedding model. It should be noted that this model was trained on a collection of texts from different resources such as Spanish Wikipedia, WikiSource and Wikibooks, and none of them contains tweets. Therefore, it is possible that the main characteristics of the social media texts (such as informal style language, grammatical errors and spelling mistakes, slang and vulgar vocabulary, abbreviations, etc) are not correctly represented in this model. Indeed, we found that there was a significant number of words from the tweets (almost a 6%) that were not found in this word embedding model. We performed a review of a small sample of these words, showing that most of them were mainly hashtags.
In our baseline approach, a tweet of n tokens (T = w 1 , w 2 , ..., w n ) is represented as the centroid of the word vectors w i of its tokens, as shown in the following equation: where N is the vocabulary size, that is, the total number of distinct words, while T F (w j , t) refers to the number of occurrences of the j-th vocabulary word in the tweet T.
In addition to using the centroid, we completed the feature set with the following additional features: • posWords: number of positive words present in the tweet.
• negWords: number of negative words present in the tweet.
• posEmo: number of positive emoticons present in the tweet.
• negEmo: number of negative emoticons present in the tweet.
For the posWords and negWords features we used the iSOL lexicon (Molina-González et al., 2013), a list composed by 2,509 positive words and 5,626 negative words. As described before, for the emoticons we used the listed in Table 1, but also added to the positive ones the number of laughs detected; and also, we included the number of recommendations present in the form of a "Follow Friday" hashtag (#FF), due to its ease of detection and its positive bias.
We also applied a set of emoticon's rules as a pre-classification stage, similar to Chikersal et al. (2015), in which we determined a first stage polarity for each tweet as follows: • If posEmo is greater than zero and negEmo is equal to zero, the tweet is marked as "P".
• If negEmo is greater than zero and posEmo is equal to zero, the tweet is marked as "N".
• If both posEmo and negEmo are greater than zero, the tweet is marked as "NEU".
• If both posEmo and negEmo are equal to zero, the tweet is marked as "NONE".
Then, the classification of tweets was performed using scikit-learn, a Python module for machine learning. This package provides many algorithms such as Random Forest, Support Vector Machine (SVM) and so on. One of its main advantages is that it is supported by extensive documentation. Moreover, it is robust, fast and easy to use. Initially, we performed experiments using three different classifiers: Random Forests, Support Vector Machines and Logistic Regression because these classifiers often achieved the best results for text classification and sentiment analysis (García Cumbreras et al., 2016).
After the classification, we made three tests: i) applying no rule, ii) honoring the polarity defined by the rule, which means, we keep the predefined polarity if the tweet was marked as "P" or "N", otherwise we take the value estimated by the classifier, and iii) a mixed approach where we give each polarity a value (N+: -2; N: -1; NEU,NONE: 0; P: 1; P+: 2) and performed an arithmetic sum of both the predefined and estimated polarity if and only if they are not equal; with that for instance, if the classifier marked a tweet as "N:-1" and the rules marked it as "P:1" the tweet will be classified as "NEU:0". In order to choose the best-performing classifiers, we used 10-fold cross-validation because there was no development dataset and this strategy has become the standard method in practical terms. Our experiments showed that, although the results were similar, the best settings were: • SVM+MIX: Support Vector Machine and applying the mixed rules approach.
• LR+MIX: Logistic Regression and applying the mixed rules approach.

CNN Model
In this study, the CNN model proposed for sentiment analysis is based on the model for sentence classification described in Kim (2014). The model has been implemented using Google's Tensorflow toolkit. Based on the fact that single level architectures seem to provide similar performance than larger networks (Kim, 2014), we decided to design our network with only a single convolutional layer, followed by a max-pooling layer and a softmax classifier as final layer. In this way, we were able to reduce the large amount of time needed to train a CNN on a large corpus as our training set (with almost 7,000 tweets). Each tweet was represented by a matrix that concatenates the word embeddings of its words. This matrix is the input of the network. In our experiments, we learned our word embeddings from scratch, but also we tried with two different pre-trained word2vec models: Cardellino's model, which was described above, and a pretrained model from tweets. To train this second model of word embeddings, we used the corpus provided by the TASS organizers (68,000 tweets) as well as a very extensive collection of 8,774,487 tweets, which we collected during 2014. To do this, we used the word2vec tool, which implements the continuous bag-of-words and skip-gram architectures for computing vector representations of words (Mikolov et al., 2013). We used the continuous bag of words model with a context window of size 8. The size of the pre-trained model from tweets is 347,970 words. We randomly initialized those words that were not in the pre-trained model.
In the next layer, convolutions were performed over the word embeddings using multiple filter sizes. A filter is a sliding window of a given number of words. That is, different size windows are treated at a time. Then, a max pooling layer extracted the most important feature (in our case, the maximum value) to reduce the computational complexity. In order to avoid over-fitting, dropout regularization was also used (Srivastava et al., 2014). This process randomly drops some units and their connections from the network during training. The final layer is a softmax prediction.
We used 10-fold cross-validation for parameter tuning. Many different combinations of hyperparameters of the neural network can be defined. Summarizing, we experimented with several variants of the model: • CNN-rand: all words are randomly initialized and then learned during training.
• CNN-wiki: a model initialized with the word vectors from Cardellino's pre-trained model. The word embeddings as well as the other parameters are fine-tuned for training.
• CNN-tweets: the network is initialized with the pre-trained model from tweets. As the previous model, both word embeddings and the other parameters are learned during the training.

Results
We adopt the same metrics used in the TASS shared task, which are the accuracy and the macroaveraged version of the precision, recall and F1. One of our main goals is to study the effect of the word embeddings on the performance of our CNN model. Table 2 compares the models based on the type of word-embeddings used as input of the network: CNN-rand, CNN-wiki and CNN-Twitter. The other parameters of the model were set as follows: dimension of word embeddings = 300, number of filters = 128, size of filters = 3,4,5 dropout=0.5, λ = 0, batch size=64 and number of epochs=200. When the classification only considers 4 levels (POS, NEU, NEG, NONE), the best results are provided by the pretrained model built from tweets, despite being much smaller (with less than 350,000 words) than Cardellino's pre-trained model (with 1.5 million of words). Even the word vectors from scratch (CNN-rand) provided slightly higher performance than the CNN model trained with Cardellino's pretrained model. This may be due to the language style of tweets is completely different to the usual   Table 3: Results of the baseline systems and the best CNN model one of text collections (for example Wikipedia) that make up Cardellino's corpus. As expected, the results are lower when the classification is performed using 6 polarity levels because there are less examples in the training set for the classes: P, P+ and N, N+. The pre-trained model from tweets still provides the best F1, but the best accuracy is provided by Cardellino's pre-trained model. The worst results are obtained when the word vectors are randomly initialized.
Once we evaluated the impact of word embeddings on the performance of the CNN model, we decided to focus on the rest of its parameters. The dropout regularization is parameterized by the dropout probability ∈ [0, 1]. Our experiments using 10-fold cross validation revealed that the best performance is achieved when it is set to 0.5. As expected, several experiments showed that the larger the number of training epochs used, the more accurate the model. The final number of training epochs was set to 200. The experiments also show that the learning rate is the parameter with a largest impact in the prediction performance. The best results were obtained when this parameter was set to 3. The size of filter also seems to have a significant effect on results. Thus, the model achieved better results when our setting also considered smaller size such as 2. This may be due to tweets are small texts whose average number of words is 12 (Li et al., 2011). The best results were obtained with the filter-size parameter equals to "2,3,4,5". In order to determine the best model, we performed a comprehensive series of experiments, showing that the best parameter setting is as follows: dimension of word embeddings = 300, number of filters = 300, size of filters = 2,3,4,5 dropout=0.5, λ = 3, batch size=500 and   number of epochs=200. Table 3 shows the results of our baselines (using SVM or logistic regression trained with the feature set and the rules described above). It also presents the results of our best CNN model. We can observe that our CNN model provides slightly better results in terms of F1 than SVM and logistic regression. However, in terms of accuracy, these popular algorithms overcome CNN model. Meanwhile, SVM and Logistic regression provide results that are extremely similar. It is worth mentioning that the logistic regression's performance was observably faster than the other two models. Although our CNN model worked acceptably, its performance is still far from the top ranking systems of TASS (see Table4).
Tables 5 and 6 show the scores for each polarity using the best CNN model. As expected, the lower results are obtained for those classes with less instances in the training dataset.

Conclusion and future work
This paper explores the use of a convolutional neural network in order to extract relevant features without the necessity of handcrafted features (such as stems, named entities, PoS tags, syntactic information, etc). Our paper shows that a convolutional neural network architecture can be an effective method for sentiment analysis of Spanish tweets. Although our performance is lower than top ranking systems of the TASS workshop, our CNN model provides promising results.
As future work, we also plan to study if the inclusion of external features (such as sentiment features from polarity lexicons) to the final layer (softmax layer) achieves to improve the results. Because the General Corpus of the TASS task covers very different domains, we will perform leaveone-domain-out cross validation in order to asses the generality of our CNN model. We also plan to explore how balanced corpora and bigger datasets could help to increase the performance classification of minority classes in the training dataset. Moreover, we would like to explore other deep learning that could effectively deal with the polarity classification of Spanish tweets.