BB_twtr at SemEval-2017 Task 4: Twitter Sentiment Analysis with CNNs and LSTMs

In this paper we describe our attempt at producing a state-of-the-art Twitter sentiment classifier using Convolutional Neural Networks (CNNs) and Long Short Term Memory (LSTMs) networks. Our system leverages a large amount of unlabeled data to pre-train word embeddings. We then use a subset of the unlabeled data to fine tune the embeddings using distant supervision. The final CNNs and LSTMs are trained on the SemEval-2017 Twitter dataset where the embeddings are fined tuned again. To boost performances we ensemble several CNNs and LSTMs together. Our approach achieved first rank on all of the five English subtasks amongst 40 teams.


Introduction
Determining the sentiment polarity of tweets has become a landmark homework exercise in natural language processing (NLP) and data science classes. This is perhaps because the task is easy to understand and it is also easy to get good results with very simple methods (e.g. positive -negative words counting). The practical applications of this task are wide, from monitoring popular events (e.g. Presidential debates, Oscars, etc.) to extracting trading signals by monitoring tweets about public companies.
These applications often benefit greatly from the best possible accuracy, which is why the SemEval-2017 Twitter competition promotes research in this area. The competition is divided into five subtasks which involve standard classification, ordinal classification and distributional estimation. For a more detailed description see (Rosenthal et al., 2017).
In the last few years, deep learning techniques have significantly out-performed traditional methods in several NLP tasks (Chen and Manning, 2014;Bahdanau et al., 2014), and sentiment analysis is no exception to this trend (Rojas-Barahona, 2016). In fact, previous iterations of the Se-mEval Twitter sentiment analysis competition have already established their power over other approaches (Nakov et al., 2016;Severyn and Moschitti, 2015;Deriu et al., 2016). Two of the most popular deep learning techniques for sentiment analysis are CNNs and LSTMs. Consequently, in an effort to build a state-of-the-art Twitter sentiment classifier, we explore both models and build a system which combines both. This paper is organized as follows. In sec. 2 we describe the architecture of the CNN and the LSTM used in our system. In sec. 3 we expand on the three training phases used in our system. In sec. 4 we discuss the various tricks that were used to fine tune the system for each individual subtasks. Finally in sec. 5 we present the performance of the system and in sec. 6 we outline our main conclusions.
2 System description 2.1 CNN Let us now describe the architecture of the CNN we worked with. Its architecture is almost identical to the CNN of Kim (2014). A smaller version of our model is illustrated on Fig. 1. The input of the network are the tweets, which are tokenized into words. Each word is mapped to a word vector representation, i.e. a word embedding, such that an entire tweet can be mapped to a matrix of size s × d, where s is the number of words in the tweet and d is the dimension of the embedding space (we chose d = 200). We follow Kim (2014) zero- padding strategy such that all tweets have the same matrix dimension X ∈ R s ×d , where we chose s = 80. We then apply several convolution operations of various sizes to this matrix. A single convolution involves a filtering matrix w ∈ R h×d where h is the size of the convolution, meaning the number of words it spans. The convolution operation is defined as where b ∈ R is a bias term and f(x) is a nonlinear function, which we chose to be the relu function. The output c ∈ R s −h+1 is therefore a concatenation of the convolution operator over all possible window of words in the tweet. Note that because of the zero-padding strategy we use, we are effectively applying wide convolutions (Kalchbrenner et al., 2014). We can use multiple filtering matrices to learn different features, and additionally we can use multiple convolution sizes to focus on smaller or larger regions of the tweets. In practice, we used three filter sizes (either [1, 2, 3], [3,4,5] or [5,6,7] depending on the model) and we used a total of 200 filtering matrices for each filter size.
We then apply a max-pooling operation to each convolution c max = max(c). The max-pooling operation extracts the most important feature for each convolution, independently of where in the tweet this feature is located. In other words, the CNN's structure effectively extracts the most important n-grams in the embedding space, which is why we believe these systems are good at sentence classification. The max-pooling operation also allows us to combine all the c max of each filter into one vector c max ∈ R m where m is the total number of filters (in our case m = 3 × 200 = 600). This vector then goes through a small fully connected hidden layer of size 30, which is then in Figure 2: Architecture of a smaller version of the bi-directional LSTM used. Picture is inspired by Figure  1 of (Zhang and Wallace, 2015). turn passed through a softmax layer to give the final classification probabilities. To reduce overfitting, we add a dropout layer (Srivastava et al., 2014) after the max-pooling layer and after the fully connected hidden layer, with a dropout probability of 50% during training.

LSTM
Let us now describe the architecture of the LSTM system we worked with. A smaller version of our model is illustrated on Fig. 2. Its main building blocks are two LSTM units. LSTMs are part of the recurrent neural networks (RNN) family, which are neural networks that are constructed to deal with sequential data by sharing their internal weights across the sequence. For each element in the sequence, that is for each word in the tweet, the RNN uses the current word embedding and its previous hidden state to compute the next hidden state. In its simplest version, the hidden state h t ∈ R m (where m is the dimension of the RNN, which we pick to be m = 200) at time t is com-puted by where x t is the current word embedding, W h ∈ R m×d and U h ∈ R m×m are weight matrices, b h ∈ R m is a bias term and f (x) is a non-linear function, usually chosen to be tanh. The initial hidden state is chosen to be a vector of zeros. Unfortunately this simple RNN suffers from the exploding and vanishing gradient problem during the backpropagation training stage (Hochreiter, 1998). LSTMs solve this problem by having a more complex internal structure which allows LSTMs to remember information for either long or short terms (Hochreiter and Schmidhuber, 1997). The hidden state of an LSTM unit is computed by (Zaremba et al., 2014) where i t is called the input gate, f t is the forget gate, c t is the cell state, h t is the regular hidden state, σ is the sigmoid function, and • is the Hadamard product.
One drawback from the LSTM is that it does not sufficiently take into account post word information because the sentence is read only in one direction; forward. To solve this problem, we use what is known as a bidirectional LSTM, which is two LSTMs whose outputs are stacked together. One LSTM reads the sentence forward, and the other LSTM reads it backward. We concatenate the hidden states of each LSTM after they processed their respective final word. This gives a vector of dimension 2m = 400, which is fed to a fully connected hidden layer of size 30, and then passed through a softmax layer to give the final classification probabilities. Here again we use dropout to reduce over-fitting; we add a dropout layer before and after the LSTMs, and after the fully connected hidden layer, with a dropout probability of 50% during training.

Training
To train those models we had access to 49,693 human labeled tweets for subtask A, 30,849 tweets for subtasks (C, E) and 18,948 tweets for subtasks (B, D). In addition to this human labeled data, we collected 100 million unique unlabeled English tweets using the Twitter streaming API. From this unlabeled dataset, we extracted a distant dataset of 5 million positive tweets and 5 million negative tweets. To extract this distant dataset we used the strategy of Go et al. (2009), that is we simply associate positive tweets with the presence of positive emoticons (e.g. ":)") and vice versa for negative tweets. Those three datasets (unlabeled, distant and labeled) were used separately in the three training stages which we now present. Note that our training strategy is very similar to the one used in (Severyn and Moschitti, 2015;Deriu et al., 2016).

Pre-processing
Before feeding the tweets to any training stage, they are pre-processed using the following procedure: • URLs are replaced by the <url> token.
• Any letter repeated more than 2 times in a row is replaced by 2 repetitions of that letter (for example, "sooooo" is replaced by "soo").

Unsupervised training
We start by using the 100 million unlabeled tweets to pre-train the word embeddings which will later be used in the CNN and LSTM. To do so, we experimented with 3 unsupervised learning algorithms, Google's Word2vec (Mikolov et al., 2013a,b), Facebook's FastText (Bojanowski et al., 2016) and Stanford's GloVe (Pennington et al., 2014). Word2vec learns word vector representations by attempting to predict context words around an input word. FastText is very similar to Word2vec but it also uses subword information in the prediction model. GloVe on the other hand is a model based on global word-word co-occurrence statistics. For all three algorithms we used the code provided by the authors with their default settings.

Distant training
The embeddings learned in the unsupervised phase contain very little information about the sentiment polarity of the words since the context for a positive word (ex. "good") tends to be very similar to the context of a negative word (ex. "bad").
To add polarity information to the embeddings, we follow the unsupervised training by a fine tuning of the embeddings via a distant training phase. To do so, we use the CNN described in sec. 2 and initialize the embeddings with the ones learned in the unsupervised phase. We then use the distant dataset to train the CNN to classify noisy positive tweets vs. noisy negative tweets. The first epoch of the training is done with the embeddings frozen in order to minimize large changes in the embeddings. We then unfreeze the embeddings and train for 6 more epochs. After this training stage, words with very different sentiment polarity (ex. "good" vs. "bad") are far apart in the embedding space.

Supervised training
The final training stage uses the human labeled data provided by SemEval-2017. We initialize the embeddings in the CNN and LSTM models with the fine tuned embeddings of the distant training phase, and freeze them for the first ∼ 5 epochs. We then train for another ∼ 5 epochs with unfrozen embeddings and a learning rate reduced by a factor of 10. We pick the cross-entropy as the loss function, and we weight it by the inverse frequency of the true classes to counteract the imbalanced dataset. The loss is minimized using the Adam optimizer (Kingma and Ba, 2014) with initial learning rate of 0.001. The models were implemented in TensorFlow and experiments were run on a GeForce GTX Titan X GPU.

Subtask specific tricks
The models described in sec. 2 and the training method described in sec. 3 are used in the same way for all five subtasks, with a few special exceptions which we now address. Clearly, the output dimension differs depending on the subtask, for subtask A the output dimension is 3, while for B and D it is 2 and for subtask C and E it is 5. Furthermore, for quantification subtasks (D and E), we use the probability average approach of Bella et al. (2010) to convert the output probabilities into sentiment distributions.
Finally for subtasks that have a topic associated with the tweet (B, C, D and E), we add two special steps which we noticed improves the accuracy during the cross-validation phase. First, if any of the words in the topic is not explicitly mentioned in the tweet, we add those missing words at the end of the tweet in the pre-processing phase. Second, we concatenate to the regular word embeddings another embedding space of dimension 5 which has only 2 possible vectors. One of these 2 vectors indicates that the current word is part of the topic, while the other vector indicates that the current word is not part of the topic.

Results
Let us now discuss the results obtained from this system. In order to assess the performance of each model and their variations, we first show their scores on the historical Twitter test set of 2013, 2014, 2015 and 2016 without using any of those sets in the training dataset, just like it was required for the 2016 edition of this competition. For brevity, we only focus on task A since it tends to be the most popular one. Moreover, in order to be consistent with historical editions of this competition, we use the average F 1 score of the positive and negative class as the metric of interest. This is different from the macro-average recall which is used in the 2017 edition, but this should not affect the conclusions of this analysis significantly since we found that the two metrics were highly correlated. The results are summarized in Table 1. This table is not meant to be an exhaustive list of all the experiments performed, but it does illustrate the relative performances of the most important variations on the models explored here.
We can see from Table 1 that the GloVe unsupervised algorithm gives a lower score than both FastText and Word2vec. It is for this reason that we did not include the GloVe variation in the ensemble model. We also note that the absence of class weights or the absence of a distant training stage lowers the scores significantly, which demonstrates that these are sound additions. Except for these three variations, the other models have similar scores. However, the ensemble model effectively outperforms all the other individual models. Indeed, while these individual models give similar scores, their outputs are sufficiently uncorrelated such that ensembling them gives the score a small boost. To get a sense of how correlated with each other these models are, we can compute the Pearson correlation coefficient between the output probabilities of any pairs of models, see  (Nakov et al., 2016). They do not come from a single system or from a single team; they are the best previous scores obtained for each test set over the years.
For the predictions on the 2017 test set, the system is retrained on all available training data, which includes previous years testing data. The results of our system on the 2017 test set are shown on Table 3. Our system achieved the best scores on all of the five English subtasks. For subtask A, there is actually a tie between our submission and another team (DataStories), but note that with respect to the other metrics (accuracy and F P N 1 score) our submission ranks higher.

Conclusion
In this paper we presented the system we used to compete in the SemEval-2017 Twitter sentiment analysis competition. Our goal was to experiment with deep learning models along with modern training strategies in an effort to build the best possible sentiment classifier for tweets. The final model we used was an ensemble of 10 CNNs and 10 LSTMs with different hyper-parameters and different pre-training strategies. We participated in all of the English subtasks, and obtained first rank in all of them.
For future work, it would be interesting to explore systems that combine a CNN and an LSTM more organically than through an ensemble model, perhaps a model similar to the one of Stojanovski et al. (2016). It would also be interesting to analyze the dependence of the amount of unlabeled and distant data on the performance of the models.    (Rosenthal et al., 2017). For subtask A and B, higher is better, while for subtask C, D and E, lower is better.