SwissCheese at SemEval-2016 Task 4: Sentiment Classification Using an Ensemble of Convolutional Neural Networks with Distant Supervision

In this paper, we propose a classiﬁer for predicting message-level sentiments of English micro-blog messages from Twitter. Our method builds upon the convolutional sentence embedding approach proposed by (Sev-eryn and Moschitti, 2015a; Severyn and Mos-chitti, 2015b). We leverage large amounts of data with distant supervision to train an ensemble of 2-layer convolutional neural networks whose predictions are combined using a random forest classiﬁer. Our approach was evaluated on the datasets of the SemEval-2016 competition (Task 4) outperforming all other approaches for the Message Polarity Classiﬁ-cation task.


Introduction
Sentiment analysis is a fundamental problem aiming to give a machine the ability to understand the emotions and opinions expressed in a written text. This is an extremely challenging task due to the complexity of human language, which makes use of rhetorical devices such as sarcasm or irony. Deep neural networks have shown great promises at capturing salient features for these complex tasks (Mikolov et al., 2013b;Severyn and Moschitti, 2015a). Particularly successful for sentiment classification were Convolutional Neural Networks (CNN) (Kim, 2014;Kalchbrenner et al., 2014;Severyn and Moschitti, 2015a;Severyn and Moschitti, 2015b;Johnson and Zhang, 2015), on which our work builds upon. * These authors contributed equally to this work These networks typically have a large number of parameters and are especially effective when trained on large amounts of data. In this work, we use a distant supervision approach to leverage large amounts of data in order to train a 2-layer CNN 1 , extending the 1-layer architecture proposed by (Severyn and Moschitti, 2015a). More specifically, we train a neural network using the following three-phase procedure: i) creation of word embeddings for initialization of the first layer; ii) distant supervised phase, where the network weights and word embeddings are trained to capture aspects related to sentiment; and iii) supervised phase, where the network is trained on the provided supervised training data. We also combine the predictions of several neural networks using a random forest meta-classifier. The proposed approach was evaluated on the datasets of the SemEval-2016 competition, Task 4 (Nakov et al., 2016) 2 for which it reaches state-of-the-art results.

Convolutional Neural Networks
We combine the outputs of two 2-layer CNNs having similar architectures but differing in the choice of certain parameters (such as the number of convolutional filters). These two networks were also initialized using different word embeddings and used slightly different training data for the distant supervised phase. The common architecture of the two CNNs is shown in Figure 1 and described in details below.
Sentence model. Each word is associated to a vector representation, which consists in a d-dimensional vector. A sentence (or tweet) is represented by the concatenation of the representations of its n constituent words. This yields a matrix X ∈ R d×n , which is used as input to the convolutional neural network.
Convolutional layer. In this layer, a set of m filters is applied to a sliding window of length h over each sentence. Let X [i:i+h] denote the concatenation of word vectors x i to x i+h . A feature c i is generated for a given filter F by: A concatenation of all vectors in a sentence produces a feature vector c ∈ R n−h+1 . The vectors c are then aggregated over all m filters into a feature map matrix C ∈ R m×(n−h+1) . The filters are learned during the training phase of the neural network using a procedure detailed in the next section.
Max pooling. The output of the convolutional layer is passed through a non-linear activation function, before entering a pooling layer. The latter aggregates vector elements by taking the maximum over a fixed set of non-overlapping intervals. The resulting pooled feature map matrix has the form: , where s is the length of each interval. In the case of overlapping intervals with a stride value s t , the pooled feature map matrix has the form C pooled ∈ R m× n−h+1−s s t . Depending on whether the borders are included or not, the result of the fraction is rounded up or down respectively.
Hidden layer. A fully connected hidden layer computes the transformation α(W * x + b), where W ∈ R m×m is the weight matrix, b ∈ IR m the bias, and α the rectified linear (relu) activation function (Nair and Hinton, 2010). The output vector of this layer, x ∈ R m , corresponds to the sentence embeddings for each tweet.
Softmax. Finally, the outputs of the final pooling layer x ∈ R m are fully connected to a soft-max regression layer, which returns the classŷ ∈ [1, K] with largest probability. i.e., where w j denotes the weights vector of class j and a j the bias of class j.
Network parameters. Training the neural network consists in learning the set of parameters ) the filter weights and biases of the first and second convolutional layers; W the concatenation of the weights w j for every output class in the softmax layer; and a the bias of the soft-max layer.

Ensemble of classifiers
We combine the results of the two 2-layer CNN described in Section 2.1 with the intent of increasing the generalizability of the final classifier. This is achieved relying on two systems trained using different procedures as well as different word embeddings. The network parameters of the two CNNs are summarized in Table 1. The preprocessing and training phases of the two systems are described below.
2.2.1 System I Preprocessing and word embeddings. The word embeddings are initialized using word2vec (Mikolov et al., 2013a) and then trained using an unlabelled corpus of 200M tweets. We apply a skipgram model of window size 5 and filter words that occur less than 5 times (Severyn and Moschitti, 2015b). The dimensionality of the vector representation is set to d = 52.
Training. During a first distant-supervised phase, we use emoticons to infer the polarity of a balanced set of 90M tweets (Read, 2005;Go et al., 2009). The resulting dataset contains 45M tweets for both the positive and negative class. The neural network is trained on these 90M tweets for one epoch, before training for 10 to 15 epochs on the labelled data provided by SemEval-2016. The word-embeddings X ∈ R d×n , are updated during both the distant and the supervised training phases, as back-propagation is applied through the entire network.

System II
Preprocessing and word embeddings. A corpus of 90M tweets 3 (30M contain positive emoticons, 30M negative ones and 30M contain none) is employed to create embedding vectors of d = 50 dimensions using GloVe (Pennington et al., 2014). Words which appear less than 5 times are discarded. Additionally, special flags ∈ {0, 1} are assigned to some words, by appending a flag vector to their word embeddings. Four different flags can be set, marking (i) words that belong to hashtags, (ii) words that have been elongated (e.g. 'hellooo', which is mapped to the same vector as 'hello'), (iii) words in which all characters are capitalized, and (iv) punctuations that are repeated more than three times (e.g. '!!!!' and '!!!' being mapped to the same vector).
Training. In the distant supervised phase, the network is trained for one epoch on a set of 60M tweets, containing an equal amount of samples with positive and negative emoticons. Similarly to System I, this pre-trained network is further refined by supervised training for about 15 epochs on the SemEval-2016 data. We apply L 2 regularization to reduce overfitting to the cost function (negative log likelihood) by adding a penalty of the form of λ θ 2 2 , with regularization strength 4 λ, where θ ∈ Θ are the network parameters of each layer.

Optimization
The network parameters are learned using AdaDelta (Zeiler, 2012), which adapts the learning rate for each dimension using only first order information. We used the hyper-parameters = 1e−6 and ρ = 0.95 as suggested by (Zeiler, 2012).

Meta-Classifier
Each aforementioned system outputs three real valuesŷ corresponding to the three sentiment classes. In addition, it outputs the categorical value for the predicted sentiment class. The meta-classifier uses these values (sentiment class and categorical value of systems I and II) as input features. We trained a random forest using the Weka (Hall et al., 2009) library on the training data. We selected the number of trees (300), maximum depth of the forest (2) and the number of features used per random selection (18) as to obtain the best overall performance over the previous years' test sets.

Computing Resources
The core routines are written in Python, making heavy use of mathematical routines in Theano (Bergstra et al., 2010) that exploits GPU acceleration. For further performance improvement, we used the CuDNN library (Chetlur and Woolley, 2014). The framework requires approximately 10 hours for the distant supervised phase and only 20-30 minutes for the supervised phase.
Experiments were conducted on g2.2xlarge instances of Amazon Web Services (AWS), which offer a GRID K520 GPU with 3072 CUDA cores and 8 GB of GDDR5 RAM. SYSTEM I SYSTEM II Number of convolutional filters m = 200 m = 300 Filter window size h h 1 = 6, h 2 = 3 h 1 = 6, h 2 = 4 Size of first max-pooling interval width = 6, striding = 2 width = 3, striding = 3 Activation function α relu relu Table 1: Summary of the parameters used in System I and II

Data
The training and development datasets used in our experiments were provided by the SemEval-2016 competition. A fraction of the tweets (10-15%) from the period 2013-2015 were no longer available on Twitter, which made the results of this year competition not directly comparable to the ones of previous years. For testing, in addition to last year's data (tweets, SMS, LiveJournal), new tweets were accessible. An overview of the data available for download is given in Table 2. Data preparation. Before extracting features, the tweets were preprocessed using the following procedure: • URLs and usernames were substituted by a replacement token • The text was lowercased • The NLTK twitter tokenizer was employed in System I and a customized version of the CMU ARK Twitter Part-of-Speech Tagger (Gimpel et al., 2011) in System II.

Results
The F-1 score was computed by the competition organizers as evaluation measure. As a result, the presented system was ranked 1 st out of 34 participants, with an F1-score of 63.30 on the Twitter-2016 test set. See (Nakov et al., 2016) for further details. Table 4 summarizes the results of individual subsystems, as well as the final system on each test set. For each test set the best score is marked in bold face. In case of the Twitter2016 and Twitter2015 test sets we marked the best performing subsystem in italics. For System I (S1), we observed that during the supervised phase, the F-1 scores measured on the different test sets presented large deviations. Hence, to improve robustness, we considered six different models of S1 (S1a, ..., S1f), varying the number of epochs between 12 and 25 during the supervised phase, stopping determined by the validation score on different sets. For System II (S2), the number of epochs, equal to 12, is determined by the one achieving the highest score on the DevTest2016 set.
The final system (FS) using the meta-classifier trained on the outputs of systems S1a-f and S2, achieves the highest accuracy on the 2016 test set with the competition-winning F1-score of 63.30%. These results improve the score of the best performing subsystem (S1b) by 0.57 points. For the 2015 test set, FS shows an improvement of 0.35 points with respect to the score of the best subsystem (S1f). S1a S1b S1c S1d S1e S1f S2 FS  Table 3: Overall results of the proposed subsystems. S1: System(s) I; S2: System II; FS: Final system, using the metaclassifier. Best (second-best) results are highlighted in bold (underlined) face.

Conclusion
We described a deep learning framework to predict the sentiment polarity of short phrases, such as tweets. The proposed approach is based on an ensemble of Convolution Neural Networks and relies on a significantly large amount of data for the distant-supervised phase. The final random forest classifier resulted in state-of-the-art performance, ranking 1 st in the SemEval-2016 competition for the task of Message Polarity Classification.