TopicThunder at SemEval-2017 Task 4: Sentiment Classification Using a Convolutional Neural Network with Distant Supervision

In this paper, we propose a classifier for predicting topic-specific sentiments of English Twitter messages. Our method is based on a 2-layer CNN.With a distant supervised phase we leverage a large amount of weakly-labelled training data. Our system was evaluated on the data provided by the SemEval-2017 competition in the Topic-Based Message Polarity Classification subtask, where it ranked 4th place.


Introduction
The goal of sentiment analysis is to teach the machine the ability to understand emotions and opinions expressed in text.There are numerous challenges which make this task very difficult, for instance the complexity of natural language which makes use of intricate concepts like irony, sarcasm, and metaphors to name a few.Usually we are not interested in the overall sentiment of a tweet but rather in the sentiment the tweet expresses towards a topic of interest.Subtask B in SemEval-2017 (Rosenthal et al., 2017) consists of predicting the sentiment of a tweet towards a given topic as either positive or negative.Convolutional neural networks (CNN) have shown to be very efficient at tackling the task of sentiment analysis, as they are able to learn features themselves instead of relying on manually crafted ones (Kalchbrenner et al., 2014;Kim, 2014).Our work is based on the system proposed by (De-riu et al., 2016), which in turn is based on (Severyn and Moschitti, 2015).CNNs typically have a large number of parameters which need sufficient data to be trained efficiently.In this work, we leverage a large amount of data to train a multilayer CNN.The training is based on the following 3-phase procedure (see Figure 1): (i) creation of word embeddings based on an unsupervised corpus of 200M English tweets, (ii) a distant supervised phase using an additional 100M tweets, where the labels are inferred by weak sentiment indicators (e.g.smileys), and (iii) a supervised phase, where the pre-trained network is trained on the provided data.We evaluated the approach on the datasets of SemEval-2017 for the subtask B, where it ranked 4 th out of 23 submissions.

Model
Our system is based on the 2-layer CNN proposed by (Deriu et al., 2016), as depicted in Figure 2 and described in details below.We refer to a layer as one convolutional and one pooling layer, thus, a 2layer network consists of two consecutive pairs of convolutional-pooling layers.
Word embeddings.The word embeddings are initialized using word2vec (Mikolov et al., 2013) and then trained on an unlabelled corpus of 200M tweets.We apply a skipgram model of window size 5 and filter words that occur less than 15 times.The dimensionality of the vector representation is set to d = 52.The resulting vocabulary is stored as a mapping V : t → i which maps a token to an index, where unknown tokens are mapped to a default index.The word embeddings are stored in a matrix E ∈ R v×d where v is the size of the vocabulary and the i-th row represents the embedding for the token with index V (t) = i.
Preprocessing The preprocessing is done by: (i) lower-case the tweet, (ii) replace URLs and usernames with a special token, and (iii) tokenize the tweet using the NLTK-TwitterTokenizer 1 .
Input Layer.Each tweet is converted to a set of indices by applying the aforementioned preprocessing.The tokens are mapped to their respective indices in the vocabulary V .
Sentence model.Each word is associated to a vector representation, which consists in a ddimensional vector.A tweet is represented by the concatenation of the representations of its n constituent words.This yields a matrix X ∈ R d×n , which is used as input to the convolutional neural network.
Convolutional layer.In this layer, a set of m filters is applied to a sliding window of length h over each sentence.Let X [i:i+h] denote the concatenation of word vectors x i to x i+h .A feature c i is generated for a given filter F by: A concatenation of all vectors in a sentence produces a feature vector c ∈ R n−h+1 .The vectors c are then aggregated over all m filters into a feature map matrix C ∈ R m×(n−h+1) .The filters are 1 http://www.nltk.org/api/nltk.tokenize.htmllearned during the training phase of the neural network.
Max pooling.The output of the convolutional layer is passed through a non-linear activation function before entering a pooling layer.The latter aggregates vector elements by taking the maximum over a fixed set of non-overlapping intervals.The resulting pooled feature map matrix has the form: , where s is the length of each interval.In the case of overlapping intervals with a stride value s t , the pooled feature map matrix has the form . Depending on whether the borders are included or not, the result of the fraction is rounded up or down respectively.
Hidden layer.A fully connected hidden layer computes the transformation α(W * x+b), where W ∈ R m×m is the weight matrix, b ∈ R m the bias, and α the rectified linear (relu) activation function (Nair and Hinton, 2010).The output vector of this layer, x ∈ R m , corresponds to the sentence embeddings for each tweet.where w j denotes the weights vector of class j and a j the bias of class j.

Network parameters. Training the neural network consists in learning the set of parameters
, where E is the sentence matrix, with each row containing the ddimensional embedding vector for a specific word; F i , b i (i = {1, 2}) the filter weights and biases of the first and second convolutional layers; W the concatenation of the weights w j for every output class in the soft-max layer; and a the bias of the soft-max layer.
Training.We pre-train the word embeddings on an unsupervised corpus of 200M tweets.During the distant-supervised phase, we use emoticons to infer the polarity of a set of additional 100M tweets (Read, 2005;Go et al., 2009).The resulting dataset contains positive 79M tweets and 21M negative tweets.The neural network is trained on these 100M tweets for one epoch.Then, the neural network is trained in a supervised phase on the labelled data provided by SemEval-2017.
To avoid over-fitting we employed early-stopping, where we stop the training if the score on the validation set did not increase for 50 epochs.The word-embeddings E ∈ R d×n are updated during both the distant and the supervised training phases, as back-propagation is applied through the entire network.
The network parameters are learned using AdaDelta (Zeiler, 2012), which adapts the learning rate for each dimension using only first order information.We used the hyper-parameters = 1e −6 and ρ = 0.95 as suggested by (Zeiler, 2012).
Computing Resources.The implementation of the system is written in Python.
We used the Keras framework (Chollet, 2015) using the Theano (Theano Development Team, 2016) backend.Theano allows to accelerate the computations using a GPU.We used an NVIDIA TITAN X GPU to conduct our experiments.To generate the word embeddings we used the Gensim framework ( Řehůřek and Sojka, 2010) which implements the word2vec model in Python.It took about 2 days to generate the word embeddings.The distant phase takes approximately 5 hours and the supervised phase up to 10 minutes.

Data
We evaluate our system on the datasets provided by SemEval-2017 for subtask B (see Table 3

Setup
We use the following hyper parameters for the 2layer CNN: In both layers we use 200 filters, a filter length of 6, and for the first layer we use a pooling length of 4 with striding 2. In order to optimize our results, we varied the batch-sizes, we introduced class-weights to lessen the impact of the unbalanced training set, and we experimented with different schemes for shuffling the training set (see below).To determine the class-weight of class i we used the following formula: n c * n i , where n is the total number of samples in the training set, c is the number of classes, namely 2, and n i is the number of samples that belong to class i.

Results
The following results are computed on Test 2016.We use the macroaveraged recall ρ as scoring mechanism for our experiments, which is more robust regarding class imbalances (Rosenthal et al., 2017;Esuli and Sebastiani, 2015).The results show that the usage of class weights increases the score by approximately 2 points from 0.8225 to 0.8403 points.Figure 3 shows the score depending on the batch size used.The results show that higher batch sizes tend to give higher scores.This effect however subsides when batch sizes greater than 1200 are used.Figure 4 shows the result when different schemes for shuffling the training set before training are applied.When we do not shuffle the training set, the macroaveraged recall score drops by 4-5 points.We compare two schemes for shuffling: (i) the epoch based shuffling, where the training set is shuffled entirely before each epoch, (ii) batch based shuffling, where the training set is split into batches and each batch is shuffled separately.The results in Figure 4 show that there is just a small difference of 1 point among the two strategies.Table 2 shows the results on the Test 2017 set, which is the official test set of the competition.
The system achieves a macroaveraged recall of 0.846, thus, ranking at 4 th place.The BB twtr system, ranked 1 st , outperforms our approach by 4 points, whereas the DataStories system and the Tweester system outperform our system by only 1 point. .We compare our system to the systems that ranked 1 st , 2 nd , and 3 rd .

Conclusion
We described a deep learning approach to solve the problem of topic-specific sentiment analysis on twitter.The approach is based on a 2-layer Convolutional Neural Network which is trained using a large amount of data.Furthermore, we experimented with different parameters to fine tune our system.The system was evaluated at SemEval-2017 for the task of Topic-Based Message Polarity Classification, ranking at 4 th place.

Figure 1 :
Figure 1: Overview of the 3-Phase training procedure.
Softmax.Finally, the outputs of the final pooling layer x ∈ R m are fully connected to a softmax regression layer, which returns the class ŷ ∈ [1, K] with largest probability.i.e., ŷ := arg max j P (y = j|x, w, a)

Figure 2 :
Figure 2: The architecture of the CNNs used in our approach.
.1).The training-set is composed by Train 2016, Dev 2016, and DevTest 2016, whereas we use Test 2016 as validation set.The training and the validation set have 100 topics each.Since our dataset of 200M tweets for the word embeddings consists primarily of tweets from 2013, some of the topics are not covered by the vocabulary.Thus, we download an additional 20M tweets where we use the topic-names as search key.This reduced the percentage of unknown words from 14% to 13%.

Figure 4 :
Figure 3: The architecture of the CNNs used in our approach.

Table 1 :
Overview of datasets and number of tweets (or sentences) provided in SemEval-2016.

Table 2 :
Official results on Test 2017