PotTS at SemEval-2016 Task 4: Sentiment Analysis of Twitter Using Character-level Convolutional Neural Networks.

This paper presents an alternative approach to polarity and intensity classiﬁcation of sentiments in microblogs. In contrast to previous works, which either relied on carefully designed hand-crafted feature sets or automatically derived neural embeddings for words, our method harnesses character embeddings as its main input units. We obtain task-speciﬁc vector representations of characters by training a deep multi-layer convolutional neural network on the labeled dataset provided to the participants of the SemEval-2016 Shared Task 4 (Sentiment Analysis in Twitter; Nakov et al., 2016b) and subsequently evaluate our classiﬁers on subtasks B (two-way polarity classiﬁcation) and C (joint ﬁve-way prediction of polarity and intensity) of this competition. Our ﬁrst system, which uses three manifold convolution sets followed by four non-linear layers, ranks 16 in the former track; while our second network, which consists of a single convolutional ﬁlter set followed by a high-way layer and three non-linearities with linear mappings in-between, attains the 10-th place on subtask C. 1


Introduction
Sentiment analysis (SA) -a field of knowledge which deals with the analysis of people's opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards particular entities mentioned in discourse (Liu, 2012) -is commonly considered to be one of the most challenging, competitive, but at 1 The source code of our implementation is freely available online at https://github.com/WladimirSidorenko/ SemEval-2016/ the same time utmost necessary areas of research in modern computational linguistics.
Unfortunately, despite numerous notable advances in recent years (e.g., Breck et al., 2007;Yessenalina and Cardie, 2011;Socher et al., 2011), many of the challenges in the opinion mining field, such as domain adaptation or analysis of noisy texts, still pose considerable difficulties to researchers. In this respect, rapidly evaluating and comparing different approaches to solving these problems in a controlled environment -like the one provided for the SemEval task (Nakov et al., 2016b) -is of crucial importance for finding the best possible way of mastering them.
We also pursue this goal in the present paper by investigating whether one of the newest machine learning trends -the use of deep neural networks (DNNs) with small receptive fields -would be a viable solution for improving state-of-the-art results in sentiment analysis of Twitter.
After a brief summary of related work in Section 2, we present the architectures of our networks and describe the training procedure we used for them in Section 3. Since we applied two different DNN topologies to subtasks B and C, we make a crosscomparison of both systems and evaluate the role of the preprocessing steps in the next-to-last section. Finally, in Section 5, we draw conclusions from our experiments and make further suggestions for future research.

Related Work
Since its presumably first official mention by Nasukawa andYi in 2003 (cf. Liu, 2012), sentiment analysis has constantly attracted the attention of re-230 searchers. Though earlier works on opinion mining were primarily concerned with the analysis of narratives (Wiebe, 1994) or newspaper articles (Wiebe et al., 2003), the explosive emergence of social media (SM) services in the mid-2000s has brought about a dramatic focus change in this field.
A particularly important role in this regard was played by Twitter -a popular microblogging service first introduced by Jack Dorsey in 2006 (Dorsey, 2006). The sudden availability of huge amounts of data combined with the presence of all possible social and national groups on this stream rapidly gave rise to a plethora of scientific studies. Notable examples of these were the works conducted by Go et al. (2009) and Pak and Paroubek (2010), who obtained their corpora using distant supervision and subsequently trained several classifiers on these data; Kouloumpis et al. (2011), who trained an AdaBoost system on the Edinburgh Twitter corpus 2 ; and Agarwal et al. (2011), who proposed tree-kernel methods for doing message-level sentiment analysis of tweets.
Eventually, with the introduction of the SemEval corpus (Nakov et al., 2013), a great deal of automatic systems and resources have appeared on the scene. Though most of these systems typically rely on traditional supervised classification methods, such as SVM (Mohammad et al., 2013;Becker et al., 2013) or logistic regression (Hamdan et al., 2015;Plotnikova et al., 2015), in recent years, the deep learning (DL) tsunami (Manning, 2015) has also started hitting the shores of this "battlefield".
In this paper we investigate whether one of the newest lines of research in DL -the use of characterlevel deep neural networks (charDNNs) -would be a perspective way for addressing the sentiment analysis task on Twitter as well.
Introduced by Sutskever et al. (2011), char-DNNs have already proved their efficiency for a variety of NLP applications, including part-of-speech tagging (dos Santos and Zadrozny, 2014), named-entity recognition (dos Santos and Guimarães, 2015), and general language modeling (Kim et al., 2015;Józefowicz et al., 2016). We hypothesized that the reduced feature sparsity of this approach, its lower susceptibility to informal spellings, and the shift of 2 http://demeter.inf.ed.ac.uk the main discriminative classification power from input units to transformation layers would make it suitable for doing opinion mining on Twitter as well.

Method
To test our conjectures, we downloaded the training and development data provided by the organizers of the SemEval-2016 Task 4 (Sentiment Analysis in Twitter; Nakov et al., 2016b). Due to dynamic changes of this content, we were only able to retrieve a total of 5,178 messages for subtasks B and D (two-way polarity classification) and 7,335 microblogs for subtasks C and E (joint five-way prediction of polarity and intensity).
We deliberately refused to do any heavy-weight NLP preprocessing of these data to check whether the applied DL method alone would suffice to get acceptable results. In order to facilitate the training and reduce the variance of the learned weights though, we applied a shallow normalization of the input by lower-casing messages' strings and filtering out stop words before passing the tweets to the classifiers.
As stop words we considered all auxiliary verbs (e.g., be, have, do) and auxiliary parts of speech (prepositions, articles, particles, and conjunctions) up to a few exceptions -we kept the negations and words that potentially could inverse the polarity of opinions, e.g., without, but, and however. Furthermore, we also removed hyperlinks, digits, retweets, @-mentions, common temporal expressions, and mentions of tweets' topics, since all of these elements were a priori guaranteed to be objective. An example of such preprocessed microblog is provided below: EXAMPLE 3.1.
Original: Going to MetLife tomorrow but not to see the boys is a weird feeling Normalized: but not see boys weird feeling

Adversarial Convolutional Networks (Subtasks B and D)
We then defined a multi-layer deep convolutional network for subtasks B and D as follows: At the initial step, we map the input characters to their appropriate embeddings, obtaining an input matrix E ∈ R n×m , where n stands for the length of the input instance, and m denotes the dimensionality of the embedding space (specifically, we use m = 32).
Next, three sets of convolutional filters -positive (+), negative (−), and shifter (x) convolutions -are applied to the input matrix E. Each of these sets in turn consists of three subsets: one subset with 4 filters of width 3, another subset comprising 8 filters of width 4, and, finally, a third subset having 12 filters of width 5. 3 Each subset filter F forms a matrix R w×m with the number of rows w corresponding to the filter width and the number of columns m being equal to the embedding dimensionality as above. A subset of filters S p w for p ∈ {+, −, x} is then naturally represented as a tensor R c×w×m , where c is the number of filters with the given width w.
We apply the usual convolution operation with max-pooling over time for each filter, getting an output vector v S p w ∈ R c for each subset. All output vectors v S p * of the same subset are then concate- The results of the three sets are subsequently joined using the following equation: where v S + , v S − , and v S x mean the output vectors for the positive, negative, and shifter sets respectively, and denotes the Hadamard product.
The motivation behind this choice of unification function is that we first want to obtain the difference between the positive and negative predictions (thus v S + − v S − ), then map this difference to the range [0, 1] (therefore the sigmoid), and finally either inverse or dampen these results depending on the output of the shifter layer, whose values are guaranteed to be in the range [−1, 1] thanks to tanh. Since we simultaneously apply competing convolutions to the same input, we call this layer "adversarial" as all of its components have different opinions regarding the final outcome.
After obtaining v conv , we consecutively use three non-linear transformations (linear rectification, hy-perbolic tangent, and sigmoid function) with linear modifications in-between: In this equation, M relu , M tanh , and M sig ∈ R 24×24 stand for the linear transform matrices, and b relu , b tanh , b sig ∈ R 24 represent the usual bias terms. With this combination, we hope to first prune unreliable input signals by using a hard rectifying linear unit (Jarrett et al., 2009) and then gain more discriminative power by successively applying tanh and sig, thus funneling the input to increasingly smaller ranges: [−1, 1] in the case of tanh, and [0, 1] in the case of sigmoid.
At the last stage, after applying a binomial dropout mask with p = 0.5 to the v sig vector (Srivastava et al., 2014), we compute the final prediction as: where M pred ∈ R 24×2 and b pred ∈ R 2 stand for the transformation matrix and bias term respectively, and the summation runs over the two elements of the resulting R 2 vector.
To train our classifier, we normally define the cost function as: where y i denotes the gold category of the i-th training instance and y i stands for its predicted class, and optimize this function using RMSProp (Tieleman and Hinton, 2012).

Highway Convolutional Networks (Subtasks C and E)
A slightly different model was used for subtasks C and E: In contrast to the previous two-way classification network, we only use one set of convolutions with 4 filters of width 3, 16 filters of width 4, and 24 filters of width 5, and the number of dimensions of the (a) Adversarial network used for subtasks B and D.
(b) Highway network used for subtasks C and E. resulting v conv vector being equal to 44 instead of 24.
After normally computing and max-pooling the convolutions, we pass the output convolution vector through a highway layer (Srivastava et al., 2015) in addition to using relu, i.e.: The rest of the network is organized the same way as in the previous model, up to the final layer. Since this task involves multivariate classification, instead of computing the sigmoid of the sum as in Equation 1, we obtain a softmax vector v σ ∈ R 5 and consider the argmax value of this vector as the predicted class: The corresponding cost function is appropriately defined as: where v σ [y i ] means the probability value for the gold class in the v σ vector, 2 and 3 are constants (we use 2 = 1e −5 and 3 = 3e −4 ), p's denote the training parameters of the model, and (y i − y i ) 2 stand for the squared difference between the numerical values of the predicted and gold classes.
In this task, we opted for the L 2 regularization instead of using dropout, since we found it working slightly better on the development set, though the differences between the two methods were not very big, and the derivative computation with dropout was significantly faster.

Initialization and Training
Because initialization has a crucial impact on the results of deep learning approaches (Sutskever et al., 2011), we did not rely on purely random weights but used the uniform He method (He et al., 2015) for initially setting the embeddings, convolutional filters, and bias terms instead. The inter-layer transformations were set to orthogonal matrices to ensure their full rank.
Additionally, to guarantee that each preceding network stage came maximally prepared and provided best possible output to its successors, after adding each new intermediate layer, we temporarily short-circuited it to the final output node(s) and pretrained this abridged network for 5 epochs, removing the short-circuit connections afterwards. The final training then took 50 epochs with each epoch lasting for 35 iterations over the provided training data.
Since our models appeared to be very susceptible to imbalanced classes, we subsampled the training data by getting min(1.1 * n min , n c ) samples for each distinct gold category c, where n min is the number of instances of the rarest class in the corpus, and n c denotes the number of training examples belonging to the c-th class. This subset was resampled anew for each subsequent training epoch.
Finally, to steer our networks towards recognizing correct features, we randomly added additional training instances from two established sentiment lexica: Subjectivity Clues (Wilson et al., 2005) and NRC Hashtag Affirmative Context Sentiment Lexicon (Kiritchenko et al., 2014). To that end, we drew n binary random numbers for each polarity class in the corpus from a binomial distribution B(n, 0.1), where n stands for the total size of the generated training set, and added a uniformly chosen term from either lexica whenever the sampled value was equal to one. In the same way, we randomly (with the probability B(m, 0.15), where m means the number of matches) replaced occurrences of terms from the lexica in the training tweets with other uniformly drawn lexicon items.

Evaluation
To train our final model, we used both training and development data provided by the organizers, setting aside 15 percent of the samples drawn in each epoch for evaluation and using the remaining 85 percent for optimizing the networks' weights.
We obtained the final classifier by choosing the network state that produced the best task-specific score on the set-aside part of the corpus during the training. For this purpose, in each training iteration, we estimated the macroaveraged recall ρ P N on the evaluation set for subtask B: ρ P N = ρ P os +ρ N eg 2 , and computed the macroaveraged mean absolute error measure M AE M (cf. Nakov et al., 2016a) to select a model for track C : The resulting models were then used in both classification and quantification subtasks of the SemEval competition, i.e., we used the adversarial network with the maximum ρ P N score observed during the training to generate the output for tracks B and D and applied the highway classifier with the minimum achieved M AE M rate to get predictions for subtasks C and E. 4 The scores of the final evaluation on the official test set are shown in Table 1. Since many of our parameter and design choices were made empirically by analyzing systems' errors at each development step, we decided to recheck whether these decisions were still optimal for the final configuration. To that end, we re-evaluated the effects of the preprocessing steps by temporarily switching off lower-casing and stop word filtering, and also estimated the impact of the network structure by applying the model architecture used for subtask B to the five-way prediction task, and vice versa using the highway network for the binary classification. The output layers, costs, and regularization functions of these two approaches were also swapped in these experiments when applied to different objectives.
Because re-running the complete training from scratch was relatively expensive (taking eight to ten hours on our machine), we reduced the number of training epochs by a factor of five, but tested each configuration thrice in order to overcome the random factors in the He initialization. The arithmetic mean and standard deviation (with N = 2) of these three outcomes for each setting are also provided in the table.
As can be seen from the results, running fewer training epochs does not notably harm the final prediction quality for the binary task. On the contrary, it might even lead to some improvements for the adversarial network. We explain this effect by the fact that the model selected during the shorter training had a lower score on the evaluation set than the network state chosen during 50 epochs. Nevertheless, despite its worse evaluation results, this first configuration was more able to fit the test data than the second system, which apparently overfitted the setaside part of the corpus. Furthermore, we also can observe a mixed effect of the normalization on the two tasks: while keeping stop words and preserving character case deteriorates the results for the binary classification, abandoning any preprocessing steps turns out to be a more favorable solution when doing five-way prediction. The reasons for such different behavior are presumably twofold: a) the character case by itself might serve as a good indicator of sentiment intensity but be rather irrelevant to expressing its polarity, and b) the number of training instances might have become scarce as the number of possible gold classes in the corpus increased.
Finally, one also can see that the highway network performs slightly better on both subtasks (two-and five-way) than its adversarial counterpart when used with shorter training. In this case, we assume that the swapping of the regularization and cost functions has hidden the distinctions of the two networks at their initial layers, since, in our earlier experiments, we did observe better results for the two-way classification with the adversarial structure.

Discussion and Conclusion
Unfortunately, despite our seemingly sound theoretical assumptions set forth at the beginning, relying on character embeddings as input did not work out in practice at the end. Our adversarial system was only ranked fourth to last on subtask B, and the highway network attained the second to last place in track C. However, knowing this outcome in advance was not possible without trying out these approaches first.
In order to make a retrospective error analysis, we computed the correlation coefficients between the character n-grams occurring in the training data and their gold classes, also comparing these figures with the corresponding numbers obtained on the test set. The results of this comparison are shown in Table 2.  As can be seen from the table, the most reliable classification traits that could have been learned during the training are very specific to their respective topics -in particular, Trump and Turkey appear to be very negatively biased terms. This effect becomes even more evident as the length of the character ngrams increases. The reason why we did not prefilter these substrings in the preprocessing was that the respective topics of these messages were specified as donald trump and erdogan, but we only removed exact topic matches from tweets.
Due to this evident topic susceptibility, as a possible way to improve our results, we could imagine the inclusion of more training data. Applying ensemble approaches, as it was done by the top-scoring systems this year, could also be a perspective direction to go. We would, however, advise the reader from further experimenting with network architectures (at least when training on the original SemEval dataset only), since both the recursive (RNTN, Socher et al., 2012) and recurrent variants (LSTM, Hochreiter and Schmidhuber, 1997) of neural classifiers were found to perform worse in our experiments than the feed-forward structure we described.