Amobee at SemEval-2017 Task 4: Deep Learning System for Sentiment Detection on Twitter

This paper describes the Amobee sentiment analysis system, adapted to compete in SemEval 2017 task 4. The system consists of two parts: a supervised training of RNN models based on a Twitter sentiment treebank, and the use of feedforward NN, Naive Bayes and logistic regression classifiers to produce predictions for the different sub-tasks. The algorithm reached the 3rd place on the 5-label classification task (sub-task C).


Introduction
Sentiment detection is the process of determining whether a text has a positive or negative attitude toward a given entity (topic) or in general. Detecting sentiment on Twitter-a social network where users interact via short 140-character messages, exchanging information and opinions-is becoming ubiquitous. Sentiment in Twitter messages (tweets) can capture the popularity level of political figures, ideas, brands, products and people. Tweets and other social media texts are challenging to analyze as they are inherently different; use of slang, mis-spelling, sarcasm, emojis and comentioning of other messages pose unique difficulties. Combined with the vast amount of Twitter data (mostly public), these make sentiment detection on Twitter a focal point for data science research.
SemEval is a yearly event in which teams compete in natural language processing tasks. Task 4 is concerned with sentiment analysis in Twitter; it contains five sub-tasks which include classification of tweets according to 2, 3 or 5 labels and quantification of sentiment distribution regarding * These authors contributed equally to this work. topics mentioned in tweets; for a complete description of task 4 see Rosenthal et al. (2017).
This paper describes our system and participation in all sub-tasks of SemEval 2017 task 4. Our system consists of two parts: a recurrent neural network trained on a private Twitter dataset, followed by a task-specific combination of model stacking and logistic regression classifiers.
The paper is organized as follows: section 2 describes the training of RNN models, data being used and model selection; section 3 describes the extraction of semantic features; section 4 describes the task-specific workflows and scores. We review and summarize in section 5. Finally, section 6 describes our future plans, mainly the development of an LSTM algorithm.

RNN Models
The first part of our system consisted of training recursive-neural-tensor-network (RNTN) models (Socher et al., 2013).

Data
Our training data for this part was created by taking a random sample 1 from Twitter and having it manually annotated on a 5-label basis to produce fully sentiment-labeled parse-trees, much like the Stanford sentiment treebank. The sample contains twenty thousand tweets with sentiment distribution as following:

Preprocessing
First we build a custom dictionary by means of crawling Wikipedia and extracting lists of brands, 1. Standard tokenization of the sentences, using the Stanford coreNLP tools .
2. Word-replacement step using the Wiki dictionary with representative keywords.
5. Regex: removing duplicate punctuation marks, replacing URLs with a keyword, removing Camel casing.
6. Parsing: parts-of-speech and constituency parsing using a shift-reduce parser 2 , which was selected for its speed over accuracy.
7. NER: using entity recognition annotator 3 , replacing numbers, dates and locations with representative keywords.
8. Wiki: second step of word-replacement using our custom wiki dictionary.

Training
We used the Stanford coreNLP sentiment annotator, introduced by Socher et al. (2013). Words are initialized either randomly as d dimensional vectors, or given externally as word vectors. We used four versions of the training data; with and without lemmatization and with and without pretrained word representations 4 (Pennington et al., 2014).

Tweet Aggregation
Twitter messages can be comprised of several sentences, with different and sometimes contrary sentiments. However, the trained models predict sentiment on individual sentences. We aggregated the sentiment for each tweet by taking a linear combination of the individual sentences comprising the tweet with weights having the following power dependency: where α, β, γ are numerical factors to be found, f, l, pol are the fraction of known words, length of the sentence and polarity, respectively, with polarity defined by: where vn, n, p, vp are the probabilities as assigned by the RNTN for very-negative, negative, positive and very-positive label for each sentence. We then optimized the parameters α, β, γ with respect to the true labels.

Model Selection
After training dozens of models, we chose to combine only the best ones using stacking, namely combining the models output using a supervised learning algorithm. For this purpose, we used the Scikit-learn (Pedregosa et al., 2011) recursive feature elimination (RFE) algorithm to find both the optimal number and the actual models, thus choosing the best five models. The models chosen include a representative from each type of the data we used and they were: • Training data without lemmatization step, with randomly initialized word-vectors of size 27.
• Training data with lemmatization step, with pre-trained word-vectors of size 25.
• 3 sets of training data with lemmatization step, with randomly initialized word-vectors of sizes 24, 26.
The five models output is concatenated and used as input for the various tasks, as described in 4.1.

Features Extraction
In addition to the RNN trained models, our system includes feature extraction step; we defined a set of lexical and semantical features to be extracted from the original tweets: • In-subject, In-object: whether the entity of interest is in the subject or object.
• Containing positive/negative adjectives that describe the entity of interest.
• Containing negation, quotations or perfect progressive forms.
For this purpose, we used the Stanford deterministic coreference resolution system (Lee et al., 2011;Recasens et al., 2013).

Experiments
The experiments were developed by using Scikitlearn machine learning library and Keras deep learning library with TensorFlow backend (Abadi et al., 2016). Results for all sub-tasks are summarized in table 1.

General Workflow
For each tweet, we first ran the RNN models and got a 5-category probability distribution from each of the trained models, thus a 25-dimensional vector. Then we extracted sentence features and concatenated them with the RNN vector. We then trained a Feedforward NN which outputs a 5-label probability distribution for each tweet. That was the starting point for each of the tasks; we refer to this process as the pipeline.

Task A
The goal of this task is to classify tweets sentiment into three classes (negative, neutral, positive) where the measured metric is a macro-averaged recall.
We used the SemEval 2017 task A data in the following way: using SemEval 2016 TEST as our TEST, partitioning the rest into TRAIN and DEV datasets. The test dataset went through the previously mentioned pipeline, getting a 5-label probability distribution.
We anticipated the sentiment distribution of the test data would be similar to the training data-as they may be drawn from the same distribution. Therefore we used re-sampling of the training dataset to obtain a skewed dataset such that a logistic regression would predict similar sentiment distributions for both the train and test datasets. Finally we trained a logistic regression on the new dataset and used it on the task A test set. We obtained a macro-averaged recall score of ρ = 0.575 and accuracy of Acc = 0.587.
Apparently, our assumption about distribution similarity was misguided as one can observe in the next table.

Tasks B, D
The goals of these tasks are to classify tweets sentiment regarding a given entity as either positive or negative (task B) and estimate sentiment distribution for each entity (task D). The measured metrics are macro-averaged recall and KLD, respectively.
We started with the training data passing our pipeline. We calculated the mean distribution for each entity on the training and testing datasets. We trained a logistic regression from a 5-label to a binary distribution and predicted a positive probability for each entity in the test set. This was used as a prior distribution for each entity, modeled as a Beta distribution. We then trained a logistic regression where the input is a concatenation of the 5-labels with the positive component of the probability distribution of the entity's sentiment and the output is a binary prediction for each tweet. Then we chose the label-using the mean positive probability as a threshold. These predictions are submitted as task B. We obtained a macro-averaged recall score of ρ = 0.822 and accuracy of Acc = 0.802.
Next, we took the predictions mean for each entity as the likelihood, modeled as a Binomial distribution, thus getting a Beta posterior distribution for each entity. These were submitted as task D. We obtained a score of KLD = 0.149.

Tasks C, E
The goals of these tasks are to classify tweets sentiment regarding a given entity into five classes-very negative, negative, neutral, positive, very positive-(task C) and estimate sentiment distribution over five classes for each entity (task E). The measured metrics are macroaveraged MAE and earth-movers-distance (EMD), respectively.
We first calculated the mean sentiment for each entity. We then used bootstrapping to generate a sample for each entity. Then we trained a logistic regression model which predicts a 5-label distribution for each entity. We modified the initial 5label probability distribution for each tweet using the following formula: where t 0 , c 0 are the current tweet and label, p entity-LR is the sentiment prediction of the logistic regression model for an entity, T is the set of all tweets and C = {vn, n, neu, p, vp} is the set of labels. We trained a logistic regression on the new distribution and the predictions were submitted as task C. We obtained a macro-averaged MAE score of M AE M = 0.599. Next, we defined a loss function as follows: where the probabilities are the predicted probabilities after the previous logistic regression step. Finally we predicted a label for each tweet according to the lowest loss, and calculated the mean sentiment for each entity. These were submitted as task E. We obtained a score of EM D = 0.345.

Review and Conclusions
In this paper we described our system of sentiment analysis adapted to participate in SemEval task 4. The highest ranking we reached was third place on the 5-label classification task. Compared with classification with 2 and 3 labels, in which we scored lower, and the fact we used similar workflow for tasks A, B, C, we speculate that the relative success is due to our sentiment treebank ranking on a 5-label basis. This can also explain the relatively superior results in quantification of 5 categories as opposed to quantification of 2 categories. Overall, we have had some unique advantages and disadvantages in this competition. On the one hand, we enjoyed an additional twenty thousand tweets, where every node of the parse tree was labeled for its sentiment, and also had the manpower to manually prune our dictionaries, as well as the opportunity to get feedback from our clients. On the other hand, we did not use any user information and/or metadata from Twitter, nor did we use the SemEval data for training the RNTN models. In addition, we did not ensemble our models with any commercially or freely available pre-trained sentiment analysis packages.

Future Work
We have several plans to improve our algorithm and to use new data. First, we plan to extract more semantic features such as verb and adverb classes and use them in neural network models as additional input. Verb classification was used to improve sentiment detection (Chesley et al., 2006); we plan to label verbs according to whether their sentiment changes as we change the tense, form and active/passive voice. Adverbs were also used to determine sentiment (Benamara et al., 2007); we plan to classify adverbs into sentiment families such as intensifiers ("very"), diminishers ("slightly"), positive ("delightfully") and negative ("shamefully").
Secondly, we can use additional data from Twitter regarding either the users or the entities-ofinterest.
Finally, we plan to implement a long shortterm memory (LSTM) network (Hochreiter and Schmidhuber, 1997) which trains on a sentence together with all the syntax and semantic features extracted from it. There is some work in the field of semantic modeling using LSTM, e.g. Palangi et al. (2014Palangi et al. ( , 2016. Our plan is to use an LSTM module to extend the RNTN model of Socher et al. (2013) by adding the additional semantic data of each phrase and a reference to the entity-of-interest. An illustration of the computational graph for the proposed model is presented in figure 1. The inputs/outputs are: V is a word vector representation of dimension d, D encodes the parts-of-speech (POS) tagging, syntactic cate- Figure 1: LSTM module; round purple nodes are element-wise operations, turquoise rectangles are neural network layers, orange rhombus is a dim-reducing matrix, splitting line is duplication, merging lines is concatenation. gory and an additional bit indicating whether the entity-of-interest is present in the expression-all encoded in a 7 dimensional vector, C is a control channel of dimension d, O is an output layer of dimension d + 7 and H is a sentiment vector of dimension s.
The module functions are defined as following: where W out ∈ R s×d is a matrix to be learnt, denotes Hadamard (element-wise) product and [., .] denotes concatenation. The functions L i are the six NN computations, given by: where (v i , s i , e i ) are the d dimensional word embedding, 6-bit encoding of the syntactic category and an indication bit of the entity-of-interest for the ith phrase, respectively, S ij encodes the inputs of a left descendant i and a right descendant j in a parse tree and k ∈ {1, . . . , 6}. Define D = 2d + 14, then T [1:d] ∈ R D×D×d is a tensor defining bilinear forms, I I,J with I, J ∈ {0, 1} are indication functions for having the entity-of-interest on the left and/or right child and W I,J ∈ R d×D are matrices to be learnt. The algorithm processes each tweet according to its parse tree, starting at the leaves and going up combining words into expressions; this is different than other LSTM algorithms since the parsing data is used explicitly. As an example, figure 2 presents the simple sentence "Amobee is awesome" with its parsing tree. The leaves are given by d-dimensional word vectors together with their POS tagging, syntactic categories (if defined for the leaf) and an entity indicator bit. The computation takes place in the inner nodes; "is" and "awesome" are combined in a node marked by "VP" which is the phrase category. In terms of our terminology, "is" and "awesome" are the i, j nodes, respectively for "VP" node calculation. We define C t−1 as the cell's state for the left child, in this case the "is" node. Left and right are concatenated as input V t and the metadata D t is from the right child while D t−1 is the metadata from the left child. The second calculation takes place at the root "S"; the input V t is now a concatenation of "Amobee" word vector, the input O t−1 holds the O t output of the previous step in node "VP"; the cell state C t−1 comes from the "Amobee" node.