Assessing State-of-the-Art Sentiment Models on State-of-the-Art Sentiment Datasets

There has been a good amount of progress in sentiment analysis over the past 10 years, including the proposal of new methods and the creation of benchmark datasets. In some papers, however, there is a tendency to compare models only on one or two datasets, either because of time restraints or because the model is tailored to a specific task. Accordingly, it is hard to understand how well a certain model generalizes across different tasks and datasets. In this paper, we contribute to this situation by comparing several models on six different benchmarks, which belong to different domains and additionally have different levels of granularity (binary, 3-class, 4-class and 5-class). We show that Bi-LSTMs perform well across datasets and that both LSTMs and Bi-LSTMs are particularly good at fine-grained sentiment tasks (i. e., with more than two classes). Incorporating sentiment information into word embeddings during training gives good results for datasets that are lexically similar to the training data. With our experiments, we contribute to a better understanding of the performance of different model architectures on different data sets. Consequently, we detect novel state-of-the-art results on the SenTube datasets.


Introduction
The task of analyzing private states expressed by an author in text, such as sentiment, emotion or affect, can give us access to a wealth of hidden information to analyze product reviews (Liu et al., 2005), political views (Speriosu et al., 2011), or to identify potentially dangerous activity on the Internet (Forsyth and Martell, 2007). The first approaches in this field of research depended on the use of words at a symbolic level (unigrams, bigrams, bag-of-words features), where generalizing to new words was difficult (Pang et al., 2002;Riloff and Wiebe, 2003).
Current state-of-the-art methods rely on features extracted in an unsupervised manner, mainly through one of the existing pre-trained word embedding approaches (Collobert et al., 2011;Mikolov et al., 2013;Pennington et al., 2014). These approaches represent words as some function of their contexts, enabling machine learning algorithms to generalize over tokens that have similar representations, arguably giving them an advantage over previous symbolic approaches.
In order to evaluate state-of-the-art models (both symbolic and embedding-based), different datasets are used. However, it is not clear that a model that performs well on one certain dataset will transfer well to other datasets with different properties. The work we describe in this paper aims at discovering if there are certain models that generally perform better or if there are certain models that are better adapted to certain kinds of datasets. Ultimately, the goal of this paper is to contribute to the current situation by supporting the choice of a method for novel domains and datasets, based on properties of the task at hand.
Our main contributions are, therefore, comparing seven approaches to sentiment analysis on six benchmark datasets 1 . We show that • bidirectional LSTMs perform well across datasets, • both LSTMs and bidirectional LSTMs are particularly good at fine-grained sentiment tasks, • and embeddings trained jointly for semantics and sentiment perform well on datasets that are similar to the training data.

Related Work
This section discusses three approaches to sentiment analysis and then describes in detail benchmark datasets which will be used in the experiments.

Approaches
To analyze the performance of state-of-the-art methods across datasets, we experiment with three approaches to sentiment analysis: (1) updating pretrained word embeddings using a neural classifier and labeled data, (2) updating pre-trained word embeddings using a semantic lexicon, and (3) training word embeddings to jointly maximize a language model score and a sentiment score. Sections 2.1.1 to 2.1.3 discuss these three approaches. We focus on sentiment-related methods, however, where appropriate, we discuss general approaches which can be adapted to this use case in a straightforward manner as well.

Retrofitting to Semantic Lexicons
There have been several proposals to improve the quality of word embeddings using semantic lexicons. Yu and Dredze (2014) propose several methods which combine the CBOW architecture (Mikolov et al., 2013) and a second objective function which attempts to maximize the relations found within some semantic lexicon. They use both the Paraphrase Database (Ganitkevitch et al., 2013) and WordNet (Fellbaum, 1999) and test their models on language modeling and semantic similarity tasks. They report that their method leads to an improvement on both tasks. Kiela et al. (2015) aim to improve embeddings by augmenting the context of a given word while training a skip-gram model (Mikolov et al., 2013). They sample extra context words, taken either from a thesaurus or association data, and incorporate this into the context of the word for each update. The evaluation is both intrinsical, on word similarity and relatedness tasks, as well as extrinsical on TOEFL synonym and document classification tasks. The augmentation strategy improves the word vectors on all tasks. Faruqui et al. (2015) propose a method to refine word vectors by using relational information from semantic lexicons (we will refer to this method in this paper as RETROFIT). They require a vocabulary V = {w 1 , . . . , w n }, its word embeddings matrixQ = {q 1 , . . . ,q n }, where eachq i is one vector for one word w i and an ontology Ω, which they represent as an undirected graph (V, E) with one vertex for each word type and edges (w i , w j ) ∈ E ⊆ V × V . They attempt to learn the matrix Q = {q 1 , . . . , q n }, such that q i is similar to bothq i and q j ∀j for (i, j) ∈ E. Therefore, the objective function to minimize is where α and β control the relative strengths of associations.
They use the XL version of the Paraphrase Database (PPDB-XL) dataset (Ganitkevitch et al., 2013), which is a dataset of paraphrases as the semantic lexicon, to improve the original vectors. This dataset includes 8 million lexical paraphrases collected from bilingual corpora, where words in language A are considered paraphrases if they are consistently translated to the same word in language B. They then test on the Stanford Sentiment Treebank (Socher et al., 2013). They train an L2regularized logistic regression classifier on the average of the word embeddings for a text and find improvements after retrofitting.
All above approaches show improvements over previous word embedding approaches (Mnih and Teh, 2012;Yu and Dredze, 2014;Xu et al., 2014) on this data set. Maas et al. (2011) were the first to jointly train semantic and sentiment word vectors. In order to capture semantic similarities, they propose a probabilistic model using a continuous mixture model over words, similar to Latent Dirichlet Allocation (LDA, Blei et al., 2003). To capture sentiment information, they include a sentiment term which uses logistic regression to predict the sentiment of a document. The full objective function is a combination of the semantic and sentiment objectives. They test their model on several sentiment and subjectivity benchmarks. Their results indicate that including the sentiment information during training actually leads to decreased performance. Tang et al. (2014) take the joint training approach and simultaneously incorporate syntactic 2 and sentiment information into their word embeddings (we refer to this method as JOINT). They extend the word embedding approach of Collobert et al. (2011), who use a neural network to predict whether an n-gram is a true n-gram or a "corrupted" version. They use the hinge-loss

Joint Training
and backpropagate the error to the corresponding word embeddings. Here, t is the original n-gram, t r is the corrupted n-gram and f cw is the language model score. Tang et al. (2014) add a sentiment hinge loss to the Collobert and Weston model, as where f s 1 is the predicted negative score and δ s (t) is an indicator function that reflects the sentiment of a sentence. δ s (t) is 1 if the true sentiment is positive and −1 if it is negative. They then use a weighted sum of both scores to create their sentiment embeddings: This requires sentiment-annotated data for training both the syntactic and sentiment losses, which they acquire by collecting tweets associated with certain emoticons. In this way, they are able to simultaneously incorporate sentiment and semantic information relevant to their task. They test their approach on the SemEval 2013 twitter dataset (Nakov et al., 2013), changing the task from three-class to binary classification, and find that they outperform other approaches.
Overall, the joint approach shows promise for tasks with a large amount of distantly-labeled data.

Supervised training
The most common approach to sentiment analysis is to use pre-trained word embeddings in combination with a supervised classifier. In this framework, the word embedding algorithm acts as a feature extractor for classification.
Recurrent neural networks (RNNs), such as the LONG SHORT-TERM MEMORY network (LSTM) (Hochreiter and Schmidhuber, 1997) or the GATED RECURRENT UNITS (GRUs) (Chung et al., 2014), sumptions that the distributional representation encodes information directly pertaining to syntax. are a variant of a feed-forward network which includes a memory state capable of learning long distance dependencies. In various forms, they have proven useful for text classification tasks (Tai et al., 2015;Tang et al., 2016). Socher et al. (2013) and Tai et al. (2015) use Glove vectors (Pennington et al., 2014) in combination with a recurrent neural networks and train on the Stanford Sentiment Treebank (Socher et al., 2013). Since this dataset is annotated for sentiment at each node of a parse tree, they train and test on these annotated phrases.
Both Socher et al. (2013) and Tai et al. (2015) also propose various RNNs which are able to take better advantage of the labeled nodes and which achieve better results than standard RNNs. However, these models require annotated parse trees, which are not necessarily available for other datasets.
CONVOLUTIONAL NEURAL NETWORKS (CNN) have proven effective for text classification (dos Santos and Gatti, 2014;Kim, 2014;Flekova and Gurevych, 2016). Kim (2014) use skipgram vectors (Mikolov et al., 2013) as input to a variety of Convolutional Neural Networks and test on seven datasets, including the Stanford Sentiment Treebank (Socher et al., 2013). The best performing setup across datasets is a single layer CNN which updates the original skipgram vectors during training.
Overall, these approaches currently achieve stateof-the-art results on many datasets, but they have not been compared to retrofitting or joint training approaches.

Datasets
We choose to evaluate the approaches presented in Section 2.1 on a number of different datasets from different domains, which also have differing levels of granularity of class labels. The Stanford Sentiment Treebank and SemEval 2013 shared-task dataset have already been used as benchmarks for some of the approaches mentioned in Section 2.1. Table 1 shows which approaches have been tested on which datasets and Table 2 gives an overview of the statistics for each dataset.

Stanford Sentiment
The Stanford Sentiment Treebank (SST-fine) (Socher et al., 2013) is a dataset of movie reviews which was annotated for 5 levels of sentiment: strong negative, negative, neutral, positive, and strong positive. It is annotated both at the clause level, where each node in a binary tree is given a sentiment score, as well as at sentence level. We use the standard split of 8544/1102/2210 for training, validation and testing. In order to compare with Faruqui et al. (2015), we also adapt the dataset to the task of binary sentiment analysis, where strong negative and negative are mapped to one label, and strong positive and positive are mapped to another label, and the neutral examples are dropped. This leads to a slightly different split of 6920/872/1821 (we refer to this dataset as SSTbinary).

OpeNER
The OpeNER dataset (Agerri et al., 2013) is a dataset of hotel reviews in which each review is annotated for opinions. An opinion includes sentiment holders, targets, and phrases, of which only the sentiment phrase is obligatory. Additionally, sentiment phrases are annotated for four levels of sentiment: strong negative, negative, positive and strong positive. We use a split of 2780/186/734 examples.

Sentube Datasets
The SenTube datasets (Uryupina et al., 2014) are texts that are taken from YouTube comments regarding automobiles and tablets. These comments are normally directed towards a commercial or a video that contains information about the product. We take only those comments that have some polarity towards the target product in the video. For the automobile dataset (SenTube-A), this gives a 3381/225/903 training, validation, and test split. For the tablets dataset (SenTube-T) the splits are 4997/333/1334. These are annotated for positive, negative, and neutral sentiment.

Semeval 2013
The SemEval 2013 Twitter dataset (SemEval) (Nakov et al., 2013) is a dataset that contains tweets collected for the 2013 SemEval shared task B. Each tweet was annotated for three levels of sentiment: positive, negative, or neutral. There were originally 9684/1654/3813 tweets annotated, but when we downloaded the dataset, we were only able to download 6021/890/2376 due to many of the tweets no longer being available.

Experimental Setup
We compare seven approaches, five of which fall into the categories mentioned in Section 2, as well as two baselines. The models and parameters are described in Section 3.1. We test these models on the benchmark datasets mentioned in Section 2.2.

Baselines
We compare our models against two baselines. First, we train an L2-regularized logistic regression on a bag-of-words representation (BOW) of the training examples, where each example is represented as a vector of size n, with n = |V | and V the vocabulary. This is a standard baseline for text classification.
Our second baseline is an L2-regularized logistic regression classifier trained on the average of the word vectors in the training example (AVE). We train word embeddings using the skip-gram with negative sampling algorithm (Mikolov et al., 2013) on a 2016 Wikipedia dump, using 50-, 100-, 200-, and 600-dimensional vectors, a window size of 10, 5 negative samples, and we set the subsampling parameter to 10 −4 . Additionally, we use the publicly available 300-dimensional GoogleNews vectors 3 in order to compare to previous work.

Retrofitting
We apply the approach by Faruqui et al. (2015) and make use of the code 4 released in combination with the PPDB-XL lexicon, as this gave the best results Train Dev.  for sentiment analysis in their experiments. We train for 10 iterations. Following the authors' setup, for testing we train an L2-regularized logistic regression classifier on the average word embeddings for a phrase (RETROFIT).

Joint Training
For the joint method, we use the 50-dimensional sentiment embeddings provided by Tang et al. (2014). Additionally, we create 100-, 200-, and 300-dimensional embeddings using the code that is publicly available 5 . We use the same hyperparameters as Tang et al. (2014): five million positive and negative tweets crawled using hashtags as proxies for sentiment, a 20-dimensional hidden layer, and a window size of three. Following the authors' setup, we concatenate the maximum, minimum and average vectors of the word embeddings for each phrase. We then train a linear SVM on these representations (JOINT).

Supervised Training
We implement a standard LSTM which has an embedding layer that maps the input to a 50-, 100-, 200-, 300-, or 600-dimensional vector, depending on the embeddings used to initialize the layer. These vectors then pass to an LSTM layer. We feed the final hidden state to a standard fully-connected 50-dimensional dense layer and then to a softmax layer, which gives us a probability distribution over our classes. As a regularizer, we use a dropout (Srivastava et al., 2014) of 0.5 before the LSTM layer.
The BIDIRECTIONAL LSTM (BILSTM) has the same architecture as the normal LSTM, but includes an additional layer which runs from the end of the text to the front. This approach has led 5 http://ir.hit.edu.cn/˜dytang to state-of-the-art results for POS-tagging (Plank et al., 2016), dependency parsing (Kiperwasser and Goldberg, 2016) and text classification (Zhou et al., 2016), among others. We use the same parameters as the LSTM, but concatenate the two hidden layers before passing them to the dense layer 6 .
We also train a simple one-layer CNN with one convolutional layer on top of pre-trained word embeddings. The first layer is an embeddings layer that maps the input of length n (padded when needed) to an n × R dimensional matrix, where R is the dimensionality of the word embeddings. The embedding matrix is then convoluted with filter sizes of 2, 3, and 4, followed by a pooling layer of length 2. This is then fed to a fully connected dense layer with ReLU activations (Nair and Hinton, 2010) and finally to the softmax layer. We again use dropout (0.5), this time before and after the convolutional layers.
For all neural models, we initialize our word representations with the skip-gram algorithm with negative sampling (Mikolov et al., 2013). For the 300-dimensional vectors, we use the publicly available GoogleNews vectors. For the other dimensions (50, 100, 200, 600), we create skip-gram vectors with a window size of 10, 5 negative samples and run 5 iterations. For out-of-vocabulary words, we use vectors initialized randomly between -0.25 and 0.25 to approximate the variance of the pretrained vectors. We train our models using ADAM (Kingma and Ba, 2014) and a minibatch size of 32 and tune the hidden layer dimension and number of training epochs on the validation set. Table 3 shows the results for the seven models across all datasets, as well as the macro-averaged results. We visualize them in Figure 3. We performed random approximation tests (Yeh, 2000) using the sigf package (Padó, 2006) with 10,000 iterations to determine the statistical significance of differences between models. Since the reported accuracies for the neural models are the means over five runs, we cannot use this technique in a straightforward manner. Therefore, we perform the random approximation tests between the runs 7 and consider the models statistically different if a majority (at least 3) of the runs are statistically different (p < 0.01, which corresponds to p < 0.05 with Bonferroni correction for 5 hypotheses). The results of statistical testing are summarized in Table 2.
Obviously, BOW continues to be a strong baseline: Though it never provides the best result on a dataset, it gives better results than AVE on OpeNER, SenTube-T, and SemEval. Surprisingly, it also performs better than JOINT on the same sets except for SenTube-T. Similarly, it outperforms RETROFIT on SenTube-T and SemEval.
RETROFIT performs better than CNN on SST-fine and JOINT on SST-fine, SST-binary, and OpeNER. It also improves the results of AVE across all datasets but SenTube-A and SemEval datasets.
Although JOINT does not perform well across datasets and, in fact, does not surpass the baselines on some datasets, it does lead to good results on SemEval and to state-of-the-art results on SenTube-A and SenTube-T.
Similarly to RETROFIT, CNN does not outperform any of the other methods on any dataset. As said, this method does not beat the baseline on SST-fine, SenTube-A, and SenTube-T. However, it outperforms the AVE baseline on SST-binary and OpeNER.
The best models are LSTM and BILSTM. The best overall model is BILSTM, which outperforms the other models on half of the tasks (SST-fine, Opener, and SemEval) and consistently beats the baseline. This is in line with other research (Plank et al., 2016;Kiperwasser and Goldberg, 2016;Zhou et al., 2016), which suggests that this model is very robust across tasks as well as datasets. The differences in performance between LSTM and BILSTM, however, are only significant (p < 0.01) on the SemEval dataset.
We also see that the difference in performance between the two LSTM models and the others is larger on datasets with fine-grained labels (BILSTM 45.6 and LSTM 45.3 vs. an average of 40 for all others on the SST-fine and BILSTM 83 and LSTM 83.1 vs. an average of 76.5 on OpeNER). These differences between the LSTM models and other models are statistically significant, except for the difference between BILSTM and CNN at 50 dimensions on the OpeNER dataset.
Our analysis of different dimensionalities as input for the classification models reveals that, typically, the higher dimensional vectors (300 or 600) outperform lower dimensions. The only differences are in JOINT for SenTube-T and SemEval and LSTM for SenTube-A and AVE on all datasets except OpeNER.

Discussion
While approaches that average the word embeddings for a sentence are comparable to state-of-theart results (Iyyer et al., 2015), AVE and RETROFIT do not perform particularly well. This is likely due to the fact that logistic regression lacks the nonlinearities which Iyyer et al. (2015) found helped, especially at deeper layers. Averaging all of the embeddings for longer phrases also seems to lead to representations that do not contain enough information for the classifier.
We also experimented with using large sentiment lexicons as the semantic lexicon for retrofitting, but found that this hurt the representation more than it helped. We believe this is because there are not enough kinds of relationships to exploit the graph structure and by trying to collapse all words towards either a positive or negative center, too much information is lost.
We expected that JOINT would perform well on SemEval, given that it was designed for this task, but it was surprising that it performed so well on the SenTube datasets. It might be due to the fact that comments for these three datasets are comparably informal and make use of emoticons and Internet jargon. We performed a short analysis of datasets (shown in Table 4 Table 3: Accuracy on the test sets. For all neural models we perform 5 runs and show the mean and standard deviation. The best results for each dataset is given in bold and results that have been previously reported are highlighted. All results derive from our reimplementation of the methods. We describe significance values in the text and appendix. Footnotes refer to the work where a method was previously tested on a specific dataset, although not necessarily with the same results: [1] Tai et al. (2015) [2] Kim (2014) [3] Faruqui et al. (2015) [4] Lambert (2015) [5] Uryupina et al. (2014) [6] Tang et al. (2014).
speech and found that, indeed, the frequency of emoticons in the SemEval and SenTube datasets diverges significantly from the other datasets. The fact that JOINT is distantly trained on similar data gives it an advantage over other models on these datasets. This leads us to believe that this approach would transfer well to novel sentiment analysis tasks with similar properties. The fact that CNN performs much better on OpeNER may be due to the smaller size of the phrases (an average of 4.28 vs. 20+ for other datasets), however, further analyses to prove this are needed.
The good results that both LSTM models achieved on the more fine-grained sentiment datasets (SST-fine and OpeNER) seem to indicate that LSTMs are able to learn dependencies that help to differentiate strong and weak versions of sentiment better than other models. This is supported by the confusion matrices shown in Figure 1. This makes them natural candidates for fine-grained sentiment analysis tasks.
LSTM perfoms better than BILSTM on two datasets but these differences are not statistically significant.
The effect of the dimensionality of the input for the classification models suggests that larger dimensionalities tend to perform better. This seems particularly true for RETROFIT, which continues gaining performance even at 600 dimensions. Most S t r o n g N e g a t iv e N e g a t iv e N e u t r a l P o s it iv e S t r o n g N e g a t iv e N e g a t iv e N e u t r a l P o s it iv e  Table 4: χ 2 statistics comparing the frequency of the following emoticons over the different datasets, :), :(, :-), :-(, :D, =). The difference in frequency of emoticons between the SemEval and SenTube datasets is not significant (p > 0.05), while for SST and OpeNER it is (p < 0.05).
other approaches perform slightly better at 600 dimensions, but AVE consistently performs worse at 600 than at 300.

Conclusions
The goal of this paper has been to discover which models perform better across different datasets. We compared state-of-the-art models (both symbolic and embedding-based) on six benchmark datasets with different characteristics and showed that Bi-LSTMs perform well across datasets and that both LSTMS and Bi-LSTMs are particularly good at fine-grained sentiment tasks. Additionally, incorporating sentiment information into word embeddings during training gives good results for datasets that are lexically similar to the training data. Finally, we reported a new state of the art on the SenTube datasets.  Figure 2: Results of the statistical analysis described in Section 3 for the best performing dimension of embeddings, where applicable. Datasets where there is a statistical difference (above diagonal) and number of datasets where a model on the Y axis is statistically better than a model on the X axis (below diagonal).
Figure 3: Maximum accuracy scores in percent for each model on the datasets. LSTM and BILSTM outperform other models on tasks with more than two labels (SST-fine, OpeNER, and SemEval). BOW peforms well against more powerful models. JOINT performs well on social media (SenTube-A, SenTube-T, and SemEval), but poorly on other tasks.