SentiME++ at SemEval-2017 Task 4: Stacking State-of-the-Art Classifiers to Enhance Sentiment Classification

In this paper, we describe the participation of the SentiME++ system to the SemEval 2017 Task 4A “Sentiment Analysis in Twitter” that aims to classify whether English tweets are of positive, neutral or negative sentiment. SentiME++ is an ensemble approach to sentiment analysis that leverages stacked generalization to automatically combine the predictions of five state-of-the-art sentiment classifiers. SentiME++ achieved officially 61.30% F1-score, ranking 12th out of 38 participants.


Introduction
The SemEval-2017 Task 4 (Rosenthal et al., 2017) focuses on the classification of tweets into positive, neutral and negative sentiment classes. In 2015, the Webis system (Hagen et al., 2015) showed the effectiveness of ensemble methods for sentiment classification by winning the SemEval-2015 Task 10 "polarity detection" challenge through the combination of four classifiers that had participated to previous editions of SemEval. In 2016, we have combined the original public release of the Webis system with the Stanford Sentiment System (Socher et al., 2013) using bagging, creating the SentiME system (Sygkounas et al., 2016b,a) which won the ESWC2016 Semantic Sentiment Analysis challenge. In bagging, the predictions of the classifiers trained on different bootstrap samples (bags) are simply averaged to obtained a final prediction. In this paper, we propose SentiME++, an enhanced version of the SentiME system that combines the predictions of the base classifiers through stacked generalization. In Section 2, we detail our approach to stack a meta-learner on top of five state-of-the-art sentiment classifiers to combine their predictions. In Section 3, we describe the experimental setup of our participation to SemEval and we report the results we obtained in Section 4. Finally, we conclude the paper in Section 5.

Preliminaries
SentiME++ is based on the predictions of five state-of-the-art sentiment classifiers: NRC-Canada: winner of SemEval 2013, trains a linear kernel SVM classifier on a set of linguistic and semantic features to extract sentiment from tweets (Mohammad et al., 2013); GU-MLT-LT: 2 nd ranked at SemEval 2013, uses a linear classifier trained by stochastic gradient descent with hinge loss and elastic net regularization for their predictions on a set of linguistic and semantic features (Günther and Furrer, 2013); KLUE: 5 th ranked at SemEval 2013, feeds a simple bag-of-words model into popular machine learning classifiers such as Naive Bayes, Linear SVM and Maximum Entropy (Proisl et al., 2013); TeamX: winner of SemEval 2014, uses a variety of pre-processors and features, fed into a supervised machine learning algorithm which utilizes Logistic Regression (Miura et al., 2014); Stanford Sentiment System: one of the subsystems of the Stanford NLP Core toolkit 1 , contains the Stanford Tree Parser, a machine-learning model that can parse the input text into Stanford Tree format and the Stanford Sentiment Classifier, which takes as input Stanford Trees and outputs the classification results. The output of the Stanford Sentiment System belongs to one of five classes (very positive, positive, neutral, negative, very negative) which differs from the three classes defined in SemEval. In a previous work (Sygkounas et al., 2016b), we have tested different con-figurations for mapping the Stanford Sentiment System classification to the three classes of the Se-mEval competition and finally decided to use the following strategy: very positive and positive are mapped to positive, neutral is mapped to neutral and negative and very negative are mapped to negative. The Stanford Sentiment System is used as an off-the-self classifier and is not trained with Se-mEval data.

Bootstrap samples
The first step in the SentiME++ approach consists in training separately the first four classifiers, using a uniform random sampling with replacement (bootstrap sampling) to generate four different training sets T i for each of the four subclassifiers from the initial training set T . In Section 3, we report the results of the experiments that we have conducted to determine the optimal size of the samples T i . Note that these samples are also called 'bags'. At this point, the SentiME system combines the predictions on the models trained on these bags using a simple average, while Sen-tiME++ uses stacked generalization, as described in the next section.

Stacking
Stacked Generalization (or simply stacking) (Wolpert, 1992) is based on the idea of creating an ensemble of base classifiers and then combining them by means of a supervised classifier, also called 'meta-learner'. Stacking typically leverages the complementarity among the base classifiers to obtain a better global performance than any of the individual models. The base classifiers are trained separately and, for each input, output their prediction. The meta-learner, which is 'stacked' on top of the base classifiers, is trained on the base classifiers' predictions and aims to correct the prediction errors of the base classifiers. SentiME++ trains separately four models, uses the Stanford Sentiment System without training and uses these five outputs as a feature vector for a stacked supervised learner (Fig. 1). In detail, the SentiME++ approach works can be divided in a training and a testing phase: Training phase: (1) generate four bootstrap samples T i by sampling n tweets from the original training set T , where n = s * |T | and s is a parameter that has to be fixed experimentally (2) train separately NRC-CANADA, GU-MLT-LT, KLUE, TeamX classifiers on the samples T i and store the trained models; (3) use the four trained models and the Stanford Sentiment System to predict the sentiment of each tweet t ∈ T , producing a training set for the stacking layer T stack ; (4) Train the meta-learner on T stack . Testing phase: (1) use the four trained models and the Stanford Sentiment System to predict the sentiment of each tweet t ∈ T test producing a test set for the stacking layer T test ; (2) test the trained meta-learner on T test . Note that the described approach is slightly different from the standard procedure of Stacked Generalization described in (Wolpert, 1992), which is normally not based on bootstrap samples, but rather on disjoint splits of the training set. This variation is mainly due to the will of building SentiME++ as an incremental enhancement of the existing SentiMe system, without disrupting its base training mechanism. The meta-learner that is used as default in SentiME++ is a Support Vector Machine (SVM) with a Radial Basis Function (RBF) kernel (Scholkopf et al., 1997). Different choices are possible, but Support Vector Machines are well-studied methods in machine learning, able to be trained efficiently and to limit over-fitting. This method depends on two hyper-parameters, i.e. parameters that are not automatically learnt and that constitutes parameters of the algorithm itself: the regularization constant C and the parameter of the radial basis function γ. In order to optimize the performance of the stacking layer, we have chosen these parameters using a grid-search cross validation approach (Hsu et al., 2003). The process works as follows: (1) define a range for hyper-parameters C ∈ [C 1 ...C m ] and γ ∈ [γ 1 ...γ n ]; (2) train the model with all possible pairs (C i , γ j ); (3) compute scores with k-fold cross validation for (C i , γ j ) pair; (4) find the best pair (C i , γ j ) according to k-fold cross validation score. SentiME is implemented in Java and the stacking process that characterizes SentiME++ is performed by a python script working on top of the results obtained by the SentiME system. The source code is available on github 2 . It uses a variety of lexicons (Table 1). Figure 1: Illustration of the SentiME++ approach: bootstrap samples (bags) are generated to train four state-of-the-art sentiment classifiers, the Stanford System is used without training and their predictions are used as a feature vector for a meta-learner.

Experimental Setup
In this section, we describe the experimental setup of the SentiME++ system for the participation to the SemEval2017 Task4A challenge.

Bootstrap samples size
One of the parameters of the SentiME++ model is the size of the bootstrap samples T i . Different sampling sizes have been experimented, ranging from 33% to 175% of the size of the initial training set T . In order to determine an optimal size, we have tested the SentiME bagging approach, which simply averages the predictions of the base classifiers, on the SemEval2013-test-B dataset training the models with different random extractions of the SemEval2013-train+dev-B dataset. The experiment was repeated three times to mitigate the randomness due to the random extractions and we observed that a 150% size 3 led to the best per-3 Note that this implies that there are duplicates among the training examples formance on SemEval2013-test-B dataset (Sygkounas et al., 2016a).

Encoding categorical features
In order to use the predicted sentiment classes as features for a meta-learner in the stacking layer, it is necessary to specify an encoding scheme, which allows the system to interpret the class values 'Positive', 'Neutral' and 'Negative'. These values could be simply mapped to integers 0, 1, 2, but the meta-learner, expecting continuous or binary inputs, would interpret it as an ordered sequence of real values. To avoid this, we use a one-hot encoding scheme, i.e. m categorical values are turned into a m dimensional binary vector where only one element at the time is active. In this specific case, the encoding that we have used is: 'positive'=[0, 0, 1], 'neutral'=[0, 1, 0], 'negative'=[1, 0, 0].

Hyper-parameters optimization
In order to optimize the performance of the SVM meta-learner, we have performed the grid-search cross validation described in Section 2 on the SemEval2013-train+dev-B dataset using 10-folds. The experiment has been performed using as a range an array of 30 logarithmically spaced values for γ from 10 −9 to 10 3 and for C from 10 −2 to 10 10 . The best obtained (C, γ) pair, i.e. the pair producing the best prediction score, which has been used for the participation to the challenge is: (C, γ) = (0.174, 0.028). The implementation of the SVM classifier and of the grid-search cross validation procedure has been carried out using the python library scikit-learn 4 .
In order to compare the performance of these different trained models, we have chosen as a test set the SemEval2016-test dataset, as it is the largest in size (33k tweets) and the most recent of SemEval test sets. The results obtained from this experiment are illustrated in Table 2  it was the best performing on the development set, but, a posteriori, we can observe that Run 3 performs better on the test set. We also observe a significant performance drop from the development to the test set. We believe that this might be due to the marked difference in the category distributions of the tweets in the two datasets (see Tab.4 in (Rosenthal et al., 2017)). The best SentiME++ run at SemEval2017 Task 4 Sub-Task A would rank 8 th out of 38 participants.

Conclusion
In this paper, we have presented SentiME++, a sentiment classifier that combines the predictions of five state-of-the-art systems through stacking. SentiME++ achieved officially 61.30% F1-score, ranking 12 th out of 38 participants. We have shown how stacking can improve the combination of the classifiers with respect to bagging, implemented in the previous version of SentiME, evaluating it on SemEval2017 Challenge datasets. We have described an experimental procedure to determine an appropriate size of the bootstrap samples and optimize hyper-parameters of the metalearner. In general, we provide a further evidence of the power of the ensemble approach applied to sentiment analysis. As a future work, we plan to improve the bootstrap sampling process by taking into account the class distributions of the tweets, to determine the bag sizes directly using SentiME++, to include more base classifiers and experiment different meta-learners.