NNEMBs at SemEval-2017 Task 4: Neural Twitter Sentiment Classification: a Simple Ensemble Method with Different Embeddings

Recently, neural twitter sentiment classification has become one of state-of-thearts, which relies less feature engineering work compared with traditional methods. In this paper, we propose a simple and effective ensemble method to further boost the performances of neural models. We collect several word embedding sets which are publicly released (often are learned on different corpus) or constructed by running Skip-gram on released large-scale corpus. We make an assumption that different word embeddings cover different words and encode different semantic knowledge, thus using them together can improve the generalizations and performances of neural models. In the SemEval 2017, our method ranks 1st in Accuracy, 5th in AverageR. Meanwhile, the additional comparisons demonstrate the superiority of our model over these ones based on only one word embedding set. We release our code for the method duplicability.


Introduction
Twitter sentiment classification has attracted a lot of attention (Dong et al., 2015;Nakov et al., 2016;Rosenthal et al., 2017), which aims to classify a tweet into three sentiment categories: negative, neutral, and positive. Tweet text has several features: written by the informal language, hash-tags and emoticons indicate sentiments, and sometimes is sarcasm, which make decisions of tweet sentiment hard for machines. With releases of annotated datasets, more researchers prefer to use the 1 https://github.com/zwjyyc/NNEMBs twitter sentiment classification as one testbed to evaluate their proposed models.
Traditional methods (Mohammad et al., 2013) for twitter sentiment classification use a variety of hand-crafted features including surface-form, semantic and sentiment lexicons. The performances of these methods often depend on the quality of feature engineering work, and building a state-ofthe-art system is difficult for novices. Moreover, these designed features are presented by the onehot representation which cannot capture the semantic relativeness of different features and proposes a problem of feature sparsity. To address this, Tang et al. (2014) induced sentiment-specific low-dimensional, real-valued embedding features for twitter classification, which encode both semantics and sentiments of words. In the experiments, the embedding features and hand-crafted features obtain similar results, and also they are complementary for each other in the system. With the developments of neural networks in natural language processing, neural sentiment classification (Severyn and Moschitti, 2015;Deriu et al., 2016) has attracted a lot of attention recently and become the state-of-the-arts. These methods first learn word embeddings from large-scale twitter corpus, then tune neural networks by the tweets which have distant labels, and finally fine-tune the proposed models by the annotated datasets.
Learning word embeddings using in-domain data is an effective way to boost model performances (Mikolov et al., 2013;Yin et al., 2016). However, collecting large-scale twitter corpus is often time-consuming. In this paper, we use the different word embedding sets to boost the performances of our neural networks, which only include released different word embeddings sets and the word embedding set derived from the released Yelp large-scale datasets by Skip-gram (Mikolov et al., 2013). A simple and effective ensemble method is proposed, which takes different word embedding sets as input to train neural networks and predicts labels of testing tweets by merging all output of neural models. Our ensemble method show its effectiveness in SemEval 2017, though most of used word embedding sets are not learned from twitter corpus, which can be explained that different embedding sets has different vocabularies and encode different parts of sentiment knowledge. Moreover, we conduct additional experiments to analyze our model.

Method
In this section, we describe the details of our method, which is illustrated in Figure 1. We feed different word embedding sets into neural networks and train these neural networks separately. When predicting the labels of tweets in testing set, we sum label probabilities of all neural network to make final decisions.

Neural Network
We have many choices of neural networks (e.g., LSTM, RNN and GRU) for our method, here we consider RCNN (Lei et al., 2016) in our method. RCNN has non-consecutive convolution and adaptive gated decay, which aims to capture longerrange, non-consecutive patterns in a weighted manner.
Given a sequence of words which are denoted as {x i } l i=1 , the corresponding word embeddings {x i } l i=1 are derived using the embedding matrix E. Then, RCNN obtains their corresponding hidden vectors {h i } l i=1 using the convolution operation and gating mechanism. After obtaining hidden vectors, RCNN uses a pooling operation to get fixed-sized vector presentation, which is fed into softmax layer to finish the prediction. The ngram convolution operation and gating decay are described as follows: where W λ , U λ , b λ , b and W * are learnable parameters, σ is sigmoid function which rescales the value into (0, 1), is dot product, λ t is gating value determining how much information of x t and previous patterns is added into the hidden vector, c (i) t refer to the vector for accumulated previous patterns which are ended with x t include i consecutive tokens. When λ t = 0, the convolution becomes a standard n-gram convolution.
We also can build a deep RCNN by adding several convolution layer on top of hidden vectors derived from the bottom convolution layer. Here we consider the RCNN with d convolution layers, which outputs {h d i } l i=1 . Then, a last pooling operation is conducted on hidden vectors to obtain text representation r. Finally, text representation is fed into a softmax layer. The softmax layer outputs the probability distribution over |Y| categories for the distributed representation, which is defined as: The cross-entropy objective function is used to optimize the RCNN model.

Prediction
We learn different RCNN models with different embedding sets as input. Formally, we have s embedding sets which are denoted as {E 1 , E 2 , · · · , E s }, and feed them into s RCNN models, then learn RCNN models separately. We predict sentiment label of testing tweet based on these learned RCNN models, which are described by following functions:  Table 1: Statistics of the embedding sets. R means the embedding set is publicly released and S means the embedding set is self-contained. GloVec (Mikolov et al., 2013) and Word2Vec (Pennington et al., 2014) are most popular embedding algorithms. Scale means the size of tokens in corpus, M and B refer to million and billion respectively. The embedding set word2vecY are trained by Word2Vec with default settings and Yelp reviews are available at https://www.yelp.com/dataset challenge.
where y is the predicted label.

Datasets and Settings
We use 4 embedding sets which are described in Table 1. Meanwhile, we crawl and merge all annotated datasets of previous SemEvals, and split them into training, development, and testing sets with ratio 8:1:1, which are shown in Table 2 together with testing set of SemEval 2017. From the table, we can see that testing set of SemEval 2017 has big differences on the category ratio (negative:neutral:positive), compared with the previous SemEval datasets.
For the model settings, all RCNN models have same configurations but different word embedding sets. We set dimensions of hidden vectors to 250 and depths d to 2. To avoid model over-fitting, we use dropout and regularization as follows: (1) the regularization parameter is set to 1e-5; (2) the dropout rate is set to 0.3, which is applied in the final text representation. All parameters are learned by Adam optimizer (Kingma and Ba, 2014) with the learning rate 0.001. Note that, all word embedding sets are fixed when training. All models are tuned by the development set in Training.

Results and Analysis
In this section, we first report the results on datasets of previous SemEvals, which are described in Table 3. Then, we report the performances of our method on SemEval 2017 in Table 4.
From the Table 3, we observe that gloveT performs worst though it is trained on in-domain twitter dataset and the word2vecY performs best though it is derived from yelp reviews. As far as we known, Yelp data is constructed by carefully filtering and is high-quality. Thus, we can include that the quality of corpus is also important as the size of corpus and domain in twitter sentiment classification. Additionally, we can infer that word2vecGN outperforms others in recall of negative category, word2vecY performs best in recall of neutral category, and gloveT is best in recall of positive category. Different embedding sets propose different characteristics. Additionally, the ensemble method obtains a significant improvement of 4%.
In the Table 4, we compare our method with best and median systems in SemEval 2017, and report the results of individual embedding sets. Our method outperforms other systems in accuracy, but performs worse in R Average, especially in R Negative. Compared with the median system, our method has improvements of about 5% in both accuracy and R Average. Different from the results in Table 3 Table 4: Results on SemEval 2017. The median system is the system of rank 19th among 38 teams.
worse among these embedding sets, while the gloveT obtains best performances. Additionally, we can observe that gloveT performs best both in R Negative and R Positive, and word2vecY performs best in R Neutral. Compared with the embedding baselines, our ensemble method obtains improvements of 2.7% and 1.5% in accuracy and R Average respectively, which demonstrates the effectiveness of the proposed method.

Error Analysis
In this section, we analyze the incorrect predictions of our system in SemEval 2017. We summarize four kinds of errors in our system. The first one is that some decisions need domain knowledge, which our method only can learn from the labeled datasets. The instances are as follows: Messi's 100 international goals for Barcelona #fcblive https://t.co/fMkglvusL1 [via @thereisagenius]. Predicted label: neutral, golden label: positive #Trudeau gives your cash to #Terrorist #Hamas-influenced group -#UNRWA -@Can-diceMalcolm https://t.co/5i5o2qwRWl Predicted label: neutral, golden label: negative Messis 9 goals in CL are more than 20 of the 32 teams in the competition have scored in total, and hes tied with five other sides #fcblive Predicted label: neutral, golden label: positive The second one is emoticons in tweet, as most of word embedding sets do not include emoticon embeddings and emoticons are always with senti- ments. The instances are described in Figure 2 The third one is that sentiments are not consistent in sentences. For example, the first half part is positive, while the second half part is negative, in this case, our system would predict 'positive' or 'negative', the golden label is neutral. @jimmyfallon 1. Emily 2. Michel 3. Kirk 4. TJ. Love the quirky ones and Emily coz she's such a BIATCH! #gilmoregirlstop4. predicted label positive, golden label: neutral The fourth one is the sarcasm, such as: #Hamas leader: #Trump may be a #Jew https://t.co/jGFZTvj2pF. predicted label positive, golden label: negative

Conclusion
We propose a simple and effective ensemble method to boost the neural twitter sentiment classification. By using different embedding sets, the system can cover more words and encode more sentiment information. The results on datasets of previous SemEval and SemEval 2017 show the effectiveness of our method. Moreover, error analysis is conducted to propose the main challenges for our method. We release our code for system duplicability.