NLPZZX at SemEval-2018 Task 1: Using Ensemble Method for Emotion and Sentiment Intensity Determination

In this paper, we put forward a system that competed at SemEval-2018 Task 1: “Affect in Tweets”. Our system uses a simple yet effective ensemble method which combines several neural network components. We participate in two subtasks for English tweets: EI-reg and V-reg. For two subtasks, different combinations of neural components are examined. For EI-reg, our system achieves an accuracy of 0.727 in Pearson Correlation Coefficient (all instances) and an accuracy of 0.555 in Pearson Correlation Coefficient (0.5-1). For V-reg, the achieved accuracy scores are respectively 0.835 and 0.670


Introduction
Sentiment analysis is a research area in the field of natural language processing. It aims to detect the sentiment expressed by the author of some form of textual data and many deep learning approaches have been successfully exploited (Cambria, 2016). The goal of SemEval-2018 Task 1 "Affect in Tweets" is to automatically determine the intensity of emotions and intensity of sentiment of the tweeters from their tweets (Mohammad et al., 2018). All tweets fall into three languages: English, Arabic and Spanish. We participate in two subtasks for English tweets: EIreg and V-reg. For EI-reg, all English tweets are separated into four emotions, anger, fear, joy and sadness. Every emotion has train, dev and test datasets. This subtask determines the intensity which is a real-valued score between 0 and 1 of emotion that represents the mental state of the tweeter. The instances with higher scores correspond to a greater degree of emotion than instances with lower scores. For V-reg, all English tweets are divided into three datasets: train, dev and test datasets. It determines the intensity of sentiment or valence that best represents the mental state of the tweeter a real-valued score between 0 and 1. The instances with higher scores correspond to a greater degree of positive sentiment than instances with lower scores. Both the two subtasks are regression tasks.
For these two subtasks, we have adopted separate ensemble method with existing neural network components (Brueckner and Schulter, 2014;Kim, 2014;Li and Qian, 2016;Yang et al., 2017) (see Figure 1). We use BiLSTM-CNN component, BiLSTM-Attention component and Deep BiLSTM-Attention component with different embeddings for simple ensemble. In these subtasks, our final model is just an average of scores provided by what we select from these single neural network components. Every emotion or valence employs different ensemble method, so there are several distinct ensemble methods in the two subtasks. Experimental results show that our proposed ensemble methods are simple yet effective.
The remainder of the paper is structured as follows. We provide details of the proposed ensemble method in Section 2. We present the experimental result of proposed methods in Section 3. Finally, a conclusion is drawn in section 4.

Methodology
We propose an simple ensemble method of different neural network components. We mainly introduce the implementation details of these components, including raw tweets preprocessing, lexicon features and embedding resources we use in these components, the architecture of these components and the best parameters of different single components. The parameters that can maximize the Pearson Correlation Coefficient between the predicted values and real values are chosen to be the best parameters.

Data Preprocessing
In general, tweet are not always syntactically wellstructured and the language used does not always strictly adhere to grammatical rules (Barbosa and Feng, 2010). So we need to preprocess raw tweets before feature extraction. Firstly, we perform a few preprocessing steps, such as remove # and retain the word itself, remove stop words with nltk.corpus. Then the tweets are transformed into lowercase. Finally, we utilize TweetTokenizer 1 to process the tweets.

Feature Extraction
Each tweet is represented as a concatenation of two different feature vectors, one is lexicon features and another is word embedding. In our system, each tweet is divided into words, every word is represented as a d + m dimension vector and thus each tweet is represented as l(d + m) matrix, where d is the dimension of word embedding and m is the dimension of lexicon features. Suppose each tweet has the same length, so l is the length 1 http://www.nltk.org/ of tweet. We utilize a variety of resources for feature extraction as follows: 1. AFINN: Calculating positive and negative sentiment scores from the lexicon (Nielsen, 2011  8. NRC Hashtag Sentiment Lexicon: Association of words with positive (negative) sentiment generated automatically from tweets with sentiment-word hashtags Mohammad et al., 2013;.
9. Emoji: This is a manual classification of the dictionary, in which each emoji has a corresponding polarity value.
10. Sentiwordnet: Sentiwordnet is a lexical resource explicitly devised for supporting sentiment classification and opinion mining applications (Baccianella et al., 2010), through the wordnet entry in the emotional classification, and marked each entry belongs to the positive and negative categories weight size.

Bidirectional LSTM with CNN
The BiLSTM with CNN first transform tweets into text matrices, the BiLSTM is applied to these matrices to build new text matrices, CNN is applied to the output of the BiLSTM to obtain text vectors for the prediction of emotional intensity. The BiL-STM with CNN achieves a rather good result on the task of emotional analysis (He et al., 2017). so we choose it for our task.
Model Architecture: Embedding vectors are fed into a BiLSTM network followed by a CNN layer. The CNN layer consists of one dimensional convolutional layer and pooling layer where the number of filters is 256, the window size of the filter is 3, and the activation function is Relu. The input and output shape of convolutional layer are both 3D tensor. The output of the CNN layer is flattened after max-pooling operation. After the Flatten layer, two dense layers are stacked and the activation functions are respectively configured as Relu and Sigmoid. Also dropout (Srivastava et al., 2014) is utilized to avoid potential overfitting, it is used between two dense layers. The reason why we select Relu is to prevent the vanishing gradient problem and accelerate the calculation. Since the task is a regression problem, we put a dense projection with sigmoid activation to obtain an intensity value between 0 and 1.
Model Training: The network parameters are learned by minimizing the mean squared error (MSE) between the real and predicted values of emotion intensity or valence intensity. We optimize this loss function via Adam that is an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments (Kingma and Ba, 2014). Batch size and training epochs may be different for different emotions and valence. To avoid overfitting issues, we use dropout in this model. Finally, we apply these three parameters for system tuning. In addition, we try various optimization algorithms with the same param-

Bidirectional LSTM with Attention
Bidirectional LSTM with Attention achieves a good result on the SemEval-2017 Task 4 "Sentiment Analysis in Twitter" (Baziotis et al., 2017), so we exploit Bidirectional LSTM with Attention model and Deep Bidirectional LSTM with Attention model for our tasks.
Model Architecture: For Bidirectional LSTM with attention model, embedding vectors are fed into a BiLSTM network followed by an attention layer (Yang et al., 2017). Not all words contribute equally to the expression of sentiment in a tweet, so we use an attention layer to find the importance of each word in tweet. After the attention layer, it is consistent with Bidirectional LSTM with CNN model. The difference between the Bidirectional LSTM with attention model and its deep version is that, we use two BiLSTM layers followed by an attention layer in the deep version.
Model Training: We use the same method to learn the network parameters. In EI-reg, we use the same batch size, training epochs and dropout to train the Deep BiLSTM Attention model with different pre-training word embeddings in every emotion, but in V-reg, batch size, training epochs and dropout are different in Deep BiLSTM Attention model with different pre-training word embeddings. In these models, we also use dropout.
The best parameters of EI-reg for these models are given in Table 2.3.1 and V-reg's best parameters are given in Table 2.3.1.

Ensemble Methods
Currently, ensembling is a widely used strategy which combines multiple single components to improve overall performance, there are many ensemble methods that have been proposed, such as, Voting, Blending, Bagging, Boosting, etc 5 . In this system, due to time constraint, we choose a simple average of the scores provided by different components, as each single component can predict emotional intensity or valence intensity. It can be defined as where n is the number of neural components.   For experiments, we use five datasets from two different subtasks, These datasets, "EI-reg-Enanger (anger)", "EI-reg-En-joy (joy)", "EI-reg-Enfear (fear)", "EI-reg-En-sadness (sadness)" and "2018-Valence-reg-En (valence)" are downloaded from SemEval-2018 Task 1 "Affect in Tweets" 6 . As for the EI-reg task dataset format, each tweet  consists of the id, the tweet, the emotion of the tweet, the emotion intensity and for the V-reg task, each tweet consists of the id, the tweet, the sentiment of the tweet and the sentiment intensity. All datasets have been divided into train set, dev set and test set. Test set's gold labels are given only after the evaluation period. Statistics of the datasets are shown in Table 3.
To measure the performance of selected methods, two submetrics of Pearson Correlation Coefficient (PCC) are used. PCC (all instances) is Pearson correlation for a subset of test data that includes all tweets. The value varies between -1 and 1. PCC (0.5-1) is the Pearson correlation for a subset of test data that includes only those tweets with intensity score greater or equal to 0.5. For both metrics, a larger value indicate a better prediction accuracy.
For each dataset, we use dev set to select our ensemble methods. Firstly we run these six components on all dev datasets. Then, combine these results of different components, different combinations of components lead to different results on dev set. Finally, we select the combination with a higher score for testing.
Our system is implemented on Keras with a Tensorflow backend 7 . We present the result of PCC (all instances) and PCC (0.5-1) for each emotion and valence on the test data, shown in Tables 3 and 3. For simplicity, we denote WT, GN, GL and GT for the word vectors of word2vectwitter-model, GoogleNews-vectors-negative300, glove.840B.300d and glove.twitter.27B.200d. We compare the results of our single components, official baseline and our ensemble system. Every emotion and valence adopts different ensemble methods, the symbol '-' means that the component is not used in the ensemble method in this emotion or valence. For example, we only use BiL-STM Attention+GT, Deep BiLSTM Attention+WT  and Deep BiLSTM Attention+GN these three components for ensemble on anger dataset. The reason why we don't use all the six components for ensemble is that ensemble does not always have a good effect, a same component can have different effects on different datasets, either good or bad. The official result for EI-reg, our average PCC reaches 0.727 in all instances and 0.555 in 0.5-1 (both ranked 10 out of 48 participants). For V-reg, the result is 0.835 in all instances (ranked 7 out of 38) and 0.670 in 0.5-1 (ranked 6 out of 38). The average result of baseline for EI-reg is 0.520 and 0.396, for V-reg, the result is 0.585 and 0.449. These results demonstrate that the ensemble approach achieves important improvement in performance across all the emotions and valence, and gains the best performance for Anger.

Conclusions and Future Works
We have proposed a simple yet effective ensemble method which integrates various neural components to perform the sentiment or emotion analysis for the tweet. Experimental results reflect that our method is effective in the prediction tasks of emotional intensity and sentimental intensity. Some other useful findings can be drawn from the experimental results: a) The model of integration for each emotion is different; b) As for lexicon features and word embedding, it is important for emotion or sentiment analysis; c) ensemble is not al-ways valid. Also, we have tried data augmentation considering insufficient training data, however the effect is not a good. As for future works, although our ensemble method has achieved good results, we would want to examine the multi-task deep learning approach on these tasks, by which it would predict the different emotional intensity at the same time, and improve the generalization effect of the prediction model.