Yuan at SemEval-2018 Task 1: Tweets Emotion Intensity Prediction using Ensemble Recurrent Neural Network

We perform the LSTM and BiLSTM model for the emotion intensity prediction. We only join the third subtask in Task 1:Affect in Tweets. Our system rank 6th among all the teams.


Introduction
Sentiment analysis (SA) is a field of knowledge which deals with the analysis of people's opinions, sentiments, evaluations, appraisals, attitudes and emotions towards particular entities (Liu, 2012). EmoInt (Mohammad and Bravo-Marquez, 2017) is a shared task hosted by WASSA 2017, aiming to predict the emotion intensity in tweets. SemEval 2018 Task 1 subtask 3 (Mohammad et al, 2018) is similar to EmoInt, however the goal of subtask 3 is to detect valenc-e or sentiment intensity, in which scores are floating point values between 0 and 1, representing low and high intensities of the emotion being expressed, respectively. Obviously we don't know in advance whether twitter's emotional intensity is positive or negative, but in EmoInt task we can determine whether twitter emotions are positive or negative based on one of four datasets: anger, fearness, joy, sadness. This is still a challenging task and remains active areas of research. These setbacks are: extensive usage of hashtags, slang, abbreviations, and emoticons. And tweets are usually typed on mobile devices like mobile phone, laptop or iPad which can result in a substantial amount of typos.
Existing methods for modeling emotion intensity rely vastly on manually constructed lexicons, which contain information about intensity weights for each available word (Mohammad and Bravo-Marquez, 2017a;Neviarouskaya et al., 2007). The intensity for the whole tweet can be deduced by combining individual scores of words, which is easy and ignores the word order compositionality of the language. Building such lexicons is a labourintensive procedure. We can learn from these models the skills of combining feature extraction and classification or regression stages given a sufficient amount of training data.
Some deep learning methods are used to process the same question. Deep neural architectures for emotion intensity prediction in tweets (Goel et al., 2017) and character-and word-level recurrent neural models for tweet emotion intensity detection (Lakomkin et al., 2017).
In our work, we firstly clean tweets, then build lexical features and find optimal combinations of features to produce a final vector representation of a tweet, next train a neural network regression model and finally get the tweet's intensity scores. In addition, we adjust our models' parameters and through the ensemble models to get the best performing results.

Data cleaning
We use the dataset provided by the official organizers to train our system, there are 1181 labeled training tweets, 449 labeled dev tweets. Test set are unlabeled 17874 tweets and the gold labels were given only after the evaluation period. Before training model or predicting test set we firstly clean the tweets, this is imperative. We utilize the following prep-rocessing steps.
(1) Hashtags are crucial markers for determining sentiment. The "#" symbol is removed and the word itself is retained. Eg, a hashtag like "#the_best_one", finally we get "the best one".

Feature Extraction
In order to completely extract features from tweets, we consider two characteristics which are annotated lexicons and pre-trained word embedding.

Annotated Lexicon
For extracting lexicon features, we follow the procedure as per the baseline system provided in the WASSA Emotion Intensity Task. The knowledge sources that have been used are: MPQA subjective lexicon (Wilson et al., 2005) , 2007). Two more features are calculated on the basis of emoticons (obtained from AFINN (Nielsen, 2011)) and negations present in the text. We use several of the above lexicons as following: • Emoji Valence (EV): This is a hand classified lexicon of Unicode emojis, rated on a scale of -5 (negative) to 5 (positive).
• SentiWordNet (SWN): Calculates positive and negative sentiment score using SentiWordNet, which is an opinion mining resource available through NLTK.
• Depeche Mood (DM) (Staiano and Guerini, 2014): This is a lexicon comprised of about 37,000 unigrams annotated with real-valued scores for the emotional states afraid, amused, angry, annoyed, don't care, happy, inspired and sad. • Emoticon Sentiment Lexicon: Note that this is a sentiment lexicon drawn from emoticons, and is not an emotion lexicon.

Word Embedding
The text can be converted into word embedding, which represents each word of the text with a d dimensional vector (Mikolov et al., 2013). Considering that we have to deal with tweets, we use GloVe word embedding trained on 2 billion tweets from twitter (Pennington et al., 2014), vectors of 100, 200 and 300 dimensions are provided as part of the pre-trained model. For this work, we use the 300 dimensional vectors of 42B tokens. We also considered GoogleNews-vectors-negative300 in our expe-riments but the effects was not as good as the GloVe word embedding.

Model Training
Based on the application of features extractions and word embedding, we can represent each word in a tweet as a high dimensional space vector, and the dimension of the vector is d  l . d represents the dimension of GloVe word embedding 300 and l stands for the length of the additional lexical dictionary. After representing the tweets, we need to train models. Since the task requires the computation of a real valued emotion intensity score for the tweets in the test set, we explore several regression methods. Our system is implemented in Keras and we finally choose the best single BiLSTM model, which contains two layers of BiLSTM following the embedding layer and, we add a dropout layer. Some parameters of our model are: dropout probability 0.25 and 0.5 respectively; units of the BiLSTM layers are 512 and 256 respectively; units of the full connection layer is 256. The complete model structure is shown below Figure  1: Figure 1: A two layer bidirectional LSTM model.

System tuning
When training model on Keras so there only some parameters need to change, we tune the parameters such as the choice of loss function, dropout probability, dimension of the BiLSTM layer. As for feature combination we use all the annotated lexicons mentioned in section 3.1 so as to control the variables and we don't consider the impact of different dictionary combinations on the results, which may be discussed in the future work. Note that all of our tuning processes are done on the development set, each time we finished a model we record the results. Ensembling of some models is universal used method to improve the performance of the overall system by combining predictions of several classifiers. Our system ensembles ten exactly the same BiLSTMs models and average the results, it turns out that the ensemble result is better than that of a single model. That is to say when we ensemble the model, the weight of each single BiLSTM is the same.

Experiment and results
All our experiments have been developed using Keras deep learning library with Theano backend, and with CUDA enabled. And all our experiments are performed on a computer with Intel Core(TM) i3 @3.4GHz 16GB of RAM and GeForce GTX 1060 GPU. After testing many neural network models, we finally find the best results on LSTM and BiLSTM models. Table 1 shows the results of a single layer LSTM changing the loss function and word embedding, we can learn that MAE loss function can get the best result with Glove word embedding, in general the performance on Glove word embedding is better than word2vec embedding. Table 2 shows the results of a single BiLSTM changing the loss function and integrating ten models under different loss functions and different word embedding we can learn that MAPE loss function can get the best result with Glove word embedding, in general the performance on Glove word embedding is better than word2vec embedding. Table 3 is the result of double layers BiLSTM changing the loss function and integrating ten models under different loss functions and different word embedding we can learn that MAPE loss function can get the best result with Glove word embedding, in general the performance on Glove word embedding is better than word2vec embedding.
The system in this subtask are evaluated using the Pearson correlation coefficient, which computes a bivariate linear coefficient, and the secondary evaluation metrics, which is Pearson correlation for a subset of the test set that includes only those tweets with intensity score greater or equal to 0.5. We present the results of the system submitted to the competition leaderboard in Table 4. The score of our system is 0.836 (Pearson) and 0.667 (Pearson gold in 0.5-1). Note that the model we used on the test set is the best model on the development set, i.e., in Table 3

Conclusions
In this paper, we propose a deep learning framework to predict the emotion intensity in tweets. The proposed system is based on two layers BiLSTM and the last layer of model using a linear regression so that we can get the intensity score, which is a consecutive emotional value. Before training model we implement features extraction and represent the tweets by word embedding. Both single model and ensemble model are described in detail with a view of making our experiments replicable. The optimal parameters are mentioned along with our method of bringing the approaches together. Our submitted system beats the baseline system by about 25.1% on the test set. Our source code is in here https://github.com/ynuwm/SemEval-2018