DataSEARCH at IEST 2018: Multiple Word Embedding based Models for Implicit Emotion Classification of Tweets with Deep Learning

This paper describes an approach to solve implicit emotion classification with the use of pre-trained word embedding models to train multiple neural networks. The system described in this paper is composed of a sequential combination of Long Short-Term Memory and Convolutional Neural Network for feature extraction and Feedforward Neural Network for classification. In this paper, we successfully show that features extracted using multiple pre-trained embeddings can be used to improve the overall performance of the system with Emoji being one of the significant features. The evaluations show that our approach outperforms the baseline system by more than 8% without using any external corpus or lexicon. This approach is ranked 8th in Implicit Emotion Shared Task (IEST) at WASSA-2018.


Introduction
Emotion classification is a major area of interest within the field of Sentiment Analysis (SA). Social media is a great source of emotional content since people are willing to publish their views on them. Twitter is one such platform which enables users to publish micro-blogs otherwise known as Tweets. Although, the tweets are limited by the number of characters, when viewed as a group it can be very significant. Every day, on average, around 500 million tweets are tweeted on Twitter. This has attracted much interest from both academia and industries to study about opinions in tweets.
Tweets can generally be considered to contain textual content. However, tweet text is usually informal containing much casual forms and emoji, thus bringing challenges in research. Implicit emotions play a major challenge in emotion identification process in tweets. This is due to the informal nature of the tweet and lack of methods to properly model such sentences. Here the term "implicit emotion" can be defined as the emotion conveyed in the text without stating the words denoting the emotion directly.
There is an effect of implicit emotions on opinion analysis tasks such as emotion identification and emotional intensity prediction. However, techniques for modeling implicit emotions in tweets lack the sufficient performance. Therefore, this study makes a major contribution to research by exploring methods for properly modeling a tweet. Implicit Emotion Shared Task (IEST) (Klinger et al., 2018) hosted by WASSA-2018 1 poses a similar task of finding the emotion expressed in a tweet out of six basic emotions without the use of the word denoting the emotion. This paper presents our approach to solve the above problem. We were ranked 8 th in the competition related to this task.
Artificial Neural Networks (ANN) has shown to perform better than conventional machine learning algorithms and has been used in variety of Natural Language Processing tasks (Young et al., 2017). One of the primary objectives of using neural networks is to model the non-linear relationships in data, which is observed in textual content frequently. Up to now, a number of studies confirmed the effectiveness of neural networks as feature extractors rather than the final classifier for opinion mining. A variety of neural network classifiers has been applied to similar tasks such as emotion identification, polarity classification, and other text classification tasks. Feedforward Neural Networks (FNN), Convolutional Neural Networks (CNN) (Kim, 2014), Long Short-Term Memory (LSTM) (Tran and Cheng, 2018;Socher et al., 2013) networks are commonly used in recent related work. Furthermore, researchers have studied much complex forms of Neural Networks by combining CNN and LSTM in different ways.
The rest of the paper is organized as follows: Section 2 will provide a brief description on the dataset, Section 3 describes the system architecture, Section 4 reports the results and analysis of our system, finally we conclude our work in Section 5 along with a discussion on further improvements.

Dataset
The dataset is labeled based on the emotion word present in the tweet before replacing that emotion word in the text with a placeholder. The dataset is labeled for six basic emotions: Anger, Sad, Joy, Fear, Disgust and Surprise. The complete details of the dataset can be found in the task description paper (Klinger et al., 2018).

System Description
The system consists three different components: the preprocessor, feature extractor and classifier. In this study we considered that effective classifier trained on the training dataset could be used as a feature extractor as well. This section will be subdivided to accommodate the stated components separately.

Preprocessing
The tweets contained in the dataset are preprocessed to an extent. In the dataset, the URLs were replaced with "http://url.removed", mentions with "@USERNAME" and new lines with "[NEW-LINE]". Additionally, we have performed following preprocessing on the dataset: changing target term "[#TRIGGERWORD#]" to " trigger " and "[NEWLINE]" to " newline ". These changes were performed to correct the tokenization. We have used TweetTokenizer 2 available in python NLTK library for tokenization. In addition to NLTK tokenizer we evaluated our system using a dictionary based tokenizer.

Feature Extraction
A number of techniques have been developed to extract features for the classifier, some of which are trained on the dataset in order to create features explicitly. The most basic feature unit is the 2 https://www.nltk.org/api/nltk.tokenize.html words. We used words to obtain the Word Vectors from multiple word embedding models trained on different corpses. Although our best performing system was based on word embeddings we developed and evaluated other features as well. In this section we will describe all the features that we have tried out.
Word Vectors: Table 1 summarizes all of the word embedding models we used in our implementation. It illustrates the word embedding techniques and the dataset it is trained on and its specific features as well. Additionally, it provides an identifier which we will be using to identify that word embedding in the next sections. Tweets can be represented as a word vector using the word2vec approach (Mikolov et al., 2013). GW2V has been obtained by training Word2vec on part of Google News dataset 3 . Similarly, Godin et al. (2015) has provided a word2vec model trained on twitter dataset (TW2V) 4 . Furthermore, fast-Text (Joulin et al., 2016) models are trained on trained on UMBC webbase corpus and statmt.org news dataset with and without subword infomation (WSFT and WFT) 5 (Mikolov et al., 2018). Glove (Pennington et al., 2014) embedding (TGv) has been trained on twitter corpus containing two billion tweets 6 . Eisner et al. (2016) has released emoji2vec (E2V) 7 a pre-trained embedding model for all Unicode emoji. Intended means of using E2V is as an extension to GW2V.
Transfer Features: Features generated by training a neural classifier on the training dataset, obtained from the last layer (layer before the output later).

Classifiers
The trial data provided in the competition is reasonably large for evaluating the model performance. As described in Section 3.2, different combinations of feature extractors were used. Following the feature extraction process, extracted features were used to train various neural networks.

LSTM-CNN
Two of the commonly used techniques to model text documents are Convolutional Neural Networks (CNN) and Long short-term memory (LSTM) networks. Rather than developing the neural network with CNN and LSTM separately, the proposed system is developed using a combination of CNN and LSTM. Figure 1 illustrates the proposed LSTM-CNN architecture. The hyper parameters selected for this network are tabulated in Table 2. The network parameters are learned by optimizing the categorical cross-entropy between actual and predicted category. Optimization is per-

Section
Parameter Value Hidden Layer 1 Num. of Units 50 Activation ReLU Hidden Layer 2 Num. of Units 25 Activation ReLU    (Bengio et al., 2003). Furthermore, Tang et al. (2014) has used deep neural network for learning sentiment-specific word embedding.
The proposed architecture of FNN is shown in Figure 2 and related hyper-parameters used in final system are provided in Table 3.
Training parameters of the FNN is similar to that of LSTM-CNN model. Dropout layers were used in training after each hidden layer with dropout rate of 0.5. Features used to train the FNN are transfered from dense layer of LSTM-CNN models trained with different embedding models. Several feature vectors obtained from LSTM-CNN are concatenated and provided as input to FNN. The final system used features from LSTM-CNN models trained with embeddings: TW2V, GW2V + E2V and WFT.

Optimization
Hyper-parameters of the neural networks should be optimized to gain better performance. They were selected based on the results on the trial set and were optimized with both manual processes and with Tree of Parzen Estimators (TPE) (Bergstra et al., 2011). However, due to the lack of processing power and time limitations we were not able to perform a comprehensive analysis on different hyper-parameter variations.

Implementation Details
Python is used to implement the system with Keras (Chollet et al., 2015) with Tensorflow (Abadi et al.) as the backend and Scikit-learn (Pedregosa et al., 2011) being the mostly used external libraries. Hyper-parameter optimization is performed with Hyperopt library (Bergstra et al., 2013). Any hyper-parameter not mentioned in Section 3 defaults to their default values in respective library. Furthermore, we made our source code and trained models available online 8 .

Evaluation and Discussion
The first set of analyses examined the impact of LSTM-CNN models trained with different word embedding models. The results of the LSTM-CNN analysis are set out in Table 4. The train set evaluation is performed by training model on training dataset evaluating on trial set. Test set training data comprised of both training data and trial data.
It is apparent from this Table 4 that the model has performed similarly for both trial dataset and test dataset, achieving similar/ better F 1 scores and variations from one feature to another. We observe the best performance of the system when using Word2vec trained on twitter. This could be due to the fact that it contains in-domain vocabulary. What stands out in the table is the improvement of results of M GW 2V with inclusion of Emoji2Vec. It can thus be suggested that Emoji provide a substantial support to finding emotion in implicit context. Furthermore, we observe that M W T F performs better than M W ST F and can be suggested that sub-word information provided by the embedding is not important in crating the model. Another noteworthy observation is that all the models indicated in Table 4 outperforms the baseline model in both trial and test cases, thus proving the effectiveness of the proposed model itself for implicit emotion prediction task.
In the next part of the analysis we used FNN trained using features extracted from LSTM-CNN models. Table 5 provides the evaluation results of these models on the test set. '+ +' is used to represent vector concatenation operation and f (M ) denotes a function that extracts the learned features form model M from the last dense layer in the neural network for a given input text. The evaluations are performed using the three best performing LSTM-CNN models: M T W 2V , M W T F and M E2V . We have omitted M GW 2V for this analysis since the word vector used to train M GW 2V is already contained in M E2V .
Results from Table 4 can be compared with the results in Table 5 which shows that the performance (precision, recall and F 1 ) of models in the latter has improved than the individual model variants. Closer inspection of the Table 5 shows that the best models are obtained when features from M T W 2V and M E2V are used together. The overall best performance is obtained when features from M T W 2V , M E2V and M W T F are concatenated together.

Conclusion
This study is set out to propose a system for implicit emotion classification with state-of-the-art neural network classifiers. Additionally we investigate the effectiveness of combinations of different pre-trained embedding for implicit emotion classification of Tweets. In this study, a LSTM and a CNN are combined sequentially and trained with different pre-trained word embeddings to be used as a feature generator for a secondary feedforward neural network classifier to make the final classi-fication. The results of this study indicate that the system performs well in implicit emotion identification and beats the baseline system by about 8% on the test set.
Furthermore the experiments support the idea that features extracted from several pre-trained word embedding models can be effectively combined to improve the overall classification performance . The most obvious finding to emerge from this study are that in-domain word embeddings and Emoji embeddings contribute in improving performance of implicit emotion classification. The generalisability of these results is subject to certain limitations. For instance, this research does not focus on fine-tuning the model architectures to different word-embeddings. Although this gives a general ground in comparing word-embeddings for this task, it does not provide the justification for individual capabilities. Further research will have to be conducted in order to determine the best configurations for individual word embeddings and feature combinations to improve the overall performance of the system.