UWB at IEST 2018: Emotion Prediction in Tweets with Bidirectional Long Short-Term Memory Neural Network

This paper describes our system created for the WASSA 2018 Implicit Emotion Shared Task. The goal of this task is to predict the emotion of a given tweet, from which a certain emotion word is removed. The removed word can be sad, happy, disgusted, angry, afraid or a synonym of one of them. Our proposed system is based on deep-learning methods. We use Bidirectional Long Short-Term Memory (BiLSTM) with word embeddings as an input. Pre-trained DeepMoji model and pre-trained emoji2vec emoji embeddings are also used as additional inputs. Our System achieves 0.657 macro F1 score and our rank is 13th out of 30.


Introduction
Emotions, especially on the social media and social networks, as an immediate response to a specific object or a situation, are a significant part of the communication between people. Even for a human, it is sometimes challenging to describe or recognize an emotion without imminent contact with a subject (e.g. idioms or sarcasm). One of the most important ways to express an emotion in a text is an emoji. Emojis are small ideograms depicting objects, people and scenes (Barbieri et al., 2018). Emojis try to capture a facial expression of a subject, which is determining for emotion detection.
This paper describes our system created for the WASSA 2018 Implicit Emotion Shared Task (Klinger et al., 2018). The goal of this task is to predict the emotion of a given tweet, from which a certain emotion word is removed, for example: It is [#TARGETWORD#] when you feel like you are invisible to others.
The removed word can be sad, happy, disgusted, surprised, angry, afraid or a synonym of one of them. The possible emotions are Sadness, Joy, Disgust, Surprise, Anger, and Fear. The [#TARGETWORD#] token in the example indicates a position of the removed word in the given tweet.

Related Work
As we mentioned before emojis are an important part of expressing emotions. Barbieri et al. (2017) investigated the relationship between words and emojis. They also proposed an approach to predict the most probable emoji that is associated with a tweet. The mentioned approach uses a Bidirectional Long Short-Term Memory networks (BiL-STM) (Graves and Schmidhuber, 2005).
Pre-trained word embeddings (word representations) such as (Mikolov et al., 2013;Pennington et al., 2014) are currently standard part in most of the state-of-the-art solutions for key NLP tasks. Tang et al. (2014) propose a method that can learn sentiment-specific word embeddings, which are able to improve performance by combining with other existing feature sets.
There are also some previously submitted systems in similar SemEval shared tasks using deep learning models. Cliche (2017) uses a CNN and LSTM for Sentiment Analysis SemEval-2017 task 4 (Rosenthal et al., 2017). Another approach with a deep LSTM with Attention mechanism is used by Baziotis et al. (2017) for the same task. Most of the best performing submitted systems (Baziotis et al., 2018;Gee and Wang, 2018;Park et al., 2018) in SemEval-2018 Task 1: Affect in Tweets (Mohammad et al., 2018) also use deep learning models with LSTM or BiLSTM neural networks.

Overview
Our approach is based on the artificial neural network that combines word embeddings and emoji-based features as input. We use Weka machine learning workbench (Hall et al., 2009) for preprocessing. Our submitted model combines BiLSTM layer for word embeddings input and dense layers for the other inputs (emoji2vec (Eisner et al., 2016) and DeepMoji (Felbo et al., 2017) features, see 2.2) connected to one dense layer, see the Figure 2 with a model architecture. Outputs of these three layers are concatenated and then a dropout (Srivastava et al., 2014) technique is applied. After the concatenating a next dense layer is employed. An output from the previous dense layer is then passed to a fully-connected softmax layer. An output of the softmax layer is a probability distribution over all six possible classes.
We trained several modified versions of our submitted model and we evaluated these models on the development data. The model with the highest macro F 1 score on the development data was then trained again on the training data extended by the development data. This model was used for test data predictions. All models were implemented by using Keras (Chollet et al., 2015) with TensorFlow backend (Abadi et al., 2015).

Tweets Preprocessing
Tweets often contain slang expressions, misspelled words, emoticons or abbreviations and it is needed to make some preprocessing steps before training and making predictions. We use a similar approach to Přibáň et al. (2018).
At first, we remove the [#TARGETWORD#] token, that represents a position of the removed word with a certain emotion and every tweet is tokenized using TweetNLP twokenizer (Gimpel et al., 2011). Then the following steps are applied on tokens: 1. Tokens are converted to lowercase 2. Tokens containing sequences of letters occurring more than two times in a row are replaced with two occurrences of them (e.g. huuuungry is reduced to huungry, looooove to loove) 3. From hashtags (tokens starting with #) the # character is removed.
4. Common sequences of words and emojis are separated by space (e.g. token "nice:D:D" is split into three tokens "nice", ":D" and ":D")

Characters & -in tokens are replaced with space
Weka machine learning workbench is used to perform the mentioned steps. After tokenization and mentioned preprocessing the tweet is padded to 50 tokens. Tweets longer than 50 words are shortened, while to the shorter tweets padding tokens are added.

Features
We use three types of input features -word embeddings, emoji embeddings and an emotional representation of a sentence. Word embeddings are representations of words usually expressed as pre-trained dense real vectors (Mikolov et al., 2013;Pennington et al., 2014) with a fixed dimension size. We use pre-trained Ultradense Word Embeddings (Rothe et al., 2016) that were trained on Twitter domain corpus. The number of dimensions for this embedding is 400. Pre-trained emoji2vec (Eisner et al., 2016) emoji embeddings (300-dimensional) are used as another input to our model. We average vectors for each emoji in a tweet and the resulting averaged vector is used as an input. The mentioned emoji2vec embeddings contain vectors for all Unicode emojis which were learned from their description in the Unicode emoji standard 1 , see the (Eisner et al., 2016) for details. Emoji2vec embeddings can be used only for some tweets because not every tweet contains some emojis, but we suppose that using emoji2vec will lead to an overall performance improvement.
We also use DeepMoji (Felbo et al., 2017) as an emotional sentence representation. The Deep-Moji model is able to predict emoji that is included with a given sentence and thus the model has also an understanding of the emotional content of that sentence. The model was trained on a dataset of 1.2 billion tweets. As an input for our model, we use the 2304-dimensional vector from the attention layer in the pre-trained DeepMoji model.

Recurrent Neural Network
The Recurrent Neural Network (RNN) extends the classic (feed-forward) neural network. An RNN is intended for sequential data. The actual hidden state h t of the RNN depends on the previous hidden state h t−1 (see Figure 1). An RNN takes the input sequence x 1 , x 2 . . . x T and for each element, at the time step t computes new hidden state h t from the input x t and from the previous hidden state h t−1 . The new hidden state h t is computed by hidden layer function H.
In the simplest case, the hidden layer function H is defined as: where the W terms correspond to weight matrices (e.g. W xh is the input-hidden weight matrix) and b h term is hidden bias vector. The concrete implementation of the H function depends on the type of the used RNN unit (Graves et al., 2013), for example Long Short-Term Memory (LSTM) unit (Hochreiter and Schmidhuber, 1997) or Gated Recurrent Unit (GRU) (Cho et al., 2014).
In our case, the input x t denotes the word embedding vector for each word in the tweet and T is a length of the tweet. Every tweet is also padded to the length T . As mentioned, the new hidden state h t depends on the previous hidden state and hence the word order is also taken into account in the RNN.

Long Short-Term Memory
The Long Short-Term Memory (Hochreiter and Schmidhuber, 1997) allows learning (remember) long-term dependencies from the input sequence. The LSTM unit consists of cell state (cell activation vector) input, forget and output gates. These gates control how the cell state is updated. The H function of the LSTM unit is defined as: where the W terms correspond to weight matrices and b terms are bias vectors, i, f , o are the input, forget and output gates, c denotes cell state (activation vector), σ is sigmoid function and * character means element-wise multiplication.
It is a common practice to use Bidirectional LSTM (BiLSTM) (Graves and Schmidhuber, 2005). The BiLSTM consists of two LSTMs, one LSTM process the input sequence from the first Figure 1: Basic RNN architecture 2 element x 1 to x T and produces output vector − → h t . The second LSTM process the input sequence in reverse order e.g. from the last element x T to x 1 and produces output vector ← − h t . Both output vectors have dimension D. The final output vector h t from BiLSTM with dimension 2D is then created by concatenating two vectors − → h t and ← − h t . Dropout (Srivastava et al., 2014) is a technique for improving neural networks by reducing overfitting. The dropout technique randomly drops out units (hidden and visible) during training and thus prevents co-adaption of neurons from training data.

Model Description
The proposed model has three inputs. Figure  2 shows the model architecture. The first input (word embeddings) represents tweet as a sequence of t = 50 tokens. We use the Ultradense Word Embeddings (Rothe et al., 2016) to obtain a vector of dimension d = 400 for each token from the tweet. The whole tweet is then represented as a matrix M ∈ R t×d . The vectors are obtained only for 50,000 most frequent words in the training dataset. If the tweet word is not present in a vocabulary of 50,000 most frequent words, the randomly initialized vector of the same dimension is used. The word embeddings input is followed by a BiLSTM layer with 1200 units, respectively every single LSTM has 600 units. We also use a dropout to recurrent connections in BiLSTM layer.
The emoji embeddings input is based on emoji2vec (Eisner et al., 2016). For each emoji in the tweet, a 300-dimensional vector is produced by the pre-trained model. All emoji vectors for the tweet are then averaged to a single vector. If the tweet does not contain any emoji a zero vector is used. The resulting averaged emoji embeddings vector E ∈ R 300 is used as an input to a dense layer with 300 units.
Our last input uses a pre-trained DeepMoji (Felbo et al., 2017) model for an emotional sentence representation. The DeepMoji model generates for each tweet vector D ∈ R 2304 which represents the emotional content of the tweet. The emotional sentence representation input is followed by a dense layer with 2304 units.
All three output vectors of the BiLSTM and two dense layers are concatenated into one vector C ∈ R 3804 that is passed to a next dense layer with 400 units. We also use a dropout after the concatenating. An output of the last dense layer is passed to a final fully-connected softmax layer. An output of the softmax layer is a probability distribution over all six possible classes. The class with the highest probability is predicted as a final output of our model.

Model Training & Hyper-Parameters
We trained our model using mini-batches of size 1024 for 5 epochs and we used the Adam (Kingma and Ba, 2014) optimizer with learning rate 0.001 the other parameters of the Adam optimizer follow those provided in the cited paper. As an activation function in the BiLSTM and in the dense layers, we used a Rectified Linear Unit (ReLu). Dropout of 0.2 is used for the recurrent connections in BiL-STM layer and in all dense layers. We trained the model on the provided training dataset and we evaluated the trained model on the development dataset. We experimented with different settings of the hyper-parameters (learning rate, mini-batch size etc.) but the mentioned settings showed to be the best one on the development data. These hyper-parameters settings were also used for final submission.

Experiments & Results
All presented experiments were evaluated on the provided development and test datasets. Table 1 shows the results for the different model settings.
We performed ablation study to see which features are the most beneficial (see Table 2). Numbers represent the performance change when the given feature is removed 3 .
We also modified our model and we experimented with an attention mechanism (Rocktäschel et al., 2015;Raffel and Ellis, 2015). The atten-3 The lowest number denotes the most beneficial feature  Table 1: Results for individual model settings tion mechanism was added to our BiLSTM layer 4 (see Table 1 with results obtained by the modified model).
Our results for the WASSA 2018 Implicit Emotion Shared Task are shown in Table 3 along with some other teams. Table 4 contains the confusion matrix obtained from the submitted predictions and Table 5

Discussion
Thanks to the ablation study (see Table 2) and results from Table 1   emoji2vec features also increase the performance of our model for the test dataset, but the contribution is insignificant and it is not so important as the word embeddings. So our assumption, that the emoji2vec feature will lead to a more significant overall performance improvement for the test dataset, is not correct. It would be more beneficial to use a simpler model without an emoji2vec feature. The modified models with the attention mechanism did not improve the performance. The possible explanation is that it is caused by the missing emotion word in the classified tweet. The missing word carries probably the most information about the emotion. If the word was present in the classified tweet the attention mechanism would pay most attention to the missing word and thus the attention mechanism would improve the perfor-  Our model performs best for the joy and fear emotions (see Table 5). On the other hand, we obtained worst results for the anger emotion. Our model produces the most false positive predictions for the anger emotion (tweet is classified as anger but the true emotion is different). From the confusion matrix (Table 4), we can see that for our model it is difficult to distinguish especially between disgust and sadness, disgust and anger, fear and anger, surprise and anger, and between surprise and fear. Table 1 shows that there are no important differences between the development dataset and test dataset results. So our decision to select the second best model (evaluated on the development dataset) for the final submission based on the results for the development dataset was suitable.

Conclusion
In this paper we described our UWB deep-learning system created for the WASSA 2018 Implicit Emotion Shared Task. Our system uses Bidirectional Long Short-Term Memory (BiLSTM) with word embeddings as an input. Pre-trained Deep-Moji model and pre-trained emoji2vec emoji embeddings are also used as additional inputs. The proposed system performs best for the joy emotion. Our System achieves 0.657 macro F 1 score and our rank is 13 th out of 30.
We performed ablation study and showed that the most beneficial features are word embeddings. The emotional sentence representation (DeepMoji feature) and the averaged emoji vectors (emoji2vec feature) did not much improve the performance of our model.
In the future work, we would like to try another approach employing a twitter specific language model to predict probabilities for each emotion class for the missing target emotion word in the provided data. These probabilities could be used as input features to our model.