YZU-NLP at EmoInt-2017: Determining Emotion Intensity Using a Bi-directional LSTM-CNN Model

The EmoInt-2017 task aims to determine a continuous numerical value representing the intensity to which an emotion is expressed in a tweet. Compared to classification tasks that identify 1 among n emotions for a tweet, the present task can provide more fine-grained (real-valued) sentiment analysis. This paper presents a system that uses a bi-directional LSTM-CNN model to complete the competition task. Combining bi-directional LSTM and CNN, the prediction process considers both global information in a tweet and local important information. The proposed method ranked sixth among twenty-one teams in terms of Pearson Correlation Coefficient.


Introduction
Categorical and dimensional representations are two major approaches to representing emotional states (Calvo and Kim, 2013;Gunes and Schuller, 2013). The categorical approach represents emotional states using several discrete classes such as positive and negative (binary) or Ekman's (1992) six basic emotions (anger, happiness, fear, sadness, disgust, and surprise), which have been successfully adopted in various sentiment applications (Pang and Lee 2008;Liu, 2012;Feldman, 2013). Based on this representation, application tasks focus on classification (i.e., identify 1 among n emotions for a given text). The dimensional approach provides a more fine-grained (real-valued) sentiment analysis. Knowing the intensity or degree to which an emotion is expressed in text is useful for more intelligent sentiment appli-cations (Thelwall et al., 2012;Paltoglou et al., 2013;Malandrakis et al., 2013;Kiritchenko and Mohammad, 2016;Wang et al., 2016a;2016b. The EmoInt-2017 task (Mohammad and Bravo-Marquez, 2017b) seeks to automatically determine a continuous numerical value representing the intensity or degree to which an emotion is expressed in a tweet. That is, given a tweet and an emotion X, determine the intensity of emotion X felt by the speaker ranging from 0 (feeling the least amount of emotion X) to 1 (feeling the maximum amount of emotion X). The proposed system uses word embeddings (Mikolov et al., 2013a;2013b) and a bi-directional LSTM-CNN model to complete the competition task.
Word embeddings can capture both semantic and syntactic information of selected words and provide a low dimensional and continuous vector representation for them. Convoluational neural network (CNN) (Kim, 2014;Kalchbrenner et al., 2014) is effective for extracting features in texts without considering the global information of that text. Long short-term memory (LSTM) (Tai et al., 2015) can capture long-distance dependencies by sequentially modeling texts across words. The proposed bi-directional LSTM-CNN model combines LSTM and CNN to model texts, encoding global information captured by LSTM in the most principal features extracted by CNN.
We first use word vectors to transform tweets into text matrices. The bi-directional LSTM is applied to these matrices to build new text matrices. CNN is applied to the output of the bi-directional LSTM to obtain text vectors for emotion intensity prediction. LSTM, CNN and their combination are described in detail in the following section. Figure 1 shows the overall framework of the proposed Bi-directional LSTM-CNN model. For a given sentence, the system's input is a sentence matrix composed of the word vectors of all words and punctuation in the sentence. The sentence matrix is further transformed into a new sentence matrix by the Bi-directional LSTM model. The new sentence matrix is then sequentially passed through a convolutional layer and a max pooling layer for feature extraction. The extracted features are then passed through a dense layer to build a sentence vector for emotion intensity prediction.

Long Short-Term Memory (LSTM)
The LSTM (Hochreiter et al., 1997) uses a gating mechanism to track the state of sequences. There are three gates and a memory cell in the LSTM architecture. The LSTM transition functions are defined as follows: are the parameters to be learned. ht is the hidden state to be produced in time step t. The input vector xt and the hidden state ht-1 are the input in time step t.
( ) σ ⋅ and tanh( ) ⋅ are the logistic sigmoid and hyperbolic tangent functions,  is the element-wise multiplication operator, and t i , t f , t o whose values are in (0, 1) are respectively called the input, forget and output gates. ct is the internal memory cell. t i controls how much new information will be stored in the current memory cell, t f controls how much information from the old memory cell will be maintained and t o controls how much information will be output as the hidden state in the current time step.
LSTM is theoretically powerful in language modelling due to its capability of representing a sentence or text with sequence order information. The last hidden state of the LSTM layer can be regarded as the text representation containing the contextual information of the text. However, LSTM is a biased model, where the words in the tail of a text are more dominant than the words in the header. Thus, prediction performance could be reduced when it is used to capture the emotion intensity of a whole text, since the key components could appear anywhere in the text.
To avoid this problem, we maintain the hidden states of all time steps, and sequentially use the hidden state to replace the original word vector input. Then we build a new text matrix.
In addition, we replace the LSTM layer with a bi-directional LSTM layer consisting of two LSTMs running in parallel: one on the input sequence and the other on the reverse of the input sequence. At each time step, the hidden state of the bi-directional LSTM is the concatenation of the forward and backward hidden states. The hidden state can thus capture both past and future information.

Convolutional Neural Network (CNN)
The CNN architecture consists of a convolutional layer and a max pooling layer. The convolutional layer's input is the bi-directional LSTM layer's output which is a new text matrix. Once the new text matrix sequentially passes through the convolutional layer, the local n-gram features can be extracted.
The max-pooling layer subsamples the output of the convolutional layer. Pooling is conducted by maintaining the max value of the result of each filter. The max-pooling layer can reduce the dimension of the extracted feature vector and extract the local dependency to maintain the most important information for prediction.
The obtained vectors are then fed to a dense layer to build a text representation. Since emotion intensity is a continuous value, a linear decoder layer uses a linear regression to transform the text representation into a real value.

Experiments and Evaluation
This section evaluates the performance of the proposed bi-directional LSTM-CNN model by submitting the results to the EmoInt-2017 task.
Dataset. The statistics of the official dataset (Mohammad and Bravo-Marquez, 2017a) used in this competition are summarized in Table 1. Each tweet was rated with a real-value (emotion intensity) in the range of (0, 1). Training, development and test datasets are provided for four emotions: joy, sadness, fear, and anger. We trained four models corresponding to four emotions using their respective training sets without their development sets. The anger, joy and fear models used the architecture of the proposed bi-directional LSTM-CNN model. To improve results, the sadness model used the architecture of CNN model which excludes the bi-directional LSTM layer shown in Fig.1  Other hyper-parameters are presented in Table 2.
Evaluation metrics. The EmoInt-2017 task published the results for all participants using the Pearson and Spearman correlation coefficient.

Results.
A total of twenty-one teams participated in the task. Table 3 shows the results of the proposed bi-directional LSTM-CNN model. Table 4 shows the results over the subset of the test data with a gold emotion intensity score greater than or equal to 0.5. Table 5 shows the experimental results for CNN, LSTM and their combineations after the release of test set ratings. LSTM used the last hidden state as the text vector, which caused the worse performance than CNN and BiLSTM-CNN. In addition, BiLSTM-CNN performed a little better than CNN and performed well for the subset with higher emotion intensity scores (>=0.5).

Conclusions
This study presents a deep learning approach to determine the emotion intensity of tweets.    Table 4: Results of the proposed BiLSTM-CNN model over a subset of the test data with a gold emotion intensity score greater than or equal to 0.5.