DMCB at SemEval-2018 Task 1: Transfer Learning of Sentiment Classification Using Group LSTM for Emotion Intensity prediction

This paper describes a system attended in the SemEval-2018 Task 1 “Affect in tweets” that predicts emotional intensities. We use Group LSTM with an attention model and transfer learning with sentiment classification data as a source data (SemEval 2017 Task 4a). A transfer model structure consists of a source domain and a target domain. Additionally, we try a new dropout that is applied to LSTMs in the Group LSTM. Our system ranked 8th at the subtask 1a (emotion intensity regression). We also show various results with different architectures in the source, target and transfer models.


Introduction
Sentiment analysis is one of the most famous Natural Language Process (NLP) task. In this study, we perform a task that predicts emotional intensities of anger, joy, fear and sadness with tweet messages, where intensity values range from 0 to 1. This task is competed at SemEval-2018Task 1 (Mohammad et al., 2018. In previous studies, neural networks with word embedding and affective lexicons were widely used (Goel et al., 2017;He et al., 2017). Also, many studies employed support vector regression (Duppada and Hiray, 2017;Akhtar et al., 2017).
Transfer learning was recently proposed as an effecive approach to have higher performance, when data is not abundant. Using a pre-trained deep-learning model with an abundant data set has been popular and shows good results in various tasks (Donahue et al., 2014;Conneau et al., 2017). Especially in a medical image task, it is very efficient because of lacks of medical data (Tajbakhsh et al., 2016). Just as humans can learn new things better with their past knowledge, neural networks can also be trained on target domains by transferring knowledge from the source domain.
We make a transfer model that can be divided into a source model and a target model. The source model is constructed based on the paper (Baziotis et al., 2017). The model of this paper uses LSTM with attention. However, we introduce Group LSTM (GLSTM) (Kuchaiev and Ginsburg, 2017) with a new dropout. After then, we make the target model with LSTM.
In the result section, we provide comparison of LSTM and GLSTM in the source model, and results of various pre-trained word embeddings with target model. Finally, we discuss about the result of the transfer model that is a combined model with the source and target models.

Data and Label
For transfer learning, we use a source data provided by SemEval 2017 Task4 (a) (Rosenthal et al., 2017). The task of the source domain is to classify sentences to positive, negative and neutral sentences. Training data is 44,613 sentences (10% are used as a development set), and test data is 12,284 sentences for the source model evaluation. For transfer learning in this study, all training and test data are used as training data.
For the target domain, training data is about 2,000 sentences for each emotion. Although the main task is regression prediction, we change it as distribution prediction (Tai et al., 2015). In this way, we deal it as a classification problem. Intensity scores y are changed to labels t satisfying: where i = [1, 2, 3, 4, 5] and y' = 4y Size of the final output is 5. For example, if an intensity score y is 0.7, label t is [0, 0, 0.2, 0.8, 0]. With given r = [0, 0.25, 0.5, 0.75, 1], label y can be obtained again by dot product with t and r (0.7 = 0.2*0.5 + 0.8*0.75).

Text preprocessing
To normalize words and remove noise in sentences, we use ekphrasis library (Baziotis et al., 2017). It helps to apply social tokenizer, spell correction, word segmentation and various preprocessing. We normalize time and number, and omit URL, email and user tag. Annotations are added on hashtags, emphasized and repeated words. We annotate them as a group because hashtags are gathered in many cases (see Table 1). Lastly, emoticons are changed to words that represent emoticons.

Word embedding
We try five pre-trained word embeddings to choose the best one for the target model. Two are trained with GloVe (Pennington et al., 2014) using different data sets: one 1 is trained with very large data in Common crawl, and the other 2 is made with tweets (Baziotis et al., 2017). Other word embedding methods are fastText 3 (Bojanowski et al., 2016), word2vec 4 (Mikolov et al., 2013) and LexVec 5 (Salle et al., 2016). LexVec is the mixed version of GloVe and word2vec. Dimensions of them are all 300. Among them, GloVe with tweet is used for the source and transfer models. Emoji can be good features but most of emoji ideograms are not contained in embedding vocabulary. Hence, we change a emoji to a phrase with python 'emoji' library. For example, is decoded to "Smiling Face with Open Mouth and Smiling Eyes". Because it is quite long, embedding vectors of emoji are changed to mean of vectors of each decoded words. In this way, we reduce Out-Of-Vocabulary and prevent the sentence from lengthening.

LSTM and GLSTM
Recurrent Neural Network (RNN) works well in a sequence model like language by addressing its arbitrary length (Tai et al., 2015). However, RNN is difficult to be optimized because of a gradient vanishing problem. To solve it, LSTM suggested a cell state and gates as bridges to control the flow of error (Hochreiter and Schmidhuber, 1997).
GLSTM is just a group of several LSTMs, where outputs of LSTMs are concatenated. The idea is that LSTM can be divided into several sub-LSTMs (Kuchaiev and Ginsburg, 2017). This model has some advantages compared to the original LSTM. The number of parameters is reduced with a preserving feature size. Also, it can be parallelized and computation times are reduced because the computation of each sub-LSTM is independent.

Dropout
To avoid overfitting and achieve generality, we use three types of dropout. One is normal dropout between layers (Srivastava et al., 2014). If a shape of the layer is sequential, dropout mask is shared on sequential axis. Another dropout is inside cells of LSTM. In the each LSTM cell, the same dropout mask is applied on hidden values that come from the previous cell (Zaremba et al., 2014). Applying different dropout masks for each cell can mislead memory and information. With the same dropout mask, however, LSTM cell can dropout nodes consistently so that the model can forget or memorize information stably. The last one is dropout between sub-LSTMs. To get more generality, we dropped several LSTMs in GLSTM. For example, if GLSTM consist of five sub-LSTMs, we dropped two LSTMs and only use the rest three LSTMs.

Model structure 3.1 Source model
For the source model, Glove with tweets is used as input vectors of the embedding layer. After embedding layer, two GLSTM layers are stacked. GLSTM is made of 5 LSTMs with 40 feature size. Additionally, we concatenate forward and backward GLSTM to be bidirectional. So hidden size of each recurrent layer is 400 ( = 5 ×40 × 2).
Next is an attention layer, which calculates importance of each time step. Attention mechanism shows good performance on sequential tasks like machine translation (Bahdanau et al., 2014) and sentiment analysis (Baziotis et al., 2017). It helps to concentrate position related to emotion. Attention values are calculated: , a t = 1 Calculated attention values are multiplied by each current hidden state and they are all added up. Passing through the attention layer, the output becomes non-sequential representation vectors. It enters a fully connected softmax layer as a final classification layer, where the size of the layer is 3.

Target model
Unlike the source model, a normal bi-LSTM is used with 100 feature size. After then, attention and output layers are stacked. The size of output layer is 5.
For transfer learning, outputs of several layers on the source model are used as additional features. The LSTM layer on the target model takes as input the concatenation of the embedding layer and the first LSTM layer output of the source model. After the attention layer, in a similar way, outputs of the attention and the final layers on the source model are concatenated and entered into the final layer as input.

Regularization
At the embedding layer, Gaussian noise is applied with sigma = 0.2. It helps models to be robust by avoiding overfitting on specific features of words. Dropouts are used everywhere between layers with probability p = 0.3 except before the final layer. Before the final layer, p = 0.5 dropout is applied. Additionally, LSTM dropout was applied on every LSTM layers with p = 0.3. The probability of dropout at GLSTM on the source model is 0.3. Also, we use L2 normalization. It prevents weights to be large values by adding weight penalty to loss. We set up it with 0.001 for the source model and 0.0001 for the target model.

Training
For the source and target models, categorical cross-entropy is used as a loss function. For updating weights, we apply the Adam (Kingma and Ba, 2014) optimizer with a learning rate of 0.001. During training the transfer model, since we want to preserve target model weight parameters with a little updating, we decrease gradient flow of backpropagation from the source model to the target model by 0.05 times (see large arrows on Figure  1). Because there are many parameters on the final model, we take that constraint to prevent overfitting.

Result and discussion
4.1 GLSTM Figure 2 shows the result of GLSTM and normal LSTM on the source model for Sentiment Classification (SemEval 2017 Task 1a). We tried various feature sizes. The number of sub-LSTM in GLSTM is fixed to 5 and the feature size of each sub-LSTM is changed. As the sizes of features increase, the performances of GLSTM increase. On the other hand, although the performances of LSTM gradually improve with larger feature sizes, it starts to decrease rapidly after 100. Thus, we infer that GLSTM with dropout is more effective on overfitting than LSTM with larger feature size. Based on this result, we use GLSTM for the source model.

Various Embedding
We tested five different word embedding vectors using the target model to choose the best embedding. To compare the performances of embeddings, the embedding layers was not trained  (static). Note that we did not use transfer learning in this experiment. Table 2 shows Pearson correlation between the given emotion intensities and predicted intensities by the models on the development set. Tweet GloVe had the best score and Common GloVe showed the second best score. Hence, we decided to do transfer learning with Tweet GloVe and Common GloVe.

Transfer
Our main task results are described in Table 3. There are four models. Tweet Glove and Common GloVe were picked from the conclusion of 4.2, and we performed two approaches: training the embedding layer or not (non-static or static) (Kim, 2014). Tweet GloVe with static showed the best performance as a single model and it is almost same to non-static. However, the non-static method had a higher score than the static for Common GloVe embedding. In addition, the ensemble model by averaging all single models showed better performance than the single models. We also found that compared to the scores without trans-fer learning on dev set (Table 2), there were significant performance improvements when transfer learning used in Table 3.

Conclusion
This paper described the system submitted to SemEval-2018 Task 1: Affect in tweets and analysis of various models. Various embedding vectors were tried and we chose Tweet GloVe with static. The main method is LSTM with attention and transfer learning that uses sentiment classification as source domain. In future work, we will perform transfer learning with labeled data sets such as SNLI or SST data sets. Also, training tagging or tree parsing can be used for transfer learning.