deepSA2018 at SemEval-2018 Task 1: Multi-task Learning of Different Label for Affect in Tweets

This paper describes our system implementation for subtask V-oc of SemEval-2018 Task 1: affect in tweets. We use multi-task learning method to learn shared representation, then learn the features for each task. There are five classification models in the proposed multi-task learning approach. These classification models are trained sequentially to learn different features for different classification tasks. In addition to the data released for SemEval-2018, we use datasets from previous SemEvals during system construction. Our Pearson correlation score is 0.638 on the official SemEval-2018 Task 1 test set.


Introduction
In recent years, people began to study how to create computational systems that process and understand the human languages. Today, people share their thoughts on social networks of the Internet, e.g. Facebook, Line, Twitter and so on. Thus, if the messages in the textual contents of social networks can be extracted and summarized automatically via algorithms, it is possible to learn what people are interested in or are concerned with, and use such information to predict future market trends.
Here we continue our previous works on the task 4 of SemEval-2017: Sentiment Analysis in Twitter (Rosenthal et al., 2017). SemEval-2017 subtask 4A is similar to task 1 of SemEval-2018: Affect in Tweets (Mohammad et al., 2018. They are challenging tasks as the messages on Twitter, called tweets, are short and informal. Furthermore, in addition to noisy or incomplete texts, the emotional content of a tweet can be ambiguous and subjective. Affect in Tweets is an expanded version of WASSA-2017 shared task (Mohammad and Bravo-Marquez, 2017).
The best system in WASSA-2017 is an ensemble of three sets of approaches, including feed-forward neural network, multi-task deep learning and sequence modeling using CNNs and LSTMs (Goel et al., 2017). They attempt to use the idea of multi-task learning to explore the notion of generalized or shared learning across different emotions. In this paper, we extend the idea with different label methods. The rest of this paper is organized as follows. In Section 2, we introduce our system. In Section 3, we describe the details of training and experimental settings. In Section 4, we present the evaluation results along with our comments.

Baseline System
Using RNN has become a very common technique for various NLP tasks. There are many units for RNN-based model like simple RNN, gated recurrent units (GRU) (Chung et al., 2014), and long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997). For the baseline, we use LSTM as unit for its long-range dependency. Figure 1 shows the architecture of our baseline system. Our baseline system contains an input layer, an embedding layer, Bi-LSTM layers and an output layer. At the input layer, the words of tweet are pre-processed, and they are treated as a sequence of words w 1 , w 2 , ...w n . Each word is represented by a one-hot vector, and the size of input layer is equal to the size of word list.
At the embedding layer, each word is converted to a word vector. We use pre-trained word vector which are stored in a matrix. Words are mapped to word vectors by the word embedding matrix. A word not in the word embedding matrix is represented by a zero vector.
A Bi-LSTM layer contains h units. We use bidirectional (Schuster and Paliwal, 1997) structure to gather two-way contextual information at each point. The hidden states from the first word to the penultimate word in a tweet are connected to the hidden states of the next word. The state values in both directions are combined with sum. Only the last Bi-LSTM states of the last word are connected to the output layer. Finally, the network output is converted to probability by a soft-max function.

Multi-task Learning
Multi-task learning has been used with success in applications of machine learning, from natural language processing (Collobert and Weston, 2008) and speech recognition (Deng et al., 2013). By sharing representations with related tasks, a model tends to generalize better on the original task (Ruder, 2017). In this work, different labels for the same data are exploited in multi-task learning. Figure 2 shows our multi-task learning framework. The overall system is divided into five models. The Three-class model is trained first, and its trained parameters are used to initialize the parameters in other models. Then we train the Negative, Neutral, Positive class models, and their trained parameters are used to initialize the parameters of the Seven class model. The final output is obtained from the Seven class model.   (Luong et al., 2015;Wang et al., 2016) is incorporated in this model.

Data
We use the dataset provided for the SemEval-2018 shared task (Mohammad et al., 2018), which includes a new dataset and the datasets provided for SemEval-2017 (Rosenthal et al., 2017). Table 1 summarizes the statistics of these datasets.

Different Labeling
The SemEval-2017 dataset consists of three-class data, which is different from the new SemEval-2018 dataset. In order to exploit SemEval-2017 dataset, we modify the data labels. In the baseline system, we change the label to ±1, ±2, or ±3. Adding a lot of data lead to imbalance problem, so we apply two methods of data balance. Method 1 is that adding data to positive and negative classes randomly such that they have same size respectively. Method 2 is that adding data to all classes randomly such that they have 3,000 tweets. Table 1 shows the numbers of data points after these different labeling methods.

Pre-processing
We begin with basic pre-processing methods (Yang et al., 2017), e.g. splitting a tweet into word, replacing URLs and USERs with normalization patterns <URL> and <USER>, and converting uppercase letters to lowercase letters. As tweets are informal and complex, the basic preprocessing is too simple to convey enough important information. Tweets often have emoticons and hashtags, which could be instrumental to sentiment analysis. Thus, we use text processing tool 1 (Baziotis et al., 2017) to improve text normalization, including sentiment-aware tokenization, spell correction, word normalization, word segmentation (for splitting hashtags). and word annotation.

Early Stopping
The early stopping method is used to prevent overfitting when the loss of a development set ceases to decrease for a few epochs. We randomly take 20% of SemEval-2018 train data as the development set for early stopping and the remaining 80% data as the train set.

Settings
The maximum length for any tweet in the used datasets is n = 99. The embedding is based on a publicly available set of word vectors learned from 400 million tweets for the ACL WNUT 2015 shared task (Baldwin et al., 2015).

Baseline System
First, we compare the experiments of different labeling in baseline system to decide how to use the train-17 dataset. In baseline system, we use the basic pre-processing for text normalization. The results are shown in Table 2. The calculation of Pearson correlation coefficient (Pcc.) requires calculating the mean value of the data, which is often close to zero. From the results, labeling to more distant from zero get the higher Pcc. Therefore, we use labeling to ±3 method in the multi-task learning system.   Table 3 shows the results of multi-task learning. With basic pre-processing for text normalization, the multi-task learning system is better than the  Table 3: Results of multi-task learning. Final row is the official SemEval-2018 test set result and others are development set results. Here * means using the ekphrasis tool for pre-processing and s-m means some-emotion.

Multi-task Learning System
baseline system. When the basic pre-processing method is replaced by using ekphrasis tool, the performance is further improved. Finally, we submit the results from our best system for the unseen test set to SemEval-2018, getting 0.638 for Pcc. eventually. We note this is significantly lower than 0.691 on the development data.

Conclusion
The proposed method improves performance on SemEval-2018 over baseline systems without multi-task learning. External dataset can significantly improve the Pcc. performance, but not the Acc. performance. The possible reason is that all the labels of external dataset are marked as ±3, resulting in data imbalance problem. In the future, we will use skewness-robust weights to solve this problem and use more resources to improve the system as sentiment lexicons.