THU_NGN at SemEval-2018 Task 2: Residual CNN-LSTM Network with Attention for English Emoji Prediction

Emojis are widely used by social media and social network users when posting their messages. It is important to study the relationships between messages and emojis. Thus, in SemEval-2018 Task 2 an interesting and challenging task is proposed, i.e., predicting which emojis are evoked by text-based tweets. We propose a residual CNN-LSTM with attention (RCLA) model for this task. Our model combines CNN and LSTM layers to capture both local and long-range contextual information for tweet representation. In addition, attention mechanism is used to select important components. Besides, residual connection is applied to CNN layers to facilitate the training of neural networks. We also incorporated additional features such as POS tags and sentiment features extracted from lexicons. Our model achieved 30.25% macro-averaged F-score in the first subtask (i.e., emoji prediction in English), ranking 7th out of 48 participants.


Introduction
Emojis such as and are widely used in social media and social network messages such as tweets. They are frequently combined with plain texts to visually complement the meaning of a message and convey various opinions and emotions (Novak et al., 2015;Barbieri et al., 2017). Social media platforms such as Twitter has accumulated a large number of emoji-incorporated messages. Analyzing the relationships between the textual message and emojis has many potential applications, such as emoji recommendation, automatic emoji-enriched message generation, and accurate sentiment analysis of social media messages (Barbieri et al., 2017).
However, the research on the relationships between textual message and emojis is limited. Existing studies on emojis mainly focus on analyzing the semantics, usage or sentiment of emojis (Aoki and Uchida, 2011;Barbieri et al., 2016a,b,c;Ljubešić and Fišer, 2016;Novak et al., 2015). For example, Barbieri et al. (2016b) explored the meaning and usage of emojis across different languages. Wijeratneet al. (2017) proposed to utilize the emoji sense definitions to improve the performance of emoji embedding model. However, these approaches cannot reveal the interplay between plain texts and emojis. In order to fill this gap, Barbieri et al. (2017) proposed a novel task to predict which emojis are evoked by text-based tweets. For example, given a tweet message "Love my coworkers ! @user", a system is required to predict that emoji is associated with this tweet.
As an extension, the SemEval-2018 Task 2 1 aims to predict emojis for English and Spanish tweets (Barbieri et al., 2018). Given a plain tweet message without emoji, systems are required to predict which emoji is evoked by this message. We proposed a residual CNN-LSTM with attention model (RCLA) for this task. 2 Our model combines LSTM and multi-level CNN layers to capture both long-range and local information to learn tweet representation. In addition, attention mechanism (Yang et al., 2016) is incorporated into our approach to select important components. Besides, we applied residual connection technique  to CNN layers in our model to facilitate the training of neural networks. We also incorporated additional features such as POS tags and sentiment features extracted from sentiment lexicons. Our model achieved 30.25% macro-averaged F-score on the test data of the first subtask (i.e., emoji prediction in English), and ranked 7 th out of 48 participants.

Residual CNN-LSTM with Attention
The framework of our residual CNN-LSTM with attention model (RCLA) is illustrated in Figure 1. Next, we will introduce each layer in our model from bottom to top in detail. The first layer in our model is the embedding layer. This layer is used to convert a sentence from a sequence of words into a sequence of dense vectors. An embedding lookup table is used in this layer, whose parameters are obtained from pretrained word embeddings and fine-tuned during training. POS tags have proven useful for many natural language processing tasks such as dimensional sentiment analysis (Wu et al., 2017). Motivated by existing studies, we also incorporate POS tags as additional features in our approach, and combining them with the word embeddings to form the final word features as the input of next layer. We use the Ark-Tweet-NLP 3 tool to obtain the POS tags of tweets.
The second layer in our model is bidirectional long short-term memory (Bi-LSTM) layer. This layer is used to capture long-range contextual information from tweets. At time step i, a hidden state h i is generated which contains both previous and future context information. Since different words and phrases have different importance for emoji prediction, we incorporate an attention layer after the Bi-LSTM layer to help our model focus on important words and contexts. The input of the attention layer is the hidden state vector h i at each time step. The attention weight α i for this time step can be computed as: where w and b are the parameters of the attention layer. The output of attention layer at the i th time step is formulated as follows: The third layer in our model is a 3-layer convolutional neural networks (CNN) to capture local context information. Each CNN layer has multiple kernels with different window sizes. In addition, we apply residual connections  to the CNN layers as shown in Figure 1, which have shown effectiveness in facilitating the training of deep neural networks. Max pooling is applied to the output of the last CNN layer to obtain the hidden representation of tweets. Tweets with specific emojis such as usually convey strong sentiment information. Thus, sentiment information is helpful for emoji prediction. We incorporate sentiment features into our model to enhance its performance. These sentiment features are extracted using AffectiveTweets 4 (Mohammad and Bravo-Marquez, 2017) package in Weka 5 . Two filters are involved, i.e., TweetToLex-iconFeatureVector (Bravo-Marquez et al., 2014) and TweetToSentiStrengthFeatureVector (Thelwall et al., 2012). These sentiment features are combined with the hidden tweet representations generate by neural networks to form the final feature representation of tweets. Finally, a softmax layer is used to predict the emoji label.
The tweets with different emojis in the training set are very imbalanced. For example, the ratio of is higher than 20%, while the ratio of is only 2.4%. Motivated by the cost-sensitive crossentropy method (Santos-Rodríguez et al., 2009), the objective function of our model is defined as: where N is the number of tweets, y i is the emoji label of the i th tweet,ŷ i is the prediction score, and w y i is the loss weight of emoji label y i . w y i is defined as , where C is the number of emoji labels and N j is the number of tweets with emoji label j. Thus, the infrequent emojis have relatively larger loss weights.

Dataset and Experimental Settings
The dataset 6 for this task is collected from Twitter. There are 20 emojis in total. 489,277 tweets are used for model training. The number of tweets in the trial and test sets are both 50,000. We used the pre-trained word embeddings provided by Barbieri et al. (2016b). They were trained on 20 million geo-localized tweets and their dimension is 300. These word embedding were fine-tuned during model training.
The hyperparameters in our model were selected via cross-validation on the trail set. More specifically, the dimension of Bi-LSTM hidden states is 300, the window sizes of CNN filters are 2, 3, 4 respectively. The number of CNN filters is 200 and the number of sentiment features is 45. The dimension of dense layer is 300, and the dropout rate is 0.2 for each layer. The batch size is 500, and the maximal training epoch is set to 100. We use RMSProp as the optimizer for network training. The performance is evaluated by macro-averaged F-score.

Performance Evaluation
The performance of our model on the test set is shown in Table 1 In addition, the performance of our approach on some infrequent emojis is also satisfactory. For example, the F-score on emoji is high. This is probably because specific words such as "Christmas" are frequently associated with this emoji, making it relatively easy to predict. The visualization of the confusion matrix of our model is shown in Figure 2. From this figure, we find two pairs of emojis which are difficult for our model to discriminate between them. The emoji is often wrongly identified as . This is probably because these two emojis are often used to express similar meaning and feelings. For example, they both can be used in tweets which convey happy emotion. Another pair of emojis is and . These two emojis look quite similar, and discriminating them is quite difficult.
In order to further validate the effectiveness of our model, we compare the performance of our model with several baseline methods. The methods to be compared include: 1) LSTM, using Bi-LSTM for tweet presentation; 2) CNN, 3layer CNN without residual connections; 3) CNN-LSTM (denoted as CL), using the combination of LSTM and CNN; 4) Residual CNN-LSTM (denoted as RCL), CNN-LSTM with residual connections; 5) Residual CNN-LSTM with attention (denoted as RCLA). The results are shown in Table 2. According to Table 2, the combination of LSTM and CNN (CL) usually outperforms the single CNN and LSTM. It indicates that combining CNN and LSTM to capture both local and longrange context information is beneficial for tweet emoji prediction. In addition, by comparing RCL with CL, we find that the residual connections can improve the performance of CL model. It shows that the residual connections can facilitate the training of neural networks. Besides, the attention mechanism can also significantly improve the performance. It validates that employing attention mechanism to capture the important contexts for emoji prediction is useful.

Influence of Additional Features
The influence of the POS tags and sentiment features is illustrated in Table 3. The results show that both POS tags and sentiment features can help improve the performance of tweet emoji prediction. It indicates that POS tags contain useful information for predicting emojis, since important emoji clues such as hashtags, emoticons and sentiment words usually have specific POS tags. Thus, incorporating POS tag features is beneficial. Incorporating sentiment features is also useful. This is because the sentiment features we extracted from sentiment lexicons can identify both formal and information sentiment signals such as hashtags and emoticons, and these sentiment signals usually have strong associations with specific emojis. Thus, incorporating the sentiment features is also beneficial to predict emojis.

Conclusion
In this paper, we introduce our residual CNN-LSTM model with attention model (RCLA) for SemEval-2018 Task 2, i.e., emoji prediction for tweets. Our model combines CNN and LSTM layers to capture both local and long-range context information for tweet representation, and incorporates an attention layer to select important information. Besides, we applied residual connections to CNN layers to facilitate the training of our model. In addition, we incorporated additional features such as POS tags and sentiment features to further improve the performance. The experimental results validate the effectiveness of our model on emoji prediction for English tweets.