NTUA-SLP at SemEval-2018 Task 2: Predicting Emojis using RNNs with Context-aware Attention

In this paper we present a deep-learning model that competed at SemEval-2018 Task 2"Multilingual Emoji Prediction". We participated in subtask A, in which we are called to predict the most likely associated emoji in English tweets. The proposed architecture relies on a Long Short-Term Memory network, augmented with an attention mechanism, that conditions the weight of each word, on a"context vector"which is taken as the aggregation of a tweet's meaning. Moreover, we initialize the embedding layer of our model, with word2vec word embeddings, pretrained on a dataset of 550 million English tweets. Finally, our model does not rely on hand-crafted features or lexicons and is trained end-to-end with back-propagation. We ranked 2nd out of 48 teams.


Introduction
Emojis play an important role in textual communication, as they function as a substitute for nonverbal cues, that are taken for granted in face-toface communication, thus allowing users to convey emotions by means other than words. Despite their large appeal in text, they haven't received much attention until recently. Former works, mostly consider their semantics (Aoki and Uchida, 2011;Espinosa-Anke et al., 2016;Barbieri et al., 2016b,a;Ljubešić and Fišer, 2016;Eisner et al., 2016) and only recently their role in social media was explored Cappallo et al., 2018). In SemEval-2018 Task 2: "Multilingual Emoji Prediction" (Barbieri et al., 2018), given a tweet, we are asked to predict its most likely associated emoji. In this work, we present a near state of the art approach for predicting emojis in tweets, which outperforms the best present work . For this purpose, we employ an LSTM network augmented with a context-aware selfattention mechanism, producing a feature representation used for classification. Moreover, the attention mechanism helps us make our model's behavior more interpretable, by examining the distribution of the attention weights for a given tweet. To this end, we provide visualizations with the distributions of the attention weights.

Overview
Figure 3 provides a high-level overview of our approach that consists of three main steps: (1) The text preprocessing step, which is common both for unlabeled data and the task's dataset, (2) the word embeddings pre-training step, where we train custom word embeddings on a big collection of unlabeled Twitter messages and (3) the model training step where we train the deep learning model. Task definition. In subtask A, given an English tweet, we are called to predict the most likely associated emoji, from the 20 most frequent emojis in English tweets according to . The training dataset consists of 500k tweets, retrieved from October 2015 to February 2017 and geolocalized in the United States. Fig. 2 Figure 3: High-level overview of our approach steps included in ekphrasis are: Twitter-specific tokenization, spell correction, word normalization, word segmentation (for splitting hashtags) and word annotation.
Tokenization. Tokenization is the first fundamental preprocessing step and since it is the basis for the other steps, it immediately affects the quality of the features learned by the network. Tokenization on Twitter is challenging, since there is large variation in the vocabulary and the expressions which are used. There are certain expressions which are better kept as one token (e.g. antiamerican) and others that should be split into separate tokens. Ekphraris recognizes Twitter markup, emoticons, emojis, dates (e.g. 07/11/2011, April 23rd), times (e.g. 4:30pm, 11:00 am), currencies (e.g. $10, 25mil, 50e), acronyms, censored words (e.g. s**t), words with emphasis (e.g. *very*) and more using an extensive list of regular expressions.
Normalization. After tokenization we apply a series of modifications on extracted tokens, such as spell correction, word normalization and segmentation. Specifically for word normalization we lowercase words, normalize URLs, emails, numbers, dates, times and user handles (@user). This helps reducing the vocabulary size without losing information. For spell correction (Jurafsky and James, 2000) and word segmentation (Segaran and Hammerbacher, 2009) we use the Viterbi algorithm. The prior probabilities are initialized using uni/bi-gram word statistics from the unlabeled dataset. Table 1 shows an example text snippet and the resulting preprocessed tokens.

Recurrent Neural Networks
We model the Twitter messages using Recurrent Neural Networks (RNN). RNNs process their inputs sequentially, performing the same operation, h t = f W (x t , h t−1 ), on every element in a sequence, where h t is the hidden state t the time step, and W the network weights. We can see that hidden state at each time step depends on previous hidden states, thus the order of elements (words) is important. This process also enables RNNs to handle inputs of variable length. RNNs are difficult to train (Pascanu et al., 2013), because gradients may grow or decay exponentially over long sequences (Bengio et al., 1994;Hochreiter et al., 2001). A way to overcome these problems is to use more sophisticated variants of regular RNNs, like Long Short-Term Memory (LSTM) networks (Hochreiter and Schmidhuber, 1997) or Gated Recurrent Units (GRU) , which ensure better gradient flow through the network. Self-Attention Mechanism. RNNs update their hidden state h i as they process a sequence and the final hidden state holds a summary of the information in the sequence. In order to amplify the contribution of important words in the final representation, a self-attention mechanism  can be used (Fig. 4). In normal RNNs, we use as representation r of the input sequence its final state h N . However, using an attention mechanism, we compute r as the convex combination of all h i , with weights a i , which signify the importance of each hidden state. Formally:

Model Description
We use a word-level BiLSTM architecture to model semantic information in tweets. We also propose an attention mechanism, which conditions the weight of h i on a "context vector" that is taken  Embedding Layer. The input to the network is a Twitter message, treated as a sequence of words. We use an embedding layer to project the words w 1 , w 2 , ..., w N to a low-dimensional vector space R W , where W the size of the embedding layer and N the number of words in a tweet. We initialize the weights of the embedding layer with our pretrained word embeddings. BiLSTM Layer. A LSTM takes as input the words of a tweet and produces the word annotations h 1 , h 2 , ..., h N , where h i is the hidden state of the LSTM at time-step i, summarizing all the information of the sentence up to w i . We use bidirectional LSTM (BiLSTM) in order to get word annotations that summarize the information from both directions. A BiLSTM consists of a forward LSTM − → f that reads the sentence from w 1 to w N and a backward LSTM ← − f that reads the sentence from w N to w 1 . We obtain the final annotation for each word, by concatenating the annotations from both directions, where denotes the concatenation operation and L the size of each LSTM. Context-aware Self-Attention Layer. Even though the hidden state h i of the LSTM captures the local context up to word i, in order to better estimate the importance of each word given the context of the tweet we condition hidden state on a context vector. The context vector is taken as the average of h i : c = 1 N (1) The final representation r is again taken as the convex combination of the hidden states.
Output Layer. We use the representation r as feature vector for classification and we feed it to a fully-connected softmax layer with L neurons, which outputs a probability distribution over all classes p c as described in Eq. 4: where W and b are the layer's weights and biases.

Regularization
In both models we add Gaussian noise to the embedding layer, which can be interpreted as a random data augmentation technique, that makes models more robust to overfitting. In addition to that we use dropout (Srivastava et al., 2014) and we stop training after the validation loss has stopped decreasing (early-stopping).

Experimental Setup
Class Weights. In order to deal with class imbalances, we apply class weights to the loss function of our models, penalizing more the misclassification of underrepresented classes. We weight each class by its inverse frequency in the training set.
Training We use Adam algorithm (Kingma and Ba, 2014) for optimizing our networks, with minibatches of size 32 and we clip the norm of the gradients (Pascanu et al., 2013) at 1, as an extra safety measure against exploding gradients. For developing our models we used PyTorch (Paszke et al., 2017) and Scikit-learn (Pedregosa et al., 2011).
Hyper-parameters. In order to find good hyperparameter values in a relative short time (compared to grid or random search), we adopt the Bayesian optimization (Bergstra et al., 2013) approach, performing a time-efficient search in the space of all hyper-parameter values. The size of the embedding layer is 300, and the LSTM layers 300 (600 for BiLSTM). We add Gaussian noise with σ = 0.05 and dropout of 0.1 at the embedding layer and dropout of 0.3 at the LSTM layer.
Results. The dataset for Task 2 was introduced in , where the authors propose a character level model with pretrained word vectors that achieves an F1 score of 34%. Our ranking as shown in Table 2 was 2/49, with an F1 score of 35.361%, which was the official evaluation metric, while team TueOslo achieved the first position with an F1 score of 35.991%. It should be noted that only the first 2 teams managed to surpass the baseline model presented in .
In Table 3 we compare the proposed Context-Attention LSTM (CA-LSTM) model against 2 baselines: (1) a Bag-of-Words (BOW) model with TF-IDF weighting and (2) a Neural Bag-of-Words (N-BOW) model, where we retrieve the word2vec representations of the words in a tweet and compute the tweet representation as the centroid of the constituent word2vec representations. Both BOW and N-BOW features are then fed to a linear SVM classifier, with tuned C = 0.6. The CA-LSTM results in Table 3 are computed by averaging the results of 10 runs to account for model variability. Table 3 shows that BOW model outperforms N-BOW by a large margin, which may indicate that there exist words, which are very correlated with specific classes and their occurrence can determine the classification result. Finally, we observe that CA-LSTM significantly outperforms both baselines. Fig. 6    emojis. Observe that our model is more likely to misclassify a rare class as an instance of one of the 4 more frequent classes, even after the inclusion of class weights in the loss function (Section 4.1). Furthermore, we observe that heart or face emojis, which are more ambiguous, are easily confusable with each other. However, as expected this in not the case for emojis like the US flag or the Christmas tree, as they are tied with specific expressions. Attention Visualization. The attention mechanism not only improves the performance of the model, but also makes it interpretable. By using the attention scores assigned to each word annotation, we can investigate the behavior of the model. Figure 7 shows how the attention mechanism focuses on each word in order to estimate the most suitable emoji label.

Conclusion
In this paper, we present a deep learning system based on a word-level BiLSTM architecture and augment it with contextual attention for SemEval Task 2: "Multilingual Emoji Prediciction" (Barbieri et al., 2018). Our work achieved excellent results, reaching the 2nd place in the competition and outperforming the state-of-the-art reported in  . The performance of our model could be further boosted, by utilizing transfer learning methods from larger, weakly annotated, datasets. Moreover, the joint training of word-and character-level models can be tested for further performance improvement. Finally, we make both our pretrained word embeddings and the source code of our models available to the community 3 , in order to make our results easily reproducible and facilitate further experimentation in the field.