NTUA-SLP at SemEval-2018 Task 3: Tracking Ironic Tweets using Ensembles of Word and Character Level Attentive RNNs

In this paper we present two deep-learning systems that competed at SemEval-2018 Task 3"Irony detection in English tweets". We design and ensemble two independent models, based on recurrent neural networks (Bi-LSTM), which operate at the word and character level, in order to capture both the semantic and syntactic information in tweets. Our models are augmented with a self-attention mechanism, in order to identify the most informative words. The embedding layer of our word-level model is initialized with word2vec word embeddings, pretrained on a collection of 550 million English tweets. We did not utilize any handcrafted features, lexicons or external datasets as prior information and our models are trained end-to-end using back propagation on constrained data. Furthermore, we provide visualizations of tweets with annotations for the salient tokens of the attention layer that can help to interpret the inner workings of the proposed models. We ranked 2nd out of 42 teams in Subtask A and 2nd out of 31 teams in Subtask B. However, post-task-completion enhancements of our models achieve state-of-the-art results ranking 1st for both subtasks.


Introduction
Irony is a form of figurative language, considered as "saying the opposite of what you mean", where the opposition of literal and intended meanings is very clear (Barbieri and Saggion, 2014;Liebrecht et al., 2013). Traditional approaches in NLP (Tsur et al., 2010;Barbieri and Saggion, 2014;Karoui et al., 2015;Farías et al., 2016) model irony based on pattern-based features, such as the contrast between high and low frequent words, the punctuation used by the author, the level of ambiguity of  The color intensity of each word / character, corresponds to its weight (importance), as given by the self-attention mechanism (Section 2.6). the words and the contrast between the sentiments. Also, (Joshi et al., 2016) recently added word embeddings statistics to the feature space and further boosted the performance in irony detection.
Modeling irony, especially in Twitter, is a challenging task, since in ironic comments literal meaning can be misguiding; irony is expressed in "secondary" meaning and fine nuances that are hard to model explicitly in machine learning algorithms. Tracking irony in social media posses the additional challenge of dealing with special language, social media markers and abbreviations. Despite the accuracy achieved in this task by handcrafted features, a laborious feature-engineering process and domain-specific knowledge are required; this type of prior knowledge must be continuously updated and investigated for each new domain. Moreover, the difficulty in parsing tweets (Gimpel et al., 2011) for feature extraction renders their precise semantic representation, which is key of determining their intended gist, much harder.
In recent years, the successful utilization of deep learning architectures in NLP led to alternative approaches for tracking irony in Twitter (Joshi et al., 2017;Ghosh and Veale, 2017). (Ghosh and Veale, 2016) proposed a Convolutional Neural Network (CNN) followed by a Long Short Term Memory (LSTM) architecture, outperforming the state-of-the-art. (Dhingra et al., 2016) utilized deep learning for representing tweets as a sequence of characters, instead of words and proved that such representations reveal information about the irony concealed in tweets.
In this work, we propose the combination of word-and character-level representations in order to exploit both semantic and syntactic information of each tweet for successfully predicting irony. For this purpose, we employ a deep LSTM architecture which models words and characters separately. We predict whether a tweet is ironic or not, as well as the type of irony in the ironic ones by ensembling the two separate models (late fusion). Furthermore, we add an attention layer to both models, to better weigh the contribution of each word and character towards irony prediction, as well as better interpret the descriptive power of our models. Attention weighting also better addresses the problem of supervising learning on deep learning architectures. The suggested model was trained only on constrained data, meaning that we did not utilize any external dataset for further tuning of the network weights.
The two deep-learning models submitted to SemEval-2018 Task 3 "Irony detection in English tweets" (Van Hee et al., 2018) are described in this paper with the following structure: in Section 2 an overview of the proposed models is presented, in Section 3 the models for tracking irony are depicted in detail, in Section 4 the experimental setup alongside with the respective results are demonstrated and finally, in Section 5 we discuss the performance of the proposed models.

Overview
Fig. 2 provides a high-level overview of our approach, which consists of three main steps: (1) the pre-training of word embeddings, where we train our own word embeddings on a big collection of unlabeled Twitter messages, (2) the independent training of our models: word-and char-level,  (3) the ensembling, where we combine the predictions of each model.

Task definitions
The goal of Subtask A is tracking irony in tweets as a binary classification problem (ironic vs. nonironic). In Subtask B, we are also called to determine the type of irony, with three different classes of irony on top of the non-ironic one (four-class classification). The types of irony are: (1) Verbal irony by means of a polarity contrast, which includes messages whose polarity (positive, negative) is inverted between the literal and the intended evaluation, such as "I really love this year's summer; weeks and weeks of awful weather", where the literal evaluation ("I really love this year's summer") is positive, while the intended one, which is implied in the context ("weeks and weeks of awful weather"), is negative.
(2) Other verbal irony, which refers to instances showing no polarity contrast, but are ironic such as "Yeah keeping cricket clean, that's what he wants #Sarcasm" and (3) situational irony which is present in messages that a present situation fails to meet some expectations, such as "Event technology session is having Internet problems. #irony #HSC2024" in which the expectation that a technology session should provide Internet connection is not met.

Data
Unlabeled Dataset. We collected a dataset of 550 million archived English Twitter messages, from Apr. 2014 to Jun. 2017. This dataset is used for (1) calculating word statistics needed in our text preprocessing pipeline (Section 2.4) and (2) train-ing word2vec word embeddings (Section 2.3).

Word Embeddings
Word embeddings are dense vector representations of words (Collobert and Weston, 2008;, capturing semantic their and syntactic information. We leverage our unlabeled dataset to train Twitter-specific word embeddings. We use the word2vec  algorithm, with the skip-gram model, negative sampling of 5 and minimum word count of 20, utilizing Gensim's (Řehůřek and Sojka, 2010) implementation. The resulting vocabulary contains 800, 000 words. The pre-trained word embeddings are used for initializing the first layer (embedding layer) of our neural networks.

Preprocessing 1
We utilized the ekphrasis 2 (Baziotis et al., 2017) tool as a tweet preprocessor. The preprocessing steps included in ekphrasis are: Twitter-specific tokenization, spell correction, word normalization, word segmentation (for splitting hashtags) and word annotation. Tokenization. Tokenization is the first fundamental preprocessing step and since it is the basis for the other steps, it immediately affects the quality of the features learned by the network. Tokenization in Twitter is especially challenging, since there is large variation in the vocabulary and the used expressions. Part of the challenge is also the decision of whether to process an entire expression (e.g. anti-american) or its respective tokens. Ekphrasis overcomes this challenge by recognizing the Twitter markup, emoticons, emojis, expressions like dates (e.g. 07/11/2011, April 23rd), times (e.g. 4:30pm, 11:00 am), currencies (e.g. $10, 25mil, 50e), acronyms, censored words (e.g. s**t) and words with emphasis (e.g. *very*). Normalization. After the tokenization we apply a series of modifications on the extracted tokens, such as spell correction, word normalization and segmentation. We also decide which tokens to omit, normalize and surround or replace with special tags (e.g. URLs, emails and @user). For the tasks of spell correction (Jurafsky and James, 2000) and word segmentation (Segaran and Hammerbacher, 2009) we use the Viterbi algorithm. The prior probabilities are initialized using uni/bigram word statistics from the unlabeled dataset.
The benefits of the above procedure are the reduction of the vocabulary size, without removing any words, and the preservation of information that is usually lost during tokenization. Table 1 shows an example text snippet and the resulting preprocessed tokens.

Recurrent Neural Networks
We model the Twitter messages using Recurrent Neural Networks (RNN). RNNs process their inputs sequentially, performing the same operation, h t = f W (x t , h t−1 ), on every element in a sequence, where h t is the hidden state t the time step, and W the network weights. We can see that hidden state at each time step depends on previous hidden states, thus the order of elements (words) is important. This process also enables RNNs to handle inputs of variable length.
RNNs are difficult to train (Pascanu et al., 2013), because gradients may grow or decay exponentially over long sequences (Bengio et al., 1994;Hochreiter et al., 2001). A way to overcome these problems is to use more sophisticated variants of regular RNNs, like Long Short-Term Memory (LSTM) networks (Hochreiter and Schmidhuber, 1997) or Gated Recurrent Units (GRU) , which introduce a gating mechanism to ensure proper gradient flow through the network. In this work, we use LSTMs.

Self-Attention Mechanism
RNNs update their hidden state h i as they process a sequence and the final hidden state holds a summary of the information in the sequence. In order to amplify the contribution of important words in the final representation, a self-attention mechanism  can be used original The *new* season of #TwinPeaks is coming on May 21, 2017. CANT WAIT \o/ !!! #tvseries #davidlynch :D processed the new <emphasis> season of <hashtag> twin peaks </hashtag> is coming on <date> . cant <allcaps> wait <allcaps> <happy> ! <repeated> <hashtag> tv series </hashtag> <hashtag> david lynch </hashtag> <laugh> (b) Attention RNN Figure 3: Comparison between the regular RNN and the RNN with attention. (Fig. 3). In normal RNNs, we use as representation r of the input sequence its final state h N . However, using an attention mechanism, we compute r as the convex combination of all h i . The weights a i are learned by the network and their magnitude signifies the importance of each hidden state in the final representation. Formally:

Models Description
We have designed two independent deep-learning models, with each one capturing different aspects of the tweet. The first model operates at the wordlevel, capturing the semantic information of the tweet and the second model at the character-level, capturing the syntactic information. Both models share the same architecture, and the only difference is in their embedding layers. We present both models in a unified manner.

Embedding Layer
Character-level. The input to the network is a Twitter message, treated as a sequence of characters. We use a character embedding layer to project the characters c 1 , c 2 , ..., c N to a lowdimensional vector space R C , where C the size of the embedding layer and N the number of characters in a tweet. We randomly initialize the weights of the embedding layer and learn the character embeddings from scratch. Word-level. The input to the network is a Twitter message, treated as a sequence of words. We use a word embedding layer to project the words w 1 , w 2 , ..., w N to a low-dimensional vector space R W , where W the size of the embedding layer and N the number of words in a tweet. We initialize the weights of the embedding layer with our pretrained word embeddings.

BiLSTM Layers
An LSTM takes as input the words (characters) of a tweet and produces the word (character) annotations h 1 , h 2 , ..., h N , where h i is the hidden state of the LSTM at time-step i, summarizing all the information of the sentence up to w i (c i ). We use bidirectional LSTM (BiLSTM) in order to get word (character) annotations that summarize the information from both directions. A bidirectional LSTM consists of a forward LSTM − → f that reads the sentence from w 1 to w N and a backward LSTM ← − f that reads the sentence from w N to w 1 . We obtain the final annotation for a given word w i (character c i ), by concatenating the annotations from both directions, where denotes the concatenation operation and L the size of each LSTM. We stack two layers of BiLSTMs in order to learn more high-level (abstract) features.

Attention Layer
Not all words contribute equally to the meaning that is expressed in a message. We use an attention mechanism to find the relative contribution (importance) of each word. The attention mechanism assigns a weight a i to each word annotation h i . We compute the fixed representation r of the whole input message. as the weighted sum of all the word annotations. (1) where W h and b h are the attention layer's weights.
Character-level Interpretation. In the case of the character-level model, the attention mechanism operates in the same way as in the wordlevel model. However, we can interpret the weight given to each character annotation h i by the attention mechanism, as the importance of the information surrounding the given character.

Output Layer
We use the representation r as feature vector for classification and we feed it to a fully-connected softmax layer with L neurons, which outputs a probability distribution over all classes p c as described in Eq. 4: where W and b are the layer's weights and biases.

Regularization
In order to prevent overfitting of both models, we add Gaussian noise to the embedding layer, which can be interpreted as a random data augmentation technique, that makes models more robust to overfitting. In addition to that, we use dropout (Srivastava et al., 2014) and early-stopping.
Finally, we do not fine-tune the embedding layers of the word-level model. Words occurring in the training set, will be moved in the embedding space and the classifier will correlate certain regions (in embedding space) to certain meanings or types of irony. However, words in the test set and not in the training set, will remain at their initial position which may no longer reflect their "true" meaning, leading to miss-classifications.

Ensemble
A key factor to good ensembles, is to utilize diverse classifiers. To this end, we combine the predictions of our word and character level models. We employed two ensemble schemes, namely unweighted average and majority voting. Unweighted Average (UA). In this approach, the final prediction is estimated from the unweighted average of the posterior probabilities for all different models. Formally, the final prediction p for a training instance is estimated by: where C is the number of classes, M is the number of different models, c ∈ {1, ..., C} denotes one class and p i is the probability vector calculated by model i ∈ {1, ..., M } using softmax function. Majority Voting (MV). Majority voting approach counts the votes of all different models and chooses the class with most votes. Compared to unweighted averaging, MV is affected less by single-network decisions. However, this schema does not consider any information derived from the minority models. Formally, for a task with C classes and M different models, the prediction for a specific instance is estimated as follows: where v c denotes the votes for class c from all different models, F i is the decision of the i th model, which is either 1 or 0 with respect to whether the model has classified the instance in class c or not, respectively, and p is the final prediction.

Experimental Setup
Class Weights. In order to deal with the problem of class imbalances in Subtask B, we apply class weights to the loss function of our models, penalizing more the misclassification of underrepresented classes. We weight each class by its inverse frequency in the training set.
Training We use Adam algorithm (Kingma and Ba, 2014) for optimizing our networks, with minibatches of size 32 and we clip the norm of the gradients (Pascanu et al., 2013) at 1, as an extra safety measure against exploding gradients. For developing our models we used PyTorch (Paszke et al., 2017) and Scikit-learn (Pedregosa et al., 2011).
Hyper-parameters. In order to find good hyperparameter values in a relative short time (compared to grid or random search), we adopt the Bayesian optimization (Bergstra et al., 2013) approach, performing a "smart" search in the high dimensional space of all the possible values. Table 2, shows the selected hyper-parameters.

Results and Discussion
Our official ranking is 2/43 in Subtask A and 2/29 in Subtask B as shown in Tables 3 and 4. Based on these rankings, the performance of the suggested model is competitive on both the binary and the multi-class classification problem. Except for its overall good performance, it also presents a stable behavior when moving from two to four classes.   Additional experimentation following the official submission significantly improved the efficiency of our models. The results of this experimentation, tested on the same data set, are shown in Tables 5 and 6. The first baseline is a Bag of Words (BOW) model with TF-IDF weighting. The second baseline is a Neural Bag of Words (N-BOW) model where we retrieve the word2vec representations of the words in a tweet and compute   the tweet representation as the centroid of the constituent word2vec representations. Both BOW and N-BOW features are then fed to a linear SVM classifier, with tuned C = 0.6. The best performance that we achieve, as shown in Tables 5 and 6 is 0.7856 and 0.5358 for Subtask A and B respectively 34 . In Subtask A the BOW and N-BOW models perform similarly with respect to f1 metric and word-level LSTM is the most competitive individual model. However, the best performance is achieved when the characterand the word-level LSTM models are combined via the unweighted average ensembling method, showing that the two suggested models indeed contain different types of information related to irony on tweets. Similar observations are derived for Subtask B, except that the character-level model in this case performs worse than the baseline models and contributes less to the final results.

Attention Visualizations
Our models' behavior can be interpreted by visualizing the distribution of the attention weights assigned to the words (characters) of the tweet. The weights signify the contribution of each word (character), to model's final classification decision. In Fig. 5, examples of the weights as- signed by the word level model to ironic tweets are presented. The salient keywords that capture the essence of irony or even polarity transitions (e.g. irony by clash) are correctly identified by the model. Moreover, in Fig. 6 we compare the behavior of the word and character models on the same tweets. In the first example, the character level model assigns larger weights to the most discriminative words whereas the weights assigned by the word level model seem uniform and insufficient in spotting the polarity transition. However, in the second example, the character level model does not attribute any weight to the words with positive polarity (e.g. "fun") compared to the word level model. Based on these observations, the two models indeed behave diversely and consequently contribute to the final outcome (see Section 3.6).

Conclusion
In this paper we present an ensemble of two different deep learning models: a word-and a character-level deep LSTM for capturing the semantic and syntactic information of tweets, respectively. We demonstrated that combining the predictions of the two models yields competitive results in both subtasks for irony prediction. Moreover, we proved that both types of informa-tion (semantic and syntactic) contribute to the final results with the word-level model, however, individually achieving more accurate irony prediction. Also, the best way of combining the outcomes of the separate models is by conducting majority voting over the respective posteriors. Finally, the proposed model successfully predicts the irony in tweets without exploiting any external information derived from hand-crafted features or lexicons. The performance reported in this paper could be further boosted by utilizing transfer learning methods from larger datasets. Moreover, the joint training of word-and character-level models can be tested for further improvement of the results. Finally, we make the source code of our models and our pretrained word embeddings available to the community 5 , in order to make our results easily reproducible and facilitate further experimentation.