Reasoning with Sarcasm by Reading In-between

Sarcasm is a sophisticated speech act which commonly manifests on social communities such as Twitter and Reddit. The prevalence of sarcasm on the social web is highly disruptive to opinion mining systems due to not only its tendency of polarity flipping but also usage of figurative language. Sarcasm commonly manifests with a contrastive theme either between positive-negative sentiments or between literal-figurative scenarios. In this paper, we revisit the notion of modeling contrast in order to reason with sarcasm. More specifically, we propose an attention-based neural model that looks in-between instead of across, enabling it to explicitly model contrast and incongruity. We conduct extensive experiments on six benchmark datasets from Twitter, Reddit and the Internet Argument Corpus. Our proposed model not only achieves state-of-the-art performance on all datasets but also enjoys improved interpretability.


Introduction
Sarcasm, commonly defined as 'An ironical taunt used to express contempt', is a challenging NLP problem due to its highly figurative nature. The usage of sarcasm on the social web is prevalent and can be frequently observed in reviews, microblogs (tweets) and online forums. As such, the battle against sarcasm is also regularly cited as one of the key challenges in sentiment analysis and opinion mining applications (Pang et al., 2008). Hence, it is both imperative and intuitive that effective sarcasm detectors can bring about numerous benefits to opinion mining applications.
Sarcasm is often associated to several linguistic phenomena such as (1) an explicit contrast between sentiments or (2) disparity between the conveyed emotion and the author's situation (context). Prior work has considered sarcasm to be a contrast between a positive and negative sentiment (Riloff et al., 2013). Consider the following examples: 1. I absolutely love to be ignored! 2. Yay!!! The best thing to wake up to is my neighbor's drilling.
3. Perfect movie for people who can't fall asleep.
Given the examples, we make a crucial observation -Sarcasm relies a lot on the semantic relationships (and contrast) between individual words and phrases in a sentence. For instance, the relationships between phrases {love, ignored}, {best, drilling} and {movie, asleep} (in the examples above) richly characterize the nature of sarcasm conveyed, i.e., word pairs tend to be contradictory and more often than not, express a juxtaposition of positive and negative terms. This concept is also explored in (Joshi et al., 2015) in which the authors refer to this phenomena as 'incongruity'. Hence, it would be useful to capture the relationships between selected word pairs in a sentence, i.e., looking in-between.
State-of-the-art sarcasm detection systems mainly rely on deep and sequential neural networks (Ghosh and Veale, 2016;Zhang et al., 2016). In these works, compositional encoders such as gated recurrent units (GRU)  or long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) are often employed, with the input document being parsed one word at a time. This has several shortcomings for the sarcasm detection task. Firstly, there is no explicit interaction between word pairs, which hampers its ability to explicitly model contrast, incongruity or juxtaposition of situations. Secondly, it is difficult to capture long-range dependencies. In this case, contrastive situations (or sentiments) which are commonplace in sarcastic language may be hard to detect with simple sequential models.
To overcome the weaknesses of standard sequential models such as recurrent neural networks, our work is based on the intuition that modeling intra-sentence relationships can not only improve classification performance but also pave the way for more explainable neural sarcasm detection methods. In other words, our key intuition manifests itself in the form of an attention-based neural network. While the key idea of most neural attention mechanisms is to focus on relevant words and sub-phrases, it merely looks across and does not explicitly capture word-word relationships. Hence, it suffers from the same shortcomings as sequential models.
In this paper, our aim is to combine the effectiveness of state-of-the-art recurrent models while harnessing the intuition of looking in-between. We propose a multi-dimensional intra-attention recurrent network that models intricate similarities between each word pair in the sentence. In other words, our novel deep learning model aims to capture 'contrast' (Riloff et al., 2013) and 'incongruity' (Joshi et al., 2015) within end-to-end neural networks. Our model can be thought of selftargeted co-attention (Xiong et al., 2016), which allows our model to not only capture word-word relationships but also long-range dependencies. Finally, we show that our model produces interpretable attention maps which aid in the explainability of model outputs. To the best of our knowledge, our model is the first attention model that can produce explainable results in the sarcasm detection task.
Briefly, the prime contributions of this work can be summarized as follows: • We propose a new state-of-the-art method for sarcasm detection. Our proposed model, the Multi-dimensional Intra-Attention Recurrent Network (MIARN) is strongly based on the intuition of compositional learning by leveraging intra-sentence relationships. To the best of our knowledge, none of the existing state-of-the-art models considered exploiting intra-sentence relationships, solely relying on sequential composition.
• We conduct extensive experiments on multiple benchmarks from Twitter, Reddit and the Internet Argument Corpus. Our proposed MIARN achieves highly competitive performance on all benchmarks, outperforming existing state-of-the-art models such as GRNN (Zhang et al., 2016) and CNN-LSTM-DNN (Ghosh and Veale, 2016).

Related Work
Sarcasm is a complex linguistic phenomena that have long fascinated both linguists and NLP researchers. After all, a better computational understanding of this complicated speech act could potentially bring about numerous benefits to existing opinion mining applications. Across the rich history of research on sarcasm, several theories such as the Situational Disparity Theory (Wilson, 2006) and the Negation Theory (Giora, 1995) have emerged. In these theories, a common theme is a motif that is strongly grounded in contrast, whether in sentiment, intention, situation or context. (Riloff et al., 2013) propagates this premise forward, presenting an algorithm strongly based on the intuition that sarcasm arises from a juxtaposition of positive and negative situations.

Sarcasm Detection
Naturally, many works in this area have treated the sarcasm detection task as a standard text classification problem. An extremely comprehensive overview can be found at (Joshi et al., 2017). Feature engineering approaches were highly popular, exploiting a wide diverse range of features such as syntactic patterns (Tsur et al., 2010), sentiment lexicons (González-Ibánez et al., 2011), ngram (Reyes et al., 2013), word frequency (Barbieri et al., 2014), word shape and pointedness features (Ptáček et al., 2014), readability and flips (Rajadesingan et al., 2015), etc. Notably, there have been quite a reasonable number of works that propose features based on similarity and contrast. (Hernández-Farías et al., 2015) measured the Wordnet based semantic similarity between words. (Joshi et al., 2015) proposed a framework based on explicit and implicit incongruity, utilizing features based on positive-negative patterns. (Joshi et al., 2016) proposed similarity features based on word embeddings.

Deep Learning for Sarcasm Detection
Deep learning based methods have recently garnered considerable interest in many areas of NLP research. In our problem domain, (Zhang et al., 2016) proposed a recurrent-based model with a gated pooling mechanism for sarcasm detection on Twitter. (Ghosh and Veale, 2016) proposed a convolutional long-short-term memory network (CNN-LSTM-DNN) that achieves state-of-the-art performance. While our work focuses on document-only sarcasm detection, several notable works have proposed models that exploit personality information (Ghosh and Veale, 2017) and user context (Amir et al., 2016). Novel methods for sarcasm detection such as gaze / cognitive features (Mishra et al., 2016(Mishra et al., , 2017 have also been explored. (Peled and Reichart, 2017) proposed a novel framework based on neural machine translation to convert a sequence from sarcastic to non-sarcastic. (Felbo et al., 2017) proposed a layer-wise training scheme that utilizes emoji-based distant supervision for sentiment analysis and sarcasm detection tasks.

Attention Models for NLP
In the context of NLP, the key idea of neural attention is to soft select a sequence of words based on their relative importance to the task at hand. Early innovations in attentional paradigms mainly involve neural machine translation (Luong et al., 2015; for aligning sequence pairs. Attention is also commonplace in many NLP applications such as sentiment classification (Chen et al., 2016;Yang et al., 2016), aspect-level sentiment analysis (Tay et al., 2018s, 2017bChen et al., 2017) and entailment classification (Rocktäschel et al., 2015). Co-attention / Bi-Attention (Xiong et al., 2016;Seo et al., 2016) is a form of pairwise attention mechanism that was proposed to model query-document pairs. Intraattention can be interpreted as a self-targetted coattention and is seeing a lot promising results in many recent works (Vaswani et al., 2017;Parikh et al., 2016;Tay et al., 2017a;Shen et al., 2017).
The key idea is to model a sequence against itself, learning to attend while capturing long term dependencies and word-word level interactions. To the best of our knowledge, our work is not only the first work that only applies intra-attention to sarcasm detection but also the first attention model for sarcasm detection.

Our Proposed Approach
In this section, we describe our proposed model. Figure 1 illustrates our overall model architecture.

Input Encoding Layer
Our model accepts a sequence of one-hot encoded vectors as an input. Each one-hot encoded vector corresponds to a single word in the vocabulary. In the input encoding layer, each one-hot vector is converted into a low-dimensional vector representation (word embedding). The word embeddings are parameterized by an embedding layer W ∈ R n×|V | . As such, the output of this layer is a sequence of word embeddings, i.e., {w 1 , w 2 , · · · w } where is a predefined maximum sequence length.

Multi-dimensional Intra-Attention
In this section, we describe our multi-dimensional intra-attention mechanism for sarcasm detection. We first begin by describing the standard single-dimensional intra-attention. The multidimensional adaptation will be introduced later in this section. The key idea behind this layer is to look in-between, i.e., modeling the semantics between each word in the input sequence. We first begin by modeling the relationship of each word pair in the input sequence. A simple way to achieve this is to use a linear 1 transformation layer to project the concatenation of each word embedding pair into a scalar score as follows: where W a ∈ R 2n×1 , b a ∈ R are the parameters of this layer. [.; .] is the vector concatenation operator and s ij is a scalar representing the affinity score between word pairs (w i , w j ). We can easily observe that s is a symmetrical matrix of × dimensions. In order to learn attention vector a, we apply a row-wise max-pooling operator on matrix s.
where a ∈ R is a vector representing the learned intra-attention weights. Then, the vector a is employed to learn weighted representation of {w 1 , w 2 · · · w } as follows: where v ∈ R n is the intra-attentive representation of the input sequence. While other choices of pooling operators may be also employed (e.g., mean-pooling over max-pooling), the choice of max-pooling is empirically motivated. Intuitively, this attention layer learns to pay attention based on a word's largest contribution to all words in the sequence. Since our objective is to highlight words that might contribute to the contrastive theories of sarcasm, a more discriminative pooling operator is desirable. Notably, we also mask values of s where i = j such that we do not allow the relationship scores of a word with respect to itself to influence the overall attention weights. Furthermore, our network can be considered as an 'inner' adaptation of neural attention, modeling intra-sentence relationships between the raw word representations instead of representations that have been compositionally manipulated. This allows word-to-word similarity to be modeled 'as it is' and not be influenced by composition. For example, when using the outputs of a compositional encoder (e.g., LSTM), matching words n and n + 1 might not be meaningful since they would be relatively similar in terms of semantic composition. For relatively short documents (such as tweets), it is also intuitive that attention typically focuses on the last hidden representation.
Intuitively, the relationships between two words is often not straightforward. Words are complex and often hold more than one meanings (or word senses). As such, it might be beneficial to model multiple views between two words. This can be modeled by representing the word pair interaction with a vector instead of a scalar. As such, we propose a multi-dimensional adaptation of the intra-attention mechanism. The key idea here is that each word pair is projected down to a lowdimensional vector before we compute the affinity score, which allows it to not only capture one view (one scalar) but also multiple views. A modification to Equation (1) constitutes our Multi-Dimensional Intra-Attention variant.
are the parameters of this layer. The final intraattentive representation is then learned with Equation (2) and Equation (3) Figure 1: High level overview of our proposed MIARN architecture. MIARN learns two representations, one based on intra-sentence relationships (intra-attentive) and another based on sequential composition (LSTM). Both views are used for prediction.

Long Short-Term Memory Encoder
While we are able to simply use the learned representation v for prediction, it is clear that v does not encode compositional information and may miss out on important compositional phrases such as 'not happy'. Clearly, our intra-attention mechanism simply considers a word-by-word interaction and does not model the input document sequentially. As such, it is beneficial to use a separate compositional encoder for this purpose, i.e., learning compositional representations. To this end, we employ the standard Long Short-Term Memory (LSTM) encoder. The output of an LSTM encoder at each time-step can be briefly defined as: where represents the maximum length of the sequence and h i ∈ R d is the hidden output of the LSTM encoder at time-step i. d is the size of the hidden units of the LSTM encoder. LSTM encoders are parameterized by gating mechanisms learned via nonlinear transformations. Since LSTMs are commonplace in standard NLP applications, we omit the technical details for the sake of brevity. Finally, to obtain a compositional representation of the input document, we use v c = h which is the last hidden output of the LSTM encoder. Note that the inputs to the LSTM encoder are the word embeddings right after the input encoding layer and not the output of the intraattention layer. We found that applying an LSTM on the intra-attentively scaled representations do not yield any benefits.

Prediction Layer
The inputs to the final prediction layer are two representations, namely (1) the intra-attentive representation (v a ∈ R n ) and (2) the compositional representation (v c ∈ R d ). This layer learns a joint representation of these two views using a nonlinear projection layer.
where W z ∈ R (d+n)×d and b z ∈ R d . Finally, we pass v into a Softmax classification layer.
where W f ∈ R d×2 , b f ∈ R 2 are the parameters of this layer.ŷ ∈ R 2 is the output layer of our proposed model.

Optimization and Learning
Our network is trained end-to-end, optimizing the standard binary cross-entropy loss function.
where J is the cost function,ŷ is the output of the network, R = ||θ|| L2 is the L2 regularization and λ is the weight of the regularizer.

Empirical Evaluation
In this section, we describe our experimental setup and results. Our experiments were designed to answer the following research questions (RQs).
• RQ1 -Does our proposed approach outperform existing state-of-the-art models?
• RQ2 -What are the impacts of some of the architectural choices of our model? How much does intra-attention contribute to the model performance? Is the Multi-Dimensional adaptation better than the Single-Dimensional adaptation?
• RQ3 -What can we interpret from the intraattention layers? Does this align with our hypothesis about looking in-between and modeling contrast?

Datasets
We conduct our experiments on six publicly available benchmark datasets which span across three well-known sources.
• Tweets -Twitter 2 is a microblogging platform which allows users to post statuses of less than 140 characters. We use two collections for sarcasm detection on tweets.
More specifically, we use the dataset obtained from (1) (Ptáček et al., 2014) in which tweets are trained via hashtag based semisupervised learning, i.e., hashtags such as #not, #sarcasm and #irony are marked as sarcastic tweets and (2) (Riloff et al., 2013) in which Tweets are hand annotated and manually checked for sarcasm. For both datasets, we retrieve. Tweets using the Twitter API using the provided tweet IDs.
• Reddit -Reddit 3 is a highly popular social forum and community. Similar to Tweets, sarcastic posts are obtained via the tag '/s' which are marked by the authors themselves. We use two Reddit datasets which are obtained from the subreddits /r/movies and /r/technology respectively. Datasets are subsets from (Khodak et al., 2017).
• Debates -We use two datasets 4 from the Internet Argument Corpus (IAC) (Lukin and Walker, 2017) which have been hand annotated for sarcasm. This dataset, unlike the first two, is mainly concerned with long text and provides a diverse comparison from the other datasets. The IAC corpus was designed for research on political debates on online forums. We use the V1 and V2 versions of the sarcasm corpus which are denoted as IAC-V1 and IAC-V2 respectively.
The statistics of the datasets used in our experiments is reported in Table 1.

Compared Methods
We compare our proposed model with the following algorithms.
• NBOW is a simple neural bag-of-words baseline that sums all the word embeddings and passes the summed vector into a simple logistic regression layer.
• CNN is a vanilla Convolutional Neural Network with max-pooling. CNNs are considered as compositional encoders that capture n-gram features by parameterized sliding windows. The filter width is 3 and number of filters f = 100.
• LSTM is a vanilla Long Short-Term Memory Network. The size of the LSTM cell is set to d = 100.
• ATT-LSTM (Attention-based LSTM) is a LSTM model with a neural attention mechanism applied to all the LSTM hidden outputs. We use a similar adaptation to (Yang et al., 2016), albeit only at the document-level.
• GRNN (Gated Recurrent Neural Network) is a Bidirectional Gated Recurrent Unit (GRU) model that was proposed for sarcasm detection by (Zhang et al., 2016). GRNN uses a gated pooling mechanism to aggregate the hidden representations from a standard BiGRU model. Since we only compare on document-level sarcasm detection, we do not use the variant of GRNN that exploits user context.
• CNN-LSTM-DNN (Convolutional LSTM + Deep Neural Network), proposed by (Ghosh and Veale, 2016), is the state-of-theart model for sarcasm detection. This model is a combination of a CNN, LSTM and Deep Neural Network via stacking. It stacks two layers of 1D convolution with 2 LSTM layers. The output passes through a deep neural network (DNN) for prediction.
Both CNN-LSTM-DNN (Ghosh and Veale, 2016) and GRNN (Zhang et al., 2016) are state-ofthe-art models for document-level sarcasm detection and have outperformed numerous neural and non-neural baselines. In particular, both works have well surpassed feature-based models (Support Vector Machines, etc.), as such we omit comparisons for the sake of brevity and focus comparisons with recent neural models instead. Moreover, since our work focuses only on document-level sarcasm detection, we do not compare against models that use external information such as user profiles, context, personality information (Ghosh and Veale, 2017) or emoji-based distant supervision (Felbo et al., 2017). For our model, we report results on both multi-dimensional and single-dimensional intraattention. The two models are named as MIARN and SIARN respectively.

Implementation Details and Metrics
We adopt standard the evaluation metrics for the sarcasm detection task, i.e., macro-averaged F1 and accuracy score. Additionally, we also report precision and recall scores. All deep learning models are implemented using Tensor-Flow (Abadi et al., 2015) and optimized on a NVIDIA GTX1070 GPU. Text is preprocessed with NLTK 5 's Tweet tokenizer. Words that only appear once in the entire corpus are removed and marked with the UNK token. Document lengths are truncated at 40, 20, 80 tokens for Twitter, Reddit and Debates dataset respectively. Mentions of other users on the Twitter dataset are replaced by '@USER'. Documents with URLs (i.e., containing 'http') are removed from the corpus. Documents with less than 5 tokens are also removed. The learning optimizer used is the RMSProp with an initial learning rate of 0.001. The L2 regularization is set to 10 −8 . We initialize the word embedding layer with GloVe (Pennington et al., 2014). We use the GloVe model trained on 2B Tweets for the Tweets and Reddit dataset. The Glove model trained on Common Crawl is used for the Debates corpus. The size of the word embeddings is fixed at d = 100 and are fine-tuned during training. In all experiments, we use a development set to select the best hyperparameters. Each model is trained for a total of 30 epochs and the model is saved each time the performance Tweets (Ptáček et al., 2014) Tweets (Riloff et al., 2013)    on the development set is topped. The batch size is tuned amongst {128, 256, 512} for all datasets. The only exception is the Tweets dataset from (Riloff et al., 2013), in which a batch size of 16 is used in lieu of the much smaller dataset size. For fair comparison, all models have the same hidden representation size and are set to 100 for both recurrent and convolutional based models (i.e., number of filters). For MIARN, the size of intraattention hidden representation is tuned amongst {4, 8, 10, 20}.

Experimental Results
Table 2, Table 3 and Table 4 reports a performance comparison of all benchmarked models on the Tweets, Reddit and Debates datasets respectively. We observe that our proposed SIARN and MIARN models achieve the best results across all six datasets. The relative improvement differs across domain and datasets. On the Tweets dataset from (Ptáček et al., 2014), MIARN achieves about ≈ 2% − 2.2% improvement in terms of F1 and accuracy score when compared against the best baseline. On the other Tweets dataset from (Riloff et al., 2013), the performance gain of our proposed model is larger, i.e., 3% − 5% improvement on average over most baselines. Our proposed SIARN and MIARN models achieve very competitive performance on the Reddit datasets, with an average of ≈ 2% margin improvement over the best baselines. Notably, the baselines we compare against are extremely competitive state-of-the-art neural network models. This further reinforces the effectiveness of our proposed approach. Additionally, the performance improvement on Debates (long text) is significantly larger than short text (i.e., Twitter and Reddit). For example, MI-ARN outperforms GRNN and CNN-LSTM-DNN by ≈ 8% − 10% on both IAC-V1 and IAC-V2. At this note, we can safely put RQ1 to rest. Overall, the performance of MIARN is often marginally better than SIARN (with some exceptions, e.g., Tweets dataset from (Riloff et al., 2013)). We believe that this is attributed to the fact that more complex word-word relationships can be learned by using multi-dimensional values instead of single-dimensional scalars. The performance brought by our additional intra-attentive representations can be further observed by comparing against the vanilla LSTM model. Clearly, removing the intra-attention network reverts our model to the standard LSTM. The performance improvements are encouraging, leading to almost 10% improvement in terms of F1 and accuracy. On datasets with short text, the performance improvement is often a modest ≈ 2% − 3% (RQ2). Notably, our proposed models also perform much better on long text, which can be attributed to the intra-attentive representations explicitly modeling long range dependencies. Intuitively, this is problematic for models that only capture sequential dependencies (e.g., word by word).
Finally, the relative performance of competitor methods are as expected. NBOW performs the worse, since it is just a naive bag-of-words model without any compositional or sequential information. On short text, LSTMs are overall better than CNNs. However, this trend is reversed on long text (i.e., Debates) since the LSTM model may be overburdened by overly long sequences. On short text, we also found that attention (or the gated pooling mechanism from GRNN) did not really help make any significant improvements over the vanilla LSTM model and a qualitative explanation to why this is so is deferred to the next section. However, attention helps for long text (such as debates), resulting in Attention LSTMs becoming the strongest baseline on the Debates datasets. However, our proposed intra-attentive model is both effective on short text and long text, outperforming Attention LSTMs consistently on all datasets.

In-depth Model Analysis
In this section, we present an in-depth analysis of our proposed model. More specifically, we not only aim to showcase the interpretability of our model but also explain how representations are formed. More specifically, we test our model (trained on Tweets dataset by (Ptáček et al., 2014)) on two examples. We extract the attention maps of three models, namely MIARN, Attention LSTM (ATT-LSTM) and applying Attention mechanism directly on the word embeddings without using a LSTM encoder (ATT-RAW).  In the first example (true label), we notice that the attention maps of MIARN are focusing on the words 'love' and 'ignored'. This is in concert with our intuition about modeling contrast and incongruity. On the other hand, both ATT-LSTM and ATT-RAW learn very different attention maps. As for ATT-LSTM, the attention weight is focused completely on the last representation -the token '!!'. Additionally, we also observed that this is true for many examples in the Tweets and Reddit dataset. We believe that this is the reason why standard neural attention does not help as what the attention mechanism is learning is to select the last representation (i.e., vanilla LSTM). Without the LSTM encoder, the attention weights focus on 'love' but not 'ignored'. This fails to capture any concept of contrast or incongruity.
Next, we consider the false labeled example. This time, the attention maps of MIARN are not as distinct as before. However, they focus on sentiment-bearing words, composing the words 'ignored sucks' to form the majority of the intraattentive representation. This time, passing the vector made up of 'ignored sucks' allows the subsequent layers to recognize that there is no contrasting situation or sentiment. Similarly, ATT-LSTM focuses on the last word time which is totally non-interpretable. On the other hand, ATT-RAW focuses on relatively non-meaningful words such as 'big'.
Overall, we analyzed two cases (positive and negative labels) and found that MIARN produces very explainable attention maps. In general, we found that MIARN is able to identify contrast and incongruity in sentences, allowing our model to better detect sarcasm. This is facilitated by modeling intra-sentence relationships. Notably, the standard vanilla attention is not explainable or interpretable.

Conclusion
Based on the intuition of intra-sentence similarity (i.e., looking in-between), we proposed a new neural network architecture for sarcasm detection. Our network incorporates a multi-dimensional intra-attention component that learns an intraattentive representation of the sentence, enabling it to detect contrastive sentiment, situations and incongruity. Extensive experiments over six public benchmarks confirm the empirical effectiveness of our proposed model. Our proposed MI-ARN model outperforms strong state-of-the-art baselines such as GRNN and CNN-LSTM-DNN. Analysis of the intra-attention scores shows that our model learns highly interpretable attention weights, paving the way for more explainable neural sarcasm detection methods.