Fracking Sarcasm using Neural Network

,


Introduction
Figurative language, such as metaphor, irony and sarcasm, is a ubiquitous aspect of human communication from ancient religious texts to modern microtexts. Sarcasm detection, despite being a wellstudied phenomenon in cognitive science and linguistics (Gibbs and Clark, 1992;gib, 2007;Kreuz and Glucksberg, 1989;Utsumi, 2000), is still at its infancy as a computational task. Detection is difficult because literal meaning is discounted and secondary or extended meanings are instead intentionally profiled. In social contexts, one's ability to detect sarcasm relies heavily on social cues such as sentiment, belief, and speaker's intention. Sarcasm is mocking and often involves harsh delivery to achieve savage putdowns, even though it can be also crafted more gently as the accretion of politeness and the abatement of hostility around a criticism (Brown and Levinson, 1978;Dews and Winner, 1995). Moreover, sarcasm often couches criticism within a humorous atmosphere (Dews and Winner, 1999). (Riloff et al., 2013) addressed one common form of sarcasm as the juxtaposition of a positive sentiment attached to a negative situation, or vice versa.  modeled sarcasm via a composition of linguistic elements, such as specific surface features about a product, frequent words, and punctuation marks. (González-Ibánez et al., 2011) views sarcasm as a conformation of lexical and pragmatic factors such as emoticons and profile references in social media. Most research approaches toward the automatic detection of sarcasm are text-based and consider sarcasm to be as a function of contrasting conditions or lexical clues. Such approaches extract definitive lexical cues as features, where the linguistic scale of features is stretched from words to phrases to provide richer contexts for analysis. Lexical feature cues may yield good results, yet without a precise semantic representation of a sentence, which is key for determining the intended gist of a sentence, robust automatic sarcasm detection will remain a difficult challenge to realize. Accurate semantic modelling of context becomes obligatory for automatic sarcasm detection if social cues and extended meaning are to be grasped.
Encouraging an immediate and very social use of language, social media platforms such as Twitter 1 are rich sources of texts for Natural Language Processing (NLP). Social micro-texts are dense in figurative language, and are useful for figurative analysis because of their topicality, ease of access, and the use of self-annotation via hashtag. In Twitter, language is distorted, often plumbing the depths of bad language (Eisenstein, 2013). Yet due to the presence of grammatical errors liberally mixed with social media markers (hashtags, emoticons, profiles), abbreviations, and code switching, these micro-texts are harder to parse, and parsing is the most commonly used method to obtain a semantic representation of a sentence. The accuracy of state-of-theart constituency parsers over tweets can be significantly lower than that for normal texts, so social media researchers still largely rely on surface level features. With the recent move to artificial neural networks in NLP, ANNs provide an alternative basis for semantic modelling. In this paper, we perform semantic modelling of sentences using neural networks for the task of sarcasm detection. The paper is organized as follows. Section 2 surveys related works, section 3 outlines methods of data collection and data processing, section 4 describes the recursive SVM model, section 5 describes the neural network model, section 6 & 7 outline our experimental setup and experimental analysis respectively, while section 8 presents a simple sarcastic Twitter bot. Finally, section 9 concludes with a short discussion of future work.

Related work
Semantic modelling of sentence meaning is a wellresearched topic in NLP. Due to 'bad language' in Twitter and a noticeable drop of accuracy for startof-the-art constituency parsers on tweets, the semantic modelling of tweets has captured the attention of researchers. To build a semantic representation of a sentence in various NLP tasks such as sentiment analysis, researchers have used syntac-1 https://twitter.com tic structure to compose a total representation as a function of the word-vector representation of a sentence's parts. (Nakagawa et al., 2010) describes a Tree-CRF classifier which uses a data-driven dependency parser, maltparser 2 , to obtain a parse tree for a sentence, and whose composition function uses the head-modifier relations of the parse tree. (Mitchell and Lapata, 2010) and (Mitchell and Lapata, 2008) defined the composition function of a sentence by algebraic operations over word meaning vectors to obtain sentence meaning vectors. (Guevara, 2010) and (Malakasiotis, 2011) formulated their composition function using a set of specific syntactic relations or specific word categories (Baroni and Zamparelli, 2010). (Socher et al., 2011) proposed a structured recursive neural network based on the convolutional operation, while (Kalchbrenner et al., 2014) proposed a convolution neural network (CNN) with dynamic k-max pooling, considering max pooling as function of input length. For sarcasm detection, due to the complexity of the task and the somewhat poorer accuracy of start-of-the-art constituency parsers on tweets, researchers have considered surface level lexical and syntactic cues as legitimate features. Kreuz and Caucci (Kreuz and Caucci, 2007) explored the role of lexical indicators, such as interjections (e.g., "gee" or "gosh"), punctuation symbols (e.g., '?'), intensifiers, and other linguistic markers for e.g. non-veridicality and hyperbole, in recognizing sarcasm in narratives. Tsur  noted the occurrence of "yay!" or "great!" as a recurring aspect of sarcastic patterns in Amazon product reviews. Davidov  examined the effectiveness of social media indicators such as hashtags to identify sarcasm. Lukin (Lukin and Walker, 2013) proposed a potential bootstrapping method for sarcasm classification in social dialogue to expand lexical N-gram cues related to sarcasm (e.g. "oh really", "no way", etc.) as well as lexico-syntactic patterns. Riloff (Riloff et al., 2013) and Liebrecht (Liebrecht et al., 2013) applied N-grams features to a classifier for English and Dutch tweets and observed that some topics recur frequently in sarcastic tweets, such as schools, dentists, church life, public transport, the weather and so on.
In this paper, we investigate the usefulness of neural-network-based semantic modelling for sarcasm detection. We propose a neural network model for semantic modelling in tweets that combines Deep Neural Networks (DNNs) with time-convolution and Long Short-Term Memory (LSTM). The proposed model is compared to a recursive Support Vector Machine (SVM) model based on constituency parse trees.

Dataset
Twitter provides functionality to users to summarize their intention via hashtags. Using a user's self-declaration of sarcasm as a retrieval cue, #sarcasm, we have crawled the Twittersphere. Since this simple heuristic misses those uses of sarcasm that lack an explicit mention of #sarcasm, we used LSA-based approach to extend the list of indicative hashtags (e.g.to include #sarcastic, #yeahright etc.). We also harvested tweets from user profiles with a strong bias toward sincerity or (for professional wits) sarcasm. To build our sarcastic data set we aggregated all tweets containing one or more positive markers of sarcasm, but removed such markers from the tweets, while tweets which did not contain any positive markers of sarcasm were considered non-sarcastic. The training dataset of 39K tweets is evenly balanced containing 18k sarcastic data and 21K non-sarcastic data. As a test set, we have created a dataset of 2000 tweets annotated by an internal team of researchers. For purposes of comparison, we also used two different publicly available sarcasm datasets.
Social media contains many interesting elements such as hashtags, profile references and emoticons. Due to the size limitation of tweets, users exploit these elements to provide contextual information. To tightly focus our research question, we did not include sarcasm from the larger conversational context and thus dropped all profile information from the input text. As users often use multi-worded hashtags to add an additional sarcastic dimension to a tweet. we used a hashtag splitter to split these phrasal tags and appended their words to the text.
For the recursive-SVM, we used the Stanford constituency parser 3 for parsing tweets. In order to ex-

Recursive-SVM Data Processing
Constituency parse trees offer a syntactic model of a sentence which can form a strong basis for semantic modelling. In order to use Stanford constituency parser here, the tweets were first pre-processed by removing social media markers such as profile references, retweets and hashtags. As a tweet may contain multiple sentences, each is split into sentences using the Standford sentence splitter, parsed separately and then stitched back together with a sentence tag (S). Hashtags are dense annotations offered by users of their own texts, and their scope generally applies to the entire content of a tweet. Thus we restored back Hashtags into parse tree by attaching them to the root node of the parse tree of the tweet with a tag (HT). Let's consider the following tweet as example, I love when people start rumors about me. #not Hashtag #not is attached to root of parse tree using Part-of-speech tag (HT) (Figure 1).

Recursive SVM
We now define a recursive-SVM model. Consider a subjective sentence (S) containing n phrases with m words in total. w l , b l and pos l denote the surface Table 1: recursive SVM features form, root form and part-of-speech respectively of l th word of S, while n i denotes the i th node and p i , h i , and o i denote phrase, head node and offensive word-marker of the i th node respectively. The 0 th node is the root node, while s i and sa i denote the predicted values of sentiment polarity and sarcastic polarity of the constituency subtrees whose root is the i th node, (s i ∈ +1, 0, sa i ∈ +1, 0). Table 1 shows training vectors (x i ∈ n , i = 0, .. , n) where y i = 1, 0 is the label for the i th node. As the number of parameters is larger than the number of instances, dual-based solvers offer the best fit for this problem. Through grid-search, the optimum penalty value (C) is determined and set to 1000 and 2000 for sentiment and sarcasm detection respectively. The stopping tolerance value was set to -0.0001. Among the variation of different loss functions, L2-regularized L1-loss and L2-loss function yielded the best results.

Neural network
Semantic modelling of sentence meaning using neural networks has been a target of attention in the social media community. Neural network architectures, such as CNN, DNN, RNN, and Recursive Neural Networks (RecNN) have shown excellent capabilities for modelling complex word composition in a sentence. A sarcastic text can be considered elementally as a sequence of text signals or word combinations. RNN is a perfect fit for modelling temporal text signals as it includes a temporal memory component, which allows the model to store the temporal contextual information directly in the model. It can aggregate the entire sequence into a temporal context that is free of explicit size constraints. Among the many implementations of RNNs, LSTMs are easy to train and do not suffer from vanishing or exploding gradients while per- forming back propagation through time. LSTM has the capability to remember long distance temporal dependencies. Moreover, as they performs temporal text modelling over input features, higher level modelling can distinguish factors of linguistic variation within the input. CNNs can also capture temporal text sequence through convolutional filters. CNNs reduce frequency variation and convolutional filters connect a subset of the feature space that is shared across the entire input (Chan and Lane, 2015). (Dos Santos et al., 2015) have shown that CNNs can directly capture temporal text patterns for shorter texts, yet in longer texts, where temporal text patterns may span across 15 to 20 words, CNNs must rely on higher-level fully connected layers to model long distance dependencies as the maximum convolutional filter width for a text is 5 (Figure 2).
Another major limitation of CNNs is the fixed convolutional filter width, which is not suitable for different lengths of temporal text patterns and cannot always resolve dependencies properly. Obtaining the optimal filter size is expensive and corpusdependent, while LSTM operates without a fixed context window size. LSTM's performance can be improved by providing better features. Following the proposal of (Vincent et al., 2008), it can be beneficial to exploit a CNN's ability to reduce frequency variation and map input features into composite robust features and using it as an input to a LSTM network. DNNs are appropriate for mapping features into a more separable space. A fully connected 164 DNN, added on top of an LSTM network, can provide better classification by mapping between output and hidden variables by transforming features into an output space. In the following section we define our proposed network in detail.

Input layer
Consider a tweet as input containing n words. The tweet is converted into a vector by replacing each word with its dictionary index s ∈ 1×n . To resolve different lengths of input, the tweet vector is padded and the tweet is converted into matrix s ∈ 1×l , where l is the maximum length of tweets in the input corpus. The input vector is fed to the embedding layer which converts each word into a distributional vector of dimension D. Thus the input tweet matrix is converted to s ∈ l×D .

Convolutional network
The aim of a convolution network is to reduce frequency variation through convolutional filters and extracting discriminating word sequences as a composite feature map for the LSTM layer. The convolution operation maps the input matrix s ∈ l×D into c ∈ |s|+m−1 using a convolutional filter k ∈ D×m . Each component is computed as follows: Convolution filter, which has the same dimension D of the input matrix, which slides along the column dimension of the input matrix, performing an element wise product between a column slice s and a filter matrix k producing a vector component c i and summed to create a feature map c ∈ R 1(|s|m+1) . f filters create a feature map C ∈ R f (|s|m+1) . We chose Sigmoid for non-linearity. Initially we passed the output of the convolutional network through a pooling layer and max-pooling is used with size 2 and 3. Later, we discarded the max-pooling layer and fed the LSTM network with all of the composite features to judge sarcasm, which improved the performance of the model.

LSTM
RNN has demonstrated the power of semantic modelling quite efficiently by incorporating feedback cycles in the network architecture. RNN networks in-clude a temporal memory component, which allows the model to store the temporal contextual information directly in the model. At each time step, it considers the current input x t and hidden state h t−1 . Thus the RNN is unable to plot long term dependencies if the gap between two time steps becomes too large. (Hochreiter and Schmidhuber, 1997) introduced LSTM, which is able to plot long term dependencies by defining each memory cell with a set of gates d , where d is the memory dimension of hidden state of LSTM, and it does not suffer from vanishing or exploding gradient while performing back propagation through time. LSTM contains three gates, which are functions of x t and h t−1 : input gate i t , forget gate f t , and output gate o t . The gates jointly decide on the memory update mechanism. Equation (3) and (2) denote the amount of information to be discarded or to be stored from and to store in memory. Equation (5) denotes the output of the cell c t .

Deep Neural Network Layer
The output of LSTM layer is passed to a fully connected DNN layer, which produces a higher order feature set based on the LSTM output, which is easily separable for the desired number of classes. Finally a softmax layer is added on top of the DNN layer. Training of network is performed by minimizing the binary cross-entropy error. For parameter optimization, we have used ADAM (Kingma and Ba, 2014) with the learning rate set to 0.001.

Experiment
To evaluate both models, we have tested rigorously with different experimental setups. For the recursive SVM, we employed different sets of feature combinations mentioned in table 1. In the neural network model, we opted for a word embedding dimension set to 256. We tested our model with different settings of the hyperparameters for CNN (number of filter, filter size), LSTM (hidden memory dimension, dropout ratio), and DNN (number of hidden memory units (HMU)). Initially we passed the output of CNN via a maxdropout layer, with maxpooling size 2 and 3, to the LSTM, but later we dropped the maxpooling layer, which improved the performance by 2%.
In our experiment, apart from the combination of CNN, LSTM, and DNN, we observed the performance for each of the neural networks individually. The CNN network is investigated by varying the number of filters and the filter widths, set to 64, 128, 256 and 2, 3 respectively. For the LSTM network, the number of memory units is varied from 64 to 256. Sigmoid is chosen as activation function for both networks. We used Gaussian initialization scaled by the fan-in and the fan-out for the embed-ding layer and Gaussian initialization scaled by the fan-in for the CNN, the LSTM, and the DNN layer as initial probability distribution. The code was implemented using keras 4 library.

Experimental Analysis
In the neural network, success depends on the apt input and the selection of hyperparameters. As we observed that the inclusion of hashtag information in the recursive-SVM method gained a better F-score, we pertained the same input structure for the neural network. Apart from difficulties in training a neural network, enormous training time is another passive obstacle. We observed that compared to stacked LSTM network, the CNN-LSTM network converges faster as CNN reduces frequency variation and produces better composite representation of the input to the LSTM network. Sarcasm detection is considered a complex task, as very subtle contextual information often triggers the sarcastic notion. Thus we noticed that the inclusion of a dropout layer on top of the CNN layer, our model suffered a decrease in performance. In the testing dataset, we observed an interesting example. I don't know about you man but I love the history homework.
With the dropout layer, model identified above mentioned example as non-sarcastic, yet without the dropout layer, our model labeled it as sarcastic. This indicates that the word "man", which functions as an intensifier of sarcasm in this context, was dropped out from the output of the CNN layer. Also we observed that incrementing the filter width of the CNN layer boosted the performance of our model by a small margin. To obtain the apt network size, we have also trained with bigger network sizes and larger filter widths, but no improvement has been observed.   We evaluated our system with two publicly available datasets Riloff et al., 2013). The results are mentioned in table 3. We observed 167 that our model has performed with a better f-score than both of the systems, but it has a lower precision value than SASI .

Twitter Bot
In NLP research, building a carefully crafted corpus has always played a crucial role. In recent research, Twitter has been used as an excellent source for various NLP tasks due to its topicality and availability. While sharing previous datasets, due to copyright and privacy concerns, researchers are forced to share only tweet identifiers along with annotations instead of the actual text of each tweet. As a tweet is a perishable commodity and may be deleted, archived or otherwise made inaccessible over time by their original creators, resources are lost in the course of time. Following our idea of retweeting via a dedicated account (@onlinesarcasm) to refrain tweets from perishing without copyright infringement, we have retweeted only detected sarcastic tweets. Regarding the quality assurance of the automated retweets, we observed that a conflict between human annotation and the output of the model is negligible for those tweets predicted with a softmax class probability higher than 0.75.

Conclusion & Future work
Sarcasm is a complex phenomenon and it is linguistically and semantically rich. By exploiting the semantic modelling power of the neural network, our model has outperformed existing sarcasm detection systems with a f-score of .92. Even though our model performs very well in sarcasm detection, it still lacks an ability to differentiate sarcasm with similar concepts. As an example, our model classified "I Just Love Mondays!" correctly as sarcasm, but it failed to classify "Thank God It's Monday!" as sarcasm, even though both are similar at the conceptual level. Feeding the model with word2vec 5 to find similar concepts may not be beneficial, as not every similar concept employs sarcasm. For example, "Thank God It's Friday!" is non-sarcastic in nature. For future works, selective use of word2vec can be exploited to improve the model. Also performing a trend analysis from the our twitter bot