Sarcastic or Not: Word Embeddings to Predict the Literal or Sarcastic Meaning of Words

Sarcasm is generally characterized as a ﬁgure of speech that involves the substi-tution of a literal by a ﬁgurative meaning, which is usually the opposite of the original literal meaning. We re-frame the sarcasm detection task as a type of word sense disambiguation problem, where the sense of a word is either literal or sar-castic . We call this the Literal/Sarcastic Sense Disambiguation (LSSD) task. We address two issues: 1) how to collect a set of target words that can have either literal or sarcastic meanings depending on context; and 2) given an utterance and a target word, how to automatically detect whether the target word is used in the literal or the sarcastic sense. For the latter, we investigate several distributional semantics meth-ods and show that a Support Vector Machines (SVM) classiﬁer with a modiﬁed kernel using word embeddings achieves a 7-10% F1 improvement over a strong lexical baseline.


Introduction
Recognizing sarcasm is important for understanding people's actual sentiments and beliefs. For example, failing to recognize the following message as being sarcastic "I love that I have to go back to the emergency room", will lead a sentiment and opinion analysis system to infer that the author's sentiment is positive towards the event of "going to the emergency room". Current approaches have framed the sarcasm detection task as predicting whether a full utterance is sarcastic or not (Davidov et al., 2010;González-Ibáñez et al., 2011;Riloff et al., 2013;Liebrecht et al., 2013;Maynard and Greenwood, 2014).
We propose a re-framing of sarcasm detection as a type of word sense disambiguation problem: given an utterance and a target word, identify whether the sense of the target word is literal or sarcastic. We call this the Literal/Sarcastic Sense Disambiguation (LSSD) task. In the above utterance, the word "love" is used in a sarcastic, nonliteral sense (the author's intended meaning being most likely the opposite of the original literal meaning -a negative sentiment, such as "hate").
Two key challenges need to be addressed: 1) how to collect a set of target words that can have a literal or a sarcastic sense, depending on context; and 2) given an utterance containing a target word, how can we determine whether the target word is used in its literal sense (e.g., "I love to take a nice stroll in the park every morning"), or in a sarcastic sense (e.g., "I love going to the dentist.").
To address the first challenge, we need to identify a set of words from sarcastic utterances, which have a figurative/sarcastic sense (e.g., "love" in the utterance "I love going to the dentist"). We propose a crowdsourcing task where Turkers in Amazon Mechanical Turk (MTurk) platform are given sarcastic utterances (tweets labeled with #sarcasm or #sarcastic hashtags) and are asked to re-phrase those messages so that they convey the author's intended meaning ("I love going to the dentist" can be rephrased as "I hate going to the dentist" or "I don't like going to the dentist"). 1 Given this parallel dataset, we use unsupervised alignment techniques to identify semantically opposite words (e.g., "love" ↔ "hate", "brilliant" ↔ "stupid", "never" ↔ "always"). The words from these pairs that appear in the original sarcastic utterances are then considered as our collection of target words (e.g., "love", "brilliant", "never") that can have both a sarcastic and a literal sense depending on the context (Section 2).
To address the second challenge, we compare several distributional semantics methods generally used in word sense disambiguation tasks (Sec- tion 3). We show that using word embeddings in a modified SVM kernel achieves the best results (Section 4). To collect training and test datasets for each of the target words, we use Twitter messages that contain those words. For the sarcastic sense (S), we use tweets that contain the target word and are labeled with the #sarcasm or #sarcastic hashtags. For the literal sense (L), we collect tweets that contain the target word and are not labeled with the #sarcastic or #sarcasm hashtags. Table 1 shows examples of two targets words ("great" and "proud") and their sarcastic sense (S) and literal sense (L). In addition, for the literal sense, we also consider a special case, where the tweets are labeled with either positive or negative hashtags (e.g., #happy, #sad) as proposed by Gonzalez et al. (2011). We denote these sentiment tweets as L sent (Table 1). Gonzalez et al. (2011) argue that it is harder to distinguish sarcastic from non-sarcastic messages where the nonsarcastic messages contain sentiment. Our results support this argument (97% F1 measure for the best result for S vs. L, compared to 84% F1 for the best result for S vs. L sent ; Section 4). 2

Collection of Target Words
To collect a set of target words that can have either literal or sarcastic meaning depending on context, we propose a two step approach: 1) a crowdsourcing task to collect a parallel dataset of sarcastic utterances and their re-phrasings that convey the authors' intended meaning; and 2) unsupervised alignment techniques to detect semantically opposite words/phrases.
Crowdsourcing Task. Given a sarcastic message (SM), Turkers were asked to re-phrase the message so that the new message is likely to express the author's intended meaning (IM). Examples of an original sarcastic message (1) and three messages generated by the Turkers (2) is given below: ( From the above examples, we can see that aligning the sarcastic message (SM) to the re-phrasings containing the author's intended meaning generated by the Turkers (IM 1 , IM 2 , IM 3 ) will allow us to detect that "happy" can be aligned to "don't like", "upset", and "unhappy". Based on this alignment, "happy" will be considered as a target word for the LSSD task.
We used 1,000 sarcastic messages collected from Twitter using the #sarcasm and #sarcastic hashtags. The Turkers were provided with detailed instructions of the task including a definition of sarcasm, the task description, and multiple examples. In addition, for messages that contain one or more sentences and where sarcasm is related to only a part of the message, the Turkers were instructed to consider the entire message in their rephrasing. This emphasis was added to avoid high asymmetry in the length between the original sarcastic message and the rephrasing of the intended meaning. For each original sarcastic message (SM), we asked five Turkers to do the rephrasing task. Each HIT contains 1 sarcastic message, and Turkers were paid 5 cents for each HIT. To ensure a high quality level, only qualified workers were allowed to perform the task (i.e., more than 90% approval rate and at least 500 approved HITs). In this way, we obtained a dataset of 5,000 SM-IM pairs.
Unsupervised Techniques to Detect Semantically Opposite Words/Phrases. We use two methods for unsupervised alignment. First, we use the co-training algorithm for paraphrase detection developed by Barzilay and McKeown (2001). This algorithm is used for two specific reasons. First, our dataset is similar in nature to the parallel monolingual dataset used in Barzilay and McKeown (2001), and thus lexical and contextual information from tweets can be used to extract the candidate targets words for LSSD. For instance, we can align the [SM] and [IM 3 ] (from the above examples), where except for the words happy and unhappy, the majority of the words in the two messages are anchor words and thus happy and unhappy can be extracted as paraphrases via cotraining. To model contextual information, such as part of speech tagging for the co-training algorithm, we used Tweet NLP (Gimpel et al., 2011). Second, Bannard and Callison-Burch (2005) noticed that the co-training method proposed by- Barzilay and McKeown (2001) requires identical bounding substrings and has bias towards single words while extracting paraphrases. This apparent limitation, however, is advantageous to us because we are specifically interested in extracting target words. Co-training resulted in 367 extracted pairs of paraphrases.
We also considered a statistical machine translation (SMT) alignment method -IBM Model 4 with HMM alignment implemented in Giza++ (Och and Ney, 2000). We used Moses software (Koehn et al., 2007) to extract lexical translations by aligning the dataset of 5,000 SM-IM pairs. From the set of 367 extracted paraphrases using Barzilay and McKeown (2001)'s approach, we selected only those paraphrases where the lexical translation scores φ (resulted after running Moses) are ≥ 0.8. After filtering via translation scores and manual inspection, we obtained a set of 80 semantically opposite paraphrases. Given this set of semantically opposite words, the words that appear in the sarcastic messages were consider our target words for LSSD (70 target words after lemmatization). They range from verbs, such as "love" and "like", adjectives, such as "brilliant", "genius", and adverbs, such as "really".

Literal/Sarcastic Sense Disambiguation
Our Literal/Sarcastic Sense Disambiguation (LSSD) task is formulated as follows: given a candidate utterance (i.e., a tweet) that contains a target word t, identify whether the sense of t is sarcastic (S) or literal (L). In order to be able to solve this problem we need training and test data for each target word that consists of utterances where the target word is used either in the literal sense or the sarcastic sense.

Data Collection
To collect training and test datasets for each of the target words, we use Twitter messages that contain those words. For the sarcastic sense (S), we use tweets that contain the target word and are labeled with the #sarcasm or #sarcastic hashtag. For the literal sense (L), we collect tweets that contain the target word and are not labeled with the #sarcastic or #sarcasm hashtags. In addition, for the literal sense we also consider a special case, where the tweets are labeled with either positive or negative sentiment hashtags (e.g., #happy, #sad). Thus, we consider two LSSD tasks: S vs. L and S vs. L sent , and aim to collect a balanced dataset for each target word.
For the 70 target words (see Section 2), we collected a total of 2,542,249 tweets via Twitter API . We considered a setup where 80% of data is used for training, 10% for development, and 10% for test. We empirically set the number of minimum training instances for each sense of the target word to 400 without any upper restriction. This resulted in 37 target words to be used in the LSSD experiments. Table 2 shows all the target words and their corresponding number of training instances for each sense (S and L/L sent ). The size of training data ranges from 26,802 for the target word "love" to 427 for the word "mature". As we will see in the results sections, however, the size of the training data is not always the key factor in the LSSD task, especially for the methods that use word embeddings.

Learning Approaches
We consider two classical approaches used in word sense disambiguation tasks: 1) distributional approaches where each sense of a target word is represented as a context vector derived from the training data; and 2) classification approaches (S vs. L; S vs. L sent ) for each target word.

Distributional Approaches
The Distributional Hypothesis in linguistics is derived from the semantic theory of language usage, i.e., words that are used and occur in the same contexts tend to purport similar meanings (Harris, 1954). Distributional semantic models (DSMs) use vectors that represent the contexts (e.g., cooccurring words) in which target words appear in a corpus, as proxies for meaning representations. Geometric techniques such as cosine similarity are then applied to these vectors to measure the similarity in meaning of corresponding words.
The DSMs are a natural approach to model our LSSD task. For each target word t we build two context-vectors that will represent the two senses of the target word t using the training data: one for the sarcastic sense S using the sarcastic training data for t ( v s ) and one for the literal sense L using the literal sense training data for t ( v l ). 3 Given a test message u containing a target word t, we first represent the target word as a vector v u using all the context words inside u. To predict whether t is used in a literal or sarcastic sense in the test message u we simply apply geometric techniques (e.g., cosine similarity) between v u and the two sense vectors v s and v l , choosing the one with the maximum score.
To create the two sense vectors v s and v l for each of the target words t, we use the positive pointwise mutual information model (PPMI) (Church and Hanks, 1990). Based on t's context words c k in a window of 10 words, we separately computed PPMI for sarcastic and literal senses using t's training data. The size of the context widow used in DSMs is generally between 5 and 10, and in our experiments we used a window of 10 words since tweets often include meaningful words/tokens at the end of the tweets (e.g., interjections, such as "yay", "ohh"; upper-case words, such as, "GREAT"; novel hashtags, such as "#notreally", "#lolol"; emoticons, such as ":("). We sorted the context words based on the PPMI scores and for each target word t we selected a maximum of 1,000 context words per sense to approximate the two senses of the target word (i.e., the vectors v s and v l for each target word t consist of a maximum of 1,000 words). Table 3 shows some target words and their corresponding con-3 In the remaining of this section we will only mention L and not Lsent for clarity and brevity.  To predict whether t is used in a literal or sarcastic sense in the test message u we simply apply the cosine similarity to the v u (vector representation of the target word t in the test message u) and the two sense vectors v s and v l of t, choosing the one with the maximum score. All vector elements are given by the tf-idf values of the corresponding words. This approach, denoted as the "PPMI baseline", is the baseline for our DSM experiments.
Context Vectors with Word Embedding: The above method considers that the context vectors v s and v l of each target word t contain the cooccurring words selected by their PPMI values. We enhance the representation of context vectors to represent each word in the context vector by its word embedding. We experiment with three different methods of obtaining word embeddings: Weighted Textual Matrix Factorization (WTMF) (Guo and Diab, 2012b); word2vec that implements the skip-gram and continuous bag-of-words models (CBOW) of Mikolov et al. (2013a), and GloVe (Pennington et al., 2014), a log-bilinear regression model based upon global word-word cooccurrence count in the training corpora.
After removing the tweets that are used as test sets, we build the three word embedding models in an unsupervised fashion with the remaining 2,482,763 tweets from our original data collection (Section 3.1). In each of the three models, each word w is represented by its d-dimensional vector w of real numbers, where d=100 for all of the embedding algorithms in our experiments. For the size of the embedding vectors, it is common to use 100 or 300 dimensions, with larger dimensions for larger datasets. Our current dataset is smaller than the ones used in other applications of word embeddings (e.g., Pennington et al. (2014) have used billion tweets to create word embedding) so we opted for 100 dimensional vectors. Below are the short descriptions of the three word embedding models: • Weighted Textual Matrix Factorization (WTMF): Low-dimensional vectors have been used in WSD tasks, since they are computationally efficient and provide better generalization than surface words. A dimension reduction method is Weighted Textual Matrix Factorization (WTMF) (Guo and Diab, 2012b), which is designed specifically for short texts, and has been successfully applied in WSD tasks (Guo and Diab, 2012a). WTMF models unobserved words, thus providing more robust embeddings for short texts such as tweets.
• word2vec Representation: We use both the Skip-gram model and the Continuous Bagof-Words (CBOW) model (Mikolov et al., 2013a;Mikolov et al., 2013c) as implemented in the word2vec gensim python library. 4 Given a window size of n words around a word w, the skip-gram model predicts the neighboring words given the current word. In contrast, the CBOW model predicts the current word w, given the neighboring words in the window. We considered a context window of 10 words.
• GloVe Representation: GloVe (Pennington et al., 2014) is a word embedding model that is based upon weighted least-square model trained on global word-word co-occurrence counts instead of the local context used by word2vec.
Here, the LSSD task is similar to the baseline: to predict whether the target word t in the test message u is used in a literal or sarcastic sense, we simply use a similarity measure between the v u (vector representation of the target word t in the test message u) and the two sense vectors v s and v l of t, choosing the one with the maximum score. The difference from the baseline is twofold: First, all vectors elements are word embeddings (i.e., 100-d vectors). Second, we use the maximumvalued matrix-element (MVME) algorithm introduced by Islam and Inkpen (2008), which has been shown to be particularly useful for computing the similarity of short texts. We modify this algorithm to use word embeddings (M V M E we ). The idea behind the MVME algorithm is that it finds a oneto-one "word alignment" between two utterances (i.e., sentences) based on the pairwise word similarity. Only the aligned words contribute to the overall similarity score.
10: Return row, col 32: end procedure Algorithm 1 presents the pseudocode of our modified algorithm for word embeddings, M V M E we . Let the total similarity between v s and v u be Sim. For each context word c k from v s and each word w j from v u , we compute a matrix where the value of the matrix element M jk denotes the cosine similarity between the embedded vectors c k and w j [lines 5 -13]. Next, we first select the matrix cell that has the highest similarity value in M (max) and add this to the Sim score [lines [16][17]. Let the r m and c m be the row and the column of the cell containing max (maximumvalued matrix element), respectively. Next, we remove all the matrix elements of the r m -th row and the c m -th column from M [line 20]. We repeat this procedure until we have traversed through all the rows and columns of M or max = 0 [line 21].

Classification Approaches
The second approach for our LSSD task is to treat it as a binary classification task to identify the sarcastic or literal sense of a target word t. We have two classification tasks: S vs. L and S vs. L sent for each of the 37 target words. We use the lib-SVM toolkit (Chang and Lin, 2011). Development data is used for tuning parameters.
SVM Baseline: The SVM baseline for LSSD tasks uses n-grams and lexicon-based binaryvalued features that are commonly used in existing state-of-the-art sarcasm detection approaches (González-Ibáñez et al., 2011;Tchokni et al., 2014). They are derived from i) bag-of-words (BoW) representations of words, ii) LIWC dictionary (Pennebaker et al., 2001), and iii) a list of interjections (e.g., "ah", "oh", "yeah"), punctuations (e.g., "!", "?"), and emoticons collected from Wikipedia. CMU Tweet Tokenizer is employed for tokenization. 5 We kept unigrams unchanged when all the characters are uppercase (e.g., "NEVER" in "A shooting in Oakland? That NEVER happens! #sarcasm") but otherwise words are converted to lower case. We also change all numbers to a generic number token "22". To avoid any bias during experiments, we removed the target words from the tweets as well as any hashtag used to determine the sense of the tweet (e.g., #sarcasm, #sarcastic, #happy, #sad).
SVM with M V M E we Kernel: We propose a new kernel kernel we to compute the semantic similarity between two tweets u r and u s using the M V M E we method introduced for the DSM approach, and the three types of word embeddings (WTMF, word2vec, and GloVe). The similarity measure in the kernel is similar to the algorithm M V M E we described in Algorithm 1, but instead 5 http://www.ark.cs.cmu.edu/TweetNLP/ of measuring the similarity between the sense vectors of t ( v s , v l ) and the vector representation of t in test message ( v u ), now we measure the similarity between two tweets u r and u s . For each k-th index word w k in u r and l-th index word w l in u s we compute the cosine similarity between the embedded vectors of the words and fill up a similarity matrix M . We select the matrix cell that has the highest similarity, add this similarity score to the total similarity Sim, remove the row and column from M that has highest similarity score, and repeat the procedure (similar to Algorithm 1). We noticed that M V M E we algorithm carefully chooses the best candidate word w l in u s for the w k word in u r since w l is the most similar word to w k . The algorithm continues the same procedure for all the remaining words in u r and u s . The final Sim is used as the kernel similarity between u r and u s . We augment this kernel kernel we into libSVM and during evaluation we run supervised LSSD classification for each target word t separately.

Results and Discussions
Tables 4 and 5 show the results for the LSSD experiments using distributional approaches and classification-based approaches, respectively. For brevity, we only report the average Precision (P), Recall (R), and F1 scores with their standard deviation (SD) (given by '±'), and the targets with maximum/minimum F1 scores. w2v sg and w2v cbow represent the skip-gram and CBOW models implemented in word2vec, respectively. Table 4 presents the results of distributional approaches (Section 3.2.1). We observe that the word embedding methods have better performance than the PPMI baseline for both S vs. L and S vs. L sent disambiguation tasks. Also, the average P/R/F1 scores for S vs. L are much higher than for S vs. L sent . Since all tweets with L sent sense were collected using sentiment hashtags (González-Ibáñez et al., 2011), they might be lexically more similar to the S tweets than the L tweets are and thus identifying the sense of a target word t between S vs. L sent is a harder task. In Table 4 we also observe that the average F1 scores between WTMF, w2v sg , w2v cbow , and GloVe are comparable and between 84%-86%, with w2v sg and w2v cbow achieving slightly higher F1.   SVM baseline (SV M bl ) and SVM using the kernel we with word embeddings (kernel W T M F , kernel GloV e , kernel w2vsg , and kernel w2v cbow ). The classification approaches give better performance compared to the distributional approaches.
The SV M bl is around 7-8 % higher than the P P M I bl and comparable with the word embeddings used in distributional approaches (Table 4). In addition, our new SVM kernel method using word embeddings shows significantly better results when compared to the SV M bl (and distributional approaches). For instance, for the S vs. L task, the average F1 is 96-97%, which is more than 10% higher than SV M bl . Similarly, for S vs. L sent task, F1 scores reported by the kernel using word2vec embeddings are in the range of 83%-84% compared to 77% given by the SV M bl , showing an absolute increase of 7%. As stated earlier, MVME algorithm aligns similar word pairs found in its inputs and this performs well for short texts (i.e., tweets). Thus, the MVME algorithm combined with word embedding in kernel we results in very high F1. Among the word embedding models, word2vec models give marginally better results compared to GloVe and WTMF, and GloVe outperforms marginally WTMF. Similar to Table  4, here, the average F1 scores for S vs. L task are higher than the S vs. L sent results.
In terms of the best and worst performing tar-gets, SV M bl prefers targets with more training data (e.g., "yeah", "love" vs. "sweet", "attractive"; see Table 2). In contrast, word embedding models for "joy" and "mature", two targets with comparatively low number of training instances have achieved very high F1 using both distributional and classification approaches (Table 4 and 5). This can be explained by the fact that for words, such as "joy", "mature", "cute", and "brilliant", the contexts of their literal and sarcastic sense are quite different, and DSMs and word embeddings are able to capture the difference. For example, observe in the Table 3, negative sentiment words, i.e., "sick", "working", "snow" are the context words for targets "joy" and "love", where as positive sentiment words, such as, "blessed", "family", "christmas", and "peace" are the context words for L or L sent senses. Overall, out of 37 targets, only 5 targets ("mature", "joy", "cute", "love", and "yeah") achieved "maximum" F1 scores in various experimental settings (Tables 4 and 5) whereas targets such as "interested", "genius", and "attractive" achieved low F1 scores.
In terms of variance in results, SVM results show low SD (0-4%). For distributional approaches, SD is slightly higher (5-8%) for several cases.

Related Work
Two lines of research are directly relevant to our work: sarcasm detection in Twitter and application of distributional semantics, such as word embedding techniques to various NLP tasks. In contrast to current research on sarcasm and irony detection (Davidov et al., 2010;Riloff et al., 2013;Liebrecht et al., 2013;Maynard and Greenwood, 2014), we have introduced a reframing of this task as a type of word sense disambiguation problem, where the sense of a word is sarcastic or literal. Our SVM baseline uses the lexical features proposed in previous research on sarcasm detection (e.g., LIWC lexicon, interjections, pragmatic features) (Liebrecht et al., 2013;González-Ibáñez et al., 2011;Reyes et al., 2013). Our analysis of target words where the sarcastic sense is the opposite of the literal sense is related to the idea of "positive sentiment toward a negative situation" proposed by Riloff et al. (2013) and recently studied by Joshi et al. (2015). In our approach, we chose distributional semantic approaches that learn contextual information of targets effectively from a large corpus containing both literal and sarcastic uses of words and show that word embedding are highly accurate in predicting the sarcastic or literal sense of a word (Tables 4 and 5). This approach has the potential to capture more nuanced cases of sarcasm, beyond "positive sentiment towards a negative situation" (e.g., one of our target words was "shocked" which is negative). However, our current framing is still inherently limited to cases where sarcasm is characterized as a figure of speech where the author means the opposite of what she says, due to our approach of selecting the target words.
Low-dimensional text representation, such as WTMF, have been successful in WSD disambiguation research and in computing similarity between short texts (Guo and Diab, 2012a;Guo and Diab, 2012b). word2vec and GloVe representations have provided state-of-the-art results on various word similarity and analogy detection task (Mikolov et al., 2013c;Mikolov et al., 2013b;Pennington et al., 2014). Word embedding based models are also used for other NLP tasks such as dependency parsing, semantic role labeling, POS tagging, NER, question-answering (Bansal et al., 2014;Collobert et al., 2011;Weston et al., 2015) and our work on LSSD is a novel application of word embeddings.

Conclusion and Future Work
We proposed a reframing of the sarcasm detection task as a type of word sense disambiguation problem, where the sense of a word is its sarcastic or literal sense. Using a crowdsourcing experiment and unsupervised methods for detecting semantically opposite phrases, we collected a set of target words to be used in the LSSD task. We compared several distributional semantics methods, and showed that using word embeddings in a modified SVM kernel achieves the best results (an increase of 10% F1 and 8% F1 for S vs. L and S vs. L sent disambiguation task, respectively, against a SVM baseline). While the SVM baseline preferred larger amounts of training data (best performance achieved on the targets words with higher number of training examples), the methods using word embeddings seem to perform well on target words where there might be an inherent difference in the contextual sarcastic and literal use of a target word, even if the training data was smaller.
We want to investigate further the nature and size of training data useful for the LSSD task.
For example, to test the effect of larger training dataset, we utilized pre-trained word vectors from GloVe (trained with 2 Billion tweets, using 100 dimensions). 6 For S vs. L disambiguation, the average F1 was 88.9%, which is 7% lower than the result using GloVe on our training set of tweets (much smaller) designed for the LSSD task. This shows the training data utilized to create word embedding models in GloVe probably do not contain enough sarcastic tweets.
Regarding the size of the training data, recall that the unsupervised alignment approach had extracted 70 target words (Section 2), although we have used 37 target words as we did not have enough training data for the remaining targets. Thus, we plan to collect more training data for these targets as well as more target words (especially for the S vs. L sent task). In addition, we plan to improve our unsupervised methods for detecting semantically opposite meaning (e.g., using the IM-IM dataset in addition to the SM-IM dataset).
One common criticism of research based on use of hashtags as gold labels is that the training utterances could be noisy. In other words, tweets might be sarcastic but not have #sarcasm or #sarcastic hashtags. We did a small manual validation on a dataset of 180 tweets from the L sent class using 3 annotators (we asked them to say whether the tweet is sarcastic or not). For cases where all 3 coders agree none of them were considered sarcastic, while when only 2 coders agree 1 tweet out of 180 was considered sarcastic. In future, we plan to perform additional experiments to study the issue of noisy data. We hope that the release of our datasets will stimulate other studies related to the sarcasm detection problem, including addressing the issue of noisy data.
We also plan to study the effect of hyperparameters in designing the DSMs. Recently, Levy et al. (2015) have argued that parameter settings have a large impact on the success of word embedding models. We want to follow their experiments to study whether parameter tuning in PMI based disambiguation can improve its performance. 6 Downloaded from http://nlp.stanford.edu/projects/glove/