Sparse, Contextually Informed Models for Irony Detection: Exploiting User Communities, Entities and Sentiment

Automatically detecting verbal irony (roughly, sarcasm) in online content is important for many practical applications (e.g., sentiment detection), but it is dif-ﬁcult. Previous approaches have relied predominantly on signal gleaned from word counts and grammatical cues. But such approaches fail to exploit the context in which comments are embedded. We thus propose a novel strategy for verbal irony classiﬁcation that exploits contextual features, speciﬁcally by combining noun phrases and sentiment extracted from comments with the forum type (e.g., conservative or liberal) to which they were posted. We show that this approach improves verbal irony classiﬁcation performance. Furthermore, because this method generates a very large feature space (and we expect predictive contextual features to be strong but few), we propose a mixed regularization strategy that places a sparsity-inducing ‘ 1 penalty on the contextual feature weights on top of the ‘ 2 penalty applied to all model coefﬁcients. This increases model sparsity and reduces the variance of model performance.


Introduction and Motivation
Automated verbal irony detection is a challenging problem. 1 But recognizing when an author has intended a statement ironically is practically important for many text classification tasks (e.g., sentiment detection).
Previous models for irony detection Lukin and Walker, 2013;Riloff et al., Figure 1: A reddit comment illustrating contextualizing features that we propose leveraging to improve classification. Here the highlighted entities (external the comment text itself) provide contextual signals indicating that the shown comment was intended ironically. As we shall see, Obamacare is in general a strong indicator of irony when present in posts to the conservative subreddit, but less so in posts to the progressive subreddit. 2013) have relied predominantly on features intrinsic to the texts to be classified. By contrast, here we propose exploiting contextualizing information, which is often available for web-based classification tasks. More specifically, we exploit signal gleaned from the conversational threads to which comments belong. Our approach capitalizes on the intuition that members of different user communities are likely to be sarcastic about different things. As a proxy for user community, we leverage knowledge of the specific forums to which comments were posted. For example, one may surmise that the statement 'I really am proud of Obama' is likely to have been intended ironically if it was posted to a forum frequented by political conservatives. But if this same utterance were posted to a liberal-leaning forum, it is more likely to have been intended in earnest. This sort of information is often directly or indirectly available on social media, but previous models have not capitalized on it. This is problematic; recent work has shown that humans require such contextualizing information to infer ironic intent (Wallace et al., 2014).
As a concrete example, we consider the task of identifying verbal irony in comments posted to reddit (http://www.reddit.com), a socialnews website. Users post content (e.g., links to news stories) to reddit, which are then voted on by the community. Users may also discuss this content on the website; these are the comments that we will work with here. Reddit comprises many subreddits, which are user communities centered around specific topics of interest. In this work we consider comments posted to two pairs of polarized user communities, or subreddits: (1) progressive and conservative subreddits (comprising individuals on the left and right of the US political spectrum, respectively), and (2) atheism and Christianity subreddits.
Our aim is to develop a model that can recognize verbal irony in comments posted to such forums, e.g., automatically discern that the user who posted the comment shown in Figure 1 intended his or her comment ironically. To this end, we propose a strategy that capitalizes on available contextualizing information, such as interactions between the user community (subreddit) that comments were posted to, extracted entities (here we use noun phrases, or NNPs) and inferred sentiment.
The contributions of this work are summarized as follows.
• We demonstrate that contextual information, such as inferred user-community (in this case, the subreddit) can be crossed with extracted entities and sentiment to improve detection of verbal irony. This improves performance over baseline models (including those that exploit inferred sentiment, but not context).
• We introduce a novel composite regularization strategy that applies a sparsifying 1 penalty to the contextual/sentiment/entity feature weights in addition to the standard squared 2 penalty to all feature weights. This induces more compact, interpretable models that exhibit lower variance.
While discerning ironic comments on reddit is our immediate task, the proposed approach is generally applicable to a wide-range of subjective, web-based text classification tasks. Indeed, this approach would be useful for any scenario in which we expect different groups of individuals producing content to tend to discuss different entities in a way that correlates with the target categorization. The key is in identifying an available proxy for user groupings (here we rely on the subreddits to which a comment was posted). Such information is often available (or can be derived) for comments posted to different mediums on the web: for example on Twitter we know who a user follows; and on YouTube we know the channels to which videos belong.
2 Exploiting context 2.1 Communities and sentiment As discussed above, a shortcoming with existing models for detecting sarcasm/verbal irony on the web is their failure to capitalize on contextualizing information. But such information is critical to discerning irony. A large body of work on the use and interpretation of verbal irony supports this supposition (Grice, 1975;Clark and Gerrig, 1984;Wallace, 2013;Wallace et al., 2014). Individuals will be more likely, in general, to use sarcasm when discussing specific entities. Which entities will depend in part on the community to which the individual belongs. As a proxy for user community, here we leverage the subreddits to which comments were posted. Sentiment may also play an important role. In general, verbal irony is almost always used to convey negative views via ostensibly positive utterances (Sperber and Wilson, 1981). And recent work (Riloff et al., 2013) has exploited features based on sentiment to improve irony detection.
To summarize: when assuming an ironic voice we expect that individuals will convey ostensibly positive sentiment about entities, and that these entities will depend on the type of individual in question. We propose capitalizing on such information by introducing features that encode subreddits, sentiment and noun phrases (NNPs), as we describe next.

Features
We leverage the feature sets enumerated in Table 1. Subreddits are observed variables. Noun phrase (NNP) extraction and sentiment inference are performed automatically via state of the art NLP tools. In particular, we use the Stanford Sentiment Analysis tool (Socher et al., 2013) to infer sentiment. To extract NNPs we use the Stanford Feature Description Sentiment The inferred sentiment (negative/neutral or positive) for a given comment. Subreddit the subreddit (e.g., progressive or conservative; atheism or Christianity) to which a comment was posted. NNP Noun phrases (e.g., proper nouns) extracted from comment texts. NNP+ Noun phrases extracted from comment texts and the thread to which they belong (for example, 'Obamacare' from the title in Figure 1). Table 1: Feature types that we exploit. We view the (observed) subreddit as a proxy for user type. We combine this with sentiment and extracted noun phrases (NNPs) to improve classifier performance.
Part of Speech tagger (Toutanova et al., 2003). We then introduce 'bag-of-NNP' features and features that indicate whether the sentiment inferred for a given sentence was positive or not.
Additionally, we introduce 'interaction' features that capture combinations of these. For example, a feature that indicates whether a given sentence mentions Obamacare (which will be one of many NNPs automatically extracted) and was posted in the conservative subreddit. This is an example of a two-way interaction. We also experiment with three-way interactions, crossing sentiment with NNPs and subreddits. An example is a feature that indicates if a sentence was: inferred to be positive and mentions Obamacare (NNP) and was part of a comment made in the conservative subreddit. Finally, we experiment with adding NNPs extracted from the comment thread in addition to the comment text.
These are rich features that capture signal not directly available from the sentences themselves. Features that encode subreddits crossed with extracted NNP's, in particular, offer a chance to explicitly account for differences in how the ironic device is used by individuals in different communities. However, this has the downside of introducing a large number of irrelevant terms into the model: we expect, a priori, that many entities will not correlate with the use of verbal irony. We would therefore expect this strategy to exhibit high variance in terms of predictive performance, and we later confirm this empirically. Ideally, a model would perform feature selection during parameter estimation, thus dropping irrelevant interaction terms. We next introduce a composite 1 / 2 regularization strategy toward this end.

Enforcing sparsity 3.1 Preliminaries
In this work we consider linear models with binary outputs (y ∈ {−1, +1}). We will assume we have access to a training dataset comprising n instances, x = {x 1 , ..., x n } and associated labels y = {y 1 , ..., y n }. We then aim to find a weightvector w that optimizes the following objective.
Where L is a loss function, R(w) is a regularization term and α is a parameter expressing the relative emphasis placed on achieving minimum empirical loss versus producing a simple model (i.e., a weight vector with small weights). Typically one searches for a good α using the available training data. For L, we will use the log-loss in this work, though other loss functions may be used in its place.

Sparsity via Regularization
Concerning R, one popular regularization function is the squared 2 norm: This is the norm used in the standard Support Vector Machine (SVM) formulation, for example, and has been shown empirically to work well for text classification (Joachims, 1998). An alternative is to use the 1 norm: Which has the advantage of inducing sparse models: i.e., using the 1 norm as a penalty tends to drive feature weights to 0. Returning to the present task of detecting verbal irony in comments, it seems reasonable to assume that there will be a relatively small set of entities that correlate with sarcasm. But because we are introducing 'interaction' features that enumerate the cross-product of subreddits and entities (and, in some cases, sentiment), we have a large feature-space. This space includes features that correspond to NNPs extracted from, and sentiment inferred for, the sentence itself: we will denote the indices for these by I. Other interaction features correspond to entities extracted from the threads associated with comments: we denote the corresponding set of indices by T . We expect only a fraction of the features comprising both I and T to have non-zero weights (i.e., to signal ironic intent).
This scenario is prone to the undesirable property of high-variance, and hence calls for stronger regularization.
But in general replacing the squared 2 norm with an 1 penalty (over all weights) hampers classification performance (indeed, as we later report, this strategy performs very poorly here). Therefore, in our scenario we would like to place a sparsifying 1 regularizer over the contextual (interaction) features while still leveraging the squared 2 -norm penalty for the standard bag-of-words (BoW) features. 2 We thus propose the following composite penalty: The idea is that this will drive many of the weights associated with the contextual features to zero, which is desirable in light of the intuition that a relatively small number of entities will likely indicate sarcasm. At the same time, this composite penalty applies only the squared 2 norm to the standard BoW features, given the comparatively strong predictive performance realized with this strategy.
Putting this together, we modify the original objective (Equation 1) as follows: Where we have placed separate α scalars on the respective penalty terms. Note that this is similar to the elastic net (Zou and Hastie, 2005) joint regularization and variable selection strategy. The distinction here is that we only apply the 1 penalty to (i.e., perform feature selection for) the subset of 'interaction' feature weights, which is in contrast to the elastic net, which imposes the composite penalty to all feature weights. One can view this as using the regularizer to encourage a sparsity pattern specific to the task at hand.

Inference
We fit this model via Stochastic Gradient Descent (SGD). 3 During each update, we impose both the squared 2 and 1 penalties; the latter is applied only to the contextual/interaction features in I and T . For the 1 penalty, we adopt the cumulative truncated gradient method proposed by Tsuruoka et al. (2009). 4 Experimental Setup

Datasets
For our development dataset, we used a subset of the reddit irony corpus (Wallace et al., 2014) comprising annotated comments from the progressive and conservative subreddits. We also report results from experiments performed using a separate, held-out portion of this data, which we did not use during model refinement. Furthermore, we later present results on comments from the atheism and Christianity subreddits (we did not use this data during model development, either). The development dataset includes 1,825 annotated comments (876 and 949 from the progressive and conservative subreddits, respectively). These comprise 5,625 sentences in total, each of which was independently labeled by three annotators as having been intended ironically or not. For additional details on the annotation process, see (Wallace et al., 2014). For simplicity, we consider a sentence to be 'ironic' (y = 1) when at least two of the three annotators designated it as such, and 'unironic' (y = −1) otherwise. Using this criteria, 286 (5%) of the labeled sentences are labeled 'ironic'.
The test portion of the political dataset comprises 996 annotated comments (409 progressive and 587 conservative comments), totalling 2,884 sentences. Using the same criteria as above -at least 2/3 annotators labeling a given sentence as 'ironic' -we have 154 'ironic' sentences (again about 5%).

Experimental Details
We recorded results from 500 independently performed experiments on random train (80%)/test (20%) splits of the data. These splits were performed at the comment (rather than sentence) level, so as not to test on sentences belonging to comments encountered in the training set. We measured performance, however, at the sentence level (often only a single sentence in a given comment will have been labeled as 'ironic').
Our baseline approach is a standard squared-2 regularized log-loss linear model (fit via SGD) that leverages uni-and bi-grams and features indicating grammatical cues, such as exclamation points and emoticons. We also experiment with a model that includes inferred sentiment indicators, but not context. We performed standard English stopwording, and we used Term Frequency Inverse-Document Frequency (TF-IDF) feature weighting. For the gradient descent procedure, we used a decaying learning rate (specifically, 1 t , where t is the update count). We performed a coarse grid search to find values for α that maximize F 1 on the training datasets. We took five full passes over the training data before terminating descent.
We report paired recalls and precisions, as observed on each random train/test split of the data. The former is defined as T P T P +F N and the latter as T P T P +F P , where T P denotes the true positive count, F N the number of false negatives and F P the false positive count. We report these separately -rather than collapsing into F 1 -because it is not clear that one would value recall and precision equally for irony detection, and because this allows us to tease out how the models differ in performance. Notably, for example, sentiment and context features both improve recall, but the latter does so without harming precision.  Table 2 summarize the performance of the different approaches over 500 independently performed train/test splits of the political development corpus. For reference, a random chance strategy (which predicts 'ironic' with probability equal to the observed prevalence) achieves a median recall of 0.048 and a median precision of 0.047. Figure 2 shows histograms of the observed absolute differences between the baseline linear clas-  Figure 3) using standard 2-norm (left) and the proposed 1 2-norm (right) regularization approaches on the atheism/Christianity data over 500 independent train/test splits. The composite norm achieves much greater sparsity, resulting in lower variance. This sparsity also (arguably) provides greater interpretability; one can inspect contextual features with non-zero weights. sifier and the proposed augmentations. Adding the proposed features (which capitalize on sentiment and NNP-mentions on specific subreddits) increases absolute median recall by 3.4 percentage points (a relative gain of ∼12%). And this is achieved without sacrificing precision (in contrast to exploiting only sentiment). Furthermore, as we can see in Figures 2 and 3, the proposed regularization strategy shrinks the variance of the classifier. This variance reduction is achieved through greater model sparsity, as can be seen in Figure  4, which improves interpretability. We note that leveraging only an 1 regularization penalty (with the full feature-set) results in very poor performance (median recall and precision of 0.05 and 0.09, respectively). Similarly, the elastic-net strategy (Zou and Hastie, 2005) (in which we do not specify which features to apply the 1 penalty to), here achieves a median recall of 0.11 and a median precision of 0.07. Table 4 reports results on the held-out political test dataset, achieved after training the models on the entirety of the development corpus. To account for the variance inherent to inference via SGD, we performed 100 runs of the SGD procedure and report median results from these runs. These results mostly agree with those reported for the development corpus: the proposed strategy improves median recall on the held-out corpus by nearly 4.0 percentage points, at a median cost of about 1 point in precision. By contrast, sentiment alone provides a 2% absolute improvement in recall at mean; median (25th, 75th) Table 2: Summary results over 500 random train/test splits of the development dataset. The top row reports mean and median baseline (BoW) recall and precision and lower and upper (25th and 75th) percentiles. We report pairwise differences w.r.t. this baseline in terms of recall and precision for each strategy. Exploiting NNP features and subreddits improves recall with little to not cost in precision. Capitalizing on sentiment alone improves recall but at a greater cost in precision. The proposed 1 2 regularization strategy achieves comparable performance with fewer features, and shrinks the variance over different train/test splits (as can bee seen in Figure 2 Table 3: Results on the atheism and Christianity subreddits. In general sentiment does not help on this dataset (see row 1). But the NNP and subreddit features again consistently improve recall without hurting precision. And, as above, 1 2 regularization shrinks variance (see Figures 2 and 3).

Results on the Held-out (Test) Corpus
Figure 2: Results from 500 independent train/test splits of the development subset of our political data. Shown are histograms with smoothed kernel density estimates of differences in recall and precision between the baseline bag-of-words based approach and each feature space/method (one per row). The solid black line at 0 indicates no difference; solid and dotted blue lines demarcate means and medians, respectively. Features are as in Table 1. The × symbol denotes interactions; + indicates addition. The proposed contextual features substantially improve recall, with little to no loss in precision. Moreover, in general, the 1 2 regularization approach reduces variance. (We note that in constructing histograms we have excluded a handful of points -never more than 1% -where the difference exceeded 0.15).  Table 4: Results on the held-out political dataset, using the entire development corpus as a training set. Abbreviations are as described in the caption for Figure 2. Due to the variance inherent to the stochastic gradient descent procedure, we repeat the experiment 100 times and report the median performance and standard deviations (of different SGD runs). Results are consistent with those reported for the development corpus. the expense of more than 2 points in precision.

Results on the religion dataset
To assess the general applicability of the proposed approach, we also evaluate the method on comments from a separate pair of polarized communities: atheism and Christianity, as described in Section 4.1. This dataset was not used during model development. We follow the experimental setup described in Section 4.2.
In this case, capitalizing on the NNP × subreddit features produces a mean 2.3% absolute gain in recall (median: 2.4%) over the baseline approach, with a (very) slight gain in precision. The 1 2 approach achieves a lower expected gain in recall (median: 1.5%), but again shrinks the variance w.r.t. model performance (see Figure 3). Moreover, as we show in Figure 4, this is achieved with a much more compact (sparser) model. We note that for the religion data, inferred sentiment features do not seem to improve performance, in contrast to the results on the political subreddits. At present, we are not sure why this is the case.
These results demonstrate that introducing features that encode entities and user communities (NNPs × subreddit) improve recall for irony detection in comments addressing relatively diverse topics (politics and religion).

Predictive features
We report the interaction features that are the best predictors of verbal irony in the respective subred-  dits (for both polar community pairs). Specifically, we estimated the weights for every interaction feature using the entire training dataset, and repeated this process 100 times to account for variation due to the SGD procedure. Table 5 displays the top 10 NNP × subreddit features for the political subreddits, with respect to the mean magnitude of the weights associated with them. We report these means and the standard deviations calculated across the 100 runs. This table implies, for example, that mentions of 'freedom' and 'kenya' indicate irony in the progressive subreddit; while mentions of 'obamacare' and 'president' (for example) in the conservative subreddit tend to imply irony. Table 6 reports analagous results for the religion subreddits. Here we can see, e.g., that 'god' is a good predictor of irony in the atheism subreddit, and 'professor' is in the Christianity subreddit.
We also report the top ranking 'three-way' interaction features that cross NNP's extracted from   sentences with subreddits and the inferred sentiment for the political corpus (Table 7). This would imply, e.g., that if a sentence in the progressive subreddit conveys an ostensibly positive sentiment about the political commentator 'Ollie', 4 then this sentence is likely to have been intended ironically. Some of these may seem counter-intuitive, such as ostensibly positive sentiment regarding 'Cruz' (as in the conservative senator Ted Cruz) in the conservative subreddit. On inspection of the comments, it would seem Ted Cruz does not find general support even in this community. Example comments include: "Stay classy Ted Cruz" and "Great idea on the talkathon Cruz". The 'mr' and 'king' terms are almost exclusively references to Obama in the conservative subreddit. In any case, because these are three-way interaction terms, they are all relatively rare: therefore we would caution against over interpretation here.

Related Work
The task of automated irony detection has recently received a great deal of attention from the NLP and ML communities (Tepperman et al., 2006;Carvalho et al., 2009;Burfoot and Baldwin, 2009;González-Ibáñez et al., 2011;Filatova, 2012;Reyes et al., 2012;Lukin and Walker, 2013;Riloff et al., 2013). This work has mostly focussed on exploiting token-4 'Ollie' is a conservative political commentator. based indicators of verbal irony. For example, it is clear that gratuitous punctuation (e.g. "oh really??!!!") signals irony (Carvalho et al., 2009).  proposed a semisupervised approach in which they look for sentence templates indicative of irony. Elsewhere, Riloff et al. (2013) proposed a method that exploits apparently contrasting sentiment in the same utterance to detect irony. While innovative, these approaches still rely on features intrinsic to comments; i.e., they do not attempt to capitalize on contextualizing features external to the comment text. This means that there will necessarily be certain (subtle) ironies that escape detection by such approaches. For example, without any additional information about the speaker, it would be impossible to deduce whether the comment "Obamacare is a great program" is intended sarcastically.
Other related recent work has shown the promise of sparse models, both for prediction and interpretation (Eisenstein et al., 2011a;Eisenstein et al., 2011b;Yogatama and Smith, 2014a). Yogatama (2014a; 2014b), e.g., has leveraged the group lasso approach to impose 'structured' sparsity on feature weights. Our work here may similarly be viewed as assuming a specific sparsity pattern (specifically that feature weights for 'interaction features' will be sparse) and expressing this via regularization.

Conclusions and Future Directions
We have shown that we can leverage contextualizing information to improve identification of verbal irony in online comments. This is in contrast to previous models, which have relied predominantly on features that are intrinsic to the texts to be classified. We exploited features that indicate user communities crossed with sentiment and extracted noun phrases. This led to consistently improved recall with little to no cost in precision. We also proposed a novel composite regularization strategy that imposes a sparsifying 1 penalty on the interaction features, as we expect most of these to be irrelevant. This reduced performance variance.
Future work will include expanding the corpus and experimenting with datasets outside of the political domain. We also plan to evaluate this strategy on data from different online sources, e.g., Twitter or YouTube. 1042