Augmenting Data for Sarcasm Detection with Unlabeled Conversation Context

We present a novel data augmentation technique, CRA (Contextual Response Augmentation), which utilizes conversational context to generate meaningful samples for training. We also mitigate the issues regarding unbalanced context lengths by changing the input output format of the model such that it can deal with varying context lengths effectively. Specifically, our proposed model, trained with the proposed data augmentation technique, participated in the sarcasm detection task of FigLang2020, have won and achieves the best performance in both Reddit and Twitter datasets.


Introduction
The performance of many NLP systems largely depends on their ability to understand figurative languages such as irony, sarcasm, and metaphor (Pozzi et al., 2016). The results from the Sentiment Analysis task held in SemEval-2014(Martínez-Cámara et al., 2014, for example, show that apparent performance drops occur when the figurative language is involved in the task. This work aims, in particular, to design a model that identifies sarcasm in the conversational context. More specifically, the goal is to determine whether a response is sarcastic or not, given the immediate context (i.e. only the previous dialogue turn) and/or the full dialogue thread (if available). For evaluation of our model, we participated in the FigLang2020 sarcasm challenge 1 , and have won the competition as our model is ranked 1 out of 35 teams for the Twitter dataset and 1 out of 34 teams for the Reddit dataset.
We summarize our technical contributions to win the challenge as follows: 1. We propose a new data augmentation technique that can successfully leverage the structural patterns of the conversational dataset. Our technique, called CRA(Contextual Response Augmentation), utilizes the conversational context of the unlabeled dataset to generate new training samples.
2. The context lengths (i.e. previous dialogue turns) are highly variable across the dataset.
To cope with such imbalance, we propose a context ensemble method that exploits multiple context lengths to train the model. The proposed format is easily applicable to any Transformer (Vaswani et al., 2017) encoders without changing any model architecture.
3. We propose an architecture where the Transformer Encoder is stacked with BiL-STM (Schuster and Paliwal, 1997) and NeXtVLAD (Lin et al., 2018). We observe that NeXtVLAD, a differentiable pooling layer, proves more effective than simple nonparametric mean/max pooling methods.

Approach
The task of our interest is, given response (r 1 ) and its previous conversational context (c 1 , c 2 , · · · , c n ), to predict whether the response r 1 is sarcastic or not (See an example in Figure 2). We below discuss our model ( Figure 2: Illustration of the context ensemble method for Sarcasm detection. We train multiple models with different context window sizes, and ensemble them for inference.

The Model
Figure 1 describes the architecture of our best performing model. The model broadly consists of two parts: the transformer (BERT) (Devlin et al., 2018) and pooling layers, which are decomposed into BiLSTM (Schuster and Paliwal, 1997) and NetXtVLAD (Lin et al., 2018) as an improved version of NetVLAD (Arandjelovic et al., 2016). Reportedly, NetVLAD is a CNN-model that is highly effective and more resistant to over-fitting than usual temporal models such as LSTM or GRU (Lin et al., 2018). The Implementation of these models are as follows: • BERT(large-cased): 24-layer, 1024-hidden and 16-heads.

Context1
There is your "rigged" election  Reranking best responses conditioned on a given context, we obtain various pseudo responses that are useful for training.

Training Details
We use the entropy loss on the last softmax layer in the model. The training batch size is 4 for all the experiments. We adopt the cyclic learning rate (Smith, 2017), where the initial learning rate is 1e-6, and the moment parameters are (0.825, 0.725). Dataset Splitting. We further split the provided training set (training data ) into the training (train data ) and validation (valid data ) set as in Table 1. We use valid data for early stopping and the model performance validation during the training phase.
Context Ensemble. Figure 2 depicts the idea of the context ensemble method to cope with highly variable context lengths in the dataset. Instead of using the training data as their original forms only ( Figure 2(a)), we consider multiple context window sizes as separate data, which can naturally balance out the proportion of short and long context (Figure 2(b)).

Data Augmentation
Van Hee et al. (2018) and Ilić et al. (2018) have observed that in the case of Twitter, fueling additional data from the same domain did not help much the performance for detecting sarcasm and irony. However, this does not mean that the data augmentation would fail to improve sarcasm detection. We use two techniques to augment the training data according to whether the data are labeled or not. Especially, our data augmentation method named Contextual Response Augmentation (CRA) can take advantage of unlabeled dialogue threads, which are abundant and cheaply collectible. It's the perfect example of how the bystander effect works, even amongst police (or whatever they are) 100% if they were alone and saw this they'd doay something to the guy. But together you get a pack mentality. r1 Unfortunately, the police are sometimes the victimizers A sarcastic sample c1 Trump won Wisconsin by 27,000 votes. 300,000 voters were turned away by the states strict Voter ID law. There is yourriggedëlection ." c2 @USER Rigged election . Trump does not have a mandate . Period. r1 @USER @USER @USER exactly we are a democracy and had a democratic vote and we voted to leave  ure 3 illustrates the overview of our CRA method whose details are presented in section 2.3.2.
Our idea is to take the context sequence [c 1 , c 2 , · · · , c n ] as a new datapoint and label it as "NOT SARCASM". As shown in Figure 2, without the response [r 1 ], the sequence could not be labeled as "SARCASM". We hypothesize that these newly generated negative samples help the model better focus on the relationship between the response [r 1 ] and its contexts [c 1 , c 2 , · · · , c n ]. Also, we balance out the number of negative samples by creating positive samples via back-translation methods (Bérard et al. (2019); Zheng et al. (2019)), which simply translate the sentences into another language and then back to the original language to obtain possibly rephrased data points. For the backtranslation, we have used 3 languages [French, Spanish, Dutch].

Augmentation with Unlabeled Data
We also generate additional training samples using the unlabeled data: [c 1 , c 2 , · · · , c n , r 1 ]. This approach is tremendously useful since a huge amount of unlabeled dialogue threads can be collected at little cost. As shown in Figure 3, the procedures for unlabeled augmentation are as follows: 1. We encode each response in the labeled training set using the BERT trained on natural inference tasks (Reimers and Gurevych, 2019).
3. We rank the top k candidates according to the next sentence prediction (NSP) confidence of BERT 2 . That is, we input [c 1 , c 2 , · · · , c n , sep, r t,i ] to BERT, and compute the NSP confidence of r t,i for all i ∈ {1, · · · , k}. We then select the most confident response r * t with its label l * t and make a new data point [c 1 , c 2 , · · · , c n , r * t , l * t ]. Table 2 shows some samples generated from this technique. The quality of generated data depends undoubtedly on the degree of contextual conformity and similarity between the initial responses. We find, however, that adding more data makes the quality of the augmented data better as the label transfer noise becomes attenuated. In summary, besides the standard datasets shown in Table 7

Experiments
We first report the quantitative results by referring to the statistics in the official evaluation server 3 of the FigLang2020 sarcasm challenge as of the challenge deadline (i.e. April 16, 2020, 11:59 p.m. UTC). Table 4 summarizes the results of the competition, where our method named miroblog shows significantly better performance than other participants in both Twitter and Reddit dataset. We report Precision (P), Recall (R), and F1 scores as the official metrics.

Further Analysis
We perform further empirical analysis to demonstrate the effectiveness of the proposed ideas. We compare different configurations of pooling layers, context ensemble, and data augmentation.
Pooling Layers. Table 3 shows the comparison of sarcasm detection performance between NeXtVLAD and other pooling methods in performance. When coupled with BiLSTM, NeXtVLAD achieves better performance than max, and mean pooling methods.
Context Ensemble. Table 6 shows the comparison with different context ensemble methods.   We use the baseline (Transformer+BiLSTM+ Maxpooling) and train it without augmenting the training set. F1 scores of the model are better in the order of (a) ensemble with maximum context, (b) ensemble with three contexts and (c) no context. The performance gap with or without context ensemble implies that balancing out the samples in terms of context length is important. On the other hand, the performance gap between (a) and (b) is only 0.006, indicating that the use of older than three recent conversational contexts is scarcely helpful. Data Augmentation. Table 5 compares the sarcasm detection results when the data augmentation is applied or not. The augmentation with labeled data increases the F1 score from 0.854 to 0.861. The augmentation with unlabeled data further enhances performance from 0.861 to 0.897. The results demonstrate that both augmentation techniques help with the performance.

Error Analysis
In order to better understand when our data augmentation methods are effective, we further analyze some examples of the following three cases according to whether the proposed labeled and unlabeled data augmentation (DA) is applied or not: (i) the prediction is wrong without DA but correct with DA, (ii) the prediction is correct without DA but wrong with DA, and (iii) the prediction is wrong with and without DA. In other words, (i) is the case where DA helps, (ii) is the one where DA hurts, and (iii) is the one where DA fails to improve. Table 8 shows some examples of these three cases. (i) The initial steps of the CRA involve finding similar training samples from the labeled database. Thus, after applying CRA, samples containing specific hashtags, e.g. #NotReally #Relax, (i) The prediction is wrong without DA but correct with DA. c1 Any practice could be anyone's last practice. Yes. c2 @USER report: tom brady struck by lighting after leaving practice. r1 [SARCASM] @USER Report: Tom Brady abducted by space aliens during practice. #NotReally #Relax (ii) The prediction is correct without DA but wrong with DA [SARCASM] @USER @USER @USER Sorry, I forgot to use the font! I'm loving your videos. Its giving me ideas and inspiration for some stuff I'd like to try. are included in the training set. We observe that theses tags tend to occur with the samples that are labeled "SARCASM", and thus CRA helps the model learn the correlation between the hashtags and the labels. (ii) The augmented response (r2) contains the phrase "cult leader" as in the original response (r1). The corresponding label, however, is "SARCASM". When the newly added samples do not match the context, or the labels are incorrect, CRA degrades the prediction. (iii) The third case arises mostly when the situation is subtle and requires external knowledge beyond the given context. In order for the model to correctly classify the response as "SARCASM", the model requires to understand the tag #VPD(Video Per Day). It is not clear what #VPD is from the context, and without such knowledge, the model may still make incorrect predictions.

Conclusion
We proposed a new data augmentation technique, CRA (Contextual Response Augmentation), that utilizes the conversational context of the unlabeled data to generate meaningful training samples. We demonstrated that the method boosts the perfor-mance of sarcasm detection significantly. The employment of both augmentations with labeled and unlabeled data enables the system to achieve the best F1 scores to win the FigLang2020 sarcasm challenge on both datasets of Twitter and Reddit.