A Modular Architecture for Unsupervised Sarcasm Generation

In this paper, we propose a novel framework for sarcasm generation; the system takes a literal negative opinion as input and translates it into a sarcastic version. Our framework does not require any paired data for training. Sarcasm emanates from context-incongruity which becomes apparent as the sentence unfolds. Our framework introduces incongruity into the literal input version through modules that: (a) filter factual content from the input opinion, (b) retrieve incongruous phrases related to the filtered facts and (c) synthesize sarcastic text from the incongruous filtered and incongruous phrases. The framework employs reinforced neural sequence to sequence learning and information retrieval and is trained only using unlabeled non-sarcastic and sarcastic opinions. Since no labeled dataset exists for such a task, for evaluation, we manually prepare a benchmark dataset containing literal opinions and their sarcastic paraphrases. Qualitative and quantitative performance analyses on the data reveal our system’s superiority over baselines built using known unsupervised statistical and neural machine translation and style transfer techniques.


Introduction
Sarcasm 1 is an intensive, ironic construct that is intended to express contempt or ridicule. It is often linked with intelligence, creativity, and wit, and therefore empowering machines to generate sarcasm is in line with the key goals of Strong AI 2 . From the perspective of Natural Language Generation (NLG), sarcasm generation remains an important problem and can prove useful in downstream applications such as conversation systems, recommenders, and online content generators. For instance, in a conversational setting, a more natural and intriguing form of conversation between humans and machines could happen if machines can intermittently generate sarcastic responses, like their human counterparts.
Over the years, a lot of research and development efforts have gone into the problem of detecting sarcasm in text, which aims to classify whether a given text contains sarcasm or not (Joshi et al. (2017b) provide an overview). However, systems for generation of sarcasm have been elusive. This is probably due to the fact that in sarcasm generation both selection of contents for sarcastic opinion generation and surface realization of contents in natural language form are highly nuanced.
In the broader area of style transformation of texts, most of the existing works have focused narrowly on transformations at lexical and syntax levels, i.e., text simplification (Siddharthan, 2014), text formalization (Jain et al., 2018), sentiment style transfer (Shen et al., 2017;Xu et al., 2018), sentiment flipping  and understanding humor (West and Horvitz, 2019). However, very little work has been done ( (Piwek, 2003), (Hovy, 1987)) on incorporating pragmatics into generation tasks such as sarcasm. Sarcasm generation offers a rich playground to study this challenge and push the state-of-the-art in text transformation. Moreover, being a pragmatic task, sarcasm construction offers diverse ways to convey the same intent, based on cultural, social and demographic backgrounds. Hence, a supervised treatment of sarcasm generation using paired labeled data (such as parallel sentences) will be highly restrictive. This further motivates the need for exploring unsupervised approaches as the one we propose in this paper.
We make the first attempt towards automatic sarcasm generation where the generation is conditioned on a literal input sentence. For example, the literal opinion "I hate it when my bus is late." should be transformed into "Absolutely love waiting for the bus". As sarcasm conveys a negative sentiment, our system expects a negative sentiment opinion as input. Out of various possible theories proposed to explain the phenomenon of sarcasm construction (Joshi et al., 2017b), our framework relies on the theory of context incongruity (Campbell and Katz, 2012). Context incongruity is prevalent in textual sarcasm (Riloff et al., 2013;Joshi et al., 2015b). The theory presents sarcasm as a contrast between positive sentiment context (e.g., absolutely loved it) and negative situational context (e.g., my bus is late).
In our framework, translation of literal sentences to sarcastic ones happens in four stages viz., (1) Sentiment Neutralization, during which sentiment-bearing words and phrases are filtered from the input, (2) Positive Sentiment Induction, where the neutralized input is translated into phrases conveying a strong positive sentiment, (3) Negative Situation Retrieval, during which a negative situation related to the input is retrieved, and (4) Sarcasm Synthesis, where appropriate sarcastic constructs are formed from the positive sentiment and negative situation phrases gathered in the first three stages.
Training and development of these modules require only three unpaired corpora of positive, negative, and sarcastic opinions. For evaluating the system, we manually prepare a small benchmark dataset which contains a set of literal opinions and their corresponding sarcastic paraphrases. Quantitative evaluation of our system is done using popular translation-evaluation metrics, and document similarity measurement metrics. For qualitative evaluation, we consider the human judgment of sarcastic intensity, fluency, and adequacy of the generated sentences. As baselines, we consider some of our simplistic model variants and existing systems for unsupervised machine translation and style transfer. Our overall observation is that our system often generates sarcasm of better quality than the baselines. The code, data, and resources are available at https://github.com/ TarunTater/sarcasm generation.

Challenges in Sarcasm Generation
Generation of sarcasm, unlike other language generation tasks, is highly nuanced. If we reconsider the example in the introductory section, the output sentence is sarcastic as it presents an unusual situation where the opinion holder has liked the rather boring act of waiting for a bus. The unusualness (and hence, the sarcasm) arises from two implicitly opposing (incongruous) contexts: love and waiting for the bus. Such a form of sarcasm, based on the context incongruity theory (Campbell and Katz, 2012), is more common in text than other forms such as prepositional, embedded or illocutionary sarcasm (Camp, 2012). For any textual sarcasm generator, figuring out contextually incongruous phrases will be as difficult as generating a fluent sentence. Moreover, most of the existing language generators are known to work on large scale literal/non-sarcastic texts (e.g., language models trained on Wikipedia articles), and are agnostic of the possible collocations of contextually incongruous phrases (Joshi et al., 2017a). We try to overcome these challenges through our modular system design, discussed as follows.

System Architecture
The overall system architecture is presented in Figure 1. For development of the modules three corpora are needed: (a) a corpus of positive sentiment sentences (P ), (b) a corpus of negative sentiment sentences (N ), and (c) a corpus of sarcastic sentences (S ). The framework performs transformation of literal text into sarcastic ones in four stages as given below:

Sentiment Neutralization
The neutralization module is designed to filter sentiment information out from the input text. For example, the input hate when my bus is late should be filtered to produce a neutral statement like bus is late. Neutralization is performed by a neural sentiment classification module. Each word in the input sentence of length N (x = {x 1 , x 2 , ..., x N }) (in one-hot representation, padded wherever necessary) is transformed into K-dimensional embedding. The embeddings are then passed through a layer of recurrent units such as Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997). The output of the LSTM layers are then sent to a self-attention layer before passing through a softmax based classifier. The classifier is trained with the supervision from sentiment positive/negative labels using corpora P and N .
During testing, for a given input of length N , the self attention vector α = α 1 , α 2 , ..., α N is first extracted (details skipped for brevity, refer Xu  et al. (2018)). We then inverse and discretize the self attention vector as follows: where α i is the attention weight for the i th word, µ and σ are the mean and standard deviation for the attention vector. For each word in the input, if the discretized attention weight is 0, it is filtered out.
The motivation behind such a design is that if the classifier is trained well, the attention weights will represent the contribution of each word to the overall sentiment decision (Xu et al., 2018). Sentiment bearing words will receive higher attention whereas neutral words will have lower attention weights. It is worth noting that neutralization can be done in several other ways such as filtering based on a sentiment lexicon. However, such operations would require additional resources such as sentiment dictionary, sense disambiguation tools, whereas the neural classification based filtering can only work with binary sentiment labeled data. Also, for computing word-level sentiment contributions, recent techniques (such as gradient-based methods (Sundararajan et al., 2017)) can be used. For simplicity, we use attention based filtering.

Positive Sentiment Induction
Once the neutralized output is extracted, it is passed to the positive sentiment induction module which transforms it into a positive sentiment sentence. For this, we use a traditional sequence to sequence pipeline (Bahdanau et al., 2014) with attention and copy mechanisms (Gulcehre et al., 2016). The input is a set of words coming from the neutralization module. These are transformed into embeddings and are then encoded with the help of LSTM layers. The decoder attends over the encoded output and produces the output tokens based on the attended vector and the previously generated tokens. This is a standard technique, typically used in neural machine translation.
As the output from the system is expected to be positive in sentiment, for training the framework, we use only a set of positive sentences from P . Each sentence in the data is filtered using the neutralization module. The filtered version, and the original positive sentence are used as source, target pairs.

Negative Situation Retrieval
Negative situations present in sarcastic opinions are typically extrinsic and are loosely related to the semantics of its literal version. Hence, a sequence to sequence module analogous to Section 5-grams: getting up for school facts, getting yelled at by people, trying to schedule my classes, feeling like every single person, walking to class in pouring, making people who already hate, working on my last day, spending countless hours at doctors, getting overdraft statements in mail 4-grams: talking about world politics, stuck in a generation, sitting in class wondering, canceled at short notice, distancing myself from certain, wipe my own tears 3-grams: born not breathing, paid to sleep, scared those faces, taking a shower, starting your monday, accused of everything, worrying about someone, fight jealousy arguments, license to trill, awarded literature prize 2-grams: scratching itchy, looking chair, getting hiv, shot first, collecting death, lost respect 1-gram: canceled, sleeping, trying, buying, stapling Table 1: Example negative situations extracted using bootstrapping technique (Riloff et al., 2013) 3.2 may not be very useful. Moreover, for sarcasm generation, for a certain topic, it is safe to assume that there can be a finite set of negative situations. From this set appropriate situation phrases can be "retrieved" depending on the given input. Thus, finding out appropriate negative situations boils down to two sub-problems of (a) preparing a finite set of negative situations, and (b) setting up the negative situation retrieval process. We discuss each of these two steps below.

Building Negative Situation Gazetteer
This is a one-time process and is done using the unsupervised bootstrapping technique similar to Riloff et al. (2013). For each sentence in the sarcasm corpus S , a candidate negative situation phrase is extracted. A candidate negative situation phrase is a word n-gram (n ≤ 5) that follows a positive sentiment phrase in a sarcastic sentence 3 . After the candidates for a positive phrase are obtained, their Part of Speech tags are extracted with the help of a POS tagger. Specific patterns of n-gram are then obtained using the POS tags. This ensures that the phrases extracted are mostly verb phrases, noun-phrases, and to-infinitive verb phrases that describe situations. In our setting we use 30 predefined POS n-gram patterns following Riloff et al. (2013).
Once the candidate negative situation phrases are extracted, they are filtered based on a scoring function as given below: where ns i is the i th negative situation extracted for a certain positive phrase. The scoring function returns a real value indicating the exclusiveness of the negative situation w.r.t the sarcastic sentences. If the score exceeds 3 the word "love" is considered as the seed positive sentiment phrase to begin the bootstrapping procedure a threshold (i.e., p > 0.5), the candidate phrase is added to the gazetteer. Once all the possible negative situation phrases are extracted, each phrase is used to extract more positive sentiment phrases similarly as above. This process of positive phrase and negative situation extraction is repeated until no new phrases are found. Table 1 shows some example negative situation phrases extracted from our dataset.

Retrieval Process
The idea is to find negative situations relevant to the input sentence. We implement an information retrieval system based on PyLucene. All the negative situations from the gazetteer (Sec. 3.3.1) are first indexed. The input sentence is considered as the query for which the most relevant negative situation is retrieved from the indexed list. The factors involved in PyLucene's retrieval algorithm include tf-idf, number of matching terms in the query sentence and the retrieved sentence, and importance measure of a term according to the total number of terms in the search.
Once the positive sentiment and negative situations are generated for the input sentence, they undergo a post-processing step where stopwords and redundant words are removed and given as input to the sarcasm synthesis module.

Sarcasm Synthesis
The sarcasm synthesis module is a sequence-tosequence network that expects a set of keywords related to positive sentiment and negative situation phrases. For training this module, the sarcasm corpus S is used. To prepare the input, we implement a keyword extraction technique based on POS tagging. Sentences in S are POS-tagged and then stopwords are removed, and then based on the POS tags noun, verb, adjective and adverbs are retained. This way, the input keywords to the system would somewhat be similar to the keywords expected in real time scenario.
The module follows an encode-attend-decode style architecture like the positive sentiment induction module, but with a different learning objective. Keywords in the input (in one-hot representation, padded wherever necessary) are transformed into a sequence of embeddings and then encoded by layers of LSTMs, which produces a hidden representations for each input word. The decoder, consisting of a single layer of LSTMs stacked on the top of a decoder embedding layer, attend over the encoded hidden representations and generate target words. In general, for T training instances of keywords and sarcastic texts, , the training objective is to maximize the likelihood of a target y i given x i , which is similar to minimizing the cross entropy between the target distribution and the predicted output distribution. For training the neural network, the crossentropy loss is back propagated. In other words, the gradient of the negative cross-entropy loss is considered to update the model parameters. The gradient is given by: where L is the loss function and M θ is the translation system with parameter θ.ŷ i is the predicted sentence. In our setting, where the input is not a proper sentence, the problem with the above objective is that it does not strongly enforce the decoder to learn and produce sarcastic output. We speculate that minimizing the token-level cross entropy loss in Eq. 3 may help produce an output that is grammatically correct but not sarcastic enough. For instance, the decoder may incur insignificant cross-entropy loss after generating a sentence like absolutely loved it, as this sentence has considerable overlap with the reference sarcastic text that provides supervision. One possible idea to tackle such problems is to employ a sarcasm scorer that can determine the sarcasm content in the generated output, and use the scores given by the sarcasm scorer for better training of the generator. However, the sarcasm scorer may be external to the sequence-tosequence pipeline, and the scoring function may not be differentiable with respect to the model M θ . For this, we apply reinforcement learning which considers sarcasm content score as a form of reward and use it to fine-tune the learning process. For learning, the policy gradient theorem (Williams, 1992) is used. The system is trained under a modified learning objective i.e., to maximize the expected reward score for a set of produced candidate sentences. The generator, operating with a policy of P M θ (ŷ i |x i ), producing an outputŷ i with an expected reward score computed using a scorer, will thus have the following gradient: where RL is the modified learning objective which has to be maximized and R is a reward function that is computed using an external scorer. In practice, the expected reward is computed by (a) sampling candidate outputs from the policy P M θ , (b) computing the reward score for each candidate and (c) averaging the rewards so obtained. In typical RL settings, the learner is typically initialized to a random policy distribution. However, in our case, since some supervision is already available in the form of target sarcastic sentences, we pretrain the model with the loss minimization objective given in Eq. 3 and then fine-tune the model based on the policy gradient scheme following Eq. 4. Thus, the learner gets initialized with a better policy distribution.
For reward calculation, we consider the confidence score of a sarcasm classifier (probability of being sarcastic) trained using S as positive examples and P and N taken as negative examples. For our setting, the classifier is analogous to the classifier used for neutralization. The classifier is based on embedding, LSTM and softmax layers.
Since the input to the system is a list of words, it may seem that the sarcasm synthesis module may not require sequence to sequence learning, and a much simpler approach like bag-of-words to sequence generation could have been used. However, note that the input to the generator is obtained after dropping words during neutralization and later appending the negative situation phrase. The sequentiality is, thus, not completely lost. This makes sequence to sequence model an intuitive choice.
We now explain our experimental setup. 6149 4 Experiment Setup

Datasets
As stated earlier, our system does not rely on any paired data for training. It requires three corpora of positive sentences, negative sentences, and sarcastic sentences collected independently.
For positive and negative sentiment corpora P and N , we considered short sentences/snippets from the following well-known sources such as (a) Stanford Sentiment Treebank Dataset, (b) Amazon Product Reviews, (c) Yelp Reviews (d) Sentiment 140 dataset (See Kotzias et al. (2015) for sources). The above datasets primarily contain tweets and short snippets. Tweets are normalized by removing hashtags, usernames, and performing spell checking and lexical normalization using NLTK (Loper and Bird, 2002). We then filtered out sentences with more than 30 words. Approximately 50, 000 sentences from each category are retained. Then, based on the vocabulary overlap with our sarcasm corpus S , 47, 827 sentences are finally retained from each category (total number of instances is 95654).
For the unlabelled sarcasm corpus S , we relied on popular datasets used for sarcasm detection tasks such as the ones by Ghosh and Veale (2016), Riloff et al. (2013), and the Reddit Sarcasm Corpus 4 . Sentences are denoised, spell corrected and normalized. Average sentence length is kept as 30 words. A total number of 306, 141 sentences are thus collected. A common vocabulary of size 20, 000 is extracted (based on frequency) for all the modules from the three corpora. Each corpus is divided into a train-valid-test split of 80%-10%-10%.

Benchmark Dataset for Evaluation
Since no dataset containing paired examples of literal and sarcastic utterances are available, we created a small test-set for evaluating our system. From the test split of the sarcasm corpus S , 250 sentences on diverse topics are selected and are manually translated into literal versions by two linguists. From this, only 203 sentences could be selected by the linguists who mutually decided whether the sentences were sarcastic enough to keep in the test dataset or not.

Model Configuration
For the neutralization module, the embedding dimension size is set to 128, two layers of LSTMs of hidden dimension of 200 are used. The classifier trains for 10 epochs with a batch size of 32, and achieves a validation accuracy of 96% and training accuracy of 98%.
For positive sentiment induction module, the embedding dimensions for both encoder and decoder are set to 500. Both the encoder and decoder have only one layer of LSTM, with a hidden dimension of 500. The module is built on top of the OpenNMT (Klein et al., 2017) framework. Training happens in 100, 000 iterations and the batch size is set to 64. The positive sentiment induction module, at the end of the training, produces a bigram BLEU (Papineni et al., 2002) score of 62.25%. For bootstrapping negative situations and other purposes, the POS tagger from Spacy 5 is used. The Lucene-IR framework is set up to retrieve negative situations.
The model configuration and training parameters for the sarcasm synthesizer is the same as the positive sentiment induction module. For the RL scheme, for each instance, the expected reward is computed over 100 candidate samples. At the end of the training, the bigram BLEU score on the validation set turns out to be 59.3%. For reward computation, we use a classifier similar to the one used for neutralization. The embedding size for this classifier is 300 and it uses two layers of unidirectional LSTMs with a hidden dimension of 300. It trains with a batch size of 64 and produces a validation accuracy of 78.3%. The probability estimates given by the classifier for any input text are taken as reward scores. For optimization, cross entropy loss criterion is used.

Evaluation Criteria
Absence of automatic evaluation metrics capable of capturing subtleties of sarcasm makes it difficult to evaluate sarcasm generators. For evaluation, we still use the popular translation and summarization evaluation metrics METEOR (Banerjee and Lavie, 2005) and ROUGE (Lin, 2004). Additionally, to check the semantic relatedness between the input and output, we use Skip-thought sentence similarity metric 6 . Note that using BLEU (Papineni et al., 2002) will be futile here as direct n-gram overlaps between the predicted and goldstandard sentences are not expected to be significantly higher for such a task. We still include it as an evaluation metric for completion.
We employ an additional metric to judge the percentage of length increment (abbreviated as WL) to see the if the length of the output is generally more than that of the input (for the reference text it is 67%). The notion behind this metric is that sarcasm typically requires more context than its literal version, requiring to have more words present at the target side.

Human Judgement based Evaluation
We also consider human judgment scores indicating whether the generated output is nonsarcastic/sarcastic (0/1 labels), how fluent it is (in a scale of 1-5, 1 being lower), and to what extent it is related to the input (in a scale of 1-5). The relatedness measure is important as the objective of the task is to produce a sarcastic version of the input text without altering the semantics much. For human evaluation, we consider only the 30 sentences randomly picked from the benchmark (test) dataset. Sarcasm is a difficult topic, so we stuck to only two annotators who had a better understanding of the language and socio-cultural diversities.

Systems for Comparison
For comparison, we consider the following four systems: 1. SarcasmBot: This is an open-sourced sarcasm generation chatbot released by Joshi et al. (2015a). The system generates a sarcastic response to an input utterance.
2. UNMT: This system is based on Unsupervised Neural Machine Translation technique by Artetxe et al. (2017), which can be extended to any translation task. In our setting, the source and target side are literal and sarcastic utterances, i.e. the direction of translation is non-sarcastic to sarcastic.

ST:
This is based on the cross alignment technique proposed by Shen et al. (2017), used for the task of sentiment translation.

5.
FLIP: This is based on heuristics for sentiment reversal. For this, the input sentence is first dependency-parsed. The root verb is determined along with its tense and aspects with the help of its part-of-speech tags 7 . The sentiment of root verb is determined using sentiment lexicon 8 . If the verb has non-zero positive or negative sentiment score, its antonym is found using WordNet. Appropriate tense and aspect form of the antonym is then obtained 9 . The modified antonym replaces the original root verb. Similarly, we replace adjective and adverbs with words carrying opposite sentiment.
For training the above systems (except FLIP), we used S at one side and a larger version of combined P and N containing 558, 235 sentences on the other side, curated from the same sources as mentioned earlier. Apart from this system, we also tested some of our model variants, which are presumably inferior and can be considered as baselines. These are termed as: 1. SG NORMAL: a system with only the sarcasm synthesizer module which takes the input directly (after removing stopwords from the input), 2. SG RL:, same as SG NORMAL but also applies reinforcement learning, 3. ALL NORMAL:, the complete system, with sarcasm synthesizer trained without reinforcement learning strategy.
4. ALL RL:, the complete system with reinforcement learning.

Results and Analysis
Tables 2 and 3 present evaluation results. While it was expected that the automatic metrics may not be able to capture the subtleties of sarcasm, the WL measure indicates that a carefully designed modular approach like ours often generates longer sentences with more context. This is also corroborated by the human evaluation where annotators have judged that the output generated from   Table 3: Human judgment scores for various systems our system are more sarcastic than the comparison systems. SarcasmBot, being a heuristic driven sarcasm generator produces sarcastic responses but is not related to the input topic. Moreover, it ends up generating only 20 different responses for our entire test dataset making its output redundant and unrelated to the input. Other existing systems such as UNMT and Monoses converge to autoencoding and end up replicating the input as output. FLIP, performs transformations at lexical level, hence achieves better fluency but certainly fails to induce sarcasm in most of the cases. Table 4 presents example generations from different systems. It is quite interesting to note that due to the RL, the model tends to produce longer sentences and brings additional context necessary for sarcasm. The fluency is however compromised. A close inspection of the outputs from each module suggests that the overall error committed by the system is due to accumulation of different types of errors, mainly (a) error during neutralization due to inappropriate assignment of weights to the words in the input, (b) dropping of words and/or insertion of spurious words during positive sentiment induction, and (c) error in scoring the sarcasm content in the RL setting. These can be addressed through better hyper-parameter tuning, gathering more training data for training the individual modules (especially the sarcasm synthe-

Related Work
As stated earlier, not many systems for sarcasm generation exist today. The closest work to ours is the one by Joshi et al. (2015a) which employs a heuristic driven approach for generating a sarcastic response to an input utterance. Since, the output of the system is a response, the system is not suitable for translating a literal input text into a sarcastic version. Unlike sarcasm generation, sar-casm detection has been a well-known problem with several available solutions. For this problem, traditional supervised and deep neural network based solutions have been proposed. The supervised approaches rely on: (a) Unigrams and Pragmatic features (González-Ibánez et al., 2011;Barbieri et al., 2014;Joshi et al., 2015b) (b) Stylistic patterns (Davidov et al., 2010) and patterns related to situational disparity (Riloff et al., 2013) and (c) Cognitive features extracted from gaze patterns (Mishra et al., 2016(Mishra et al., , 2017. Recent systems are based on variants of deep neural networks built on the top of embeddings. Deep neural networks based solutions for sarcasm detection include (Ghosh and Veale, 2016) who uses a combination of RNNs and CNNs for sarcasm detection, and (Tay et al., 2018), who propose a variant of CNN for extracting features related to context incongruity.
A few works exist in the domains of irony, pun and humour generation and are summarized by Wallace (2015), Ritchie (2005) and Strapparava et al. (2011) respectively. However, most of these are heuristic driven and, hence, may not be easily scaled to new domains and languages. From the perspective of language style transfer. Shen et al. (2017) propose an unsupervised scheme to learn latent content distribution across different text corpora and use it for sentiment style transfer. Xu et al. (2018) introduce an unsupervised sentiment translation technique through sentiment neutralization and reinforced sequence generation.  propose a style transfer technique based on unsupervised MT inspired by Artetxe et al. (2017). Artetxe et al. (2018) have recently proposed an unsupervised statistical machine translation scheme. We adopt some of these modules for the task of sarcasm generation. As far as we know, our proposed model is the first of its kind for end-to-end neural sarcasm generation.

Conclusion and Future Work
We proposed a first of its kind approach for textual sarcasm generation from literal opinionated texts. We designed a modular framework for extracting facts from the input, generating incongruous positive and negative situational phrases related to the facts, and finally generating sarcastic variations. For evaluation, we prepared a benchmark dataset containing literal opinions and their sarcastic versions. Through qualitative and quantitative anal-ysis of the system's performance on the benchmark dataset, we observed that our system often generates better sarcastic sentences compared to some of our trivial model variants, and unsupervised systems used for machine translation and sentiment style transfer. In the future, we would like to extend this framework for cross-lingual and cross-cultural sarcasm and irony generation.