Analyzing Political Parody in Social Media

Parody is a figurative device used to imitate an entity for comedic or critical purposes and represents a widespread phenomenon in social media through many popular parody accounts. In this paper, we present the first computational study of parody. We introduce a new publicly available data set of tweets from real politicians and their corresponding parody accounts. We run a battery of supervised machine learning models for automatically detecting parody tweets with an emphasis on robustness by testing on tweets from accounts unseen in training, across different genders and across countries. Our results show that political parody tweets can be predicted with an accuracy up to 90%. Finally, we identify the markers of parody through a linguistic analysis. Beyond research in linguistics and political communication, accurately and automatically detecting parody is important to improving fact checking for journalists and analytics such as sentiment analysis through filtering out parodical utterances.


Introduction
Parody is a figurative device which is used to imitate and ridicule a particular target (Rose, 1993) and has been studied in linguistics as a figurative trope distinct to irony and satire (Kreuz and Roberts, 1993;Rossen-Knill and Henry, 1997). Traditional forms of parody include editorial cartoons, sketches or articles pretending to have been authored by the parodied person. 2 A new form of parody recently emerged in social media, and Twitter in particular, through accounts that impersonate public figures. Highfield (2016) defines parody accounts acting as: a known, real person, for obviously comedic purposes. There should be no risk of mistaking their tweets for their subject's actual views; these accounts play with stereotypes of these figures or juxtapose their public image with a very different, behind-closed-doors persona.
A very popular type of parody is political parody which plays an important role in public speech by offering irreverent interpretations of political personas (Hariman, 2008). Table 1 shows examples of very popular (over 50k followers) and active (thousands of tweets sent) political parody accounts on Twitter. Sample tweets show how the style and topic of parody tweets are similar to those from the real accounts, which may pose issues to automatic classification.
While closely related figurative devices such as irony and sarcasm have been extensively studied in computational linguistics (Wallace, 2015;Joshi et al., 2017), parody yet to be explored using computational methods. In this paper, we aim to bridge this gap and conduct, for the first time, a systematic study of political parody as a figurative device in social media. To this end, we make the following contributions: 1. A novel classification task where we seek to automatically classify real and parody tweets. For this task, we create a new large-scale publicly available data set containing a total of 131,666 English tweets from 184 parody accounts and corresponding real accounts of politicians from the US, UK and other countries (Section 3); 2. Experiments with feature-and neural-based machine learning models for parody detection, which achieve high predictive accuracy of up to 89.7% F1. These are focused on the robust-ness of classification, with test data from: a) users; b) genders; c) locations; unseen in training (Section 5); 3. Linguistic analysis of the markers of parody tweets and of the model errors (Section 6). We argue that understanding the expression and use of parody in natural language and automatically identifying it are important to applications in computational social science and beyond. Parody tweets can often be misinterpreted as facts even though Twitter only allows parody accounts if they are explicitly marked as parody 3 and the poster does not have the intention to mislead. For example, the Speaker of the US House of Representatives, Nancy Pelosi, falsely cited a Michael Flynn parody tweet; 4 and many users were fooled by a Donald Trump parody tweet about 'Dow Joans'. 5 Thus, accurate parody classification methods can be useful in downstream NLP applications such as automatic fact checking (Vlachos and Riedel, 2014) and rumour verification (Karmakharm et al., 2019), sentiment analysis (Pang et al., 2008) or nowcasting voting intention (Tumasjan et al., 2010;Lampos et al., 2013;Tsakalidis et al., 2018).
Beyond NLP, parody detection can be used in: (i) political communication, to study and understand the effects of political parody in the public speech on a large scale (Hariman, 2008;Highfield, 2016); (ii) linguistics, to identify characteristics of figurative language (Rose, 1993;Kreuz and Roberts, 1993;Rossen-Knill and Henry, 1997); (iii) network science, to identify the adoption and diffusion mechanisms of parody (Vosoughi et al., 2018).

Related Work Parody in Linguistics
Parody is an artistic form and literary genre that dates back to Aristophanes in ancient Greece who parodied argumentation styles in Frogs. Verbal parody was studied in linguistics as a figurative trope distinct to irony and satire (Kreuz and Roberts, 1993;Rossen-Knill and Henry, 1997) and researchers long debated its definition and theoretic distinctions to other types of humor (Grice et al., 1975;Sperber, 1984;Wilson, 2006;Dynel, 2014). In general, verbal parody involves a highly situated, intentional, and conventional speech act (Rossen-Knill and Henry, 1997) composed of both a negative evaluation and a form of pretense or echoic mention (Sperber, 1984;Wilson, 2006;Dynel, 2014) through which an entity is mimicked or imitated with the goal of criticizing it to a comedic effect. Thus, imitative composition for amusing purpose is an an inherent characteristic of parody (Franke, 1971). The parodist intentionally re-presents the object of the parody and flaunts this re-presentation (Rossen-Knill and Henry, 1997).
Parody on Social Media Parody is considered an integral part of Twitter (Vis, 2013) and previous studies on parody in social media focused on analysing how these accounts contribute to topical discussions (Highfield, 2016) and the relationship between identity, impersonation and authenticity (Page, 2014). Public relation studies showed that parody accounts impact organisations during crises while they can become a threat to their reputation (Wan et al., 2015).
Satire Most related to parody, satire has been tangentially studied as one of several prediction targets in NLP in the context of identifying disinformation (McHardy et al., 2019;de Morais et al., 2019). (Rashkin et al., 2017) compare the language of real news with that of satire, hoaxes, and propaganda to identify linguistic features of unreliable text. They demonstrate how stylistic characteristics can help to decide the text's veracity. The study of parody is therefore relevant to this topic, as satire and parodies are classified by some as a type of disinformation with 'no intention to cause harm but has potential to fool' (Wardle and Derakhshan, 2018).

Irony and Sarcasm
There is a rich body of work in NLP on identifying irony and sarcasm as a classification task (Wallace, 2015;Joshi et al., 2017). Van Hee et al. (2018) organized two open shared tasks. The first aims to automatically classify tweets as ironic or not, and the second is on identifying the type of irony expressed in tweets. However, the definition of irony is usually 'a trope whose actual meaning differs from what is literally enunciated' (Van Hee et al., 2018), following the Gricean belief that the hallmark of irony is to communicate the opposite of the literal meaning (Wilson, 2006), violating the first maxim of Quality (Grice et al., 1975). In this Our NHS will never be on the table for any trade negotiations. Were investing more than ever before -and when we leave the EU, we will introduce an Australian style, points-based immigration system so the NHS can plan for the future.

Parody @BorisJohnson MP
People seem to be ignoring the many advantages of selling off the NHS, like the fact that hospitals will be far more spacious once poor people can't afford to use them. sense, irony is treated in NLP in a similar way as sarcasm (González-Ibáñez et al., 2011;Khattri et al., 2015;Joshi et al., 2017). In addition to the words in the utterance, further using the user and pragmatic context is known to be informative for irony or sarcasm detection in NLP (Bamman and Smith, 2015;Wallace, 2015). For instance, Oprea and Magdy (2019) make use of user embeddings for textual sarcasm detection. In the design of our data splits, we aim to limit the contribution of this aspects from the results.
Relation to other NLP Tasks The pretense aspect of parody relates our task to a few other NLP tasks. In authorship attribution, the goal is to predict the author of a given text (Stamatatos, 2009;Juola et al., 2008;Koppel et al., 2009). However, there is no intent for the authors to imitate the style of others and most differences between authors are in the topics they write about, which we aim to limit by focusing on political parody. Further, in our setups, no tweets from an author are in both training and testing to limit the impact of terms specific to a particular person. Pastiche detection (Dinu et al., 2012) aims to distinguish between an original text and a text written by someone aiming to imitate the style of the original author with the goal of impersonating. Most similar in experimental setup to our task, Preoţiuc-Pietro and Devlin Marier (2019) aim to distinguish between tweets published from the same account by different types of users: politicians or their staff. While both pastiches and staff writers aim to present similar content with similar style to the original authors, the texts lack the humorous component specific of parodies.
A large body of related NLP work has ex-

Task & Data
We define parody detection in social media as a binary classification task performed at the social media post level. Given a post T , defined as a sequence of tokens T = {t 1 , ..., t n }, the aim is to label T either as parody or genuine. Note that one could use social network information but this is out of the paper's scope as we only focus on parody as a linguistic device. We create a new publicly available data set to study this task, as no other data set is available. We perform our analysis on a set of users from the same domain (politics) to limit variations caused by topic. We first identify real and parody accounts of politicians on Twitter posting in English from the United States of America (US), the United Kingdom (UK) and other accounts posting in English from the rest of the world. We opted to use Twitter because it is arguably the most popular platform for politicians to interact with the public or with other politicians (Parmelee and Bichard, 2011). For example, 67% of prospective parliamentary candidates for the 2019 UK general election have an active Twitter account. 6 Twitter also allows to maintain parody accounts, subject to adding explicit markers in both the user bio and handle such as parody, fake. 7 Finally, we label tweets as parody or real, depending on the type of account they were posted from. We highlight that we are not using user description or handle names in prediction, as this would make the task trivial.

Collecting Real and Parody Politician Accounts
We first query the public Twitter API using the following terms: {parody, #parody, parody account, fake, #fake, fake account, not real} to retrieve candidate parody accounts according to Twitter's policy.
From that set, we exclude any accounts matching fan or commentary in their bio or account name since these are likely to be not posting parodical content. We also exclude private and deactivated accounts and accounts with a majority of non-English tweets. After collecting this initial set of parody candidates, the authors of the paper manually inspected up to the first ten original tweets from each candidate to identify whether an account is a parody or not following the definition of a public figure parody account from Highfield (2016) (see Section 1), further filtering out non-parody accounts. We keep a single parody account in case of multiple parody accounts about the same person. Finally, for each remaining account, the authors manually identified the corresponding real politician account to collect pairs of real and parody.
Following the process above, we were able to identify parody accounts of 103 unique people, with 81 having a corresponding real account. The authors also identified the binary gender and location (country) of the accounts using publicly available records. This resulted in 21.6% female accounts (women parliamentarians percentages as of 2017: 19% US, 30% UK, 28.8% OECD average).  The majority of the politicians are located in the US (44.5%) followed by the UK (26.7%) while 28.8% are from the rest of the world (e.g. Germany, Canada, India, Russia).

Collecting Real and Parody Tweets
We collect all of the available original tweets, excluding retweets and quoted tweets, from all the parody and real politician accounts. 9 We further balance the number of tweets in a real -parody account pair in order for our experiments and linguistic analysis not to be driven by a few prolific users or by imbalances in the tweet ratio for a specific pair. We keep a ratio of maximum ±20% between the real and parody tweets per pair by keeping all tweets from the less prolific account and randomly down-sampling from the more prolific one. Subsequently, for the parody accounts with no corresponding real account, we sample a number of tweets equal to the median number of tweets for the real accounts. Finally, we label tweets as parody or real, depending on the type of account they come from. In total, the data set contains 131,666 tweets, with 65,710 real and 65,956 parody.

Data Splits
To test that automatically predicting political parody is robust and generalizes to held-out situations not included in the training data, we create the following three data splits for running experiments: Person Split We first split the data by adding all tweets from each real -parody account pair to a single split, either train, development or test.
To obtain a fairly balanced data set without pairs of accounts with a large number of tweets dominating any splits, we compute the mean between real and parody tweets for each account, and stratify them, with pairs of proportionally distributed means across the train, development, and test sets (see Table 2).   Gender Split We also split the data by the gender of the politicians into training, development and test, obtaining two versions of the data: (i) one with female accounts in train/dev and male in test; and (ii) male accounts in train/dev and female in test (see Table 3).
Location split Finally, we split the data based on the location of the politicians. We group the accounts in three groups of locations: US, UK and the rest of the world (RoW). We obtain three different splits, where each group makes up the test set and the other two groups make up the train and development set (see Table 4).

Text Preprocessing
We preprocess text by lower-casing, replacing all URLs and anonymizing all mentions of usernames with placeholder token. We preserve emoticons and punctuation marks and replace tokens that appear in less than five tweets with a special 'unknown' token. We tokenize text using DLATK (Schwartz et al., 2017), a Twitter-aware tokenizer.

Predictive Models
We experiment with a series of approaches to classification of parody tweets, ranging from linear models, neural network architectures and pretrained contextual embedding models. Hyperparameter selection is included in the Appendix.

Linear Baselines
LR-BOW As a first baseline, we use a logistic regression with standard bag-of-words (LR-BOW) representation of the tweets.

LR-BOW+POS
We extend LR-BOW using syntactic information from Part-Of-Speech (POS) tags. We first tag all tweets in our data using the NLTK tagger and then we extract bag-of-words features where each unigram consists of a token with its associated POS tag.

BiLSTM-Att
The first neural model is a bidirectional Long-Short Term Memory (LSTM) network (Hochreiter and Schmidhuber, 1997) with a self-attention mechanism (BiLSTM-Att; Zhou et al. (2016)). Tokens t i in a given tweet T = {t 1 , ..., t n } are mapped to embeddings and passed through a bidirectional LSTM. A single tweet representation (h) is computed as the sum of the resulting contextualized vector representations ( i a i h i ) where a i is the self-attention score in timestep i. The tweet representation (h) is subsequently passed to the output layer using a sigmoid activation function.

ULMFit
The Universal Language Model Fine-tuning (ULMFit) is a method for efficient transfer learning (Howard and Ruder, 2018). The key intuition is to train a text encoder on a language modelling task (i.e. predicting the next token in a sequence) where data is abundant, then fine-tune it on a target task where data is more limited. During fine-tuning, ULMFit uses gradual layer unfreezing to avoid catastrophic forgetting. We experiment with using AWD-LSTM (Merity et al., 2018) as the base text encoder pretrained on the Wikitext 103 data set and we fine-tune it on our own parody classification task. For this purpose, after the AWS-LSTM layers, we add a fully-connected layer with a ReLU activation function followed by an output layer with a sigmoid activation function. Before each of these two additional layers, we perform batch normalization.

BERT and RoBERTa
Bidirectional Encoder Representations from Transformers (BERT) is a language model based on transformer networks (Vaswani et al., 2017) pre-trained on large corpora . The model makes use of multiple multi-head attention layers to learn bidirectional embeddings for input tokens. It is trained for masked language modelling, where a fraction of the input tokens in a given sequence are masked and the task is to predict a masked word given its context. BERT uses wordpieces which are passed through an embedding layer and get summed together with positional and segment embeddings. The former introduce positional information to the attention layers, while the latter contain information about the location of a segment. Similar to ULMFit, we fine-tune the BERT-base model for predicting parody tweets by adding an output dense layer for binary classification and feeding it with the 'classification' token.
We further experiment with RoBERTa (Liu et al., 2019), which is an extenstion of BERT trained on more data and different hyperparameters. RoBERTa has been showed to improve performance in various benchmarks compared to the original BERT (Liu et al., 2019).

XLNet
XLNet is another pre-trained neural language model based on transformer networks (Yang et al., 2019). XLNet is similar to BERT in its structure, but is trained on a permutated (instead of masked) language modelling task. During training, sentence words are permuted and the model predicts a word given the shuffled context. We also adapt XLNet for predicting parody, similar to BERT and ULMFit.

Model Hyperparameters
We optimize all model parameters on the development set for each data split (see Section 3).

BiLSTM-Att
We use 200-dimensional GloVe embeddings (Pennington et al., 2014) pre-trained on Twitter data. The maximum sequence length is set to 50 covering 95% of the tweets in the training set. The LSTM size is h = 300 where h ∈ {50, 100, 300} with dropout d = 0.5 where d ∈ {.2, .5}. We use Adam (Kingma and Ba, 2014) with default learning rate, minimizing the binary cross-entropy using a batch size of 64 over 10 epochs with early stopping.
ULMFit We first update only the AWD-LSTM weights with a learning rate l = 2e-3 for one epoch where l ∈ {1e-3, 2e-3, 4e-3} for language modeling. Then, we update both the AWD-LSTM and embedding weights for one more epoch, using a learning rate of l = 2e-5 where l ∈ {1e-4, 2e-5, 5e-5}. The size of the intermediate fully-connected layer (after AWD-LSTM and before the output) is set by default to 50. Both in the intermediate and output layers we use default dropout of 0.08 and 0.1 respectively from Howard and Ruder (2018).
BERT and RoBERTa For BERT, we used the base model (12 layers and 110M total parameters) trained on lowercase English. We fine-tune it for 1 epoch with a learning rate l = 5e-5 where l ∈ {2e-5, 3e-5, 5e-5} as recommended in  with a batch size of 128. For RoBERTa, we use the same fine-tuning parameters as BERT.

Results
This section contains the experimental results obtained on all three different data splits proposed in Section 3. We evaluate our methods (Section 4) using several metrics, including accuracy, precision, recall, macro F1 score, and Area under the ROC (AUC). We report results over three runs using different random seeds and we report the average and standard deviation. Table 5 presents the results for the parody prediction models with the data split by person. We observe the architectures using pre-trained text encoders (i.e. ULMFit, BERT, RoBERTa and XLNet) outperform both neural (BiLSTM-Att) and feature-based (LR-BOW and LR-BOW+POS) by a large margin across metrics with transformer architectures (BERT, RoBERTa and XL-Net) performing best. The highest scoring model,  RoBERTa, classifies accounts (parody and real) with an accuracy of 90, which is more than 8% greater than the best non-transformer model (the ULMFit method). RoBERTa also outperforms the Logistic Regression baselines (LR-BOW and LR-BOW+POS) by more than 16 in accuracy and 13 in F1 score. Furthermore, it is the only model to score higher than 90 on precision. Table 6 shows the F1-scores obtained when training on the gender splits, i.e. training on male and testing on female accounts and vice versa. We first observe that models trained on the male set are in general more accurate than models trained on the female set, with the sole exception of ULMFit. This is probably due to the fact that the data set is imbalanced towards men as shown in Table 3 (see also Section 3). We also do not observe a dramatic performance drop compared to the mixed-gender model on the person split (see Table 5). Again, RoBERTa is the most accurate model when trained in both splits, obtaining an F1-score of 87.11 and 84.87 for the male and female data respectively. The transformer-based architectures are again the best performing models overall, but the difference between them and the feature-based methods is smaller than it was on the person split.

Discussion
Through experiments over three different data splits, we show that all models predict parody tweets consistently above random, even if tested on people unseen in training. In general, we observe that the pre-trained contextual embedding based models perform best, with an average of around 10 F1 better than the linear methods. From these methods, we find that RoBERTa outperforms the other methods by a small, but consistent margin, similar to past research (Liu et al., 2019). Further, we see that the predictions are robust to any location or gender specific differences, as the performance on held-out locations and genders are close to when splitting by person with a maximum of < 5 F1 drop, also impacted by training on less data (e.g. female users). This highlights the fact that our models capture information beyond topics or features specific to any person, gender or location and can potentially identify stylistic differences between parody and real tweets.

Analysis
We finally perform an analysis based on our novel data set to uncover the peculiarities of political parody and understand the limits of the predictive models.

Linguistic Feature Analysis
We first analyse the linguistic features specific of real and parody tweets. For this purpose, we use the method introduced in (Schwartz et al., 2013) and used in several other analyses of user traits (Preoţiuc-Pietro et al., 2017) or speech acts . We thus rank the feature sets described in Section 4 using univariate Pearson correlation (note that for the analysis we use POS tags instead of POS n-grams).
Features are normalized to sum up to unit for each tweet. Then, for each feature, we compute correlations independently between its distribution across posts and the label of the post (parody or not). Table 8 presents the top unigrams and part-ofspeech features correlated with real and parody tweets. We first note that the top features related to either parody or genuine tweets are function words or related to style, as opposed to the topic. This enforces that the make-up of the data set or any of its categories are not impacted by topic choice and parody detection is mostly a stylistic difference. The only exception are a few hashtags related to parody accounts (e.g. #imwithme), but on a closer inspection, all of these are related to tweets from a single parody account and are thus not useful in prediction by any setup, as tweets containing these  will only appear in either the train or test set. The top features related to either category of tweets are pronouns ('our' for genuine tweets, 'i' for parody tweets). In general, we observe that parody tweets are much more personal and include possessives ('me', 'my', 'i', "i'm", PRP) or second person pronouns ('you'). This indicates that parodies are more personal and direct, which is also supported by use of more @-mentions and quotation marks. The real politician tweets are more impersonal and the use of 'our' indicates a desire to include the reader in the conversation.
The real politician tweets include more stopwords (e.g. prepositions, conjunctions, determiners), which indicate that these tweets are more well formed. Conversely, the parody tweets include more contractions ("don't", "i'm"), hinting to a less formal style ('dude'). Politician tweets frequently use their account to promote events they participate in or are relevant to the day-today schedule of a politician, as hinted by several prepositions ('at', 'on') and words ('meeting', "today') (Preoţiuc-Pietro and Devlin Marier, 2019). For example, this is a tweet of the U.S. Senator from Connecticut, Chris Murphy: Rudy Giuliani is in Ukraine today, meeting with Ukranian leaders on behalf of the President of the United States, representing the President's re-election campaign. [...] Through part-of-speech patterns, we observe that parody accounts are more likely to use verbs in the present singular (VBZ, VBP). This hints that parody tweets explicitly try to mimic direct quotes from the parodied politician in first person and using present tense verbs, while actual politician tweets are more impersonal. Adverbs (RB) are used predominantly in parodies and a common sequence in parody tweets is adverbs followed by verbs (RB VB) which can be used to emphasize actions or relevant events. For example, the following is a tweet of a parody account (@Queen Europe) of Angela Merkel: I mean, the Brexit Express literally appears to be going backwards but OK <url>

Error Analysis
Finally, we perform an error analysis to examine the behavior of our best performing model (RoBERTa) and identify potential limitations of the current approaches. The first example is a tweet by the former US president Barack Obama which was classified as parody while it is in fact a real tweet: This parody tweet, even though it is more opinionated, is more similar in style to a slogan or campaign speech and is therefore missclassified. Lastly, the following is a tweet from former President Obama that was misclassified as parody: It's the #GimmeFive challenge, presidential style. <url> The reason behind is that there are politicians, such as Barack Obama, who often write in an informal manner and this may cause the models to misclassify this kind of tweets.

Conclusion
We presented the first study of parody using methods from computational linguistics and machine learning, a related but distinct linguistic phenomenon to irony and sarcasm. Focusing on political parody in social media, we introduced a freely available large-scale data set containing a total of 131,666 English tweets from 184 real and corresponding parody accounts. We defined parody prediction as a new binary classification task at a tweet level and evaluated a battery of featurebased and neural models achieving high predictive accuracy of up to 89.7% F1 on tweets from people unseen in training.
In the future, we plan to study more in depth the stylistic and figurative devices used for parody, extend the data set beyond the political case study and explore human behavior regarding parody, including how this is detected and diffused through social media.