How Effectively Can Machines Defend Against Machine-Generated Fake News? An Empirical Study

We empirically study the effectiveness of machine-generated fake news detectors by understanding the model’s sensitivity to different synthetic perturbations during test time. The current machine-generated fake news detectors rely on provenance to determine the veracity of news. Our experiments find that the success of these detectors can be limited since they are rarely sensitive to semantic perturbations and are very sensitive to syntactic perturbations. Also, we would like to open-source our code and believe it could be a useful diagnostic tool for evaluating models aimed at fighting machine-generated fake news.


Introduction
The advancement of language models (LM) in text generation has raised concerns over misusing LM in generating fake news, misleading reviews, spreading rumor and propaganda (Vosoughi et al., 2018;Solaiman et al., 2019;Varshney et al., 2019Varshney et al., , 2020. Fact-checking is one approach that involves studying veracity of the news using external evidence (Popat et al., 2018;Nie et al., 2018). However, it remains a challenging task since the performance of the current automatic factchecking models are not satisfactory (Thorne et al., 2018). Rashkin et al. (2017) studied automated fact-checking by examining the role of stylistic bias to help verify the truthfulness of an article. Another approach which has recently gained traction to combat mass-scale production of fake news is detecting stylistic differences in human-written and machine-generated news 1 . Later, Grover (Zellers et al., 2019), a transformer based model (Vaswani et al., 2017) trained on news corpora was proposed to determine machinegenerated fake news. 1 https://openai.com/blog/gpt-2-1-5b-release/ The detection of machine-generated fake news purely based on stylistic biases can be hard because: (1) legitimate human-written articles can be easily corrupted at scale by machines, (2) an attacker can overlay the distributional features of human-written text over machine-generated text to fool the style-based classifiers and vice-versa, (3) legitimate text can be generated with LM and the current machine-generated fake news detectors rely on similar distribution for generation of legitimate and fake news (Schuster et al., 2019). However, to the best of our knowledge, there has not been a systematic empirical evaluation to validate these claims. We devise six different perturbations to study the behavior of models 2 . In this study we do not cover (3) since generative models for applications like summarization (See et al., 2017;Nallapati et al., 2016), text completion (Vaswani et al., 2017; can be directly tested for veracity on detector models. From our experiments, we see that models are insensitive to semantic types of perturbations considered in this work, and moreover are extremely sensitive to grammatical perturbations, that do not change the semantics.

Related Work
Universal Attacks in NLP: Ribeiro et al. (2018) debugged models using semantic-preserving perturbations that forced changes in predictions for downstream tasks such as sentiment analysis, visual QA and machine comprehension. Behjati et al. (2019) crafted data-independent adversarial sequences that can fool text classifier when added to any input sample. Alternatively, Wallace et al. (2019) study triggers in the form of a word or a few words to analyze models and biases in datasets for LM, text classification. Machine-generated text detection: Bakhtin et al. (2019) study the generalization ability of models trained to detect real text from the machinegenerated text, Gehrmann et al. (2019) show statistical distributional differences between humanwritten and machine-generated text and provide a tool to the readers to detect machine generated text. Zellers et al. (2019) proposed defense against machine-generated fake news, Grover, by building a linear classifier on top of the last hidden state of its controlled generator model trained on a large news corpora. Automatic fact checking is another approach that is being studied actively with synthetic (Thorne et al., 2018) and real datasets Augenstein et al., 2019). In parallel work, Schuster et al. (2019) discusses the limitations of stylistic based approaches for machine-generated fake news detection. They devised two benchmarks: (1) text completion using LM and, (2) negating the meaning of human-written articles by maintaining the distribution of human-written text as learned by the model. Our work is orthogonal to these efforts in the following ways: (1) we study (in)sensitivity of multiple models for semantic perturbations by keeping the distribution of human-written text intact, (2) we also study the sensitivity of models to semanticpreserving syntactic perturbations with the goal of overlaying distribution of human-written text over machine-generated text, (3) our experiments can be used as a diagnostic evaluation tool for future machine-generated fake news detectors and, (4) our semantic perturbations can be used to evaluate and study fact-checking based solutions as well.

Methodology
We measure the performance of models by looking at accuracy with respect to perturbations introduced in this work. All models are trained without any perturbations and the behavior is studied only at test time. We consider the following models in our experiments -Grover Mega discriminator 3 , GPT2 output detector 4 and Fakebox 5 and use RealNews dataset 6 .

Types of Perturbations
We devise perturbations across real news (humanwritten) articles to test the model behavior. The perturbations in this work are broadly categorized in two main streams: semantic and syntactic. The semantic perturbations are sentence-level perturbations while the syntactic are word-level perturbations. At N % perturbation level, for any type of semantic perturbation, semantics of N % of sentences are changed; whereas in a syntactic perturbation, N % of the words are modified.

Semantic perturbations
The semantic perturbations are intended to turn a real news (human-written text) into a fake news. Our aim is to understand to what extent the content and factuality of text influences model decisions.
The psychology studies show that people try to diverge as little as possible from the truth while lying (Mazar et al., 2008). Hence, we study perturbations at various levels to understand the sensitivity of models. Understandably, models find difficulty in spotting minor perturbations (except article shuffling) and the performance improves as we add more noise to the real articles. An ideal model will flip its decision to all the semantic perturbations introduced in this work. Below is a brief description of types of semantic perturbations we consider. varying sentiment: We change the polarity of sentences within an article, by replacing positive, or comforting words to negative words and vice-versa; thereby changing the overall sentiment of the article. In order to reverse the polarity of sentences, we replace one randomly chosen word in a sentence with its antonym from Stanford NLTK 7 . source-target exchange: The source and the target entities 8 in a sentence are interchanged that do not have coordinating or correlating conjunctions. article shuffling: We perturb a real article by randomly adding N % of sentences from a fake article where N is the perturbation level. We also

Perturbation
Real article Fake article type varying -Google said last year it spent more than $100 million on -Google said first year it spent more than $100 million on sentiment Content ID. They say the automatic filters are blunt -Some Content ID. They say the manual filters are blunt -No consumers worry that the new rules would bring an end.
consumers worry that the new rules would bring an end. The EU denies this.
The EU admit this. source-target -Lokuhettige had 14 days to respond to the new charges, -ICC had 14 days to respond to the new charges, exchange the ICC added. Sri Lanka Cricket has been thrown into the Lokuhettige added. ICC has been thrown into turmoil turmoil as the ICC continues to investigate corruption as the Sri Lanka Cricket continues to investigate corruption allegations in the island nation. -allegations in the island nation.article -Rose feels not enough action is being taken and the -Rose feels not enough action is being taken and the shuffling disparity in the punishment highlights its ineffectiveness. disparity in the punishment highlights its ineffectiveness. "Obviously, it is a bit sad (to feel like this) but when This would pave the way for Daenerys Targaryen to bring the countries only get fined what I'd probably spend on a night Wall down."You see my manager get banned for two games out in London, what do you expect" he added. "You see for just being -" my manager get banned for two games for just being -" entity A New Jersey bus driver's incredible note to the parents of two Pribbernow, New Kilgore bus driver's incredible note to the replacement children who reached out to another student with a parents of two children who reached out to another student with a disability went viral. There is no way to fully understand what is going on in crypto There's no way to fully understand what's going on in the crypto perturbation world. I am not even sure anyone could even if you tried to. world -I am not sure anyone could even if you tried to. I can tell you that recent surge in BitCoin is an opportunity I can tell you that the recent surge in BitCoin is an opportunity to buy long term real assets to buy long-term real assets. remove an equal number of sentences from the real article to maintain the total article length. The fake article chosen for shuffling will not have entities present in the title of real article. entity replacement: We replace entities 8 with another irrelevant entity of the same type. The irrelevant entities are picked from the fake articles which are not present in real articles. alter numerical facts: The numerical facts are distorted in a real news article. A numerical figure (digits and words) less than hundred thousand will be scaled up to a random number in the range of 1 million to 1 trillion and vice-versa. Table 1 represents examples of articles subjected to perturbations in this work. We use spacy 8 to identify entities and entity types in our experiments. Ippolito et al. (2019) recently studied the influence of excerpt length for classification of machinegenerated and human-written text. In the training dataset of Grover, we observe that machinegenerated articles have shorter length but longer sentences than human-written articles. We perturb these features by: (i) breaking longer sentences, (ii) removing definite articles if they appear among 8 https://spacy.io/ the most repeated words in an article, (iii) using semantic-preserving rules (for example converting that's →that is) (Ribeiro et al., 2018), (iv) reformatting paragraphs of machine-generated text. These perturbations preserve semantics of articles, hence an ideal model should not flip its decision. Table 2 summarizes the performance of models when subjected to different types of perturbations at test time. For our experiments, we pick 2K samples for every perturbation type (real for semantic and fake for syntactic perturbations) from the Re-alNews dataset that are classified correctly (100% accuracy) by all the models without any perturbations introduced in this work. We start with 25% perturbation level because very small perturbation levels may not be enough to change the overall semantics of the article. However, Grover identifies its own generated text even at 1% perturbation level (18% accuracy on article shuffling perturbation). On manual examination, we found that on an average 5% of the real articles did not change semantics on perturbing for varying sentiment and source-target exchange. Fakebox performance is not reported due to very low accuracy for all the perturbations introduced in this work. Since the details of Fakebox tool is not publicly available,  The cell values contain the mean and standard deviation across 5 runs of experiments. We choose real articles for devising semantic perturbations and fake articles for devising syntactic perturbations. For semantic perturbations, we see that model performance increases with level of perturbations while for syntactic perturbation, models tend to perform bad with increase in attributes of human-written text. We mark 'NA' in article shuffling for 100% perturbation level since 100% shuffling will be a full machine-generated text which was already classified correctly by the detectors.

Results and Analysis
we omit them from analysis. The code is publicly available 9 . From our experiments, we make the following observations: • All the machine-generated fake news detectors considered in this work are vulnerable to semantic perturbations even under extreme perturbations indicating that actuality of the articles do not aid in model decision. • Grover performs well on article shuffling indicating that it learns sentence structures of its own generated text pretty well. • The machine-generated fake news detectors are also vulnerable to semantic preserving syntactic perturbations indicating they could possibly be learning sentence structures. Another reason could be due to data bias since the training dataset of machine-generated text has longer sentences, punctuation and definite articles when compared to human-written text. • Grover fails to detect sentiment changes in articles indicating that it is insensitive to polarity between entities. From manual examination we found that 5% of real articles perturbed due to varying sentiment have uncommon phrases which would have aided the GPT2 detector. For example, Police say Aranda told them he would go to the mall →Police say Aranda told them he stay in place to the mall • Current machine-generated fake news detectors rely on previously seen data without ex-9 https://github.com/meghu2791/evaluateNeuralFakenews Detectors ternal resources for classification. This could possibly explain the performance drop of GPT2 in varying sentiment at 100% perturbation level since there will be no inconsistent polarity towards entities unlike perturbations at 50% or 75%. • Transformers are insensitive to perturbations like word-level shuffling and possibly learn bag-of-word like distribution (Sankar et al., 2019). GPT2 fails to identify source-target exchange indicating they adhere to the above observations. The marginal gains of Grover probably indicates better understanding of linguistic cues in sentences. • The better performance of GPT2 in entity replacement could be due to non-existence of replaced entities in articles labeled real in the training dataset of GPT2.

Conclusion
With the advances in language modeling for text generation, the detection of fake news becomes challenging. We find that success of style-based classifiers are limited when real articles are perturbed even under extreme modifications. We believe our experiments motivates to explore integration of multiple dimensions like examine source credibility, fact-checking via external resources, model robustness by adversarial training, commonsense reasoning to machine-generated fake news detectors. By open-sourcing our code, we believe our methodology of studying vulnerabilities in the fake news detectors can aid in creation of robust models in the future.