Automatic Detection of Machine Generated Text: A Critical Survey

Text generative models (TGMs) excel in producing text that matches the style of human language reasonably well. Such TGMs can be misused by adversaries, e.g., by automatically generating fake news and fake product reviews that can look authentic and fool humans. Detectors that can distinguish text generated by TGM from human written text play a vital role in mitigating such misuse of TGMs. Recently, there has been a flurry of works from both natural language processing (NLP) and machine learning (ML) communities to build accurate detectors for English. Despite the importance of this problem, there is currently no work that surveys this fast-growing literature and introduces newcomers to important research challenges. In this work, we fill this void by providing a critical survey and review of this literature to facilitate a comprehensive understanding of this problem. We conduct an in-depth error analysis of the state-of-the-art detector and discuss research directions to guide future work in this exciting area.


Introduction
Current state-of-the-art text generative models (TGMs) excel in producing text that approaches the style of human language, especially in terms of grammaticality, fluency, coherency, and usage of real world knowledge Zellers et al., 2019;Keskar et al., 2019;Bakhtin et al., 2020;Brown et al., 2020). TGMs are useful in a wide variety of applications, including story generation (Fan et al., 2018), conversational response generation (Zhang et al., 2020), code auto-completion (Solaiman et al., 2019), and radiology report generation (Liu et al., 2019a). However, TGMs can also be misused for fake news generation (Zellers et al., 2019;Brown et al., 2020;Uchendu et al., 2020), fake product reviews generation (Adelani et al., 2020), and spamming/phishing. (Weiss, 2019). 1 Thus, it is important to build tools that can minimize the threats posed by the misuse of TGMs.
The commonly used approach to combat the threats posed by the misuse of TGMs is to formulate the problem of distinguishing text generated by TGMs and human written text as a classification task. The classifier, henceforth called detector, can be used to automatically remove machine generated text from online platforms such as social media, e-commerce, email clients, and government forums, when the intention of the TGM generated text is abuse. An ideal detector should be: (i) accurate, that is, good accuracy with a good trade-off for false positives and false negatives depending on the online platform (email client, social media) on which TGM is applied (Solaiman et al., 2019); (ii) data-efficient, that is, needs as few examples as possible from the TGM used by the attacker (Zellers et al., 2019); (iii) generalizable, that is, detects text generated by different modeling choices of the TGM used by the attacker such as model architecture, TGM training data, TGM conditioning prompt length, model size, and text decoding method (Solaiman et al., 2019;Bakhtin et al., 2020;Uchendu et al., 2020); and (iv) interpretable, that is, detector decisions need to be understandable to humans (Gehrmann et al., 2019); and (v) robust, that is, detector can handle adversarial examples (Wolff, 2020). Given the importance of this problem, there has been a flurry of research recently from both NLP and ML communities on building useful detectors. However, there is currently no work that provides a literature review of existing detection works and highlight important research challenges.
In this paper, we present a critical literature review of the existing detection research for English to aid understanding of this important area. We organize the survey to guide the reader seamlessly through a number of important aspects, as follows: First, we establish the background for the detection task, which includes TGMs, decoding methods for text generation, and social impacts of TGMs ( §2). Second, we present various aspects of large-scale TGMs such as model architecture, training cost, and controllability ( §3). Third, we present and discuss the various existing detectors in terms of their underlying methods ( §4). Fourth, we provide a linguistically and computationally motivated analysis of key issues of the state-of-the-art detector ( §5). Fifth, we discuss interesting future research directions that can help in building useful detectors ( §6). Our main contributions are three-fold: • We provide the first survey on the important, burgeoning area of detection of machine generated text from human written text.
• We develop an error analysis of current state-of-the-art detector, guided and illustrated by machine generated texts, to shed light on the limitations of existing detection work.
• Motivated by our analysis and existing challenges, we propose a rich and diverse set of research directions to guide future work in this exciting area.

Background
Here, we provide the background for the problem of detecting machine generated text from human written text. Specifically, we introduce key concepts in training a TGM, generating text from a TGM, and social implications of using TGMs in practice. Existing detection datasets are discussed in Appendix.

Training TGM
TGM is typically a neural language model (NLM) trained to model the probability of a token given the previous tokens in a text sequence, i.e., p θ (x t |x 1 , . . . , x i , . . . , x t−1 ), with tokens coming from a vocabulary, x i ∈ V. If x = (x 1 , . . . , x |x| ) represents the text sequence, p θ typically takes the form p θ (x) = Π |x| t=1 p θ (x t |x 1 , . . . , x t−1 ). If p * (x) denotes the reference distribution and D denotes a finite set of text sequences from p * , TGM estimates parameters θ by minimizing the following objective function: Notice that TGM can be a non-neural model (e.g., n-gram LM) and based on nontraditional LM objective (e.g., masked language modeling (Devlin et al., 2019;Song et al., 2019)). In this survey, we focus primarily on TGMs for English that are neural and based on traditional LM objective, as they are successful in generating coherent paragraphs of English text.

Generating text from TGM
Given a sub-sequence (prefix), x 1:k ∼ p * , the task of generating text from TGM is to use p θ to conditionally decode a continuation,x k+1:N ∼ p θ (.|x 1:k ) such that the resulting completion (x 1 , . . . , x k ,x k+1 , . . . ,x N ) resembles a sample from p * (Welleck et al., 2020). In a news article generation task, the prefix can be headlines and the continuation can be the body of the news article. In a story generation task, the prefix can be beginning of a story and the continuation can be rest of the story. Since the computation of the optimal continuation (x k+1:N ) is not tractable with time complexity of O((N − k) |V| ), approximate deterministic or stochastic decoding methods are utilized to generate continuations.
Deterministic methods: In deterministic methods, the continuation is fully determined by the TGM parameters and prefix. The two most commonly used deterministic decoding methods are greedy search and beam search. Greedy search works by selecting the highest probability token at each time step: x t = arg max p θ (x t |x 1 , . . . , x t−1 ) with time complexity of O((N − k)|V|). On the other hand, beam search maintains a fixed-size (b) set of partially decoded sequences, called hypotheses. At each time step, beam search creates new hypotheses by appending each token in the vocabulary to each existing hypothesis, scoring the resulting sequences using p * with time complexity of O((N − k)b|V|). In practice, these deterministic decoding methods depend highly on the underlying model probabilities and suffer from producing degenerate continuation, i.e., generic text often with repetitive tokens (Holtzman et al., 2020). Recently, Welleck et al., (2020) show that the degeneracy issues with beam search can be alleviated by training a TGM with the original TGM objective (Eq. (1)) augmented with an unlikelihood objective that assigns lower probabilities to unlikely generations. Stochastic methods: Stochastic decoding methods work by sampling from a model-dependent distribution at each time step, x t ∼ q(x t |x 1 , . . . , x t−1 , p θ ). In unrestricted sampling (also known as pure sampling), the chance of sampling a low-confidence token from the unreliable tail distribution is very high, leading to text that can be unrelated to prefix. To reduce the chance of sampling a lowconfidence token, sampling is limited to a subset of the vocabulary W ⊂ V at each time step. Let Z = x∈W p θ (x|x 1 , . . . , x t−1 ). If x t ∈ W, q(x t |x 1 , . . . , x t−1 , p θ ) is set as p θ (x t |x 1 , . . . , x t−1 )/Z, otherwise set as 0. The two most effective stochastic decoding methods are top-k sampling (Fan et al., 2018) and top-p (or nucleus) sampling (Holtzman et al., 2020). The top-k sampler limits sampling to the k most-probable tokens, that is, W is the size k subset of V that maximizes x∈W p θ (x|x 1 , . . . , x t−1 ). The top-k sampler uses a constant value of k, which can be sub-optimal in different contexts, that is, generated text is limited to a subset of natural language distribution. For example, generic contexts (e.g., predicting noun) might require larger value of k, while other contexts (e.g., predicting prepositions) might require smaller value of k so that only useful candidate tokens are considered. The nucleus sampler overcomes the burden of considering only a fixed number of tokens by limiting sampling to the smallest set of tokens with total mass above a threshold p ∈ [0, 1], i.e., W is the smallest subset with x∈W p θ (x|x 1 , . . . , x t−1 ) >= p. Thus, the number of candidate tokens considered varies dynamically depending on the context, and the resulting text is reasonably natural with less repetitions. Recently, Massarelli et al., (2020) show that top-k and top-p sampler tend to generate more nonfactual sentences, as corroborated by Wikipedia.

Social impacts of TGMs
Bias: Unsurprisingly, a TGM can capture and amplify the societal biases (over-generalized beliefs about a particular group of people, e.g., Group X are bad drivers) present in the training data (Sun et al., 2019;Nadeem et al., 2020). Solaiman et al., (2019) and Brown et al., (2020) show that TGMs reflect gender bias (e.g., favoring males over females), racial bias (e.g., favoring white over black people), and religious bias (e.g., favoring Christians over Muslims). Although TGMs can be used as a tool to study how patterns in the training data can translate to these unintended biases in the model outputs (Solaiman et al., 2019), the biases can cause harm to the people in relevant groups in many ways (Crawford, 2017). Beneficial usage: TGMs are used to create task-specific systems, such as question answering, reading comprehension, natural language inference, and machine translation Brown et al., 2020). TGMs can also be used to generate text that approximately matches the style of human language, which benefits applications such as story generation (Fan et al., 2018), conversational response generation (Zhang et al., 2020), code auto-completion (TabNine, 2020), and radiology report generation (Liu et al., 2019a). Malicious usage: TGMs can have unfortunate uses by (even low-skilled) adversaries for malicious purposes, such as fake news generation (Zellers et al., 2019;Brown et al., 2020;Uchendu et al., 2020), fake product reviews generation (Adelani et al., 2020), and spamming/phishing (Weiss, 2019). Humans can spot fake news articles (Brown et al., 2020), fake product reviews (Adelani et al., 2020), and fake comments (Weiss, 2019) generated by TGM only at chance level. To combat the threats posed by such adversaries, accurate models that can identify text generated by TGM from human written text need to be built. Such a model can have benevolent uses such as moderating content in vulnerable platforms including social media, email clients, government websites, and e-commerce websites. top-k fake prod. reviews Dathathri et al., (2020) Table 1: Summary of the characteristics of TGMs that can act as threat models. The last column corresponds to the threats discussed in the original paper.

Text generative models
In this section, we will discuss various aspects of large-scale TGMs. These TGMs act as threat models since they can be misused by a low-skilled adversary, e.g., by generating fake news and fake product reviews. Table 1 displays the summary of key characteristics of these TGMs along with the threats they pose (according to the original papers).

Model architecture, training data, training cost
Model architecture: The model architecture underlying all the state-of-the-art TGMs is the transformer (Vaswani et al., 2017). Compared to recurrent neural networks (RNNs) (Elman, 1990), the transformer model does not have a bias to recent tokens and can learn long-range dependency information. The generation from TGMs such as GPT-2 which are based on transformer architecture tends to be grammatically correct, coherent, and uses world knowledge . 2 Training data: TGMs such as GPT-2, CTRL (Keskar et al., 2019), and GPT-3 (Brown et al., 2020) have billions of parameters. They are generally trained using the language modeling objective on large amounts of raw text from a diverse set of sources (like Wikipedia, Reddit, and news sources). As an exception, GROVER (Zellers et al., 2019) is trained on millions of news article only. Such trained TGMs can also be fine-tuned on a domain-specific corpus for the LM task to generate text that matches the respective domain reasonably. For example, Adelani et al., (2020) fine-tune the GPT-2 model on the specific domain of product reviews to generate fake reviews, which mimics the style of a human review.
Training cost: Training TGMs with billions of parameters on millions of documents requires a huge computational budget (Zellers et al., 2019), high energy cost (Strubell et al., 2019), and long training time (Brown et al., 2020). Unfortunately, it is not yet a standard practice to report financial (vs. energy vs. computational) budget in every research publication. This makes it hard for us to perform TGM training feasibility studies. One exception is the work done by Zellers et al., (2019), where they explicitly mention that their proposed TGM model, GROVER, took two weeks of training with a cost of $25K (including the cost of data collection). We note that even though this may be an expensive budget, it is by no means outside the reach of even low-resource organizations, let alone nation states. The implication is that various entities of variable sizes and resource capabilities can practically deploy models for spreading disinformation using TGMs.

Controllable generation
Controllable TGMs possess the ability to control the aspects of the generation such as topic and sentiment of the article. GPT-2  and GPT-3 (Brown et al., 2020) assume the prefix to be any natural language text, which might be too coarse in controlling the generation in an explicit fashion.
Researchers have devised two ways to design a controllable TGM, which we now introduce.
Training with control tokens: The first way is to leverage meta-information about the article such as its author, date of creation, source domain and prepend this information as additional token(s) to the input sequence, before training the TGM. These tokens act as additional context for the article, allowing the TGM to learn the relation between the meta-information and the original article. Once trained, the TGM model can be controlled by prompting with the meta-information of users' interest. The first controllable TGM proposed is the GROVER model, which can generate a news article given the meta-information of the news article (such as headline, author, and date). The GROVER model can create trustworthy fake news that is harder for humans to identify than human written fake news and can thus pose a significant threat. Similar to the GROVER model, the CTRL model provides explicit control of particular aspects of the generated text by exploiting naturally occurring control codes (e.g., the URL for a news article) to condition the text (e.g., news article body). These control codes govern style (e.g., sports vs. politics, FOX sports vs. CNN sports), content (e.g., Wikipedia vs. books), and task-specific behavior (e.g., question answering vs. machine translation). Control using attribute classifier: The second and the most recent way to design a controllable TGM is to combine a pretrained TGM like GPT-2 with one or more attribute classifiers (e.g., sentiment classifier) that guide text generation (Dathathri et al., 2020). The attribute models measure the extent to which the desired attribute is encoded in a piece of text. At each timestep, GPT-2 updates its latent representations based on gradients from the attribute model for the text generated so far so as to increase the likelihood of the generated text having the desired attribute. The updated latents are used to compute a new next token distribution from which a token to be generated is sampled. The interesting property of this method is that the TGM model need not be retrained (unlike Adelani et al., (2020) work that need retraining of the GPT-2 model), thereby avoiding the significant cost of retraining.

Detectors
In this section, we discuss various detectors for identifying machine generated text from human written text. To aid understanding of the literature, we organize the detectors according to the underlying methods on which they are based.

Classifiers trained from scratch
Bag-of-words classifier: Some detectors employ classical machine learning methods such as logistic regression to train a model from scratch to discriminate between text generated by TGM and human written text. Solaiman et al., (2019) use a simple baseline model that represents a document with tf-idf vector (unigrams and bigrams) on top of a logistic regression model to distinguish WebText articles (online web pages) from text generated using GPT-2 models. They study different sizes of GPT-2 models that vary in terms of number of parameters (117M, 345M, 762M, 1542M) and different sampling techniques (pure sampling, top-k sampling, and top-p sampling). They observe that generations from the larger GPT-2 models are difficult to detect compared to that of the smaller models, which indicates that the larger the TGM, the closer the style of the generated text with that of human written text. Top-k samples are easier to detect while nucleus samples are harder to detect. This result stems from the fact that top-k sampler typically over-generates common words, leaving statistical anomalies that are easily spotted by the detector . Additionally, Solaiman et al., (2019) fine-tune the GPT-2 model on Amazon product reviews and show that the text generated by fine-tuned GPT-2 model is harder to detect as fine-tuned domain specific TGMs are more human-like than general purpose TGM (i.e., the original GPT-2 model).
Detecting machine configuration: Tay et al., (2020) study the extent to which different modeling choices (decoding method, TGM model size, prompt length) leave artifacts (detectable signatures that arise from modeling choices) in the generated text. They propose the task of identifying the TGM modeling choice given the text generated by TGM. They show that a classifier can be trained to predict the modeling choice well beyond the chance level, which ascertains that text generated by TGM may be more sensitive to TGM modeling choices than previously thought. They also find that the proposed detection task of identifying text generated by different TGM modeling choices is less harder than the task of identifying text generated by TGM from human written text along with different TGM modeling choices. They show that word order does not matter much as a bag-of-words detector performs very similar to detectors based on complex encoder (e.g., transformer). This result is consistent with the recent work done by Uchendu et al., (2020), which shows that simple models (traditional ML models trained on psychological features and simple neural network architectures) perform well in three settings: (i) classify if two given articles are generated by the same TGM; (ii) classify if a given article is written by a human or a TGM (the original detection problem); (iii) identify the TGM that generated a given article (similar to Tay et al., (2020)). For the original detection problem, the authors find that the text generated by the GPT-2 model to be hard to detect among several TGMs (see Appendix for the list of studied TGMs).

Zero-shot classifier
In the zero-shot classification setting, a pretrained TGM (for example, GPT-2, GROVER) is employed to detect generations from itself or similar models. The detector does not require supervised detection examples for further training (i.e., fine-tuning). Total log probability: Solaiman et al., (2019) present a baseline that uses TGM to evaluate total log probability, and thresholds based on this probability to make the prediction. For instance, text is predicted as machine generated if the overall likelihood of the text according to the GPT-2 model is closer to the mean likelihood over all machine generated texts than to the mean likelihood of human written texts. However, they find that this classifier performs poorly compared to the previously discussed logistic regression based classifier ( §4.1).

Giant Language model Test Room (GLTR) tool:
The GLTR tool (Gehrmann et al., 2019) proposes a suite of baseline statistical methods that can highlight the distributional differences in text generated by GPT-2 model and human written text. Specifically, GLTR enables the study of a piece of text by visualizing per-token model probability, per-token rank in the predicted next token distribution, and entropy of the predicted next token distribution. Based on these visualizations, the tool clearly shows that TGMs over-generate from a limited subset of the true distribution of natural language. Indeed, rare word usage in text generated by GPT-2 model is markedly less compared to the human written text. The tool lets humans (including non-experts) to study a piece of text, but might be less effective in future once TGMs start generating text that lacks statistical anomalies.

Fine-tuning NLM
In this setup, a pretrained language model (e.g., BERT, RoBERTa (Liu et al., 2019b)) is fine-tuned to detect text generated from itself or similar models. Unlike the zero-shot classification setup, the detector does require supervised detection examples for further training. ) and thereby conclude that the best models for generating neural disinformation are also the best at detecting their own generations. This result suggests the need to make generators such as GROVER and GPT-2 publicly available. 3 Nevertheless, the authors do not experiment with BERT model to observe similar pattern that the BERT model also excels in detecting the text written by itself as the BERT detector and the BERT generator possess similar inductive bias. Uchendu et al., (2020) show that the off-the-shelf GROVER detector does not perform well in detecting text generated by TGMs other than the original GROVER model. The most interesting finding of this work is that fine-tuning using the RoBERTa model achieves higher accuracy than fine-tuning a GPT-2 model with equivalent capacity. This result might be due to the superior quality of the bidirectional representations inherent in the masked language modeling objective employed by the RoBERTa language model compared to the GPT-2 language model, which is limited by learning only unidirectional representation (left to right). This finding contradicts that of the GROVER work (Zellers et al., 2019), where the authors conclude that the best models for detecting neural disinformation from a TGM is the TGM itself. Recently, Fagni et al., (2020) show that the RoBERTa detector establishes the state-of-the-art performance in spotting machine generated tweets from human written tweets accurately, outperforming both traditional ML models (e.g., bag-of-words) and complex neural network models (e.g., RNN, CNN) by a large margin. This interesting result indicates that the RoBERTa detector can generalize to publication sources unseen during its pretraining such as Twitter.
The RoBERTa detector also outperforms existing detectors in spotting news articles generated by several TGMs (Uchendu et al., 2020) and product reviews generated by the GPT-2 model fine-tuned on Amazon product reviews (Adelani et al., 2020).

Human-machine collaboration
Apart from building a statistical model to detect online disinformation, one can build a system that can leverage human visual interpretation skills and common sense knowledge.
Differences in human and machine detector:  study the differences in the ability of humans and automated detectors to identify text generated by TGM. The authors observe: (i) human raters are good at noticing contradictions or semantic errors (e.g., incoherence) in text generated by TGM, which the automatic detectors are weak at, due to lack of deep semantic understanding; (ii) automatic detectors are good when text generated by TGM contains over-representation of high-likelihood words (caveat of top-k sampling as discussed in §2.2), whereas the human raters are not good. Overall, automatic detectors are significantly better than human raters, but generalize poorly to text generated by unseen decoding methods.
Supporting untrained humans: As seen before, the GLTR tool (Gehrmann et al., 2019) can aid humans by visualizing the properties of text such as unexpected and out-of-context words. The main advantage of GLTR is that it can facilitate untrained humans to accurately detect synthetic text (from 54% to 72% in terms of accuracy). However, GLTR flags machine generated easily but it is hard to be confident that the text is not machine generated. This result suggests the need for human-machine collaboration to solve the detection task (Solaiman et al., 2019).
Real or Fake Text (RoFT) tool: The RoFT tool (Dugan et al., 2020) focuses on evaluating human detection of text generated by TGM by asking humans to detect the sentence boundary at which the text transitions from human written text to machine generated text. The main assumption is that TGM successfully fools the human if the guess from the human is far from the true sentence boundary. Current TGMs can fool humans by one or two sentences. The core advantages of the RoFT tool include its engaging annotation interface, collection of user's explanation for their guess in free form text, and potential to scale to different textual domains as well as different TGM modeling choices. The main limitation of the tool is that the text shown to the humans can be rife with human generated sentences, and hence does not reflect an organic generation from a TGM.
In this section, we discuss open issues in the state-of-the-art detector based on the RoBERTa model, which has been shown to excel in detecting text generated by TGM based on news articles, product reviews, tweets, and web pages (see §4.3). 4 We focus on the task of detecting text generated by the GPT-2 model from human written Amazon product reviews, a challenging task given the shortness of reviews. We employ the RoBERTa detector on the publicly available dataset, containing generations from the GPT-2 model (1542M parameters) based on pure, top-k and top-p sampling along with human written reviews (see Appendix for dataset details). In Figure 1, we plot the accuracy of the detector w.r.t. number of training examples per class, averaged over ten random initializations to control for initialization effects. We observe that the RoBERTa detector needs several thousands of examples to reach high accuracy. Specifically, it has an impractical requirement of 200K, 15K and 50K training examples for performing at 90% accuracy on identifying pure, top-k and top-p examples respectively. 5 Given that creation of large datasets for the detection task is hard (Zellers et al., 2019), it is important to investigate whether the data-efficiency of the RoBERTa detector can be significantly improved. We manually inspect 100 randomly picked false positives (machine generated product review incorrectly predicted as human written product review) of the RoBERTa detector trained on 15K examples each from top-p generations and from human written reviews. 6 Below, we list down the error categories that we have identified and provide at least one example for each error category. Fluency: Among the false positive reviews, we find 73 reviews to be very fluent and can confuse even humans (1).
(1) I loved this film. I can't really explain why, but when I first saw it it struck me as bizarre, almost oddball, but I quickly got over that and remembered that I love oddball films. This was an early 80's film. A great film to see on a gloomy rainy evening. This film is suspenseful and full of weirdness. Add this to your collection. Shortness: Out of these 73 identified fluent reviews, 27 reviews are very short, with a median of 24 words. We give two examples below: (2) love it. best sweeper.
My favorite combo. Always works and usually cools my system to boot. So glad I got these instead of other brands. Factuality: We find 10 false positive reviews to contain factual errors. 4 Concurrent with our work, Zhong et al., (2020) propose a detector that leverages factual and coherence structure underlying the text, which outperforms the RoBERTa detector in spotting machine generated text based on news articles and web pages. We also acknowledge that detectors fine-tuned on the state-of-the-art NLMs such as T5 (Raffel et al., 2020), ELECTRA (Clark et al., 2020) might most likely outperform the RoBERTa detector in general. 5 Given that attackers can create synthetic text at scale using TGMs, 90% detection accuracy might not be a high accuracy. 6 As seen in §2.2 and §4, top-p sampling produces good quality text that reasonably matches the style of human writing and is also harder to detect for humans. We leave the study of false negatives for future. Our annotation of 100 false positives can be accessed at: https://github.com/UBC-NLP/coling2020_machine_generated_text.

(4)
That movie got the stars and represents the best of this collection but there's better made Creature Movies as well including a 1960's remake of 'Dracula' with Kirk Douglas and Harrison Ford. (5) Just love Ben Affleck! He won't be missed in another very good movie. Worth watching especially if you like Ben! Review (4) on product 'Universal Studios Classic Monster Collection' contains the incorrect fact that Harrison Ford acted in 'Dracula' movie, and another review (5) on 'Runaway Jury' movie contains the incorrect fact that Ben Affleck acted in that movie. Spurious entities: In 4 false positive reviews, we find that the review contains novel entities unrelated to the domain of the product. For example, review (6) on 'Junkfood' musical product contains novel entity, 'grisberg', which is not associated with music domain.
(6) another classic by grisberg, i love stevie she was one of the greatest r&b singers I know darwin halstead ment her so be a big fan please do yo self a favor and buy this dvd, its nice and it absolutly amazing this woman has a very yorfelt approch to r&b music Contradiction: We find one review (7) containing contradictions, where the subject (husband) is claimed to be not a big fan of a product but also as loving the same product.
(7) My husband likes his coffee black so he loves flavored coffee but is not a big fan of flavored coffee. ... Repetition: In two false positive reviews, the facts undergo repetition.
(8) Great movie, although took a while to see at first it held my interest and kept me interested, plus i thought it was extremly good. also it was very good. Common sense reasoning: We find one false positive review that describes an improbable event, that is, violates common sense reasoning.
(9) ... I received both amazon Prime and a Walmart's for delivery and they both came on time. I love it and highly recommend it! The review (9) on a specific audio player product mentions that the user received the same product from two e-commerce companies simultaneously, which is most likely an improbable event. Typos and grammatical errors: There are 7 false positive reviews that possess typographical and grammatical errors (10) and (11). We note that such errors (especially spelling errors) are not unusual in online reviews, including those by humans.
(10) Once they are on they aren't wrinkled or lose they shape.
(11) Had to unplug thing to get the hard drive to work. Would rather have don batteries in the olden days.. Incoherence: There are 3 false positive reviews that seem incoherent. The movie review (12) switches the focus of the discourse between actors (Sophia and Duchovny) and story line in an incoherent fashion, which violates the theory of centering in discourse analysis (Grosz et al., 1995;Gehrmann et al., 2019).
(12) ... Sophia Loren plays 'Marion' a 'showgirl' that is picked on by the establishment for her wild style. ... Duchovny's character is also 'On the line' in the business world. ... The storyline is so intriguing and unpredictable. ... Sophia Loren's acting is just awesome and her wardrobe is just perfect! If you love sex and nud**y, you will be greatly pleased.

Future Research Directions
In this section, we discuss a set of future research directions, which can help in building useful detectors.

Leveraging auxiliary signals
Existing detectors do not exploit auxiliary signals about the textual source. 7 For example, the RoBERTa detector studied in §5 ignores the auxiliary signals about the review (e.g., helpfulness) and the product (e.g., description). Such auxiliary signals can be complementary to linguistic signals from the textual source for the detection task (Hovy, 2016;Solaiman et al., 2019). Given the rapidly evolving research in building intelligent TGMs that narrows the gap between machine and human distribution of natural language text, auxiliary signals could play a crucial role in mitigating the threats posed by TGMs.

Assessing veracity of the text
Existing detectors have an assumption that the fake text is determined by the source (e.g., TGM) that generated the text. This assumption does not hold true in two practical scenarios: (i) real text autogenerated in a process similar to that of fake text, and (ii) adversaries creating fake text by modifying articles originating from legitimate human sources. Schuster et al., (2020) show that existing detectors perform poorly in these two scenarios as they rely too much on distributional features, which cannot help in distinguishing texts from similar sources. Hence, we call for more research on detectors that assess the veracity of machine generated text by consulting external sources, like knowledge bases (Thorne and Vlachos, 2018) and diffusion network (Vosoughi et al., 2018), instead of relying only on the source.

Building generalizable detectors
Existing detectors exhibit poor cross-domain accuracy, that is, they are not generalizable to different publication formats (Wikipedia, books, news sources) (Bakhtin et al., 2019). Beyond publication formats and topics (e.g., politics, sports), the detector should also transfer to unseen TGM settings such as model architecture, different decoding methods (e.g., top-k, top-p), model size, different prefix lengths, and training data (Bakhtin et al., 2020;Uchendu et al., 2020).

Building interpretable detectors
We discussed the importance of human raters pairing up with automatic detectors in §4.4. A viable way for this collaboration is to make the decisions taken by the automatic detector interpretable (such as in GLTR) so that human raters can logically group (e.g., contradictions) the model decisions and humans can "accept", "modify", or "reject" these decisions. This calls for more research in building detectors that can provide explanations for its decisions, which are understandable to humans.

Building detectors robust to adversarial attacks
Existing detectors are brittle, i.e., the detector decisions can vary significantly for even small changes in the text input. For example, Wolff (2020) shows that the RoBERTa detector can be attacked using simple schemes such as replacing characters with homoglyphs and misspelling some words. These two attacks reduce the detector's recall in text generated by TGM from 97.44% to 0.26% and 22.68% respectively. Therefore, it is important to study various adversarial attacks ranging from simple attacks (e.g., misspellings) to advanced attacks (e.g., universal attacks (Wallace et al., 2019)) and create adversarial examples with an aim to characterize the vulnerabilities of the detector as well as to make the detector robust against various attacks.

Conclusion
Detectors able to tease apart machine generated text from human written text can play a vital role in mitigating misuse of TGMs such as in automatic creation of fake news and fake product reviews. Our categorization of existing detectors and related issues into classifiers trained from scratch, zero-shot classifiers, fine-tuning NLMs, and human-machine collaboration can help readers contextualize each detector w.r.t the fast-growing literature. We also hope that our computationally and linguistically motivated error analysis of the state-of-the-art detector can bring readers up to speed on many existing challenges in building useful detectors. Our rich and diverse set of research directions also have the potential to guide future work in this exciting area.  RealNews vs. GROVER: The Generating aRticles by Only Viewing mEtadata Records (GROVER) model (Zellers et al., 2019) is trained on RealNews, a collection of news articles from Common Crawl. The authors of the GROVER model provide a subset of news articles (not part of the training set of GROVER model) and news articles generated by GROVER model with top-p sampling. 9 Tweets vs. Misc.: Social media platforms like Twitter has several bot user accounts, whose entire timeline is composed of tweets produced by models such as markov chain, RNN, LSTM (Hochreiter and Schmidhuber, 1997), GPT-2, and several miscellaneous (unknown) models. Fagni et al., (2020) provide a collection of tweets from manually identified bot accounts and a collection of tweets from the humans imitated by the bot accounts. 10 This tweet dataset is challenging as the tweets are extremely short (median of 14, 16 words for human and machine tweets respectively). Unlike other datasets, this tweet dataset contains real machine generated texts posted in Twitter, which can directly measure the real world utility of the detector. Since these machine generated tweets encompass generations from different TGM models such as markov chain, LSTM, GPT-2 and miscellaneous models, this tweet dataset lets us study the generalizability of the detector with respect to the TGM that produced the text.  (Dathathri et al., 2020), and FAIR (Ng et al., 2019). 12 Similar to the tweets dataset, this news dataset lets us study the generalizability of the detector with respect to the TGM that produced the text.