Neural Deepfake Detection with Factual Structure of Text

Deepfake detection, the task of automatically discriminating machine-generated text, is increasingly critical with recent advances in natural language generative models. Existing approaches to deepfake detection typically represent documents with coarse-grained representations. However, they struggle to capture factual structures of documents, which is a discriminative factor between machine-generated and human-written text according to our statistical analysis. To address this, we propose a graph-based model that utilizes the factual structure of a document for deepfake detection of text. Our approach represents the factual structure of a given document as an entity graph, which is further utilized to learn sentence representations with a graph neural network. Sentence representations are then composed to a document representation for making predictions, where consistent relations between neighboring sentences are sequentially modeled. Results of experiments on two public deepfake datasets show that our approach significantly improves strong base models built with RoBERTa. Model analysis further indicates that our model can distinguish the difference in the factual structure between machine-generated text and human-written text.


Introduction
Nowadays, unprecedented amounts of online misinformation (e.g., fake news and rumors) spread through the internet, which may misinform people's opinions of essential social events (Faris et al., 2017;Thorne et al., 2018;Goodrich et al., 2019;Kryściński et al., 2019). Recent advances in neural generative models, such as GPT-2 (Radford et al., Figure 1: An example of machine-generated fake news. We can observe that the factual structure of entities extracted by named entity recognition is inconsistent. 2019), make the situation even severer as their ability to generate fluent and coherent text may enable adversaries to produce fake news. In this work, we study deepfake detection of text, to automatically discriminate machine-generated text from human-written text.
Previous works on deepfake detection of text are dominated by neural document classification models (Bakhtin et al., 2019;Zellers et al., 2019;Wang et al., 2019;Vijayaraghavan et al., 2020). They typically tackle the problem with coarse-grained document-level evidence such as dense vectors learned by neural encoder and traditional features (e.g., TF-IDF, word counts). However, these coarsegrained models struggle to capture the fine-grained factual structure of the text. We define the factual structure as a graph containing entities mentioned in the text and the semantically relevant relations among them. As shown in the motivating example in Figure 1, even though machine-generated text seems coherent, its factual structure is inconsistent. Our statistical analysis further reveals the difference in the factual structure between humanwritten and machine-generated text (detailed in Section 3). Thus, modeling factual structures is essential for detecting machine-generated text.
Based on the aforementioned analysis, we propose FAST, a graph-based reasoning approach uti-lizing FActual Structure of Text for deepfake detection. With a given document, we represent its factual structure as a graph, where nodes are automatically extracted by named entity recognition. Node representations are calculated not only with the internal factual structure of a document via a graph convolution network, but also with external knowledge from entity representations pre-trained on Wikipedia. These node representations are fed to produce sentence representations which, together with the coherence of continuous sentences, are further used to compose a document representation for making the final prediction.
We conduct experiments on a news-style dataset and a webtext-style dataset, with negative instances generated by GROVER (Zellers et al., 2019) and GPT-2 (Radford et al., 2019) respectively. Experiments show that our method significantly outperforms strong transformer-based baselines on both datasets. Model analysis further indicates that our model can distinguish the difference in the factual structure between machine-generated text and human-written text. The contributions are summarized as follows: • We propose a graph-based approach, which models the fine-grained factual structure of a document for deepfake detection of text.
• We statistically show that machine-generated text differs from human-written text in terms of the factual structures, and injecting factual structures boosts detection accuracy.
• Results of experiments on news-style and webtext-style datasets verify that our approach achieves improved accuracy compared to strong transformer-based pre-trained models.

Task Definition
We study the task of deepfake detection of text in this paper. This task discriminates machinegenerated text from human-written text, which can be viewed as a binary classification problem. We conduct our experiments on two datasets with different styles: a news-style dataset with fake text generated by GROVER (Zellers et al., 2019) and a large-scale webtext-style dataset with fake text generated by GPT-2 (Radford et al., 2019). The newsstyle dataset consists of 25,000 labeled documents, and the webtext-style dataset consists of 520,000 labeled documents. With a given document, systems are required to perform reasoning about the content of the document and assess whether it is "human-written" or "machine-generated".

Factual Consistency Verification
In this part, we conduct a statistical analysis to reveal the difference in the factual structure between human-written and machine-generated text. Specifically, we study the difference in factual structures from a consistency perspective and analyze entitylevel and sentence-level consistency. Through data observation, we find that humanwritten text tends to repeatedly mention the same entity in continuous sentences, while machinewritten continuous sentences are more likely to mention irrelevant entities. Therefore, we define entity consistency count (ECC) of a document as the number of entities that are repeatedly mentioned in the next w sentences, where w is the sentence window size. Sentence consistency count (SCC) of a document is defined as the number of sentences that mention the same entities with the next w sentences. For instance, if entities mentioned in three continuous sentences are "A and B; A; B" and w = 2, then ECC = 2 because two entities A and B are repeatedly mentioned in the next 2 sentences. SCC = 1 because only the first sentence has entities mentioned in the next 2 sentences. We use all 5,000 pairs of human-written and machine-generated documents from the newsstyle dataset and each pair of documents share the same metadata (e.g., title) for statistical analysis. We plot the kernel density distribution of these two types of consistency count with sentence window size w = {1, 2}.
As shown in Figure 2, human-written documents are more likely to have higher entity-level and sentence-level consistency count. This analysis indicates that human-written and machine-generated text are different in the factual structure, thus modeling consistency of factual structures is essential in discriminating them.

Methodology
In this section, we present our graph-based reasoning approach, which considers factual structures of documents, which is used to guide the reasoning process for the final prediction. Figure 3 gives a high-level overview of our approach. With a document given as the input, our system begins by calculating the contextual word representations by RoBERTa ( § 4.1). Then, we  Figure 2: Statistical analysis about entity-level and sentence-level consistency. Orange curve and blue curve indicate kernel density estimation curve for human-written document and machine-generated document respectively. X-axis indicates the value of consistency count and y-axis indicates probability density. build a graph for capturing the internal factual structure of the whole document ( § 4.2). With the constructed graph, we initialize node representations utilizing internal and external factual knowledge and propagate and aggregate information by a graph neural network to learn graph-enhanced sentence representations ( § 4.3). Then, to model the consistent relations of continuous sentences and compose a document representation for making the final prediction, we employ a sequential model with help of coherence scores from a pre-trained next sentence prediction (NSP) model ( § 4.4).

Word Representation
In this part, we present how to calculate contextual word representations by a transformer-based model. In pratice, we employ RoBERTa .
Taking a document d as the input, we employ RoBERTa to learn contextual semantic representations for words 1 . RoBERTa encoder B maps document x with length |x| into a sequence of following hidden vectors.
where each h(x) i indicates the contextual representation of word i

Graph Construction
In this part, we present how to construct a graph to reveal the internal factual structure of a docu-ment. In practice, we observe that selecting entities, the core participants of events, as arguments to construct the graph leads to less noise to the representation of the factual structure. Therefore, we employ a named entity recognition (NER) model to parse entities mentioned in each sentence. Specifically, taking a document as the input, we construct a graph in the following steps.
• We parse each sentence to a set of entities with an off-the-shelf NER toolkit built by Al-lenNLP 2 , which is an implementation of Peters et al. (2017). Each entity is regarded as a node in the graph.
• We establish links between inner-sentence and inter-sentence entity node pairs to capture the structural relevance. We add inner-sentence edges to entity pairs in the same sentence for they are naturally relevant to each other. Moreover, we add inter-sentence edges to literally similar inter-sentence entity pairs for they are likely to be the same entity.
After this process, the graph reveals the finegrained factual structure of a document.

Graph Neural Network
In this part, we introduce how to initialize node representations and exploit factual structure utilizing multi-layer graph convolution network (GCN) to propagate and aggregate information and finally produce sentence representations.

Node Representation Initialization
We initialize node representations with contextual word representations learnt by RoBERTa and external entity representations pre-trained on Wikipedia.
Contextual Representation Since each entity node is naturally a span of words mentioned in the document, we calculate the contextual representation of each node by the contextual words representations h(x). Supposing an entity e consists of n words, then the contextual representation ε B is calculated with the following formula: where W B is a weight metric, e i is the absolute position in the document of the i th word in the span of entity e, and ReLU is an activation function.  Figure 3: An overview of our approach. Taking a document as the input, we first calculate contextual word representations via RoBERTa ( § 4.1) and represent the factual structure as a graph ( § 4.2). After that, we employ graph neural network to learn sentence representations ( § 4.3). Then, sentence representations are composed to a document representation considering coherence of continuous sentences before making the final prediction ( § 4.4).

Wikipedia-based Entity Representation
To model external factual knowledge about entities in the knowledge base, we further represent entity e with a projected wikipedia2vec entity representation (Yamada et al., 2018), which embeds words and entities on Wikipedia pages in a common space. The Wikipedia-based entity representation ε w is : where v e is the wikipedia2vec representation of entity e and W w is a weight metric.
The initial representation H e ∈ R d of entity node e is the concatenation of contextual representation ε B and Wikipedia-based entity representation ε w , with dimension d.

Multi-layer GCN
In order to propagate and aggregate information through multihop neighbouring nodes, we employ multi-layer Graph Convolution Network (GCN) (Kipf and Welling, 2016).
Formally, we denote the constructed graph as G and representation of all nodes as H ∈ R N ×d , where N denote the number of nodes. Each row H e ∈ R d in H indicates a representation of node e. We denote the adjacency matrix of graph G as A and degree matrix as D. We further calculate A = D − 1 2 AD − 1 2 . Then, the formula of multilayer GCN is described as follows: where H (i) e denotes the representation of node e calculated by i th layer of GCNs, W i is the weight matrix of layer i. σ is an activation function. Specially, H e is the initialized node representations.
Finally, through m layers of GCN, we obtain the graph-enhanced node representations based on the structure of the factual graph.

Sentence Representation
According to compositionality, we believe that global representation should come from partial representations. Therefore, we calculate sentencelevel representations based on graph-enhanced node representations. Supposing sentence i has N i corresponding entities, we calculate the representation y i of sentence i as follows: where σ is an activation function, W s is a weight matrix, b s is a bias vector and H (i,j) indicates the representation of j th node in sentence i. The compositionality can also be implemented in other ways, which we leave to future work.

Aggregation to Document Representation
In this part, we present how to compose a document representation for the final prediction utilizing graph-enhanced sentence representations and coherence score calculated by a pre-trained next sentence prediction (NSP) model.
Coherence Tracking LSTM With graphenhanced sentence representations given as the input, the factual consistency of continuous sentences is automatically modeled by a sequential model. Specifically, We employ LSTM to track the consistent relations and produce representations y i for sentence i Next Sentence Prediction Model In order to further model contextual coherence of neighbouring sentence pairs as an additional information, we pre-train an NSP model to calculate the contextual coherence score for each neighbouring sentence pair. We employ RoBERTa  as the backbone, which receives pairs of sentences as the input and assesses whether the second sentence is a subsequent sentence of the first. Further training details are explained in Appendix A. The outputs S are described as follows.
where s+1 is the number of sentences in document x and each S (i−1,i) is the positive probability score for sentence pair (i − 1, i), which indicates how likely it is that sentence i is a subsequent sentence of sentence i − 1.
Prediction with NSP Score We generate a document-level representation by composing sentence representations before making the final prediction. To achieve this, we take NSP scores as weights and calculate the weighted sum of representations of sentence pairs with the assumption that sentence pairs with higher contextual coherence score should also carry more importance in making the final prediction. The final document representation D is calculated as follows.
Finally, we make the final prediction by feeding the combination of D and the last hidden vector h([CLS]) from RoBERTa through an classification layer. The goal of this operation is to maintain the complete contextual semantic meaning of the whole document because some linguistic clues are left out during graph construction.

Experiment Settings
In this paper, we evaluate our system on the following two datasets: • News-style GROVER-generated dataset provided by Zellers et al. (2019). The humanwritten instances are collected from Real-News, and machine-generated instances are generated by GROVER-Mega, a large stateof-the-art transformer-based generative model developed for neural fake news. We largely follow the experimental settings as described by Zellers et al. (2019) and adopt two evaluation metrics: paired accuracy and unpaired accuracy. In the paired setting, the system is given human-written news and machinegenerated news with the same meta-data. The system needs to assign higher machine probability to the machine-generated news than the human-written one. In the unpaired setting, the system is provided with single news document and states whether the document is human-written or machine-generated.
• Webtext-style GPT2-generated dataset provided by OpenAI 3 . The human-written instances are collected from WebText. Machinegenerated instances are generated by GPT-2 XL-1542M (Radford et al., 2019), a powerful transformer-based generative model trained on a corpus collected from popular webpages. For this dataset, we adopt binary classification accuracy as the evaluation metric.
We set nucleus sampling with p = 0.96 as the sampling strategy of generator for both datasets, which leads to better generated text quality (Zellers et al., 2019;Ippolito et al., 2019). The statistics of the two datasets are shown in the Table 1.

Dataset Train Valid
Test Set Unpaired Paired News-style 10,000 3,000 8,000 8,000 Webtext-style 500,000 10,000 10,000 - Furthermore, we adopt RoBERTa-Base  as the direct baseline for our experiments because RoBERTa achieves state-of-the-art performance on several benchmark NLP tasks. The hyper-parameters and training details of our model are described in Appendix B.

Model Comparison
Baseline Settings We compare our system with transformer-based baselines for DeepFake detection, including three powerful transformer-based pre-trained models: BERT (Devlin et al., 2018), XLNet  and RoBERTa (Liu  For the news-style dataset, we further compare our model with GPT-2 (Radford et al., 2019) and GROVER (Zellers et al., 2019). The GROVER-based discriminator is a fine-tuned version of generator GROVER, which has three model sizes: GROVER-Base (124 million parameters), GROVER-Large (335 million parameters), and GROVER-Mega (1.5 billion parameters). Our model is not comparable with GROVER-Mega for the following reasons. Firstly, GROVER-Mega is the fake news generator, and it has a strong inductive bias (e.g., data distribution and sampling strategy) as the discriminator (Zellers et al., 2019). Secondly, GROVER-Mega has a much larger model size (1.5 billion parameters) than our model.
For the webtext-style dataset, we compare with the baselines we trained with the same hyperparameters. We don't compare with GPT-2 because it's the generator for machine-generated text. Table 2, we compare our model with baselines on the test set of newsstyle dataset with negative instances generated by GROVER-Mega. As shown in the table, our model significantly outperforms our direct baseline RoBERTa with 4.2% improvements on unpaired accuracy and 4.3% improvements on paired accuracy. Our model also significantly outperforms GROVER-Large and other strong transformerbased baselines (i.e., GPT2, BERT, XLNet).  In Table 3, we compare our model with baselines on the development set and the test set of webtextstyle dataset. Our model significantly outperforms strongest transformer-based baseline RoBERTa by 2.64% on the development set and 3.07% on the test set of webtext-style GPT2-generated dataset.

Results and Analysis In
This observation indicates that modeling finegrained factual structures empower our system to discriminate the difference between human-written text and machine-generated text.

Ablation Study
Moreover, we also conduct ablation studies to evaluate the impact of each component by conducting experiments about direct baseline RoBERTa-Base and four variants of our full model.
• RoBERTa-Base is our direct baseline without considering any structural information.
• FAST (GCN) calculate a global document representation by averaging node representations after representation learning by GCN.
• FAST (GCN w/o wiki) The node representations eliminate entity representations from wikipedia2vec and the rest are the same as FAST (GCN).
• FAST (GCN + LSTM) takes the final hidden state from coherence tracking LSTM ( § 4.4) as the final document-level representation.
• FAST (GCN + LSTM + NSP) is the full model introduced in this paper.
As shown in Figure 4, adding GCN improve performance on the development the set of news-style dataset and webtext-style dataset. This verifies that incorporating fine-grained structural information is beneficial for detecting generated text. Eliminating wikipedia-based entity representation from FAST (GCN) drops performance, which indicates that incorporating external knowledge is also beneficial. Moreover, incorporating coherence tracking LSTM brings further improvement on two datasets, which indicates that modeling consistency of factual structure of continuous sentences is better than simply using global structural information of the document, like the setting in FAST (GCN). Lastly, results also show that incorporating semantic coherence score of pre-trained NSP model is beneficial for discriminating generated text.

Case Study
As shown in Figure 5, we conduct a case study by giving an example. This example shows humanwritten news and machine-generated news with the same metadata (i.e., title). The veracity of both documents are correctly predicted by our model. With the given document, our system constructs a factual graph and makes the correct predictions by reasoning over the constructed graph. We can observe that although the continuous sentences in the machine-generated news look coherent, their factual structure is not consistent as they describe events about irrelevant entities. Instead, the humanwritten news has a more consistent factual structure. However, without utilizing factual structure information, RoBERTa fails to discriminate between these two articles. This observation reflects that our model can distinguish the difference in the factual consistency of machine-generated text and humanwritten text.

Error Analysis
To explore further directions for future studies, we randomly select 200 instances and manually summarize representative error types.
The primary type of errors is those caused by failing to extract core entities of sentences. The quality of a constructed graph is somehow limited by the performance of the NER model. This limitation leaves further exploratory space for extraction of internal factual structure. The second type of errors is caused by the weakness in the mathematical calculation of the model. For instance, a document describes that "a smaller $5 million one-off was seized in 2016 and the National Bank of Antigua and Barbuda reclaimed $30 million stolen in the 2015 heist last year. $100 million, it was a massive amount. But now we are talking of $50 million, this is extremely conservative... ". Humans can easily observe that the mentioned numbers are highly inconsistent in the generated text. A machine struggles to discern that. This error type calls for the development of a machine's mathematical calculation abilities. The third error type is caused by failing to model commonsense knowledge. For example, a famous generated document mentioned "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. ... These four-horned, silver-white unicorns were previously unknown to science.". Although the text looks coherent, it is still problematic in terms of commonsense knowledge that "unicorn has only one horn". This leaves space for further research on exploring commonsense knowledge in deepfake detection.

Related Work
Recently, fake news detection has attracted growing interest due to the unprecedented amount of fake contents propagating through the internet (Vosoughi et al., 2018). Spreading of fake news arises public concerns (Cooke, 2018) as it may influence essential public events like politic elections (Allcott and Gentzkow, 2017). Online reviews can also be generated by machines, and … Just before morning twilight, you can easily see Jupiter 1 and Saturn 1 in the low south to southeast sky. Jupiter 2 , the brighter of the two, is on the right. Using astronomical units, Jupiter 3 is 5.3 AUs away, which equals 492 million miles, and Saturn 2 is at 10 AUs or 930 million miles away. … …As Earth 1 tilts away from the sun and toward the Earth 2 equator, we can see how the plane and plane of the equator move away from the Sun. A 15th-century astronomer, Joseph Bagnold, wrote the 20th-century English word "Gravitational Lensing". … Title: Sky Watch: No need to blow your mind when wrapping your brain around celestial distances Figure 5: A case study of our approach. Continuous words in orange indicate a entity node extracted by our system. Each green solid box indicates a sub-graph corresponding to a sentence, and a blue dashed line indicates an edge between semantically relevant entity pairs. Numbers in orange and blue indicate probability for the human-written document and the machine-generated document respectively.
can even be as fluent as human-written text (Adelani et al., 2020). This situation becomes even more serious when recent development of large pre-trained language models (Radford et al., 2019;Zellers et al., 2019) are capable of generating coherent, fluent and human-like text. Two influential works are GPT-2 (Radford et al., 2019) and GROVER (Zellers et al., 2019), The former is an open-sourced, large-scale unsupervised language model learned on web texts, while the latter is particularly learned for news. In this work, we study the problem of discriminating machine-generated and human-written text, and evaluate on datasets produced by both GPT-2 and GROVER.
Advances in generative models have promoted the development of detection methods. Previous studies in the field of DeepFake detection of generated text are dominated by deep-learning based document classification models and studies about discriminating features of generated text. GROVER (Zellers et al., 2019) detects generated text by a fine-tuned model of the generative model itself. Ippolito et al. (2019) fine-tune the BERT model for discrimination and explore how sampling strategies and text excerpt length affect the detection. GLTR (Gehrmann et al., 2019) develops a statistical method of computing per-token likelihoods and visualizes histograms over them to help deepfake detection. Badaskar et al. (2008) and Pérez-Rosas et al. (2017) study language distributional features including n-gram frequencies, text coherence and syntax features. Vijayaraghavan et al. (2020) study the effectiveness of different numeric representations (e.g., TFIDF and Word2Vec) and different neural networks (e.g., ANNs, LSTMs) for detection. Bakhtin et al. (2019) tackle the problem as a ranking task and study about the cross-architecture and cross-corpus generalization of their scoring functions. Schuster et al. (2019) indicate that simple provenance-based detection methods are insufficient for solving the problem and call for development of fact checking systems. However, existing approaches struggle to capture fine-grained factual structures among continuous sentences, which in our observation is essential in discriminating human-written text and machine-generated text. Our approach takes a step towards modeling finegrained factual structures for deepfake detection of text.

Conclusion
In this paper, we present FAST, a graph-based reasoning approach utilizing fine-grained factual knowledge for DeepFake detection of text. We represent the factual structure of a document as a graph, which is utilized to learn graph-enhanced sentence representations. Sentence representations are further composed through document-level aggregation for the final prediction, where the consistency and coherence of continuous sentences are sequentially modeled. We evaluate our system on a news-style dataset and a webtext-style dataset, whose fake instances are generated by GROVER and GPT-2 respectively. Experiments show that components of our approach bring improvements and our full model significantly outperforms transformer-based baselines on both datasets. Model analysis further suggests that our model can distinguish the difference in the factual structure of machine-generated and human-written text.