How Robust are Fact Checking Systems on Colloquial Claims?

Knowledge is now starting to power neural dialogue agents. At the same time, the risk of misinformation and disinformation from dialogue agents also rises. Verifying the veracity of information from formal sources are widely studied in computational fact checking. In this work, we ask: How robust are fact checking systems on claims in colloquial style? We aim to open up new discussions in the intersection of fact verification and dialogue safety. In order to investigate how fact checking systems behave on colloquial claims, we transfer the styles of claims from FEVER (Thorne et al., 2018) into colloquialism. We find that existing fact checking systems that perform well on claims in formal style significantly degenerate on colloquial claims with the same semantics. Especially, we show that document retrieval is the weakest spot in the system even vulnerable to filler words, such as “yeah” and “you know”. The document recall of WikiAPI retriever (Hanselowski et al., 2018) which is 90.0% on FEVER, drops to 72.2% on the colloquial claims. We compare the characteristics of colloquial claims to those of claims in formal style, and demonstrate the challenging issues in them.


Introduction
Recently, knowledge has been starting to power neural dialogue agents (Moghe et al., 2018;Zhou et al., 2018b;Ghazvininejad et al., 2018;Qin et al., 2019;Gopalakrishnan et al., 2019), being equipped with Wikipedia (Dinan et al., 2019b), news (Gopalakrishnan et al., 2019), domain specific knowledge-base (Eric and Manning, 2017), and commonsense (Zhou et al., 2018a;Young et al., 2018;. However, the use of knowledge inevitably put dialogue agents in new jeopardy. For example, recent workshop on safety for conversational AI (Dinan et al., 2020b) introduced * Equal contribution an example of such risk: Bickmore et al. (2018) asked participants to query conversational agents for advice in situations where medical information is needed. Then, internist and pharmacist judged the actions that the participants would take based on the advice. Assessments revealed that agents often deliver incorrect medical information that may cause lethal consequences.
A bigger threat may be the abuse of dialogue agents to deliberately distribute disinformation. What would happen if knowledge-powered agents are tweaked to massively generate false claims on online communities? The impact of such fake news can be critical as they quickly spread through social media (Shu et al., 2017). The chatbot Tay's shut down due to malicious attempts show the imminent danger of abuse (Wolf et al., 2017).
Verifying the integrity of a given piece of information has been studied in the field of computational fact checking. Thorne et al. (2018) introduce an annotated dataset FEVER for fact checking based on Wikipedia. Augenstein et al. (2019) collect claims on fact checking websites and release the MultiFC dataset. Jiang et al. (2020) collect a dataset requiring many-hop evidence extraction from Wikipedia. Wadden et al. (2020) collect a dataset of scientific claims to be verified.
Most claims of existing datasets are taken from formal texts, such as news, academic papers, and Wikipedia. These claims tend to be concise and structured: "Beautiful was number two on the Billboard Hot 100 in 2003". On the other hand, claims or information that we encounter in dialogues are more unstructured and informal: "The song Beautiful is great! It even reached number two on the Hot 100 in 2003, you know?". For improving the applicability of fact checking systems, they must also be robust for verifying the claims in dialogues.
Unfortunately, threats regarding misinformation and disinformation from dialogue agents remain understudied. Research on dialogue safety mainly has focused on making dialogue agents robust to adversarial attacks (Dinan et al., 2019a), and preventing dialogue agents from generating offensive or biased responses (Henderson et al., 2018;Sap et al., 2019;. In this work, we aim to investigate how fact checking systems behave when verifying claims in dialogue style, rather than claims from news outlets, scientific articles, or Wikipedia. Colloquial claims are different in several aspects compared to claims from formal sources. (i) They tend to also include filler words, casual comments, or personal feelings which do not require verification. (ii) Since claims in colloquial language are less precise than formal claims, correctly using the context in claims becomes important to disambiguate them. We demonstrate that these features make existing fact checking systems have difficulties in verifying colloquial claims. We use English datasets for the investigation in this work. Our major contributions of this work can be outlined as follows: (1) We open up new discussions in the intersection of fact verification and dialogue safety; how to verify claims in colloquial language, compared to previous works that solely focus on the claims in formal style (e.g. news, academic papers, Wikipedia).
(2) For this study, we curate colloquial claims by transferring the styles of claims in existing fact checking dataset of FEVER (Thorne et al., 2018). For style transfer, we finetune a pretrained dialogue model with a knowledge-grounded dialogue dataset and apply additional filtering to compensate for the quality of output.
(3) We show that the existing fact checking systems that perform well on claims in formal style significantly degenerate on colloquial ones with the same semantics. We analyze the performance drop and show document retrieval is the weakest spot in the system.
(4) We identify the challenging characteristics of colloquial claims; (i) they often involve expressions that are not verifiable (e.g. filler words or personal feeling) and (ii) they include ambiguity inside the claim that necessities better understanding of the context. We release the code and the curated colloquial claims set.
FEVER (Thorne et al., 2018) Claim: The iPhone 4 is a dial telephone.   (Thorne et al., 2018) is a fact checking benchmark dataset based on Wikipedia. Its fact checking pipeline has become one of the standard followed by many (Hanselowski et al., 2018;Nie et al., 2019;Zhou et al., 2019;Liu et al., 2020;Zhong et al., 2020;Jiang et al., 2020). The pipeline comprises three stages: document retrieval, evidence selection, and claim verification. For a given claim to be verified, the system first retrieves the related documents from the pool. Next, among the returned documents, the system selects the most suitable sentences for evidence. Finally, based on the evidence sentences the system classifies the claim's veracity with three classes: SUP-PORTED, REFUTED (contradicted by the evidence), and NOTENOUGHINFO (cannot be determined by the evidence). An example from the FEVER is shown in Table 1.

Wizard of Wikipedia
The Wizard of Wikipedia (WoW) (Dinan et al., 2019b) may be the closest dialogue dataset to existing fact checking datasets. It is a knowledge-based open-domain dialogue dataset involving two speakers discussing on a given topic. An example is presented in Table 1. One speaker (referred as apprentice) is eager to learn about the topic, while the other speaker (the wizard) delivers knowledgegrounded responses based on both dialogue context and Wikipedia documents for the topic. In this dataset, the gold "knowledge sentence" from Wikipedia is provided for each wizard's response. Hence, we can regard the gold knowledge sentence as the evidence for the Wizard's response.
However, WoW only provides pairs of (knowledge sentence, grounded response), hence those responses are all SUPPORTED by Wikipedia. There are no REFUTED or NOTENOUGHINFO responses in the dataset. Such limitation make it difficult to directly adopt WoW as a fact checking dialogue dataset. Nonetheless, its knowledge-grounding property makes it a useful resource for training dialogue models to generate colloquial utterances grounded on claims.

Transferring to Colloquial Claims
Our goal is to curate colloquial claims by transferring the style of each claim sentence in the FEVER dataset 1 into colloquial style. We first finetune a dialog model with the WoW dataset so that it learns to transfer knowledge sentences from Wikipedia into conversational utterances (section 3.1). We then apply the finetuned model to transfer each claim in FEVER (sourced from Wikipedia) into colloquial style, and perform filtering process to warrant the integrity of this style transfer (section 3.2). Figure 1 overviews the whole pipeline of style transfer.

Finetuning a Dialogue Model
We first finetune BART-large (Lewis et al., 2020) to generate the wizard's response given only the corresponding knowledge sentence from WoW, without the dialogue context. Take the example in Table 1, when the knowledge sentence is given as "Hershey's headquarters are in Hershey, Pennsylvania", BART is finetuned to generate the wizard's response "I love Hershey too! Do you know that Hershey's HQ is actually in Hershey?". We exclude the dialogue context during fine-tuning in order to enforce the dialogue model to exclusively focus on knowledge contents. The finetuned BART shows a low perplexity of 10.51 on WoW's validation set. This indicates that BART can generate information-grounded utterances when given knowledge sentences.
Then, we apply the finetuned BART to transfer each claim in FEVER to a colloquial one. Our expectation is that since claims in FEVER are based on Wikipedia too and similar to knowledge sentences in WoW in many aspects, the finetuned model may be able to produce utterances while preserving the semantics of claims from FEVER. However, naively using the generated claims as is has several issues, including (i) copy-and-paste, (ii) pronoun overwrite, (iii) semantic discrepancy, and (iv) lack of colloquialism. We carefully mitigate these issues through a filtering pipeline.

Oversampling and Filtering
We first oversample n colloquial candidates Q i = {q i,j } 468 j=1 per claim c i in FEVER, using BART through Nucleus Sampling (Holtzman et al., 2020) (p = 0.95).
Preventing Copy-Paste. We observe the dialogue model sometimes simply copies the input claim as output. Since copy-pasted candidates are not colloquial, we remove the ones whose F1 scores are higher than 0.9, in respect to the original claim.
Preserving Named Entities. Utterances in dialogues tend to refer entities with pronouns rather than their original word. As a result, we observe that dialogue models also convert entities in claims to pronouns. For example, given the input claim "Tetris has sold millions of physical copies", BART outputs "Yeah it's fun even today, no wonder it sold millions of physical copies". Since there are no previous contexts for claims in FEVER, it is not possible to recognize that pronoun "it" is referring to "tetris".
In order to preserve the entities, we leverage the named entity recognition (NER) module from Stanza (Qi et al., 2020), which shows 88.8 F1-score on OntoNotes (Weischedel et al., 2013) test set. We extract a set of named entities E c i from claim c i , and compare it with the named entity set E q i,j of each q i,j in Q i . We remove candidates with less than two matching named entities. For claims with single named entity, we remove candidates having no named entities.
Preserving Semantic Equivalence. It is well known that neural dialogue models lack consistency (Li et al., 2016) and can hallucinate irrelevant content (Roller et al., 2020). As a result, there can be semantic difference between the original FEVER claim and the generated one.
To preserve the original semantics, we leverage natural language inference (NLI), which is a task of determining whether a hypothesis sentence can be inferred from the given premise sentence. The hypothesis sentence is classified into three categories: ENTAILMENT (true), CONTRADICTION (false), and NEUTRAL (undetermined). A sound colloquial claim should be entailed by the original   claim and it also must not contradict the original. Suppose the original claim is "Apple Inc. designed and manufactured iPhone 4" and the generated claim is "I heard Apple is also famous for designing the iMac computer". This claim is removed because "designing iMac" cannot be inferred from the fact "Apple manufactured iPhone 4". We conduct bidirectional NLI between the original claim and the generated one using RoBERTa (Liu et al., 2019) trained on MNLI (Williams et al., 2018). The RoBERTa model shows 90.59% accuracy on MNLI validation set. For each candidate q i,j , we conduct NLI(c i , q i,j ) and NLI(q i,j , c i ) with the original claim c i . We only preserve the candidates that result in ENTAILMENT for the former and do not result in CONTRADICTION for the latter.
Ensuring Colloquialism. Although the candidates are generated by a dialogue model, they may still resemble the style of the original claims, rather than colloquial style. To ensure colloquialism, we select the top-k candidate claims which are most difficult to discriminate from responses in Wizard of Wikipedia (WoW) (Dinan et al., 2019b), through an iterative adversarial filtering method AFLITE (Sakaguchi et al., 2020;Bras et al., 2020). We first embed the candidates with RoBERTa and train an ensemble of binary linear classifiers to determine each candidate whether it is from WoW or our colloquial claims. We eliminate candidates that are easily classified as our colloquial claims after each iteration. We continue the iteration until k candidates remain in each Q i . We set k = 3. Since only candidates that are hard to discriminate from WoW responses survive, they resemble the styles of dialogue utterances. We defer the detailed algorithm for adversarial filtering to Appendix.
Filtering Statistics. Table 2 shows the average survival rate of candidates after each filtering step. We observe that the NER and NLI filter effectively remove large amounts of candidates. On average, 29 out of 486 candidates survive after the NLI filtering stage. Then, adversarial filtering is used for selecting k candidates among the remainders. Figure 2 shows the recall for our colloquial claims by the binary classifiers used in AFLITE. As only indistinguishable candidate claims from the WoW responses survive, the recall drops after each iteration. We also compare the qualitative traits of candidates before and after the filtering in Section 4.2.

Manual Quality Check on Test set
Finally, we manually check all SUPPORTED and REFUTED instances in the test set of our Colloquial Claims dataset. Three human annotators choose the best suitable claim for each colloquial claim set (|Q i | ≤ k) for the given label and evidence. If there are no suitable claim in the set, we recover the set before top-k selection. As a last resort, we let annotators rewrite the colloquial claim when no eligible candidate exists. The proportion that requires manual rewriting is less than 1% of 5,615   instances.

Quantitative Comparison
We first discuss the characteristics of our Colloquial Claims with quantitative analysis, compared to FEVER (Thorne et al., 2018) and Wizard of Wikipedia (WoW) (Dinan et al., 2019b). Diverse Claims. We provide basic statistics of our Colloquial Claims in Table 3. In FEVER, only a single claim exists per evidence set, whereas our Colloquial Claims provide up to three claims. As a result, the number of data instances of our dataset is larger than FEVER.
Due to the wordy nature of colloquial language, the our transferred claims are longer and more diverse in length than those in FEVER. Figure 3 plots the density of the claim sentence lengths of FEVER and our dataset.
Colloquial Style. The claims in our Colloquial Claims have similar styles to the utterances in dialogues. Following , we gauge the style of sentences by measuring the perplexity with a pretrained DialoGPT (Zhang et al., 2019). The perplexity of the sentence becomes high if its style is far from a dialogue.    Table 5 also compares the top-20 frequent tokens in the claims from FEVER and our dataset. The most frequent tokens in FEVER's claims are mostly fact-related words, such as "american", "released", and "born". On the other hand, the claims in our Colloquial Claims also have tokens that frequently appear in conversations, such as "know", "actually", "like", and "oh".

Qualitative Comparisons
We conduct human evaluation via Amazon Mechanical Turk to investigate the effectiveness of our filtering pipeline. We random sample 100 data instances from our Colloquial Claims and compare between survived and removed candidates. Each instance is rated by three unique human annotators.
To evaluate the overall quality of our generated claims, we ask human users to evaluate humanness in 4-point scale: "Do you think this sentence is from a bot or a human?". We compare them with responses from WoW and FEVER on humanness.
We also conduct NLI on the claims from our Colloquial Claims and FEVER to evaluate the label mappings. Users are instructed to classify claims into three veracity labels given the gold evidence: SUPPORTED, REFUTED, NOTENOUGHINFO.   Table 6 summarizes the averaged humanness and human NLI scores. Since the responses in WoW are from real dialogues, we can observe they have the highest humanness score. Interestingly, our generated claims are evaluated to be better than human-generated claims in FEVER, in terms of humanness. We suspect that this is due to the colloquialism of our generated claims.
The survived claims have more accurate label mappings with the evidence, compared to removed candidates. It is thanks to the bidirectional NLI filter that removes the candidate claims that are semantically different from the original claims. Table  7 shows some examples comparing our generated claims to the original FEVER claims.

Experiments and Analysis
We conduct experiments on our curated colloquial claims to see how they impact existing fact checking systems.

Experimental Setting
Datasets. FEVER (Thorne et al., 2018) consists of three steps of fact checking pipeline: document retrieval, evidence selection, and claim verification. Based on selected evidence, the claims are classified into three classes of veracity: SUPPORTED, REFUTED, NOTENOUGHINFO. The Colloquial Claims is our generated dataset based on FEVER with claims in the colloquial style.
Metrics. FEVER fact checking uses two performance scores: label accuracy and FEVER-score. Label accuracy is the claim verification performance of the fact checking system. The FEVERscore is a more complicated evaluation regarding the whole pipeline. Following the FEVER challenge 2 , a claim verification is evaluated as correct if the system retrieves at least one complete set of ground-truth evidence sentences and also classifies 2 https://fever.ai/2018/task.html  the claim correctly. For the evidence sentences, we evaluate the first 5 sentences retrieved from the system. We also report the recall for retrieved documents and selected evidence sentences.

Fact-Checking Baselines
We run experiments on six combinations of the fact-checking system according to the steps. For each dataset evaluation, we finetuned the system on the respective dataset.
Document Retrieval. We test three types of approaches: (1) oracle, (2)   Evidence Selection. WikiAPI and DPR both use BERT (Devlin et al., 2019) to encode sentences and sort them out from the documents.
Claim Verification. We test two approaches: (1) BERT and (2) CorefBERT (Ye et al., 2020), which is one of the best performing methods on FEVER. The CorefBERT pretrains BERT to better capture the coreference information in text. We also apply kernel graph attention network (KGAT) (Liu et al., 2020) on BERT and CorefBERT for fine-grained attention using evidence graphs. More details can be found in Appendix. Table 8 compares the performance of fact checking systems on FEVER and our Colloquial Claims. Both label accuracy and FEVER-score significantly decrease for all systems on our Colloquial Claims, compared to FEVER. The Wiki-API+BERT+KGAT(CorefBERT) system performs on par with best performing models for FEVER by label accuracy of 73.8%. However, it degenerates on the colloquial dataset with the label accuracy of 60.9%. We remind that our Colloquial Claims shares the same document pool, annotated evidence sentences, and similar semantics with claims from FEVER. Thus, it is the difference in the claim's style that makes the fact checking systems fatally degenerate.

Experimental Results
The WikiAPI, used in many fact checking systems (Hanselowski et al., 2018;Chernyavskiy and Ilvovsky, 2019;Stammbach and Neumann, 2019;Zhou et al., 2019;Liu et al., 2020), shows superior performance than DPR on the FEVER dataset, with document recall of 90.0%. On Colloquial Claims, however, it crashes down to 72.2%. Meanwhile, the DPR shows more robust document retrieval on Colloquial Claims than WikiAPI.
Apart from document retrieval and evidence selection, we can also observe performance decrease in the systems with evidence oracles. This indicates that claim verification is also more difficult on Colloquial Claims.

Challenges in Colloquial Claims
We analyze the causes of degeneration in document retrieval and claim verification in relation to the colloquial traits. We compare three document retrieval methods along with the oracle: WikiAPI, DrQA (Chen et al., 2017), and Dense Passage Retrieval (DPR). DrQA is another variation of term-matching method based on TF-IDF. Table 9 shows the titles of ten most documents by each retriever.
Filler Words Unnecessary of Fact Checking. In colloquial language, claims are not always composed of factual remarks requiring verification. Filler words (e.g. "I see", "yeah, like") are also frequently mixed in the utterances, as shown in Table 5. Hence, our Colloquial Claims requires systems to partition the parts that affect veracity from the ones that do not. However, Table 9 shows that word-matching retrieval systems, such as Wiki-API and DrQA, are vulnerable to those insignificant parts. They naively retrieve filler word related documents very frequently.
Minding the Context. Considering the context inside the sentence is essential for verifying colloquial claims. Lexical variation and polysemy is common in colloquial language. Such variations and ambiguity are tolerable because common context flows in the utterance. For example, in the colloquial claim of "Niko Coster-Waldau is also the host of the show. He was with Fox at one point.", it is easy to see the word "Fox" stands for "Fox Broad-  casting Company" based on the context. However, it is well known that simple term-matching methods cannot capture such context (Karpukhin et al., 2020). Thus, we observe that systems instead simply retrieve the document of "fox". Also, Table  9 shows another example of contextless retrieval. The document "Yes (band)", "There There (novel)", and "Yea (football club)" are naively retrieved by the systems, due to simple filler words in colloquial claims.
Overcoming the Colloquial Traits. Methods based on TF-IDF or word-matching are good at recognizing core keywords, but suffer at capturing the rich semantics of context. On the other hand, the DPR, a similarity search method based on dense embeddings, shows promising results. Results in Table 9 illustrate that DPR is able to ignore the context-irrelevant entities and focus more on factrelated entities. Compared to other retrieval methods, the ten most retrieved documents from DPR does not contain any filler words. Since filler words are irrelevant to the veracity of colloquial claims, the DPR learns their insignificance. Therefore, dense representation can be important for making fact-checking systems to be robust on claims in dialogues.

Related Work
Fact Checking and Verification. The need for claim verification has led to annotated fact check-ing datasets (Thorne et al., 2018;Baly et al., 2018;Augenstein et al., 2019;Jiang et al., 2020;Wadden et al., 2020;Chen et al., 2020). Recent works deploy adversarial attacks against fact checking systems (Thorne et al., 2019a,b;Niewinski et al., 2019;Atanasova et al., 2020b) and attempt to improve the system through generation (Atanasova et al., 2020a;Goyal and Durrett, 2020;Fan et al., 2020). Existing works tend to focus on verifying news or Wikipedia. However, verifying facts is not limited to such formal texts. Compared to previous works, we focus on verifying claims in the dialogue domain, which resembles more daily life situations.
A special case of fact verification is rumour detection. Its goal is to determine the veracity of rumours from social media (Li et al., 2019). The rumour is classified based on the reactions of chained messages (Gorrell et al., 2019). The procedure and characteristics of rumour detection is quite different from the fact checking pipeline (Gorrell et al., 2019). In our task, we verify the claims based on factuality from the related documents, rather than stances of the comments.
Safety in Open-domain Dialogue. Recently, much work has studied safety issues of machine dialogue agents in several aspects. Wulczyn et al. (2017)  Previous works cover a wide range of dialogue safety, yet the risk of disinformation and misinformation remain understudied. In this work, we extend dialogue safety to cover verification of responses with false information.

Conclusion
This work aimed to open up new discussions in the intersection of fact checking and dialogue safety. In order to study how existing fact checking systems behave on claims in dialogues, we curate colloquial claims by transferring the styles of claims in FEVER (Thorne et al., 2018) to colloquialism. We leverage BART (Lewis et al., 2020) and Wizard of Wikipedia (WoW) (Dinan et al., 2019b). We finetune BART to generate the wizard's responses with knowledge sentences from WoW. Then, we input FEVER claims to generate claim-grounded utterances. We oversample candidate claims and apply filters to compensate quality. We showed that existing fact checking systems well-performing on FEVER degenerate on colloquial claims. We found that the document retriever is the weakest spot in the system which is even vulnerable to filler words. We compared the characteristic differences between claims in formal style and ones in colloquialism . An important future direction will be building a dialogue dataset for fact checking.

A Implementation Details of AFLITE
We use adversarial filtering method AFLITE (Sakaguchi et al., 2020;Bras et al., 2020) to select top-k candidate claims which are most difficult to discriminate from responses in Wizard of Wikipedia (WoW) (Dinan et al., 2019b). The algorithm takes as input the original WoW and Colloquial Claims, then returns each filtered dataset. AFLITE comprised with two steps: (i) precomputing phase and (ii) filtering phase.
In precomputing phase, we randomly sample 10% of instances from WoW and Colloquial Claims to fine-tune RoBERTa-large. We then use finetuned RoBERTa to pre-compute embeddings for the rest of the instances as the input for the filtering phase. We discard samples used for fine-tuning from the final dataset.
In filtering phase, we use an ensemble of linear classifiers to iteratively discard easily distinguishable instances. At each iteration, we train 32 linear classifiers on different random partitions of the data and collect their predictions on their rest of the instances. For each instance, we compute its score as the ratio of correct predictions over the total number of predictions, and remove top-n instances whose score is above threshold 0.75. We remove top-1000 instances among the entire WoW, and top-2 instances for each candidate sets in Colloquial Claims. We repeat this process until we have less than 3 instances for each candidate set or scores in candidate set are below the threshold.

B Other Implementation Details
For finetuning BART-large (Lewis et al., 2020) (Wolf et al., 2019) to implement bidirectional NLI, and named entity recognition module from Stanza 6 (Qi et al., 2020) to extract named entities from generated claims and claims in FEVER. We use official code from the authors to implement KGAT and BERT evidence selector 7 (Liu et al., 2020), CorefBERT 8 (Ye et al., 2020)