DeSePtion: Dual Sequence Prediction and Adversarial Examples for Improved Fact-Checking

The increased focus on misinformation has spurred development of data and systems for detecting the veracity of a claim as well as retrieving authoritative evidence. The Fact Extraction and VERification (FEVER) dataset provides such a resource for evaluating endto- end fact-checking, requiring retrieval of evidence from Wikipedia to validate a veracity prediction. We show that current systems for FEVER are vulnerable to three categories of realistic challenges for fact-checking – multiple propositions, temporal reasoning, and ambiguity and lexical variation – and introduce a resource with these types of claims. Then we present a system designed to be resilient to these “attacks” using multiple pointer networks for document selection and jointly modeling a sequence of evidence sentences and veracity relation predictions. We find that in handling these attacks we obtain state-of-the-art results on FEVER, largely due to improved evidence retrieval.


Introduction
The growing presence of biased, one-sided, and often altered discourse, is posing a challenge to our media platforms from newswire to social media (Vosoughi et al., 2018). To overcome this challenge, fact-checking has emerged as a necessary part of journalism, where experts examine "check-worthy" claims (Hassan et al., 2017) published by others for their "shades" of truth (e.g., FactCheck.org or Poli-tiFact). However, this process is time-consuming, and thus building computational models for automatic fact-checking has become an active area of research (Graves, 2018). Advances were  (Mihaylova et al., 2019), and LIAR(+) datasets with claims from PolitiFact (Wang, 2017;Alhindi et al., 2018).
The FEVER 1.0 shared task dataset  has enabled the development of endto-end fact-checking systems, requiring document retrieval and evidence sentence extraction to corroborate a veracity relation prediction (supports, refutes, not enough info). An example is given in Figure 1. Since the claims in FEVER 1.0 were manually written using information from Wikipedia, the dataset may lack linguistic challenges that occur in verifying naturally occurring check-worthy claims, such as temporal reasoning or lexical generalization/specification. Thorne and Vlachos (2019) designed a second shared task (FEVER 2.0) for participants to create adversarial claims ("attacks") to break state-of-the-art systems and then develop systems to resolve those attacks.
We present a novel dataset of adversarial examples for fact extraction and verification in three challenging categories: 1) multiple propositions (claims that require multi-hop document or sentence retrieval); 2) temporal reasoning (date comparisons, ordering of events); and 3) named entity ambiguity and lexical variation (Section 4). We show that state-of-the-art systems are vulnerable to adversarial attacks from this dataset (Section 6). In addition, we take steps toward addressing these vulnerabilities, presenting a system for endto-end fact-checking that brings two novel contri-butions using pointer networks: 1) a document ranking model; and 2) a joint model for evidence sentence selection and veracity relation prediction framed as a sequence labeling task (Section 5).
Our new system achieves state-of-the-art results for FEVER and we present an evaluation of our models including ablation studies (Section 6). Data and code will be released to the community. 1

Related Work
Approaches for predicting the veracity of naturallyoccurring claims have focused on statements factchecked by journalists or organizations such as PolitiFact.org (Vlachos and Riedel, 2014;Alhindi et al., 2018), news articles (Pomerleau and Rao, 2017), or answers in community forums (Mihaylova et al., 2018(Mihaylova et al., , 2019. However, those datasets are not suited for end-to-end fact-checking as they provide sources and evidence while FEVER  requires retrieval. Initial work on FEVER focused on a pipeline approach of retrieving documents, selecting sentences, and then using an entailment module (Malon, 2018;Hanselowski et al., 2018;Tokala et al., 2019); the winning entry for the FEVER 1.0 shared task (Nie et al., 2019a) used three homogeneous neural models. Other work has jointly learned either evidence extraction and question answering (Nishida et al., 2019) or sentence selection and relation prediction (Yin and Roth, 2018;Hidey and Diab, 2018); unlike these approaches, we use the same sequential evidence prediction architecture for both document and sentence selection, jointly predicting a sequence of labels in the latter step. More recently, Zhou et al. (2019) proposed a graph-based framework for multi-hop retrieval, whereas we model evidence sequentially.
Language-based adversarial attacks have often involved transformations of the input such as phrase insertion to distract question answering systems (Jia and Liang, 2017) or to force a model to always make the same prediction (Wallace et al., 2019). Other research has resulted in adversarial methods for paraphrasing with universal replacement rules (Ribeiro et al., 2018) or lexical substitution (Alzantot et al., 2018;Ren et al., 2019). While our strategies include insertion and replacement, we focus specifically on challenges in factchecking. The task of natural language inference (Bowman et al., 2015;Williams et al., 2018) provides similar challenges: examples for numerical reasoning and lexical inference have been shown to be difficult (Glockner et al., 2018;Nie et al., 2019b) and improved models on these types are likely to be useful for fact-checking. Finally, (Thorne and Vlachos, 2019) provided a baseline for the FEVER 2.0 shared task with entailment-based perturbations. Other participants generated adversarial claims using implicative phrases such as "not clear" (Kim and Allan, 2019) or GPT-2 (Niewinski et al., 2019). In comparison, we present a diverse set of attacks motivated by realistic, challenging categories and further develop models to address those attacks.

Problem Formulation and Datasets
We address the end-to-end fact-checking problem in the context of FEVER , a task where a system is required to verify a claim by providing evidence from Wikipedia. To be successful, a system needs to predict both the correct veracity relation-supported (S), refuted (R), or not enough information (NEI)-and the correct set of evidence sentences (not applicable for NEI). The FEVER 1.0 dataset  was created by extracting sentences from popular Wikipedia pages and mutating them with paraphrases or other edit operations to create a claim. Then, each claim was labeled and paired with evidence or the empty set for NEI. Overall, there are 185,445 claims, of which 90,367 are S, 40,107 are R, and 45,971 are NEI. Thorne and Vlachos (2019) introduced an adversarial set up for the FEVER 2.0 shared task -participants submitted claims to break existing systems and a system designed to withstand such attacks. The organizers provided a baseline of 1000 adversarial examples with negation and entailment-preserving/-altering transformations and this set was combined with examples from participants to form the FEVER 2.0 dataset.

Adversarial Dataset for Fact-checking
While the FEVER dataset is a valuable resource, our goal is to evaluate complex adversarial claims which resemble check-worthy claims found in news articles, speeches, debates, and online discussions. We thus propose three types of attacks based on analysis of FV1 or prior literature: those using multiple propositions, requiring temporal and numerical reasoning, and involving lexical variation. For the multi-propositional type, Graves (2018) notes that professional fact-checking organizations need to synthesise evidence from multiple sources; automated systems struggle with claims such as "Lesotho is the smallest country in Africa." In FV1dev, 83.18% of S and R claims require only a single piece of evidence and 89% require only a single Wikipedia page. Furthermore, our previous work on FEVER 1.0 found that our model can fully retrieve 86% of evidence sentences from Wikipedia when only a single sentence is required, but the number drops to 17% when 2 sentences are required and 3% when 3 or more sentences are required (Hidey and Diab, 2018).
For the second type, check-worthy claims are often numerical (Francis, 2016) and temporal reasoning is especially challenging (Mirza and Tonelli, 2016). Rashkin et al. (2017) and Jiang and Wilson (2018) showed that numbers and comparatives are indicative of truthful statements in news, but the presence of a date alone does not indicate its veracity. In FV1-dev, only 17.81% of the claims contain dates and 0.22% contain time information. 2 To understand how current systems perform on these types of claims, we evaluated three stateof-the-art systems from FEVER 1.0 (Hanselowski et al., 2018;Yoneda et al., 2018;Nie et al., 2019a), and examined the predictions where the systems disagreed. We found that in characterizing these predictions according to the named entities present in the claims, the most frequent types were numerical and temporal (such as percent, money, quantity, and date).
Finally, adversarial attacks for lexical variation, where words may be inserted or replaced or changed with some other edit operation, have been shown to be effective for similar tasks such as natural language inference (Nie et al., 2019b) and question answering (Jia and Liang, 2017), so we include these types of attacks as well. For the fact-checking task, models must match words and entities across claim and evidence to make a veracity prediction. As claims often contain ambiguous entities  or lexical features indicative of credibility (Nakashole and Mitchell, 2014), we desire models resilient to minor changes in entities (Hanselowski et al., 2018) and words (Alzantot et al., 2018).
We thus create an adversarial dataset of 1000 examples, with 417 multi-propositional, 313 temporal and 270 lexically variational. Representative examples are provided in Appendix A.
Multiple Propositions Check-worthy claims often consist of multiple propositions (Graves, 2018). In the FEVER task, checking these claims may require retrieving evidence sequentially after resolving entities and events, understanding discourse connectives, and evaluating each proposition.
Consider the claim "Janet Leigh was from New York and was an author." The Wikipedia page [Janet Leigh] contains evidence that she was an author, but makes no mention of New York. We generate new claims of the CONJUNCTION type automatically by mining claims from FV1-dev and extracting entities from the subject position. We then combine two claims by replacing the subject in one sentence with a discourse connective such as "and." The new label is S if both original claims are S, R if at least one claim is R, and NEI otherwise.
While CONJUNCTION claims provide a way to evaluate multiple propositions about a single entity, these claims only require evidence from a single page; hence we create new examples requiring reasoning over multiple pages. To create MULTI-HOP examples, we select claims from FV1-dev whose evidence obtained from a single page P contains at least one other entity having a valid page Q. We then modify the claim by appending information about the entity which can be verified from Q. For example, given the claim "The Nice Guys is a 2016 action comedy film." we make a multi-hop claim by obtaining the page [Shane Black] (the director) and appending the phrase "directed by a Danish screenwriter known for the film Lethal Weapon." While multi-hop retrieval provides a way to evaluate the S and R cases, composition of multiple propositions may also be necessary for NEI, as the relation of the claim and evidence may be changed by more general/specific phrases. We thus add ADDITIONAL UNVERIFIABLE PROPOSITIONS that change the gold label to NEI. We selected claims from FV1-dev and added propositions which have no evidence in Wikipedia (e.g. for the claim "Duff McKagan is an American citizen," we can add the reduced relative clause "born in Seattle").
Temporal Reasoning Many check-worthy claims contain dates or time periods and to verify them requires models that handle temporal reasoning (Thorne and Vlachos, 2017).
In order to evaluate the ability of current systems to handle temporal reasoning we modify claims from FV1-dev. More specifically, using claims with the phrase "in <date>" we automatically generate seven modified claims using simple DATE MANIP-ULATION heuristics: arithmetic (e.g., "in 2001" → "4 years before 2005"), range ("in 2001" → "before 2008"), and verbalization ("in 2001" → "in the first decade of the 21st century").
We also create examples requiring MULTI-HOP TEMPORAL REASONING, where the system must evaluate an event in relation to another. Consider the S claim "The first governor of the Indiana Territory lived long enough to see it become a state." A system must resolve entity references (Indiana Territory and its first governor, William Henry Harrison) and compare dates of events (the admittance of Indiana in 1816 and death of Harrison in 1841). While multi-hop retrieval may resolve references, the model must understand the meaning of "lived long enough to see" and evaluate the comparative statement. To create claims of this type, we mine Wikipedia by selecting a page X and extracting sentences with the pattern "is/was/named the A of Y " (e.g. A is "first governor") where Y links to another page. Then we manually create temporal claims by examining dates on X and Y and describing the relation between the entities and events.

Named Entity Ambiguity and Lexical Variation
As fact-checking systems are sensitive to lexical choice (Nakashole and Mitchell, 2014;Rashkin et al., 2017), we consider how variations in entities and words may affect veracity relation prediction.
ENTITY DISAMBIGUATION has been shown to be important for retrieving the correct page for an entity among multiple candidates (Hanselowski et al., 2018). To create examples that contain ambiguous entities we selected claims from FV1-dev where at least one Wikipedia disambiguation page was returned by the Wikipedia python API. 3 We then created a new claim using one of the documents returned from the disambiguation list. For example the claim "Patrick Stewart is someone who does acting for a living." returns a disambiguation page, which in turn gives a list of pages 3 https://pypi.org/project/wikipedia/ such as [Patrick Stewart] and [Patrick Maxwell Stewart].
Finally, as previous work has shown that neural models are vulnerable to LEXICAL SUBSTITUTION (Alzantot et al., 2018), we apply their genetic algorithm approach to replace words via counter-fitted embeddings. We make a claim adversarial to a model fine-tuned on claims and gold evidence by replacing synonyms, hypernyms, or hyponyms, e.g. created → established, leader → chief. We manually remove ungrammatical claims or incorrect relations.

Methods
Verifying check-worthy claims such as those in Section 4 requires a system to 1) make sequential decisions to handle multiple propositions, 2) support temporal reasoning, and 3) handle ambiguity and complex lexical relations. To address the first requirement we make use of a pointer network (Vinyals et al., 2015) in two novel ways: i) to rerank candidate documents and ii) to jointly predict a sequence of evidence sentences and veracity relations in order to compose evidence ( Figure 3). To address the second we add a post-processing step for simple temporal reasoning. To address the third we use rich, contextualized representations. Specifically, we fine-tune BERT (Devlin et al., 2019) as this model has shown excellent performance on related tasks and was pre-trained on Wikipedia. Our full pipeline is presented in Figure 2. We first identify an initial candidate set of documents  2018), which provides results from Google search and predicted named entities and noun phrases. Then, we perform document ranking by selecting the top D < M pages with a pointer network (1b). Next, an N -long sequence of evidence sentences (2) and veracity relation labels (3) are predicted jointly by another pointer network.
Prior to training, we fine-tune BERT for document and sentence ranking on claim/title and claim/sentence pairs, respectively. Each claim and evidence pair in the FEVER 1.0 dataset has both the title of the Wikipedia article and at least one sentence associated with the evidence, so we can train on each of these pairs directly. For the claim "Michelle Obama's husband was born in Kenya", shown in Figure 3, we obtain representations by pairing this claim with evidence sentences such as "Obama was born in Hawaii" and article titles such as [Barack Obama].
The core component of our approach is the pointer network, as seen in Figure 3. Unlike our previous work (Hidey and Diab, 2018), we use the pointer network to re-rank candidate documents and jointly predict a sequence of evidence sentences and relations. Given a candidate set of evidence (as either document titles or sentences) and a respective fine-tuned BERT model, we extract features for every claim c and evidence e p pair by summing the [CLS] embedding for the top 4 layers (as recommended by Devlin et al. (2019)): Next, to select the top k evidence, we use a pointer network over the evidence for claim c to extract evidence recurrently by computing the extraction probability P (p t |p 0 · · · p t−1 ) for evidence e p at time t < k. At time t, we update the hidden state z t of the pointer network decoder. Then we compute the weighted average h q t of the entire evidence set using q hops over the evidence (Vinyals et al., 2016;Sukhbaatar et al., 2015): 4 We concatenate m p and h q t and use a multi-layer perceptron (MLP) to predict p t . The loss is then: We train on gold evidence and perform inference with beam search for both document ranking (Section 5.1) and joint sentence selection and relation prediction (Section 5.2).

Document Ranking
In order to obtain representations as input to the pointer network for document ranking, we leverage the fact that Wikipedia articles all have a title (e.g. [Barack Obama]), and fine-tune BERT on title and claim pairs, in lieu of examining the entire document text (which due to its length is not suitable for BERT). Because the title often overlaps lexically with the claim (e.g. [Michelle Obama]), we can train the model to locate the title in the claim. Furthermore, the words in the title co-occur with words in the article (e.g. Barack and Michelle), which the pre-trained BERT language model may be attuned to. We thus fine-tune a classifier on a dataset created from title and claim pairs (where positive examples are titles of gold evidence pages and negative are randomly sampled from our candidate set), obtaining 90.0% accuracy. Given the fine-tuned model, we extract features using Equation 1 where e p is a title, and use Equation 3 to learn to predict a sequence of titles as in Figure 3.

Joint Sentence Selection and Relation Prediction
The sentence selection and relation prediction tasks are closely linked, as predicting the correct evidence is necessary for predicting S or R and the representation should reflect the interaction between a claim and an evidence set. Conversely, if a claim and an evidence set are unrelated, the model should predict NEI. We thus jointly model this interaction by sharing the parameters of the pointer network -the hidden state of the decoder is used for both tasks and the models differ only by a final MLP.
Sentence Selection Similar to our document selection fine-tuning approach, we fine-tune a classifier on claim and evidence sentence pairs to obtain BERT embeddings. However, instead of training a binary classifier for the presence of valid evidence we train directly on veracity relation prediction, which is better suited for the end task. We create a dataset by pairing each claim with its set of gold evidence sentences. As gold evidence is not available for NEI relations, we sample sentences from our candidate documents to maintain a balanced dataset. We then fine-tune a BERT classifier on relation prediction, obtaining 93% accuracy. Given the fine-tuned model, we extract features using Equation 1 where e p is a sentence, and use Equation 3 to learn to predict a sequence of sentences.

Relation Prediction
In order to closely link relation prediction with evidence prediction, we reframe the task as a sequence labeling task. In other words, rather than make a single prediction given all evidence sentences, we make one prediction at every timestep during decoding to model the relation between the claim and all evidence retrieved to that point. This approach provides three benefits: it allows the model to better handle noise (when an incorrect evidence sentence is predicted), to handle multi-hop inference (to model the occurrence of switching from NEI to S/R), and to effectively provide more training data (for k = 5 timesteps we have 5 times as many relation labels). For the claim in Figure 3, the initial label sequence is NEI and R because the first evidence sentence by itself (the fact that Barack Obama was born in Hawaii) would not refute the claim. Furthermore for k = 5, the remaining sequence would be R, R, R, as additional evidence (guaranteed to be non-contradictory in FEVER) would not change the prediction. On the other hand, given a claim that requires only a single piece of evidence, such as that in Figure 1, the sequence would be R, R, R, R, R if the correct evidence sentence was selected at the first timestep, NEI, R, R, R, R if the correct evidence sentence was selected at the second timestep, and so forth. We augment the evidence sentence selection described previously to use the hidden state of the pointer network after q hops (Equation 2) and an MLP to also predict a label at that time step, closely linking evidence and label prediction: As with evidence prediction (Equation 3), when the gold label sequence is available, the loss term is: When training, at the current timestep we use both the gold evidence, i.e. "teacher forcing" (Williams and Zipser, 1989), and the model prediction from the previous step, so that we have training data for NEI. Combining Equations 3 and 5, our loss is: Finally, to predict a relation at inference, we ensemble the sequence of predicted labels by averaging the probabilities over every time step. 5 Post-processing for Simple Temporal Reasoning As neural models are unreliable for handling numerical statements, we introduce a rule-based step to extract and reason about dates. We use the Open Information Extraction system of Stanovsky et al. (2018) to extract tuples. For example, given the claim "The Latvian Soviet Socialist Republic was a republic of the Soviet Union 3 years after 2009," the system would identify ARG0 as preceding the verb was and ARG1 following. After identifying tuples in claims and predicted sentences, we discard those lacking dates (e.g. ARG0). Given more than one candidate sentence, we select the one ranked higher by the pointer network. Once we have both the claim and evidence date-tuple we apply one of three rules to resolve the relation prediction based on the corresponding temporal phrase. We either evaluate whether the evidence date is between two dates in the claim (e.g. between/during/in), we add/subtract x years from the date in the claim and compare to the evidence date (e.g. x years/days before/after), or compare the claim date directly to the evidence date (e.g. before/after/in). For the date expression "3 years after 2009," we compare the year 2012 to the date in the retrieved evidence (1991, the year the USSR dissolved) and label the claim as R.

Experiments and Results
We evaluate our dataset and system as part of the FEVER 2.0 shared task in order to validate the vulnerabilities introduced by our adversarial claims (Section 4) and the solutions proposed by our system (Section 5). We train our system on FV1-train and evaluate on FV1/FV2-dev/test (Section 3). We report accuracy (percentage of correct labels) and recall (whether the gold evidence is contained in selected evidence at k = 5). We also report the FEVER score, the percentage of correct evidence sentences (for S and R) that also have correct labels, and potency, the inverse FEVER score (subtracted from one) for evaluating adversarial claims.
Our Baseline-RL: For baseline experiments, to compare different loss functions, we use the approach of Chakrabarty et al. (2018) for document selection and ranking, the reinforcement learning (RL) method of Chen and Bansal (2018) for sentence selection, and BERT (Devlin et al., 2019) for relation prediction. The RL approach using a pointer network is detailed by Chen and Bansal (2018) for extractive summarization, with the only difference that we use our fine-tuned BERT on claim/gold sentence pairs to represent each evidence sentence in the pointer network (as with our full system) and use the FEVER score as a reward. The reward is obtained by selecting sentences with the pointer network and then predicting the relation using an MLP (updated during training) and the concatenation of all claim/predicted sentence representations with their maximum/minimum pooling.
Hyper-parameters and settings for all experiments are detailed in Appendix B.

Adversarial Dataset Evaluation
We present the performance of our adversarial claims, obtained by submitting to the shared task server. We compare our claims to other participants in the FEVER 2.0 shared task (Table 2) and divided by attack type (Table 3). Potency was macro-averaged across different fact-checking systems (Thorne and Vlachos, 2019), correctness of labels was verified by shared task annotators, and adjusted potency was calculated by the organizers as the potency of correct examples. Compared to other participants (Table 2), we presented a larger set of claims (501 in dev and 499 in test). We rank second in adjusted potency, but we provided a more diverse set than those created by the organizers or other participants. The organizers (Thorne and Vlachos, 2019) created adversarial claims using simple pattern-matching and replacement, e.g. quantifiers and negation. Niewinski et al. (2019) trained a GPT-2-based model on the FEVER data and manually filtered disfluent claims. Kim and Allan (2019) considered a variety of approaches, the majority of which required understanding area comparisons between different regions or understanding implications (e.g. that "not clear" implies NEI). While GPT-2 is effective, our approach is controllable and targeted at real-world challenges. Finally, Table 3

Evaluation against State-of-the-art
In Tables 4 and 5 we compare Our System (Section 5) to recent work from teams that submitted to the shared task server for FEVER 1.0 and 2.0, respectively, including the results of Our Baseline-RL system in Table 5. Our dual pointer network approach obtains state-of-the-art results on the FEVER 1.0 blind test set (

System Component Ablation
To better understand the improved performance of our system, we present two ablation studies in Tables 6 and 7 on FV1 and FV2 dev, respectively. 6 Table 6 presents the effect of using different objective functions for sentence selection and relation prediction, compared to joint sentence selection and relation prediction in our full model. We compare Our System to Our Baseline-RL system as well as another baseline (Ptr). The Ptr system is the same as Our Baseline-RL, except the pointer network and MLP are not jointly trained with RL but independently using gold evidence and predicted evidence and relations, respectively. Finally, the Oracle upper bound presents the maximum possible recall after our document ranking stage, compared to 94.4% for Chakrabarty et al. (2018), and relation accuracy (given the MLP trained on 5 sentences guaranteed to contain gold evidence). We find that by incorporating the relation sequence loss, we improve the evidence recall significantly relative to the oracle upper-bound, reducing the relative error by 50% while also obtaining improvements on relation prediction, even over a strong RL baseline. Overall, the best model is able to retrieve 95.9% of the possible gold sentences after the document selection stage, suggesting that further improvements are more likely to come from document selection.   Table 7 evaluates the impact of the document pointer network and rule-based date handling on FV2-dev, as the impact of multi-hop reasoning and temporal relations is less visible on FV1-dev. We again compare Our Baseline-RL system to Our System and find an even larger 7.16 point improvement in FEVER score. We find that ablating the date post-processing (-dateProc) and both the date post-processing and document ranking components (-dateProc,-docRank) reduces the FEVER score by 1.45 and 3.5 points, respectively, with the latter largely resulting from a 5 point decrease in recall.

Ablation for Attack Types
While Table 3 presents the macro-average of all systems by attack type, we compare the performance of Our Baseline-RL and Our System in Table 8. 6 Our system is significantly better on all metrics (p < 0.001 by the approximate randomization test).  Our System improves on evidence recall for multi-hop claims (indicating that a multi-hop document retrieval step may help) and those with ambiguous entities or words (using a model to re-rank may remove false matches with high lexical similarity). For example, the claim "Honeymoon is a major-label record by Elizabeth Woolridge Grant." requires multi-hop reasoning over entities. Our System correctly retrieves the pages [Lana Del Rey] and [Honeymoon (Lana Del Rey album)], but Our Baseline-RL is misled by the incorrect page [Honeymoon]. However, while recall increases on multi-hop claims compared to the baseline, accuracy decreases, suggesting the model may be learning a bias of the claim or label distribution instead of relations between claims and evidence.
We also obtain large improvements on date manipulation examples (here a rule-based approach is better than our neural one); in contrast, multi-hop temporal reasoning leaves room for improvement. For instance, for the claim "The MVP of the 1976 Canada Cup tournament was born before the tournament was first held," our full system correctly retrieves [Bobby Orr] and [1976 Canada Cup] (unlike the RL baseline). However, a further inference step is needed beyond our current capabilities -reasoning that Orr's birth year (1948) is before the first year of the tournament (1976).
Finally, we enhance performance on multipropositions as conjunctions or additional unverifiable information (indicating that relation sequence prediction helps). Claims (non-verifiable phrase in brackets) such as "Taran Killam is a [stage] actor." and "Home for the Holidays stars an actress [born in Georgia]." are incorrectly predicted by the baseline even though correct evidence is retrieved.

Conclusion
We showed weaknesses in approaches to factchecking via novel adversarial claims. We took steps towards realistic fact-checking with targeted improvements to multi-hop reasoning (by a document pointer network and a pointer network for sequential joint sentence selection and relation pre-  There are many unaddressed vulnerabilities that are relevant for fact-checking. The Facebook bAbI tasks (Weston et al., 2016) include other types of reasoning (e.g. positional or size-based). The DROP dataset (Dua et al., 2019) requires mathematical operations for question answering such as addition or counting. Propositions with causal relations (Hidey and McKeown, 2016), which are eventbased rather than attribute-based as in FEVER, are also challenging. Finally, many verifiable claims are non-experiential (Park and Cardie, 2014), e.g. personal testimonies, which would require predicting whether a reported event was actually possible. Finally, our system could be improved in many ways. Future work in multi-hop reasoning could represent the relation between consecutive pieces of evidence and future work in temporal reasoning could incorporate numerical operations with BERT (Andor et al., 2019). One limitation of our system is the pipeline nature, which may require addressing each type of attack individually as adversaries adjust their techniques. An end-to-end approach or a query reformulation step (re-writing claims to be similar to FEVER) might make the model more resilient as new attacks are introduced.

B Hyper-parameters and Experimental Settings
We select M = 30 Wikipedia articles using TF-IDF when combining with our other candidate document selection methods and select D = 5 after document ranking. We select N = 5 sentences during sentence selection, consistent with the shared task evaluation.

B.1 BERT Language Model Fine-Tuning
We use version 0.5.0 of the Huggingface library (https://github.com/huggingface/ pytorch-pretrained-BERT) to fine-tune the "BERT-base" model using the default settings. We lowercase all tokens and use the default BERT tokenizer. (2019), we select hyper-parameters by grid search over 16 and 32 for batch size, 2e-5, 3e-5, and 5e-5 for learning rate, and 3 and 4 for the number of epochs.

Document Ranking
Sentence Selection Our dataset of sentence and claim pairs (also obtained from FV1-train) consists of 54,431 S relations, 54,592 R relations, and 54,501 NEI relations in training, with approximately 10% set aside for validation (6,139 S relations, 5,984 R relations, and 6,050 NEI relations). We again select hyper-parameters consistent with the recommended best practice.

B.2 Pointer Network
We train both the document ranking and sentence selection pointer networks on FV1-train with the same hyper-parameters using Adagrad (Duchi et al., 2011) with a learning rate of 0.01, a batch size of 16, and a maximum of 140 epochs with early stopping on FV1-dev. The dimension of the pointer network LSTM hidden state is set to 200 with q = 3 hops over the memory. We use a beam width of 5 during inference. The MLP used to predict relations has a hidden layer dimensionality of 200 and we set λ = 1.

B.3 Reinforcement Learning
To make the sentence extractor an RL agent, we can formulate a Markov Decision Process (MDP): at each extraction step t, given a claim c, the agent observes the current state and samples an action from Equation 3 to extract a document sentence s, predict the relation label l and receive a reward r(t + 1) = FEVER(c, s, l). We train using REIN-FORCE, adapted with an Actor-Critic to minimize variance (detailed by Chen and Bansal (2018)). As RL often requires pre-training, we combine the pointer network loss from Equation 3 with the RL loss (L(θ rl )) and the relation prediction loss (L(θ rel ): L(θ) = λ 1 L(θ ptr ) + λ 2 L(θ rl ) + L(θ rel ) (7) We set both λ 1 = 1 and λ 2 = 1.