FEVER Breaker’s Run of Team NbAuzDrLqg

We describe our submission for the Breaker phase of the second Fact Extraction and VERification (FEVER) Shared Task. Our adversarial data can be explained by two perspectives. First, we aimed at testing model’s ability to retrieve evidence, when appropriate query terms could not be easily generated from the claim. Second, we test model’s ability to precisely understand the implications of the texts, which we expect to be rare in FEVER 1.0 dataset. Overall, we suggested six types of adversarial attacks. The evaluation on the submitted systems showed that the systems were only able get both the evidence and label correct in 20% of the data. We also demonstrate our adversarial run analysis in the data development process.


Introduction
The Fact Extraction and Verification (FEVER) workshop focuses on developing fact-check systems, which can resolve "fake-news" and misinformation problems. In the shared task of FEVER, the goal is to develop a system which can verify the given claim, by retrieving evidence from the documents from Wikipedia and classifying the claim into either Supports, Refutesor NotE-noughInfo. While the systems in the shared task of first FEVER workshop (FEVER 1.0) showed impressive performance, it was questionable if they are robust against adversarial claims that are different from the test data from the original dataset (Thorne and Vlachos, 2019).
The second workshop on Fact Extraction and VERification has a shared task that can investigate the robustness of the systems. The shared task is in a Build it Break it Fix it setting. In the first phase, participants (Builders) develop fact-check systems as what was done in last year's shared task. In the second phase, participants (Breakers) will have access to the systems and attack the systems to generate claims which are challenging for the builders. In the third phase, the Fixers would fix the systems to be robust toward the Breakers' claims.
We participated in the second phase (Breakers Run) in the competition. We submitted 203 instances over seven types of attacks. For 6 out of 7 attack types (except SubsetNum), the claims were manually written. The claims for SubsetNum were generated based on a template.
Our submission resulted in Raw Potency of 79.66% but resulted in bad Correct Rate of 64.71% and the Adjusted Potency of 51.54%.
Our data were annotated to have 25.7% as incorrect label and 22.8% as ungrammatical, which includes 8.9% overlap. While the ungrammatical cases evenly appeared among all the cases, incorrect label cases are concentrated in NotClear attack.
We consider there are two types of challenges for the Fact-Checking system. The first is retrieval challenge and the second is language understanding challenge.
The results of the Fever 1.0 showed that the most of the evidences can be found among the candidate sentences that are retrieved by taking the terms in the claim as a query (Yoneda et al., 2018;Hanselowski et al., 2018a).
Three of our attacks focuses on retrieval challenges. The claims from EntityLess attack have few entities that can be used to retrieve evidence documents. The claims from EntityLinking different name from that which is in the evidence sentence, so the system need to link other name from the other article that explains alternative names for an entity. The claims from SubsetNum require 3 sentences as the evidence, where two of the evidence document can be found from the terms of the claim, but the other evidence cannot.
Remaining three attacks focuses on precise understanding of the text. We considered the case that the relevant article mentions the claim, but another sentence from the article says the claim to be not true (Controversy) or to be not clear (Not-Clear). If the system blindly picks most relevant sentences, the system can miss such clarifying information. The claims from FiniteSet consider the cases that some expression can imply that no more event of the particular type can happen other than the mentioned events.
In section 2, we explain our motivations for the attack types. In section 3, we explain how we generate 6 types of attacks. In section 4 we discuss the shared task results. In addition to actual submission results, section 5 discuss about the analysis in adversarial attack development phase

Design motivations
The claims of original FEVER dataset are made from the randomly chosen sentences (Thorne et al., 2018). We expect that many sentences share similar semantic patterns, while there are only few sentences that have different pattern than the majority. Randomly sampling sentences would result in many claims that can be handled by similar fact checking strategies, which makes the dataset hard to contain challenging and exceptional claims that are less trivial to fact-check. Here is an example of exceptional claims. Given a sentence, if the claim is entailed by the sentence, it is okay to conclude Supports for most cases. However, there are a few cases that the following sentence denies what's written in the previous sentence. Our attack types Controversy and NotClear test such cases.
In relation extraction domain, it was considered as a serious challenge to have ability to disambiguate a polysemous entity mention or infer that two orthographically different mentions are the same entity (Rao et al., 2013). We refer this challenge as entity linking and suggest that entity linking should be more intensely tested for factchecking task. In the FEVER 1.0, many system solely relied on the neural network to handle entity linking. For the names of entities that are mentioned often in the corpus, word embedding could be trained enough to handle it. We expect that neural network might fail when it comes to the rarely mentioned surface names. We expect that original FEVER dataset will not have many such cases. EntityLess  1  7  2  10  EntityLinking  8  1  0  9  SubsetNum  50  50  0  100  Controversy  0  10  0  10  NotClear  0  0  34  34  FiniteSet  4  6  0  10  NE  0  0  30  30  Total  63  74  66  203   Table 1: Label statistics for our submission. NE stands for NotEnoughInfo.

Supports Refutes NE Total
3 Claim generation for each type of attacks.
Our submission includes six types of adversarial cases and one type that only contain NotEnough-Infoto make all of three labels to have similar number of claims. Examples for the six attacks are listed in Table 2 and Table 3.

EntityLess 1
This attack contains case that the evidence articles cannot be easily searched by the words in the claim. The claims only contains more common terms such as 'university', 'alumni' and 'U.S.'. In the example in the Table 2, the evidence is in 'Harvard University' article, while the important term 'Harvard' is not given in the claim. We expect that the system would wrongly answer NotE-noughInfo.

EntityLinking
This case tests the ability to identify different surface names for the same entity. The collection has the sentences that introduce multiple names of an entity. We selected one of such sentences which we expect to be not too popular and it is used as a first evidence. As a second evidence, we searched the sentence that mentions the entity and replaced the name of the entity with another name. We expect that the system would wrongly answer NotE-noughInfo.

SubsetNum
This case is generated based on a simple logic: if region A is part of B and B is smaller than C, A is smaller than C. In the example is    Table 4: Acceptability judgments.
-OK : The claim is grammatical and the label is supported by the evidence.
-GR : The claim is ungrammatical.
-UN : The claim is grammatical but the label is incorrect.
ond and third evidence could be directly retrieved from the claim, but not the first evidence. The claims were automatically generated. We extracted the information using the predefined templates. We first extracted the list of the entities that refer to regions. Then we extracted subset relations. The area information of each entity was parsed. We expect that the system would wrongly answer NotEnoughInfo.

Controversy
This case tests if the system can distinguish the mentions that are not actually true. Two evidence sentences are required. A sentence suggests information and the following sentence says that the previous statement is not true. All the claims for these cases are Refutes. We expect that the system would wrongly answer Supports.

NotClear
Wikipedia has sentences that say "It is not clear ..." (Table 2). We wrote the claims that are mentioned to be not clear and we consider this implies NotE-noughInfo. Because the label is NotEnoughInfo, we did not include the evidences.
The annotators did not accepted most of the claims (85%) and annotated they are not NotE-noughInfo(including the one in the table). It is not clear if they accepted the sentences with 'not clear' as evidences or they found from other documents. We expect that the system would wrongly answer Supportsor Refutes.

FiniteSet
A sentence "A is ninth and last to do B." implies that there are only nine possible events for B. Moreover, if another event is claimed to be happened at the time which is later than when A happened, it cannot be true. For many cases keyword 'last' is just enough to restrict the times. Both Supports and Refutes cases are generated. We expect that the system would wrongly answer NotE-noughInfo.

NE
Our adversarial claims are mostly Supports or Refutes. In order to make each label has same similar number of claims we add claims whose label is NotEnoughInfo. These claims are not particularly adversarial compared to others.

Task Evaluation
The breaker's runs were evaluated by the following metrics: is the official evaluation metric, which is roughly the fraction of the instances that got both the evidences and label correct.
Our submission resulted in the raw potency of 79.66%. Accepted rate was 64.71%. Adjusted potency was 51.54%.
The raw potency of 79.66 implies that systems only got 20% got correct. Considering that 15% of the whole data was NE category which was not actually adversarial, the systems totally fail on our adversarial data.
During the shared task, we tested each type of attack on the running docker images of the shared task test server For the final Fixer phase, the accepted instances from all breaker's run were collected. The collected data were provided to the fixers so that the systems can be revised or re-trained on the adversarial data. There was only one fixer system (CUNLP) and it showed FEVER score of 32.92% before they fixed the system. After they fixed the system it achieved the FEVER score of 68.80%. Note that these scores for the fixer system are results of all breaker's submissions not only our submission.
We were not provided the performance for the only our runs, but still we can make some speculation about the potency of adversarial instances in this shared task. We expect that the adversarial runs were rather limited in their diversity, the fixer was able to revise this challenges either manually or by machine learning models ability to adapt to new types of data.

Development Analysis
Here, we show a few adversarial instances that we generated during the development process. Note that some of these claims (2, 4) are of different categories from what was introduced in section 3, because they were not included in final submission. We evaluated these claims on the provided Controversy Apollo astronauts did not actually walk on the Moon. Refutes  sandbox interface, which runs the previously submitted systems. The systems are UCL (Yoneda et al., 2018), Athens (Hanselowski et al., 2018b), UCL-MR (Yoneda et al., 2018), Papelo (Malon, 2018), GPLSI, Columbia and the baseline system (Thorne et al., 2018).
The claims and evidences are listed in Table 5 and 6. Claim 1 in the Table 5 requires fact-check system to collect and combine many evidences. The system has to check if there are presidents who were born in America and precede Barack Obama's term. Claim 2 is an example of the previously explained SubsetSum attack. Claim 3 could be challenging because it does not contain any good keyword in it. It also requires systems to be able to compare numbers. Claim 4 requires to compare numbers. We expected systems could make mistake as evidence sentences have numerous "largest" in them. Claim 5 has related documents that could be mistakenly taken as evidence to support the claim. There is an article "Moon landing conspiracy theories", which contains sentence saying "12 Apollo astronauts did not actually walk on the Moon". Because this evidence sentence is very similar to the claim in terms of term matching, this might be retrieved as an evidence and might confuse the system. Table 7 shows the results of each systems, mainly focusing on if the systems get the classification labels correct. The systems rarely select the evidences that we submitted. However, as there are many alternative evidences for these claims, we could conclude this as total failure. UNC Athene UCL MR Papelo GPLSI Columbia baseline