TwoWingOS: A Two-Wing Optimization Strategy for Evidential Claim Verification

Determining whether a given claim is supported by evidence is a fundamental NLP problem that is best modeled as Textual Entailment. However, given a large collection of text, finding evidence that could support or refute a given claim is a challenge in itself, amplified by the fact that different evidence might be needed to support or refute a claim. Nevertheless, most prior work decouples evidence finding from determining the truth value of the claim given the evidence. We propose to consider these two aspects jointly. We develop TwoWingOS (two-wing optimization strategy), a system that, while identifying appropriate evidence for a claim, also determines whether or not the claim is supported by the evidence. Given the claim, TwoWingOS attempts to identify a subset of the evidence candidates; given the predicted evidence, it then attempts to determine the truth value of the corresponding claim entailment problem. We treat this problem as coupled optimization problems, training a joint model for it. TwoWingOS offers two advantages: (i) Unlike pipeline systems it facilitates flexible-size evidence set, and (ii) Joint training improves both the claim entailment and the evidence identification. Experiments on a benchmark dataset show state-of-the-art performance.


Introduction
A claim, e.g., "Marilyn Monroe worked with Warner Brothers", is an assertive sentence that may be true or false. While the task of claim verification will not tell us the absolute truth of this claim, it is expected to determine whether the claim is supported by evidence in a given text collection. Specifically, given a claim and a text corpus, evidential claim verification, demonstrated in 1 cogcomp.org/page/publication_view/847  Figure 1: Illustration of evidential claim verification task. For a claim, we determine its truth value by evidence identified from a text corpus. Figure 1, aims at identifying text snippets in the corpus that act as evidence that supports or refutes the claim. This problem has broad applications. For example, knowledge bases (KB), such as Freebase (Bollacker et al., 2008), YAGO (Suchanek et al., 2007), can be augmented with a new relational statement such as "(Afghanistan, is source of, Kushan Dynasty)". This needs to be first verified by a claim verification process and supported by evidence (Roth et al., 2009;Chaganty et al., 2017). More broadly, claim verification is a key component in any technical solution addressing recent concerns about the trustworthiness of online content (Vydiswaran et al., 2011;Pasternack and Roth, 2013;Hovy et al., 2013). In both scenarios, we care about whether or not a claim holds, and seek reliable evidence in support of this decision.
Evidential claim verification requires that we address three challenges. First, to locate text snippets in the given corpus that can potentially be used to determine the truth value of the given claim. This differs from the conventional textual entailment (TE) problem (Dagan et al., 2013) as here we first look for the premises given a hypothesis. Clearly, the evidence one seeks depends on the claim, as well as on the eventual entailment decision -the same claim would require different supporting than refuting evidence. This motivates us to develop an approach that can transfer knowledge from claim verification to evidence identification. Second, the evidence for a claim might require aggregating information from multiple sentences and even multiple documents (rf. #3 in Table 4). Therefore, a set, rather than a collection of independent text snippets, should be chosen to act as evidence. And, finally, in difference from TE, given a set of evidence sentences as a premise, the truth value of the claim should depend on all of the evidence, rather than on a single sentence there. The discussion above suggests that claim verification and evidence identification are tightly coupled. Claim should influence the identification of appropriate evidence, and "trusted evidence boosts the claim's veracity" (Vydiswaran et al., 2011). Consequently, we propose TWOWINGOS, a twowing optimization strategy 2 , to support this process. As shown in Figure 2, we consider a set of sentences S as the candidate evidence space, a claim x, and a decision space Y for the claim verification. In the optimal condition, a one-hot vector over Y indicates which decision to make towards the claim, and a binary vector over S indicates a subset of sentences S e (in blue in Figure 2) to act as evidence.
Prior work mostly approached this problem as a pipeline procedure -first, given a claim x, determine S e by some similarity matching; then, conduct textual entailment over (S e , x) pairs. Our framework, TWOWINGOS, optimizes the two subtasks jointly, so that both claim verification and evidence identification can enhance each other. TWOWINGOS is a generic framework making use of a shared representation of the claim to cotrain evidence identification and claim verification.
TWOWINGOS is tested on the FEVER benchmark (Thorne et al., 2018), showing ≈30% F 1 improvement for evidence identification, and ≈23% accuracy increase in claim verification. Our analysis shows that (i) entity mentions in claims provide a strong clue for retrieving relevant passages; (ii) composition of evidence clues across sentences helps claim verification; and that (iii) the joint training scheme provides significant benefits of a pipeline architecture.

Related Work
Most work focuses on the dataset construction while lacking advanced models to handle the problem. Vlachos and Riedel (2014) propose and define the "fact checking" problem, without a concrete solution. Ferreira and Vlachos (2016) release the dataset "Emergent" for rumor debunking. Each claim is accompanied by an article headline as evidence. Then a three-way logistic regression model is used over some rule-based features. No need to search for evidence. Wang (2017) release a larger dataset for fake news detection, and propose a hybrid neural network to integrate the statement and the speaker's meta data to do classification. However, the presentation of evidences is ignored. Kobayashi et al. (2017) release a similar dataset to (Thorne et al., 2018), but they do not consider the evaluation of evidence reasoning.
Some work mainly pays attention to determining whether the claim is true or false, assuming evidence facts are provided or neglecting presenting evidence totally, e.g., (Angeli and Manning, 2014) -given a database of true facts as premises, predicting whether an unseen fact is true and should belong to the database by natural logic inference. Open-domain question answering (QA) against a text corpus (Yin et al., 2016;Chen et al., 2017;Wang et al., 2018) can also be treated as claim verification problem, if we treat (question, correct answer) as a claim. However, little work has studied how well a QA system can identify all the answer evidence.
Only a few works considered improving the evidence presentation in claim verification problems. Roth et al. (2009) introduce the task of Entailed Relation Recognition -given a set of short paragraphs and a relational fact in the triple form of (argument 1 , relation, argument 2 ), finding the paragraphs that can entail this fact. They first use Expanded Lexical Retrieval to rank and keep the topk paragraphs as candidates, then build a TE classifier over each (candidate, statement) pair. The work directly related to us is by Thorne et al. (2018). Given claims and a set of Wikipages, Thorne et al. (2018) use a retrieval model based on TF-IDF to locate top-5 sentences in top-5 pages as evidence, then utilize a neural entailment model to classify (evidence, claim) pairs.
In contrast, our work tries to optimize the claim verification as well as the evidence identification in a joint training scheme, which is more than just supporting or refuting the claims. Figure 2 illustrates the two-wing optimization problem addressed in this work: given a collection of evidence candidates S={s 1 , s 2 , · · · , s i , · · · , s m }, a claim x and a decision set Y = {y 1 · · · , y n }, the model TWOWINGOS predicts a binary vector p over S and a one-hot vector o over Y against the ground truth, a binary vector q and a one-hot vector z, respectively. A binary vector over S means a subset of sentences (S e ) act as evidence, and the one-hot vector indicates a single decision (y i ) to be made towards the claim x given the evidence S e . Next, we use two separate subsections to elaborate the process of evidence identification (i.e., optimize p to q) and the claim verification (i.e., optimize o to z).

Evidence identification
A simple approach to identifying evidence is to detect the top-k sentences that are lexically similar to the claim, as some pipeline systems (Roth et al., 2009;Thorne et al., 2018) do. However, a claimunaware fixed k is less optimal, adding noise or missing key supporting factors, consequently limiting the performance.
In this work, we approach the evidence by modeling sentences S={s 1 , · · · , s i , · · · , s m } with the claim x as context in a supervised learning scheme. For each s i , the problem turns out to be learning a probability: how likely s i can entail the claim conditioned on other candidates as context, as shown by the blue items in Figure 2.
To start, a piece of text t (t ∈ S ∪ {x}) is represented as a sequence of l hidden states, forming a feature map T ∈ R d×l , where d is the dimensionality of hidden states. We first stack a vanilla CNN (convolution & max-pooling) (LeCun et al., 1998) over T to get a representation for t. As a result, each evidence candidate s i has a representation s i , and the claim x has a representation x. To get a probability for each s i , we need first to build its claim-aware representation r i .
Coarse-grained representation. We directly concatenate the representation of s i and x, generated by the vanilla CNN, as: This coarse-grained approach makes use of merely the sentence-level representations while neglecting more fine-grained interactions between the sentences and the claim.
Fine-grained representation. Instead of directly employing the sentence-level representations, here we explore claim-aware representations for each word in sentence s i , then compose them as the sentence representation r i , inspired by the Attentive Convolution (Yin and Schütze, 2017). For each word s j i in s i , we first calculate its matching score towards each word x z in x, by dot product over their hidden states. Then the representation of the claim, as the context for the word s j i , is formed as: Now, word s j i has left context s j−1 i , right context s j+1 i in s i , and the claim-aware context c j i from x. A convolution encoder generates its claim-aware representation i j i : To compose those claim-aware word representations as the representation for sentence s i , we use a max-pooling over {i j i } along with j, generating i i .
We use term f int (s i , x) to denote this whole process, so that: At this point, the fine-grained representation for evidence candidate s i is: Loss function. With a claim-aware representation r i , the sentence s i subsequently gets a probability, acting as the evidence, α i ∈ (0, 1) via a non-linear sigmoid function: where parameter vector v has the same dimensionality as r i .
In the end, all evidence candidates in S have a ground-truth binary vector q and the predicted probability vector α; then loss l ev ("ev": evidence) is implemented as a binary cross-entropy: As the output of this evidence identification module, we binarize the probability vector α by

Claim verification
As shown in Figure 2, to figure out an entailment decision y i for the claim x, the evidence S e possibly consists of more than one sentence. Furthermore, those evidence sentences are not necessarily in textual order nor from the same passage. So, we need a mechanism that enables each evidence or even each word inside to be aware of the content from other evidence sentences. Similar to the aforementioned approach to evidence identification, we come up with three methods, with different representation granularity, to learn a representation for (S e , x), i.e., the input for claim verification, shown in Figure 3.
Coarse-grained representation. In this case, we treat S e as a whole, constructing its representation e by summing up the representations of all sentences in S e in a weighted way: where α i , from Equation 6, is the probability of s i being the evidence. Then the (S e , x) pair gets a coarse-grained concatenated representation: [e, x]. It does not model the interactions within the evidence nor the interactions between the evidence and the claim. Based on our experience in evidence identification module, the representation of a sentence is better learned by composing context-aware word-level representations. Next, we introduce how to learn fine-grained representation for the (S e , x) pair.
Single-channel fine-grained representation. By "single-channel," we mean each sentence s i is aware of the claim x as its single context.
For a single pair (s i , x), we utilize the function f int () in Equation 4 to build the fine-grained representations for both s i and x, obtaining For (S e , x), we compose all the {i i } and all the {x i } along with i, via a weighted max-pooling: This weighted max-pooling ensures that the sentences with higher probabilities of being evidence have a higher chance to present their features. As a result, (S e , x) gets a concatenated representation: [e, x] Two-channel fine-grained representation. By "two-channel," we mean that each evidence s i is aware of two kinds of context, one from the claim x, the other from the remaining evidences.
Our first step is to accumulate evidence clues within S e . To start, we concatenate all sentences in S e as a fake long sentenceŜ consisting of hidden states {ŝ}. Similar to Equation 2, for each word s j i in sentence s i , we accumulate all of its related clues (c j i ) fromŜ as follows: Then we update s j i , the representation of word s j i , by element-wise addition: This step enables the word s j i to "see" all related clues from S e . The reason we add s j i and c j i is motivated by a simple experience: Assume the claim "Lily lives in the biggest city in Canada", and one sentence contains a clue "· · · Lily lives in Toronto · · · " and another sentence contains a clue "· · · Toronto is Canada's largest city· · · ". The most simple yet effective approach to aggregating the two clues is to sum up their representation vectors (Blacoe and Lapata, 2012) (we do not concatenate them, as those clues have no consistent textual order across different s j i ). After updating the representation of each word in s i , we perform the aforementioned "singlechannel fine-grained representation" between the updated s i and the claim x, generating [e, x].
Loss function. For the claim verification input (S e , x), we forward its representation [e, x] to a #SUPPORTED #REFUTED #NEI  train  80,035  29,775  35,639  dev  3,333  3,333  3,333  test  3,333  3,333 3,333 where W ∈ R n×2d , b ∈ R n The loss l cv ("cv": claim verification) is implemented as negative log-likelihood: where z is the ground truth one-hot label vector for the claim x on the space Y .

Joint optimization
Given the loss l ev in evidence identification and the loss l cv in claim verification, the overall training loss is represented by: To ensure that we jointly train the two coupled subtasks with intensive knowledge communication instead of simply putting two pipeline neural networks together, our TWOWINGOS has following configurations: • Both subsystems share the same set of word embeddings as parameters; the vanilla CNNs for learning sentence and claim representations share parameters as well.
• The output binary vector p by the evidence identification module is forwarded to the module of claim verification, as shown in Equations 8-10.
• Though the representation of a claim's decision y i is not put explicitly into the module of evidence identification, the claim's representation x will be fine-tuned by the y i , so that the evidence candidates can get adjustment from the decision y i , since the claims are shared by two modules.

Setup
Dataset. In this work, we use FEVER (Thorne et al., 2018). The claims in FEVER were generated from the introductory parts of about 50K 1 2 3 4 5 6 7 8 9 10 >10 #sentence and #page in evidence  To increase the claim complexity so that claims would not be trivially verified, annotators adopt two routes: (i) Providing additional knowledge: Annotators can explore a dictionary of terms that were (hyper-)linked, along with their pages; (ii) Mutate claims in six ways: negation, paraphrasing, substitution of a relation/entity with a similar/dissimilar one, and making the claims more general/specific. All resulting claims have 9.4 tokens in average. Apart from claims, FEVER also provides a Wikipedia corpus in size of about 5.4 million. Each claim is labeled as SUPPORTED, RE-FUTED or NOTENOUGHINFO (NEI). In addition, evidence sentences, from any wiki page, are required to be provided for SUPPORTED and RE-FUTED. Table 1 lists the data statistics. Figure 4 shows the distributions of sentence sizes and page sizes in FEVER's evidence set. We can see that roughly 28% of the evidence covers more than one sentence, and approximately 16.3% of the evidence covers more than one wiki page.
This task has three evaluations: (i) NOSCOREEV -accuracy of claim verification, neglecting the validity of evidence; (ii) SCOREEV -accuracy of claim verification with a requirement that the predicted evidence fully covers the gold evidence for SUPPORTED and RE-FUTED; (iii) F 1 -between the predicted evidence sentences and the ones chosen by annotators. We use the officially released evaluation scorer 3 .
Wiki page retrieval 4 . For each claim, we search in the given dictionary of wiki pages in the form of {title: sentence list}, and keep the top-5 ranked pages for fair comparison with Thorne et al. (2018). Algorithm 1 briefly shows the steps of wiki page retrieval. To speed up, we first build an inverted index from words to titles, then for each claim, we only search in the titles that cover at least one claim word.
Input: A claim, wiki={title: page vocab} Output: A ranked top-k wiki titles Generate entity mentions from the claim; while each title do if claim.vocab∩title.vocab is empty then discard this title else title score = the max recall value of title.vocab in claim and in entity mentions of the claim; if title score = 1.0 then title.score = title score else page score = recall of claim in page vocab; title.score = title score + page score end end end Sort titles by title.score in descending order Algorithm 1: Algorithm description of wiki page retrieval for FEVER claims.
All sentences of the top-5 retrieved wiki pages are kept as evidence candidates for claims in train, dev and test. It is worth mentioning that this page retrieval step is a reasonable preprocessing which controls the complexity of evidence searching in real-world, such as the big space -5.4 million -in this work.
Baselines. In this work, we first consider the two systems reported by Thorne et al. (2018): (i) MLP: A multi-layer perceptron with one hidden layer, based on TF-IDF cosine similarity between the claim and the evidence (all evidence sentences are concatenated as a longer text piece) (Riedel et al., 2017); (ii) Decomp-Att (Parikh et al., 2016): A decomposable attention model that develops atten-k (Thorne et al., 2018) Table 2: Wikipage retrieval evaluation on dev. "rate": claim proportion, e.g., x%, if its gold passages are fully retrieved (for "SUPPORT" and "REFUTE" only); "acc ceiling": x%·(#S+#R)+#N

#S+#R+#N
, the upper bound of accuracy for three classes if the coverage x% satisfies. tion mechanisms to decompose the problem into subproblems to solve in parallel. Note that both systems first employed an IR system to keep top-5 relevant sentences from the retrieved top-5 wiki pages as static evidence for claims.
We further consider the following variants of our own system TWOWINGOS: • Coarse-coarse: Both evidence identification and claim verification adopt coarse-grained representations.
To further study our system, we test this "coarse-coarse" in three setups: (i) "pipeline"train the two modules independently. Forward the predicted evidence to do entailment for claims; (ii) "diff-CNN" -joint training with separate CNN parameters to learn sentence/claim representations; (iii) "share-CNN" -joint training with shared CNN parameters.
The following variants are in joint training.
• Fine&sentence-wise: Given the evidence with multiple sentences, a natural baseline is to do entailment reasoning for each (sentence, claim), then compose. We do entailment reasoning between each predicted evidence sentence and the claim, generating a probability distribution over the label space Y . Then we sum up all the distribution vectors element-wise, as an ensemble system, to predict the label; • Four combinations of different grained representation learning: "coarse&fine(single)", "coarse&fine(two)", "fine&coarse" and "fine&fine(two)".
"Single" and "two" refer to the single/two-channel cases respectively.

Results
Performance of passage retrieval. Table 2 compares our wikipage retriever with the one in (Thorne et al., 2018), which used a document retriever 5 from DrQA (Chen et al., 2017).
Our document retrieval module surpasses the competitor by a big margin in terms of the coverage of gold passages: 89.63% vs. 55.30% (k = 5 in all experiments). Its powerfulness should be attributed to: (i) Entity mention detection in the claims. (ii) As wiki titles are entities, we have a bi-channel way to match the claim with the wiki page: one with the title, the other with the page body, as shown in Algorithm 1. Table 3 lists the performances of baselines and the TWOWIN-GOS variants on FEVER (dev&test). From the dev block, we observe that:

Performance on FEVER
• TWOWINGOS (from "share-CNN") surpasses prior systems in big margins. Overall, fine-grained schemes in each subtask contribute more than the coarse-grained counterparts; • In the three setups -"pipeline", "diff-CNN" and "share-CNN" -of coarse-coarse, "pipeline" gets better scores than (Thorne et al., 2018) in terms of evidence identification. "Share-CNN" has comparable F 1 as "diff-CNN" while gaining a lot on NOSCOREEV (72.32 vs. 39.22) and SCOREEV (50.12 vs. 21.04). This clearly shows that the claim verification gains much knowledge transferred from the evidence identification module. Both "diff-CNN" and "share-CNN" perform better than "pipeline" (except for the slight inferiority at SCOREEV: 21.04 vs. 22.26).
• Two-channel fine-grained representations show more effective than the single-channel counterpart in claim verification (NOSCOREEV: 78.77 vs. 75.65,SCOREEV: 53.64 vs. 52.65). As we expected, evidence sentences should collaborate in inferring the truth value of the claims. Two-channel setup enables an evidence candidate aware of other candidates as well as the claim.
• In the last three rows of dev, there is no clear difference among their evidence identification scores. Recall that "sent-wise" is essentially an ensemble system over each (sentence, claim) entailment result. "Coarse-grained", instead, first sums up all sentence representation, then performs ( (sentence), claim) reasoning. We can also treat this "sum up" as an ensemble. Their comparison shows that these two kinds of tricks do not   Figure 5: Performance vs. #sentence in evidence. Our system has robust precisions. The overall performance NOSCOREEV is not influenced by the decreasing recall; this verifies the fact that the truth value of most claims can be determined by a single identified evidence sentence. make much difference.
In both dev and test blocks, we can observe that our evidence identification module consistently obtains balanced recall and precision. In contrast, the pipeline system by Thorne et al. (2018) has much higher recall than precision (45.89 vs. 10.79). It is worth mentioning that the SCOREEV metric is highly influenced by the recall value, since SCOREEV is computed on the claim instances whose evidences are fully retrieved, regardless of the precision. So, ideally, a system can set all sentences as evidence, so that SCOREEV can be promoted to be equal to NOSCOREEV. Our system is more reliable in this perspective.
Performance vs. #sent. in evidence. Figure 5 shows the results of the five evaluation measures against different sizes of gold evidence sentences in test set. We observe that: (i) Our system has robust precisions across #sentence; however, the recall decreases. This is not that surprising, since the more ground-truth sentences in evidence, the harder it is to retrieve all of them; (ii) Due to the decrease in recall, the SCOREEV also gets influenced for bigger #sentence. Interestingly, high precision and worse recall in evidence with more sentences still make consistently strong overall performance, i.e., NOSCOREEV. This should be due to the fact that the majority (83.18% (Thorne et al., 2018)) of claims can be correctly entailed by a single ground truth sentence, even if any remaining ground truth sentences are unavailable.
(i.e., (Telemundo, 0) and (Telemundo, 4)) correctly; however, it falsely predicts the claim label. (Telemundo, 0): Telemundo is an American Spanish-language terrestrial television · · · . We can easily find that the keyword "Spanishlanguage" should refute the claim. However, both "Spanish-language" in this evidence and the "English-language" in the claim are unknown tokens with randomly initialized embeddings. This hints that a more careful data preprocessing may be helpful. In addition, to refute the claim, another clue comes from the combination of (Telemundo, 4) and (Hispanic and Latino Americans, 0). (Telemundo, 4): "The channel · · · aimed at Hispanic and Latino American audiences"; (Hispanic and Latino Americans, 0): "Hispanic Americans and Latino Americans · · · are descendants of people from countries of Latin America and Spain.". Our system only retrieved (Telemundo,4). And this clue is hard to grasp as it requires some background knowledge -people from Latin America and Spain usually are not treated as English-speaking.
In the case #2, our system fails to identify any evidence. This is due to the failure of our passage retrieval module: it detects entity mentions "Home", "Holidays" and "American", and the top-5 retrieved passages are "Home", "Home for the Holidays", "American Home", "American" and "Home for the Holidays (song)", which unfortunately cover none of the four ground truth passages. Interestingly, (i) given the falsely retrieved passages, our system predicts "no sentence is valid evidence" (denoted as ∅ in Table 4); (ii) given the empty evidence, our system predicts "NoEnoughInfo" for this claim. Both make sense.
In the case #3, a successful classification of the claim requires information aggregation over the three gold evidence sentences: (Weekly Idol, 0): "Weekly Idol is a South Korean variety show · · · "; (Weekly Idol, 1): "The show is hosted by comedian Jeong Hyeong-don and rapper Defconn."; (Defconn, 0): "Defconn (born Yoo Dae-joon; January 6 , 1977 ) is a · · · ". To successfully retrieve the three sentences as a whole set of evidence is challenging in evidence identification. Additionally, this example relies on the recognition and matching of digital numbers (1983 vs. 1977), which is beyond the expressivity of word embeddings, and is expected to be handled by rules more easily.

Summary
In this work, we build TWOWINGOS, a two-wing optimization framework to address the claim verification problem by presenting precise evidence. Differing from a pipeline system, TWOWIN-GOS ensures the evidence identification module and the claim verification module are trained jointly, in an end-to-end scheme. Experiments show the superiority of TWOWINGOS in the FEVER benchmark.