Hierarchical Evidence Set Modeling for Automated Fact Extraction and Verification

Automated fact extraction and verification is a challenging task that involves finding relevant evidence sentences from a reliable corpus to verify the truthfulness of a claim. Existing models either (i) concatenate all the evidence sentences, leading to the inclusion of redundant and noisy information; or (ii) process each claim-evidence sentence pair separately and aggregate all of them later, missing the early combination of related sentences for more accurate claim verification. Unlike the prior works, in this paper, we propose Hierarchical Evidence Set Modeling (HESM), a framework to extract evidence sets (each of which may contain multiple evidence sentences), and verify a claim to be supported, refuted or not enough info, by encoding and attending the claim and evidence sets at different levels of hierarchy. Our experimental results show that HESM outperforms 7 state-of-the-art methods for fact extraction and claim verification. Our source code is available at https://github.com/ShyamSubramanian/HESM.


Introduction
A study by Gabielkov et al. (2016) has revealed that 60% of people on social media share the news after reading just the title, without reading the actual content of the news. Unfortunately, the rise of social media has further accelerated the communication and propagation of unverified information. To solve the problem, our work focuses on automated fact extraction and verification task, which requires retrieving the evidence related to a claim as well as verifying the claim based on the evidence. The task is challenging since it requires semantic understanding and reasoning to learn the subtleties that differ between evidence that supports and evidence that refutes a claim. The task's difficulty is further amplified for claims that require aggregating information from multiple evidence sentences in different documents.
Previous works, in fact verification, either operate by combining all the evidence sentences (Nie et al., 2019) or they operate at each evidence sentence-level and aggregate them later (Yoneda et al., 2018;Hanselowski et al., 2018a). Combining all the sentences together may lead to the combination of redundant, noisy, and irrelevant information with the relevant information. This makes claim verification more complicated in terms of identifying and learning the context of only the relevant sentences. On the other hand, processing each evidence sentence separately, delays the combination of relevant sentences that belong to the same evidence set, for claims that require aggregating information from multiple sentences. It also makes claim verification harder since it summarizes information without complete context. Figure 1 depicts an example of an ideal verification system, which extracts evidence sets, processes them individually, and then aggregates them later. In the example, four evidence sentences are retrieved. Sentences which are relevant and hyperlinked, are combined to form evidence sets (called Evidence Set [1] and Evidence Set [2] in the figure). Each evidence set verifies the claim individually, and then they are aggregated for the final verification.
Like Figure 1, our proposed framework also retrieves and combines evidence sentences into evidence sets in an iterative fashion. Then, it processes each evidence set individually to form a representation of the evidence set using word-level attention. Then, it combines information from all the evidence set representations using contextual and non-contextual aggregation methods, which use evidence set-level attention. The word-level attention, along with evidence set-level attention, forms a hierarchical attention mechanism. Finally, our Figure 1: An example of claim, evidence sets, and verdict. The arrows represent the hierarchy of the fact extraction and verification process. The second sentence in each evidence set is retrieved from a document hyperlinked from the first sentence. framework learns to verify the claim at different levels of hierarchy (i.e., at each evidence set-level and the aggregated evidence level).
Our main contributions are as follows: • We propose Hierarchical Evidence Set Modeling, which consists of document retriever, multi-hop evidence retriever, and claim verification. • Our multi-hop evidence retriever retrieves evidence sentences and combines them as evidence sets. Our claim verification component conducts the hierarchical verification based on each evidence set individually and then based on all the evidence sets. • Our experimental results show that our model outperforms 7 state-of-the-art baselines in both the evidence retrieval and claim verification.

Related Work
Several works exist in fact verification based on different forms of claim and evidence. Thorne and Vlachos (2017); Vlachos and Riedel (2015) verify numerical claims using subject-predicateobject triples from knowledge graph as evidence. Nakashole and Mitchell (2014); Bast et al. (2017) verify subject-predicate-object triple based claims.  verifies textual claims based on evidences in a tabular format. Fact verification is studied in different natural language settings namely Recognizing Textual Entailment (Dagan et al., 2005), Natural Language Inference (Bowman et al., 2015) and Claim verification (Thorne et al., 2018a). A differently motivated but closely related problem is fact checking in journalism, also known as fake news detection (Ferreira and Vlachos, 2016;Wang, 2017). In this work, we focus on Claim verification using the FEVER dataset (Thorne et al., 2018a) with textual claims and evidence. Previous works on the fact extraction and claim verification task follow a three-stage pipeline that includes document retrieval, evidence sentence retrieval and claim verification. Most previous works reuse the document retrieval component of topperforming systems (Hanselowski et al., 2018b;Yoneda et al., 2018;Nie et al., 2019) in the FEVER Shared Task 1.0 challenge (Thorne et al., 2018b).
Evidence sentence retrieval component in almost all previous works retrieves all the evidences through a single iteration (Yoneda et al., 2018;Hanselowski et al., 2018b;Nie et al., 2019;Chen et al., 2017;Soleimani et al., 2020;Liu et al., 2020). Stammbach and Neumann (2019) uses a multi-hop retrieval strategy through two iterations to retrieve evidence sentences that are conditioned on the retrieval of other evidence sentences. Then, they choose all the top-most relevant evidence sentences with the highest relevance scores and combine them. Our work follows a similar strategy, but differs from the prior work by combining only evidence sentences that belong to the same evidence set, and then processing each evidence set separately.
In claim verification component, Nie et al. (2019); Yoneda et al. (2018); Hanselowski et al. (2018b) use a modified ESIM model (Chen et al., 2017) for verification. Recent works (Soleimani et al., 2020;Zhou et al., 2019;Stammbach and Neumann, 2019) use BERT model (Devlin et al., 2019) for claim verification. Few other works (Zhou et al., 2019;Liu et al., 2020) use graph based models for fine-grained semantic reasoning. Different from the previous works, our model operates with claimevidence set pairs instead of claim-evidence sentence pairs. Our model benefits from encoding, attending and evaluating at different levels of hi-erarchy, as well as from both contextual and noncontextual aggregations of the evidence sets.

Problem Definition
Given a set of m textual documents and a claim c i , the problem is to find a set of evidence sentenceŝ E i = {s 1 , s 2 , ..., s |Ê i | } and classify the claim c i asŷ i ∈ {S, R, N EI} (i.e., SUPPORTED, RE-FUTED or NOT ENOUGH INFO). For a successful verification of the claim c i , there are two conditions: (1)Ê i should match at least one evidence set in the ground truth evidence sets E i and (2)ŷ i should match the ground truth entailment label y i .

Hierarchical Evidence Set Modeling
Our Hierarchical Evidence Set Modeling (HESM) framework consists of three components namely Document Retriever, Multi-hop Evidence Retriever and Claim Verification. Figure 2 shows an overview of our framework. The document retriever component retrieves the top K 1 documents that are relevant to the claim. The multi-hop retriever component retrieves the relevant top K 2 evidence sets from the K 1 retrieved documents via an iterative fashion. The claim verification component classifies the claim as SUPPORTS, REFUTES or NOT ENOUGH INFO based on the retrieved evidence sets. Following prior works, in our framework, we reuse the document retriever component from Nie et al. (2019), which works well in terms of relevant document retrieval. We mainly focus on and propose novel multi-hop evidence retriever and claim verification components.

Document Retriever
Document retrieval is the task of selecting documents related to a given claim. First, documents are selected by an exact match between their titles and a span of text of the claim. In particular, the CoreNLP toolkit (Manning et al., 2014) is used for retrieving text spans from the claim. To obtain more relevant documents, the same procedure is applied again after eliminating articles such as 'a', 'an' or 'the' from the claim, and once again after singularizing each word in the claim. For documents, whose titles are ambiguous (e.g., "Savages (band)" and Savages (2012 film)), a semantic understanding strategy based on Neural Semantic Matching Network (NSMN) (Nie et al.,   documents by comparing the first line of each document with the claim. Finally, only the top K 1 ranked documents are selected.

Multi-hop Evidence Retriever
According to statistics of the FEVER dataset (Thorne et al., 2018a), 16.82% claims require multiple evidence sentences to verify their truthfulness, and 12.5% claims' evidence sentences are located across multiple documents. Based on this, we propose a multi-hop evidence retriever, which is an iterative retrieval mechanism with N number of iterations or hops. From analyzing the FEVER dataset, almost all the evidence sentences are at most two hops away from a claim, and thus can be retrieved in two iterations. Hence, for this work, we set N as 2. We retrieve a maximum of K 2 evidence sets for each claim. Each evidence set contains a maximum of M s evidence sentences. With the recent success of Transformer (Vaswani et al., 2017) based pre-trained models in NLP, we incorporate the ALBERT model (Lan et al., 2020) as a part of our multi-hop evidence retriever. ALBERT is a lightweight BERT based model that is pre-trained on large-scale English language corpus for learning language representation.
In the first iteration, given a claim c i , each sentence j in the selected documents from the document retriever is concatenated with the claim c i as token is pooled and fed to a linear layer classifier to produce the two scores m + and m − for selecting and discarding the sentence, respectively. In Transformer-based models, [CLS] token is considered as a representation of the whole input. Then, a selection probability p(x = 1|c i , j) is calculated as a softmax normalization between the two scores. Only the top K 2 sentences with the highest m + scores and probability score greater than a threshold th evi1 are selected.
In the second iteration, each of the K 2 evidence sentences from the first iteration is considered as an evidence set. In the FEVER dataset, for claims requiring multiple sentences for verification, most of the sentences missed in the first iteration of retrieval are found in hyperlinked documents of the sentences retrieved in the first iteration. Therefore, in second iteration, the claim c i , each of the K 2 evidence sentences j, and each sentence k from the hyperlinked documents in sentence j are concatenated as [[CLS]; c i ; [SEP ]; j; k] and fed as input to the ALBERT model. Similar to the first iteration, two scores m + and m − , and a selection probability p(x = 1|c i , j, k) are obtained. Finally, for each evidence sentence j, a maximum of (M s − 1) sentences with the highest m + scores and probability score greater than a threshold th evi2 are selected and added to the corresponding evidence set.

Claim Verification
Claim verification is a three-way classification task to label the claim as SUPPORTED, REFUTED, or NOT ENOUGH INFO, based on the extracted evidence. Inspired by Hierarchical Attention Network (Yang et al., 2016), we propose a neural network that combines evidence sets hierarchically. While Yang et al. (2016) uses word-level and sentencelevel attention to hierarchically combine words into sentences and sentences into a document, in this task, we use word-level and evidence set-level attention to hierarchically combine words and sentences into evidence sets, and evidence sets into an aggregated evidence. Different from Yang et al. (2016), we propose two ways of aggregating evidence sets. Also, we train each evidence set to be able to verify the claim individually. The model consists of two parts: (1) Evidence Set Modeling Block that contains a word-level encoder and attention layers to model each evidence set based on its words and sentences; and (2) Hierarchical Aggregator that contains evidence set-level encoder and attention layers to combine multiple evidence sets.

Evidence Set Modeling Block
The Evidence Set Modeling Block in Figure 3 takes a claim c i and each evidence set e j as input and returns: (1) a sequence output u 1 , u 2 , ..., u T , that is the representation of each token in the sequence;  (2) a pooled output p j , that can be considered as a joint representation of the claim and the evidence set (3) a summarized vector s j , that is also a joint representation of the claim and the evidence set obtained using word-level attention; and (4) the logits l j from classification of the claim as SUPPORTS, REFUTES or NOT ENOUGH INFO, based on the evidence set e j . Word Encoder. We use the ALBERT model for word level encoder. Let J be the number of evidence sets retrieved for the claim c i . First, all the sentences in an evidence set j are concatenated to form the evidence set sequence e j , where j ∈ [1, J]. Then, the claim c i and the evidence set sequence e j are concatenated as [[CLS]; c i ; [SEP ]; e j ; [SEP ]] to form the input sequence x j . The word embeddings, X j ∈ R T ×d , of the input sequence x j is obtained from the ALBERT embedding layer, where T denotes the number of tokens in the input sequence x j and d is the size of the word embedding. Then, the ALBERT model processes the input X j and produces a sequence output u 1 , u 2 , ..., u T denoted by U j ∈ R T ×d , which consists of the representation of each token t in x j . The ALBERT model also consists of a pooling layer that returns the vector representation p j of the [CLS] token which is considered to be representation of the whole sequence in Transformer-based models.
Attention Sum Block. Before describing wordlevel attention, we first describe the Attention Sum block which is used in the word-level attention. The Attention Sum block in Figure 4   sum of all the value token vectors v 1 to v R , where the weights are calculated using attention between input token vectors q 1 to q R and a trainable weight vector u q that is randomly initialized. Each vector q r is passed through a linear layer to get hidden representation f r for each token r ∈ [1, R]. The hidden representation f r is then subjected to a dot product with the vector u q to form a scalar c t which is the attention score for each q r . Then, softmax is computed over all the attention scores c 1 to c R to get an attention weight a r for each token r. Finally, the value token vectors v r are subjected to a weighted sum with attention probabilities from the softmax operation as weights and returns the summarized vector s. The attention weights denote the importance of each token in the value vectors sequence. The Attention Sum block is used in the following Word Attention and Hierarchical Aggregation components.
Word Attention. In the word-level attention component, the sequence output u t , where t ∈ [1, T ], of the evidence set j obtained from Word Encoder is passed (as both the input q r and value v r vectors) through the Attention Sum block to obtain a summarized vector representation s j (denoted as s in Attention Sum block), based on the importance of each word. s j is used in the Hierarchical Classifier. The pooled output vector p j containing representation of [CLS] token from the Word Encoder is passed through a linear layer to obtain a three way classification score l j (SUPPORTS, REFUTES and NOT ENOUGH INFO classes) of the claim c i based on the evidence set e j . This classifier verifies the claim based on the evidence set.

Hierarchical Aggregation Modeling
The hierarchical aggregation component in Figure  5 takes the output of the Evidence Set Modeling block of all J evidence sets as input and produces the three-way classification score for the claim based on all the evidence sets. It consists of two types of aggregations namely contextual and noncontextual aggregations. Both components compute an evidence set level attention to combine all the evidence sets, forming a hierarchy. Non-contextual Evidence Set Aggregation. Non-contextual aggregation combines the logits l 1 , ..., l J of all the evidence sets to produce the aggregated verification logits l nc . The motivation behind using non-contextual aggregation is that the majority of the claims need only a single evidence sentence/evidence set for verification. Therefore, we aggregate the logits instead of doing a contextual combination of evidence sets. This helps in avoiding the combination of context from multiple evidence sets without being distracted by sentences containing unnecessary information. The pooled output p 1 , ..., p J , and the classification logits l 1 , ..., l J of all the evidence sets, from the Evidence set modeling block, are passed through the Attention Sum block to compute the aggregated representation of all the evidence sets. Here, the sequence of vectors p 1 , p 2 , ....p J forms the input vectors of Attention Sum block and the logits l 1 , l 2 , ..., l J forms the value vectors of the Attention Sum block. Thus, it aggregates the logits of all evidence sets based on the importance of each evidence set. l nc = ATTN SUM(p 1 , ..., p J ; l 1 , ..., l J ) Contextual Evidence Set Aggregation. Contextual aggregation combines the representation s j of each evidence set j with one another to produce the claim verification logits l c . The motivation behind using contextual aggregation is that, even though we combine evidence sentences into evidence sets through the multi-hop retriever, our extracted evidence sets might not be completely accurate for some claims (i.e., some evidence sentences that belong to the same ground truth evidence set might be distributed across our extracted multiple evidence sets). Therefore, we combine the evidence sets contextually to overcome the limitation. Let S ∈ R J×d denote the summarized representations s 1 , s 2 , ..., s J of all the evidence sets [1, J]. S is passed through a Transformer encoder, in order to obtain contextual representations m 1 , m 2 , ..., m J denoted by M ∈ R J×d . Here, the Transformer encoder layer ensures that the context from one evidence set is combined with other evidence sets. Then, the evidence set representations m j , where j ∈ [1, J], from the encoder are passed (as both the input q r and value v r vectors) through the Attention Sum block to obtain an aggregated vector representation k of all the evidence sets. Finally, the vector representation k is fed into a linear layer classifier to obtain the three way classification logits l c of the claim.
Aggregated Logits. The aggregated logits are computed based on a weighted combination of the scores from contextual and non-contextual aggrega-tions. The weights β 1 and β 2 are trainable weights that denote importance of each aggregation. l = β 1 l c + β 2 l nc (13)

Training Loss and Inference
The three-way classification logits l j from the Evidence set modeling block for each evidence set j are subjected to a cross-entropy loss. All the losses from each evidence set j are averaged to get an aggregated loss L esm . The aggregated classification logits l from the Hierarchical Aggregation Modeling block are subjected to a cross-entropy loss L ham . The final loss is the sum of L esm and L ham . During the inference, the aggregated logits l from the Hierarchical Aggregation Modeling is used as the final three-way classification score of the claim verification. The label with the maximum score is selected as the final classification label.

Experiment Setting
In this section, we describe the dataset, evaluation metrics, baselines, and implementation details in our experiments.
Dataset. We evaluate our framework HESM in the FEVER dataset, a large scale fact verification dataset (Thorne et al., 2018a). The dataset consists of 185, 445 claims with human-annotated evidence sentences from 5, 416, 537 documents. Each claim is labeled as SUPPORTS, REFUTES, or NOT ENOUGH INFO. The dataset consists of training, development, and test sets, as shown in Table 1. The training and development sets, along with their ground truth evidence and labels are available publicly. But, the ground truth evidence and labels of the test set are not publicly available. Instead, once extracted evidence sets/sentences and predicted labels of the test set by a model are submitted to the online evaluation system 1 , its performance is measured and displayed at the system. In this work, we train and tune our hyper-parameters on training and development sets, respectively.
Baselines. We compare our model against 7 state-of-the-art baselines, including the top performed models from FEVER Shared task 1.0 (Nie et al., 2019;Hanselowski et al., 2018a;Yoneda et al., 2018), BERT based models (Soleimani et al., 2020;Stammbach and Neumann, 2019;   2019) and graph based model (Liu et al., 2020). Although we compare ours against all of them, the BERT based models are our major baselines since we use ALBERT, which is a lightweight BERT based model. The detailed description of the baselines is presented in the Appendix. Evaluation Metrics. The official evaluation metrics of the FEVER dataset are Label Accuracy (LA) and FEVER score. Label Accuracy is the three-way classification accuracy for the labels SUPPORTS, REFUTES, and NOT ENOUGH INFO, regardless of the retrieved evidence. FEVER score considers a claim to be correctly classified only if the retrieved evidence set matches at least one of the ground truth evidence sets along with the correct label. Between the two metrics, the FEVER score is considered as the most important evaluation metric because it considers both correct evidence retrieval and correct label prediction.
For evidence retrieval performance evaluation, recall and OFEVER are reported since these two scores matter for the claim verification process. Note that OFEVER is the oracle fever score calculated, assuming that the claim verification component has 100% accuracy. As formulated by Thorne et al. (2018a), a maximum of 5 evidence sentences are extracted to calculate evidence retrieval performance. For our model's evaluation purpose, we assign the score of evidence sentences retrieved in first iteration to their corresponding evidence sets. Then, we sort the evidence sets based on their assigned scores and select at most 5 sentences from the evidence sets in the same sorted order.
Implementation, Training, and Hyperparameter Tuning. We set number of retrieved documents K 1 = 10, the number of iterations N = 2, maximum number of sentences retrieved in the first iteration per claim K 2 = 3, a threshold probability th evi1 = 0.5, the maximum number of sentences in each Evidence set M s = 3, another threshold probability th evi2 = 0.8. Other detailed information is described in the Appendix.

Experimental Results and Analysis
Experiments are conducted to evaluate the performance of evidence retrieval, claim verification, and  Model LA(%) FEVER(%) UKP Athene (Hanselowski et al., 2018b) 65.46 61.58 UCL MRG (Yoneda et al., 2018) 67.62 62.52 UNC NLP (Nie et al., 2019) 68.21 64.21 BERT Pair (Zhou et al., 2019) 69.75 65.18 BERT Concat (Zhou et al., 2019) 71.01 65.64 BERT (Base) (Soleimani et al., 2020) 70.67 68.50 GEAR (BERT Base) (Zhou et al., 2019) 71.60 67.10 KGAT (BERT Base) (Liu et al., 2020) 72  aggregation approaches. In addition, we conduct an ablation study. Only the claim verification experiment is conducted in the test set since each baseline's officially evaluated results are reported in the FEVER leaderboard. In the other experiments and analysis, we use the development set since the test set does not contain the ground truth of evidence sets/sentences and claim class labels.

Multi-hop evidence retrieval
As shown in Table 2, we compare the performance of our model with two baselines, UNC NLP (Nie et al., 2019) and BERT based model (Stammbach and Neumann, 2019). UNC NLP uses ESIM (Chen et al., 2017) based model, andStammbach andNeumann (2019) uses a BERT based model. Since most other previous works either use ESIM based model or BERT based model for evidence retrieval, we compare with these two representative baselines (i.e., the results of the other 5 baselines in evidence retrieval would be similar to one of them). Our HESM with ALBERT Base outperforms the baselines, achieving 0.905 recall and 93.70% OFEVER score. We can also notice that multiple-hop evidence retrieval approaches (ours and Stammbach and Neumann (2019)) performed better than UNC NLP, which conducts a single iteration.  achieving 74.64% label accuracy (LA) and 71.48% FEVER score. In particular, our model performed much better than the top performed models from FEVER Shared task 1.0 (i.e., UKP Athene, UCL MRG, and UNC NLP). Compared with baselines using BERT Base, our HESM with ALBERT Base performed better than them. Likewise, compared with baselines using large language models, our model with ALBERT large still performed better than them. This experimental result confirms that our model with ALBERT large improved 1.1% FEVER score compared with the best baseline, KGAT with RoBERTa Large, indicating our model's capability of producing more correct label prediction and evidence extraction. The reason why we chose to use ALBERT over BERT in our models is ALBERT consumes much less memory and is expected to have comparable performance to its BERT counterpart. Since the other models/baselines use BERT instead of ALBERT, for a fair comparison, we include a result of our HESM model with BERT Base. The performance is similar to the HESM with ALBERT Base model. This result confirms that our framework is more important than a specific language model used.

Aggregation Analysis
We compare our hierarchical aggregation with different baseline aggregation methods. Table 4 shows the results of aggregation analysis in the development set. Top-1 aggregation is using just the top-1 relevant evidence set to verify the claim. Logical aggregation involves classifying the claim as SUPPORTS or REFUTES if at least one of the evidence sets has the label SUPPORTS or REFUTES, respectively. In case both labels appear in the evidence sets, then the label from the top-scoring evidence set is used to break the tie. If both labels do not appear in any of the evidence sets, we predict the claim as NOT ENOUGH INFO. MLP aggregation is to use an MLP layer to aggregate the class label probability of all the evidence sets   to get a final verification label. Concat aggregation concatenates all the sentences in all evidence sets into a string to verify the claim. Attention-based aggregation is similar to the aggregation technique used in Hanselowski et al. (2018b) using attention between claim and each evidence set to get the importance of each evidence set and then combine them using Max and Mean pooling. Finally, our HESM model aggregates evidence sets using hierarchical aggregation. From the results, we can see that our HESM model outperforms all other aggregation methods. Table 5 shows the label accuracy and FEVER score of our model after removing different components, including evidence set level loss L esm , and contextual and non-contextual aggregations. All of the proposed components positively contributed to boosting the performance of our framework.

Contextual and Non-contextual Aggregations
In this section, we study the performance of contextual and non-contextual aggregations in different aspects in the development set.
Label-wise performance. Figure 6 shows performance of contextual and non-contextual aggregations with respect to the class labels. We use the logits l c and l nc to calculate performance of contextual and non-contextual aggregations. In both label accuracy and FEVER score, contextual aggregation performs better for correctly verifying a claim when the relevant evidence either supports or refutes the claim, whereas non-contextual ag-   Figure 7 shows the performance of contextual and non-contextual aggregations with respect to claims requiring a different number of evidence sentences for verification. Overall refers to all the claims, Single refers to claims requiring only a single evidence sentence for verification, Any refers to claims for which more than one ground truth evidence set exists, where some sets contain a single evidence sentence and some sets contain multiple evidence sentences, and Multi refers to claims that can be verified only with multiple sentences. Non-contextual aggregation performs better than contextual aggregation in claims requiring only Single evidence sentence, whereas contextual aggregation performs better than non-contextual aggregation in claims requiring Any and Multi evidence sentences. The results make sense because contextual aggregation combines the context of multiple evidence sets, while non-contextual aggregation usually selects one of the evidence sets based on the attention mechanism.
Attention analysis. In Table 6 we show the weights β 1 and β 2 of the final model and also the evidence-set level attention accuracy. The attention weights can be seen as the importance of each aggregation. The attention weights show that both the aggregations are equally important (0.48 vs. 0.52).
The attention accuracy denotes the accuracy of the evidence set-level attention from the attention sum block in both eq. (9) and eq. (12) of noncontextual and contextual aggregations, respectively. It evaluates whether the retrieved evidence set from multi-hop retriever, which matches one of the ground truth evidence sets, has the highest attention weight of all the retrieved evidence sets. In cases where the evidence sentences from the ground truth evidence set are distributed across multiple evidence sets retrieved from multi-hop retriever, the attention is considered accurate if all the matching evidence sets have higher attention weight than the non-matching evidence sets. Here, we consider only the claims for which the retrieved evidence sentences match at least one ground truth evidence set. In other words, we omit the claims with NOT ENOUGH INFO label and also the 6.3% claims for which the multi-hop retriever cannot retrieve evidence sentences that match at least one ground truth evidence set as shown in Table 2. The high attention accuracy for both contextual and noncontextual aggregation shows that our evidence-set level attention is highly capable of attending to the correct evidence sets.

Conclusion
In this paper, we have proposed HESM framework for automated fact extraction and verification. HESM operates at evidence set level initially and combines information from all the evidence sets using hierarchical aggregation to verify the claim. Our experiments confirm that our hierarchical evidence set modeling outperforms 7 state-of-the-art baselines, producing more accurate claim verification. Our aggregation and ablation study show that our hierarchical aggregation works better than many baseline aggregation methods. Our analysis of contextual and non-contextual aggregations illustrates that the aggregations perform different roles and positively contribute to different aspects of fact-verification.