Sentence-Level Evidence Embedding for Claim Verification with Hierarchical Attention Networks

Claim verification is generally a task of verifying the veracity of a given claim, which is critical to many downstream applications. It is cumbersome and inefficient for human fact-checkers to find consistent pieces of evidence, from which solid verdict could be inferred against the claim. In this paper, we propose a novel end-to-end hierarchical attention network focusing on learning to represent coherent evidence as well as their semantic relatedness with the claim. Our model consists of three main components: 1) A coherence-based attention layer embeds coherent evidence considering the claim and sentences from relevant articles; 2) An entailment-based attention layer attends on sentences that can semantically infer the claim on top of the first attention; and 3) An output layer predicts the verdict based on the embedded evidence. Experimental results on three public benchmark datasets show that our proposed model outperforms a set of state-of-the-art baselines.


Introduction
The increasing popularity of social media has drastically changed how our daily news are produced, disseminated and consumed. 1 Without systematic moderation, a large volume of information based on false or unverified claims (e.g., fake news, rumours, propagandas, etc.) can proliferate online. Such misinformation poses unprecedented challenges to information credibility, which traditionally relies on fact-checkers to manually assess whether specific claims are true or not.
Despite the increased demand, the effectiveness and efficiency of human fact-checking is handicapped by the volume and fast pace the noteworthy 1 The latest Pew Research statistics show that 68% American adults at least occasionally get news on social media. http://www.pewinternet.org/2018/03/ 01/social-media-use-in-2018/ claims being produced on daily basis. Therefore, it is an urgent need to automate the process and ease the human burden in assessing the veracity of claims (Thorne and Vlachos, 2018).
Not surprisingly, various methods for automatic claim verification have been proposed using machine learning. Typically, given the claims, models are learned from auxiliary relevant sources such as news articles or social media responses for capturing words and linguistic units that might indicate viewpoint or language style towards the claim Rashkin et al., 2017;Popat et al., 2017;Dungs et al., 2018). However, the factuality of a claim is independent of people's belief and subjective language use, and human perception is unconsciously prone to misinformation due to the common cognitive biases such as naive realism (Reed et al., 2013) and confirmation bias (Nickerson, 1998).
A recent trend is that researchers are trying to establish more objective tasks and evidence-based verification solutions, which focus on the use of evidence obtained from more reliable sources, e.g., encyclopedia articles, verified news, etc., as an important distinguishing factor (Thorne and Vlachos, 2018). Ferreira and Vlachos (2016) use news headlines as evidence to predict whether it is for, against or observing a claim. In the Fake News Challenge 2 , the body text of an article is used as evidence to detect the stances relative to the claim made in the headline. Thorne et al. (2018a) formulate the Fact Extraction and VERification (FEVER) task which requires extracting evidence from Wikipedia and synthesizing information from multiple documents to verify the claim. Popat et al. (2018) propose DeClarE, an evidence-aware neural attention model to aggregate salient words from source news articles as the  main evidence to obtain claim-specific representation based on the attention score of each token. Inspired by the FEVER task (Thorne et al., 2018a) and DeClarE (Popat et al., 2018), we propose our approach to claim verification by using representation learning to embed sentence-level evidences based on coherence modeling and natural language inference (NLI). The example in Table 1 illustrates our general idea: given a claim "The test of a 5G cellular network is the cause of unexplained bird deaths occurring in a park in The Hague, Netherlands" and its relevant articles, we try to embed into the claim-specific representation those evidential sentences (e.g., s 1 -s 4 ) that are not only topically coherent among themselves considering the claim, but could also semantically infer the claim based on textual entailment relations such as entail, contradict, and neutral. It is hypothesized that sentence-level evidence can convey more complete and deeper semantics, thus providing stronger NLI capacity between claim and evidence, which would result in better claimspecific representation for the more accurate factchecking decision.
In this work, we propose an end-to-end hierarchical attention network for sentence-level ev-idence embedding that aims to attend on important sentences (i.e., evidence) by considering their topical coherence and semantic inference strength. Different from DeclarE (Popat et al., 2018), our model can determine the verdict of a claim more reasonably with evidential sentences embedded into the learned claim representation. Meanwhile, with the help of attention, crucial evidence can be highlighted and referred for better interpretability of the verdict. Our model is also advantageous over pipeline methods such as Neural Semantic Matching Network (NSMN) (Nie et al., 2019) which topped the FEVER shared task (Thorne et al., 2018b), because our model can be trained to address evidence representation learning directly rather than rank and select sentences semantically similar to the claim. Our contributions are summarized as follows: • We propose a novel claim verification framework based on hierarchical attention neural networks to learn sentence-level evidence embeddings to obtain claim-specific representation. • We use a co-attention mechanism to model sentence coherence and integrate the coherenceand entailment-based attentions into our proposed hierarchical attention framework for better evidence embedding. • We experimentally confirm that our method is much more effective than several state-of-theart claim verification models using three public benchmark datasets collected from snopes.com, politifact.com and Wikipedia.

Related Work
The literature on fact-checking and credibility assessment has been reviewed by several comprehensive surveys (Shu et al., 2017;Zubiaga et al., 2018;Kumar and Shah, 2018;Sharma et al., 2019). We only briefly review prior works closely related to ours. Many studies on claim verification extracted veracity-indicative features that can reflect stances and writing styles from relevant texts such as news articles, microblog posts, etc. and used the traditional supervised models to learn the parameters (Castillo et al., 2011;Qazvinian et al., 2011;Rubin et al., 2016;Ferreira and Vlachos, 2016;Rashkin et al., 2017). Deep learning models such as recurrent neural networks (RNN) (Ma et al., 2016), convolutional neural networks (CNN) (Wang, 2017) and recursive neural networks (Ma et al., 2018) were also exploited to learn the feature representations.
More recently, semantic matching methods were proposed to retrieve evidence from relatively trustworthy sources such as checked news and Wikipedia articles. Popat et al. (2018) attempted to debunk false claims by learning claim representations from relevant articles using an attention mechanism to focus on words that are closely related to the claim. Following NLI (Bowman et al., 2015), which is a task of classifying the relationship between a pair of sentences, composed by a premise and a hypothesis, as Entails, Contradicts or Neutral, Thorne et al. (2018a) formulated claim verification as a task that aims to classify claims into Supported, Refuted or Not Enough Info (NEI). They released a large dataset containing mutated claims based on relevant Wikipedia articles and developed a basic pipeline with document retrieval, sentence selection, and NLI modules. Similar pipelines were developed by most of the participating teams (Nie et al., 2019;Padia et al., 2018;Alhindi et al., 2018;Hanselowski et al., 2018) in FEVER shared task (Thorne et al., 2018b). Apart from the document retrieval function, our model is end-to-end and aims to learn sentence-level evidence with a hierarchical attention framework.
Attention is in general used to attend on the most important part of texts, and has been successfully applied in machine translation (Luong et al.), question answering (Xiong et al., 2016) and parsing (Dozat and Manning, 2016), and is adopted in our model for attending on important sentences as evidence. Our work is also related to coherence modeling. Different from traditional coherence studies focusing on discourse coherence among sentences that are widely applied in text generation (Park and Kim, 2015; Kiddon et al., 2016) and summarization (Logeswaran et al., 2018), we try to capture evidential sentences topically coherent not only among themselves but also with respect to the target claim.

Problem Statement
We define a claim verification dataset as {C}, where each instance C = (y, c, S) is a tuple representing a given claim c which is associated with a ground-truth label y and a set of n sentences S = {s i } n i=1 from the relevant documents of the claim. We assume the relevant documents are re-trieved from text collections containing variable number of sentences, and we disregard the order of sentences and which documents they are from. Our task is to classify an instance into a class defined by the specific dataset, such as veracity class labels, e.g., True/False, or NLI-style class labels, e.g., Supported/Refuted/NEI.
Our approach exploits and integrates two core semantic relations: 1) coherence of the sentences given the claim; 2) entailment relation between the claim and each sentence, which are described more specifically below. Coherence Evaluation: According to the coherence theory of truth, the truth of any (true) proposition consists in its coherence with some specified set of propositions (Young, 2018). In order to focus on the useful evidence in a set of relevant sentences S, we propose a coherence-based attention component by cross-checking if any sentence s i ∈ S coheres well with the claim and with other sentences in S in terms of topical consistency. Textual Entailment: Entailment is used to measure whether a piece of evidence semantically infers a given claim. We propose an entailmentbased attention component that can be pre-trained to capture entailment relations (Dagan et al., 2010;Bowman et al., 2015) based on sentence pairs labeled with NLI-specific classes: entails, contradicts and neutral. This pre-trained component together with the entire claim verification framework then will be trained end-to-end to attend on the salient sentences for inferring the claim.

End-to-End Claim Verification Model
In this section, we introduce our end-to-end hierarchical attention network for claim verification, which consist of two attention layers, i.e., coherence-based attention and entailment-based attention, for learning evidence embeddings. Figure 1 gives an overview of our framework, which will be depicted in detail in the subsections.

Sentence Representation
Given a word sequence T = (w 1 . . . w t . . . w |T | ) which could be either a claim or a sentence, each w t ∈ R d is d-dimensional vector which can be initialized with pre-trained word embeddings. We map each w t into a fixed-sized hidden vector using standard GRU (Cho et al., 2014). We then obtain the sentence-level representation for a claim c and each sentence s i ∈ S using two GRU-based RNN where |.| denotes the number of words, w |c| is the last word of c, w |s i | is the last word of s i , θ c contains the claim encoder parameters, θ S contains the sentence encoder parameters, and h c , h s i ∈ R 1×l are l-dimensional vectors.

Coherence-based Evidence Attention
Our assumption is that sentences used as evidence should be topically coherent given a claim. For example, for the claim in Table 1, which is about the connection between 5G test and birds' death in a park in Hague, the sentences s 1 -s 4 are topically coherent by specifically addressing the event's detail while s 5 -s 7 are marginal as s 6 and s 7 diverge from the focus and s 5 is a too general statement even though it might imply a possibility. Our model cross-checks all the sentences to capture the coherence among them using an attention mechanism. We consider the relation from two perspectives: 1) global coherence measures the consistency of each sentence regarding the entire set as a whole; and 2) local coherence measures the consistency of each sentence considering its relation with another sentence. For each s i , we use a biaffine attention (Dozat and Manning, 2016), which naturally fits our problem, to get the attention weights: where H S = [h s 1 ; . . . ; h sn ] ∈ R n×l is the matrix representing all sentences, and W c ∈ R l×l and u ∈ R 1×l contain the weights of the biaffine transformation. The term H S · u ∈ R n×1 denotes the global coherence where each element is a prior probability of a sentence s j being coherent with any sentences in S; the term is a n-dimensional weight vector for s i where each elementα ij for j ∈ [1, . . . , n] denotes the coherence attention weight between s i and s j .

Extension of Coherence Attention
The coherence attention in Eq. 2 ignores the claim information. To prevent off-topic coherence which deviates from claim's focus, we propose to assess each sentence's coherence by jointly considering the claim and all sentences, which shares a similar intuition with the co-attention method in questionanswering Xiong et al., 2016). Unlike the question-answer co-attention focusing on mutual selection of salient words in question and documents, we focus on sentence-level attention, for which we have multiple sentences but only one claim. So, we only need a claim-guided sentence attention. We use a gating unit to endow the model with the capacity of deciding how much information it should accept from the claim. The new attention weight of s i is computed by: where g c→s i = σ(W g · h s i + U g · h c ) is the gate function with trainable parameters W g and U g , H S = [h s 1 ; . . . ;h sn ] denotes the stacked output of the gating unit, and other settings are same as the biaffine coherence attention (see Eq. 2).
Based on the attention weights, each sentence can be represented as the weighted sum of all sentences, capturing its overall coherence: where α ij is the attention weight between s i and s j obtained from Eq. 2 (α i ) or Eq. 3 (ᾱ i ). Finally, we concatenate the coherence-based sentence embedding h s i with the original embedding h s i to obtain a richer sentence representation: where W co and b co are parameters for transforming the concatenation into a l-dimensional vector.

Entailment-based Evidence Attention
We further enhance the sentence representation by capturing the entailment relations between the sentences and the claim based on the NLI method (Bowman et al., 2015) for strengthening the semantic inference capacity of our model. Given c and s i , we represent each such pair by integrating three matching functions between h c andh s i : 1) concatenation [h c ,h s i ]; 2) elementwise product h c h s i ; and 3) absolute elementwise difference |h c −h s i |. The similar matching scheme was commonly used to train NLI models (Conneau et al., 2017;Mou et al., 2016;Liu et al., 2016;. We then perform a transformation to obtain the joint representation h c s i as follow: where W e are trainable weights for transforming the long concatenation into an l-dimensional vector. We omit the bias to avoid notational clutter.
To capture entailment-based evidence, we again apply attention over the original sentences guided by the joint representation h c s i which is obtained on top of the coherence attention. This yields: where V e and b e are parameters turning h c s i to an entailment score b i , β i is the entailment-based attention weight of s i which is used to produce the final representation h c S of an entire instance. Note that the hierarchy of our attention structure is conveyed by the query part h c s i , and we apply the weight β i on the original representation h s i rather than h s i (Eq. 4) orh s i (Eq. 5), which is empirically better based on our trials since the latter two may contain more redundant information due to the sum over an entire set when computing h s i .

The Overall Model
The attention vector h c S is the high-level representation of the claim with the embedded evidence based on the hierarchical attention method. We use a fully connected output layer to output the probability distribution over the veracity classes: where V o and b o are the weights and bias in output layer. Note that Eq. 8 assumes that using h c S alone can determine the veracity as true or false without direct reference to the claim again. This may be suitable for news data as the salient news sentences often straightforwardly comment on the claim's veracity. However, some claim verification tasks such as FEVER (Thorne et al., 2018a) are particularly defined to classify if the factual evidence from the source like Wikipedia, which rarely remark on the veracity of the mutated claim, can infer the claim as being supported, refuted or NEI. In such case, we replace h c S in Eq. 8 with the richer representationĥ c S = [h c , h c S , h c h c S , |h c − h c S |] to facilitate the inference from the evidence to the claim in accordance with such NLI style of the task definition. Interestingly, such treatment does not work for veracity classification of news claim (see Section 5.2), which may be because the veracity features of news claim have been already embedded into h c S and the richer representationĥ c S involving the claim could introduce unnecessary noise to a non-NLI type of task unlike FEVER.
To fine-tune our model, we also pre-train the coherence-and entailment-related parameters for avoiding the sole reliance on the potentially limited supervision from the task-specific labels.

Pre-training Coherence Model
Without ground truth for learning the coherence model, we use a pair-wise training strategy to optimize a large margin objective. For each claim c, we randomly choose another "negative" claim c . Then we construct a tuple (s, X + , X − ), where X + = (c, S) and X − = (c , S ) are tuples consisting of different claims and their relevant article sentences, and s ∈ S is a sentence selected randomly. Generally, (s, X + ) should exhibit higher topical coherence than (s, X − ) since the former reports the same claim c. We seek for parameters that assign a higher score to (s, X + ) than (s, X − ) by minimizing the following margin-based ranking loss: and r(, ) is the ranking function turning the coherence-based sentence embedding to a ranking score: (10) where cohAtt(, ) is a shorthand of Eq. 4, and W c and b c are the weights and bias of an added ranking output layer which is not a part of our end-to-end model. The pre-trained model is used to initialize all the parameters needed for computing Eq. 4.

Pre-training Entailment Model
We use the Standford Natural Language Inference (SNLI) dataset (Bowman et al., 2015) to pretrain the parameters of entailment-based attention model. Specifically, we train a model for Recognizing Textual Entailment (RTE) as follow: whereȳ is the entailment class label, i.e., entails, contradicts, or neutral, h RT E has the same form as Eq. 6 while the input claim-sentence pair is replaced by a pair of premise and hypothesis in the SNLI corpus (each element is encoded by a GRU sentence encoder), and V e and b e are the weights and bias of the RTE output layer which is not part of our end-to-end model. The pre-trained model is used to initialize the parameters W e in Eq. 6. For pre-training, we minimize the square loss between the distributions of the predicted and the ground-truth entailment classes.

Overall Training
After pre-training, all the model parameters are trained end-to-end by minimizing the squared error between the class probability distribution of the prediction and that of the ground truth over the claims. Parameters are updated through backpropagation (Collobert et al., 2011) with Ada-Grad (Duchi et al., 2011) for speeding up convergence. The training process ends when the model converges or the maximum epoch number is met. We represent input words using pre-trained GloVe Wikipedia 6B word embeddings (Pennington et al., 2014). We set d to 300 for word vectors and l to 100 for hidden units, and no parameter depends on n which varies with different claims.

Datasets and Evaluation Metrics
We use three public fact-checking datasets for evaluation: 1) Snopes and 2) PolitiFact, released by Popat et al. (2018), containing 4,341 and 3,568 news claims, respectively, along with relevant articles collected from various web sources; 3) FEVER, released by Thorne et al. (2018a), which consists of 185,445 claims accompanied by human-annotated relevant Wikipedia articles and evidence-bearing sentences, and many claims in FEVER are human altered by mutating the original claims from Wikipedia.
Each Snopes claim was labeled as true or false, while each PolitiFact claim was originally assigned one of six veracity labels: true, mostly true, half true, mostly false, false, and pants on fire. Unlike Popat et al. (2018) converting all the classes into true or false, we merge mostly true, half true and mostly false into mixed, and treat false and pants on fire as false. Thus, we have a more practical classification on PolitiFact, i.e., true, false and mixed. We use micro-/macro-averaged F1, classspecific precision, recall and F-measure as evaluation metrics. We hold out 10% of the claims for tuning the hyper parameters, and conduct 5-fold cross-validation on the rest of the claims.
On FEVER dataset, each claim, which is classified as Supported, Refuted or NEI, can be verified with its ground-truth label and a set of humanannotated evidential sentences extracted from its relevant Wikipedia pages. This task is similar as predicting the entailment relation by aggregating the sentences to infer the NLI-style label of the target claim, instead of directly predicting the claim's veracity as true or false. FEVER shared task used label accuracy, F 1 score of evidential sentence selection, and FEVER score as evaluation metrics (Thorne et al., 2018b).

Experiments on Veracity-based Datasets
We compare our model and several state-of-the-art baseline methods described below. 1) SVM: A linear SVM model for fake news detection using a set of linguistic features (e.g., bag-of-words, ngrams, etc.) handcrafted from relevant sentences (Thorne and Vlachos, 2018); S for the output layer (see Section 4.4). We implement our models and DeClarE with Theano 3 , and use the original codes of other baselines. As DeClarE is not yet open-source, we consult with its developers for our implementation.

Results of Comparison
As shown in Table 2, CNN and LSTM using barely content of claims without considering external information are comparable with SVM which uses handcrafted features based on relevant article sentences. Among all the baselines, DeClarE performs the best because it not only learns to capture complex features effectively via the neural model, but also strengthens the learned features by attending on the salient words that are important for predicting the correct label.
Our model can capture more accurate sentencelevel evidence which convey the semantics more completely and deeply. The superiority is clear: HAN-na which considers sentence as evidence without using attention is already better than the baselines except DeClarE, implying the importance of sentence-level information. HAN-ba and HAN using attentions to embed sentence-level evidence consistently outperform DeClarE in large margin that is based on word-level attention.
HAN consistently outperforms HAN-ba on both datasets. This suggests that the co-attention considering claim for capturing sentence coherence is more effective to represent more accurate evidence. HAN-nli, however, fails to work and is even worse than DeClarE, which confirms our conjecture that veracity classification on news data differs from a NLI type of task like on FEVER (see Section 5.3) since news reports often openly remark the claim's veracity and involving the claim in the output layer may interfere the decision.

Ablation Study
To evaluate the impact of each component, we perform ablation tests based on the no-attention model HAN-na plus some component(s) which can be one or combination of the following attentions: 1) ba and 2) ca correspond to the coherencebased biaffine attention (Eq. 2) and co-attention (Eq. 3), respectively; 3) ea: entailment-based attention (Eq. 7). As shown in Table 3, HAN-na plus each component alone improves the model, indicating their effectiveness for embedding sentence-level evidence. Furthermore, +ca consistently outperforms +ba, reaffirming the advantage of co-attention; +ea makes similar improvements over HAN-na as +ba and +ca did, suggesting that both types of attention are comparably helpful. Combining them hierarchically makes further improvements especially in the case of +ca+ea, implying that the two attention mechanisms are complementary.
We also examine the impact of pre-training on HAN in comparison with its performance without pre-training, namely HAN-. In Figure 2, we observe that the pre-training does not have much impact when we use the entire training set, but it clearly improves the model when only using certain proportions of the training data. This indicates that the fine-tuned coherence and entailment models are generally helpful for claim verification, especially when the sampled set is not sufficiently large for fully training the model.

Discussion
Regarding the gap between the published performance of DeClarE (Plain+Attn) (Popat et al., 2018) which is 0.79 on the Snopes dataset and that of our implementation of it which is 0.759, we conjecture the reason may be that DeClarE utilized an undisclosed strategy for balancing the training datasets that we could not easily replicate, while we trained all the systems in Table 2 on the original unbalanced dataset. We leave this for further investigation in future upon the availability of DeClarE source codes. On the PolitiFact dataset, since we adopt a three-way classification, it is thus not directly comparable with the original DeClarE performance which is based on two classes.   Case Study Table 4 illustrates some top sentences embedded with a claim from Snopes dataset which is correctly detected as fake. We can see that 1) the top sentences have high topical overlap with both the claim and each other; 2) the highly ranked sentences play a major role in deciding the verdict, as they remark on the claim's veracity directly; 3) the lower sentences seem less important since they either repeat the claim or are very subjective. Providing such readable pieces of evidence to human fact-checker for verifying the claim can be helpful.

Experiments on FEVER Dataset
We compare the following systems on the public Dev set 4 of FEVER dataset: 1) Fever-base: The FEVER baseline (Thorne et al., 2018a) that is a pipeline for claim verification including 3 stages: document retrieval, sentence selection and textual entailment. 2) NSMN: The pipeline-based system named as UNC-NLP topping the FEVER shared task (Thorne et al., 2018b), which was later reported as using Neural Semantic Matching Networks (Nie et al., 2019). 3) HAN-nli: Our full model trained using the FEVER task dataset. Note that similar to DeClarE our model assumes that the set of articles about each claim have been retrieved, while the FEVER task requires users search relevant Wikipages in the first place. Using FEVER, our method thus is not truly end-to-end in this setting. We utilize the document retrieval module of NSMN (Nie et al., 2019) to obtain the relevant Wikipages. 4) HAN-nli*: For more fair comparison with NSMN which utilized the ground-truth sentences in the training set to train their sentence selector, we fine-tune the HAN-nli, namely HAN-nli*, by optimizing the square error loss between the entailment attention score b i (see Eq. 7) and the -1/+1 value indicating whether s i is selected as a piece of evidence in the ground truth. 5) HAN*: The original HAN using Eq. 8 in the output layer and fine-tuned like HAN-nli*. Table 5 shows that HAN-nli* is much better than the two baselines in terms of label accuracy and evidence F1 score. There are two reasons: 1) apart from the retrieval module, our model optimizes all the parameters end-to-end, while the two pipeline systems may result in error propagation; and 2) our evidence embedding method considers more complex facets such as topical coherence and semantic entailment, while NSMN just focuses on similarity matching between the claim and each sentence. HAN-nli seem already a decent model given its much better performance than Fever-base. This confirms the advantage of our evidence embedding method on the FEVER task.
NSMN achieves higher FEVER score and evidence recall than our method. However, the reason is straightforward: FEVER score favors recalling the annotated evidential sentences while one of the limitations of FEVER dataset is that the ground-truth sentences provided by human annotators were often incomplete (Thorne et al., 2018a,b). Our approach is not limited by selecting top-k sentences and may embed into evidence as many diverse sentences as the model requires. Compared to NSMN which aims to recall the top evidence sentences in FEVER's ground truth, our model achieves much higher Accuracy, Evidence Precision and F 1 . HAN* is ineffective, confirming that in FEVER task the claim content is needed in the output layer for the NLI to take effect since the evidence from Wikipedia typically does not contain direct remarks on the veracity of a claim.

Discussion
The pipeline-based system NSMN demonstrates superior evidence retrieval performance in terms of FEVER score. We emphasize that the essential objective of our model is not for evidence retrieval and ranking. Instead of ranking sentences into the top-k positions, we pay more attention on claim verification accuracy by embedding and aggregating the useful sentences as evidence like we have explained above. However, such discrepancy inspires us to investigate in the future an end-to-end approach to jointly model evidence retrieval and claim verification in a unified framework based on our sentence-level attention mechanism.
Finally, thanks to one of our reviewers, we learn about another two-stage model named TwoWin-gOS (Yin and Roth, 2018), which achieves a comparable FEVER score but a little bit higher accuracy than ours on FEVER task. The TwoWin-gOS applies a two-wing optimization approach to jointly optimizing sentence selection and veracity classification. The reasons regarding their higher performance might lie in that: 1) their input word embeddings are fine-tuned based on the context of the evidence and claim while ours are fixed during training; and 2) the document retrieval module of the TwoWingOS has demonstrated higher effectiveness than that of the NSMN (see rate (recall) and acc ceiling (OFEVER) in Tables 2 in (Yin and Roth, 2018;Nie et al., 2019) for details).

Conclusions and Future Work
We propose a novel neural end-to-end framework for claim verification by learning to embed sentence-level evidence with a hierarchical attention mechanism. Our model strengthens the evidence representations by attending on the sentences that are not only topically coherent but can also semantically infer the target claim. The results on three public benchmark datasets confirm the advantages of our method. For the future work, beyond what we have mentioned, we plan to examine our model on different information sources. We will also try to incorporate relevant metadata into it, e.g., author profile, website credibility, etc.