An End-to-End Multi-task Learning Model for Fact Checking

With huge amount of information generated every day on the web, fact checking is an important and challenging task which can help people identify the authenticity of most claims as well as providing evidences selected from knowledge source like Wikipedia. Here we decompose this problem into two parts: an entity linking task (retrieving relative Wikipedia pages) and recognizing textual entailment between the claim and selected pages. In this paper, we present an end-to-end multi-task learning with bi-direction attention (EMBA) model to classify the claim as “supports”, “refutes” or “not enough info” with respect to the pages retrieved and detect sentences as evidence at the same time. We conduct experiments on the FEVER (Fact Extraction and VERification) paper test dataset and shared task test dataset, a new public dataset for verification against textual sources. Experimental results show that our method achieves comparable performance compared with the baseline system.


Introduction
When we got news from newspapers and TVs which was thoroughly investigated and written by professional journalists, most of these messages are well-found and trustworthy. However, with the popularity of the internet, there are 2.5 quintillion bytes of data created each day at our current pace 1 . Everyone online is a producer as well as a recipient of these emerging information, and some of them are incorrect, fabricated or even with some evil purposes. Most time it is difficult for us to figure out the truth of those emerging news without professional background and enough investigation. Fact checking, which firstly has been produced and received a lot of attention in the indus-try of journalism, mainly verifying the speeches of public figures, is also important for other domains, e.g. wrong common-sense correction, rumor detection, content review etc.
With the increasing demand for automatic claim verification, several datasets for fact checking have been produced in recent years. Vlachos and Riedel (2014) are the first to release a public fake news detection and fact-checking dataset from two fact checking websites, the fact checking blog of Channel 4 2 and the True-O-Meter from PolitiFact 3 . This dataset only includes 221 statements. Similarly, from PolitiFact via its API,  collected LIAR dataset with 12.8K manually labeled short statements, which permits machine learning based methods used on this dataset. Both dataset don't include the original justification and evidence as it was not machinereadable. However, just verifying the claim based on the claim itself and without referring to any evidence sources is not reasonable and convincing.
In 2015, Silverman launched the Emergent Project 4 , a real-time rumor tracker, part of a research project with the Tow Center for Digital Journalism 5 at Columbia University. Ferreira and Vlachos (2016) firstly proposed to use the data from Emergent Project as Emergent dataset for rumor debunking, which contains 300 rumored claims and 2,595 associated news articles. In 2017, the Fake news challenge (Pomerleau and Rao, 2017) consisted of 50K labeled claimarticle pairs similarly derived from the Emergent Project. These two dataset stemmed from Emergent Project alleviate the fact checking task by detecting the relationship between claim-article pairs. However, in more common situation, we are dealing with plenty of claims themselves online without associated articles which can help to verify the claims.
Fact Extraction and VERification (FEVER) dataset (Thorne et al., 2018) consists of 185,445 claims manually verified against the introductory sections of Wikipedia pages and classified as SUPPORTED, REFUTED or NOTENOUGH-INFO. For the first two classes, the dataset provides combination of sentences forming the necessary evidences supporting or refuting the claim. Obviously, this dataset is more difficult than existing fact-checking datasets. In order to achieve higher FEVER score, a fact-checking system is required to classify the claim correctly as well as retrieving sentences among more than 5 million Wikipedia pages jointed as correct evidence supporting the judgement.
The baseline method of this task comprises of three components: document retrieval, sentencelevel evidence selection and textual entailment. For the first two retrieval components, the baseline method uses document retrieval component of DrQA (Chen et al., 2017) which only relies on the unigram and bigram TF-IDF with vector similarity and don't understand semantics of the claim and pages. So, we find that it extracts lots of Wikipedia pages which are unrelated to the entities described in claims. Besides, similarity-based method prefer extracting supporting evidences than refuting evidences. For the recognizing textual entailment (RTE) module, on one hand, the previous retrieval results limit the performance of the RTE model. On the other hand, the selected sentences concatenated as evidences may also confuse the RTE model due to some contradictory information.
In this paper, we introduce an end-to-end multitask learning with bi-direction attention (EMBA) model for FEVER task. We utilize the multi-task framework to jointly extract evidences and verify the claim because these two sub-tasks can be accomplished at the same time. For example, after selecting relative pages, we carefully scan these pages to find supporting or refuting evidences. If we find some, the claim can be labeled as SUP-PORTS or REFUTES immediately. If not, the claim will be classified as NOTENOUGHINFO after we read pages completely. Our model is trained on claim-pages pairs by using attention mechanism in both directions, claim-to-pages and pages-to-claim, which provides complimentary in-formation to each other. We obtain claim-aware sentence representation to predict the correct evidence position and the pages-aware claim representation to detect the relationship between the claim and the pages.

Related Work
Natural Language Inference (NLI) or Recognizing textual entailment (RTE) detects the relationship between the premise-hypothesis pairs as "entailment", "contradiction" and "not related". With the renaissance of neural network (Krizhevsky et al., 2012;Mikolov et al., 2010;Graves, 2012) and attention mechanism (Xu et al., 2015;Luong et al., 2015;Bahdanau et al., 2014), the popular framework for the RTE is "matching-aggregation" (Parikh et al., 2016;. Under this framework, words of two sentences are firstly aligned, and then the aligning results with original vectors are aggregated into a new representation vector to make the final decision. The attention mechanism can empower this framework to capture more interactive features between two sentences. Compared to Fever task, RTE provides the sentence to verify against instead of having to retrieve it from knowledge source. Another relative task is question answering (QA) and machine reading comprehension (MRC), for which approaches have recently been extended to handle large-scale resources such as Wikipedia (Chen et al., 2017). Similar to MRC task which needs to identify the answer span in a passage, FEVER task requires to detect the evidence sentences in Wikipedia pages. However, MRC model tends to identify the answer span based on the similarity and reasoning between the question and passage, while similarity-based method is more likely to ignore refuting evidence in pages. For example, a claim stating "Manchester by the Sea is distributed globally" can be refuted by retrieving "It began a limited release on November 18, 2016" as evidence.

Model
The FEVER dataset is derived from the Wikipedia pages. So, we assume each claim contains at least one entity in Wikipedia and the evidence can be retrieved from these relative pages. Thus, we decompose FEVER task into two components: (1) entity linking which detects Wikipedia entities in claim. We use the pages of identified entities   (2) an end-to-end multitask learning with bi-direction attention (EMBA) model (in Figure 1) which classify the claim as "supports", "refutes" or "not enough info" with respect to the pages retrieved and select sentences as evidence at the same time.

Entity Liking
S-MART is a Wikipedia entity linking tool for short and noisy text. For each claim, we use S-MART to retrieve the top 5 entities from Wikipedia. These entity pages are jointed together as the source pages then passed to select correct sentences. For a given claim, S-MART first retrieves all possible entities of Wikipedia by surface matching, and then ranks them using a statistical model, which is trained on the frequency counts with which the surface form occurs with the entity.

Sentence Extraction and Claim Verification
We now proceed to identify the correct sentences as evidence from relative pages and try to classify the claim as "supports", "refutes" or "not enough info" with respect to the pages retrieved at the same time. Inspired by the recent success of attention mechanism in NLI  and MRC (Seo et al., 2016;Tan et al., 2017), we propose an end-to-end multi-task learning with bi-direction attention (EMBA) model, which exploits both pages-to-claim attention to verify the claim and claim-to-pages attention to predict the evidence sentence position respectively. Our model consists of: Embedding layer: This layer represents each word in a fixed-size vector with two components: a word embedding and a character-level embedding. For word embedding, pre-trained word vectors, Glove (Pennington et al., 2014), provides the fixed-size embedding of each word. For character embedding, following Kim (Kim, 2014), characters of each words are embedded into fixed-size embedding, then fed into a Convolutional Neural Network (CNN). The character and word embedding vectors are concatenated together and passed to a Highway Network (Srivastava et al., 2015). The output of this layer are two sequences of word vectors of claim and pages.
Context embedding layer: The purpose of this layer is to incorporate contextual information into the presentation of each word of claim and passage. We utilize a bi-directional LSTM (BiLSTM) on the top of the embedding provided by the previous layers to encode contextual embedding for each word.
Attention matching layer: In this layer, we compute attention in two directions: from pages to claim as well as from claim to pages. To obtain these attention mechanisms, we first calculate a shared similarity matrix between the contextual embedding of each word of the claim h c i and each word of the pages h p j : where α ij represents the attention weights on the i-th claim word by j-th pages word, w is a trainable weight vector, • is elementwise multiplication, [;] is vector concatenation across row, and implicit multiplication is matrix multiplication. Claim-to-pages attention Claim-to-pages attention represents which claim words are most relevant to each word of pages. To obtain attended pages vector, we take α ij as the weight of h p j and weighted sum all the contextual embedding of pages:h Finally, we match each contextual embedding with its corresponding attention vector to obtain the claim-aware representation of each word of pages: Pages-to-claim attention Pages-to-claim attention represents which pages words are most relevant to each claim word. Similar to claim-to-pages attention, the attended claim vector and the pagesaware representation of each pages word are calculated by: Aggregation layer: The input to the aggregation layer is two sequences of matching vectors, the claim-aware pages word representation and pages-aware claim word representation. The goal of the modeling layer is to capture the interaction among the pages words conditioned on the claim as well as the claim words conditioned on the passage words. This is different from the contextual embedding layer, which captures the interaction among context information independent of matching information.
Sentence selection layer: The FEVER task requires the model to retrieve sentences of the passage as evidence to verify the claim. The sentence representation s t is obtained by concatenating vectors from the last time-step of the previous layer BiLSTM models output sequences. We calculate the probability distribution of the evidence position over the whole pages by: For this sub-task, the objective function is to minimize the negative log probabilities of true evidence index: where y t ∈ 0, 1 denotes a label, y t = 1 means the t-th sentence is a correct evidence, other y t = 0. Claim verification layer: The input of this layer is pages-aware claim representation produced from the matching layer and the output is a 3-way classification, predicting whether the claim is SUPPORTED, REFUTED or NOTENOUGH-INFO by the pages. We utilize multiple convolution layers, with the output of 3 for classification. We optimize the objective function: Where k is the number of claims. y t ∈ 0, 1, 2 denotes a label, meaning the i-th claim is SUP-PORTED, REFUTED, and NOTENOUGHINFO by the pages respectively. Training: The model is trained by minimizing joint objective function: where α is the hyper-parameter for weights of two loss functions.

Experiments
In this section, we evaluate our model on FEVER paper test dataset and shared task test dataset.

Model Details
The model architecture used for this task is depicted in Figure 1. The nonlinearity function f = tanh is employed. We use 100 1D filters for CNN char embedding, each with a width of 5. The hidden state size (d) of the model is 100. We use the Adam (Kingma and Ba, 2014) optimizer, with a minibatch size of 32 and an initial learning rate of 0.001. A dropout rate of 0.2 is used for the  CNN, all LSTM layers, and the linear transformation. The parameters are initialized by the techniques described in (Glorot, 2010). The max value used for max-norm regularization is 5. The L c loss weight is set to α = 0.5.

Experimental Results
We use the official evaluation script 6 to compute the evidence F1 score, label accuracy and FEVER score. As shown in Table 1, our method achieves comparable performance on FEVER paper test dataset comparing with the baseline method on FEVER paper dev dataset. The result shows that jointly verifying a claim and retrieving evidences at same time can be as good as pipelined model. Our method results on the FEVER paper shared task test dataset is showed in Table 2. Besides, We calculate and present the confusion matrix of claim classification results on the FEVER paper test dataset in Table 3. Our model isn't good at identifying the unrelated relationship between claim and pages retrieved. Our model sentence selection performance is recorded in

Error Analysis
We investigate the predicted results on the paper test dataset and show several error causes as followings.

Document retrieval
We use entity linking tool to retrieve relative Wikipedia pages. Some entity mentions in claims are linked incorrectly, hence we cannot obtain the desired pages containing the correct evidence sentences. The S-MART tool returned correct entities for 70% claims of paper test dataset. A better entity retrieval method should be researched for the FEVER task.
Pages length After document retrieval, the relative pages are concatenated and passed through EMBA model. However, in order to train and predict effectively, the length of the pages is limited to 800 tokens. So, if there are many relative pages and the position of the evidence sentence is near the end of the page, these correct sentences would be cut off.
Evidence composition Some claims require composition of evidences from multiple pages. Furthermore, the selection of second page relies on the correct retrieval of the first page and sentence. For example, claim "Deepika Padukone has been in at least one Indian films" can be supported by combination of "She starred roles in Yeh Jawaani Hai Deewani" and "Yeh Jawaani Hai Deewani is an Indian film" from "Deepika Padukone" and "Yeh Jawaani Hai Deewani" Wikipedia pages respectively. The second page couldn't be found correctly if we don't select the first sentence exactly. 18% claims in train dataset belong to this situation.

Conclusion
We propose a novel end-to-end multi-task learning with bi-direction attention (EMBA) model to detect sentences as evidence and classify the claim as "supports", "refutes" or "not enough info" with respect to the pages retrieved at the same time. EMBA uses attention mechanism in both directions to capture interactive features between claim and pages retrieved. Model obtains claim-aware sentence representation to predict the correct evidence position and the pages-aware claim representation to detect the relationship between the claim and the pages. Experimental results on the FEVER paper test dataset show that our approach achieve comparable performance comparing with the baseline method. There are several promising directions that worth researching in the future. For instance, in sentence selection layer, the model just predicts whether a sentence is an evidence. Further, we can try to instantly predict whether a sentence is "supporting", "refuting" or "not related with" the claim. What's more, the hyperparameter α for joint loss function is fixed. A good value for this parameter can achieve one plus one is greater than two. We can try to learn this parameter value during training the model.