Simple Compounded-Label Training for Fact Extraction and Verification

Automatic fact checking is an important task motivated by the need for detecting and preventing the spread of misinformation across the web. The recently released FEVER challenge provides a benchmark task that assesses systems’ capability for both the retrieval of required evidence and the identification of authentic claims. Previous approaches share a similar pipeline training paradigm that decomposes the task into three subtasks, with each component built and trained separately. Although achieving acceptable scores, these methods induce difficulty for practical application development due to unnecessary complexity and expensive computation. In this paper, we explore the potential of simplifying the system design and reducing training computation by proposing a joint training setup in which a single sequence matching model is trained with compounded labels that give supervision for both sentence selection and claim verification subtasks, eliminating the duplicate computation that occurs when models are designed and trained separately. Empirical results on FEVER indicate that our method: (1) outperforms the typical multi-task learning approach, and (2) gets comparable results to top performing systems with a much simpler training setup and less training computation (in terms of the amount of data consumed and the number of model parameters), facilitating future works on the automatic fact checking task and its practical usage.


Introduction
The increasing concern with the spread of misinformation has motivated research regarding automatic fact checking datasets and systems (Pomerleau and Rao, 2017;Hanselowski et al., 2018a;Bast et al., 2017;Pérez-Rosas et al., 2018;Zhou et al., 2019;Vlachos and Riedel, 2014;Wang, 2017; Our code will be publicly available on our webpage. 2019a,b). The Fact Extraction and VERification (FEVER) dataset (Thorne et al., 2018a) is the most recent large-scale dataset that enables the development of data-driven neural approaches to the automatic fact checking task. Additionally, the FEVER Shared Task (Thorne et al., 2018b) introduced a benchmark, the first of this kind, that is capable of evaluating both evidence retrieval and claim verification.
Several top-ranked approaches on FEVER (Nie et al., 2019a;Yoneda et al., 2018;Hanselowski et al., 2018b) decompose the task into 3 subtasks: document retrieval, sentence selection, and claim verification, and follow a similar pipeline training setup where sub-components are developed and trained sequentially. Although achieving higher scores on benchmarks, pipeline training is timeconsuming and imposes difficulty for fast application development since downstream training relies on data provided by a fully-converged upstream component. The impossibility of parallelization also causes data-inefficiency as training the same input sentence for both sentence selection and claim verification requires twice the computation, whereas humans can learn the task of sentence selection and claim verification jointly.
In this work, we simplify the training procedure and increase training efficiency for sentence selection and claim verification by merging redundant components and computation that exist when training the two tasks separately. We propose a joint training setup in which sentence selection and claim verification are tackled by a single neural sequence matching model. This model is trained with a compounded label space in which for a given claim, an input sentence that is labeled as "NON-SELECT" for sentence selection module training will also be labeled as "NOTENOUGHINFO" for claim verification module training. Similarly, input evidence that is labeled as "SUPPORTS" or "REFUTES" for claim verification module training will also be labeled as "SELECT" for sentence selection module training.
To validate our new setup, we compare with the previous pipeline setup and a multi-task learning setup which trains the two tasks alternately. Fig. 1 illustrates differences among these three setups.
Results indicate that: our method (1) outperforms the multi-task learning setup, and (2) yields comparable results with a top performing pipelinetrained system while consuming less than half the number of data points, reducing the parameter size by one-third, and converging to a functional state much faster than the pipeline-trained system. We argue that the aforementioned design simplification and training acceleration are valuable especially during time-sensitive application development.

Previous FEVER Systems
Many of the top performing FEVER 1.0 systems, all achieving greater than 60% FEVER score on the respective leaderboard (Nie et al., 2019a;Yoneda et al., 2018;Hanselowski et al., 2018b), share the same pipeline training schema in which document retrieval, sentence selection, and claim verification are all trained separately.
While Nie et al. (2019a) proposed formalizing sentence selection and claim verification as a similar problem, sentence selection and claim verification are still trained separately on the task, which contrasts with our setup. Additionally, Yin and Roth (2018) proposed a hierarchical neural model to tackle both sentence selection and claim verification at the same time, but did not induce computational savings as in our setup.

Information Retrieval
Neural networks have been successfully applied to information retrieval tasks in Natural Language Processing (Huang et al., 2013;Guo et al., 2016;Mitra et al., 2017;Dehghani et al., 2017;Qi et al., 2019;Nie et al., 2019b) with a focus on relevant retrieval. Information retrieval is generally a relevance-matching task whereas claim verification is a more semantics-intensive task. We consider using a single semantics-focused model to conduct both sentence retrieval and claim verification.

Natural Language Inference
Natural Language Inference (NLI) requires a system to classify the logical relationship between two sentences in which one is the premise and one is the hypothesis. This classifier decides whether the relationship is entailment, contradiction, or neutral. Several large-scale datasets have been created for this purpose, including the Stanford Natural Language Inference Corpus (Bowman et al., 2015) and the Multi-Genre Natural Language Inference Corpus (Williams et al., 2018). This task can be formalized as a semantic sequence matching task, which bears resemblance to both the sentence retrieval and claim verification tasks.

Multi-Task Learning
Multi-task learning (MTL) (Caruana, 1997) has been successfully used to merge Natural Language Processing tasks (Luong et al., 2016;Hashimoto et al., 2017;Dong et al., 2015) for improved performance. Parameter sharing, in particular sharing of certain structures such as label spaces, has been used widely in several NLP tasks for this purpose (Liu et al., 2017;Søgaard and Goldberg, 2016). Zhao et al. (2018) used a multi-task learning setup for FEVER that shared certain layers between sentence selection and claim verification modules. Augenstein et al. (2018) used shared label spaces in MTL for sequence classification. Following this work, Augenstein et al. (2019) used shared label spaces for automatic fact checking. However, the labels involved in this work were limited to claim verification labels only, and did not incorporate sentence selection as we do in this paper.

Fake News Detection
In addition to the FEVER shared task, other recent work in fake news detection has focused on several aspects of data collection and statement verification. Shu et al. (2019b) looked into the role of social context in fake news detection. Additionally, Shu et al. (2019a) also explored creating explainable fake news detection.

Sequence Matching Model
Sentence selection and claim verification can be easily structured as the same sequence matching problem in which the input is a pair of textual sequences and the output is a semantic relationship label for the pair. Nie et al. (2019a) Figure 1: Different training setups. In the pipeline setup, sentence selection and claim verification models are trained separately. In the multi-task setup, the two tasks are treated separately, but use a single model. In the compounded-label training setup, the training is simplified to a single task by mixing the data of the two tasks and allowing controlled supervision between the two tasks. S, R, NEI, SL, and NSL represent "SUPPORTS", "REFUTES", "NOTENOUGHINFO", "SELECT", and "NON-SELECT", respectively.
the same architecture, the neural semantic matching network (NSMN), on the two tasks and showed it was effective on both. Thus, we use the same NSMN model with a modified output layer in our experiments.

Neural Semantic Matching Network (NSMN)
For convenience, we give a description similar to the original paper (Nie et al., 2019a) about the model below.
Encoding Layer: where U ∈ R d 0 ×n and H ∈ R d 0 ×m are the two input sequences, d 0 and d 1 are input and output dimensions, and n and m are lengths of the two sequences. Alignment Layer: where an element in A [i,j] indicates the alignment score between i-th token in U and j-th token in H. Aligned sequences are computed as: where Softmax col is column-wise softmax,Ũ is the aligned representation fromH toŪ and vice versa forH. The aligned and encoded representations are combined as:  (7) where f is one fully-connected layer with a rectifier as an activation function and • denotes elementwise multiplication.
Matching Layer: where U * and H * are sub-channels of the input U and H without GloVe, provided to the matching layer via a shortcut connection. Output Layer: where function h denotes two fully-connected layers with a rectifier being applied on the output of the first layer.

Compounded-Label Output Layer
We propose the following compounded-label output layer for simpler, more efficient training. Given the input pair x i , the NSMN model is: where m ∈ R 4 is the output vector of NSMN in which the first three elements correspond to claim verification and the last element to sentence selection. Then, the probabilities are calculated as: where m [0:3] denotes the first three elements of m and y cv ∈ R 3 denotes the probability of predicting the relation between the input and claim as "SUPPORTS", "REFUTES", or "NOTENOUGHINFO", while m 3 denotes the fourth element of m and y ss ∈ R indicates the probability of choosing the input as evidence for the claim. This allows us to transfer the model's outputs to predictions in a compact way.

Compounded-Label Training
In order to simplify the training procedure and increase data efficiency, we introduce compoundedlabel training. Consider the model output vector: whereŷ i ∈ R 5 is the concatenation of y cv and [y ss , 1 − y ss ] . To optimize the model, we use the entropy objective function: In a typical classification setup, the ground truth label embedding y i is a one-hot column vector chosen from an identity matrix, where the dimension equals the total number of categories. However, our compounded-label embedding is structured as the matrix with some supervision provided in the zero-area of one-hot embeddings shown below: The first 3 columns are label embeddings for "SUPPORTS", "REFUTES", and "NOTENOUGHINFO" in verification and the last 2 columns are the label embeddings for "SELECT" and "NON-SELECT" in sentence selection, resp. Thus, for a given claim, "SUPPORTS" and "REFUTES" evidence will also give supervision as positive examples to sentence selection weighted by λ 1 and "NON-SELECT" sentences will also give supervision as "NOTENOUGHINFO" evidence to claim verification weighted by λ 2 .

Experimental Setup
We focused on comparing the following five NSMN 1 training setups for sentence selection and claim verification. We obtain upstream document retrieval data using the method in Nie et al. (2019a). Training details are in the appendix.   Table 2: Final performance, evidence recall, model size, and data consumption (until convergence) for all 5 setups. We measure data consumption as the amount of data the model used for parameter updating, e.g., 10K updates w/ batch size 32 consumes 320K data. 'D.M.'=direct mixing, 'C.L.'=compounded-label, 'MTL.'=multi-task learning, 'Rdc-Pip.'=pipeline w/ reduced size, 'Pip.'=pipeline (Nie et al., 2019a).
Pipeline: We train separate sentence selection and claim verification models as in Nie et al. (2019a).

Multi-task Learning:
We follow the neural multitask learning setup called alternate training (Dong et al., 2015;Luong et al., 2016;Hashimoto et al., 2017), where each batch contains examples from a single task only. We build a single NSMN model for both selection and verification and alternatively optimize the two tasks.
Direct Mixing: We simply blend the input examples of the two tasks into the same batch, providing additional simplicity over our multi-task learning setup in which batches need to be task-exclusive.
Compounded-Label Training: We also blend the inputs of the two tasks, but counter to direct mixing, we use the compounded-label embedding described in Sec. 3 for optimization and downsample the input examples to reduce training time.
Reduced Pipeline: This is the same pipeline setup as described above, except that we reduce the model sizes for both sentence selection and verification such that the total model size is equal to all other setups that use only a single joint model. This experiment gives a fair comparison between each of the setups by canceling out the parametersize variance. Table 1 shows a comparison of the first four different setups. 5 Results and Analysis FEVER Score Performance: We observe from Table 2 that compounded-label training outperforms 2 both the multitask learning and direct mixing setups. We speculate that the performance gap is due to the fact that in the multi-task and direct mixing training setups, the same model is trained by separated and different supervisions of two tasks, resulting in oscillation and making it difficult to reach a better global minimum. However, in the compounded-label setup, training the model on one task always gives a subtly-controlled supervision on the other task. This not only applies natural regularization on the targeted task itself, but also pushes the model towards a better state for both tasks. Next, we also show that the compounded-label setup achieves a higher FEVER score than the reduced-pipeline setup (3rd row in Table 2), indicating its ability to model the two tasks jointly in a more compact and parameter-efficient way. Although the full pipeline setup gives a slightly higher FEVER score, the compounded-label setup has the advantage of reducing parameter size by one-third, requiring less than half the training computation, and improving the training efficiency (elaborated on in the following subsection). Finally, we also compare recall scores, since this is most related to the FEVER score, as validated by Nie et al. (2019a). Efficiency: In Fig. 2, we show the training effi-2 In Table 2, the improvements of compounded-label over the first three entries are significant with p < 10 −5 while the improvement of full pipeline over compounded-label is significant with p < 0.05. Stat. significance was computed on bootstrap test with 100K iterations (Noreen, 1989;Efron and Tibshirani, 1994).  ciency of different approaches by tracking performance with the number of data points consumed. 3 Parameter update settings are equal across all experiments and thus show an accurate depiction of the speedup independent of batch size, etc. For fair comparison, there is no FEVER score for the first 22 × 320K data points in the pipeline setup since these data points are consumed in the separate upstream sentence selection training. The compounded-label training setup exhibits a more stable training curve than the other setups during initial training, and reaches a 60%+ FEVER score after seeing only 1,280K data points. This indicates that the compounded-label setting allows the model to quickly reach a stable and functional state. This is valuable for online learning on streaming data, where the model is trained with real-time human feedback. On the contrary, the performance of the multi-task learning and direct mixing setups fluctuates at a low level during initial training stages, which shows that optimization oscillation makes training difficult in these setups. Blind Test Results: In Table 3 we compare the two setups on the blind test set. Compounded Label achieved 61.65% FEVER score and 66.21% label score (LA) while the pipeline setup got 62.69% and 66.20% for FEVER score and LA, respectively. Since the upper bound is dependent on document retrieval quality, we report the upper bound of these scores as 92.42% following Nie et al. (2019a). Our method was able to yield results comparable to the pipeline model on FEVER score and even higher results on label score, with simpler design, faster convergence and only two-thirds the number of parameters.

Conclusion
We present a simple compounded-label setup for jointly training sentence selection and claim verification. This setup provides higher training efficiency and lower parameter size while still achieving comparable results to the pipeline approach.