Neural Multi-Task Learning for Stance Prediction

We present a multi-task learning model that leverages large amount of textual information from existing datasets to improve stance prediction. In particular, we utilize multiple NLP tasks under both unsupervised and supervised settings for the target stance prediction task. Our model obtains state-of-the-art performance on a public benchmark dataset, Fake News Challenge, outperforming current approaches by a wide margin.


Introduction
For journalists and news agencies, fact checking is the task of assessing the veracity of information and claims. Due to the large volume of claims, automating this process is of great interest to the journalism and NLP communities. A main component of automated fact-checking is stance detection which aims to automatically determine the perspective (stance) of given documents with respect to given claims as agree, disagree, discuss, or unrelated.
Previous work (Riedel et al., 2017;Hanselowski et al., 2018;Baird et al., 2017;Chopra et al., 2017;Xu et al., 2018) presented various neural models for stance prediction. One of the challenges for these models is the limited size of human-labeled data, which can adversely affect the resulting performance for this task. To overcome this limitation, we propose to supplement data from other similar Natural Language Processing (NLP) tasks. However, this is not a straightforward process due to differences between NLP tasks and data sources. We address this problem using an effective multi-task learning approach which shows sizable improvement for the task of stance prediction on the Fake News Challenge benchmark dataset. The contributions of this work are as follows: • To the best of our knowledge, we are the first to apply multi-task learning to the problem of stance prediction across different NLP tasks and data sources.
• We present an effective multi-task learning model, and investigate the effectiveness of different NLP tasks for stance prediction.
• Our model outperforms the state-of-the-art baselines on a publicly-available benchmark dataset with a substantial improvement.

Multi-task Learning Framework
We propose a multi-task learning framework which utilizes the commonalities and differences across existing NLP datasets and tasks to improve stance prediction performance. More specifically, we use both unsupervised and supervised pretraining on multiple tasks, and then fine-tune the resulting model on our target stance prediction task.

Model Architecture
The architecture of our model is shown in Figure  1. We use a transformer encoder (Vaswani et al., 2017) that is shared across different tasks to encode the inputs before feeding the contextualized embeddings into task-specific output layers. In what follows, we explain different components of our model.

Input Representation
The input sequence x = {x 1 , . . . , x l } of length l is either a single sentence or multiple texts packed together. The input is first converted to word piece sequences (Wu et al., 2016) and, in the case of multiple texts, a special token [SEP] is inserted between the tokenized sequences. Another special token [CLS] is inserted at the beginning of the sequence, which corresponds to the representation of the entire sequence. Transformer Encoder We use a bidirectional Transformer encoder that takes x as input and produces contextual embedding vectors C ∈ R d×l via multiple layers of self-attention (Devlin et al., 2019).
Task-specific Output Layers For singlesentence classification tasks, we take the vector from the first column in C, corresponding to the special token [CLS], as the semantic representation of the input sentence x. We then feed this vector through a linear layer followed by softmax to obtain the prediction probabilities. For pairwise classification tasks, we use the answer module from the stochastic answer network (SAN) (Liu et al., 2018) as the output classifier. It performs K-step reasoning over the two pieces of text with bi-linear attention and a recurrent mechanism, producing output predictions at each step and iteratively refining its predictions. At training time, some predictions are randomly discarded (stochastic dropout) before averaging, and during inference all output probabilities are utilized.

Unsupervised Pre-training
To utilize large amounts of text data, we use the BERT model which pre-trains the transformer encoder parameters with two unsupervised learning tasks: masked language modeling, for which the model has to predict a randomly masked out word in the sequence, and next sentence predic-tion, where two sentences are packed and fed into the encoder and the embedding corresponding to the [CLS] token is used to predict whether they are adjacent sentences (Devlin et al., 2019).

Multi-task Supervised Pre-training
In addition to learning contextual representations under an unsupervised setting with large data, we investigate whether existing NLP tasks that are conceptually similar to stance prediction can improve performance. We introduce four types of such tasks for pre-training: Textual Entailment: Given two sentences, a premise and an hypothesis, the model determines whether the hypothesis is an entailment, contradiction, or neutral with respect to the premise. Since stance prediction could be cast as a textual entailment task, we investigate if the addition of this task will benefit our model. Paraphrase Detection: Given a pair of sentences, the model should predict whether they are semantically equivalent. This task is considered because we may be able to benefit from detecting document sentences that are equivalent to claims. Question Answering: Question answering is similar to the stance prediction task in that the model has to make a prediction given a question and a passage containing several sentences. Sentiment Analysis: Fake claims or articles may exhibit stronger sentiment, thus we explore if pre-training on this task would be beneficial.

Training Procedure and Details
There are two stages in our training procedure: multi-task supervised pre-training, and fine-tuning on stance prediction. Before the training stages, the transformer encoder is initialized with pretrained parameters to take advantage of knowledge learned from unlabeled data 1 .
During multi-task pre-training, we randomly pick an ordering on tasks between each epoch, and train on 10% of a task's training data for each task in that order. This process is repeated 10 times in each epoch so that all the training examples are trained once. The shared encoder is learned over all tasks while each task-specific output layer is learned only for its corresponding task.
For fine-tuning, the task-specific output layers for pre-training are discarded, and a randomly initialized output layer is added for stance prediction. Then the entire model is fine-tuned over the training set for stance prediction.
For both multi-task pre-training and fine-tuning, we train with cross-entropy loss at each output layer. We use the Adam optimizer (Kingma and Ba, 2014) with learning rate of 3e-5, β 1 = 0.9, β 2 = 0.999, and mini-batch size of 16 for 10 epochs. For the SAN answer module we set K = 5 and use stochastic dropout rate of 0.1.

Data
The BERT model was pre-trained on the BooksCorpus (Zhu et al., 2015) and English Wikipedia. For multi-task pre-training, we use the following datasets: SNLI Stanford Natural Language Inference is the standard entailment classification task that contains 549K training sentence pairs after removing examples with no gold labels (Bowman et al., 2015). The relation labels are entailment, contradiction, and neutral. MNLI Multi-genre Natural Language Inference is a large-scale entailment classification task from a diverse set of sources with the same relation classes as SNLI (Williams et al., 2018). We use its training set that contains 393K pairs of sentences. RTE Recognizing Textual Entailment is a binary entailment task with 2.5K training examples (Wang et al., 2019).
QQP Quora Question Pairs 2 is a QA dataset for binary classification where the goal is to predict whether two questions are semantically equivalent. We use its 364K training examples for pretraining. MRPC Microsoft Research Paraphrase Corpus consists of automatically extracted sentence pairs from new sources, with human annotations for whether the pairs are semantically equivalent (Dolan and Brockett, 2005). The training set used for pre-training contains 3.7K sentence pairs. QNLI Question Natural Language Inference (Wang et al., 2019) is a QA dataset which is derived from the Stanford Question Answering Dataset (Rajpurkar et al., 2016) and used for binary classification. For a given question-sentence pair, the task is to predict whether the sentence contains the answer to the question. QNLI contains 108K training pairs. SST-2 Stanford Sentiment Treebank is used for binary classification for sentences extracted from movie reviews (Socher et al., 2013). We use the GLUE version that contains 67K training sentences (Wang et al., 2019). IMDB The Large Movie Review Dataset contains 50K movie reviews which are categorized as either positive or negative in terms of sentiment orientation (Maas et al., 2011).
For fine-tuning on stance prediction, we use the dataset provided by the Fake News Challenge Stage 1 (FNC-1) 3 , consisting of a total of 75K claim-document pairs collected from a variety of sources such as rumor sites and social media. The claim-document relation classes are: agree, disagree, discuss, and unrelated. The FNC-1 dataset has an imbalanced distribution over stance labels, especially lacking data for agree (7.3%), and disagree (1.7%) classes.

Evaluation Metrics
For evaluation, the standard measures of accuracy and macro-F1 are used. Additionally, as per previous work, weighted accuracy is also reported, which is a two-level scoring scheme that gives 0.25 weight to predicting examples as related v.s. unrelated correctly, and an additional 0.75 weight to classifying related examples as agree, disagree, and discuss correctly.

Baselines
We compare our model with existing state-of-theart stance prediction models including the topranked models from FNC-1 and neural models:

Gradient Boosting
This baseline 4 uses a gradient-boosting classifier with hand-crafted features including n-gram features, and indicator features for polarity and refutation. TALOS (Baird et al., 2017) An ensemble of gradient-boosted decision trees and a convolutional neural network. UCL (Riedel et al., 2017) A Multi-Layer Perceptron (MLP) with Bag-of-Words and similarity features extracted from claims and documents. Memory Network  A feature-light end-to-end memory network that attends over convolutional and recurrent encoders. Adversarial Domain Adaptation (Xu et al., 2018) This baseline uses a domain classifier with gradient reversal on top of a convolutional network and TF-IDF features to perform adversarial domain adaptation from another fact-checking dataset (Thorne et al., 2018) to FNC. 4 https://github.com/FakeNewsChallenge/ fnc-1-baseline

Results and Discussion
The performance of the existing models are shown in Table 1 from rows 1-5, and our models (MTransSAN) are in rows 8-21. All variants of MTransSAN consistently outperform existing models on all three metrics by a considerable margin. In particular, our best MTransSAN (row 14) achieves 6.0 and 14.4 points of absolute improvement in terms of weighted accuracy and macro-F1, respectively, over existing state-of-theart results.
We also compare MTransSAN versus a model with the same architecture but without pre-training on the NLP tasks (TransSAN), shown in row 7, and another version of that model with a linear layer instead of the SAN answer module (TransLinear), shown in row 6. Using the SAN answer module improves over a linear layer for all three metrics, and generally most MTransSAN models outperform the TransSAN model. Our best MTransSAN model exceeds TransSAN by 3.1 and 6.5 points in weighted accuracy and macro-F1, respectively, justifying the effectiveness of model pre-training with NLU tasks. Note that even the TransLinear model outperforms previously stateof-the-art models by a wide margin, suggesting that a neural model pre-trained on large amounts of unlabeled data and fine-tuned on stance prediction is superior to models that require hand-crafted features.
Additionally, we conduct experiments where we use different combinations of language understanding tasks for pre-training. We pre-train with single tasks, multiple tasks with the same task type, and joint learning across multiple task types. For textual entailment (rows 8-11), we see that pre-training on SNLI gives us best improvement, and that pre-training across all three entailment tasks did not improve compared to just training on SNLI. However, for paraphrase detection (rows 12-14) the combination of QQP and MRPC gives us the best results across all MTransSAN models. This suggests that the paraphrase detection might be the most useful task type among the NLP tasks in terms of boosting stance prediction performance. Question answering and sentiment analysis (rows 15-18), on the other hand, give lower performance improvements compared to paraphrase detection. Models trained on joint tasks (rows 19-21) do not outperform our best model either.
Overall, we find that utilizing the BERT model results in large improvements compared to the baselines, which is not unexpected given the success of BERT. We also show that our multi-task learning approach gives even further improvements upon BERT by a wide margin.

Related Work
Stance Prediction. This task is an important component for fact checking and veracity inference. To address stance prediction, (Riedel et al., 2017) used a Multi-Layer Perceptron (MLP) with bag-of-words and similarity features extracted from input documents and claims, and (Hanselowski et al., 2018) presented a deep MLP trained using a rich feature representation, based on unigrams, non-negative matrix factorization, latent semantic indexing. (Baird et al., 2017) presented an ensemble of gradient-boosted decision trees and a deep convolutional neural network, while (Chopra et al., 2017) proposed a model based on bi-directional LSTM and attention mechanism. While, these works utilized a rich handcrafted features, (Mohtarami et al., , 2019 proposed strong end-to-end feature-light memory networks for stance prediction in monoand cross-lingual settings. Recently, (Xu et al., 2018) presented a state-of-the-art model based on adversarial domain adaptation with more labeled data, but they limited their model to only using data from the same stance prediction task. In this work, we remove this limitation and used labeled data from other tasks that are similar to stance prediction through multi-task learning.
Multi-task and Transfer Learning. Multi-task and transfer learning have been long-studied problems in machine learning and NLP (Caruana, 1997;Collobert and Weston, 2008;Pan and Yang, 2010). More recently, numerous methods on unsupervised pre-training of deep contextualized models for transfer learning have been proposed (Peters et al., 2018;Devlin et al., 2019;Radford et al., 2019;, and (Conneau et al., 2017;McCann et al., 2017) presented supervised pretraining methods for NLI and translation. Recent work on multi-task learning has focused on designing effective neural architectures (Hashimoto et al., 2017;Søgaard and Goldberg, 2016;Sanh et al., 2018;Ruder et al., 2017). Combining these two lines of work, Clark et al., 2019) explored fine-tuning the contextualized models with multiple natural language understanding tasks. In this work, we depart from previous works by specifically studying the effects of multi-task fine-tuning for the stance prediction task with pre-trained models.

Conclusion and Future Work
We present an effective multi-task learning model that transfers knowledge from existing NLP tasks to improve stance prediction. Our model outperforms state-of-the-art systems by 6.0 and 14.4 points in weighted accuracy and macro-F1 respectively on the FNC-1 benchmark dataset. In future, we plan to further investigate our model to more specifically identify and illustrate its source of improvement, improve our transfer learning approach for better fine-tuning, and investigate the utility of our model in other fact-checking subproblems such as evidence extraction.