Towards Automatic Fake News Detection: Cross-Level Stance Detection in News Articles

In this paper, we propose to adapt the four-staged pipeline proposed by Zubiaga et al. (2018) for the Rumor Verification task to the problem of Fake News Detection. We show that the recently released FNC-1 corpus covers two of its steps, namely the Tracking and the Stance Detection task. We identify asymmetry in length in the input to be a key characteristic of the latter step, when adapted to the framework of Fake News Detection, and propose to handle it as a specific type of Cross-Level Stance Detection. Inspired by theories from the field of Journalism Studies, we implement and test two architectures to successfully model the internal structure of an article and its interactions with a claim.


Introduction
The rise of social media platforms, which allow for real-time posting of news with very little (or none at all) editorial review at the source, is responsible for an unprecedented growth in the amount of the information available to the public. While this constitutes an invaluable source of free information, it also facilitates the spread of misinformation. In particular, the literature distinguishes between rumors, i.e., pieces of information which are unverified at the time of posting and therefore can turn out to be true or false, and fake news (or hoaxes), i.e., false stories which are instrumentally made up with the intent to mislead the readers and spread disinformation .
Both Rumor Verification (RV) and Fake News Detection (FND) constitute very difficult tasks even for trained professionals. Therefore, approaching them in an end-to-end fashion has generally been avoided. Both tasks, however, can be easily split into a number of sub-steps. For instance,  proposed a model for RV which consists of four stages: a rumor detection stage, where potentially rumorous posts are identified, followed by a tracking stage, where posts concerning the identified rumor are collected; after determining the orientation expressed in each post with respect to the rumor (stance detection), the final truth value of the rumor is obtained by aggregating those single stance judgments (veracity classification). As shown in Figure 1, this pipeline can be naturally adapted to FND.
In recent years, several efforts have been made by the research community toward the automatization of some of these stages, in order to provide effective tools to enhance the performance of human journalists in rumor and fake news debunking (Thorne and Vlachos, 2018). Concerning FND, Pomerleau and Rao (2017) recently released a dataset for the Stance Detection step in the framework of the Fake News Challenge 1 (FNC-1). The core of the corpus is constituted by a collection of articles discussing 566 claims, 300 of which come from the EMERGENT dataset (Ferreira and Vlachos, 2016). Each article is summarized in a headline and labeled as agreeing (AGR), disagreeing (DSG) or discussing (DSC) the claim. Additionally, unrelated (UNR) samples were created by pairing headlines with random articles. The goal of the challenge was to classify the pairs constituted by a headline and an article as AGR, DSG, DSC or UNR.
Following the pipeline discussed above, it is clear that the FNC-1 actually covers two of the four steps, namely: (1) The tracking step, consisting in filtering out the irrelevant UNR samples; (2) The actual stance detection step, consisting in the classification of a related headline/article pair into AGR, DSC or DSC.   . The first row describes the corresponding step whereas the second row shows the outputs of each step for both the RV and the fake news detection (FND) tasks. The red rectangle indicates steps covered by the FNC-1 corpus. Figure adapted from .
Note that the amount of semantic understanding needed for the second task is much higher than for the first. In fact, even humans struggle in the related sample classification, as empirically demonstrated by Hanselowski et al. (2018): the interannotator agreement of five human judges drops from Fleiss' κ of .686 to .218, after filtering out the UNR samples. For this reason, we concentrate on the stance detection step, and we make the following contributions: 1. We identify asymmetry in length between headlines and articles as a key characteristic of the FNC-1 corpus: on average, an article contains more than 30 times the number of words contained in its associated headline. This is peculiar with respect to most of the commonly used datasets for stance detection (Mohammad et al., 2017) and require the development of architectures specifically tailored to this considerable asymmetry. Following on the terminology introduced by Jurgens et al. (2014) for Semantic Similarity, we propose to handle the problem as a Cross-Level Stance Detection task. To our knowledge, it is the first time that this task is investigated in isolation.
2. Inspired by theoretical principles in the field of Journalism Studies, we propose two simple neural architectures to model the argumentative structure of an article, and its complex interplay with a headline. We demonstrate that our systems can beat a strong feature-based baseline, based on one of the FNC-1 winning architectures, and that they can successfully model the internal structure of a news article and its relations with a claim, leveraging only word embeddings as input.
2 Related Work

Stance Detection
Stance Detection (SD) has been defined as the task of determining the attitude expressed in a short piece of text with respect to a target, usually expressed with one or few words (as Feminism or Climate Change, Mohammad et al. (2016)). In fact, most of the available corpora for SD consider very short samples, as Tweets. SD became very popular in recent years, resulting in a large number of publications (Mohammad et al., 2017). To our knowledge, however, no one explicitly considered the problem of stance detection giving as input two items which are considerably asymmetric in length, that is, a long and structured document and a target expressed in the form of a complete sentence and not as a concept. For this reason, we propose to call the task introduced in the FNC-1 challenge Cross-Level Stance Detection. This is in line with the definition of Cross-Level Semantic Similarity, which measures the degree to which the meaning of a larger (in terms of length) linguistic item is captured by a smaller item (Jurgens et al., 2014).
After reporting on the systems participating to the FNC-1, which released the first SD dataset collecting long documents, we briefly mention some of the most relevant works on SD using Twitter data.
Fake News Challenge. With more than 50 participating groups, the FNC-1 drew high interest from both the research community and in-LEAD BODY TAIL 1. "An astonishing image appears to show a giant crab, nearly 50 feet across, lurking in the har bor at Whitstable, Kent, and while some assert that it is a playful hoax, others believe they have found evidence of a genuine aquatic monster. 2. [...] The giant animal is shaped like an edible crab, a species commonly found in British waters, but which only grows to be ten inches across, on average.  Figure 2: Article from the FNC-1 test set (sample no. 998), analyzed following the inversed pyramid principles (Scanlan, 1999). Notice that single sentences may express a different stance with respect to a claim, while others can be irrelevant, as shown in the leftmost column.

3.
dustry. Due to the high number of UNR samples, which constituted almost three quarters of the training set, most of the groups proposed architectures which could perform well in this specific class -that is, in the tracking step of the FND pipeline. The second (Hanselowski et al., 2017) and third (Riedel et al., 2017) classified teams proposed multi-layer perceptrons (MLPs)based systems. The best performing system (Baird et al., 2017) is an ensemble of a convolutional neural network (CNN) and a gradient-boosted decision tree. All models, with the exception of the CNN, take as input a number of hand-engineered features. Recently, Hanselowski et al. (2018) enriched the feature set used in Hanselowski et al. (2017) and added a stacked BiLSTM layer to their model, resulting in a modest gain in performance.
All models described above performed very well in the UNR classification (with F 1 usually above .98 for this class), achieving considerably worse results on the related samples (Hanselowski et al., 2018).
Rumor Stance Detection on Tweets. The most commonly used datasets for rumor stance detection, the RumorEval  and the PHEME (Zubiaga et al., 2016b) corpora, collect Tweets. State-of-the-art results on the PHEME corpus has been obtained by Aker et al. (2017), who used a very rich set of problemspecific features. Their model beat the previous state-of-the-art system by Zubiaga et al. (2016a), who modeled the tree-structured Twitter conversations using a LSTM, taking as input a conversation's branch at time.

Journalism Studies: News-writing Prose
Each genre develops its peculiar narrative forms, which allow for the most effective transmission of a message. In modern news-writing prose, especially in the Anglo-Saxon journalism, the inverted pyramid style is widely adopted (Scanlan, 1999).
Key element of this well standardized template consists in the fact that the most newsworthy facts (the so-called 5W), are presented at the very beginning of the article -the lead -with the remaining information following, in order of importance, in the body of the article: in this section, we can find less essential element as quotes, interviews and background or explanatory information; any additional input, as related stories, images and credits, are put in the very last paragraphs, the tail (Scanlan, 1999). Usually, no more than one or two ideas are expressed in the same paragraph (Sun Technical Publications, 2003). Those characteristic elements are clearly visible in Figure 2. This style is particularly suited for rapidly evolving breaking news event, where a journalist can update an article by attaching a new paragraph with the last updates at the beginning of it. Moreover, putting most newsworthy facts at the beginning of an article allows the impatient readers to quickly decide on their level of interest in the report.
After manual analysis of excerpts of the FNC-1 corpus, we concluded that most articles were actu-  ally written following the inverted pyramid principles.

Encoding the article
Based on the elements of Journalism Studies discussed above, we propose a simple architecture based on bidirectional conditional encoding (Augenstein et al., 2016) to encode an article split into n sentences. Each sentence S i is first converted into its embedding representation E S i ∈ R e×s i , where e is the embedding size and s i is the length of the i th tokenized sentence. Then, we encode the article using BiLSTM A , a Bidirectional LSTM which reads the article sentence by sentence in backward order, initializing the first states of its forward and backward components with the last states it has produced after processing the previous sentence ( Figure 3).
Notice that we process the article from the bottom to the top, as we assume the most salient information to be concentrate in the lead. By considering an article as an ensemble of sentences which are separately encoded conditioned to their preceding ones, we can model the relationship of each sentence with respect to the others and, at the same time, reduce the number of parameters.  . Dotted arrows represent conditional encoding and networks with the same color share the weights. The system reads an article from the last sentence to the first, processing each sentence twice: first conditioning on the headline, then conditioning on the previous sentence. Due to lack of space, only the first two sentences of the article are represented.

Encoding the relationship between the headline and the article
After having encoded the article, we model its relationship with the headline. As shown in Figure 2, we expect single sentences to express a potentially different stance with respect to the headline, while some sentences -especially in the body and the tail -can be irrelevant. For this reason, we separately evaluate the relationship of each sentence, conditioned on the previous sentence(s), with the headline. In this paper, we consider two approaches: Double-conditional Encoding. As a first method, we modeled the relationship between the headline and the article using conditional encoding.
First, the headline is encoded using a bidirectional LSTM. Then, we separately process each sentence of the article with BiLSTM H , a BiL-STM conditioned on the last states of the BiLSTM which processed the headline. We finally stack BiLSTM A on top of BiLSTM H . In this way, we obtain a matrix H S i ∈ R l×s i for each sentence S i .
Following Wang et al. (2018), we notate this as: (2) This process is shown in Figure 4. In this way, we read each sentence S i , which is encoded in a headline-specific manner, conditioning on the previous sentence(s). Clearly, it could have been possible to obtain a hidden representation for each sentence by first conditioning on the previous sentence(s), and then on the headline. Results of preliminary experiments, however, showed worse results for this variant, suggesting that having the conditioning on the previous sentence(s) nearer to the decoder is beneficial for the cross-level stance detection task.
Co-matching Attention. We also explored the use of attention in order to connect the headline H H ∈ R l×c , encoded using a BiLSTM layer, with the article's sentences H S 1 ...H Sn , embedded as explained in Subsection 3.1. Inspired by the architecture proposed by Wang et al. (2018) for multichoice reading comprehension, we obtain a matrix H S i , attentively read with respect to the headline, for each sentence at position i ∈ {1, ..., n} as follows: we first obtain an aggregated representation of the headline and the i th sentence H S i ∈ R l×S i (Eq 4), obtained by dot product of H H with the attention weights A i ∈ R c×s i (Eq 3); then, we obtain co-matching states of each sentence with H H i using Eq 5: are the parameters to learn. As in Wang et al. (2018), we use the element-wise subtraction and multiplication ⊗ to build matching representations of the headline.
Self-attention. After encoding of the relationship between the headline and the article, we employ a similar self-attention mechanism as in Yang et al. (2016) in order to soft-select the most relevant elements of the encoded sentence. Given the sequence of vectors {h 1 , ..., h S } in H S i , obtained with the double-conditional encoding or the comatching attention approaches described above, the final vector representation of the i th sentence S i is obtained as follows: where the hidden representation of the word at position t, u it , is obtained though a one-layer MLP (Eq 6). The normalized attention matrix α t is then obtained though a softmax operation (Eq 7). Finally, s i is computed by a weighted sum of all hidden states h t with the weight matrix α t (Eq 8).

Decoding
Following the inverted pyramid principles, according to which the most relevant information is concentrated at the beginning of the article, we aggregate the sentence vector representations {s 1 , ..., s n } using a backward LSTM. The final predictionŷ is finally obtained with a softmax operation over the tagset.

Data and Preprocessing
We downloaded the FNC-1 corpus from the challenge website 2 . As we wanted to concentrate on the cross-level stance detection sub-task, we only considered related (AGR, DSC and DSC) samples, discarding the noisy UNR samples, which would constitute the output of the tracking step. The distribution of related samples is also very unbalanced, with the DSC class constituting more than a half of the subset and the DSG samples accounting for only 7.5% of the related samples, as shown in As discussed in the Introduction, the cross-level stance detection task is characterized by an asymmetry in length in the input: on average, headlines are 12.40 tokens long, while articles span from 4 up to 4788 tokens, with an average length   Table 2.

Baseline
As a baseline, we implemented the Athena model proposed by Hanselowski et al. (2017), which scored second in the FNC-1. We did not use the first-ranked system, as it is an ensemble model, nor the modification to Athena proposed in Hanselowski et al. (2018), as the new feature set and the BiLSTM layer did not significantly improve the performance of the original model. The model consists in a 7-layers deep MLP, with varying number of units, followed by a softmax. Input is presented as a large matrix of concatenated features, some of which separately encode the headline or the body: • Presence of refuting and polarity words • Tf-idf-weighted BoW unigram features, considering a vocabulary of 5000 entries.
while others jointly consider the headline/body: • Word overlap between headline and article.
• Count of headline's token and ngrams which appear in the article. • Cosine similarity of the embeddings of nouns and verbs of the headline and the article.
Moreover, they use topic models based on nonnegative matrix factorization, latent Dirichlet allocation, and latent semantic indexing. This results in a final set of 10 features. In this way, the asymmetry in length between is solved by compressing both the headline and the article into two fixedsized vectors of the same size.
The same hyperparameters as in Hanselowski et al. (2017)

(Hyper-)Parameters
The high-level structure of the models has been implemented with Keras, while single layers have been written in Tensorflow. (Hyper-)parameters used for training, useful for experiments replication, are reported in Table 3. Concerning vocabulary creation, we included only words occurring more than 7 times. The embedding matrix has been initialized with word2vec embeddings 5 , which performed better than other set of pre-trained embeddings according to some preliminary experiments. This can be partially explained as word2vec embeddings are trained on part of the Google News corpus, thus on the same domain as the FNC-1 dataset. OOV words have been zeroinitialized. In order to avoid overfitting, we did not update word vectors during training.

Evaluation Metrics
As we are not considering the UNR samples, the FNC-1 score would not constitute a good metric, as it distinguishes between related and unrelated samples for scoring 6 . Following Zubiaga et Figure 5: Performance of the co-matching encoder in term of macro-averaged F 1 score on the test set, considering the first n sentences of an article. Blue and red violins represent respectively backward and forward encoding of the considered sentences.
al. (2018) and Hanselowski et al. (2018), we use macro-average precision, recall and F 1 measure, which is less affected by the high class unbalance (Table 1). We also consider the accuracy with respect to the single AGR, DSG and DSC classes.

Results and Discussion
As shown in Table 4, both encoders described in Section 3 outperformed the baseline (line 1), despite having a considerably minor number of parameters. In particular, the feature-based model obtained a relatively good performance in classifying the very infrequent DSG labels, probably thanks to its large number of hand-engineered features. However, it shows some difficulties in discriminating between AGR and DSC samples. This is probably a consequence of the system flattening the entire article into a fixed-size vector: this inevitably causes the system to loose the subtle nuances in the argumentative structure of the news story, which allows for distinguishing between AGR and DSC samples, and to favor the most common DSC class. On the contrary, our architectures approach the asymmetry in length in the input by carefully encoding the articles as a hierarchical sequence of sentences, and by separately modeling their relative positions with respect to the headline. In this way, they are able to successfully discriminate between AGR and DSC samples. In general, the encoder based on co-matching attention performed clearly better than the architecture based on double-conditional encoding (line 2 and 10), reaching a higher performance in all metrics but the classification of the DSC samples.

Modeling the Inverted Pyramid
In order to test our assumption that the great majority of the FNC-1 corpus were written following the inverted pyramid style, we took the comatching attention model, which performed better than the double-conditionally encoded architecture, and progressively reducing the number of considered sentences. Moreover, we modified the article encoder (Subsection 3.1) in order to process the input sequence in forward and not in backward order. For each of these 13 settings 7 , we run 10 simulations.
As the violin plots in Figure 5 show (blue violins), considering a reduced number sentences does not correlate with an overly big drop in performance, until a number of less than four sentences is taken. Below this threshold, the ability of the system to correctly classify the stance of the article is compromised. This can be explained with the inverted pyramid theory: until we consider a number of sentences sufficient in order to include the lead and part of the body of the article, the system can rely on a sufficient number of elements in order to discriminate its stance. On the contrary, if we consider only the very first sentences, the system can get confused, being exposed to only  a portion of the (sometimes opposing) opinions expressed in the article. Interestingly, our system seems to be pretty robust to the noisy sentences which could be included when considering a higher number of sentences.
The assumption that most of the articles in the FNC-1 corpus are written following the inverted pyramid principles is further confirmed by the fact that, after the threshold of 4 considered sentences, simulations using forward encoding perform always consistently worse than using backwards encoding (the red violins in Figure 5). Reasonably, below this threshold, we do not observe a considerable difference in performance between backward and forward models.

Using additional Input Channels
To investigate the impact of features other than word embeddings, we consider two further input channels: • Named Entities (NE) -NEs were obtained using the Stanford NE Recognizer (Finkel et al., 2005), resulting in a tagset of 13 labels. • Characters -Each input word was split into characters. Only characters occurring more than 100 times in the training set were considered, obtaining a final vocabulary of 149 characters. As in Lample et al. (2016), in we concatenate the output of a BiLSTM run over the character sequence.
The output of each input channel is concatenated with the word embedding, and passed to the article encoder described in Section 3.1. Hyperparameters used for experiments are reported in Table 3.

Anonymizing the input
After manual analysis of the predictions, we suspected that some models could have spotted some correlations between certain Named Entities and a specific stance in the training set. Some of those correlations are well known and can be useful in veracity detection (Wang, 2017). In this paper, however, we wanted to train a model for stance detection only based on its language understanding, without counting on such possibly accidental correlations.
In order to avoid the systems to rely on chance correlations, which would not generalize on the test set, we modified the input sequences by substituting all input tokens labeled as <PERSON>, <ORGANIZATION> and <LOCATION> by the Stanford Named Entity Recognizer with the corresponding NE tags.

Results
Results of experiments concatenating the previously mentioned features to the word embedding input to both architectures are reported in Table 4 (even lines). In general, using NE embeddings alone with word embeddings was not beneficial for both models. Considering the architecture based on double-conditional encoding, using both characters and NE features actually lead to (sometimes small) improvements in almost all considered evaluation metrics. Moving to the architecture using co-matching attention, adding characters or NE embeddings, even in combination, caused a considerable drop in all evaluation metrics, apart on some single label classification (as the AGR class).
As shown in Table 4 (odd lines), anonymizing the input was always useful for the architecture using double-conditional encoding, resulting in a consistently higher macro-averaged F 1 score. Considering the architecture based on comatching attention, however, anonymizing the input was beneficial only for architectures leveraging NE tags (only with word embeddings, or in combination with character embeddings), which were also the ones showing the highest drop in performance with respect to the model using only word embeddings.
The best performance according to macroaveraged precision, recall and F 1 score is obtained using the co-matching attention model leveraging only word embeddings. The high performance of this model is mainly due to its ability to discriminate the very unfrequent DSG class.

Conclusions
We proposed two simple architectures for Cross-Level Stance Detection, which were carefully designed to model the internal structure of a news article and its relations with a claim. Results show that our "journalistically"-motivated approach can beat a strong feature-based baseline, without relying on any language-specific resources other than word embeddings. This indicates that an interdisciplinary dialogue between Natural Language Processing and Journalism Studies can be very fruitful for fighting Fake News.
In future work, we aim to put together the different stages of the FND pipeline. Following the work of Kochkina et al. (2018) for RV, it could be interesting to compare a sequential approach to separately solve each step of the pipeline in isolation, with a joint multi-task system. The generalizability of the models trained on the FND pipeline to other domains could be tested with the recently released ARC corpus (Hanselowski et al., 2018), which has similar statistical characteristics as the FNC-1 corpus.