Do Sentence Interactions Matter? Leveraging Sentence Level Representations for Fake News Classification

The rising growth of fake news and misleading information through online media outlets demands an automatic method for detecting such news articles. Of the few limited works which differentiate between trusted vs other types of news article (satire, propaganda, hoax), none of them model sentence interactions within a document. We observe an interesting pattern in the way sentences interact with each other across different kind of news articles. To capture this kind of information for long news articles, we propose a graph neural network-based model which does away with the need of feature engineering for fine grained fake news classification. Through experiments, we show that our proposed method beats strong neural baselines and achieves state-of-the-art accuracy on existing datasets. Moreover, we establish the generalizability of our model by evaluating its performance in out-of-domain scenarios. Code is available at https://github.com/MysteryVaibhav/fake_news_semantics


Introduction
In today's day and age of social media, there are ample opportunities for fake news production, dissemination and consumption.Rashkin et al. (2017) break down fake news into three categories, hoax, propaganda and satire.A hoax article typically tries to convince the reader about a cookedup story while propaganda ones usually mislead the reader into believing a false political or social agenda.Burfoot and Baldwin (2009) defines a satirical article as the one which deliberately exposes real-world individuals, organisations and events to ridicule.
Previous works (Rubin et al., 2016;Rashkin et al., 2017) rely on various linguistic and handcrafted semantic features for differentiating between news articles.However, none of them try of sentence embeddings obtained using BERT (Devlin et al., 2019) for two kind of news articles from SLN.
A point denotes a sentence and the number indicates which paragraph it belonged to in the article.
to model the interaction of sentences within the document.We observed a pattern in the way sentences cluster in different kind of news articles.Specifically, satirical articles had a more coherent story and thus all the sentences in the document seemed similar to each other.On the other hand, the trusted news articles were also coherent but the similarity between sentences from different parts of the document was not that strong, as depicted in Figure 1.We believe that the reason for such kind of behaviour is the presence of factual jumps across sections in a trusted document.
In this work, we propose a graph neural network-based model to classify news articles while capturing the interaction of sentences across the document.We present a series of experiments on News Corpus with Varying Reliability dataset (Rashkin et al., 2017) and Satirical Legitimate News dataset (Rubin et al., 2016).Our results demonstrate that the proposed model achieves state-of-the-art performance on these datasets and provides interesting insights.Experiments performed in out-of-domain settings establish the generalizability of our proposed method.

Related Work
Satire, according to Simpson (2003), is complicated because it occupies more than one place in the framework for humor, proposed by Ziv (1988): it clearly has an aggressive and social function, and often expresses an intellectual aspect as well.Rubin et al. (2016) defines news satire as a genre of satire that mimics the format and style of journalistic reporting.Datasets created for the task of identifying satirical news articles from the trusted ones are often constructed by collecting documents from different online sources (Rubin et al., 2016).McHardy et al. (2019) hypothesized that this encourages the models to learn characteristics for different publication sources rather than characteristics of satire.In this work, we show that our proposed model generalizes to articles from unseen publication sources.Rubin et al. (2016)'s work by offering a quantitative study of linguistic differences found in articles of different types of fake news such as hoax, propaganda and satire.They also proposed predictive models for graded deception across multiple domains.Rashkin et al. (2017) found that neural methods didn't perform well for this task and proposed to use a Max-Entropy classifier.We show that our proposed neural network based on graph convolutional layers can outperform this model.Recent works by Yang et al. (2017);De Sarkar et al. (2018) show that sophisticated neural models can be used for satirical news detection.To the best of our knowledge, none of the previous works represent individual documents as graphs where the nodes represent the sentences for performing clas-sification using a graph neural network.

Dataset and Baseline
We use SLN: Satirical and Legitimate News Database (Rubin et al., 2016), RPN: Random Political News Dataset (Horne and Adali, 2017) and LUN: Labeled Unreliable News Dataset Rashkin et al. ( 2017) for our experiments.Table 1 shows the statistics.Since all of the previous methods on the aforementioned datasets are non-neural, we implement the following neural baselines, • CNN: In this model, we apply a 1-d CNN (Convolutional Neural Network) layer (Kim, 2014) with filter size 3 over the word embeddings of the sentences within a document.This is followed by a max-pooling layer to get a single document vector which is passed to a fully connected projection layer to get the logits over output classes.
• LSTM: In this model, we encode the document using a LSTM (Long Short-Term Memory) layer (Hochreiter and Schmidhuber, 1997).We use the hidden state at the last time step as the document vector which is passed to a fully connected projection layer to get the logits over output classes.
• BERT: In this model, we extract the sentence vector (representation corresponding to [CLS] token) using BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2019) for each sentence in the document.We then apply a LSTM layer on the sentence embeddings, followed by a projection layer to make the prediction for each document.

Proposed Model
Capturing sentence interactions in long documents is not feasible using a recurrent network because of the vanishing gradient problem (Pascanu et al., 2013).Thus, we propose a novel way of encoding documents as described in the next subsection.
Figure 2 shows the overall framework of our graph based neural network.

Input Representation
Each document in the corpus is represented as a graph.The nodes of the graph represent the sentences of a document while the edges represent the semantic similarity between a pair of sentences.Representing a document as a fully connected graph allows the model to directly capture the interaction of each sentence with every other sentence in the document.Formally, We initialize the edge scores using BERT (Devlin et al., 2019) finetuned on the semantic textual similarity task 1 for computing the semantic similarity (SS) between two sentences.Refer to the Supplementary Material for more details regarding the SS model.Note that this representation drops the sentence order information but is better able to capture the interaction between far off sentences within a document.

Graph based Neural Networks
We reformulate the fake news classification problem as a graph classification task, where a graph represents a document.Given a graph G = (E, S) where E is the adjacency matrix and S is the sentence feature matrix.We randomly initialize the word embeddings and use the last hidden state of a LSTM layer as the sentence embedding, shown in Figure 2. We experiment with two kinds of graph neural networks,

Graph Convolution Network (GCN)
The graph convolutional network (Kipf and Welling, 2017) is a spectral convolutional operation denoted by f (Z l , E|W l ), Here, Z l is the output feature corresponding to the nodes after l th convolution.W l is the parameter associated with the l th layer.We set Z 0 = S. Based on the above operation, we can define arbitrarily deep networks.For our experiments, we just use a single layer unless stated otherwise.By default, the adjacency matrix (E) is fully connected i.e. all the elements are 1 except the diagonal elements which are all set to 0. We set E based on semantic similarity model in our GCN + SS model.For the GCN + Attn model, we just add a self attention layer (Vaswani et al., 2017) after the GCN layer and before the pooling layer.

Graph Attention Network (GAT)
Velikovi et al. ( 2018) introduced graph attention networks to address various shortcomings of GCNs.Most importantly, they enable nodes to attend over their neighborhoods features without depending on the graph structure upfront.The key idea is to compute the hidden representations of each node in the graph, by attending over its neighbors, following a self-attention (Vaswani et al., 2017) strategy.By default, there is one attention head in the GAT model.For our GAT + 2 Attn Heads model, we use two attention heads and concatenate the node embeddings obtained from different heads before passing it to the pooling layer.
For a fully connected graph, the GAT model allows every node to attend on every other node and learn the edge weights.Thus, initializing the edge weights using the SS model is useless as they are being learned.Mathematical details are provided in the Supplementary Material.

Hyperparameters
We use a randomly initialized embedding matrix with 100 dimensions.We use a single layer LSTM to encode the sentences prior to the graph neural networks.All the hidden dimensions used in our networks are set to 100.The node embedding dimension is 32.For GCN and GAT, we set σ as LeakyRelU with slope 0.2.We train the models for a maximum of 10 epochs and use Adam optimizer with learning rate 0.001.For all the models, we use max-pool for pooling, which is followed by a fully connected projection layer with output nodes equal to the number of classes for classification.

Experimental Setting
We conduct experiments across various settings and datasets.We report macro-averaged scores in

Results
Table 2 shows the quantitative results for the two way classification between satirical and trusted news articles.Our proposed GAT method with 2 attention heads outperforms SoTA.The semantic  To further understand the working of our proposed model, we closely inspect the attention maps generated by the GAT model for satirical and trusted news articles for the SLN dataset.From Figure 3, we can see that the attention map generated for the trusted news article only focuses on two specific sentence whereas the attention weights are much more distributed in case of a satirical article.Interestingly enough the highlighted sentences in case of the trusted news article were the starting sentence of two different paragraphs in the article indicating the presence of similar sentence clusters within a document.This opens a new avenue for understanding the differences between different kind of text articles for future research.

Conclusion
This paper introduces a novel way of encoding articles for fake news classification.The intuition behind representing documents as a graph is motivated by the fact that sentences interact differently with each other across different kinds of article.Recurrent networks are unable to maintain long term dependencies in large documents, whereas a fully connected graph captures the interaction between sentences at unit distance.The quantitative result shows the effectiveness of our proposed model and the qualitative results validate our hypothesis about difference in sentence interaction across different articles.Further, we show that our proposed model generalizes to unseen datasets.

Figure 1 :
Figure 1: TSNE visualization (Van Der Maaten, 2014) of sentence embeddings obtained using BERT (Devlin et al., 2019) for two kind of news articles from SLN.A point denotes a sentence and the number indicates which paragraph it belonged to in the article.

Figure 2 :
Figure 2: Proposed semantic graph neural network based model for fake news classification.

Table 1 :
Statistics about different dataset sources.GN refers to Gigaword News.

Table 3 :
4-way classification results for different models.We only report F1-score following the SoTA paper.

Table 3 .
All of our proposed methods outperform SoTA on both the in-domain and out of domain test set.