Stance Detection in Fake News A Combined Feature Representation

With the uncontrolled increasing of fake news and rumors over the Web, different approaches have been proposed to address the problem. In this paper, we present an approach that combines lexical, word embeddings and n-gram features to detect the stance in fake news. Our approach has been tested on the Fake News Challenge (FNC-1) dataset. Given a news title-article pair, the FNC-1 task aims at determining the relevance of the article and the title. Our proposed approach has achieved an accurate result (59.6 % Macro F1) that is close to the state-of-the-art result with 0.013 difference using a simple feature representation. Furthermore, we have investigated the importance of different lexicons in the detection of the classification labels.


Introduction
Recently, many phenomena appeared and spread in the Internet, especially with the huge propagation of information and the growth of social networks. Some of these phenomena are fake news, rumors and misinformation. In general, the detection of these phenomena is crucial since in many situations they expose the people to danger 1 . Journalism made several efforts in addressing these problems by presenting a validity proof to the audience. Unfortunately, these manual attempts take much time and effort from the journalists and, at the same time, they cannot cover the rapid spread of these fake news. Hence, there is the need for addressing the problem from an automatic perspective. Fake news gained large attention recently from the natural language processing (NLP) research community and many approaches have been proposed. These approaches investigated fake news from network and textual perspectives (Shu et al., 2017). Some of the textual approaches handled the phenomenon from a validity aspect, where they labeled a claim as "False", "True", or "Half-True". Others tried to tackle it from a stance perspective, similar to stance detection works on Twitter (Mohammad et al., 2016;Taulé et al., 2017;Lai et al., 2018) that tried to determine whether a tweet is in favor, against, or neither to a given target entity (person, organization, etc.). Where in fake news, they replaced the tuple of the tweet and the target entity with a claim and an article; also a different stances' set is used (agree, disagree, discuss, and unrelated).
Several shared tasks have been proposed: Fake News Challenge (FNC-1) (Rao and Pomerleau, 2017), RumorEval (Derczynski et al., 2017), CheckThat (Barrón-Cedeño et al., 2018), and Fact Extraction and Verification (FEVER) 2 . In FNC-1, the organizers proposed the task to be approached from a stance perspective; the goal is to predict how other articles orient to a specific fact, similarly than in RumorEval (task-A). While in both RumorEval (task-B) and CheckThat (task-B) a rumor/claim has been submitted and the task objective is to validate the truthfulness of this sentence (true, half-true, or false). In the first task of CheckThat (task-A) participants were asked to detect claims that are worthy for checking (may have facts), as preliminary step to task B. Finally, the purpose of FEVER shared task is to evaluate the ability of a system to verify a factual claim using evidences from Wikipedia, where each re-trieved evidence (in case there are many) should be labeled as "Supported", "Refuted" or "NotE-noughInfo" (if there isn't sufficient evidence to either support or refute it). The given attention to fake news and rumors detection in the literature is more than the one gained by detecting worthy claims. The orientation in these works was towards inferring these worthy claims using linguistic and stylistic aspects (Ghanem et al., 2018c;Hassan et al., 2015).

Related Work
From an NLP perspective, many approaches proposed to employ statistical (Magdy andWanas, 2010), linguistic (Markowitz andHancock, 2014;Volkova et al., 2017), and stylistic (Potthast et al., 2017) features. Other approaches incorporated different combination of features, such as word or character n-grams overlapping score, bag-ofwords (BOW), word embeddings, and latent semantic analysis features (Riedel et al., 2017;Hanselowski et al., 2017;Karadzhov et al., 2018). In some cases, authors used external features and retrieved evidences from the Web. For example, in (Ghanem et al., 2018b) the authors utilized both Google and Bing search engines to investigate the factuality of political claims. In (Mihaylov et al., 2015), a similar work has also retrieved evidences from Google and online blogs to validate sentences in question answering forums. In other attempts, some approaches utilized deep learning architectures to validate fake news. In (Baird et al., 2017), an approach combined a Convolutional Neural Network with a Gradient Boost classifier to predict the stance on FNC. As a result, their approach achieved the highest accuracy in the task results. Using a different deep learning architecture, the authors in (Hanselowski et al., 2018) used a Long Short-Term Memory (LSTM) network combined with other features such as bagof-characters (BOC), BOW and topic model features based on non-negative matrix factorization, Latent Dirichlet Allocation, and Latent Semantic Indexing. They achieved state-of-the-art results (60.9% Macro F1) on the FNC-1 dataset. The approaches that were proposed in both fake news and rumors detection are slightly different, since both phenomena were studied in different environment. Fake news datasets generally were collected from formal sources (political debates or Web news articles). On the other hand, Twit-ter was the source for rumors datasets. Therefore, the proposed approaches for rumors focused more on the propagation of tweets (ex. retweet ratio (Enayet and El-Beltagy, 2017)) and the writing style of the tweets (Kochkina et al., 2017).

Task
Given a pair of text fragments (title and article) obtained from news, the task goal is to estimate the relative perspective (stance) of these two fragments with respect to a specific topic. In other words, the stance prediction of an article towards the title of this article. For each input pair, there are 4 stance labels: Agree, Disagree, Discuss, and Unrelated. "Agree" if the article supports the title; "disagree" if refuses it; "discuss" whether the article discusses the title but without showing an in favor or against stance; and "unrelated" when the article describes a different topic than the one of the title. The task's dataset is imbalanced in a high ratio (see next section). Therefore, the organizers introduced a weighted accuracy score for the evaluation. Their proposed score gave 25% of the final score for predicting the unrelated class, while 75% for the other classes. Later, the authors in (Hanselowski et al., 2018) proposed an in-depth analysis to discuss FNC-1 experimental setup. They showed that this accuracy metric is not appropriate and fails to take into account the imbalanced class distribution, where models performing well on the majority class and poorly on the minority classes are favored. Therefore, they proposed Macro F1 metric to be used in this task. Accordingly, in this paper we show the experimental results using the Macro F1 measure.

Corpus
The presented dataset was built using 300 different topics. The training part consists of 49,972 tuples in a form of title, article, and label, while the test part consists of 25,413 tuples. The ratio of each label (class) in the dataset is: 73.13% Unrelated, 17.82% Discuss, 7.36% Agree, and 1.68% Disagree. Clearly the dataset is heavily biased towards the unrelated label. Titles length ranges between 8 and 40 words, whereas for the articles ranges between 600 and 7000 words (Bhatt et al., 2018). These numbers show a real challenge to predict the stance between these two fragments that are totally different in lengths.

Tough-to-beat Baseline
The organizers presented a tough baseline using Gradient Boost decision tree classifier. In contrast to other shared tasks, their baseline employed more sophisticated features. As features, they employed n-gram co-occurrence between the titles and articles using both character and word grams (using a combination of multiple lengths) along with other hand-crafted features such as: word overlapping between the title and the article and the existence of highly polarized words from a lexicon (ex. fake, hoax). Their baseline achieved an FNC-1 score value of 75% and 45.4% value of Macro F1.

Approach and Results
The literature work on the FNC dataset showed that the best results are not obtained with a pure deep learning architecture, and simple BOW representations showed a good performance. In our approach, we combine n-grams, word embeddings and cue words to detect the stance of the title with respect to its article.

Preprocessing
Before building the feature representation, we perform a set of text preprocessing steps. In some articles we found links, hashtags, and user mentions (ex. @USER), so we remove them to make the text less biased. Similarly, we remove non-English and special characters.

Features
In our approach we combine simple feature representation to model the title-article tuples: • Cue words: We employ a set of cue words categories that were used previously in (Bahuleyan and Vechtomova, 2017) to identify the stance of Twitter users towards rumor tweets. As Table 1 shows, the cue words categories are Belief, Denial, Doubt, Report, Knowledge, Negation and Fake. The Fake cue list is a combination of some words from FNC-1 baseline polarized words list and words from the original list. The provided set of cue words is quite small, therefore, we use Google News word2vec to expand it. For each word, we retrieve the most 5 similar words. As an example, for the word "misinform", we retrieved "mislead ","misinform- ing","disinform","misinformation", and "demonize" as the most similar words.
• Google News word2vec embedding: For each title-article tuple, we measure the cosine similarity of the embedding of each sentence. Also, we use the full 300 length embedding vector for both the title and the article. The sentence embeddings is obtained by averaging its words embeddings. Previously in (Ghanem et al., 2018a), the authors showed that using the main sentence components (verbs, nouns, and adjectives) improved the detection accuracy of a plagiarism detection approach 3 rather than using the full sentence components. Therefore, we build these embeddings vectors using the main sentence components. Furthermore, we maintain the set of cue words that showed in the previous point.
• FNC-1 features: we use the same baseline feature set (see Section 3.3).

Experiments
In our experiments, we tested Support Vector Machines (SVM) (using each Linear and RBF kernels), Gradient Boost, Random Forest and Naive Bayes classifiers but the Neural Network (NN) showed better results 6 . Our NN architecture consists of two hidden layers with rectified linear unit (ReLU) activation function as non-linearity for the hidden layers, and Softmax activation function for the output layer. Also, we employed the Systems Macro-F1 Majority vote 0.210 FNC-1 baseline 0.454 Talos (Baird et al., 2017) 0.582 UCLMR (Riedel et al., 2017) 0.583 Athene (Hanselowski et al., 2017) 0.604 stackLSTM (Hanselowski et al., 2018) 0.609 Our approach 0.596 Cue words 0.250 Word2vec embeddings 0.488 Adam weight optimizer. The used batch size is 200. Table 2 shows the results of our approach and those of the FNC-1 participants. We investigated the score of each of our features independently. The word2vec embeddings feature set has achieved 0.488 Macro F1 value, while the cue words achieved 0.25. The extension of the cue words has improved the final result by 2.5%.
The tuples of the "Unrelated" class had been created artificially by assigning articles from different documents. This abnormal distribution can affect the result of the cue words feature when we test it independently; since we extract the cue words feature from the articles part (without the titles) and some articles could be found with different class labels, this can bias the classification process. As we mentioned previously, the stateof-the-art result was obtained by an approach that combined LSTM with other features (see Section 2). Our approach achieved 0.596 value of Macro F1 score which is very close to the best result.
The combination of the cue words categories with the other features has improved the overall result. Each of them had impact in the classification process. In Figure 1, we show the importance of each category using the Information Gain. We extract it using Gradient Boost classifier as it achieves the highest result comparing to the other decision tree-based classifiers. The figure clarifies that Report is the category that has the highest importance in the classification process, where Negation and Belief categories have lower importance, whereas both of the Denial and Knowledge categories have the lowest importance. Surprisingly, both of the Fake and Doubt categories have a lower importance than the other three. Our intuition was that the Fake category will have the highest importance in discriminating the classes, where this category contains words that: may not appear in the "Agree" class records, appear profusely in the "Disagree" class (where the title is fake and the article proving that), and a medium appearance amount in the "Discuss" class. Similarly, for the Doubt category, it seems that it may appear frequently in both "Discuss" and "Disagree" classes where its words normally mentioned when an article discusses a specific idea or when refuse it. To understand deeper our Information Gain results, we conducted another experiment to infer the importance of each category with respect to each classification class.
To do so, we use SVM classifier coefficients (linear kernel) to extract the most important category to each classification class. In our initial experiments, the SVM produced a result that is similar to the NN (58% Macro F1), so based on the good performance we used it in this experiment, where we couldn't extract the feature importance using the NN. Once the SVM fits the data and creates a hyperplane that uses support vectors to maximize the distance between the classes, the importance of the features can be extracted based on the absolute size of the coefficients (vector coordinates). In Table 3 we show the importance of each category by their order. We can notice that for the "Agree" class, generally, the categories that are used when there is a disagreement (Denial, Fake, Negation) tend to be less important than the other categories. On the contrary, for the "Disagree", disagreement categories appear in general in higher order comparing to the "Agree" class.  For the "Discuss" class, due to the unclear stance towards the title where articles did not show a clear in favor or against stance, we can notice an overlapping in the highest order between the categories that are important for both "Disagree" and "Agree" classes. Finally, as we mentioned previously that the articles in the "Unrelated" class are created artificially by assigning articles from different titles, the order of the categories is not meaningful.

Conclusion and Future Work
Fake news is still an open research topic. Further contributions are required, especially to deal automatically with the massive growth of information over the Web. Our work attempted to approach the stance detection of fake news using a simple model based on a combination of n-grams, word embeddings and lexical representation of cue words. These lexical cue words have been employed previously in the literature in rumors stance detection approaches. Although we used a simple feature set, we achieved similar results than the state of the art. This work is an initial step towards a further investigation of features to improve stance detection in fake news. As a future work, we plan to focus on summarizing the articles in the dataset. As we mentioned in Section 3.2, the length ratio difference between the titles and the articles is large. Therefore, summarizing the articles may be a worthy attempt to improve the comparison between the two text fragments.