DFKI-DKT at SemEval-2017 Task 8: Rumour Detection and Classification using Cascading Heuristics

We describe our submissions for SemEval-2017 Task 8, Determining Rumour Veracity and Support for Rumours. The Digital Curation Technologies (DKT) team at the German Research Center for Artificial Intelligence (DFKI) participated in two subtasks: Subtask A (determining the stance of a message) and Subtask B (determining veracity of a message, closed variant). In both cases, our implementation consisted of a Multivariate Logistic Regression (Maximum Entropy) classifier coupled with hand-written patterns and rules (heuristics) applied in a post-process cascading fashion. We provide a detailed analysis of the system performance and report on variants of our systems that were not part of the official submission.


Introduction
In today's digital age, the social, political and economic relevance of online media and online content is becoming more and more relevant. Accordingly, the task of analysing and determining the veracity of online content is receiving a growing amount of attention by the NLP community. The ability to detect whether a piece of news is fake or not, and to do so automatically, is a very timely language technology application (Zubiaga and Ji, 2014). Through these shared tasks, we intend to address which linguistic and contextual features characterise a rumour.
SemEval2017 Task 8 (Derczynski et al., 2017) provided all participants with a dataset consisting of tweets in response to breaking news stories. It contains conversations responding to rumourous tweets. These tweets have been annotated for sup-port, deny, query or comment (SDQC). The competition consisted of two subtasks: • Subtask A: Determining whether response tweets support, deny, query or comment (SDQC) on rumours (source tweet) • Subtask B: Given a tweet, determine whether the statement is true or false (i. e., a rumour). This subtask featured two variants: closed (determining veracity from the tweet alone) and open (determining veracity from additional context). We participated in the closed task.
Our approach to both subtasks involved extracting relevant features from the provided data and training a classifier followed by a set of heuristics implemented in a cascading decision tree style (Minguillon, 2002). These rules, applied as a postprocess, help induce a better mapping from classification results to rumour categorisation and veracity detection because they take into account specific features characterising a particular class.
In this paper we seek to answer two questions using Rumour Detection and Classification as a case-study: • Which features comprise the set of postprocess rules?
• What is the optimal technique to implement these heuristics (cascading order)?
This paper is structured as follows. Section 2 gives a bird's eye overview of our systems submitted for evaluation. Section 3 describes the various rumour detection and classification models as well as experimental setups (not part of the official submission). Section 4 displays the results and analyses them. Section 5 contains a discussion of the task in general followed by an explanation of some design decisions.

DFKI-DKT's Submission Overview
Our submissions can be categorised as hybrid systems since they consist of both machine learning and rule-based (heuristics) modules. The first step was to extract contextual features (tweet text) and metadata features (Twitter user account properties and message properties) from the provided test data. We then trained a Maximum Entropy classifier (Malouf, 2002) followed by a set of heuristics (if-then clauses) implemented in a cascading decision tree style (Minguillon, 2002), see Figure 1.

Data and Tools
In terms of tools and resources we did not use any external data. All models were trained on the provided twitter dataset. Table 1 gives an overview of the size of the data for subtasks A and B. We implemented feature vector-based text classification models using the Mallet Machine Learning Toolkit (McCallum, 2002) in Java. The heuristics were implemented in the form of an experimentally determined sequence of if-then decision rules written in Python. Evaluation was performed using the scoring scripts provided by the task organisers.

Preprocessing
We employed the standard tokenisation scripts while extracting the feature vectors for training a classifier. We did not implement any other preprocessing step. In fact, it was discovered that cleaning the tweets actually impacted the classification algorithm in a negative way. We believe that certain as-is characteristics of the text (uppercase, spelling errors, emoticons, etc.) help in better distinguishing the used categories (SDQC).

Subtask A Heuristics
The classifier was trained on four classes (SDQC). This was followed by a post-processing module of decision rules based on linguistic patterns and Twitter metadata. The heuristics were as follows: • If a tweet begins with a wh-word (where, when, how, what, why, which) and/or ends with a question mark, then classify it as query • If a tweet contains a negation, then classify it as denial • If a tweet is a retweet, then classify it as support • If more than 70% of the text is all uppercase, then classify it as comment

Subtask B (closed) Heuristics
The classifier was trained on two classes (true, false). This was followed by a post-processing module of decision rules based on linguistic patterns and Twitter metadata. The heuristics were as follows: • If a tweet begins with a wh-word (where, when, how, what, why, which) and/or ends with a question mark, then classify it as false • If the tweet has been retweeted x number of times, then classify it as true • If more than 70% of the text is all uppercase, then classify it as false • If the tweet contains more than three @usernames and hashtags, then classify it as false • If the author of the tweet as more than 10000 followers, then classify it as true

Models and Experiments
In this section, we describe the details of the features used in our models as well as the different experimental settings.

Models
We trained three different classifiers, followed by applying the heuristics model described in Sections 2.3 and 2.4: • Maximum Entropy classification (MaxEnt) (Malouf, 2002), also known as Multivariate Logistic Regression.
• Naive Bayes classification (Frank and Bouckaert, 2006) assumes independence of the features while counting.
• Winnow classification (Winnow2) (Littlestone, 1988) is similar to the perceptron model but uses a multiplicative weight update scheme rather than an additive method.
While we submitted only the MaxEnt model due to time constraints, we also include the results and analysis of the performance of the Naive Bayes and Winnow classifiers. We also computed an ensemble classifier, i. e., a voting-based combination of the three models' results using the following algorithm: • Count the number of votes (MaxEnt, Naive Bayes, Winnow) for each of the categories (four for Subtask A, two for Subtask B) • Select the category with the maximum number of votes • If there is a tie, select the result of MaxEnt classifier

Useful Features
For subtask A (determining the category of a message), we compiled a list of distinctive features 1 characteristic of each of the stances: support, query, deny, comment. We conducted an investigation into linguistic and context-specific patterns that may distinguish one stance from the other. For example, query messages almost always have a wh-word and a question mark. Table2 gives a snapshot of the frequency of the patterns on the training data in each of the SQDC categories.

Experimental Setup
The features used in the classification algorithms consisted of a vector of the words (twitter text). When we attempted to incorporate some of the features described above in the classification algorithm, the performance deteriorated. This led us to implement a post-process heuristic module and subject the results of the classification to a second   Table 3 shows the results of our experiments. We submitted the MaxEnt results. However, the ensemble method (combination of all three models) shows a much better performance. Figure 2 demonstrates the number of correct categories we classified accurately (blue bar). Our systems performed best at predicting the "comment" and "query" in subtask A and "false" in subtask B. The poor performance on "support" in subtask A and "true" in subtask B can be attributed to our post-process framework, i.e. our rules are not sufficiently discriminative. A work-around is to label all tweets as "support" and then implement the if-then rules.

Discussion
In this section, we briefly touch upon a few observations from our experiments. First, the actual twitter text should not be cleaned in any way, i. e., errors, misspellings, acronyms etc. contained in the text help in the task. Using rule-based heuristics derived from a statistical analysis of the characteristics of the training data, helps in a post-processing step to improve the classification performance of test data.

Conclusion
We implemented hybrid systems, i. e., combinations of statistical (classifier) and rule-based (heuristics) modules. It can be observed that textual features and metadata benefit both tasks. In terms of future work, we plan to implement a better cascading model, i. e., to assign probabilities to the heuristics.