UPV-28-UNITO at SemEval-2019 Task 7: Exploiting Post’s Nesting and Syntax Information for Rumor Stance Classification

In the present paper we describe the UPV-28-UNITO system’s submission to the RumorEval 2019 shared task. The approach we applied for addressing both the subtasks of the contest exploits both classical machine learning algorithms and word embeddings, and it is based on diverse groups of features: stylistic, lexical, emotional, sentiment, meta-structural and Twitter-based. A novel set of features that take advantage of the syntactic information in texts is moreover introduced in the paper.


Introduction
The problem of rumor detection lately is attracting considerable attention, also considering the very fast diffusion of information that features social media platforms. In particular rumors are facilitated by large users' communities, where also expert journalists are unable to keep up with the huge volume of online generated information and to decide whether a news is a hoax (Procter et al., 2013;Webb et al., 2016;Zubiaga et al., 2018).
Rumour stance classification is the task that intends to classify the type of contribution to the rumours expressed by different posts of a same thread (Qazvinian et al., 2011) according to a set of given categories: supporting, denying, querying or simply commenting on the rumour. For instance, referring to Twitter, once a tweet that introduces a rumour is detected (the "source tweet"), all the tweets having a reply relationship with it, (i.e. being part of the same thread), are collected to be classified.
Our participation to this task is mainly focused on the investigation of linguistic features of social media language that can be used as cues for detecting rumors 1 .

Related work
The RumorEval 2019 shared task involves two tasks: Task A (rumour stance classification) and Task B (verification).
Stance Detection (SD) consists in automatically determining whether the author of a text is in favour, against, or neutral towards a given target, i.e. statement, event, person or organization, and it is generally indicated as TARGET-SPECIFIC STANCE CLASSIFICATION (Mohammad et al., 2016).
Another type of stance classification, more general-purpose, is the OPEN STANCE CLASSIFI-CATION task, usually indicated with the acronym SDQC, by referring to the four categories exploited for indicating the attitude of a message with respect to the rumour: Support (S), Deny (D), Query (Q) and Comment (C) (Aker et al., 2017). Target-specific stance classification is especially suitable for analyses about a specific product or political actor, being the target given as already extracted, e.g. from conversational cues. On this regard several shared tasks have been organized in recent years: see for instance SemEval-2016Task 6 (Mohammad et al., 2017 considering six commonly known targets in the United States, and StanceCat at IberEval-2017 on stance and gender detection in tweets on the matter of the Independence of Catalonia (Taulé et al., 2017). On the other hand, the open stance classification, (i.e. the task addressd in this paper), is more suitable in classifying emerging news or novel contexts, such as working with online media or streaming news analysis.
Provided that attitudes around a claim can act as proxies for its veracity, and not only of its controversiality, it is reasonable to consider the application of SDQC techniques for accomplishing rumour analysis tasks. A first shared task, concerning SDQC applied to rumor detection, has been organized at SemEval-2017, i.e RumorEval 2017 . Furthermore, several research works have analyzed the open issue of the impact of rumors in social media (Resnick et al., 2014;Zubiaga et al., 2015Zubiaga et al., , 2018, for instance exploiting linguistic features . Such a kind of approaches may be also found in works which deal with the problems of Fake News Detection (Ciampaglia et al., 2015;Hanselowski et al., 2018).
Furthermore, a rumor is defined as a "circulating story of questionable veracity, which is apparently credible but hard to verify, and produces sufficient scepticism and/or anxiety so as to motivate finding out the actual truth" (Zubiaga et al., 2015).
Concerning veracity identification, increasingly advanced systems and annotation schemas have been developed to support the analysis of rumour veracity and misinformation in text (Qazvinian et al., 2011;Kumar and Geethakumari, 2014;Zhang et al., 2015).

Description of the task
The RumorEval task is articulated in the following sub-tasks: Task A (open stance classification -SDQC) is a multi-class classification for determining whether a message is a "support", a "deny", a "query" or a "comment" wrt the original post; Task B (verification) is a binary classification for predicting the veracity of a given rumour into "true" or "false" and according to a confidence value in the range of 0-1.

Training and Test Data
The RumourEval 2019 corpus contains a total of 8,529 English posts, namely 6,702 from Twitter and 1,827 from Reddit.
The portion of data from Twitter has been built by combining the RumorEval 2017 training and development datasets , and includes 5,568 tweets: 325 source tweets (grouped into eight overall topics such as Charlie Hebdo attack, Ottawa shooting, Germanwings crash...), and 5,243 discussion tweets collected in their threads. The dataset from Reddit, which has been instead newly released this year, is composed by 1,134 posts: 40 source posts and 1,094 collected in their threads.  All data have been split in training and test set with a proportion of approximately 80%−20% (see Table 1).

UPV-28-UNITO Submission
The approach and the features selection we applied is the same for both tasks and is based on a set of manual features described in Section 4.1. We built moreover another set of features (i.e. second-level features) extracted by using the manual features together with features based on word embeddings (see Section 4.2 for a detailed description). For modeling the features distribution with respect to each thread, we used for task B the same features as in task A. Then, in both tasks, we fed the features to a classical machine learning classifier.

Manual Features
For enhancing the selection of features, we investigated the impact of diverse groups of them: emotional, sentiment, lexical, stylistic, meta-structural and Twitter-based. Furthermore, we introduced a novel set of syntax-based features.
Emotional Features -We exploited several emotional resources in order to build features for our system. Three lexica: (a) EmoSenticNet, a lexicon that assigns six WordNet Affect emotion labels to SenticNet concepts (Poria et al., 2013); (b) the NRC Emotion Lexicon, a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive) (Mohammad and Turney, 2010); and (c) SentiSense, an easily scalable conceptbased affective lexicon for Sentiment Analysis (De Albornoz et al., 2012). We also exploited two tools: (d) Empath, a tool that can generate and validate new lexical categories on demand from a small set of seed terms (Fast et al., 2016); and (e) LIWC a text analysis dictionary that counts words in psychologically meaningful categories (Pennebaker et al., 2001).
Sentiment Features -Our sentiment features were modeled exploiting sentiment resources such as: (a) SentiStrength, a sentiment strength detection program which uses a lexical approach that exploits a list of sentiment-related terms (Thelwall et al., 2010); (b) AFINN, a list of English words rated for valence with an integer between minus five (negative) and plus five (positive) (Nielsen, 2011); (c) SentiWordNet, a lexical resource in which each WordNet synset is associated to three numerical scores, describing how objective, positive, and negative the terms contained in the synset are (Esuli and Sebastiani, 2007) In addition to the above-listed, common features exploited in Sentiment Analysis tasks, in this work we introduce two novel sets of features: (1) Problem-specific features (considering the fact that the dataset is composed by Twitter data and Reddit data) and (2) Syntactical features.
Meta-structural features -Since training and test data are from Twitter and Reddit both, we explored meta-structural features suitable for data coming from both platforms: ( Syntactic Features -In our system some feature has been also modeled by referring to syntactic information involved in texts . After having parsed 5 the dataset in the Universal De-pendency 6 format, thus obtaining a set of syntactic "dependency relations" (deprel), we were able to exploit: (a) the ratio of negation dependencies compared to all the other relations; (b) the Bag of Relations (BoR all) considering all the deprels attached to all the tokens; (c) the Bag of Relations (BoR list) considering all the deprels attached to the tokens belonging to a selected list of words (from the lists already made explicit in the paragraph "Lexical Features" in Section 4.1); and finally (d) Bag of Relations (BoR verbs) considering all the deprels attached to all the verbs, thus fully exploiting morpho-synctactic knowledge.

Second-level Features
For the second-level features, we employed (a) the cosine similarity of one instance wrt its parents and (b) information of the tree structure of a thread, exploiting its "nesting" and depth from the source tweet.
Similarity with Parents -In this feature, we used the cosine similarity to measure the similarity between each post with its parents. The parents of a reply are the (A) direct upper-level post and (B) the source post in the thread (see Figure 1). We extracted the cosine similarity in A and B by using the manual features' final vector and words embeddings average vectors of the posts; the words embeddings average vector for a post is extracted by averaging the embeddings of the post's words 7 . 6 The de facto standard for the representation of syntactical knowledge in the NLP community: https:// universaldependencies.org/ 7 We used the pre-trained Google News word embeddings in our system: https://code.google.com/ archive/p/word2vec/ SDQC Depth-based Clusters -We built levelbased stance clusters from the posts. For each stance class (SDQC), we extracted all the belonging posts that correspond to one of the four classes and we computed the average value of the feature vectors (as one unique cluster). Since we have four main stances, this process ended with four main clusters. For the feature extraction, we measured the cosine similarity for each post wrt these four clusters. As done in the previous feature described above, we built these clusters by using both the manual features' vectors and word embeddings' vectors of the posts, so each stance cluster is represented in two ways. In these four main clusters, we didn't consider the nesting of the posts in the thread.
Also, we obtained the same clusters but instead of averaging all the posts that correspond to a stance, we considered the nesting of the posts in the thread. We split the nesting of the threads into five groups: posts with depth one, two, three, four, five or larger. For each of these levels, we extracted four SDQC clusters (depth-based). For instance, if a post occurs in depth two, we measured the cosine similarity between this post and 1) the four main SDQC clusters 8 , 2) the four depth-based SDQC clusters two.
Concerning task B, we modeled the distribution of the features used for task A. For each thread we did the following: 1. We counted how many posts in the thread correspond to each of the stances. 2. We extracted the averaged features' vectors for each stance's posts in the thread. 3. We extracted the standard deviation for each stance's posts in the thread.

Experiments
We tested different machine learning classifiers in each task performing 10-fold cross-validation. The results showed that the Logistic Regression (LR) produces the highest scores. For tuning the classifier, we used the Grid Search method. The parameters of the LR are: C = 61.5, penalty = L2, and since the dataset is not balanced, we used different weights for the classes as COMMENT = 0.10, DENY = 0.35, SUPPORT = 0.20 and QUERY = 0.35. We conducted an ablation test on the features employed in task A in order to investigate their importance in the classification process. Table 2 presents the ablation test results as well as the system performance using 10-fold cross-validation.  Provided that the organizers allowed two submissions for the final evaluation, on both tasks we used all the features (set A) in the first submission and set M for the second submission. In Table 3 we present the final scores achieved on both tasks.

Error Analysis
A manual error analysis allow us to see which categories and posts turned out to be the most difficult to be dealt with our system. We found out that SUPPORT was misclassified 114 times, DENY 92 times, QUERY 44 times, and COMMENT 57 times. Therefore, SUPPORT seems to be the hardest category to be correctly classified. Table 4 reports the detailed confusion matrix of predicted vs. gold labels and shows that the most of errors are related to the category SUP-PORT (in the gold dataset) and COMMENT (in our runs), while any error involves the more contrasting classes (e.g. SUPPORT   be moreover observed that several semantically empty messages of the test set have been marked using some class, while our system marks them as COMMENT, i.e. selecting the more frequent class when a clear indication of the content is lacking.

Conclusion
In this paper we presented an overview of the UPV-28-UNITO participation for SemEval 2019 Task 7 -Determining Rumour Veracity and Support for Rumours. We submitted two different runs in the detection of rumor stance classification (Task A) and veracity classification (Task B) in English messages retrieved from Twitter and Reddit both. Our approach was based on emotional, sentiment, lexical, stylistic, meta-structural and Twitter-based features. Furthermore, we introduced two novel sets of features, i.e. syntactical and depth-based features, which proved to be successful for the task of rumor stance classification, where our system ranked as 5th (out of 26) and, according to the RMSE score, we ranked 6th in Task B for veracity classification. Since the two latter groups of features produced an interesting contribution to the score for Task A, but they were fairly neutral in Task B, we will follow this trail and try to inquire more on these aspects in our future work.