Argument Mining on Twitter: Arguments, Facts and Sources

Social media collect and spread on the Web personal opinions, facts, fake news and all kind of information users may be interested in. Applying argument mining methods to such heterogeneous data sources is a challenging open research issue, in particular considering the peculiarities of the language used to write textual messages on social media. In addition, new issues emerge when dealing with arguments posted on such platforms, such as the need to make a distinction between personal opinions and actual facts, and to detect the source disseminating information about such facts to allow for provenance verification. In this paper, we apply supervised classification to identify arguments on Twitter, and we present two new tasks for argument mining, namely facts recognition and source identification. We study the feasibility of the approaches proposed to address these tasks on a set of tweets related to the Grexit and Brexit news topics.


Introduction
Argument mining aims at automatically extracting natural language arguments and their relations from a variety of textual corpora, with the final goal of providing machine-processable structured data for computational models of arguments and reasoning engines (Peldszus and Stede, 2013;Lippi and Torroni, 2016). Several approaches have been proposed so far to tackle the two main tasks identified in the field: i) arguments extraction, i.e., to detect arguments within the input natural language texts and the further detection of their boundaries, and ii) relations prediction, i.e., to predict what are the relations holding between the arguments identified in the first task 1 . Social media platforms like Twitter 2 and newspapers blogs allow users to post their own viewpoints on a certain topic, or to disseminate news read on newspapers. Being these texts short, without standard spelling and with specific conventions (e.g., hashtags, emoticons), they represent an open challenge for standard argument mining approaches (Snajder, 2017). The nature and peculiarity of social media data rise also the need of defining new tasks in the argument mining domain (Addawood and Bashir, 2016;Llewellyn et al., 2014).
In this paper, we tackle the first standard task in argument mining, addressing the research question: how to mine arguments from Twitter? Going a step further, we address also the following subquestions that arise in the context of social media: i) how to distinguish factual arguments from opinions? ii) how to automatically detect the source of factual arguments? To answer these questions, we extend and annotate a dataset of tweets extracted from the streams about the Grexit and the Brexit news. To address the first task of argument detection, we apply supervised classification to separate argument-tweets from non-argumentative ones. By considering only argument-tweets, in the second step we apply again a supervised classifier to recognize tweets reporting factual information from those containing opinions only. Finally, we detect, for all those arguments recognized as factual in the previous step, what is the source of such information (e.g., the CNN), relying on the type of the Named Entities recognized in the tweets. The last two steps represent new tasks in the argument 1 We refer the reader interested in more details on argument mining to (Peldszus and Stede, 2013;Lippi and Torroni, 2016) as survey papers, and to the proceedings of the Argument Mining workshop series (https:// argmining2017.wordpress.com/). 2 www.twitter.com mining research field, of particular importance in social media applications.

Mining arguments on Twitter
In this section, we describe the approaches we have developed to address the following tasks: i) Argument detection, ii) Factual vs opinion classification, and iii) Source identification, on social media data. Our experimental setting -whose goal is to investigate the tasks' feasibility on such peculiar data -considers a dataset of tweets related to the political debates on whether or not Great Britain and Greece had to leave the European Union (i.e. #Brexit and #Grexit threads in Twitter).

Experimental setting
Dataset. 3 The only available resource of annotated tweets for argument mining is DART (Bosc et al., 2016a). From the highly heterogeneous topics contained in such resource (i.e. the letter to Iran written by 47 U.S. senators; the referendum for or against Greece leaving the EU; the release of Apple iWatch; the airing of the 4th episode of the 5th season of the TV series Game of Thrones), and considering the fact that tweets discussing a political topic generally have a more developed argumentative structure than tweets commenting on a product release, we decided to select for our experiments the subset of the DART dataset on the thread #Grexit (987 tweets). Then, following the same methodology described in (Bosc et al., 2016a), we have extended such dataset collecting 900 tweets from the thread on #Brexit. From the original thread, we filtered away retweets, accounts with a bot probability >0.5 (Davis et al., 2016), and almost identical tweets (Jaccard distance, empirically evaluated threshold). Given that tweets in DART are already annotated for task 1 (argument/non-argument, see Section 2.2), two annotators carried out the same task on the newly extracted data. Moreover, the same annotators annotated both datasets (Grexit/Brexit) for the other two tasks of our experiments, i.e. i) given the argument tweets, annotation of tweets as either containing factual information or opinions (see Section 2.3), and ii) given factual argument tweets, annotate their source when explicitly cited (see Section 2.4). Tables 1, 2 and 3 contain statistical information on the datasets.
Inter annotator agreement (IAA) (Carletta, 1996) between the two annotators has been calculated for the three annotation tasks, resulting in κ=0.767 on the first task (calculated on 100 tweets), κ=0.727 on the second task (on 80 tweets), and Dice=0.84 (Dice, 1945) 4 on the third task (on the whole dataset). More specifically, to compute IAA, we sampled the data applying the same strategy: for the first task, we randomly selected 10% of the tweets of the Grexit dataset (our training set); for task 2, again we randomly selected 10% of the tweets annotated as argument in the previous annotation step; for task 3, given the small size of the dataset, both annotators annotated the whole corpus.   Classification algorithms. We tested Logistic Regression (LR) and Random Forest (RF) classification algorithms, relying on the scikit-learn tool suite 5 . For the learning methods, we have used a Grid Search (exhaustive) through a set of predefined hyper-parameters to find the best performing ones (the goal of our work is not to optimize the classification performance but to provide a preliminary investigation on new tasks in argument mining over Twitter data). We extract argumentlevel features from the dataset of tweets (following (Wang and Cardie, 2014)), that we group into the following categories: • Lexical (L): unigram, bigram, WordNet verb synsets; • Twitter-specific (T): punctuation, emoticons; • Syntactic/Semantic (S): we have two versions of dependency relations as features, one being the original form, the other generalizing a word to its POS tag in turn. We also use the syntactic tree of the tweets as feature. We apply the Stanford parser (Manning et al., 2014) to obtain parse trees and dependency relations; • Sentiment (SE): we extract the sentiment from the tweets with the Alchemy API 6 , the sentiment analysis feature of IBM's Semantic Text Analysis API. It returns a polarity label (positive, negative or neutral) and a polarity score between -1 (totally negative) and 1 (totally positive).
As baselines we consider both LR and RF algorithms with a set of basic features (i.e., lexical).

Task 1: Argument detection
The task consists in classifying a tweet as being an argument or not. We consider as arguments all those text snippets providing a portion of a standard argument structure, i.e., opinions under the form of claims, facts mirroring the data in the Toulmin model of argument (Toulmin, 2003), or persuasive claims, following the definition of argument tweet provided in (Bosc et al., 2016a,b). Our dataset contains 746 argument tweets and 241 non-argument tweets for Grexit (that we use as training set), and 713 argument tweets and 187 non-argument tweets for Brexit (the test set). Below we report an example of argument tweet (a), and of a non-argument tweet (b).
(a) Junker asks "who does he think I am". I suspect elected PM Tsipras thinks Junker is an unelected Eurocrat. #justsaying #democracy #grexit 6 https://www.ibm.com/watson/ alchemy-api.html (b) #USAvJPN #independenceday #Justin-BieberBestIdol Macri #ConEsteFrioYo happy 4th of july #Grefenderum Wireless Festival We cast the argument detection task as a binary classification task, and we apply the supervised algorithms described in Section 2.1. Table 4 reports on the obtained results with the different configurations, while Table 5 reports on the results obtained by the best configuration, i.e., LR + All features, per each category.

Approach
Precision  Most of the miss-classified tweets are either ironical, e.g.: If #Greece had a euro for every time someone mentioned #Grexit and #Greferendum they would probably have enough for a bailout. #GreekCrisis that was wrongly classified as argument, or contain reported speech, e.g.: Jeremy Warner: Unintentionally, the Greeks have done themselves a favour. Soon, they will be out of the euro http://t.co/YmqXi36lGj #Grexit that was wrongly classified as non argument. Our results are comparable to those reported in (Bosc et al., 2016b) (they trained a supervised classifier on the tweets of all topics in the DART dataset but the iWatch, used as test set). Better performances obtained in our setting are most likely due to a better feature selection, and to the fact that in our case the topics in the training and test sets are more homogeneous.

Task 2: Factual vs opinion classification
This task consists in classifying argumenttweets as containing factual information or being opinion-based (Park et al., 2015). Our interest focuses in particular on factual argument-tweets, as we are interested then in the automated identification of their sources. This would allow then to rank factual tweet-arguments depending on the reliability or expertise of their source for subsequent tasks as fact checking. Given the huge amount of work in the literature devoted to opinion extraction, we do not address any further analysis on opinion-based arguments here, referring the interested reader to (Liu, 2012).
An argument is annotated as factual if it contains a piece of information which can be proved to be true (see example (a) below), or if it contains "reported speech" (see example (b) below). All the other argument tweets are considered as "opinion" (see example (c) below).
To address the task of factual vs opinion arguments classification, we apply the supervised classification algorithms described in Section 2.1. Tweets from Grexit dataset are used as training set, and those from Brexit dataset as test set. Table 6 reports on the obtained results, while Table 7 reports on the results obtained by the best configuration, i.e. LR + All features, per each category.
Most of the miss-classified tweets contain reported opinions/reported speech and are wrongly classified by the algorithm as opinion -such behaviour could be expected given that sentiment features play a major role in these cases, e.g.,

Approach
Precision  that was wrongly classified as fact.

Task 3: Source identification
Since factual arguments (as defined above) are generally reported by news agencies and individuals, the third task we address -and that can be of a value in the context of social media -is the recognition of the information source that disseminates the news reported in a tweet (when explicitly mentioned). For instance, in: The Guardian: Greek crisis: European leaders scramble for response to referendum no vote. http://t.co/cUNiyLGfg3 the source of information is The Guardian newspaper. Such annotation is useful to rank factual tweet-arguments depending on the reliability or expertise of their source in news summarization or fact-checking applications, for example.
Our dataset contains 79 factual argument tweets where the source is explicitly cited for Grexit (training set), and 40 factual argument tweets where the source is explicitly cited for Brexit (test set). Given the small size of the available annotated dataset, to address this task we implemented a simple string matching algorithm that relies on a gazetteer containing a set of Twitter usernames and hashtags extracted from the training data, and a list of very common news agencies (e.g. BBC, CNN, CNBC). If no matches are found, the algorithm extracts the NEs from the tweets through (Nooralahzadeh et al., 2016)'s system, and applies the following two heuristics: i) if a NE is of type dbo:Organisation or dbo:Person, it considers such NE as the source; ii) it searches in the abstract of the DBpedia 7 page linked to that NE if the words "news", "newspaper" or "magazine" appear (if found, such entity is considered as the source). In the example above, the following NEs have been detected in the tweet: "The Guardian" (linked to the DBpedia resource http://dbpedia.org/page/ The_Guardian) and "Greek crisis" (linked to http://dbpedia.org/page/Greek_ government-debt_crisis). Applying the mentioned heuristics, the first NE is considered as the source. Table 8 reports on the obtained results. As baseline, we use a method that considers all the NEs detected in the tweet as sources.
0.69 0.64 0.67 Most of the errors of the algorithm are due to information sources not recognized as NEs (in particular, when the source is a Twitter user), or NEs that are linked to the wrong DBpedia page. However, in order to draw more interesting conclusions on the most suitable methods to address this task, we would need the increase the size of the dataset.

Discussion and Future work
This paper investigated argument mining tasks on Twitter data. The main contribution is twofold: first, we propose one of the very few approaches of argument mining on Twitter, and second, we propose and evaluate two new tasks for argument mining, i.e., facts recognition and source identification. These tasks are particularly relevant when applied to social media data, in line with the open popular challenges of fact-checking and source verification to which these results contribute.
The issue of argument detection on Twitter has already been addressed in the literature. Bosc et al. (2016a,b) address a binary classification task (argument-tweet vs. non argument), as first step of their pipeline. Goudas et al. (2015) experiments machine learning techniques over a dataset in Greek extracted from social media. They first detect argumentative sentences, and second identify premises and claims. However, none of them is neither interested in distinguishing facts from opinions nor to identify the arguments' sources. An argumentation-based approach is applied to Twitter data to extract opinions in (Grosse et al., 2015), with the aim of detecting conflicting elements in an opinion tree to avoid potentially inconsistent information. Both the goal and the adopted methodology are different from ours.
Being it a work in progress, several open issues have to be considered as future research. Among them, we are currently extending the dataset of annotated tweets both in terms of annotated tweets per topic, and in terms of addressed topics (e.g. Brexit after the referendum, Trump), in order to have more instances of facts and sources. On such extended dataset, we plan to run experiments using the three modules of the system as a pipeline.
Moreover, we plan to extend our pipeline by considering also the links provided in the tweets to verify their sources, i.e., if a tweet claims to report information from the CNN but the link actually redirects towards an advertisement website the source is not indubitably the CNN.