Yes, we can! Mining Arguments in 50 Years of US Presidential Campaign Debates

Political debates offer a rare opportunity for citizens to compare the candidates’ positions on the most controversial topics of the campaign. Thus they represent a natural application scenario for Argument Mining. As existing research lacks solid empirical investigation of the typology of argument components in political debates, we fill this gap by proposing an Argument Mining approach to political debates. We address this task in an empirical manner by annotating 39 political debates from the last 50 years of US presidential campaigns, creating a new corpus of 29k argument components, labeled as premises and claims. We then propose two tasks: (1) identifying the argumentative components in such debates, and (2) classifying them as premises and claims. We show that feature-rich SVM learners and Neural Network architectures outperform standard baselines in Argument Mining over such complex data. We release the new corpus USElecDeb60To16 and the accompanying software under free licenses to the research community.


Introduction
Political debates are public interviews where the candidates of elections are requested to confront each other on topics such as unemployment, taxes, and foreign policy. During presidential elections in the US, it is customary for the main candidates of the two largest parties, i.e., the Democratic and the Republican Parties, to engage in a debate around the most controversial issues of the time. Such debates are considered as a de facto election process, and in some cases they have nearly decided the outcomes of the elections (Coleman et al., 2015).
Given the importance of these debates and their innate argumentative features, they represent a natural playground for Argument(ation) Mining (AM) methods (Peldszus and Stede, 2013;Lippi and Torroni, 2016b;. AM deals with analyzing argumentation in various domains, such as legal cases (Mochales and Moens, 2011), persuasive essays (Stab and Gurevych, 2017), clinical trials (Mayer et al., 2018) and scientific articles (Teufel et al., 2009). The ability of identifying argumentative components and predicting their relations in such a kind of texts opens the door to cutting-edge tasks like fallacy detection, fact-checking, and counterargumentation generation.
Despite the plethora of existing approaches and annotated corpora for AM, very few of them tackle the issue of mining argumentative structures from political debates (Lippi and Torroni, 2016a;Menini et al., 2018;Duthie and Budzynska, 2018;Visser et al., 2019). To be best of our knowledge, none of them take on the identification of argument components (i.e., premises and claims) on a large corpus of political debates. This paper fills this gap by (1) performing a large-scale annotation study over 50 years of US presidential campaigns from 1960 (Nixon vs. Kennedy) to 2016 (Trump vs. Clinton), resulting in 29k annotated argument components, and (2) experimenting with featurerich SVM learners and neural architectures outperforming standard baselines in Argument Mining. Finally, to ensure full reproducibility of our experiments, we provide all data and source codes under free licenses.
The paper is organized as follows. Section 2 discusses the related work and compares it to the proposed approach. In Section 3, we present the corpus of political debates we built along with some examples from the annotation guidelines. Section 4 describes the experimental setting, reports on the obtained results and discusses the main errors which occurred.
2 Background and related work AM is "the general task of analyzing discourse on the pragmatics level and applying a certain argumentation theory to model and automatically analyze the data at hand" (Habernal and Gurevych, 2017). Two tasks are crucial in Argument Mining: i) Argument component detection in the input text: this step may be further split in the detection of argument components (i.e., claims and premises) and of their textual boundaries. Different methods have been tested, like Support Vector Machines (SVM) (e.g., (Lippi and Torroni, 2016c)), Naïve Bayes classifiers (Duthie et al., 2016), Logistic Regression (Levy et al., 2014) and Neural Networks (Stab and Gurevych, 2017), and ii) Prediction of the relations holding between the argumentative components (i.e., attacks and supports). Relations can be predicted between arguments (Cabrio and Villata, 2013) or between argument components (Stab and Gurevych, 2017).
Regarding political debates, (Menini et al., 2018) predict relations on speeches of the Nixon-Kennedy campaign considering only annotations on the relations among such arguments. (Lippi and Torroni, 2016a) focus on the 2015 UK election debates to study the impact of vocal features from the speeches on the identification of claims in debates. They built a small corpus of political debates annotated with premises and claims. (Duthie and Budzynska, 2018) proposed the ethos mining task aiming at detecting ethotic arguments and the relations among the politicians and the parties in the UK Parliament. Sentences are annotated as being ethotic arguments or not. (Basave and He, 2016) studied the use of semantic frames for modelling argumentation in speakers' discourse. They investigated the impact of argumentation as a influence rank indicator for politicians on the 20 debates for the Republican primary election. Finally, (Visser et al., 2019) present a dataset composed of the transcripts of televised political debates leading up to the 2016 presidential election in the US, with the addition of the reactions from the social media platform Reddit. The corpus is annotated based on the Inference Anchoring Theory, and not with argument components. Contrary to past works, we create a huge annotated dataset including 39 political debates, and we present a successful attempt to argument component detection on such a big corpus of political debates.

Dataset creation
The USElecDeb60To16 v.01 dataset was collected from the website of the Commission on Presidential Debates 1 , which provided transcripts of the debates broadcasted on TV and held among the leading candidates for the presidential and vice presidential nominations in the US. USElecDeb60To16 includes the debates starting from Kennedy and Nixon in 1960 until those between Clinton and Trump in 2016. Table 1 provides some statistics on the dataset in terms of number of turns in the conversations, of sentences and of words in the transcripts. The unique properties of this dataset are its size (see Table 1), its peculiar nature of containing reciprocal discussions (mainly between Democrats and Republicans), and its time line structure. The motivation for creating a new corpus is twofold: i) to the best of our knowledge, no other big corpus on political debates annotated at a argument component level for Argument Mining exists, and ii) we ensure the reproducibility of the annotation, writing guidelines, inspired from (Rinott et al., 2015;Lippi and Torroni, 2016a), with precise rules for identifying and segmenting argument components (i.e., claims and premises) in political debates. 2 In the following, we detail the annotation of the argument components through examples from the USElecDeb60To16 dataset.
Claims. Being them the ultimate goal of an argument, in the context of political debates, claims can be a policy advocated by a party or a candidate to be undertaken which needs to be justified in order to be accepted by the audience. In Example 1, 3 Bush is defending the decisions taken by his administration by claiming that his policy has been effective. Claims might also provide judgments about the other candidate or parties (Example 2). Taking a stance towards a controversial subject, or an opinion towards a specific issue is also considered as a claim (e.g., "I've opposed the death penalty during all of my life"). The presence of discourse indicators (e.g., "in my opinion", "I believe") is generally a useful hint in finding claims that state opinions and judgments.
Premises. Premises are assertions made by the debaters for supporting their claims (i.e., reasons or justifications). A type of premise commonly used by candidates is referring to past experience: more experienced candidates exploit this technique to assert that their claims are more relevant than their opponents because of their past experience (Example 4). Three expert annotators defined the annotation guidelines, then three other annotators carried out the annotation task relying on such guidelines. Each transcript has been independently annotated by at least two annotators 4 . 86% of the sentences, which were annotated at least with one component, were tagged with only one argument component, while the remaining 14% with more than one component (7% with both claims and premises). 5 Only 0.6% of the dataset contains cross-sentence annotations (i.e., annotations which are not bound in one sentence). 19 debates have been independently annotated by three annotators to measure the IAA. The observed agreement percentage and IAA at sentence-level (following (Stab and Gurevych, 2014)) are respectively 0.83% and κ = 0.57 (moderate agreement) for argumentative-non argumentative sentences, and 63% and κ = 0.4 (fair agreement) for the argument components. Such annotation tasks are very difficult with political debates. In many examples, the choice between a premise and a claim is hard to define. In Example 7, the sentence "the way Senator [...]" is used as a premise for the previous claim, but if observed out of this context, it can be identified as a claim. This justifies the IAA on the argument component annotation. To release a consistent dataset, in the reconciliation phase we computed the IAA of the annotators with two other expert annotators with background in computational linguistics on a sample of 6 debates. In case of disagreement among the first three annotators, the annotation provided by the annotator which showed to be consistently in line with the expert annotators (i.e., with a higher IAA) was included in the released dataset. After the reconciliation phase, the USE-lecDeb60To16 dataset contains the annotation of 29521 argument components (i.e., 16087 claims and 13434 premises). Notice that the number of claims is higher than the number of premises, because in political speeches the candidates make arguments mostly without providing premises for their claims. Moreover, the candidates use longer sentences (more words) to express their premises than their claims.
For our experiments, we split the dataset into train (13894 components), validation (6577 components) and test (9050 components) sets, keeping the same component distribution as in the original dataset.

Experimental setting
We address the argument component detection task as two subsequent classification steps, i.e., the argumentative sentences detection (Task 1), and the argumentative components identification (Task 2). We address both of these classification tasks at the sentence level (e.g., we label a sentence according to the longest component annotated in the sentence).
Methods For Task 1, we trained both a linearkernel SVM with stochastic gradient descent learning method using bag of words features only, and a SVM classifier with rbf kernel (python scikit-learn v0.20.1, penalty parameter=10) using the features listed below to distinguish argu-mentative sentences (i.e., sentences which contain at least one argument component) from the non-argumentative ones. For comparison, we also tested a Neural Network structured with two bidirectional LSTM layers (Hochreiter and Schmidhuber, 1997) using word embeddings from Fast-Text (Joulin et al., 2016;Mikolov et al., 2018) as the weights for the embedding layer. The output layer determines the class Argumentative/Non-Argumentative for the input sentence. A feedforward Neural Network was also trained using the same sentence-based features used with the SVM classifier. This network consists of two hidden layers with 64 and 32 neurons for the 1st and 2nd hidden layer, respectively.
As for the component classification step, we applied the same classifiers as for Task 1 (SVM and LSTM). For both tasks, we implemented the majority baseline for argument component classification used in (Stab and Gurevych, 2017).
We considered the following features: tf-idf of each word, NGram (bigrams and trigrams), POS of adverbs, adjectives (used by debaters to stress the correctness of their premises), different tenses of verbs and modal verbs (they often affect the certainty of the assertions, hence would be a hint of facts/non-facts in discerning between argument components), syntactic features (constituency parse trees, dept of the parsing tree), discourse connectives (and their position), NER (debaters often mention party members, former presidents, organizations and dates or numbers like statistics as examples to strengthen the premises for the claims), semantic features (sentiment polarity of the argument component and of its covering sentence (Menini et al., 2018)).
Evaluation Tables 2 and 3 present the results obtained on detecting argumentative sentences (Task 1) and classifying argumentative components (Task 2), respectively. Results obtained with linear-kernel SVM significantly outperformed the majority baseline in both tasks. Enriching the feature-set increased the classification performances by 9% on Task 1 using the rbfkernel SVM, while only by 2.2% on Task 2. Running ablation tests for features analysis, we noticed that the lexical features (tf-idf and NGram features) strongly contribute to performance increase in both tasks. NER features -selected on the assumption that they would have improved the detection of premises as candidate tend to use NERs to provide examples -showed to be more effective in Task 1 only. Sentiment and discourse indicator features did not show to be effective in either classification tasks. Results obtained by LSTM with word-embedding as features in both tasks are comparable to that of the SVM using all the features, showing the efficiency of neural classifiers on AM tasks using less dimensionality for the input data.   Given the complexity of the task, we computed the human upper bound as the average F-score of the annotation agreement between annotators and the gold-standard. It resulted in 0.87 and 0.75 on argumentative vs. non-argumentative, and 0.74 and 0.65 on claims and premises, respectively.
Error Analysis Argumentative sentences are rarely misclassified, which results in high recall on argumentative sentence identification. Some patterns can be identified from the misclassified non-argumentative sentences. One of these patterns appears in very short non-argumentative sentences which contain an argument indicator such as "so", for instance the sentence: "So what should we do?" is classified as argumentative although in the context it is considered a non-argumentative sentence. Since indicators for claims are more numerous in these debates, this misclassification mostly occurs when a claim-indicator is uttered by the candidate in a non-argumentative manner.
In other cases, candidates make final remarks phrasing their speech with a structure similar to argumentative sentences, for example: "I think when you make that decision, it might be well if you would ask yourself, are you better off than you were four years ago?".
Misclassification between claims and premises, instead, is primarily due to the fact that the component classification is highly dependent on the structure of the argument.

Conclusion
We investigated the detection of argument components in the US presidential campaign debates: i) providing a manually annotated resource of 29k argument components, and ii) evaluating featurerich SVM learners and Neural Networks on such data (achieving ∼90% w.r.t. human performance). We highlighted the strengths (e.g., satisfactory performances on different oratory styles across time and topics) and weaknesses (e.g., no argument boundaries detection on a clause level, the context of the whole debates is not considered).
For future work, we plan to i) automatically predict relations between argument components in the USElecDeb60To16 dataset, and ii) propose a new task, i.e., fallacy detection so that common fallacies in political argumentation (Zurloni and Anolli, 2010) can be automatically identified, in line with the work of (Habernal et al., 2018).