What works and what does not: Classifier and feature analysis for argument mining

This paper offers a comparative analysis of the performance of different supervised machine learning methods and feature sets on argument mining tasks. Specifically, we address the tasks of extracting argumentative segments from texts and predicting the structure between those segments. Eight classifiers and different combinations of six feature types reported in previous work are evaluated. The results indicate that overall best performing features are the structural ones. Although the performance of classifiers varies depending on the feature combinations and corpora used for training and testing, Random Forest seems to be among the best performing classifiers. These results build a basis for further development of argument mining techniques and can guide an implementation of argument mining into different applications such as argument based search.


Introduction
Argument mining refers to the automatic extraction of arguments from natural texts. An argument consists of a claim (also referred to as the conclusion of the argument) and several pieces of evidence called premises that support or reject the claim (Lippi and Torroni, 2016).
In terms of methods all these studies rely on supervised machine learning. Among the different classification approaches applied Support Vector Machines, Naïve Bayes and Logistic Regression are the most common ones. Also different feature types have been investigated for the different steps of the argument mining task. Among the features types the prominent ones are structural, lexical, syntactic, indicators and contextual features as summarized by Stab and Gurevych (2014b).
Given this variety of work on argument mining time is ripe for an extensive comparative analysis of the performance of different machine learning techniques on different argument mining tasks using different data sets. Such an analysis should serve as a basis for further development of argument mining techniques and also inform those who want to implement argument mining components into other applications.
In this paper we offer such a comparative analysis of machine learning methods and features with respect to two argument mining tasks: (1) identifying argumentative segments in text, i.e. the classification of textual units (usually sentences) into claims, premises or none and (2) the prediction of argument structure, i.e. connecting claims and premises. We re-implement a rich set of features reported by related work and evaluate eight different classification systems. We perform our investigation on two different well-known corpora: (1) the persuasive essays corpus reported by Stab and Gurevych (2016) and (2) the Wikipedia claim and premise data reported by .

Data
We investigate the feature and classifier performances on two corpora. The first corpus consists of over 400 persuasive essays where arguments are annotated as claim, premise or major claim (Stab and Gurevych, 2016). For our purposes we consider each major claim as a claim to keep the argumentation model as simple as possible and ensure comparability between data sets. The second corpus consists of over 300 Wikipedia articles in which arguments are annotated as either Context Dependent Claim (CDC) or Context Dependent Evidence (CDE) in the context of a given topic .

Features
We evaluate several feature types proposed in previous work (Stab and Gurevych, 2014b): Structural features consider statistics about tokens and punctuation. Lexical features capture information on unigram frequency, as well as salient verbs and adverbs. Syntactic features incorporate occurrences of frequent POS-Sequences. Indicators introduce a list of argumentative keywords. Contextual features take into account structural and lexical features of surrounding sentences. In terms of data preprocessing we performed lemmatization before feature extraction step but left out removing stopwords as they are relevant for determining arguments. For instance stopwords like because, therefore, etc. are indeed good indicators for argumentative text.
Each feature set is scaled to a range between 0 and 1 and normalized by tf-idf. Furthermore, we also investigated word embeddings as an additional feature type by using the pre-trained Google News corpus consisting of 3 million 300dimension English word vectors 1 .

Detection of Argumentative Sentences
The first classification task involves identifying argumentative sentences in natural texts. This is considered as a three-class classification task, where sentences are classified as claim, premise and none. The gold standard data contains texts annotated either as premise or claim. To determine the non-argumentative sentences, which are necessary for developing a classifier to distinguish between positive and negative examples, we include sentences for which there is no annotations.

Prediction of Argumentative Structures
The second classification task aims to identify the relationship between claims and premises. This task is treated as a binary classification task: a claim and a premise can be in a linked or unlinked relation. All annotated pairs of claims and premises are taken as linked examples. To determine the unlinked examples we take a subset of both annotated premises and claims and calculate the cross product of these two sets. 2 The selection of negative pairs is a randomized process where repetition of single arguments are possible but not as a complete pair.

Classifiers
We investigate 8 classifiers, some of which have been used by previous studies (LinearSVC, Logistic Regression, Random Forest, Multinominal Naïve Bayes (MNB)) and some of which we implement for the first time for the above tasks: Nearest Neighbor, AdaBoosted Decision Tree (AdaBoost), Gaussian Naïve Bayes (GNB) and Convolutional Neural Networks (CNNs)). Each classifier, except the CNN, has been trained and tested on each possible combination of the six feature types. For the task of argumentative sentence detection the best overall result on persuasive essays is achieved by combining all six feature sets yielding an F1-score of 81% achieved by the Linear SVC classifier. The structural features achieve the best results among the single feature types. Similar results have been also reported in (Stab and Gurevych, 2014a) for a smaller corpus of 90 persuasive essays. Also in the leave-one-out setting removing the structural features leads to the largest loss in performance. Lexical features are the next most useful feature for separating argumentative sentences from non-argumentative ones. Syntactic features are found to be least useful for this task. The performance of the classifiers based on these features only is low and removing them from a set of features does not lead to a substantial reduction in performance.
For the task of predicting the argument structure the best overall results (66%) are achieved by AdaBoost classifier based on all features without word embeddings. Table 1 indicates that the structural features are again the best performing feature set among the single ones achieving an F1-score of 65% in combination with Logistic Regression and LinearSVC. This single structural feature set even outperforms combined feature sets (excluding the ALL without Word Embeddings feature) showing that inclusion of the other feature types, in particular word embeddings lead only to noise. The other feature types all perform substantially worse than the structural feature type and their overall performance is similar.
Due to the great performance of the structural feature we computed significance test between this feature (took the best results) and all the other single features with their best performance. Results of the significance test are shown in the first two rows (after the table heading) of Table 3.

Results on Wikipedia Data
For the Wikipedia corpus we extracted 2858 premise and claim examples and 1200 nonargumentative examples for sentence detection classification task. 3 For structure prediction classification task we obtained 1232 positive examples for support relations between premises and claims and 1200 negative examples for non-supporting relations. The negative relational instances are those that bear wrong pairings. The results for the Wikipedia corpus are shown in Table 2. Table 2 reveals that for argumentative sentence detection the structural features again achieve the best results among the single feature types and 3 We randomly selected 1200 non-argumentative examples that were not annotated. We admit that these negative examples can still have argumentative sentences because the Wikipedia corpus contains only topic dependent claims and premises. Any claim or premise not topic related was not annotated.  The results are shown as X/Y where X refers to score for the task of detecting argumentative sentences and Y refers to the score for predicting argumentative structure. Table 3: Significance using using Student's t-test between the structural features and the others for the essay (first 2 rows) and the Wikipedia corpus (last 2 rows). When conducting multiple analyses on the same dependent variable, the chance of achieving a significant result by pure chance increases. To correct for this we did a Bonferroni correction. Results are reported after this correction. In the cells Y means yes and N means nosignificance.
lead to largest loss in performance when removed from the set of all features. The best scoring classifier is Random Forest, which based on structural features achieves an F1-score of 94%. The best overall result is achieved by random Forest classifier by combining five feature sets without word embeddings. The F1 score in this setting is 96%. As in the persuasive essay corpus, the arguments in Wikipedia corpus are also best identified using structural features. The lexical feature type gains the next best evaluation results in both single and leave-one-out feature settings. Syntactic features do not have a substantial influence in separating argumentative from non-argumentative sentences, which was also observed within the persuasive essay corpus. Overall, the scores for Wikipedia are substantially higher than those obtained for the essay corpus. For the structure prediction task on the Wikipedia corpus Table 2 indicates that structural feature proved best feature type for argument structure prediction, achieving an F1-score of 56% in Nearest Neighbors classifier. The performance of syntactic features is the lowest, while lexical and word embedding feature types perform in general comparably to the structural features.
Best results are achieved when word embeddings, lexical, indicators and structural feature types are combined leading to an F1-score of 60% in combination with Logistic Regression classifier. Similar to the essay corpus we computed the significance test between the structural feature set with the other single feature sets. The results are shown in the last two rows of Table 3.

Results with CNN
Finally, for the purpose of detecting argumentative pieces of text as well as structure prediction we have adopted the Convolutional Neural Network (CNN) architecture described by Kim (2014), who applied it to the task of sentiment analysis. Apart from changing the inputs from sentimental sentences to argumentative pieces of text, we kept the original architecture, as well as all settings used for training as described by Kim (2014). Table 4 shows the results of our adopted CNN classifier for both corpora. We can see the CNN has a good performance in argumentative sentence detection, it achieves an F1-score of 74% for the persuasive essay corpus and an F1-score of 75% for the Wikipedia data. 4 In terms of structure pre-diction it leads to an F1-score of 73% for the persuasive essay corpus and 52% for the Wikipedia corpus.

Conclusion
In this paper we presented a comparative analysis of supervised classification methods for two argument mining tasks. Specifically, we investigated six feature types proposed by previous work implemented in 8 classifiers, some of which have been proposed before and some of which were new. We addressed two argument mining tasks: (1) the detection of argumentative pieces of text and (2) predicting the structure between claims and premises. We performed our analysis on two different corpora: persuasive essays and Wikipedia articles. The most robust result in our analysis was the contribution of structural features. For both corpora and both tasks, these features were consistently the most relevant ones. Likewise, syntactic features were not useful in any of the experimental settings. The classifier performance varied across features and corpora and we did not get a robust result for one classifier consistently outperforming others. However, Random Forest classifier showed best results on the Wikipedia Corpus and results comparable to the best ones for the essays corpus. In our future work we plan to expand our investigation by including other corpora to test on as well as Recurrent Neural Networks. Also note for the final version of the paper we plan to include an extensive error analysis which we omit now due to space limitations.