What is the Essence of a Claim? Cross-Domain Claim Identification

Argument mining has become a popular research area in NLP. It typically includes the identification of argumentative components, e.g. claims, as the central component of an argument. We perform a qualitative analysis across six different datasets and show that these appear to conceptualize claims quite differently. To learn about the consequences of such different conceptualizations of claim for practical applications, we carried out extensive experiments using state-of-the-art feature-rich and deep learning systems, to identify claims in a cross-domain fashion. While the divergent conceptualization of claims in different datasets is indeed harmful to cross-domain classification, we show that there are shared properties on the lexical level as well as system configurations that can help to overcome these gaps.


Introduction
The key component of an argument is the claim. This simple observation has not changed much since the early works on argumentation by Aristotle more than two thousand years ago, although argumentation scholars provide us with a plethora of often clashing theories and models (van Eemeren et al., 2014). Despite the lack of a precise definition in the contemporary argumentation theory, Toulmin's influential work on argumentation in the 1950's introduced a claim as an 'assertion that deserves our attention' (Toulmin, 2003, p. 11); recent works describe a claim as 'a statement that is in dispute and that we are trying to support with reasons' (Govier, 2010).
Argument mining, a computational counterpart of manual argumentation analysis, is a recent growing sub-field of NLP (Peldszus and Stede, 2013a). 'Mining' arguments usually involves several steps like separating argumentative from nonargumentative text units, parsing argument structures, and recognizing argument components such as claims-the main focus of this article. Claim identification itself is an important prerequisite for applications such as fake checking (Vlachos and Riedel, 2014), politics and legal affairs (Surdeanu et al., 2010), and science (Park and Blake, 2012).
Although claims can be identified with a promising level of accuracy in typical argumentative discourse such as persuasive essays (Stab and Gurevych, 2014;Eger et al., 2017), less homogeneous resources, for instance online discourse, pose challenges to current systems (Habernal and Gurevych, 2017). Furthermore, existing argument mining approaches are often limited to a single, specific domain like legal documents (Mochales-Palau and Moens, 2009), microtexts (Peldszus and Stede, 2015), Wikipedia articles (Levy et al., 2014;Rinott et al., 2015) or student essays (Stab and Gurevych, 2017). The problem of generalizing systems or features and their robustness across heterogeneous datasets thus remains fairly unexplored.
This situation motivated us to perform a detailed analysis of the concept of claims (as a key component of an argument) in existing argument mining datasets from different domains. 1 We first review and qualitatively analyze six existing publicly available datasets for argument mining ( §3), showing that the conceptualizations of claims in these datasets differ largely. In a next step, we analyze the influence of these differences for crossdomain claim identification. We propose several computational models for claim identification, including systems using linguistically motivated features ( §4.1) and recent deep neural networks ( §4.2), and rigorously evaluate them on and across all datasets ( §5). Finally, in order to better understand the factors influencing the performance in a cross-domain scenario, we perform an extensive quantitative analysis on the results ( §6).
Our analysis reveals that despite obvious differences in conceptualizations of claims across datasets, there are some shared properties on the lexical level which can be useful for claim identification in heterogeneous or unknown domains. Furthermore, we found that the choice of the source (training) domain is crucial when the target domain is unknown. We release our experimental framework to help other researchers build upon our findings. 2

Related Work
Existing approaches to argument mining can be roughly categorized into (a) multi-document approaches which recognize claims and evidence across several documents and (b) discourse level approaches addressing the argumentative structure within a single document. Multi-document approaches have been proposed e.g. by Levy et al. (2014) and Rinott et al. (2015) for mining claims and corresponding evidence for a predefined topic over multiple Wikipedia articles. Nevertheless, to date most approaches and datasets deal with single-document argumentative discourse. This paper takes the discourse level perspective, as we aim to assess multiple datasets from different authors and compare their notion of 'claims'. Mochales-Palau and Moens (2009) experiment at the discourse level using feature-rich SVM and a hand-crafted context-free grammar in order to recognize claims and premises in legal decisions. Their best results for claims achieve 74.1% F 1 using domain-dependent key phrases, token counts, location features, information about verbs, and the tense of the sentence. Peldszus and Stede (2015) present an approach based on a minimum spanning tree algorithm and model the global structure of arguments considering argumentative relations, the stance and the function of argument components. Their approach yields 86.9% F 1 for recognizing claims in English 'microtexts'. Habernal and Gurevych (2017) cast ar-gument component identification as BIO sequence labeling and jointly model separation of argumentative from non-argumentative text units and identification of argument component boundaries together with their types. They achieved 25.1% Macro-F 1 with a combination of topic, sentiment, semantic, discourse and embedding features using structural SVM. Stab and Gurevych (2014) identified claims and other argument components in student essays. They experiment with several classifiers and achieved the best performance of 53.8% F 1 score using SVM with structural, lexical, syntactic, indicator and contextual features. Although the above-mentioned approaches achieve promising results in particular domains, their ability to generalize over heterogeneous text types and domains remains unanswered. Rosenthal and McKeown (2012) set out to explore this direction by conducting cross-domain experiments for detecting claims in blog articles from LiveJournal and discussions taken from Wikipedia. However, they focused on relatively similar datasets that both stem from the social media domain and in addition annotated the datasets themselves, leading to an identical conceptualization of the notion of claim. Although Al-Khatib et al. (2016) also deal with cross-domain experiments, they address a different task; namely identification of argumentative sentences. Further, their goals are different: they want to improve argumentation mining via distant supervision rather than detecting differences in the notions of a claim.
Domain adaptation techniques (Daume III, 2007) try to address the frequently observed drop in classifier performances entailed by a dissimilarity of training and test data distributions. Since techniques such as learning generalized crossdomain representations in an unsupervised manner (Blitzer et al., 2006;Pan et al., 2010;Glorot et al., 2011;Yang and Eisenstein, 2015) have been criticized for targeting specific source and target domains, it has alternatively been proposed to learn universal representations from general domains in order to render a learner robust across all possible domain shifts (Müller and Schütze, 2015;Schnabel and Schütze, 2013). Our approach is in a similar vein. However, rather than trying to improve classifier performances for a specific sourcetarget domain pair, we want to detect differences between these pairs. Furthermore, we are looking for universal feature sets or classifiers that perform generally well for claim identification across varying source and target domains.

Claim Identification in Computational Argumentation
We briefly describe six English datasets used in our empirical study; they all capture claims on the discourse level. Table 1 summarizes the dataset statistics relevant to claim identification.

Datasets
The AraucariaDB corpus (Reed et al., 2008) includes various genres (VG) such as newspaper editorials, parliamentary records, or judicial summaries. The annotation scheme structures arguments as trees and distinguishes between claims and premises at the clause level. Although the reliability of the annotations is unknown, the corpus has been extensively used in argument mining (Moens et al., 2007;Feng and Hirst, 2011;Rooney et al., 2012). The corpus from Habernal and Gurevych (2017) includes user-generated web discourse (WD) such as blog posts, or user comments annotated with claims and premises as well as backings, rebuttals and refutations (α U 0.48) inspired by Toulmin's model of argument (Toulmin, 2003).
The persuasive essay (PE) corpus (Stab and Gurevych, 2017) includes 402 student essays. The scheme comprises major claims, claims and premises at the clause level (α U 0.77). The corpus has been extensively used in the argument mining community (Persing and Ng, 2015;Lippi and Torroni, 2015;Nguyen and Litman, 2016).
Biran and Rambow (2011a) annotated claims and premises in online comments (OC) from blog threads of LiveJournal (κ 0.69). In a subsequent work, Biran and Rambow (2011b) applied their annotation scheme to documents from Wikipedia talk pages (WTP) and annotated 118 threads. For our experiments, we consider each user comment in both corpora as a document, which yields 2, 805 documents in the OC corpus and 1, 985 documents in the WTP corpus.
Peldszus and Stede (2016) created a corpus of German microtexts (MT) of controlled linguistic and rhetoric complexity. Each document includes a single argument and does not exceed five argument components. The scheme models the argument structure and distinguishes between premises and claims, among other properties (such as proponent/opponent or normal/example). In the first annotation study, 26 untrained annotators annotated 23 microtexts in a classroom experiment (κ 0.38) (Peldszus and Stede, 2013b). In a subsequent work, the corpus was largely extended by expert annotators (κ 0.83). Recently, they translated the corpus to English, resulting in the first parallel corpus in computational argumentation; our experiments rely on the English version.

Qualitative Analysis of Claims
In order to investigate how claim annotations are tackled in the chosen corpora, one co-author of this paper manually analyzed 50 randomly sampled claims from each corpus. The characteristics taken into account are drawn from argumentation theory (Schiappa and Nordin, 2013) and include among other things the claim type, signaling words and discourse markers.
Biran and Rambow (2011b) do not back-up their claim annotations by any common argumentation theory but rather state that claims are utterances which convey subjective information and anticipate the question 'why are you telling me that?' and need to be supported by justifications. Using this rather loose definition, a claim might be any subjective statement that is justified by the author. Detailed examination of the LiveJournal corpus (OC) revealed that sentences with claims are extremely noisy. Their content ranges from a single word, ("Bastard."), to emotional expressions of personal regret, ("::hugs:: i am so sorry hon ..") to general Web-chat nonsense ("W-wow... that's a wicked awesome picture... looks like something from Pirates of the Caribbean...gone Victorian ...lolz.") or posts without any clear argumentative purpose ("what i did with it was make this recipe for a sort of casserole/stratta (i made this up, here is the recipe) [...] butter, 4 eggs, salt, pepper, sauted onions and cabbage..add as much as you want bake for 1 hour at 350 it was seriously delicious!"). The Wikipedia Talk Page corpus (WTP) contains claims typical to Wikipedia quality discussions ("That is why this article has NPOV issues.") and policy claims (Schiappa and Nordin, 2013) are present as well ("I think the gallery should be got rid of altogether."). However, a small number of nonsensical claims remains ("A dot.").
Analysis of the MT dataset revealed that about half of claim sentences contain the modal verb 'should', clearly indicating policy claims ("The death penalty should be abandoned everywhere."). Such statements also very explicitly express the stance on the controversial topic of interest. In a similar vein, claims in persuasive students' essays (PE) heavily rely on phrases signaling beliefs ("In my opinion, although using machines have many benefits, we cannot ignore its negative effects.") or argumentative discourse connectors whose usage is recommended in textbooks on essay writing ("Thus, it is not need for young people to possess this ability."). Most claims are value/policy claims written in the present tense.
The mixture of genres in the AraucariaDB corpus (VG) is reflected in the variety of claims. While some are simple statements starting with a discourse marker ("Therefore, 10% of the students in my logic class are left-handed."), there are many legal-specific claims requiring expert knowledge ("In considering the intention of Parliament when passing the 1985 Act, or perhaps more properly the intention of the draftsman in settling its terms, there are [...]"), reported and direct speech claims ("Eight-month-old Kyle Mutch's tragic death was not an accident and he suffered injuries consistent with a punch or a kick, a court heard yesterday."), and several nonsensical claims ("RE: Does the Priest Scandal Reveal the Beast?") which undercut the consistency of this dataset.
The web-discourse (WD) claims take a clear stance to the relevant controversy ("I regard single sex education as bad."), yet sometimes anaphoric ("My view on the subject is no."). The usage of discourse markers is seldom. Habernal and Gurevych (2017) investigated hedging in claims and found out that it varies with respect to the topic being discussed (10% up to 35% of claims are hedged). Sarcasm or rhetorical question are also common ("In 2013, is single-sex education really the way to go?").
These observations make clear that annotating claims-the central part of all arguments, as suggested by the majority of argumentation scholars-can be approached very differently when it comes to actual empirical, data-driven operationalization. While some traits are shared, such as that claims usually need some support to make up a 'full' argument (e.g., premises, evidence, or justifications), the exact definition of a claim can be arbitrary-depending on the domain, register, or task.

Methodology
Given the results from the qualitative analysis, we want to investigate whether the different conceptualizations of claims can be assessed empirically and if so, how they could be dealt with in practice. Put simply, the task we are trying to solve in the following is: given a sentence, classify whether or not it contains a claim. We opted to model the claim identification task on sentence level, as this is the only way to make all datasets compatible to each other. Different datasets model claim boundaries differently, e.g. MT includes discourse markers within the same sentence, whereas they are excluded in PE.
All six datasets described in the previous section have been preprocessed by first segmenting documents into sentences using Stanford CoreNLP (Manning et al., 2014) and then annotating every sentence as claim, if one or more tokens within the sentence were labeled as claim (or major claim in PE). Analogously, each sentence is annotated as non-claim, if none of its tokens were labeled as claim (or major claim). Although our basic units of interest are sentences, we keep the content of the entire document to be able to retrieve information about the context of (non-)claims. 3 We are not interested in optimizing the properties of a certain learner for this task, but rather want to compare the influence of different types of lexical, syntactical, and other kinds of information across datasets. 4 Thus, we used a limited set of learners for our task: a) a standard L2-regularized logistic regression approach with manually defined feature sets 5 , which is a simple yet robust and established technique for many text classification problems (Plank et al., 2014;He et al., 2015;Zhang et al., 2016a;Ferreira and Vlachos, 2016); and b) several deep learning approaches, using state-of-the-art neural network architectures.
The in-domain experiments were carried out in a 10-fold cross-validation setup with fixed splits into training and test data. As for the crossdomain experiments, we train on the entire data of the source domain and test on the entire data of the target domain. In the domain adaptation terminology, this corresponds to an unsupervised setting.
To address class-imbalance in our datasets (see Table 1), we downsample the negative class (nonclaim) both in-domain and cross-domain, so that positive and negative class occur approximately in an 1:1 ratio in the training data. Since this means that we discard a lot of useful information (many negative instances), we repeat this procedure 20 times, in each case randomly discarding instances of the negative class such that the required ratio is obtained. At test time, we use the majority prediction of this ensemble of 20 trained models. With the exception of very few cases, this led to consistent performance improvements across all experiments. The systems are described in more detail in the following subsections. Additionally, we report the results of two baselines. The majority baseline labels all sentences as non-claims (predominant class in all datasets), the random baseline labels sentences as claims with 0.5 probability.

Linguistically Motivated Features
For the logistic regression-based experiments (LR) we employed the following feature groups. Structure Features capture the position, the length and the punctuation of a sentence. Lexical Features are lowercased unigrams. Syntax Features account for grammatical information at the sentence level. We include information about the part-of-speech and parse tree for each sentence.
Discourse Features encode information extracted with help of the Penn Discourse Treebank (PDTB) styled end-to-end discourse parser as presented by Lin et al. (2014). Embedding Features represent each sentence as a summation of its word embeddings (Guo et al., 2014). We further experimented with sentiment features (Habernal and Gurevych, 2015;Anand et al., 2011) and dictionary features (Misra et al., 2015;Rosenthal and McKeown, 2015) but these delivered very poor results and are not reported in this article. The full set of features and their parameters are described in the supplementary material to this article. We experiment with the full feature set, individual feature groups, and feature ablation (all features except for one group).

Deep Learning Approaches
As alternatives to our feature-based systems, we consider three deep learning approaches. The first is the Convolutional Neural Net of Kim (Kim, 2014) which has shown to perform excellently on many diverse classification tasks such as sentiment analysis and question classification and is still a strong competitor among neural techniques focusing on sentence classification (Komninos and Manandhar, 2016;Zhang et al., 2016b,c). We consider two variants of Kim's CNN, one in which words' vectors are initialized with pre-trained GoogleNews word embeddings (CNN:w2vec) and one in which the vectors are randomly initialized and updated during training (CNN:rand). Our second model is an LSTM (long short-term memory) neural net for sentence classification (LSTM) and our third model is a bidirectional LSTM (BiL-STM).
For all neural network classifiers, we use default hyperparameters concerning hidden dimensionalities (for the two LSTM models), number of filters (for the convolutional neural net), and others. We train each of the three neural networks for 15 iterations and choose in each case the learned model that performs best on a held-out development set of roughly 10% of the training data as the model to apply to unseen test data. This corresponds to an early stopping regularization scheme.

Results
In the following, we summarize the results of the various learners described above. Obtaining all results required heavy computation, e.g. the cross-  validation experiments for feature-based systems took 56 days of computing. We intentionally do not list the results of previous work on those datasets. The scores are not comparable since we strictly work on sentence level (rather than e.g. clause level) and applied downsampling to the training data. All reported significance tests were conduced using two-tailed Wilcoxon Signed-Rank Test for matched pairs, i.e. paired scores of F 1 scores from two compared systems (Japkowicz and Shah, 2014).

In-Domain Experiments
The performance of the learners is quite divergent across datasets, with Macro-F 1 scores 6 ranging from 60% (WTP) to 80% (MT), average 67% (see Table 2). On all datasets, our best systems clearly outperform both baselines. In isolation, lexical, embedding, and syntax features are most helpful, whereas structural features did not help in most cases. Discourse features only contribute significantly on MT. When looking at the performance of the feature-based approaches, the most striking finding is the importance of lexical (in our setup, unigram) information.
The average performances of LR −syntax and CNN:rand are virtually identical, both for Macro-6 Described as Fscore M in Sokolova and Lapalme (2009). F 1 and Claim-F 1 , with a slight advantage for the feature-based approach, but their difference is not statistically significant (p ≤ 0.05). Altogether, these two systems exhibit significantly better average performances than all other models surveyed here, both those relying on and those not relying on hand-crafted features (p ≤ 0.05). The absence or the different nature of inter-annotator agreement measures for all datasets prevent us from searching for correlations between agreement and performance. But we observed that the systems yield better results on PE and MT, both datasets with good inter-annotator agreement (α u = 0.77 for PE and κ = 0.83 for MT).

Cross-Domain Experiments
For all six datasets, training on different sources resulted in a performance drop. Table 3 lists the results of the best feature-based (LR All features) and deep learning (CNN:rand) systems, as well as single feature groups (averages over all source domains, results for individual source domains can be found in the supplementary material to this article). We note the biggest performance drops on the datasets which performed best in the indomain setting (MT and PE). For the lowest scoring datasets, OC and WTP, the differences are only marginal when trained on a suitable dataset (VG   and OC, respectively). The best feature-based approach outperforms the best deep learning approach in most scenarios. In particular, as opposed to the in-domain experiments, the difference of the Claim-F 1 measure between the feature-based approaches and the deep learning approaches is striking. In the feature-based approaches, on average, a combination of all features yields the best results for both Macro-F 1 and Claim-F 1 . When comparing single features, lexical ones do the best job.
Looking at the best overall system (LR with all features), the average test results when training on different source datasets are between 54% Macro-F 1 resp. 23% Claim-F 1 (both MT) and 58% (VG) resp. 34% (OC). Depending on the goal that should be achieved, training on VG (highest average Macro-F 1 ) or OC (highest average Claim-F 1 ) seems to be the best choice when the domain of test data is unknown (we analyze this finding in more depth in §6). MT clearly gives the best results as target domain, followed by PE and VG.
We also performed experiments with mixed sources, the results are shown in Table 4. We did this in a leave-one-domain-out fashion, in partic-ular we trained on all but one datasets and tested on the remaining one. In this scenario, the neural network systems seem to benefit from the increased amount of training data and thus gave the best results. Overall, the mixed sources approach works better than many of the single-source crossdomain systems -yet, the differences were not found to be significant, but as good as training on suitable single sources (see above).

Further Analysis and Discussion
To better understand which factors influence cross-domain performance of the systems we tested, we considered the following variables as potential determinants of outcome: similarity between source and target domain, the source domain itself, training data size, and the ratio between claims and non-claims.
We calculated the Spearman correlation of the top-500 lemmas between the datasets in each direction, see results in Table 5. The most similar domains are OC (source s) and WTP (target t), coming from the same authors. OC (s) and WD (t) as well OC (s) and VG (t) are also highly cor-   related. For a statistical test of potential correlations between cross-domain performances and the introduced variables, we regress the cross-domain results (Table 3) on Table 5 (T4 in the following equation), on the number of claims #C (directly related to training data size in our experiments, effect of downsampling), and on the ratio of claims to non-claims R. 7 More precisely, given source/training data and target data pairs (s,t) in Table 3, we estimate the linear regression model where y st denotes the Macro-F 1 score when training on s and testing on t. In the regression, we also include binary dummy variables 1 σ = 1 s,σ for each domain σ whose value is 1 if s = σ (and 0 otherwise). These help us identify "good" source domains. The coefficient α for Table 5 is not statistically significantly different from zero in any case. Ultimately, this means that it is difficult to predict cross-domain performance from lexical similarity of the datasets. This is in contrast to e.g., POS tagging, where lexical similarity has been reported to predict cross-domain performance very 7 Overall, we had 15 different systems, see upper 15 rows in Table 2. Therefore, we had 15 different regression models.
well (Van Asch and Daelemans, 2010). The coefficient for training data size β is statistically significantly different from zero in three out of 15 cases. In particular, it is significantly positive in two (CNN:rand, CNN:w2vec) out of four cases for the neural networks. This indicates that the neural networks would have particularly benefited from more training data, which is confirmed by the improved performance of the neural networks in the mixed sources experiments (cf. §5.2). The ratio of claims to non-claims in t is among the best predictors for the variables considered here (coefficient γ is significant in three out of 15 cases, but consistently positive). This is probably due to our decision to balance training data (downsampling nonclaims) to keep the assessment of claim identification realistic for real-world applications, where the class ratio of t is unknown. Our systems are thus inherently biased towards a higher claim ratio.
Finally, the dummy variables for OC and VG are three times significantly positive, but consistently positive overall. Their average coefficient is 2.31 and 1.90, respectively, while the average coefficients for all other source datasets is negative, and not significant in most cases. Thus, even when controlling for all other factors such as training data size and the different claim ratios of target domains, OC and VG are the best source domains for cross-domain claim classification in our experiments. OC and VG are particularly good training sources for the detection of claims (as opposed to non-claims)-the minority class in all datasetsas indicated by the average Claim-F 1 scores in Table 3.
One finding that was confirmed both in-domain as well as cross-domain was the importance of lexical features as compared to other feature groups. As mere lexical similarity between domains does not explain performance (cf. coefficient α above), this finding indicated that the learners relied on a few, but important lexical clues. To go more into depth, we carried out error analysis on the CNN:rand cross-domain results. We used OC, VG and PE as source domains, and MT and WTP as target domains. By examining examples in which a model trained on OC and VG made correct predictions as opposed to a model trained on PE, we quickly noticed that lexical indicators indeed played a crucial role. In particular, the occurrence of the word "should" (and to a lower degree: "would", "article", "one") are helpful for the detection of claims across various datasets. In MT, a simple baseline labeling every sentence containing "should" as claim achieves 76.1 Macro-F 1 (just slightly below the best in-domain system on this dataset). In the other datasets, this phenomenon is far less dominant, but still observable. We conclude that a few rather simple rules (learned by models trained on OC and VG, but not by potentially more complex models trained on PE) make a big difference in the cross-domain setting.

Conclusion
In a rigorous empirical assessment of different machine learning systems, we compared how six datasets model claims as the fundamental component of an argument. The varying performance of the tested in-domain systems reflects different notions of claims also observed in a qualitative study of claims across the domains. Our results reveal that the best in-domain system is not necessarily the best system in environments where the target domain is unknown. Particularly, we found that mixing source domains and training on two rather noisy datasets (OC and VG) gave the best results in the cross-domain setup. The reason for this seem to be a few important lexical indicators (like the word "should") which are learned easier under these circumstances. In summary, as for the six datasets we analyzed here, our analysis shows that the essence of a claim is not much more than a few lexical clues.
From this, we conclude that future work should address the problem of vague conceptualization of claims as central components of arguments. A more consistent notion of claims, which also holds across domains, would potentially not just benefit cross-domain claim identification, but also higherlevel applications relying on argumentation mining (Wachsmuth et al., 2017). To further overcome the problem of domain dependence, multi-task learning is a framework that could be explored (Søgaard and Goldberg, 2016) for different conceptualizations of claims.