Unsupervised corpus–wide claim detection

Automatic claim detection is a fundamental argument mining task that aims to automatically mine claims regarding a topic of consideration. Previous works on mining argumentative content have assumed that a set of relevant documents is given in advance. Here, we present a first corpus– wide claim detection framework, that can be directly applied to massive corpora. Using simple and intuitive empirical observations, we derive a claim sentence query by which we are able to directly retrieve sentences in which the prior probability to include topic-relevant claims is greatly enhanced. Next, we employ simple heuristics to rank the sentences, leading to an unsupervised corpus–wide claim detection system, with precision that outperforms previously reported results on the task of claim detection given relevant documents and labeled data.


Introduction
Decision making typically relies on the quality of the arguments being presented and the process by which they are resolved. A common component in all argument models (e.g., (Toulmin, 1958)) is the claim, namely the assertion the argument aims to prove. Given a topic of interest, suggesting a diverse set of persuasive claims is a demanding cognitive goal. The corresponding task of automatic claim detection was first introduced in , and is considered a fundamental task in the emerging field of argument mining (Lippi and Torroni, 2016). To illustrate some of the subtleties involved, Table 1 lists examples of sentences related * First two authors contributed equally.
to the topic of whether we should end affirmative action. S1 Opponents claim that affirmative action has undesirable side-effects and that it fails to achieve its goals. S2 The European Court of Justice held that this form of positive discrimination is unlawful. S3 Clearly, qualifications should be the only determining factor when competing for a job. S4 In 1961, John F. Kennedy became the first to utilize the term affirmative action in its contemporary sense. Previous works on claim detection have assumed the availability of a relatively small set of articles enriched with relevant claims . Similarly, other argument-mining works have focused on the analysis of a small set of argumentative essays (Stab and Gurevych, 2014). This paradigm has two limitations. First, it relies on a manual, or automatic (Roitman et al., 2016), process to retrieve the relevant set of articles, which is non-trivial and prone to errors. In addition, when considering large corpora, relevant claims may spread across a much wider and diverse set of articles compared to those considered by earlier works. Here, we present a first corpus-wide claim detection framework, that can be directly applied to massive corpora, with no need to specify a small set of documents in advance.
We exploit the empirical observation that relevant claims are typically (i) semantically related to the topic; and (ii) reside within sentences with identifiable structural properties. Thus, we aim to pinpoint single sentences within the corpus that satisfy both criteria.
Semantic relatedness can be manifested via a rich set of linguistic mechanisms. E.g., in Table 1, S1 mentions the main concept (MC) of the topic (i.e., affirmative action) explicitly; S2 mentions the MC using a different surface form -'positive discrimination'; while S3 contains a valid claim without explicitly mentioning the MC. Here, we suggest to use a mention detection tool (Ferragina and Scaiella, 2010), which maps surface forms to Wikipedia titles (a.k.a Wikification), to focus the mining process on sentences in which the MC is detected. Thus, we keep the potential to detect sentences in which different surface forms are used to express the MC. Moreover, using a Wikification tool can help prevent drift in the meaning of the topic. For example, consider the topic Marriage is outdated for which the MC is Marriage. Had we searched the corpus for all sentences with the word Marriage, we would have found many sentences that mention the term Same sex marriage which tends to appear more often in argumentative content within the corpus. The risk in this case, is to have the claim detection system drift towards this related but quite different topic. By using a Wikification tool, and assuming it works reasonably well, we avoid this problem. Searching for sentences with the concept Marriage will not return sentences in which the Wikification tool found the concept Same sex marriage.
However, as mentioned, semantic relatedness is not enough; e.g., S4 mentions the MC explicitly, but does not include a claim. To further distinguish such sentences from those containing claims, we observe that the token 'that' is often used as a precursor to a claim; as in S1, S2 and in the sentence "we observe that the token 'that' is often used as a precursor to a claim." The usage of 'that' as a feature was first suggested in . Thus, we use the presence of 'that' as an initial weak label, and further identify unigrams enriched in the suffixes of sentences containing 'that' followed by the MC, compared to sentences containing the MC without a preceding 'that'. This yields a Claim Lexicon (CL), from which we derive a Claim Sentence Query (CSQ) composed of the following ordered triplet: that → MC → CL, i.e., the token 'that', the MC as identified by a Wikification tool, and a unigram from the CL, in that order.
We demonstrate empirically over Wikipedia, that for sentences satisfying this query, the prior probability to include a relevant claim is enhanced compared to the background distribution. Further-more, by applying simple unsupervised heuristics to sort the retrieved sentences, we obtain precision results outperforming , while using no labeled data, and tackling the presumably more challenging goal of corpus-wide claim detection. Our results demonstrate the practical value of the proposed approach, in particular for topics that are well covered in the examined corpus.

Related Work
Context dependent claim detection (i.e. the detection of claims that support/contest a given topic) was first suggested by . Next, (Lippi and Torroni, 2015) proposed the context independent claim detection task, in which one attempts to detect claims without having the topic as input. Thus, if the texts contain claims for multiple topics, all should be detected. Both works used the data in  for training and testing their models.  have first described 'that' as an indicator for sentences containing claims. Other works have identified additional indicators of claims, such as discourse markers, and have used them within a rule-based, rather than a supervised, framework (Eckle-Kohler et al., 2015;Ong et al., 2014;Somasundaran and Wiebe, 2009;Schneider and Wyner, 2012).
The usage we make in this work of the word 'that' as an initial weak label is closely related to the idea of distant supervision (Mintz et al., 2009). In the context of argument mining, (Al-Khatib et al., 2016) also used noisy labels to train a classifier, albeit for a different task. They exploited the manually curated idebate.org resource to define -admittedly noisy -labeled data, that were used to train an argument mining classification scheme. In contrast, our approach requires no data curation and relies on a simple linguistic observation of the typical role of 'that' in argumentative text. Our use of the token 'that' as a weak label to identify a relevant lexicon, is also reminiscent of the classical work by (Hearst, 1992) who suggested to use lexico-syntactic patterns to identify various lexical relations. However, to the best of our knowledge, the present work is the first to use such a paradigm in the context of argument mining. 80 3 System Description

Sentence Level Index
Corpus-wide claim detection requires a run-time efficient approach. Thus, although the context surrounding a sentence may hint whether it contains a claim, we focus solely on single sentences and the information they contain. Correspondingly, we built an inverted index 1 of sentences for the Wikipedia May 2015 dump, covering ∼ 4.9M articles. After text cleaning and sentence splitting using OpenNlp 2 we obtained a sentence-level index that contains ∼ 83M sentences. We then used TagMe (Ferragina and Scaiella, 2010) to Wikify each sentence, limiting the context used by TagMe for disambiguation, to the examined sentence.

Topics
We started with a manually curated list of 431 debate topics that are often used in debate-related sites like idebate.org. We limit our attention to debate topics that focus on a single concept, denoted here as the MC, which is further identified by a corresponding Wikipedia page, e.g., Affirmative Action, Doping in Sport, Boxing, etc. In addition, we focus on topics that are well covered in Wikipedia, which we formally define as topics for which the query q1 = M C has at least 1, 000 matches. This criterion is satisfied in 212/431 topics, of which we randomly selected 100 as a development set (termed dev-set henceforth) and 50 topics as a test set, used solely for evaluation. The complete list of topics is given in the Supplementary Material (SM).

Claim Sentence Query (CSQ)
For the 100 dev-set topics we obtained a total of ∼ 1.86M sentences that match the query q1, hence are assumed to be semantically related to their respective topic. We refer to this set of sentences as the q1-set. Using 'that' as a weak label, we divide the q1-set into two classes -the sentences that contain the token 'that' before the MC, and the sentences that do not -denoted c 1 and c 2 , respectively. The class c 1 consists of ∼ 183K sentences, hence we define the estimated prior probability of a sentence from q1-set to be included in c 1 as P (c 1 ) = 0.0986.
Based on these classes, we are interested in constructing a lexicon of claim-related words that will enable designing a query with a relatively high prior for detecting claim-containing sentences. We start with standard pre-processing including tokenization, stop-word removal, lowercasing, pos-tagging using OpenNlp, and removal of tokens mentioned in < 10 sentences in q1set. Preliminary analysis -described in detail in the SM -suggested that we should focus on the suffixes of the sentences in c 1 , where the suffix is defined as the part of the sentence that follows the MC. Note, that in our setting the claim is expected to occur after the token 'that' with the MC usually being the subject, hence the suffix as defined above seems like a natural candidate to search for words characteristic of claims. Formally, we define n 1 as the number of sentences in c 1 that contain w in the sentence suffix; n 2 as the number of sentences in c 2 that contain w; and P suf f (c 1 |w) = n1/(n1 + n2). Finally, we define the Claim Lexicon (CL) as the set of words which satisfy P suf f (c 1 |w) > P (c 1 ), namely the set of words that are characteristic of the suffixes of sentences in the class c 1 . To put it differently, the set of words that, when they appear in the sentence suffix, make the sentence more likely to be in c 1 than expected by the prior.
A desirable feature of the CL is that it contains words which are indicative of claims in the general sense, i.e., in the context of many different topics. Since the resulting lexicon included some topicspecific words, mostly nouns, we applied straightforward cleansing of removing all nouns, as well as numbers, single-character tokens, and countryspecific terms from the CL, ending up with a lexicon consisting of 586 words, listed in the SM.
We then use the CL to construct the claim sentence query (CSQ): that → MC → CL, where CL denotes any word from the CL. We assessed the prior probability to contain a claim for sentences matching different queries by randomly selecting at most 3 sentences that match the query per devset topic, and annotating the resulting sentences by 5 human annotators. We find that, as expected, the prior associated with the query that → MC is higher than the background prior of sentences matching q1 = M C, 4.8% vs. 2.4%, respectively. Using the CSQ further enhances the prior to 9.8%, a factor of 4 compared to the background. Table 2 summarizes the prior and number of matches per query.  Table 2: Summary of query evaluation. The "Prior" column shows the percentage of claim sentences estimated by the annotation experiment. The "#Matches" column shows the median number of query matches across the dev-set topics.

From CSQ to Claim Detection
Based on the sentences that match the CSQ, we are now ready to define a system that performs corpus-wide claim detection by adding sentence re-ranking, boundary detection, and simple filters. Naturally, we are interested to present higher confidence predictions first. Remaining within the unsupervised framework, we rank the sentences by the average of two simple scores: (i) w2v: The CSQ only aims to ensure that the MC is present in the examined sentence. Hence, it seems reasonable to assume that considering the semantic similarity of the entire candidate claim to the topic will improve the ranking. Thus, we compare the word2vec representation (Mikolov et al., 2013) of each word in the sentence part following the first 'that' to each word in the MC to find the best cosine-similarity match, and average the obtained scores; (ii) slop: The number of tokens between 'that' and the first match to the CL. This assumes that the closer the elements appear in the sentence, the higher the probability that it contains a claim.
To perform claim detection, the claim itself should be extracted from the surrounding sentence. From the way the CSQ is constructed, it follows that the claim is expected to start right after the 'that'. The end of the claim is harder to predict. An approach to boundary detection was described in , but here we employ a simple heuristic, which does not require labeled data, namely ending the claim at the sentence end. Finally, sentences containing location/person named-entities after the 'that' are filtered out.

Results
To evaluate the performance of the proposed system we applied crowd labeling 3 on the predicted claims for all 150 topics in the dev-and test-set. For each topic we labeled the top 50 predictions, or all predictions if there were less. A prediction was considered correct if the majority of the annotators marked it as a claim 4 . The average pairwise Kappa agreement on the dev-set was 0.38, which is similar to the Kappa of 0.39 reported in this context by . Table 3 depicts the obtained results. Using our approach -that requires no labeling and is applied over the entire Wikipedia corpus -we obtain results that outperform those reached using a supervised approach over a manually pre-selected set of articles ) (see 'Levy' Row), though we note that we consider a different set of topics because of the restrictions we impose on the topic structure (section 3.2). In addition, the test set results are better compared to the dev-set results, suggesting that the system is able to generalize to entirely new topics.
When considering only topics for which > K sentences match the CSQ, the precision increases considerably. For example, for topics that have at least 50 sentences matching the CSQ, P @50 is 24% and 34% in the dev-and test-set, respectively. Thus, for topics well covered in the corpus, the precision of the system is even more promising.
The precision results in table 3 are not directly comparable to "classical" argumentation mining tasks, e.g. (Stab and Gurevych, 2014), since our task involves detecting claims over a full corpus in which the ratio of positive cases is much lower (2.4% of sentences containing the MC).

Limitations
In this work, we only considered topics that focus on a single concept which has a corresponding Wikipedia page. Expanding the proposed framework to more complex queries, covering more than a single concept, merits further investigation. Yet, even without such an expansion, we note that controversial topics are often characterized by a corresponding Wikipedia page.
Our approach targets claims in which the MC is identified by a Wikification tool. While this allows mining claims in which the MC is expressed via different surface forms, Wikification errors also propagate to our performance. Thus, improvements in available Wikification tools are expected to improve the results of the approach. In addition, claims that do not explicitly refer to the MC are out of the radar of the proposed system, limiting its recall. Expanding the CSQ with concepts related to the MC, may mitigate this issue.
Finally, we focused on sentences matching the pattern that → MC. Exploring the same methodology for additional patterns characterizing claimcontaining sentences is left for future work.

Discussion
We present an unsupervised simple framework for corpus-wide claim detection, which relies on features that are quick to compute. Exploiting the token 'that' as a weak signal, or as distant supervision (Mintz et al., 2009) for claim-containing sentences, we obtain results that outperform a supervised claim detection system applied to a limited set of documents . Extending this approach to other computational argumentation tasks like evidence detection (Rinott et al., 2015) is a natural direction for future work. Notably, the system precision is clearly superior to the precision of the initial 'that' label, indicating the existence of characteristics of claimcontaining sentences which may further enhance the signal embodied in this label. Thus, we hypothesize that supervised learning based on labeling the predictions of the unsupervised system can further improve the system results, e.g., by obtaining better ranking schemes and/or stronger methods to determine claim boundaries.
Finally, we demonstrated our approach over the Wikipedia corpus. We speculate that the proposed approach holds even greater potential for mining larger and more argumentative corpora such as newspapers aggregates; in particular, when considering controversial topics that are widely discussed in the media, for which it is natural to ex-pect that relevant claims are mentioned across a very large set of typically short articles.