Feasible Annotation Scheme for Capturing Policy Argument Reasoning using Argument Templates

Most of the existing works on argument mining cast the problem of argumentative structure identification as classification tasks (e.g. attack-support relations, stance, explicit premise/claim). This paper goes a step further by addressing the task of automatically identifying reasoning patterns of arguments using predefined templates, which is called argument template (AT) instantiation. The contributions of this work are three-fold. First, we develop a simple, yet expressive set of easily annotatable ATs that can represent a majority of writer’s reasoning for texts with diverse policy topics while maintaining the computational feasibility of the task. Second, we create a small, but highly reliable annotated corpus of instantiated ATs on top of reliably annotated support and attack relations and conduct an annotation study. Third, we formulate the task of AT instantiation as structured prediction constrained by a feasible set of templates. Our evaluation demonstrates that we can annotate ATs with a reasonably high inter-annotator agreement, and the use of template-constrained inference is useful for instantiating ATs with only partial reasoning comprehension clues.


Introduction
Recognizing argumentative structures in unstructured texts is an important task for many natural language processing (NLP) applications. Argument mining is an emerging, leading field of argumentative structure identification in the NLP community. It involves a wide variety of subtasks for argumentative structure identification such as explicit premise and claim identification/classification Rinott et al., 2015;Stab and Gurevych, 2014), stance classification (Hasan and Ng, 2014;Persing and Ng, 2016), and argumentative relation detection (Cocarascu and Toni, 2017;Niculae et al., 2017;Peldszus and Stede, 2015b;Stab and Gurevych, 2017). These tasks have been useful for applications such as essay scoring, document summarization, etc. Stab and Gurevych, 2017). This paper addresses a feasible annotation scheme for the task of reasoning pattern identification in argumentative texts. Consider the following argument consisting of two argumentative segments S 1 and S 2 regarding the policy topic Should Germany universities charge tuition fees?: (1) S 1 : German universities should not charge tuition fees. S 2 : Every German citizen has a right to education.
In this work, we adopt Walton et al. (2008)'s argumentation schemes (ASs), one prominent theory used for identifying reasoning patterns in every day arguments. Using Walton et al. (2008)'s Argument from Negative Consequences scheme, the reasoning of Example 1 can be explained as follows: • Premise : If action x is brought about, bad consequences y will occur.
• Conclusion: x should not be brought about.
where both x and y are slot-fillers and x="charge tuition fees" and y="a right to education will be violated". Each AS identifies a scheme (from 65 total schemes) and appropriate slot-fillers. Instantiations of such reasoning patterns for an argument have several advantages. First, identifying such reasoning will be useful for a range of argumentation mining applications, such as aggregating multiple arguments for producing a logic-based abstractive summary. Second, we believe that it will contribute towards automatically assessing the quality of the logical structure of a given argument, where identifying specific arguments can signify higher quality, especially for tasks such as essay scoring (Song et al., 2014;Wachsmuth et al., 2016). Third, it will be useful for generating support or attacks in application contexts where a human and machine are cooperatively engaged in a debate (for decision support or education). Furthermore, understanding the reasoning in an argumentative text can contribute towards determining implicit ARs not indicated with an explicit discourse marker.
Towards automatically identifying the underlying reasoning of argumentative texts, Reed (2006) created Araucaria, a corpus consisting of argumentative texts annotated with Walton et al. (2008)'s ASs. Feng andHirst (2011) used Araucaria for creating a computational model for identifying the type of argumentation scheme.
Although Araucaria is a well-known corpus in the argumentation mining community, it suffers from complex annotation guidelines which makes the annotation task difficult. 1 A follow up study (Musi et al., 2016) reports that the interannotator agreement of annotating a simplified taxonomy of the Argumentum Model of Topics argumentation schemes (Rigotti, 2006;Palmieri, 2014) results in Fleiss' κ = 0.31 ("fair agreement") even if the annotators are trained and only a subset (8 types) of schemes are annotated. In this work, we assume the following: (i) annotating multiple types of ASs is difficult, and (ii) the reliability of annotating reasoning patterns for a single AS with implicit slot-fillers is low because when slot-fillers are not explicitly written in the original text, they must manually be generated by annotators using natural language sentences; this allows for a wide variety of possible, arbitrary candidates for each scheme (e.g. y="a right to education is violated" in Example 1), making the annotation costly and difficult. Towards constructing a highly-reliable corpus for the task of automatic reasoning identification in argumentative texts, an annotation scheme that covers a wide-range of arguments as much as possible and simultaneously offers a simple way to specify implicit slot-fillers instead of manually creating natural language sentences is crucial.
This paper makes three important contributions towards automatically capturing a writer's reason-ing in argumentative texts. First, we compose a simple, yet expressive set of easily annotatable templates (argument templates or ATs) that allow for writer's reasoning to be representable without the need for manual generation of natural language sentences when slot-fillers are implicit. Specifically, we propose a template/slotfiller based approach for instantiating reasoning patterns that capture the underlying reasoning between two argumentative segments in an argumentative relation (AR) using two types of causal labels (e.g. PROMOTE and SUPPRESS). Our annotation study demonstrates that we can annotate ATs with a reasonably high inter-annotator agreement (Cohen's κ=0.80) and ATs can represent a majority (74.6%) of writer's reasoning in a small essay corpus with multiple, diverse policy topics. Second, using ATs, we augment an existing, reliable corpus of argumentative texts (Peldszus and Stede, 2015a) with writer's reasoning and create a small, but useful corpus on top of pre-labeled argumentative relations. Third, towards creating a fullyautomated argument template instantiation model, we create a preliminary computational model for instantiating ATs. We formulate the task of AT instantiation as structured prediction constrained by a feasible set of ATs. We hypothesize that the introduction of such constraints enables us to instantiate ATs with only partial reasoning comprehension clues. Our evaluation shows that templateconstrained inference is indeed useful for instantiating ATs with only partial reasoning comprehension clues.

A Corpus of Instantiated Argument Templates
The key requirements for automatically capturing an argument's reasoning are four-fold: (i) capture a writer's implicit reasoning as much as possible, (ii) be machine-friendly, (iii) be useful for downstream applications, and (iv) keep human annotation simple. Towards this goal, as mentioned in Section 1, Reed (2006) created Araucaria, a corpus consisting of argumentative texts annotated with Walton et al. (2008)'s ASs. However, the annotation scheme requires annotators to manually generate natural language sentences for implicit slot-fillers (i.e. (ii) and (iv) are not considered).
To address this issue, we propose a method that allows annotators to avoid manual generation of natural language sentences when a slot-German universities should not charge tuition fees.
Every German citizen has a right to education. filler is implicit. Given two argumentative statements with a known AR, our task is to identify the reasoning between them by (i) selecting a template from a predefined template set (argument templates (ATs)), where each template encodes a causal label, and (ii) instantiating the template via slot-filling, where the slot is linked with a relevant, arbitrary phrase in the input text. Figure 1 exemplifies our proposed approach, using the support relation from S 2 to S 1 in Example 1. The first step is to identify an AT: "S 1 , the target segment of the relation (i.e. S t ), states that x should not be brought about (i.e. bad) 2 , because S 2 , the source segment of the relation (i.e. S s ), states that x is bad because when x happens, y, a good entity/event, will be suppressed.". The second step is to instantiate the template by filling in the slots x, y with a phrase from the text: x ="charge tuition fees" and y ="a right to education". By encoding causal labels, annotators are no longer required to manually construct implicit slot-fillers (e.g. y="a right to education will be violated" in Section 1).

TEMPLATE SELECTION
The key insight about template design from previous work (Musi et al., 2016) is that if we annotate reasoning with coarse-grained reasoning types, the annotation becomes more difficult. In this work, we hypothesize that patterns for representing argumentation are not uniformly distributed but highly skewed, and create an inventory of major ATs, annotating only typical instances of reasoning with them. We label instances where a template cannot be instantiated as "OTHER". In fact, as we report in Section 2.3, the variety of reasoning underlying ARs in the corpus we use can be largely captured by only a small number of predefined templates. Although the ex-pressibility of a slot-filler will be reduced by embedding causal labels into our templates, the feasibility of the computational task will be increased. In the future, we plan to capture the causal information lost by annotating other factors of the causality such as severity, truthfulness, likelihood, etc.

Dataset
We create our set of ATs using the arg-microtexts corpus 3 (Peldszus and Stede, 2015a), a corpus of manually composed arguments, due to its high reliability of annotated relations amongst 3 annotators (Fleiss κ = 0.83). 4 . The corpus contains 112 argumentative texts, each consisting of roughly five segments composed of a policy topic question, a main claim, and several premises. Each argument in a text is comprised of a policy argument, where each topic supports that one should or should not do something. Additionally, each argumentative segment was annotated with its stance (i.e. opponent or proponent) towards the topic question. 357 ARs between segments have been manually annotated as either SUPPORT (i.e. a segment supports the acceptability of another argumentative segment), ATTACK (i.e. a segment attacks the acceptability of another argumentative segment), or UNDERCUT (i.e. a segment attacks another AR) relations, where each relation makes up 62.7% (224/357), 23.5% (84/357) and 13.8% (49/357), respectively.
In total, we used 89 texts 5 , consisting of 23 diverse policy topics (e.g. fines for dog dirt, waste separation, etc.). We divided the corpus into two ATTACK TEMPLATES SUPPORT TEMPLATES S t : S s : We used the development set to induce the ATs described in Section 2.2 and conduct several trial annotations.

Argument Templates
We build our inventory of ATs based on Walton et al. (2008)'s argumentation schemes and analyze the development set for identifying the types of argumentation schemes. As the arg-microtexts corpus consists of policy arguments, we find that the most commonly used argumentation schemes from the corpus include the Argument from Positive (Negative) Consequences schemes, hereby referred to as the Argument from Consequences (AC) scheme. The scheme is as follows: • Premise : If x is brought about, good (bad) consequences y will occur.
• Conclusion: x should (not) be brought about.
We create ATs for a SUPPORT relation by considering the relation between the premise and conclusion (e.g. S s and S t in Figure 1, respectively).
To represent ATTACK relations with argumentation schemes, we assume that a premise supports the opposite conclusion.
(2) S t : German universities should not charge tuition fees. S s : However, tuition fees could promote better education quality.
For instance, in Example 2, an ATTACK relation exists from S s to S t . The premise, S s , is in support of the opposite conclusion (i.e. "German universities should charge tuition fees"). We represent this phenomena using the ATTACK templates shown in Figure 2. Figure 2, we first create four ATs for a SUPPORT relation (AT-S1 to AT-S4). An example is as follows:

AC-inspired templates As shown in
AT-S1: S t , the target segment, implies/states that x, an entity/event, is GOOD and should be brought about. S s , the source segment, implies/states that x is GOOD, because when x exists/happens (or existed/happened), y, a GOOD entity/event, will be (or was) PRO-MOTED (or NOT SUPPRESSED) 6 In Example 1, the reasoning is instantiated by AT-S3, with x="charge tuition fees", a BAD thing, and y="a right to education", a GOOD thing. The terms GOOD and BAD refer to the value judgment (VJ) a writer has towards a template slot. This differs from the original stance in the arg-microtexts corpus, which considers the stance of the whole argumentative segment towards the topic. PROMOTE and SUPPRESS refer to the causality between slot-fillers x and y, where PROMOTE refers to the activation of something (e.g. smoking leads to cancer) and SUPPRESS refers to the inactivation (e.g. smoking destroys lives) (Hashimoto et al., 2012). To reduce the complexity of the annotation study, we do not consider the modality of causality.
For an ATTACK relation, we create four ATs (AT-A1 to AT-A4), as illustrated in Figure 2.
S t : S s : S s :

AT-UP1
S t : S s :

AT-SP2
S t : S s : BAD( y ) Figure 3: Argument templates for non-AC reasoning.
AT-A1: S t implies/states that x is GOOD and should be brought about, but S s implies/states that x is BAD because when x exists/happens (or happened), y, a GOOD entity/event, will be (or was) SUPPRESSED (or NOT PRO-MOTED).
In Example 2, the reasoning is instantiated by AT-A3, with x="corporate income tax", a BAD thing, and y="better education quality", a GOOD thing.

Additional templates
We create a few ATs to capture minor, non-AC reasoning for each relation, including UNDERCUT relations. In total, we create four additional types of ATs: presupposition, argument from analogy, proposition, and quantifier. We create four templates (not shown) for an UNDERCUT relation. We thus assume S t as a link, denoted as R t . An example is as follows: AT-U1: R t supports the goodness of x, but S s implies/states that x is BAD because when x happens (or happened), y, a GOOD thing, will be (or was) SUPPRESSED (or NOT PROMOTED). Figure 3 shows analogous and propositional templates for SUPPORT (AT-SA1 and AT-SA2) and ATTACK (AT-AA1 and AT-AA2) relations. The template is as follows (e.g. AT-AA1): AT-AA1: S t states that x is BAD, and S s states that x is BAD because y is BAD and is analogous to x.
For the UNDERCUT relation, our analysis revealed that a quantifier in a relation could be attacked. Thus, we create the template AT-UQ1 for UNDERCUT, represented as: AT-UQ1: R 1 assumes a quantifier q, but S s disagrees with it.
(3) R 1 Sx : Intelligent services must urgently be regulated more tightly by parliament; R 1 Sy : this should be clear to everyone after the disclosures of Edward Snowden. S s : Granted, those concern primarily the British and American intelligence services, In Example 3, R 1 , a SUPPORT(S x ,S y ) relation, assumes that all intelligent services should be regulated more tightly; however, S s states that only two services are concerned.
To capture the argument where the underlying assumptions in one segment are supported or attacked by another, we introduce the relations AT-SP1, AT-AP1, and AT-UP1 for SUPPORT, AT-TACK, and UNDERCUT, respectively. The template can be interpreted as follows (e.g. AT-AP1): AT-AP1: S t assumes a presupposition p, but S s agrees with it.
(4) S t : For dog dirt left on the pavement dog owners should by all means pay a bit more. S s : Indeed, it's not the fault of the animals In Example 4, S t presupposes that dog dirt is the fault of the animals, but S s disagrees. Thus, template AT-AP1 would be selected. 7 We also create templates for propositional explanations, represented in templates AT-SP2 and AT-AP2. The templates can be interpreted as follows (e.g. AT-SP2): AT-SP2: S t states a proposition p, and S s restates it.

Annotation Study
For testing the feasibility of our templates, we observe two metrics using the test set: (i) interannotator agreement and (ii) template coverage. For our inter-annotator agreement study, we asked two fluent-English speakers with knowledge of ASs to explain each AR with an argument template and to fill in the template's slots using the annotation tool brat (Stenetorp et al., 2012). To study the coverage of relations which can be represented with an AT, we asked the annotators to mark a relation as the special pattern "OTHER" when any AT cannot be instantiated for a given relation. The annotators were given the original, segmented argumentative text, its ARs (i.e. SUPPORT, ATTACK, and UNDERCUT relations), and the predefined list of ATs. As a training phase, both of the annotators were asked to annotate the development set and to discuss disagreements amongst each other. Next, the annotators were instructed to individually annotate all 270 relations in the test set. As we were aware that an annotation may consist of two or more compatible instantiations, one being more salient than the others, we wanted to regard all semantically compatible templates as correct. For example, consider the following text from the annotation: S t : The death penalty should be abandoned. S s : Innocent people are convicted. Both of the annotators agreed that an AT from Figure 2 was appropriate and slot x was "death penalty". However, one annotator chose AT-A3 with y = "Innocent people", a GOOD entity, and the other annotator chose AT-A4 with y = "Innocent people are convicted", a BAD event. The annotators agreed with each other's annotation because PROMOTE(death penalty, Innocent people are convicted) and SUPPRESS(death penalty, Innocent people) are semantically compatible.
Therefore, when analyzing the inter-annotator agreement, we categorized each pair of template instantiations as "agreeable" if the following conditions were met: (i) the ATs selected by both annotators are exactly the same and the phrases associated with the template slots are exactly the same or overlapped, or (ii) if (i) was not met, each of the annotators agreed on the other's annotation. 8 46.3% (125/270) of the relations were categorized as "agreeable" for (i) only. For both (i) and (ii), 85.9% (232/270) of the relations were categorized as "agreeable". The Cohen's Kappa (κ) score is 0.80, indicating a good agreement. This difference in agreement signifies the variety of semantically compatible instances for a given pair of argumentative relations. This also indicates the importance of conducting a large-scale annotation, where a pair of ARs may have two or more semantically compatible instances.
The coverage of relations representable with an AT for the test set is 74.6% (173/232). 9 . Although our set of ATs is small, we cover a majority of patterns on a test set consisting of multiple, diverse topics. Our results support our hypothesis that ATs are not uniformly distributed but highly skewed.

Overview
The full-fledged task of automatically instantiating ATs for two argumentative segments is computationally challenging due to a large amount of arbitrary slot-fillers x and y for an AT. As a first step towards full-fledged parsing, due to the small size of our corpus, we simplify this challenge in our current task setting by (i) limiting AT instantiations to ATTACK and SUPPORT relations instantiated with an AC template (i.e. 8 templates in Figure 2) due to the low distributions of other ATs (e.g. undercut, presupposition, etc) and (ii) assuming slot-fillers x and y have already been identified. In our future work, we will relax these conditions by testing against arbitrary slot-filler pairs and reasoning which may not be instantiated using ATs. Let us formally define the simplified task of AT instantiation. Our input is two argumentative segments S t , S s and slot-fillers x in S t and y in S s . Our output is an appropriate AT representing the writer's reasoning behind S t and S s in terms of slot-fillers x, y. To represent an AT instantiation, we use the notation r, v x , c, v y , where r ∈ {SUPPORT, ATTACK}, v x , v y ∈ {GOOD, BAD} and c ∈ {PROMOTE, SUPPRESS} represent an argumentative relation, a VJ of slot-fillers x and y, and the type of causality from x to y, respectively (e.g. SUPPORT, BAD, PROMOTE, BAD for AT-S4). We refer to r, v x , c, v y as AT ingredients.
The core idea of the proposed method is as follows. Observing the AT dev set, we found that contextual clues are typically not available for all AT ingredients but for some AT ingredients. Thus, we hypothesize that AT ingredients with no explicit clue can be inferred using the knowledge of ATs their ingredients identified by explicit clues. In Example 1, for instance, if we already know that (i) the value judgment v x of "charge tuition fees" is BAD, (ii) the value judgment v y of "a right to education" is GOOD, and (iii) the argumentative relation r is SUPPORT, then we can uniquely identify that the causality is SUPPRESS.

Models for AT ingredients
We create three models m arg , m val , and m cau for identifying an AR, VJ, and causality, each of which returns a confidence score of their decision. As this is the first attempt at automating the instantiation of ATs, we use simple models for identifying AT ingredients rather than developing sophisticated models. This makes the framework transparent and analysis simple while allowing us to examine the effectiveness of template constraints.
Value Judgment (m val ) We train a Support Vector Machine (SVM)-based binary classifier (Cortes and Vapnik, 1995) to identify the VJ of the given slot-fillers x, y (i.e. GOOD or BAD). From observation of the AT dev set, we found the following features useful for VJ identification: (i) auxiliary verbs (e.g. should, must, ought) and (ii) negated auxiliary verbs (e.g. should not, must not). 10 We also found that adjectives, both inside and outside a slot-filler, are useful. For example, consider the following text: "Yes, it is annoying and cumbersome to separate your trash x ". The keywords annoying and cumbersome explicitly indicate that the VJ of the slot-filler x (i.e. to separate your trash) is bad. Simultaneously, we discovered that slot-fillers had clues themselves for indicating VJ (e.g. Innocent in "Innocent people"). Thus, we introduce two additional features: (iii) the average sentiment of each adjective outside the slot-filler and (iv) inside the slot-filler. 11 Causal Relations (m cau ) We develop a simple rule-based classifier for identifying causal relations between the given slot-fillers x and y. We use a predefined list of causal phrases (i.e. causes, will lead to, etc. for PROMOTE, and destroy, 10 We parse each segment using Spacy (Honnibal and Johnson, 2015). 11 We use an existing sentiment lexicon (Warriner et al., 2013) to extract the sentiment polarity of each adjective. kill, etc. for SUPPRESS) composed from Reisert et al. (2015). We use the AT development set to expand the phrase list for any PROMOTE or SUPPRESS phrases not in the list. Given the source S s and target S t segments, we use the following rules: If a PROMOTE phrase appears after x in S t , then predict PROMOTE with a confidence score of 1.0, namely m cau (PROMOTE) = 1.0, m cau (SUPPRESS) = 0.0. The same rule is applied to a SUPPRESS phrase. Else if a PROMOTE phrase appears before y in S s , then predict PRO-MOTE with a confidence score of 1.0. The same rule is applied to a SUPPRESS phrase. Otherwise (i.e. there are no PROMOTE or SUPPRESS phrases), we predict PROMOTE, the majority relation (66%) in the AT development set. Since we are less confident than other ingredients if there is no contextual clue for the causality, we set the confidence scores to m cau (PROMOTE) = ε, m cau (SUPPRESS) = 0.1ε. ε is a number less than all confidence scores given by AR and VJ models.
Argumentative Relations (m arg ) We replicate a simple classification model (Peldszus and Stede, 2015b) for identifying the argumentative relation between given segments S s and S t (as either SUP-PORT or ATTACK). The classifier is based on a logistic regression and uses surface features such as lemma, part-of-speech tags, and segment length from the source and target segments.

Putting all things together
To instantiate an AT, we use a standard linear model constrained by ATs as follows: arg max where w is a weight vector, Φ is a feature function of an AT instantiation r, v x , c, v y and T represents the SUPPORT and ATTACK templates from Figure 2. The feature function Φ(r, v x , c, v y ) returns an 8-dimensional feature vector characterizing an AT instantiation as follows: {m arg (SUPPORT), m arg (ATTACK), m val (x, GOOD), m val (x, BAD), m cau (PROMOTE), m cau (SUPPRESS), m val (y, GOOD), m val (y, BAD)}. We use the confidence values of each AT ingredient calculated by the separate models described in Section 3.2. For instance, given an AT instantiation SUPPORT, BAD, PROMOTE, BAD , we create the following feature vector: {m arg (SUPPORT), 0, 0, m val (x, BAD), m cau (PROMOTE) , 0, 0, m val (y, BAD)}. We learn w on training data by using an averaged structured perceptron (Collins, 2002). We call this a template-constrained inference model, or TCI. To see the effectiveness, we consider the model without r, v x , c, v y ∈ T , which we call non-constrained inference model, or NI. If the NI model's output does not match an AT, we output SUPPORT, GOOD, PROMOTE, GOOD (AT-S1), the majority AT in the dev set.
The advantage of TCI is that if a model of each ingredient is not confident about its prediction and the most-likely AT is invalid, the wrong prediction can be fixed by combining the knowledge of ATs and other confident AT ingredient predictions. The NI model entirely depends on the independent decision of each ingredient model, regardless of whether the predictions are confident or not, which is compensated by TCI.

Setting
In Section 2, the annotators were given an argumentative relation and instructed to instantiate an AT. Towards fully automating the task of AT instantiation, we also test our system when no argumentative relation is given. Therefore, we consider two settings: (i) predict an AT with the goldstandard argumentative relation (G) and (ii) with no gold-standard relation (N). Thus, we examine four models: NI-G, NI-N,  For all models for AT instantiation, we conduct a 5x10-fold cross validation using 231 unique SUPPORT and ATTACK AC instantiations collected from the annotations on the 69 texts (270 relations) from our test set. 13 In each fold, we create a validation set consisting of one-fifth of the training data. We then oversample the training data. We employ early stopping with a patience of 2 and measure its performance using the accuracy of predictions on the validation set.

Results and discussion
The results (F 1 score) for the m arg , m val , and m cau subtask models are as follows: 0.59, 0.65, 0.42. The results indicate that the rule-based causality classifier has lower performance. We attribute this  to the lack of explicit contextual clues indicating the causality between slot-fillers. Through a subjective analysis, we found that roughly 88% of causal relations are implicit in the AT test set, thus PROMOTE is mainly predicted. Table 1 shows the results of AT instantiation. The low performance of a majority and random baseline indicates that the AT instantiation task is not simple. The proposed models (NI, TCI) clearly outperform these baseline models. The TCI model consistently outperforms the NI model in both settings G and N. This indicates that template constraints are useful for instantiating ATs.
To further test our hypothesis that AT ingredients without an explicit contextual clue (i.e. implicit) can be inferred with a template constraint, we manually analyzed all 231 of the testing instances and label whether or not an explicit contextual clue exists for VJ and causality. We then compared the accuracies of each ingredient on implicit problem instances for NI-G and TCI-G. Shown in Table 2 are our results which indicate that our model is able to infer ingredients with no explicit contextual clue more reasonably with the introduction of a template constraint, especially in the case of causality.
The following shows an AT without an explicit contextual clue for causality that was predicted correctly using TCI-G: "S t : Nevertheless, everybody should contribute to the funding of the public broadcasters x in equal measure, S s : for we need general and independent media y .", where explicit clues (i.e. should contribute to and we need) indicate the VJ of x, y, both GOOD, but the causality between x and y is implicit. Combining this with the SUPPORT relation, the template constraints indicate that AT-S1 is the only possibility.

Related Work
ATs Reed (2006) annotated the Araucaria corpus (Reed, 2006) with Walton et al. (2008)'s argumentation schemes (AS), and successive work (Feng and Hirst, 2011) created a machine learning-model to classify an argument into five sets of schemes. However, Reed (2006) does not report the inter-annotator agreement. Lawrence and Reed (2016) created a model for instantiating ASs with a natural language representation, whereas we instantiate using templates and slotfillers. Green (2015) conducted work on identifying new ASs used in biomedical articles.
Several argumentative corpora have been created for argumentation mining fields such as argument component identification, argument component classification, and structure identification Rinott et al., 2015;Stab and Gurevych, 2014). Earlier work on discourse structure analysis includes discourse theories such as Rhetorical Structure Theory (Mann and Thompson, 1987). The Penn Discourse TreeBank, the largest manually annotated corpus for discourse relations, targeted both implicit and explicit relation detection for either adjacent sentences or clauses (Prasad et al., 2008). However, these studies do not aim for capturing implicit reasoning behind arguments.
AT ingredients Although we adopted a simple approach for AT ingredient identification for our first attempt (see Section 3.2), many sophisticated approaches have been proposed. Shallow discourse analysis of ARs has been extensively studied (Cocarascu and Toni, 2017;Niculae et al., 2017;Peldszus and Stede, 2015a,b). VJ identification is similar to targeted sentiment analysis (Mitchell et al., 2013;Dong et al., 2014). Somasundaran and Wiebe (2010) developed an annotation method for targeted sentiment. However, we aim to expand the annotation to other types of arguments, and their work only considers the task setting of stance classification. Finally, causal relation identification between an entity pair in a sentence has been studied (Zhang and Wang, 2015). In the future, we will incorporate these sophisticated techniques into our model.

Conclusion and future work
In this work, we propose a feasible annotation scheme for capturing a writer's reasoning in argumentative texts. We first developed a small list of predefined templates (ATs) for capturing the reasoning of ARs, where each template encodes a causal label that enables annotators to avoid manual generation of natural language slot-fillers, and conducted a corpus study. Our results indicate that ATs are highly skewed, and even with a small set of ATs, we can capture a majority of reasoning (74.6%) for multiple, diverse policy topics. We believe that the design decision to leave a wide variety of long-tailed, minor classes of reasoning as "OTHER" helps keep the AT instantiation simple. Furthermore, our results can be considered a good achievement (Cohen's κ=0.80). The annotated corpus is made publicly available. 14 We then created several preliminary models for automatically instantiating ATs. We discovered that template-constrained inference helps towards instantiating ATs with implicit ingredients necessary for understanding the reasoning behind an argument.
In the future, we will extend our work by conducting a large-scale annotation of ATs using methods such as crowdsourcing, and we will experiment with full-fledged parsing via recent neural models for capturing argumentative component features (Eger et al., 2017;Schulz et al., 2018;Ajjour et al., 2017). We plan to use other available argumentative corpora for conducting our experiments. We will also work towards expanding our templates and integrating them into the argument reasoning task proposed in SemEval2018 (Habernal et al., 2017). Finally, we plan to capture the causal information lost by annotating other factors of the causality such as severity, truthfulness, likelihood, to name a few.