A Corpus for Argumentative Writing Support in German

In this paper, we present a novel annotation approach to capture claims and premises of arguments and their relations in student-written persuasive peer reviews on business models in German language. We propose an annotation scheme based on annotation guidelines that allows to model claims and premises as well as support and attack relations for capturing the structure of argumentative discourse in student-written peer reviews. We conduct an annotation study with three annotators on 50 persuasive essays to evaluate our annotation scheme. The obtained inter-rater agreement of α = 0.57 for argument components and α = 0.49 for argumentative relations indicates that the proposed annotation scheme successfully guides annotators to moderate agreement. Finally, we present our freely available corpus of 1,000 persuasive student-written peer reviews on business models and our annotation guidelines to encourage future research on the design and development of argumentative writing support systems for students.


Introduction
In today's world most information is readily available. Consequently, the sole reproduction of information is losing attention. This results in a shift of job profiles towards interdisciplinary, ambiguous and creative tasks (vom Brocke et al., 2018). Therefore, educational institutions need to evolve in their curricula when it comes to the compositions of skills and knowledge conveyed. In particular, teaching higher order thinking skills to students, such as critical thinking, collaboration or problem-solving, has become more important (Fadel et al., 2015). This has already been recognized by the Organization for Economic Co-operation and Development (OECD), which included these skills as a major element of their Learning Framework 2030 (OECD, 2018). One subclass represents the skill of arguing in a structured, reflective and well-formed way. Argumentation is not only an essential part of our daily communication and thinking, but it also contributes significantly to the competencies of communication, collaboration and problem-solving (Kuhn, 1992). Building on studies by Aristotle, the ability to form convincing arguments is recognized as the foundation for persuading an audience of novel ideas, and it plays a major role in strategic decision-making and analyzing different standpoints. To develop skills such as argumentation, it is of great importance for the individual student to receive continuous feedback throughout their learning journey, also called formative feedback (Black and Wiliam, 2009;Hattie and Timperley, 2007). However, this is naturally hindered due to traditional large-scale lectures and due to the growing field of distance learning scenarios such as massive open online courses (MOOCs) (Seaman et al., 2018). In fact, educational institutions, such as universities, face the challenge of providing individual formative feedback effectively (Fortes and Tchantchane, 2010), since every student would need a personal tutor to have an optimal learning environment for learning how to argue (Vygotsky, 1980). One possible path for providing individual feedback is to leverage recent developments in Computational Linguistics in the form of computer-assisted writing which enables the development of writing support systems that provide tailored feedback about textual documents (Song et al., 2014;Stab and Gurevych, 2014a). Argumentation Mining (AM), a research field in Computational Linguistics, aims at automatically identifying arguments in unstructured texts (Lippi and Torroni, 2015). An argument is a set of statements made up of three parts: a claim, a set of evidence or premises (e.g., facts) and an inference from the evidence to the claim (Toulmin, 1984). Claim and premise represent the argument components. The claim is the central component of an argument, representing an arguable text unit, while the premises are propositions that either support or attack the claim, underpinning its plausibility. Support and attack are argumentative relations that model the discourse structure of arguments. In the identification of these argumentation structures, two main tasks can be distinguished: 1) argument component classification, the classification of argumentative text into claims and premises, and 2) argument relation classification, the identification of support and attack relationships between pairs of argument components.
To design, train and test AM algorithms, the availability of high-quality labeled corpora is crucial. Therefore, numerous prior works have dealt with creating annotated data sets. However, they are all limited to a particular genre, ranging from well-structured legal (Palau and Moens, 2009;Ashley and Walker, 2013) and scientific documents (Kirschner et al., 2015;Houngbo and Mercer, 2014), to rather ambiguous, vague and less formal social web content (Wachsmuth et al., 2014;Aharoni et al., 2014;Cabrio and Villata, 2014;Habernal and Gurevych, 2015a). Corpora that are applicable for the design and development of argumentative writing support systems are scarce (Stab and Gurevych, 2017a;Wambsganss and Rietsche, 2020;Wambsganss et al., 2020b). The only collection from the education domain that is annotated for argumentative discourse structures was presented in Stab and Gurevych (2014a). It is composed of 90 persuasive essays written by students in English language and later extended to include 402 essays (Stab and Gurevych, 2017a). However, these corpora are 1) annotated in English language only and 2) not derived from a specific learning scenario which would leverage the effective use for an argumentative writing support system. Consequently, there is a lack of linguistic corpora for training models that provide students with adaptive feedback about the quality of their argumentation in common scenarios in large-scale lectures or the growing field of MOOCs (Seaman et al., 2018), such as peer reviews where students provide each other argumentative feedback on a specific task, e.g., on a previously developed business model (Rietsche and Söllner, 2019).
Creating gold standards and test collections requires a formal representation model as well as corresponding annotation guidelines. In this paper, we introduce an argumentation annotation scheme for persuasive student-generated peer reviews extracted from a common learning scenario. Moreover, we present a corpus of 1,000 student-written peer reviews that are annotated for argumentation components and their relations. Our contribution is threefold: 1) we derive an annotation scheme for a new data domain for AM based on argumentation theory and previous work on annotation schemes for persuasive student essays (Stab and Gurevych, 2017a;Stab and Gurevych, 2014a), 2) we present an annotation study based on 50 persuasive peer reviews and three annotators to show that the annotation of student peer reviews is reliably possible, 3) we present our final and freely available corpus of 1,000 student peer reviews collected in our lecture about business innovation in German language. We therefore hope to encourage future research on student-generated argumentative texts and on writing support systems to train argumentation skills of students based on AM.

Argumentation Mining
AM is a research field in Computational Linguistics, gaining momentum in a lot of areas, including the legal domain (Mochales Palau and Ieven, 2009), newswire articles (Reed et al., 2008a;Deng and Wiebe, 2015;Sardianos et al., 2015), user-generated web content (Wachsmuth et al., 2014;Habernal and Gurevych, 2015b;Abbott et al., 2016), or online debates (Cabrio and Villata, 2014;. AM aims at automatically identifying arguments in unstructured textual documents based on the classification of argumentative and non-argumentative text units and the extraction of argument components and their relations. Recently, researchers have built increasing interest in intelligent writing assistance based on AM (Song et al., 2014;Stab and Gurevych, 2014a;Stab and Gurevych, 2014b), since it enables argumentative writing support systems that provide tailored feedback about arguments in student-generated texts. However, the effectiveness of using this technology in a certain learning scenario for educational purposes has rarely been assessed (Stab and Gurevych, 2017b;Lippi and Torroni, 2015), as argumentation corpora from student-generated texts in the field of education are rather uncommon (Lawrence and Reed, 2019).

Argument Annotated Corpora and Annotation Schemes
As Lawrence and Reed (2019) state, "one of the challenges faced by current approaches to argument mining is the lack of large quantities of appropriately annotated arguments to serve as training and test data." Since the availability of labeled corpora is crucial for designing, training and evaluating AM algorithms, numerous prior works have dealt with creating annotated data sets, such as the Araucaria corpus (Reed et al., 2008b), the European Court of Human Rights (ECHR) corpus of Mochales and Moens (2008), the Debatepedia corpus (Cabrio and Villata, 2012), the ChangeMyView corpus (Egawa et al., 2019) or the persuasive essays corpus of Stab and Gurevych (2014a) with 90 essays and the corpus of Stab and Gurevych (2017a) with 402 persuasive student essays. These corpora have been widely used for various AM tasks, such as the identification of argument components (Rooney et al., 2012), corpus wide AM (Ein-Dor et al., 2019) or end-to-end AM (Persing and Ng, 2016).
Creating gold standards and test collections requires a formal representation model as well as corresponding annotation guidelines. While a number of well-defined models exist in the field of AM (Freeman, 2001;Perelman, 1971;Pollock, 1995;Walton and Macagno, 2015;Walton, 1996;Wambsganss et al., 2020a), there is no general argumentation annotation scheme across all domains and genres of texts. Instead, the proposed representations differ in granularity, expression power and categorization (Lawrence and Reed, 2019). Therefore, conducting annotation studies with several annotators when introducing new annotation schemes is crucial for the quality of argumentation corpora.

Argument Annotated Corpora in Education
With the exception of the corpora proposed in Stab and Gurevych (2014a) and Stab and Gurevych (2017a), previously presented argument annotated datasets are not easily applicable for the development of argumentative writing support systems for students in a real-world case, since they are 1) not extracted from an educational learning scenario in which the annotation could be used for training a model that provides students feedback on the texts, and 2) often not annotated at the level of discourse (Stab and Gurevych, 2017a;Lawrence and Reed, 2019), which is necessary for example to give students feedback on insufficiently supported claims. Stab and Gurevych (2014a) identified the lack of linguistic corpora in the domain of student-written texts for designing and developing argumentative writing support systems for adaptive feedback by leveraging AM techniques (Stab and Gurevych, 2014a). Therefore, they introduced an annotation scheme for annotating argument components and their relationships in persuasive student essays. Afterwards, several studies built on their corpus, including e.g.,  who take a subset of the essays and annotate their persuasiveness, or  who train a persuasiveness scoring model on them. However, the transfer of argumentation corpora to other educational domains (e.g., common learning scenarios such as peer reviews) and other languages falls short in current literature.

Essay Scoring Corpora
Besides annotated datasets of argumentative student-written texts, several corpora exist in the field of automatic essay scoring. The goal of this task is to automatically rate textual documents in the form of holistic scores based on their content and form (Horbach et al., 2017). Most corpora are built on student-written content, e.g. the Cambridge Learner Corpus (CLC) (Yannakoudakis et al., 2011) with 1,244 English essays, the Swedish high school corpus with 1,702 essays (Östling et al., 2013), or the TOEFL11 corpus with 1,100 English essays written by language students (Blanchard et al., 2013). However, the corpora are usually annotated with a holistic score on the level of full documents only, e.g., low, medium, high in the TOEFL11 corpus, while specific argumentation structures are commonly ignored. In fact, the persuasiveness of essays is -if at all considered -usually only regarded as one sub variable of the overall document (e.g., in the annotations ofÖstling et al. (2013) and Yannakoudakis et al. (2011)).
Consequently, these corpora lack applicability for the development of argumentative writing support systems. The argumentation quality is often only annotated as a qualitative score on a 1 to 3 range (e.g., Horbach et al. (2017)) and therefore not sufficient enough to train a sophisticated supervised machine learning model for argumentative writing assistants. Corpora from the field of automatic essay scoring usually neither distinguish between different types of argument components (e.g., claims or premises) nor are they built on a rich argumentation annotation scheme. Nguyen and Litman (2018) demonstrated the value of AM for automated persuasive essay scoring by evaluating different AM features for improving essay scores. However, essay scoring corpora mostly do not focus on the annotation of argumentation relations, and therefore, disqualify for a foundation for sophisticated models for feedback on argumentative discourse through writing support systems (e.g., feedback on unsupported claims). We aim to address this literature gap by presenting and evaluating an annotation scheme as well as an annotated argumentation corpus built on student-written texts with the objective of enabling researchers to develop novel argumentative writing support systems for students.

Corpus Construction
We compiled a corpus of 1,000 student-generated peer reviews in which students provide each other feedback on previously developed business models. Peer reviews are a modern learning scenario in large-scale lectures, enabling students to reflect on their content, receive individual feedback from peers and thus deepen their understanding of the content (Rietsche and Söllner, 2019). Moreover, they are easy to set up in traditional large-scale learning scenarios or in the growing field of distance-learning scenarios such as MOOCs (Seaman et al., 2018). This can be applied to train skills such as argumentation. However, since not many suitable corpora are available to provide argumentation feedback that A) contain annotated persuasive student peer reviews, B) consist of a sufficient size to be able to use trained models in a real-world scenario and C) follow an annotation guideline for guiding the annotators towards an adequate agreement, we propose an new annotation scheme to model argument components as well as argumentation relations that reflects the argumentative discourse structure in persuasive peer reviews. We based our annotation scheme on the model of Toulmin (1984) and the studies of Stab and Gurevych (2014a;Stab and Gurevych (2017a). To build a reliable corpus, we followed a four step methodology, illustrated in Figure 1: 1) We examined scientific literature and theory on how to model argumentation discourse structures in texts from different domains, 2) we randomly sampled 50 student-generated peer reviews and, on the basis of our findings from literature and theory, developed a set of annotation guidelines consisting of rules and limitations on how to annotate argumentation discourse structures, 3) we applied, evaluated and improved our guidelines with three native-speakers in three consecutive workshops to resolve annotation ambiguities, 4) we applied the final annotation scheme based on our 15 pages guidelines to a corpus of 1,000 student-generated peer reviews.

Data Source
We gathered a corpus of 1,000 student-generated peer reviews written in German. The data was collected in one of our mandatory business innovation lectures in a master program at our university. In this lecture, around 200 students develop and present a new business model for which they receive three peer reviews each. Here, a fellow student from the same course elaborates on the strengths and weaknesses of the business model and gives persuasive recommendations on what could be improved. We sampled a random subset of 1,000 of these reviews out of around 7,000 documents collected between 2014 and 2018.

Annotation Scheme
Our objective is to model the argumentation discourse structures of student-generated peer reviews by annotating argumentation components and their relations. Most of the peer reviews in our corpus follow a similar structure. They describe several strengths or weaknesses of the business model under consideration, backing them up by examples, statistics, intuitions or citations. These strengths and weaknesses can also be formulated to make a certain recommendation for improvements of the business model. An argumentation component, e.g., a strength, weakness or suggestion is only regarded as a "claim" if it contains a certain standpoint, which can also represent a complete sentence. Our basic annotation scheme is illustrated in Figure 2.

Argument Components
For argumentation components, we follow established models in argumentation theory which provide detailed definitions of argument components (e.g., (Toulmin, 1984;Walton et al., 2008;Weinberger and Fischer, 2006;Perelman, 1971;Pollock, 1995;Walton, 1996;Freeman, 2001;Walton and Macagno, 2015;Van Eemeren and Grootendorst, 2016). These theories generally agree that a basic argument consists of multiple components and that it includes a claim that is supported or attacked by at least one premise. Also in the domain of student-written peer reviews, we found that a claim is the central component of an argument. It is a controversial statement (e.g., claiming a strength or weakness of a business model -see examples below) that is either true or false and should not be accepted by the receiver of the feedback without additional support or backing. The premise supports the validity of the claim (e.g., by providing a statistic, quote or a value-based intuition). It is a reason given by the author to persuade the reader of her claim. Below are two examples of claims from our corpus (1. being a strength, 2. being a weakness) and their supporting premises. 2 1. "The value proposition is very well done. claim (strength) It is short and concise and the advantages or benefits are well highlighted. premise (example) As a customer I would like to try the product after reading it." premise (intuition) 2. "Unfortunately, the value proposition canvas is very poorly filled in. claim (weakness) The points are far too little elaborated and far too little described or explained. The customer jobs should be described much more precisely." premise (example)

Argumentative Relations
The basic argumentation structure in our corpus of student-generated peer reviews consists of several claims, each independently supported by one or more premises. Since in our data domain weaknesses and strengths of a business model are discussed, the texts do generally not present a major claim as is the case for example in English essays annotated by Stab and Gurevych (2014a) and Stab and Gurevych (2017a). Instead, the documents we deal with consist of a set of claims supported by one or more premises. However, premises not only support a statement, but may also attack a claim, e.g., when used as a stylistic device or to illustrate uncertainty in the argumentation. Hence, more complicated constellations of claims and premises are possible, in which a claim is supported by several different premises or by a chain of premises, in which each premise is in turn supported by another premise. In the same way, a claim can be supported by one premise and attacked by another, or supported by a premise which is attacked by another premise. Nevertheless, the simplest form consists of a claim supported by a single premise. To provide an overview, we illustrated three basic examples of annotated relations in our corpus (see Figure 3). Some statistics on the occurrences of different patterns in our dataset can be found in Table 6.

Annotation Process
Three native German speakers annotated the peer reviews independently from each other for claims and premises as well as their argumentative relationships in terms of support and attack, according to the annotation guidelines we specified. Inspired by Stab and Gurevych (2017a), our guidelines consisted of 15 pages, including definitions and rules for what is an argument, which annotation scheme is to be used and how argument components and argumentative relations are to be judged. Several private training sessions and three team workshops were performed to resolve disagreements among the annotators and to reach a common understanding of the annotation guidelines. We used the brat rapid annotation tool, since it provides a graphical interface for marking up text units and linking their relations (Stenetorp et al., 2012). After the first 50 reviews were annotated by all three annotators, we calculated the inter-annotator agreement (IAA) scores. As we obtained satisfying results, we proceeded with a single annotator who marked up the remaining 950 documents. Following Stab and Gurevych (2014a), we annotated argument components in terms of claims and premises. The specific component types, e.g., weakness, strength for claims or intuition, statistic for premise were not annotated in this study. However, we believe this would be a useful addition in future work. In Figure 4 we display an example of an entire peer review with the corresponding annotations.

Corpus Analysis
We analyzed the results of the annotation process in order to examine (1) the reliability of the corpus and (2) the major disagreements in argument component and relation annotations between the annotators. In addition, we calculated some statistics of the final corpus.

Inter-Annotator Agreement
To evaluate the reliability of the argument component and argumentative relation annotations, we followed the approach of Stab and Gurevych (2014a).

Argument Components
With regard to the argument components, two strategies were used. Since there were no predefined markables, annotators not only had to identify the type of argument component, but also its boundaries.
In order to assess the latter, we use Krippendorff's α U (Krippendorff, 2004), which allows for assessing the reliability of an annotated corpus, considering the differences in the markable boundaries. To evaluate the annotators' agreement in terms of the selected category of an argument component for a given sentence, we calculate percentage agreement and two chance-corrected measures, multi-π (Fleiss, 1971) (1) The value proposition is well done. Really a great idea.

C1
(2) Especially that it is not just a BrowserAddOn that shows where it is cheaper. But that you can also buy the products in a bundle via LivePrice without having to log on to all the different online stores and enter your data.

P4
(8) This also keeps the costs within a manageable range.

P5
(9) In version 2 I would reconsider the logistics as described in 2.

C4
(10) I would also go into detail about the advantages of a Premium Membership.

C5
(11) Only ad-free or even more advantages? P7 support support support support support support support Figure 4: Fully annotated example of a peer review according to our annotation scheme and guidelines. The left column represents the claims (C1 -C6), while the premises are listed on the right (P1 -P7). N signifies a non-argumentative text unit. and Krippendorff's α (Krippendorff, 1980). We decided to operate at sentence level, since only 20.56% of the sentences in the corpus contain annotations of different argument components. Thus, evaluating the reliability at sentence level served as a good approximation of the IAA. At the level of individual sentences, 33.45% contain a claim, 35.64% a premise and 55.52% a non-annotation. 20.56% of the sentences contain several annotations, with a combination of a premise and a non-argumentative span (32.21%) and a combination of a claim and a non-argumentative span (35.78%) representing the majority of those cases. Only 4.08% contain both a premise and a claim. At the token level, the following class distribution is achieved: 43.54% claim, 45.42% premise and 11.04% are not annotated. Table 1 displays the resulting IAA scores. We obtained an IAA of 78.65% for the claims and 76.63% for the premises. The corresponding multi-π scores are 55.47% and 51.06%. Regarding Krippendorff's α, a score of 55.49% and 51.08% is obtained, indicating a moderate agreement for both categories. With a score of 44.04% and 47.76%, the unitized α of both the claim and premise annotations is slightly smaller compared to the sentence-level agreement. Thus, the boundaries of argument components are  Table 1: Inter-rater agreement of argument component annotations.
less precisely identified in comparison to the classification into argument types. Yet the scores still suggest that there is a moderate level of agreement between the annotators. With a score of α U =35.90%, the boundaries of non-argumentative units are less reliably detected. In contrast, the agreement scores targeting the component types are considerably higher for the non-argumentative spans as compared to the claims and premises (85.72%, 64.10%, 64.12%), indicating a substantial agreement between the annotators. Hence, we conclude that the annotation of the argument components in student-generated peer reviews is reliably possible.  Table 2: Inter-rater agreement of argumentative relation annotations.

Argumentative Relations
To evaluate the reliability of the argumentative relations, we used the set of all pairs between argument components that were possible during the annotation task according to our annotation scheme, i.e. all pairs between a claim and a premise and between two premises. In total, the markables include 4,792 pairs of which 7.41% are annotated as support relations and 0.35% as attack relations. 92.24% of the possible pairs were not identified by an annotator. Since the number of attack relations is so small, we decided to focus on the support relations, distinguishing only between the two types support and nonsupport. We obtained an IAA of 94.13% for both support and non-support relations. The corresponding multi-π and Krippendorff's α scores both amount to 49.03% (see Table 2). Therefore, we conclude that argumentative relations, too, can be reliably annotated in student-generated argumentative peer reviews.

Confusion Probability Matrices
To analyze the disagreement between the three annotators, we created confusion probability matrices (CPM) (Cinková et al., 2012) for argument components and argumentative relations. A CPM contains the conditional probabilities that an annotator assigns a certain category (column) given that another annotator has chosen the category in the row for a specific item. In contrast to traditional confusion matrices, a CPM also allows for the evaluation of confusions if more than two annotators are involved in an annotation study (Stab and Gurevych, 2014a  While there is a broad agreement between the annotators in distinguishing non-argumentative discourse units from argument components, the major disagreement is between claims and premises (Table  5). This result is in accordance with the findings in Stab and Gurevych (2014a). This could be expected since a claim can also serve as premise for another claim, and it is difficult to distinguish these two concepts in the presence of reasoning chains. For instance, examples (1-3) establish a reasoning chain in which (1) is supported by (2) and (2) is supported by (3): (1) "The client would have to wait twice as long as usual for his parcel." (2) "LivePrice orders the products from the provider that offers them at the best price and packs them." (3) "In the Business Model Canvas it is described that LivePrice packs the products itself and ships them." Considering the structure of this argumentation, (1) can be classified as a claim. However, if (1) is omitted, (2) becomes a claim that is supported by (3). Thus, the distinction between claims and premises depends not only on the context and the intention of the author but also on the structure of a specific argument (Stab and Gurevych, 2014a).
The CPM for argumentative relations (see Table 6) reveals that there is a rather high confusion between support relations and none classified relations. This result is again in line with the findings of Stab and Gurevych (2014a). In sum, according to our error analysis, the annotation of argumentative relations yields less reliable results than that of argument components.

Manual Analysis
We manually analyzed the most common differences in the annotated text spans between the annotators on a random sample of 10 documents that are composed of 292 sentences. Our findings are as follows: while some annotators included clausal conjunctions, such as "because", "since" or "as", in their annotations of premises, others did not mark them up. Moreover, in some cases phrases with verbs of reported speech or cognition, such as "I think that" or "I believe that", were included in the annotations of argument components, whereas in other cases this phrasal type was ignored in the annotations. The same holds for introductory phrases at the beginning of sentences or clauses, e.g. "however", "moreover", "for example". In addition, some annotators included punctuation marks denoting the end of sentences or clauses in their annotated spans, while others did not. In addition, we encountered a larger number of cases where annotators mistakenly missed the first or last character(s) of a word starting or ending, respectively, the annotated span. Often, the annotations were not even consistent within a single annotator.  Tables 3, 4, 5 and 6 present some statistics of the final corpus. It consists of 1,000 student-written peer reviews in German that are composed of 20,125 sentences with 246,980 tokens in total. Hence, on average, each document has 20 sentences and 272 tokens. A total of 7,996 claims (31.64%) and 8,479 premises (33.55%) were annotated. 8,796 textual spans (34.81%) were not identified as an argument component ("None"). On top of that, 8,291 support (97.03%) and 254 attack (2.97%) relationships were marked up by the annotators. The little percentage of attack relations are explainable due to the domain of our annotated texts. The nature of peer reviews is to provide feedback about a certain topic by highlighting strengths, weaknesses and suggesting improvements. Usually students used premises to support their claims, in fact, only about 3% of the claims were further elaborated by discussing them more controversially through attacking premises.    Table 6 presents the distribution of support relationships in our corpus. With 46.30%, the majority of claims is not supported by any premise. 39.07% of the claims are backed up by exactly one premise, while 10.93% are supported by two premises. Patterns with more than two supporting premises rarely occur in our dataset. The majority of premises is not backed up by another premise (75.07%). However, in 21.08% of the cases, there is one supporting premise. In that way, more complex reasoning chains are established.  Table 6: Distribution of patterns of support relationships in the created corpus. The columns denote the percentage of claims and premises, respectively, that are supported by the specified number of premises.

Application of the Corpus
After constructing and analysing our corpus, we leveraged the novel data to train a machine learning model. Our objective was to embed a classification algorithm in the backend of a user-centered adaptive writing support system to provide students with formative argumentation feedback in the writing process. Therefore, we trained and tuned a model with different text features (Stab and Gurevych, 2014a;Fromm et al., 2019). We performed a multiclass classification on the sentence level to detect the argument components and their relations. For argument component classification, we found that a Support Vector Machine (SVM) achieved the best results, with an accuracy of 65.4% on the test set. Regarding the argumentative relation classification, a binary classification task, an SVM again achieved the best results on our corpus, obtaining an accuracy of 72.1% on the test set. A detailed description of the features and the embedding of our corpus in a user-centered adaptive writing support tool, as well as its effect on students' argumentation writing skills can be found in Wambsganss et al. (2020b).

Conclusion
We introduce an argumentation annotation scheme as well as an annotated corpus of persuasive studentwritten peer reviews extracted from a real-world learning scenario which can be leveraged to provide students feedback on the quality of their argumentation (e.g., by highlighting insufficiently backed claims) through a writing support system. Regarding the educational domain, previously presented argument annotation schemes and argumentation corpora fall short on several aspects: they are not derived from a modern learning scenario, they do not follow a systematic methodology based on detailed inter-rater agreement studies or they do not include annotations of argumentative relations. To overcome these limitations, we present a corpus of 1,000 student-written peer reviews that are annotated for argument components and their relations. Our contribution is threefold: 1) we derive an annotation scheme for a new data domain for AM based on argumentation theory and previous work on annotation schemes for persuasive student essays (Stab and Gurevych, 2014b;Stab and Gurevych, 2017a), 2) we present an annotation study based on 50 persuasive peer reviews that were annotated by three native German speakers, demonstrating that the annotation of student-generated peer reviews is reliably possible, and 3) we present our final and freely available corpus of 1,000 student peer reviews collected in our lecture about business innovation in German language. We hope to encourage fellow researchers to leverage our annotation scheme and argumentation corpus to design and develop writing support systems for students in large-scale learning scenarios.