Transferring Knowledge from Discourse to Arguments: A Case Study with Scientific Abstracts

In this work we propose to leverage resources available with discourse-level annotations to facilitate the identification of argumentative components and relations in scientific texts, which has been recognized as a particularly challenging task. In particular, we implement and evaluate a transfer learning approach in which contextualized representations learned from discourse parsing tasks are used as input of argument mining models. As a pilot application, we explore the feasibility of using automatically identified argumentative components and relations to predict the acceptance of papers in computer science venues. In order to conduct our experiments, we propose an annotation scheme for argumentative units and relations and use it to enrich an existing corpus with an argumentation layer.


Introduction
The growing number of scientific publications and the shortening of the research-publication cycles (Bornmann and Mutz, 2015) pose a challenge to authors, reviewers and editors. The development of automatic systems to support the quality assessment of scientific texts can facilitate the work of editors and referees of scientific publications and, at the same time, be of value for researchers to obtain feedback that can lead to improve the communication of their results.
The quality assessment of scientific texts has many dimensions, and each one involves different levels of difficulties. While the relevance of the problem at stake and the novelty of the solutions proposed by the authors are of great significance in terms of weighting the ultimate contributions of the work, aspects such as the argumentative structure of the text are key when analyzing its effectiveness with respect to its communication objectives (Walton and Walton, 1989). A fine-grained assessment of the contributions made in research articles requires to identify the main claims made by the authors and to determine if the evidence provided to support them is strong enough. Or, in other terms, if both the structure and the contents of the arguments proposed by the authors can persuade a potential reader of the validity of their contributions.
In addition to being useful for facilitating the assessment of some quality aspects of a text, the automatic identification of argumentative units and their relations-a set of related tasks known as argument mining-is a relevant problem in itself in the context of knowledge mining (Mochales and Moens, 2011). Being able to extract not only what is being stated by the authors of a text but also the reasons they provide to support it can be useful in multiple applications, ranging from a finegrained analysis of opinions to the generation of abstractive summaries of texts. As an example of a potential application for argument mining, (Lippi and Torroni, 2016) suggest the possibility of developing an argumentative ranking component in a search engine so that it retrieves documents based on claims and evidence on a given topic extracted automatically from texts.
The tasks involved in the extraction of arguments from text-including the identification of argumentative sentences, the detection of argument component boundaries and the prediction of argument structures-are related to other text mining tasks-including sequence labeling, text segmentation, entity recognition and relation extractionwhich are in general tackled by means of supervised learning methods (Lippi and Torroni, 2016). The lack of annotated data with argumentative information, however, presents a challenge when trying to apply these well-known approaches to argument mining (Stab and Gurevych, 2017). This is so, in part, due to the inherent difficulty of unambiguously identifying argumentative elements in texts, which is reflected in the low levels of inter-annotator agreement reached in general for this task (Habernal et al., 2014). If this is true in several knowledge domains, it poses a more difficult problem in the case of scientific texts due to their inherent argumentative complexity (Kirschner et al., 2015;Green, 2015). We propose to address this challenge by leveraging data annotated with discourse relations, as previous works suggest potential benefits in linking discourse analysis and argument mining tasks Stab et al., 2014;Cabrio et al., 2013;Biran and Rambow, 2011;Green, 2015).

Contributions
• We propose to tackle the limitations posed by the lack of annotated data for argument mining in the scientific domain by leveraging existing Rhetorical Structure Theory (RST) (Mann et al., 1992) annotations in a corpus of computational linguistics abstracts (SciDTB) (Yang and Li, 2018). In order to do so: 1. We propose and test an annotation scheme that we use to conduct a pilot annotation experiment in which we enrich a subset of the SciDTB corpus with an additional layer of argumentative structures. 2. We explore the potential of a transfer learning approach to improve the performance of an argument mining model trained with a small volume of data annotated with the proposed scheme. • We report preliminary results on the prediction of acceptance or rejection of scientific papers in computer science conferences based on the automatic identification of argumentative components and relations in their abstracts.
In this work we adopt a pragmatic perspective in relation to exploring the predictive potential of the argumentative structure of an abstract for the acceptance or rejection of the corresponding manuscript in a peer review process. We do not intend to imply that the ultimate quality of the papers-or even the abstracts-could be determined solely by considering this limited information.
The rest of the paper is organized as follows: in Section 2 we describe previous work, focusing, in particular, on works aimed at identifying arguments in scientific texts. In Section 3 we describe the dataset used in our experiments and our proposed annotation scheme for fine-grained scientific argument mining. In Section 4 we describe our transfer learning experiments, their experimental settings and results and, in Section 5, we do the same with the experiments aimed at predicting the acceptance or rejection of papers in conferences. Finally, in Section 6, we summarize our main contributions and propose additional research avenues as follow-up to the current work.

Related work
This work is informed by previous research in the areas of argument mining, argumentation quality assessment and the relationship between discourse and argumentative structures and, from the methodological perspective, to transfer learning approaches. Due to space restrictions, we cannot cover in detail all the relevant background work. We refer the reader to (Lippi and Torroni, 2016) for a thorough summary of argument mining initiatives in various domains and with different approaches. (Wachsmuth et al., 2017) provide a comprehensive survey of quality assessment approaches in the context of computational argumentation and categorize them in relation to how they address logical, rhetorical and dialectical dimensions of argumentation. (Pan and Yang, 2010) provide an in-depth review of current trends in transfer learning, including inductive, transductive and unsupervised approaches. Furthermore, they classify the different approaches based on what is transferred: instances, feature representations, parameters or relational knowledge. A more direct antecedent to our work is the research conducted by Peldszus and Stede Stede, 2016, 2015a;, who annotated 112 argumentatively rich texts using RST and argumentation schemes in order to study the relationship between discourse and argumentation structures. The texts were generated in an experiment in which several participants wrote short texts of controlled linguistic and rhetoric complexity discussing a controversial issue from a pre-defined list. Based on this corpus, the authors conducted experiments in order to derive argumentative components and relations from RST trees, comparing three approaches: a transformation model, an aligner based on sub-graph matching and an evidence graph model (Peldszus and Stede, 2015b).
Our work is one of few that deal with argument mining in scientific texts which, as mentioned in Section 1, is considered as a particularly challenging domain (Kirschner et al., 2015;Green, 2015). (Stab et al., 2014) and (Kirschner et al., 2015) carried out annotation studies with scientific articles in educational research with binary argumentative and discourse relations (support, attack, detail, and sequence). In order to calculate the agreement between the annotators that participated in the process they developed a novel graphbased agreement measure, which can identify different annotations with similar meaning, thus obtaining higher agreement than with standard measures. The evaluation of argument annotations is still an open issue. (Stab et al., 2014) suggest that it might be interesting to explore, for this task, evaluation schemes that are able to deal with multiple correct annotations such as those used in text summarization. (Lauscher et al., 2018b) analyze the information shared by rhetorical and argumentative structure of scientific documents. In order to do this, they add an argumentation layer to the DrInventor Scientific Corpus (Fisas et al., 2016), which includes 40 computer graphics papers annotated with four layers including citation contexts, rhetorical role of sentences, subjective information and summarization relevance. The enriched corpus is used to trained new models for the automatic identification of claims and evidence, which are made available as a web service (Lauscher et al., 2018a). Some of the first initiatives aimed at the automatic identification of rhetorical and argumentative components in scientific texts include the Argumentative Zoning (AZ) model (Teufel et al., 1999(Teufel et al., , 2009 and the CoreSC scheme (Liakata et al., 2012). While AZ considers annotations for knowledge claims made by the authors of scientific articles, CoreSC associates research components to the parts of the texts describing them, thus obtaining a readable representation of the research process described by the paper. Both of them are sentence-based schema that are focused on the identification of the components and do not consider the relations between them. (Feltrim et al., 2006) adapted the AZ model for the automatic annotation of scientific abstracts in Portuguese (AZPort). The AZPort model was integrated as a module of SciPo, 2 a web-based tool aimed at supporting novice writers of academic 2 http://www.nilc.icmc.usp.br/scipo/ texts: given an abstract, the system classifies its sentences by means of AZPort and, based on a set of rules for well-formed rhetorical structures, it provides feedback for potential improvements (e.g., re-ordering the elements of the text or adding missing content). More recently, (Vargas-Campos and Alva-Manchego, 2016) adapted the AZPort model to Spanish (AZEsp), which was also integrated into a computer-assisted writing tool for computer science dissertations in Spanish (Sci-Esp).

Annotated data
In order to explore the possibility of leveraging discourse information for the identification of argumentative components and relations we add a new annotation layer to the Discourse Dependency TreeBank for Scientific Abstracts (SciDTB) (Yang and Li, 2018). SciDTB contains 798 abstracts from the ACL Anthology (Radev et al., 2013) annotated with elementary discourse units (EDUs) and relations from the RST Framework. Polynary discourse relations in RST are binarized in SciDTB following a criteria similar to the "rightheavy" transformation used in other works that represent discourse structures as dependency trees (Morey et al., 2017;Li et al., 2014).
We consider a subset of the SciDTB corpus consisting of 60 abstracts from the Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) and transformed them into a format suitable for the GraPAT graph annotation tool (Sonntag and Stede, 2014) 3 , which had been previously tailored to the specificities of our proposed annotation scheme, described in Section 3.1.
The corpus enriched with the argumentation 4 level contains a total of 327 sentences, 8012 tokens, 862 discourse units and 352 argumentative units linked by 292 argumentative relations.

Annotation scheme
Several argumentation mining works (Lippi and Torroni, 2016) use claims and premises as basic argumentative units. In the case of scientific discourse, however, it is frequent to find that claims are not explicitly stated in an argumentative writing style but are instead left implicit (Hyland, 1998). The description of the problem addressed in the paper, for instance, usually conveys implicit claims in relation to the relevance of the problem at stake and/or the adequacy of the proposed approach. We introduce a fine-grained annotation scheme aimed at capturing information that accounts for the specificities of the scientific discourse, including the type of evidence that is offered to support a statement (e.g., background information, experimental data or interpretation of results). This can provide relevant information, for instance, to assess the argumentative strength of a text. The types of proposed units in our scheme were considered so they can be mapped-even if with a different level of granularity-to concepts in CoreSC (Liakata et al., 2010) and AZ categories, which would enable additional research on the potential of using existing annotated corpora for argument mining tasks. Like -and in contrast with CoreSC and AZ-we consider EDUs as the minimal spans that can be annotated. Argumentative units can, in turn, cover multiple sentences.
The proposed units include: • proposal (problem or approach) • assertion (conclusion or known fact) • result (interpretation of data) • observation (data) • means (implementation) • description (definitions/other information) In line with (Kirschner et al., 2015), we adopt in our annotation scheme the classic support and attack argumentative relations and the two discourse relations detail and sequence. Figure 1 shows a subset of the argumentative components and relations annotated in an abstract from (Zhang and Wang, 2014), 5 including a proposal and two supporting units: an assertion and a result. Figure 2 shows the original discourse units and relations as annotated in SciDTB.
In the subset of SciDTB annotated for our experiments, the types of argumentative units are distributed as follows: 31% of the units are of type proposal, 25% assertion, 21% result, 18% means, 3% observation, and 2% description. In turn, the relations are distributed: 45% of type detail, 42% 5 http://aclweb.org/anthology/D14-1033 support, 9% additional, and 4% sequence. No attack relations were identified in the set of currently annotated texts. When considering the distance 6 of the units to their parent unit in the argumentation tree, we observe that the majority (57%) are linked to a unit that occurs right before or after it in the text, while 19% are linked to a unit with a distance of 1 unit in-between, 12% to a unit with a distance of 2 units, 6% to a unit with a distance of 3, and 6% to a unit with a distance of 4 or more. 7

Transfer learning experiment
The first set of experiments, described in this section, are aimed at exploring the potential of applying a transfer learning method to improve the performance of argument mining tasks trained with a small corpus of 60 abstracts by leveraging the discourse annotations available in the full SciDTB corpus.

Tasks
We define the following set of argument mining tasks: • AFu (argumentative function): Identify the boundaries and argumentative functions of the components. In the example in Fig. 1, it would imply to identify the boundaries of the three nodes and the two support relations that link them. • ATy (argumentative unit): Identify the boundaries and types of the components. In the example, the proposal, assertion and a results units. • APa (argumentative attachment): Identify the boundaries of the components and the relative position of the parent argumentative unit. For instance, the assertion unit in Fig.  1 is attached to the proposal unit with a relative distance of one unit in the forward direction (as the assertion occurs right before the proposal in the text). The result unit, in turn, is attached to the proposal with a distance of four units in the background direction (the units that occur between these two nodes are omitted in the figure).

Experimental setups
We train each of the tasks described in 4.1 separately and compare the results obtained with those obtained by an inductive transfer learning method in which we use encoders trained with the RST annotations available in the SciDTB corpus. These encoders are then used to produce contextualized representations of the input tokens that are fed to the argument mining learning processes. The discourse parsing tasks considered to train the specialized encoders are: • DFu (discourse function): Identify the boundaries and discourse roles of the EDUs before in the text.
(attribution, evaluation, progression, etc.). 8 • DPa (discourse attachment): Identify the boundaries of the EDUs and the relative position of the parent units in the RST tree.
The discourse tasks (DFu and DPa) are trained with the 738 abstracts left in the SciDTB corpus when excluding the 60 abstracts annotated with arguments. This is done in order to avoid introducing a bias that would not reflect the results obtained when no discourse annotations are available.
All the argument mining models (AFu, ATy, APa) are trained and evaluated in a 10-fold crossvalidation setting.
In all cases the models are generated by means of bi-directional long short-term memory (BiL-STM) networks, as this type of architecture has proven to perform reasonably well in argument mining tasks across different classification scenarios (Eger et al., 2017). In order to simplify the experiments and the interpretation of their results we use the same architecture for all tasks: two layers of 100 recurrent units, Adam optimizer, naive dropout probability of 0.25 and a conditional random fields (CRF) classifier as the last layer of the network. We use, for the BiLSTMs, the implementation made available by the Ubiquitous Knowledge Processing Lab of the Technische Universität Darmstadt (Reimers and Gurevych, 2017). 9 As our intention is to compare the different approaches and not necessarily obtain the best possible models for these tasks, no hyperparameter optimization is done in these experiments and, in all of the cases, the networks are trained for 100 epochs.
All of the tasks are modeled as sequence labeling problems in which the tokens are tagged using the beginning-inside-outside (BIO) tagging scheme. The tokens are encoded as the concatenation of 300-dimensional dependency-based word embeddings (DEmb) 10 ( k) (Komninos and Manandhar, 2016) and 1024-dimensional contextualized word embeddings (ELMo) ( e) (Peters et al., 2018). In these experiments we use the 5.5 billiontoken version of ELMo trained with Wikipedia and monolingual news from the WMT 2008WMT -2012 For the experiments with the RST encoders we include the 200-dimensional embeddings obtained from the concatenation of the backward and forward hidden states of the top layers of the DFu or DPa models (RSTEnc) ( f and p, respectively). Table 1 summarizes the sets of embeddings used in these experiments and their dimensions.
Each argument mining task is paired with one discourse parsing task for the transfer learning experiments. While AFu and ATy are paired with DFu, APa is paired with DPa. This means that the input for the AFu and ATy tasks is obtained as the concatenation of the vectors [ k, e, f ], while in the case of APa the input is [ k, e, p].

Abbreviation Notation Dimensions
DEmb

Results
We adopt the ConNLL criteria for named-entity recognition 12 to evaluate the performances obtained in the identification of argumentative components and relations. Table 2 shows the average F1-measures obtained for each of the settings considering the epochs 10 to 100. 13 The argument mining models trained with the representations produced by the RST encoders (DEmb+ELMo+RSTEnc) yield better performances, with gains of 0.03, 0.04 and 0.02 F1 points for AFu, ATy and APa, respectively, over the models trained solely with the dependencybased and ELMo embeddings (DEmb+ELMo).  Table 2: Average F1-measures in epochs 10-100 11 https://allennlp.org/elmo 12 A true positive is considered when both the boundary and the type of the entity match. 13 The epochs before the 10th are not significant as the models have not had enough time to learn anything. In order to determine whether the better performance of the RST encoders is due to the knowledge conveyed by the task-specific representations we conducted an additional experiment in which we concatenated 200-dimensional GloVe embeddings 14 (Pennington et al., 2014) ( g) obtaining 1524-dimension embeddings [ k, e, g] used as input of each of the argument mining models. In this case, the results obtained are mixed, with an increase in performance of 0.02 F1 points in average-for the epochs 10 to 100-for ATy, a worse performance of 0.01 F1 points for AFu and no difference in performance for APa. The models with the GloVe embeddings (DEmb+ELMo+GloVe) have, therefore, worse performances in average of 0.04, 0.02 and 0.02 F1 points for AFu, ATy and APa with respect to the models that include the embeddings obtained by means of the RST encoders.

Setting AFu ATy APa
Figures 3, 4 and 5 show the trend lines of F1measures obtained with the different models for the epochs 10 to 100 for the AFu, ATy and APa tasks, respectively. The graphs show that the models with information from the RST encoders not only learn better the argument mining tasks but they also do it in less time with respect to the other settings.
These results support out initial hypothesis in the sense that transferring discourse knowledge by means of representations learned in discourse parsing tasks can contribute to improve the performance of argument mining models trained with a rather small number of instances.

Acceptance prediction experiment
As a pilot application we explore the possibility of predicting the acceptance/rejection of papers in computer science conferences 15 based on the annotations generated by the best argument mining models of the experiments described in Section 4.
Quality assessment metrics that consider elements such as clarity and simplicity, lack of redundancy and comprehensiveness of scientific reporting have been developed for abstracts in other domains-in particular, in life sciences- (Timmer et al., 2003). These instruments were used in studies that show that abstracts with higher formal quality scores-as measured by human experts-are more frequently accepted for presentations in conferences (Timmer et al., 2001). We do not believe that these results can be directly extrapolated to the quality assessment of scientific abstracts in computer science, an area in which full manuscripts are most frequently considered for review and where abstracts have less fixed structures. Furthermore, clearer links between the formal quality of scientific reporting and the overall quality of research in computer science still need to be established. Considering all these limitations, we were interested in exploring whether the automatically identified argumentative structure of the abstracts could reflect some quality aspects of the full manuscripts and if this, in turn, could contribute to predict their acceptance in conferences in a specific research area in the field of computer science.

Dataset
As training set for the acceptance prediction experiment we use 117 abstracts of manuscripts submitted to the Compact Deep Neural Network Representation with Industrial Applications (CDNNRIA) and the Interpretability and Robustness for Audio, Speech and Language (IRASL) workshops held in the context of the Thirtysecond Conference on Neural Information Processing Systems (NIPS 2018). As test set we use 30 abstracts of manuscripts submitted to the Sixth International Conference on Learning Representations (ICLR 2018). All of the abstracts were collected from the OpenReviews website (Soergel et al., 2013). 16 The distribution of accepted/rejected papers in the training and test sets is shown in Table 3 Set Conference Accepted Rejected  Train CDNNRIA  35  23  Train  IRASL  30  29  55  52  Test  ICLR  15  15   Table 3: Accepted/rejected papers in training and test sets

Experimental setup
The CDNNRIA, IRASL and ICLR abstracts are used as input to the AFu, ATy and APa models described in Section 4 obtaining sequences of argumentative units, types and parent attachments. These sequences are then used as features to train and evaluate a binary classifier aimed at predicting the acceptance or rejection of the corresponding papers.  Considering that we are dealing with a small set of features with a reduced number of potential values for each one, we use a decision tree algorithm for our pilot classification experiment. In addition to the training and evaluation speed of the algorithm we consider that the higher interpretability of the results-by examining the decision points-can also contribute to assess to what degree the different elements of the predicted argumentative structure are used in the classification. We use Weka's implementation of the C4.5 algorithm (Quinlan, 1993) (J48) with default parameters with the exception of the confidence factor used for pruning the tree, which was selected evaluating the different models obtained against a random split of 20% of the test set used for validation. 17 As the training set is not perfectly 17 weka.classifiers.trees.J48 -C0.6 -M2 balanced, we pre-process the data with Weka's ClassBalancer algorithm, which assigns weights to each instance so that each class has the same total weight.

Results
The classifier trained with the argumentative units and relations extracted from the CDNNRIA/IRASL abstracts has a performance of 0.67 F1-score when evaluated with the training set obtained from processing the ICLR abstracts, 18 0.17 F1 points above a random binary classification in a balanced set. As expected, the main decision points in the tree correspond, broadly, to those attributes that are also ranked higher when measuring their contribution to reduce the entropy with respect to the class. 19 Observing these features, we can see that the most relevant decision elements are the parent attachment of first argumentative unit, the argumentative functions of the first two units and the argumentative type of the first unit. Also relevant are the features that mark the end of the sequences of argumentative types and functions for the majority of the instances. This means that the number of identified units also have a relevant role in the predictions. However, the number of units by themselves is not a good predictor of the abstract's class. In fact, executing the same experiment but replacing the non-padding values for function, type and attachment for fixed values we obtain an F1-measure of 0.59 due, in particular, to a higher number of false negatives (accepted papers classified as rejected).  In this work we explored the potential of leveraging existing discourse-annotated corpora to im-prove the performance of fine-grained argument mining models trained with a limited number of examples. In order to test our hypothesis, we proposed an annotation scheme and used it to enrich, with a new layer of argumentative structures, a subset of a corpus previously annotated with discourse-level units and relations. Promising results are obtained by implementing an inductive transfer learning method in which contextualized representations obtained by means of encoders trained with discourse parsing tasks are used as input of argument mining models. As a potential application of the annotations produced by the argument mining models, we implemented a simple classifier aimed at predicting the potential acceptance/rejection of computer science papers according to the argumentative structure of their abstracts. The results of these preliminary experiments are auspicious and motivate us to continue working in this area. As a first step in this direction, we plan to extend the coverage of the argumentative layer of annotations to the full SciDTB corpus. We expect this to become a valuable resource in argument mining research in scientific texts which, as mentioned, has been identified as a particularly challenging domain.
The obtained results open several paths up for additional research, including the implementation of other transfer learning approaches-e.g., multitask learning settings 20 -as well as other neural architectures-including attention-based architectures, which have proven to achieve good results in argument mining tasks (Stab et al., 2018). As mentioned in Section 3.1, we are also interested in exploring the possibility of leveraging other existing tools and resources to facilitate the automatic identification of argumentative structures and relations, such as corpora annotated with different schema-including variants of CoreSC and AZ. We also intend to expand our acceptance prediction experiments using the PeerRead dataset (Kang et al., 2018), 21 which has a greater coverage than the NIPS and ICLR subsets that we used in our experiments. This dataset contains, in addition to the acceptance/rejection decisions, scores for different aspects of the papers-including substance and clarity, among others-, which would allow us to explore in more depth whether the ar- 20 We conducted preliminary experiments in this area with mixed results, so we plan to continue investigating this approach in order to clarify its true potential. 21 https://github.com/allenai/PeerRead gumentative structure of the abstracts-and, potentially, other sections-relate to more specific quality aspects of the manuscripts.