Investigating the Role of Argumentation in the Rhetorical Analysis of Scientific Publications with Neural Multi-Task Learning Models

Exponential growth in the number of scientific publications yields the need for effective automatic analysis of rhetorical aspects of scientific writing. Acknowledging the argumentative nature of scientific text, in this work we investigate the link between the argumentative structure of scientific publications and rhetorical aspects such as discourse categories or citation contexts. To this end, we (1) augment a corpus of scientific publications annotated with four layers of rhetoric annotations with argumentation annotations and (2) investigate neural multi-task learning architectures combining argument extraction with a set of rhetorical classification tasks. By coupling rhetorical classifiers with the extraction of argumentative components in a joint multi-task learning setting, we obtain significant performance gains for different rhetorical analysis tasks.


Introduction
Scientific publications, as "tools of persuasion" in research (Gilbert, 1977), are carefully composed documents written to convince the reader of the validity and merit of the researchers' work. As such, they are inherently argumentative and often adhere to well-trodden rhetorical patterns and argumentation schemes of the respective research field. The accelerated growth of scientific literature (Bornmann and Mutz, 2015) makes exploration and analysis of relevant publications increasingly difficult. This yields the need for automatic analyses of these documents, including their argumentative and rhetorical structure.
To allow for the holistic analysis of scientific publications with respect to the interactions between different rhetorical aspects of scientific text Fisas et al. (2016) created a corpus of scientific publications with manual annotations of several high-level rhetorical aspects of scientific writing (e.g., sentence-level discourse roles), but without annotations of the argumentative structure of publications. Despite (1) scientific texts being inherently argumentative (Gilbert, 1976), (2) the existence of theoretical argumentative frameworks (Toulmin, 2003;Kirschner et al., 2015), and (3) a wide range of argument extraction models in other domains (e.g., debates or essays, see Palau and Moens (2009) ;Habernal and Gurevych (2017), inter alia), there is still very little work on automatic argumentation mining from scientific literature. Consequently, there has been no work analyzing associations between argumentation and other rhetorical constructs in scientific writing, although such dependencies exist. Consider the following example: "In general, our OMR preserves the high frequency content of the motion quite well [claim], since :::::: inverse :::: rate ::::::: control :: is ::::::: directed :: by ::::::::: Jacobian :::::: values [data]." Here, the authors make a claim (underlined text) about their approach and support it with a technical fact (data) about the method (wave-underlined text). At the same time, regarding other rhetorical constructs, this sentence is stating the subjective aspect of advantage (of the proposed method), belongs to the discourse category of outcome (of the authors' work), and may be considered relevant for the (extractive) summary of the publication. We argue that these rhetorical dimensions are interconnected and that fine-grained argumentation underpins other rhetorical layers in scientific text. For example, sentences stating an advantage of a method are likely to be argumentative and may contain claims that should be included in the summary.
Assuming that argumentation guides rhetorics in scientific text, we investigate neural multi-task learning (MTL) models which couple argument extraction with several other rhetorical analysis tasks. To this end, we augment the existing corpus of scientific publications (Fisas et al., 2016), containing several layers of rhetorical annotations, with an additional layer of argumentative components and relations. We then explore two neural MTL architectures based on shared recurrent encoders, intrasentence attention, and private task-specific classifiers and couple the neural architectures with a joint MTL objective with uncertainty-based weighting of task-specific losses (Kendall et al., 2018). We validate our approach by testing that it outperforms traditional machine learning models in single-task settings. We finally show that coupling rhetorical analysis tasks with argument extraction using MTL models significantly improves the results for the rhetorical analysis tasks.
Contributions. We create the first corpus of scientific publications in English annotated with finegrained argumentative structures and carry out the first study on dependencies between different rhetorical dimensions in scientific writing. Using MTL models, we show that argumentation informs other rhetorical analysis tasks. Finally, in the context of MTL research, our results indicate that the dynamic uncertainty-based loss weighting (Kendall et al., 2018) is beneficial for high-level natural language processing tasks.

Related Work
We provide an overview of (1) studies analyzing rhetorical aspects in scientific publications and (2) a large body of work on argumentation mining.

Rhetorical Analysis of Scientific Texts
Previous work has analyzed a number of rhetorical aspects of scientific publications. Teufel et al. (1999Teufel et al. ( , 2009 analyzed the discourse structure of scientific publications. They annotated sentences with discourse categories named argumentative zones. Liakata et al. (2010) proposed a more general discourse scheme dubbed core scientific concepts and in subsequent work (Liakata et al., 2012) trained a conditional random fields (CRF) model to assign discourse labels to text spans. Several authors focused on tasks relating to citations: extraction of citation context (e.g., Abu-Jbara et al., 2013;Jha et al., 2017), classification of citation polarity (e.g., Athar, 2011) and purpose (e.g., Teufel et al., 2006;Jochim and Schütze, 2012), and the automatic detection of referenced parts of the cited publication (Jaidka et al., 2017). Both discourse and citation information have been exploited for summarizing scientific publications (Cohan and Goharian, 2015;Teufel and Moens, 2002;Abu-Jbara and Radev, 2011;Chen and Zhuge, 2014;Lauscher et al., 2017a). Intuitively, citation contexts may contain information relevant to the summary. Similarly, summaries commonly contain sentences with diversified discourse properties. Fisas et al. (2016) provided different layers of rhetorical annotations on the same corpus of scientific text.Their Dr. Inventor Corpus is annotated with a combination of existing discourse annotation schemes (Teufel et al., 2009;Liakata et al., 2010) and citation-based annotations. Despite the argumentative nature of scientific texts, the Dr. Inventor Corpus contains no annotations of argumentative components such as claims. Several computational studies followed, addressing the rhetorical tasks corresponding to the layers of the Dr. Inventor Corpus Saggion, 2015, 2016;Accuosto et al., 2017), but none of them investigated dependencies between different tasks.
The work of Kirschner et al. (2015) is the closest to ours, since they also annotated scientific publications with fine-grained argumentation. However, their corpus is in German and contains no annotations of other rhetorical dimensions. Moreover, their corpus is significantly smaller than the Dr. Inventor Corpus (Fisas et al., 2016). In contrast, we augment the Dr. Inventor Corpus with an argumentation layer, effectively allowing for combinations of argumentation extraction and other rhetorical analysis tasks in MTL settings.

Argumentation Mining
Argumentation mining (AM) refers to extracting (and ideally understanding) arguments from natural language text Torroni, 2015, 2016) and includes tasks like argument detection (Palau and Moens, 2009), argument component identification (Daxenberger et al., 2017), and argument relation classification (Boltužić andŠnajder, 2014). In their pioneering work on automatic AM, Palau and Moens (2009) discriminated argumentative from non-argumentative sentences and proposed a rulebased approach for extracting argumentative structures in documents. Gurevych (2016, 2017) extracted argumentative components from online discussions. They framed the argumentative component extraction as a sequence labeling task and applied structured SVMs as a learning model.
Recent work started exploiting dependencies between AM tasks using global optimization (Peldszus and Stede, 2015;Persing and Ng, 2016;Stab et al., 2014) and MTL models (Eger et al., 2017;Niculae et al., 2017). Peldszus and Stede (2015) used decoding based on minimum spanning trees to jointly predict argumentative segments and their types as well as argumentative relations, to generate an argumentation graph from text. Persing and Ng (2016) and Stab and Gurevych (2017)  In contrast to these efforts that combine several AM subtasks or formalisms with joint optimization and MTL models, in this work we examine the dependencies between argumentative components and other rhetorical aspects of scientific writing.

Data Annotation
We first briefly describe the Dr. Inventor Corpus (Fisas et al., 2016), which we augment with argumentative annotations. We then explain in more detail our argumentation annotation scheme and the annotation process.

Dr. Inventor Corpus
We chose the Dr. Inventor Corpus (Fisas et al., 2015(Fisas et al., , 2016   10, 789 sentences, it is one of the largest corpora of scientific text manually labeled with rhetorical information. Secondly, it contains four different layers of rhetorical annotations: (1) a discourse layer, specifying discourse roles of sentences, (2) a citation context layer, specifying the textual context of citations, (3) a layer with subjective aspect categories assigned to sentences, and (4) a summarization relevance layer, indicating how relevant sentences are for the summary. The overview of labels for all annotation layers with the distribution of instances across labels is shown in Table 1. For more details on the original Dr. Inventor Corpus we refer the reader to (Fisas et al., 2015(Fisas et al., , 2016.

Argumentation Annotation Scheme
We considered several existing argumentation frameworks (e.g., Anscombre and Ducrot, 1983;Walton et al., 2008;Dung, 1995, inter alia) and selected the Toulmin's model (Toulmin, 2003) as a starting point for our study. We chose the Toulmin's model because: (1) it is a well-established in philosophy as well as in computer science (e.g, Freeman, 1991;Bench-Capon, 1998;Verheij, 2009, inter alia) and (2) it contains different types of argumentative components and relations between them into account, which is useful for fine-grained argu-mentative analyses.
To test the applicability of the framework for our purposes,we first carried out a small preliminary annotation round with two expert annotators and adjusted the annotation scheme according to their observations. Argumentative components. We devised an adapted version of the Toulmin model, 1 containing the following argumentative components: • Background claim: An argumentative statement related to the work of other authors, state-of-theart methods, or common practices; "The range of breathtaking realistic 3D models is only limited by the creativity of artists and resolution of devices." • Own claim: An argumentative statement about own work, covered by the publication itself; "Using our method, character authors may use any tool they like to author characters." • Data: A fact that the authors state as evidence that either supports or contradicts a claim.
"SSD is widely adopted in games, virtual reality, and other realtime applications due to :: its ::::: ease :: of :::::::::::::: implementation and ::: low :::: cost ::: of :::::::::: computing." Argumentative components are annotated as arbitrary spans of text (in terms of length, annotated components ranged from a single token to multiple sentences). Annotators were instructed to annotate the shortest possible span of text that completely captures the argumentative component. Thus, we do not bind arguments to sentences, i.e., we allow for fine-grained argumentative components.
Argumentative relations. Authors connect argumentative components in order to form convincing reasoning chains. To allow for the detection of long argumentation chains, we also annotated relations between argumentative components. Following proposals from previous work (Dung, 1995;Bench-Capon, 1998), we distinguish between three relation types: • Supports: indicates that a claim component is supported by a data component or another claim. The (assumed) validity of the supporting component (data or claim) contributes to the validity of the supported claim.
• Contradicts: indicates that the validity of a claim decreases with the validity of another argumentative component. If an argumentative component is assumed to be true, the claim it contradicts is assumed to be false, and vice versa.
• Same claim: connects different mentions of what is essentially the same claim. It is common to repeat important claims (e.g., the central claim) of the work several times in the publication (claim coreference).
Further details about the annotation scheme can be found in the annotation guidelines we provided to our annotators. 2

Annotation Procedure and Results
Annotation process. We hired four annotators for the task, one of whom we considered to be an expert annotator 3 and executed the process in two phases. In the first phase, we calibrated the annotators for the task in five iterations, on five publications from the Dr. Inventor Corpus. After all annotators labeled one of the five documents, we met with them, discussed the disagreements, identified erroneous annotations, and, when required, revised the annotation guidelines. At the end of the calibration phase, the annotators re-annotated the five calibration publications and resolved the remaining disagreements by consensus.
In Figure 1 we show the IAA for both component identification and relation classification, in terms of averaged pairwise F 1 score, 4 after each of the five calibration iterations. It can be seen that the discussions in the calibration phase helped to get a common understanding of the task among the annotators. However, we note that when considering argumentative relations in addition to the components only, the agreement decreases. Apart from the increased complexity compared to the component identification only this is due to the high ambiguity of argumentative structures, which is one of the main challenges in argument mining,  Figure 1: IAA evolution over calibration phases (blue for argumentative components; green for relations). We report both strict (annotated components match in span and type; relations match in type and components at both ends match strictly) and relaxed agreement scores (components match in type and overlap in span; relations match in type and their components at both ends match according to the relaxed criterion).
as suggested by Stab et al. (2014). Moreover, disagreements in the argumentative component identification are propagated and cause disagreements in relation annotations, since relation annotations match only when the agreement criterion for the components at both ends is met. Interestingly, the average agreement of our expert annotator with non-expert annotators was similar to the average agreement between non-expert annotators. This is encouraging, because it suggests that annotating argumentative structures in scientific text does not require expert knowledge of the domain. In the second phase, we evenly split the remaining 35 documents of the Dr. Inventor Corpus among the four annotators, without any overlaps.
The augmented corpus. We make the Dr. Inventor Corpus augmented with argumentation annotations (together with the annotation guidelines) publicly available. 5 The final corpus contains 12, 289 annotations of argumentative components and 6, 530 relation annotations. We show the distributions of labels in   position articles), in which authors primarily emphasize the contributions of their own work. There are two main reasons for having a smaller number of data components compared to claims. On one hand, there are longer argumentative chains in which claims are supported by other claims (i.e., only the first claim is supported by the data component). On the other hand, there is also a nonnegligible amount of standalone (i.e., unsupported and unchallenged) claims, implied also by having less annotated relations than claims.
To obtain an initial insight on the interrelations between the different rhetorical aspects in scientific writing, we conduct an information-theoretic analysis and assess the amount of information shared among the annotation layers by computing the normalized mutual information (Strehl and Ghosh, 2003). Normalized mutual information is a variant of mutual information, which has been shown to correlate with the gains that can be obtained in multi-task learning settings (Bjerva, 2017). The results can be seen in Table 3. The strongest link is observed between argument components and discourse roles, followed by argument components and citation contexts.

Multi-task Learning for Rhetorical Analysis of Scientific Writing
We next exploit the augmented corpus to exploit the dependencies between argumentation and other rhetorical dimensions. To this end, we adopt neural MTL as a methodological framework.

Tasks
The following are the rhetorical analysis and argument extraction tasks we investigate.
Argumentative Component Identification (ACI). The task is to extract and classify argumentative components. We frame ACI as a token-level sequence labeling task: given a sequence of tokens x = (x 1 , .., x n ) of length n, the task is to assign a sequence of tags y aci = (y 1 , .., y n ), y i ∈ Y aci . The tagset Y aci contains seven token-level tags, obtained by combining the standard B-I-O annotation scheme with three types of argumentative components: Own claim, Background claim, and Data.
Discourse Role Classification (DRC). The multi-class classification task in which each sentence needs to be assigned one out of the set of discourse roles Y drc = {Background, Unspecified, Challenge, FutureWork, Approach, Outcome}.
Citation Context Identification (CCI). The task is to identify the span of the publication text that introduces or explains a reference. It is also a token-level sequence-labeling task -a sequence of tags y cci = (y 1 , .., y n ) with y i ∈ Y cci = {B CC , I CC , O} is assigned to a sequence of tokens x = (x 1 , .., x n ).
Subjective Aspect Classification (SAC). Another sentence-level classification task in which each sentence is assigned one of the subjective aspect labels, Y sac = {None, Limitation, Advantage, Disadvantage-Advantage, Disadvantage, Common Practice, Novelty, Advantage-Disadvantage}.

Summary Relevance Classification (SRC).
The task is to predict the relevance of a sentence for the (extractive) summary of the publication. Each sentence needs to be assigned one of the labels Y src = {Very relevant, Relevant, May appear, Should not appear, Totally irrelevant}.
ACI and CCI are token-level sequence labeling tasks. The remaining three tasks can be cast as either (1) plain sentence classification tasks or (2) sentence-level sequence labeling tasks (assuming that there are regularities in sequences of sentencelevel labels that can be captured). We propose one MTL architecture for each of the two possibilities.

Multi-Task Learning Models
We propose two different MTL architectures for the rhetorical and argumentative analysis of scientific publications. The Simple model treats sentencelevel tasks (DRC, SAC, and SRC) as plain classification tasks (i.e., the prediction for each sentence ignores the content and labels of other, neighboring sentences). The Hierarchical model addresses sentence-level tasks as sequence labeling tasks. This model can be seen as a hierarchical sequence labeling model, in which the sentence-level recurrent network is stacked on top of the token-level sequence labeling network. Both architectures are illustrated in Figure 2.
Token-level Predictions. Given a sentence s i = (x i1 , .., x in ) out of a sequence of sentences d = (s 1 , .., s m ) we first retrieve the pre-trained embedding vector for each token x ij .We then obtain context-aware token representations h ij by applying a bidirectional recurrent network with long short-term memory cells (Hochreiter and Schmidhuber, 1997)  This token-level Bi-LSTM encoder is shared between the tasks combined by the MTL models. Next, we define a separate classifier for each of the token-level (TL) tasks (i.e., ACI and CCI) and feed the contextualized token representations h ij to these classifiers. Each of the classifiers is defined as a feed-forward network with a single hidden layer. The label probability distribution is obtained by applying the softmax function on its output.
where W t ∈ R 2K×|Yt| and b t ∈ R |Yt| are the taskspecific classification parameters for the task t, with K being the size of the LSTM state and |Y t | the number of discrete labels of task t.
Sentence-level Predictions. We learn to aggregate a sentence representation s i from contextualized vectors of its tokens, h ij (produced by the  token-level Bi-LSTM), using the intra-sentence attention mechanism (Yang et al., 2016): with the weights α i computed dynamically as: where u att is the trainable attention head vector and U i is a matrix with non-linearly transformed token representations (h ij ) as rows: In the Simple architecture, sentence representations s i are fed directly to the sentence-level taskspecific classifiers, which are also feed-forward networks with a single hidden layer: Within the Hierarchical architecture, sentence representations are first contextualized with representations of other sentences via the sentencelevel Bi-LSTM layer (denoted with the function Bi-LSTM S ) and then forwarded to the classifier: Joint optimization and loss functions. All of the tasks we consider are framed as multi-class classification tasks. Thus, we simply specify all taskspecific losses to be L2-regularized cross-entropy errors. Let y to be the one-hot ground truth label vector for the prediction instance o 6 of the task t, and let y to be the predicted probability distribution over the task labels for the same instance. With Y t as the set of labels for task t, the task-specific loss L t is computed as follows: where Ω t is the set of model's parameters relevant for the task t 7 and λ is the regularization factor. We train the MTL model jointly on different tasks by defining and minimizing the joint loss function L that combines task-specific losses L t . Instead of using constant weights, we opt for dynamic weighting of task-specific losses during the training process, based on the homoscedastic uncertainty of tasks, as proposed by Kendall et al. (2018): where σ t is the variance of the task-specific loss over training instances, used to quantify the uncertainty of task t. Kendall et al. (2018) show that better MTL results can be obtained by dynamically assigning less weight to the more uncertain tasks, as opposed to constant task weights throughout the whole training process.

Evaluation
We run two sets of experiments. First, we evaluate the performance of the Simple and the Hierarchical neural models on individual tasks (i.e., in singletask learning (STL) scenarios). We then evaluate the impact of the argumentative signal on other dimensions of rhetorical analysis by combining them in joint MTL settings.

Experimental Setup.
We randomly split the corpus on the documentlevel into train (roughly 70%, 28 documents containing 6,697 sentences) and test portions (roughly 30%; 12 documents with 2,874 sentences). We used roughly 20% of the train portion as the validation set for model selection.
Model configuration and training. We ran an initial grid search on the validation set with possible values for the hyperparameters learning rate ν ∈ {10 −4 , 10 −5 }, L2 regularization factor λ ∈ {0.001, 0, 0001}, and LSTM states K ∈ {64, 128, 256} and found the hyperparameter configuration ν = 10 −4 , λ = 0.001, and K = 128 to be optimal for the vast majority of the STL and MTL models. In all experiments, we represent tokens with pre-trained 300-dimensional GloVe embeddings (Pennington et al., 2014) 8 and optimize the model parameters using the Adam algorithm (Kingma and Ba, 2015). We initialize all model parameters using Xavier initialization (Glorot and Bengio, 2010), train the models in batches of N = 16 sentences and apply early stopping based on the validation set performance.
Baselines. As a type of "sanity check", we first compare the performance of the two neural architectures against traditional supervised machine learning algorithms on each of the tasks separately.
For the token-level sequence labeling tasks (ACI and CCI) we use Hidden Markov Models (HMM) and Conditional Random Fields (CRF) (Lafferty et al., 2001)    In MTL experiments, we consider the respective task performances from single-task experiments and MTL with a joint loss function with equal weighting of the task losses as baselines.
Single-Task Experiments. We first report the model performances for individual tasks in STL settings. Results for token-level tasks are shown in Table 4, whereas Table 5   appear sentences (and an Irrelevant sentence with other Irrelevant and Should not appear sentences). The fact that we observe no gains from the additional sentence-level Bi-LSTM encoder for the DRC and SAC tasks suggests that the content of the sentence informs its discourse role and subjective aspect much more strongly than neighboring sentences. In other words, the DRC and SAC seem to be more localized classification tasks than SRC.
Multi-Task Learning Results. Our core research question relates to the effect that recognizing fine-grained argumentative components has on other rhetorical analysis tasks. This is why, in our central set of experiments, we evaluate MTL models with homoscedastic uncertainty weighting which combine the ACI (as an auxiliary task) with each of the four other tasks. In each multi-task learning model, the token-level Bi-LSTM encoder is shared between the two tasks. For sentencelevel tasks (DRC, SAC, SRC), we evaluate both the Simple and Hierarchical architecture. In Table  6 we show the performance of the MTL models on rhetorical analysis tasks (these can be compared to the respective single-task model performances from Tables 4 and 5. When coupled in MTL settings with argumentation component identification (ACI) using the joint loss formulation of Kendall et al. (2018), the results significantly 10 improve for all rhetorical analysis tasks and models (except for SAC with the Hierarchical model), in comparison with the respective single-task models. However, the performance for the argumentation component identification does not improve in MTL. In other words, the extraction of fine-grained argumentative components seems to inform higher-level rhetorical analysis tasks, but not vice-versa. This indeed supports the hypothesis that argumentation guides scientific writing and influences rhetorical structure of publications. Furthermore, our results support the findings of Schulz et al. (2018) who show that, opposed to initial results of Alonso and Plank (2017), MTL can yield performance gains for higher-level semantic tasks.

Conclusion
Acknowledging the argumentative nature of scientific text, we investigated the role of argumentation in the rhetorical analysis of scientific publications. We first extended an existing corpus annotated with four different layers of rhetorical information with annotations of argumentative components and relations, creating the largest argumentation-labeled corpus of scientific text in English. We explored intuitive neural architectures with recurrent encoders for argument extraction and rhetorical analysis tasks and showed significant improvements over traditional machine learning models. We then coupled argument extraction with different rhetorical analysis tasks in MTL models with dynamic loss weighting and demonstrated that the argumentative signal has a positive impact on high-level rhetorical analysis tasks.
Admittedly, the corpus we used in this work is limited to the domain of computer graphics. Nonetheless, we believe that our findings relating to the argumentative nature of scientific text and links between argumentation and other rhetorical aspects generalize to other domains too. This is also supported by the comparable agreement observed between expert and non-expert annotators.
In the future work, we would like to extend the collection of scientific text to other fields. Next, we intend to explore a wider range of MTL models, especially those involving more than two tasks. Having annotated argumentative relations, we will work on models for their automated identification in scientific publications.