Argumentation Mining on Essays at Multi Scales

Argumentation mining on essays is a new challenging task in natural language processing, which aims to identify the types and locations of argumentation components. Recent research mainly models the task as a sequence tagging problem and deal with all the argumentation components at word level. However, this task is not scale-independent. Some types of argumentation components which serve as core opinions on essays or paragraphs, are at essay level or paragraph level. Sequence tagging method conducts reasoning by local context words, and fails to effectively mine these components. To this end, we propose a multi-scale argumentation mining model, where we respectively mine different types of argumentation components at corresponding levels. Besides, an effective coarse-to-fine argumentation fusion mechanism is proposed to further improve the performance. We conduct a serial of experiments on the Persuasive Essay dataset (PE2.0). Experimental results indicate that our model outperforms existing models on mining all types of argumentation components.


Introduction
Argumentation mining (AM) is a challenging task in natural language processing . Recent research mainly involves independent sentences ; Bar-Haim et al., 2017;Niven and Kao, 2019;Reimers et al., 2019) and also essays (Levy et al., 2014;Habernal and Gurevych, 2017;Chernodub et al., 2019;Petasis, 2019). In this paper, we focus on argumentation mining on essays. Argumentation mining on essays aims to identify the types and locations of argumentation components from essay text . Typically, there are three argumentation types, namely major claims (MC), claim (C) and premises (P).
Previous research (Levy et al., 2014; takes sentences as the smallest argumentative unit, and handles this task in a rough way. They firstly split the essay into several sentences, and adopt a sentence classification model to select and reserve sentences which may be promising to contain argumentation components. Then they further identify the exact boundaries of argumentation components in those sentences. These pipeline approaches fail to conduct effective argumentation mining, since they ignore the argumentation structure of the essay and only handle the task at sentence level. Recent research Chernodub et al., 2019) focuses on end-to-end neural models. They boil the task down to a sequence tagging problem, and handle it at word level instead of sentence level. Typically, neural network is employed as encoder for text representation, and Conditional Random Field (CRF) is employed as decoder to make final prediction. This word-level sequence tagging method can simultaneously identify the types and locations of all argumentation components.
However, as shown in Figure 1, it can be observed that different types of argumentation components are at different levels: • Major claims serve the whole essay as the core opinions. They can be straightly proposed at the beginning of the essay, or summarized in the end. They are at essay level.
International tourism is now more common than ever before The last decade has seen an increasing number of tourists traveling to visit natural wonder sights, ancient heritages and different cultures around the world. While some people might think that this international tourism has negative effects on the destination countries, I would contend that it has contributed to the economic development as well as preserved the culture and environment of the tourist destinations [MC].
Firstly, international tourism promotes many aspects of the destination country's economy in order to serve various demands of tourists[P]. Take Cambodia for example, a large number of visitors coming to visit the Angkowat ancient temple need services like restaurants, hotels, souvenir shops and other stores [P]. These demands trigger related business in the surrounding settings which in turn create many jobs for local people improve infrastructure and living standard [P]. Therefore tourism has clearly improved lives in the tourist country [C].
Secondly · · · To conclude, as far as I am concerned, international tourism has both triggered economic development and maintained cultural and environment values of the tourist countries [MC]. In addition, the authorities should adequately support these sustainable developments. • Claims serve specific paragraphs as the core statements. They can appear anywhere in a paragraph, either proposed at the beginning, or summarized in the end, and also given in the middle. They are at paragraph level.
• Premises serve as all kinds of evidences to give reasons for major claims and claims. They can be logical statements, survey results, typical examples, public thoughts, expert suggestions, etc. They are at word level. Moreover, sequence tagging method utilizes classical CRF model to capture sophisticated dependency in a word-by-word way. Such method is thus appropriate to integrate local word-level information, but unsuitable for inference on long-distance text at essay level or paragraph level. To this end, we argue that different types of argumentation components should be handled at different levels.
In this paper, we propose a multi-scale argumentation mining model. In order to mine major claims, we design essay-level argumentation extraction submodule based on multi-span extraction strategy. Besides, to mine claims, we design paragraph-level argumentation extraction submodule based on randomized extraction strategy. As for mining premises, we follow the word-level sequence tagging method. Finally, a coarse-to-fine argumentation fusion mechanism is proposed to further improve the performance.
We carry out a serial of experiments on the Persuasive Essays dataset (PE 2.0) . The experimental results indicate that our model can significantly improve the performance as compared to state-of-the-art models, where our model respectively achieves 8.92% absolute improvement on overall performance, 14.89% absolute improvement on mining major claims, and 11.05% absolute improvement on mining claims. Moreover, we compare the performance of (i) multi-span extraction and randomized extraction (ii) argumentation extraction and argumentation tagging, which allow us to validate the effectiveness of our strategies of processing different types of argumentation components at their corresponding levels.
The organization of this paper is as follows. Firstly we give a detailed explanation to our multi-scale argumentation mining model in Section 2. Then in Section 3, we introduce our experiments. The detailed experimental results are displayed and analyzed in Section 4. In Section 5, we give a brief overview of related work about argumentation mining on essays. Finally we draw our conclusion in Section 6.

Multi-scale Argumentation Mining Model
An overview of our multi-scale argumentation mining model is shown in Figure 2. For major claims, we design an essay-level argumentation extraction submodule based on multi-span extraction strategy in Section 2.1, where the whole essay is taken as the input of BERT encoder, and a pointer network is utilized to score each word and thus score all the candidate spans. By these scores and a set of reasonable rules, we rank and filter the candidate spans to select result spans. For claims, we design a paragraphlevel argumentation extraction submodule based on randomized extraction strategy in Section 2.2, where each paragraph is respectively taken as the input of BERT encoder to mine result spans, and result spans of each paragraph are gathered as the result spans of the corresponding essay. For premises, we design a word-level argumentation tagging submodule in Section 2.3, where the whole essay is taken as the input of BERT encoder, and CRF is utilized as decoder to obtain the tag sequence with the highest sequence score. Finally, a coarse-to-fine argumentation fusion mechanism in Section 2.4 is utilized to obtain the final results, since there may exist some overlaps on result spans of different argumentation types.

Essay-level Argumentation Extraction for Major Claim
Major claims are at essay level. For each essay, let E = {w 1 , · · · , w les } denotes the essay. To mine major claims, the input sequence is: where l es is the length of the essay. The sequence is encoded with BERT encoder (Devlin et al., 2019): Through the multi-head self-attention mechanism, BERT can perceive and more heavily weight the attentive words in the essay. This allows the model to capture essay context by multi-layer transformers.
Then inspired by pointer networks (Vinyals et al., 2015), for each word w i in the essay, its embedding H i is utilized to score the word through a linear layer: where score s i is the start score for the word to be the start of a major claim span, while score e i is the end score. Then the cross entropy loss of start position and end position are respectively calculated, and the sum of start loss and end loss is employed as the final loss: where y s i is the start label for w i to be the start of a major claim span, 1 for golden start word while 0 for non-start word, and y e i is the end label. Moreover, as shown in Figure 1, there are some occasions where an essay contains more than one major claim spans. Actually, each essay has at least one major claim span, and two major claim spans in usual, where one is straightly proposed at the beginning, while another is summarized in the end. Hence, we adopt a multi-span extraction strategy during training, where all major claims in an essay are admitted. It indicates that start label y s and end label y e may be multi-one-hot labels: When prediction, all candidate spans are ranked according to their corresponding probability. The probability for a span starting from w i and ending at w j is defined as Equation 6: Then we propose a set of reasonable rules, which are based on common sense, to filter apparently wrong and overlapped candidate spans. The rules are explained in detail in Appendix 1. Finally we reserve top K as result spans for each essay.

Paragraph-Level Argumentation Extraction for Claim
Claims are at paragraph level. For each essay, firstly we respectively mine claims from each paragraph, and then gather the results for subsequent argumentation fusion on essays. Specifically, for each paragraph, let P = {w 1 , · · · , w lpa } denotes the paragraph. The input sequence is: where l pa is the length of the paragraph. The sequence is also encoded with BERT encoder for contextualized embedding: Then similar to the submodule for major claim in Section 2.1, the start and end score of a word comes from its embedding: and the sum of the cross entropy loss of start position and end position is adopted as final loss: Besides, as shown in Figure 1, a paragraph may contain one claim span, or none. Moreover, there are very few occasions where a paragraph contains more than one claim spans. Taking this into account, we adopt a randomized extraction strategy. It means that, if a paragraph contains more than one claim spans, then in each training epoch, only one span is admitted and other spans are ignored. The admitted one is randomly chosen in each epoch. Thus start label y s and end label y e may be one-hot labels for paragraphs which have at least one claim span, and full-zero labels for paragraphs which does not contain any claim span: Similarly, during prediction, all candidate spans are ranked according to span probability: Then the filtering rules in Appendix 1 are adopted to remove apparently wrong and overlapped candidate spans. Finally we keep top k as result spans for each paragraph, and gather them as result spans for the corresponding essay.

Word-Level Argumentation Tagging for Premise
Premises are at word level. We adopt word-level argumentation tagging through a BERT-CRF sequence tagging model to mine premises. For each essay, let E = {w 1 , · · · , w les } denotes the essay. The input sequence is: and the sequence is also encoded with BERT encoder for contextualized embedding: Then the embedding of each word is employed to score the word to be different tags through a linear layer: where k is the number of tag types, and score j i ( j ∈ {1, 2, · · · , k}) is the score of word i to be marked as tag j. In our research, we adopt the same tag configuration as Chernodub et al. (2019), which is a compound of BIO label and argumentation types.
We also adopt a Conditional Random Field (CRF) model (Lample et al., 2016) as decoder. Specifically, for a predicted tag sequence t: t i is the predicted tag of the word w i , and the corresponding sequence score is: where A is trained one-step tag transition matrix. The final loss is defined as: where y t is tag sequence label, 1 for groundtruth tag sequence while 0 for others, and T is a set of all possible tag sequences. During prediction, the Viterbi algorithm is adopted for decoding to obtain tag sequence with the highest sequence score, which will be considered as the submodule prediction.

Coarse-to-fine Argumentation Fusion
As mentioned above, we have obtained result spans of different argumentation types at corresponding levels respectively. However, the result spans of different argumentation types might be overlapped. Hence we propose a coarse-to-fine method for the fusion of them.
Specifically, let priority x denotes the priority of argumentation type x, where x ∈ {M C, C, P }. We follow the coarse-to-fine principle and set the highest priority for major claim, higher for claim, and the lowest for premise: priority M C > priority C > priority P Then for each essay, we keep three sets, which respectively contain the result spans of major claim, claim and premise. For each essay, if a result span from one set is overlapped with another result span from another set according to Algorithm 1 in Appendix 1, then we reserve the span from the set with higher priority, and remove another span from its corresponding set. In this way, all sets will not share any overlapped spans and the fusion procedure is accomplished.

Experiments
In this section, at first we introduce the dataset we utilize and show our experiment setup. Then we introduce the evaluation metrics. Finally we list the baselines that we adopt for comparison.

Dataset
PE 2.0 dataset 1 (2017), which is based on PE 1.0 dataset (2014), is one of the most classical and widely used datasets in argumentation mining on essays. PE 2.0 annotates three kinds of argumentation components, namely major claim (MC), claim (C) and premise (P). Many previous researches (Persing and Ng, 2016;Chernodub et al.,

Experiment Setup
We implement our model with TensorFlow 1.14.0 and conduct our experiments on a computation node with a NIVIDIA RTX2080 GPU. In our experiments, pre-trained uncased BERT-base model 3 is adopted as encoder. We utilize BERTAdam optimizer with an initial learning rate of 5e-6, and choose a batch size of 4 to avoid out of memory problem, for BERT is extremely exhausting for memory. We also employ a hyper parameter optimization with dropout probability from {0.1, 0.2, 0.3}. In each case, we train 20 epochs, and choose model parameters with the best performance on the development set.

Evaluation Metrics
To accurately evaluate the performance of our model on mining all types of argumentation components, we employ following span-based evaluation metrics. For specific argumentation type, a prediction span of an essay is regarded as true only if it is exactly matched with a groundtruth span of the essay. We calculate mean precision P, mean recall R, as well as mean F1 score F of each essay on the test set. Furthermore, we employ macro F score defined in Equation 20 as overall evaluation metric: where n is number of essays on the test set. Besides, as previous research Chernodub et al., 2019), we also take the micro F score from Persing and Ng (2016) into account. The detailed definition of this metric is available in Appendix 2.

Submodule Performance
Experimental results on mining different argumentation types before fusion are summarized in Table 3. Our essay-level argumentation extraction submodule for major claim shows the best performance with the highest F1 score, as well as the highest precision and recall on mining major claims. As we have pointed out, major claims are at essay level. Thus BERT-CRF with essay as input performs better among all the sequence tagging models. However, it still conducts reasoning in a word-by-word way through CRF. As compared to the CRF, pointer network in our submodule can capture long-distance context information on essays. Therefore, the submodule significantly outperforms other word-level sequence tagging models. Besides, our paragraph-level argumentation extraction submodule for claim obtains the best performance with the highest F1 score on mining claims. The submodule also obtains near the best precision and recall. Claims are at paragraph level. BERT-CRF with paragraph as input shows better performance among all the sequence tagging baselines. As compared to it, our submodule utilizes pointer network to conduct reasoning on paragraphs. Thus, the submodule shows apparent advantages on mining claims. However, its F1 score of 53.54%, though the highest among all models, is relatively low compared to other argumentation types. This might because the submodule ignoring the information from other paragraphs in the identical essay. Nevertheless, it is a challenging trade-off problem from the later ablation studies in Section 4.3. Moreover, our word-level argumentation tagging submodule for premises has the best performance with the highest F1 score, as well as the highest precision on mining premises. It indicates that the pre-trained language model BERT is also powerful and adaptive to transfer in this task.

Input
Method    We also verify the efficiency of our coarse-to-fine argumentation fusion mechanism in Table 4. For major claim, the performance remains the same after fusion since we set the highest priority for major claim, and do not remove any such span. For claim, the performance is apparently improved with higher F1 score, which comes from the significant increase of precision and relatively slight decrease of recall. For premise, the performance also gets slightly promoted. Besides, the overall performance also gets promoted after fusion with respectively 1.18% and 1.20% absolute increase of F micro and F macro score. All these improvements indicate that our coarse-to-fine argumentation fusion mechanism is effective.

Multi-span Extraction or Randomized Extraction
We respectively mine major claims and claims at different levels with different extraction strategy 6 . The results are summarized in Table 5. For major claims, under the same extraction strategy, extractions on essays significantly outperform extractions on paragraphs. However, the situation is exactly opposite for claims. Under the same extraction strategy, extractions on paragraphs obtain better performances. It shows that different type argumentations components should be handled at corresponding level.
Moreover, no matter what argumentation type, on paragraphs, randomized extractions outperform multi-span extractions. And on essays, multi-span extractions are better than randomized extractions. It may indicate that multi-span strategies is appropriate at essay-level extractions, and randomized strategies is appropriate at paragraph-level extractions. Actually, in usual, an essay contains more than one claim spans, where multi-span extraction is more appropriate. However, on most occasions, a paragraph has at most one claim span, or does not have any span, where randomized extraction is more appropriate.
The situation is similar for major claims. Therefore, the results show the effectiveness of our strategies chosen for different types of argumentation components.

Argumentation Extraction or Argumentation Tagging
We also try to mine premises with argumentation extraction method. The results are compared in Table  6. Our word-level argumentation tagging submodule for premise obtains the best performance with the highest F1 score as well as the highest precision and recall. This just indicates that premises are at word level, and argumentation tagging is more appropriate than argumentation extraction on mining them.  modeled AM on essays as a sentence-level feature-based classification task, where each sentence is respectively classified through a set of linguistical features.  firstly proposed a sequence tagging model to distinguish argumentation components and non-argumentation components, and employed a joint ILP (Integer Linear Programming) model to identify the types of argumentation components. However, they reported performance of different subtasks without overall performance. Potash et al. (2017) utilized pointer network to identify the types of argumentation components on the assumption that all argumentation components have already been identified, which means the exact boundaries of all the argumentation components are already available.  further proposed a new end-to-end sequence tagging model, which firstly employs compound labels of BIO and argumentation types, and simultaneously identifies the types and exact locations of different argumentation components. Chernodub et al. (2019) tried to build application interface, which is called TARGER and is a BiLSTM-CNN-CRF sequence tagging model, for convenient argumentation mining on essays. Besides, latest research (Petasis, 2019;Spliethover et al., 2019) also aims to distinguish argumentation components from non-argumentation components with text segmentation based on sequence tagging models. Other work Peldszus et al., 2016;Skeppstedt et al., 2018) focuses on arg-microtext corpus , which contains 112 independent short texts, where each can be considered as one paragraph and contains about 5 argumentation components on average.

Conclusion
We propose a multi-scale argumentation mining model for argumentation mining on essays. Our model respectively mines different types of argumentation components at corresponding levels. Moreover, a coarse-to-fine argumentation fusion mechanism is adopted to further improve the results. The experimental results on PE 2.0 dataset indicate that our model obtains the state-of-the-art performance, where the model obtains significantly improved performance on mining major claims and claims. The results reveal the importance of argumentation mining at different levels on different argumentation types. In the future, we will try to mine different argumentation types with multi-task learning method.
The results are displayed in Figure A.1. Different argumentation types show diverse error modes. For all the types, None-out is the dominate error. This may because of the few shot of argumentation components as compared to the non-argumentation ones. For major claims, None-out, Type-in, and None-in errors are serious. It may be a bit difficult for the model to distinguish major claims from non-argumentation components. Claims come with pretty critical None-out, Type-in, and Type-out errors. This may indicate the model tends to mistake claims for non-argumentation components, as well as confuse claims with other argumentation types. As for premises, None-out, Boundary, and Type-out errors take dominant positions. The model may get into trouble in identifying exact boundaries of premises and distinguishing premises from non-argumentation components. Besides, the model also tends to mistake premises for other argumentation types.

A.4 Word-based Sequence Tagging Results
Word-based sequence tagging results of different models are compared in Table A.1. Among all these models, BERT-CRF with essays as input shows the best word-based performance on all tag types. However, even this model, the F1 scores of major claim and claim are still low, where the F1 scores of B-MC and I-MC are both less than 70%, while the F1 scores of B-C and I-C are both less than 60%. Moreover, actually, for major claim, the minimum of the F1 scores of B-MC and I-MC can be considered as the upper boundary of corresponding span-based F1 score. The situation is similar for claim. That is to say, for these sequence tagging models, span-based F1 scores of major claim and claim will be respectively less than 64.58% and 58.51%. Therefore, sequence tagging models show extremely limited performance on mining major claims and claims.

A.5 Machine Reading Comprehension Framework
Inspired by  and , we try to handle the task under the Machine Reading Comprehension (MRC) framework to further improve the performance on mining major claims and claims. As shown in Figure 1 in our paper, the title of an essay is a condensed summary of the essay, which explicitly points out the topic and even directly proposes the core opinion. Hence, we adopt essay title as query and guide information. We respectively employ new MRC inputs for our submodules in Section 2.1 and Section 2.2 to mining major claims and claims. More specifically, to mine major claims, for each essay, let T = {w 1 , w 2 , · · · , w lt } denotes the title, and E = {w 1 , w 2 , · · · , w les } denotes the essay. We concatenate the title and the essay text as MRC input: And the concatenation is encoded with BERT encoder: Similarly, to mine claims, for each paragraph, let T = {w 1 , w 2 , · · · , w lt } denotes the essay title, and P = {w 1 , w 2 , · · · , w lpa } denotes the paragraph. These two are also concatenated as MRC input: The concatenation is also encoded with BERT encoder: Then the subsequent argumentation extractions remain the same. The results are compared in Table A.2. MRC framework with essay title as query leads to worse performance. Actually, essay titles are diverse. They can be a statement, e.g. International tourism is now more common than ever before, a question, e.g. Can technology alone solve the world's environmental problems?, or a phrase, e.g. Living and studying overseas. It may be pretty difficult for the model to understand the role of the essay title as query. The essay title query will act as disturbing factor rather than guide information for argumentation mining. Hence, MRC framework with essay title as query fails to show promoted performance.