Discourse Self-Attention for Discourse Element Identification in Argumentative Student Essays

This paper proposes to adapt self-attention to discourse level for modeling discourse elements in argumentative student essays. Specifically, we focus on two issues. First, we propose structural sentence positional encodings to explicitly represent sentence positions. Second, we propose to use inter-sentence at-tentions to capture sentence interactions and enhance sentence representation. We conduct experiments on two datasets: a Chinese dataset and an English dataset. We ﬁnd that (i) sentence positional encodings can lead to a large improvement for identifying discourse elements; (ii) a structural relative positional encoding of sentences shows to be most effective; (iii) inter-sentence attention vectors are useful as a kind of sentence representation for identifying discourse elements.


Introduction
Discourse describes how a document is organized. This paper focuses on the task of discourse element identification (DEI) in argumentative student essays. Discourse elements represent the function and contribution of every discourse unit to the discourse. Burstein et al. (2003) formulate discourse elements as 5 categories: introduction, thesis, main idea, supporting and conclusion, while argument components such as major claim, claim and premise are used as discourse elements in argumentation structure parsing in persuasive essays (Stab and Gurevych, 2014). DEI can benefit automated essay scoring in many aspects: modeling organization, inferring topics and opinions or used as features for scoring systems (Attali and Burstein, 2006;Burstein et al., 2001;Persing et al., 2010;Song et al., 2020).
Despite its importance, DEI is challenging. First, the ambiguity of sentences makes learning models difficult to distinguish some discourse elements. For example, the thesis is defined as expressing the central claim of the author and the main ideas support the thesis from specific aspects. However, it is hard to distinguish them from their content and style.
Second, the discourse element of a specific sentence depends on context. As a result, considering individual sentences only would have difficulties in identifying discourse elements. The relations and relatedness among multiple sentences should be explored.
Third, the data imbalance problem is serious, e.g., the number of elaboration sentences could be 10 times more than the number of thesis sentences. The minority discourse elements (such as thesis, main ideas or major claim) are harder to be recalled although they have important roles in many scenarios, e.g., evaluating the organization of essays (Attali and Burstein, 2006).
In this paper, we propose a method to explicitly model sentence positions and relations to improve discourse element identification in argumentative student essays. Our idea is partially motivated by the self-attention mechanism such as (Vaswani et al., 2017). Self-attention is usually applied to capture dependencies between words. We aim to apply self-attention mechanism to describe relations between sentences.
On one hand, position information is important for DEI to give clues on discourse elements beyond content and style, because authors usually hold some conventions to organize content. Position is one of the most useful feature classes in feature-based DEI (Burstein et al., 2003;Stab and Gurevych, 2014). Previous neural network models usually cast DEI as a classification or sequence labeling task and do not explicitly model position information. Motivated by the positional encoding of words, we propose a simple structural positional encoding strategy for a sentence by considering its relative position in its essay, relative position of its paragraph in its essay, and its relative position within its paragraph.
On the other hand, relatedness among sentences may also indicate properties of discourse elements. For example, thesis sentences should have close relations to the whole essay; main ideas usually locate in similar positions and have high relatedness. Relatedness between discourse elements has shown to be an important indicator of essay coherence (Higgins et al., 2004). We compute inter-sentence attention vectors to represent either element-wise or content-wise relations to other sentences, which bring in additional information beyond individual sentences and enhance sentence representation without extra information.
Experiments show that the proposed approach can get considerable improvements compared with feature-based and neural network based baselines on a Chinese dataset and obtain competitive results compared with the state-of-the-art method on an English dataset. The structural positional encodings of sentences show effectiveness to achieve obvious overall improvements. The inter-sentence attention vectors enhance sentence representation helping identify discourse elements as well.
2 Related Work 2.1 Discourse Element Identification DEI could be seen as a subtask in discourse structure analysis. It aims to identify discourse elements, determine their functions and establish relationships among them in an argumentative text.
The solutions to these tasks usually adopt similar machine learning methods but use domain related features. The methods could be roughly classified into the following categories.
Sequence labeling based methods exploit contextual information for DEI with conditional random fields (Hirohata et al., 2008;Song et al., 2015) or recurrent neural networks .
Establishing relations between sentences is often viewed as a classification tasks as well (Stab and Gurevych, 2014). Parsing based methods are also adopted to build more complex structures with techniques like ILP  or RST style parsing (Peldszus and Stede, 2015).
Feature engineering. Some common features are shared across these tasks, including syntactic, lexical, semantic and discourse relations. There are also domain related features to further boost the performance. Mochales and Moens (2011) designed special features for argumentation mining in legal texts. Nguyen and Litman (2015) identified claims based on domain words. Lippi and Torroni (2015) modeled syntactic structures for content independent claim detection based on tree kernels.
Our work is mostly related to DEI in argumentative student essays (Burstein et al., 2003;Stab and Gurevych, 2014), which is useful for qualifying essay organization (Persing et al., 2010), argumentation (Persing and Ng, 2016;Wachsmuth et al., 2016) and general writing (Burstein et al., 2003;Ong et al., 2014;Song et al., 2014). The major feature classes proposed by Burstein et al. (2003) and Stab and Gurevych (2014) are used to build a baseline. The features include: position, cue words, lexical features (main verbs, adverbs and connectives) and structural features (such as number of clauses). Some of these features are based on manually collected lexicons.
Deep Learning Methods have achieved great success in many NLP tasks.  proposed neural argumentation mining models based on sequence tagging or dependency parsing. It exploits inter-sentence relations but needs sophisticated language processing.  exploited CNN and LSTM for classifying sentences to identify claims from different domains. It mainly depends on the content of components but does not sufficiently model positions and exploit inter-sentence relatedness.

Attention Mechanism for Discourse Representation
Attention mechanism was first introduced by (Bahdanau et al., 2015) in the encoder-decoder framework. Attention has the ability to learn important regions within a context and has been widely adopted in deep learning. Liu and Lapata (2018) proposed a structured attention mechanism to derive a tree over a text, akin to an RST discourse tree. Ferracane et al. (2019) evaluated the model, however, found multiple negative results. Attention mechanism has also been applied for RST parsing and its applications (Li et al., 2016;Ji and Smith, 2017;Huber and Carenini, 2019) but it is mostly used for capturing local semantic interactions.

Self-Attention Mechanism
Vaswani et al. (2017) proposed the self-attention mechanism and achieved state of the art results in many NLP tasks. Since then, self-attention has drawn increasing interests due to flexibility in modeling long range interactions.
Self-attention ignores word order in a sentence. As a result, position representations are developed to cooperate with self-attention. In addition to the sinusoidal position representation proposed by Vaswani et al. (2017), there are also other variations to bias the selection of attentive regions (Shen et al., 2018;Shaw et al., 2018;Yang et al., 2019). In NLP, self-attention is mostly applied to sequential structures such as a sequence of words. Mihaylov and Frank (2019) proposed a discourse-aware selfattention encoder for reading comprehension on narrative texts, where event chains, discourse relations and coreference relations are used for connecting sentences. Self-attention can be also extended to 2d-dimensions for image processing (Parmar et al., 2018) and lattice inputs (Sperber et al., 2019).

Baseline
We use Hierarchical BiLSTM (HBiLSTM), which is similar to (Yang et al., 2016), as the base model to model sentence and discourse level representations.
The task is to assign discourse element labels y = (y 1 , ..., y n ) to sentences (x 1 , ..., x n ) in a text, where x i , 1 ≤ i ≤ n, is a sentence of a sequence of words and y i ∈ Y, Y is a set of pre-defined discourse elements.

Sentence Representation Layer
A sequence of words x = {w 1 , ..., w N } is modeled with a RNN encoder and is converted into a sequence of hidden states H = {h 1 , ..., h N }. The hidden state at the i-th step is where f is a RNN unit, e(w i ) ∈ R d is the embedding of a word, and h i−1 is the hidden state of the previous step. The whole sequence could be represented as a fixed length vector is a function to summarize hidden states. In this work, Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) is used as the RNN unit and the sequence is encoded in a Bidirectional way that a hidden state is the concatenation of the corresponding hidden states from both directions. The summarization function φ(·) could be based on the attention mechanism.

Discourse Representation Layer
In the discourse element layer, we feed sentence representations C = (c 1 , ..., c n ) ∈ R d×n to a BiL-STM and use a nonlinear layer to map semantic representations to discourse element representations, D = tanh(BiLSTM (C)). (2)

Inference Layer
Finally, we use a linear and a softmax layer to predict the discourse element of every sentence, where Y ∈ R |Y|×n refers to the probabilities of every sentence over discourse element categories. The baseline mainly exploits interactions between adjacent sentences, but long distance interactions and sentence positions are not explicitly considered, which may be also important to determine the function of sentences in argumentative discourse.
The sentences in an essay are converted to sentence embeddings C through the BiLSTM encoder introduced in Section 3.1, which are used as the input of DiSA. DiSA explicitly represents sentence positions, which are integrated with the content representations of sentences to get element representations. DiSA also has an inter-sentence attention module to get both element-wise and content-wise attention vectors of sentences to capture sentence interactions. The attention vectors and element representations are concatenated and fed to a linear layer and a softmax layer for prediction.

Sentence Positional Encodings (SPE)
Discourse elements in argumentative essays are sensitive to their positions. For example, introduction mostly comes before thesis or main ideas and main ideas may occur more often at the beginnings or endings of paragraphs. Figure 2 shows an essay with 7 sentences and 4 paragraphs as an example. We consider three types of sentence positions for positional encoding.
• Global position: The index of a sentence is used to describe its position where we assume sentences in an essay form a sequence.
• Paragraph position: An essay has multiple paragraphs. The position of the paragraph that contains the sentence is also important.
• Local position: The position of the sentence in its paragraph is informative as well.
We adopt a relative positional encoding approach. We compute the relative positions for the above three position types. For example, the relative global position of the i-th (i ≥ 1) sentence in an essay E is where |E| is the number of sentences.
To integrate with sentence representations, we expand pos global (i) to a vector of the same dimension d of the distributed sentence representations by duplicating its value to every dimension, noted as pos global (i) ∈ R d . The relative paragraph position representation pos para (i) and relative local position representation pos local (i) are computed in the same way.
where {β t } are parameters to be learnt in training. The element representation of the i-th sentence is e i = tanh(BiLSTM (C i + pos(i))). (6) 4.2 Inter-Sentence Attention (ISA) Self-Attention relates elements at different positions by computing attention between every pair of elements. An attention function is to map a query and a set of key-value pairs to an output. The queries Q, keys K and values V are vectors. We define Q, K ∈ R d k ×n and d k is the dimension. The attention is computed as The output is computed as a weighted sum of the values, i.e., αV. Here, we are interested in the attention vectors rather than the weighted output, because an attention vector reflects the relatedness of a given sentence to every other sentence. We propose the inter-sentence attention (ISA) by applying self-attention to sentence semantic representations C and discourse element representations E = {e i }.
We use E to get Q and K, Q = EW Q ,  , 2015). It can maintain relatedness information by maxpooling α e and α c in local bins. These bins have sizes proportional to the number of an essay's sentences so that the number of bins is fixed regardless of the essay length. We set the number of bins to 1, 2, 4 and 8, respectively. The resulted representations can be seen as descriptions of the relatedness of a sentence to different zones of its essay. These representations are concatenated so that the dimension of the pooled attention vectors α c , α e is 1+2+4+8=15. Finally, the prediction is made according to where α c , α e and E are concatenated.

The Chinese Dataset
The construction of the Chinese Dataset mainly follows the definition and taxonomy of discourse elements proposed by Burstein et al. (2003). Specifically, we consider the following discourse elements: • Introduction The role of introduction is to introduce background or attract readers' attention before making claims.
• Thesis The thesis express the central claim of an author with respect to the essay's topic.
• Main Idea The ideas establish foundational ideas or aspects that are related to the thesis.
• Evidence The evidence elements provide examples or other evidence that are used to support main ideas and thesis.
• Elaboration The elaboration elements further explain main ideas or provide reasons, but contain no examples or other evidence.
• Conclusion The conclusion sentence is the extension of the central argument, summarizes the full text, and echos the thesis of the essay.
• Other Other elements refer to the ones that do not match the above classes.
The dataset has 1,230 argumentative essays written by high school students, covering diverse topics. These essays were collected from a website LeleKetang. 1 We asked two annotators from the literal arts college to assign discourse elements to sentences from these essays according to a manual. The annotators discussed to reach a consensus and refined the manual for several rounds. We use one annotator's annotation as the gold answer, and the other's annotation as the prediction, and compute the F1 scores to measure the agreement, which is shown in Figure 3. Table 1 shows the basic statistics of the dataset. The distribution of discourse elements is imbalanced. Elaboration and evidence sentences are [To conclude, art could play an active role in improving the quality of people's lives, ]s 1 [ but I think that governments should attach heavier weight to other social issues such as education and housing needs]s 2 [because those are the most essential ways enable to make people a decent life.]s 3  Table 3: Basic statistics of the English dataset converted from . many more than thesis and main idea sentences. The type of other sentence accounts for a very small percentage of the dataset. The test dataset is 10% of the whole dataset.

The English Dataset
We also use the English student essay dataset released by . This dataset marks argument components, i.e., major claim, claim, and premise, at clause level. Table 2 shows an example sentence. The consecutive words in bold form three components, corresponding to claim, major claim and premise, respectively.
Because our model is at sentence level, we convert the original annotations to sentence level. First, an essay is split into sentences by NLTK. Then if a sentence contains only one argument component, we annotate this sentence as the type of this component; if a sentence contains more than one argument component, we further separate it into multiple sentences to ensure that each sentence has only one argument. The beginning of a new sentence is from the end of the last component. The end of a new sentence is the end of the component it contains. As shown in Table 2, three sentences s 1 , s 2 and s 3 are generated from the original example sentence. If a sentence does not have any argument component, its label is other. Table 3 shows the basic statistics of the converted dataset.

Experimental Settings
The max length of sentences is set to 40 words. Sentences are padded or truncated according to this length. The Tencent pre-trained word embeddings (Song et al., 2018) were used for experiments on the Chinese dataset. The dimension of word embeddings is 200. The Bert tokenizer and embeddings were used for experiments on the English dataset. The dimension of all the BiLSTM hidden layers is 256 on Chinese dataset, and 128 on English dataset. So is the dimension of d k . The dimension of the attention vectors is 15. The optimizer is stochastic gradient descent (SGD) with a learning rate 0.1. The best models were selected for all settings based on the results on the validation data, which is 10% of the training data.
We use accuracy (Acc.) and macro-F1 as evaluation metrics.

Comparisons
We compare with the following systems.
• Feature-based. We adapt features from previous feature-based methods (Burstein et al., 2003;Stab and Gurevych, 2014;Song et al., 2015) to build a feature-based CRF model.
• HBiLSTM. The baseline described in Section 3 uses two BiLSTM layers to encode word sequences and sentences.
• BERT. We fine-tune BERT on training data to train a sentence classifier, because the lengths of many Chinese essays exceed the length constraint of BERT and it is expensive to train BERT-like models at discourse level. Table 4 shows the performance of the baselines and DiSA. We can see that HBiLSTM performs even worse than feature-based approach. HBiL-STM has a low macro-F1 score, indicating that it has difficulties in identifying particular discourse elements. The two end-to-end models do not consider position information and interactions among sentences. The performance of BERT is worse than HBiLSTM. This verifies that sequence modeling is more proper than single sentence classification for this task. DiSA achieves the best performance   on all metrics, with a large improvement compared with the baselines.

System Comparisons
Figure 3 further illustrates system performance on identifying specific discourse elements. The human performance is also measured by considering one annotator's annotation as the answer, and the other one's as the prediction.
The discourse elements that HBiLSTM is unable to accurately identify are thesis and main idea. Despite their importance for understanding a text, their scale is obviously smaller than other discourse elements, which may bring in obstacles for datadriven approaches.
Feature-based method performs better than HBiLSTM on identifying thesis and main idea. But it heavily relies on feature-engineering such as manually collected discourse markers and cue words. It does not perform well on identifying evidence due to the difficulties in designing related features.
DiSA is also an end-to-end model the same as HBiLSTM but performs much better. We will discuss the impacts of positional encoding and intersentence attention in Section 6.3.2 and 6.3.3.
Compared with the feature-based method, DiSA has comparable performance on identifying thesis but has superior results on identifying main idea (9% higher in F1 score) and evidence (21% higher in F1 score).

Analysis of Positional Encodings
This part investigates the effect of sentence positional encodings. We compare our relative sentence positional encoding (relativeSPE) with two other encoding strategies which are previously used for word sequences. Sinusoidal indicates the sinusoidal positional encoding which is used in Transformer (Vaswani et al., 2017). PosEmbedding uses a distributed vector to represent an absolute position. The position embeddings are learned during training. Each of the above three strategies is applied for modeling global position, local position and paragraph position, which are then combined according to Equation 5. Table 5 lists the results of using different SPEs and modeling different positions. RelativeSPE performs best with improvements of 2-3% macro-F1 score compared with Sinusoidal and PosEmbedding. Without SPE, the metrics drop at least 6.2% compared with using any SPE strategy, and 8.6% compared with relativeSPE. If we explicitly add only pos global , the results even decrease. Perhaps recurrent neural networks such as LSTM naturally capture sequential positional information. However, encoding paragraph position (pos para ) and local position (pos local ) largely improves the performance. This indicates that proper structural positional encodings can exploit richer discourse structures than sequential structures. Table 6 shows the effects of removing intersentence attention (ISA) components from DiSA. We can see that both ElemSA and ContSA can make contributions, and ElemSA seems to have a larger effect on macro-F1 score. Removing ISA, the accuracy and the macro-F1 score decreases 1.8% and 2.2%.

Analysis of Inter-Sentence Attention
Remind that ISA uses attention vectors as representations rather than the final output αV in the self-attention module. Table 6 also lists the performance that αV is used to replace attention vectors.   The result is not good. This indicates that semantic relation among sentences is more important for DEI than the specific meaning of sentences. We further analyze ISA's impact on specific discourse elements. As shown in Table 7, ISA affects the identification of the minority discourse element thesis most. It also benefits identifying evidence which is not a minority discourse element. Thesis sentences often relate to other sentences from different essay zones, while evidence sentences mainly provide facts or examples so they often relate to local context in content. ISA helps capture such patterns. The performance on other types also increases with different degrees.
Anyway, ISA provides a way to build useful representations by exploiting relations between sentences in the same text without any extra burden. Table 8 and Table 9 show main experimental results on the English dataset.

Results on the English Dataset
The second column of Table 8 shows the results on distinguishing four component types (i.e., major claim, claim, premise, other). DiSA outperforms the baselines with a large margin on both accuracy and macro-F1. Again, removing SPE leads to a large performance decrease.  conducted argument component classification experiments (classifying a component into major claim, claim and premise) by assuming that argument components have been correctly distinguished from other parts. To compare with their results, during training, the other type is removed from the label set and only the losses over non-other sentences are accumulated.   , where Joint-Best incorporates relation identification as an auxiliary task.
The third column of Table 8 shows the comparison to the best results from . DiSA does not perform competitively based on the distributed representation only, because the baseline uses some strong hand-crafted features, especially the component position features, which rely on the correct argument component information. Thus we build a feature vector by incorporating the indicator features and a component position feature: number of preceding and following components in paragraph, out of 8 categories of features introduced in . The vector is concatenated with the distributed representation. This combination obtains improvements, outperforms Single-Best results, and achieves close performance compared with Joint-Best, which considers argumentative relation identification as an auxiliary task. We also attempt to apply the same strategy for the Chinese task. But the improvement is negligible. The reason may be that the indicator phrases used in Chinese essays is much less than in English essays. The English dataset heavily relies on phrases signaling beliefs or argumentative discourse connectors . Table 9 shows the macro-F1 scores of DiSA on identifying specific argument components. Without the ISA module, the identification of major claims and claims would decline by 3% and 1.4% absolute F1 score, respectively. This is consistent with the experimental results on the Chinese dataset. As a result, the effectiveness of the SPE and ISA can be verified on both the Chinese and the English datasets.

Conclusion
We presented a method DiSA to identify discourse elements in argumentative student essays by explicitly modeling structural positions and inter-  sentence relations. The structural positional encoding considers relative positions of the sentence and its paragraph. Moreover, we use inter-sentence attention vectors to capture sentence relations in content and function. Experiments on a Chinese dataset and an English dataset show that (i) although it is simple, the positional encoding largely improves the performance. This indicates that modeling structural positions is feasible and important to analyze the role of sentences; (ii) discourse elements could be better identified with the help of inter-sentence attention vectors, especially the minority ones and the ones that have distinct relation patterns to other sentences. In future, we plan to evaluate DiSA on other discourse analysis tasks.