Framework for the Analysis of Simplified Texts Taking Discourse into Account: the Basque Causal Relations as Case Study

Text simplification is crucial for some readers to understand the content of a text. Analyzing simplified texts can help to understand the mechanism hidden in the process of simplification. In this paper we present a research framework to analyze the impact of simplification operations on discourse. To that end, we used the Corpus of the Simplified Basque texts (CBST) and we studied the strategies followed in the simplification of causal relations and their effects at discourse level. From this analysis of the sample we derive that discourse has not been always taken into account which may lead to a lack of coherence in the simplified text.


Introduction and Related Work
Text Simplification is a research line that has been important in the educational community (Simensen, 1987;Young, 1999;Crossley et al., 2007) but it is also becoming important in the Natural Language Processing (NLP) community. Therefore, multidisciplinary researchers are working on different ways to make text simplification by automatic or semi-automatic means. This task is known as Automatic or Automated Text Simplification (ATS) and its development has been deeply explained in the literature ((Saggion, 2017)).
In this work, we want to describe a framework to analyze simplified texts taking discourse structure following the Rhetorical Structure Theory (RST) 1 (Mann and Thompson, 1988) into account and answer the following research questions: − How can we describe the impact of simplification operations in discourse?
− How do simplification operations affect the rhetorical structures of the original texts?
This type of studies need annotated corpora which are expensive, but at the same time, necessary. We can find in the literature corpora available for English (Petersen and Ostendorf, 2007;Xu et al., 2015;Pellow and Eskenazi, 2014), Danish (Klerke and Søgaard, 2012), German (Klaper et al., 2013), Brazilian Portuguese (Caseli et al., 2009), Spanish (Bott and Saggion, 2011), Italian (Brunato et al., 2015) and Basque . In the case of the last three corpora, simplification operations have been annotated and general annotation schemes derived. Besides, from the simplification perspective, Gonzalez-Dios et al. (2016) analyzed in the Basque corpus whether conditional, concessive, purpose, temporal and relative clauses 2 have been simplified or not, and if so, which were the macro-operations that had been performed.
From the discourse perspective, Crossley et al. (2007) analyzed the cohesion of 105 texts taken from seven texts-books aiming beginners of English as a second language with Coh-metrix (Graesser et al., 2004). They focused on the following seven sets: i) causal cohesion, ii) connectives and logical operators, iii) coreference measures, iv) density of major parts of speech measures, v) polysemy and hypernymy measures, vi) syntactic complexity, and vii) word information and frequency measures. They found out among others that original 2 These clauses are the most five predictive features for the readability assessment system for Basque (Gonzalez-Dios et al., 2014)  Beraz, hegoaren formak garrantzi handia du ; izan ere, hegoaren formak inguruan duen airearen jarioan asko eragiten du . Hegoaren forma, ordea, ez da hegan egitearen lehen arrazoia.
So, the form of the wings, though it is not the main motive of the flying, is very important, because it affects a lot the surrounding air flow.
So, the form of the wings is very important; indeed, the form of the wings affects a lot the surrounding air flow. The form of the wings is not, however, the main motive of the flying.' So, the form of the wings, though it is not the main motive of the flying, is very important; indeed, it affects a lot the surrounding air flow. In the analysis of intuitively simplified texts, Crossley et al. (2012) found out that advanced level texts exhibited less causal cohesion than beginning level texts.
To our knowledge, there is no joint framework to analyze simplified texts taking simplification operations and discourse into account. That is why the aim of this paper is to propose a framework to measure how simplification operations affect relational discourse structure. In this study, we focus on forms used to express causality because reducing causal discourse relations is crucial for people with language disorders. For example, Kong et al. (2017) stated that the coherence of speakers with aphasia tended to miss essential information content. This can be measured because aphasia speakers reduce some RST relations, such as ELABORATION and causal relations in their speech. This paper is structured as follows: in Section 2 we present the resources needed to perform the analysis; in Section 3, we describe the framework for the analysis; in Section 4, we present the results of the quantitative analysis on the causal relations and in Section 5, we conclude and outline the future work.

Resources
In order to perform this study, we have used the Corpus of Basque Simplified Text (CBST). This corpus is a collection of texts divided in 227 sentences of the science popularisation domain. Each original sentence in the corpus has a structurally simplified and an intuitively simplified sentence. In this corpus, the operations performed in order to simplify the sentences have been annotated following an annotation scheme 3 composed by the following eight macro-operations: i) delete, ii) merge, iii) split, iv) transformation, v) insert, vi) reordering, vii) no operation and viii) other. These macrooperations involve many operations . In Table 1 we show the original sentence identified as Bernoulli 80 and its two simplified versions.
To create the cause subcorpus, we extracted semi-automatically the causal clauses as done by Gonzalez-Dios et al. (2016) and then, following the proposal of Iruskieta et al. (2016), we extracted the sentences containing causal discourse markers and causal lexical signals. The main figures of this sample are presented in Table 2  The number of causal structures found in the original sentences of the CBST is shown according to their type in Table 3: i) syntactically marked causal signals (syntactic), ii) causal signals made explicit by discourse markers (DMs), iii) causal relations signaled with nouns and verbs (Lexical).

Type
Simp. RST Joint Syntactic 17 3 3 DMs 16 3 3 Lexical 32 3 3 The additional resources used in this analysis are 1) a study of the frequencies and positions of the adverbial clauses (Gonzalez-Dios et al., 2015) in order to see the frequencies of the syntactic relations; 2) the corpus Zernola (Gonzalez-Dios et al., 2014) to see if the syntactic relations are also used in simple texts; and 3) a lemma frequency list (Gonzalez-Dios, 2016) to see the frequencies of the discourse markers and lexical signals.

Framework for the Analysis of Simplified Texts
In this section, we present the framework and the annotation required to perform the analysis of simplified texts taking discourse into account.

Simplification Annotation and Analysis
Following Gonzalez-Dios et al. (2016), we propose to annotate whether the target clauses, in our case the causal relations, have been treated or not (binary tagging). If so, which operations have been performed in each structure. Besides, in this study, we add complementary descriptions such as clause length, syntactic depth (depth of the syntactic tree), surrounding phenomena or frequency information. These are the questions we propose: a) Simplification treatment and macro-operations: − Have the syntactic, DMs and lexical signals been treated or not? In the case of the syntactic signals,we also analyze if they have been treated or not according to the causal type defined by Euskaltzaindia (Euskaltzaindia, 2011): i) pure causal -(e)lako 'because', ii) causal explicative bait-'since' and iii) pseudo-causal -(e)nez 'as'). − When the simplification is performed, we ask: which macro-operations have been performed? For each macro-operation, which exact operations? In the case of lexical signals, which operations according to the PoS (verbs or nouns)?
b) Length and depth − The sentences that have been split are longer than the average sentence length of original clause? − The sentences that have been split are inside another subordinate clause?
c) Frequencies − In the case of the syntactic signals, are they also frequent in other corpora? For this analysis, the frequencies of other corpora are needed. − When performing transformations, have the syntactic, DMs and lexical signals been substituted with a more frequent equivalent one? d) Ordering − In the case of the syntactic signals, do the reordering operations suit the word order found in other corpora or the canonical RST relation order? − Do they suit canonical or stylistic word or sentence orders?

Discourse Annotation (RST) and Analysis
In the discourse analysis, we want to know if the relations found in the original texts have been kept, modified or deleted in the simplified texts. To that end, we follow this procedure: − Segmentation: automatic fine-grained discourse segmentation with EusEduSeg (Iruskieta and Zapirain, 2015) and manually corrected following Iruskieta (2014). Output format: RS3. − Rhetorical structure annotation: manually annotated with RSTTool (O'Donnell, 2000) following a modular and incremental annotation method (Pardo, 2005). Output format: RS3. − Description if there were maintained or changed the nucleus-satellite order of the relations and the relation names with the Rhetorical DataBase (RhetDB) (Pardo, 2005).
In order to describe the simplification operations at rhetorical structure level, we propose the following questions: a) Rhetorical relations: − What kind of rhetorical relations were deleted from the original sentences in the intuitive corpus-set and in the structural corpus-set?
− Which relations have been added for text simplification?
b) Ordering: − Has the nucleus-satellite order been maintained in rhetorical relations? 4

Joint Annotation and Analysis
In order to join both analyses and based on the previous annotation, we propose to analyze the influence of simplification operations in discourse looking at the elementary discourse units (EDU), the central subconstituent (CSC) 5 and the rhetorical relations (RR). Exactly, we look the simplification operations performed which impact have on discourse. So, for each relation we make a description like the one that follows for the structurally simplified sentence presented in Table 1: i) an insert (hegoaren formak 'the shapes of the wings') has been performed in the clausal proposition; ii) two split and three insert operations (izan ere, Hegoaren forma 'due to the shape of the wings' and ordea 'however') in the surrounding phenomena.
Regarding rhetorical structure, we based on the simplification annotation and in the RST trees like the one presented in Figure 1, where the rhetorical structure (RS-tree) of the original text is shown above and the RS-tree of the structurally simplified text is bellow. There are three main changes in Figure 1: i) there is one span missing (4 above and 3 bellow), ii) the CAUSE relation is attached directly to the most important EDU of the RS-tree (to the central subconstituent), and iii) the CONCESSION relation has a new order (SN above and NS bellow) and is attached to a bigger text span (EDU 1−2 bellow) 6 .
In order to quantify and summarize that, these are the questions we propose: a) Treatment in simplification: − Has it been treated or not? b) Elementary discourse unit (EDU): 4 This is important as Mann and Thompson (1987) state: "if a natural text is rewritten to convert the instances of non-canonical span order to canonical order, it seldom reduces text quality and often improves it". 5 The CSC is the salient EDU of a text span. 6 Other changes were done in signaling the relations: in the signal CAUSE, the causal subordinator -lako 'since' was changed into the explicative connector izan ere 'since'. This way we see how the simplification operations affect discourse.

Results of the Quantitative Analysis
In this section, we present the results and analysis of the causal relations (our sample) according to the framework presented in Section 3.

Results of Simplification Analysis
Treatment and macro-operations: In Table 4 we present the results in relation to the treatment in both simplification approaches. As we can see: i) more syntactic signals have been treated in the intuitive approach; ii) results in the lexical signals are similar; iii) and discourse markers do not seem to be treated in any case.  Focusing on the different types of causal syntactic signals (Table 5), we see that there is a tendency to treat the pure causal -(e)lako 'because' in the structural approach, while explicative bait-'since' is treated in the intuitive approach.

Structural
Intuitive Pure -(e)lako 55.56 (4/9) 33.33 (3/9) Explicative bait-40.00 (2/5) 100.00 (5/5) Pseudo -(e)nez 33.33 (1/3) 100.00 (3/3) Looking at the macro-operations ( Table 6) we see that, in our sample, while the syntactic signals undergo split and transformation operations, the discourse markers Comparing the approaches, it is noticeable that more split operations are performed in the structural approach and more transformations in the intuitive. Exactly, the transformations performed in syntactic signals are: i) transforming a subordinate clause into a main clause ii) reformulations (more than one operations and paraphrases) and ii) changing the syntactic signal.
Regarding discourse markers, the transformation that has been performed is the substitution of a discourse marker for a more frequent one. The other macro-operations are delete and reordering.
In the case of the lexical signals, the operations performed vary according to the PoS. In Table 7 we present figures about the number of operations performed in nouns and verbs.
To summarize the analysis of the operations, we see that some macro-operations are restricted to the relation type and the PoS of it. That is, we see that no split is applied in all causal DMs or in all noun causal signals. For example, in the causal clause of sentence presented in Table 1, an insert has been performed in the structural approach; in the intuitive approach a split, a transformation (subordinate to main clause) and an insert have been performed.
Length and depth: The average length of the causal clauses in our original sample are 7 words 7 . In the intuitive approach, the split operations have been carried out in all the clauses with 7 or more words, but this only happens in 2 out of the 5 split operations carried out in the structural approach. In relation to the depth, two of the split operations in the structural approach were performed in subordinate clauses inside subordinate clauses e.g. a relative clause inside a noun clause.
Frequencies: Related to the description of the syntactic structures contained in the CBST, we have checked if they are also frequent structures in the BDT corpus 8 and in the Zernola corpus. As we can see, they are all frequent structures in both corpora (Table 8).
In Table 9 we present some transformation operations involving substitutions. Our analysis lead us to propose some preliminary conclusions: syntactic signals and DMs are not always substituted with more frequent equivalent ones, but with less ambiguous. As we see here, more frequent forms do not always mean simplicity.
Ordering: In relation to the reordering operations, we have analyzed whether the movements carried out    in the simplified sentences at syntactic level suit the canonical word order or the order of clauses found in EPEC. In our sample no reordering was performed at that level. But, we did find an interesting reordering in the intuitive approach: a stylistic reordering took place in the signals in order to avoid the rear-burden 9 .

Results of Discourse Analysis
In Table 10, we present the results obtained with Rhetorical Database in the different corpus-sets regarding simplification approaches and rhetorical relations. The number (K) of all the relations and the differences (diff.) of each corpus-set: i) relations of the original texts (source text) in the first two columns, ii) relations of the intuitively simplified texts in the following two, and iii) relations of the structurally simplified texts in the last two. We can observe different simplification strategies in 9 "(...) "rear burden" (...) [is] the effect that occurs when some key elements for correct processing of the message (e.g. the verb) are pushed towards the end of the sentence, thus delaying and making more difficult the comprehension of the message by the receiver." (Maia-Larretxea, 2015, 68). Using RhetDB, we extracted and presented in Table 11 the nuclearity type (SN: satellite first and nucleus after; NS: the other way around, nucleus first and satellite after) of all the hypotactic relations 12 and their frequencies.
Regarding Table 11, we see that the frequency of the causal relations (CAUSE, RESULT and PURPOSE) is bigger in the original subcorpus 0.411 (0.117 for SN and 0.294 for NS), 13 than in the intuitive 0.318 (SN: 0.09 and NS: 0.227) and structural approach 0.3 (SN: 0.00 and NS 0.3). This shows that there are less causal relations in the simplified datasets as also found by Graesser et al. (2004) and Crossley et al. (2012) and the NS order is preferred in the causal subgroup, when any causal relation is maintained.
Another interesting observation is that the NS ordering has been increased in the structural approach,

Type Transformation
Explanation Syntactic bait--> -(e)lako causal explicative substituted with a pure causal (less frequent) DMs horrez gain 'moreover' -> gainera 'in addition' substituted with a more frequent bada 'so', 'then', 'well' -> hala ere 'however' substituted with a less frequent, but less ambiguous Signals eragile 'originator','promoter' -> arrazoi 'reason', 'cause', 'motive' substituted with a more frequent near synonym  whereas in the intuitive approach the SN was increased (and, therefore, the NS decreased). This change brings the important message to the back of the structure and this way, it is more difficult to maintain all the information needed to understand the sentence in the memory, above all in the case of long sentences.

Joint Analysis
The results of the joint analysis of our sample are presented in  To underline these results of Table 12 we summarized the most important differences in Table 13. We observe that the simplification operations performed in the intuitive (Int.) and structural (Str.) approaches are similar when simplifying (Simpl.), maintaining or changing the EDUs (Changes in EDUs), performing changes in the CSC and maintaining the RRs. But there is a great difference when they establish a new rhetorical relation (see Table 13), because there are only 3 changed relations (underlined in bold) in common: RESULT > CAUSE, CIRCUMSTANCE > CONDITION and +CONCESSION.

Concluding remarks
As a conclusion of this joint analysis, we think that rhetorical relations of the original texts were not always missing, for example '−info' means that there is less information. The sign > means that something at the left was changed by another thing to the right.  Table 13: Results of the joint analysis taken into account when simplifying them (most of them were maintained). So, we want to propose for future simplification guidelines that not only lexis or syntax should be taken into account, but also discourse. That is, if in the original text there is a significant discourse relation, it should be kept in the simplified text when it helps comprehension but deleted when it leads to confusion. But the need of the discourse would not be limited to relations but to the overall relational discourse structure when simplifying text manually, the CSC and the same-unit should also be carefully treated.
For automatic texts simplification systems, the detection of the CSC should also be an important step, above all in the cases that the main piece of information should be highlighted. The difficult task of detecting the same-unit constructions could also be interesting, so that they should be deleted as much as possible.

Conclusion and Future Work
In this paper, we present a framework for the analysis of simplified texts taking discourse into account. In the simplification analysis, we propose to analyze the treatment and its the macro-operations, the length and depth, the frequencies and the reordering; in the discourse analysis, we propose to segment, annotate and describe the rhetorical relations; and, in the joint analysis, we propose to see the impact of simplification operations on the elementary discourse units, central subconstituents and rhetorical relations. Preliminary results show that this framework is useful to describe the simplified texts and that discourse is not always taken into account when simplifying texts in our datasets with the risk of creating notcoherent simplified texts. We have seen e.g. that some macro-operations such as the split cannot be applied to all the relations and that being more frequent does not involve simplicity as took for granted many times.
Currently, we are searching for more simplified texts in Basque to get more data and asking more people to simplify them, in order to get ride of the possible bias caused by the people who simplified the texts. Moreover, we are annotating in the Corpus of Basque Simplified Texts (CBST) more rhetorical relations to understand or describe all the simplification mechanisms. In the near future, we also want to perform this analysis with entire texts and not only sentences.