Design and Annotation of the First Italian Corpus for Text Simplification

In this paper, we present design and construction of the ﬁrst Italian corpus for automatic and semi–automatic text simpliﬁcation. In line with current approaches, we propose a new annotation scheme speciﬁcally conceived to identify the typology of changes an original sentence undergoes when it is manually sim-pliﬁed. Such a scheme has been applied to two aligned Italian corpora, containing original texts with corresponding simpliﬁed versions, selected as representative of two different manual simpliﬁcation strategies and addressing different target reader populations. Each corpus was annotated with the operations foreseen in the annotation scheme, covering different levels of linguistic description. Annotation results were analysed with the ﬁ-nal aim of capturing peculiarities and differences of the different simpliﬁcation strategies pursued in the two corpora.


Introduction and Background
Automatic Text Simplification (ATS) is receiving growing attention over the last few years due to the implications it has for both machine-and humanoriented tasks. ATS has been employed as a preprocessing step to improve the efficiency of e.g. parsing, machine translation and information extraction. Recently, ATS has been used in educational scenarios and assistive technologies; e.g. for the adaptation of texts to particular readers, like children , L2 learners (Petersen and Ostendorf, 2007), people with low literacy skills (Aluísio et al., 2008), cognitive disabilities (Bott and Saggion, 2014) or language impairments, e.g. aphasia (Carroll et al., 1998) or deafness (Inui et al., 2003).
The purpose of ATS, within both perspectives, is to reduce lexical and syntactic complexity while preserving the original meaning of the text. To this aim, three main approaches have been followed. The more traditional one relies on the use of hand-crafted rules (Chandrasekar et al., 1996;Siddharthan, 2002;Siddharthan, 2010;Siddharthan, 2011), which typically cover specific phenomena that are symptoms of linguistic complexity, especially at the syntactic level (e.g. passives, relative clauses, appositions). Recently, the availability of larger parallel corpora, i.e. sentence-aligned corpora consisting of both the original and the simplified version of the same text (e.g. English and Simple English Wikipedia, in short EW and SEW), has allowed a consistent use of machine learning techniques for automatically acquiring simplification rules. This is the approach followed by e.g. Woodsend and Lapata (2011), who based their ATS system on a quasi-synchronous grammar, Zhu et al. (2010), who adapted a Statistical Machine Translation (SMT) algorithm to implement simplification operations on the parse tree, and Narayan and Gardent (2014), who similarly adopted SMT techniques but also combined a deep semantic representation of the sentence. Both hand-written and automatically acquired rules have advantages and shortcomings. While the former can potentially account for the maximum linguistic information, they are extremely costly to develop and tend to cover only a few lexical and syntactic constructs; on the other side, data-driven approaches require the least linguistic knowledge but they are not feasible without a large quantity of aligned data. Hybrid approaches seem to offer a good alternative; as shown by Siddharthan and Angrosh (2014), a system that combines automatically harvested lexical rules with hand-crafted syntactic rules outperformed the state of the art. Besides, all these systems exploit the EW/SEW dataset as a training corpus. Such resources are lacking for languages other than English, making it rather impossible to approach ATS as pure machine learning task. For some of these languages, parallel monolingual corpora are annotated with simplification rules corresponding to transformations to perform on a complex sentence. This is the approach followed by Brouwers et al. (2014) for French; Bott and Saggion (2014) for Spanish; Caseli et al. (2009) for Brazilian Portuguese. A different approach is advanced by Specia (2010) for Brazilian Portuguese, who adopted phrase-based machine learning from a parallel corpus. For Basque, Aranzabe et al. (2013) used the output of a readability assessment system for detecting complex sentences, which are simplified by a large set of hand-crafted rules.
Typically, ATS approaches rely on the output of a syntactic parser although the main cause of errors for an ATS system is due to erroneous parses also when state-of-the-art parsers are used Siddharthan, 2011;Drndarević et al., 2013;Brouwers et al., 2014;Siddharthan and Angrosh, 2014). In particular, this concerns relative clause attachments and clause boundary identification (Siddharthan and Angrosh, 2014). According to Drndarević et al. (2013), one third of ATS errors depends on previous parsing errors and Brouwers et al. (2014) revealed that 89% of text simplification (TS) errors are due to preprocessing errors.
ATS is largely underinvestigated for what concerns Italian. The only exception is (Barlacchi and Tonelli, 2013), who devised a rule-based architecture focusing on a limited set of linguistic structures, but no previous study has addressed ATS by using parallel corpora.

Our Contribution
We present the first Italian resource for automatic and semi-automatic text simplification. We collected and hand-aligned two monolingual corpora representative of two different strategies of manual simplification and addressing different target readers. The corpora were annotated with a set of rules designed to capture simplification operations at diverse levels of linguistic description. There are several motivations underlying the proposed approach. As a universal native simplified-language speaker does not exist (Siddharthan, 2014), it follows that ATS systems are typically specialized with respect to a specific target user. Hence, we introduce a new annotation scheme able to handle different simplification strategies, at the level of both method and target users. This is the starting point to develop a flexible automatic or semi-automatic TS system.The proposed resource can be used to train a supervised classifier aimed at carrying out a semi-automatic TS task. In the semi-automatic scenario, the system will be able to identify the areas of linguistic complexity within a sentence and suggest the authors the most appropriate simplification rule for the intended audience and domain. This classifier, using the information extracted from the syntactic tree as one of the features exploited to predict the rules to be applied, is expected to be more robust to syntactic parsing errors than TS systems based on hand-crafted or automatically acquired rules heavily relying on parses transformations. To give an idea of how wrong parses can affect a TS system, let's consider that the accuracy of the state-of-the-art dependency parser for Italian is 87.89% in terms of Labeled Attachment Score corresponding to 293 erroneously parsed sentences out of the total of 376, i.e. 78% of the test sentences contain at least one parsing error. 1 . Moreover, it should be noted that in a TS scenario the parsers are typically tested on domains outside of the data from which they were trained or developed on (i.e. out-domain scenario) and it is widely acknowledged that state-of-the-art statistical parsers have a dramatic drop of accuracy when tested in a out-domain scenario (Gildea, 2001).
In this paper, we also carried out a comparative analysis between different TS strategies addressing different target users: this was possible thanks to the internal composition of the developed resource, which allowed us to investigate the effects of simplification rules on the linguistic peculiarities of abridged texts with respect to their original versions.

Corpora
The annotated resource 2 presented here is made up of two sub-corpora that can be considered representative of two different TS strategies: the "structural" and the "intuitive" strategy, following Allen (2009)'s definition, who addressed TS in the context of L2 learning. The former uses predefined graded lists (covering both word and structural levels) or traditional readability formulas. The latter is dependent on the author's teaching experience and personal judgments about the comprehension ability of learners. Although with main distinctions, this classification can be applied for our purpose.
The first sub-corpus (Terence) contains 32 short novels for children and their manually simplified version. 3 The simplification was carried out in a cumulative fashion with the aim of improving the comprehension of the original text at three different levels: global coherence, local cohesion and lexicon/syntax. To align the corpus, we selected the last two levels of simplification (i.e. local cohesion and lexicon/syntax) which were considered respectively as the original and the simplified version. This was motivated by the need of tackling only those textual simplification aspects with a counterpart at the morpho-syntactic and syntactic level. We handaligned the resulting 1036 original sentences to the 1060 simplified ones. The results (Table 1) provide some insights into the typology of human editing operations. In 90% of the cases a 1:1 alignment is reported; 39 original sentences (3.75%) have a correspondence 1:2, thus suggesting an occurred split; 2 original sentences underwent a three-fold split (0.19%), i.e. they correspond to three sentences in the simplified version; 15 pairs of original sentences were merged into a single one (2.88%). Finally, the percentage of unaligned sentences is 1%.
The second sub-corpus (Teacher) is composed by 24 pairs of original/simplified texts, which were col- lected by surfing specialized educational websites providing free resources for teachers. They cover different textual genres, such as literature (e.g. extracts from famous Italian novels) and handbooks for high school on diverse subjects (e.g. history, geography), and they are addressed to different targets. Unlike Terence, the simplification was performed independently by a teacher, with the aim of adapting the text to the need of audience, typically L2 students with at least a B2 level in Italian. Thus, Teacher can be considered as an instance of "intuitive" simplification: while the target is usually the same (i.e. L2 learners), each text was produced by a different author and the interventions made on the text span over different linguistic levels without any predefined distinction or hierarchy. On the contrary, Terence exemplifies a "structural" simplification, since: i) it was produced by a pool of experts; ii) it addressed a well-defined target; iii) it was consistent with a predefined guideline tackling the simplification at three separate textual dimensions. This can also explain the higher percentage of texts which were perfectly aligned at sentence level (92.1% see Table 1) with respect to Teacher (68.32%).
To compare the two different simplification strategies with respect to the effect of the simplification process, we evaluated the two corpora with the readability index existing for the Italian language, i.e. READ-IT (Dell'Orletta et al., 2011). For both the corpora, we calculated the Spearman's correlation between the scores obtained by different READ-IT models (i.e. using different types of linguistic features) on the original and the simplified version. As reported in Table 2, the two simplified corpora are significantly correlated with all READ-IT models. In particular, Teacher is especially correlated with the model using a combination of raw text and lexical features (READ-IT lexical model in Table 2). This possibly follows from the "intuitive" simplification process of Teacher that mostly concerns lexical substitution operations.
Readability index Terence Teacher READ-IT global 0.77 * 0.47 READ-IT base 0.80 * 0.50 READ-IT lexical 0.65 * 0.72 * READ-IT syntax 0.54 * 0.46 Table 2: Spearman's correlation between different READ-IT models and the simplified corpora. Significant correlations (p < 0.05) are bolded; those with p < 0.001 are also marked with * .
The two corpora were annotated by two undergraduate students in computational linguistics, who received preliminary training lessons on the simplification rules covered by the annotation tagset. Each student annotated a different corpus and all their annotations were verified by a trained linguist.

Simplification Annotation Scheme
We defined an annotation scheme covering six macro-categories: split, merge, reordering, insert, delete and transformation. Following Bott and Saggion (2014), we used a two-level structure, i.e. for some categories more specific subclassed have been introduced. In Table 3, we show the tagset of the annotation scheme. In the following examples extracted from the annotated corpus, we bolded the text span marked in the original sentence by each rule-tag and we highlighted in italics the corresponding text span in the simplified version. 4 Split: it is the most investigated operation in ATS, for both human-and machine-oriented applications. Typically, a split affects coordinate clauses (introduced by coordinate conjunctions, colons or semicolons), subordinate clauses (e.g. non-restrictive relative clauses), appositive and adverbial phrases. Nevertheless, we do not expect that each of these sentences undergoes a split, as the human expert may prefer not to detach two clauses, for instance when a subordinate clause provides the necessary background information to understand the matrix clause. Merge: it is to be taken as the reverse of split, i.e. the operation by which two (or more) original sentences are joined into a unique simplified sentence. This transformation is less likely to be adopted, as it creates semantically denser sentences, more difficult to process (Kintsh and Keenan, 1973). Yet, to some extent (see the alignment results), this is a choice the expert can make and it can be interesting to verify whether the sentences susceptible to be merged display any regular pattern of linguistic features that can be automatically captured.  Insert: the process of simplification may even result in a longer sentence, because of the insertion of words or phrases that provide supportive information to the original sentence. Despite the cognitive literature suggests reducing the inference load of a text, especially with less skilled or low-knowledge readers (Ozuru et al., 2009), it is difficult to predict what an author will actually add to the original sentence to make it clearer. It can happen that the sentence is elliptical, i.e. syntactically compressed, and the difficulty depends on the ability to retrieve the missing arguments, which are then made explicit as a result of the simplification. Our annotation scheme has introduced two more specific tags to mark insertions: one for verbs and one for subject. The latter signals the transformation of a covert subject into a lexical noun phrase 5 . Delete: dropping redundant information is also a strategy for simplifying a text. As for the insert tag, also deletion is largely unpredictable, although we can imagine that simplified sentences would contain less adjunct phrases (e.g. adverbs or adjectives). Such occurrences have been marked with the underspecified delete rule; two more restricted tags, delete verb and delete subj, have been introduced to signal, respectively, the deletion of a verb and of an overt subject (made implicit and recoverable through verb agreement morphology Transformation: this label covers six typologies of transformations that a sentence may undergo to become more comprehensible for the intended reader. Such modifications can affect the sentence at the lexical, morpho-syntactic and syntactic level, also giving rise to overlapping phenomena. Our annotation scheme has intended to cover the following phenomena. -Lexical substitution (word level): when a single word is replaced by another word (or more than one), which is usually a more common synonym or a less specific term. Given the relevance of lexical changes in TS, which is also confirmed by our results, previous works have proposed feasible ways to automatize lexical simplification, e.g. by relying on electronic resources, such as WordNet  or word frequency lists (Drndarevic et al., 2012). However, synonyms or hypernyms replacements do not cover all the editing options, since we observed that an author might also restate the meaning of the complex word with a multi-word pharaphrase. -Verbal voice: to signal the transformation of a passive sentence into an active or vice versa. Within both the corpora very few examples of the latter were found; this result was expected since passive sentences represent an instance of non-canonical order: they are acquired later by typically developing children (Maratsos, 1974;Bever, 1970) (for Italian, (Cipriani et al., 1993;Ciccarelli, 1998)) and have been reported as problematic for atypical populations, e.g. deaf children (Volpato, 2010). Yet, the "passivization" rule may still be productive in other textual typologies, where it can happen that the author of the simplification prefers not only to keep, but even to insert, a passive, in order to avoid more unusual syntactic constructs in Italian (such as impersonal sentences). This is also in line with what Bott and Saggion (2014)  -Verbal features: Italian is a language with a rich inflectional paradigm and changes affecting verbal features (mood, tense) have proven useful in discriminating between easy-and difficult-to-read texts in readability assessment task (Dell'Orletta et al., 2011). Poor comprehenders also find it difficult to properly master verb inflectional morphology; the same holds for other categories of atypical readers, e.g. dyslexics (Fiorin, 2009), but also for L2 learners (Sorace, 1993); thus, the simplification, according to the intended target, will probably alter the distribution of verbal features.

Simplification Rules and Linguistic Features
The analysis of the frequency distribution of each rule within the two annotated corpora (Table 3) allows us capturing similarities and variations across corpora representing two different TS strategies and addressed to diverse categories of readers. The majority of rules are similarly distributed across the two corpora showing that a number of simplification choices are shared by a team of experts and independent teachers. This is an interesting finding as it might suggest the existence of an "independent" simplification process shared by approaches targeting different audience and based on different simplification methods. Exceptions are represented by some rules involving verbs (i.e. transformation of verbal features and insert verb) and anaphoric replacements. For what concerns the latter, it should be noted that the Terence original version here adopted inherits previous sentence transformations covering, among others, anaphoric replacements. The different distribution of rules involving verbs might reflect both the different simplification choices related to the structural and in-tuitive simplification strategies and the different textual genres included in Teacher and Terence. For a more in-depth analysis of the impact and the significance of each simplification rule, we focused on the most frequently applied rules and we chose a set of features which are typically involved in automatic readability assessment and also express language-specific peculiarities. For each linguistic feature, we calculated the Spearman's correlation between the feature values extracted from the original text and from the simplified version with respect to the selected rules.

Linguistic Features
The set of linguistic features spans across different levels of linguistic analysis and are broadly classifiable into four main classes: raw text, lexical, morpho-syntactic and syntactic features, shortly described below. They were extracted from the corpora automatically tagged by the part-of-speech tagger described in Dell'Orletta (2009) and dependencyparsed by the DeSR parser (Attardi, 2006). Table 4) are typically used within traditional readability metrics and include sentence length (average number of words per sentence), and word length (average number of characters per words).

Raw text features (Features [1-2] in
Feature [3] refers to the percentage of all unique words (types) on the Basic Italian Vocabulary (BIV) by De Mauro (2000) in the sentence. The BIV includes a list of 7,000 words highly familiar to Italian native speakers.
The set of morpho-syntactic features [4-19] ranges from the probability distribution of part-ofspeech types, to the lexical density of the text, calculated as the ratio of content words (verbs, nouns, adjectives and adverbs) to the total number of lexical tokens in a text. It also includes verbal mood and tense distributions, a language-specific feature related to Italian rich verbal morphology.
The set of syntactic features [20-35] captures different aspects of the syntactic structure, such as: -parse tree depth features, going from the depth of the whole parse tree [26], calculated in terms of the longest path from the root of the dependency tree to some leaf, to a more specific feature referring to the average depth of embedded complement 'chains' [23] governed by a nominal head and including ei-ther prepositional complements or nominal and adjectival modifiers; -verbal predicate features, going from the arity of verbs [27], meant as the number of instantiated dependency links sharing the same verbal head (covering both arguments and modifiers), to the distribution of verbal roots with explicit subject [28] with respect to all sentence roots occurring in a text and the relative ordering of subject and object with respect to the verbal head  Table 4 illustrates the correlations between the linguistic features and the most frequently applied simplification rules. It can be noted that all the rules are strongly correlated with the linguistic features. This reveals that these rules have a great impact on the linguistic structure of the simplified text. It also shows the effectiveness of such features to capture simplification operations at varying degrees of linguistic description. Interestingly, if we examine more in-depth the significance value, we can observe a distinction between the two corpora. Terence reports a higher number of stronger correlations (i.e. p < 0.001) with respect to Teacher. These results seem to provide an evidence to the existence of different simplification strategies, which vary according to the person (i.e. expert vs. nonexpert), textual genres and intended target. Specifically, the teachers prefer a more vocabulary-oriented simplification approach, as testified by a) the highest significant correlations reported by the rules dealing with lexical replacements (i.e. LexSub word and LexSub phrase) and b) the fact that the majority of significant correlations at > 0.5 affects linguistic features from [1] to [19], i.e. features not dealing with the syntactic structure. This might suggest that, independently from the simplification rule adopted, the resulting sentence has not undergone a strong modification in its grammatical structure. This is not the case of the "structural" simplification, in which all the rules significantly correlate with both lexical/morpho-syntactic features (set [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19]) and syntactic features (set [20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35]). On the other side, the correlation results reported by the Delete, Lex-Sub word and LexSub phrase rules reveal the existence of a common approach to simplification. In the two corpora these rules are correlated with mainly the same linguistic features.

Correlation
For what concerns the evaluation of the overall significance of each rule, we observe that a wide number of correlations at ≥ 0.6 occurs especially when Split and LexSub word were applied. Both these simplification operations are expected to greatly redefine the structure of the sentence; a split e.g. not only correlates with sentence length, but it also reduces prepositional chains [23]. Split might be triggered by long noun phrases with a deverbal noun; to simplify them the author could have chosen to turn them into an autonomous sentence, by also adding a verb (see the high correlation between [23] and InsertVerb).

Conclusion
We have presented the first Italian corpus for text simplification. This annotated resource is composed by two monolingual parallel corpora, representing two different strategies of simplification: "structural" and "intuitive". We have defined an annotation scheme able to capture manual simplifications at different levels of linguistic structure as well as to handle the different strategies of simplification. We have carried out an in-depth analysis of the impact of each simplification rule with respect to a set of linguistic features related to text complexity. This study has highlighted the existence of an "independent" simplification process shared by the two considered simplification approaches targeting different audience. We are currently using this finding in the development of a semi-automatic supervised TS system trained on the two corpora able to handle these shared simplification phenomena. Current developments are also devoted to refining the anno-  Table 4: Spearman's correlation between the most frequent rules and a subset of linguistic features. Significant correlations (p < 0.05) are bolded; those with p < 0.001 are also marked with * . For each column, the left value refers to Terence, the right value to Teacher. tation scheme, also by testing the suitability of this scheme for other corpora.
K.Woodsend and M. Lapata. 2011. Learning to simplify sentences with quasi-synchronous grammar and integer programming. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 409-420. Z. Zhu, D. Bernhard, and I. Gurevych. 2010. A monolingual tree-based translation model for sentence simplification. Proceedings of the 23rd international conference on computational linguistics, 1353-1361.