OneStopEnglish corpus: A new corpus for automatic readability assessment and text simplification

This paper describes the collection and compilation of the OneStopEnglish corpus of texts written at three reading levels, and demonstrates its usefulness for through two applications - automatic readability assessment and automatic text simplification. The corpus consists of 189 texts, each in three versions (567 in total). The corpus is now freely available under a CC by-SA 4.0 license and we hope that it would foster further research on the topics of readability assessment and text simplification.


Introduction
Automatic Readability Assessment (ARA), the task of assessing the reading difficulty of a text, is a well-studied problem in computational linguistics (cf. Collins-Thompson, 2014). A related problem is Automatic Text Simplification (cf. Siddharthan, 2014) which aims to generate simplified texts from complex versions. While most of the research on these problems focused on feature engineering and modeling, there is very little reported work about the creation of open access corpora that supports this research.
Corpora used in ARA were primarily derived from textbooks or news articles written for different target audiences. In most of the cases, the texts at different levels in these corpora are not comparable versions of each other, which would not help us develop fine-grained readability models which can identify what parts of texts are difficult compared to others, instead of having a single score for the whole text. Corpora of parallel texts simplified for different target reading levels can solve this problem, and support better ARA models. On the other hand, ATS systems by default need parallel corpora, and primarily relied on parallel sentence pairs from Wikipedia-Simple Wikipedia for 1 https://creativecommons.org/licenses/ by-sa/4.0/ training and evaluating the simplification models. While the availability and suitability of this corpus is definitely a positive aspect, the lack of additional corpora makes an evaluation of the generalizability of simplification approaches difficult.
In this background, we created a corpus aligned at text and sentence level, across three reading levels (beginner, intermediate, advanced), targeting English as Second Language (ESL) learners. To our knowledge, this is the first such free corpus in any language for readability assessment research. While a sentence aligned corpus from the same source was discussed in previous research, the current corpus is larger, and cleaner. In addition to describing the corpus, we demonstrate the usefulness of this corpus for automatic readability classification and text simplification. The corpus is freely available 2 . Its creation and relevance are described in the sections that follow: Section 2 describes other relevant corpus creation projects. Section 3 describes our corpus creation. Section 4 describes some preliminary experiments with readability assessment and text simplification using this corpus. Section 5 concludes the paper with pointers to future work. Washburne and Vogel (1926) and Vogel and Washburne (1928) can be considered one of the early works on corpora creation for readability research, where they collected a corpus of 700 books annotated by children in terms of reading difficulty. While there are other such efforts in the past century, corpora from those early projects are not available for current use. Contemporary approaches to readability assessment typically rely on compiling large corpora from the Web. The WeeklyReader magazine was used as a source for graded news texts in past ARA research (Petersen, 2007;Feng, 2010). Petersen and Ostendorf (2009) described a corpus of articles from Encyclopedia Britannica, where each article had a comparable "Elementary Version", which, however, is not freely available as far as we know.  compiled WeeBit corpus, combining WeeklyReader with BBC Bite-Size, and this corpus was used in several ARA approaches in the past few years. (Vajjala and Meurers, 2013) described a large corpus of age specific TV program transcripts from BBC, and (Napoles and Dredze, 2010) used a corpus of Wikipedia-Simple Wikipedia articles. (Hancke et al., 2012;Dell'Orletta et al., 2011;Gonzalez-Dios et al., 2014), describe such web-based corpus compilation efforts for German, Italian and Basque respectively.

Related Work
Textbooks from school curricula were also used as training corpora for readability assessment models in the past (e.g., Heilman et al. (2008) for English, Berendes et al. (2017) for German, (Islam et al., 2012) for Bangla). In all these cases, the grade level of the text was decided based on the target reader group (according to the website/textbook) which was decided by either publishers or authors. Another way of creating such corpora is through human annotations. DeLite corpus Vor der Brück et al. (2008) for German legal texts, and van Oosten and Hoste (2011); Clercq et al. (2014) for Dutch texts describe crowd annotated resources whereas the common core standards corpus described in Nelson et al. (2012) is annotated by experts according to the common core guidelines on text complexity. Corpora created with such human annotations are expensive to obtain and hence, are generally smaller in size. Therefore, such corpora may not be sufficient to build new models, although they can serve as good evaluation datasets.
Primary concern with all these corpora is that the articles in different reading levels are not comparable versions of each other (except Encyclopedia Britannica). The only other publicly and/or freely accessible readability corpus that potentially has parallel and comparable texts in multiple reading levels is the NewsEla 3 corpus which is a corpus of manually simplified news texts. While the corpus is available for research under some license restrictions, it also addresses a different tar-get audience, young L1 English learners. In this background, we release an openly accessible corpus of texts with text and sentence level mapping across three reading levels, targeting L2 learners of English.
In terms of sentence aligned corpora for text simplification, different versions of aligned Wiki-Simple Wikipedia sentences have been used in NLP research (Zhu et al., 2010;Coster and Kauchak, 2011;Hwang et al., 2015). Different supervised and unsupervised approaches were proposed to construct such corpora (Bott and Saggion, 2011; Klerke and Søgaard, 2012;Klaper et al., 2013;Brunato et al., 2016). Our corpus adds a new resource for the English text simplification task.

Corpus
Our corpus was compiled from onestopenglish.com over the period 2013-2016. onestopenglish.com is an English language learning resources website run by MacMillan Education, with over 700,000 users across 100 countries. One of the features of the website is a weekly news lessons section, which contains articles sourced from The Guardian newspaper, and rewritten by teachers to suit three levels of adult ESL learners (elementary, intermediate, and advanced). That is, content from the same original article is rewritten in three versions, to suit three reading levels. The advanced version is close to the original article, although not with exact same content. Texts from this source were previously used for training sentence level readability models Ambati et al., 2016;Howcroft and Demberg, 2017), for performing corpus analysis about the characteristics of simplified text (Allen, 2009), and in user studies about the relationship between text complexity and reading comprehension (Crossley et al., 2014;, although the corpus was not publicly available in the past. Original articles from the website consisted of pdf files containing the article text, some pre/post test questions, and other additional material. So, the first step in the corpus creation process involved removing the irrelevant content. We first explored off-the-shelf pdf to text converters, and while they worked, they did not always result in a clean text, sometimes missing entire pages of content. While this may not be a significant issue for   We performed some preliminary corpus analysis of the three reading levels in terms of some common features used in readability literature. Table 3 shows the summary of these results, using traditionally used features such as Flesch-Kincaid Grade Level (FKGL) (Kincaid et al., 1975) and 4 We acquired permission both from Onestopenglish.com and The Guardian to release this plain-text version of the corpus.
Type-token ratio (TTR), and occurrences of different phrases, as given by Stanford Parser (Chen and Manning, 2014). In general, all feature values decrease from ADV to ELE, which is expected, if we assume all these features to be indicative of reading level of text.

Experiments
We demonstrate the usefulness of this corpus for two applications: readability assessment and text simplification.

Readability Assessment
We modeled this as a classification problem using both generic text classification features such as word ngrams as well as features typically used in readability classification research 5 . Generic text classification features include: 1. Word n-grams: Uni, Bi, Trigram features 2. POS n-grams: Bi and Trigrams of POS tags from Stanford tagger (Toutanova et al., 2003) 3. Character n-grams: 2-5 character n-grams, considering word boundaries 4. Syntactic production rules: phrase structure production rules from Stanford parser  5. Dependency relations: Dependency relation triplets of the form (relation, head, word) from Stanford dependency parser (Chen and Manning, 2014) All n-gram features and grammar rules/relations that occurred at least 5 times in the entire corpus were retained for the final feature set. All these features were extracted using LightSide text mining workbench (Mayfield and Rosé, 2013). Table 4 shows the classification results with these features, using Sequential Minimal Optimization (SMO) classifier with linear kernel (with a random baseline of 33% as all classes are represented equally in the data).
Features Accuracy Word n-grams 61.38% POS n-grams 67.37% Char n-grams 77.25% Syntactic Production Rules 54.67% Dependency Relations 27.16% Character ngrams seem to be the best performing group of generic features, achieving 77% accuracy. Data-driven features that rely on deeper linguistic representations seem to perform poorly compared to these simple features. Particularly, dependency relations perform worse than the random baseline. Since we are working with parallel texts, there will be a lot of word level overlap across reading levels, and hence, it is not entirely surprising to see word n-grams not doing well. However, despite this, character n-grams seems to do well. We speculate they capture sub-word simplified text information such as usage of certain suffixes or prefixes, which has to be further explored in future.
In addition to the generic features, we also trained classifiers with features that are typically used in ARA research. These are: 1. Traditional features and formulae, that have been used in all the ARA models in the past 2. lexical variation, type token ratio, and POS tag ratio based features 3. Features based on psycholinguistic databases 4. Features based on constituent parse trees 5. Discourse features include: • overlap measures among sentences in a document as used in Coh-Metrix (Graesser et al., 2014) • usage of different kinds of connectives obtained from the discourse connectives tagger (Pitler and Nenkova, 2009) • coreference chains in the text from Stanford CoreNLP  Highest classification accuracy is achieved when all the features are put together, as shown in Table 5. However, this only results in a less than 1% improvement over character n-grams. Character n-grams as features for readability assessment were not explored in the past, and these results would lead us to explore that in future. In terms of comparison with existing work on ARA, highest accuracies reported are close to 90% on WeeBit dataset . However, considering that we are comparing texts on the same topic, differing primarily in terms of style rather than content, this is perhaps a difficult dataset to model, compared to other existing readability datasets.
Since we now have a corpus with parallel versions of sentences and paragraphs at different reading levels, one idea to explore further is to model readability assessment as a sentence and paragraph level pair-wise ranking problem, and then use those "local" readability assessments to infer "global" text level readability (e.g., Chapter 5.5, Vajjala (2015)). Previous research also (Ma et al., 2012) showed that pair-wise ranking resulted in better readability models than classification. A combination of both these approaches would be an interesting dimension to explore in future.

Text Simplification
Automatic Text Simplification (ATS) has been commonly modeled as a Phrase Based Machine Translation (PBMT) problem in the literature. To demonstrate the usefulness of this corpus for ATS experiments, we used the adv-ele sentence aligned version of the OSE corpus and treated it as a phrase based machine translation problem. We split the dataset with 2166 sentence pairs into -1000 sentence pairs for training, 500 for development, and the remaining 666 pairs for testing. We did not explore a neural model, due to the size of the dataset considered. We used Moses (Hoang et al., 2007) to train the model, and evaluated the model performance on test data in terms of various evaluation metrics used in MT research, comparing machine generated and human translations.
This model resulted in a BLEU (Papineni et al., 2001) score of 54.45 and METEOR (Denkowski and Lavie, 2014) score of 46. While the scores are not interpretable by themselves, general guidelines by Lavie (2011) suggest that BLEU and METEOR scores above 50 indicate understandable translations. Comparing with existing results on ATS, Zhang and Lapata (2017) trained a neural network based MT model with 300K sentence pairs as training data, and reported a much higher BLEU score of 88.85. The results on current dataset (with 1000 sentence training data and PBMT) cannot be compared with this result though, especially considering the size of the dataset. However, previous research showed that a high BLEU score with one corpus did not generalize when the test set came from another source (Chapter 6 in Vajjala, 2015). While our dataset may not be sufficient to build robust text simplification models, it can be used to test the generalizability of such state of the art text simplification approaches, or to be combined with a larger dataset while training a simplification model.

Conclusion
In this paper, we described the creation of a new corpus for readability assessment and text simplification research, and demonstrated its usefulness for readability assessment and text simplification research. The corpus is released with this paper, and we hope it will foster further research into readability assessment and text simplification systems aimed at ESL learners.
Beyond researchers interested in computational modeling, this corpus is also useful for other groups such as: a) researchers interested in conducting user studies about the relationship between text simplification and reader comprehension, or between expert annotated readability labels and target reader comprehension of texts (e.g., ) and b) researchers interested in doing corpus studies with simplified and unsimplified texts, which can give insights into creating both manual and automatically simplified texts (e.g., (Allen, 2009)). David Allen. 2009. A study of the role of relative clauses in the simplification of news texts for learners of English. System, 37 (4)