Insights from Russian second language readability classification: complexity-dependent training requirements, and feature evaluation of multiple categories

I investigate Russian second language read-ability assessment using a machine-learning approach with a range of lexical, morphological, syntactic, and discourse features. Testing the model with a new collection of Russian L2 readability corpora achieves an F-score of 0.671 and adjacent accuracy 0.919 on a 6-level classiﬁcation task. Information gain and feature subset evaluation shows that morphological features are collectively the most informative. Learning curves for binary classiﬁers reveal that fewer training data are needed to distinguish between beginning reading levels than are needed to distinguish between intermediate reading levels.


Introduction
Reading is one of the core skills in both first and second language learning, and it is arguably the most important means of accessing information in the modern world. Modern second language pedagogy typically includes reading as a major component of foreign language instruction. There has been debate regarding the use of authentic materials versus contrived materials, where authentic materials are defined as "A stretch of real language, produced by a real speaker or writer for a real audience and designed to convey a real message of some sort" (Morrow, 1977, p. 13). 1 Many empirical studies have demonstrated advantages to using authentic materials, including increased linguistic, pragmatic, and discourse competence (Gilmore, 2007, citations in §3). However, Gilmore (2007) notes that "Finding appropriate authentic texts and designing tasks for them can, in itself, be an extremely time-consuming process." An appropriate text should arguably be interesting, linguistically relevant, authentic, recent, and at the appropriate reading level.
Tools to automatically identify a given text's complexity would help remove one of the most timeconsuming steps of text selection, allowing teachers to focus on pedagogical aspects of text selection. Furthermore, these tools would also make it possible for learners to find appropriate texts for themselves.
A thorough conceptual and historical overview of readability research can be found in Vajjala (2015, §2.2). The last decade has seen a rise in research on readability classification, primarily focused on English, but also including French, German, Italian, Portuguese, and Swedish (Roll et al., 2007;Vor der Brück et al., 2008;Aluisio et al., 2010;Francois and Watrin, 2011;Dell'Orletta et al., 2011;Hancke et al., 2012;Pilán et al., 2015). Broadly speaking, these languages have limited morphology in comparison with Russian, which has relatively rich morphology among major world languages. It is therefore not surprising that morphology has received little attention in studies of automatic readability classification. One important exception is Hancke et al. (2012) which examines lexical, syntactic and morphological features with a two-level corpus of German magazine articles. In their study, morphological features are collectively the most predictive category of features. Furthermore, when combining feature categories in groups of two or three, the highest performing combinations included the morphology category. If morphological features figure so prominently in German readability classification, then there is good reason to expect that they will be similarly informative for Russian second-language readability classification.
This article explores to what extent textual features based on morphological analysis can lead to successful readability classification of Russian texts for language learning. In Section 2, I give an overview of previous research on readability, including some work on Russian. The corpora collected for use in this study are described in Section 3. The features extracted for machine learning are outlined in Section 4. Results are discussed in Sections 5 and 6, and conclusions and outlook for future research are presented in Section 7.

Background
The history of empirical readability assessment began as early as 1880 (DuBay, 2006), with methods as simple as counting sentence length by hand. Today, research on readability is dominated by machinelearning approaches that automatically extract complex features based on surface wordforms, part-ofspeech analysis, syntactic parses, and models of lexical difficulty. In this section, I give an abbreviated history of the various approaches to readability assessment, including the kinds of textual features that have received attention. Although some proprietary solutions are relevant here, I focus primarily on work that has resulted in publically available knowledge and resources.

History of evaluating text complexity
The earliest approaches to readability analysis consisted of developing readability formulas, which combined a small number of easily countable features, such as average sentence length, and average word length (Kincaid et al., 1975;Coleman and Liau, 1975). Although formulas for computing readability have been criticized for being overly simplistic, they were quickly adopted and remain in widespread use today. 2 An early extension of these simple 'counting' formulas was to additionally rely on lists of words deemed "easy", based on either their frequency or polling of young learners (Dale and Chall, 1948;Chall and Dale, 1995;Stenner, 1996). A higher proportion of words belonging to these lists resulted in lower readability measures, and vice versa.
With the recent growth of natural language processing techniques, it has become possible to extract information about the lexical and/or syntactic structure of a text, and automatically train readability models using machine-learning techniques. Some of the earliest attempts at this built unigram language models based on American textbooks, and estimated a text's reading level by testing how well it was described by each unigram model (Si and Callan, 2001;Collins-Thompson and Callan, 2004). This approach was extended in the REAP project 3 to include a number of grammatical features as well (Heilman et al., 2007;Heilman et al., 2008a;Heilman et al., 2008b).
Over time, readability researchers have increasingly taken inspiration from various subfields of linguistics to identify features for modeling readability, including syntax (Schwarm and Ostendorf, 2005;Petersen and Ostendorf, 2009), discourse (Feng, 2010;, textual coherence (Graesser et al., 2004;Crossley et al., 2007a;Crossley et al., 2007b;Crossley et al., 2008), and second language acquisition . The present study expands this enterprise by examining second language readability for Russian.

Automatic readability assessment of Russian texts
The history of readability assessment of Russian texts takes a very similar trajectory to the work related above. Early work was based on developing formulas based on simple countable features (Mikk, 1974;Oborneva, 2005;Oborneva, 2006a;Oborneva, 2006b;Mizernov and Graščenko, 2015). Some researchers have tried to be more objective about defining readability, by obtaining data from expert raters, or from other experimental means, and then performing statistical analysis-such as linear regression, or correlation-to identify impor-tant factors of text complexity (Sharoff et al., 2008;Petrova and Okladnikova, 2009;Okladnikova, 2010;Špakovskij, 2003;Špakovskij, 2008;Ivanov, 2013;Kotlyarov, 2015), such as lexical properties, morphological categories, typographic layout, and syntactic complexity.
To my knowledge, only one study has previously examined readability in the context of Russian second-language pedagogical texts. Karpov et al. (2014) performed a series of experiments using several different kinds of machine-learning models to automatically classify Russian text complexity, as well as single-sentence complexity. They collected a small corpus of texts (described in Section 3 below), with texts at 4 of the CEFR levels: 4 A1, A2, B1, and C2. They extracted 25 features from these texts, including document length, sentence length, word length, lexicon difficulty, and presence of each part of speech. No morphological features were included, despite the fact that morphology is the most challenging feature of Russian grammar for most language learners. Using Classification Tree, SVM, and Logistic Regression models for binary classification (A1-C2, A2-C2, and B1-C2), they report achieving accuracy close to 100%. It should be noted that no results were reported with more customary stepwise binary combinations, such as A1-A2, A2-B1, and B1-C2, which are more difficultand more useful-distinctions. In a four-way classication task, they state that their results were lower, but they only provide precision, recall, and accuracy metrics for the B1 readability level during four-way classification, which were as high as 99%. Irregularities in reporting make it difficult to draw firm conclusions from their work, especially because their corpora covered only four out of six CEFR levels with no more than 60 data points per level.

Corpora
The corpora 5 in this study all use the same scale for rating L2 readability, the Common European Framework of Reference for Languages (CEFR). The six common reference levels of CEFR can be divided into three levels-Basic user (A), Independent user (B), and Proficient user (C)-each of which is subdivided into two levels. This yields the following six levels in ascending order: A1, A2, B1, B2, C1, and C2. 6 For all corpora, reading levels were assigned by the original author or publisher, so there is no guarantee that the reading levels between corpora align well.
Two subcorpora were used by Karpov et al. (2014). The CIE corpus includes texts created by teachers for learners of Russian. These texts are taken from a collection of materials kept in an open repository at http://texts.cie.ru. The second subcorpus used by Karpov et al. (2014) consists of 50 original news articles for native readers, rated at level C2.
The LingQ corpus (LQ) is a corpus of texts from http://www.lingq.com, a commercial language-learning website that includes lessons uploaded by member enthusiasts, with 3481 texts. Reading levels were determined by the member who uploaded each lesson.
The Red Kalinka (RK) corpus is a collection of 99 texts taken from 13 books in the "Russian books with audio" series available at http:// www.redkalinka.com. These books include stories, dialogues, texts about Russian culture, and business dialogues.
The TORFL corpus comes from the Test of Russian as a Foreign Language, a set of standardized tests administered by the Russian Ministry of Education and Science. It is a collection of 168 texts that I extracted from official practice tests for the TORFL.
The Zlatoust corpus (Zlat) comes from a series of readers for language learners at the lower CEFR levels, with 746 documents.
The Combined corpus is a combination of the corpora described above. The distribution of documents per level is given in Table 1. Note that some corpora do not have texts at every reading level. Table 2 shows the median document length (in words) per level in each of the corpora. The overall median document size is 268 words. Within each corpus, median document length tends to in-All A1 A2 B1  B2  C1 C2  CIE  145  28  57  60  --news  50  -----50  LQ  3481 323 653 716  832  609 348  RK  99  40  18  17  18  6  -TORFL 168  31  36  36  26  28  11  Zlat.  746  -66  553  127  --Comb. 4689 422 830 1382 1003 643 409  The overall distribution of document length is shown in Figure 1, where the x-axis is all documents ranked by document length and the y-axis is document length. The shortest document contains 7 words, and the longest document contains over 9000 words. In the following sections, I give an overview of the features used in this study, both the rationale for their inclusion, as well as details regarding their operationalization and implementation. I combine features used in previous research with some novel features based on morphological analysis. I divide features into the following categories: lexical, morphological, syntactic, and semantic.
LEXV The lexical variability category contains features that are intended to measure the variety of lexemes found in a document. One of the most basic measures of lexical variability is the type-token ratio, which is the number of unique wordforms divided by the number of tokens in a text. Because the type-token ratio is dependent on document length, I included a few more robust metrics that have been proposed: , and the Uber Index (log 2 T / log(N/T )). For all of these metrics, a higher score signifies higher concentrations of unique tokens, which indicates more difficult readability levels.
LEXC Lexical complexity includes multiple concepts. One is the degree to which individual words can be parsed into component morphemes. This is a reflection of the derivational or agglutinative structure of words. Another measure of lexical complexity is word length, which reflects the difficulty of chunking and storing words in short-term memory. Depending on the particulars of a given language or the development level of a given learner, lexical complexity can either inhibit or enhance comprehension. For example, the word neftepererabatyvajuščij (zavod) 'oil-refining (factory)' is overwhelming for a beginning learner, but an advanced learner who has never seen this word can easily deduce its meaning by recognizing its component morphemes: nefte-pere-rabat-yvaj-uščij 'oil-re-work-IPFV-ing'.
Word length features were computed on the basis of characters, syllables, and morphemes. For each of these three, both an average and a maximum were computed. In addition, all six of these features were computed for both all words, and for content words only. 7 The features for word length in morphemes were computed on the basis of Tixonov's Morpho-orthographic dictionary (Tixonov, 2002), which contains parses for about 100 000 words. All words that are not found in the dictionary were ignored. In addition to average and maximum word lengths, I also followed Karpov et al. (2014) in calculating word length bands, such as the proportion of words with five or more characters. These bands are calculated for 5-13 characters (9 features) and 3-6 syllables (4 features). All 13 of these features were calculated both for all words and for content words only.
LEXF Lexical familiarity features were computed to attempt to capture the degree to which the words of a text are familiar to readers of various levels. These features model the development of learners' vocabulary from level to level. Unlike the features for lexical variability and lexical complexity, which are primarily based on surface structure, the features for lexical familiarity rely on a predefined frequency lists or lexicons.
The first set of lexical familiarity features are derived from the official "Lexical Minimum" lists for the TORFL examinations. The lexical minimum lists are compiled for the four lowest levels (A1, A2, B1, and B2), where each list contains the words that should be mastered for the tests at each level. These lists can be seen as prescriptive vocabulary for language learners. Following Karpov et al. (2014), I computed features for the proportion of words above a given reading level.
The second set of lexical familiarity features are taken from the Kelly Project (Kilgarriff et al., 2014), which is a "corpus-based vocabulary list" for language learners. These lists are based primarily on word frequency, with manual adjustments made by professional teachers. Just like the features based on the Lexical Minimum, I computed the proportion of words over each of the six CEFR levels.
The third set of lexical familiarity features are based on raw frequency and frequency rank for both lemma frequency and token frequency. 8 For each of 7 The following parts of speech were considered content words: adjectives, adverbs, nouns and verbs. 8 Lemma frequency data were taken from Ljaševskaja and Šarov (2009) (available digitally at http://dict. the four kinds of frequency data, I computed average, median, minimum, and standard deviation.

Morphological features (MORPH)
Morphological features are primarily based on morphosyntactic values, as output by an automatic morphological analyzer. The first three sets of features reflect simple counts of whether a morphosyntactic tag is present or what proportion of tokens receive each morphosyntactic tag. The first set of features expresses whether a given morphosyntactic tag is present in the document. A second set of features, expresses the ratio of tokens with each morphosyntactic tag, normalized by token count. A third set of features, the value-feature ratio (VFR), was calculated as the number of tokens that express a morphosyntactic value (e.g. past), normalized by the number of tokens that express the corresponding morphosyntactic feature (e.g. tense).
In the early stages of learning Russian, learners do not have a knowledge of all six cases, so I hypothesized that texts intended for the lowest reading level might be distinguished by a limited number of attested cases. Similarly, two subcases in Russian, partitive genitive and second locative, are generally rare, but are overrepresented in texts written for beginners who are being introduced to these subcases. Two features were computed to capture these intuitions: the number of cases and the number of subcases attested in the document.
Following Nikin et al. (2007;Krioni et al. (2008;Filippova (2010), I calculated a feature to measure the proportion of abstract words. This was done by using a regular expression to test lemmas for the presence of a number of abstract derivational suffixes. This feature is normalized to the number of tokens in the document.

Sentence length-based features (SENT)
The SENT category consists of features that include in their computation some form of sentence length, including words per sentence, syllables per sentence, letters per sentence, coordinating conjunctions per sentence, and subordinating conjuncruslang.ru/freq.php), which is based on data from the Russian National Corpus. The token frequency data were taken directly from the Russian National Corpus webpage at http: //ruscorpora.ru/corpora-freq.html. tions per sentence. In addition, I also compute the type frequency of morphosyntactic readings per sentence. This category also includes the traditional readability formulas: Russian Flesch Reading Ease (Oborneva, 2006a), Flesch Reading Ease, Flesch-Kincaid Grade Level, and the Coleman-Liau Index.

Syntactic features (SYNT)
Syntactic features for this study were primarily based on the output of the hunpos 9 trigram part-ofspeech tagger and maltparser 10 syntactic dependency parser, both trained on the SynTagRus 11 treebank. Using maltoptimizer, 12 I found that the best-performing algorithm was Nivre Eager, which achieved a labeled attachment score of 81.29% with cross-validation of SynTagRus.
Researchers of automatic readability classification and closely related tasks have used a number of syntactic dependency features which I also implement here (Yannakoudakis et al., 2011;Dell'Orletta et al., 2011;Vor der Brück and Hartrumpf, 2007;Vor der Brück et al., 2008). These include features based on dependency lengths (the number of tokens intervening between a dependent and its head), as well as the number of dependents belonging to particular parts of speech, in particular nouns and verbs. In addition, I also include features based on dependency tree depth (the path length from root to leaves).

Discourse/content features (DISC)
The discourse/content features (DISC) are intended to capture the broader difficulty of understanding the text as a whole, rather than the difficulty of processing the linguistic structure of particular words or sentences. One set of features are based on definitions (Krioni et al., 2008), which are a set of words and phrases that are used to introduce or define new terms in a text. Using regular expressions, I calculate definitions per token and definitions per sentence.
Another set of features is adapted from the work of Brown et al. (2007;, who show that logical propositional density-a fundamental measurement in the study of discourse comprehension-can be accurately measured purely on the basis of partof-speech counts. One other feature is based on the intuition that reading dialogic texts is generally easier than reading prose. This feature is computed as the number of dialog symbols 13 per token.

Summary of features
As outlined in the preceding sections, this study makes use of 179 features. Many of the features are inspired by previous research of readability, both for Russian and for other languages. The distribution of these features across categories is shown in Table 3.

Results
The machine-learning and evaluation for this study were performed using the weka data mining software (Hall et al., 2009). Based on preliminary tests, the Random Forest model was selected as the classifier algorithm for the study. 14 All results reported below are achieved using the Random Forest algorithm with default parameters. Unless otherwise specified, evaluation was performed using ten-fold cross validation. Results are given in Table 4. Precision is a measure of how many of the documents predicted to be at a given readability level are actually at that level (true positives divided by true and false positives). 13 In Russian, -, -, -and : are used to mark turns in a dialog. 14 Other classifiers that consistently performed well were NNge (nearest-neighbor with non-nested generalized exemplars), FT (Functional Trees), MultilayerPerceptron, and SMO (sequential minimal optimization for support vector machine).
Recall measures how many of the documents at a given readability level are predicted correctly (true positives divided by true positives and false negatives). The two metrics are calculated for each reading level and weighted averages are reported for the classifier as a whole. The F-score is a harmonic mean of precision and recall. Adjacent accuracy is the same as weighted recall, except that it considers predictions that are off by one category as correct. For example, a B2 document is counted as being correctly classified if the classifier predicts B1, B2, or C1. The baseline performance achieved by predicting the mode reading level (B1)-using weka's ZeroR classifier-is precision 0.097 and recall 0.312 (F-score 0.149). The OneR classifier, which is based on only the most informative feature (corrected type-token ratio), achieves precision 0.487 and recall 0.497 (F-score 0.471). The Random Forest classifier, trained on the full Combined corpus with all 179 features, achieves precision 0.69 and recall 0.677 (F-score 0.671), with adjacent accuracy 0.919.

Classifier
Precis  A confusion matrix is given in Table 5, which shows the predictions of the RandomForest classifier. The rows represent the actual reading level as specified in the gold standard, whereas the columns represent the reading level predicted by the classifier. Correct classifications appear along the diagonal. Table 5 shows that the majority of misclassifications are only off by one level, and indeed the adjacent accuracy is 0.919, which means that less than 10% of the documents are more than one level away from the gold standard.

Binary classifiers
Evaluation was performed with binary classifiers, in which the datasets contain only two adjacent readability levels. Since the Combined corpus has six levels, there are five binary classifier pairs: A1-A2, A2-B1, B1-B2, B2-C1, C1-C2. The results of  the cross-validation evalution of these classifiers is given in Table 6. Red Kalinka and LQsupp (the second largest subcorpus of LingQ)-which were judged to be the most reliable subcorpora-were also examined individually.
prec  As expected, because the binary classifiers' are more specialized, with less data noise and fewer levels to choose between, their accuracy is much higher.
One potentially interesting difference between binary classifiers at different levels is their learning curves, or in other words, the amount of training data needed to approach optimal results. I hypothesized that the binary classifiers at lower levels would need less data, because texts for beginners have limited possibilities for how they can vary without increasing complexity. Texts at higher reading levels, however, can vary in many different ways. To adapt Tolstoy's famous opening line to Anna Karenina, "All [simple texts] are similar to each other, but each [complex text] is [complex] in its own way." If this is true, then binary classifiers at higher reading levels should require more data to reach the upper limit of their classifying accuracy. This prediction was tested by controlling the number of documents used in the training data for each binary classifier, while tracking the F-score on cross-validation. Re-sults of this experiment are given in Figure 2. The results of this experiment support the hypothesized difference between binary classifier levels, albeit with some exceptions. The A1-A2 classifier rises quickly, and begins to level off after seeing about 40 documents. The A2-B1 classifier rises more gradually, and levels off after seeing about 55 documents. The B1-B2 classifier rises even more slowly, and does not level off within the scope of this figure.
Up to this point, the data confirm my hypothesis that lower levels require less training data. However, the B2-C1 and C1-C2 classifiers buck this trend, with learning curves that outperform the simplest binary classifier with very little training data. One possible explanation for this is that the increasing complexity of CEFR levels is not linear, meaning that the leap from A1 to A2 is much smaller than the leap from C1 to C2. The increasing rate of change is explicitly formalized in the official standards for the TORFL tests. For example, the number of words that a learner should know has the following progression: 750, 1300, 2300, 10 000, 12 000 (7 000 active), 20 000 (8 000 active). This means that distinguishing B2-C1 and C1-C2 should be easier because the distance between their respective levels is an order of magnitude larger than the distance between the respective levels of A1-A2, A2-B1. Furthermore, development of grammar should be more or less complete by level B2, so that the the number of features that distinguish C1 from C2 should be smaller than in lower levels, where grammar development is a limiting factor.

Feature evaluation
As summarized in Section 4.5, this study makes use of 179 features, divided into 7 categories: DISC, LEXC, LEXF, LEXV, MORPH, SENT, and SYNT. Many of the features used in this study are taken from previous research of related topics, and some features are proposed for the first time here. Previous researchers of Russian readability have not included morphological features, so the results of these features are of particular interest here.
In this section, I explore the extent to which the selected corpora can support the relevance and impact of these features in Russian second language readability classification. One rough test for the value of each category of features is to run crossvalidation with models trained on only one category of features. In Table 7, I report the results of this experiment using the Combined corpus.  The results in Table 7 show that MORPH, has the highest F-score of any single category, with an Fscore just 0.053 below a model trained on all 179 features. True comparisons between categories are problematic because the number of features per category varies significantly.
In order to evaluate the usefulness of each feature as a member of a feature set, I used the correlationbased feature subset selection algorithm (CfsSub-setEval) (Hall, 1999), which selects the most predictive subset of features by minimizing redundant information, based on feature correlation.
Out of 179 features, the CfsSubsetEval algorithm selected 32 features. Many of the features selected for the optimal feature set are also among the top 30 most informative features according to information gain. However, the morphological features-which had only 7 features among the top 30 for information gain-now include 14 features, which indicates that although these features are not as informative, the information that they contribute is unique.
A classifier trained on only these 32 features with the Combined corpus achieved precision 0.674 and recall 0.665 (F-score 0.659), which is only 0.01 worse than the model trained on all 179 features.

Conclusions and Outlook
This article has presented new research in automatic classification of Russian texts according to second language readability. This technology is intended to support learning activities that enhance student engagement through online authentic materials (Erbaggio et al., 2010). I collected a new corpus of Russian language-learning texts classified according to CEFR proficiency levels. The corpus comes from a broad spectrum of sources, which resulted in a richer and more robust dataset, while also complicating comparisons between subsets of the data.
Classifier performance A six-level Random Forest classifier achieves an F-score of 0.671, with adjacent accuracy of 0.919. Binary classifiers with only two adjacent reading levels achieve F-scores between 0.806 and 0.892. This is the first large-scale study of this task with Russian data, and although these results are promising, there is still room for improvement, both in corpus quality and modeling features.
In Section 5.1, I showed that binary classifiers at the lowest and highest reading levels required less training data to approach their upper limit. Beginning with the lowest levels, each successive binary classifier learned more slowly than the last until the B2-C1 level. I interpret this as evidence that simple texts are all similar, but complex texts can be complex in many different ways.
Features Among the most informative individual features used in this study are type-token ratios, as well as various measures of maximum syntactic dependency lengths and maximum tree depth. However, as a category, the morphological features are most informative. When features with overlapping information are removed using correlationbased feature selection, the resulting set includes 14 MORPH features, 8 SYNT features, 4 LEXV fea-tures, 3 LEXF features, and 2 LEXC features, and 1 DISC feature. Models trained on only one category of features also show the importance of morphology in this task, with the MORPH category achieving a higher F-score than other individual categories.
Although the feature set used in this study had fairly broad coverage, there are still a number of possible features that could likely improve classifier performance further. Other researchers have seen good results using features based on semantic ambiguity, derived from word nets. Implementing such features would be possible with the new and growing resources from the Yet Another RussNet project. 15 Another category of features that is absent in this study is language modeling, including the possibility of calculating information-theoretic metrics, such as surprisal, based on those models.
The syntactic features used in this study could be expanded to capture more nuanced features of the dependency structure. For instance, currently implemented syntactic features completely ignore the kinds of syntactic relations between words. In addition, some theoretical work in dependency syntax, such as catenae (Osborne et al., 2012) and dependency/locality (Gibson, 2000) may serve as the basis for other potential syntactic features.
Applications One of the most promising applications of the technology discussed in this article is a grammar-aware search engine or similar information retrieval framework that can assist both teachers and students to identify texts at the appropriate reading level. Such systems have been discussed in the literature (Ott, 2009), and similar tools can be created for Russian language learning. Acknowledgments I am indebted to Detmar Meurers and Laura Janda for insightful feedback at various stages of this project. I am grateful to Nikolay Karpov for openly sharing his research source files. I am also thankful to the CLEAR research group at UiT and three anonymous reviewers for feedback on an earlier version of this paper. Any remaining errors or shortcomings are my own.