Spelling Correction for Morphologically Rich Language: a Case Study of Russian

We present an algorithm for automatic correction of spelling errors on the sentence level, which uses noisy channel model and feature-based reranking of hypotheses. Our system is designed for Russian and clearly outperforms the winner of SpellRuEval-2016 competition. We show that language model size has the greatest influence on spelling correction quality. We also experiment with different types of features and show that morphological and semantic information also improves the accuracy of spellchecking.

The task of automatic spelling correction has applications in different areas including correction of search queries, spellchecking in browsers and text editors etc. It attracted intensive attention in early era of modern NLP. Many researchers addressed both the problems of effective candidates generation (Kernighan et al., 1990;Brill and Moore, 2000) and their adequate ranking (Golding and Roth, 1999;Whitelaw et al., 2009). Recently, the focus has moved to close but separate areas of text normalization (Han et al., 2013) and grammar errors correction (Ng et al., 2014), though the task of spellchecking is far from being perfectly solved. Most of early works were conducted for English for which NLP tasks are usually easier than for other languages due to simplicity of its morphology and strict word order. Also there were studies for Arabic (papers of QALB-2014Shared Task (Ng et al., 2014) and Chinese (Wu et al., 2013), but for most languages the problem still is open. In context of Slavic languages, there were just a few works including  for Russian, Richter et al. (2012) for Czech and Hladek et al. (2013) for Slovak.
However, spelling correction becomes actual again due to intensive growth of social media. Indeed, corpora of Web texts including blogs, microblogs, forums etc. become the main sources for corpus studies. Most of these corpora are very large so they are collected and processed automatically with only limited manual correction. Hence, most texts in such corpora contain various types of spelling variation, from mere typos and orthographic errors to dialectal and sociolinguistic peculiarities. Moreover, orthographic errors are unavoidable since the more social media texts we have, the higher is the fraction of those, whose authors are not well-educated and therefore tend to make mistakes. That increases the percentage of out-of-vocabulary words in text, which affects the quality of any further NLP task from lemmatization to any kind of parsing or information extraction. Summarizing, it is desirable to detect and correct at least undoubtable misspellings in Web texts with high precision.
Unfortunately, there were very few studies dealing with spellchecking for real-world Web texts, e.g. LiveJournal or Facebook. Most authors investigated spelling correction in a rather restricted fashion. They focused on selecting a correct word from a small pre-defined confusion set (e.g., adopt/adapt), skipping a problem of detecting misprints or generating the set of possible corrections. Often researchers did not deal with real-world errors just randomly introducing typos in every word with some probability. Therefore, spelling correction has no "intelligent baseline" algorithm such as trigram HMM-models for morphological parsing or CBOW vectors for distributional similarity. One of the goals of our work is to propose such a baseline. The principal feature of our approach is that it works with entire sentences, not on the level of separate words.
A serious problem for research on spellcheck-45 ing is the lack of publicly available datasets for spelling correction in different languages. Fortunately, recently such a corpus was created for Russian during SpellRuEval-2016 competition . Russian is rather complex for NLP tasks because of its developed nominal and verb morphology and free word order. Therefore it is well-suited for extensive testing of spelling correction algorithms, although our results are applicable to any other language having similar properties.
We propose a reranking algorithm for automatic spelling correction and evaluate it on SpellRuEval-2016 dataset. The paper is organized as follows: Section 1 summarizes previous work on automatic spelling correction focusing on context-sensitive approaches, Section 2 contains our algorithm, Section 3 describes test data, Section 4 analyzes the performance of our system depending on different settings and we conclude in Section 5.

Previous Work
Here we give a brief review of literature on spellchecking especially dealing with contextsensitive error correction.
• Weighted variants of error distances were considered in Kernighan et al. (1990) and Brill and Moore (2000).
• Toutanova and Moore (2002) added a pronunciation model for spelling correction, phonetic features were also exploited by Schaback and Li (2007).
• Noisy channel model of error correction based on ngrams appears in Mays et al. (1991) and Brill and Moore (2000). Other context-sensitive approaches include Golding and Roth (1999) and Hirst and Budanitsky (2005).
• Different sources of information were integrated by means of the final classifier in Flor (2012), who mainly uses semantic features, and Schaback and Li (2007), utilizing syntactic, phonetic and semantic information. Feature-based approach was also pursued by Xiong et al. (2014).
Since our method is also based on reranking, we compare it with the works of the last group. First, we work with sentences and consider each word as a potential typo while Schaback and Li (2007) and Flor (2012) try to correct isolated words using context features. To be applied to real-world texts their algorithm must be preceeded by a preliminary error detection stage which is not necessary in our approach. This makes the model more robust since error detection is a nontrivial task for social media texts due to high number of slang, proper names (including colloquial) etc. By its architecture our model more resembles Xiong et al. (2014), however, the set of features used differs significantly reflecting the difference between Chinese and Russian. As far as we know, our model is one of the first HMM-based systems used for spelling correction of a morphologically rich language.
There are also very few works dealing with spelling correction of Russian texts: Panina et al. (2013) uses feature-based approach to correct search queries. Works for other Slavic languages include Richter et al. (2012) for Czech, who used a feature-based method to correct errors in words given their context, and Hladek et al. (2013) who performed unsupervised error correction for Slovak. The present work is a part of ongoing research started by . The algorithm the latter is also based on reranking, however, they did not use morphological and semantic features. Actually, the effectiveness of these features was under question and one of the objectives of the work was to test their applicability in case of morphologically rich languages. We answer to this question positively.

Algoritm Description
Our system performs context-sensitive spelling error correction. The workcycle is divided into three main steps: candidate generation, n-best list extraction and feature-based ranking of hypotheses. Candidates are generated for every word in sentence since in real-world applications it is not known which words are mistyped. Pairs of consecutive words are also processed to deal with space insertion. There are four types of candidates: 2. Words having the same phonetic code by the METAPHONE-style algorithm of .
3. Dictionary words or word pairs obtained by space/hyphen insertion/deletion. We also write several rules for candidate generation encoding frequent error patterns, for example the informal writing of *-цца instead of -ться or -тся in the infinitive suffix (*нравицца → нравится).

A manually written correction list including colloquial writings as
Not all candidate words have the same score. We calculate the frequencies of different errors on SpellRuEval development set and set the probabilities of different error types (Levenshtein correction, phonetic correction, space insertion/deletion etc.) proportional to their frequencies. This constitutes the basic error model P (t|s) for transforming the hidden word s into observed word t. 1 We construct hypotheses for the whole sentence choosing one word from each candidate set and extract n best candidate sentences using beam search. To score the sentences we used noisy chan- where p(t i |s i ) is the probability of transforming the i-th aligned group in the hidden correct sentence to i-th group in the observed sentence and p(s) is a trigram language model probability. Actually, this is a hidden Markov model (HMM) with word bigrams being the states of HMM and candidate words being the output symbols. Since our error model does not take into account weights of different edits and other helpful linguistic clues, we rerank the hypotheses using features. Our feature set includes the following features: • Length of the sentence, scores of original error and language models.
• Weighted edit distance between source and correction. The model was learned on the development set of  using the algorithm of Brill and Moore (2000).
• The total number and the number of corrections for out-of-vocabulary, long, short and capitalized words.
• The number of words that can be transformed into two dictionary words by space insertion and actual number of such corrections.
• The number of possible word pairs that can form a single word by space deletion or hyphen insertion and actual number of such corrections (hyphen errors are very common in informal writing).
• Morphological and semantic features (see extensive description in Section 4).
We also tried to implement more fine-grained features for hyphen and space insertion/deletion. For example, we counted the occurrences of the word по in the sentence and the number of words having по as its prefix as well as the number of hyphen insertions in such words/word pairs to reflect the common error pattern по-русски "in Russian" → по русски or порусски. However, most of such features appeared noisy in our experiments and were excluded from the final feature set. In total, our model includes 31 basic features, 9 morphological features, 6 semantic features and 1 morphosemantic feature -the unigram model score for the lemmatized sentence.
For every candidate sentence we obtain a feature vector with up to 47 dimensions. These vectors are ranked using a linear model returning the vector u i with the highest scalar product w, u i . The weight vector w is learned using the method of Joachims (2006): in training phase we generate candidate sentences for each sentence of the training set; if u 0 is the vector of the correct hypothesis and u 1 , . . . , u m of others, then the vectors u 0 −u 1 , . . . , u 0 −u m are assigned to the positive class and the opposite vectors to negative. Afterwards the weights can be learned by any linear classifier. We also experimented with the perceptron method of learning but the results were significantly worse.

Test Data
We used the development and test set of SpellRuEval contest . Development set consisted of 2001 and testing set of 2009 sentences respectively, taken from Livejournal segment of GICR corpus (Piperski et al., 2013). We refer the reader to the contest organizers paper for the full description of the dataset and just give a few examples: 47 The film is very cool, I think, about real senses.

Correct sentences (799 of 2007).
Development set was used to train the reranker and to test hand-written rules of candidate generation. We built a trigram language model with Kneser-Ney smoothing using KenLM toolkit (Heafield, 2011). It was trained on the subset of GICR corpus containing 25mln words. The subset used for model training had no intersections with development and test sets. We also selected a 5mln word subset of this corpus to obtain cooccurrence counts and to investigate the dependence of performance quality from language model size.
The trigram model for morphological tags was trained on the subset of Golden Standard of GICR corpus, 2 the size of the training data was 10000 sentences. Instead of the full tags we used POS labels and selected grammemes: gender, number and case for nouns; gender, number, case, shortness and comparison degree for adjectives; mood for verbs and case for prepositions. Participles were considered as adjectives and pronouns as nouns or adjectives depending on their syntactic role. We used ABBYY Compreno dictionary containing about 3,7 mln word forms. 3 We used logistic regression (though linear SVM showed almost the same results) for the final reranking, the implementation was taken from scikit-learn package (Pedregosa et al., 2011).

Comparison of Different Models
As our first experiment we compare 4 sets of features: WORD-LEVEL, including 31 features specified in Section 2; MORPHO, which also includes the morphological model score; SEM, extending WORD-LEVEL with semantic features and MOR-PHOSEM using both morphological and semantic information. For all 4 settings we run two experiments with different language models (trained on 5mln and on 25 mln words respectively). The morphological score is the negative log-probability of the sequence of morphological tags assigned to the words in proposed correction. We selected the most probable sequence considering all tags in the dictionary with equal probability. For the outof-vocabulary words the tags and their probailities were guessed using simple suffix classifier.
Semantic scores were calculated from cooccurrence statistics. We calculated them as follows: first, all the lemmas of nouns, adjectives, verbs and adverbs appearing at least 100 times in our training data were selected. Then for every pair of such lemmas we calculated the number of times its members appear in the same sentence and kept all the pairs occurring at least 20 times. The set of pairs was pruned further: we kept w 2 as the potential pair of w 1 only if its probability to appear in the sentences containing w 1 is at least 3 times higher than its unconditional probability. From these statistics we extracted the following features (w 2 is said to be a matching pair for w 1 if their pair is listed in the set of cooccurrence counts, lemma l 1 is frequent if it has at least one matching pair).
1. The number of words in the sentence whose lemma has a matching pair with some other word in the sentence.
2. Average number of matching lemmas for frequent lemmas in the sentence.
3. Maximal and average probabilities p(l 2 |l 1 ) for the lemma l 2 in the sentence to appear together with l 1 averaged over all l 1 in the sentence.
4. The number of frequent lemmas and whether the sentence contains at least one frequent lemma.
We compare our algorithm against the one of Sorokin and Shavrina (2016) -the top ranking system of SpellRuEval competition (BASELINE method). The results of our experiments are given in Table 1. Each row contains two subrows for smaller and larger language models. The following metrics are reported. They were calculated using the evaluation script of SpellRuEval-2016, for details refer to .

Precision (the proportion of properly cor-
rected tokens among all such tokens).
2. Recall (the fraction of misspelled tokens which were properly corrected).
4. Accuracy (the percentage of correct output sentences).

The mean reciprocal rank (MRR) of correct output sentences and the number of times they appear in list of hypotheses (Coverage).
Only the top 5 variants are taken into account.
Let T, F, W, M denote the number of exact corrections, the number of detected typos where the correction was wrong, the number of "false alarms", when a correctly spelled word was considered as typo and a number of missed typos, respectively. In this notation precision equals T T +F +W and recall is T T +F +M . Therefore making an incorrect correction is worse than making no correction since both these operations decrease recall, but the former also affects precision. Hence we think that the percentage of correctly predicted sentences is more adequate as performance measure. It is also the objective maximized by the learning algorithm.
We give a detailed analysis of results in the next section. The preliminary conclusions are the following: 1. The size of the language model is the most significant factor affecting the algorithm performance.
2. Using the score of morphological model leads to significant improvement, reducing error rate by 8% in terms of F1-measure (84.24% instead of 82.87 )and by 5.9% in terms of sentence accuracy (78.34% instead of 76.99%). 4 3. Using semantic features further improves performance.
4. The impact of complex features is more significant in case of smaller language model. It is expected: the less data you have, the more complex algorithm you need to achieve the same level of performance.

Further Results and Discussion
Our results are rather convincing in order to prove that morphological and semantic features are useful for better spelling correction. However, they are still far from being perfect, therefore we should ask about further improvements that can be achieved on this way. At first, let us illustrate how morphological model helps to select a correct hypothesis. Consider the sentence к *сожаления, придётся постараться which should be corrected to к сожалению, придётся постараться ("it's a pity, (I) have to make an effort"). Lexeme сожаление ("pity") is erroneously written in its Sg+Gen form сожаления, not Sg+Dat сожалению. However, the preposition к requires a dative after it. On the level of morphological tags we have an erroneous sequence Prep+Dat Noun+Neut+Sg+Gen and a correct sequence Prep+Dat Noun+Neut+Sg+Dat. Since a dative preposition never has a genitive immediately to the right, the former sequence has much lower probability and is penalized by the ranker. Certailnly, it has lower probability by language model already, but this is not sufficient to make a correction since it is a dictionary word which is corrected. Indeed, most of the dictionary words in the sentence are spelled correctly which means that the number of corrections in dictionary words should be a negative feature. Therefore additional evidence is required to overcome this negative gain. Also morphological model is less sparser than lexical therefore it leaves less probability to unseen events which means the cost of unlikely sequence is much higher. However, not all incorrect sequences of morphological tags can be rejected by trigram model only, especially in case of restricted set of tags, like we have. For example, in Russian each preposition restricts possible cases of its dependent noun. Most prepositions select only one case, for example, из "from" allows only genitive after it; other prepositions like за "besides" can govern accusative and instrumental cases, but rules out other 4 main cases. Nouns and adjectives in noun groups agree in case, number and gender; a verb agrees with its subject (usually noun or pronoun) in number and in gender (in past tense). All these dependencies are unbounded which means that an arbitrary number of words can separate two elements of the same phrase. However, the emerging constraints may be used to determine that, for example, a verb in particular position cannot be finite and hence reject or penalize a corresponding hypothesis of the spellchecker. That observation seems promising since confusion of 3rd person and infinitive forms of a verb is a common orthographic mistake (мне нравится кофе "I like coffee" → *мне нравиться кофе, where нравиться is the infinitive form).
Therefore we added 4 groups of features, 2 fea-tures in each groups, which contain the following counts: 1. The total number of prepositions and the number of prepositions which do not have a noun to the right which agrees with them.
2. The total number of adjectives and the number of adjectives which do not have a noun to the right which agrees with them.
3. The total number of infinitives and the number of infinitives which do not have a head (a predicative or a transitive verb).
4. The total number of indicative verbs and the number of verbs that do not a have a subject which agrees with them.
We hoped that these features would be helpful to improve our system performance further, but this was not the case. Encoding additional information deteriorated the quality, possibly due to overfitting. However, we observed that careful encoding of these features is impossible due to high morphological complexity of Russian. For example, nouns usually follow their attributes, but may also precede them (лицо, красное от мороза "the face, red from frost"), subject is often only subsumed but omitted in the surface form or there is no subject at all like in impersonal sentences (холодает get colder+Pres+Sing+3 "it is getting colder").
Adverbs are often homonymical to grammatically correct prepositional phrases (вправду "indeed" and в "in" правду "truth+Sg+Dat"), which forces the algorithm to oversegment them in order to increase the number of prepositions that agree with their nouns, etc. Summarizing, designing more complex morphological features requires additional research, probably in the framework of constraint grammars. 50 That is a nesessary step since among 559 sentences of the test set which were not properly corrected about 30 had an error in the verb form.
Even using only one morphological feature is not straightforward. Our reported results stand for the case when WORD-LEVEL model was trained first and the obtained score was used as a feature on the second step of the classification together with morphological model score. Otherwise error reduction is about twice less. The same happens with semantic features: trying to determine their weights together with word-level features, we obtain no gain at all. It implies that new features should be added hierarchically. In our best model semantics are added after learning the weight of morphology model.
During error analysis we have found that about one third of algorithm errors can be attributed as "semantical" which means that incorrect sentence cannot be rejected by morphological or statistical features since both variants are rare and belong to the same grammatical category. Often these are so-called "real-word errors", where the erroneous word is also in the dictionary. However, it is not trivial to extract a formal semantic score that favors one variant and refutes the other. Consider, for example, the mistyped sentence География его выступлений *достегает Китая и Индии "The geography of his performances *lashes China and India". Here the word *достегает "(it) lashes" must be replaced by достигает "(it) reaches". A correction in the dictionary word is penalized, therefore there must be a valuable gain in language or semantic model score to compensate this penalty. But the verb достигать "to reach" does not cooccur frequently with other lexemes in the sentence like география "geography" and выступление "performance". The score of the language model is substantially higher for the correct variant, but it is not sufficient to compensate the correction in dictionary word. In this particular case additional preprocessing phase could be helpful since we might not have an exact phrase "достигает Китая" "reaches China" in our corpus, but certainly have other constructions of the form "достигает Name Of Country". However, we do not have a ready implementation of this approach, but using class-based or factored language model together with some semantic classification seems a promising idea for further investigation.
Actually, morphological and semantic features are the instruments to remedy the weaknesses of n-gram language model, which is not powerful enough to discriminate between probable and unprobable sentences. Using more adequate language models might make fine-tuning of features unnesessary. A promising candidate to replace ngram models are neural language models (Mikolov et al., 2010) since they solve exactly the problem of choosing the optimal word in given context which is the main problem of spellchecking. We leave this question for future research.

Generalization of Results
Since lack of publicly available datasets is one of obstacles in spellchecking research, it is reasonable to ask to what extent our results depend on the size of the dataset and the source language. Table  2 shows the dependence between the size of development set used to tune the reranker weights and the quality of correction. We observed that even for the development set of 200 sentences (which is possible to collect and annotate manually) results are acceptable, though performance accuracy increases when we use more data. All results are averaged for 10 independent runs. Note that the gain from using more complex features increases with the size of development data which means that their weights are not tuned properly on smaller datasets.
Another question is whether our approach can be adapted to other languages. The architecture of the model is language-independent. Moreover, linguistically motivated features we design also are not specific to any language since they use only cooccurrence counts. Candidate search and some of word-level features encode language-specific information, but they reflect more the nature of Russian spelling errors in Russian, not the Russian word structure. Actually, a linguist can add any word-level feature; for example, instead of hyphen errors we may look for diacritic errors if the language uses diacritics, such as Czech. Our reranking model can also incorporate arbitrary sentencelevel features reflecting morphological or lexical constraints. It makes our architecture perspective to design spellcheckers for other languages, not only for Russian. 51 Dev. set size Model

Conclusions and Future Work
We develop a language-independent model for spelling correction and apply it to Russian language. Our algorithm outperforms the previous best system. Its another merit is flexibility that allows to incorporate arbitrary word-level and sentence-level features. Experimenting with features of different type, we observe that the main factor for spelling corrector performance is the quality of language model. However, morphological and semantic information is also helpful. The direction of future work is three-fold: the first step is to augment traditional language models with neural ones and check whether this allows to deal better with long-distance dependencies which might be helpful in choosing the correct candidate. The second step is to apply our model to other languages with complex morphology and check whether the same features are beneficial as in case of Russian. The third one is to reimplement our model using finite-state tools since its main components (candidate search and their ranking) are actually finite-state operations.