UMMU@QALB-2015 Shared Task: Character and Word level SMT pipeline for Automatic Error Correction of Arabic Text

In this paper we present the LIUM (Laboratoire d’Informatique de l’Universit du Maine) and CMU-Q (Carnegie Mel-lon University in Qatar) joint submission in the Arabic shared task on automatic spelling error correction. Our best system is a sequential combination of two statistical machine translation systems (SMT) trained on top of the MADAMIRA output. The ﬁrst is a Character-based one, used to produce a ﬁrst correction at the character level. Characters are then glued to form the input to the second system working at the Word level. This sequential combination achieves an F 1 score of ( 69.42 ) that is better than the best F 1 score reported on the 2014 test set ( 67.91 ). The UMMU best submission to the QALB-15 shared task is ranked ﬁrst over 10 submission on the L2 test condition and second over 12 submission on the L1 testsset.


Introduction
Errors such as incorrect spelling, word choice, or grammar, limit the effectiveness of NLP models: language errors are problematic when provided as input to NLP systems, which are often not robust enough to handle unexpected variations. The difficulty of spelling errors are language-dependent: the more complex the orthography, morphology, or syntax of a language, the more likely it is to have errors in aspects requiring complex human/machine processing. For morphologically rich languages such as Arabic, spelling errors are very frequent, even among native speakers (Shaalan, 2005). This is because Modern Standard Arabic (MSA), the unifying language of formal text, is not the native language of any Arab . 1 Arabic word morphology is agglutinative: particles and pronouns are written as part of a word (Habash, 2010). This adds an additional challenge to the writer (native or non native) and could be a principal source of spelling mistakes.
In this paper, we describe an approach performing a sequential combination of two statistical machine translation systems for automatic spelling error correction for Arabic. Our system learns models of correction by training on paired examples of errors and their corrections. The training, tuning and test data are provided by the Shared task organizers. Compared to the first edition of this shared task, this year's version proposes two sub-tasks tackling two text genres: (1) news corpus (news articles extracted from Aljazeera); (2) a corpus of sentences written by learners of Arabic as a Second Language. These two corpora are extracted from the QALB corpus . We tested our system and showed that it performs well on both corpora.
The remainder of this paper is organized as follows. First, we review the main previous efforts for automatic spelling correction, in Section 3. We then give an overview of the various spelling mistakes done while writing an Arabic text, in Section 4. In Section 5, we detail our error correction system. We present in section 6 the results obtained for the different experiments we conducted using the shared task 2015 dev set. Before concluding, we section 7 details the UMMU official results on QALB-15 test set.

QALB Shared Task Description
The goal of the QALB shared task is developing the of an automatic system for Arabic Error Correction. The QALB-2015 task is the extension of the first QALB shared task ) that took place last year. The QALB-2014 addressed errors in comments written to Aljazeera articles by native Arabic speakers . This year's competition includes two tracks, and, in addition to errors produced by native speakers, also includes correction of texts written by learners of Arabic as a foreign language (L2) . The native track includes Alj-train-2014, Alj-dev-2014, Alj-test-2014 texts from QALB-2014. The L2 track includes L2train-2015 and L2-dev-2015. This data was released for the development of the systems. The systems were scored on blind test sets Alj-test-2015 and L2-test-2015.

Related Work
Automatic error detection and correction include automatic spelling checking, grammar checking and post-editing. Numerous approaches (both supervised and unsupervised) have been explored to improve the fluency of the text and reduce the percentage of out-of-vocabulary words using NLP tools, resources, and heuristics, e.g., morphological analyzers, language models, and edit-distance measure (Kukich, 1992;Oflazer, 1996;Zribi and Ben Ahmed, 2003;Shaalan et al., 2003;Haddad and Yaseen, 2007;Hassan et al., 2008;Habash, 2008;Shaalan et al., 2010). There has been a lot of work on error correction for English (e.g., (Golding and Roth, 1999)).
For Arabic, this issue was studied in various directions and in different research work. In 2003, Shaalan et al. (2003) presented work on the specification and classification of spelling errors in Arabic. Later on, Haddad and Yaseen (2007) presented a hybrid approach using morphological features and rules to fine tune the word recognition and non-word correction method. In order to build an Arabic spelling checker,  developed semi-automatically, a dictionary of 9 million fully inflected Arabic words using a morphological transducer and a large corpus. They then created an error model by analyzing error types and by creating an edit distance ranker. Finally, they analyzed the level of noise in different sources of data and selected the optimal subset to train their system. Alkanhal et al. (2012) presented a stochastic approach for spelling correction of Arabic text. They used a context-based system to automatically correct misspelled words. First of all, a list is generated with possible alternatives for each misspelled word using the Damerau-Levenshtein edit distance, then the right alternative for each misspelled word is selected stochastically using a lattice search, and an n-gram method.  trained a Noisy Channel Model on word-based unigrams to detect and correct spelling errors. Dahlmeier and Ng (2012) built specialized decoders for English grammatical error correction.
More recently, (Pasha et al., 2014) created MADAMIRA, a system for morphological analysis and disambiguation of Arabic, this system can be used to improve the accuracy of spelling checking system especially with Hamza spelling correction. A statistical machine translation model to train an error correction system was presented recently by Jeblee et al. (2014). In contrast to their approach, our system combines two level MT models: character level, then a word level.

Spelling errors in Arabic
Three types of Arabic word misspellings are defined in the literature: typographic, cognitive and phonetic errors (Haddad and Yaseen, 2007). The typographic errors corresponding to single word error misspelling represent 80% of all misspelling errors in Arabic (Ben Hamadou, 1994). Based on this study, the most common typographic editing errors that can be found in any Arabic text are the following: Substitution: approximately, 41.5% of errors belong to substitution errors. In the case of (/ → /, "he played"), for example, the letter / , E/ is mistakenly substituted by / , g/, which results in an incorrect word.
Deletion: approximately 23% of single errors are deletion errors. For example in / → /, "he opens", the letter / , t/ had been missed leading to an erroneous word.

SMT system for error correction
We formulate the error correction task as a translation problem where the source part is the text to be corrected and the target part is the correct text.
Let's assume that we want to correct an erroneous sentence e to a correct sentence c, and f i (e, c) is such a model which calculates a probability that c is the correction of e. The goal of the system is to find the correction c * defined as : c * = arg max c p(c|e) = arg max c p(e|c)p(c) p(e|c) is estimated in a translation model and p(c) is the target-side language model. The argmax is the task of the decoder and it represent the search for the best hypothesis in the space of possible correction c. The translation system is trained using the well known MOSES toolkit (Koehn et al., 2007). The system is constructed using data produced for the QALB shared task and described in Table 1 as follows: First, we generate the correct sentences 2 of the QALB training corpus then translation and reordering models are trained. The language model (LM) is trained on the correct side of the QALB data and a selected part of the Arabic Gigaword corpus.
In order to select the most appropriate amount of monolingual data, we employ data selection techniques based on cross-entropy criterion using 2 for that we used a modified version of the m2scorer script that could be distributed Xenc 3 (Rousseau, 2013). The selected data is determined in such a way that the corresponding LM minimize the perplexity calculated on the development set. Selected part of each monolingual corpus is used to train an interpolated n-gram 4 back-off target LM using SRILM toolkit (Stolcke et al., 2011) with Kneser-Ney smoothing.
In this work, we propose to use two SMT systems trained with different translation unit (words and characters) as described previously. This was motivated by our intuition that each system will target a different pattern of errors and their combination may outperform the single system performance.

Experiments and results
We train four models depending on the used training unit and the nature source side (with or without pre-processing). Each system is evaluated independently and best systems are combined.

Data description
All our models are built using training, development and testing data provided by the shared task organizers and described in Table 1

SMT on raw data
We present here the results we obtained using the wod-level and character-level systems trained on raw non processed data. As shown in Table 3 and  It is interesting to note that our character-level system performs better than the word-level one, both on the dev (66.12 vs. 61.31) and test sets. This could be explained by the fact that character level system takes advantage from its finer granularity.  Table 4: Character level SMT error correction.

SMT on MADAMIRA pre-processed data
The results obtained using MADAMIRA correction candidates (see Table 2) makes it a good start point which one can exploit in order to improve our SMT correction systems. For this we used MADAMIRA as pre-processing step of the SMT training data. Indeed, we re-train our systems over the MADAMIRA pre-processed data. Results for the character-level and word-level systems are presented in Table 5 and Table 6.   Table 6: Character level SMT error correction with MADAMIRA pre-process.
As we expected, this combination yields better results (F-score of 40.12 on the L2-dev-2015 data set vs. 38.76, when using only the character-level system). It is not surprising that the character level system gives better results than the word level one when trained on the MADAMIRA pre-processed data (66.12 vs. 63.71 on Alj-dev-2014).

Sequential combination
Although the character level system outperforms the word level one, we still want to take benefits from the higher modeling level of word based system. For this we propose two combination setups: (i) top-down sequential combination and (ii) bottom-up sequential combination. 5 Both combination are performed using data pre-processed with MADAMIRA.

Top-down combination
In this setup, we first use the word-based SMT system. Then we re-translate its outputs using the character level system. The results obtained are given in Table 7. This combination yeilds better results than when using the character level system only (See Table 4).

Bottom-up combination
The Bottom-up combination consists in using the word-level system to re-translate the character

UMMU@QALB-2015 Results
In this section, we present the official results of our system on the 2015 QALB test set . We submitted two outputs UMMU-1 and UMMU-2. The UMMU-1 is the result of our best system on the dev data (see Table 8 for UMMU-1 dev results) and the UMMU-2 is the output of the Character level SMT without combination (see table 6 for UMMU-2 dev results). The official results of UMMU primary and secondary submissions are respectively presented on table 9 and 10. According to the results presented on Table 9 our system is ranked first in the L2 subtask and second in the L1.    Table 10 gives the results of the UMMU-2 submission. With regards to our UMMU-1 results we note that our character-level system has a higher precision and lower recall in both subtask. These findings show that our word-level system, when applied on the character-level outputs, improves the recall but decrease the precision. Thus, a better combination of our systems may improve the final F 1 score by avoiding the precision drop.

Conclusion and Future Work
We described our submission in the Arabic shared task on automatic spelling error correction. Our system is a sequential combination of two statistical machine translation systems (SMT). First a Character-based SMT system is used to perform lower level correction. Characters outputs of this systems are then glued and used as the input to the higher level system working at the Word level. This sequential combination allows to achieve a F 1 score of 71.10 on L1-test-2015 and 41.20 on L2-test-2015, which ranks us 2nd in the L1 subtask and 1st in the L2 subtask. We submitted is a three-stage system that benefits from a MADAMIRA pre-processing, a low level character based SMT system and a higher word-level SMT system. We showed the complementarity of the three stages. We also showed that at each step we our F1-score was improved. In future work, we would like to investigate the possibility of adding an additional layer that uses a neural network language model to estimate the probability in a continuous space and gives better generalization to unseen events.