QCMUQ@QALB-2015 Shared Task: Combining Character level MT and Error-tolerant Finite-State Recognition for Arabic Spelling Correction

We describe the CMU-Q and QCRI’s joint efforts in building a spelling correction system for Arabic in the QALB 2015 Shared Task. Our system is based on a hybrid pipeline that combines rule-based linguistic techniques with statistical meth-ods using language modeling and machine translation, as well as an error-tolerant ﬁnite-state automata method. We trained and tested our spelling corrector using the dataset provided by the shared task orga-nizers. Our system outperforms the base-line system and yeilds better correction quality with an F-score of 68.12 on L1-test-2015 testset and 38.90 on the L2-test-2015. This ranks us 2nd in the L2 subtask and 5th in the L1 subtask.


Introduction
With the increased usage of computers in the processing of various languages comes the need for correcting errors introduced at different stages. Hence, the topic of text correction has seen a lot of interest in the past several years (Haddad and Yaseen, 2007;Rozovskaya et al., 2013). Numerous approaches have been explored to correct spelling errors in texts using NLP tools and resources (Kukich, 1992;Oflazer, 1996). The spelling correction for Arabic is an understudied problem in comparison to English, although small amount of research has been done previously (Shaalan et al., 2003;Hassan et al., 2008). The reason for this is the complexity of Arabic language and unavailability of language resources. For example, the Arabic spell checker in Microsoft Word gives incorrect suggests for even simple errors. First shared task on automatic Arabic text correction  has been established recently. Its goal is to develop and evaluate spelling correction systems for Arabic trained either on naturally occurring errors in text written by humans or machines. Similar to the first version, in this task participants are asked to implement a system that takes as input MSA (Modern Standard Arabic) text with various spelling errors and automatically correct it. In this year's edition, participants are asked to test their systems on two text genres: (i) news corpus (mainly newswire extracted from Aljazeera); (ii) a corpus of sentences written by learners of Arabic as a Second Language (ASL). Texts produced by learners of ASL generally contain a number of spelling errors. The main problem faced by them is using Arabic with vocabulary and grammar rules that are different from their native language.
In this paper, we describe our Arabic spelling correction system. Our system is based on a hybrid pipeline which combines rule-based techniques with statistical methods using language modeling and machine translation, as well as an error-tolerant finite-state automata method. We trained and tested our spelling corrector using the dataset provided by the shared task organizers Arabic (Rozovskaya et al., 2015). Our systems outperform the baseline and achieve better correction quality with an F-score of 68.42% on the 2014 testset and 44.02 % on the L2 Dev.
2 Data Resources QALB: We trained and evaluated our system using the data provided for the shared task and the m2Scorer (Dahlmeier and Ng, 2012). These datasets are extracted from the QALB corpus of human-edited Arabic text produced by native speakers, non-native speakers and machines . The corpus contains a large Monolingual Arabic corpus: Additionally, we used the GigaWord Arabic corpus and the News commentary corpus as used in state-of-theart English-to-Arabic machine translation system (Sajjad et al., 2013b) to build different language models (character-level and word-level LMs). The complete corpus consists of 32 million sentences and approximately 1,700 million tokens. Due to computational limitations, we were able to train our language model only on 60% of the data which we randomly selected from the whole corpus.

Our Approach
Our automatic spelling corrector consists of a hybrid pipeline that combines five different and complementary approaches: (i) a morphologybased corrector; (ii) a rule-based corrector; (ii) an 1 Part of the statistics reported in Table 1 is taken from Diab et al. (2014) 2 The list is freely available at: http: //sourceforge.net/projects/ arabic-wordlist/ SMT( statistical machine translation)-based corrector; and (d) an error-tolerant finite-state automata approach.
Our system design is motivated by the diversity of the errors contained in our train and dev datasets (See Table 1). It was very challenging to design one system to handle all of the errors. We propose several expert systems each tacking a different kind of spelling errors. For example, we built a character-level machine translation system to handle cases of space insertion and deletion affecting non-clitics, as this part is specifically treated by the rule-based module. To cover some remaining character-level spelling mistakes, we use a Finite-State-Automata (FSA) approach. All our systems run on top of each other, gradually correcting the Arabic text in steps.

MADAMIRA Corrections (Morph)
MADAMIRA (Pasha et al., 2014) is a tool, originally designed for morphological analysis and disambiguation of MSA and dialectal Arabic texts. MADAMIRA employs different features to select, for each word in context, a proper analysis and performs Alif and Ya spelling correction for the phenomena associated with its letters. The task organizers provided the shared task data preprocessed with MADAMIRA, including all of the features generated by the tool for every word.
Similar to Jeblee et al. (2014), we used the corrections proposed by MADAMIRA and apply them to the data. We show in Section 4 that while the correction candidate proposed by MADAMIRA may not be necessarily correct, it performs at a very high precision.

Original
Source '

Target
English which I have seen in Youtube is that Characters Source Target   Table 2: Preparing the training and tuning and test corpus for alignment 3.2 Rule-based Corrector (Rules) The MADAMIRA corrector described above does not handle splits and merges; In addition to that, we use the rule-based corrector described in . The rules were created through analysis of samples of the 2014 training data. We also apply a set of rules to reattach clitics that may have been split apart from the base word. After examining the train dataset, we realized that 95% of word merging cases involve " /w/'and'" attachment. Furthermore, we removed duplications and elongations by merging a sequence of two or more of the same character into a single instance.

Statistical Machine Translation Models
An SMT system translate sentence from one language into another. An alignment step learns mapping from source into target. A phrase-based model is subsequently learned from the wordalignments. The phrase-based model along with other decoding features, such as language and reordering models 3 are used to decode the test sentences. We will use the SMT framework for spell checker where error sentences act as our source and corrections act as a target in the training data.

Phrase-based error correction system (PBMT):
The available training data from the shared task consists of parallel sentences. We build a phrasebased machine translation using it. Since the system learns at phrase-level, we hope to identify and correct different errors, especially the ones that were not captured by MADAMIRA.
Character-based error correction system (CBMT): There has been a lot of work in using character-based models for Arabic transliteration to English (Durrani et al., 2014c) and for conversion of Arabic dialects into MSA and vice verse (Sajjad et al., 2013a;Durrani et al., 2014a). The conversion of Arabic dialects to MSA at character-level can be seen as a spelling correction task where small character-level changes are made to convert a dialectal word into an MSA word. We also formulate our correction problem as a character-level machine translation problem, where the pre-processed incorrect Arabic text is considered as the source, and our target is the correct Arabic text provided by the Shared task organizers.
The goal is to learn correspondences between errors and their corrections. All the train data is used to train our the phrase-based model. We treat sentences as sequences of characters instead, as shown in Table 2. Our intuition behind using such model is that it may capture and correct: (i) split errors, occurring due to the deletion of a space between two words, and (ii) merge errors occurring due to the insertion of a space between two words by mistake; (iii) common spelling mistakes (hamzas, yas, etc).
We used the Moses toolkit (Koehn et al., 2007) to create a word and character levels model built on the best pre-processed data (mainly the feat14 tokens extracted using MADAMIRA described in 3.1). We use the standard setting of MGIZA (Gao and Vogel, 2008) and the grow-diagonal-final as the symmetrization heuristic (Och and Ney, 2003) of MOSES to get the character to character alignments. We build a 5-gram word and character language models using KenLM (Heafield, 2011).

Error-tolerant FST (EFST)
We adapted the error-tolerant recognition approach developed by Oflazer (1996). It was originally designed for the analysis of the agglutinative morphology of Turkish words and used for dictionary-based spelling corrector module. This error-tolerant finite-state recognizer identifies the strings that deviate mildly from a regular set of  strings recognized by the underlying FSA. For example, suppose we have a recognizer for a regular set over a, b described by the regular expression (aba + bab)*, and we want to recognize the inputs that are slightly corrupted, for example, abaaaba may be matched to abaaba (correcting for a spurious a), or babbb may be matched to babbab (correcting for a deletion), or ababba may be matched to either abaaba (correcting a b to an a) or to ababab (correcting the reversal of the last two symbols). This method is perfect for handling mainly transposition errors resulting from swapping two letters , or typing errors of neighboring letters in the keyboard. We use the Foma library (Hulden, 2009) to build the finite-state tranducer using the Arabic Word-list as a dictionary. 4 For each word, our system checks if the word is analyzed and recognized by the finite-state transducer. It then generates a list of correction candidates for the nonrecognized ones. The candidates are words having an edit distance lower than a certain threshold. We score the different candidates using a LM and consider the best one as the possible correction for each word.

Evaluation and Results
We experimented with different configurations to reach an optimal setting when combining different modules. We evaluated our system for precision, recall, and F measure (F1) against the devset reference and the test 2014 set. Results for vari-ous system configurations on the L2 dev and test 2014 sets are given in We achieved our best F-measure value with the following configuration: using CBMT system after applying the clitic re-attachment rules. These were then passed through the EFST. Using this combination we are able to correct 66.79% of the errors on the 2014 test set with a precision of 70.14%. Our system outperforms the baseline for the L2 data as well with an F-measure of 44.02% compared to (F1=20.28% when we use the Morph module).

QCMUQ@QALB-201Results
We present here the official results of our system (Morph+CBMT+Rules+EFST) on the 2015 QALB test set (Rozovskaya et al., 2015). The official results of our QCMUQ are presented in Table  4. These results rank us 2nd in the L2 subtask and 5th in the L1 subtask.  6 Conclusion and Future work We described our system for automatic Arabic text correction. Our system combines rule-based methods with statistical techniques based on SMT framework and LM-based scoring. We additionally used finite-state-automata to do corrections. Our best system outperforms the baseline with an F-score of 68.12 on L1-test-2015 testset and 38.90 on the L2-test-2015. In the future, we want to focus on correcting punctuation errors, to produce a more accurate system. We plan to experiment with different combination methods similar to the ones used for combining MT outputs.