Arib@QALB-2015 Shared Task: A Hybrid Cascade Model for Arabic Spelling Error Detection and Correction

In this paper we present the Arib sys-tem for Arabic spelling error detection and correction as part of the second Shared Task on Automatic Arabic Error Correction. Our system contains many components that address various types of spelling error and applies a combination of approaches including rule based, statistical based, and lexicon based in a cascade fashion. We also employed two core models, namely a probabilistic-based model and a distance-based model. Our results on the development and test set indicate that using the correction components in cascaded way yields the best re-sults. The overall recall of our system is 0.51, with a precision of 0.67 and an F1 score of 0.58.


Introduction
In last year's shared task on Automatic Arabic Error Correction of the Arabic NLP Workshop (QALB-2014 shared task), a diverse set of approaches were presented including pipeline, hybrid and cascade. These approaches used different techniques such as supervised learning, rule and/or lexicon based, and statistical language modeling. Furthermore, systems presented used several external resources, namely, Arabic Gigaword, AraComLex dictionary, Arabic Wikipedia and Aljazeera articles, to name but a few. The QALB-2015 shared task is an extension of the first QALB-2014 shared task [1] that occurred last year. QALB-2014 handled errors in comments written by Arabic native speakers in Aljazeera articles [2]. This year's competition includes two subtasks, and, in addition to Arabic native speakers errors, also includes correction of texts written by new learners of Arabic language [3]. The test written by Arabic native subtask includes Alj-train-2014, Alj-dev-2014, Alj-test-2014 texts from QALB-2014. The L2 subtask includes L2-train-2015 and L2-dev-2015. This data was released for the development of the systems. To build on the previous efforts, we present in this paper, the design and implementation of the Arib system to address the problem of Arabic spelling errors detection and correction. Hence, the name Arib ‫]ﺃأﺭرﻱيﺏب[‬ is an Arabic word that means a person who is bright, skilled, intelligent and insightful. Arib will employ a hybrid cascade model as an approach with distance and probability-based techniques that reuse a large scale dataset complied from different external resources. This paper is organized as follows: section 2 presents related work, section 3 shows how we compiled the necessary language resources for our system, section 4 highlights the main components of our proposed system, section 5 presents our experiments on the system, section 6 reports the obtained results and section 7 concludes the paper with final remarks and future directions.

Related Work
The task of Arabic spelling errors detection and correction generally addresses errors such as edit errors, add, split, merge, punctuation, orthographical, dialectal, and other error types. Depending on the techniques used for the task, systems designed for the error detection and correction task utilize language resources such as textual corpora and dictionaries. One of the earliest studies on Arabic spelling detection and correction is the work conducted by Al-Fedaghi and Amin [4]. The system built detects all four error types edit, add, split, and merge and employs the technique of reducing the words to their original roots to identify spelling errors. Dictionaries used in this system are arranged according to Arabic word roots. The work presented in [5] describes a system which uses an Arabic morphological analyzer, lexicon, and heuristics to detect five types of errors: reading, hearing, touch-type, morphological errors and editing errors. Another similar system that uses the Arabic Web Dictionary (AWD) is presented in [6]. The system used dictionary lookup, morphological analysis and regular expressions to detect the four error types as well as punctuation errors. Other dictionaries used for the Arabic spelling errors detection and correction task include: Ayaspell [7], and AraComLex [8] [9]. Arabic language corpora have been used for spelling error detection and correction. Using a corpus to support the task by providing a resource for training machine-learning based spellchecking systems. Popular corpora used in Arabic spelling error detection and correction systems include: QALB corpus [10], Muaidi [11], and the Arabic Gigaword. The QALB corpus is a large Arabic corpus of manually corrected sentences, it is considered as a "Spelling-error corpus" for Arabic. Systems which used the corpus for the task of error detection and correction include [12], [13], [14], [7], [15], [8], and [16]. The Muaidi corpus has been used in the work presented in [17]. The corpus is a personally built corpus containing a set of 101,987 word types. The Arabic Gigaword corpus is a large corpus of Arabic text from Arabic news sources, developed by the Linguistic Data Consortium. The work described in [9] uses the Gigaword corpus to support the task of spelling error detection and correction.
Techniques and tools reported in the literature for supporting the Arabic spelling errors detection and correction task include morphological analysis [12] [14], N-gram scores [17] [8], conditional random fields [14] [8], and Naïve Base [15]. Similar to systems described in the literature, Arib utilizes language resources such as dictionaries and corpora as well as the application of different techniques to support the task of Arabic spelling error detection and correction.

Language Resources
An important component of any spelling errors detection and correction system is the compilation of a large scale dictionary that can be used to cover most Arabic words for the sake of detecting the misspelled word. So in order to build this dictionary we reverse-engineered the QALB corpus by replacing the wrong words from the annotated text with the correct words in the final text. We also used several other corpuses, namely: KSU corpus of classical Arabic [19], Open Source Arabic Corpora (OSAC) [20], Al-Sulaiti Corpus [21], and KACST Arabic Corpus [22]. These corpuses were compiled into one complete corpus, we then used KHAWAS tool (KACST Arabic Corpora Processing Tool) [23] to extract the words with their frequencies. This final step helped in building a huge dictionary that was used later on in our system (See Fig.1).

Our Approach
The design of Arib is based on a hybrid cascade approach to spelling errors detection and correction. By cascade we mean that the original Arabic text passes through several components before a final result is returned. Each component participates in identifying spelling errors and recommending a correction. The final result is a compiled collection of all spelling errors identified and the suggested corrections. Our system can cover a range of spelling errors. Errors that are discovered by Arib include: edit, add, split, merge, punctuation, phonological, and common mistakes. The general architecture and major components of Arib system are shown in Fig. 2.

MADAMIRA Corrector
MADAMIRA [24] is a system developed for morphological analysis and Disambiguation of Arabic text. Since the organizers of the shared task provided the data pre-processed with MADAMIRA, we used the features generated by MADAMIRA to support the spelling error detection and correction. The output of MADAMIRA includes an analysis and correction of the spelling mistakes in the word (Alf)(‫)ﺃأ‬ and terminal (Yaa)(‫.)ﻯى‬ Spelling errors of this type can easily and accurately be detected and corrected using this component.

Rule-Based Corrector
In this component knowledge of common spelling error patterns are represented as rules that can be applied to provide a correction. All rules are applied to the misspelled word to generate possible corrections. These rules were created through analysis of samples of the QALB Shared Task Dataset and from Arabic language expert who summarized common misspellings of Arabic new learners.
• All numbers are separated from words. • Fix the Speech effects characters. • Remove extra characters by eliminating a sequence of three or more of the same characters.
• Insert a space after all words end by a Ta-Marbouta characters ‫)‪)(p‬ﺓة(‬ if it is attached to the following word. • Insert a space after "ElY, ALY" ‫ﻉعﻝلﻯى٬،(‬ ‫)ﺇإﻝلﻯى‬ (On, For) preposition if it is attached to the following word.

Probabilistic-Based Spelling Correction
This component scans the text for spelling errors using Bayes probability theory, and is based on the algorithm by Peter Norvig for spell checking [25], [26]. It is classified as a probabilistic technique, thus it computes the probability that a given word is the correction for a misspelled work. This component uses our customized dictionary, with word frequencies extracted from KHAWAS to enumerate all possible corrections for the misspelled word. In order to find a correction of misspelled word from all possible corrections we chose the candidate word with the highest probability. For example, the misspelled non-word ‫"ﺕتﺯزﺍاﺏب"‬ "tzAb" could be corrected to ‫"ﺕتﺭرﺍاﺏب"‬ "trAb" (Soil) or ‫"ﺕتﺭرﺍاﺙث"‬ "trAv" (Heritage), in this component we suggest the correction based on the probabilities.

Levenshtein-Distance-based Spelling Correction
This component implements the Symmetric Delete Spelling Correction (FAROO) algorithm, a robust algorithm for error detection and correction based on the edit distance using (Damerau-Levenshtein) distance measure [27]. A dictionary entry is selected to be the correction based on its edit distance to the misspelled word. The algorithm works by generating words with an edit distance of <=2 from each dictionary word, and adds them both to the dictionary. Words are generated with an edit distance of <=2 from the input words, and they are searched in the dictionary.

Open Source Arabic autocorrect (Ghaltawi)
Ghalatawi [28] is an open source Arabic spelling errors detection and correction system available online [28]. The system discovers common spelling errors and uses a dictionary lookup and regular expressions. It is written in Python and has been integrated as a cascade within our development.

Puctuation Recovery
This component runs a set of rules against the input to determine the absence of periods, semicolon and commas in a given Arabic text. Rules on punctuation are extracted from Arabic language resources and modeled within this component. Previous works mentioned that it is always better to keep the existing punctuation marks in the text [15], so we keep the current punctuation marks (period, comma, question mark, exclamation mark, colon, semicolon, parentheses, and quotation mark) and attempt only to insert the missing marks. The output of this component is the final output of the system.

System Experiments
As we previously mentioned, Arib consists of several components designed to tackle different types of errors. For the submissions to the second shared task, we submitted three versions of the system. We refer to these as Arib-1, Arib-2, and Arib-3. Table.1 shows the component of our system and which components are incorporated in each version.

Results and Discussion
With a view to evaluate the performance of our system, we used the M2 Scorer [29], the official scorer of the shared task. Results from the evaluation show that the Arib performed well as each component is added to the system.

Conclusion and Further Research
In this paper, we described a hybrid cascade approach for Arabic Spelling detection and correction system for participation in the second shared task on Automatic Arabic Error Correction. Our approach combines rule-based linguistic techniques with probabilistic-based and Distance-based Spelling Correction techniques. We experiment with our system using different configurations of the developed components. Results of the experiments show encouraging results.
Future work involves further enhancements to the system including developing more intelligent techniques to correct split and merge errors. Moreover, use more advanced techniques for the sake of punctuation corrector including machine learning techniques and semantic text analysis technology.