TECHLIMED@QALB-Shared Task 2015: a hybrid Arabic Error Correction System

This paper reports on the participation of Techlimed in the Second Shared Task on Automatic Arabic Error Correction organized by the Arabic Natural Language Processing Workshop. This year's competition includes two tracks, and, in addition to errors produced by native speakers (L1), also includes correction of texts written by learners of Arabic as a foreign language (L2). Techlimed participated in the L1 track. For our participation in the L1 evaluation task, we developed two systems. The first one is based on the spell-checker Hunspell with specific dictionaries. The second one is a hybrid system based on rules, morphology analysis and statistical machine translation. Our results on the test set show that the hybrid system outperforms the lexicon driven approach with a precision of 71.2%, a recall of 64.94% and an F-measure of 67.93%.


Introduction
Spell checking is an important task in Natural Language Processing (NLP). It can be used in a wide range of applications such as word processing tools, machine translation, information retrieval, optical character recognition etc. Automatic error correction tools on Arabic are underperforming in comparison with other languages like English or French. The lack of appropriate resources (e.g. publicly available corpora and tools) and the complexity of the Arabic language can explain this difference. Arabic is a challenging language for any NLP tool for many reasons. Arabic has a rich and complex morphology compared to other languages. Short vowels are missing in the texts but are mandatory from a grammatical point of view. Moreover, they are needed to disambiguate between several possibilities of words. Arabic is a rich language. It is characterised by its great number of synonyms and is a highly agglutinative, inflectional and derivational language that uses clitics (proclitics and enclitics). Arabic has many varieties. Modern Standard Arabic represents the variety of the news and formal speech. Classical Arabic refers to religious and classical texts. Dialectal Arabic has no standard rules for orthography and is based on the pronunciation. Therefore, a same word can be written using many different surface forms depending on the dialectal origin of the writer. Another very popular way of writing Arabic on the Internet and the social media like Facebook or Tweeter is to use "Arabizi", a Latinized form of writing Arabic using Latin letters and digits (Aboelezz 2009). For our participation in this second QALB Shared Task, we tried to improve the systems we have developed for the first edition (Mostefa 2014). The first approach is a lexicon driven spell checker using Hunspell (Hunspell 2007). The second approach is a hybrid system based on correction rules, morphological analysis and statistical machine translation. The paper is organized as follows: section 2 gives an overview of the automatic error correction evaluation task and resources provided by the organizers; section 3 describes the systems we have developed for the evaluations; and finally in section 4 we discuss the results and draw some conclusion.

Task description and language resources
The QALB-2015 shared task (Rozovskaya 2015) is an extension of the first QALB shared task ) that took place in 2014. QALB-2014 addressed errors in comments written to Aljazeera articles by native Arabic speakers .This year's competition includes two tracks, and, in addition to errors produced by native speakers, also includes correction of texts written by learners of Arabic as a foreign language (L2) (Zaghouani 2015 (Levenshtein 1966) and the implementation of the M2 scorer (Dahlmeier 2012). Then for each sentence Precision, Recall and F-measure are calculated.

System description 3.1 Rule-based system
For the rule-based system, we used the spellchecker Hunspell (Hunspell 2007) with different dictionaries and affix files. The structure of Hunspell uses two files to define the spell checking of a language. The first file is a dictionary file that contains a stem list of the language. The second file is an affix file that maps the lemmas with their affixes. Affixes in Hunspell are divided into prefixes and suffixes, infixes are only included in the stems and spell checked in terms of proximity in lexical morphemes. Dictionary and affix file in Hunspell is similar to the one depicted in Table 1 and  The dictionary contains the minimal words which are mapped with the affix rules. The affix file contains mainly prefix and suffix rules that apply to the words of the dictionary. For instance, the rule of prefixation /Tb/ in Table 2 creates the word-form ‫وﻟﺪن‬ (wldn) while the rule of suffixation /cc/ creates ‫وﻟﺪﻧﻲ‬ (wldny).
For the evaluation, we used Hunspell with a modified version of the Hunspell Arabic dictionary and affix files version 3.2 (Ayaspell 2008). We obtained a precision of 56.64% and a recall of 19.78% for an F-measure of 29.32% on the development set. The results seem to be low but we have to consider that Hunspell does not correct the punctuation errors; many errors in the data include punctuation errors (around 30%).

Hybrid system based on SMT
For the second approach, we combined Statistical Machine Translation (SMT) system with morphological output of MADAMIRA (Pasha 2014) and some automatic rules to correct the text. We build three different SMT systems based on the Moses toolkit (Koehn 2007) with different input for training the phrase-based translation models. For the first system (Tech-1), we used the output of MADAMIRA morphological analyzer and the corrected texts to train a MADAMIRA/correct translation model. We used the text from the Aljtrain-2014 data and apply corrections to build a parallel MADAMIRA/correct text corpus of 20,428 sentences and we train a phrase based translation model. The Alj-dev-2014 data is used for Moses to tune the translation models. The second system (Tech-2) is the same as the previous one, but we added Alj-dev-2014 in the training data and used Alj-Test-2014 as development data for tuning the translation models. The third system (Tech-3) uses the original erroneous text instead of the MADAMIRA output to build a parallel error/correct text corpus and we train a phrase based model. As for Tech-1, the Aljdev-2014 data is used for Moses to tune the translation models.
For the word alignment, we used GIZA++ (Och 2003). For the language model, we used corpora of newspapers publicly available and collected by Techlimed. The sources are coming from the Open Source Arabic Corpora (Saad 2010) (20M words), the Adjir corpus (Abdelali 2005) (147M words) and other corpora we collected from various online newspapers for a total of 300M words. The language model was created with the IRSTLM toolkit (Federico, 2008 Table 3 System component description For each system, we then applied the following rules: • Convert eastern Arabic digits (۰۱۲۳٤٥٦۷۸۹) into western Arabic digits (0 1 2 3 4 5 6 7 8 9). • Separate numbers from word.
The results obtained on the development data (Alj-test-2014) and the evaluation set (Alj-test-2015) are given in the Table 4 and Table 5.  Table 5 Results on the evaluation data (Alj-Test-2015) The best system TECH-2 is obtained with the combination of MADAMIRA correction with the SMT system trained on 21k sentences and with correction rules.  Table 6 Performance of TECH-2 on the evaluation data (Alj-Test-2015) by component.

Error analysis and discussion
Some difficulties appear when we try to achieve and develop the automatic correction by spellchecker. These problems and difficulties are due not only to the complex morphological system of Arabic language, but also for many reasons, which concern the capacity of spellchecker system. The following list shows us types of problems and difficulties (the Buckwalter transliteration (Buckwalter 2002) is given for each Arabic word example). Problem related to pronunciation similarities between the Hamza and Alif in some word such as ‫إﺳﺘﻘﺒﻞ/إﺳﺘﻌﺠﺎل‬ (<stEjAl/ <stqbl), which are respectively wrong versions of ‫.اﺳﺘﻘﺒﻞ/اﺳﺘﻌﺠﺎل‬ (AstEjAl/ Astqbl) • Similar form problems leading to wrong word substitutions (i.e. incorrect substitution of words by one another): For example, words having similarities in form such as ‫أن‬ (>n) and ‫إن‬ (<n) are confused and ‫ان‬ (An), which does not exist in Arabic, is frequently used. • Deverbal nouns ending ‫ة/ـ‬ ‫ﺔ‬ : we notice that spellchecker does not respect Arabic forms of deverbal nouns, called Masdar in the Arabic grammatical tradition. As a result, it could not be able to correct words in which ‫"ه/ـﮫ"‬ is wrongly used at the end of word position instead of ‫ة/ـﺔ‬ (e.g. ‫إﺑﺎدة‬ (<bAdp) having the deverbal form /?ifâlat/ ‫إﻓﺎﻟﺔ‬ (<fAlp) is written ‫إﺑﺎده/اﺑﺎده‬ (Ab-Adh/ <bAdh).

Conclusion
This paper has reported on the participation of Techlimed in the 2015 QALB Shared Task on Automatic Arabic Error Correction. We developed two approaches, one based on Hunspell and the other based on a hybrid SMT system. The best results were obtained with the hybrid SMT system which was able to deal with the punctuation mark corrections. We also tested a hybrid system by combining Hunspell and the SMT system but did not get better results than the SMT system. Our perspective is to include the Di-iNAR lexical database (Abbès 2004) and also a large dialectal corpus to improve the results.