QCRI@QALB-2015 Shared Task: Correction of Arabic Text for Native and Non-Native Speakers’ Errors

This paper describes the error correction model that we used for the QALB-2015 Automatic Correction of Arabic Text shared task. We employed a case-speciﬁc correction approach that handles speciﬁc error types such as dialectal word substi-tution and word splits and merges with the aid of a language model. We also applied corrections that are speciﬁc to second language learners that handle erroneous preposition selection, deﬁniteness, and gender-number agreement


Introduction
In This paper, we provide a system description for our submissions to the Arabic error correction shared task (QALB-2015 Shared Task on Automatic Correction of Arabic) as part of the Arabic NLP workshop. The QALB-2015 shared task is an extension of the first QALB shared task  which addressed errors in comments written to Aljazeera articles by native Arabic speakers . The current competition includes two tracks, and, in addition to errors produced by native speakers, also includes correction of texts written by learners of Arabic as a foreign language (L2) (Zaghouani et al., 2015). The native track includes Aljtrain-2014, Alj-dev-2014, Alj-test-2014 texts from QALB-2014. The L2 track includes L2-train-2015 and L2-dev-2015. This data was released for the development of the systems. The systems were scored on blind test sets Alj-test-2015 andL2-test-2015. We submitted runs to the automatic correction of text generated by native speaker (L1) and nonnative speakers (L2). For both L1 and L2, we employed a case-specific approach that is aided by a language model (LM) to handle specific error types such as dialectal word substitutions and word splits. We also constructed a list of corrections that we observed in the QALB-2014 data set and in the QALB-2015 training set. We made use of these corrections to generate alternative corrections for words. When dealing with L2 text, we noticed specific patterns of mistakes mainly related to gender-number agreement, phonetic spellings, and definiteness. As for punctuation recovery, we opted only to place periods at the end of sentences and to correct reversed question marks. We opted not to invest in punctuation recovery based on the mixed results we obtained for the QALB-2014 shared task (Mubarak and Darwish, 2014).

QALB LCorpus Error Analysis
The QALB corpus used for the task contains over two million words of manually corrected Arabic text. The corpus is composed of text that is produced by native speakers as well as non-native speakers (Habash et al., 2013). While annotating the corpus,  detailed various types of errors that were encountered and addressed -mainly L1. Additional proposed corrections for L2 errors were summarized with no details. Understanding the error types would shed light on their manifestations and help correct them properly. We inspected the training and development sets and noticed a number of potential issues that can be summarized as follows: 1. Syntax Errors due to first language influence: L2 learners may carry over rules from their native languages resulting in syntactic and morphological errors, such as: (a) Definiteness: In Arabic syntax, a possessive case, idafa construct, which happens between two words, mostly requires that the first word be indefinite while the second be definite. Such as the case of " " (ktAb Al-tlmy* 1 -"The book of the student"). Note, the first Arabic word doesn't contain the definite article "Al" while the second does. Erroneous application, or not, of the definite article was common. For example, the student may say: " " (ktAb tlmy*) or " " (AlktAb Altlmy*).
(b) Gender-number agreement: Gender-number agreement is another common error type.
The inflectional morphology of Arabic may: embed gender-number markers in verbs as in " " (>Ejbtny Almdynp -I liked the city) and the learner may write " " (>Ejbny Almdynp) without the feminine marker; and the use of feminine singular adjectives with masculine plural inanimate nouns as in " " (mdn EZymp -great cities) and the learner may write " " (mdn EZymwn) or " " (mdn EZymAt).
(c) Prepositions: Mixing the usage of prepositions is another typical challenge for L2 learners, as it requires good understanding of spacio-temporal aspects of language. Thus, L2 learners tend to mix between these prepositions as in " " (wSlt fy Almdynp -I arrived in the city) instead of " " (wSlt ¡lY Almdynp -I arrived to the city).
2. Spelling errors: Grasping sounds is another challenging issue particularly given: (a) Letter that sound the same but written differently, such as " " (t) and " " (p), may lead to erroneous spellings like " " (mbArAt -game) instead of " " (mbArAp). Other example letter pairs are " " (S) and " " (s) and " " (T) and " " (t) (b) Letters that have similar shapes but a differ number of dots on or below them. We noticed that L2 learners often confuse letter such as: " " (j), " " (H), and " (x); and " " (S) and " " (D). This may lead to errors such as " " (Sbb AlxAdv) instead of 1 Buckwalter transliteration " " (sbb AlHAdv -the reason for the accident).

Word Error Correction
In this section we describe our case-specific error correction system that handles specific error types with the aid of a language model (LM) generated from an Aljazeera corpus. We built a word bigram LM from a set of 234,638 Aljazeera articles 2 that span 10 years. Mubarak et al. (2010) reported that spelling mistakes in Aljazeera articles are infrequent. We used this language model in all subsequent steps.
We attempted to address specific types of errors including dialectal words, word normalization errors, and words that were erroneously split or merged. Before applying any correction, we always consulted the LM. We handled the following cases in order (L2 specific corrections are noted): • Switching from English punctuation marks to Arabic ones, namely changing: "?" → " " and "," → " ".
• Handling common dialectal words and common word-level mistakes. To do so, we extracted all the errors and their corrections from the QALB-2014 (train, dev, and test) and the training split of the QALB-2015 data set. In all, we extracted 221,460 errors from this corpus. If an error had 1 seen correction and the correction was done at least 2 times, we used the correction as a deterministic correction. For example, the word ( "AlAHdAv" -the events) was found 86 times in this corpus, and in all cases it was corrected to ( "Al > HdAv ). There were 10,255 such corrections. Further, we manually revised words for which a specific correction was made in 60% or more of the cases (2,076 words) to extract a list of valid alternatives for each word. For example, the word ( "AlAmwr") appeared 157 times and was corrected to ( "Al > mwr ) in 99% of the cases. We ignored the remaining seen corrections. An example dialectal word is ( "Ally" -"this" or "that") which could be mapped to ( "Al*y"), ( "Alty"), or ( "Al*yn"). An example of a common mistake is ( " > n$A Allh -"God willing") which is corrected to ( ">n $A' Allh"). When performing correction, given a word appearing in our list, we either replaced it deterministically if it had one correction, or we consulted our LM to pick between the different alternatives. When dealing with L2 data, we added 297 more deterministic errors (ex: "wvm" was always corrected to "vm").
• Handling errors pertaining to the different forms of alef, alef maqsoura and ya, and ta marbouta and ha as described in Table 1 and  Table 2. We used an approach similar to the open suggested by Moussa et al. (Moussa et al., 2012), and we also added the following cases, namely attempting to replace: "&" with "&w" or "}w"; and "}" with "y"' or vice versa (ex: "mr&s" → "mr&ws", "qAry"' → "qAr}"). To generate the alternatives for words, we normalized all the unique words in the Aljazeera corpus, and we constructed a reverse look-up table that has the normalized form as the key and a list of seen alternatives that could have generated the normalized form. The look-up table contained 905k normalized word entries with corresponding denormalized forms. When correcting, a word is normalized and looked-up in the table to retrieve possible alternatives. We used the LM to pick the best alternative in context. Table 2 shows examples from the look-up table for normalized words and their alternative corrections.
• Removing repeated letters. Often people repeat letters, particularly long vowels, for emphasis as in ( ">xyyyyrAAA") (meaning "at last"). We corrected for elongation in a manner similar to that of . When a long vowel is repeated, we replaced it with a either the vowel (ex.
">xyrA" -finally) or the vowel with one repetition (ex. "sEwdyyn" -Saudis) and scored it using the LM. This was expanded to consonants also (ex. "bkvyrrrr" → "bkvyr"). If a repeated alef appeared in the beginning of the word, we attempted to replace it with alef lam (ex.
• Handling merges and splits. Often words are concatenated erroneously. Thus, we attempted to split all words that were at least 5 letters long after letters that don't change their shapes when they are connected to the letters following them, namely different alef forms, "d", "*", "r", "z", "w", "p", and "Y" (ex: "yArbnA" → "yA rbnA"). If the bigram was observed in the LM and the LM score was higher (in context) than when they were concatenated, then the word was split. Conversely, some words were split in the middle. We attempted to merge every two words in sequence. If the LM score was higher (in context) after the merge, then the two words would be merged (ex: "AntSAr At" → "AntSArAt"). • Correcting out-of-vocabulary (OOV) words. For words that were not observed in the LM, we attempted replacing phonetically or visually similar letters and deleting/replacing letters that appear in dialectal words as shown in Table 3. Generated suggestions are scored in context using the LM. Many of these errors are common in the L2 data set.
• For L2 data only, as we mentioned earlier we observed errors pertaining to definiteness and gender-number agreement. We generated possible corrections as follows: words that start with definite article, we scored the word with and with-out a definite article. We did the same with words ending with ta marbouta (p). We also added other alternatives for the word by adding the definite article and/or that ta marbouta (for words without one or the other or neither). In all cases, we used the LM to select the most probable alternative in contexts.
2. QCRI-2-L1: case-based correction for L2 test file and also by adding alternatives for possible errors in the definite article "Al" and feminine mark "p" as described in section 3.
3. QCRI-1-L2: case-based correction for L2 test file with handling the definiteness or feminine marker. Table 4 and Table 5 report the officially submitted results against the development set and test set in order, and Table 6 reports the results of the new system against the development set and test set of QALB 2014 shared task.

Conclusion
In this paper, we presented an automatic approach for correcting Arabic text based on handling specific error types. We handled common dialectal words, some dialectal morphological features, letter normalization errors (ex. alef, ta marbouta, etc.), and word splitting and merging. For the L2 corpus, we also corrected letters that L2 learners often confuse because of similarity in shape or sound, and we attempted to correct errors pertaining to definiteness and gender-number agreement.
For punctuation recovery, we opted to put periods at the end of sentences. Preliminary experiments using fuzzy match using a character-based mod-    (Sajjad et al., 2012;Durrani et al., 2014;Darwish et al., 2014). We intend to incorporate this development among others in our on-going research. The fuzzy match algorithm will correct cases like: ( , Al>bEA' , ystxdnwnhA) to ( , Al<rbEA' , ystxdmwnhA).
L2 learners present new spelling error types. Such types may not typical spelling errors as they may produce valid words that are erroneous in context. Hence employing a methodology to detect such cases will be of great help.Also, we plan to handle more grammar errors for cases like: numbers, case endings, gender-number agreement, irregular (broken) plurals, and Tanween errors ( ).