GWU-HASP-2015@QALB-2015 Shared Task: Priming Spelling Candidates with Probability

In this paper, we describe our system HASP-2015 (Hybrid Arabic Spelling and Punctuation Corrector) in which we introduce significant improvements over our previous version HASP-2014 and with which we participated in the QALB2015 Second Shared Task on Arabic Error Correction. Our system utilizes probabilistic information on errors and their possible corrections in the training data and combine that with an open-source reference dictionary (or word list) for detecting errors and generating and filtering candidates. We enhance our system further by allowing it to generate candidates for common semantic and grammatical errors. Eventually, an n-gram language model is used for selecting best candidates. We use a CRF (Conditional Random Fields) classifier for correcting punctuation errors in a two-pass process where first the system learns punctuation placement, and then it learns to identify punctuation types.


Introduction
In this paper 1 we describe our system for Arabic spelling error detection and correction, HASP-2015 (Hybrid Arabic Spelling and Punctuation Corrector).
We introduce significant improvements to our previous version HASP-2014 (Attia et al., 2014). We participate with HASP-2015 in the QALB-2015 Second Shared Task on Arabic Error Correction .
The problem of Arabic spelling error correction has been investigated in a number of papers (Haddad and Yaseen, 2007;Alfaifi and Atwell, 2012;Hassan et al., 2008;Attia et al., 2012;Alkanhal et al., 2012). Significant contributions were also introduced in the 2014 Shared Task on Arabic Error Correction  including Nawar and Ragheb, 2014;Jeblee et al., 2014;and Mubarak and Darwish, 2014).
The QALB--2015 shared task is an extension of the first QALB shared task  that took place in 2014. QALB--2014 addressed errors in comments written to Aljazeera articles by native Arabic speakers . This year's competition includes two tracks, and, in addition to errors produced by native speakers, also includes correction of texts written by learners of Arabic as a foreign language (L2) . The native track includes Alj--train--2014, Alj--dev--2014, Alj-test--2014 texts from QALB--2014. The L2 track includes L2--train--2015 and L2--dev--2015. This data was released for the development of the systems. The systems are scored on blind test sets Alj--test--2015 and L2--test--2015. Our system is ranked third and fourth on the Alj and L2, respectively.
The shared task data deals with "errors" in the general sense which comprise: a) punctuation errors; b) non-word errors; c) real-word spelling errors; d) grammatical errors (related to case, number and gender); and, e) affective variations such as elongation (kashida) and speech effects such as character multiplication for emphasis. Our previous system, HASP-2014, handles only types (a), (b), and (e) errors. We extend our systemt HASP-2015 to provide coverage for and address types (d) and (e) spelling errors.

Our Methodology
Our system uses a pipeline of four components: 1) regular expression normalization for deterministic errors, 2) A discriminative classifier for punctuation errors, 3) Spelling detection and handling, and, 4) Post-processing for fixing common system errors.
For punctuation errors, we use a classifier in a two-pass process where first the system learns punctuation placement, and then it learns to identify punctuation types. The reason for this staging is that learning six punctuation types at once could be problematic for the classifier, and we hypothesize that splitting the task of placement from identification, where in the first step it makes a binary decision of whether or not to insert a punctuation mark, and in the second step it predicts the type of that punctuation mark.
In HASP-2014, we only rely on a reference dictionary (or word list) for detecting errors and generating candidates. The candidates were generated according to the edit distance between the erroneous word and possible candidates.
In HASP-2015, we generate probabilistic information from the training data on errors and their possible corrections and utilize this information in detecting errors and generating candidates. The reference dictionary is relegated to as a back-off function when no probabilistic information is available in the training data. Our system is able to detect and generate candidates for common semantic and grammatical errors. Candidates and their probabilistic scores are passed an n-gram language model for selecting best candidates. Our system is explained in detail in the next section.
For organizational purposes, we divide errors into two types: a) nonverbal errors which include affective variations, punctuation, word merges and word splits; and b) verbal errors, which include non-word error, real-word error, grammatical errors, and dialectal words/expressions. In other words, verbal errors are related to the alphabetical buildup of words, and non-verbal errors go beyond this alphabetical buildup.

Nonverbal Errors
Nonverbal errors include affective variations, punctuation errors, word merges and word splits.

Affective Variations
There are many instances in the shared task's data that can be treated using simple and straightforward conversion via regular expression replace rules. We estimate that these instances cover 10% of the non-punctuation errors in the development set. In HASP, we use deterministic heuristic rules to normalize the text, including the removal of speech effects, such as ‫ﺍاﻟﺮﺟﺎﺍاﺍاﺍاﻝل‬ AlrjAAAAl 'men' which is converted to ‫ﺍاﻟﺮﺟﺎﻝل‬ Al-rjAl, the removal of decorative kashida, e.g. ‫ﺩدﻣــﺎء‬ dm__A' 'blood', and the conversion of Hindi digits (٠۰١۱٢۲٣۳٤٥٦٧۷٨۸٩۹) into Arabic digits [0-9].

Punctuation Errors
Punctuation errors constitute 40% of the errors in the QALB Arabic data. In HASP-2015, we continue to handle the six basic punctuation marks: comma, colon, semi-colon, exclamation mark, question mark, and period.
For classification, we use a Conditional Random Field, CRF++ classifier (Lafferty et al. 2001) with window size 5. The features we use are extracted from the 'column' file in the QALB shared task data, which includes preprocessing with MADAMIRA morphological disambiguator (Pasha et al., 2014). In HASP-2015, we split the task of the classifier into two subtasks: placement and identification.

Pass II: Identification
This stage uses the same set of features of the placement stage in addition to its output to determine the type of punctuation mark to be placed. The predicted class is one of the following seven: colon_after, comma_after, ex-clmark_after, period_after, qmark_after, semico-lon_after, and NA.
This two-pass process shows significant improvement over the baseline for Alj and L2 data as illustrated in Table 1 and 2.

Word Merges
Merged words are when the space(s) between two or more words is deleted, such as ‫ﻫﮬﮪھﺬﺍاﺍاﻟﻨﻈﺎﻡم‬ h*AAlnZAm 'this system', which should be ‫ﻫﮬﮪھﺬﺍا‬ ‫ﺍاﻟﻨﻈﺎﻡم‬ h*A AlnZAm. These errors constitute 3.67% and 3.48% of the error types in the shared task's development and training data, respectively. We use Attia et al.'s (2012) algorithm for dealing with merged words, − 3 , where l is word length.
Moreover, we found out that common merge errors and their correction can conveniently be learned from the training data, leading to significant improvement as shown in the final results.

Word Splits
Beside the problem of merged words, there is also the problem of split words, where one or more spaces are inserted within a word, such as ‫ﺍاﻡم‬ ‫ﺻﻢ‬ Sm Am 'valve' (the correct form is ‫ﺻﻤﺎﻡم‬ SmAm). This error constitutes 6% of the shared task's found in the training and development sets. We found that the vast majority of instances of this type of error involve the clitic conjunction waw "and", which should be represented as a word prefix. Therefore, we opted to handle this problem in our work in a partial and shallow manner using deterministic rules by the reattachment of the separated conjunction morpheme waw ‫ﻭو‬ w "and" to the succeeding word.

Verbal Errors
Verbal errors include non-word errors, real-word errors, grammatical errors, and dialectal words/expressions.

Error Detection
The method for detecting spelling errors have usually varied according to the type of error. A non-word spelling error is typically defined as (adapted from Brill, and Moore, 2000): given an alphabet Ʃ, a reference dictionary consisting of strings in Ʃ * , a given word is a spelling error if ∊ Ʃ * and ∉ . For real-word errors, a reference dictionary will not help, as both the error and the correction are valid words in isolation. Instead, a language model, for example, is used to estimate the likelihood of words in a certain context, and words that fall below a certain threshold are considered as a possible error. POS bigrams and tri-grams have also been used for that purpose (Kukich, 1992). We employ a single algorithm to detect all types of spelling errors, whether non-word, semantic, grammatical or dialectal. Our algorithm for error detection is to find words in the training data where ( ( | )) > ( | ! ) , where is a spelling error, c is the correction, n is a threshold and ` is considered as a candidate. This translates to the probability of given times is greater than the probability of ! given . In our system, we set the threshold = 2 which effectively mean that a semantic error is only considered when the probability of the correction is more than half the probability of the reference word. The threshold estimation is an empirical question determined by the robustness of the language model and the quantity of noise in the training data.
In HASP-2015, the reference dictionary is not totally discarded, but used as a back-off resource to cover instances not included in the training data. We use AraComLex Extended, an opensource reference dictionary (or word list) of 9.2M full-formed words (Attia et al., 2012) as our backup reference dictionary.

Candidate Generation
Correcting spelling errors is ideally treated as a probabilistic problem formulated as (Kernigan, 1990;Norvig, 2009;Brill, and Moore, 2000): Here ( ) is the probability that is the correct word (or the language model), and ( | ) is the probability that is typed when is intended (the error model or noisy channel model), ! is the scoring mechanism that computes the correction c that maximizes the probability.
In HASP-2014, we ranked candidates according to their edit distance score using the finite state compiler, foma (Hulden, 2009), but in HASP-2015, we rank candidates according to their probability, ( | ) , as derived from the training data, and we pass candidates along with their probability scores to the language model. Again, the edit distance candidates and their ranking are used when no probability information is available from the training data. The following are some illustrative examples of the statistical information extracted from the training data for the various error types.

Error Correction and Final Results
For error correction, namely selecting the best solution among the list of candidates, we use an n-gram language model (LM), as implemented in the SRILM package (Stolcke et al., 2011). We use the 'disambig' tool for selecting candidates from a map file where erroneous words are provided with a list of possible corrections. We also use the 'ngram' utility in post-processing for deciding on whether a split-word solution has a better probability than a single word solution.
Our tri-gram language model is trained on the Arabic Gigaword Corpus, 5 th edition (Parker et al., 2011) and a corpus crawled from Al-Jazeera .
For the LM disambiguation we use the '-fb' option (forward-backward tracking), and we provide candidates with probability scores collected from the QALB training data. Both of the forward-backward tracking and the probability scores in tandem yield better results than the default values. We evaluate the performance of our system against the gold standard using the Max-Match (M 2 ) method for evaluating grammatical error correction by Dahlmeier and Ng (2012).
Our best f-score is obtained by priming candidates from the training data, adding Al-Jazeera corpus to Gigaword 5, and using the two-pass CRF punctuation prediction. Table 3 and 4 show the results on Alj and L2 development sets respectively.  For the baseline, we use the older version of our system (HASP-2014), and the results show significant improvement in performance. The biggest two gains in performance, as shown in Table 3, came from experiments 2 and 3 when candidates and their probabilities were extracted from the training data and used to supplement candidates generated from the reference dictionary using edit distance. Experiment 3, i.e. using real-word candidate allowed our system to handle semantic and grammatical errors, a domain which was beyond the scope of the previous version. Dialectal errors were included in Experiment 2 dealing with non-word candidates. It is to be noted the system can benefit from a larger training set if that becomes available in the future.
The slight improvements gained by experiments 4 through 7 are an indication of the dimensions along which future improvements might be achieved. These dimensions include better way of handling merge errors, postprocessing for correcting system-specific errors, better handling of punctuation errors, and better selection of data for training the language model.
It is also to be noted that the gold data suffers from instances of inconsistency. For example ‫ﻻﺑﺪ‬ lAbd "must" is split as two words ‫ﺑﺪ‬ ‫ﻻ‬ lA bd in 64% of the cases, while ‫ﻣﺎﺯزﺍاﻝل‬ mAzAl "still" is split in 32% of the cases.
Moreover, while conducting error analysis we found many errors in the manual annotation of the gold development data. For example, ‫ﺍاﻟﻠﺬﻱي‬ All*y "who" is incorrectly corrected as ‫ﺍاﻟﺬ‬ ‫ﻯى‬ Al*Y while the correct correction is ‫ﺍاﻟﺬ‬ ‫ﻱي‬ Al*y and many more errors are not detected at all in the gold data, such as ‫ﺍاﻧﻜﻢ٬،‬ Ankm "you" and ‫ﺍاﻟﻤﻠﺘﺤﺪﺓة‬ AlmltHdp for ‫ﺍاﻟﻤﺘﺤﺪﺓة‬ AlmtHdp "united". In total, we automatically found over 200 errors in the gold development data, but with manual checking it is found that some of the instances are incorrectly reported. However, we assume that more investigation of the consistency and accuracy of the gold data can lead to better performance and better evaluation of the systems participating in the shared task.

Conclusion
We have described our system HASP for the automatic correction of spelling and punctuation mistakes in Arabic. To our knowledge, this is the first system to handle punctuation errors. We utilize and improve on an open-source full-form dictionary, introduce a better algorithm for handing merged word errors, tune the LM parameters, and combine the various components together, leading to cumulative improved results.