The Second QALB Shared Task on Automatic Text Correction for Arabic

We present a summary of QALB-2015, the second shared task on automatic text correction of Arabic texts. The shared task extends QALB-2014, which focused on correcting errors in Arabic texts produced by native speakers of Arabic. The competition this year, in addition to native data, includes texts produced by learners of Arabic as a foreign language. The report includes an overview of the QALB corpus, which is the dataset used for training and evaluation, an overview of participating systems, results of the competition and an analysis of the results and systems.


Introduction
The task of text correction has recently been attracting a lot of attention in the Natural Language Processing (NLP) community, but most of the effort in this area concentrated on English, especially on errors made by learners of English as a Second Language. Four competitions devoted to error correction for non-native English writers took place recently: HOO (Dale and Kilgarriff, 2011;Dale et al., 2012) and CoNLL (Ng et al., 2013;Ng et al., 2014). Shared tasks of this kind are extremely important, as they bring together researchers and promote the development of relevant techniques and dissemination of key resources, such as benchmark data sets.
In the area of Arabic text correction, there has been a significant body of work, as well (Shaalan et al., 2003;Hassan et al., 2008). However, due to the lack of a common benchmark data set, making progress on this task has been difficult. The QALB shared task on automatic text correction of Arabic, organized within the framework of the Qatar Arabic Language Bank (QALB) project, 1 is the first effort aimed at constructing a benchmark data set, which will allow for development and evaluation of automatic correction systems for Arabic.
In this paper, we present a summary of the second edition of the QALB competition. The first one -QALB-2014  -took place in conjunction with the Arabic NLP workshop at EMNLP-2014 and focused on errors found in online commentaries produced by native speakers of Arabic. QALB-2014 attracted a lot of attention and resulted in nine systems being submitted with a variety of approaches that included rule-based frameworks, machine-learning classifiers, and statistical machine translation methods. This year's competition extends the first edition by adding another track that focuses on errors found in essays written by learners of Arabic.
Eight teams participated in the competition this year, including several participants from last year who submitted improved systems for the native track. The non-native (L2) track also allowed the participants to determine to what extent their approaches need to be modified to adapt to a new set of errors. Overall, QALB-2015 generated a diverse set of approaches for automatic text correction of Arabic.
The rest of the paper is organized as follows. In Section 2, we present the shared task framework. This is followed by an overview of the QALB corpus (Section 3). Section 4 describes the shared task data, and Section 5 presents the approaches adopted by the participating teams. Section 6 discusses the results of the competition. Section 7 concludes the paper.

Task Description
The QALB-2015 shared task extends QALB-2014, the first shared task on Arabic text correction that was created as a forum for competition and collaboration on automatic error correction in Modern Standard Arabic and took place in conjunction with the Arabic NLP workshop at EMNLP-2014 .
QALB-2014 addressed errors in online user comments written to Aljazeera articles by native Arabic speakers. This year's competition includes two tracks -native and non-native. In addition to the Aljazeera commentaries written by native speakers, it also includes texts produced by learners of Arabic as a foreign language (L2).
Both the native and the non-native data is written in Modern Standard Arabic and is part of the QALB corpus (see Section 3), a manuallycorrected collection of Arabic texts. The Aljazeera section of the corpus is presented in . The L2 data is extracted from two learner corpora of Arabic -the Arabic Learners Written Corpus (ALWC) (Farwaneh and Tamimi, 2012) and the Arabic Learner Corpus (ALC) (Alfaifi and Atwell, 2012). For details about the L2 data, we refer the reader to Zaghouani et al. (2015a).
The shared task participants were provided with training and development data to build their systems, but were also free to make use of additional resources, including corpora, linguistic resources, and software, as long as these were publicly available.
For evaluation, a standard framework developed for similar error correction competitions in English and that we also used last year has been adopted: system outputs are compared against gold annotations using Precision, Recall and F 1 . Systems are ranked based on the F 1 scores obtained on the test sets.

The QALB Corpus
The QALB corpus was created as part of the QALB project. One of the goals of the QALB project is to develop a large manually corrected corpus for a variety of Arabic texts, including texts produced by native and non-native writers, as well as machine translation output. Within the framework of this project, comprehensive annotation guidelines and a specialized web-based annotation interface have been developed Obeid et al., 2013;Zaghouani et al., 2015a).
The texts are manually annotated for errors by native Arabic speakers. The annotation begins with an initial automatic pre-processing step. Next, the files are processed with the morphological analysis and disambiguation system MADAMIRA (Pasha et al., 2014) that corrects a common class of spelling errors. The files are then assigned to a team of trained human annotators who were instructed to correct all errors in the input.
The errors include spelling, punctuation, word choice, morphology, syntax, and dialectal usage. However, it should be stressed that the error classification was only used for guiding the annotation process; the annotators were not instructed to mark the type of error but only needed to specify an appropriate correction.
Once the annotation was complete, the corrections were automatically grouped into the following seven action categories based on the action required to correct the error: {Edit, Add, Merge, Split, Delete, Move, Other}. 2 Table 1 presents a sample Arabic news comment along with its manually corrected form, its romanized transliteration, 3 and the English translation. The errors in the original and the corrected forms are underlined and co-indexed. Table 2 presents a subset of the errors for the example shown in Table 1 along with the error types and annotation actions. The Appendix at the end of the paper lists all annotation actions for that example. 4 Essays written by L2 speakers differ from the native texts both because of the genre and the types of mistakes. For this reason, the general QALB L1 annotation guidelines were extended by adding new rules describing the error correction procedure in texts produced by L2 speakers (Zaghouani et al., 2015a). Because the genres are different, the writing styles exhibit different distributions of words, phrases, and structures. Further, while native texts mostly contain orthographic and punctuation mistakes, non-native writings also reveal lexical choice errors, missing and extraneous words (e.g. articles, prepositions), and mistakes in word Original Corrected lA ttSwrwA mdy 1 sςAdty ςnd qrAŷh 2 hðh 3 AltHlylAt AlrAŷςh w AlmHtrmh 4 lÂAny 6 šAb w knt 7 btmny 8 mn Allh An 9 Âŵdy Alςmrh mr-wrA bAlmsjd AlAqSy 10 w kAn 12 ybdwA 13 An 14 hðA bςyd AlmnAl fkl mA 16 fy 17 Hd 18 ysmς AlAmnyh 19 kAn byqwl 20 Ank 21 mmkn ttmny 23 An 24 ÂHfAd ÂHfAdk yHqqwhAlÂn 25 Amnytk 26 mstHylh.
lA ttSwrwA mdý 1 sςAdty ςnd qrA'h 2 hðh 3 AltHlylAt AlrAŷςh wAlmHtrmh 4 . 5 lÂnny 6 šAb wknt 7 Âtmný 8 mn Allh Ân 9 Âŵdy Alςmrh mrwrA bAlmsjd AlÂqSý 10 , 11 wkAn 12 ybdw 13 Ân 14 hðA bςyd AlmnAl, 15 fkl wAHd 18 ysmς AlÂmnyh 19 kAn yqwl 20 Ânk 21 mmkn Ân 22 ttmný 23 Ân 24 ÂHfAd ÂHfAdk yHqqwhA lÂn 25 Âmnytk 26 mstHylh. Translation You cannot imagine the extent of my happiness when I read these wonderful and respectful analyses because I am a young man and I wish from God to perform Umrah passing through the Al-Aqsa Mosque; and it seemed that this was elusive that when anyone heard the wish, he would say that you can wish that your great grandchildren may achieve it because your wish is impossible.  Both for the native and L2 data, we ensured that sentences from the same comment or essay belonged to the same set, i.e. training, development, or test. Furthermore, Aljazeera comments belonging to the same article were included only in one of the shared task subsets (i.e., training, development or test). The commentaries were also split by the annotation time.
Similar to QALB-2014, the data was made available to the participants in three versions:
(1) plain text, one document per line; (2) text with annotations specifying errors and the corresponding corrections; (3) feature files specifying morphological information obtained by running MADAMIRA, a tool for morphological analysis and disambiguation of Modern Standard Arabic (Pasha et al., 2014). MADAMIRA performs morphological analysis and contextual disambiguation. Using the output of MADAMIRA, we generated for each word thirty-three features. The features specify various properties: the part-ofspeech (POS), lemma, aspect, person, gender, number, and so on. Among its features, MADAMIRA generates normalization forms and as a result corrects a large subset of a special class of spelling mistakes in words containing the letters Alif and final Ya.
These letters are a source of the most common spelling types of spelling errors in Arabic and involve Hamzated Alifs and Alif-Maqsura/Ya confusion (Habash, 2010;El Kholy and Habash, 2012). We refer to these errors as Alif/Ya errors (see also Section 6). Several participants this year and in QALB-2014 (e.g. ) used MADAMIRA predictions as part of their systems. We show the performance of the MADAMIRA baseline in Sec. 6. Table 4 presents statistics on the shared task data for native and non-native tracks separately. Table 5 shows the distribution of annotations by the action type. The majority of corrections (over 50%) belong to the type Edit. This is followed by mistakes that require an insertion of missing word or punctuation (about a third of all errors). With respect to the differences between Aljazeera and L2 data, note that the L2 data has a higher percentage of corrections of type Edit but fewer additions of missing words. This could be explained by the fact that a large percentage of Aljazeera errors (over 40%) involve missing punctuation. In addition to this difference, there are almost twice as many deletions and five time more moves in the L2 data, which could be due to grammatical errors that are not typical for native speakers.  (Nawar, 2015) Cairo University (Egypt) GWU (Attia et al., 2015) George Washington University (USA) QCMUQ  Carnegie Mellon University in Qatar (Qatar) and Qatar Computing Research Institute (Qatar) QCRI (Mubarak et al., 2015) Qatar Computing Research Institute (Qatar) SAHSOH (Zaghouani et al., 2015b) Bouira University (Algeria) and Carnegie Mellon University in Qatar (Qatar) TECH (Mostefa et al., 2015) Techlimed.com (France) UMMU (Bougares and Bouamor, 2015) Laboratoire d'Informatique de l'Université du Maine (France) and Carnegie Mellon University in Qatar (Qatar)  Table 7: Approaches adopted by the participating teams.

Participants and Approaches
Eight teams participated in the shared task. Table 6 presents the list of participating institutions and their names in the shared task. Each team was allowed to submit up to three outputs. Overall, we received 12 outputs for the native track and 10 outputs for the non-native track (one of the teams -TECH -did not participate in the non-native track). The submitted systems included a diverse set of approaches that incorporated rule-based frameworks, statistical machine translation and machine-learning models, as well as hybrid systems. The teams that scored at the top employed hybrid methods by combining a variety of techniques. For example, the CUFE system extracted rules from the morphological analyzer and learned their probabilities using the training data, while the UMMU system combined statistical machine-translation with MADAMIRA corrections. Table 7 summarizes the approaches adopted by each team.

Results
In this section, we present the results of the competition. As was done in QALB-2014, we adopted the standard Precision (P), Recall (R), and F 1 metric. This metric was also used in recent shared tasks on grammatical error correction in English: HOO competitions (Dale and Kilgarriff, 2011;Dale et al., 2012) and CoNLL (Ng et al., 2013). The results are computed using the M2 scorer (Dahlmeier and Ng, 2012) that was also used in the CoNLL shared tasks.
Tables 8 and 9 present the official results of the evaluation on the test sets for the Aljazeera data and the L2 data, respectively. The results are sorted according to the F 1 scores obtained by the   -test-2015). Column 1 shows the system rank according to the F 1 score. MADAMIRA refers to the baseline of applying corrections proposed by MADAMIRA.
systems. The range of the scores is quite widefrom 53 to 72 F 1 on the native data and from 25 to 41 on non-native. Observe that the performance on the non-native data is substantially lower for all of the teams. This is to be expected as nonnative writers exhibit a variety of errors -spelling, grammar, word choice. In contrast, the native data contains many punctuation and spelling mistakes that can be handled by MADAMIRA and are much easier to address (see also analysis below). In fact, we used MADAMIRA as a baseline system (last row in the tables). As the results show, MADAMIRA provides quite a competitive baseline, especially on the native data. But all of the teams managed to beat this baseline, in many cases by a large margin. This suggests that even though MADAMIRA is a sophisticated system, it cannot handle all of the errors, and the participating teams developed approaches that are complementary to it.
It is interesting to compare the obtained results to those obtained on similar shared tasks on English as a Second Language (ESL) writings. While the performance on native MSA data in Table 8is better than on ESL, performance on L2 writings is quite similar. For instance, the highest score in the HOO-2011 shared task (Dale and Kilgarriff, 2011) that addressed all errors was 21.1 (Rozovskaya et al., 2011); the highest performance in the CoNLL-2013 shared task that also used the same evalua-   -test-2015). Column 1 shows the system rank according to the F 1 score. Column 1 shows the system rank according to the F 1 score. MADAMIRA refers to the baseline of applying corrections proposed by MADAMIRA.
tion metric was 31.20 (Rozovskaya et al., 2013). 5 In addition to providing the official rankings, we also analyze system performance for different types of mistakes by automatically assigning errors to one of the following categories: punctuation errors; errors involving Alif and Ya; and all other errors. Punctuation errors account for 39% of all errors in the Aljazeera data. 6 Tables 6 and 6 show the performance of the teams in three settings: with punctuation errors removed; with Alif /Ya errors removed; and when both punctuation and Alif /Ya errors are removed. In general, both for the native and the non-native data, performance drops when the Alif /Ya errors are removed, which indicates that these errors may be easier. When the punctuation errors are removed, the performance on the native data improves slightly, but goes down a little on the non-native data. Overall, it can be concluded that the punctuation mistakes do not significantly affect the performance and are of the same difficulty level as the remaining of the errors.
Finally, the majority of the teams participated last year and relied on the findings from the previous round. Overall, it can be said that the participants were able to make progress and to im-  Table 11: L2-test-2015: Results on the test set in different settings: with punctuation errors removed from evaluation; normalization errors removed; and when both punctuation and normalization errors are removed. Only the best output from each team is shown.
prove their systems since last year. Although direct comparison is not possible since the test sets are not the same and the test data from last year was used for development, we observe that four teams scored more than 70 F 1 points on the native data this year, while last year the best result that was obtained by the CLMB system  was 67.91 points. We refer the reader to the system description papers for more detail on how the respective systems have been improved.

Conclusion
This paper presented a report on QALB-2015, the second shared task on text correction of Arabic. QALB-2015 extended QALB-2014 that took place last year and focused on correcting texts written by native Arabic speakers. This year, we added a second track, on non-native data. We received 12 system submissions from eight teams. We are pleased with the extent of participation, the quality of results and the diversity of approaches.
Many participants continued from last year and improved and extended their systems. We feel motivated to conduct new research competitions in the near future.

Acknowledgments
We would like to thank the organizing committee of ACL 2015 and its Arabic NLP workshop and also the shared task participants for their ideas and support. We thank Al Jazeera News (and especially, Khalid Judia) for providing the user comments portion of the QALB corpus. We also thank the QALB project annotators: Hoda Fathy, Dhoha Abid, Mariem Fekih, Anissa Jrad, Hoda Ibrahim, Noor Alzeer, Samah Lakhal, Jihene Wefi, Elsherif Mahmoud and Hossam El-Husseini. This publication was made possible by grant NPRP-4-1058-1-168 from the Qatar National Research Fund (a member of the Qatar Foundation). The statements made herein are solely the responsibility of the authors.

Appendix A: Sample annotation file
The sequence of manual corrections for the example in Table 1 is shown below.