SHAKKIL: An Automatic Diacritization System for Modern Standard Arabic Texts

This paper sheds light on a system that would be able to diacritize Arabic texts automatically (SHAKKIL). In this system, the diacritization problem will be handled through two levels; morphological and syntactic processing levels. The adopted morphological disambiguation algorithm depends on four layers; Uni-morphological form layer, rule-based morphological disambiguation layer, statistical-based disambiguation layer and Out Of Vocabulary (OOV) layer. The adopted syntactic disambiguation algorithms is concerned with detecting the case ending diacritics depending on a rule based approach simulating the shallow parsing technique. This will be achieved using an annotated corpus for extracting the Arabic linguistic rules, building the language models and testing the system output. This system is considered as a good trial of the interaction between rule-based approach and statistical approach, where the rules can help the statistics in detecting the right diacritization and vice versa. At this point, the morphological Word Error Rate (WER) is 4.56% while the morphological Diacritic Error Rate (DER) is 1.88% and the syntactic WER is 9.36%. The best WER is 14.78% compared to the best-published results, of (Abandah, 2015); 11.68%, (Rashwan, et al., 2015); 12.90% and (Metwally, Rashwan, & Atiya, 2016); 13.70%.


Introduction
Modern Standard Arabic (MAS) is currently the sixth most widely spoken language in the world with estimated 422 million native speakers. It is usually written without diacritics which makes it difficult for performing Arabic text processing. In addition, this often leads to considerable ambiguity since several words that have different diacritic patterns may appear identical in a diacritic-less setting. In fact, a text without diacritics may bring difficulties for Arabic readers. It is also problematic for Arabic processing applications where the lack of diacritics adds another layer of ambiguity when processing the input data (Shaalan et al., 2009).
Diacritics restoration is the problem of inserting diacritics into a text where they are missing. Predicting the correct diacritization of the Arabic words elaborates the meaning of the words and leads to better understanding of the text, which in turn is much useful in several real life applications such as Information Retrieval (IR), Machine Translation (MT), Text-to-speech (TTS), Part-Of-Speech (POS) tagging and others.
For full diacritization of an Arabic words, two basic components are needed: 1) Morphology-dependent that selects the best internal diacritized form of the same spelling; e.g. the word "Elm" has different diacritized forms; "Eilom" "science", "Ealam" "flag", "Eal˜ama" "taught" and "Ealima" "knew". 2) Syntax-dependent that detects the best syntactic case of the word within a given sentence; i.e. its role in the parsing tree of that sentence. For example; "darasotu Eiloma Alr˜iyADiy˜Ati" "I studied Mathematics" implies the syntactic diacritic of the target word -which is an "object" in the parsing tree -is "Fatha", while "yufiydu Eilomu Alr˜iyADiy˜Ati jamiyEa AloEuluwmi" "Mathematics benefits all sciences" implies the syntactic diacritic of the target word which is a "subject" in the parsing tree -is "Damma" (Rashwan et al., 2009).

Related Work
Diacritic restoration has been receiving increasing attention and has been the focus of several studies. Different methods such as rule-based, example-based, hierarchical, morphological and contextual-based as well as methods with Hidden Markov Models (HMM) and weighted finite state machines have been applied for the diacritization of Arabic text.
In addition, some software companies have developed commercial products for the automatic diacritization of Arabic; Sakhr Arabic Automatic Diacritizer 1 , Xerox's Arabic Morphological Processor 2 and RDI's Automatic Arabic Phonetic Transcriptor (Diacritizer/Vowelizer) 3 . Moreover, there are also other free online available systems; Meshkal Arabic Diacritizer 4 , Harakat Arabic Diacritizer 5 , Al-Du'aly 6 , Farasa 7 and Google Tashkeel which is no longer working where the tool is not available now.
It has been noticed that most of the previous systems use different statistical approaches in their diacritization systems and few trials use the rulebased approach. The difference between the current proposed system and the others is the interaction between the rule-based approach and statistical approach using different machine learning techniques, where the rule-based can help the statistical-based in detecting the right diacritization and vice versa. In addition, extracting and implementing some syntactic rules for case ending restoration where, to our knowledge, none of the previous systems make use of syntax with the  . 28-12-2016] exception of (Shahrour et al., 2015), (Chennoufi and Mazroui, 2016b) and (Alansary, 2016a) who have integrated syntactic rules.

System Data Sets
The used data in the current system was selected from a Morphologically Annotated Gold Standard Arabic Resource (MASAR) for MSA (Alansary, 2016b). The texts were selected from different sources; Newspapers, Net Articles and Books. Moreover, these selected texts covered more than one genre. Each word is tagged with features, namely, Lemma, Gloss, prefixes, Stem, Tag, suffixes, Gender, Number, Definiteness, Root, Stem Pattern, Case Ending, Name Entity and finally Vocalization. In the current proposed system, the data is subdivided into three kinds of data sets: a) "Training Data" about 450,000 words. b) "Development Data" about 55,000 words, and c) "Testing Data" about 55,000 words.

Extracting Arabic Linguistic Rules
It must be noted that extracting Arabic linguistic rules is not an easy task where these rules must be represented in a generalized format in a way that simulates the concerned component of the language. So these rules need to be constrained in a certain order to avoid overlapping among them. In this stage a number of rules related to morphology, definiteness and case ending are extracted from the training data set in a formalized and generalized format and implemented in the system. As concerning to morphological rules, we are concerned with extracting some rules that help in detecting the POS of a certain words depending on the preceding or succeeding POS tags or word forms. In addition to the previous kind of rules, other rules have been extracted to detect the whole morphological analysis and hence the full internal diacritization depending on the preceding or succeeding POS tags or word forms. These rules get only the correct solution for the words context and consequently eliminate all wrong solutions. In addition, they may get only one morphological/diacritized form or get more than one correct morphological analysis after eliminating the wrong solutions. The extracted morphological rules are of 11 categories; Prepositional Phrases, Subjunctive Particles, Jussive Particles, Accusative Particles, Interrogative Particles, Pronouns, Verb Particles, Amendment Particle "l'kin", Time and Place Adverbs, Verbs and Some Word Forms. In each category, a number of sub rules are extracted and implemented.

Ex. (1)
fahuwa yaraY >an˜a >afoDal AlomuqotaraHAt huwa AlomuqotraH AlofiydorAliy He believes that the best proposals is the federal proposal The realization of nominal case in Arabic is complicated by its orthography, which uses optional diacritics to indicate short vowel case morphemes, and by its morphology, which does not always distinguish between all cases. Additionally, case realization in Arabic interacts heavily with the realization of definiteness, leading to different realizations depending on whether the nominal is indefinite, i.e., receiving nunation ( ), definite through the determiner Al+ ( +) or definite through being the governor of an EDAFAH possessive construction ( ) (Habash et al., 2007b). In addition, case ending realization in Arabic interacts in some cases with other information: 1. Word Patterns: the diptote word patterns ( ) refer to a category of nouns and adjectives that have special case ending when they are indefinite since they do not take tanween. It must be noticed that when these words are definite, they take regular case ending diacritics. 2. Verb Transitivity: the transitivity of the verbs helps sometimes in detecting the subject (which receives nominative case ending) and object (which receives the subjunctive case ending). 3. Feminine Plural Word Forms: in Arabic, the object receives the case ending for the genitive case instead of the subjunctive case; this is in the case of those words end with " " "At/NSUFF" " " (Fashwan, 2016). In order to detect the case ending diacritics, a prior step is done where some Arabic linguistic rules have been extracted and implemented to detect the definiteness of each word depending on its context or its selected morphological analysis. In addition, the stem pattern of each stem has been detected depending on its root, stem and lemma. Moreover, the transitivity of each verb has been detected depending on its lemma.
After that, some Arabic linguistic rules have been extracted and implemented to detect the case ending depending on a window of -/+ 3 words around the focused word taking into consideration the context, the selected morphological analysis, definiteness feature, stem pattern and verb's transitivity. The extracted case ending rules are of 4 categories; rules for Detecting Case Ending (MOOD) of the Imperfect Verbs, rules for detecting the case ending of Noun Phrases, rules for detecting the case ending of Adverb phrases and rules for detecting the case ending of adjectival phrases. One of these rules states that if there is a noun phrase preceded by " " ">am˜A" "as for/concerning" or " " "lawolA" "if not" then the noun must be in nominative case taking into consideration the definiteness feature; if the noun is "INDEF" then the noun must receive nunation (Tanween Damma ' '), but if the noun is "DEF" or "EDAFAH" then the noun must receive Damma ' ' as Rule (2)  (2), the first condition of Rule (2) is applied and the case ending diacritic of the word " " "SadiyqN" "friend" is Tanween Damma ' ' 'N'. In Ex. (3), the second condition of Rule (2) is applied and the case ending diacritic of the word " " "Sadiysqu" "friend" is Damma ' ' "u". Ex. (2) lawolA SadiyqN liy >axobaraniy biAlo>amori Except for a friend told me about this matter Ex. (3) >am˜A Sadiyqu Tufuwlatiy falahu min˜iy kul˜u AloHub˜i As for my childhood friend, all the love from me Table 1 shows the number of the extracted linguistic rules for being used for both morphological and syntactic processing levels. For more details about these extracted rules, see (Fashwan, 2016

SHAKKIL Diacritization System
In this system, the diacritization problem will be handled through two levels; morphological processing level (for detecting the internal diacritics) and syntactic processing level (for detecting the case ending diacritics).
The morphological processing level depends on four layers. The first three layers are similar to BASMA's (Alansary, 2015). The first layer is direct matching between the words that have only one morphological analysis and their diacritized forms, the second is disambiguating the input by depending on contextual morphological rules, and the third is disambiguating the input statistically by using machine learning techniques. However, the three layers in this algorithm are applied sequentially for the whole input, unlike BASMA's system that applies the layers word by word. In each of these layers, a step towards the morphological diacritization of the input text is performed as figure 1 shows. Moreover, this algorithm makes use of the relations between the words and their contexts, whether the preceding or the succeeding words, but BASMA depends only on the morphological disambiguation of the preceding words. In addition to these three layers, another layer is used; the Out Of Vocabulary (OOV) layer. The adopted syntactic algorithm is a rule based approach that detects the main constituents of the morphological analysis output and applies the suitable syntactic rules to detect the case ending.

Uni-Morphological Form Layer
This layer is only concerned with words that have only one diacritized form as well as one POS tag. For example, the word " " "AliHotilAl" "occupation" has only one diacritized form with In this layer, the adopted morphological disambiguation algorithm does not disambiguate the words according to its word order in the text; however, it begins with matching directly all the words that have uni-morphological/diacritized form and assigns this analysis to the word. This layer is considered as the key for disambiguating other words in the other layers whether they are the preceding or the succeeding words. It may help in: 1) disambiguating the other words in the rule-based layer if they are governed by a rule that provides one morphological/diacritized form, 2) disambiguating the other words in the statistical-based layer or 3) disambiguating the other words by the help of both the rule-based layer and the statistical-based layer if they are governed by a rule that provides more than one morphological/diacritized form.

Rule Based Morphological Disambiguation Layer
The main goal of implementing the morphological rules is to help in eliminating the wrong solution and making the searching problem easier while selecting the best POS or a complete morphological analysis of the non-disambiguated words and hence detecting the internal diacritization. The extracted and implemented rules in this layer are of two types: 1. Rules always help in providing only one morphological/diacritized form for some nondisambiguated words without the need to use the statistical-based layer and word text orders. Most of the rules of this type are concerned with the relation between some non-disambiguated word forms and the preceding or succeeding words. 2. Rules may provide one or more than one solution depending on the word form solutions' variation. If the rule provides only one analysis solution, then this analysis is assigned to the word directly and there will be no need for applying the statistical-based layer. However, if the applied rule provides more than one solution (morphological/diacritized form) the statistical-based layer will be applied in order to get the best solution, depending on the solutions provided by the rules.
In the case of having more than one solution for a certain word after applying the rule, following the text word order is a must. The system will depend on a window of -/+ 3 analyzed words around this word to obtain the best solution through the statistical-based layer, as sub-section 4.1.3 shows. The disambiguated words in this layer by rules only may help the statistical-based layer in disambiguating the succeeding word, if it is not analyzed yet. In addition, the disambiguated words through the statistical-based layer may be governed by a rule that helps in disambiguating the succeeding word if it is not analyzed yet.

Statistical Based Morphological Disambiguation Layer
As mentioned before, to reach the best morphological/diacritized form of the words to be analyzed statistically, there are three processes have been used. It must be noted that these processes are not used in all cases while disambiguating the word to be analyzed. The first one is for getting the best POS score from analysis solutions of the word to be analyzed in relation to the preceding and succeeding (if found) POS depending on some smoothing techniques. The second one is for getting the best stem, tag or lemma score with relation to the word form itself. It is used in two cases: 1) When the POS of the word's analyses are the same. 2) When the POS model detects best POS and it is found that this POS has more than one lemma or diacritized form. The third one is for getting the best suffix score from analysis solutions for the word to be analyzed. It is used when the word to be disambiguated has more than one suffix. The top scoring solution of the word is then selected.

Dealing with OOV Words
For predicting the OOV words, a prior step is taken; preprocessing the stems of the training data. The stems of the training data are used to get a list of unique 4307 diacritized patterns with their templates and frequencies. The patterns are prepared by converting the consonants in the stem to placeholder while keeping the vowels, hamazat hamazat (" " ">", " " "<", " " "&", " " "}", ... etc.) and weak letters " " " " "wAy". In addition, POS of patterns are taken into consideration as figure 2 shows. The POS helps, in some cases, in limiting the scope of the search of the matched pattern, where, for example, if the OOV word has been detected as having " " "Al" at the beginning of it, this means the system should search for the detected pattern in the patterns of nouns or adjectives.
While detecting the input text analysis solutions, each word is checked by the system to determine whether it has analyses solutions from BAMA or it is OOV. When the word form is checked as OOV, the system switches to the OOV module. In this module, the system tries to get all word's possible morphological constituents (a combination of prefixes, stem and suffixes). Then, it uses the list of detected stems and gets their counterpart diacritized patterns.
The selected pattern is used to retrieve the suitable diacritic for the stem. Moreover, the system chooses the POS tag of the diacritized pattern and assign it to the diacritized stem where each selected solution is added to text's solutions. While working in the morphological disambiguation processing level, if the OOV word has more than one matched POS tag, the system detects the best one depending on step one and two of subsection (4.1.3).
It must be noted that there are no out of vocabulary (OOV) words in MASAR data since they are analyzed manually as if they are analyzed by BAMA and then added in BAMA's dictionaries so that they would be analyzed correctly the next time they are used.

Syntactic Processing Level
The task of the syntactic processing level is to predict the syntactic case of a sequence of morphologically diacritized words given their POS tags, definiteness, stem pattern and/or transitivity and hence assigning the suitable case ending diacritics. Some limitations violate the rules for setting the case ending of syntactic diacritic, since the rules are limited to use a window of -/+3 words before the focused word.
Before diacritizing the word syntactically, its POS tag is checked first. Using the POS tag of the word, it is decided how the syntactic diacritization of this word should be handled. As mentioned before, the extracted rules in this level simulates one of the language processing approaches that computes a basic analysis of sentence structure rather than attempting full syntactic analysis; shallow syntactic parsing. It is an analysis of a sentence which identifies the constituents (noun groups, verb groups, prepositional groups adjectival groups, adverbial groups), but does not specify their internal structure, nor their role in the main sentence.
The extracted rules for detecting the imperfect verb case ending, the case ending of noun, the case ending of adjectives, the case ending of adverbs and the case ending of some proper nouns (sub-section 3.1) have been implemented in the current proposed system, taking into consideration the phrases in which each of the previous cate-gories occur.

Evaluation
A blind data set that is not part of the development of the system is then used to evaluate the system versus the gold annotated data. Two error rates are calculated: diacritic error rate (DER) which indicates how many letters have been incorrectly restored with their diacritics, and word error rate (WER) which indicates how many words have at least one diacritic error. Table 2 shows the total error rate of the system as a whole and WER and DER for each layer in morphological processing level and the percentage of words that has been detected by the help of each layer using different machine learning techniques.
The comparison between the different smoothing techniques shows that Witten-Bell discounting and Kneser-Ney smoothing techniques results are close; however, Kneser-Ney smoothing technique is the best one in detecting the best morphological analysis (internal diacritic) and case ending diacritics in the current proposed system. Consequently, the Kneser-Ney smoothing technique is used while comparing the results of the current systems with other state of the art systems.  Table 2: System Evaluation Results.

Error Analysis
When checking the results, we find that the first two layers of the proposed system have the lowest WER and DER, then, the third layer. The statistical based layer only gives the highest WER and DER. In what follows, some error analysis in each layer is reviewed: -In the uni-morphological layer, it has been found that the error rate is a result of some possessive nouns or present verbs that have affixed possessive. In such words, the case ending diacritic is assigned within the word, not at its end. For example, the word " " "yumArishA" "he + practice/pursue/exercise", has three moods with three different cases (' ' "u", ' ' "a", " " "o") within the word form. When the system fails to detect the suitable case ending diacritic according to the context, the blind testing process counts this wrong diacritic as a morphological (internal error) not a syntactic error. If the testing process has been done for the internal diacritics and syntactic case separately, the results are expected to be enhanced. -In the rule-based layer, the errors in this layer happen for some reasons; the first is when the word that is governed by a rule that makes the succeeding or preceding word to be diacritized wrongly. In EX. (4), the word " " ">n" is diacritized wrongly as " " ">an˜a" not " " ">ano" according to this context. This leads to diacritize the word " " "Zal˜a" as " " "Zil˜a" affected by the rule mentioning that only nouns occurs after " ". The same problem appeared in the unimorphological form layer appears in this layer. In addition, when the applied rule gives more than one available solution and the statistical based layer is used to select the best solution, the statistical based layer may choose a wrong one.

Ex. (4)
baEoda >an˜a Zil˜a quruwnAF EadiydapF After that shadow many centuries -In the statistical-based layer, although it can predict the correct diacritized form in most of cases even if the same word appears in the same sentence with two different POS and diacritics, it cannot, in other cases, predict the suitable diacritic form. In Ex. (5) the same word " " "TAlb" have been diacritized with two different diacritics according to the context, where the first word is a verb while the second one is a noun.

Ex. (5)
TAlaba waziyr Alt arobiyap waAlt aEoliym biAl{ihotimAm bikul TAlib EalaY Had sawA' The Minister of Education demanded with the interest for each student alike In some other cases, the statistical-based layer can predict the correct POS but it fails in detecting the best lemma that helps to differentiate among the different word forms diacritization. In Ex. (6), the system fails to detect the correct diacritized form of the word " " "AlmhAfZp" where it should be " " "AlmuHAfiZap" "the + governess".
Ex. (6) wa qAmato AlomuHAfaZap bi<iSodAr qarAr bitaEoyiyn nA}ib lahA The issuance of the province's decision to appoint a deputy for her Moreover, the statistical based layer may fail to detect both correct POS and the correct diacritized form of the word to be analyzed. In Ex. (7), the input word " " "Ely" wrongly diacritized with wrong POS where the system edited it to " " "EalaY/PREP" while it should be in this sentence as " " "Ealiy˜/NOUN PROP". This wrong POS and diacritization leads to have wrong case ending for both this word and the succeeding one.
Ex. (7) EalaY TifolK SagiyrK yabolug mino AloEumori Ea$arapa >aEowAmK A small child at the age of ten years The problem of detecting the proper nouns is similar to the previous problem where there are a lot of words in Arabic that may be used as proper nouns and nouns or adjectives. Predicting these words wrongly leads to have wrong case ending for both the word and the succeeding one. -In the syntactic processing level, Assigning a wrong POS, transitivity or definiteness to a word, may lead to wrong syntactic case. In Ex. (8), the word " " "hadaf" "goal/target" has assigned definiteness as "EDAFAH" as a result of considering " " "waHiyd" "alone" as "NOUN PROP".

Comparing the System with Other State of the Art Systems
In order to have an objective evaluation of the system, the same testing data (LDC's Arabic Treebank) that was used in the other systems was used to compare the results. It is a part of Arabic Tree Bank part 3 (ATB3) from "An-Nahar" Lebanese News Agency. It consists of 91 articles (about 52.000 word) covering the period from October 15, 200215, to December 15, 2002 In the testing process, 51.63% of the words are diacritized in the first layer, 5.56% of the words are diacritized by rule-based layer only, 8.26% of the words are diacritized by both rule-based and statistical-based layers, 32.99% of the words are diacritized by statistical-based layer only, and finally 1.56% of the word are diacritized in OOV layer. In OOV layer, the system could predict the words with WER of 11.2% and DER of 6.7%.  The comparison indicates that (Abandah et al., 2015), (Rashwan et al., 2015) and (Habash et al., 2009) outperform the current system's results. However, the results are still close to (Metwally et al., 2016).

Conclusion
In this work, we depend on Arabic morphological rules as well as different machine learning techniques for detecting the morphological diacritics (internal diacritics). In addition, we adopt a rulebased approach that depends on POS morphological sequences, definiteness classifier, stem pattern and verb transitivity for detecting the case ending diacritics. Evaluation of the proposed system is made in comparison with other best state of the art systems. The best WER of the morphological diacritization achieved by the system is 4.81% and the best syntactic diacritization achieved is 9.97% compared to the best-published results. Since this work is in progress, these results are expected to be enhanced by extracting more Arabic linguistic rules (morphological and syntactic), adding more semantic features, using different machine learning techniques for morphological and syntactic processing levels and implementing the improvements by working on larger amounts of data. For enhancing the OOV results more patterns with more features need to be handled.