Improving Arabic Diacritization through Syntactic Analysis

We present an approach to Arabic automatic diacritization that integrates syntactic analysis with morphological tagging through improving the prediction of case and state features. Our best system increases the accuracy of word diacritization by 2.5% absolute on all words, and 5.2% absolute on nominals over a state-of-the-art baseline. Similar increases are shown on the full morphological analysis choice.


Introduction
Modern Standard Arabic (MSA) orthography generally omits diacritical marks which encode lexical as well as syntactic (case) information. The task of Arabic automatic diacritization is about the automatic restoration of the missing diacritics. Diacritization improvement in Arabic has important implications for downstream processing for Arabic natural language processing, e.g. speech recognition (Ananthakrishnan et al., 2005;Biadsy et al., 2009), speech synthesis (Elshafei et al., 2002), and machine translation (Diab et al., 2007;Zbib et al., 2010).
Previous efforts on diacritization utilized morphological tagging techniques to disambiguate word forms. Habash et al. (2007a) observe that while such techniques work relatively well on lexical diacritics (located on word stems), they are much worse for syntactic case diacritics (typically word final). They suggest that syntactic analysis may help with automatic diacrtization, but stop short of testing the idea, and instead demonstrate that complex linguistic features and rules are needed to model complex Arabic case using gold syntactic analyses. In this paper, we develop an approach for improving the quality of automatic Arabic diacritization through the use of automatic syntactic analysis. Our approach combines handwritten rules for case assignment and agreement with machine learning of case and state adjustment on the output of a state-of-the-art morphological tagger. Our best system increases the accuracy of word diacrtization by 2.5% absolute overall, and 5.2% absolute on nominals over a state-of-the-art baseline.

Linguistic Background
Arabic automatic processing, and specifically diacritization is hard for a number of reasons.
First, Arabic words are morphologically rich. The morphological analyzer we use represents Arabic words with 15 features (Pasha et al., 2014). 1 We focus on case and state in this paper. In our data set, case has five values: nominative (n), accusative (a), genitive (g), undefined (u) and not applicable (na). Cases n, a and g are usually expressed with an overt morpheme. Case u is used to mark words without an overt morpheme expression of case (e.g., invariable nouns such as šakwaý 'complaint'), or those not assigned a case in the manual annotations. Most of the missing assignments are for foreign proper nouns, which often do not receive case markers. However, this is not done consistently in the training data we use. Case na is used for non-nominals. State is a nominal feature that has four values: definite (d), indefinite (i), construct (c) and not applicable (na). State generally reflects the definiteness in nominals (d vs i) and whether a nominal is the head of genitive construction (aka Idafa) (c). State na is used for non-nominals. 2 For the most part, case and state realize as a single word-final morpheme, e.g., the suffix Aã in kitAbAã 3 is a morpheme indicating the word is (cas:a, stt:i).
Second, undiacritized Arabic words are highly ambiguous: in our training data, words had an average of 12.8 analyses per word, most of which are associated with different diacritizations. Some diacritization differences reflect different analysis lemmas; while others are due to morpho-syntactic differences. For example, the undiacritized version of the word we used above ( ktAbA) has two other diacritizations and analyses: kut∼AbAã (cas:a stt:i) 'writers' (different lemma) and kitAbA (cas:n stt:c num:d) 'two books of [...]' (different features).
Third, Arabic has complex case/state assignment and agreement patterns that interact with the sentence's syntax. For example, a noun may get its case by being subject of a verb and its state by being the head of an idafa construction; while adjectives modifying this noun agree with it in its case, their state is determined by the state of the last element in the idafa chain the noun heads.
For more information on Arabic orthography, morphology, and syntax, see Habash (2010).
Most of the previous approaches cited above utilize different sequence modeling techniques that use varying degrees of knowledge from shallow letter and word forms to deeper morphological information; none to our knowledge make use of syntax. Habash and Rambow (2007) approach diacritization as a part of the morphological disambiguation problem, where they select the optimal full morphological tag for Arabic in context and use it to select from a list of possible analyses produced by a morphological analyzer. They use independent taggers for all features; and language models for lemmas and diacritized surface forms. Their work is part of the state-of-the-art Arabic morphological tagger MADAMIRA (Pasha et al., 2014). Our paper is most closely related to Habash and Rambow (2007) and Pasha et al. (2014). We extend their work using additional syntactic features to improve morphological disambiguation accuracy. We demonstrate improvements in terms of both full morphological analysis choice (lemmatization, tokenization, all features) as well as word diacritization.
Most recently, Abandah et al. (2015) presented a recurrent neural network approach to diacritize full sentences with impressive results. We do not compare to their effort here but we note that they use an order of magnitude more diacritized data than we do, and they focus on diacritization only as opposed to full morphological analysis.
In related work on modeling Arabic case and syntax, Habash et al. (2007a) compared rule-based and machine learning approaches to capture the complexity of Arabic case assignment and agreement. They demonstrated their results on gold syntactic analyses showing that given good syntactic representations, case prediction can be done at a high degree of accuracy. Alkuhlani et al. (2013) later extended this work to cover all morphological features, including state. Additionally, Marton et al. (2013) demonstrated that in the context of syntactic dependency parsing, case is the best feature to use in gold settings and is the worst feature to use in predicted settings. In this paper we use automatic (i.e. not gold) syntactic features to improve case prediction, which improves morphological analysis and word diacrtization.

Approach
Motivation We are motivated by an error analysis we conducted of 1,200 words of the MADAMIRA system output. We found a large number of surprising syntactically impossible case errors such as genitive nouns following verbs or construct nouns followed by non-genitives. We explain these errors by MADAMIRA's contextual models being limited to a small window of neighboring words and with no modeling of syntax, which leads to a much worse performance on case and diacritization compared to lemmas and POS (almost 10% absolute drop from 96% to 86%).
Proposed Solution Our approach is to provide better prediction of case and state using models with access to additional information, in particular syntactic analysis and rules. The predicted case and state values are then used to re-tag the MADAMIRA output by selecting the best match in its ranked morphological analyses. We limit our retagging to nominals. Since what we are learning to predict is how to correct MADAMIRA's baseline choice (as opposed to a generative model of case-state), we also re-apply the model on its output to fix primarily propagated agreement errors in a manner similar to Habash et al. (2007a)'s agreement classifier. 4

Experimental Results
We present next our experimental results and compare five case-state prediction techniques. The results for these techniques are compared to our state-of-the-art baseline system, which has been compared to a number of other approaches.

Experimental Setup
Data We used the Penn Arabic Treebank (PATB, parts 1, 2 and 3) (Maamouri et al., 2004;Maamouri et al., 2006;Maamouri et al., 2009) split into Train, Dev and (blind) Test along the recommendations of Diab et al. (2013) which were also used in the baseline system. The morphological feature representations we use are derived from the PATB analyses following the approach used in the MADA and later MADAMIRA systems (Habash and Rambow, 2005;Habash and Rambow, 2007;Pasha et al., 2014). We further divide Dev into two parts with equal number of sentences: DevTrain (30K words) for training our case-state classifiers, and DevTest (33K words) for development testing. The Test set has 63K words. 5 Evaluation Metrics We report our accuracy in terms of two metrics: a. Diac, the percentage of correctly fully diacrtized words; and b. All, the percentage of words for which a full morphological analysis (lemma, POS, all inflectional and clitic features, and diacritization) is correctly predicted. We report the results on all words (All Words) as well as on nominals 6 with no u case in the gold (henceforth, Nominals). We do not report on case and state prediction accuracy, nor on the character-level diacritization.

Morphological Analysis, Baseline and Topline
For our baseline, we use the morphological analysis and disambiguation tool MADAMIRA (Pasha et al., 2014), which produces a contextually ranked list of analyses for each word. We computed an oracular topline using the PATB gold case and state values in the retagging process. ments. We do not report more on this due to space limitations. Because of the need to have the same of number of parse tree tokens for reapplication in the models using syntax, we constrain the retagging to maintain the same clitic signature (number of clitics) in the MADAMIRA baseline top analysis. 5 To address the concern that we are using more "training" data than our baseline, we compared the performance of MADAMIRA's release (baseline) to a version that was trained on Train + DevTrain. The small increase in training data made no significant difference from the baseline system in terms of our metrics. 6 Nominals consists of nouns (including noun_quant and noun_num), adjectives, proper nouns, adverbs, and pronouns. Syntactic Analysis For syntactic features, we trained an Arabic dependency parser using Malt-Parser (Nivre et al., 2007) on the Columbia Arabic Treebank (CATiB) version of the PATB ). The Train data followed the same splits mentioned above. The Nivre "eager" algorithm was used in all experiments. The CATiB dependency tree has six simple POS tags and eight relations ). The PATB tokenization as well as the CATiB POS tags were produced by the baseline system MADAMIRA and used as input to the parser. We also used the well performing yet simple CATiBex expansion of the CATiB tags as implemented in the publicly available parsing pipeline from Marton et al. (2013). Our parser's performance on the PATB Dev set is comparable to Marton et al. (2013): 84.2% labeled attachment, 86.6% unlabeled attachment, and 93.6% label accuracy.
Machine Learning Technique Given the small size of DevTrain, we opted to train unlexicalized models that we expect to capture morphosyntactic abstractions. We tried a number of machine learning techniques and settled on using the J48 Decision tree classifier with its default settings in WEKA (Hall et al., 2009;Quinlan, 1993) for all of the classification experiments in this paper.

Case and State Classification Techniques
We detail and report on five case-state classification techniques we experimented with. All results on the DevTest are presented in Table 1.

Morphology Rules
We created a simple manual word-morphology based classifier that handled the most salient case errors seen in our pilot study (Section 4) and whose correction has a high precision. The scope of the rules was limited to word bigrams and included three conditions: (i) post-verbal genitive nouns are changed to the first nongenitive analysis available from MADAMIRA; (ii) post-construct state non-genitive nouns should become genitive; and (iii) adjectives agreeing with the nouns they follow in gender, number, and definiteness, but not in case should match the nouns' case. The collective improvement of all the rules applied in the order presented above adds up to 0.6% absolute on All Word Diac, and 1.2% absolute on Nominal Diac.

Morphology-based Classifier
We trained a classifier to predict a correction of the baseline case and state of a word using the DevTrain data set. For features, we included all the nonlexical morphological features (all features mentioned in footnote 1 except for lemma). Clitic feature values were binarized to indicate if a clitic is present or not. We used Nil values for all outof-vocabulary words in MADAMIRA. BOS and EOS placeholders were used for sentence boundaries. We excluded all DevTrain words whose predicted POS switches from nominal to nonnominal or vice versa, but kept them as part of other words' context. This minimizes noise and sparsity in the training data, especially given its small size and the rarity of such examples. We experimented with adding features from neighboring words within a window size of +/-2 words. The best performing setup was with window size +/-1. This classifier makes around 0.5% absolute gain over the simple word morphology rules. Habash et al. (2007a)'s simple set of rules for determining case on gold dependency trees, we re-implemented these rules to work with our different dependency representation and extended them to include state assignment in a manner similar to Alkuhlani et al. (2013). 7 These rules improve over the baseline by 4.5% absolute in Nominal Diac accuracy, but produce no gains in All Words setting. An investigation of the error patterns reveals the main 7 For nominal case assignment, our rules are: (i) default case is a; (ii) children of root are n; (iii) object of preposition and idafa children are assigned g; (iv) predicates not headed by verbs are assigned n; and (v) subjects and topics not headed (or grand-headed) by a class of words called Ân∼a and her sisters are assigned n. After case assignment, we apply two case agreement rules: (i) for all nominals modifying case-assigned nominals (i.e., with tree relation mod), copy the case of the modified nominal; and (ii) for all nominals conjoined to case-assigned nominals (i.e., with tree chain nominal-conjunction-nominal), copy the case of the heading nominal. For state, we assign d to words with the definite article proclitic, and c to words heading an idafa construction. All other nominals are assigned state i. reason to be the rules' inability to predict the problematic u case.

Syntax-based Classifier
We trained a classifier to predict a correction of the case and state of a word on a parsed version of the DevTrain data set. Since the syntactic parse separates most clitics in the PATB tokenization, we align the tree tokens with the word morphology before extracting classifier features. The features we used are the token's morphological features (same as those used for words in the morphology-based classifier), the parent and grandparent's features, the relations between token and parent, and between parent and grandparent, and the features of the neighboring +/-n tokens. The tokenized clitics are only used as context features (tree and surface). We tested all combinations of values of n as 0, 1, 2, and 3 and of including parent and grandparent features. The best performing setting was with including the parent and grandparent as well as a window of +/-1 tokens. The syntax-based classifier improves over the morphology-based classifier but it trails behind syntax rules in terms of Nominal Diac. Its All Word Diac accuracy is the highest so far.

Combination of Syntax Rules and Classifier
We combine the last two approaches to exploit their complementary performance by including the syntax rule predictions as features in the syntax-based classifier. The resulting system is our best performer achieving DevTest accuracy improvements of 2.3% absolute (All Word Diac), and 4.5% absolute (Nominal Diac).

Blind Test Results
The results of applying the best approach and settings on our blind Test set are presented in Table 2. While the baseline on Test is slightly higher than DevTest, the performance on all metrics are comparable. The increase in Diac accuracy on All Words is 2.5% absolute and on Nominals is 5.2% absolute. The corresponding relative reductions in error to the oracle toplines are 30% and 34%. Similar increases are shown on the full morphological analysis choice.   Figure 1: Examples of corrections from our best performing system.

Error Analysis and Examples
We manually investigated the types of errors in the first 100 errors in the DevTest in our best system's output. In about a quarter of the cases (23%), all of which proper nouns, the gold reference was not fully diacritized, making it impossible to evaluate. In an additional 7%, typographical errors in the input including missing sentence breaks led to bad parses which are the likely cause of error. The rest of the errors are system failures: 31% are connected with syntactic tree errors (although a third of these are due to agreementpropagated errors over correct trees); 28% are due to other morphological analysis issues (half are out-of-vocabulary and 4% are no-analysis cases); and 11% are other case selection errors unrelated to the above-mentioned issues. Figure 1 shows four examples from the DevTest where the analysis of our best system for the underlined words is different from baseline. Examples (a), (b), and (c) are cases where our best system's analysis matched the gold. The dependency relation also matched the gold and was the likely cause of correction. In example (d), our best system incorrectly changed the correct baseline analysis in agreement with the wrong dependency relation provided by the parser, which is the likely cause of error.

Conclusion and Future Work
We have demonstrated the value of using automatic syntactic analysis as part of the task of automatic diacritization and morphological tagging of Arabic. Our best solution is a hybrid approach that combines statistical parsing and manual syntactic rules as part of a machine learning model for correcting case and state features.
In the future, we plan to investigate the development of joint morphological disambiguation and syntactic parsing models. We will also work on improving the quality of Arabic parsing which is behind many of the errors according to our error analysis. Other possible directions include using more sophisticated machine learning techniques and richer lexical features. We also plan to host a demo and make our system available through the website of the Computational Approaches to Modeling Language (CAMeL) Lab: www.camel-lab.com.