Complementary Strategies for Low Resourced Morphological Modeling

Morphologically rich languages are challenging for natural language processing tasks due to data sparsity. This can be addressed either by introducing out-of-context morphological knowledge, or by developing machine learning architectures that specifically target data sparsity and/or morphological information. We find these approaches to complement each other in a morphological paradigm modeling task in Modern Standard Arabic, which, in addition to being morphologically complex, features ubiquitous ambiguity, exacerbating sparsity with noise. Given a small number of out-of-context rules describing closed class morphology, we combine them with word embeddings leveraging subword strings and noise reduction techniques. The combination outperforms both approaches individually by about 20% absolute. While morphological resources already exist for Modern Standard Arabic, our results inform how comparable resources might be constructed for non-standard dialects or any morphologically rich, low resourced language, given scarcity of time and funding.


Introduction
Morphologically rich languages pose many challenges for natural language processing tasks. This often takes the shape of data sparsity, as the increase in the number of possible inflections for any given core concept leads to a lower average word frequency of individual (i.e., unique) word types. Hence, models have fewer chances to learn about types based on their in-context behavior. One common, albeit time consuming response to this challenge is to introduce outof-context morphological knowledge, hand crafting rules to relate forms inflected from the same lemma. The other common response is to adopt machine learning architectures specifically targeting data sparsity and/or morphological informa-tion. We find these two responses to be complementary in a paradigm modeling task for Modern Standard Arabic (MSA).
MSA is characterized by morphological richness and extreme orthographic ambiguity, compounding the issue of data sparsity with noise (Habash, 2010). Despite its challenges, MSA is relatively well resourced, with many solutions for morphological analysis and disambiguation leveraging large amounts of annotated data, hand crafted rules, and/or sophisticated neural architectures (Khoja and Garside, 1999;Habash and Rambow, 2006;Smrž, 2007;Graff et al., 2009;Pasha et al., 2014;Abdelali et al., 2016;Inoue et al., 2017;Zalmout and Habash, 2017). Such resources and techniques, however, are not available or not viable for the many under resourced and often mutually unintelligible dialects of Arabic (DA), which are similarly morphologically rich and highly ambiguous (Chiang et al., 2006;Erdmann et al., 2017). Many recent efforts seek to develop morphological resources for DA, but most are under developed or specific to one dialect Eskander et al., 2013;Jarrar et al., 2014;Al-Shargi et al., 2016;Eskander et al., 2016a;Khalifa et al., 2016Khalifa et al., , 2017Zribi et al., 2017;Khalifa et al., 2018).
This work does not aim to develop a full morphological analysis and disambiguation resource, but to inform how one might be most efficiently developed for any DA variety or similarly low resourced language, given scarcity of time and funding. For such a resource to be practical and easily extendable to new DA varieties, it must take as input the natural, highly ambiguous orthography. Thus, we do not rely on constructed phonological representations to clarify ambiguities, as is common practice when modeling morphology for its own sake (Cotterell et al., 2016(Cotterell et al., , 2017. To in-form how such a resource should be developed, we evaluate minimally rule based and unsupervised techniques for clustering words that belong to the same paradigm in MSA. We primarily use pre-existing MSA resources only for evaluation, constraining resource availability to emulate DA settings during training, as we lack the resources to evaluate our techniques in DA. Our best system combines a minimal set of rules describing closed class morphology with word embeddings that leverage subword strings and noise reduction strategies. The former, despite being cheaper and easier to produce than other rule-based systems, provides valuable out-of-context morphological knowledge, which the latter complements by modeling the in-context behavior of words and morphemes. Combining the techniques outperforms either individually by about 20% absolute.

Morphology and Ambiguity
Arabic morphology is structurally and functionally complex. Structurally, paradigms are relatively large. Component cells convey morphosyntactic properties at a much finer granularity than English. Functionally, many morphological processes are non-concatenative, or templatic. Arabic roots are lists of de-lexicalized radicals, which must be mapped onto a template to derive a word. The derived word will then exhibit some predictable semantic and morpho-syntactic relationship to the root, based on its template. For example, the root r d d, 1 having to do with responding, could take a singular nominal template where geminates are collapsed, becoming rd, 'response', or a so-called broken plural template, separating the geminate with a long vowel to become rdwd, 'responses'. Arabic orthography complicates the issue further, as diacritics marking short vowels, gemination, and case endings are typically not written. In addition to causing frequent lexical ambiguity among forms that are pronounced differently, this also causes templatic processes to appear to be concatenative or completely disappear. For example, deriving 'to cool' brd (fully diacritized, bar∼ad) from 'coldness' brd (fully diacritized, bar.d) involves doubling the second root letter and adding a short vowel before the third, yet these templatic changes usually disappear in the orthography.
Most templatic processes are derivational, deriving new core meanings with separate paradigms from a shared root. Inflectional processes generally concatenate affixes to a shared stem to realize different cells in the same paradigm. Broken plurals however, like rdwd, are a notable exception, resulting from a templatic inflectional process. Approximately 55% of all plurals are broken (Alkuhlani and Habash, 2011).
Arabic is further characterized by frequent cliticization of prepositions, conjunctions, and object pronouns. Thus, a single syntactic word can take many cliticized forms, potentially becoming homonymous with inflections of unrelated lemmas or distinct cells in the same paradigm. The brd, 'response'-'coldness' ambiguity exemplifies this. The 'response' meaning interprets b as a cliticized preposition meaning 'with', while the 'coldness' meaning interprets b as the first root radical. To investigate how these morphological traits affect our ability to model paradigms, we define the following morphological structures.
Paradigm All words that share a certain lemma comprise a paradigm, e.g., in Figure 1, the paradigm of verbal lemma rad∼, 'to respond', contains the four words connected to it by a solid line. Ambiguity within the paradigm is referred to as syncretism, and is very common in Arabic. For example, the present tense second person masculine singular form is syncretic with the third person feminine singular in verbs, as shown by trd, 'you[masc.sing]/she respond(s)'. Additionally, orthography normalizes short vowel distinctions between past tense second person masculine, second person feminine, and first person forms (and sometimes third person feminine), thus causing radadta, radadti, and radadtu, respectively, to be orthographically syncretic. Cliticized forms can also cause unique syncretisms, e.g., brdnA has two possible interpretations from the same lemma bar∼ad, 'to cool'. If the final suffix nA is interpreted as a past tense verbal exponent, it means 'we cooled', whereas if it is interpreted as a cliticized personal pronoun, it becomes 'he/it cooled us'.
Subparadigm At or below the paradigm level, subparadigms are comprised of all words that share the same lemma ambiguity. Lemma ambiguity refers to the set of all lemmas a word could have been derived from out of context. Hence, Figure 1: A clan of two families with two paradigms each, connected by both derivational and coincidental ambiguities. Line dotting style is only used to visually distinguish paradigm membership.
brd and brdnA form a subparadigm, being the only words in Figure 1 which can all be derived exclusively from lemmas, rad∼, 'response', bard, 'coldness', and bar∼ad, 'to cool'.
Family At or above the paradigm level, a family is comprised of all paradigms which can be linked via derivational ambiguity, such that all paradigms are derived from the same root. Thus, all forms mapping to the two paradigms which in turn map to the root b r d, relating to cold, constitute a single family. The subparadigm of brd and brdnA link the two component paradigms via derivational ambiguity. 2 Clan At or above the family level, a clan is comprised of all families which can be linked by coincidental ambiguity. Thus, the subparadigm of brd and brdnA, whose derivational ambiguity joins the paradigms of the b r d family, also connects that family to the unrelated r d d family via coincidental ambiguity. This is caused by the multiple possible analyses of b as either a cliticized preposition or a root letter.

Experiments
In this section, we describe the data, design, and models used in our experiments. 2 The linguistic concept of derivational family differs from ours in that it does not require any ambiguous forms to be shared by derivationally related paradigms. However, identifying such derivational families automatically is non-trivial. Even if the shared root can be identified, it can be difficult to determine whether the root is mono or polysemous, e.g., š ς r could refer to hair, poetry, or feeling. Regardless, our definition of family better serves our investigation into the effects of ambiguity.

Data
To train word embedding models, we use a corpus of 500,000 Arabic sentences (13 million words) randomly selected from the corpus used in Almahairi et al. (2016). This makes our findings more generalizable to DA, as many dialects have similar amounts of available data . We clean our corpus via standard preprocessing 3 and analyze each word out of context with SAMA (Graff et al., 2009) to get the set of possible fully diacritized lemmas from which it could be derived. 4 To build an evaluation set, we sum the frequencies of all types within each paradigm and bucket paradigms based on frequency. We randomly select evaluation paradigms such that all 10 buckets contribute at least 10 paradigms each. For all selected paradigms, any paradigms from the same clan are also selected, allowing us to assume that the paradigms included in the evaluation set are independent of those that are not included. Paradigms with only a single type are discarded, as these are not interesting for analysis. Our resulting EVAL set contains 1,036 words from 91 paradigms and a great deal of ambiguity at all levels of abstraction (see Table 1). Because we prohibit paradigms from entering EVAL without the rest of their clan, EVAL also exhibits the desirable property of reflecting a generally realistic distribution of ambiguity: 36% of its vocabulary are lemma ambiguous as compared to 39% for the entire corpus.

Approach and Evaluation Metric
We build single and multi prototype representations of the entire vocabulary, then examine how well they reflect the paradigms in EVAL. Each representation can be thought of as a tree where each word is a leaf at depth 0, i.e., W 1 , W 2 , and W 3 in Figure 2. Descending down the tree, words are clustered with other words' branches at subsequent depths until the clustering algorithm finishes or the root is reached where all words in the vocabulary are clustered together. All trees use some model of word similarity to guide clustering. In multi prototype representations, a word's leaf prototype at depth 0 can be copied and grafted onto other words' branches at non-zero depths before those branches are clustered to its own. Such is the case of W 2 , which is copied as W 2 at depth 1 of W 3 's branch before W 3 's branch connects to W 2 's. This enables partially overlapping paradigms to be modeled, like those in Figure 2. We evaluate the trees via average maximum Fscore. For each word in EVAL, we descend from its leaf, at each depth calculating an F-score for the overlap between the words that have been clustered to the leaf's branch so far and the leaf word's known paradigm mates, i.e., the set of words sharing at least one lemma with the leaf. Thus, paradigms are soft clusters in our representation, in that, for each word in a paradigm, its set of proposed paradigm mates need not be consistent with any of its proposed paradigm mates' sets of proposed paradigm mates. We then take the best F-score for each leaf word in EVAL, regardless of the depth level at which it was achieved, and average these maximum F-scores. This reflects how cohesively paradigms are represented in the tree. 5 Additionally, we report the average depth at which templatic and concatenatively related paradigm mates are added.
Because we evaluate via average maximum Fscore, this metric represents the potential performance of any given model. Future work will address predicting the depth level where average maximum F-score is achieved for a given leaf word via rule-based and/or empirical techniques that have proven successful for related tasks (Narasimhan et al., 2015;Soricut and Och, 2015;Cao and Rei, 2016;Bergmanis and Goldwater, 2017;Sakakini et al., 2017).

Word Similarity Models
We use the following word similarity models for clustering words in single and multi prototype tree representations.
LEVENSHTEIN The LEVENSHTEIN baseline uses only orthographic edit distances to form a multi prototype tree. At each depth level i, the branch will include every word which has an edit distance of i when compared to the leaf. Transitivity does not hold in this model, as words x and y could be in each other's depth 1 branch, but the fact that z is in y's depth 1 branch does not imply its inclusion in x's depth 1 branch. If the edit distance between x and z is greater than 1, copies, or additional prototypes must be made of x and y. Because morphology involves complicated processes that cannot be explained merely via orthographic similarity, we predict this model will perform poorly. Still, this baseline is useful to ensure that other models are learning something from words' in-context behavior or out-of-context morphological knowledge beyond what can be superficially induced from edit distances. DELEX We use a de-lexicalized (DELEX) morphological analyzer to predict morphological relatedness. The analyzer covers all MSA closedclass affixes and clitics and their allowed combinations in open class parts-of-speech (POS); however there is no information about stems and lemmas in the model. 6 The affixes and clitics and their compatibility rules were extracted from SAMA (Graff et al., 2009). They are relatively cheap to create for any DA or other languages. The independent, expensive component of SAMA is the information regarding stems and lemmas, which we used to form our evaluation set. We are inspired by Darwish (2002), who demonstrated the creation of an Arabic shallow analyzer in one day. Our approach can be easily extended to DA at least in a similar manner to Salloum and Habash (2014).
To determine if two MSA words are possibly in the same paradigm, we do the following: (1) we use the analyzer to identify all potential stems with corresponding POS for each word (these stems are simply the leftover string after removing any prefixal and suffixal strings which match a prefixsuffix combination deemed compatible by SAMA); (2) each stem is deterministically converted into an orthographic root as per Eskander et al. (2013) by removing Hamzas (the set of letters representing the glottal stop phoneme, i.e., ', Â, Ǎ , Ā , ŵ, ŷ), long vowels ( A, y, w, ý), and reducing geminate consonants (e.g., rdd → rd); (3) two words are determined to be possibly from the same paradigm if there exists a possible orthographic root-POS analysis shared by both words. DELEX builds a multi prototype tree with a maximum depth of 1. For each leaf word, it uses the above algorithm to identify all words in the vocabulary which can possibly share a paradigm with the leaf word, and grafts them into the branch. Hence, a word can belong to more than one hypothesized paradigm. Because DELEX has access 6 The system includes 15 basic prefixes/proclitics ( A, Al, b, f, fy, k, l, lA, mA, n, s, t, w, to valuable morphological knowledge, we predict it will be a competitive baseline. Furthermore, it should produce nearly perfect recall, only missing rare exceptional forms, e.g., broken plurals that introduce new consonants such as brAmj, 'programs', the plural of brnAmj, 'program'. We expect its precision to be weak because it lacks lexical or stem-pattern information, leading to rampant clustering of derivationally related and unrelated forms. For example, a word like jAŷz , 'prize' (true root j w z) receives the orthographic root j z (long vowel, hamza letter, and suffix are dropped), which clusters it with unrelated forms such as jz', 'part' (true root j z '), and jz, 'shearing' (true root j z z).

Word Embedding Models (W2V, FT, and FT+)
We use different word embedding models to build single prototype representations of the vocabulary via binary hierarchical clustering (Müllner et al., 2013). In order to analyze the effects of data sparsity, we do not impose a minimum word frequency count, but learn vectors for the entire vocabulary. At depth 0, we consider each leaf word to be its own branch. Descending down the tree, we iteratively join the closest two branches based on Ward distance (Ward Jr, 1963). Joined branches are represented by the centroid of their component words' vectors (though, as in other models, we do not include the leaf word as a match when calculating average maximum F-score). We continue iterating until only a single root remains containing the entire vocabulary.
These trees are single prototype because the input embeddings only provide one vector for each word, regardless of whether or not it is ambiguous in any way. While this is a limitation for these models, 7 existing multi prototype word embeddings generally model sense ambiguity, which is easier to capture (though harder to evaluate) given the unsupervised settings in which embeddings are typically trained (Reisinger and Mooney, 2010;Huang et al., 2012;Chen et al., 2014;Bartunov et al., 2016). Adapting multi prototype embed-dings to model lemma ambiguities is non-trivial, especially without lots of supervision. We leave this for future work.
Because trees built from word embeddings are all constructed via the same binary clustering algorithm, the depths at which templatic and concatenatively inflected paradigm mates are joined in Table 2 are comparable vertically across W2V, FT, and FT+ as well as horizontally. However, the multi prototype trees are shorter and fatter, such that the templatic and concatenative average join depths are only comparable horizontally with each other, i.e., within the same model.

W2V
The Gensim implementation of WORD2VEC (Mikolov et al., 2013a;Řehůřek and Sojka, 2010) uses the SkipGram algorithm with 200 dimensions and a context window of 5 tokens on either side of the target word. As this does not have access to any subword information and is specifically designed for semantics, not morphology, we predict that it will not perform well in our evaluation. FT We train a FASTTEXT (Bojanowski et al., 2016) implementation with the same parameters as W2V, except a word's vector is the sum of its SkipGram vector and that of all its component character n-grams between length 2 and 6. Since short vowels are not written, many Arabic affixes are only one character. With FASTTEXT bookending words with start/end symbols in its internal representation, outermost single-letter affixes are functionally two characters. By inducing knowledge of such affixes, these character n-gram parameters outperform the language agnostic range of 3 to 6 proposed by Bojanowski et al. (2016).
With the ability to model how subword strings behave in context, FT should outperform both LEVENSHTEIN and W2V, though without access to scholar seeded knowledge of morphological structures, it is difficult to predict how FT will compare with DELEX. Errors may arise from clustering words based on affixes indicative of syntactic behavior instead of the stem, which indicates paradigm membership. Also, if the word is infrequent and contains no semantically distinct subword string with higher frequency, the embeddings will be noisy. Frequency and noise also interact with the hubbiness, or crowdedness of the embedding region, as rural regions will require less precision in the vectors to cluster well, whereas there is little room for noise in crowded urban regions where many similar but morphologically unrelated words could interfere. FT+ We build another FT model by concatenating the vectors learned from two variant FT models, one with the normal window size of 5 and one with a narrow window size of 1. Both are trained on a preprocessed corpus where phrases have been probabilistically identified in potentially unique distributions over multiple copies of each sentence, as described in . 8 This technique attempts to better model syntactic cueswhich are better encoded with narrow context windows (Pennington et al., 2014;Trask et al., 2015;Goldberg, 2016;Tu et al., 2017)-while avoiding treating non-compositional phrases as compositional, and also learning from multiple, potentially complementary phrase-chunkings of every sentence. By combining these sources of information, FT+ is designed to learn more meaningful vectors without requiring additional data. We predict it will uniformly outperform FT by reducing noise in the handling of sparse forms like infrequent inflections-a hallmark of morphologically rich languages. FT+&DELEX We make unique copies for each leaf word's branch extending all the way to the root in the single prototype FT+ tree. Then, for each leaf word, at every depth of its branch copy, we use DELEX to prune any words which could not share an orthographic root with the leaf word. Pruning is local to that branch copy, and does not affect the branch copies of paradigm mates which had originally been proposed by FT+ before making branch copies. This makes FT+&DELEX a multi-prototype model. After pruning, the F-score is recalculated for each depth of each leaf word's branch and a new average maximum F-score is reported. Because FT+ encodes information regarding the in-context behavior of words, it is quite complementary to the out-of-context morphological knowledge supplied by DELEX. We predict this model will outperform all others.  Table 2: Scores for clustering words with their paradigm mates in tree representations built from different models of word similarity. Scores are calculated as described in Section 3.2, with precision and recall extracted from the depth that maximizes F and then averaged over all words in EVAL. Join depths refer to the average depth at which templatic or concatenatively related paradigm mates are added to the branch.

Results and Discussion
The results in Table 2 provide strong evidence in support of our hypotheses. The only model performing worse than the LEVENSHTEIN edit distance baseline is W2V, which only understands the in-context, semantic behavior of words. By being able to learn morphological knowledge from in-context behavior of subword strings, FT greatly improves over both W2V and LEVENSHTEIN, demonstrating that it learns far more than can be inferred from out-of-context subword strings, i.e., edit distance, or in-context distributional semantic knowledge without any morphology, i.e., W2V. As predicted, FT+ improves uniformly over FT in all categories, presumably by reducing noise in the vectors of infrequent inflections. Interestingly, with no access to subword information, W2V performs equally poorly on both templatic and concatenatively related paradigm mates, whereas FT and FT+ greatly improve on concatenative mates, but not templatic ones. This is likely because FT and FT+ can identify patterns in subword strings, but not in non-adjacent characters. DELEX's strong baseline performance demonstrates that simple, out-of-context, de-lexicalized knowledge of morphology is sufficient to outperform the best word embedding model that only learns from words' in-context behaviors. However, given the complementarity between DELEX's knowledge and the information FT+ can learn, it is not surprising that the combination of these techniques, FT+&DELEX, far outperforms either system individually.

Specific Examples
We discuss a number of examples that illustrate the variety in the behavior and complementarity of rulebased DELEX, embedding-based FT+, and the combined FT+&DELEX models.
For each example, we specify the strength of the maximum F-score for the three models as such: 9 strength DELEX +strength FT+ →strength FT+&DELEX , e.g., LOW+MID→HI denotes poor DELEX and mediocre FT+ performance on a word, yielding high performance in the combined model.
• jAŷz , 'prize' (LOW+HI→HI) This word has high orthographic root ambiguity since its second morphological root radical is a Hamza. This results in matching words with unrelated true roots like jz', 'part' and jz, 'shearing' under DELEX. It also has high root fertility, in that different paradigms can come from the same true root, like jAŷz, 'permissible', further challenging DELEX. FT+ does relatively better, capturing the word's other inflections, even the broken plural jwAŷz, as their in-context behavior is similar to jAŷz . Interesting recall errors by FT+ include semantically and orthographically similar fAŷz , 'winner[fem.sing]'. The combination yields a perfect F-score.
• yhrςwn, 'they rush' (HI+LOW→HI) This word has an unambiguous orthographic root with no root fertility, resulting in a perfect F-score for DELEX. However, FT+ misses several inflections such as nhrς, 'we rush', and whrςt, 'and I/you/she rushed'. FT+ also makes many semantically and/or syntactically similar precision errors: ysrςwn, 'they hurry', ySArςwn, 'they wrestle', and yqrςwn, 'they ring (a bell)'. The combination leads to a perfect F-score.
• dynAmyky, 'dynamic' (HI+HI→HI) This word has an unambiguous orthographic 9 The strength designation HI is used for F-scores above 75%, LOW for scores below 25%, and MID for the rest. root based on a foreign borrowing and relatively unique semantics and subword strings. Thus, it achieves a perfect F-score in all three models.

•
AntšrwA, 'they spread out' (MID+MID→HI) This word has high orthographic root ambiguity (and, incidentally, fertility) due to the presence of n and t, which could belong to a root, template, or prefix. This leads to a 63% F-score under DELEX with many precision errors: AntšArh, 'his spreading out', wntšAwr, 'we discuss', and ntšArk, 'we collaborate'. FT+ scores only 47%, proposing semantically related but morphologically unrelated or only derivationally related forms: e.g., mntšr, 'spread out' (adjective), and tmrkzwA, 'they centralized' (antonym). This semantic knowledge however, complements DELEX's knowledge, such that the combination is almost perfect (98%).
• kf', 'efficient' (LOW+LOW→LOW) While 17% of words are LOW in DELEX and 28% in FT+, only 4% are LOW in FT+&DELEX. This word exemplifies that 4%, occupying the gap between DELEX's knowledge and FT+'s. It has an extremely ambiguous orthographic root due to the true root containing a Hamza and the first letter being interpretable as a proclitic or root radical. Thus, DELEX achieves 2% F. FT+ is only slightly better (5%). It is likely that this word's low frequency is the main contributor to its noisy embedding, as it only appears once in our corpus. The combination F-score is thus, only 11%.

Related Work
This work builds on several others addressing word embeddings and computational morphology.
Word Embeddings Word embeddings are trained by predicting either a target word given its context (Continuous Bag of Words) or elements of the context given a target (SkipGram) in unannotated corpora (Mikolov et al., 2013a), with the learned vectors modeling how words relate to each other. Embeddings have been adapted to incorporate word order (Trask et al., 2015) or subword information (Bojanowski et al., 2016) to motivate the learned vectors to specifically capture syntactic, morphological, or other similarities.
Word embeddings are generally single prototype models, in that they learn one vector for each word, which can be problematic for ambiguous forms (Reisinger and Mooney, 2010;Huang et al., 2012;Chen et al., 2014). Bartunov et al. (2016) propose a multi prototype model that learns distinct vectors for distinct meanings of types based on variation in the contexts within which they appear. Gyllensten and Sahlgren (2015), argue that single prototype embeddings actually can model ambiguity because the defining characteristics of a word's different meanings typically manifest in different dimensions of the highly dimensional vector space. They find ambiguous words' relative nearest neighbors in a relative neighborhood graph often correlate with distinct meanings. Such works however, deal with sense ambiguity, or abstract semantic distinctions between different usages of a word with potentially the same morphosyntactic properties and core meaning. Evaluation usually requires linking to large semantic databases which, for Arabic, are still underdeveloped (Black et al., 2006;Badaro et al., 2014;Elrazzaz et al., 2017).
Computational Morphology This field of study includes rule-based, machine learning, and hybrid approaches to modeling morphology. The traditional approach is to hand write rules to identify the morphological properties of words (Beesley, 1998;Khoja and Garside, 1999;Habash and Rambow, 2006;Smrž, 2007;Graff et al., 2009;Habash, 2010). These can be used for out-of-context analysis-which SAMA (Graff et al., 2009) performs for MSA-or they can be combined with machine learning approaches that leverage information from the context in which a word appears. MADAMIRA (Pasha et al., 2014), for example, is trained on an annotated corpus to disambiguate SAMA's analyses based on the surrounding sentence.
Other systems use machine learning without rules. They can train on annotated data, like Faruqui et al. (2016) who learn morpho-syntactic lexica from a small seed, or they can learn without supervision, like Luo et al. (2017) who induce "morphological forests" of derivationally related words by predicting suffixes and prefixes based on the vocabulary alone. Some approaches seek to be language independent. MORFESSOR (Creutz and Lagus, 2005), for instance, segments words based on unannotated text. However, it deter-ministically produces context-irrelevant segmentations, causing error propagation in languages like Arabic, characterized by high lexical ambiguity (Saleh and Habash, 2009;Pasha et al., 2014). A few systems have incorporated word embeddings to perform segmentation (Narasimhan et al., 2015;Soricut and Och, 2015;Cao and Rei, 2016), with some attempting to model and analyze relations between underlying morphemes as well (Bergmanis and Goldwater, 2017;Sakakini et al., 2017), though none of these distinguish between inflectional and derivational morphology. Eskander et al. (2016b) propose another segmentation system using Adaptor Grammars for six typologically distinct languages. Snyder and Barzilay (2010) actually use multiple languages simultaneously, finding the parallels between them useful for disambiguation in morphological and syntactic tasks.
Our work is closely related to Avraham and Goldberg (2017), who train embeddings on a Hebrew corpus with disambiguated morphosyntactic information appended to each token. Similarly, Cotterell and Schütze (2015) "guide" German word embeddings with morphological annotation, and Gieske (2017) use morphological information encoded in word embeddings to inflect German verbs. For Arabic, Rasooli et al. (2014) induce paradigmatic knowledge from raw text to produce unseen inflections, and Eskander et al. (2013) identify orthographic roots and use them to extract features for paradigm completion given annotated data. While we adopt the concept of approximating the linguistic root with an orthographic root, we do not use annotated data where the stem has already been determined as in Eskander et al. (2013). Thus, we generate all possible orthographic roots for a given word instead of just one, as discussed in Section 3.3. Sakakini et al. (2017) provide an alternative unsupervised technique for extracting roots in Semitic languages, however, we chose to adopt the orthographic root concept instead for several reasons. Firstly, despite performing comparably with other empirical techniques, Sakakini et al. (2017)'s root extractor is not extremely accurate. While our implementation generates potentially multiple orthographic roots with imperfect precision, the near perfect recall is useful for pruning without propogating error. A major reason why we find DELEX and FT+ to complement one another is the independence of the orthographic root extraction rules and the distributional statistics leveraged by word embeddings. Sakakini et al. (2017)'s root extractor however, depends on embeddings to identify roots. Furthermore, their root extractor cannot be used to generate multi prototype models as it only produces one root per word. Finally, despite orthographic roots' dependance on hand written rules, we show that these rules are very few, such that adapting Sakakini et al. (2017)'s root extractor to a new language or dialect would not necessarily require any less effort than writing new rules.

Conclusion and Future Work
In this work, we demonstrated that out-of-context, rule-based knowledge of morphological structure, even in minimal supply, greatly complements what word embeddings can learn about morphology from words' in-context behaviors. We discussed how Arabic's morphological richness and many forms of ambiguity interact with different word similarity models' ability to represent morphological structure in a paradigm clustering task. Our work quantifies the value of leveraging subword information when learning embeddings and the further value of noise reduction techniques targeting the sparsity caused by complex morphology. Our best performing model uses out-ofcontext rules to prune unlikely paradigm mates suggested by our best embedding model, achieving an F-score of 71.5% averaged over our evaluation vocabulary. Our results inform how one would most cost effectively construct morphological resources for DA or similarly under resourced, morphologically complex languages.
Our future work will target templatic morphological processes which still challenge our best model, requiring knowledge of patterns realized over non-adjacent characters. We will also address errors due to ambiguity, either by adapting multi prototype embedding models to capture morphological ambiguity, including knowledge of paradigm structure in our de-lexicalized rules, or by using disambiguated lemma frequencies to model ambiguity probabilistically. In applying this work to DA, we will additionally need to address the issue of noisy, unstandardized spelling. We will also investigate different knowledge transfer techniques to leverage the many resources available for MSA.