A Little Linguistics Goes a Long Way: Unsupervised Segmentation with Limited Language Specific Guidance

We present de-lexical segmentation, a linguistically motivated alternative to greedy or other unsupervised methods, requiring only minimal language specific input. Our technique involves creating a small grammar of closed-class affixes which can be written in a few hours. The grammar over generates analyses for word forms attested in a raw corpus which are disambiguated based on features of the linguistic base proposed for each form. Extending the grammar to cover orthographic, morpho-syntactic or lexical variation is simple, making it an ideal solution for challenging corpora with noisy, dialect-inconsistent, or otherwise non-standard content. In two evaluations, we consistently outperform competitive unsupervised baselines and approach the performance of state-of-the-art supervised models trained on large amounts of data, providing evidence for the value of linguistic input during preprocessing.


Introduction
Non-standard domains, dialectal variation, and unstandardized spelling make segmentation challenging, though morphologically rich languages require good segmentation to enable downstream applications from syntactic parsing to machine translation (MT). For domains lacking sufficient annotated data to train segmenters, one must resort to language specific greedy techniques or language agnostic unsupervised techniques. Greedy techniques use maximum matching to identify base words, leveraging large dictionaries (Guo, 1997). Yet such dictionaries are often unavailable or too expensive for low resource languages. Language agnostic unsupervised options like MOR-FESSOR (Creutz and Lagus, 2005) and byte pair encoding (BPE) (Sennrich et al., 2016) assume no resources beyond raw text but can yield lower performance on downstream tasks (Vania and Lopez, 2017;Kann et al., 2018). They also suffer from typological biases and favor intended applications at the expense of others.
To this end, we present De-lexical Segmentation (DESEG), a slightly more expensive but powerful alternative to language agnostic morphological segmentation, realizing most of the benefits of supervised segmentation at far less a cost. DE-SEG requires language specific input in the form of a small grammar describing the combinatorics of closed-class affixes. We demonstrate that such a grammar can be constructed easily and rapidly for a new language or dialect. Hence, DESEG addresses the scenario in which there is no supervised segmenter available for a given language or dialect (or no segmenter trained on a domain with sufficient lexical overlap with the target domain in its training data), but the user does have linguistic knowledge of the target language/dialect.
The user-provided grammar is employed in conjunction with a large, raw corpus. The grammar over generates analyses for all words therein, allowing for maximal recall not only of the possible affix combinations, but also variant spellings and dialectal idiosyncrasies. The preferred analysis is disambiguated based on the fertility with which its proposed base attaches to different affixes in analyses of other words throughout the corpus. This follows from the logic that valid bases are more likely to productively combine with more exponents 1 (Bertram et al., 2000). By leveraging language specific resources but learning to disambiguate empirically without supervision, we mitigate much of the sparsity inherent in processing non-standard domains.
Using a corpus of several Arabic dialects exhibiting rich and complex morphology, unstandardized spelling, and variation bordering on mutual unintelligibility, we evaluate DESEG intrinsically on language modeling (LM) and extrinsically on MT. DESEG consistently outperforms MORFESSOR and BPE while only costing a few hours of grammar-building labor; and in some environments it outperforms state-of-the-art supervised Arabic tokenizers MADAMIRA (Pasha et al., 2014) and FARASA (Abdelali et al., 2016). The success of such a simple model is strong evidence for the value of linguistic input during preprocessing. DESEG is publicly available at github. com/CAMeL-Lab/deSeg.

Related Work
Many morphologically rich languages lack crucial preprocessing resources like morphological analyzers or segmenters. Even well resourced languages often lack such resources for non-standard dialects and domains. There have been many approaches to address this problem, varying along a number of dimensions: the degree of language independence or specificity, the required amount of machine learning supervision, the degree of depth and richness of the morphological representations.
Standard Arabic models Modern Standard Arabic (MSA) morphological analysis, disambiguation and tokenization has been the focus of a large number of efforts. Khoja and Garside (1999) was one of the earliest published efforts on automatic shallow and deterministic segmentation for MSA. Darwish (2002) used limited resources and greedy techniques to automatically learn rules and statistics to build a shallow morphological analyzer. There are many MSA morphological analyzers with rich representations and good coverage that required very intensive efforts to create (Beesley, 1998;Buckwalter, 2004;Attia, 2006Attia, , 2007Smrž, 2007;Boudchiche et al., 2017). Buckwalter (2004) is perhaps the most commonly used among them, as it contributed the representations for the Penn Arabic treebank (PATB) (Maamouri and Bies, 2004). The PATB has been the most used resource for supervised morphological disambiguation (Diab et al., 2004;Habash and Rambow, 2005;Pasha et al., 2014;AlGahtani and McNaught, 2015;Zalmout and Habash, 2017). Some efforts have used other annotated resources and/or large unannotated data sets (Lee et al., 2003;Abdelali et al., 2016;Freihat et al., 2018). More closely related to this paper,  demonstrated that delexicalized information provides a cheap means of inducing morphological knowledge and thereby predicting lexical information in MSA. They employ a de-lexicalized grammar which is similar to ours, but they do not handle dialectal variants or spelling variation. They also do not use the grammar for segmentation, but for pruning word embedding clusters in order to predict the paradigm membership of forms encountered in raw text.
Dialectal Arabic models Work on dialectal Arabic morphology and tokenization is relatively newer than work on MSA. Some of the earlier efforts worked on rule-based approaches to model dialectal morphology directly (Habash and Rambow, 2006;Habash et al., 2012), or exploiting existing MSA resources (Salloum and Habash, 2014). Later, a number of annotation efforts have led to the creation of varying sizes of dialectal annotated corpora following the style of the PATB (Maamouri et al., 2014;Jarrar et al., 2016;Al-Shargi et al., 2016;Alshargi et al., 2019). The created annotations supported models for dialectal Arabic analysis, disambiguation and tokenization building on the same successful approaches in MSA (Eskander et al., 2016a;Habash et al., 2013;Pasha et al., 2014;Zalmout and Habash, 2019). More closely related to this paper, Eldesouki et al. (2017) used de-lexicalized analy-sis strategy for four colloquial varieties of Arabic, though they also use minimal training data and extract features from an open class lexicon to learn either an SVM or bi-LSTM-CRF disambiguation model. They further show that domain adaptation from existing MSA training data is beneficial. Also, Samih et al. (2017) applied a related model to segmentation, allowing different Arabic dialects to inform one another, thus avoiding the need to perform dialect identification during preprocessing. We compare our model to MADAMIRA (Pasha et al., 2014) and FARASA (Abdelali et al., 2016), which represent the fully supervised state of the art for segmenting Arabic in the standard domain, but have limited support for multiple colloquial variants of the language.
Finally, we note that, linguistically, our work is inspired by Bertram et al. (2000) who find that prolific stems with large derivational families are accessed more quickly. Their work suggests that stem fertility-or the productivity with which a stem can combine with different affixes-is cognitively relevant to morphological organization.

De-lexical Segmentation for Arabic
In this section, we introduce a case study on segmenting a multi-dialect Arabic corpus and explain the linguistic challenges it presents for popular approaches to segmentation. Furthermore, we discuss the construction of DESEG's grammar and its disambiguation algorithm.

Arabic and its Dialects
Arabic is highly diaglossic (Ferguson, 1959), with the relatively consistent high register of Modern Standard Arabic being learned in schools across the Arab World. Meanwhile the often mutually unintelligible low register variants-collectively known as dialectal Arabic (DA)-are spoken colloquially. The phonological, morpho-syntactic, and lexical variation within the Arabic sprachbund is comparable to that among Romance languages (Chiang et al., 2006;Rouchdy, 2013;Erdmann et al., 2017), leading to problematic noise in multidialect corpora . Furthermore, lack of spelling conventions in DA exacerbates data sparsity, as does a rich morphology featuring templatic phenomena and robust cliticization, making it challenging to train quality segmenters even with much supervised data.

Data
To demonstrate how our model handles such challenging phenomena, we apply it to the CORPUS6 subset of the MADAR-BTEC (Takezawa et al., 2002) corpus of Arabic dialects . This consists of 12,000 sentences in the travel domain (9,000 for training) parallel between English, MSA, and the DA varieties spoken in Beirut, Cairo, Doha, Rabat, and Tunis. This comprises a representative sample of the breadth of intra-DA variation .
In addition to CORPUS6, we also use large amounts of raw monolingual data to train our segmenter and the unsupervised baselines. To avoid introducing even more noise, we restrict our monolingual datasets as much as possible to similar domains. For DA, we use the four subsets of Almeman and Lee (2013)'s web crawl of forums, comments and blogs, consisting of over 10 million words for each subset's dialect region. It is worth noting however, that the granularity of their dialect regions is coarser than the granularity of CORPUS6. Hence, their Maghrebi dialect corresponds to two dialects in CORPUS6, Tunis and Rabat, while the remaining three dialect regions have rather obvious one-to-one correspondences with CORPUS6, i.e., Egyptian to Cairo, Levantine to Beirut, and Gulf to Doha. For MSA, which rarely occurs consistently (i.e., outside of brief instances of code-mixing) in such casual domains, we used the TED corpus (Cettolo and Girardi, 2012) for our monolingual data set, finding a compromise between domain relevance and corpus size. It contains about 2.5 million words.
Obviously, CORPUS6 is small relative to other MT corpora, but this is exactly why it is a meaningful evaluation corpus. Larger parallel corpora are often only available for better resourced languages/domains where fully supervised segmenters are also more likely to be available, negating the need to build one's own segmenter. Furthermore, as parallel data becomes less sparse, tokenization necessarily has less of an effect since models can memorize and effectively use longer sequences. With that said, CORPUS6 is commissioned, and in future work we would like to also test DESEG's performance on natural corpora.

De-lexical Analysis
The DESEG grammar provides all possible delexical analyses of words by assuming any n-gram   of some minimum length can be an open class base, provided the remaining characters comprise a supported affix pattern. Hence, a simple grammar which only supports words without affixes or with a single suffix, +s, would return two analyses for wugs: wugs and wug +s, and one for foo: foo.
To build such a grammar for an Arabic dialect, we target clitic affixation, as this phenomenon is nontemplatic with minimal fusional edits, making it easier to model with a smaller grammar, yet it accounts for a great deal of sparsity, as Arabic clitics are as productive as regular inflectional exponents.
We use our grammar to build a de-lexicalized morphological analyzer for all DA dialects targeting the D3 segmentation scheme (Habash, 2010), which separates all clitics and only clitics from the base forms to which they attach. We chose D3 as Sadat and Habash (2006) demonstrate it to be the most effective scheme for low resource Arabic MT. 2 While Arabic exhibits many other nonconcatenative, templatic phenomena which complicate segmentation and tokenization, clitics are always concatenated to the outsides of base forms after the templatic pattern has been applied and are thus easier to separate. Occasionally, fusional processes can alter phonemes/graphemes on either side of base-clitic or clitic-clitic boundaries, but no templatic process is ever invoked to alter the internal structure of bases by affixing any clitic.
We follow Khalifa et al. (2017)'s approach to extending paradigms with possible clitic combinations, though we don't require any stem lexical information. Hence, we cheaply enable the grammar to over generate, accommodating more spelling variants and removing the need to construct an open class lexicon. Instead, we simply provide meta paradigms for abstractions over base forms with the same combinatorics. Each cell in a meta paradigm represents a unique exponent, or possible mapping of clitics to positions surrounding the abstract base, such that the inflected form would be valid for any real base represented by that meta paradigm. Considering verbal affixation in English, walk and talk would be two real bases taking the same meta paradigm with four cells, represented by exponents _+ing _+s, _, and _+ed. Thus, any two bases exhibiting distinct exponent signatures will belong to distinct meta paradigms. In Arabic, by contrast, paradigms are enumeratively and integratively more complex than the TALK/WALK meta paradigm (Ackerman and Malouf, 2013). Table 1 3 exemplifies Arabic's enumerative complexity, as verbs, for instance, depending on dialect, can take some 20 affixes according to (A), realizing various combinations of aspect, person, gender, and number.
Having taken an affix, the verb can participate in myriad possible additional combinations with clitics in (B) and (C) as dictated by the bottom two rules in the CFG in (D). Arabic is thus, integratively complex in that rich exponents can be comprised of many interacting morphemes whose meanings are often affected by each other's presence. Furthermore, fusional processes acting on such complex forms results in frequent allomorphy. Allomorphy is mostly limited to internal, non-clitic morphemes, which enables us to greatly reduce sparcity without propagating error by focusing on clitics. Hence, we can represent all verbs with a single meta paradigm which is large, but can be described in two CFG rules. In practice then, each of the 20 possible affixes in (A) will correspond to distinct abstract bases, though this eliminates the need to specify 20 distinct meta paradigms for single lexemes. We target relating these abstract bases to each other via nonconcatenative modeling in future work.
In terms of the effort required to create the grammar, there are a total of 98 unique affixes for all dialects. We include the non-clitic affixes in Table 1 (A) in this count as they are used to restrict the set of possible meta paradigms. Of these, 45% appear in at least two dialects and 33% appear in all dialects. The total number of affix-dialect pairs is 288. On average, 88% of each dialect's affixes are shared by at least one other dialect and 45% by all dialects. The average dialect specific list contains 58 affixes and adding a second dialect requires an additional 16. Adding a third, fourth, and fifth dialect requires 10, 8, and 7 additional affixes on average, respectively. Thus, building a single dialect grammar is cheap and adding dialects is even cheaper. Our final grammar contains five meta paradigms, one for each of the basic Arabic parts-of-speech-verbs (PV, IV, and CV), nominals, and particles-compiled into an analyzer like that of Buckwalter (2004).

Unsupervised Disambiguation
DESEG supports two simple, fast models for disambiguating the grammar's analyses. The first, DESEG g , greedily selects the maximum match analysis, or that with the smallest base after matching affixes. The second, DESEG f , selects the analysis with the most fertile base. The fertility of each candidate base is calculated in the raw corpus by counting the possible combinations of adjacent affixes with which it appears over all analyses for all words in which it is proposed as a base.
For example, consider the three-word toy corpus in Table 2.
byqwlhA, correctly segmented as b+ yqwl +hA, PROG+ say.3MS +it, 'he is saying it', has six possible analyses, each with a different candidate base. Two candidate bases, yqwl and byqwl, are also candidate bases for another word, byqwl 'he's not saying', but only yqwl exhibits multiple unique adjoining affix sets. In byqwlhA, it takes the circumfix b | hA, while in byqwl, it takes the prefix b. The fertility of base yqwl suggests it is more likely to be a productive stem in the language, whereas the lack of fertility for the base byqwl suggests it is not systematically utilized in the language as a base might be expected to be used, and that it is more likely a simple coincidence that enables the over permissive grammar to allow such a candidate.
The final word in the vocabulary, ybqwly, correctly segmented as ybqw +l +y, remain.3MP +to +me 'they remain for me', is challenging because no other inflection of the lexeme is attested.  Yet, by maximum matching on the affixes, we choose the correct analysis-ybqw plus the complex suffix of prepositional l followed by object y-as the proposed base ybqw is shorter than the other candidate base which is produced by erroneously assuming a nominal meta paradigm. The nominal analysis re-analyzes y as the first person possessive enclitic and crucially extends the base with l, as l is not a viable nominal enclitic. Thus, choosing the shortest base can help to eliminate coincidentally feasible analyses.
Each model, DESEG f and DESEG g , breaks ties using the other. Thus, DESEG f would correctly segment the entire toy corpus, as the correct analyses in byqwlhA and byqwl feature the uniquely most fertile candidate bases, and while there is a fertility tie for ybqwly, backing off to the candidate segmentation with the smallest base length correctly selects the segmentation with ybqw as the base. DESEG g correctly segments byqwl and ybqwly, but incorrectly predicts that the stem-final l in byqwlhA is actually the same enclitic preposition present in ybqwly and thus, over segments.
In the event of ties after considering both fertility and base length, both models back off again to the analysis with the base that most frequently occurs as a full word in the raw corpus. Prioritizing this frequency above either fertility or base length minimization always hurt performance, even though it proved quite useful as a feature for Narasimhan et al. (2015). We attribute this seeming discrepancy to the interaction of Arabic's rich morphology with the noise of unstandardized DA data. Many gold bases actually cannot appear as stand-alone words due to the fusional morphology and various writing conventions greatly affect the frequency with which bases that can manifest as stand-alone words actually do.

Evaluation
We compare DESEG to several alternative segmentation models. We use the CORPUS6 dev set to pick the optimal minimum base length on an intrinsic LM perplexity evaluation, and then perform an extrinsic MT evaluation on the test set.

Models
We evaluate the following models: PLAIN This baseline segments only punctuation.
MADAMIRA Egyptian and MSA versions are available for MADAMIRA, which disambiguates a rule-based morphological analyzer's output with an SVM trained on morphologically annotated data. We use the Egyptian version as it is pretrained on a superset of the MSA data to capture code switching. Thus, performance does not significantly drop when testing on MSA, and performance is significantly greater when testing on DA varietes-even those far outside of Egypt-due to many shared intra-DA linguistic traits not present in MSA (Khalifa et al., 2017). MADAMIRA is a tokenizer in that it not only segments but also mitigates data sparsity due to allomorphy by recovering the canonical underlying morpheme for each segment. We run MADAMIRA in D3 tokenization mode, facilitating comparison with DESEG.
FARASA Similar to MADAMIRA, FARASA is a pre-trained, SVM-based system leveraging gold annotations and external dictionaries. Together, FARASA and MADAMIRA represent the state of   the art for a number of morphological tasks in Arabic. FARASA differs from MADAMIRA in that only one version is publicly available, it segments only, not attempting to tokenize, and the segmentation scheme is linguistically ad hoc, tending to be slightly more granular than D3.
BPE Byte pair encoding uses an algorithm originally designed for file compression to perform unsupervised segmentation. BPE was originally proposed to reduce vocabulary size to make neural MT tractable (Sennrich et al., 2016), as the algorithm's simplicity enables easy application to any language. It separates all characters in the corpus, then performs a pre-determined number of join operations, merging all instances of specified bigrams. Joins are determined such that the resulting corpus will contain as few tokens as possible given the number of join operations allowed. Thus, while the algorithm is unsupervised and easy to apply to any language, it is linguistically naive, assuming that morphological organization is driven solely by enumerative efficiency concerns. Likely for this reason, BPE has not been demonstrated to be particularly useful for applications beyond neural MT (Kann et al., 2018).
MORFESSOR The de facto publicly available unsupervised segmentation system is MORFES-SOR. Like BPE, MORFESSOR trains in an unsupervised fashion on large amounts of data and is easily run on any language. Efficient encoding of morphology is also at the center of MORFESSOR's objective function, though it considers not only how compact the corpus can be represented, but also how compact the grammar describing morpheme combinatorics can be represented. Stem morphemes are distinguished from affixal morphemes as the model seeks to limit the number of unique signatures-the sets of unique affixes which can occur with a given stem-that result from the learned segmentation scheme. While MORFESSOR performs well on a number of unsupervised segmentation tasks, it is known to have typological biases toward the languages for which it was originally developed (Kirschenbaum, 2015). DESEG Our model, described in Section 3, finds a compromise between the convenience of language agnostic unsupervised systems and the performance of systems leveraging language specific resources. DESEG can be run with a minimum base length of either 2 or 3 characters and a priority of base fertility maximization (f ) over greedy base length minimization, or vice versa (g). Minimum base length and priority are represented as subscripts in all relevant tables. Table 3 shows the LM results for tokenizing  where all trainable segmenters are trained on all of the raw data pooled together instead of training dialect specific tokenizers on relevant subsections of the 40+ million word corpus. To enable pooled DESEG grammars, each dialect's grammar is merged into one highly permissive, over generating pan-Arabic grammar. In the unpooled training scenario, perplexity rankings were consistent with those displayed here. Our model greatly re-duces both perplexity and out of vocabulary over all competitive models, though we also exhibit a tendency to over segment. Our best DESEG variants use a minimum base length of two, which is logical because while Arabic features mainly triradical roots, gemination causes many base forms to reduce to only two graphemes. In the intrinsic evaluation, it is difficult to tell whether the preference for greed (DESEG g2 ) or fertility (DESEG f 2 ) is better. Our success is likely due to the fact that we alone cover all the dialects, yet that coverage was achieved in a fraction of the time spent constructing the annotated data upon which state-of-the-art systems rely to cover just a single dialect.

Extrinsic Machine Translation Evaluation
We conduct MT experiments translating Arabic dialects to English in three environments. Pooled-pooled trains segmenters (only trainable segmenters) on the monolingual corpus with all dialects pooled and the MT system on all the dialects pooled. Individual-individual trains six segmenters on relevant subsections of the monolingual data and six MT systems on the relevant partitions of CORPUS6. Individual-pooled trains individual segmenters but one pan-Arabic MT system, which is reasonable to reduce the over generation of the morphological model but leverage shared information during MT. Neural MT has been used with dialects (Hassan et al., 2017), but given the extreme scarcity of in-domain data, statistical MT (Koehn et al., 2007) is the better choice (Farajian et al., 2017) for comparing quality of segmentation in our setting. DESEG consistently outperforms unsupervised alternatives BPE and MORFESSOR in Table 4 while approaching and even beating state-of-the-art systems FARASA and MADAMIRA in the individual-pooled environment. The Fertility-based model DESEG f 2 outperforms its greedy counterpart, supporting the argument that base fertility plays a meaningful role in morphological organization.

Error Analysis
We performed a quantitative error analysis on 100 sentences randomly selected from CORPUS6 for each variety, creating a gold segmentation set. In Table 5, accuracy is computed given the two modes of training DESEG f 2 (i.e., pooled or individual), and compared with the PLAIN input base-line. Average segmentation accuracy over all varieties correlates with the extrinsic evaluation for both modes of training DESEG f 2 . In both modes, the best performance is on MSA and the worst is on Rabat then Tunis.
In individual mode, the poor performance of Rabat and Tunis is expected as we could not obtain sufficiently large monolingual data sets that distinguish these two quite linguistically distinct North African varieties. Thus, we were forced to train both grammars' disambiguators on the same data, propagating error whenever a form occurred in the Rabat dialect not analyzable by the Tunis grammar or vice versa. As for pooled mode, careful inspection revealed an exceptional amount of inconsistent spellings in the Tunis and Rabat partitions of CORPUS6 that were not anticipated when constructing the grammar. The definite article proclitic + Al+ for example, frequently appears as its own word, reduced to just l, or deleted altogether when preceded by another proclitic, especially when the l assimilates phonologically to the following phoneme. In MSA, by contrast, the definite article is always attached to the following noun, the l is never deleted, and the A can only be deleted following the prepositional proclitic + l+, 'for'. It is not surprising then that MSA performs the best in both modes as there is only negligible inconsistency in MSA spelling, meaning that the grammar need not anticipate an unbounded set of spelling alternatives exacerbating over generation and putting more stress on the disambiguator.
The best DA performance is achieved on Beirut for the pooled mode and Doha for the Individual. Beirut is the least verbose of all dialects in unsegmented space, and also exhibits the lowest ratio of unsegmented tokens to gold segmented tokens, meaning that it rewards over segmenting, which we know DESEG f 2 is biased toward given its secondary preference for short bases. As for the high performance on Doha, it is worth noting that Doha is also the highest performing dialect on all MT experiments, even recording higher BLEU scores than MSA. It is thus likely that the Doha partition of CORPUS6 is simply more internally consistent than the others, not just in terms of spelling, but also lexical choices and syntactic structure. This could be idiosyncratic to CORPUS6 more than it is characteristic of the Doha dialect, though an independent test corpus would be needed to investigate this further.
While the extrinsic MT results vouch for the effectiveness of pooled grammars when training data cannot be separated by dialect, the pooled training mode consistently fails to outperform PLAIN on the harsh evaluation metric of segmentation accuracy. On average, the pooled mode is 15% less accurate than individual-which does consistently improve over PLAIN-demonstrating that reducing the grammar's capacity to over generate by determining the dialect before segmenting greatly facilitates disambiguation. Indeed, there is a 94% correlation between the verbosity reduction and accuracy increase going from the pooled to individual mode, indicating that the pooled model is over segmenting as more options for mistakenly identifying segmentable clitics become available across different dialects. This is especially problematic for words like the noun frd, 'individual', which contain highly fertile, analyzable bases within their true base. That is, frd can also be analyzed out of context as a conjunction followed by a verb + f+ rd 'so he responded', where the verbal base is highly fertile, especially since it is identical to the nominal rd, 'response' and thus can participate in a large number of clitic combinations as licensed by three feasible meta paradigms (verbal PV, verbal CV, or nominal). Furthermore, the increased uncertainty caused by greater over generation of the analyzer in pooled mode gives the base length minimization back off more influence. Base length minimization as a disambiguation strategy will always over segment by definition if the analyzer permits it. Thus, low frequency or unknown words like the proper name Awnw, 'Ono' are frequently over segmented, as occurs in all dialects except Doha and MSA, where the leading or trailing sequences of graphemes happen to not be confusable with any viable clitics according to the grammar.
Considering context will be crucial to improving the model's handling of such cases in future work, as the Cairene sentence hy dy mdAm Awnw?, 'Is this Madame Ono?' provides a blatant clue in the title 'Madame', that Awnw is a name and need not be segmented.
Similarly, the Beiruti sentence, ...  Table 5: Segmentation accuracy of DESEG trained on Pooled versus Indiv(idual) dialects/grammars and evaluated on CORPUS6 against the PLAIN input baseline. Seg(mentation) verbosity is the ratio of segmented tokens over gold segmented tokens while accuracy and error reduction (ER) are reported as percentages.

AzA btryd
'Please exchange for me this ten dollar [bill] for a single five...' indicates that a noun should follow the numerical modifier xms , 'five', not the proclitic conjunction + f+, 'so'.

Conclusion and Future Work
We present an effective unsupervised means of introducing linguistic information for segmentation that greatly improves performance over other unsupervised systems as evaluated both intrinsically and extrinsically. We target robust handling of rich morphological phenomena and noisy corpora, achieving performance on a multi-dialect Arabic corpus comparable to state-of-the-art supervised systems. The success of our simple system is strong evidence for the value of linguistic input during preprocessing.
In the future, we plan to evaluate our models on natural (uncommissioned) dialectal corpora. We also plan to enhance our delexicalize models with non-concatenative components. And we also in tend to develop models that consider context.