Farasa: A Fast and Furious Segmenter for Arabic

In this paper, we present Farasa, a fast and accurate Arabic segmenter. Our approach is based on SVM-rank using linear kernels. We measure the performance of the seg-menter in terms of accuracy and efﬁciency, in two NLP tasks, namely Machine Translation (MT) and Information Retrieval (IR). Farasa outperforms or is at par with the state-of-the-art Arabic segmenters (Stanford and MADAMIRA), while being more than one order of magnitude faster.


Introduction
Word segmentation/tokenization is one of the most important pre-processing steps for many NLP task, particularly for a morphologically rich language such as Arabic. Arabic word segmentation involves breaking words into its constituent prefix(es), stem, and suffix(es). For example, the word "wktAbnA" 1 " " (gloss: "and our book") is composed of the prefix "w" " " (and), stem "ktAb" " " (book), and a possessive pronoun "nA" " " (our). The task of the tokenizer is to segment the word into "w+ ktAb +nA" " ". Segmentation has been shown to have significant impact on NLP applications such as MT and IR.
We introduce a new segmenter, Farasa ("insight" in Arabic), an SVM-based segmenter that uses a variety of features and lexicons to rank possible segmentations of a word. The features include: likelihoods of stems, prefixes, suffixes, their combinations; presence in lexicons containing valid stems or named entities; and underlying stem templates.
We carried out extensive tests comparing Farasa with two state-of-the-art segmenters: MADAMIRA (Pasha et al., 2014), and the Stanford Arabic segmenter (Monroe et al., 2014), on two standard NLP tasks namely MT and IR. The comparisons were done in terms of accuracy and efficiency. We trained Arabic↔English Statistical Machine Translation (SMT) systems using each of the three segmenters. Farasa performs clearly better than Stanford's segmenter and is at par with MADAMIRA, in terms of BLEU (Papineni et al., 2002). On the IR task, Farasa outperforms both with statistically significant improvements. Moreover, we observed Farasa to be at least an order of magnitude faster than both. Farasa also performs slightly better than the two in an intrinsic evaluation. Farasa has been made freely available. 2

Farasa
Features: In this section we introduce the features and lexicons that we used for seg-2 Tool available at: http://alt.qcri.org/tools/farasa/ mentation. For any given word (out of context), all possible character-level segmentations are found and ones leading to a sequence of pref ix 1 +...+pref ix n +stem+suf f ix 1 +...+suf f ix m , where: pref ix 1 .. n are valid prefixes; suf f ix 1 .. m are valid suffixes; and prefix and suffix sequences are legal, are retained. Our valid prefixes are: f, w, l, b, k, Al, s.
. Our valid suffixes are: A, p, t, k, n, w, y, At, An, wn, wA, yn, kmA, km, kn, h, hA, hmA, hm, hn, nA, tmA, tm, and tn . Using these prefixes and suffixes, we generated a list of valid prefix and suffix sequences. For example, sequences where a coordinating conjunction (w or f) precedes a preposition (b, l, k), which in turn precedes a determiner (Al), is legal, for example in the word fbAlktab (gloss: "and in the book") which is segmented to (f+b+Al+ktAb ). Conversely, a determiner is not allowed to precede any other prefix. We used the following features: -Leading Prefixes: conditional probability that a leading character sequence is a prefix.
-Trailing Suffixes: conditional probability that a trailing character sequence is a suffix.
-LM Prob (Stem): unigram probability of stem based on a language model that we trained from a corpus containing over 12 years worth of articles of Aljazeera.net (from 2000 to 2011). The corpus is composed of 114,758 articles containing 94 million words.
-LM Prob: unigram probability of stem with first suffix.
-Suffix|Prefix: probability of suffix given prefix. -Stem Template: whether a valid stem template can be obtained from the stem. Stem templates are patterns that transform an Arabic root into a stem. For example, apply the template CCAC on the root "ktb" " " produces the stem "ktAb" " " (meaning: book). To find stem templates, we used the module described in Darwish et al. (2014).
-Stem Lexicon: whether the stem appears in a lexicon of automatically generated stems. This can help identify valid stems. This list is generated by placing roots into stem templates to generate a stem, which is retained if it appears in the aforementioned Aljazeera corpus.
-Gazetteer Lexicon: whether the stem that has no trailing suffixes appears in a gazetteer of person and location names. The gazetteer was extracted from Arabic Wikipedia in the manner described by (Darwish et al., 2012) and we retained just word unigrams.
-Function Words: whether the stem is a function word such as "ElY" " " (on) and "mn" " " (from).
-AraComLex: whether the stem appears in the AraComLex Arabic lexicon, which contains 31,753 stems of which 24,976 are nouns and 6,777 are verbs (Attia et al., 2011). -Buckwalter Lexicon: whether the stem appears in the Buckwalter lexicon as extracted from the AraMorph package (Buckwalter, 2002).
-Length Difference: difference in length from the average stem length.
Learning: We constructed feature vectors for each possible segmentation and marked correct segmentation for each word. We then used SVM-Rank (Joachims, 2006) to learn feature weights. We used a linear kernel with a trade-off factor between training errors and margin (C) equal to 100, which is based on offline experiments done on a dev set. During test, all possible segmentations with valid prefix-suffix combinations are generated, and the different segmentations are scored using the classifier. We had two varieties of Farasa. In the first, Farasa Base , the classifier is used to segment all words directly. It also uses a small lookup list of concatenated stop-words where the letter "n" " " is dropped such as "EmA" " " ("En+mA" " "), and "mmA" " " ("mn+mA" " "). In the second, Farasa Lookup , previously seen segmentations during training are cached, and classification is applied on words that were unseen during training. The cache includes words that have only one segmentation during training, or words appearing 5 or more times with one segmentation appearing more than 70% of times.
Training and Testing: For training, we used parts 1 (version 4.1), 2 (version 3.1), and 3 (version 2) of 12  where both were configured to segment all possible affixes. We did not compare to Stanford, because it only segments based on the ATB segmentation scheme. Farasa lookup performs slightly better than MADAMIRA. From analyzing the errors in Farasa, we found that most of the errors were due to either: foreign named entities such as "lynks" " " (meaning: Linux) and "bAlysky" " " (meaning: Palisky); or to long words with more than four segmentations such as "wlmfAj}thmA" " " ("w+l+mfAj}+t+hmA" " ") (meaning "and to surprise both of them"). Perhaps, adding larger gazetteers of foreign names would help reduce the first kind of errors. For the second type of errors, the classifier generates the correct segmentation, but it receives often a slightly lower score than the incorrect segmentation. Perhaps adding more features can help correct such errors.

Machine Translation
Setup: We trained Statistical Machine Translation (SMT) systems for Arabic↔English, to compare Farasa with Stanford and MADAMIRA 3 . The comparison was done in terms of BLEU (Papineni et al., 2002) and processing times. We used concatenation of IWSLT TED talks (Cettolo et al., 2014)   202K Sentences) to train phrase-based systems.
Systems: We used Moses (Koehn et al., 2007), a state-of-the-art toolkit with the the settings described in (Durrani et al., 2014a): these include a maximum sentence length of 80, Fast-Aligner for word-alignments (Dyer et al., 2013), an interpolated Kneser-Ney smoothed 5-gram language model with KenLM (Heafield, 2011), used at runtime, MBR decoding (Kumar and Byrne, 2004), Cube Pruning (Huang and Chiang, 2007) using a stack size of 1,000 during tuning and 5,000 during testing. We tuned with the k-best batch MIRA (Cherry and Foster, 2012). Among other features, we used lexicalized reordering model (Galley and Manning, 2008), a 5-gram Operation Sequence Model (Durrani et al., 2011), Class-based Models (Durrani et al., 2014b) 4 and other default parameters. We used an unsupervised transliteration model (Durrani et al., 2014c) to transliterate the OOV words. We used the standard tune and test set provided by the IWSLT shared task to evaluate the systems.
In each experiment, we simply changed the segmentation pipeline to try different segmentation. We used ATB scheme for MADAMIRA which has shown to outperform its alternatives (S2 and D3) previously (Sajjad et al., 2013). Table 2 compares the Arabic-to-English SMT systems using the three segmentation tools. Farasa performs better than Stanford's Arabic segmenter giving an improvement of +0.25, but slightly worse than MADAMIRA (-0.10). The differences are not statistically significant. For efficiency, Farasa is faster than Stanford and MADAMIRA by a factor of 5 and 50 respectively. 5 The run-time of MADAMIRA makes it cumbersome to run on bigger corpora like the multiUN (UN) ( Table 3: English-to-Arabic Machine Translation, BLEU scores and Time (in seconds)

Results:
Chen, 2010) which contains roughly 4M sentences. This factor becomes even daunting when training a segmented target-side language model for Englishto-Arabic system. Table 3 shows results from English-to-Arabic system. In this case, Stanford performs significantly worse than others. MADAMIRA performs slightly better than Farasa. However, as before, Farasa is more than multiple orders of magnitude faster.

Information Retrieval
Setup: We also used extrinsic IR evaluation to determine the quality of stemming compared to MADAMIRA and the Stanford segmenter. We performed experiments on the TREC 2001/2002 cross language track collection, which contains 383,872 Arabic newswire articles, containing 59.6 million words), and 75 topics with their relevance judgments (Oard and Gey, 2002). This is presently the best available large Arabic information retrieval test collection. We used Mean Average Precision (MAP) and precision at 10 (P@10) as the measures of goodness for this retrieval task. Going down from the top a retrieved ranked list, Average Precision (AP) is the average of precision values computed at every relevant document found. P@10 is the same as MAP, but the ranked list is restricted to 10 results. We used SOLR (ver. 5.6) 6 to perform all experimentation. SOLR uses a tf-idf ranking model. We used a paired 2-tailed t-test with p-value less than 0.05 to ascertain statistical significance. For experimental setups, we performed letter normalization, where we conflated: variants of "alef", "ta marbouta" and "ha", "alef maqsoura" and "ya", and the different forms of "hamza".  Results: Table 4 summarizes the retrieval results for using words without stemming and using MADAMIRA, Stanford, and Farasa for stemming. The table also indicates statistical significance and reports on the processing time that each of the segmenters took to process the entire document collection. As can be seen from the results, Farasa outperformed using words, MADAMIRA, and Stanford significantly. Farasa was an order of magnitude faster than Stanford and two orders of magnitude faster than MADAMIRA.

Analysis
The major advantage of using Farasa is speed, without loss in accuracy. This mainly results from optimization described earlier in the Section 2 which includes caching and limiting the context used for building the features vector. Stanford segmenter uses a third-order (i.e., 4-gram) Markov CRF model (Green and DeNero, 2012) to predict the correct segmentation. On the other hand, MADAMIRA bases its segmentation on the output of a morphological analyzer which provides a list of possible analyses (independent of context) for each word. Both text and analyses are passed to a feature modeling component, which applies SVM and language models to derive predictions for the word segmentation (Pasha et al., 2014). This hierarchy could explain the slowness of MADAMIRA versus other tokenizers.

Conclusion
In this paper we introduced Farasa, a new Arabic segmenter, which uses SVM for ranking. We compared our segmenter with state-of-the-art segmenters MADAMIRA and Stanford, on standard 14 MT and IR tasks and demonstrated Farasa to be significantly better (in terms of accuracy) than both on the IR tasks and at par with MADAMIRA on the MT tasks. We found Farasa by orders of magnitude faster than both. Farasa has been made available for use 7 and will be added to Moses for Arabic tokenization.