Joint Persian Word Segmentation Correction and Zero-Width Non-Joiner Recognition Using BERT

Words are properly segmented in the Persian writing system; in practice, however, these writing rules are often neglected, resulting in single words being written disjointedly and multiple words written without any white spaces between them. This paper addresses the problems of word segmentation and zero-width non-joiner (ZWNJ) recognition in Persian, which we approach jointly as a sequence labeling problem. We achieved a macro-averaged F1-score of 92.40% on a carefully collected corpus of 500 sentences with a high level of difficulty.


Introduction
People who have worked with real-world data in Persian natural language processing (NLP) are familiar with the Persian writing system's problematic properties, the most important of which are word segmentation and ZWNJ, both consequences of a phenomenon called "orthographic ligature". In Persian, white space character (U+0020) is used to segment words in a sentence, and ZWNJ (U+200C) is used between free morphemes in compound words (or combining forms), and also between one or several free morphemes and one or several bound morphemes to stop joining letters from connecting into a ligature (Moghaddam, 2007). Although there is a set of rules regarding the use of white space and ZWNJ set by the regulatory body of the Persian language (Academy of Persian Language and Literature, 2015), only a few people follow them in writing formal Persian, let alone the informal language.
Some common errors regarding the use of white space and ZWNJ are as follows: (1) not segmenting words where the last letter of the first word is a non-joiner character (e.g., vaāmār[e] marg|omir behtar az xeyli|hāst 1 "and the mortality rate is better than many"); (2) using ZWNJ instead of white space which could be categorized as (a) excessive use of ZWNJ (e.g., for separating first and last names) and (b) ZWNJ key hit mistakenly as in many cell phone keyboards the space key and the ZWNJ key are adjacent to each other; (3) using space instead of ZWNJ (e.g. mi konam "I do"); (4) using neither of them (e.g. mikonam "I do"); and many other spontaneous errors.
In comparison to European languages, such as English and German, word segmentation in Asian languages like Japanese, Chinese , and Thai (Aroonmanakun, 2002;Tesprasit et al., 2003), is more complicated, because space is not specifically used as an orthographic word boundary delimiter (Khan et al., 2018b). For instance, in Vietnamese, space is not only used to separate words, but it is also applied to separate syllables (that can be considered as a meaningful word or as a part of multi-syllable words) that make up words (Huyen et al., 2008). The problem in Persian is not similar to that of the abovementioned languages, but quite identical to Urdu (Durrani and Hussain, 2010;Lehal, 2010;Rashid and Latif, 2012;Bin Zia et al., 2018;Khan et al., 2018a), as they both are Indo-Iranian languages and their writing systems are derived from the Arabic script. In Persian, words are properly segmented in theory, but these rules are not always followed in practice. These kinds of writing system characteristics result in difficulties in text processing, e.g., sequence labeling. According to the previous research, word segmentation can improve the results of other NLP tasks like information retrieval (Foo and Li, 2004), machine translation (Xu et al., 2004), information extraction (Peng and Dredze, 2016), dependency parsing (Nguyen, 2018), etc.
In this paper, we address the problems of word segmentation and ZWNJ correction in Persian using sequence labeling models and achieve results that are quite promising and could pave the way for an effective solution for real-world situations. To be sure of this, we gathered a corpus of 500 sentences with a high degree of difficulty regarding the correctness of word segmentation and ZWNJs, and evaluated the models' performance on it. After reviewing the related work, we discuss the data used for training, validating, and testing the models in §3, the methodology and the experimental settings in §4, and the results in §5. We then conclude the paper and suggest some ideas for future work.
In Persian, there are few studies conducted on preprocessing toolkits, some of which also include modules for space and ZWNJ correction. As an early work, we can mention Shamsfard et al. (2009) that designed a Persian text preprocessing tool called STeP-1 to tokenize texts, check spellings, and analyze words morphologically. To design the tokenizer, they use dictionary-based and rule-based methods to recognize the correct places of space and ZWNJ, and report a performance of 86.6%. As another work, Sarabi et al. (2013) design a toolkit for Persian text processing in four lexical, morphological, syntactic, and semantic levels. Space and ZWNJ positions are checked in the first level using dictionary-based and rule-based methods, achieving a precision of 95%. Additionally, Hazm 2 , an open-source preprocessing library, performs ZWNJ correction using regular expressions and a list of valid stems. Parsivar (Mohtaj et al., 2018), the most powerful Persian preprocessing tool, applies rule-based and statistical methods to determine the correct places of ZWNJ and space, respectively. They trained a naive Bayes model on the 10 million word Bijankhan corpus (Bijankhan et al., 2011) with the IOB tagging scheme to find word boundaries, achieving an F 1 -score of 96.5%. As for the latest work, Panahandeh and Ghanbari (2019) uses an N-gram language model and a rule-based method to correct space and ZWNJ between compound words, respectively. They report an F 1 -score of 81.94% for space correction.

Data
The 10 million word Bijankhan corpus (Bijankhan et al., 2011) was used in this research, which is a cleanly tokenized corpus with fine-grained part-of-speech tags (which were not used here). This corpus contains 10,437,194 words or 38,971,131 characters. We chose this corpus since (a) their word segmentation is clean and their approach is practical, (b) the size of the corpus is enormous (∼39M characters), and (c) the corpus comprises many different topics, including news articles, literary prose and poems, informal dialogues, etc. All white spaces in a single token were converted to ZWNJs and the characters in the corpus were normalized as well. Then, the corpus was split into three parts: the first 10% for testing, the second 10% for validation, and the rest for training. We also collected a test corpus of 500 sentences from twitter, news broadcasting websites, and discussion forums to have real-world data with real white space and ZWNJ errors. This data was deliberately meant to be a difficult test corpus with numerous extreme cases (and also easy and completely correct ones). This corpus contains 16,574 words or 93,355 characters.

Methodology
We see the word segmentation correction and the ZWNJ recognition tasks as one problem and train a single model to perform these two tasks jointly. We approached the task as a sequence labeling problem, i.e., mapping each character to a tag space of size 3. The tag is 0 when there is no white space or ZWNJ, 1 when there is white space, and 2 when there is ZWNJ after the corresponding character. Two types of models were trained for this purpose: 1. CRF: a conditional random field (CRF) model (Lafferty et al., 2001) implemented using sklearn-crfsuite (Korobov, 2015;Okazaki, 2007), with the input features of the focus, 5 previous and 5 following characters, and four character-based Boolean features indicating whether the focus character is first and is last character of the sentence, and also if the character is joiner, and whether it is digit. All the white space and ZWNJ characters were stripped from the input texts. The L1 and L2 regularization coefficients were set to 0.1 and the max iteration argument to 100.

BERT:
The main bidirectional encoder representations from Transformers (BERT) model (Devlin et al., 2018), plus a fully-connected network mapping to the tag space. The learning rate was set to 2e − 5 and the batch size to 10. Adam (Kingma and Ba, 2014) was used for optimizing the weights with cross-entropy as the loss function. As for the pre-trained weights, the multilingual cased model was used. We have followed the recommended settings for sequence labeling, which is to calculate loss only on the first part of each tokenized word. The implementation was done using PyTorch (Paszke et al., 2019) and HuggingFace's Transformers (Wolf et al., 2019) libraries.
The input texts were fed into the model in two different settings: (a) All the white space and ZWNJ characters were stripped from the input texts, similar to the CRF model above. The model has to figure out the position of white space and ZWNJ characters in sentences from scratch. We refer to this model as BERT a . (b) The white space and ZWNJ characters were kept, but some noise was introduced to the input data in the following manner for each sentence in Bijankhan corpus: i. r 1 × l of the ZWNJs were changed to white space characters, where l is the length of the sentence (in characters) and r 1 is a uniform random variable on (0., .15). ii. r 2 × l of the white spaces after non-joiner characters were removed, where l is the length of the sentence (in characters) and r 2 is a uniform random variable on (0., .2). iii. r 3 × l of the remaining characters were changed in the following manner, where l is the length of the sentence (in characters) and r 3 is a uniform random variable on (0., .05): A. replace(c, rand(null, ZWNJ )), where the randomly chosen character c was white space; B. replace(c, rand(null, space)), where the randomly chosen character c was ZWNJ; C. replace(c, concat(c, rand(ZWNJ, space))), where the randomly chosen character c was followed neither by a white space nor a ZWNJ, otherwise, remove the following white space or ZWNJ. The noise ranges were chosen based on our observation of real-world errors; hence this scenario is closer to real-world situations. The corresponding output tags of the model for the input white space and ZWNJ character were masked and ignored in calculating the loss and performance measures. We refer to this model as BERT b .
We also experimented with Parsivar tool, introduced in §2, to rectify white spaces and ZWNJs using its Normalizer (with the statistical space correction argument set to True) and SpellCheck modules. As the latter module might add or remove characters other than white space and ZWNJ, the differences between the two strings were found using Python's difflib library, and placeholder characters were added where needed to make the strings the same length and make their comparison possible. This work's performance measures are precision, recall, and F 1 -score for each class and macro-averaged F 1 -score of all of them.

Results and Analysis
The results of the abovementioned methods on the test set of Bijankhan corpus and the 500 sentence corpus are shown in Table 1. The BERT models outperform the other methods by a large margin, BERT b standing on the top by 1.33% F 1 -score more than BERT a , 28.65% more than the CRF, 26.73% more than Parsivar, and 9.55% more than the baseline on the 500 sentence corpus. The baseline simply indicates the correctness of word segmentation and ZWNJs in the 500 test corpus (i.e., naturally occurring errors in the real-world data) when compared to its corrected pair. The results also show that Parsivar and the CRF model not only do not increase word segmentation and ZWNJs correctness, but also add more errors to the data.  Table 1: The F 1 -score for each class and the macro-averaged F 1 -scores of all the classes for each of the methods on Bijankhan test set and the 500 sentence corpus. The best result in each column is in bold.

Bijankhan Test
The errors in BERT a and BERT b are more or less similar and can be categorized into the following groups: (a) ZWNJ before an intrusive y between a vowel and an ezafe (e.g., xāney[e] "[the] house of"), which is simply because this writing style is not used in Bijankhan corpus (and is also fixable as words with ezafe are labeled as GEN in Bijankhan corpus); (b) underscores and hash characters in hashtags, which is again because the model has not seen it in the training data; (c) out of vocabulary words, such as tu'iter "twitter" and dāmāš "damash (a soccer team name)"; (d) digits occasionally sticking to the word before or after them; (e) informal words, such as the clitic e "is", which is the short form of ast "is"; (f) words or phrases with different syntactic roles, such as xaridāri|šode "bought (adjective)" and xaridārišode "bought (verb)"; (g) typos and uncommon spellings. The confusion matrix also reveals that ZWNJ is mostly mistaken with the 0 class, and not with the 1 class, which results in joiner characters connecting into a ligature and make the text difficult to read. There are also some cases which can not strictly be counted as errors, but different writing styles. Unfortunately, our performance measure does not account for these, more or less, common cases. Some examples are pul|dār "rich", doruq|gu "liar", andšaffāf|sāzi "clarification". All in all, adding more training samples with the abovementioned features would most probably solve many of the mentioned error groups.

Conclusion and Future Work
In this paper, we experimented with different methods to tackle word segmentation correction and zerowidth non-joiner recognition problems in Persian. The results on our collected data show BERT outperforming other methods, and the error analysis indicates that it would be relatively easy to increase the performance and pave the way for a practical and effective preprocessing tool. Future work could focus on collecting more informal Persian data, more diverse topics, and more modern registers of the language, to further improve this work's results. There are also some techniques in the work mentioned in §2, e.g., multi-task learning (say, for the model to learn the difference between joiner and non-joiner characters as an auxiliary task), that could be used in Persian as well. Covering different writing styles in the training data would also be helpful, e.g., changing some of the ending e[ye] clitics in Bijankhan corpus, when the word is tagged as GEN, to ey[e], as discussed in §5.