KLPT – Kurdish Language Processing Toolkit

Despite the recent advances in applying language-independent approaches to various natural language processing tasks thanks to artificial intelligence, some language-specific tools are still essential to process a language in a viable manner. Kurdish language is a less-resourced language with a remarkable diversity in dialects and scripts and lacks basic language processing tools. To address this issue, we introduce a language processing toolkit to handle such a diversity in an efficient way. Our toolkit is composed of fundamental components such as text preprocessing, stemming, tokenization, lemmatization and transliteration and is able to get further extended by future developers. The project is publicly available.


Introduction
Language technology is an increasingly important field in our information era which is dependent on our knowledge of the human language and computational methods to process it. Unlike the latter which undergoes constant progress with new methods and more efficient techniques being invented, the processability of human languages does not evolve with the same pace. This is particularly the case of languages with scarce resources and limited grammars, also known as lessresourced languages.
Various natural language processing (NLP) tasks are of pipeline architecture; that is, to address a specific task, a few other language processing tasks may be initially required (Manning et al., 2014). With the current advances in the open-source movements, more researchers and industrial developers are encouraged to share their knowledge in an open-source manner, accessible under certain conditions (Ljungberg, 2000). Therefore, the development of underlying tasks in NLP for a specific language will potentially pave the way for further contributions to the field, by either improving the current tools or further progress in new tasks. For instance, tokenization as a fundamental task is widely required in many other applications such as part-of-speech tagging, machine translation and syntactic analysis. Once addressed, future researchers can build upon it for more advanced tasks or eventually improve it.
Despite a plethora of performant tools and specific frameworks for NLP, such as NLTK (Loper and Bird, 2002), Stanza (Qi et al., 2020), Teanga (Ziad et al., 2018) and spaCy 2 , the progress with respect to less-resourced languages is often hindered by not only the lack of basic tools and resources but also the accessibility of the previous studies under an open-source licence. This is particularly the case of Kurdish, a less-resourced Indo-European language that is the focus of the current paper. As an example, although the task of spell-checking and stemming for Kurdish have been addressed by many previous studies, (Jaf and Ramsay, 2014;Salavati and Ahmadi, 2018;Mustafa and Rashid, 2018;Saeed et al., 2018a;Hawezi et al., 2019) to mention but a few, none of them provides an implementation of their tool under any licence.
On the other hand, some previous studies use specific frameworks that are hardly integrable and inter-operable. For instance,  and  describe their efforts in developing a large-scale morphological lexicon and a part-of-speech tagger for Kurdish within the Alexina framework under the LGPL-LR licence. Despite the valuable impact of this study in the field, for example in (Cotterell et al., 2017) and (Gökırmak and Tyers, 2017), the tool does not  seem to be widely used in the subsequent projects.
As such, projects such as (Jaf and Ramsay, 2014) and (Ahmadi and Hassani, 2020a) tackle the very same topic from scratch. Language-specific toolkits have been previously designed for various languages, such as IceNLP for Icelandic (Loftsson and Rögnvaldsson, 2007), VnCoreNLP for Vietnamese (Vu et al., 2018), FudanNLP for Chinese (Qiu et al., 2013), PSI-Toolkit for Polish (Graliński et al., 2013) and ParsiPardaz for Persian (Sarabi et al., 2013). In the same vein, in order to facilitate the basic language processing tasks for Kurdish in an organized and methodical way and aware of the increasing importance of open-source and inter-operable tools for building more efficient systems and get further advanced in the field, we present KLPT-the Kurdish language processing toolkit. This toolkit is developed in Python and is composed of core modules and is extendable by future developers.

Kurdish Language
Kurdish belongs to the Northwestern branch of the Iranian languages within the Indo-European language family which is spoken by 20-30 million speakers in the Kurdish regions of Turkey, Iraq, Iran and Syria and also, among the Kurdish diaspora around the world (Ahmadi et al., 2019). The division of Kurdish into Northern Kurdish (or Kurmanji), Central Kurdish (or Sorani), Southern Kurdish and Laki, respectively with kmr, ckb, sdh and lki ISO 639-3 language codes, has been widely studied previously (Edmonds, 2013). Based on the structural differences between these, some scholars believe that they are distinct languages and therefore, refer to them as Kurdish languages (Kreyenbroek, 2005). On the other hand, it is also commonly believed by both scholars and Kurdish people that those are in fact different dialects of the Kurdish language (Haig and Matras, 2002;Matras, 2017). In this study, we remain with this theory and refer to them as Kurdish dialects. It is worth mentioning that despite the linguistic similarities of Zazaki, also known as Dimlî, and Gorani languages and the popular belief that they are dialects of Kurdish, studies show that they belong to the Zaza-Gorani language family which is independent from the Kurdish language (Paul, 1998;Jugel, 2014;Ahmadi, 2020c).
Kurdish has been historically written in various scripts, namely Cyrillic, Armenian, Latin and Arabic among which the latter two are still widely in use. Efforts in standardization of the Kurdish alphabets and orthographies have not succeeded to be globally followed by all Kurdish speakers in all regions (Tavadze, 2019;Haig and Matras, 2002;Aydogan, 2012). As such, the Kurmanji dialect is mostly written in the Latin-based script while the Sorani, Southern Kurdish and Laki are mostly written in the Arabic-based script. That, not only scatters readers and speakers to communicate together, but also creates further challenges in processing the language (Esmaili, 2012; Ahmadi, 2019). Table 1 provides the Latin-based and Arabic-based Kurdish alphabets used for all the dialects.
Kurdish language is a highly inflectional language, particularly due to a high number of affixes and clitics (Ahmadi and Hassani, 2020b). Regarding nouns, although Sorani does not have gender or grammatical cases, it has a full article marking system for definite, indefinite and demonstrative in singular and plural forms (Jugel, 2014). On the other hand, Kurmanji has a fewer number of article markers for feminine and masculine genders (Thackston, 2006 (Traida, 2007). For instance, siław 'hi (n)', pîroz 'holy' (adj) and heł (verbal particle denoting 'up') with the single-word verb kirdin can respectively form compound verbs siław kirdin "to greet", pîroz kirdin "to congratulate" and heł kirdin "to turn on". The stringing characteristic of the Arabicbased script of Kurdish further adds to this morphological complexity in such a way that several word forms may be concatenated together (Ahmadi, 2020b). Regarding syntax, Kurdish has a subject-object-verb word order and is a null-subject (or pro-drop) language. The presence of grammatical markers for nominative and oblique cases varies within dialects and subdialects. For instance, in the Sorani subdialects of Sulaymaniyah and Erbil, respectively categorized as Southern Sorani and Northern Sorani by (Matras, 2017), the oblique case is marked differently. Another particularity of the Kurdish language is its morphosyntactic alignment in the past tense of transitive verbs. In such tenses, an ergative-absolutive alignment occurs where the subject of intransitive verbs behaves like the patient of the transitive verb in the past (Haig, 1998;Karimi, 2014). Unlike Kurmanji which uses oblique cases for this purpose, Sorani only uses different pronominal markers to specify ergativity, therefore it is called split-ergative (Esmaili and Salavati, 2013). Except the past tenses, a nominative-accusative alignment is observed in other tenses.
Not being equally documented and used, Kurdish dialects have different levels of linguistic resourcefulness. In comparison to Sorani and Kurmanji which are widely used by the media and press, Southern Kurdish and Laki are underdocumented and lack basic language resources such as electronic dictionaries and corpora (Fattah, 2000;Ahmadi et al., 2019;Ahmadi, 2020c).

Current State of Kurdish Language Processing
The putational linguistics, we reviewed the scientific publications that directly address an issue in those fields. A total number of 53 publications are collected from the widely-used academic databases and search engines such as Google Scholar 5 , and then classified based on their discussed sub-fields which are illustrated in Figure 1. The Kurdish dialects are not evenly discussed in the previous studies, with Sorani making up a predominant proportion of almost 90%. Although a smaller proportion represents the Kurmanji dialect, no publication is found with respect to processing of the Southern Kurdish or Laki dialects. Regarding the research focus of the previous works, a range of NLP sub-fields has been addressed, particularly in text mining, morphological and syntactic analysis and, creation of lexical resources. We exceptionally included optical character recognition as it is of importance for converting printed material to electronic forms (Ahmadi et al., 2019). The full list of the surveyed papers can be found in Appendix A.2.
More importantly, we analyze previous publications from the following two perspectives: • Open-source: Does the paper provide the discussed resource or tool under an open-source license? To this end, we verified the content of the papers and also, checked the Web, particularly major distributed version control systems such as GitHub 6 , GitLab 7 and Bit-Bucket 8 .

KLPT Architecture
KLPT is implemented in Python and is composed of four core modules with specific tasks. Although we were inspired by the functionality of relevant NLP toolkits, particularly NLTK and spaCy, no external library is used in this toolkit. Regarding the toolkit design, we followed the rules of scientific software development suggested by (Prlić and Procter, 2012) along with common practices in Python programming language. Figure 3 provides the structure of the toolkit. In order to facilitate the integration of variations specific to dialects and scripts and more importantly, to avoid hard-coding, required files are provided in the data folder. For instance, the data required for the preprocess module is imported from preprocess.json. In addition, third-party programs can be provided in bin. test and docs respectively contain test cases and project documentation. Regarding the latter, we use Sphinx documentation generator 12 .
It is worth noting that each module within the klpt package has been previously studied and evaluated separately. Our goal is to introduce the functionality of the modules within the toolkit in this section.

Preprocess
Many keyboard layouts are specifically designed for Kurdish where different character encoding are assigned to visually-similar graphemes. In addition to the usage of non-Kurdish keyboards, such as Arabic, Turkish and Persian keyboards, such diversity creates abnormality across texts in Kurdish writing. For instance, the grapheme (î/y), can be represented as (U+064A), (U+0649), (U+FEF2), (U+FEF1) and (U+06CC), among which only the latter should be used in the Arabic-based script of Kurdish. Moreover, various writing conventions are used for each dialect and script. For instance, in Kurmanji, when dates are affixed with a morpheme, the suffix may be separated by ', -or without any marker as in 2020'an, 2020-an and 2020an.
To remedy such issues in an automatic and structured manner, the preprocess module provides two main functions: normalize() for normalizing encoding abnormalities by unifying characters in such a way that only one specific encoding is used for each grapheme and, 12 https://www.sphinx-doc.org standardize() which applies orthographic conventions to the text. For example, when hêvî 'hope' is suffixed with the vowel a (Izafa, meaning 'of'), a semi-vowel y appears between the two vowels and is usually written as hêviya or hêvîya 'hope of'. As the latter form is considered less ambiguous, this function converts the first form accordingly. Although defining a universal orthography for Kurdish is out of scope of our project, we believe that writing conventions and orthographies should be addressed to some extent. Therefore, in this initial version, we follow the writing conventions proposed by (Aydogan, 2012) for Kurmanji and (Hashemi, 2016) for Sorani.
In addition to these two functions, unify_numeral() is provided to convert numerals, namely in Farsi (۰۱۲۳۴۵۶۷۸۹)  All these three functions are then evoked within preprocess() function which normalizes, standardizes and unifies the text according to the given arguments. The general procedure followed in this module can be summarized as string replacement. For this purpose, we define regular expressions for each dialect and script. The regular expressions along with the character mappings are provided in preprocess.json in such an order that the intended normalization and standardization are carried out correctly. Although this module is not explicitly evoked within other modules, except in the transliterate module, it is recommended that the output of the preprocessing module be used as the input of other modules by the user.

Transliterate
Given the diversity of the alphabets used in Kurdish, transliteration is a necessity to facilitate the communication between speakers and is also beneficial to various NLP tasks, such as namedentity recognition and machine translation. Although Kurdish orthographies are phonemic, i.e. each grapheme is supposed to represent a single phoneme, transliterating characters within the alphabets is more challenging than it appears. This is particularly due to (U+0648) and (U+06CC) in the Arabic-based alphabet which can be respectively mapped to 'u/w' and 'î/y'. For instance, in and is transliterated as bîwir 'axe' and kurt 'short', respectively. Moreover, there is no grapheme for the vowel i, also known as Bizroke "the little furtive", in the Arabicbased script which creates further challenges in the morphological analysis of the language (Ahmadi, 2019).
In this module, we focus on transliterating Arabic-based and Latin-based scripts of Kurdish using WERGOR transliterator 13 (Ahmadi, 2019). This tool uses a rule-based approach based on the phonological and syllabic characteristics of Kurdish for distinguishing double-usage characters, i.e. and , and predicting the placement of i. Although the algorithm efficiently transliterates double-usage characters, it has been evaluated to detect i with a low accuracy of 39%. 13 https://github.com/sinaahmadi/wergor

Stem
Although the task of stemming has been previously addressed in the literature, no open-source viable solution was available for Kurdish. Therefore, we developed morphological rules containing combinations of Kurdish morphemes in Sorani and Kurmanji, and also an annotated lexicon containing lemmas with specific flags such as part-ofspeech tags and stems. The morphological rules and the lexicons are then used to develop a morphological analyzer and spell-checker in HUN-SPELL (Ooms, 2017) for Kurdish, where they are respectively known as affixes (.aff) and dictionary (.dic). Thanks to the wide usage of HUN-SPELL in open-source text editors such as Apache OpenOffice, our development will be also beneficial for general purposes such as spell-checking in text editors. More importantly, we integrate HUN-SPELL in KLPT for this module using a wrapper program 14 .
The Stem module comes with two classes: Stem and Spellcheck. Although these two classes focus on two different tasks, they are provided in the same module as they are both based on the same implementation in Hunspell. Given a word, the Stem class provides four main functions, namely stem() for retrieving wordform stem, e.g. kirdin/kirin (do.INF) → kir, lemmatize() for lemmatization, e.g. kirdbûm (do.1SG.PST.PFV) → kirdin, analyze() for morphological analysis which returns a dictionary containing the flags according to HUNSPELL such as part-of-speech, terminal suffixes and inflectional suffixes and finally, suffix_suggest() which returns all the possible suffixes that can appear with a given lexeme. In addition to these, generate() will also be added to the module which generates a word-form given morphemes.
On the other hand, the Spellcheck class provides check_spelling() and correct_spelling() which are respectively used for spell checking (Boolean output) and spell correction. For instance, given (xwardûmate), check_spelling() detects that it is incorrectly written and a few suggestions are provided by correct_spelling(), among which (xwardûmane) "(we) have eaten". The performance of the tool is further described in (Ahmadi, 2020d,a).

Tokenize
Although both Arabic-based and Latin-based alphabets use spaces to delimit word boundaries, not all words correspond to a token in Kurdish. This is particularly due to the complex morphology, e.g. article marking suffixes, and the writing traditions. In the Arabic-based alphabet, there is a tendency to concatenate clitics, affixes and words together which results many tokens being written as one single word-form without any space as in (hîwaşyane) "(it) is also their hope" which is composed of four tokens, noun hîwa, endoclitic =ş, pronominal enclitic -yan and present copula e. The Latin-based script, particularly when used for writing Kurmanji, respects word boundaries in a better way. For instance, the same phrase is written as "hêvîya wan jî ew e".
In this module, we use the tokenization approach proposed by (Ahmadi, 2020b). This approach uses an annotated lexicon with a morphological analyzer to tokenize words in Sorani and Kurmanji. Given the wide usage of compound forms in word formation in Kurdish, a lexicon is also provided for multi-word expressions (MWEs) and their possible forms, with and without space. That way, the inconsistencies in writing compound words is tackled efficiently. In addition to mwe_tokenize() and word_tokenize() which are respectively provided for the tokenization of words and MWEs, sent_tokenize() is a third function which tokenizes a given text into sentences based on punctuation marks. It is worth mentioning that words and MWEs are respectively separated by and by default which can be customized by the user.

Configuration
Given the combination of scripts and dialects of the input data, verification of the several configurations of each class can be complex. Therefore, we provide the configuration module which is used internally within the modules when an object of a class is initialized. This way, the class constructors validate the arguments by evoking this module and the error-handling is carried out only in the Configuration class.
For further clarification on the interaction of the individual modules within the KLPT package, Figure A.5 shows its package and class diagrams in the Unified Modeling Language (UML).

Usages
In this section, we provide basic usages of the application programming interface (API) of the KLPT package. The package is available on the Python Package Index (PyPI) 15 in Python 3.5 and later and, can be installed as follows:

pip install klpt
The installation of the package comes with the data files, i.e. data folder, and requirements which are also installed. Once the package installed, each module can be imported and used as described above. Figure 4 provides an example on how to work with various modules of the package.
As a future work, we would like to extend the current version to include syntactic and semantic parsing for Sorani and Kurmanji. Given the scarcity of resources regarding computational linguistics and natural language processing, we believe that the KLPT package will create a new field of interest for Kurdish linguists as well. Therefore, we are aiming at creating educational content to introduce the field to non-expert public too.

Acknowledgments
The author would like to thank his two colleagues, Dr. Kyumars Sheykh Esmaili and Dr. Hossein Hassani who respectively initiated the Kurdish Language Processing Project and Kurdish-BLARK. Despite the lack of financial support of Kurdish-related projects, these initiatives have made huge contributions thanks to volunteer researchers. Similarly, the constructive comments of the three anonymous reviewers were very useful and are much appreciated.