iNLTK: Natural Language Toolkit for Indic Languages

We present iNLTK, an open-source NLP library consisting of pre-trained language models and out-of-the-box support for Data Augmentation, Textual Similarity, Sentence Embeddings, Word Embeddings, Tokenization and Text Generation in 13 Indic Languages. By using pre-trained models from iNLTK for text classification on publicly available datasets, we significantly outperform previously reported results. On these datasets, we also show that by using pre-trained models and data augmentation from iNLTK, we can achieve more than 95% of the previous best performance by using less than 10% of the training data. iNLTK is already being widely used by the community and has 40,000+ downloads, 600+ stars and 100+ forks on GitHub. The library is available at https://github.com/goru001/inltk.


Introduction
Deep learning offers a way to harness large amounts of computation and data with little engineering by hand (LeCun et al., 2015). With distributed representation, various deep models have become the new state-of-the-art methods for NLP problems. Pre-trained language models (Devlin et al., 2019) can model syntactic/semantic relations between words and reduce feature engineering. These pre-trained models are useful for initialization and/or transfer learning for NLP tasks. Pre-trained models are typically learned using unsupervised approaches from large, diverse monolingual corpora (Kunchukuttan et al., 2020). While we have seen exciting progress across many tasks in natural language processing over the last years, most such results have been achieved in English and a small set of other high-resource languages (Ruder, 2020).
Indic languages, widely spoken by more than a billion speakers, lack pre-trained deep language models, trained on a large corpus, which can provide a headstart for downstream tasks using transfer learning. Availability of such models is critical to build a system that can achieve good results in "lowresource" settings -where labeled data is scarce and computation is expensive, which is the biggest challenge for working on NLP in Indic Languages. Additionally, there's lack of Indic languages support in NLP libraries like spacy 1 , nltk 2 -creating a barrier to entry for working with Indic languages.
iNLTK, an open-source natural language toolkit for Indic languages, is designed to address these problems and to significantly lower barriers to doing NLP in Indic Languages by • sharing pre-trained deep language models, which can then be fine-tuned and used for downstream tasks like text classification, • providing out-of-the-box support for Data Augmentation, Textual Similarity, Sentence Embeddings, Word Embeddings, Tokenization and Text Generation built on top of pretrained language models, lowering the barrier for doing applied research and building products in Indic languages iNLTK library supports 13 Indic languages, including English, as shown in Table 2. GitHub repository 3 for the library contains source code, links to download pre-trained models, datasets and API documentation 4 . It includes reference implementations for reproducing text-classification results shown in Section 2.4, which can also be easily adapted to new data. The library has a permissive MIT License and is easy to download and install via pip or by cloning the GitHub repository.   (Howard and Ruder, 2018) and TransformerXL (Dai et al., 2019) language models for 13 Indic languages. All the language models (LMs) were trained from scratch using PyTorch (Paszke et al., 2017) and Fastai 5 , except for English. Pre-trained LMs were then evaluated on downstream task of text classification on public datasets. Pre-trained LMs for English were borrowed from Fastai directly. This section describes training of language models and their evaluation.

Dataset preparation
We obtained a monolingual corpora for each one of the languages from Wikipedia for training LMs from scratch. We used the wiki extractor 6 tool and BeautifulSoup 7 for text extraction from Wikipedia. Wikipedia articles were then cleaned and split into train-validation sets.

Tokenization
We create subword vocabulary for each one of the languages by training a SentencePiece 8 tokenization model on Wikipedia articles dataset, using unigram segmentation algorithm (Kudo and Richardson, 2018). An important property of Sen-tencePiece tokenization, necessary for us to obtain a valid subword-based language model, is its reversibility. We do not use subword regularization as the available training dataset is large enough to avoid overfitting. Table 3 shows subword vocabulary size of the tokenization model for each one of the languages.

Language Model Training
Our model is based on the Fastai implementation of ULMFiT and TransformerXL. Hyperparameters    Table 5 shows perplexity of language models on validation set. Trans-formerXL consistently performs better for all languages.

Text Classification Evaluation
We evaluated pre-trained ULMFiT language models on downstream task of text-classification using following publicly available datasets:

iNLTK API
iNLTK is designed to be simple for practitioners in order to lower the barrier for doing applied research and building products in Indic languages. This section discusses various NLP tasks for which iNLTK provides out-of-the-box support, under a unified API. Data Augmentation helps in improving the performance of NLP models (Duboue and Chu-Carroll, 2006;Marton et al., 2009). It is even more important in "low-resource" settings, where labeled data is scarce. iNLTK provides augmentations 14 for a sentence while preserving its semantics following a two step process. Firstly, it generates candidate paraphrases by replacing original sentence tokens with tokens which have closest embeddings from the embedding layer of pre-trained language model. And then, it chooses top paraphrases which are similar to original sentence, where similarity between sentences is calculated as the cosine similarity of sentence embeddings, obtained from pre-trained language model's encoder.
To evaluate the effectiveness of using data augmentation from iNLTK in low resource settings, we prepare 15 reduced train sets of publicly available text-classification datasets by picking first N examples from the full train set 16 , where N is equal to size of reduced train set and compare accuracy of the classifier trained with vs without data augmen-14 https://inltk.readthedocs.io/en/latest/api docs.html#getsimilar-sentences 15 Notebooks to prepare reduced datasets are accessible from the GitHub repository of the library 16 Labels in publicly available full train sets were not grouped together, instead were randomly shuffled tation. Table 7 shows reduced dataset statistics and comparison of results obtained on full and reduced datasets using iNLTK. Using data augmentation from iNLTK gives significant increase in accuracy on Hindi, Bengali, Malayalam and Tamil dataset, and minor improvements in Gujarati and Marathi datasets. Additionally, Table 7 compares previously obtained best results on these datasets using INLP embeddings (Kunchukuttan et al., 2020) with results obtained using iNLTK pretrained models and iNLTK's data augmentation utility. On an average, with iNLTK we are able to achieve more than 95% of the previous accuracy using less than 10% of the training data 17 .
Semantic Textual Similarity (STS) assesses the degree to which the underlying semantics of two segments of text are equivalent to each other (Agirre et al., 2016). iNLTK compares 18 sentence embeddings of the two segments of text, obtained from pre-trained language model's encoder, using a comparison function, to evaluate semantic textual similarity. Cosine similarity between sentence embeddings is used as the default comparison function.
Distributed representations are the cornerstone of modern NLP, which have led to significant advances in many NLP tasks. iNLTK provides utilities to obtain distributed representations for words 19 , sentences and documents 20 obtained from embedding layer and encoder output of pretrained language models, respectively.
Additionally, iNLTK provides utilities to generate text 21 given a prompt, using pre-trained language models, tokenize 22 text using sentencepiece tokenization models described in Section 2.2, identify 23 which one of the supported Indic languages is given text in and remove tokens of a foreign language 24 from given text.

Related Work
NLP and ML communities have a strong culture of building open-source tools. There are lots of easyto-use, user-facing libraries for general-purpose NLP like NLTK (Loper and Bird, 2002), Stanford CoreNLP (Manning et al., 2014), Spacy (Honnibal and Montani, 2017), AllenNLP (Gardner et al., 2018), Flair (Akbik et al., 2019), Stanza (Qi et al., 2020) and Huggingface Transformers (Wolf et al., 2019). But most of these libraries have limited or no support for Indic languages, creating a barrier to entry for working with Indic languages. Additionally, for many Indic languages word embeddings have been trained, but they still lack richer pretrained representations from deep language models (Kunchukuttan et al., 2020). iNLTK tries to solve these problems by providing pre-trained language models and out-of-the-box support for a variety of NLP tasks in 13 Indic languages.

Conclusion and Future Work
iNLTK provides pre-trained language models and supports Data Augmentation, Textual Similarity, Sentence Embeddings, Word Embeddings, Tokenization and Text Generation in 13 Indic Languages. Our results significantly outperform other methods on text-classification benchmarks, using pre-trained models from iNLTK. These pre-trained models from iNLTK can be used as-is for a variety of NLP tasks, or can be fine-tuned on domain specific datasets. iNLTK is being widely 25 used 26 and sentence-encoding 21 https://inltk.readthedocs.io/en/latest/api docs.html#predictnext-n-words 22 https://inltk.readthedocs.io/en/latest/api docs.html#tokenize 23 https://inltk.readthedocs.io/en/latest/api docs.html#identifylanguage 24 https://inltk.readthedocs.io/en/latest/api docs.html#removeforeign-languages 25 https://github.com/goru001/inltk/network/members 26 https://pepy.tech/project/inltk appreciated 27 by the community 28 . We are working on expanding the supported languages in iNLTK to include other Indic languages like Telugu, Maithili; code mixed languages like Hinglish (Hindi and English), Manglish (Malayalam and English) and Tanglish (Tamil and English); expanding supported model architectures to include BERT. Additionally, we want to mitigate any possible unwarranted biases which might exist in pre-trained language models (Lu et al., 2019), because of training data, which might propagate into downstream systems using these models. While these tasks are work in progress, we hope this library will accelerate NLP research and development in Indic languages.