iNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages

In this paper, we introduce NLP resources for 11 major Indian languages from two major language families. These resources include: (a) large-scale sentence-level monolingual corpora, (b) pre-trained word embeddings, (c) pre-trained language models, and (d) multiple NLU evaluation datasets (IndicGLUE benchmark). The monolingual corpora contains a total of 8.8 billion tokens across all 11 languages and Indian English, primarily sourced from news crawls. The word embeddings are based on FastText, hence suitable for handling morphological complexity of Indian languages. The pre-trained language models are based on the compact ALBERT model. Lastly, we compile the (IndicGLUE benchmark for Indian language NLU. To this end, we create datasets for the following tasks: Article Genre Classification, Headline Prediction, Wikipedia Section-Title Prediction, Cloze-style Multiple choice QA, Winograd NLI and COPA. We also include publicly available datasets for some Indic languages for tasks like Named Entity Recognition, Cross-lingual Sentence Retrieval, Paraphrase detection, etc. Our embeddings are competitive or better than existing pre-trained embeddings on multiple tasks. We hope that the availability of the dataset will accelerate Indic NLP research which has the potential to impact more than a billion people. It can also help the community in evaluating advances in NLP over a more diverse pool of languages. The data and models are available at https://indicnlp.ai4bharat.org.


Introduction
Distributional representations are the corner stone of modern NLP, which have led to significant advances in many NLP tasks like text classification, NER, sentiment analysis, MT, QA, NLI, etc.
Particularly, word embeddings (Mikolov * Volunteer effort for the AI4Bharat project et al., 2013b), contextualized word embeddings (Peters et al., 2018), and language models (Devlin et al., 2019) can model syntactic/semantic relations between words and reduce feature engineering. These pre-trained models are useful for initialization and/or transfer learning for NLP tasks. They are also useful for learning multilingual embeddings which enable cross-lingual transfer. Pretrained models are typically learned from large, diverse monolingual corpora. The quality of embeddings is impacted by the size of the monolingual corpora (Mikolov et al., 2013a;Bojanowski et al., 2017), a resource not widely available for many major languages.
In particular, Indic languages, widely spoken by more than a billion speakers, lack large, publicly available monolingual corpora. They include 8 out of top 20 most spoken languages and ∼30 languages with more than a million speakers. There is also a growing population of users consuming Indian language content (print, digital, government and businesses). Further, Indic languages are very diverse, spanning 4 major language families. The Indo-Aryan and Dravidian languages are spoken by 96% of the population in India. The other families are diverse, but the speaker population is relatively small. Almost all Indian languages have SOV word order and are morphologically rich. The language families have also interacted over a long period of time leading to significant convergence in linguistic features; hence, the Indian subcontinent is referred to as a linguistic area (Emeneau, 1956). Indic languages are thus of great interest and importance for NLP research.
Unfortunately, the progress on Indic NLP has been constrained by the unavailability of large scale monolingual corpora and evaluation benchmarks. The former allows the development of pretrained language models and deep contextualised word embeddings which have become drivers of modern NLP. The latter allows systematic evaluation across a wide variety of tasks to check the efficacy of new models. With the hope of accelerating Indic NLP research, we address the creation of (i) large, general-domain monolingual corpora for multiple Indian languages, (ii) word embeddings and multilingual language models trained on this corpora, and (iii) an evaluation benchmark comprising of various NLU tasks.
Our monolingual corpora, collectively referred to as IndicCorp, contains a total of 8.8 billion tokens across 11 major Indian languages and English. The articles in IndicCorp are primarily sourced from news crawls. Using IndicCorp, we first train and evaluate word embeddings for each of the 11 languages. Given the morphological richness of Indian languages we train FastText word embeddings which are known to be more effective for such languages. To evaluate these embeddings we curate a benchmark comprising of word similarity and analogy tasks (Akhtar et al., 2017;Grave et al., 2018), text classification tasks, sentence classification tasks (Akhtar et al., 2016;Mukku and Mamidi, 2017), and bilingual lexicon induction tasks. On most tasks, the word embeddings trained on our IndicCorp outperform similar embeddings trained on existing corpora for Indian languages.
Next, we train multilingual language models for these 11 languages using the ALBERT model (Lan et al., 2020). We chose ALBERT as the base model as it is very compact and hence easier to use in downstream tasks. To evaluate these pretrained language models, we create an NLU benchmark comprising of the following tasks: article genre classification, headline prediction, named entity recognition, Wikipedia section-title prediction, cloze-style multiple choice QA, natural language inference, paraphrase detection, sentiment analysis, discourse mode classification, and crosslingual sentence retrieval. We collectively refer to this benchmark as IndicGLUE and it is a collection of (i) existing Indian language datasets for some tasks, (ii) manual translations of some English datasets into Indian languages done as a part of this work, and (iii) new datasets that were created semi-automatically for all major Indian languages as a part of this work. These new datasets were created using external metadata (such as website/Wikipedia structure) resulting in more complex NLU tasks. Across all these tasks, we show that our embeddings are competitive or better than existing pre-trained multilingual embeddings such as mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020). We hope that these embeddings and evaluations benchmarks will not only be useful in driving NLP research on Indic languages, but will also help in evaluating advances in NLP over a more diverse set of languages.
In summary, this paper introduces IndicNLP-Suite containing the following resources for Indic NLP which will be made publicly available: • IndicCorp: Large sentence-level monolingual corpora for 11 languages from two language families (Indo-Aryan branch and Dravidian) and Indian English with an average 9-fold increase in size over OSCAR.
• IndicGLUE: An evaluation benchmark containing a variety of NLU tasks.

Related Work
Text Corpora. Few organized sources of monolingual corpora exist for most Indian languages. The EMILLE/CIIL corpus (McEnery et al., 2000) was an early effort to build corpora for South Asian languages, spanning 14 languages with a total of 92 million words. Wikipedia for Indian languages is small (the largest one, Hindi, has just 40 million words). The Leipzig corpus (Goldhahn et al., 2012) contains small collections of upto 1 million sentences for news and web crawls (average 300K sentences). In addition, there are some language specific corpora for Hindi and Urdu Jawaid et al., 2014). In particular, the Hind-MonoCorp  is one of the few larger Indian language collections (787M tokens).
The CommonCrawl 1 project crawls webpages in many languages by sampling various websites. Our analysis of a processed crawl for the years 2013-2016 (Buck et al., 2014) for Indian languages revealed that most Indian languages, with the exception of Hindi, Tamil and Malayalam, have few good sentences (≥10 words) -in the order of around 50 million words. The OSCAR project (Ortiz Suarez et al., 2019), a recent processing of CommonCrawl, also contains much less data for most Indian languages than our crawls. The CC-Net (Wenzek et al., 2019) and C4 (Raffel et al., 2019) projects also provide tools to process common crawl, but the extracted corpora are not provided and require a large amount of processing power. Our monolingual corpora is about 4 times larger than the corresponding OSCAR corpus and two times larger than the corresponding CC-100 corpus (Conneau et al., 2020). Word Embeddings. Word embeddings have been trained for many Indian languages using limited corpora. The Polyglot (Al-Rfou et al., 2013) and FastText (Bojanowski et al., 2017) projects provide embeddings trained on Wikipedia. FastText also provides embeddings trained on Wikipedia + CommonCrawl corpora. We show that on most evaluation tasks IndicFT outperforms existing FastText based embeddings. Pretrained Transformers. Pre-trained transformers serve as general language understanding models that can be used in a wide variety of downstream NLP tasks (Radford et al., 2019). Several transformer-based language models such as GPT (Radford, 2018), BERT (Devlin et al., 2019), RoBERTa , ALBERT (Lan et al., 2020), etc. have been proposed. All these models require large amounts of monolingual corpora for training. For Indic languages, two such multilingual models are available: XLM-R (Conneau et al., 2020) and multilingual BERT (Devlin et al., 2019). However, they are trained across~100 languages and smaller Indic language corpora. NLU Benchmarks. Benchmarks such as GLUE (Wang et al., 2018), SuperGLUE , CLUE (Chinese) (Xu et al., 2020), and FLUE (French) (Le et al., 2020) are important for tracking the efficacy of NLP models across languages. Such a benchmark is missing for Indic languages and the goal of this work is to fill this void. Datasets are available for some tasks for a few languages. The following are some of the prominent publicly available datasets 2 : word similarity (Akhtar et al., 2017), word analogy (Grave et al., 2018), text classification, sentiment analysis (Akhtar et al., 2016;Mukku and Mamidi, 2017), paraphrase detection (Anand , QA (Clark et al., 2020;, discourse mode classification (Dhanwal et al., 2020), etc.. We also create datasets for some tasks, most of which span all major Indian languages. We bun-  dle together the existing datasets and our newly created datasets to create the IndicGLUE benchmark.

IndicCorp: Indian Language Corpora
In this section, we describe the creation of our monolingual corpora. Data sources. Our goal was the collection of corpora that reflect contemporary use of Indic languages and cover a wide range of topics. Hence, we focus primarily on crawling news articles, magazines and blogposts. We source our data from popular Indian language news websites. We discover most of our sources through online newspaper directories (e.g., w3newspaper) and automated web searches using hand-picked terms in various languages. We analyzed whether we could augment our crawls with data from other smaller sources like Leipzig corpus (Goldhahn et al., 2012), WMT NewsCrawl, WMT CommonCrawl (Buck et al., 2014), HindEnCorp (Hindi) , etc. Amongst these we chose to augment our dataset with only the CommonCrawl data from the OSCAR corpus (Ortiz Suárez et al., 2019). Article Extraction. For many news websites, we used BoilerPipe 3 , a tool to automatically extract the main article content for structured pages without any site-specific customizations (Kohlschütter et al., 2010). This approach works well for most of the Indian language news websites. In some cases, we wrote custom extractors for each website using BeautifulSoup 4 , a Python library for parsing HTML/XML documents. After content extraction, we applied filters on content length, script, etc., to select good quality articles. Text Processing. First, we canonicalize the representation of Indic language text in order to handle multiple Unicode representations of certain characters. Next, we split the article into sentences and tokenize the sentences. These steps take into account Indic punctuations and sentence delimiters. Heuristics avoid creating sentences for initials (P. G. Wodehouse) and common Indian titles (Shri., equivalent to Mr. in English) which are followed by a period. We use the Indic NLP Library 5 (Kunchukuttan, 2020) for processing.
The final corpus for a language is created after combining our crawls with OSCAR corpus 6 and de-duplicating and shuffling sentences. We used the Murmurhash algorithm (mmh3 Python library with a 128-bit unsigned hash) for de-duplication. Due to copyright reasons, we only release the final shuffled corpus described below. Dataset Statistics. Table 1 shows statistics of the de-duplicated monolingual datasets for each language. Hindi and Indian English are the largest collections, while Odia and Assamese have the smallest collection. All other languages contain between 500-1000 million tokens. OSCAR is an important contributor to our corpus and accounts for nearly (23%) of our corpus by the number of sentences. The rest of the data originated from our crawls. As evident from the last column of Table 1, for 8 languages the number of tokens in our corpus is at least 7 times that in OSCAR. For the remaining 3 languages it is twice that of OSCAR.

IndicGLUE: Multilingual NLU Benchmark
We now introduce IndicGLUE, the Indic General Language Understanding Evaluation Benchmark, which is a collection of various NLP tasks as described below. The goal is to provide an evaluation benchmark for natural language understanding capabilities of NLP models on diverse tasks and multiple Indian languages. As discussed earlier, very few public NLP datasets are available for all Indian languages. Hence, we adopted a two-pronged approach to construct this benchmark. One, we use existing datasets that address some tasks. However, such datasets are available for just 4-5 Indian languages. We also manually translated some English datasets into a few Indian languages. We summarize statistics of these datasets in Appendix A. Two, we create new datasets that span all major Indian languages. These datasets are curated semi-automatically using external metadata like website/Wikipedia structure and are designed to present reasonably complex NLU tasks. Table 2 summarizes the sizes of the respective datasets.
Further details (such as the min, max, average number of words per training instance) can be found in Appendix C. Standard train and test splits for all datasets are publicly available on the website for reproducibility. For publicly available datasets, we used the original split if provided.
News Category Classification. The task is to predict the genre/topic of a given news article or news headline. We create news article category datasets using IndicCorp for 9 languages. The categories are determined from URL components. We chose generic categories which are likely to be consistent across websites (e.g., entertainment, sports, business, lifestyle, technology, politics, crime) . See Appendix B for details.
Headline Prediction Task. The task is to predict the correct headline for a news article from a given list of four candidate headlines (3 incorrect, 1 correct). We generate the dataset from our news article crawls which contain articles and their headlines. We ensure that the three incorrect candidates are not completely unrelated to the given article. In particular, while choosing incorrect candidates, we considered only those articles that had a sizeable overlap of entities with the original article.
Wikipedia Section-title Prediction. The task is to predict the correct title for a Wikipedia section from a given list of four candidate titles (3 incorrect, 1 correct). We use the open-source tool WikiExtractor to extract sections and their titles from Wikipedia. To make the task challenging, we choose the 3 incorrect candidates for a given section, only from the titles of other sections in the same article as the given section.
Cloze-style Multiple-choice QA. Given a text with an entity randomly masked, the task is to predict that masked entity from a list of 4 candidate entities (3 incorrect, 1 correct). The text is obtained from Wikipedia articles and the entities in the text are identified using Wikidata. We choose the 3 in-   (Pan et al., 2017) which contains NER data for 282 languages. This dataset is created from Wikipedia by utilizing cross language links to propagate English named entity labels to other languages. We consider the following coarsegrained labels in this dataset: Person (PER), Organisation (ORG) and Location (LOC). Cross-lingual Sentence Retrieval. Given a sentence in English, the task is to retrieve its translation from a set of candidate sentences in an Indian language. We use the CVIT-Mann Ki Baat dataset 8 (Siripragrada et al., 2020) for this task. Winograd NLI (WNLI). The WNLI task (Levesque et al., 2011) is part of the GLUE benchmark. Each example in the dataset consists of a pair of sentences where the second sentence is constructed from the first sentence by replacing an ambiguous pronoun with a possible referent within the sentence. The task is to predict if the second sentence is entailed by the original sentence. We manually translated this dataset to 3 Indic languages (hi, mr, gu) with the help of skilled bilingual speakers. The annotators were paid 3 cents per word and the translations were then verified by an expert bilingual speaker. The task is to select the alternative that is more plausibly the cause (or effect) of the situation described by the premise. As with WNLI, we translated the dataset into 3 Indic languages (hi, mr, gu).
Paraphrase Detection. We use the Amritha paraphrase dataset comprsing 4 Indic languages (hi,pa,ta,ml) (Anand . We evaluate on two subtasks: Subtask 1-Given a pair of sentences from news paper domain, the task is to classify them as paraphrases (P) or not paraphrases (NP). Subtask 2-Given two sentences from news paper domain, the task is to identify whether they are completely equivalent (E) or roughly equivalent (RE) or not equivalent (NE). This task is similar to subtask 1, but the main difference is the use of three classes instead of two.

Discourse Mode Classification.
Given a sentence, the task is to classify it into one of the following discourse categories: argumentative, descriptive, dialogic, informative, narrative. We use the MIDAS Hindi Discourse Analysis dataset (Dhanwal et al., 2020) for this task.

IndicFT: Word Embeddings
We train FastText word embeddings for each language using IndicCorp, and evaluate their quality on: (a) word similarity, (b) word analogy, (c) text classification, (d) bilingual lexicon induction tasks. We compare our embeddings (referred to as IndicFT) with two pre-trained embeddings released by the FastText project trained on Wikipedia (FT-W) (Bojanowski et al., 2017) and Wiki+CommonCrawl (FT-WC) (Grave et al., 2018) respectively.

Training Details
We train 300-dimensional word embeddings for each language on IndicCorp using FastText (Bojanowski et al., 2017). Since Indian languages are morphologically rich, we chose FastText, which is capable of integrating subword information by using character n-gram embeddings during training. We train skipgram models for 10 epochs with a window size of 5, minimum token count of 5 and 10 negative examples sampled for each instance. We chose these hyper-parameters based on suggestions by Grave et al. (2018). Based on previously published results, we expect FastText to be better than word-level algorithms like word2vec (Mikolov et al., 2013b) and GloVe (Pennington et al., 2014) for morphologically rich languages.

Word Similarity & Analogy Evaluation
We perform an intrinsic evaluation of the word embeddings using the IIIT-Hyderabad word similarity dataset (Akhtar et al., 2017) (7 Indian languages with 100-200 word-pairs per language) and the Facebook Hindi word analogy dataset (Grave et al., 2018). Table 3 shows the evaluation results.  On average, IndicFT embeddings outperform the baseline embeddings.

Text Classification Evaluation
We

Results. On nearly all datasets and languages,
IndicFT embeddings outperform baseline embeddings (see Tables 4 and 5).

Bilingual Lexicon Induction
We train bilingual word embeddings from English to Indian languages and vice versa using GeoMM (Jawanpuria et al., 2019), a state-of-the-art supervised method for learning bilingual embeddings. We evaluate the bilingual embeddings on the BLI task, using bilingual dictionaries from the MUSE project and a en-te dictionary created in-house. We search among the 200k most frequent target language words with the CSLS distance metric during inference (Conneau et al., 2018). Table 6 shows the results. The quality of multilingual embed-   dings depends on the quality of monolingual embeddings. IndicFT bilingual embeddings significantly outperform the baseline bilingual embeddings for most languages.

IndicBERT: Multilingual NLU Model
In this section, we introduce IndicBERT which is trained on IndicCorp and evaluated on IndicGLUE.
We specifically chose ALBERT as the base model as it has fewer parameters making it easier to distribute and use in downstream applications. Further, similar to mBERT, we chose to train a single model for all Indian languages with a hope of utilizing the relatedness amongst Indian languages.
In particular, such joint training may be beneficial for some of the under-represented languages (e.g., Odia and Assamese).

Pre-training
Using IndicCorp we first train a sentence piece tokenizer (Kudo and Richardson, 2018) to tokenize the sentences in each language. We use this tokenized corpora to train a multilingual ALBERT using the standard masked language model (MLM) objective. Note that we did not use the Sentence Order Prediction objective used in the original AL-BERT work. Similar to mBERT and XLM-R models, we perform exponentially smoothed weighting of the data across languages to give a better representation to low-resource languages. We choose a vocabulary of 200k to accommodate different scripts and large vocabularies of Indic languages. We train our models on a single TPU v3 provided by Tensorflow Research Cloud 9 . We train both the base and large versions of ALBERT. To account for memory constraints, we use a smaller maximum sequence length of 128. In addition, for the large model, we use a smaller batch size of 2048. For creating each batch, we first randomly select a language and then randomly select sentences from that language. Apart from sequence length and batch size, we use the default values for the remaining hyperparameters as in Lan et al. (2020). We train the model for a total of 400k steps. It took 6 days to train the base model and 9 days to train the large model. In the remaining discussion, we refer to our models as IndicBERT base and IndicBERT large. Our models are compared with two of the best performing multilingual models: mBERT (Pires et al., 2019) and XLM-R base model (Ruder et al., 2019). Not that our model is much smaller compared to these models, while it is trained on larger Indic language corpora (see Table  14 in Appendix C.5 for details).

Fine-tuning
After pre-training, we fine-tune IndicBERT on each of the tasks in IndicGLUE using the respective training sets. The fine-tuning is done independently for each task and each language (i.e., we have a task-specific model for each language). We describe the fine-tuning procedure for each task. Headline Prediction, Wikipedia Section Title Prediction. For headline prediction, we feed the article and candidate headline to the model with a SEP token in between. We have a classification head at the top which assigns a score between 0 and 1 to the headline. We use cross entropy loss with the target label as 1 for the correct candidate and 0 for the incorrect candidates. During prediction, we choose the candidate headline assigned the highest score. Section title prediction uses the same procedure (Wikipedia section and section titles instead of news articles and headlines respectively). Named Entity Recognition. Each sentence is fed as a single sequence to the model. For every token, we have a softmax layer at the output which computes a probability distribution over the NER   classes. We fine-tune the model using multi-class cross entropy loss. Cloze-style Multiple-choice QA. We feed the masked text segment as input to the model and at the output we have a softmax layer which predicts a probability distribution over the given candidates. We fine-tune the model using cross entropy loss with the target label as 1 for the correct candidate and 0 for the incorrect candidates. Cross-lingual Sentence Retrieval. No finetuning is required for this task. We compute the representation of every sentence by mean-pooling the outputs in the last hidden layer and then using cosine distance to compute similarity between sentences (Libovický et al., 2019). Additionally, we also center the sentence vectors across each language to remove language-specific bias in the vectors (Reimers and Gurevych, 2019).

Winograd NLI, COPA, Paraphrase Detection:
We input the sentence pair into the model as segment A and segment B. The [CLS] representation from the last layer is fed into an output layer for classification into one of the categories. News Category Classification, Discourse Mode Classification, Sentiment Analysis. We feed the representation of the [CLS] token from the last layer to a linear classifier with a softmax layer to predict a probability distribution over the categories. We fine-tune the model using multi-class cross entropy loss.

Evaluation
We summarize the main observations from our results as reported in Tables 7-10. Comparison with mBERT and XLM-R. On most tasks, IndicBERT models outperform XLM-R and mBERT. Specifically, IndicBERT models are competitive on the Wikipedia Section Title pre-    dataset with the smaller IIT-Bombay and iNLTK datasets.
• The IIT-Patna Movie and Product review datasets have 4 classes namely postive, negative, neutral and conflict. We ignored the conflict class.
• In the Telugu-ACTSA corpus, we evaluated only on the news line dataset (named as tel-ugu_sentiment_fasttext.txt) and ignored all the other domain datasets as they have very few data-points.

B IndicGLUE News Category Dataset
The IndicGLUE news category dataset is a collection of articles labeled with news categories. We used this dataset in the evaluation of word embeddings and language models. Table 12 provides the statistics of the dataset.

C IndicGLUE Datasets
We provide some additional statistics for the In-dicGLUE dataset in

C.5 Model Details
Table 14 compares our models with existing pretrained models.