How low is too low? A monolingual take on lemmatisation in Indian languages

Lemmatization aims to reduce the sparse data problem by relating the inflected forms of a word to its dictionary form. Most prior work on ML based lemmatization has focused on high resource languages, where data sets (word forms) are readily available. For languages which have no linguistic work available, especially on morphology or in languages where the computational realization of linguistic rules is complex and cumbersome, machine learning based lemmatizers are the way togo. In this paper, we devote our attention to lemmatisation for low resource, morphologically rich scheduled Indian languages using neural methods. Here, low resource means only a small number of word forms are available. We perform tests to analyse the variance in monolingual models’ performance on varying the corpus size and contextual morphological tag data for training. We show that monolingual approaches with data augmentation can give competitive accuracy even in the low resource setting, which augurs well for NLP in low resource setting.


Introduction
Natural Language Processing (NLP) has seen remarkable growth in all its sub-areas like machine translation, summarization, question answering and so on. For all these tasks, though, morphemes remain the most basic form of information (Otter et al., 2020). Morpheme identification (lemma and affixes) can assist these very useful large applications by solving the data sparsity problem.
Good lemmatisers are invaluable tools for handling large vocabulary in morphologically rich languages and thereby boosting performance in downstream tasks, but techniques * These authors contributed equally to this work are limited by resource availability. This is a relevant point for Indian languages. For instance, as many as 197 Indian languages are in the UNESCO's Atlas of the "World's Languages in Danger, 2010". Even among the 22 scheduled languages of India, there is a wide disparity in resource availability, for example, for Konkani and Kashmiri (Rajan et al., 2020;Islam et al., 2018).
Techniques like Porter stemmer are indeed quick solutions, but they are suited only for alphabetic script languages, like English, and not abugida, like Bengali (Ali et al., 2017), or abjad, like Urdu (Kansal et al., 2012), script languages. Moreover, creating stemmers requires different language specific stemming algorithms. This requirement of language specific measures comes in the way of scaling the enterprise of creating stemmers for the hundreds and thousands of languages that exist in the world. One might think of ML for stemming-for example, training a neural net with stems and word forms; but almost none of the 22 scheduled Indian languages, which is just a subset of the numerous languages spoken and written in India, have resources sufficient for training deep models (Bhattacharyya et al., 2019). For a majority of Indian languages, the absence of dictionaries compounds the problem.
Most of the current approaches for morphological analysis use the idea of cross-lingual transfer learning from a higher resource language to the low resource language (McCarthy et al., 2019) of interest. We show that even monolingual models can consistently perform with high accuracy with even as little as 500 samples, without cross-lingual training of neural models and without structured information like dictionaries. We further demonstrate good performance in extremely low resource setting with as few as 100 training examples samples to train on and show a competitive performance against cross-lingual models in the same setting. In Zeman et al. (2018), lemmatisation was performed for small treebanks exploiting the common annotation standard across all languages, and the same task was implicit in Nivre et al. (2017). Recently, there has been a shift to extremely low resource settings with the SIGMORPHON 2019 shared task (Mc-Carthy et al., 2019) focusing on cross-lingual learning. However, their task focuses on the reverse direction: given a lemma and a set of morphological features, generate a target inflected form.

Models
A two-step attention process (Anastasopoulos and Neubig, 2019) similar to the SIGMOR-PHON 2019 morphological inflection task (Mc-Carthy et al., 2019) has been adapted for the setup, which consists of four components: encoder for morphological tags, encoder character sequence, attention and a decoder.
The inputs to the model are inflected words and morphological tags, and we use selfattention single layer bidirectional LSTM without positional embeddings as encoders. At each time step, during decoding, two context vectors are created via two different attention matrices over the output from the encoding of inflected word and morphological tag.
At the decoder, we use a two-step process: first we create a tag-informed state by attending over tags using the output from the decoder at the previous time step. Second, we use this to attend over the source characters to produce the state vector for the decoder at that time step, which is used for producing the output character for that time step using a fully connected layer followed by a softmax.
We also add structural bias to the attention model that encourages Markov assumption over alignments, that is, if the i-th source character is aligned to the j-th target one, alignments from the (i + 1)-th or ith to (j + 1)th character are preferred.
We refer the reader to Anastasopoulos and Neubig (2019) for more details and explanations about the two-step attention process and Cohn et al. (2016) for more details regarding structural bias.

Data
From the SIGMORPHON 2019 shared task, we collect language data from the multilingual morphological inflection task for Bengali, Hindi, Kannada, Sanskrit, Telugu, and Urdu. Out of these, Telugu is the only one that does not have a large data set (inflected word forms). We use the same task categorization of high or low resource languages as SIG-MORPHON. Each training sample is a triplet: (inflected word, lemma, tag), where tag refers to the set of morphological features for the inflected word. A detailed description of the dataset that we use for training is provided in Table 1 10,000 10,000 100 We create the smaller data sets from the high-resource data sets using the sampling method based on probability distributions mentioned in Cotterell et al. (2018). During training for smaller data sets, we use augmentation from Cotterell et al. (2016). This particular augmentation method relies on substituting stems in a word with random sequences of characters while preserving its length.
We also annotate data sets with tag information to create multiple data sets for analysing the effects of data set size and the importance of tag information on the accuracy of the models.

Training
The model runs in two phases:

Main Phase
The training tuple (X, Y, T) is fed into the system, and the model is allowed to learn the distribution over the data. A cool down period is also used while training to improve the accuracy of the model. We also employ early stopping with a higher threshold than the cool down period so that the training stops when no further progress is possible.
Hyperparameters for our models are discussed in appendix A.1. We also release all our code online for reproducibility and further research. *

Variation with number of training word-pairs
We create three models for each training set size. They contain (1) no morphological fea-* https://github.com/krsrv/lemmatisation tures, (2) basic PoS tag data, and (3) all morphological features. We report accuracies over complete string matching for our experiments. Figure 1 shows the graphs for accuracy versus data. When the complete set of morphological features is included in training, most languages achieve extremely high accuracy (at least 95%, except for Kannada), even when data set sizes are as small as 1000. When the data set size is 500, the accuracy drop to the range 80-90% but are still competitive wrt rule-  (Prathibha and Padma, 2015). However, the performance drops drastically when the data set size is reduced to 100. Performance on the augmented data sets shows a marked increase in accuracy over the unaugmented 100 training samples, but is still below the performance of models trained on 500 samples.
Telugu is not included in Figure 1 due to the lack of training samples. We train only one model over the available 61 samples (augmented to 10,000). The model achieves an accuracy of 80% on the SIGMORPHON Task 1 test set for Telugu.

Variation with morphological information
Comparing Figure 1(a) and 1(b), we see that tag data does not provide substantial additional information to the model when the data set size exceeds 2000, barring the case for San-  skrit. At 500, there is a spike in accuracy for Sanskrit which is probably explained by the fact that Sanskrit is a morphologically and semantically systematic language with very few ambiguities (evident from its linguistic and grammar text Aṣṭādhyāyī by Pāṇini), and thus is the language with highest responsiveness to augmentation with tag data. Below 4000, the morphological tag data substantially improves the accuracy. Sanksrit and Kannada both show worse results compared to other languages, which is likely due to the complex inflection patterns in both languages.
The gains from including tag information are better visualised in Table 2. A negative value in the table indicates that the model's performance decreases in absence of tag data. In general, we see that full-tag informed models perform the best, followed by basic PoS tag informed models and finally models without tag information.
The table also shows that the importance of tag data increases considerably with decrease in the training set size. However, an anomaly occurs with 100 training samples, when the absence of tag information improves the performance. A possible explanation is that the number of training samples is too low and the model is not able to learn what to focus on effectively. This anomaly disappears when we augment the data before training the model.
Note that achieving 100% accuracy on lemmatization without any tag information is not possible with any data set size. Some words can have multiple lemmas and require context for disambiguation: क (kee) can map to either करना (karana) or का (kaa) depending on whether it is used as a postposition or a verb.

Comparison with cross-lingual models
We also train cross models using the same method as monolingual training and incorporate the training procedure described by Artetxe et al. (2020) (the hyperparameters are listed in appendix A.2). We simulate a low resource language by choosing 100 samples at random and use all the other languages as high resource languages. Macro averaged accuracy for a simulated low resource language shows that monolingual models give comparable accuracies when compared to cross-lingual models, with the exception of Hindi. Performance of Sanskrit and Urdu, especially Urdu, seem to be better when the mono-lingual models are used.
The complete list of accuracies for the crosslingual models are listed in Table 3. The macro-averaged difference between the crosslingual and monolingual model is -2 in the cross-lingual models' favor.

Conclusion
We have given a methodology for lemmatization of low resource (i.e., availability of small number of word forms) in this paper. For most languages, a monolingual model trained on approximately 1000 training samples gives competitive accuracy, while training on 500 samples gives results at par with rule-based linguistic systems. For extremely-low resource settings as well, monolingual models perform well with the help of data augmentation. Even in these scenarios, monolingual models can give competitive results compared to cross-lingual models, a result that is supported by research in other tasks such as morphological inflection (Anastasopoulos and Neubig, 2019).
Additionally, in the low resource setting, additional features are an important source of information. Even PoS tags benefit the training process.

Areas of improvement
The model currently does not exploit any linguistic knowledge available to improve its performance. Incorporating morphological rules or using bilingual knowledge to create transfer models could grant accuracy gains (Gebreselassie et al., 2020;Faruqui et al., 2015). Moreover, transformers have been shown to improve performance on character level tasks which would be applicable method here (Wu et al., 2020). Another potential area of improvement could be the usage of different data hallucination techniques like in Shcherbakov et al. (2016), which uses phonetics instead of relying on characters for predictions.

Ethical Considerations
The work in this paper can be useful for expanding the power of language understanding to ethnic/local languages. This can consequently bring these low-resource language domains within the umbrella of widespread NLP applications in edge computing devices. By focusing on low-resource domains, we understand how lightweight models fare in these settings, thereby leading to potential trimming down of model sizes, training time, compute costs etc., which is a significant step towards maintaining energy and carbon costs.
Such developments also spur the progress of languages and the civilisations associated with them by bringing them into the advanced technological manifolds, and thereby bring more equitable distribution of technology and quality of life across the globe.