Analysing cross-lingual transfer in lemmatisation for Indian languages

Lemmatization aims to reduce the sparse data problem by relating the inflected forms of a word to its dictionary form. However, most of the prior work on this topic has focused on high resource languages. In this paper, we evaluate cross-lingual approaches for low resource languages, especially in the context of morphologically rich Indian languages. We test our model on six languages from two different families and develop linguistic insights into each model’s performance.


Introduction
NLP has seen a sharp growth across various frontiers on multiple tasks; for example, today's systems are often required to generate text or to summarize documents. However, a morpheme remains the most basic level of information for most of them (Otter et al., 2020). Most of the research in these fields has focused on improving state-of-the-art for high resource languages. In contrast, research on low resource languages has been slow to start. For Indian languages, this is a major issue. Only some of the 22 scheduled Indian languages, which are a subset of the numerous languages spoken and written in India, have enough resources for training a deep learning model. For the remaining languages, the potential for improvement in performance is substantial.
Most of the current approaches for morphological analysis use cross-lingual transfer learning from a higher resource language to some low resource language (McCarthy et al., 2019). But choosing the high resource language for transfer learning is still done in an ad hoc manner, with the most common criteria being the phylogenetic distance in the language family (Cotterell and Heigold, 2017;Johnson et al., 2017). However, it has been shown that all languages from the same family might not share the same linguistic properties (Ahmad et al., 2019).
In this paper, we use different cross-lingual training methodologies and analyse the resulting source-target language pair performances based on different linguistic factors. positional embeddings, since the tag embeddings should be order-invariant. At each timestep, on the decoder side, two context vectors are created via two different attention matrices over the output from the encoding of lemma and tag (Luong et al., 2015).
The decoder then computes the output in a two-step process: it first creates a tag-informed state by attending over tags using the output from the decoder at the previous time step. We then compute the state vector by attending over the source characters using the tag-informed state. Using the updated state, the output character for that timestep is produced. This output is passed through a fully connected layer before applying a softmax to get the output character.
We also add structural bias to the attention model that encourages Markov assumption over alignments, that is, if the i-th source character is aligned to the j-th target one, alignments from the (i + 1)-th or ith to (j + 1)-th character are preferred.
We refer the reader to Cohn et al.(2016) for more details regarding the structural bias and Anastasopoulos and Neubig(2019) for more details and explanations about the two-step attention process.

Data
From the SIGMORPHON 2019 shared task, we collect language data from the cross-lingual morphological inflection task for Bengali, Hindi, Kannada, Sanskrit, Telugu, and Urdu. Out of these, Telugu is the only language that does not have a large dataset. We use the same classification as the SIGMORPHON shared task for annotating a language as high or low resource.
A detailed description of the dataset that we use for training is provided in Indo-Aryan RTL Abjad 10,000 10,000 100 Table 1: Number of inflected-word lemma pairs available for each language. The Total column shows the original number of samples and the High and Low columns show the curated training dataset size in a high and low resource setting respectively. During training, we augment the dataset to 10,000 samples in the low resource setting. LTR: Left-to-right, RTL: Right-to-left We use the alignment method from Cotterell et al. (2016) to generate additional artificial data to augment the low resource datasets. The method relies on substituting multiple possible stems in a word with random sequences of characters while preserving its length.
For each language, the training data is augmented so that the total training set size is equal to 10,000, including the original training data.

Cross-lingual training
For the remainder of this section, let L 1 be the source language(high resource) and L 2 be the target language(low resource). We use a modified transfer learning method adapted from (Artetxe et al., 2020) that transfers learning from a model learnt on L 1 to another language L 2 based on results on a validation set (see Appendix B for more details).
The entire seq2seq model is broken up into modules, with the encoder, decoder, attention layers (called EDA module for the remainder of the section) the same for both source and transfer language. The embedding layers and the dense output layers are different for each language. The training then proceeds as follows (all the modules have been listed on the right side, with the trainable modules italicised):  Phase 1 -Copying phase for L 1 EDA + L 1 embeddings + L 1 dense output The model is allowed to learn to copy characters. The copying phase is stopped when the accuracy reaches 80%. Attention heat maps after this phase show that the attention model has adapted to the structural biases and has learnt monotonicity.
Phase 2 Copying phase for L 2 EDA + L 2 embeddings + L 2 dense output By learning to copy from L 2 accurately, we expect the embedding layer to learn proper representations of characters in L 2 . This phase is stopped when the copying accuracy crosses 85%.
Phase 3 Training phase for L 1 EDA + L 1 embeddings + L 1 dense output L 1 embeddings weights are frozen and the model is allowed to train on the lemmatisation for high resource language. The model is expected to learn the process of lemmatisation.
Phase 4 Training phase for L 2 EDA + L 2 embeddings + L 2 dense output We fine-tune the model on lemmatisation for L 2 . We observe that the model converges quickly in this phase compared to Phase 3, although the time to convergence varies with different language pairs. We use the model with the lowest validation loss for training the next phase in each case. A total of 25 cross-lingual models are created. Since sufficient resources for Telugu were not available, models with Telugu as L 1 could not be created.
All the hyperparameters used are mentioned in Appendix A. We release all our code online for reproducibility and further research. Table 2 lists the accuracy of our architecture and the hard monotonic attention model  for different language pairs in the context of cross-lingual as well as monolingual setting. The hard monotonic attention model in the cross-lingual setting was adapted from the SIGMORPHON 2019 shared task 1 (McCarthy et al., 2019).

Right to Left scripts
We see that both models achieve a very low accuracy for Urdu in the extremely low resource setting. Urdu as a source language in cross-lingual training is not effective as well -the accuracy values for the target languages lie within the corresponding standard deviation range.
To identify the possible source of low accuracy, we created models with reversed letter orders for Urdu, Hindi and Bengali. The accuracies do not change by much for both cross-lingual (with Urdu, Hindi, Bengali as target languages) and monolingual low-resource models. Therefore a right-to-left writing system is not the primary cause of low accuracy.
Therefore, we hypothesise that the Abjad script is more difficult to learn in a low resource setting because Abjad requires inferring vowels instead of explicitly supplying them. On running the models on Arabic, we obtain single-digit accuracy in all cases, which supports our claim.

Effect of source languages
Anastasopoulos et al. (2019) suspect that low variance in performance across source languages could be due to different scripts. We confirm the the hypothesis through our results here. There are 5 different scripts distributed among 6 languages in our dataset. For each transfer language and in each model, we can see that the deviation in performance is very small.
For example, we see that Bengali has a standard deviation of only around 1.09 in both the architectures, whereas the standard deviation for Sanskrit jumps to 6.30 for the hard attention model. The latter is due to the spike in performance when Hindi, a language very closely related to Sanskrit and using the same script, is used as a source language.

Performance gain over monolingual models
For a fixed transfer language, we can see that either almost all cross-lingual models perform better than the monolingual model or almost all cross-lingual models perform worse than the monolingual model, i.e., the performance of a few cross-lingual models can be generalised to all other source languages for a fixed transfer language. This fact is supported by the observation made in Section 4.2. Note that we compare the accuracy of cross-lingual and monolingual models for a given model architecture. For instance, Urdu consistently fares worse in our crosslingual model, while it performs consistently better in the hard monotonic attention cross-lingual model. Note that we compare the gain/loss in performance against the monolingual model for that architecture.
Therefore, we claim that in extremely low resource settings, performance gains over monolingual models can be expected from all languages or languages closely related to the transfer language. The same result is observed for the morphological inflection task (Anastasopoulos and Neubig, 2019).

Related work
Lemmatisation has been tested extensively (Zeman et al., 2018;Nivre et al., 2017), but on datasets that are at least an order of magnitude greater than what we work with. Recently, there has been a shift to extremely low resource settings with the SIGMORPHON 2019 shared task (McCarthy et al., 2019) focusing on cross-lingual learning. However, their task focuses on the reverse direction: given a lemma and a bundle of morphological features, generate a target inflected form. To our knowledge, we are the first ones to study lemmatisation in such a low resource framework.

Conclusion
Inference-based scripts such as Abjad can be difficult for models to learn in extremely low resource scenarios. For other scripts, it is difficult to predict whether cross-lingual models fare better than monolingual models. In general, for a given low resource language, the performance of a language as a source language is a good predictor of gain/loss for other source languages.

A Hyperparameters
All our models were trained on a single 12 GB Nvidia GeForceGTX TitanXGPU. We use the Adam optimiser with the default parameters except for learning rate. The training time for each model was between 1 to 3 hours.