Morphology-Aware Meta-Embeddings for Tamil

In this work, we explore generating morphologically enhanced word embeddings for Tamil, a highly agglutinative South Indian language with rich morphology that remains low-resource with regards to NLP tasks. We present here the first-ever word analogy dataset for Tamil, consisting of 4499 hand-curated word tetrads across 10 semantic and 13 morphological relation types. Using a rules-based segmenter to capture morphology as well as meta-embedding techniques, we train meta-embeddings that outperform existing baselines by 16% on our analogy task and appear to mitigate a previously observed trade-off between semantic and morphological accuracy.


Introduction
Continuous-space word embedding methods such as word2vec (Mikolov et al., 2013) have proven to be very useful for a wide range of NLP tasks. However, it has been observed that representations that treat each word holistically face inherent limitations when working with morphologically rich languages, and methods have accordingly been designed to incorporate subword information (Cotterell and Schütze, 2015;Luong et al., 2013). Among these, the fastText embeddings remain one of the best-known (Bojanowski et al., 2017;Grave et al., 2018), using character n-grams to approximate word-internal structural features.
In this work, we focus on producing morphology-aware embeddings for Tamil, a Dravidian language with over 68 million speakers across India, Sri Lanka, Malaysia, and Singapore (Wikipedia, 2020). Tamil remains a low-resource language for NLP tasks despite its large speaker base, and traditional methods of evaluating word embeddings, for instance the word analogy task (Mikolov et al., 2013), are almost entirely lacking. Thus, to facilitate our work, we present here a * Equal contribution. novel, human-curated analogy dataset consisting of 4499 analogy tetrads.
With regard to morphology, Tamil is highly agglutinative, encoding grammatical features such as gender, number, and case in single words comprising large sequences of compounded morphemes. Approaches such as character n-grams that generically incorporate subword information may be too coarse when working with Tamil, due to short morpheme lengths paired with high similarity between morphemes and sandhi across morpheme boundaries. The high frequency of 'false morphemes' or character sequences resembling morphemes in non-productive situations compounds this further. In our work, we attempt to tailor embeddings to Tamil morphology with the incorporation of a rules-based morphological segmenter.
We present three primary contributions: (1) We present the first-ever human-generated analogy dataset for Tamil, capturing both semantic and morphological analogies.
(2) We construct a set of novel word embeddings for Tamil that incorporates morphological segmentation and outperforms existing baselines trained on the same corpus.
(3) Finally, we show that meta-embedding methods used in conjunction with linear dimension reduction can mitigate previously observed trade-offs between capturing semantic and morphological/syntactic information in embeddings (Avraham and Goldberg, 2017;Qiu et al., 2014).
Our dataset, embeddings, and experiments are all publicly available with documentation at our GitHub repository. by the need for accurate morphological annotation - Luong et al. (2013) used a recursive neural network to combine learned representations of individual morphemes, using the toolkit Morfessor (Creutz and Lagus, 2007) for unsupervised morphological segmentation. Cotterell and Schütze (2015) utilized a hand-annotated, morphologically-labelled corpus to train embeddings to predict morphological tags and thereby encode morphology.
Morphology for low-resource languages. There has been some recent work focusing on morphological incorporation for low-resource agglutinative languages such as Turkish and Uyghur (Pan et al., 2020). Kumar et al. (2017) focused entirely on Dravidian languages, creating a corpus partially annotated for morphological segmentation and POS tagging.
Meta-embeddings. It has been observed that various methods of combining word embeddings into meta-embeddings can combine the strengths of individual embeddings to improve performance. Such methods include concatenation (Yin and Schütze, 2016), averaging or summing (Coates and Bollegala, 2018), and constructing a vector of complex numbers (Wittek et al., 2013).
Dimension reduction for word representations. Yin and Schütze (2016) also observed that PCA could reduce the dimension of metaembeddings without significantly hurting performance. In a similar vein, Mu and Viswanath (2018) found that removing the top few principal components improved performance. Raunak et al. (2019) found that composing these two methods improved dimension reduction, often producing embeddings that even outperformed the original embeddings.
Tamil word embeddings. Tamil word embeddings remain a relatively under-explored space in the literature. The well-known fastText embeddings contain Tamil embeddings in both iterations (Bojanowski et al., 2017;Grave et al., 2018), and represent the state-of-the-art. Kumar et al. (2020) produced a range of embeddings using conventional methods on corpora they produced for 14 Indian languages including Tamil.
Word analogy datasets. The state-of-theart word analogy dataset in English remains the Google analogy test set developed by Mikolov et al. (2013). Similar datasets have been produced for Spanish (Cardellino, 2016), Russian and Ara-bic (Abdou et al., 2018), and Chinese (Jin and Wu, 2012), among others. Hindi is the only South Asian language with a high-quality word analogy dataset to date (Grave et al., 2018), which incorporates particular forms of culturally-linked analogies such as kinship terms. In our work, we attempt to similarly capture language-specific analogies for Tamil.

Atomization
Given that we are considering methods that decompose a word into either character n-grams or morphemes, we consider both of these as special cases of decomposing a word into constituent 'atoms', a process we call 'atomization'. An atomizer takes in a word w and outputs a sequence S(w) of atoms. We detail the two main atomizers used in generating our models here: (1) The first, henceforth called character 5-grams, follows the original fastText papers (Bojanowski et al., 2017;Grave et al., 2018), in which a word's atoms are its character 5grams, as well as the entire word itself.
(2) In our second method, which we designate morphemes + stem (1-3)-grams, we modify a pre-existing Tamil stemmer 1 into a rules-based morphological segmenter using sbl2py 2 . The segmenter's role is to map each word to its stem and its sequence of morphemes. A word's atoms are then its stem, constituent morphemes, and character (1-3)grams of the stem. We will closely examine the segmenter's behavior in section 3.4.

Training
Our setup slightly extends that of Bojanowski et al. (2017), allowing atoms produced by any atomization method to fulfil the role played by character n-grams in the original paper. The model's trainable parameters are the embeddings z a for the individual atoms and the output vectors v ′ w . Following Bojanowski et al. (2017), we sum the atom vectors to obtain the input vector for the word: z a 1 https://github.com/rdamodharan/tamil-stemmer 2 https://github.com/torfsen/sbl2py Figure 1: A visualization of a single word's embedding with morphemes + stem (1-3)-grams atomization. The segmenter breaks the word down into morphemes, which together with (1-3)-grams of the stem are our final atoms. The sum of the atom embeddings (which are updated throughout training) is the overall embedding.
The relationship between atom embeddings and the overall word embeddings in training are visualized in Figure 1. Each word has its own output vector that does not depend on the atoms. Using a large text corpus (in this case Wikipedia), we train these embeddings with the skip-gram objective and negative sampling applied to the input and output vectors v w , v ′ w as in (Mikolov et al., 2013).

Constructing meta-embeddings
Our key primitive for constructing metaembeddings is a merging operation (Algorithm 1) that takes two separate sets of d-dimensional word embeddings as input and outputs another set of d-dimensional embeddings. It does this by concatenating the two sets of embeddings, then applying PCA to obtain the desired dimensionality. This procedure is visualized in Figure 2.
Our final embeddings are defined as follows. We train one set of embeddings from the character 5-grams atomization. Hereby we refer to this model as "fastText" 3 and call its input and output embedding matrices FT i , FT o respectively. We then train another set of embeddings with the morphemes + stem (1-3)-grams atomization, which we refer to as "MorphoSeg". We label its input and output embedding matrices by MS i , MS o . Our final embeddings are the columns of the matrix

Analysis of rules-based segmenter
Here we discuss the strengths and weaknesses of the segmenter (introduced in section 3.1) as a core part of our methodology.
Strengths. We find that the segmenter performs well on and correctly identifies a wide range of morphemes. In particular, it almost always correctly breaks down inflectional increments across morpheme boundaries, for instance with marattai 'tree(ACC)' −→ maram 'tree' + ai (accusative suffix). Additionally, long agglutinative compounds are often broken up correctly, for instance ezudappat . ukir . adu −→ ezuda + pat . u + kir . a+ du.
Weaknesses. However, we also find a number of distinct failure modes of the segmenter. We find undersegmentation of morphemes (e.g. inability to separate multiple stems in one word), oversegmentation of 'false' morphemes (in words that contain homophones of morphemes, e.g. paccai 'green' which happens to end in ai, the accusative suffix), ellision of certain morphemes, and difficulty with irregular forms. While it is beyond the scope of this paper to thoroughly analyze the segmenter, we anticipate that our gains on morphological tasks could be vastly improved with a better segmenter. More details are provided in our code.

Dataset
One of the primary contributions of this work is our novel Tamil analogy dataset, the first available for the language. The dataset is a hand-crafted set of 426 paired relations between words, and was produced by the authors. These word pairs are split into 10 semantic and 13 morphological relation types (see Appendix A). Analogy tetrads are then generated in a combinatorial fashion by combining pairs from the same class.
Given that Tamil is a low-resource language, automated construction of analogy dataset is relatively infeasible; lexicons rarely list fully inflected or complex morphological forms, and the use of a segmenter would subject the dataset to limitations similar to those discussed in section 3.4. As such, the decision to produce a human-curated analogy dataset was motivated by the desire to produce a gold-standard analogy task resource.
As a result of Tamil's morphological richness, even semantic relations often contain pairs with similar morphology. As such, we clustered word pairs within each relation type that shared identical morphology into labelled sub-classes. Analogies produced from morphologically identical pairs, along with analogies from morphological relations, were sorted into the Subword category of analogies, and semantic tetrads that were produced from morphologically non-identical pairs were placed in the Non-subword category. Table 3 shows 4 examples of word tetrads for the semantic and morphological categories respectively, with two word pairs given per relation. The full dataset contains 4499 analogy tetrads, with 3487 analogy tetrads across 19 relation types in the Subword category and 1012 tetrads across 10 relation types in the Non-subword category. A complete list of relation types with examples (and numerical distributions across the full dataset, development, and test sets) can be found in Appendix A.
We note that our segmenter does not correctly identify all morphological relations. Therefore, MorphoSeg may not improve overall performance even in the subword category. However, we will see in Section 6 that it significantly improves performance on certain morphological relations, and that our final meta-embedding absorbs these strengths to substantially improve overall performance (across relations and categories).

Training corpus and analogy task
We extracted 4 a corpus from the Tamil Wikipedia dump 5 (comprising 133732 articles) on April 20, 2020 and shuffled the sentences to obtain our final training corpus. A copy of the dump is available at our GitHub repository. We note that while Wikipedia may not capture the full range of Tamil inflectional morphology (having predominantly present/past tense and third-person conjugations), it captures rich derivational morphology that provides a good source of productive morphological diversity for our embeddings.
Our evaluation task measured performance on a word-analogy task performed by 'guessing' a missing word in our set of tetrads, which was computed by the gensim most_similar function (Řehůřek and Sojka, 2010). Correctness on each analogy tetrad was measured by top-k accuracy for each k ∈ {1, 5, 10}. From this, top-k accuracies were computed for each relation type. These accuracies were averaged within subword and non-subword categories, and overall model performance was measured by averaging the two figures. For brevity, we only present top-10 results here but we observe qualitatively similar behavior for top-1 and top-5 accuracy. Details are provided in Appendix C. We used a 75/25 dev/test split. in Tamil verbal and noun forms even morphologically identical forms can vary in the way they append to a verbal/noun root (as in rows 7 and 8), and multiple morphemes often exist for a given meaning.

Implementation details
We used our training corpus to produce 300dimensional embeddings. We based our training code on Tzu-Ray Su's PyTorch implementation of word2vec 6 . More hyperparameters are provided in Appendix B.
In our evaluation of our models, we were unable to utilize the full set of 4499 analogy tetrads, as many tetrads contained out-of-vocabulary (OOV) words due to the limitations of the Wikipedia training corpus. As our model was unable to did handle OOV tokens, we had to filter our dataset for applicable tetrads. After filtering for OOV tokens, there remained 1576 analogies (45.2%) across 13 relation types in the Subword category, and 794 analogies (78.5%) across 9 relation types in the Non-subword category.
We also trained a standard word2vec skip-gram model (Mikolov et al., 2013) as a baseline.

Results and Analysis
Results on the test set are shown in Figure 4. Our model was the strongest among those evaluated in both the subword and non-subword categories. First, we examine individual sets of word embeddings in section 6.1 and observe that they 6 https://github.com/ray1007/pytorch-word2vec differ substantially in their success modes. In particular, we show that the incorporation of the morphological segmenter appears to significantly boost performance on certain morphological relations. Secondly, we turn to analyzing our metaembeddings in section 6.2. Counterintuitively, we find that meta-embeddings in fact improve their performance when reduced to the same dimension as our original embeddings, seemingly combining the strengths of different representations.

Comparing individual models
'fastText, input' was the strongest individual model in both categories by a substantial margin. However, there are some relations in the dataset where it fell short and some of the other individual models performed better. As expected, 'Mor-phoSeg, input' was very effective on morphological relations that our segmenter correctly identified. More interestingly, both 'MorphoSeg, output' and 'fastText, output' performed better than 'fastText, input' across the kinship relations in the non-subword category. We hypothesize that kinship relations contain word pairs that rarely share subword information, so output vectors were more successful as they did not explicitly use subwordbased atomization. Results on some such relations are shown in Figure 5.  This illustrates that the four individual embeddings had complementary success modes, suggesting the applicability of meta-embedding methods. Furthermore, the complementary strengths of models across relations appeared to occur along the lines of previously observed semanticmorphological trade-offs (Avraham and Goldberg, 2017), which warrants further investigation.

Improvements from meta-embeddings
The results highlight that both concatenation and PCA were highly effective in increasing performance. Each of Concat(MS i , MS o ) and Concat(FT i , FT o ) performed at least as well as each of their constituent individual models in both categories. Moreover, the PCA step (between Concat and Merge) consistently improved upon the Concat models by around 5% in both categories.
Examining the results of our final metaembedding on each relation as in Figure 5 revealed that it drew on the complementary success modes of the individual models, thus mitigating the semantic-morphological trade-offs between them. In most relations, the meta-embedding performed at least as well as the best individual model, if not substantially better.

Conclusion and Future Work
In this paper, we investigated directly incorporating morphology into Tamil word embeddings using a morphological segmenter, following Bojanowski et al. (2017) in computing representations for subword units. We constructed a word analogy dataset for Tamil consisting of 13 types of morphological relations and 10 types of semantic relations to evaluate performance. We combine individual models to obtain more versatile metaembeddings that seem to overcome previously observed trade-offs.
It remains for future work to investigate the performance of our techniques on OOV words, and the improvements better morphological segmentation might bring. Evaluating our embeddings on other tasks for Indian languages, such as Akhtar et al. (2017)'s Tamil word similarity dataset, remains an important direction, as does studying the importance of incorporating morphology for downstream tasks such as POS tagging and NMT (Kumar et al., 2020). Exploring the applicability of our pipeline of morphological segmentation and meta-embeddings to other morphology-rich languages is another avenue for future work.

A Dataset Details
In this section of the Appendix, we provide details of our word analogy dataset and its construction. Essentially, as mentioned in the main work, we present a set of 4499 word tetrads split across 10 semantic and 13 morphological categories.

A.1 Relation types
In our first section here, we provide examples of each relation type we generated words for, with illustrative examples provided in Tamil text, Roman transliteration, and translations. Tables 1 and 2 show semantic and morphological word pair categories respectively.

A.2 Distribution of pairs across relation types
In Tables 3 and 4, we show the numerical distribution of pairs across categories in the full dataset and the dataset used in the paper for evaluation after filtering of OOV tokens. We attempt to attain as even a spread as possible over analogy categories (and furnish a range of morphology over the language).

A.3 Distribution of tetrads across relation types
In Tables 5 and 6, we show the numerical distributions of analogy tetrads over our distinct relation types. We also provide here the final train/test split of our data to show that the relative distributions over relation types were largely maintained, and also seek to show the breakdown of analogies across relation types in the original, unfiltered dataset. We review here briefly the process by which analogy tetrads are constructed. Within our 10 semantic and 13 morphological relation types, we assign pairs with similar morphology to a class. Following this, word pairs are combinatorially paired to produce tetrads; pairs of the same class and the same relation type, that is, pairs that share identical/highly similar morphology get assigned to the Subword class of analogies, and tetrads consisting of two divergent pairs get assigned to the Non-subword class. The idea here is that we want to differentiate analogies that can be solved with use of subword information from those that cannot as such, we notate this in our results, and capture this distinction across our different relation types in Tables 5 and 6.

B Implementation Details B.1 Atomization
There is a subtle difference between the n-grams we take of the entire word in the character 5-grams atomization and the n-grams we take of the stem in the morphemes + stem (1-3)-grams atomization. Specifically, Tamil is an abugida script, which means vowel values attached to consonants are expressed as a series of diacritics. The original fastText paper and the character 5grams method we implemented separate diacritics from their stem consonants since they are given distinct Unicode characters. However, for the morphemes + stem (1-3)-grams, we tried taking ngrams with and without separate diacritics and stems, and ultimately chose not to separate them. We used a smaller n-gram window of 1-3 to account for this.

B.2 Training Hyperparameters
We tabulate all hyperparameters in Table 7. These were mostly unchanged from the defaults used in Tzu-Ray Su's original GitHub repository. The only change was that we used 5 negative samples, following the original fastText setup (Bojanowski et al., 2017).

B.3 Alternative Dimension Reduction Methods
We briefly note that we tried incorporating the dimension reduction method proposed by Raunak et al. (2019), which combines removing the top few principal components of the embeddings with PCA. We found that this was less effective than simple PCA for our embeddings. We hypothesize that this was because our embeddings did not have disproportionately high top singular values, contrasting the observations made by Raunak et al.
(2019) for the embeddings that they considered.

C Detailed Results
This section expands on the results presented in the body of this paper in two ways: we show top-k results for each k ∈ {1, 5, 10}, and we show these results for each individual relation in our dataset. We note that we attempted to compare our models against other existing baselines trained on slightly different corpora such as those released by Kumar et al. (2020). However, due to the different corpora, these models had additional OOV tokens in our analogy dataset that we would have had to remove to evaluate them together with our models. Running methods such as ours alongside many standard baselines on a fixed corpus and comparing the resulting models is an important area for future work.

C.1 Results in subword categories
Results for top-1, top-5, and top-10 accuracies are shown in Tables 8, 9, and 10 respectively. Categories are numbered according to the numbering convention established in Tables 1 and 2. The TripleMerged model (our final set of metaembeddings) generally outperforms all other models in these categories by margins of ≈ 10%, although this is not uniformly true across all categories. This is to be expected since this is the only meta-embedding incorporating both the FT i and MS i embeddings, which are the two individual embeddings that incorporate subword information. This explanation is also supported by the strong performances of these two individual embeddings in the Subword categories.

C.2 Results in non-subword categories
Results for top-1, top-5, and top-10 accuracies are shown in Tables 11, 12, and 13 respectively. The strongest performing models here are Triple-Merged, FT concat , and FT merged . While Triple-Merged generally outperforms FT concat , FT merged is in general slightly better than TripleMerged in these categories. This is once again to be expected since the FT i and FT o models are the bestperforming individual embeddings in the Nonsubword categories. Still, it is remarkable that TripleMerged is only slightly worse in general than FT merged , given that it was the result of merging FT merged with MS merged , which was significantly weaker in the Non-subword categories.

C.3 Overall results
Overall results averaged across categories are shown in Table 14. TripleMerged exhibits the strongest performance overall, compensating for its slight weakness in the Non-subword category with significant improvements in the Subword category on all other models.                Table 14: Average top-1, top-5, and top-10 accuracies for all models in the Sub-word and Non-subword categories, as well as overall accuracies (taken to be the average of Sub-word and Non-subword scores).