CMU-01 at the SIGMORPHON 2019 Shared Task on Crosslinguality and Context in Morphology

This paper presents the submission by the CMU-01 team to the SIGMORPHON 2019 task 2 of Morphological Analysis and Lemmatization in Context. This task requires us to produce the lemma and morpho-syntactic description of each token in a sequence, for 107 treebanks. We approach this task with a hierarchical neural conditional random field (CRF) model which predicts each coarse-grained feature (eg. POS, Case, etc.) independently. However, most treebanks are under-resourced, thus making it challenging to train deep neural models for them. Hence, we propose a multi-lingual transfer training regime where we transfer from multiple related languages that share similar typology.


Introduction
Morphological analysis (Hajic and Hladká, 1998;Oflazer and Kuruöz, 1994) is the task of predicting morpho-syntactic properties along with the lemma of each token in a sequence, with several downstream applications including machine translation (Vylomova et al., 2017), named entity recognition (Güngör et al., 2018) and semantic role labeling (Strubell et al., 2018).Advances in deep learning have enabled significant progress for the task of morphological tagging (Müller and Schuetze, 2015;Heigold et al., 2017) and lemmatization (Malaviya et al., 2019) under large amounts of annotated data.However, most languages are underresourced and often exhibit diverse linguistic phenomena, thus making it challenging to generalize existing state-of-the-art models for all languages.
In order to tackle the issue of data scarcity, recent approaches have coupled deep learning with cross-lingual transfer learning (Malaviya et al., 2018;Cotterell and Heigold, 2017;Kondratyuk, 2019) and have shown promising results.Previous works (e.g., Cotterell and Heigold, 2017) combine the set of morphological properties into a single monolithic tag and employ multi-sequence classification.This runs the risk of data sparsity and exploding output space for morphologically rich languages.Malaviya et al. (2018) instead predict each coarse-grained feature, such as part-ofspeech (POS) or Case, separately by modeling dependencies between these features and also between the labels across the sequence using a factorial conditional random field (CRF).However, this results in a large number of factors leading to a slower training time (over 24h).
To address the issues of both data sparsity and having a tractable computation time, we propose a hierarchical neural model which predicts each coarse-grained feature independently, but without modeling the pairwise interactions within them.This results in a time-efficient computation (5-6h) and substantially outperforms the baselines.To more explicitly incorporate syntactic knowledge, we embed POS information in an encoder which is shared with all feature decoders.To address the issue of data scarcity, we present two multilingual transfer approaches where we train on a group of typologically related languages and find that language-groups with shallower time-depths (i.e., period of time during which languages diverged to become independent) tend to benefit the most from transfer.We focus on the task of contextual morphological analysis and use the provided baseline model for the task of lemmatization (Malaviya et al., 2019).
This paper makes the following contributions: 1. We present a hierarchical neural model for contextual morphological analysis with a shared encoder and independent decoders for each coarse-grained feature.This provides us with the flexibility to produce any combination of features.
2. We analyze the dependencies among different morphological features to inform model choices, and find that adding POS information to the encoder significantly improves prediction accuracy by reducing errors across features, particularly Gender errors.
3. We evaluate our proposed approach on 107 treebanks and achieve +14.76 (accuracy) average improvement over the shared task baseline (Mc-Carthy et al., 2019) for morphological analysis.

Contextual Morphological Analysis
In this section, we formally define the task ( §2.1) and describe our proposed approach ( §2.2).

Task Formulation
Formally, we define the task of contextual morphological analysis as a sequence tagging problem.
Given a sequence of tokens x = x 1 , x 2 , • • • , x n , the task is to predict the morphological tagset y = y 1 , y 2 , • • • , y n where the target label y i for a token x i constitutes the fine-grained morpho-syntactic traits {N;PL;NOM;FEM}.

Our Method
In line with Malaviya et al. (2018), we formulate morphological analysis as a feature-wise sequence prediction task, where we predict the fine-grained labels (e.g N, NOM, ...) for the corresponding coarse-grained features F ={POS,Case,...} as shown in Figure 1.However, we only model the transition dependencies between the labels of a feature.This is done for two reasons: 1) As per Malaviya et al. (2018)'s analysis, the removal of pairwise dependencies led to only a -0.93 (avg.)decrease in the F1 score.We further observe in our experiments that our formulation performs better even without explicitly modeling pairwise dependencies; 2) The factorial CRF model gets computationally expensive to train with pairwise dependencies since loopy belief propagation is used for inference.
Therefore, we propose a feature-wise hierarchical neural CRF tagger (Lample et al., 2016;Ma and Hovy, 2016;Yang et al., 2016) with independent predictions for each coarse-grained feature for a given time-step, without explicitly modeling the pairwise dependencies.

Hierarchical Neural CRF model
The hierarchical neural CRF model comprises of two major components, an encoder which com-bines character and word-level features into a continuous representation and a multi-class multilabel decoder.Given an unlabeled sequence x, the encoder computes the context-senstive hidden representations for each token x i .These representations are shared across |F | independent linearchain CRFs for inference.We refer to this model as MDCRF.
Decoder: Our decoder comprises of |F | independent feature-wise CRFs whose objective function is given as follows: where F = {POS, Case, Gender,...} is the set of coarse-grained features observed in the training dataset.
) being the energy function for each feature f .During inference the predictions from each feature-wise decoder is concatenated together to output the complete morphological analysis of the sequence x.
Encoder: We adopt a standard hierarchical sequence encoder which is shared among all the |F | feature-wise decoders.It consists of a characterlevel bi-LSTM that computes hidden representations for each token in the sequence.These subword representations help in capturing information about morphological inflections.To further enforce this signal, we add a layer of self-attention (Vaswani et al., 2017) on top of the characterlevel bi-LSTM.Self-attention provides each character with a context from all the characters in the token.A bi-LSTM modeling layer is added on top of the self-attention layer which produces a token-level representation.These representations are then concatenated with a word embedding vector and fed to another bi-LSTM to produce context sensitive token representations which are then fed to all the |F | CRFs for inference.

Adding Linguistic Knowledge
Part-of-speech (POS) is perhaps the most important coarse-grained feature.Not only is every token annotated for POS, but most other features depend on it.For instance, verbs do not have Case, Figure 2: Polyglot model being used for the token "de" in Turkish, denoted by language vector <tr>.
nouns do not have Tense.In order to leverage these linguistic constraints, we incorporate POS information for each token into our shared encoder.We refer to this variant of the model as MDCRF+POS, as shown in Figure 1.
Since POS tags are not available as input, we first run a separate hierarchical neural CRF tagger for POS alone and use the model predictions as input to the MDCRF+POS.For each token, we encode its predicted POS tag into a continuous representation and concatenate it with the character and word-level token representations.Finally, these concatenated representations are fed to the wordlevel bi-LSTM and inference is performed using |F |-1 decoders, excluding the POS decoder.Going forward, we use this model architecture for all our experiments unless otherwise noted.

Multi-lingual Transfer
So far, we have described our model architecture for a monolingual setting.However, the performance of neural models is highly dependent on the availability of large amounts of annotated data, making it challenging to generalize to lowresource languages.Cross-lingual transfer learning attempts to alleviate this challenge by transferring knowledge from high-resource languages.Prior work (Cotterell and Heigold, 2017;Malaviya et al., 2018;Buys and Botha, 2016) has shown the benefits of cross-lingual transfer for morphological tagging.Malaviya et al. (2018) restrict to transferring from one language, whereas Cotterell and Heigold (2017) show that multi-source transfer performs better than single-source.Inspired by this, we experiment with two approaches for multi-lingual transfer learning.

MULTI-SOURCE:
In this method, we augment the training data from related languages with the target language data.Similar to Cotterell and Heigold (2017), we perform a hard clustering of languages based on the typological and orthographic similarity of the source languages with the target language.For instance, we construct a language cluster Indo-Aryan, which comprises of all the languages in the dataset that belong to the Indo-Aryan language family which are Hindi, Marathi and Sanskrit.For some larger language families such as Germanic and Slavic, we construct language clusters from a subset of languages.For instance, the North-Germanic language cluster comprises of treebanks from German, Norwegian, Swedish and Danish.Some languages such as Urdu, Tamil are the only representative languages of their respective language families in the dataset.For these languages, we create a cluster with the next closest language with respect to typology or orthography.For Urdu, we add Hindi because of typological similarity.For other such isolates, we add Turkish because of its extensive agglutination.A total of 24 language clusters were defined based on the literature and with help from a linguist, the details of which can be seen in the Appendix Section §B.
Given a language cluster, all the training data from each language within it is first concatenated together.Then, for each language we concatenate the language embedding vector with the token representation in the encoder by adding the language id <LANG ID> at the beginning and end of each sequence.Given a sequence x, the encoder produces contextualized hidden representation h i for each token x i : where e i is the word embedding vector, c i is the character-level representation, p i is the POS embedding and l i is the language embedding vector.This is done to help the model disambiguate languages as often same tokens have different morpho-syntactic description across languages.For example, the token " " is a part of both Hindi and Marathi vocabulary.In Hindi it denotes a CONJ whereas in Marathi it is a pronoun with the following description: 3;MASC;PRO;NOM;SG.
POLYGLOT: Languages are often related to multiple languages along different dimensions.For instance, Swedish is lexically similar to German, but it is morpho-syntactically closer to English.To enable a model to utilize these relationships, we feed explicit typological information to the encoder, drawing inspiration from the polyglot model proposed by Tsvetkov et al. (2016).In this multilingual model, we first concatenate all the training data from the source languages, similar to the MULTI-SOURCE setting and compute h i for each token.Then context vector h i is factored by the typology feature vector t l to integrate these manually defined features as follows: where W l , b l are language-specific parameters which project the typology vector into a lowdimentional space.Finally, g l i computes the global-context language matrix which is vectorized into a column vector and fed to the decoder, as shown in Figure 2. Tsvetkov et al. (2016) derive their typology vectors from the URIEL database (Littell et al., 2017).We consider a subset of these typology features which are most relevant to the task of morphosyntactic analysis and obtain 18 Syntax-WALS features.2However, we observed that for most language clusters, these typology feature values within a cluster were not discriminating, which defeats the purpose of using POLYGLOT for disambiguating languages across typological dimensions.Therefore, we construct custom typological vector per each language cluster based on the training data global statistics.
For every coarse-grained feature, this constructed vector contains the proportion of words in the training data that are annotated with that feature.We also experiment with calculating these proportions separately for words for each POS label (N, V, ...).Given the importance of POS, we also include the number of fine-grained POS labels that the most frequent coarse-grained features (Gender, Number, Person, Case) can take.This results in bi-gram features such as N-FEM, N-NOM, N-SG.We remove features which do not occur within a given cluster to avoid sparse features.Table 1 shows a portion of the example vector constructed for the Indo-Aryan cluster.From the table we can see that, some features such as ADJ-Gender-FEM and V-Person-1 are present in all the three languages within the cluster.Whereas some features such as ADJ-Gender-NEUT is absent from Hindi because Hindi only has two genders which are MASC and FEM.Training Regime: For both the multi-lingual transfer methods, we train one model per language cluster and fine-tune this model for each individual language.which saves time and compute for training 107 individual models from scratch.Furthermore, since a language cluster can have multiple high-resource languages, we take min (5000, #training data-points) for each language to have a tractable training time.We up-sample the lowresource languages to match the number of training data-points of the high-resource languages.

Contextual Lemmatization
We use the neural model from Malaviya et al. (2019) for contextual lemmatization.This is a neural sequence-to-sequence model with hard attention, which takes both the inflected form and morphological tag set for a token as input and produces a lemma, both at the character level.The decoder uses the concatenation of the previous character and the tag set to produce the next character in the lemma.The lemmatization model is jointly trained with an LSTM-based tagger using jackknifing to reduce exposure bias in training: Malaviya et al. (2019) report significantly lower lemmatization results training with gold tags and using predicted tags only at test time.We use their tagger for training and our contextual morphological analysis models' predicted tags at evaluation time.This model served as the baseline lemmatizer for Task 2; we refer readers to the shared task paper for model details (McCarthy et al., 2019).

Experiments
We conduct the following experiments: We compare our multi-lingual transfer approach with the baselines Malaviya et al. (2018) and Cotterell and Heigold (2017) 2018) consider a feature-wise model which predicts fine-grained labels for corresponding coarse categories {POS,Case,...}.Since morphosyntactic properties are often correlated, they model these inter-dependencies using a factorial CRF and define two inter-dependencies: 1) a pairwise dependency, which models correlations between the morpho-syntactic properties within a token, and 2) a transition dependency, which models label correlations across all tokens in a sequence.Although this formulation provides the flexibility to produce any combination of tagsets, this model is computationally expensive to train since the factors model dependencies between all labels of all coarse-grained features, leading to >20k factors.
Data processing: We use the train/dev/test split provided in the shared task (McCarthy et al., 2018). 3Since we model feature-wise prediction for each coarse-grained feature, our model requires the provided data to be annotated for coarse-grained features.Therefore, we construct a feature-label dictionary based on the UM documentation4 to map the individual fine-grained traits, which are in the UM schema, to their respective coarse-grained categories.This transforms the tagset {N;PL;NOM;FEM} as {POS=N;Number=PL;Case=NOM;Gender=FEM}.We note that usually a token has a subset of the coarse-grained categories, therefore we extend the morphological tagset for each token by adding the remaining features observed in the training set and assigning them a special value " " which denotes null.Hyper-parameters: We use a hidden size of 200 for each direction of the LSTM with a dropout of 0.5.For the character-level bi-LSTM we use a hidden size of 25.We use 100 dimentional size for word and language embeddings with 64 dimensional POS embeddings, all randomly initialized.SGD was used as the optimizer with learning rate of 0.015.The models were trained until convergence.For POLYGLOT, we project the constructed typology vector into 20 dimension hidden size.

Results and Discussion
Table 2 shows the comparison results of our proposed approach with the baselines (Malaviya et al., 2018;Cotterell and Heigold, 2017) using cross-lingual transfer.Here MDCRF+POS refers to our model architecture and MULTI-SOURCE refers to our multi-lingual transfer approach.Malaviya et al. (2018) and Cotterell and Heigold (2017) test their approach on UD v2.1 (Nivre et al., 2017) under two settings: tgt size = 100 and tgt size = 1000, where tgt size denotes the number of target language data-points used during training.Malaviya et al. (2018) transfer from one related high-resource language.We use the same experimental resources for comparison and for a fair comparison we do not fine-tune on the target language.Of the four language pairs tested by Malaviya et al. (2018), we choose RU/BG and FI/HU for comparison, where BG and HU are the target languages and RU and FI are the respective transfer languages, since these languages are morphologically challenging.We see that under both settings our approach outperforms the baselines by a significant margin for both the language pairs.Next, we compare our multi-lingual transfer approaches MULTI-SOURCE and MULTI-SOURCE + POLYGLOT in order to decide the model for our final submission.We conduct experiments on three low-resource languages: Marathi (mr-ufal), Sanskrit (sa-ufal) and Belarusian (be-hse), all of which have < 400 training data-points.The italicized text denotes the treebank used in the experiments.For mr-ufal and sa-ufal, we transfer from a related high-resource language of Hindi (hi-hdtb).For be-hse, we transfer from two related languages, Russian (ru-gsd) and Ukrainian (ukiu).However, from Table 3, we see that the performance of the two models is comparable.Therefore, for our final submission we use only MULTI-SOURCE which is much faster to train than the MULTI-SOURCE + POLYGLOT.We discuss their comparative performance in greater detail in Section §5.1.Finally, we compare our approach with the shared task baseline.Table 5, 6 in the Appendix shows our results for all 107 treebanks.We observe that out system achieves an average improvement of +14.70 (accuracy) and +4.63 (F1) over the provided baseline (McCarthy et al., 2019).We note that for the shared task submission, we did not use self-attention over the character-level representations.Therefore, we additionally show the results after adding selfattention.We observe that the addition gives an average improvement of +0.60 (accuracy) and +0.30 (F1) over our previous best submission.

Analysis
Here we analyze the different components of our model in an effort to understand what it is learning.
Why does adding POS help?As discussed earlier ( §2), we explicitly add the POS feature in the form of embeddings into the shared encoder.To evaluate the contribution of POS alone, we conduct monolingual experiments without concatenating the POS embeddings with the token-level representations.Table 4 outlines the ablation results for three treebanks with varying training size.We observe that our monolingual model MDCRF significantly outperforms the baseline (McCarthy et al., 2019) by +13.72 accuracy and +3.82 F1 (avg).On adding POS, we further gain +3.56 accuracy and +0.71 F1 over MDCRF across the three treebanks.We note that this improvement is more pronounced for the low-medium resource languages of Marathi (+6.12 accuracy) and Ukrainian (+3.57accuracy).To understand where the addition of POS helps, we analyse the number of errors made per each coarse-grained feature.For the example of Marathi, POS helped the most in reducing Gender errors (Figure 3).For some word forms, the gender may be inferred from inflectional form alone, but for others, this information may be insufficient, e.g." क मत " (price.N.FEM.SG.ACC) in Marathi which does not have the traditional female suffix " ".We observe that this behavior corresponds to POS: verbs and adjectives are more predictable from surface forms alone than nouns.The addition of POS information in the encoder helps the model learn to weigh different encoded information more heavily when assigning gender to different parts of speech.For Ukrainian and Sanskrit, POS information also helped reduce errors in Case and Number.More details can be found in Appendix Section §C.
Tkachenko and Sirts ( 2018) also model dependence on POS with a POS-dependent context vector in the decoder.However, they observe no significant improvement; we hypothesize that incorporating POS information into the shared encoder instead provides the model with a stronger signal.
What is the model learning?One of the major advantages of our model's use of self-attention is that it enables us to provide insights into what the model has learned.As seen in Figure 4, we found evidence of the model learning languagespecific inflectional properties.Both Marathi and Belarusian display morphological inflections predominantly in the form of suffix and the attention maps for both these languages demonstrate the same.For the Marathi example, the last three characters denote the ergative case and we can see that the attention weights are concentrated on these three characters.Similarly for the Belarusian example, the last two characters denote the genitive case with plural number and is the focus of the attention.For Indonesian, inflections can be also found as circumfixes where the affix is attached at both the beginning and end of the token.For instance, both keand -an affixes are appended to form nouns and we can see from Figure 4 that the attention is focused both on the prefix and the suffix.Interestingly for Indonesian, the model seems to have also discovered the stem camat, as evidenced from the attention pattern.

Does time-depth matter for transfer learning?
As discussed earlier, we train one model per language cluster for multi-lingual transfer learning.We compare different clusters to see if time-depth of the languages within a cluster affects the extent of transfer.Time depth is the period of time that has elapsed since all languages in the group were a single language (in other words, the time since divergence).We consider the following three clusters: Hindi-Marathi-Sanskrit (Indo-Aryan), Russian-Ukrainian-Belarusian (Slavic) and Arabic-Hebrew-Amharic-Akkadian (Semitic).These three clusters were chosen because the languages in them became separate languages at varying time-depths.For instance, in the Semitic cluster the languages diverged roughly 5000 years ago, whereas for the Slavic cluster the time-depth is <1000 years.Therefore, we expect transfer to help more for languages where the time-depth is more recent.In Figure 5, we compare the MULTI-SOURCE model with our best mono-lingual model MDCRF+POS and we see that transfer helps most for the Slavic cluster by +2.9 accuracy.For the Indo-Aryan cluster it helps by +0.32 accuracy and for the Semitic cluster we observe a slight negative effect with transfer (-0.0176 accuracy).This supports our hypothesis that time-depth does affect the extent of transfer learning with language clusters having lower time-depths benefiting the most.
One particular advantage that the Slavic cluster has over both the Indo-Aryan and Semitic clusters is the similarity of script.Russian, Belarusian, and Ukrainian use variants of the same script; Hindi, Sanskrit, and Marathi do, as well, but the Semitic languages all use different scripts.This is also attributed to the shallower time-depths of the Slavic and clusters.as suggested by the anonymous reviewers, we add Czech and Polish to the Slavic cluster and see to what extent the scripts are confusing the model.Czech and Polish use different script as compared to Russian, Belarusian, and Ukrainian.We observe that MULTI-SOURCE model like before, achieves similar improvements over the monolingual models for Belarusian (+8.17 accuracy) and Ukrainian (+1.2 accuracy).However, a slight decrease is observed for Russian ( -0.45 accuracy).This suggests that the MULTI-SOURCE model is robust to scriptal changes and benefits the low-resource languages by learning from typologically similar languages, more so for language clusters with shallow timedepths.Why did POLYGLOT not help further?We hypothesize that one reason why POLYGLOT did not help over MULTI-SOURCE is because the language embedding vector probably learns the same typological information which the typology vector encodes.Hence, the typological vector doesn't seem to add any new information.As evidence, we look at the transition weights learned in both the models; as shown in Figure 8, we see that the transition weights learned for the Case feature are very similar for both MULTI-SOURCE and MULTI-SOURCE + POLYGLOT.In the future, we plan to explore the contextual parameter generation method (Platanios et al., 2018) for leveraging the typology vectors to inform the decoders during inference.

Error Analysis
In this section, we analyze the major error categories for the MULTI-SOURCE model for the Indo-Aryan cluster.We observe that Gender, Case, Number, Person features account for the most number of errors (65% for Marathi, 49% for Sanskrit).One reason for this is the non-overlapping output label space across the languages within a cluster.For instance, in the Indo-Aryan cluster, Hindi is a high-resource language (> 13k training sentences) with Marathi (373) and Sanskrit (184) being the low-resource languages.We observe that the label space for Case, Gender, Number overlap the least among the three languages.Marathi and Sanskrit have three genders: NEUT, FEM, MASC whereas Hindi only has FEM, MASC.Furthermore, only two Hindi Case labels (ACC, NOM) overlap with Marathi and Sanskrit because in Hindi the labels often have alternatives such as ACC/ERG, ACC/DAT.These differences in the output space negatively affect the transfer.For the Slavic cluster, we observe that almost all the feature labels overlap nicely for the languages therein, which is probably another reason why we see a gain of +6.89 for Belarusian in Figure 5 and only +0.32 increase for Marathi.
We also note that for some languages such as Belarusian and Russian, the POS errors in-creased by 25.3% and 4.4% respectively for the MDCRF+POS model.This suggests that decoupling POS feature from the other feature decoders harmed the model.In future, we plan to improve the MDCRF+POS model by jointly training POS decoder with the other feature decoders which use the latent representation of POS in an end-to-end fashion.

Conclusion and Future Work
We implement a hierarchical neural model with independent decoders for each coarse-grained morphological feature and show that incorporating POS information in the shared encoder helps improve prediction for other features.Furthermore, our multi-lingual transfer methods not only help improve results for related languages but also eliminate the need of training individual models for each dataset from scratch.In future, we plan to explore the use of pre-trained multi-lingual word embeddings such as BERT (Devlin et al., 2019), in our encoder.

B Language Clusters
We train one model per language cluster for the multi-lingual transfer learning.Each language cluster was constructed based on the typological and orthographic similarity of the languages therein.Table 5, 6 show details of the language clusters.Figure 7 shows clusters graphically by relative size per dataset.

C Analysis
In order to understand where the addition of POS helps, we plot the number of errors per each coarse-grained feature for three languages in Figure 8.For Sanskrit and Ukrainian we see that POS generally helps reduce the errors predominantly for the features: Case, Gender, Number.For Belarusian, we did not observe a clear trend since the POS accuracy actually decreased for (c) Sanskrit (sa-ufal) Figure 8: Number of errors per coarse-grained feature for models comparing the addition of POS to the encoder.The rows at the bottom denote the total number of predictions per each feature for both the models.

Figure 3 :
Figure 3: Number of errors per coarse-grained feature for Marathi comparing the addition of POS to the encoder.The rows at the bottom denote the total number of predictions per each feature for both the models.

Figure 4 :
Figure4: Character-level attention maps for three typologically different languages.Marathi and Belarusian display morphological inflections pre-dominantly as suffix.Indonesian displays inflections in the form of prefix, suffix and circumfix where the affix is found both at the beginning and end of a token.

Figure 5 :
Figure 5: Absolute gain of multi-lingual transfer over monolingual models.Blue denotes the Indo-Aryan cluster, pink the Slavic, and yellow the Semitic.
Hierarchical neural model for contextual morphological analysis with independent CRF decoders for each coarse-grained feature F .For the model MDCRF+POS, POS embeddings are concatenated to the word and char-level representations as depicted above.This model has |F |-1 decoders since POS tagger is run separately as a prior step.MDCRF refers to the above model without POS embeddings having all |F | decoders.

Table 1 :
Example of manually constructed typology features for the Indo-Aryan cluster.
(McCarthy et al., 2019)2017)formulate this task as a sequence prediction problem with the output space being the set of all possible tagsets seen in the training data.Specifically, they construct a neural network based multi-class classifier where each tagset {N;PL;NOM;FEM} forms a class.Since the output space is only restricted to the tagsets seen in the training data, this method cannot generalize to unseen tagsets.Furthermore, for morphologically rich languages such as Russian or Turkish, the output space of the tagset is huge leading to sparse training data.(McCarthyetal.,2019)followa similar approach.To overcome these drawbacksMalaviya et al.  ( (McCarthy et al., 2019)ntal settings.Next, we compare our approach with the shared task baseline(McCarthy et al., 2019).Finally, we analyze the contributions of different components of our proposed method.Baselines:

Table 2 :
Comparing our model for bilingual transfer with previous baselines.

Table 6 :
Comprehensive resultsFigure 7: Language family clusters, by number of sentences per dataset.