Cross-lingual Character-Level Neural Morphological Tagging

Even for common NLP tasks, sufficient supervision is not available in many languages – morphological tagging is no exception. In the work presented here, we explore a transfer learning scheme, whereby we train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together. Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones.


Introduction
State-of-the-art morphological taggers require thousands of annotated sentences to train.For the majority of the world's languages, however, sufficient, large-scale annotation is not available and obtaining it would often be infeasible.Accordingly, an important road forward in low-resource NLP is the development of methods that allow for the training of high-quality tools from smaller amounts of data.In this work, we focus on transfer learning-we train a recurrent neural tagger for a low-resource language jointly with a tagger for a related highresource language.Forcing the models to share character-level features among the languages allows large gains in accuracy when tagging the lowresource languages, while maintaining (or even improving) accuracy on the high-resource language.
Recurrent neural networks constitute the state of the art for a myriad of tasks in NLP, e.g., multilingual part-of-speech tagging (Plank et al., 2016), syntactic parsing (Dyer et al., 2015;Zeman et al., 2017), morphological paradigm completion (Cotterell et al., 2016(Cotterell et al., , 2017) ) and language modeling (Sundermeyer et al., 2012;Melis et al., 2017); recently, such models have also improved morphological tagging (Heigold et al., 2016(Heigold et al., , 2017)).In addition to increased performance over classical approaches, neural networks also offer a second advantage: they admit a clean paradigm for multitask learning.If the learned representations for all of the tasks are embedded jointly into a shared representation space, the various tasks reap benefits from each other and often performance improves for all (Collobert et al., 2011b).We exploit this idea for language-to-language transfer to develop an approach for cross-lingual morphological tagging.
We experiment on 18 languages taken from four different language families.Using the Universal Dependencies treebanks, we emulate a lowresource setting for our experiments, e.g., we attempt to train a morphological tagger for Catalan using primarily data from a related language like Spanish.Our results demonstrate the successful transfer of morphological knowledge from the highresource languages to the low-resource languages without relying on an externally acquired bilingual lexicon or bitext.We consider both the single-and multi-source transfer case and explore how similar two languages must be in order to enable highquality transfer of morphological taggers. 1

Morphological Tagging
Many languages in the world exhibit rich inflectional morphology: the form of individual words mutates to reflect the syntactic function.For example, the Spanish verb soñar will appear as sueño in the first person present singular, but soñáis in the second person present plural, depending on the bundle of syntacto-semantic attributes associated with the given form (in a sentential context).For concreteness, we list a more complete table of Spanish verbal inflections in Table 1.Note that some languages, e.g., the Northeastern Caucasian language Archi, display a veritable cornucopia of potential forms with the size of the verbal paradigm exceeding 10,000 (Kibrik, 1998).
Standard NLP annotation, e.g., the scheme in Sylak- Glassman et al. (2015), marks forms in terms of universal key-attribute pairs, e.g., the first person present singular is represented as [ pos=V,per=1,num=SG,tns=PRES ].This bundle of key-attribute 1 While we only experiment with languages in the same family, we show that closer languages within that family are better candidates for transfer.We remark that future work should consider the viability of more distant language pairs.pairs is typically termed a morphological tag and we may view the goal of morphological tagging to label each word in its sentential context with the appropriate tag (Oflazer and Kuruöz, 1994;Hajič and Hladká, 1998).As the part of speech (POS) is a component of the tag, we may view morphological tagging as a strict generalization of POS tagging, where we have significantly refined the set of available tags.All of the experiments in this paper make use of the universal morphological tag set available in the Universal Dependencies (UD) (Nivre et al., 2016).As an example, we have provided a Russian sentence with its UD tagging in Figure 1.
Transferring Morphology.The transfer of morphology is arguably more dependent on the relatedness of the languages in question than other linguistic annotation in NLP such as POS and named entity recognition (NER).For example, POS lends itself nicely to a universal annotation scheme (Petrov et al., 2012) and traditional NER is limited to a small number of cross-linguistic compliant categories, e.g., PERSON and PLACE.Even universal dependency arc labels employ cross-lingual labels (Nivre et al., 2016).Morphology, on the other hand, typically requires more fine-grained annotation, e.g., grammatical case and tense.It is often the case that one language will make a syntacto-semantic distinction in the form (or at all) that another does not.For example, the Hungarian noun overtly marks 17 grammatical cases and Slavic verbs typically distinguish two aspects through morphology, while English marks none of these distinctions.If the word form in the source language does not overtly mark a grammatical category in the target language, it is nigh-impossible to expect a successful transfer.For this reason, much of our work focuses on the transfer of related languages-specifically exploring how close two languages must be for a successful transfer.Note that the language-specific nature of morphology does not contradict the universality of the annotation; each language may mark a different subset of categories, i.e., use a different set of the universal keys-attribute pairs, but nevertheless there is a single, universal set, from which the key-attribute pairs are drawn.See Newmeyer (2007) for a linguistic treatment of cross-lingual annotation.
Notation.We will discuss morphological tagging in terms of the following notation.We will consider two (related) languages: a high-resource source language ℓ s and a low-resource target language ℓ t .Each of these languages will have its own (potentially overlapping) set of morphological tags, denoted T s and T t , respectively.We will work with the union of both sets T = T s ∪ T t .An individual tag m n = [k 1 =v 1 , . . ., k M =v M ] ∈ T is comprised of universal keys and attributes, i.e., the pairs (k m , v n ) are completely language-agnostic.In the case where a language does not mark a distinction, e.g., case on English nouns, the corresponding keys are excluded from the tag.Typically, |T | is large (see Table 3).We denote the set of training sentences for the high-resource source language as D s and the set of training sentences for the lowresource target language as D t .In the experimental section, we will also consider a multi-source setting where we have multiple high-resource languages, but, for ease of explication, we stick to the singlesource case in the development of the model.

Character-Level Neural Transfer
Our formulation of transfer learning builds on work in multi-task learning (Caruana, 1997;Collobert et al., 2011b).We treat each individual language as a task and train a joint model for all the tasks together.We first discuss the current state of the art in morphological tagging: a character-level recurrent neural network.After that, we explore three augmentations to the architecture that allow for Language-specific Bi-LSTM embedder.Figure 2: We depict four subarchitectures used in the models we develop in this work.Combining (a) with the character representations in (c) gives the vanilla morphological tagging architecture of Heigold et al. (2017).Combining (a) with (d) yields the language-universal softmax architecture and (b) and (c) yields our joint model for language identification and tagging.
the transfer learning scenario.All of our proposals force the representation of the characters for both the source and the target language to share the same representation space, but involve different mechanisms, by which the model may learn language-specific features.

Character-Level Neural Networks
Character-level neural networks currently constitute the state of the art in morphological tagging (Heigold et al., 2017).We draw on previous work in defining a conditional distribution over taggings t for a sentence w of length |w| = N as which may be seen as a 0 th order conditional random field (CRF) (Lafferty et al., 2001) with parameter vector θ.2 Importantly, this factorization of the distribution p θ (t | w) also allows for efficient exact decoding and marginal inference in O(N )time, but at the cost of not admitting any explicit interactions in the output structure, i.e., between adjacent tags. 3We parameterize the distribution over tags at each time step as where is a bias vector and positional representations e n ∈ R D are taken from a concatenation of the output of two long short-term memory recurrent neural networks (LSTMs) (Hochreiter and Schmidhuber, 1997), folded forward and backward, respectively, over a sequence of input vectors. 4The integer D is the dimensionality and a tunable hyperparameter.This constitutes a bidirectional LSTM (Graves and Schmidhuber, 2005).We define the positional representation vector as follows where each v n ∈ R n is, itself, a word representation.Note that the function LSTM returns the last hidden state representation of the network.This architecture is the context bidirectional recurrent neural network of Plank et al. (2016).Finally, we derive each word representation v n from a characterlevel bidirectional LSTM embedder.Namely, we define each word representation as the concatenation In other words, we run a bidirectional LSTM over the character stream.This bidirectional LSTM is the sequence bidirectional recurrent neural network of Plank et al. (2016).Note a concatenation of the sequence of character symbols ⟨c n 1 , . . ., c n Mn ⟩ results in the word w n .Each of the M n characters c n k is a member of an alphabet Σ, the union of sets of characters in the languages considered.We direct the reader to Heigold et al. ( 2017) for a more in-depth discussion of this and various additional architectures for the computation of v n ; the architecture we have presented in Equation ( 5) is competitive with the best-performing setting in Heigold et al.'s (2017) study.

Cross-Lingual Morphological Transfer as
Multi-Task Learning Cross-lingual morphological tagging may be formulated as a multi-task learning problem.We seek to learn a set of shared character representations for taggers in both languages together through optimization of a joint loss function that combines the high-resource tagger and the low-resource one.
The first loss function we consider is the following: Crucially, our cross-lingual objective forces both taggers to share part of the parameter vector θ, which allows it to represent morphological regularities between the two languages in a common representation space and, thus, enables transfer of knowledge.This is no different from monolingual multi-task settings, e.g., jointly training a chunker and a tagger for the transfer of syntactic information (Collobert et al., 2011b).We point out that, in contrast to our approach, almost all multi-task transfer learning, e.g., for dependency parsing (Guo et al., 2016), has shared word-level representations rather than character-level representations.See §6 for a more complete discussion.
We consider two parameterizations of this distribution p θ (t n | w, ℓ).First, we modify the initial character-level LSTM representation such that it also encodes the identity of the language.Second, we modify the softmax layer, creating a languagespecific softmax.
Language-Universal Softmax.Our first architecture has one softmax, as in Equation ( 2), over all morphological tags in T (shared among all the languages).To allow the architecture to encode morphological features specific to one language, e.g., the third person present plural ending in Spanish is -an, but -ão in Portuguese, we modify the creation of the character-level representations.Specifically, we augment the character alphabet Σ with a distinguished symbol that indicates the language: id ℓ .We then pre-and postpend this symbol to the character stream for every word before feeding the characters into the bidirectional LSTM.Thus, we arrive at the new language-specific word representations, This model creates a language-specific representation v n , but the individual representations for a given character are shared among the languages jointly trained on.The remainder of the architecture is held constant.
Language-Specific Softmax.Next, inspired by the architecture of Heigold et al. ( 2013), we consider a language-specific softmax layer, i.e., we define a new output layer for every language: where W ℓ ∈ R |T |×D and b ℓ ∈ R |T | are now language-specific.In this architecture, the representations e n are the same for all languages-the model has to learn language-specific behavior exclusively through the output softmax of the tagging LSTM.

Joint Morphological Tagging and Language
Identification.The third model we exhibit is a joint architecture for tagging and language identification.We consider the following loss function: where we factor the joint distribution as Just as before, we define p θ (t | w, ℓ) above as in Equation ( 7) and we define which is a multi-layer perceptron with a binary softmax (over the two languages) as an output layer; we have added the additional parameters V ∈ R 2×D and U ∈ R 2×2 .In the case of multi-source transfer, this is a softmax over the set of languages.
Comparative Discussion.The first two architectures discussed in §3.2 represent two possibilities for a multi-task objective, where we condition on the language of the sentence.The first integrates this knowledge at a lower level and the second at a higher level.The third architecture discussed in §3.2 takes a different tack-rather than conditioning on the language, it predicts it.The joint model offers one interesting advantage over the two architectures proposed.Namely, it allows us to perform a morphological analysis on a sentence where the language is unknown.This effectively alleviates an early step in the NLP pipeline, where language id is performed, and is useful in conditions where the language to be tagged may not be known apriori, e.g., when tagging social media data.While there are certainly more complex architectures one could engineer for the task, we believe we have found a relatively diverse sampling, enabling an interesting experimental comparison.Indeed, it is an important empirical question which architectures are most appropriate for transfer learning.Since transfer learning affords the opportunity to reduce the sample complexity of the "data-hungry" neural networks that currently dominate NLP research, finding a good solution for cross-lingual transfer in state-of-the-art neural models will likely be a boon for low-resource NLP in general.

Experiments
Empirically, we ask three questions of our architectures.i) How well can we transfer morphological tagging models from high-resource languages to low-resource languages in each architecture?(Does one of the three outperforms the others?)ii) How many annotated data in the low-resource language do we need?iii) How closely related do the languages need to be to get good transfer?

Datasets
We use the morphological tagging datasets provided by the Universal Dependencies (UD) treebanks (the concatenation of the 4 th and 6 th columns of the file format) (Nivre et al., 2016).We list the size of the training, development, and test splits of the UD treebanks we used in Table 2. Also, we list the number of unique morphological tags in each language in Table 3, which serves as an approximate measure of the morphological complexity each language exhibits.Crucially, the data are annotated in a cross-linguistically consistent manner, such that words in the different languages that have the same syntacto-semantic function have the same bundle of tags (see §2 for a discussion).Potentially, further gains would be possible by using a more universal scheme, e.g., the UNIMORPH scheme.

Baselines
We consider two baselines in our work.First, we consider the MARMOT tagger (Müller et al., 2013), which is currently the best-performing non-neural model.The source code for MARMOT is freely available online,5 which allows us to perform fully controlled experiments with this model.Second, we consider the alignment-based projection approach of Buys and Botha (2016). 6We discuss each of the two baselines in turn.(d) Results for the Uralic languages.Table 4: Results under our joint model.The tables highlight that the best source languages are often genetically and typologically closest.Also, we see that multi-source often helps, albeit more often in the |Dt| = 100 case.

Higher-Order CRF Tagger
The MARMOT tagger is the leading non-neural approach to morphological tagging.This baseline is important since non-neural, feature-based approaches have been found empirically to be more efficient, in the sense that their learning curves tend to be steeper.Thus, in the low-resource setting we would be remiss not to consider a feature-based approach.Note that this is not a transfer approach, but rather only uses the low-resource data.

Alignment-based Projection
The projection approach of Buys and Botha (2016) provides an alternative method for transfer learning.The idea is to construct pseudo-annotations for bitext given an alignment (Och and Ney, 2003).Then, one trains a standard tagger using the projected annotations.The specific tagger employed is the WSABIE model of Weston et al. (2011), whichlike our approach-is a 0 th -order discriminative original paper, as well as additional numbers provided to us by the first author in a personal communication.The numbers will not be, strictly speaking, comparable.However, we hope they provide insight into the relative performance of the tagger.neural model.In contrast to ours, however, their network is shallow.We compare the two methods in more detail in §6.

Architecture Study
Additionally, we perform a thorough study of the neural transfer learner, considering all three architectures.A primary goal of our experiments is to determine which of our three proposed neural transfer techniques is superior.Even though our experiments focus on morphological tagging, these architectures are more general in that they may be easily applied to other tasks, e.g., parsing or machine translation.We additionally explore the viability of multi-source transfer, i.e., the case where we have multiple source languages.All of our architectures generalize to the multi-source case without any complications.

Experimental Details
We train our models with the following conditions.Evaluation Metrics.We evaluate using average per-token accuracy, as is standard for both POS tagging and morphological tagging, and per-feature Table 5: Comparison of our approach to various baselines for low-resource tagging under token-level accuracy.We compare on only those languages in Buys and Botha (2016).Note that tag-level accuracy was not reported in the original B&B paper, but was acquired through personal communication with the first author.All architectures presented in this work are used in their multi-source setting.The B&B and MARMOT models are single-source.
F 1 as employed in Buys and Botha (2016).The per feature F 1 calculates a key F k 1 for each key in the target language's tags by asking if the keyattribute pair k n =v n is in the predicted tag.Then, the key-specific F k 1 values are averaged equally.Note that F 1 is a more flexible metric as it gives partial credit for getting some of the attributes in the bundle correct, whereas accuracy does not.
Hyperparameters.Our networks are four layers deep (two LSTM layers for the character embedder, i.e., to compute v n and two LSTM layers for the tagger, i.e., to compute e n ) and we use a representation size of 128 for the character input vector size and hidden layers of 256 nodes in all other cases.All networks are trained with the stochastic gradient method RMSProp (Tieleman and Hinton, 2012), with a fixed initial learning rate and a learning rate decay that is adjusted for the other languages according to the amount of training data.The batch size is always 16. Furthermore, we use dropout (Srivastava et al., 2014).The dropout probability is set to 0.2.We used Torch 7 (Collobert et al., 2011a) to configure the computation graphs implementing the network architectures.

Results and Discussion
We report our results in two tables.First, we report a detailed cross-lingual evaluation in Table 4. Secondly, we report a comparison against two baselines in Table 5 (accuracy) and Table 6 (F 1 ).We see two general trends of the data.First, we find that genetically closer languages yield better source languages.Second, we find that the multi-softmax architecture is the best in terms of transfer ability, as evinced by the results in Table 4.We find a wider gap between our model and the baselines under the accuracy than under F 1 .We attribute this to the fact that F 1 is a softer metric in that it assigns credit to partially correct guesses.
Source Language.As discussed in §2, the transfer of morphology is language-dependent.This intuition is borne out in the results from our study (see Table 4).We see that in the closer grouping of the Western Romance languages, i.e., Catalan, French, Italian, Portuguese, and Spanish, it is easier to transfer than with Romanian, an Eastern Romance language.Within the Western grouping, we see that the close pairs, e.g., Spanish and Portuguese, are amenable to transfer.We find a similar pattern in the other language families, e.g., Russian is the best source language for Ukrainian, Czech is the best language source for Slovak and Finnish is the best source language for Estonian.
Multi-Source Transfer.In many cases, we find that multiple sources noticeably improve the results over the single-source case.For instance, when we have multiple Romance languages as a source lan-guage, we see gains of up to 2%.We also see gains in the Northern Germanic languages when using multiple source languages.From a linguistic point of view, this is logical as different source languages may be similar to the target language along different dimensions, e.g., when transferring among the Slavic languages, we note that Russian retains the complex nominal case system of Serbian, but south Slavic Bulgarian is lexically more similar.
Performance Against the Two Baselines.As shown in Table 5 and Table 6, our model outperforms the projection tagger of Buys and Botha (2016) even though our approach does not utilize bitext, large-scale alignment or monolingual corpora-rather, all transfer between languages happens through the forced sharing of characterlevel features. 7Our model, does, however, require annotation of a small number of sentences in the target language for training.We note, however, that this does not necessitate a large number of human annotation hours (Garrette and Baldridge, 2013).
Reducing Sample Complexity.Another interesting point about our model that is best evinced in Figure 3 is the feature-based CRF approach seems to be a better choice for the low-resource setting, i.e., the neural model has greater sample complexity.However, in the multi-task scenario, we find that the neural tagger's learning curve is even steeper.In other words, if we have to train a tagger on very little data, we are better off using a neural multi-task approach than a feature-based approach; preliminary attempts to develop a multitask version of MARMOT failed (see Figure 3).

Related Work
We divide the discussion of related work topically into three parts for ease of intellectual digestion. 7We would like to highlight some issues of comparability with the results in Buys and Botha (2016).Strictly speaking, the results are not comparable and our improvement over their method should be taken with a grain of salt.As the source code is not publicly available and developed in industry, we resorted to numbers in their published work and additional numbers obtained through direct communication with the authors.First, we used a slightly newer version of UD to incorporate more languages: we used v2 whereas they used v1.2.There are minor differences in the morphological tagset used between these versions.Also, in the |Dt| = 1000 setting, we are training on significantly more data than the models in Buys and Botha (2016).A much fairer comparison is to our models with |Dt| = 100.Also, we compare to their method using their standard (non) setup.This method is fair in so far as we evaluate in the same manner, but it disadvantages their approach, which cannot predict tags that are not in the source language.
Most cross-lingual work in NLP-focusing on morphology or otherwise-has concentrated on indirect supervision, rather than transfer learning.The goal in such a scenario is to provide noisy labels for training the tagger in the low-resource language through annotations projected over aligned bitext with a high-resource language.This method of projection was first introduced by Yarowsky and Ngai (2001) for the projection of POS annotation.While follow-up work (Fossum and Abney, 2005;Das and Petrov, 2011;Täckström et al., 2012) has continually demonstrated the efficacy of projecting simple part-of-speech annotations, Buys and Botha (2016) were the first to show the use of bitext-based projection for the training of a morphological tagger for low-resource languages.
As we also discuss the training of a morphological tagger, our work is most closely related to Buys and Botha (2016) in terms of the task itself.We contrast the approaches.The main difference lies that our approach is not projectionbased and, thus, does not require the construction of a bilingual lexicon for projection-based on bitext.Rather, our method jointly learns multiple taggers and forces them to share features-a true transfer learning scenario.In contrast to projection-based methods, our procedure always requires a minimal amount of annotated data in the low-resource target language-in practice, however, this distinction is non-critical as projection-based methods without a small amount of seed target language data perform poorly (Buys and Botha, 2016).

Character-level NLP.
Our work also follows a recent trend in NLP, whereby traditional word-level neural representations are being replaced by character-level representations for a myriad tasks, e.g., POS tagging dos Santos and Zadrozny (2014), parsing (Ballesteros et al., 2015), language modeling (Ling et al., 2015), sentiment analysis (Zhang et al., 2015) as well as the tagger of Heigold et al. (2017), whose work we build upon.Our work is also related to recent work on character-level morphological generation using neural architectures (Faruqui et al., 2016;Rastogi et al., 2016).

Neural Cross-lingual Transfer in NLP.
In terms of methodology, however, our proposal bears similarity to recent work in speech and ma- Table 6: Comparison of our approach to various baselines for low-resource tagging under F1 to allow for a more complete comparison to the model of Buys and Botha (2016).All architectures presented in this work are used in their multi-source setting.
The B&B and MARMOT models are single-source.We only compare on those languages used in B&B.
chine translation-we discuss each in turn.In speech recognition, Heigold et al. (2013) train a cross-lingual neural acoustic model on five Romance languages.The architecture bears similarity to our multi-language softmax approach.Dependency parsing benefits from cross-lingual learning in a similar fashion (Guo et al., 2015(Guo et al., , 2016)).
In neural machine translation (Sutskever et al., 2014;Bahdanau et al., 2015), recent work (Firat et al., 2016;Zoph and Knight, 2016;Johnson et al., 2016) has explored the possibility of jointly train translation models for a wide variety of languages.Our work addresses a different task, but the undergirding philosophical motivation is similar, i.e., attack low-resource NLP through multi-task transfer learning.Kann et al. (2017) offer a similar method for cross-lingual transfer in morphological inflection generation.

Conclusion
We have presented three character-level recurrent neural network architectures for multi-task crosslingual transfer of morphological taggers.We provided an empirical evaluation of the technique on 18 languages from four different language families, showing wide-spread applicability of the method.We found that the transfer of morphological taggers is an eminently viable endeavor among related language and, in general, the closer the languages, the easier the transfer of morphology becomes.Our technique outperforms two strong baselines proposed in previous work.Moreover, we define standard low-resource training splits in UD for future research in low-resource morphological tagging.Future work should focus on extending the neural morphological tagger to a joint lemmatizer (Müller et al., 2015) and evaluate its functionality in the low-resource setting.

Figure 3 :
Figure3: A learning curve for Spanish and Catalan comparing our monolingual model, our joint model, and two MARMOT models.The first MARMOT model is identical to those trained in the rest of the paper and the second shows a multi-task approach, which failed so no further experimentation was performed with this model.

Table 1 :
Example of a morphologically tagged sentence in Russian using the annotation scheme provided in the UD dataset.Partial inflection table for the Spanish verb soñar.

Table 2 :
Number of tokens in each of the train, development and test splits (organized by language family).

Table 3 :
Number of unique morphological tags for each of the experimental languages (organized by family).