Exploring Cross-Lingual Transfer of Morphological Knowledge In Sequence-to-Sequence Models

Multi-task training is an effective method to mitigate the data sparsity problem. It has recently been applied for cross-lingual transfer learning for paradigm completion—the task of producing inflected forms of lemmata—with sequence-to-sequence networks. However, it is still vague how the model transfers knowledge across languages, as well as if and which information is shared. To investigate this, we propose a set of data-dependent experiments using an existing encoder-decoder recurrent neural network for the task. Our results show that indeed the performance gains surpass a pure regularization effect and that knowledge about language and morphology can be transferred.


Introduction
Neural sequence-to-sequence models define the state of the art for paradigm completion (Cotterell et al., 2016(Cotterell et al., , 2017Kann and Schütze, 2016), the task of generating inflected forms of a lemma's paradigm, e.g., filling the empty fields in Table 1 using one of the non-empty fields.
However, those models are in general very datahungry, and do not reach good performances in low-resource settings. Therefore, Kann et al. (2017) propose to leverage morphological knowledge from a high-resource language (source language) to improve paradigm completion in a closely related language with insufficient resources (target language). This is achieved by a form of multi-task learning -they train an encoder-decoder model simultaneously on training examples for both languages. While closer related languages seem to help more than distant ones, the mechanisms how this transfer works still  remain largely obscure. Several possibilities exist: (i) learning of target tag specific word transformations from the high-resource language (trans); (ii) training of the character language model of the decoder (LM); (iii) learning a bias to copy a large part of the input (copy), since members of the same paradigm mostly share the same stem; (iv) a general regularization effect obtained by multitask training (reg).
In this work, we intend to shed light on the way cross-lingual transfer learning for paradigm completion with an encoder-decoder model works, and will especially focus on the role of the character and tag embeddings. In particular we aim at answering the following questions: (i) What does the neural model learn from the tags of a highresource language for the tags of a low-resource language? (ii) Is sharing an alphabet important for the transfer? (iii) How much of the transfer learning can be reduced to a regularization effect achieved by multi-task learning?
tion task in the low-resource language.

Transfer Learning for Paradigm Completion
In this section, we describe cross-lingual transfer learning for morphology and the model used for it.
Cross-lingual transfer. Transfer learning for paradigm completion is much more languagespecific than most semantic natural language processing tasks, like entity typing or machine translation. An extreme example is the infeasible task of transferring morphological knowledge from Chinese to Portuguese as Chinese does not make use of inflection at all. Even between two morphologically rich languages transfer is difficult if they are unrelated, since inflections often mark dissimilar subcategories and word forms do not share similarities. However, Kann et al. (2017) show that transferring morphological knowledge from Spanish to Portuguese, two languages with similar morphology and 89% lexical similarity, works well and, more surprisingly, even supposedly very different languages like Arabic and Spanish can benefit from each other. They make this possible by training an encoder-decoder model and appending a special tag (i.e., embedding) for each language to the input of the system, similar to (Johnson et al., 2016). It is currently unclear, though, what the nature of this transfer is, motivating our work which explores this in more detail.
Model description. The model Kann et al. (2017) use and we explore in more detail here is an encoder-decoder recurrent neural network (RNN) with attention (Bahdanau et al., 2015). It is trained on maximizing the following log-likelihood: We denote the source training examples as D s and the target training examples as D s . w s represents Figure 1: Overview of an encoder-decoder RNN, mapping the Spanish lemma soñar to the target form sueña. The thickness of the arrows towards the circled plus symbol corresponds to each attention weight. All tags in the input are omitted. a lemma in a high-resource source language s and w t represents a lemma in a low-resource target language t . k represents a given slot in the paradigm and f k [w ] is the inflected form of w corresponding to the morphological tag t k . The parameters θ of the model are tied for both the high-resource language and the low-resource language to enable transfer learning.
In detail, a bidirectional gated RNN is used to encode the input sequence, which consists of a language tag, morphological tags and characters of the input language. The decoder generates the output sequence from the characters of the same language, and consists of a unidirectional RNN with an attention mechanism over the encoder hidden states. Notably, the elements of the input and the output are represented by embeddings living in separate spaces.
Hyperparameters. Encoder and decoder RNNs have 100 hidden units and we use 300dimensional embeddings.
We train using ADADELTA (Zeiler, 2012) with minibatch size 20. All models for all experiments are trained for a maximum of 150 epochs. The best model is applied at test time.

Exploration of Transfer Learning
In order to answer the questions raised in the introduction, we conduct the following experiments.

Data
We use the Romance and Arabic language data from Kann et al. (2017). In particular, each training file contains 12, 000 high-resource examples mixed with 50 or 200 fixed Spanish instances. We trans LM copy reg l-ciph X X t-ciph X l-emb X X X t-emb X

Experiments
Letter cipher (l-ciph). Let C = C low ∪ C high be the union of the sets of all characters in the alphabets of the low-resource language and the high-resource language, respectively. 1 We define a bijective cipher function f ciph : C → C, mapping each character to a different character, chosen at random. Then, we apply this function to the elements of the input and output words in the high-resource language and train the model on this modified data. The low-resource samples in train, dev and test remain unchanged.
We expect this to have the following effects: (i) languages do not share affixes anymore; (ii) as we use the same embeddings for the changed and unchanged characters, the model might learn wrong affixes for tags; (iii) an incorrect character language model could be learned; and (iv) a general bias to copy should remain unchanged. Tag cipher (t-ciph). We further consider the union of the sets of all morphological tags existing in the low-and high-resource languages: T = T low ∪ T high . We define a bijective cipher function f ciph : T → T . We then apply this function to all tags in the high-resource language input and train a new model. The low-resource examples in train, dev and test are not changed.
We expect this to: (i) disturb the learning of correspondences between target tags and output characters; (ii) not influence anything else. Language-dependent letter embeddings (lemb). We now use different embeddings for the characters of the two languages. This corresponds to a setting where the source and target languages do not share the same vocabulary. This should result in: (i) making it impossible for the model to learn which affixes have to be produced for which tag, maybe resulting in benefits for more distant and worse performance for extremely close languages; and (ii) transfer of the decoder's character language model getting impossible.
Language-dependent tag embedding (t-emb). Additionally, we also experiment with different embeddings for the morphological tags in different languages.
We expect the following to happen: (i) the model can learn a character language model in the output, which might be good for related and bad for more distant languages; (ii) it should not be possible for the model to learn a correspondence between tags and characters in the output sequence; and (iii) the model cannot get information about tags in the low-resource language from the high-resource language's examples.
We additionally perform two last experiments: Language-dependent letter embeddings with separation symbol (l-emb-sep). This is the same as l-emb, but we introduce a new separation symbol SEP between the tags and the characters, solving the problem that it is not clear where the tag ends and the word starts. We expect equal or better performance than for l-emb. Language-dependent tag embedding with separation symbol (t-emb-sep). This is equivalent to t-emb, but we again insert a new separation symbol SEP between the tags and the input word's characters. We expect equal or better performance than for t-emb.

Intuition
In Table 3 we display an overview of which of the working mechanisms of cross-lingual transfer learning we expect to be effected by which changes to the high-resource training data. Depending on the relationship between the source and the target language, e.g., whether they use the same affixes to express the same morphosyntactic properties, we anticipate stronger or weaker effects. The regularization effect should not be influenced by our changes to the data.

Results and Analysis
For the low-resource training set of size 50, the models with the original setup and without transfer perform best and worst, respectively. However,   for low-resource training size 200, t-emb-sep performs best in most case, and without transfer still performs worst. The order of the accuracies averaged over all languages can be seen in Figure  2: original > t-emb-sep > t-emb > t-ciph > lemb-sep > l-emb > l-ciph for 50 and t-emb-sep > l-emb-sep > l-emb > t-emb > original > t-ciph > l-ciph for 200 low-resource examples. The detailed results of each language can be found in Table 4. First, this shows clearly that the character embeddings are more important for the task than the tag embeddings. Second, l-emb (resp. t-emb) and l-ciph (resp. t-ciph) correspond to a setting with no additional information vs. a setting with potentially wrong information. Generally higher accuracies for separate embedding spaces indicate that the model can learn incorrect information via transfer. Thus, the choice of the source language seems to be very important. The differences in performance between original and l-emb represent the influence of shared vs. separate embedding spaces, i.e., vocabularies in the case of the letters. Sharing a vocabulary seems to influence the final accuracy a lot, and more positively for 50 lowresource examples. We can explain this with the model learning to copy -it has no intrinsic way of knowing which input character equals which output character in the vocabulary unless it has seen it at least once. However, for 200 Spanish examples, we can expect all characters to appear in the Spanish training data, such that the character language model and tag-output correspondence get more important. This explains the unexpected result that l-emb performs best for Arabic (200) and Portuguese (200): both source languages potentially confuse the language model; in Portuguese we contribute this to a big overlap of lemmata in the two languages with Portuguese often inflecting in a different way (Kann et al., 2017). Further, the differences in performance between original and t-emb show that the model indeed learns information from the tags, supposedly which output sequence is more likely to appear with which tag.
The l-emb-sep and t-emb-sep results show that a separation symbol clearly improves the model's performance.

Related Work
Transfer learning with encoder-decoder networks. Encoder-decoder RNNs were introduced by Cho et al. (2014) and Sutskever et al. (2014) and extended by an attention mechanism by Bahdanau et al. (2015). Lately, much work was done on multi-task learning and transfer learning with encoder-decoder RNNs. Luong et al. (2015) investigated multi-task setups for sequence-to-sequence learning, combining multiple encoders and decoders. In contrast, in our experiments, we use only one encoder and one decoder. There exists much work on multi-task learning with encoderdecoder RNNs for machine translation (Johnson et al., 2016;Dong et al., 2015;Firat et al., 2016;Ha et al., 2016). Alonso and Plank (2016) explored multi-task learning empirically, analyzing when it improves performance. Here, we focus on how transfer via multi-task learning works.
Paradigm completion. SIGMORPHON hosted two shared tasks on paradigm completion (Cotterell et al., 2016(Cotterell et al., , 2017, in order to encourage the development of systems for the task. One approach is to treat it as a string transduction problem by applying an alignment model with a semi-Markov model (Durrett and DeNero, 2013;Nicolai et al., 2015). Recently, neural sequenceto-sequence models are also widely used (Faruqui et al., 2016;Kann and Schütze, 2016;Aharoni and Goldberg, 2017;Zhou and Neubig, 2017). All the above mentioned work were designed for one single language.

Conclusion
We conducted a set of experiments to explore the mechanisms behind cross-lingual transfer learning for morphological reinflection. Our findings indicate that knowledge about a language's typical character sequences and outputs for certain morphological tags can be transferred. In particular, this means that the effect cannot be reduced to sole regularization.