Multi-Team: A Multi-attention, Multi-decoder Approach to Morphological Analysis.

This paper describes our submission to SIGMORPHON 2019 Task 2: Morphological analysis and lemmatization in context. Our model is a multi-task sequence to sequence neural network, which jointly learns morphological tagging and lemmatization. On the encoding side, we exploit character-level as well as contextual information. We introduce a multi-attention decoder to selectively focus on different parts of character and word sequences. To further improve the model, we train on multiple datasets simultaneously and use external embeddings for initialization. Our final model reaches an average morphological tagging F1 score of 94.54 and a lemma accuracy of 93.91 on the test data, ranking respectively 3rd and 6th out of 13 teams in the SIGMORPHON 2019 shared task.


Introduction
This paper presents our model for the SIGMOR-PHON 2019 Task 2 on morphological analysis and lemmatization in context (McCarthy et al., 2019). The task is to generate a lemma and a sequence of morphological tags, which are called morphosyntactic descriptions (MSD), for each word in a given sentence. This task is important because it can be used to improve several downstream NLP applications such as grammatical error correction (Ng et al., 2014), machine translation (Conforti et al., 2018) and multilingual parsing . Table 1 shows the lemma and morphological tags of: Johnny likes cats.
The first sub-task, Lemmatization, is to transform an inflected word form to its lemma which is its base-form (or dictionary form), as in the example of likes to like. The second sub-task, morphological tagging, is to predict morphological properties of words as a sequence of tags, including a part of speech tag. These morphological tags specify the inflections encoded in word-forms. In the  example sentence, the word likes is annotated with a morphological tag set of {V,SG,3,IND,PRS}. Both tasks are dependent on context. For example, while walking is annotated with the lemma walk and tag set {N,SG} in the sentence: The beach is within walking distance; it is annotated with walking and {V.PTCP;PRS;V} in: I was walking.
These two tasks have a clear relation; in most languages the categories found in the morphological tags indicate how the lemma of the word was inflected to the word-form. In other words, syntactic inflections have a strong correlation with the morphological properties of the words.
Our approach to solve both of these tasks consists of an encoder and two separate decoders within a multi-task architecture based on a sequence-to-sequence network. The shared encoder reads words and sentences to learn character-level and word-level representations. The decoders then separately generate lemmas and morphological tags using these representations by using multiple attention mechanisms. Our contributions are threefold: • We introduce the use of multiple attention mechanisms that selectively focus character and word sequences in the sentence context.
• We evaluate the effect of a variety of types of external embeddings for lemmatization and morphological tagging.
• We evaluate the effect of combining annotated datasets from related languages for both tasks using dataset embeddings.

Related work
Our system is based on three main approaches which are heavily studied in existing literature. These are sequence-to-sequence learning, multitask learning and multi-lingual learning. Recent work on computational morphology showed that neural sequence-to-sequence (seq2seq) models (Sutskever et al., 2014;Bahdanau et al., 2014) have yielded new stateof-the-art performance on various tasks including morphological reinflection and lemmatization (Cotterell et al., 2016(Cotterell et al., , 2017(Cotterell et al., , 2018. Building on this, Dayanık et al. (2018) utilize different levels of representations such as character-level, word-level and sentence-level in the encoder of their seq2seq architecture based on previous work (Heigold et al., 2017).
Multi-task learning approaches for jointly learning related tasks have been successfully employed on syntactic and semantic tasks Plank et al., 2016). In the context of morphological analysis, this has been used by Kementchedjhieva et al. (2018), who jointly learn morphosyntactic tags and inflections for a word in a given context, and use a shared encoder within a multi-task architecture consisting of multiple decoder similar to our model.
Multi-lingual learning approaches which benefit from joint learning for multiple languages is also studied on various tasks with different architectures. Ammar et al. (2016) uses a language embedding that contains information considering the language, word-order properties and typological properties for dependency parsing. In multilingual neural machine translation, Johnson et al. (2017) use a special token to indicate the target language. In this work, our model uses the approach of  who introduce the treebank embedding approach to combine several treebanks for a single language or closely related languages.
Most similar to our model, Kondratyuk et al. (2018) use a joint decoder approach for morphological tagging and lemmatization. However, our model differs from theirs in substantial ways. Our model employs an encoder-decoder architecture which utilizes different levels of attention components with a multi-lingual/multi-dataset signal. Moreover, our model solves the tagging problem as a sequential prediction task instead of multilayer classification so that we can use the same architecture for both lemmatization and tagging which are described in Section 3.2 and 3.3.

System Description
Our model is inspired by the architecture of Dayanık et al. (2018). We employ an encoderdecoder model over the character and word sequences. Following Dayanık et al. (2018), the encoder in our model consists of two parts. First, a word encoder which runs on the character level, is used to generate embeddings for each word (Section 3.1.1). Second, a context encoder is initialized with these word embeddings, and runs on the sentence level (Section 3.1.4). We also experiment with two methods to complement the word-level embeddings (Section 3.1.2 and 3.1.3).
The representations at the different levels which are generated by the encoder are then passed into the decoders. Unlike Dayanık et al. (2018) which uses one decoder for both the lemmas and the morphological tags, we use two different decoders in a multi-task architecture. The tag decoder produces a set of morphological tags by using word representations and joint attention mechanism that one attention focuses on words and other focuses on characters (Section 3.2). The lemma decoder produces a lemma by using the same information complemented with output embeddings of the tag decoder (Section 3.3).
Multi-task Learning The decoders work jointly in a multi-task fashion and they share all internal representations of the encoder. The whole network is trained by backpropagating the sum of the losses of the decoders without any weighting: where the morphological tag loss L tag and the lemma loss L lemma are separately computed as the negative log likelihood loss over their softmax outputs.
Notation Given a sentence S = w 1 , ..., w n and w i = c 1 , ..., c m where w denotes words and c denotes characters, our model processes S and w in encoders and jointly produces a set of morphological tags t i = t i,1 , ..., t i,γ and a lemma l i = l i,1 , ..., l i,φ which is a sequence of characters.

Encoder
In the following subsections, we explain the different parts of the encoder. An overview of the encoder architecture is shown in Figure 1.
c a t s <\w>

Word Encoder
We use a bidirectional GRU layer (Cho et al., 2014) to encode character sequences in the word encoder. We first pass each character of a word w i to an embedding layer to map them into the fixed dimensional character embeddings. The bi-GRUs process character embeddings in both directions and produce the hidden states h c i,1 , ..., h c i,m . The resulting word embedding e c i is computed by concatenating the final states of forward and backward GRUs for the given word:

Word-Surface Embeddings
In addition to the character-level word embeddings, we use surface-level word embeddings which are either learned in a standalone embedding layer or taken from the pre-trained external embeddings. Word-surface embeddings are denoted by e w i . For the unknown words, we used a word droupout to overcome the sparsity issue. Following Kiperwasser and Goldberg (2016), we replace unknown tokens with a probability that is inversely proportional to the frequency of the word so that the word representation for an unknown token is learned based on infrequent words and their context.

Dataset Embeddings
In order to train our model on multiple datasets at once, we use dataset embedding e d a for each These embeddings enable us to combine multiple datasets without losing their monolingual and heterogeneous characters. The strategy that we use to pick and combine datasets is described in Section 4.2

Context Encoder
In order to encode sentence level contextual information, we use another bidirectional GRU layer. For a given sentence, we first concatenate the output of the word encoder e c i , the word-surface embedding e w i and the dataset embedding e d a , for each word in the sentence. The resulting embedding sequence e in 1 , ..., e in n is then passed into the bi-GRU. The output of the bi-GRU is a sequence of embeddings e s 1 , ..., e s n each representing a word in the sentence: (4) e s 1:n = bi-GRU(e in 1:n )

Tag Decoder
As the tag decoder shows in Figure 2, we use a 2 layer stacked bidirectional GRU as the tag decoder to generate morphological tags t i = t i,1 , ..., t i,γ for the target word w i in a given sentence. In order to utilize both character-level representations and contextual representations during decoding, we initialize the first layer of the decoder with the context-level word embedding e s i and the second layer of the decoder with the character-level word embedding e c i after passing them through a relu layer. The decoder outputs the morphological tags over a softmax layer based on the final hidden states h t , which are computed in a joint attention mechanism described in the following section.

Joint Context and Character Attention
We employ two different attention mechanisms to allow the decoder to focus on multiple parts of the sentence and the target word at the same time. We use the attention mechanism introduced by Bahdanau et al. (2014) for the context attention layer.
In the context attention, the alignment vector a s t , which consists of weights for each word in the sentence, is computed based on the previous hidden state h t−1 at the top layer of the stacked bi-GRU and context-level embeddings e s of words by using the concat function described in Luong et al. (2015). The sentence-level context vector c s t which is calculated as a weighted average over word embeddings, is then passed into a simple concatenation layer W s c to produce the new hidden state h t through the stacked bi-GRU: , h t−1 ) (10) Together with the context attention, we also employ a character-level attention model to take into account the entire output of the word encoder. We use the global attention mechanism with the general score function for alignment vectors (Luong et al., 2015), for the character attention. The source-side character-level attention vector c c t is computed as a weighted average of the outputs of the word encoder, each denoted by h c i,j . The resulting output state h t of the tag decoder is then generated by concatenating the current hidden state at the top of the stacked bi-GRU h t and the context vector c c t in a concatenation layer which has a tanh activation:

Lemma Decoder
The lemma decoder ( l i,1 , ..., l i,φ for a target word w i . Similar to the tag encoder, we use a 2 layer stacked bi-GRU as lemma decoder. The initial states of the decoder layers are taken from the word encoder output e c i and the context encoder output e s i through a relu layer similarly as in the tag decoder. The output of the lemma decoder l i,t is conditioned on the current state of the decoder h t , the character attention c c t and the morphological tags t i,1:γ of the target word. The probability of the output lemma characters are then predicted through a softmax layer. In order to exploit morphological features during lemmatization, we give the morphological tags t i:γ which are predicted by the tag decoder, as part of input to the lemma decoder. Independent of their order, the entire set of the tags are encoded by a simple feed-forward layer as described in the baseline model (Malaviya et al., 2019) and the resulting vector is concatenated with the input embeddings for each target word. The last part of the lemma decoder is the attention network which is the same character-level attention model as in the tag decoder. The character attention mechanism allows the lemma decoder to compute an attention vector c c t based on the output states of the word encoder. The attention vector is then passed into a concatenation layer to generate the output state h t of the decoder for each lemma character l i,t .

Setup
In this section we will give the details regarding our experimental setup. The hyperparameters we used in our experiments are shown in Table 2. These hyperparameters have been tuned on the datasets described in Section 5.1. For the training, we used ADAM (Kingma and Ba, 2014) and we applied an early stopping strategy with a minimum number of 100 epochs. We stop training if there is no improvement in the development set for 4 consecutive epochs (patience).

External Embeddings
Because of time-constraints and the large number of languages in the dataset, we used out-ofthe-box embeddings. We compared the performance of three well-known pre-trained embedding repositories for different training methods. We use two word- All of these embeddings have been trained using the default settings for the embedding type, hence their dimensions are substantially different (Polyglot; 64, FastText:300, ELMo:1,024) . We decided not to transform these, as their default dimensions are tuned towards their training algorithm and we want to provide a fair comparison for all out-of-the-box settings.

Dataset Embeddings
For the dataset embeddings, we only consider combining pairs of two for efficiency reasons. To ensure that we match datasets which are informative, we use word overlap (excluding numberals and punctuation). As this method is expected to be most benficial for small datasets, we searched for datasets which are closest (ie. have a large word overlap) to the 50 smallest datasets. The final pairs of datasets can be found in Appendix A.

Experiments
In this section, we will describe the data used in our experiments as well as evaluate the effectiveness of our external embeddings setup and the dataset embeddings with in a variety of settings. In all experiments we use +E and -E to indicate the model with and without external embeddings, and +D and -D for dataset embeddings.

Data
The test data of SIGMORPHON 2019 task 2 consists of a collection of datasets released in the Universal Dependencies project , which are automatically converted to the Uni-Morph Schema (McCarthy et al., 2018). In total, we evaluate our model on 107 datasets, covering 66 languages.
After empirically looking at the trade-off between data-size and training time, we decided to limit each dataset to its first 250,000 tokens for all experiments. This speeded up the training considerably, with almost no loss in performance.
For the tuning of our model, we selected a sub-set of datasets from the main benchmark. More specifically, we aimed to get a diversion of language-family, size, and morphological richness (here proxied by the average amount of morphological tags per word). To ensure we do not overfit on a specific dataset/annotation, we selected two datasets for each of these languages. The selected datasets are shown in Table 3.

Baseline
The baseline consists of two separate parts: a morphological tagger and a lemmatizer. The morphological tagger, which predicts a set of morphological features (as one tag) for each word, is a biL-STM model with character level layers. The kbest predicted morphological tags are then used as extra information to improve the lemmatization. The lemmatizer, which is based on Wu et al. (2018), uses a hard attention mechanism within an encoder-decoder model. Unlike the previous models, the morphological tags are explicitly given to the lemmatizer to indicate the morpho-syntactic features of words. The lemmatizer combines the given morphological tags with a character encoding to predict the lemma.

External Embeddings
In Figure 4, we plotted the average performance of our model when the different types of embeddings are used to initialize the word-surface embeddings (detailed results are in Appendix B). The results show that a performance boost of approximately 2.5% can be obtained for lemmatization and 5% for morphological tagging. Especially the ELMo embeddings perform very well, and result in an improvement of 3.77 and 6.35 percentage points. The Polyglot embeddings perform surprisingly well, considering they only have an embedding size of 64. In addition to the reported settings, we also experimented with concatenating the vectors from all types of external embeddings. However, our empirical results showed that this performed worse compared to using any of the em-   Because not all types of embeddings are available for all languages, we use fallback options for the test data. We choose embeddings in the following order: ELMo, Polyglot, FastText. After this selection, three languages still have no embeddings (Akkadian, Coptic and Naija), we omitted datasets in these languages from the external embedding experiments.

Dataset Embeddings
To test whether the dataset embeddings are necessary, we compare them with a naive approach to combine datasets: simply training on the concatenation of both datasets. The average results on 4 small datasets and 4 large datasets which are given in Table 3, are compared separately in Figure 5. In both small and large settings, using dataset embeddings improves the performance in both morphological tagging and lemmatization, however the effect of dataset embeddings is higher on small datasets, especially in the morphological tagging task. For the detailed results on our tune datasets, we refer to Appendix C.

Results
In this section, we will compare our final results for two settings with the baseline. In general, we compare two setups: use of external data (external embeddings, +E) and a constrained setup (-E), which only uses training data. For the dataset embeddings, we could only run for the smallest 50 datasets because of time limitations, so for the development data, we only report results for these datasets. For the test data, we used dataset embeddings for datasets for which they have shown to be beneficial on the development data. Our average results are shown in Table 7. For the results for all four settings per dataset, we refer to Appendix D; here we see that the best setting is generally to use dataset embeddings when available.

Morphological Tagging
For the morphological tagging task, external embeddings show to be more beneficial for the tagging task, whereas the dataset embeddings are particularly beneficial for lemmatization, but combining them leads to the best scores for both tasks. Furthermore, our model outperforms the baseline by a large margin. This is because, while the baseline has a separate component for morphological tagging, our model learns both tasks jointly. This approach implicitly enables the decoder to access lemma information for morphological tagging. Besides, we use a multi-attention strategy which combines word level and character level attentions which improves the tagging performance.

Lemmatization
In contrast to the results on the development data, the baseline outperforms our model on the test data (Table 7). Especially on small datasets which are not paired with another dataset, such as UD Akkadian-PISANDUB, the baseline performs better with a large margin.
There are two main reasons for this performance difference. First, the baseline uses a hard attention to model alignment distribution explicitly, whereas, our model uses soft attention for both tasks. The results show that a hard attention mechanism performs better on the lemmatization, confirming the findings of Wu et al. (2018). Integrating a lemma decoder having hard attention with a morphological tag decoder which employs soft attention, could be explored in future studies. Second, as explained in the previous section, we optimize for both tasks jointly without any weighting. Although this is more elegant, as only one model is trained, it might not lead to the most optimal performance.

Conclusion
In this paper, we presented our model for the Sigmorphon 2019 Task 2 on morphological analysis and lemmatization. We use an encoder-decoder model by utilizing multi-task learning approach. A shared encoder runs on the character and sentence level and two separate decoders jointly learn to generate morphological tags and the lemma for  each word.
Our system achieved an average morphological tagging F1 score of 94.57 and an average lemma accuracy score of 93.94 on the test data. The experimental analysis showed that: • Employing a multi-task achitecture having multiple levels of attention mechanism improved the morphological tagging over the baseline strategy.
• Using the pre-trained embeddings substantially improved our scores for both tasks.
• Applying a multi-lingual/dataset strategy by learning special embeddings also improved our scores, specifically for small datasets. On 50 datasets (Table 7), the multi-dataset strategy improved the performance of our model substantially, by 2.95 (accuracy) on lemmatization and 1.81 (F1) on morphological tagging.
• Furthermore, these improvements are highly complementary: using dataset embeddings simultaneously with external embeddings leads to superior performance.
The code to re-run all experiments can be found on: https://bitbucket.org/ ahmetustunn/morphology_in_context