THOMAS: The Hegemonic OSU Morphological Analyzer using Seq2seq

This paper describes the OSU submission to the SIGMORPHON 2019 shared task, Crosslinguality and Context in Morphology. Our system addresses the contextual morphological analysis subtask of Task 2, which is to produce the morphosyntactic description (MSD) of each fully inflected word within a given sentence. We frame this as a sequence generation task and employ a neural encoder-decoder (seq2seq) architecture to generate the sequence of MSD tags given the encoded representation of each token. Follow-up analyses reveal that our system most significantly improves performance on morphologically complex languages whose inflected word forms typically have longer MSD tag sequences. In addition, our system seems to capture the structured correlation between MSD tags, such as that between the “verb” tag and TAM-related tags.


Introduction
For many natural language processing (NLP) applications such as parsing and machine translation, correctly analyzing the part-of-speech and finegrained morphological information (e.g. tense, mood, and aspect) of a given string of words is crucial for satisfactory performance. This task depends on the system's ability to learn reliable representations of the sequence on two distinct levels -one at the character-level, which is indicative of the morphosyntactic values of the word, and the other at the word-level, which is informative of subsequent words that are likely to appear in the sequence. In addition, the system needs to have representational flexibility in order to be used in a cross-linguistic setting, as languages with typologically distinct morphological systems (e.g. isolating, agglutinative, and fusional) have different methods of realizing morphological information. * First authors. Ordering determined by dice roll.

Input
They buy and sell books . MSD tags N;NOM;PL | V;SG;1;PRS | CONJ | V;PL;3;PRS | N;PL | PUNCT Task 2 of the SIGMORPHON 2019 Shared Task, Morphological Analysis and Lemmatization in Context (McCarthy et al., 2019), provides an appropriate setting to examine the applicability of morphological analyzers on typologically distinct languages. As mentioned on the shared task webpage, 1 the goal of the contextual morphological analysis subtask of Task 2 is to produce the morphosyntactic description (MSD) of each word within a given sentence (i.e. "context," see Table 1 for example). 2 The system's performance is evaluated on a total of 107 treebanks from the Uni-Morph dataset (McCarthy et al., 2018), which covers more than 70 languages. Again, this requires the system to generalize across typologically different languages without being biased towards a particular morphological system.
In this paper, we present our approach of treating contextual morphological analysis as the generation of the correct sequence of MSD tag dimensions. To address the task, we take a similar approach as the shared task baseline system (Malaviya et al., 2019) in encoding each word in the sequence with a representation learned by a 1 https://sigmorphon.github.io/sharedtasks/ 2019/task2/ 2 For the other subtask of contextual lemmatization, the goal of which is to return the correct lemmata of the fully inflected forms, we generated the predictions using the pretrained shared task baseline lemmatizer (Malaviya et al., 2019). As the baseline system conducts lemmatization by conditioning on predicted MSD tags, we provided the system with the predictions from our seq2seq model as input. Figure 1: The encoder based on bidirectional LSTM for the baseline, binary relevance, and seq2seq models.
(a) The decoder of the binary relevance model, which makes independent binary decisions for each possible tag dimension.
(b) The GRU decoder of the seq2seq model, which predicts the next tag dimension given the encoder representation and the prediction at the previous timestep. character-level recurrent neural network (RNN). With the baseline system that treats each possible combination of MSD tag dimensions separately and chooses the most likely combination, we first demonstrate that modifying the system to make multiple independent binary decisions over each possible tag dimension results in higher performance. Furthermore, we present an encoderdecoder (seq2seq) model that decodes the representation of each input word into a sequence of MSD tag dimensions. The use of the seq2seq model further improves model performance, espe-cially in terms of exact match accuracy for tokens that have long sequences of MSD tag dimensions. Our best-performing model outperforms the official baseline by 14.25 on exact match accuracy and by 4.6 on micro-averaged F1.

Model Description
Baseline model The baseline model takes as input each sentence in the training data, and uses a bidirectional LSTM (Long Short-Term Memory, Hochreiter and Schmidhuber, 1997) to learn a representation for each word by attending to its individual characters. The learned representation is then subsequently fed into a fully connected linear layer, which maps the representation of the word to the space of every observed combination of MSD tag dimensions. The network is updated based on the cross-entropy loss between the model's prediction and the correct combination of MSD tag dimensions.
Binary relevance model An obvious limitation to the above baseline approach is that the number of observed combinations of MSD tag dimensions is typically large for most languages, and especially for agglutinative and fusional languages whose words contain relatively more morphological information than those of other languages (see Table 2). In addition, treating each combination separately prevents the model from generalizing to other instances of the same MSD tag dimension that might simply appear in a different combination. We hypothesize that this would most unfavorably impact system performance on Sents Tokens Tags Combinations   en 13297 204857  36  178  es 14144 439925  40  419  hi 13317 281948  43  1508  ru  4024  79989  47  1385  tr  4508  46417  55  1896  zh  3997  98734 21 39 Encoder-decoder (seq2seq) model 3 Nonetheless, given the fact that particular MSD tag dimensions tend to co-occur within a same word (e.g. the "verb" tag dimension frequently co-occurs with tense-or aspect-related tag dimensions), the independence assumption between individual tag dimensions made in the binary relevance model may be too strong to capture this inherent structure. To account for the potential dependence between predicted tag dimensions, we feed the encoded representation of each word as the initial hidden states of a GRU (Gated Recurrent Unit, Cho et al., 2014) decoder, which is then trained to predict one tag dimension at each decoding timestep. The use of such a seq2seq model is also partly motivated by its state-of-the-art performance in various NLP tasks such as machine translation (Bahdanau et al., 2015;Luong et al., 2015), document classification (Nam et al., 2017;Yang et al., 2018), morphological reinflection (Kann and Schütze, 2016;Kann et al., 2017), and morphological analysis like the current shared task (Tkachenko and Sirts, 2018  Our seq2seq model strongly outperforms the official baseline, scoring 14.25 and 4.6 points higher on average across 107 datasets on exact match accuracy and micro-averaged F1 scores respectively. For an in-depth analysis of each model, we focus on 6 languages and compare the performance of our two models (binary relevance and seq2seq) to that of the baseline model.

Experimental Design
Training data Following the shared task guidelines, six different treebanks from the UniMorph dataset (McCarthy et al., 2018) provided the data for training and evaluating the model. The six treebanks -English-EWT, Spanish-Ancora, Hindi-HDTB, Russian-GSD, Turkish-IMST, and Chinese-GSD -cover a wide spectrum of morphological typology, thus making it suitable to assess the generalizability of each morphological analysis system. The descriptive statistics of each training set are outlined in Table 2.
Training and evaluation procedure For the binary relevance model, most of the hyperparameters followed the default settings of the baseline system code 4 ; characters were embedded into 128-dimension representations, and the characterlevel biLSTM was trained to output a 256dimension representation. Adam (Kingma and Ba, 2015) was used as the optimizer, using the default settings of the PyTorch deep learning library (Paszke et al., 2017). The model was trained for five epochs using batches of size 16, with early stopping. 5 The same hyperparameters were used   to train the encoder portion of the seq2seq model. As for the GRU decoder, the maximum sequence length was fixed as the maximum sequence length seen during training. Following prior work (Yang et al., 2018), the order of the output tags was fixed to be in decreasing order of frequency of occurrence in the training set. Decoding took place in a greedy manner, and only the highest scoring hypothesis at the previous timestep was further pursued. The model was trained without any teacher forcing, as preliminary results showed that a teacher forcing ratio of 0.5 resulted in a decrease in model performance.
After training was complete, the models' accuracy was evaluated on the held-out test portion of the six treebanks that were used to train the models. As per the shared task guidelines, the exact match accuracy and micro-averaged F1 scores were calculated for each of the trained models.
that the default settings of the code were used to train them. The only changes to the default settings when training the binary relevance model were in the training epochs (default 10 epochs) and batch size (not implemented, therefore default size 1).

Results and Discussion
As can be seen in Table 3, having the model make independent binary decisions for each possible MSD tag dimension (i.e. the binary relevance model) significantly increases model performance. This is most likely the result of having narrowed down the output space and thereby allowing the model to generalize over instances of the same tag dimension that appear in different combinations. In addition, using a neural decoder to generate a sequence of tag dimensions further improves model performance in terms of exact match accuracy, which is sensitive to predicting the correct number of tag dimensions. This corroborates the results of Tkachenko and Sirts (2018), who found that their sequence generation model outperformed other neural classifiers in terms of accuracy on most languages. The increase in performance is especially salient in Russian and Turkish, which typically have more tag dimensions per word than other languages. An analysis of the distribution of predicted tag dimensions (Table 4) shows that the seq2seq model predicts significantly less "invalid" combinations that are not attested in the gold test set, 6 indicating that the seq2seq model is more capable of capturing the structured dependence compared to the binary relevance model.

Lengths of tag sequences
To further examine where the seq2seq model makes significant improvement, the exact match accuracy and microaveraged F1 scores were calculated according to Bin. Rel. Seq2seq en P < G P = G P > G P < G P = G P > G 0  -3199  14  -3201  12  1  0  7819  252  0  7872  199  2  386  9296  164  165  9605  76  3  61  1017  44  58  1037  27  4  125  1995  20  150  1986  4  5  3  384  2  1  386  2  6 27 704 6 20 716 1 ru P < G P = G P > G P < G P = G P > G  the number of MSD tag dimensions in the test portion of the dataset. In Figure 3, the scores are presented for English, Russian, and Turkish. 7 Additionally, we compared the number of tag dimensions predicted by each model to that of the gold annotation in order to investigate whether there was a tendency for the models to over-or underpredict the correct number of tag dimensions (Table 5). Although there is no clear pattern as to sequences of what length (i.e. short or long) the seq2seq model helps the most, it is clear from the scores that the seq2seq model has the capability to reproduce longer sequences of tag dimensions in comparison to the binary relevance model. Furthermore, while both models predict the correct number of tag dimensions for the vast majority of test examples, the seq2seq model makes more accurate predictions across sequences of nearly 7 There was only one token each with two or three tag dimensions in the test portion of the Turkish dataset (and none in the development portion). As such, the scores for tokens with two or three tag dimensions were omitted in the figure.  all lengths. There is also a general tendency for the two models to under-predict rather than overpredict distinct tag dimensions, with the exception of the seq2seq model on Russian examples with four tag dimensions or less.
Dependence between tag dimensions We hypothesize that the neural decoder of the seq2seq model helped it correctly predict tag dimensions that are low in frequency but often co-occur with a more frequent tag dimension. Such highly dependent examples can be found in the verbal paradigm of a language, where tag dimensions that indicate a particular tense, aspect, and mood (TAM; e.g. present, progressive, indicative) always cooccur with the verb (V) tag dimension. We expect that the prediction of the higher-frequency V tag dimension during decoding would have helped the model accurately predict these specific TAMrelated tag dimensions. As a case study testing this hypothesis, we compared the performance of the two models on TAM-related tokens present in the Turkish test set. The results in Table 6 reveal that the seq2seq model generally outperforms the binary relevance model, indicating that the seq2seq model captures the dependence between the V tag dimension and TAM-related tag dimensions. While the above analyses clearly demonstrate that the seq2seq model learns the structure behind MSD tag dimensions and thus predicts more linguistically plausible sequences in comparison to the binary relevance model, the binary relevance model slightly outperforms the seq2seq model in terms of micro-averaged F1 score. We conjecture that this is due to the nature of the decoder employed in the seq2seq model. Because the decoder conditions on its prediction at the previous timestep, once the decoder predicts an erroneous tag dimension, it is likely to continue to deviate from the correct sequence. This will result in predictions that do not have many tag dimensions in common with the gold annotation. On the other hand, as the binary relevance model is optimized to predict each individual tag dimension independently, it is more likely to generate "partially correct" sequences that are penalized less severely by the F1 score. Representative errors from the seq2seq model on the Russian test set presented in Table 7 demonstrate this tendency; in general, the prediction of an incorrect tag dimension results in predictions that have little overlap with the gold annotation.
In order to alleviate such decoding errors of the seq2seq model, a beam search could be conducted to pursue multiple hypotheses simultaneously. This could help the model recover from an initial erroneous prediction, albeit at the cost of computational efficiency. Furthermore, to explicitly incorporate the underlying structure between MSD tag dimensions, the binary relevance model could be extended to a multiclass multilabel classifier, which selects one tag among those that are in complementary distribution for each morphological category (e.g. part-of-speech, case, number) as in Tkachenko and Sirts (2018). Finally, a more rig-orous search for the optimal hyperparameters (e.g. hidden state sizes, training epochs, learning rate) of each model could further enhance their performance. We leave these directions to future work.

Conclusion
In this paper, we present our approach to the SIG-MORPHON 2019 contextual morphological analysis shared task. Expanding from the baseline model that chooses the most likely combination from all those present in the training data, we demonstrate that having the model make independent binary decisions over each tag dimension alleviates data sparsity and improves model performance. Furthermore, based on the linguistic insight that certain tag dimensions often co-occur together, we employed a neural decoder to turn contextual morphological analysis into a sequence generation task and aimed to capture this dependence. This again improved model performance in terms of exact match accuracy, especially for morphologically rich languages that generally have more MSD tag dimensions for every token. A follow-up case study of Turkish verbal inflections demonstrates that the seq2seq model captures the correlation between the more frequent V tag dimension and the less frequent TAM-related tag dimensions.