CUNI–Malta system at SIGMORPHON 2019 Shared Task on Morphological Analysis and Lemmatization in context: Operation-based word formation

This paper presents the submission by the Charles University-University of Malta team to the SIGMORPHON 2019 Shared Task on Morphological Analysis and Lemmatization in context. We present a lemmatization model based on previous work on neural transducers (Makarov and Clematide, 2018b; Aharoni and Goldberg, 2016). The key difference is that our model transforms the whole word form in every step, instead of consuming it character by character. We propose a merging strategy inspired by Byte-Pair-Encoding that reduces the space of valid operations by merging frequent adjacent operations. The resulting operations not only encode the actions to be performed but the relative position in the word token and how characters need to be transformed. Our morphological tagger is a vanilla biLSTM tagger that operates over operation representations, encoding operations and words in a hierarchical manner. Even though relative performance according to metrics is below the baseline, experiments show that our models capture important associations between interpretable operation labels and fine-grained morpho-syntax labels.


Introduction
Tasks related to morphological analysis have been traditionally formulated as string transduction problems tackled by weighted finite state transducers (Mohri, 2004;Eisner, 2002). More recently, however, the problem has been tackled with neural architectures featuring sequence-tosequence architectures (Kann and Schütze, 2016) and neural transducers (Aharoni and Goldberg, 2016;Makarov and Clematide, 2018b,a).
In this paper we describe our submission for the SIGMORPHON 2019 Shared Task related to morphological analysis and lemmatization in context (McCarthy et al., 2019). We focus on an operation-based word formation process using a neural transducer which consumes more than one character at a time. Our main motivation for this approach stems from neural transducers that normally consume one character at a time using context-enriched representation of characters. 1 In language modelling, character-based RNNs have a difficulty capturing long dependencies between characters, especially dependencies in words which are separated by several tokens. This can be a crucial piece of information for morphological analysis in context. This type of approach has already been extend effectively to Neural Machine Translation by (Sennrich et al., 2016), who employ simple character n-gram models and a segmentation based on the byte pair encoding (BPE) compression algorithm.

Related Work
In the last few years, efforts on the analysis of endangered low-resourced languages and the development of basic language tools for them (Rios, 2016;Pereira-Noriega et al., 2017;Cardenas and Zeman, 2018) have once more brought attention into the latent necessity for research of less language-dependent models that are not unreasonably data hungry.
On the other hand, more recent efforts have proposed combined strategies to bring together the transducer paradigm and neural architectures (Rastogi et al., 2016;Aharoni and Goldberg, 2016;Lin et al., 2019). For example, the neural transducer proposed by (Aharoni and Goldberg, 2016) presents a sequence to sequence architecture that decodes one character at a time while attending at the input character under a hard-monotonic constrain. However, their method relies on out-of-the-pipeline alignment of the input and output string at the character level. Subsequent work by Makarov and Clematide (2018b) proposed a transition-based architecture instead, although still operating under the same conditions, i.e. consuming one character at a time and relying on prealignment. More recently, however, Makarov and Clematide (2018a) proposed to learn alignment lattices along the transduction mechanism under an imitation learning framework, hence eliminating the need for single, noisy alignments.
In this work, we propose a neural architecture that encodes more expressive, interpretable transducer operations. We relax the condition of consuming one character at a time, and derive operations meant to be applied at the word level instead. These operations are obtained by merging initial character-level operations using the BPE algorithm (Gage, 1994).

Task Description
The SIGMORPHON 2019 Shared Task (Mc-Carthy et al., 2019) features three main tasks: (i) cross-lingual transfer for inflection generation, (ii) morphological analysis and lemmatization in context, and (iii) an open challenge over past editions of the shared tasks.
We participated in Task II for which a complete sentence of word forms is presented and lemmas and feature bundles (morpho-syntantic description labels) are to be predicted for each token. This task features an outstanding diverse pool of 66 languages from a total of 107 treebanks. Data (forms, lemmas, and feature bundles) are obtained from UniversalDependencies v.2.3 treebanks (Nivre et al., 2018). However, the feature bundles are translated into the UniMorph tagset (Kirov et al., 2018) using the mapping strategy proposed by McCarthy et al. (2018).

Problem Formulation
Let w ∈ V and z ∈ V L be a word type and its corresponding lemma; and let A be a set of string transformation actions. We define the function T : V × A m → V L that receives as input a word form w and a sequence of string transformations a = a 0 , ., a i , .., a m . T iteratively applies the transformations one at a time and returns the resulting string. The objective is to obtain a sequence of actions a such that a form w gets transformed into its lemma z, i.e. T (w, a) = z.

String transformations at the word level
We encode every string transformationhenceforth, action-a i ∈ A as follows: operation-position-segment . The additional information encoded such as position and segment (characters) involved, allows actions to operate at the word level and act upon a segment of characters instead of a single character. This is a key difference between A and the action sets of most previously proposed neural transducers (Aharoni and Goldberg, 2017;Makarov and Clematide, 2018b,c) which only encode the operation to perform and consume one character at a time.

Obtaining gold action sequences
We discuss now how to deterministically populate A. We start off with operations that act upon one character at a time. We derive these operations with the Damerou-Levenshtein (DL) distance algorithm which adds the transposition operation in addition to the traditional set of the edit distance algorithm. However, the set A of the form operation-position-segment directly derived by this algorithm is too large and sparse to be learned effectively, especially because of the position component.
Hence, we simplify A by merging the k most frequent operations performed at adjacent positions by using Byte-Pair-Encoding (BPE) (Gage, 1994). Furthermore, we replace the position component of actions performed at the beginning of a token with the label A, indicating that it is a prefixing action. Analogously, we use the label A to indicate it is a suffixing action. Table 1 presents a description of the licensed values of each component, including the operation set considered.
Finally, actions are sorted so that prefix actions are performed first, followed by inner-word actions (positions i ), and lastly, suffix actions. In addition, prefix and suffix actions are sorted so that T would process the word form from the outside in. Consider the example presented in Table 2, a sequence of suffix actions. The form visto (Spanish for 'seen', past participle) is transformed into the lemma ver ('to see'), with all actions operating at the right border of the current token.

System Description
In this section we describe the models presented for Task 2 on morphological tagging and lemma- Table 1: Description of components encoded in action labels. Σ: set of characters observed in the training data. Table 2: Example of step-by-step transformation from form visto (Spanish for 'seen', past participle) to lemma ver ('to see'). Bottom row presents the final token representation as the initial form followed by the action sequence. tization in context. We tackle the tasks of lemmatization and analysis with two separate, pipelined models, as follows.

Lemmatization Model
We posit the task of lemmatization as a language modelling problem over action sequences. Let w = w 0 , ..., w i , ..., w n be a sequence of word tokens, z = z 0 , ..., z i , ..., z n the lemma sequence associated with w, and a i = a i 0 , ..., a i j , ...a i m the action sequence such that T (w i , a i ) = z i . We encode a i using an RNN with an LSTM cell (Hochreiter and Schmidhuber, 1997), as follows j is the embedding of action a i j . Then, the probability of action a i j is defined as is the ReLU activation function, and W and b are network parameters. As a way to introduce the original word form into the encoded sequence, we prepend w i to a i . Hence, the probability of the first action is determined by is the last state of the encoded action sequence of the previous word w i−1 , and e i 0 is the embedding of word w i . The network is then optimized by minimizing the negative log-likelihood of the action sequences, as follows, where W is the set of all sentences in the training set and θ represents the parameters of the network. Figure 1 presents a representation of the lemmatizer model architecture. Note that a i m is the special action label ST OP . During decoding, we construct the lemma z i by running T over the predicted action sequence of w i .

Morphological Tagging Model
.., f i k , f i K } be the morpho-syntactic description (MSD) label associated with word form w i , defined as the concatenation of all individual features f k such as N or P l, and F i . We tackle the task of morphological tagging as a sequence labeling problem over aggregated representations of word forms.
We starts off by encoding the action sequence using a bidirectional LSTM (Graves et al., 2013) in order to obtain a word level representation x i = [f m ; b 0 ], where f m is the the last forward state and b 0 is the first backward state. We use action embeddings trained by the lemmatizer and we freeze them during training.
Then, the sequence x 0 , .., x n , u i = biLST M (x i , u i−1 ) is encoded by a wordlevel biLSTM u i = biLST M (x i , u i−1 ) Then, the probablity of feature label F i is given by where g(x) is a ReLU activation function, and W and b are network parameters. The network is optimized using cross-entropy loss.

Experimental Setup
We follow a two step approach to morphological analysis by first obtaining the action sequence using the lemmatizer model, and then obtaining the feature label sequence over these action representations. All models were implemented and trained using PyTorch 1.0.0. 2

Action sequence preprocessing
We lowercase forms and lemmas before running the DL-distance algorithm. Following the BPE training procedure described by Sennrich et al. (2016), we obtain the list of merged operations from the action sequences derived from the training data. We limit the number of merges to 50. Then, these merges are applied to action sequences on the development and test data.

Training and optimization of details
Both the lemmatizer and analyzer models were trained using Adam (Kingma and Ba, 2017), regularized using dropout (Srivastava et al., 2014), and employing an early stopping strategy. We tune the hyper-parameters of both models over the development set of Spanish (es ancora) 3 and then we use the optimal configuration to train on all treebanks except kpv ikdp, kpv lattice, and sa ufal. Preliminary experiments showed that these treebanks needed a smaller analyzer model to perform well. In this case, we choose kpv ikdp as our reference to obtain an optimal hyper-parameter configuration.
In each case, hyper-parameters were optimized over 30 iterations of random search guided by a Tree-structured Parzen Estimator (TPE). 4 Table 3 presents the hyper-parameters for the lemmatizer, analyzer, and the small version of the analyzer.
For decoding of lemmas, we follow a greedy approach to action sequence decoding. We also experimented with beam search but the improvements were not significant. Furthermore, we implement heuristics to prune a predicted sequence of actions. In addition to the heuristic of halting decoding if a PAD or STOP action is found, we halt if the action is not valid given the current string. For example, the action DEL-5 -o cannot be applied to string who for the simple reason  Table 3: Hyper-parameters of all models proposed. Lem = Lemmatizer; Anlz = Analyzer that the string is not long enough and, hence, the action is not valid.

Baseline model
We consider the baseline neural model provided by the organizers of the shared task. The architecture, proposed by Malaviya et al. (2019), performs lemmatization and morphological tagging jointly. The morphological tagging module of the model employs an LSTM-based tagger (Heigold et al., 2017), whilst the lemmatizer module employs a sequence-to-sequence architecture with hard attention mechanism (Xu et al., 2015).

Co-occurrence of actions and morphological features
We further investigate the co-occurrence of action labels with individual morphological features. Given the word form w i and its associated morphological tag F i = {f i 0 , ..., f i k , f i K } and action sequence a i = a 0 , ..., a j , ..., a m , let us define the joint probability distribution between individual features and action labels, as We consider P (F i |x 1:i ) = P (f i k |x 1:i ), ∀f i k ∈ F i . Note that P (F i |x 1:i ) and P (a i j |a i 1:j−1 ) are the probabilities obtained by the lemmatizer and tagger in equations 1 and 2, respectively. Tagging   Table 4 presents results on all metrics for the top 5 and bottom 5 scored treebanks according to the MSD-F1 scores on the official test evaluation. Results for the development set are presented as averaged over 10 runs with standard deviation value in parenthesis.

Lemmatization and Morphological
In lemmatization, our model underperforms the baseline for most treebanks, incurring in an error increase ranging from 0.27% to 35.14% in lemma accuracy. However, we improve over the baseline on the following languages: Tagalog (tl trg), Chinese (zh gsd, zh cfl), Cantonese (yue hk), and Amharic (am att).
We hypothesize that the relative poor performance in lemmatization stems from the input representation, i,e. the action sequences. Combinations of position information inside the token ( i ) and segment characters produces an action set A that is too fine-grained and sparse, even after the BPE merging of adjacent actions.
In morphological tagging, we observe an error increase ranging from 0.31% to 7.34% in MSD-F1 score. The exception were Russian (ru gsd) and Finnish (fi tdt) for which we obtain an error decrease of 34.88% and 46.71% in MSD-accuracy, 5 respectively. Figure 2 shows the distribution of individual morphological features over action labels, as defined in Eq.3 for Czech (cs pdt). Every row represents how likely a fine-grained feature label is to cooccur with an action performed during lemmatization of a token. On the left, we have co-occurrence distributions of gold actions and gold feature labels. On the right, we have co-occurrence distributions of predicted actions and predicted feature labels. For ease of visualization, we only plot the 20 most frequent action labels and the 30 most frequent features in the development set. We can observe the lemmatizer and tagger succeed in fitting the gold distribution. This is to be expected since the distribution in Eq.3 depends on P (F i |x 1:i ) and P (a j |a 1:j ), which are directly optimized by our models. We obtain similar plots for Spanish, English, Turkish, German, and Arab.

Actions and Morphological Features
This analysis also sheds light on which actions and morphological features the model learns to associate. For example, action del-A -y is strongly associated with features PL, N, and MASC, in accordance with the suffix y being a plural marker. Another notable example is that of the prefix ne which negates a verb. We observe that action del-A-ne is strongly associated with feature V. We also observe ubiquitous features such as POS (positive polarity), which shows an annotation preference unless the bound morpheme of negation is observed (ne).

Fixed gold action sequences
Obtaining gold action sequences as a previous, independent step presents a drawback, as pointed out by Makarov and Clematide (2018a). The optimal action sequence obtained for certain wordlemma pair might not be unique. Hence, if the lemmatizer predicts an alternative valid action sequence, the loss function would still penalize it during training. Given that we consider only one optimal sequence per word-lemma pair, our model cannot take advantage of all the possible valid alternative gold sequences.

Monotonic correspondence assumption
Previous work on neural transducers for morphology tasks (Aharoni and Goldberg, 2017;Makarov and Clematide, 2018b,a) rely on the fact that an almost monotonic alignment of input and output characters exists. This assumption also includes that both words and lemmas are presented in the same writing system (same-script condition), if no off-the-shelf character mapper is used. Our action sequencer relies on the same-script condition in order to not produce too long sequences and in turn, our lemmatizer relies on it to learn meaningful sequences.
However, upon inspection, we identify a couple of treebanks that violate this condition. In the first one, Arabic-PUD (ar pud), lemmas are romanized, i.e. presented in Latin rather than Arabic script. For the second one, Akkadian-PISANDUB (akk pisandub), different writing systems (ideographic vs. syllabic) are encoded in the forms but are not preserved in the lemmas. This encoding includes extra symbols such as hyphens and square brackets as well as capitalization of continuous segments. This kind of mismatch between word forms and lemmas forces our lemmatizer to learn action sequences that transform one character at a time, leading to poor performance given our architecture (16.75% and 14.36% on lemmata accuracy for ar pud and akk pisandub, respectively).

Lemmatizer biased to copy word forms
Languages with little to no morphology such as Chinese or Vietnamese will bias a transducer into   copying the whole input to the output, as pointed out by Makarov and Clematide (2018b). Our proposed lemmatizer exhibits the same kind of bias, obtaining up to 99.53% of lemmata accuracy for Chinese-CFL and Levenshtein distance of 0.0 in test set and 100% and 0.0 in the development set. Other languages benefit from this bias also, as can be observed in Figure 3. We note that, on average, the lemmatizer predicts no more than 3 actions before halting.

Conclusions
We presented our submission to the SIGMOR-PHON 2019 Shared Task on Morphological Analysis and Lemmatization in context. We presented a lemmatization strategy based on word formation operations derived from extended edit-distance operations that operate at the word level instead of at the character leve. These operations are merged using a BPE-inspired algorithm in order to encode segment (prefix, suffix) information in addition to the action to perform. Most notably, the proposed models are capable of associate the derived interpretable operations with morpho-syntactic feature labels. We find that the proposed architectures underperform the shared task baseline for most treebanks, showing plenty of room for improvement in this regard.