Neural Lemmatization of Multiword Expressions

This article focuses on the lemmatization of multiword expressions (MWEs). We propose a deep encoder-decoder architecture generating for every MWE word its corresponding part in the lemma, based on the internal context of the MWE. The encoder relies on recurrent networks based on (1) the character sequence of the individual words to capture their morphological properties, and (2) the word sequence of the MWE to capture lexical and syntactic properties. The decoder in charge of generating the corresponding part of the lemma for each word of the MWE is based on a classical character-level attention-based recurrent model. Our model is evaluated for Italian, French, Polish and Portuguese and shows good performances except for Polish.


Introduction
Lemmatization consists in finding the canonical form of an inflected form occurring in a text. Usually, the lemma is the base form that can be found in a dictionary. In this paper, we are interested in the lemmatization of multiword expressions (MWEs), that has received little attention in the past. MWEs consist of combinations of several words that show some idiosyncrasy (Gross, 1986;Sag et al., 2002;Baldwin and Kim, 2010;Constant et al., 2017). They display the linguistic properties of a lexical unit and are present in lexicons as simple words are. For instance, such a task may be of interest for the identification of concepts and entities in morphologically-rich languages. 1 The main difficulty of the task resides in the variable morphological, lexical and syntactic properties of MWEs leading to many dif-ferent lemmatization rules on top of simpleword lemmatization knowledge, as illustrated by the 27 hand-crafted rules used by the rule-based multiword lemmatizer for Polish described in Marcińczuk (2017). For example, in French, the nominal MWE cartes bleues (cards.noun.fem.pl blue.noun.fem.pl), meaning credit cards, is lemmatized in carte bleue (car.noun.fem.sg blue.adj.fem.sg) where the adjective bleue (blue) agrees in person (sg) and gender (fem) with the noun carte (card). A singleword lemmatization would not preserve the gender agreement in this example: the feminine adjective bleues would be lemmatized in the masculine bleu.
In this paper, we propose a deep encoderdecoder architecture generating for every MWE word its corresponding part in the lemma, based on the internal context of the MWE. The encoder relies on recurrent networks based on (1) the character sequence of the individual words to capture their morphological properties, and (2) the word sequence of the MWE to capture lexical and syntactic properties. The decoder in charge of generating the corresponding part of the lemma for each word of the MWE is based on a classical characterlevel attention-based recurrent model. One research question is whether the system is able to encode the complex linguistic properties in order to generate an accurate MWE lemma. As a preliminary stage, we evaluated our architecture in five suffix-based inflectional languages with a special focus on French and Polish.
Contrary to the lemmatization of simple words (Bergmanis and Goldwater, 2018), our task is not a disambiguation task 2 , as for a given MWE form, there is one possible lemma in all cases but some very rare exceptions. This means that the lemma Figure 1: Neural architecture. For simplification, we do not show hidden and softmax layers of attention decoder. We use ReLU as activation of the hidden layer. TAG k stands for the embedding of the predicted POS tag of the word w k , possibly concatenated with the embedding of the gold MWE-level POS tag. of a known MWE is simply its associated lemma in the training data. The interest of a neural system is thus limited to the case of unknown MWEs. One research question is whether the system is able to generalize well on unknown MWEs.
To the best of our knowledge, this is the first attempt to implement a language-independent MWE lemmatizer based entirely on neural networks. Previous work used rule-based methods and/or statistical classification methods (Piskorski et al., 2007;Radziszewski, 2013;Stankovic et al., 2016;Marcińczuk, 2017).
The article is organized as follows. First, we describe our model and our dataset. Then we display and discuss experimental results, before describing related work.

Model
Our lemmatization model is based on a deep encoder-decoder architecture as shown in Figure 1. The input MWE is a sequence w 1 w 2 ... w n of n words. It is given without any external context as there is no disambiguation to perform (cf. section 1). Every word w k is decomposed in a sequence w k1 w k2 ... w kn k of n k characters that is passed to a Gated Recurrent Unit 3 (GRU) that out-puts a character-based word embedding C(w k ), which corresponds to the output of the last GRU cell. The whole MWE sequence C(w 1 ) C(w 2 ) ... C(w n ) is then passed to a GRU 4 in order to capture the internal context of the MWE. For every word w k , a decoder generates its corresponding part l k = l k1 l k2 ...l kp k in the MWE lemma l. It is based on a character-based conditional GRU augmented with an attention mechanism (Bahdanau et al., 2014). Every w k is encoded as a vector which is the concatenation of the following features: its context-free character-based embedding C(w k ), its left context 5 h k−1 in the MWE (h k−1 being the output of the GRU at time stamp k − 1), a tag T AG k , its position k and the MWE length n. T AG k is the embedding of the predicted POS tag of w k , sometimes concatenated with the embedding of the gold MWE POS tag.
Our model has some limitations. First, the input form and the produced base form must have the same number of words. Secondly, the sequential nature of the model and the one-to-one correspondance are not very adequate to model lemmatization modifying the word order. For instance, the lemmatization of the verbal expression decision [was] made in the passive form involves word  Token-based data are derived from annotated corpora and are intended to be used to evaluate our approach on a real MWE distribution. Type-based data are derived from different morphosyntactic dictionaries and are intended to be used to evaluate the coverage and robustness of our approach. They are divided in train/dev/test splits. Table 1 displays the dataset sources and statistics. French and Polish data are by far the larger datasets and includes both token-and type-based resources. Italian and Portuguese data are smaller and only typebased. They are derived from the freely available dictionaries in the Unitex plateform (Paumier et al., 2009). We constructed our dataset by applying some automatic preprocessing to resolve tokenization and lemma discrepancies between the different sources, and to filter MWEs whose number of words is not equal to the number of words of the lemma, since our approach is based on a word-to-word process (1.6% of the 6 Datasets and code can be found at the following url: https://git.atilf.fr/parseme-fr/ deep-lexical-analysis. Note that the French Treebank data are distributed upon request because of license specificities. MWEs are thus taken off in French). For tokenbased datasets, we used the official splits used in  and Seddah et al. (2013) for French, and in Marcińczuk (2017) for Polish. For dictionary-based resources, we applied a random split by taking care of keeping all entries with the same lemma in the same split.
For every language, we constructed a unique 7 training set composed of the different train parts of the different resources used. We also augmented our training sets with gold pairs (simple-word form, simple-word lemma) to account for simpleword lemmatization knowledge in the MWE lemmatization process. This information comes from the same sources as MWEs.

Experiments
Experimental setup. We manually tuned the hyperparameters of our system on the dev sections.
Our final results on test sections were obtained using the best hyperparameter setting for the dev sections (hidden layer size: 192, character embedding size: 32, tag embedding size: 8, learning rate: 0.005, dropout: 0.25). We used UDPipe (Straka and Straková, 2017) to predict word POS tags for all languages. We also included predicted morphological features for Polish.
Evaluation metrics. We evaluated our system by using two metrics: MWE-based accuracy and word-based accuracy. MWE-based accuracy, also used for tuning, accounts for the proportion of MWEs that have been correctly lemmatized. Word-based accuracy indicates the total proportion of words that have been given the correct corresponding lemma part.
Results. Table 2 displays our final results on the dev and test sets of our five languages. First, it shows that our system generalizes well on unknown MWEs (columns unk.). For type-based data, scores on unknown MWEs are comparable or slightly better than for all MWEs. For token-based data, the MWE-based accuracy loss is reasonable, ranging from almost 0 point for French verbal expressions (ST data) to 13 points for Polish MWEs. Our system shows good performances on French. On similar languages (BR, IT, PT), results are lower, but rather good given the limited size of the training sets. The system shows disappointing results for Polish, especially for the dictionary. On the token-based dataset, results are very far from the ones obtained by the rule-based system of (Marcińczuk, 2017) which displays around 98% accuracy using 27 rules and dictionary information. Polish being a morphologically-rich language, the encoding of morphological constraints would deserve more investigations. The system also shows lower scores for verbal expressions in French, which show much morphological and syntactic variation.
We also evaluated our system to lemmatize simple words, as it would have been convenient to have a single system processing the lemmatization on both simple words and MWEs. However, it did not show satisfying results: we obtained a score of 73% on the FTB corpus, against 99% when the system is trained on simple words only.   Table 3 displays the results on the dev section of the French data excluding the ST data. The GRU component appears crucial to capture morphosyntactic constraints (8-10 point gain). The use of simple-word lemmatization knowledge has also a significant impact (7-8 point gain). Word POS tags are mainly beneficial for the dictionary evaluation (4-point gain). We also evaluated the impact of adding the gold MWE POS, which are mainly beneficial in a dictionary evaluation setting (4-point gain).  Comparison with baselines. We compared our system with two baselines, both using UDPipe (Straka and Straková, 2017). The first one consists in training UDPipe in a special way. More precisely, it is trained on sequences of simple words of the train corpora, plus on the MWE word sequences of the training data set. In order to give cues about the MWE internal structure to UDPipe, we provide MWE words with IOB-like tags indicating their relative positions in the MWE, in addition to their POS-tags/MWEtag, in the train set. For instance, the French MWE cartes bleues (lit. cards blue, tr. credit cards) would be annotated in the following way (with POS-tags): cartes/carte/B-NOUN bleues/bleue/I-ADJ. The second one simply consists in lemmatizing each word of the MWE separately, with UDPipe already trained with the basic UD model. The output MWE lemma is the concatenation of the predicted lemmas of all MWE words. Table 3 shows that this baseline is not competitive with respect to the UDPipe adaptation baseline. Table 4 compares the performances of our system with the best baseline on dev datasets 8 for French and Polish. The baseline consistently shows lower scores for Polish and French. The best baseline ranges from 0.4-to-2.5-point loss for French and more than 10-point loss for Polish.  Results by MWE subclasses. Table 5 compares results for different lemmatization cases for French and Polish on dev data: the MWE lemma corresponds to (1) the MWE form, (2) the concatenation of the word lemmas, (3) other cases. In French, our system performs rather well on the second case. In Polish, system performs better on the first case. It is worth noticing that our system performs very well on MWEs that belong to both cases (1) and (2), especially for French. There is a significant gap in performances with the other cases for both languages. Note that the proportion of MWEs belonging to the other cases is much greater in Polish than in French. This might partially explains why the system performs so poorly on Polish data. posed approaches based on statistical classification, like predicting edit tree operations transforming word forms into lemmata (Grzegorz Chrupala and van Genabith, 2008;Müller et al., 2015) or predicting lemmatization rules consisting in removing and then adding suffixes and prefixes (Straka and Straková, 2017). Using the deep learning paradigm, Schnober et al. (2016) and Bergmanis and Goldwater (2018) proposed attentionbased encoder-decoder lemmatization.

Conclusion
In this paper, we presented a novel architecture for MWE lemmatization relying on a word-to-word process based on a deep encoder-decoder neural network. It uses both the morphological information of the individual words and their internal context in the MWE. Evaluations for five languages showed that the proposed system generalizes well on unknown MWEs, though results are disappointing for a language with very rich morphology like Polish and for verbal expressions. This would require further more detailed investigation. Another line of research for future work would consist in integrating transformers in our system and in evaluating it on more languages.