The Impact of Word Representations on Sequential Neural MWE Identification

Recent initiatives such as the PARSEME shared task allowed the rapid development of MWE identification systems. Many of those are based on recent NLP advances, using neural sequence models that take continuous word representations as input. We study two related questions in neural MWE identification: (a) the use of lemmas and/or surface forms as input features, and (b) the use of word-based or character-based embeddings to represent them. Our experiments on Basque, French, and Polish show that character-based representations yield systematically better results than word-based ones. In some cases, character-based representations of surface forms can be used as a proxy for lemmas, depending on the morphological complexity of the language.


Introduction
MWE identification consists in finding multiword expressions (MWEs) in running text (Constant et al., 2017). For many years, MWE identification was considered unrealistic, with most MWE research focusing on out-of-context MWE discovery (Ramisch et al., 2013). Indeed, the availability of MWE-annotated corpora was limited to some treebanks with partial annotations, often a by-product of syntax trees Constant et al., 2013). This prevented the widespread development and evaluation of MWE identification systems, as compared to other tasks such as POS tagging and named entity recognition.
This landscape has drastically changed in the last few years, thanks to shared tasks such as DiMSUM (Schneider et al., 2016) and PARSEME 1.0 and 1.1 (Savary et al., 2017; and to the release of open corpora annotated for MWEs in ∼20 languages. These initiatives provide a unified framework for MWE identifi-cation, including training/test corpus splits, evaluation metrics, benchmark results, and analysis tools. As a consequence, it is now possible to study some classical text processing problems and their impact on MWE identification systems. One of these problems is the relation between a language's morphology, lemmatisation, input feature representations, out-of-vocabulary (OOV) words, and the performance of the system. For instance, an MWE identification system based on (inflected) surface forms will likely encounter more OOV words than a system based on lemmas, especially for morphologically-rich languages in which a single lemma may correspond to dozens of surface forms (Seddah et al., 2013). This problem is particularly relevant for verbal MWEs, which present high morphological and syntactic variability .
Our goal is to study the impact of word representations on verbal MWE (VMWE) identification, comparing lemmas, surface forms, traditional word embeddings and subword representations. We compare the performance of an off-the-shelf MWE identification system based on neural sequence tagging (Zampieri et al., 2018) using lemmas and surface forms as input features, encoded in the form of classical pre-initialised word2vec embeddings (Mikolov et al., 2013) or, alternatively, using new-generation FastText embeddings built from character n-grams (Bojanowski et al., 2017). Our main hypothesis is that the latter can model morphological variability, representing an alternative for lemmatisation. We carry out experiments in 3 languages with varying morphological complexity: French, Polish and Basque. popular models for MWE identification (Constant et al., 2017). Parsing-based methods take the (recursive) structure of language into account, trying to identify MWEs as a by-product of parsing Constant et al., 2013), or jointly (Constant and Nivre, 2016). Sequence tagging models, on the other hand, consider only linear context, using models such as CRFs (Vincze et al., 2011;Shigeto et al., 2013;Riedl and Biemann, 2016) and averaged perceptron (Schneider et al., 2014) combined with some variant of begin-inside-outside (BIO) encoding (Ramshaw and Marcus, 1995).
Recurrent neural networks can be used for sequence tagging, being able to handle continuous word representations and unlimited context. The first neural identification system was MU-MULS, submitted to the PARSEME shared task 1.0 (Klyueva et al., 2017). Although it did not obtain the best results, MUMULS influenced the development of more advanced models (Gharbieh et al., 2017) which ultimately led to the popularisation of the approach. As a consequence, and inspired by the success of neural models in NLP, nine out of the 17 systems submitted to the PARSEME shared task 1.1 used neural networks . Recently, improvements have been proposed, e.g. to deal with discontinuous MWEs (Rohanian et al., 2019).
Previous work studied the impact of external lexicons (Riedl and Biemann, 2016) and of several feature sets (Maldonado et al., 2017) on CRFs for MWE identification. Character-based embeddings have been shown useful to predict MWE compositionality out of context (Hakimi Parizi and Cook, 2018). In other tasks such as named entity recognition, character convolution layers have been successfully applied (Ma and Hovy, 2016). The use of pre-trained vs. randomly initialised embeddings has been analysed in some PARSEME shared task papers (Ehren et al., 2018;Zampieri et al., 2018). The closest works to ours are the Veyn (Zampieri et al., 2018) and SHOMA (Taslimipoor and Rohanian, 2018) systems, submitted to the PARSEME shared task 1.1. Veyn is used as our off-the-shelf base system, so most of its architecture is identical to ours. Similarly to us, SHOMA employs FastText embeddings, a recurrent layer and a CRF output layer. To our knowledge, however, this is the first study to compare input representations for neural MWE identification.
3 Experimental Setup Corpora The PARSEME shared task 1.1 released freely available VMWE-annotated corpora in 20 languages. 1 Each language's corpus is split into training, development and test parts. To choose our target languages, we analysed the PARSEME corpora, choosing 3 languages with varying morphological richness: Basque (EU), French (FR) and Polish (PL), shown in Table 1. 2 The FR training corpus has more than 420K tokens, whereas the PL and EU training corpora have around 220K and 117K tokens. EU contains less annotated VMWE occurrences than both FR and PL. The average length of annotated VMWE occurrences is similar in the three languages (2.02/2.29/2.13 in EU/FR/PL). The proportion of discontinuous VMWEs in highest in FR (42.12%), whereas in Polish (29.76%) and in Basque (19.28%) they are less frequent. These languages do have not the same morphological richness, as measured by the average number of surface forms per lemma in the vocabulary ('Morph' column). For instance, the EU training corpus (2.32) has a higher morphological richness than PL (2.21) and FR (1.33). The rate of OOVs, that is, of words that appear in the dev or test corpus vocabularies, but not in the training corpus, is higher for surface forms than for lemmas, with a potential negative impact on VMWE identification systems based on surface forms only. As expected, the OOV rate for surface forms is lowest in FR (20-26%), which also has the lowest morphological richness, and highest for EU (43%). These differences are less visible for lemmas, which abstract away from morphology. 3 An interesting figure is the OOV rate focusing on verbs only. 4 Here, PL presents more OOV verb forms (42-44%) than EU (32%), but again this difference disappears for lemmas. This is relevant because our experimen-1 http://hdl.handle.net/11372/LRT-2842 2 Other languages have similar characteristics but were not selected due to the size of the corpora or to incomplete information (e.g. Turkish has missing surface forms for some verbs, preventing us from training a system based on surface forms only). 3 The official PARSEME French test corpus presents 11,632 missing lemmas. We have lemmatised it using UD-Pipe (http://ufal.mff.cuni.cz/udpipe) with default parameters, trained on the PARSEME shared task training corpus, to remain in the "closed track" conditions. 4 For EU, we consider the POS tags VERB, ADI and ADT according to the conversion  tal setup implies that it is difficult for a system to predict a VMWE without a reliable representation for a verb, learned from the training data.

MWE Identification System
We use our inhouse MWE identification system Veyn (Zampieri et al., 2018), based on sequence tagging using recurrent neural networks. 5 The system takes as input the concatenation of the embeddings of the words' features (e.g. lemmas and POS). It uses a CRF output layer (conditional random fields) to predict valid label sequences, with VMWEs encoded using the 'BIOG+cat' format. Each token is tagged 'B' if it is at the beginning of a VMWE, 'I' if it is inside a VMWE, 'O' if it does not belong to a VMWE, and 'G', if it does not belong to a VMWE but it is in the gap between two words that are part of a VMWE. The tags 'B' and 'I' are concatenated with the VMWE categories (VID, LVC.full, etc.) present in the corpus. The system is trained on the shared task training corpora, so that the results are comparable with the systems submitted to the closed track. 6 We use the dev corpus as validation data, training for 25 epochs which 3 epochs of patience for early stopping. We configure it to use two layers of bidirectional gated recurrent units (GRU) of dimension 128, with all other parameters taking the default values suggested in the Veyn documentation.

Word Representations
We use two types of word embeddings to represent input surface forms 5 https://github.com/zamp13/Veyn 6 http://multiword.sourceforge.net/ sharedtaskresults2018 and lemmas: word2vec and FastText. Word2vec is a prediction-based distributional model in which a word representation is obtained from a neural network trying to predict a word from its context or vice-versa (Mikolov et al., 2013). Fast-Text is an adaptation which also takes into account character n-grams, being able to build vectors for OOVs from its character n-grams (Bojanowski et al., 2017). For each representation, we used the gensim library 7 to train 256-dimensional vectors for both forms and lemmas on the training corpus of the shared task for 10 epochs. Furthermore, all embeddings use the CBOW algorithm with the same hyper-parameter values of 5 for the window size (left/right context of words) and 1 for min-count (minimum number of occurrences of words). For FastText, we set the size of character n-grams to 1 to combine the whole word's embedding with the embeddings of its characters. We did not use contextual representations, like BERT, Elmo or Flair (Devlin et al., 2018;Peters et al., 2018;Akbik et al., 2018), because they have to be pre-trained on large corpora and we wanted to have an experimental setup compatible with the closed track of the PARSEME shared task.

Evaluation Measures
We adopt the metrics proposed in the PARSEME shared tasks (Savary et al., 2017). The MWE-based measure (F-MWE) is the F1 score for fully predicted VMWEs, whereas the token-based measure (F-TOK) is the F1 score for tokens belonging to a VMWE.  Table 2: MWE-based F-measure (F-MWE) and token-based F-measures (F-TOK) of the models on the test corpus, using word2vec and FastText word representations for different feature sets: lemmas, surface forms, and both.

Results
We train Veyn using UPOS tags as input features, combined with word2vec and FastText embeddings for lemmas, surface forms, or both. 8 Performances are given on the PARSEME test corpus for Basque (EU), French (FR) and Polish (PL). On one hand, we compare performances with Fast-Text and word2vec representations, and on the other hand, we compare performances with various input feature sets ( Table 2). się w stanie, the system with FastText makes no prediction whereas with word2vec the prediction is będzie się w stanie where the reflexive clitic się is tagged as being a single-token inherently reflexive verb and w stanie is predicted as a verbal idiom. More single token predictions increase the recall of F-TOK, but decrease the precision of F-MWE. Further investigation will be made to understand this phenomenon, which could be compensated by simple post-processing, e.g. grouping single-token predictions with adjacent ones. We hypothesise that the system with subword representation is able to take the morphological inflection into account. For example, the French expression faire référence 'to make reference' is seen in this form in the training corpus, but the test corpus contains a different inflection of the verb fait référence 'makes reference'. For this example, with FastText representation the system is able to find the expression, but with word2vec representation the system can not find it if we use surface form and lemma at input.

Impact of Word Vector Representation
Impact of Input Pre-processing For Basque, which has a high morphological richness, the model with the richest information provides the best results. Performances are maximised with the form-lemma model, providing an F-MWE score of 69.24, while the form model yields a 66.52 score and the lemma model gives 62.86, suggesting that relevant information for VMWE identification is lost in lemmatisation. For Polish, similar results are obtained in terms of F-Tok while F-MWE is maximised for the lemma configuration with FastText. This is also a consequence of the phenomenon described in the previous subsection where single-token expressions are predicted for Polish. The lemma configuration is less affected by this phenomenon (F-TOK is lower) and thus full-expression identification is more effec-tive (higher F-MWE of 61.49). Results on French corroborate this trend: although French has simpler morphology, lemmas are still important to obtain best results. As opposed to highly morphological languages like Basque, the combination of lemmas and forms for French does not yield as much improvement. Performances in terms of F-TOK are equivalent for lemma and form-lemma and are slightly better in terms of F-MWE. For the three languages under consideration, our best models would have ranked in the top-3 in the closed track of the official shared task results.

Conclusions and Future Work
We have studied the impact of word representations on VMWE identification for Basque, French and Polish, comparing lemmas and surface forms as input features and comparing traditional word embeddings (word2vec) and subword representations (FastText). Regarding the latter, subword representations proved to be efficient for our task. For the former, we have highlighted that the use of lemmas always have a positive impact. For languages with high morphological richness, the combination of lemmas and forms has an even higher impact, especially for Basque. Considering the high Out-of-Vocabulary rate, including for verbs, we intend to improve OOV handling in the future. The use of recent embeddings such as BERT, Elmo and Flair, trained on large external corpora, could help with OOVs.