Cross-lingual Transfer Learning and Multitask Learning for Capturing Multiword Expressions

Recent developments in deep learning have prompted a surge of interest in the application of multitask and transfer learning to NLP problems. In this study, we explore for the first time, the application of transfer learning (TRL) and multitask learning (MTL) to the identification of Multiword Expressions (MWEs). For MTL, we exploit the shared syntactic information between MWE and dependency parsing models to jointly train a single model on both tasks. We specifically predict two types of labels: MWE and dependency parse. Our neural MTL architecture utilises the supervision of dependency parsing in lower layers and predicts MWE tags in upper layers. In the TRL scenario, we overcome the scarcity of data by learning a model on a larger MWE dataset and transferring the knowledge to a resource-poor setting in another language. In both scenarios, the resulting models achieved higher performance compared to standard neural approaches.


Introduction
Multiword Expressions (MWEs) are combinations of two or more lexical components that form non/semi-compositional meaning units. Due to their idiosyncratic behaviour, MWEs have been studied using various statistical and machine learning approaches including supervised classification (Diab and Bhutada, 2009), tagging (Schneider et al., 2014), and unsupervised prediction (Fazly et al., 2009). Studies have focused on both their syntactic (Constant and Nivre, 2016) and semantic ( Van de Cruys and Moirón, 2007) features.
Recently, the PARSEME project provided an extensive multilingual dataset of verbal MWEs . Datasets of certain languages in this resource are rich with a huge number of tagged sequences while others are considerably smaller. Several notable systems have been proposed to train sequence labelling models on this dataset including neural (Taslimipoor and Rohanian, 2018) and non-neural systems (Moreau et al., 2018). MWE prediction for some of these languages has proved to be more challenging due to several reasons including scarcity of data, higher percentage of unseen MWE instances in the test set, and prevalence of discontinuous or variable MWEs.
In this paper, we focus on one of those languages for which the results were collectively low (interestingly it was English) and explore two neural approaches in order to address the shortcomings of the current neural models and enhance learning. The two approaches are: multitask learning and transfer learning, with two different motivations.
Syntactic and semantic idiosyncrasies in MWEs call for special treatment, with models that take them into account from different perspectives. Syntactic and semantic information are commonly fed to the models as input features. However, we consider an alternative way to exploit this information. Specifically, in a supervised setting, we add dependency syntax information as auxiliary supervision. Therefore we perform multitask learning between MWE and dependency parse tags.
Syntactic dependency information has been previously proven to be successful in identifying MWEs (Constant and Nivre, 2016). However, neural processing methodologies are yet to be deeply explored for MWE modelling (Constant et al., 2017). In multitask learning we have several different prediction tasks over the same input. The idea is that the process of learning features for one task can be helpful for another.
In order to deal with data scarcity in the English dataset, in another setting we train our model on a language with a larger data and transfer the learned knowledge for predicting MWE tags in English.
In this study we build upon recent neural network systems that have proved to be successful in representing syntactic and semantic features of text and design novel multitask and transfer learning architectures for MWE identification. The contributions of this work are: 1) we propose a neural model that improves MWE identification by jointly learning MWE and dependency parse labels; 2) We show that MWE identification models, when multitasked with dependency parsing, outperform the models which naively add dependency parse information as additional features; 3) we propose, to the best of our knowledge for the first time, a cross-lingual transfer learning method for processing MWEs, thus making a contribution towards the study of low-resource languages.

Related Work
Constant and Nivre (2016) proposed joint syntactic and lexical analysis in which the syntactic dimension of their structure is represented by a dependency tree, and the lexical dimension is represented by a forest of trees. The two dimensions share token-level representations. They use a transition-based system that jointly learns both lexical and syntactic analysis resulting in an improvement for the task of MWE identification.
The idea of multitask learning (MTL) in neural networks was popularised by the work of Collobert et al. (2011). They improved the performance of chunking by jointly learning it with POS tagging. Søgaard and Goldberg (2016) discuss the idea further by pinpointing that supervising different tasks on different layers is beneficial. Specifically, in their work, for an input sequence, w 1:n they have several RNN layers l for each task, t, and their task-specific classifier is defined as: is the output representation of RNN for word i and f t is the tagger/classification function. This way, different tasks might be applied to different RNN layers (i.e. there are layers shared by several tasks, and layers that are specific to some tasks). We use this idea here, by having some specific layers for final MWE prediction which are not shared with the auxiliary parsing task.
Using an LSTM-based model, Bingel and Søgaard (2017) performed a study to find beneficial tasks for the purpose of MTL in a sequence labelling scenario. In their work, the MWE model benefited from most auxiliary tasks such as chunk-ing, CCG parsing, and Super-sense tagging. A similar finding is reported in Changpinyo et al. (2018) where performance of an MWE tagger was consistently improved when jointly trained with any of the 10 different auxiliary tasks in various MTL settings.
Transfer learning (TRL) has seen a flurry of interest with the advent of pre-trained language models, transformers, and contextualised embeddings (Howard and Ruder, 2018;Peters et al., 2018;Devlin et al., 2018). Transfer learning is particularly helpful where data scarcity can be an issue, and a related task with more data can be used to alleviate the issue. Liu et al. (2018) is an example of the use of task-aware language models to enhance sequence labelling using an LSTM-CRF architecture powered by a language model. A related scenario in TRL is when tasks remain the same but models are designed to transfer knowledge across languages. In NLP, crosslingual transfer learning has been extensively explored in the context of representation learning where monolingual spaces are mapped into a common embedding space through methods like retrofitting (Faruqui et al., 2015), matrix factorization (Vyas and Carpuat, 2016) or similar. Outside representation learning, there have been many attempts to use TRL in NLP tasks. For sequence labelling,  trained POS tagging models cross-lingually without access to parallel resources. The model consisted of two LSTM components where one is shared between the languages and the other is private (language-specific). Yang et al. (2017) is a notable example of cross-lingual transfer learning under low-resource settings where sequence labelling models were trained to transfer knowledge between English, Spanish, and Dutch for POS tagging, chunking, and Named Entity Recognition (NER) through the use of shared and private parameters. In that work, three different architectures were explored for cross-domain, cross-application, and crosslingual transfer. The core of their proposed models is similar to Lample et al. (2016), with minor differences including the incorporation of GRU instead of LSTM and a training objective based on the max-margin principle.

Methodology
The core of our model is a neural architecture that incorporates CNN and LSTM layers which are commonly employed in sequence tagging models. 1 We adapt the architecture to the two scenarios of multitask and transfer learning. The details of the layers and input representations for these models are further explained in Section 4 and depicted in Figure 1.

Multitask Learning
In the multitask learning scenario, the models are required to simultaneously predict MWE tags, dependency parse arcs and dependency parse labels. A sample of all three-fold labels that the model should predict for a sentence is depicted in figure  2. In order to learn the main output, MWE tag, the model computes loss values for two auxiliary outputs, Dep arc and Dep tag, and add them to the main output loss. Similar to the idea of Søgaard and Goldberg (2016), we introduce the supervision of dependency parsing in lower layers and aim to boost the performance of the final MWE tagging layer. To this end, the parallel CNNs and the first BiLSTM PUNCT 10 punct * Figure 2: Annotation of one sample sentence containing one VPC and a verbal idiom in the English data for the Parseme shared task edition 1.1.
layer is shared between the two tasks. On top of this, two layers with independent auxiliary losses are applied to predict dependency tags. Parallel to this, we add a single BiLSTM before the main output layer for predicting MWE tags (Figure 1). In this study, we simply add the main loss to the two auxiliary losses (which are all computed using categorical cross-entropy).

Transfer Learning
In transfer learning, also known as domain adaptation, information from a source task is retained to enhance learning for another related task. In this study, we use TRL in a multilingual scenario. Since our target language is low-resource, the aim is to benefit from richer data of another language.
To this end, a model which is trained on the domain of one language is transferred to the domain of another target language. The two languages have the same sets of POS and dependency parse tags. Therefore, one-hot encoded POS and dependency inputs are shared between the trained and the transferred models. When loading pretrained contextualised embeddings as inputs, the sentences of individual languages have their own sets of weights. On the other hand, we also have a setting in which our model starts with a trainable embedding layer. In this case, the vocabularies of both languages are combined and indexed together. This way, common vocabularies or proper nouns of the two languages receive the same indices.
In this study, we first train the model on the German data, and then transfer the weights to an identical model which is re-trained on English for a fewer number of iterations.

Experiments
We experiment with the multilingual dataset from the PARSEME project  which was made available for the shared task on identification of verbal MWEs . Verbal MWEs in the dataset include idioms, verb particle constructions, and light verb constructions, among others. MWE tags in the dataset are similar to IOB labels, since there is a distinction between the beginning and other components of an MWE. We target the data for English which is surprisingly small in this dataset (with 3, 471 training and 3, 965 test sequences) and try to use MTL and TRL to improve MWE identification.
The inputs to our system are combinations of ELMo embeddings which are trained on our data using the implementation provided by Che et al.
(2018) and one-hot encoded POS tags. In cases where we add dependency parse information as inputs, the representation for dependency arcs and labels are as follows. In order to represent arcs, we use adjacency matrix representation for each sentence. In the adjacency matrix, each token is assigned a row in which all cells are zero except for the one corresponding to the head of the token in dependency tree. Dependency labels, though are one-hot encoded.
We set hyperparameters based on the ones used in a similar architecture proposed by Taslimipoor and Rohanian (2018) which was implemented for a single task and mono-lingual setting. The CNN layers have 200 neurons, one with filter size 2 and the other with size 3, both with relu activation. BiLSTM layers have both 300 neurons, dropout 0.5, and recurrent dropout of 0.2. We use the Adam optimizer for all settings. Figure 1 shows the whole architecture for MTL. The model architecture for standard setting and TRL is the same excluding the auxiliary components.

Evaluation
In the MTL setting, we make comparison between the case when the model is trained only on MWE tags (single-task, STL) to when jointly trained to predict MWE and dependency parsing tags in a multitask scenario (MTL). We also compare the results of joint prediction with the case when dependency information is directly fed as additional input. In the TRL setting, we first train our model on the German data which has 6, 734 training se-quences. 2 We finally compare the results from TRL with all other results.
We evaluate the models using F1-score in two settings: 1) strict matching (MWE-based) in which all components of an MWE are considered as a unit that should be correctly classified; and 2) fuzzy matching (token-based) in which any correctly predicted token of the data is counted .

Results
The results are reported in Table 1. We report the average F1-score over five separate runs along with standard deviation. The first two rows show the baseline results when we use the neural model in the standard setting. For the second row, we use dependency parsing tags as well as ELMo and POS tags for the input to the system.
In the third and the fourth rows (MTL), we observe that the results improve when dependency parse information is predicted as auxiliary output. In particular, we observe these improvements when adding the dependency loss outputs at one layer before the outermost BiLSTM. We also see that the addition of POS to the input is not necessarily effective in the MTL setting (i.e. according to the third row, the MTL setting without POS results in a better performance). Our best MTL system outperforms the systems that participated in open track of the Parseme shared task  for English data. However, it performs slightly worse than the neural system proposed by Rohanian et al. (2019), which deals with discontinuous MWEs using graph convolutional network and attention mechanism.
The models are trained on google colab with GPU: 1xTesla K80, having 2496 CUDA cores, compute 3.7, and 12GB GDDR5 VRAM. While the MTL model might seem to be complicated, it does not add much to the time complexity of the model. Specifically it takes, on average, 45 minutes to train the MTL model compared to 43 minutes to train STL both for 100 epochs.
The performance of TRL is only slightly better than STL and lower than MTL. This is not to our surprise, because ELMo vectors, that are one of the inputs to all the models, are pre-trained on huge amount of data and bring enough knowledge to the low resource.  Furthermore, in the case of TRL, we hypothesize a scenario in which we do not have access to a huge amount of data and avoid using ELMo as the input. We perform a preliminary experiment with a randomly initialized embedding layer as the first component of the network to be trained with other layers. We report the results of this experiment in Table 2. This way the model is not using any extensive external data (hence the name closed STL). Here we can better see the benefits of transferring the model cross-lingually. More investigations need to be done to discover the limits of this approach (e.g. through the application of different language models and experimentation with other architectures of the same kinds).

The Effect of Learning Rate in TRL
When transferring from the source to the target domain, the model is prone to overfitting on the new data, losing the potentially beneficial information from the high-resource model. This problem is sometimes referred to as catastrophic forgetting. One way to mitigate this issue is to control for the hyperparameters of the source and target language, specially setting the learning rate in a way that domain adaptation occurs incrementally. Ongoing research explore various regularization and ensemble methods to preserve and transfer knowledge between tasks (Chronopoulou et al., 2019;Lee et al., 2017;Rusu et al., 2016). These methods, however, introduce varying degrees of computational complexities.
Even though the sensitivity of TRL to the learn-ing rate is largely acknowledged in the literature, previous work is indecisive as to what learning rate scheduling achieves the best result. Bowman et al. (2015) lower the starting learning rates after transfer, in order to preserve pre-transfer information in early training. Kocmi and Bojar (2018) however, found that, in TRL between language pairs in the task of neural machine translation, changing hyperparameters from the parent to the child model harmed performance. Mou et al. (2016) set the best hyperparameters from the source task during the validation phase and transferred them to the target domain. They acknowledged that the hyperparameters can potentially become biased towards the source domain. The conclusion was that the best hyperparameters are ready to be transferred during the epoch range when the performance peaks in the source domain.
In this work we refrained from altering the learning rate, since, consistent with some of the previous work, we noticed a sharp decline in performance when changing this value.

Conclusions and Future Work
In this work we explored two neural architectures to improve identification of MWEs through learning of related linguistic tasks. 3 We experimented with cross-lingual transfer learning between two Germanic languages, and in a separate scenario, we designed and tested a multitask learning approach to tag MWEs while concurrently training on dependency arcs and labels as auxiliary tasks. Our results show that the models prove promising and outperform the standard baseline. In future we plan to study these techniques in more detail, and make extensive comparisons between them in order to understand to what extent and under what circumstances they help MWE identification.