Hypernymy Detection for Low-Resource Languages via Meta Learning

Hypernymy detection, a.k.a, lexical entailment, is a fundamental sub-task of many natural language understanding tasks. Previous explorations mostly focus on monolingual hypernymy detection on high-resource languages, e.g., English, but few investigate the low-resource scenarios. This paper addresses the problem of low-resource hypernymy detection by combining high-resource languages. We extensively compare three joint training paradigms and for the first time propose applying meta learning to relieve the low-resource issue. Experiments demonstrate the superiority of our method among the three settings, which substantially improves the performance of extremely low-resource languages by preventing over-fitting on small datasets.


Introduction
Hypernymy is a fundamental asymmetric lexicosemantic relation. It expresses is-a relationship between concepts and is widely used to build taxonomies (Miller, 1995) or large-scale knowledge bases (Wu et al., 2012;Seitner et al., 2016). Lexicosemantic patterns (e.g., X such as Y) are generally employed to harvest benchmark datasets or resources from large English corpus due to their high precision (Hearst, 1992). However, Hearstlike patterns of English can not be easily transferred to other languages such as Chinese. Creating high-quality hypernymy benchmarks for other languages requires much more human-annotation efforts and hypernymy detection in those languages remains low-resource tasks . In this paper, we focus on the question: how could we make full use of hypernymy pairs of high-resource languages such as English for other low-resource languages, e.g., Japanese and Thai? * Work done when C. Yu and J. Han were with Tencent AI Lab.
To investigate this question, we firstly assume a strong feasibility of semantic relation transfer across languages, which is in line with existing findings on human cognition. Youn et al. (2016) uncovered the universal conceptual structure of human lexical semantics among cross-lingual dictionaries and revealed the language-independent attribute for semantic similarity of concepts. Wang et al. (2019) studied cross-lingual training by simply merging high-resource language pairs and lowresource language ones, which is prone to overfitting to low-resource ones. Based on the above interesting findings and the datasets in Wang et al. (2019), we study three training paradigms of combining training data from multiple different languages, i.e., cross-lingual training, multilingual training, as well as meta learning.
To the best of our knowledge, meta learning algorithms have not been previously applied to hypernymy detection. We propose applying meta learning algorithms in low-resource hypernymy detection and perform extensive comparisons with multilingual training. Meta-learning algorithms aim at learning language-independent models and then fine-tuning on multiple languages with minimal training instances. In our experiments, we further explore the two following questions: • Considering the language-agnostic lexical semantics, would multilingual training improve the performance by employing additional regularization?
• Regarding the effectiveness of meta learning in low-resource scenarios (Dou et al., 2019), can we leverage meta learning to help multilingual training?
The results for question 1 are surprising. Obvious improvement is observed from neither bilingual cross-training nor multilingual training. The perfor-mance even drops on extremely low-resource languages as the models easily over-fit low-resource language datasets. Meta learning algorithms, on the other hand, significantly relieve these cases by learning good model initialization for all languages. In the end, meta learning achieves the best performance among three training paradigms, which answers the main questions of this work.

Training Settings
In this section, we first introduce the base supervised model for hypernymy detection, and then illustrate three joint training paradigms.

Base Model
As discussed in Section 1, pattern-based models are highly language-dependant and can not generalize to arbitrary languages. We resort to supervised distributional models as base models, which take the distributional representation of terms as input features to train hypernymy relation classifiers (Roller et al., 2014;Yu et al., 2015;Rei et al., 2018). Luckily, pre-trained distributional vectors (e.g., fastText word embedding (Bojanowski et al., 2017)) are widely available for most languages.
Formally, given a pair of terms (x, y) in one language, we denote the corresponding word vectors by x and y. The hypernymy detection models learn a classifier f θ to make binary prediction, where the input features could be the concatenation, difference, or other complex combinations of x and y.
To keep the base model simple and effective, we directly concatenate the two vectors and train a twolayer MLP, i.e., f θ (x ⊕ y) = MLP(x ⊕ y). Note the performances of base model are comparable with the ones in Wang et al. (2019) without feature extractors and self training.

Joint Models
Cross-lingual Training. Following the setting of Wang et al. (2019), cross-lingual hypernymy detection aims to predict low-resource language pairs combining large training data from high-resource languages. Specially, in our case, English is the only high-resource language. Therefore, we train a joint model on the mixture of our large English dataset and the small dataset of another language such as Japanese. Due to the different representation spaces of languages, word translation techniques are required to transfer knowledge and align the feature space across languages. We adopt the end Fine-tune θ on each low-resource language.
technique of Conneau et al. (2017) to learn a mapping matrix W l−en to project the word embedding space of language l to that of English. The input feature to the classifier f θ for language l is then (W l−en x, W l−en y). The quality of translation matrix W l−en highly affects the transfer performance and we carefully choose the best mapping according to the evaluation on bilingual word translation benchmarks 1 . Detailed results are omitted due to the limited space. Multilingual Training. Instead of training on a pair of languages, multilingual training combines all available pairs in any language. Glavaš and Vulić (2018) has showed that semantic relation classification tasks benefit from the additional regularization resulted from multilingual training. We also investigate whether multilingual training for low-resource hypernymy detection could learn a model that has better generalization ability on all languages. Due to the language-independent structure of semantic relation, the interaction among datasets of all languages imports more external knowledge than cross-lingual training. However the characteristic of limited training instances for low-resource languages may make the model easily over-fit and hurt the generalization. In the following experiments, we would answer and analyze the question thoroughly. Meta Learning. Inspired by low-resource machine translation in Gu et al. (2018) and general language representation in Dou et al. (2019), we propose applying meta learning algorithms to hypernymy detection. We firstly learn languageindependent models based on multiple highresource languages and then adapt to low-resource language pairs. Here we adopt the most representative model-agnostic meta-learning (MAML) algorithm (Finn et al., 2017). Formally, given the base model f θ with parameters θ, we denote training on each language l as task T l . For each task (language) T l , we sample a batch of data as the support set T l (S) and another batch of data as the query set T l (Q). During the meta training stage, we randomly sample L tasks {T 1 , T 2 , ...T L }, and then update the model parameters by k gradient steps for each task T l : Here L is the loss function for task T l and α is the learning rate. The overall objective function for meta learning is min ). Hence the model parameters are updated by: where β is the learning rate for meta learning. The overall meta learning procedure is formulated in Algorithm 1. After nsteps of meta learning iterations, we use several small-batch data from each language to fine-tune the model parameter θ. Compared with multilingual training in Section 2.2, meta learning algorithms have the same input but different learning procedures or parameter updating strategies. Instead of simply merging all the high-resource and low-resource datasets to learn a joint model, meta learning algorithms learn a good initialization for all languages that can be adapted to one specific language. An obvious advantage of universal initialization is that it avoids the case where the model may favor highresource languages in multilingual training (Dou et al., 2019).

Experimental Setup
We conduct experiments on the hypernymy detection datasets of several languages in Wang et al.   dataset, it combines five commonly-used benchmarks and we refer to Wang et al. (2019) for the description of data construction. We further categorize the seven low-resource datasets as moderately low-resource ones e.g., FR, ZH, FI, IT and extremely low-resource ones e.g., TH, JA, EL according to relative dataset sizes. The statistics of all datasets are shown in Table 1. For all three low-resource joint training paradigms, we randomly split the non-English language datasets with 20% for training, 20% for development, and 60% for testing, following Wang et al. (2019). For English we also take out the 20% development set for model selection. Word embeddings for each language are from pre-trained fastText word vectors 3 whose dimensions are set to 300. We report averaged accuracy of 5-fold cross-validation for low-resource languages. For the three joint models, we uniformly run 5,000 steps and select the best model for each language based on its development set. The hidden layer size for the base models is set to 400. We use vanilla SGD to optimize the meta learner with batch size 32 and learning rate β = 0.5. We set the sampled task number L in each step to 8, update step k to 5, and inner learning rate α to 0.001. Our code is available at https://github. com/ccclyu/metaHypernymy.

Experimental Results
In Table 2, we demonstrate the main results of all training paradigms. Empirically we answer the two questions raised in the Section 1. Do simple joint multilingual models work? In the first row, we report performances of the base  monolingual model on all seven low-resource languages, denoted by "Mono". On top of it, Crosslingual training (or bilingual, denoted by "Cross") obtains marginal improvements for moderately lowresource languages. However, the performance drops dramatically for two extremely low-resource languages, i.e., JA from 0.740 to 0.711 and EL from 0.702 to 0.684. We note that data sparsity leads to the over-fitting issue and thus bad generalization. Similar observations could be drawn from multilingual training ("Multi" for short). In summary, for extremely low-resource datasets, effective and advanced joint training is needed.
Is meta learning better than multilingual training? As discussed in Section 2.2, simple multilingual training and meta learning have the same input. But our experiments indicate that even the model initialized by meta learning (not fine-tuned, denoted by "zeroMeta" in Table 2) achieves superior performances. For example, on Thai, the accuracy jumps from 0.657 to 0.702 without fine-tuning. After fine-tuning with several batches of data, meta learning (denoted by "Finetune") achieves the best performance for all low-resource languages. To fully understand the difference of the two training paradigms, we use the same batch size and run the two joint training models for 5,000 steps. Figure 1 shows the loss curve of the development set for each low-resource language as well as English. We have two major observations: 1) Both the two joint training paradigms could well fit English, the high-resource dataset, but multilingual training converges quickly then over-fits severely on extremely low-resource datasets (indicated by bold lines in Figure 1a), which results in dropping performances. Instead, meta learning has a relatively stable trend on the descending loss. For EL (the purple bold line in Figure 1b), though the loss first increases, it finally decreases and reaches a lower level. 2) The converging dev losses of meta learning reach to lower numbers and have lower variances among all languages. This demonstrates that meta learning aims at learning a languageindependent model/initialization that is helpful for fine-tuning rather than over-fitting on some languages.

Discussion
Experiments are based on good word representations and bilingual lexicon induction methods. However, the quality of them would impact results considerably, which we briefly discuss below. Transferability of Word Vector Space. One of the limitation of training paradigms in our work might be non-isomorphic embedding spaces, which are largely caused by the intrinsic property of dissimilar languages. The projection matrix W l−en is learned unsupervisedly based on strong assumption that the embedding spaces for two languages are isometric, i.e., similar in terms of structures (Vulić et al., 2020). However when generalizing to more low-resource languages, it does not always hold. It would be necessary in practice to carefully quantify isomorphism between two word vector spaces and adopt the approaches that relax the isomorphic assumption (Patra et al., 2019).

Contextualized Word Representation (CWR).
Replacing static word vectors with CWRs such as ELMo (Peters et al., 2018), BERT (Devlin et al., 2019) has achieved dominant performances on almost every NLP task. Ethayarajh (2019) show that principal component embeddings of CWR in lower layers of BERT outperform GloVe and fastText on many static embedding benchmarks such as word similarity and analogy. However it remains unclear how to use CWR to fully help lexical semantic tasks. We are also interested in whether zero-shot multilingual CWR pre-training such as Multilingual BERT (Pires et al., 2019) would benefit this task. Another promising direction is to devise the lexical knowledge from large pre-training language models (Bosselut et al., 2019;Petroni et al., 2019). We left them for the future work.

Related Work
Cross-Lingual Hypernymy Detection. Wang et al. (2019) firstly studies hypernymy detection in multilingual joint settings, Other similar tasks intend to predict whether a pair of words from two different languages exhibit hypernymy relationship (Vyas and Carpuat, 2016;Upadhyay et al., 2018; or to what extent the relationship  is. In this work, we focus on the former task. Meta Learning. Also known as learn to learn, it aims at developing models that could learn new tasks or adopt to new tasks with a few training examples. Recently it has attracted more attention due to the simple yet effective models such as MAML (Finn et al., 2017) and Reptile (Nichol et al., 2018).
There are emerging investigations of applying meta learning in NLP tasks such as machine translation (Gu et al., 2018), semantic parsing (Huang et al., 2018), personalized dialogue system (Madotto et al., 2019), relation classification (Obamuyide and Vlachos, 2019) and codeswitched speech recognition (Winata et al., 2020). Our work is inspired by Dou et al. (2019) that compares multi-task learning and meta learning for general language representations.

Conclusion
Transferring lexical knowledge across languages are important especially for low-resource cases. In this paper, we investigate three joint train-ing paradigms for detecting hypernymy in lowresource languages. We show that simple multilingual training is not helpful for all tasks and we significantly improve the performance using meta learning. Our study demonstrates the feasibility and effectiveness to combine high-and lowresource data to jointly train hypernymy detection models.