Cross-lingual Lexical Sememe Prediction

Sememes are defined as the minimum semantic units of human languages. As important knowledge sources, sememe-based linguistic knowledge bases have been widely used in many NLP tasks. However, most languages still do not have sememe-based linguistic knowledge bases. Thus we present a task of cross-lingual lexical sememe prediction, aiming to automatically predict sememes for words in other languages. We propose a novel framework to model correlations between sememes and multi-lingual words in low-dimensional semantic space for sememe prediction. Experimental results on real-world datasets show that our proposed model achieves consistent and significant improvements as compared to baseline methods in cross-lingual sememe prediction. The codes and data of this paper are available at https://github.com/thunlp/CL-SP.


Introduction
Words are regarded as the smallest meaningful unit of speech or writing that can stand by themselves in human languages, but not the smallest indivisible semantic unit of meaning. That is, the meaning of a word can be represented as a set of semantic components. For example, "Man = human + male + adult" and "Boy = human + male + child". In linguistics, the minimum semantic unit of meaning is named sememe (Bloomfield, 1926). Some people believe that semantic meanings of concepts such as words can be composed of a limited closed set of sememes. And sememes can help us comprehend human languages better.
Unfortunately, the lexical sememes of words are not explicit in most human languages. Hence, people construct sememe-based linguistic knowledge * Indicates equal contribution † Corresponding author bases (KBs) via manually annotating every words with a pre-defined closed set of sememes. HowNet (Dong and Dong, 2003) is one of the most wellknown sememe-based linguistic KBs. Different from WordNet (Miller, 1995) which focuses on the relations between senses, it annotates each word with one or more relevant sememes. As illustrated in Fig. 1, the word apple has two senses including apple (fruit) and apple (brand) in HowNet. The sense apple (fruit) has one sememe fruit, and the sense apple (brand) has five sememes including computer, PatternValue, able, bring and Speci-ficBrand. There exist about 2, 000 sememes and over 100 thousand labeled Chinese and English words in HowNet. HowNet has been widely used in various NLP applications such as word similarity computation (Liu and Li, 2002), word sense disambiguation (Zhang et al., 2005), question classification (Sun et al., 2007) and sentiment classification (Dang and Zhang, 2010). However, most languages do not have such sememe-based linguistic KBs, which prevents us understanding and utilizing human languages to a greater extent. Therefore, it is important to build sememe-based linguistic KBs for various languages. Manual construction for sememebased linguistic KBs requires efforts of many linguistic experts, which is time-consuming and labor-intensive. For example, the construction of HowNet has cost lots of Chinese linguistic experts more than 10 years.
To address the issue of the high labor cost of manual annotation, we propose a new task, crosslingual lexical sememe prediction (CLSP) which aims to automatically predict lexical sememes for words in other languages. CLSP aims to assist in the annotation of linguistic experts. There are two critical challenges for CLSP: (1) There is not a consistent one-to-one match between words in different languages. For example, English word "beautiful" can refer to Chinese words of either "美丽" or "漂亮". Hence, we cannot simply translate HowNet into another language. And how to recognize the semantic meaning of a word in other languages becomes a critical problem. (2) Since there is a gap between the semantic meanings of words and sememes, we need to build semantic representations for words and sememes to capture the semantic relatedness between them.
To tackle these challenges, in this paper, we propose a novel model for CLSP, which aims to transfer sememe-based linguistic KBs from source language to target language. Our model contains three modules including (1) monolingual word embedding learning which is intended for learning semantic representations of words for source and target languages respectively; (2) cross-lingual word embedding alignment which aims to bridge the gap between the semantic representations of words in two languages; (3) sememe-based word embedding learning whose objective is to incorporate sememe information into word representations. For simplicity, we do not consider the hierarchy information in HowNet in this paper.
In experiments, we take Chinese as source language and English as target language to show the effectiveness of our model. Experimental results show that our proposed model could effectively predict lexical sememes for words with different frequencies in other languages. Our model also has consistent improvements on two auxiliary experiments including bilingual lexicon induction and monolingual word similarity computation by jointly learning the representations of sememes, words in source and target languages.

Related Work
Since HowNet was published (Dong and Dong, 2003), it has attracted wide attention of re-searchers. Most of related works focus on applying HowNet to specific NLP tasks (Liu and Li, 2002;Zhang et al., 2005;Sun et al., 2007;Dang and Zhang, 2010;Fu et al., 2013;Niu et al., 2017;Zeng et al., 2018;Gu et al., 2018). To the best of our knowledge, only  and Jin et al. (2018) conduct studies of augmenting HowNet by recommending sememes for new words. However, both of the two works are aimed to recommend sememes for monolingual words and not applicable to cross-lingual circumstance. Accordingly, our work is the first effort to automatically perform cross-lingual sememe prediction to enrich sememe-based linguistic KBs.
Our novel model adopts the method of word representation learning (WRL). Recent years have witnessed great advances in WRL. Models like Skip-gram, CBOW (Mikolov et al., 2013a) and GloVe (Pennington et al., 2014) are immensely popular and achieve remarkable performance in many NLP tasks. However, most WRL methods learn distributional information of words from large corpora while the valuable information contained in semantic lexicons are disregarded. Therefore, some works try to inject semantic information of KBs into WRL (Faruqui et al., 2015;Mrkšic et al., 2016;Bollegala et al., 2016). Nevertheless, these works are all applied to word-based KBs such as WordNet, few works pay attention to how to incorporate the knowledge from sememe-based linguistic KBs.
In terms of our cross-lingual sememe prediction task, parallel data-based bilingual WRL methods are unsuitable because most language pairs have no large parallel corpora. Besides, unsupervised methods are not appropriate either as they are generally hard to learn high-quality bilingual word embeddings. Therefore, we choose the seed lexicon method in our model, and further introduce matching mechanism that is inspired by Zhang et al. (2017) to enhance its performance.

Methodology
In this section, we introduce our novel model for CLSP. Here we define the language with sememe annotations as source language and the language without sememe annotations as target language. The main idea of our model is to learn word embeddings of source and target languages jointly in a unified semantic space, and then predict sememes for words in target language according to the words with similar semantic meanings in source language.
Our method consists of three parts: monolingual word representation learning, cross-lingual word embedding alignment and sememe-based word representation learning. Hence, we define the objective function of our method corresponding to the three parts: Here, the monolingual term L mono is designed for learning monolingual word embeddings from nonparallel corpora for source and target languages respectively. The cross-lingual term L cross aims to align cross-lingual word embeddings in a unified semantic space. And L sememe can draw sememe information into word representation learning and conduce to better word embeddings for sememe prediction. In the following subsections, we introduce the three parts in detail.

Monolingual Word Representation
Monolingual word representation is responsible for explaining regularities in monolingual corpora of source and target languages. Since the two corpora are non-parallel, L mono comprises two monolingual sub-models that are independent of each other: where the superscripts S and T denote source and target languages respectively. As a common practice, we choose the well established Skip-gram model to obtain monolingual word embeddings. Skip-gram model is aimed at maximizing the predictive probability of context words conditioned on the centered word. Formally, taking the source side for example, given a training word sequence {w S 1 , · · · , w S n }, Skip-gram model intends to minimize: (3) where K is the size of the sliding window. P (w S c+k |w S c ) stands for the predictive probability of one of the context words conditioned on the centered word w S c , formalized by the following softmax function: in which V s indicates the word vocabulary of source language. L T mono can be formulated similarly.

Cross-lingual Word Embedding Alignment
Cross-lingual word embedding alignment aims to build a unified semantic space for the words in source and target languages. Inspired by Zhang et al. (2017), we align the cross-lingual word embeddings with signals of a seed lexicon and selfmatching. Formally, L cross is composed of two terms including alignment by seed lexicon L seed and alignment by matching L match : where λ s and λ m are hyperparameters for controlling relative weightings of the two terms.

Alignment by Seed Lexicon
The seed lexicon term L seed encourages word embeddings of translation pairs in a seed lexicon D to be close, which can be achieved via a L 2 regularizer: in which w S s and w T t indicate the words in source and target languages in the seed lexicon respectively.

Alignment by Matching Mechanism
As for the matching process, it is founded on an assumption that each target word should be matched to a single source word or a special empty word, and vice versa. The goal of the matching process is to find the matched source (target) word for each target (source) word and maximize the matching probabilities for all the matched word pairs. The loss of this part can be formulated as: where L T 2S match is the term for target-to-source matching and L S2T match is the term for source-totarget matching.
Next, we give a detailed explanation of target-to-source matching, and the source-totarget matching is defined in the same way. We first introduce a latent variable m t ∈ {0, 1, · · · , |V S |} (t = 1, 2, · · · , |V T |) for each target word w T t , where |V S | and |V T | indicate the vocabulary size of source and target languages respectively. Here, m t specifies the index of the source word that w T t matches with, and m t = 0 signifies the empty word is matched. Then we have m = {m 1 , m 2 , · · · , m |V T | }, and can formalize the target-to-source matching term: where C T and C S denote the target and source corpus respectively. Here, we simply assume that the matching processes of target words are independent of each other. Therefore, we have: where w S mt is the source word that w T t matches with, and c(w T t ) is the number of times w T t occurs in the target corpus.

Sememe-based Word Representation
Sememe-based word representation is intended for improving word embeddings for sememe prediction by introducing the information of sememebased linguistic KBs of source language. In this section, we present two methods of sememe-based word representation.

Word Relation-based Approach
A simple and intuitive method is to let words with similar sememe annotations tend to have similar word embeddings, which we name word relationbased approach. To begin with, we construct a synonym list from sememe-based linguistic KBs of source language, where we regard words sharing a certain number of sememes as synonyms. Next, we force synonyms to have closer word embeddings.
Formally, we let w S i be original word embedding of w S i andŵ S i be its adjusted word embedding. And let Syn(w S i ) denote the synonym set of word w S i . Then the loss function is: where α and β control the relative strengths of the two terms.
It should be noted that the idea of forcing similar words to have close word embeddings is similar to the state-of-theart retrofitting approach (Faruqui et al., 2015). However, retrofitting approach cannot be applied here because sememe-based linguistic KBs such as HowNet cannot directly provide its needed synonym list.

Sememe Embedding-based Approach
Simple and effective as the word relation-based approach is, it cannot make full use of the information of sememe-based linguistic KBs because it disregards the complicated relations between sememes and words as well as relations between different sememes. To address this limitation, we propose sememe embedding-based approach, which learns both sememe and word embeddings jointly.
In this approach, we represent sememes with distributed vectors as well and place them into the same semantic space as words. Similar to SPSE , which learns sememe embeddings by decomposing word-sememe matrix and sememe-sememe matrix, our method utilizes sememe embeddings as regularizers to learn better word embeddings. Different from SPSE, we do not use pre-trained word embeddings. Instead, we learn word embeddings and sememe embeddings simultaneously.
More specifically, from HowNet we can extract a source-side word-sememe matrix M S with M S sj = 1 indicating word w S s is annotated with sememe x j , otherwise M S sj = 0. Hence by factorizing M S , we can define the loss function as: where b s and b ′ j are the biases of w S s and x j , and X denotes sememe set.
In this approach, we obtain word and sememe embeddings in a unified semantic space. The sememe embeddings bear all the information about the relationships between words and sememes, and they inject the information into word embeddings. Therefore, the word embeddings are expected to be more suitable for sememe prediction.

Training and Prediction
Training When training monolingual word embeddings, we use negative sampling following Mikolov et al. (2013a). In the optimization of sememe part, we adopt the iterative updating method following Faruqui et al. (2015) for word relation-based approach and stochastic gradient descent (SGD) for sememe embedding-based approach. As for the optimization of the seed lexicon term of crosslingual part, we also apply SGD.
Nevertheless, due to the existence of the latent variable, optimization of the matching process in cross-lingual part poses a challenge. We settle on Viterbi EM algorithm to address the problem. Next, we still take the target-to-source side as an example and give a detailed description of the training process using Viterbi EM algorithm.
Viterbi EM algorithm alternates between a Viterbi E step and a subsequent M step. The Viterbi E step aims to find the most probable matched word pairs given the current parameters. Considering the independence, we can seek the match for each word individually: As for the parametrization of the matching probability, there are various choices. For computational simplicity, we select cosine similarity: where ϵ is a hyperparameter indicating the probability of matching the empty word. Therefore, the Viterbi E step computes matching by: From this, we can see that ϵ serves as a threshold to keep out unreliable matched pairs. The Viterbi M step performs maximization as if the latent variable has been observed in the Viterbi E step. Thus, we can treat the matched pairs as correct translations, and use a L 2 regularizer as well. Consequently, the M step computes: where M(w S , w T ) is defined as:

Prediction
Since we assume that words with similar sememe annotations are similar and similar words should have similar sememes, which resembles collaborative filtering in personalized recommendation, we can recommend sememes for target words according to their most similar source words. Formally, we define the score function P (x j |w T t ) of sememes x j given a target word w T t as: where r s is the descending rank of word similarity cos(w S s , w T t ) for the source word w S s , and c ∈ (0, 1) is a hyperparameter. Thus, c rs is a declined confidence factor which can eliminate the noise from irrelevant source words and concentrate on the most similar source words when predicting sememes for target words.

Experiments
In this section, we first introduce the dataset used in the experiments and then describe the experimental settings of both baseline method and our model. Next, we present the experimental results of different methods on the task of cross-lingual lexical sememe prediction. And then we conduct detailed analysis and exhaustive case studies. Following this, we investigate the effect of word frequency on cross-lingual sememe prediction results. Finally, we perform further quantitative analysis through two sub-tasks including bilingual lexicon induction and word similarity computation.

Dataset
We use sememe annotations in HowNet for sememe prediction. HowNet annotates sememes for 118, 346 Chinese words and 104, 025 English words. The number of sememes in total is 1, 983. Since some sememes only appear few times in HowNet, which are expected to be unimportant, we filter out those low-frequency sememes. Specifically, the frequency threshold is 5, and the final number of distinct sememes used in our experiments is 1, 400.
In our experiments, Chinese is source language and English is target language. To learn Chinese and English monolingual word embeddings, we extract about 2.0G text from Sogou-T 1 and Wikipedia 2 respectively. And we use THULAC 3 (Li and Sun, 2009) for Chinese word segmentation.
As for seed lexicon, we build it in a similar way to Zhang et al. (2017). First, we employ Google Translation API 4 to translate the source side (Chinese) vocabulary. Then the translations in the target language (English) are queried again in the reverse direction to translate back to the source language (Chinese). And we only keep the translation pairs whose back translated words match with the original source words.
In the task of bilingual lexicon induction, we opt for Chinese-English Translation Lexicon Version 3.0 5 to be the gold standard. In the task of word similarity computation, we choose WordSim-240 and WordSim-297 (Jin and Wu, 2012) datasets for Chinese, and WordSim-353 (Finkelstein et al., 2002) and SimLex-999 (Hill et al., 2015) datasets for English to evaluate the performance of our model. These datasets contain word pairs as well as human-assigned similarity scores. The word vectors are evaluated by ranking the word pairs according to their cosine similarities, and measuring Spearman's rank correlation coefficient with the human ratings.

Experimental Settings
We empirically set the dimension of word and sememe embeddings to 200. And the embeddings are all randomly initialized. In monolingual word embedding learning, we follow the optimal parameter settings in Mikolov et al. (2013a). We set the window size K to 5, down-sampling rate for highfrequency words to 10 −5 , learning rate to 0.025 and the number of negative samples to 5. In crosslingual word embedding alignment, the seed lexicon term weight λ s is 0.01, and the matching term weight λ m is 1, 000. In sememe-based word representation, the number of shared sememes for synonyms in the word relation-based approach is 2. In the training of matching process, we set ϵ to 0.5 empirically. When predicting sememes for words in target language, we only consider 100 most similar source words for each target word and the attenuation parameter c is 0.8. The testing set for cross-lingual lexical sememe prediction contains 2, 000 randomly selected English words from the vocabulary.

Cross-lingual Lexical Sememe Prediction
We evaluate our model by recommending sememes for English words. In HowNet, many words have multiple sememes, so that sememe prediction can be regarded as a multi-label classification task. We use mean average precision (MAP) and F 1 score to evaluate the sememe prediction results.
We compare our model that incorporates sememe information with word relation-based approach (named CLSP-WR) and our model which jointly trains word and sememe embeddings (named CLSP-SE) with a baseline method BiLex (Zhang et al., 2017), a bilingual WRL model without incorporation of sememe information. For BiLex, we use its trained bilingual word embeddings to predict sememes for the words in target language with our sememe prediction approach.  (1) Our two models perform much better compared with BiLex in all the seed lexicon size settings. It indicates that incorporating sememe information into word embeddings can effectively improve the performance of predicting sememes for target words. The reason is that both of our models make words with similar sememe annotations have similar embeddings, and as a result, we can recommend better sememes for target words according to its related source words.
(2) CLSP-SE model achieves better results than CLSP-WR model. The reason is that by representing sememes in a latent semantic space, CLSP-SE model can further capture the relatedness between sememes as well as the relatedness between words and sememes, which is helpful for modeling the representations of those words with similar sememes.

Case Study
In case study, we conduct qualitative analysis to explain the effectiveness of our models with detailed cases. We show two examples of crosslingual word sememe prediction, in which we predict sememes for handcuffs and canoeist. Fig. 2 shows the embeddings of five closest Chinese and English words to handcuffs and canoeist, and the vector of each word is projected down to two dimensions using t-SNE (Maaten and Hinton, 2008). 6 The largest seed lexicon size is 6000 because that is the maximum number of translation word pairs that we can obtain from the bilingual corpora.  Table 2 lists top-5 sememes we predict for the two words and the sememes annotated for each word in HowNet are in boldface. In the table, we also exhibit the annotated sememes of the five closest Chinese words.
In the first example, our model finds the best translated word for handcuffs in Chinese ⼿ 铐 "handcuffs", whose sememe annotations are exactly the same as those of handcuffs. In addition, the second closest Chinese word 镣 铐 "shackles" is a synonym for ⼿铐 "handcuffs" and also has the same sememe annotations. Therefore, our model predicts all the correct sememes successfully. From the prediction results of this example, we notice that our model can accurately predict general sememes like 用具 "tool" and ⼈ "human", which are supposed to be difficult to predict.
In the second example, accurate Chinese translated counterpart for canoeist does not exist, but our model still hits all the three annotated sememes in the top-5 predicted sememes. By observing the most similar Chinese words, we can find that although these words do not have the same meaning as canoeist, they are related to canoeist in different aspects. For example, 短跑 "sprint" and canoeist are both in the sports domain so that they share the sememes 锻炼 "exercise" and 体育 "sport". 名将 "sports star" has the meaning of sports star and it can provide the sememe ⼈ "human" in sememe prediction. Furthermore, it is noteworthy that our model predicts 船 "ship" due to the nearest Chinese words 独 ⽊ ⾈ "canoe" and 皮 艇 "kayak", whereas 船 "ship" is not annotated for canoeist in HowNet. It is obvious that 船 "ship" is an appropriate sememe for canoeist. Since HowNet is manually annotated by experts, misannotated words always exist inevitably, which in some cases underestimates our models.

Effect of Word Frequency
To explore how frequencies of target words affect cross-lingual sememe prediction results, we split the testing set into four subsets according to word frequency and then calculate the sememe prediction MAP and F 1 score for each subset. The results are shown in Table 3.  (1) The more frequently a target word appears in the corpus, the better its predicted sememes are. It is because high-frequency words normally have better word embeddings, which are crucial to sememe prediction.
(2) Our models evidently perform better than BiLex in different word frequencies, especially in low frequency. It indicates that by considering external information of HowNet, our models are more robust and can competently handle sparse scenarios.

Further Quantitative Analysis
In this section, we conduct two typical auxiliary experiments to further analyze the superiority of our models quantitatively.

Bilingual Lexicon Induction
Our models learn bilingual word embeddings in one unified semantic space. Here we use translation top-1 and top-5 average precision (P@1 and P@5) to evaluate bilingual lexicon induction performance of our models and BiLex. The seed lexicon size also varies in {1000, 2000, 4000, 6000}.  The results are shown in Table 4. From this table, we observe that our models, especially CLSP-SE model, enhance the performance of word translation compared to BiLex no matter how large the seed lexicon is. It indicates that our models can bind bilingual word embeddings better.

Method
Chinese ( Table 5: Performance on monolingual word similarity computation with seed lexicon size 6000. Table 5 shows the results of monolingual word similarity computation on four datasets. From the table, we find that: (1) Our models perform better than BiLex on both Chinese word similarity datasets. It signifies incorporating sememe information helps learn better monolingual embeddings; (2) CLSP-WR model does not enhance English word similarity results but CLSP-SE model does. It is because CLSP-WR model only post-processes Chinese word embeddings and keeps English word embeddings unchanged while CLSP-SE model undertakes bilingual alignment and sememe information incorporation together, which makes English word embeddings improve with Chinese word embeddings.

Conclusion and Future Work
In this paper, we introduce a new task of crosslingual sememe prediction. This task is very important because the construction of sememe-based linguistic knowledge bases in various languages is beneficial to better understanding these languages. We propose a simple and effective model for this task, including monolingual word representation learning, cross-lingual word representation alignment and sememe-based word representation learning. Experimental results on real-world datasets show that our model achieves consistent and significant improvements compared to baseline method in cross-lingual sememe prediction.
In the future, we will explore the following research directions: (1) In this paper, for simplification, we ignore the rich hierarchy information in HowNet and also ignore the fact that a word may have multiple senses. We will extend our models to consider the structure information of sememe and multiple senses of words; (2) In fact, our framework for cross-lingual lexical sememe prediction can be transferred to other cross-lingual tasks. We will explore the effectiveness of our model in these tasks such as cross-lingual information retrieval.