BERT for Monolingual and Cross-Lingual Reverse Dictionary

Reverse dictionary is the task to find the proper target word given the word description. In this paper, we tried to incorporate BERT into this task. However, since BERT is based on the byte-pair-encoding (BPE) subword encoding, it is nontrivial to make BERT generate a word given the description. We propose a simple but effective method to make BERT generate the target word for this specific task. Besides, the cross-lingual reverse dictionary is the task to find the proper target word described in another language. Previous models have to keep two different word embeddings and learn to align these embeddings. Nevertheless, by using the Multilingual BERT (mBERT), we can efficiently conduct the cross-lingual reverse dictionary with one subword embedding, and the alignment between languages is not necessary. More importantly, mBERT can achieve remarkable cross-lingual reverse dictionary performance even without the parallel corpus, which means it can conduct the cross-lingual reverse dictionary with only corresponding monolingual data. Code is publicly available at https://github.com/yhcc/BertForRD.git.


Introduction
Reverse dictionary (Bilac et al., 2004;Hill et al., 2016) is the task to find the proper target word given the word description. Fig. 1 shows an example of the monolingual and the cross-lingual reverse dictionary. Reverse dictionary should be a useful tool to help writers, translators, and new language learners find a proper word when encountering the tip-of-the-tongue problem (Brown and McNeill, 1966). Moreover, the reverse dictionary can be used for educational evaluation. For example, teachers can ask the students to describe a word, and the correct description should make the reverse dictionary model recall the word.
The core of reverse dictionary is to match a word and its description semantically. Early methods (Bilac et al., 2004;Shaw et al., 2013) firstly extracted the handcrafted features and then used similaritybased approaches to find the target word. However, since these methods are mainly based on the surface form of words, they cannot extract the semantic meaning, resulting in bad performance when evaluated on the human-written search query. Recent methods usually adopt neural networks to encode the description and the candidate words into the same semantic embedding space and return the word which is closest to the description (Hill et al., 2016;Zhang et al., 2019).
Although current neural methods can extract the semantic representations of the descriptions and words, they have three challenging issues: (1) The first issue is the data sparsity. It is hard to learn good embeddings for the low-frequent words; (2) The second issue is polysemy. The previous methods usually use the static word embedding (Mikolov et al., 2013;Pennington et al., 2014), making them struggle to find the target word when the target word is polysemous. Pilehvar (2019) used different word senses to represent a word. Nonetheless, gathering senses for all words is not easy; (3) The third issue is the alignment of crosslingual word embeddings in the cross-lingual re-verse dictionary scenario (Hill et al., 2016;. In this paper, we leverage the pre-trained masked language model BERT (Devlin et al., 2019) to tackle the above issues. Firstly, since BERT tokenizes the words into subwords with byte-pairencoding (BPE) (Sennrich et al., 2016b), the common subwords between low-frequent and highfrequent words can alleviate the data sparsity problem. Secondly, BERT can output contextualized representation for a word. Thus the polysemy problem can be much relieved. Thirdly, the mBERT is suitable to tackle the cross-lingual reverse dictionary. Because BERT shares some subwords between different languages, there is no need to align different languages explicitly. Therefore, we formulate the reverse dictionary task into the masked language model framework and use BERT to deal with the reverse dictionary task in monolingual and cross-lingual scenarios. Besides, our proposed framework can also tackle the cross-lingual reverse dictionary task without the parallel (aligned) corpus.
Our contributions can be summarized as follows: 1. We propose a simple but effective solution to incorporate BERT into the reverse dictionary task. In the method, the target word is predicted according to masked language model predictions. With BERT, we achieve significant improvement for the monolingual reverse dictionary task.
2. By leveraging the Multilingual BERT (mBERT), we extend our methods into the cross-lingual reverse dictionary task, mBERT can not only avoid the explicit alignment between different language embeddings, but also achieve good performance.
3. We propose the unaligned cross-lingual reverse dictionary scenario and achieve encouraging performance only with monolingual reverse dictionary data. As far as we know, this is the first time the unaligned cross-lingual reverse dictionary is inspected.

Related Work
The reverse dictionary task has been investigated in several previous academic studies. Bilac et al. (2004) proposed using the information retrieval techniques to solve this task, and they first built a database based on available dictionaries. When a query came in, the system would find the closest definition in the database, then return the corresponding word. Different similarity metrics can be used to calculate the distance. Shaw et al. (2013) enhanced the retrieval system with WordNet (Miller, 1995). Hill et al. (2016) was the first to apply RNN into the reverse dictionary task, making the model free of handcrafted features. After encoding the definition into a dense vector, this vector is used to find its nearest neighbor word. This model formulation has been adopted in several papers (Pilehvar, 2019;Zhang et al., 2019;Morinaga and Yamaguchi, 2018;Hedderich et al., 2019), their difference lies in usage of different resources. Kartsaklis et al. (2018); Thorat and Choudhari (2016) used WordNet to form graphs to tackle the reverse dictionary task. The construction of the bilingual reverse dictionary has been studied in (Gollins and Sanderson, 2001;Lam and Kalita, 2013). Lam and Kalita (2013) relied on the availability of lexical resources, such as WordNet, to build a bilingual reverse dictionary.  built several bilingual reverse dictionaries based on the Wiktionary 1 , but this kind of online data cannot ensure the data's quality. Building a bilingual reverse dictionary is not an easy task, and it will be even harder for low-resource language. Other than the low-quality problem, the vast number of language pairs is also a big obstacle, since if there are N languages, they will form N 2 pairs. However, by the unaligned cross-lingual reverse dictionary, we can not only exploit the high-quality monolingual dictionaries, but also avoid the preparation of N 2 language pairs. Unsupervised machine translation is highly correlated with the unaligned cross-lingual reverse dictionary (Lample et al., 2018a;Conneau and Lample, 2019;Sennrich et al., 2016a). However, the unaligned cross-lingual reverse dictionary task differs from the unsupervised machine translation at least in two aspects. Firstly, the target for the crosslingual reverse dictionary and machine translation is a word and a sentence, respectively. Secondly, theoretically, the translated sentence and the original sentence should contain the same information. Nevertheless, in the cross-lingual reverse dictionary task, on the one hand, the target word might contain more senses when it is polysemous. On the other hand, a description can correspond to several similar terms. The polysemy also makes the unsupervised word alignment hard to solve this task (Lample et al., 2018b).
Last but not least, the pre-trained language model BERT has been extensively exploited in the Natural Language Processing (NLP) community since its introduction (Devlin et al., 2019;Conneau and Lample, 2019). Owing to BERT's ability to extract contextualized information, BERT has been successfully utilized to enhance various tasks substantially, such as the aspect-based sentiment analysis task , summarization (Zhong et al., 2019), named entity recognition (Yan et al., 2019;Li et al., 2020) and Chinese dependency parsing . However, most works used BERT as an encoder, and less work uses BERT to do generation (Wang and Cho, 2019;Conneau and Lample, 2019). Wang and Cho (2019) showed that BERT is a Markov random field language model. Therefore, sentences can be sampled from BERT. Conneau and Lample (2019) used pre-trained BERT to initialize the unsupervised machine training model an achieve good performance. Different from these work, although a word might contain several subwords, we use a simple but effective method to make BERT generate the word ranking list with only one forward pass.

Methodology
The reverse dictionary task is to find the target word w given its definition d = [w 1 , w 2 , . . . , w n ], where d and w can be in the same language or different languages. In this section, we first introduce BERT, then present the method we used to incorporate BERT into the reverse dictionary task.

BERT
BERT is a pre-trained model proposed in (Devlin et al., 2019). BERT contains several Transformer Encoder layers. BERT can be formulated as followsĥ where h 0 is the BERT input, for each token, it is the sum of its token embedding, position embedding, and segment embedding; LN is the layer normalization layer; MHAtt is the multi-head self-attention; FFN contains three layers, the first one is a linear projection layer, then an activation layer, then another linear projection layer; l is the depth of the  Figure 2: The model structure for the monolingual and cross-lingual reverse dictionary. The "[MASK]" in the input is the placeholder where BERT needs to predict. Placeholders concatenate with the word definition before sending it into BERT. Postprocessing is required to convert the prediction for "[MASK]"s into the word ranking list. layer, the total number of layers in BERT is 12 or 24.
Two tasks were used to pre-train BERT. The first is to replace some tokens with the "[MASK]" symbol, BERT has to recover this masked token from outputs of the last layer. The second one is the next sentence prediction. For two continuous sentences, 50% of the time the second sentence will be replaced with other sentences, BERT has to figure out whether the input sequence is continuous based on the output vector of the "[CLS]" token. Another noticeable fact about BERT is that, instead of directly using the word, it used BPE subword (Sennrich et al., 2016b) to represent tokens.Therefore, one word may be split into several tokens. Next, we will show how we make BERT generate the word ranking list.

BERT for Monolingual Reverse Dictionary
The model structure is shown in Fig. 2 [SEP]". We want BERT to recover the target word w from the k "[MASK]" tokens based on the definition d. We first utilize BERT to predict the masks as in its pre-training task. It can be formulated as where H L k ∈ R k×d model is the hidden states for the k masked tokens in the last layer, MLM is the pre-trained masked language model, S subword ∈ R k×|V | is the subword score distribution for the k positions, |V | is the number of subword tokens. Although we can make BERT directly predict word by using a word embedding, it will suffer from at least two problems: the first one is that it cannot take advantage of common subwords between words, such as prefixes and postfixes; the second one is that predicting word is inconsistent with the pre-trained tasks.
After achieving S subword , we need to convert them back to word scores. However, there are |V | k kinds of subword combinations, which makes it intractable to represent words by crossing subwords. Another method is to generate subword one-by-one (Wang and Cho, 2019;Conneau and Lample, 2019), it is not suitable for this task, since this task needs to return a ranking list of words, but the generation can only offer limited answers. Nevertheless, for this specific task, the number of possible target words is fixed since the number of unique words in one language's dictionary is limited. Hence, instead of combining the subword sequence into different words, we can only care for the subword sequence, which can form a valid word.
Specifically, for a given language, we first list all its valid words and find the subword sequence for each word. For a word w with the subword sequence [b 1 , ..., b k ], its score is calculated by where S word ∈ R is the score for the word w, However, not all words can be decomposed to k subword tokens. If a word has subword tokens less than k, we pad it with "[MASK]", while our method cannot handle words with more than k subword tokens. By this method, each word can get a score. Therefore we can directly use the cross-entropy loss to finetune the model, where N is the total number of samples, w is the target word. When ranking, words are sorted by their scores.

BERT for Cross-lingual Reverse Dictionary
The model structure used in this setting is as depicted in Fig. 2 LG1 LG1 LG1 LG1 LG2 LG2 LG2 … LG2 LG2 LG1 Language embeddings … + … + … + … … [MASK] Figure 3: The model structure for the unaligned crosslingual reverse dictionary. We add a randomly initialized language embedding to distinguish languages. Since we only have monolingual training data, "LG1" and "LG2" are of the same value in the training phase, but different in the evaluation phase.
model. mBERT has the same structure as BERT, but it was trained on 104 languages. Therefore its token embedding contains subwords in different languages.

BERT for Unaligned Cross-lingual Reverse Dictionary
The model used for this setting is as depicted in Fig. 3. Compared with the BERT model, we add an extra learnable language embedding in the bottom, and the language embedding has the same dimension as the other embeddings. Except for the randomly initialized language embedding, the model is initialized with the pre-trained mBERT. Instead of using the MLM to get S subword , we use the following equation to get S subword where Emb token ∈ R |V |×d model is the subword token embeddings. We found this formulation will lead to better performance than using the MLM, and we assume this is because the training data only contains monolingual data, thus it will be hard for the model to predict tokens in another language when evaluation, while if the Emb token is used, the model can utilize the similarity between subwords to make reasonable predictions. After getting S subword , we use Eq.4 to get the scores for each word, and different languages have different word lists, the loss is calculated by where M is the number languages, N k is the number of samples for language j, w  word in language j, S word j is the score distribution for words in language j. When getting the ranking list for a language, we only calculate word scores for that language.

Dataset
For the monolingual reverse dictionary, we tested our methods in the English dataset and Chinese dataset released by (Hill et al., 2016) and ( (2) Unseen definition set, none of the word's definitions have been seen during the training phase, but they might occur in other words' definition; (3) Description definition set, the description and its corresponding word are given by human. Methods rely on word matching may not perform well in this setting (Hill et al., 2016); (4) Question definition set, this dataset is only in Chinese, it contains 272 definitions appeared in Chinese exams. The detailed dataset statistics are shown in Table 1. For the cross-lingual and unaligned cross-lingual reverse dictionary, we use the dataset released in . This dataset includes four bilingual reverse dictionaries: English↔French, English↔Spanish. Besides, this dataset includes English, French, and Spanish monolingual reverse dictionary data. The test set for this dataset is four bilingual reverse dictionaries: En↔Fr and En↔Es. For the cross-lingual reverse dictionary, we use the paired bilingual reverse dictionary data to train our model; for the unaligned cross-lingual reverse dictionary, we use the three monolingual reverse dictionary data to train our model. And for both  Table 2: Dataset statistics for the cross-lingual and unaligned cross-lingual reverse dictionary. The upper block is the monolingual data used to train the unaligned cross-lingual reverse dictionary. The lower block is the cross-lingual reverse dictionary data. Both scenarios were evaluated in the test set in the lower part. "En-fr" means the target word is in English, the definition is in French.
settings, we report results on the test sets of the four bilingual reverse dictionary. The detailed dataset statistics are shown in Table 2.

Evaluation Metrics
For the English and Chinese monolingual reverse dictionary, we report three metrics: the median rank of target words (Median Rank, lower better, lowerest is 0), the ratio that target words appear in top 1/10/100 (Acc@1/10/100, higher better, ranges from 0 to 1), and the variance of the rank of the correct target word (Rank Variance, lower better), these results are also reported in (Hill et al., 2016;Zhang et al., 2019). For the cross-lingual and unaligned cross-lingual reverse dictionary, we report the Acc@1/10, and the mean reciprocal rank (MRR, higher is better, ranges from 0 to 1), these results are also reported in .

Hyper-parameter Settings
The English BERT and Multilingual BERT (mBERT) are from (Devlin et al., 2019), the Chinese BERT is from (Cui et al., 2019). Since RoBERTa has the same model structure as BERT, we also report the performance with the English RoBERTa from  and the Chinese RoBERTa from (Cui et al., 2019) for the monolingual reverse dictionary. Both RoBERTa and BERT are the base version, and we use the uncased English BERT and cased mBERT. For all models, we find the hyper-parameters based on the Acc@10 in the development sets, the models with the best development set performance are evaluated on the test set. The data and detailed hyper-parameters for each setting will be released within the code 2 . We choose k = 4 for Chinese, and k = 5 for other languages, k is determined by at least 99% of the target words in the training set are included.

Monolingual Reverse Dictionary
Results for the English and Chinese monolingual reverse dictionary have been shown in Table 3 and Table 4, respectively. "OneLook" in Table 3 is the most used commercial reverse dictionary system, it indexed over 1061 dictionaries, even included online dictionaries, such as Wikipedia and Word-Net (Miller, 1995). Therefore, its result in the unseen definition test set is ignored. "SuperSense", "RDWECI", "MS-LSTM" and "Mul-Channel" are from (Pilehvar, 2019; Morinaga and Yamaguchi, 2018;Kartsaklis et al., 2018;Zhang et al., 2019), respectively. From Table 3, RoBERTa achieves state-of-the-art performance on the human description test set. And owing to bigger models, in the seen definition test set, compared with the "Mulchannel", BERT and RoBERTa enhance the performance significantly. Although the MS-LSTM (Kartsaklis et al., 2018) performs remarkably in the seen test sets, it fails to generalize to unseen and description test sets. Besides, "RDWECI", "Super-Sense", "Mul-channel" in Table 3 all used external knowledge, such as WordNet, Part-of-Speech tags. Combining BERT and structured knowledge should further improve the performance in all test sets, we leave it for further work. Table 4 presents the results for the Chinese reverse dictionary. For the seen definition setting, BERT and RoBERTa substantially improve the performance. Apart from the good performance in seen definitions, BERT and RoBERTa perform well in the human description test set, which depicts their capability to capture human's meaning.

Cross-lingual Reverse Dictionary
In this section, we will present the results for the cross-lingual reverse dictionary. The performance comparison is shown in  Table 3: Results on the English reverse dictionary datasets. In each cell, the values are the "Median Rank", "Acc@1/10/100" and "Rank Variance". * results are from (Zhang et al., 2019) . BERT and RoBERTa achieve a significant performance boost in both the description test set and the unseen test set.  Table 4: Results on the Chinese reverse dictionary datasets. In each cell, the values are the "Median Rank", "Acc@1/10" and "Rank Variance". * results are from (Zhang et al., 2019). Our proposed methods enhance the performance in all test sets substantially.
The contrast between "mBERT" and "mBERTjoint" shows that jointly train the reverse dictionary in different language pairs can improve the performance.

Unaligned Cross-lingual Reverse Dictionary
In this section, we present the results of the unaligned bilingual and cross-lingual reverse dictionary. Models are trained on several monolingual reverse dictionary data, but they will be evaluated on bilingual reverse dictionary data. Take the "En-Fr" as an example, models are trained on English  Table 5: Results for the cross-lingual reverse dictionary. In each cell, the values are "Acc@1/10" and "MRR". * results are from . "En-Fr" means the target word is in English, while the description is in French. The "ATT" and "mBERT" used the bilingual corpus to train the model. The "ATT-joint" and "mBERT-joint" are trained on four bilingual reverse dictionary corpus simultaneously.  Table 6: Results for the unaligned cross-lingual reverse dictionary. In each cell, the values are "Acc@1/10" and "MRR". * is from . "En-Fr" means the target word is in English, while the definition is in French. Models in the lower block do not use aligned data. While models in the upper block use aligned data to train the model. definitions to English words, French definitions to French words, while in the evaluation phase, the model is asked to recall an English word given the French description or vice versa.
Since previous models do not consider this setting, we make a baseline by firstly getting words with the same language as the definition through a monolingual reverse dictionary model, then using the word translation or aligned word vectors to recall words in another language. Take "En-Fr" for instance, we first recall the top 10 French words with the French definition, then each French word is translated into an English word by either translations or word vectors.
Models listed in Table 6 are as follows: (1) mBERT-Match uses aligned word vectors (Lample et al., 2018b) to recall the target words in another language; (2) mBERT-Trans uses the translation API 3 ; (3) mBERT-Unaligned uses two monolingual reverse dictionary corpus to train one model. Therefore, the results of "En-Fr" and "Fr-En" in Table 6 are from the same model; (4) mBERT-joint-Unaligned is trained on all monolingual corpus.
As shown in the Table 6, the "mBERT-Unaligned" and "mBERT-joint-Unaligned" perform much better than the "mBERT-Match" and "mBERT-Trans". Therefore, it is meaningful to explore the unaligned reverse dictionary scenario. As we will show in Section 6.4, the translation method might fail to recall the target words when the word is polysemous.
From Table 6, we can see that jointly training three monolingual reverse dictionary tasks do not help to recall cross-lingual words. Therefore, how to utilize different languages to enhance the per-6 Analysis 6.1 Performance for Number of Senses Following (Zhang et al., 2019), we evaluate the accuracy of words with a different number of senses through WordNet (Miller, 1995). The results are shown in Fig. 4. BERT and RoBERTa significantly improve the accuracy of words with single and multiple senses, which means they can alleviate the polysemous issue.

Performance for Different Number of Subword
Since BERT decomposes words into subwords, we want to investigate whether the number of subwords has an impact on performance. We evaluate the English development set, results are shown in Fig. 5. The model achieves the best accuracy in English words with one subword and Chinese words with two subwords. This might be caused by the fact that most English words and Chinese words have one subword and two subwords, respectively.

Unseen Definition in Unaligned Cross-lingual Reverse Dictionary
In this section, for the target words presented in bilingual test sets, we gradually remove their definitions from the monolingual training corpus. The performance changing curve is depicted in Table  6. As a reminder, the test sets need to recall target words in another language, while the deleted word and definition are in the same language. Since the number of removed samples is less than 2% of the monolingual corpus, the performance decay cannot be totally ascribed to the reducing data. Based on Table 6, for the unaligned reverse dictionary task, we can enhance the cross-lingual word retrieval by including more monolingual word definitions. Figure 6: The performance for the unaligned reverse dictionary with the increment of deleted definitions in monolingual data. The dense and dotted lines are Acc@1, Acc@10, respectively. Although the deleted definition and word are in the same language, deleting them harms the performance of cross-lingual word retrieval.

Case Study
For the monolingual scenario, we present an example in Table 7 to show that decomposing words into subwords helps to recall related words. Table  8 shows the comparison between "mBERT-Trans" and "mBERT-joint-Unaligned".

Conclusion
In this paper, we formulate the reverse dictionary task under the masked language model framework and use BERT to predict the target word. Since Definition someone who studies secret code systems in order to obtain secret information Mul-Channel cryptographer cryptologist spymaster snoop BERT cryptanalyst codebreaker cryptographer coder RoBERTa codebreaker cryptanalyst cryptographer snooper Table 7: A Monolingual case displays the advantage of using subwords. In each row is the model's top recalled words; the underlined word is the target word. The predicted words by BERT or RoBERTa is either related to "someone" (corresponding to the "-analyst" or "er") or "code/secret" (correspoding to "code-" or "crypt-").

Definition
El punto que esta a mitad del camino entre dos extremos. (The point that is halfway between two ends) Spanish centro mitad medio punta Trans. core middle middle tip Unaligned center centre middle mid

Definition
Pièce où l ' on prépare et fait cuire les aliments (Room where food is prepared and cooked) French cuisine restaurant pièce cuire Trans. cookery restaurant room cook Unaligned kitchen cook office restaurant Table 8: Unaligned reverse dictionary results by translation and the proposed unaligned reverse dictionary model. The target word is underlined, the "Trans." row is the word translation results. The Spanish "centro" in the upper block also has the meaning "center", but without context, it gives the wrong translation, and the French word "cuisine" in the lower block makes the same error.
BERT decomposes words into subwords, the score of the target word is the sum of the scores of its constituent subwords. With the incorporation of BERT, our method achieves state-of-the-art performances for both the monolingual and cross-lingual reverse dictionary tasks. Besides, we propose a new cross-lingual reverse dictionary task without aligned data. Our proposed framework can perform the cross-lingual reverse dictionary while being trained on monolingual corpora only. Although the performance of unaligned BERT is superior to the translation and word vector alignment method, it still lags behind the supervised aligned reverse dictionary model. Therefore, future work should be conducted to enhance performance on the unaligned reverse dictionary.