Bayesian Hierarchical Words Representation Learning

This paper presents the Bayesian Hierarchical Words Representation (BHWR) learning algorithm. BHWR facilitates Variational Bayes word representation learning combined with semantic taxonomy modeling via hierarchical priors. By propagating relevant information between related words, BHWR utilizes the taxonomy to improve the quality of such representations. Evaluation of several linguistic datasets demonstrates the advantages of BHWR over suitable alternatives that facilitate Bayesian modeling with or without semantic priors. Finally, we further show that BHWR produces better representations for rare words.


Introduction
In the last decade, a plethora of methods were proposed for learning vector representations for words (Mikolov et al., 2013;Pennington et al., 2014;Barkan, 2017), sentences (Lin et al, 2017; Barkan et al., 2020a), items (Barkan et al., 2016;Barkan et al., 2019;Barkan et al., 2020b;Barkan et al., 2020c), and medical concepts (Luo, el al., 2019). In the domain of natural language understanding, neural word embedding models are designed to learn distributed word representations as vectors in a latent space. In this space, arithmetic operations between the word * Equal contribution.
The major focus of most previous models was on optimizing the utilization of co-occurrence relations for learning representations, e.g., learning the probability of word to appear in the vicinity of word . Yet, often, additional side information can be leveraged for learning finer embeddings. In this work, we focus on incorporating word semantic taxonomy, which is particularly useful for learning representations of rare words and for learning word representations from a small-size corpus.
To this end, we introduce the Bayesian Hierarchical Words Representation (BHWR) learning algorithm. BHWR presents two complementary properties: Bayesian modeling of word representations aside with hierarchical priors that naturally support semantic taxonomy. BHWR is based on a Variational Bayes (VB) optimization that enables the mapping of words into probability densities in the latent space.
A key advantage of BHWR is the utilization of word taxonomy for the propagation of relevant information between related words. For example, consider the words 'anode' and 'cathode'. Both words have a common relationship to the word Bayesian Hierarchical Words Representation Learning Oren Barkan 1 *, Idan Rejwan 12 *, Avi Caciularu 13 * and Noam Koenigstein 14 'electrode' which appears hierarchically above them in the taxonomy knowledge base. Assume the word 'cathode' frequently appears in the corpus, while the words 'anode' and 'electrode' do not appear in the corpus or occur very infrequently. A model that relies solely on cooccurrence relations will fail to infer the semantic proximity between 'cathode' and 'anode'. However, a model that utilizes word taxonomy will learn a representation for the parent word 'electrode' based on its child 'cathode'. Moreover, the parent word 'electrode' will serve as an informative prior for 'anode', and the representation of 'anode' will fall back to its prior 'electrode'. Finally, if more occurrences of 'anode' will be added to the training dataset, its representation can smoothly transition away from its prior position in accordance with the cooccurrences patterns in the data.
Besides the semantic information added to the word representations via the hierarchical prior, the Bayesian modeling by itself helps deal with the problem of rare words. These words suffer from insufficient statistics, and their respective embeddings are quite sensitive to noise. This problem becomes acute in the case of point estimate solutions that do not model uncertainty. In contrast, Bayesian solutions learn the entire posterior density and hence are more robust to overfitting (Bishop, 2006).
We train BHWR on a small annotated corpus (Miller et al., 1993) and evaluate its overall performance as well as the improvement on rare words. Our findings show that BHWR outperforms other non-contextualized word embedding methods that facilitate either Bayesian modeling or semantic taxonomy.

Related Work
Incorporating lexical-semantic information in learning word embeddings has been suggested in the past. In (Faruqui et al., 2015), a post-processing technique was introduced in order to refine pretrained word representations using relational information from semantic lexicons. In (Li et al., 2016), hierarchical taxonomy was utilized for improving document categorization and concept clustering. Recently, linguistic knowledge bases were utilized for enhancing contextualized word embeddings (Huang et al., 2019;Levine et al., 2019).
The abovementioned works finetune pre-trained representations, while injecting external contextual information for words (some w.r.t. a specific task). Unlike these works, BHWR facilitates Bayesian learning of non-contextualized word embeddings, in combination with hierarchical taxonomy information which is not task-specific. Hence, a direct comparison between BHWR and these works is unfitting.
More relevant to our work is the Bayesian Skip-Gram (BSG) model from (Barkan et al., 2017). However, BSG does not allow the use of external information such as word taxonomy. Hence, in our experiments, we compared BHWR to BSG and further apply the method from (Faruqui et al. 2015) to enhance BSG with word taxonomy information.

Bayesian Hierarchical Words Representation
In this section, we describe the model and derive a VB solution that is finally translated to the Bayesian Hierarchical Words Representation (BHWR) learning algorithm.

Model
Let = { } and ℐ = { } , be a vocabulary and a corresponding index set, respectively. We define : ℐ → ℙ(ℐ) and : ≜ [ ] are the sets of parents and children word indices for the word index , respectively. This forms a hierarchical structure (network) in which a word can appear either as a leaf or as an internal node (parent).
Let , , ℎ , ℎ ∈ ℝ be the context and target leaf and parent representations of the word , ℎ ℎ respectively. For example, if ∈ [ ], then we use ℎ as the parent node for both and ℎ . In addition, we define Let = ( ) be a text corpus and let ∈ ℕ be the context window parameter. We iterate over and for each word , we sample a random window size ∈ where ( ) ≜ . We further assume normal hierarchical priors as follows: where and are the precision hyperparameters. In the same manner, we assume normal hierarchical priors ( | , ) and ( | ). Then, the joint density of and the model parameters = { , , , } given the precision hyperparameters = { , , , } is given by  Our goal is to compute posterior predictive distribution for an arbitrary * given (which is not necessarily in ). The probability of the words and to co-occur is given by * = 1 , = ∫ ( ) ( | , ) . (2)

Posterior Approximation
Since the posterior ( | , ) in Eq. (2) is intractable, we turn to VB approximation (Bishop, 2006) of ( | , ) via a fully factorized distribution The posterior approximation ( ) is obtained via the minimization of the KL divergence from the true posterior, namely the minimization of ( ( )|| ( | , )) , which is equivalent (Bishop, 2006) to the maximization of (negative) variational free energy ℒ( ) ≜ ∫ ( ) log , ( ) . ℒ( ) is maximized via an iterative procedure that is guaranteed to converge to a local optima (as the optimization is non-convex): At each iteration, we update each parameter ∈ , in turn, according to the following update rule: * ( ) = exp ( \ ) [log ( , | )] + . (3) However, a straightforward application of Eq. (3) will run useless, as the term ( , | ) includes the likelihood ( | ), which consists of sigmoid functions that are not conjugate to the normal prior ( | ) from Eq. (1). Therefore, by introducing an additional variational parameter , we can utilize the logistic bound from (Jaakkola and Jordan, 1996) for lower bounding the log likelihood ( | ) with a squared exponential function as follows:

Moreover, this bound is tight for
( | ) enables a conjugate relation with ( | ) that results in normal density estimators * ( ) , ∈ . Hence, for each ∈ , we update the precision and mean (the sufficient statistics), following the update rule from Eq.  5) and (6), respectively.

Posterior Predictive Approximation
Finally, we approximate the integral from Eq. (2) by replacing the posterior with its factorized approximation where = and its density is approximated using normal density (using 's first two moments under ). The final transition follows the logistic Gaussian integral approximation suggested by (MacKay, 1992).
In practice, the similarity score for a pair of words and was based on two different versions of Eq. (7): The first by assigning = and the second with = . Then, the average of these two scores is taken as the final similarity score. Our experiments revealed that this technique yields better results.

Experimental Setup and Results
The experimentations in this section are focused on word similarity. Next, we present the training corpus, evaluated models, evaluation tasks, and the results.

Training Corpus
We use SemCor (Miller et al., 1993), which contains 37,176 annotated sentences with 820,411 words and a vocabulary size of 11,766 words. Each word's parent is taken to be its WordNet (Miller et al., 1990) hypernym, e.g., for the words 'anode' and 'cathode', the parent word is 'electrode'.

Models and Configurations
We compare Bayesian Hierarchical Words Representation (BHWR) with the Skip-Gram with negative sampling (SG) model from (Mikolov et al., and the Bayesian Skip-Gram (BSG) model from (Barkan, 2017). For each model, we consider two versions: The first uses the word representations produced by the model as is. In the second version, we further refine the learned word representations by applying the post-processing step from (Faruqui et al., 2015). This enables the incorporation of word taxonomy information also to the SG and BSG methods. Overall, we consider six different model configurations; the postprocessing versions of the modeled are marked with a '-P' suffix.
All models were trained till convergence. We used subsampling parameter (Mikolov et al., 2013) of 10 and a negative to positive ratio of 1. The precision hyperparameters were set to = = 0.1 and to = = 0.001 . The embedding dimension was set to = 50.
For BHWR and BSG, scoring a pair of words is done by using the posterior predictive approximation (Section 3.4). For SG, we compute , + , , where is the cosine similarity function (recall SG is based on a point estimate solution). Finally, for each combination of dataset and method, we report the Spearman rank correlation in terms of percentage. Table 1 presents the results for all combinations of models and datasets. In the last column, we report for each model the average score across all datasets. The table is partitioned into two sections that present the regular and the post-processed versions of the models. For each dataset and section, the best and second-best scores are boldfaced and underlined, respectively. Next, we turn to discuss the main trends presented in Tab. 1.

Results
First, we consider the regular model versions (first three rows). BHWR significantly outperforms BSG and SG across all datasets, and BSG comes second with a noticeable difference. This demonstrates the merit of the Bayesian treatment (BSG ≻ SG) and the modeling of word taxonomy (BHWR ≻ BSG).
Next, we turn to examine the post-processed versions (last three rows). We observe a significant boost to the results of all the models, which serves as an independent evaluation and reinforcement to the effectiveness of the post-processing method from (Faruqui et al., 2015). BHWR-P again surpasses the other models by a large margin, while BSG-P and SG-P are on par. An interesting observation is that the post-processing method is found to be instrumental not only for SG and BSG but also for BHWR that utilizes word taxonomy inherently (BHWR ≺ BHWR-P). This can be explained by the fact that the method of (Faruqui et al., 2015) uses additional lexical information such as synonyms, which are not incorporated in BHWR. Yet, BHWR alone (without postprocessing) still outperforms both BSG-P and SG-P. This result demonstrates the advantage of BHWR that facilitates learning of co-occurrences relations together word taxonomy, simultaneously.
Note that the results in Tab.1 are suboptimal when compared to (Pennington et al., 2014): This is clearly related to the small corpus size used in this work. In the future, we plan to conduct an evaluation on larger corpora that are not necessarily annotated.
Finally, In order to demonstrate the strength of BHWR for words with only a few occurrences in the corpus, we further compare the models' performance on rare words. Table 2 shows the results on the word similarity tasks for words that occurred in the corpus five times or less. We observe that the gaps between BHWR and the other models become even more significant, either with or without the utilization of the postprocessing from (Faruqui et al., 2015).

Conclusion and Future Work
We presented BHWR -A word representation learning model, facilitating Bayesian learning of co-occurrences relations together with word taxonomy via hierarchical priors. When trained on a small corpus, BHWR exhibits a significant performance gain over other word embedding methods across various word similarity datasets. Importantly, a remarkable improvement is obtained for rare words. Moreover, BHWR outperforms all other baselines even when the latter are enhanced with the post-processing taxonomy refinement procedure from (Faruqui et al., 2015). Finally, when combining BHWR with the post-processing from (Faruqui et al., 2015), further improvement is observed.
In the future, we plan to extend the applicability of the presented model to other linguistics tasks as well as recommendations and medical inference tasks.