Embedding Semantic Taxonomies

A common step in developing an understanding of a vertical domain, e.g. shopping, dining, movies, medicine, etc., is curating a taxonomy of categories specific to the domain. These human created artifacts have been the subject of research in embeddings that attempt to encode aspects of the partial ordering property of taxonomies. We compare Box Embeddings, a natural containment representation of category taxonomies, to partial-order embeddings and a baseline Bayes Net, in the context of representing the Medical Subject Headings (MeSH) taxonomy given a set of 300K PubMed articles with subject labels from MeSH. We deeply explore the experimental properties of training box embeddings, including preparation of the training data, sampling ratios and class balance, initialization strategies, and propose a fix to the original box objective. We then present first results in using these techniques for representing a bipartite learning problem (i.e. collaborative filtering) in the presence of taxonomic relations within each partition, inferring disease (anatomical) locations from their use as subject labels in journal articles. Our box model substantially outperforms all baselines for taxonomic reconstruction and bipartite relationship experiments. This performance improvement is observed both in overall accuracy and the weighted spread by true taxonomic depth.


Introduction
Recent work on hierarchical representational structures in machine learning promise to blend the value of human curated taxonomies with the power and flexibility of machine learning systems. A plethora of such taxonomies exist across many domains as seen in libraries, linguistic resources, medicine and popular culture, yet they rely on an assumption of discreteness -category membership is either true or false -and this assumption does not generally lend itself to modelling with continuous valued systems or spatial embeddings.
Taxonomic knowledge can be leveraged in the semantic space to extract hierarchical relations between words, and this simple observation has been the basis of many computational resources for linguistics, such as WordNet, EuroWordNet, FrameNet, etc. For example, the similarity of the words lion and tiger may be indistinguishable from the similarity of lion and animal in a naive text embedding space, as they may simply be the same distance apart, but in WordNet the difference and similarity are made clear through the taxonomy. In many cases, this explicit knowledge can be derived from existing taxonomies, but without some way of understanding this knowledge in the embedding space, it is just another similarity signal. Recently, order embeddings addressed this issue by orienting similarity in two axes and adding a constraint to similarity embeddings that ensured more general terms had to be closer to the origin (Liu et al., 2012).
More recently we've seen the introduction of box lattice embeddings, which treat categories in a taxonomy as n-dimensional boxes, with a constraint that boxes representing more general terms should contain boxes representing more specific terms .
On a different front, there has been steady progress in automating knowledge-base construction (KBC) from text, as a pure demonstration of NLP as well as a productive tool for end to end tasks like question answering. A key part of KBC efforts is learning relations from text, yet a key research question remains relatively unexplored: can we use taxonomic similarity in conjunction with KB relations? In the early days of NLP and knowledge representation (KR), it had always been assumed these two kinds of knowledge would work together, but there is little evidence in existing systems that they do. This work is motivated by that key research question.
Similar to prior work, we began by experimenting with various taxonomy embedding techniques to reconstruct the Medical Subject Headings (MeSH) taxonomy from published article tags. We customize a Box Embedding model for the task and compare the results to several baselines. Finally, we explore a familiar problem: learning a relation on a bipartite graph, in this case the relation between diseases and parts of the body using the co-occurrence of human-assigned MeSH labels on PubMed abstracts. The work tests the hypothesis that the MeSH taxonomy can provide a useful signal in learning this relation: if asthma (disease) and bronchi (anatomy) co-occur in data, and bronchi is a subcategory of lung in the taxonomy, then asthma and lung are related.
The contributions of this paper include the following: 1. A customized box model objective function that smooths the non-differentiable hinge property of the naive learning problem 2. Performance improvements for the taxonomic learning task via variations of negative sampling and weighting. 3. A methodology for incorporating sparse instance data evidence without exploding the size of the parameter space 4. Defining evaluation metrics specialized to the unique requirements of the learning of a taxonomy 5. Defining and testing novel experiments for learning taxonomies on bipartite graphs.

Related Work
Previous research in learning semantic taxonomies has included a few approaches. (Cimiano et al., 2005) proposed learning hierarchies from straight text using Formal Concept Analysis and (Maedche and Staab, 2001) learned ontologies with the semantic web. (Liu et al., 2012) extracted taxonomies from keywords. (Hoxha et al., 2016) learned taxonomy by clustering based on informed syntactic matching and semantic relations. (Velardi et al., 2013) automatically extracts taxonomies from web sites using concepts and hypernym relations.
Much early work with ontologies focused explicitly on expanding a given taxonomy or examining a specific area in more depth. (Snow et al., 2005) used a classifier for predicting new undiscovered edges in WordNet. (Kozareva, 2010) used an initial taxonomy to algorithmically crawl the web.  uses a seed taxonomy for guided hierarchical expansion. Other work considered noisy real word graphs and suggested algorithms to remove edges from a formal taxonomy (Velardi et al., 2013).
Similarly, there has been a corpus of work in embedding techniques to extract hierarchical structures in text. (Globerson et al., 2007) utilized co-occurrence for creating simple euclidean embeddings.  employs hierarchical clustering in a recursive process to construct topic taxonomies. (Vendrov et al., 2016) represented text with partial order embeddings (POE).  explored partial order embeddings with distributional co-occurrences for learning ontologies.  extended the functionality of POE to better reflect conditional probabilities of an ontology with Box embeddings. (Athiwaratkun and Wilson, 2018) explored Gaussian word embeddings. Earlier the authors showed preliminary results using Box Embeddings for constructing taxonomies in (Lees et al., 2019). In this work, we expand the performance analysis and add experimental results for the novel bipartite relations problem.
More recently, several works have explored hyperbolic embeddings, in which hierarchical structures are represented as continuous generalizations in non-Euclidean space. (Dhingra et al., 2018) (Nickel and Douwe, 2018) (Ganea et al., 2018) (De Sa et al., 2017. (Le et al., 2017) uses hyperbolic embeddings for learning concept hierarchies from text. (Tifrea et al., 2019) introduced an application of GLOVE in hyperbolic space.

MeSH Dataset
MeSH, the NLM Medical Subject Headings, (U.S. National Library of Medicine, 2018a) is a taxonomy of subject headings for categorizing medical writing. Pubmed (U.S. National Library of Medicine, 2018b) is a very large collection over 30 million medical journal articles, each with metadata including human labeled subject categories from MeSH.
MeSH is organized into 16 top level categories, such as A: Anatomy, B: Organisms, C: Diseases, which themselves cannot be the subjects of articles. The taxonomy is on average 8 levels deep, with a generalization or broader-term relationship from child to parent nodes, e.g. <Respiratory System, Anatomy>, <Larynx, Respiratory System>, <Asthma, Diseases>. We adopt the notation in which the taxonomy is represented as a collection of child, parent edges in a tree-shaped taxonomy graph (noting that it is not strictly a tree).
The MeSH anatomy hierarchy mostly follows a Mereological generalization (sub-parts to parts), while diseases follow a slightly more causal generalization (<Pneumonia, Bacterial Infection>). This kind of semantic promiscuity is extremely common in taxonomies (Guarino and Welty, 2009), and causes an imprecision that begs for an approach with soft, continuous-valued constraints, as opposed to the traditional discrete reading of the taxonomic relationship (all members of the subcategory are members of the super-category). It was this observation that led us to taxonomy embeddings.
Articles listed in PubMed can have any number of subject headings, on average 8-10, and it is fully expected by the published methodology that assigning a particular subject heading to an article also assigns the MeSH parents and ancestors -they expect the transitive closure to hold.

Taxonomy Embedding Experiments
This work explores Box Lattice Embeddings  as a technique uniquely suited to the semantic taxonomy space. The efficacy of this approach is validated with Partial Order Embeddings and Naive Bayes are examined as baselines.

Partial Order Embeddings
We applied Partial Order Embeddings to our datasets, an established technique for modeling taxonomy data, as described in (Vendrov et al., 2016). The model assigns each entity u an embedding f (u) ∈ n in such a way that the order in the space of entities, defined by edges u, v , is maintained in the embedding space by requiring that f (u) ≥ f (v), which holds for all dimensions. This is accomplished by defining a continuous score function for embeddings that measures compliance with the order constraint: Note that max ranges over vector elements. This score should be zero, or close to it, for a pair forming an edge (from a set of positive examples P ) and large positive for a pair that is not an edge (from a set of negative examples N ). This requirement is enforced by minimizing the following max-margin loss function over embeddings f : It uses a margin α > 0 to express the minimum desired value for a score of the negative example. Relative importance of positive examples may be controlled with edge weights w, if such values are available in the dataset (for example as conditional probabilities). Hyper-parameter W can be used to control relative importance of positive and negative parts of the loss.

Box Lattice Embeddings
Box lattice embeddings associate each category with 2 vectors in [0, 1], (x m , x M ), the minimum and maximum value of the box at each dimension . For numerical reasons these are stored as a minimum, a positive offset plus an term to prevent boxes from becoming too small. This representation of boxes in Cartesian space can define a partial ordering by containment of boxes, and a lattice structure as: In other words, maximize the overlap of boxes with a positive edge. The original boxes paper showed results for reconstructing a taxonomy derived from WordNet using the transitive closure of edges, and the team made the code available in Github, which we reused and modified for our experiments.

Bayes Nets
As discussed in the next section, the data set we've chosen has an abundance of instance data from which conditional probabilities between categories can be calculated, making a Bayesian model an obvious choice to predict whether an instance belongs to a category. Given the training data, the model computes the conditional probability p(C x |C y ) for any category pair (C x , C y ) that exists in the training data. Categories co-occur if they appear as labels on the same PubMed article. During training we choose a threshold τ where p(C x |C y ) > τ → edge C x , C y such that the likelihood of known taxonomy edges in the training set is maximized.

Learning Taxonomies
Many taxonomies are extensional: the categories organize sets of entities in some problem domain, such as movies, stores, restaurants, books, songs, etc. There is a fairly clear instance/category distinction in these cases and the number of instances vastly outweighs the number of categories. Such taxonomies have edges from instances to categories, inst-cat edges (e.g. <Star Wars, Science-Fiction Movie>), and edges between categories, cat-cat edges (e.g. <Cult Science-Fiction Movie, Science-Fiction Movie>). Some taxonomies are intensional: they have no instances, at least none represented in data. WordNet synsets, for example, are arranged in a taxonomy, and there are very few instance-like synsets in WordNet, referring e.g. to specific people and organizations, but for the most part, synsets like "amount of matter" have no clear extension. Such taxonomies have only cat-cat edges.
Many previous experiments on learning taxonomy embeddings were conducted on intensional taxonomies, with no instances, and the objective (for training and evaluation) was simply the number of correct cat-cat edges learned above some confidence threshold.
Our datasets are extensional and contain millions of instances: PubMed 2018 has over 30 million.

Learning from Instances
We tried many approaches to utilizing the rich extension of MeSH in PubMed articles. The most obvious was to treat each inst-cat edge as a part of the graph, and ignore the semantic differences with cat-cat edges. However, in our embedding techniques, edges are training examples and the vertices (e.g. each category or instance) are the learnable parameters, so this results in an explosion of parameters almost to the point of having fewer examples than parameters. Further, instances as embeddings causes many spurious inst-inst edges to be inferred from the dense encodings, generating tremendous noise. One solution may be to use different forms of optimization, such as annealing or Brownian Motion. We save this for future work.
Another approach is to reuse the embedding techniques designed for cat-cat edges and summarize the inst-cat edges into the categories. For every pair of co-occurring inst-cat edges c, p 1 c, p 2 , we emit a cat-cat edge p 1 , p 2 . This leads to repetition of edges in the training data that reflects the magnitude of the co-occurrence. We compared these two approaches: Taxo: Use the edges from the taxonomy, with no consideration for the instances. Summary: Include the co-occurrence of categories in instances as a weight on the taxonomy edges.

Transitive Closure
In addition to summarizing instances, we experimented with the use of deterministic reasoning on the training data, specifically the transitive closure of cat-cat and inst-cat edges. Given only direct cat-cat edges (e.g. if we have a, b and b, c , and we do not have a, c ), we attempt to learn embeddings that approximate the taxonomy. For instance edges, we can similarly compute the transitive closure in the usual way (e.g. if we have i, a and a, b then we add i, b ), and then summarize from the inferred inst-cat edges as described above. This gives us two more variations on data sets: Direct: do not compute the transitive closure of the category edges. For PubMed articles, only the human labelled subject headings are used. This implies most instances will not have edges to a category and its ancestors. Reconstructing a taxonomy from only direct edges is expected to be an extremely hard problem to solve with embedding methods.
Closure: compute the transitive closure of the category edges. PubMed articles will have edges to each human supplied subject heading and all its ancestors. The closure multiplies the positive training data and is expected to be a simpler problem.
inst an instance (PubMed article) cat a category (MeSH subject heading) inst-cat edge from an instance to a category cat-cat taxonomy edge from a subcategory to a category taxo dataset consisting of cat-cat edges from the MeSH taxonomy bip dataset consisting of bipartite edges from MeSH Disease to Anatomy categories summary dataset of cat-cat edges weighted by # of instances shared by the two categories direct dataset consisting of cat-cat or inst-cat edges, as specified in MeSH and Pubmed closure dataset of direct edges augmented by the transitive closure of MeSH categories Table 1: Terms used in this paper.

Baseline Taxonomy Reconstruction Experiments
With these parameters on our training data, we compared box embeddings, our bayes net, and order embeddings on their ability to reconstruct a 10% held out sample of the MeSH direct taxonomy edges, given the other 90% of edges, with F1 scores shown in Tab. 2. Instance summarization was on the first ten shards of PubMed 2018 (300k articles). While all edges in the test set are direct and held out, for every test edge a, b , there must be edges in train containing a and b in order to form their embeddings.  The experimental results in Table 2 demonstrate that Box Embeddings did not perform as expected. POE fared well, and seemed to respond well to the extra information provided by instance summarization and transitive closure. In most cases, the closure edges help the models classify the direct-only test edges. The effect of a naive negative edge -the two root categories are forced apart, and category C must split its loss between the two.

Customized Box Embeddings for Taxonomy Reconstruction
The promise of the representational power of boxes over vectors motivated us to dive deeper into the poor performance of boxes in these settings. Specifically, we modified the negative ratio, experimented with informed sampling, changed the objective function and utilized centered initialization to improve the box embedding model performance.
Negative Ratio: In the original box and order embedding papers, negatives were sampled uniformly from the complement of positive edges with a ratio of 1:1. The prior probability of a negative edge was greater than 0.99, so this sampling method allowed for a problem as shown in Fig. 1a. Given a simple toy taxonomy of five categories, Fig. 1a shows a version of the box layout we would expect, and Fig. 1b shows an actual learned set of box embeddings (in 2d) with no negatives. One problem becomes clear, the intuition of the box objective fails to account for the fact that it is easiest to satisfy the objective by making most of the boxes overlap.
Negative sampling rates are a common ML problem, and a commonly used approach is to increase the negative ratio to 10:1, which led to a problem illustrated in Fig. 1b. By increasing the number of negative edges in training, we increased the chances of a bad edge that creates competing constraints. In this case, since we have no edge A, B it could be naively sampled as a negative. Having that as a negative edge forces the two boxes apart, even though the edges C, A and C, B should cause them to overlap.
Informed sampling: Some improvements to sampling were proposed in (Athiwaratkun and Wilson, 2018), but these were more specific to order embeddings. Instead, we addressed the problems of naive sampling with informed sampling (see Appendix -in Section 10 for details). In a nutshell the algorithm uses traditional taxonomic reasoning as a way to identify edges that, while not necessarily positive, should not be negative. As in the example shown in Fig. 1b, one such case are edges between ancestors of the same category. Informed sampling improved matters but revealed another problem with training boxes, the box crossing problem, shown in Fig. 2a. In the example, we have altered the simple taxonomy a bit by making the edge C, B negative, and with A and C on opposite sides of B, the gradient will not allow them to cross and ultimately meet. As the neg:pos ratio increases, the chances of box crossing standing in the way increases dramatically. While in some sense this is no more than a gradient descent problem, getting stuck in a local neighborhood, the fact that boxes have volume, corners, and edges, leads to a solution space that is full of local minima. Solutions that smooth the space can help dramatically.
The improvements in performance from informed sampling can be seen for the Partial Order Embedding (POE) baselines in Table 2. Specifically, the rows marked with the negs designation outperform the same experiments without informed sampling.
Box Objective: The failure of the original box to generalize in the presence of more negatives forced us to discover and propose fixes to negative sampling, and the box crossing problem. However, ultimately we found the primary benefit came from modifying the hinge property of the loss function. When the box crossing problem arises, there is a non-differentiable hinge in the negative loss on the step where the negative boxes first overlap. An approach to smoothing the loss has appeared very recently (Li et al., 2019). We used a far simpler fix developed before that result came out, which in combination with a different initialization strategy, resolved the box crossing problem. In our modified loss function, positive and negative losses are defined with respect to the indicator function on a joint set of concept boxes a, b. For any given box embeddings we define the meet(a, b) embedding of the boxes intersection volume and the disjoint indicator function if there is no intersection between the concept boxes. See Appendix 11 for details on meet function.
1 a ∧b =⊥ 0 otherwise Using the above notation we differentiate the instance losses for positive f pos (a, b), positive disjoint f posd (a, d) and negative loss f neg (a, b) scenarios. Definitions for these functions can be found in appendix 11. The objective modification ensures a smooth transition when boxes with a positive edge meet, and when boxes with a negative edge move apart. When positive boxes are disjoint and have a label indicating an overlap, there is loss proportional to their distance. This addition is to ensure that boxes that are far apart are encouraged to move closer together in physical space. When the boxes meet the loss is inversely proportional to the amount of overlap. At the point where they meet, these two kinds of loss must also meet, otherwise an unbounded gradient occurs.
For negative edges, when two overlapping boxes come apart, the negative loss indicated by f neg (a, b) is zero. The original box model clipped this value at in order to avoid a zero value, and added simple smoothing. This was not working in practice, as we saw sharp gradient spikes during optimization that had the effect of making negative boxes bounce away from each other. To solve this, we merely reversed the approach for positive loss, splitting the loss into overlapping and disjoint loss, and flipping the sense.
With these fixes in place, the new box models were able to overcome the box crossing problem as shown in Fig 2b. Center init: In addition, while the solution proposed in (Li et al., 2019) uses a soft box boundary that effectively causes all boxes to overlap, we found that center intitialization, i.e. initializing all boxes to the center of the space, provided the same advantages -all boxes overlap at the beginning -while preserving the lattice properties of boxes. The original box model used random initialization. Center initialization provides, from analysis, a smoother gradient overall, leading to faster learning times.

Experimental Results
Combining a 10:1 negative ratio, informed sampling, a new objective, and center initialization, box models are able to model the taxonomy, as shown in Tab. 3. They are able to capitalize on the summary and taxonomy closure signals, and show better performance on this problem than the other methods.  Table 3: F1 scores of three taxonomy embedding approaches with improved box model.

Taxonomy Depth
The hierarchical nature of taxonomy datasets yields a further evaluation problem. When using box embeddings, it is expected that the volume of root nodes will substantially exceed those of the leaf nodes. As such, a naive evaluation of performance can yield decent results with a model that only learns the root nodes and/or nodes with edges close to the root. Given the structure of box embeddings where the marginal probability of a node is equivalent to the box volume, such an evaluation problem may be more pronounced. As such, we have examined the notion of taxonomy depth as used by (Yang and Callan, 2009). Table 4 contains the F1 scores of the different models on summary + closure data at different levels of the taxonomy. The result demonstrates that decent learning metric scores were achieved for models that only learned superficial hierarchical structure. In other words, the Bayes and Order Embedding models' performance at higher levels of the taxonomy dominated their high overall F1, but for lower levels of the taxonomy, the performance was inconsistent. The Box Embedding model demonstrates stable performance at all depths, indicating the model learned the underlying taxonomy structure throughout. Consistent performance across depths is a crucial factor in the bipartite problem discussed in the next section.   Table 5: F1 scores of three taxonomy embedding approaches on the bipartite (bip) relation problem using summaries from 10 and 100 shards of PubMed.

Bipartite Relations and Taxonomies
Given the ability to reconstruct a single taxonomy, we expand the problem definition to learn a bipartite relation between two distinct taxonomy branches of MeSH. For simplicity, we focus on A: Anatomy and C: Disease, with a total of 6610 categories, with instance summarization from the first ten shards of the PubMed 2018 database (300k articles). The bipartite edges are formed through the summarization process, where the probability of each Disease to Anatomy edge d, a is the conditional probability P (a|d) that a PubMed article is labelled with category a, given that it is labeled with d. Note that for the closure dataset, an article i is considered to be labeled with a category a p if the following direct edges exist: i, a c , a c , a p . As in previous experiments, we measure F1 on a held-out set of 10% of edges.
In these experiments, we attempt to jointly learn the in-taxonomy and cross-taxonomy relations between diseases and their anatomical locations. The intention is for the two taxonomies to overlay each other to represent the location relation.
For example, the A:Anatomy branch of the taxonomy contains the edge Heart, HeartV alve and the C:Disease branch contains HeartV alveDiseases, HeartDiseases .
The underlying expectation is that co-occurrences of MeSH labels on PubMed articles will demonstrate that when an article mentions Heart Valve Diseases it will also mentions Heart Valve. This will establish a bipartite edge between the two A:Anatomy and C:Disease taxonomy branches. The hypothesis is that these direct co-occurrence edges will assist in extracting the more general edges such as HeartV alveDisease, Heart , HeartDisease, Heart .
An initial analysis of the initial data revealed that many bipartite edges yielded very small marginal proportions. As such we conducted, a second set of experiment summarizing the edges using the first 100 shards of PubMed 2018 or roughly 10% of the total and roughly 30M articles. Table 5 includes F1 scores for the bipartite (bip) learning problem. Sum10 refers to the original smaller PubMed instance summarization dataset and sum100 is the larger aggregated over 30 million articles. Our Naive Bayes system was too slow to finish sum100 in time for this submission.
The results were not as expected. The closure data, in which the transitive closure (ie the taxonomy) had been applied to the bipartite edges in the training data, showed no significant improvement over the direct (no closure) data. Increasing the size of the summarization did help improve overall representation, and box embeddings perform better than the other two methods.
An analysis of the learned embeddings discovered an exponential distribution of bipartite edge scores. With all three techniques, a threshold-setting step was applied specifying the minimum conditional probability on an edge in the embedding space for consideration as a discrete edge. However, it is acknowledged that for these bipartite relations, the embeddings are approximating the true conditional probabilities, which may or may not be a valid approximations for evaluating a *true* discrete edge. The exponential distribution of conditional probabilities indicates that the vast majority of co-occurrences are quite small. A better measure of the performance of these embeddings on the bipartite relation may be KL-divergence. We leave this investigation to future work.

Appendix: Informed Negative Sampling
If negatives are generated on-the-fly (as they were in the original box embeddings implementation), negatives from the transitive closure can leak into training after a test/train split. We must generate negatives in batch and split into test/train. Self edges should never be negative. The inverse edges should never be negative, e.g. if c, p is a positive edge, p, c should not be negative. This is an interesting divergence between the box objective and the set semantics of categories: in a set semantics, the pair and its inverse can only be true if c = p, thus negating the inverse is equivalent to asserting strict subset. The box objective, however, pushes negative edges apart, and a sub-category must clearly overlap with its parent.
Edges between ancestors of the same category should never be negative, e.g. if c, p 1 , c, p 2 are positive, then neither p 1 , p 2 nor its inverse should be negative since, in a strict reading of taxonomy, their boxes should overlap. In Fig. 1a, the idealized box embedding for a simple taxonomy is shown on the left; category C should be contained in both A and B, therefore the latter should overlap. Without negatives, the simplest zero-loss solution is to make all boxes overlap. In Fig. 1a the effect of a poorly chosen negative edge between A and B is shown, it forces the two boxes apart and C must deal with the loss. Again, in a set semantics, one may want to negate the subcategory edge between two overlapping categories, which would simply be interpreted as a constraint against one being a subset of the other. With boxes, doing so would again conflict with the objective to make them overlap.
We introduce informed sampling (Alg. 1) as a way of avoiding these problems. The algorithm incrementally builds N N , the set of non-negative edges, by processing the constraints discussed above, and subtracts that from the edge set formed by the cross product of all categories in the taxonomy, C. The customized Box Embedding Objective function is defined with respect to a joint set of concept boxes {a, b} with corresponding box embeddings represented as a pair of vectors in [0, 1] n where n is number of dimensions, (x m , x M ), (y m , y M ). A specific sampled instance is represented by (â j ,b j , l j ) where a label l j ∈ [0, 1] and represents proportional concept box overlap. In our modified loss function, positive and negative loses are defined with respect to the indicator function. For any given box embeddings we define the meet embedding of the box intersection and the disjoint indicator function if there is no intersection x ∧ y =⊥.