Probabilistic Embedding of Knowledge Graphs with Box Lattice Measures

Embedding methods which enforce a partial order or lattice structure over the concept space, such as Order Embeddings (OE), are a natural way to model transitive relational data (e.g. entailment graphs). However, OE learns a deterministic knowledge base, limiting expressiveness of queries and the ability to use uncertainty for both prediction and learning (e.g. learning from expectations). Probabilistic extensions of OE have provided the ability to somewhat calibrate these denotational probabilities while retaining the consistency and inductive bias of ordered models, but lack the ability to model the negative correlations found in real-world knowledge. In this work we show that a broad class of models that assign probability measures to OE can never capture negative correlation, which motivates our construction of a novel box lattice and accompanying probability measure to capture anti-correlation and even disjoint concepts, while still providing the benefits of probabilistic modeling, such as the ability to perform rich joint and conditional queries over arbitrary sets of concepts, and both learning from and predicting calibrated uncertainty. We show improvements over previous approaches in modeling the Flickr and WordNet entailment graphs, and investigate the power of the model.


Introduction
Structured embeddings based on regions, densities, and orderings have gained popularity in recent years for their inductive bias towards the essential asymmetries inherent in problems such as image captioning (Vendrov et al., 2016), lexical and textual entailment (Erk, 2009;Vilnis and McCallum, 2015;Lai and Hockenmaier, 2017;Athiwaratkun and Wilson, 2018), and knowledge graph completion and reasoning (He et al., 2015;Nickel and Kiela, 2017;Li et al., 2017).
Models that easily encode asymmetry, and related properties such as transitivity (the two components of commonplace relations such as partially ordered sets and lattices), have great utility in these applications, leaving less to be learned from the data than arbitrary relational models. At their best, they resemble a hybrid between embedding models and structured prediction. As noted by Vendrov et al. (2016) and Li et al. (2017), while the models learn sets of embeddings, these parameters obey rich structural constraints. The entire set can be thought of as one, sometimes provably consistent, structured prediction, such as an ontology in the form of a single directed acyclic graph.
While the structured prediction analogy applies best to Order Embeddings (OE), which embeds consistent partial orders, other region-and density-based representations have been proposed for the express purpose of inducing a bias towards asymmetric relationships. For example, the Gaussian Embedding (GE) model (Vilnis and Mc-Callum, 2015) aims to represent the asymmetry and uncertainty in an object's relations and attributes by means of uncertainty in the representation. However, while the space of representations is a manifold of probability distributions, the model is not truly probabilistic in that it does not model asymmetries and relations in terms of prob-abilities, but in terms of asymmetric comparison functions such as the originally proposed KL divergence and the recently proposed thresholded divergences (Athiwaratkun and Wilson, 2018).
Probabilistic models are especially compelling for modeling ontologies, entailment graphs, and knowledge graphs. Their desirable properties include an ability to remain consistent in the presence of noisy data, suitability towards semisupervised training using the expectations and uncertain labels present in these large-scale applications, the naturality of representing the inherent uncertainty of knowledge they store, and the ability to answer complex queries involving more than 2 variables. Note that the final one requires a true joint probabilistic model with a tractable inference procedure, not something provided by e.g. matrix factorization.
We take the dual approach to density-based embeddings and model uncertainty about relationships and attributes as explicitly probabilistic, while basing the probability on a latent space of geometric objects that obey natural structural biases for modeling transitive, asymmetric relations. The most similar work are the probabilistic order embeddings (POE) of Lai (Lai and Hockenmaier, 2017), which apply a probability measure to each order embedding's forward cone (the set of points greater than the embedding in each dimension), assigning a finite and normalized volume to the unbounded space. However, POE suffers severe limitations as a probabilistic model, including an inability to model negative correlations between concepts, which motivates the construction of our box lattice model.
Our model represents objects, concepts, and events as high-dimensional products-of-intervals (hyperrectangles or boxes), with an event's unary probability coming from the box volume and joint probabilities coming from overlaps. This contrasts with POE's approach of defining events as the forward cones of vectors, extending to infinity, integrated under a probability measure that assigns them finite volume.
One desirable property of a structured representation for ordered data, originally noted in (Vendrov et al., 2016) is a "slackness" shared by OE, POE, and our model: when the model predicts an "edge" or lack thereof (i.e. P (a|b) = 0 or 1, or a zero constraint violation in the case of OE), being exposed to that fact again will not update the model. Moreover, there are large degrees of freedom in parameter space that exhibit this slackness, giving the model the ability to embed complex structure with 0 loss when compared to models based on symmetric inner products or distances between embeddings, e.g. bilinear GLMs (Collins et al., 2002), Trans-E (Bordes et al., 2013), and other embedding models which must always be pushing and pulling parameters towards and away from each other.
Our experiments demonstrate the power of our approach to probabilistic ordering-biased relational modeling. First, we investigate an instructive 2-dimensional toy dataset that both demonstrates the way the model self organizes its box event space, and enables sensible answers to queries involving arbitrary numbers of variables, despite being trained on only pairwise data. We achieve a new state of the art in denotational probability modeling on the Flickr entailment dataset (Lai and Hockenmaier, 2017), and a matching state-of-the-art on WordNet hypernymy (Vendrov et al., 2016;Miller, 1995) with the concurrent work on thresholded Gaussian embedding of Athiwaratkun and Wilson (2018), achieving our best results by training on additional co-occurrence expectations aggregated from leaf types.
We find that the strong empirical performance of probabilistic ordering models, and our box lattice model in particular, and their endowment of new forms of training and querying, make them a promising avenue for future research in representing structured knowledge.

Related Work
In addition to the related work in structured embeddings mentioned in the introduction, our focus on directed, transitive relational modeling and ontology induction shares much with the rich field of directed graphical models and causal modeling (Pearl, 1988), as well as learning the structure of those models (Heckerman et al., 1995). Work in undirected structure learning such the Graphical Lasso (Friedman et al., 2008) is also relevant due to our desire to learn from pairwise joint/conditional probabilities and moment matrices, which are closely related in the setting of discrete variables.
Especially relevant research in Bayesian networks are applications towards learning taxonomic structure of relational data (Bansal et al., 2014), although this work is often restricted towards tree-shaped ontologies, which allow efficient inference by Chu-Liu-Edmonds' algorithm (Chu and Liu, 1995), while we focus on arbitrary DAGs.
As our model is based on populating a latent "event space" into boxes (products of intervals), it is especially reminiscent of the Mondrian process (Roy and Teh, 2009). However, the Mondrian process partitions the space as a high dimensional tree (a non-parametric kd-tree), while our model allows the arbitrary box placement required for DAG structure, and is much more tractable in high dimensions compared to the Mondrian's Bayesian non-parametric inference.
Embedding applications to relational learning constitute a huge field to which it is impossible to do justice, but one general difference between our approaches is that e.g. a matrix factorization model treats the embeddings as objects to score relation links with, as opposed to POE or our model in which embeddings represent subsets of probabilistic event space which are directly integrated. They are full probabilistic models of the joint set of variables, rather than embedding-based approximations of only low-order joint and conditional probabilities. That is, any set of our parameters can answer any arbitrary probabilistic question (possibly requiring intractable computation), rather than being fixed to modeling only certain subsets of the joint.
Embedding-based learning's large advantage over the combinatorial structure learning presented by classical PGM approaches is its applicability to large-scale probability distributions containing hundreds of thousands of events or more, as in both our WordNet and Flickr experiments.

Partial Orders and Lattices
A non-strict partial ordered set (poset) is a set P equipped with a binary relation such that for all a, b, c ∈ P , • a a (reflexivity) • a b a implies a = b (antisymmetry) • a b c implies a c (transitivity) This is simply a generalization of a totally ordered set that allows some elements to be incomparable, and is a good model for the kind of acyclic directed graph data found in knowledge bases.
A lattice is a poset where any subset has a a unique least upper and greatest lower bound, which will be true of all posets (lattices) considered in this paper. The least upper bound of two elements a, b ∈ P is called the join, denoted a ∨ b, and the greatest lower bound is called the meet, denoted a ∧ b.
Additionally, in a bounded lattice we have two extra elements, called top, denoted and bottom, denoted ⊥, which are respectively the least upper bound and greatest lower bound of the entire space. Using the extended real number line (adding points at infinity), all lattices considered in this paper are bounded lattices.

Order Embeddings (OE)
Vendrov et al. (2016) introduced a method for embedding partially ordered sets and a task, partial order completion, an abstract term for things like hypernym or entailment prediction (learning transitive relations). The goal is to learn a mapping from the partially-ordered data domain to some other partially-ordered space that will enable generalization.
They choose Y to be a vector space, and the order Y to be based on the reverse product order on R n + , which specifies x y ⇐⇒ ∀i ∈ {1..n}, x i ≥ y i so an embedding is below another in the hierarchy if all of the coordinates are larger, and 0 provides a top element. Although Vendrov et al. (2016) do not explicitly discuss it, their model does not just capture partial orderings, but is a standard construction of a vector (Hilbert) lattice, in which the operations of meet and join can be defined as taking the pointwise maximum and minimum of two vectors, respectively (Zaanen, 1997). This observation is also used in (Li et al., 2017) to generate extra constraints for training order embeddings.
As noted in the original work, these single point embeddings can be thought of as regions, i.e. the cone extending out from the vector towards infinity. All concepts "entailed" by a given concept must lie in this cone.
This ordering is optimized from examples of ordered elements and negative samples via a maxmargin loss.

Probabilistic Order Embeddings (POE)
Lai and Hockenmaier (2017) built on the "region" idea to derive a probabilistic formulation (which we will refer to as POE) to model entailment probabilities in a consistent, hierarchical way.
Noting that all of OE's regions obviously have the same infinite area under the standard (Lebesgue) measure of R n + , they propose a probabilistic interpretation where the Bernoulli probability of each concept a or joint set of concepts {a, b} with corresponding vectors {x, y} is given by its volume under the exponential measure: since the meet of two vectors is simply the intersection of their area cones, and replacing sums with 1 norms for brevity since all coordinates are positive. While having the intuition of measuring the areas of cones, this also automatically gives a valid probability distribution over concepts since this is just the product likelihood under a coordinatewise exponential distribution.
However, they note a deficiency of their model -it can only model positive (Pearson) correlations between concepts (Bernoulli variables). Consider two Bernoulli variables a and b, whose probabilities correspond to the areas of cones x and y. Recall the Bernoulli covariance formula (we will deal with covariances instead of correlations when convenient, since they always have the same sign): Since the sum of two positive vectors can only be greater than the sum of their pointwise maximum, this quantity will always be nonnegative. This has real consequences for probabilistic modeling in KBs: conditioning on more concepts will only make probabilities higher (or unchanged), e.g. p(dog|plant) ≥ p(dog).

Probabilistic Asymmetric Transitive Relations
Probabilistic models have pleasing consistency properties for modeling asymmetric transitive relations, in particular compared to density-based embeddings -a pairwise conditional probability table can almost always (in the technical sense) be asymmetrized to produce a DAG by simply taking an edge if P (a|b) > P (b|a). A matrix of pairwise Gaussian KL divergences cannot be consistently asymmetrized in this manner. These claims are proven in Appendix C. While a high P (a|b) does not always indicate an edge in an ontology due to confounding variables, existing graphical model structure learning methods can be used to further prune on the base graph without adding a cycle, such as Graphical Lasso or simple thresholding (Fattahi and Sojoudi, 2017).

Method
We develop a probabilistic model for lattices based on hypercube embeddings that can model both positive and negative correlations. Before describing this, we first motivate our choice to abandon OE/POE type cone-based models for this purpose.

Correlations from Cone Measures
Claim. For a pair of Bernoulli variables p(a) and p(b), cov(a, b) ≥ 0 if the Bernoulli probabilities come from the volume of a cone as measured under any product (coordinate-wise) probability measure p(x) = n i p i (x i ) on R n , where F i , the associated CDF for p i , is monotone increasing.
Proof. For any product measure we have under the uniform measure. This box is unique as a monotone increasing univariate CDF is bijective with (0, 1)cones in R n can be invertibly mapped to boxes of equivalent measure inside the unit hypercube [0, 1] n . These boxes have only half their degrees of freedom, as they have the form [F i (x i ), 1] per dimension, (intuitively, they have one end "stuck at infinity" since the cone integrates to infinity. So W.L.O.G. we can consider two transformed cones x and y corresponding to our Bernoulli variables a and b, and letting F i (x i ) = u i and Pairing terms in the right-hand product, we have since the right contains all the terms of the left and can only grow smaller. This argument is easily modified to the case of the nonnegative orthant, mutatis mutandis.
An open question for future work is what nonproduct measures this claim also applies to. Note that some non-product measures, such as multivariate Gaussian, can be transformed into product measures easily (whitening) and the above proof would still apply. It seems probable that some measures, nonlinearly entangled across dimensions, could encode negative correlations in cone volumes. However, it is not generally tractable to integrate high-dimensional cones under arbitrary non-product measures.

Box Lattices
The above proof gives us intuition about the possible form of a better representation. Cones can be mapped into boxes within the unit hypercube while preserving their measure, and the lack of negative correlation seems to come from the fact that they always have an overly-large intersection due to "pinning" the maximum in each dimension to 1. To remedy this, we propose to learn representations in the space of all boxes (axis-aligned hyperrectangles), gaining back an extra degree of freedom. These representations can be learned with a suitable probability measure in R n , the nonnegative orthant R n + , or directly in the unit hypercube with the uniform measure, which we elect.
We associate each concept with 2 vectors, the minimum and maximum value of the box at each dimension. Practically for numerical reasons these are stored as a minimum, a positive offset plus an term to prevent boxes from becoming too small and underflowing.
Let us define our box embeddings as a pair of vectors in [0, 1] n , (x m , x M ), representing the maximum and minimum at each coordinate.
Then we can define a partial ordering by inclusion of boxes, and a lattice structure as x ∧ y = ⊥ if x and y disjoint, else where the meet is the intersecting box, or bottom (the empty set) where no intersection exists, and join is the smallest enclosing box. This lattice, considered on its own terms as a non-probabilistic object, is strictly more general than the order embedding lattice in any dimension, which is proven in Appendix B.
However, the finite sizes of all the lattice elements lead to a natural probabilistic interpretation under the uniform measure. Joint and marginal probabilities are given by the volume of the (intersection) box. For concept a with associated box (x m , x M ), probability is simply p(a) = n i (x M,i − x m,i ) (under the uniform measure). p(⊥) is of course zero since no probability mass is assigned to the empty set.
It remains to show that this representation can represent both positive and negative correlations. Proof. Boxes can clearly model disjointness (exactly −1 correlation if the total volume of the boxes equals 1). Two identical boxes give their concepts exactly correlation 1. The area of the meet is continuous with respect to translations of intersecting boxes, and all other terms in correlation stay constant, so by continuity of the correlation function our model can achieve all possible correlations for a pair of variables.
This proof can be extended to boxes in R n with product measures by the previous reduction.
Limitations: Note that this model cannot perfectly describe all possible probability distributions or concepts as embedded objects. For example, the complement of a box is not a box. However, queries about complemented variables can be calculated by the Inclusion-Exclusion principle, made more efficient by the fact that all nonnegated terms can be grouped and calculated exactly. We show some toy exact calculations with negated variables in Appendix A. Also, note that in a knowledge graph often true complements are not required -for example mortal and immortal are not actually complements, because the concept color is neither.
Additionally, requiring the total probability mass covered by boxes to equal 1, or exactly matching marginal box probabilities while modeling all correlations is a difficult box-packing-type problem and not generally possible. Modeling limitations aside, the union of boxes having mass < 1 can be seen as an open-world assumption on our KB (not all points in space have corresponding concepts, yet).

Learning
While inference (calculation of pairwise joint, unary marginal, and pairwise conditional probabilities) is quite straightforward by taking intersections of boxes and computing volumes (and their ratios), learning does not appear easy at first glance. While the (sub)gradient of the joint probability is well defined when boxes intersect, it is non-differentiable otherwise. Instead we optimize a lower bound.
Clearly p(a ∨ b) ≥ p(a ∪ b), with equality only when a = b, so this can give us a lower bound: Where probabilities are always given by the volume of the associated box. This lower bound always exists and is differentiable, even when the joint is not. It is guaranteed to be nonpositive except when a and b intersect, in which case the true joint likelihood should be used.
While a negative bound on a probability is odd, inspecting the bound we see that its gradient will push the enclosing box to be smaller, while increasing areas of the individual boxes, until they intersect, which is a sensible learning strategy.
Since we are working with small probabilities it is advisable to negate this term and maximize the negative logarithm: This still has an unbounded gradient as the lower bound approaches 0, so it is also useful to add a constant within the logarithm function to avoid numerical problems.
Since the likelihood of the full data is usually intractable to compute as a conjunction of many negations, we optimize binary conditional and unary marginal terms separately by maximum likelihood.
In this work, we parametrize the boxes as (min, ∆ = max − min), with Euclidean projections after gradient steps to keep our parameters in the unit hypercube and maintain the minimum/delta constraints. Now that we have the ability to compute probabilities and (surrogate) gradients for arbitrary marginals in the model, and by extension conditionals, we will see specific examples in the experiments.

Warmup: 2D Embedding of a Toy Lattice
We begin by investigating properties of our model in modeling a small toy problem, consisting of a small hand constructed ontology over 19 concepts, aggregated from atomic synthetic examples first into a probabilistic lattice (e.g. some rabbits are brown, some are white), and then a full CPD. We model it using only 2 dimensions to enable visualization of the way the model self-organizes its "event space", training the model by minimize weighted cross-entropy with both the unary marginals and pairwise conditional probabilities. We also conduct a parallel experiment with POE as embedded in the unit cube, where each representation is constrained to touch the faces x = 1, y = 1. In Figure 2, we show the representation of lattice structures by POE and the box lattice model as compared to the abstract probabilistic lattice used to construct the data, shown in Figure 1, and compare the conditional probabilities produced by our model to the ground truth, demonstrating the richer capacity of the box model in capturing strong positive and negative correlations. In Table 1, we perform a series of multivariable conditional queries and demonstrate intuitive results on high-order queries containing up to 4 variables, despite the model being trained on only 2-way information.

WordNet
We experiment on WordNet hypernym prediction, using the same train, development and test split as Vendrov et al. (2016), created by randomly taking 4,000 hypernym pairs from the 837,888-    Since our model is probabilistic, we would like a sensible value for P (n), where n is a node. We assign these marginal probabilities by looking at the number of descendants in the hierarchy under a node, and normalizing over all nodes, taking P (n) = | descendants(n) | | nodes | . Furthermore, we use the graph structure (only of the subset of edges in the training set to avoid leaking data) to augment the data with approximate conditional probabilities P (x|y). For each leaf, we consider all of its ancestors as pairwise co-occurences, then aggregate and divide by the number of leaves to get an approximate joint probability distribution, P (x, y) = | x, y co-occur in ancestor set | | leaves | . With this and the unary marginals, we can create a conditional probability table, which we prune based on the difference of P (x|y) and P (y|x) and add cross entropy with these conditional "soft edges" to the training data. We refer to experiments using this additional data as Box + CPD in Table 3.
We use 50 dimensions in our experiments. Since our model has 2 parameters per dimension, we also perform an apples-to-apples comparison with a 100D POE model. As seen in Table 3, we outperform POE significantly even with this added representational power. We also observe sensible negatively correlated examples, shown in 2, in the trained box model, while POE cannot represent such relationships. We tune our models on the development set, with parameters documented in Appendix D.1. We observe that not only does our model outperform POE, it beats all previous results on WordNet, aside from the concurrent work of Athiwaratkun and Wilson (2018) (using different train/dev negative examples), the baseline POE model does as well. This indicates that probabilistic embeddings for transitive relations are a promising avenue for future work. Additionally, the ability of the model to learn from the expected "soft edges" improves it to state-of-the-art level. We expect that co-occurrence counts gathered from real textual corpora, rather than merely aggregating up the WordNet lattice, would further strengthen this effect.   We conduct experiments on the large-scale Flickr entailment dataset of 45 million image caption pairs. We use the exactly same train/dev/test from Lai and Hockenmaier (2017). We use a slightly different unseen word pairs and unseen words test data, obtained from the author. We include their published results and also use their published code, marked * , for comparison.

Flickr Entailment Graph
For these experiments, we relax our boxes from the unit hypercube to the nonnegative orthant and obtain probabilities under the exponential measure, p(x) = exp(−x). We enforce the nonnegativity constraints by clipping the LSTMgenerated embedding (Hochreiter and Schmidhuber, 1997) for the box minimum with a ReLU, and parametrize our ∆ embeddings using a softplus activation to prevent dead units. As in Lai and Hockenmaier (2017), we use 512 hidden units in our LSTM to compose sentence vectors. We then apply two single-layer feed-forward networks with 512 units applied to the final LSTM state to produce the embeddings.
As we can see from Table 4, we note large improvements in KL and Pearson correlation to the ground truth entailment probabilities. In further analysis, Figure 3 demonstrates that while the box model outperforms POE in nearly every regime, the highest gains come from the comparatively difficult to calibrate small entailment probabilities, indicating the greater capability of our model to produce fine-grained distinctions.

Conclusion and Future Work
We have only scratched the surface of possible applications. An exciting direction is the incorporation of multi-relational data for general knowledge representation and inference. Secondly, more complex representations, such as 2n-dimensional products of 2-dimensional convex polyhedra, would offer greater flexibility in tiling event space. Improved inference of the latent boxes, either through better optimization or through Bayesian approaches is another natural extension. Our greatest interest is in the application of this powerful new tool to the many areas where other structured embeddings have shown promise.

A Queries with Negated Variables
Section 4.2 mentions that although the complement of a box is not a box, queries involving negated variables can be calculated exactly with Inclusion-Exclusion, demonstrated in Table 5. While there are many more interesting and efficient approaches, we simply use the formula for calculating the volume of the union of hyperrectangles (a standard Inclusion-Exclusion formula). This is equivalent since the intersection of complements of boxes is the complement of the union of boxes. We first intersect all of the non-negated variables into one conjunction box, T . We then calculate the volume of the union of T with all of the boxes representing complements of negated variables F = ¬f 1 , ¬f 2 , ¬f 3 , ..., v 1 = (T ∪ f 1 ∪ f 2 ∪ f 3 ...) = 1 − P (¬T, ¬f 1 , ¬f 2 , ¬f 3 , ...), and the volume of just the negated variables' boxes, v 2 = (f 1 ∪ f 2 ∪ f 3 ...) = 1 − P (¬f 1 , ¬f 2 , ¬f 3 , ...). The probability of the query is v 1 − v 2 = P (F ) − P (¬T, F ) = (P (T, F ) + P (¬T, F )) − P (¬T, F ) = P (T, F ), which was the original query. P(deer | ... ) P(deer) 0.12 ¬white 0.13 animal 0.50 ¬white,animal 0.54 ¬white,animal,herbivore 0.73 ¬white, animal, herbivore, ¬rabbit 0.80 ¬white, animal, ¬herbivore,¬rabbit 0.00

B Properties of the Box Lattice
In this section, we cover some technical details about the box lattice model and its properties especially as compared to the order embedding model.

B.1 Non-Distributivity
A lattice is called distributive if the following identity holds for all members x, y, z: Claim. Order embeddings form a distributive lattice.
Proof. This is a standard results on vector lattices shown in e.g. (Zaanen, 1997) A non-distributive lattice is a strictly more general object, capable of modeling more objects since it does not necessarily need to fulfill the above identity for all triples x, y, z.
Claim. The box lattice is non-distributive. This proves that the box lattice is a strict generalization of order embeddings, and not equivalent to order embeddings of any dimensionality. Additionally, our choice of an example containing disjoint elements hints at the importance of non-distributivity for our goal of modeling disjoint events.

B.2 Pseudocomplemented
A lattice is called pseudocomplemented if for every element x there exists a unique greatest element in the lattice x * that is disjoint from x and x ∧ x * = ⊥. The box lattice is almost always pseudocomplemented, aside from symmetry concerns (for example, a perfectly centered cube in the 2-dimensional box lattice of side length < 1 has 4 possible equally large pseudocomplements. However any such symmetries can always be infinitesimally perturbed without breaking order structure so the box lattice is pseudocomplemented in a measure-theoretic sense. However, these pseudocomplements can be arbitrarily bad approximations of the true complement set of a box, with the worst case scenario coming from large, nearly-centered cubes.
C Asymmetrizing Score Matrices C.1 Probabilistic Models Assume we have a pairwise CPD between Bernoulli variables, and also have access to the unary marginals for each Bernoulli, and further that no unary marginals are exactly identical. If they are exactly identical, we can generate random independent Bernoulli parameters and their JPD, and take a small convex combination with that to infinitesimally perturb the statistics, so this proof is valid everywhere but on a set of measure 0 which we can approximate arbitrarily well.
Claim. If all unary marginals are distinct, taking the elements of the pairwise CPD, removing the diagonal, and deleting an entry if P (A|B) < P (B|A), that is if A ij < A ji , will result in a weighted adjacency matrix for an acyclic directed graph Proof. Order the variables x 1 ...x n so that p( . So with the variables so ordered, if we use the CPD to create an adjacency matrix with an edge C ij = 1 if and only if p(x i ) < p(x j ), it will be upper triangular with 0 on the diagonal. This is a nilpotent matrix which means it is the adjacency matrix of an acyclic graph. This can be easily seen since the entries of A k are the set of K-hop neighbors, and if this set eventually becomes empty, as in a nilpotent matrix, we have no cycles.
Since the labeling of our vertices is arbitrary, this means that our adjacency matrix created by the proposed asymmetrizing procedure is always acyclic since it is similar to an upper triangular matrix with 0s on the diagonal. This holds as long as the unary marginals can always be ordered (which they can be except on a set of measure zero, and in practice on it seems to work even if you ignore this constraint.

C.2 KL Divergences and Gaussian Embeddings
Assume the same setup as section C.1, but the scores in the matrix come from (possibly thresholded if A ij − A ji < c) pairwise divergences between Gaussian embeddings.
Claim. There exist graphs produced by the above procedure that do not lead to directed acyclic graphs if thresholded by deleting entries when A ij < A ji : Proof. Consider the following set of 5 2-dimensional Gaussians with diagonal covariance: G 5 = N (x 5 ; [9, 3], diag ([5, 9])) Applying asymmetrization and even pruning at a threshold of c = 1 (which is non-nilpotent and does affect edges) produces a cycle between nodes 5, 1, and 3. There are certain repeated numbers in the parameters, but this is not the cause of the issue. They are whole numbers for ease of exposition, they were randomly generated and many more examples can be created with arbitrary floating point numbers.

C.3 Order Embeddings
We simulated many millions of random sets of order embedding parameters, and created pairwise graphs using the order embedding energy function, and were never able to find a cycle in the resulting asymmetrized graphs. We conjecture that this is because the order embedding energy is essentially a Lagrangian relaxation term penalizing the violation of a true partial order relation, but have not proven it.
Conjecture. Sets of Order Embeddings can be consistently asymmetrized into directed acyclic graphs according to the procedure in section C.1.

D.1 WordNet Parameters
Since the WordNet data has binary 0, 1 links instead of calibrated probabilities, and the negative links are found from random negative sampling, we constrain the delta embedding to not update for negative samples during optimization. We found this was effective in preventing random negative samples from decreasing the volume of the boxes and creating artificially disjoint pairs.
The WordNet parameters that achieved best performance on the development set (whose train set performance we reported) are: batch size: 800 dimension: 50 edge loss weight: 1.0 unary loss weight: 9.0 learning rate: 0.001 minimum dimension delta size: 1e-6 dimension-max regularization weight: 0.005 optimizer: Adam For WordNet training with additional soft CPD edges, we use the same parameters. We also perform pruning on the generated CPD file. We only include t 1 , t 2 pairs with probability ≥ 0.6 and the reverse pair t 2 , t 1 ≤ 0.4 probability.
We tune the batch size of the model between 800 and 40000 because bigger batch size facilitates faster training. We also sweep over 1.0 to 9.0 for edge loss weight and 9.0 to 1.0 for the unary loss weight. The learning rate we tune in λ ∈ {0.001, 0.0001}. The minimum dimension delta size we tune in ∈ {0.01, 0.001, 0.0001, 0.00001, 0.000001}. The dimension-max regularization encourages the upper bound of box to be close 1.0 with an L1 penalty to prevent collapse. We perform parameter search in {0.0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5}.

D.2 Flickr Parameters
The Flickr parameters that achieved best performance on the development set (whose train set performance we reported) are: batch size: 512 dropout: 0.5 unary loss weight: 8.0 edge loss weight: 2.0 learning rate: 0.0001 minimum dimension delta size: 1e-6 optimizer: Adam The LSTM parameters are initialized with Glorot initialization (Glorot and Bengio, 2010), as are the weight and bias parameters for the feedforward networks to produce the box minimums. The network to produce the ∆ embedding is initialized from a uniform distribution from [15.0, 15.50]. We clip to zero for min embeddings (apply a ReLU), and apply a softplus to enforce the positivity and minimum dimension size constraints on the ∆ embeddings.
We also sweep over 1.0 to 9.0 for edge loss weight and 9.0 to 1.0 for the unary loss weight. The learning rate λ ∈ {0.001, 0.0001}. We tried Glorot initialization with the ∆ network as well, but since we wanted a high degree of overlap at the beginning of training, we simply swept over different uniform initialization ranges in [5.0, 5.5], [10.0, 10.5] and [15.0, 15.5].