Predicting Semantic Relations using Global Graph Properties

Semantic graphs, such as WordNet, are resources which curate natural language on two distinguishable layers. On the local level, individual relations between synsets (semantic building blocks) such as hypernymy and meronymy enhance our understanding of the words used to express their meanings. Globally, analysis of graph-theoretic properties of the entire net sheds light on the structure of human language as a whole. In this paper, we combine global and local properties of semantic graphs through the framework of Max-Margin Markov Graph Models (M3GM), a novel extension of Exponential Random Graph Model (ERGM) that scales to large multi-relational graphs. We demonstrate how such global modeling improves performance on the local task of predicting semantic relations between synsets, yielding new state-of-the-art results on the WN18RR dataset, a challenging version of WordNet link prediction in which “easy” reciprocal cases are removed. In addition, the M3GM model identifies multirelational motifs that are characteristic of well-formed lexical semantic ontologies.


Introduction
Semantic graphs, such as WordNet (Fellbaum, 1998), encode the structural qualities of language as a representation of human knowledge. On the local level, they describe connections between specific semantic concepts, or synsets, through individual edges representing relations such as hypernymy ('is-a') or meronymy ('is-part-of'); on the global level, they encode emergent regular properties in the induced relation graphs. Local properties have been subject to extensive study in recent years via the task of relation prediction, where individual edges are found based mostly on distributional methods that embed synsets and relations into a vector space (e.g. Socher et al., 2013;Bordes et al., 2013;Neelakantan et al., 2015). In contrast, while the structural regularity and significance of global aspects of semantic graphs is well-attested (Sigman and Cecchi, 2002), global properties have rarely been used in prediction settings. In this paper, we show how global semantic graph features can facilitate in local tasks such as relation prediction.
To motivate this approach, consider the hypothetical hypernym graph fragments in Figure 1: in (a), the semantic concept (synset) 'catamaran' has a single hypernym, 'boat'. This is a typical property across a standard hypernym graph. In (b), the synset 'cat' has two hypernyms, an unlikely event. While a local relation prediction model might mistake the relation between 'cat' and 'boat' to be plausible, for whatever reason, a high-order graphstructure-aware model should be able to discard it based on the knowledge that a synset should not have more than one hypernym. In (c), an impossible situation arises: a cycle in the hypernym graph leads each of the participating synsets to be predicted by transitivity as its own hypernym, contrary to the relation's definition. However, a purely local model has no explicit mechanism for rejecting such an outcome.
In this paper, we examine the effect of global graph properties on the link structure via the WordNet relation prediction task. Our hypothesis is that features extracted from the entire graph can help constrain local predictions to structurally sound ones (Guo et al., 2007). Such features are often manifested as aggregate counts of small subgraph structures, known as motifs, such as the number of nodes with two or more outgoing edges, or the number of cycles of length 3. Returning to the example in Figure 1, each of these features will be affected when graphs (b) and (c) are evaluated, respectively.
To estimate weights on local and global graph features, we build on the Exponential Random Graph Model (ERGM), a log-linear model over networks utilizing global graph features (Holland and Leinhardt, 1981). In ERGMs, the likelihood of a graph is computed by exponentiating a weighted sum of the features, and then normalizing over all possible graphs. This normalization term grows exponentially in the number of nodes, and in general cannot be decomposed into smaller parts. Approximations are therefore necessary to fit ERGMs on graphs with even a few dozen nodes, and the largest known ERGMs scale only to thousands of nodes (Schmid and Desmarais, 2017). This is insufficient for WordNet, which has an order of 10 5 nodes.
We extend the ERGM framework in several ways. First, we replace the maximum likelihood objective with a margin-based objective, which compares the observed network against alternative networks; we call the resulting model the Max-Margin Markov Graph Model (M3GM), drawing on ideas from structured prediction (Taskar et al., 2004). The gradient of this loss is approximated by importance sampling over candidate negative edges, using a local relational model as a proposal distribution. The complexity of each epoch of estimation is thus linear in the number of edges, making it possible to scale up to the 10 5 nodes in WordNet. 1 Second, we address the multi-relational nature of semantic graphs, by incorporating a combinatorial set of labeled motifs. Finally, we link graph-level relational features with distributional information, by combining the M3GM with a dyad-level model over word sense embeddings.
We train M3GM as a re-ranker, which we apply to a a strong local-feature baseline on the WN18RR dataset (Dettmers et al., 2018). This yields absolute improvements of 3-4 points on all commonly-used metrics. Model inspection reveals that M3GM assigns importance to features from all relations, and captures some interesting inter-relational properties that lend insight into the overall structure of WordNet. 2 2 Related Work Relational prediction in semantic graphs. Recent approaches to relation prediction in semantic graphs generally start by embedding the semantic concepts into a shared space and modeling relations by some operator that induces a score for an embedding pair input. We use several of these techniques as base models (Nickel et al., 2011;Bordes et al., 2013;Yang et al., 2014); detailed description of these methods is postponed to Section 3.2. Socher et al. (2013) generalize over the approach of Nickel et al. (2011) by using a bilinear tensor which assigns multiple parameters for each relation; Shi and Weninger (2017) project the node embeddings in a translational model similar to that of Bordes et al. (2013); Dettmers et al. (2018) apply a convolutional neural network by reshaping synset embeddings to 2-dimensional matrices. None of these embedding-based approaches incorporate structural information; in general, improvements in embedding-based methods are expected to be complementary to our approach. Some recent works compose single edges into more intricate motifs, such as Guu et al. (2015), who define a task of path prediction and compose various functions to solve it. They find that compositionalized bilinear models perform best on WordNet. Minervini et al. (2017) train link-prediction models against an adversary that produces examples which violate structural constraints such as symmetry and transitivity. Another line of work builds on local neighborhoods of relation interactions and automatic detection of relations from syntactically parsed text (Riedel et al., 2013;. Schlichtkrull et al. (2017) use Graph Convolutional Networks to predict relations while considering high-order neighborhood properties of the nodes in question. In general, these methods aggregate information over local neighborhoods, but do not explicitly model structural motifs.
Our model introduces interaction features between relations (e.g., hypernyms and meronyms) for the goal of relation prediction. To our knowledge, this is the first time that relation interaction is explicitly modeled into a relation prediction task. Within the ERGM framework, Lu et al. (2010) train a limited set of combinatory path features for social network link prediction.
Scaling exponential random graph models. The problem of approximating the denominator of the ERGM probability has been an active research topic for several decades. Two common approximation methods exist in the literature. In Maximum Pseudolikelihood Estimation (MPLE; Strauss and Ikeda, 1990), a graph's probability is decomposed into a product of the probability for each edge, which in turn is computed based on the ERGM feature difference between the graph excluding the edge and the full graph. Monte Carlo Maximum Likelihood Estimation (MCMLE; Snijders, 2002) follows a sampling logic, where a large number of graphs is randomly generated from the overall space under the intuition that the sum of their scores would give a good approximation for the total score mass. The probability for the observed graph is then estimated following normalization conditioned on the sampling distribution, and its precision increases as more samples are gathered. Recent work found that applying a parametric bootstrap can increase the reliability of MPLE, while retaining its superiority in training speed (Schmid and Desmarais, 2017). Despite this result, we opted for an MCMLE-based approach for M3GM, mainly due to the ability to keep the number of edges constant in each sampled graph. This property is important in our setup, since local edge scores added or removed to the overall graph score can occasionally dominate the objective function, giving unintended importance to the overall edge count.

Max-Margin Markov Graph Models
is a set of directed edges. The ERGM scoring function defines a probability over G |V | , the set of all graphs with |V | nodes. This probability is defined as a loglinear function, (1) where f is a feature function, from graphs to a vector of feature counts. Features are typically counts of motifs -small subgraph structures -as described in the introduction. The vector θ is the parameter to estimate.
In this section we discuss our adaptation of this model to the domain of semantic graphs, leveraging their idiosyncratic properties. Semantic graphs are composed of multiple relation types, which the feature space needs to accommodate; their nodes are linguistic constructs (semantic concepts) associated with complex interpretations, which can benefit the graph representation through incorporating their embeddings in R d into a new scoring model. We then present our M3GM framework to perform reliable and efficient parameter estimation on the new model.

Graph Motifs as Features
Based on common practice in ERGM feature extraction (e.g., Morris et al., 2008), we select the following graph features as a basis: • Total edge count; • Number of cycles of length k, for k ∈ {2, 3}; • Number of nodes with exactly k outgoing (incoming) edges, for k ∈ {1, 2, 3}; • Number of nodes with at least k outgoing (incoming) edges, for k ∈ {1, 2, 3}; • Number of paths of length 2; • Transitivity: the proportion of length-2 paths u → v → w where an edge u → w also exists.
Semantic graphs are multigraphs, where multiple relationships (hypernymy, meronymy, derivation, etc.) are overlaid atop a common set of nodes. For each relation r in the relation inventory R, we denote its edge set as E r , and redefine E = r∈R E r , the union of all labeled edges. Some relations do not produce a connected graph, while others may coincide with each other frequently, possibly in regular but intricate patterns: for example, derivation relations tend to occur between synsets in the higher, more abstract levels of the hypernym graph. We represent this complexity by expanding the feature space to include relation-sensitive combinatory motifs. For each feature template from the basis list above, we extract features for all possible combinations of relation types existing in the graph. Depending on the feature type, these could be relation singletons, pairs, or triples; they may be order-sensitive or order-insensitive. For example: • A combinatory 'transitivity' feature will be extracted for the proportion of paths • A combinatory '2-outgoing' feature will be extracted for the number of nodes with exactly one derivation and one has part.
The number of features thus scales in O(|R| K ) for a feature basis which involves up to K edges in any feature, and so our 17 basis features (with K = 3) generate a combinatory feature set with roughly 3,000 features for the 11-relation version of WordNet used in our experiments (see Section 4.1).

Local Score Component
In classical ERGM application domains such as social media or biological networks, nodes tend to have little intrinsic distinction, or at least little meaningful intrinsic information that may be extracted prior to applying the model. In semantic graphs, however, the nodes represent synsets, which are associated with information that is both valuable to predicting the graph structure and approximable using unsupervised techniques such as embedding into a common d-dimensional vector space based on copious amounts of available data. We thus modify the traditional scoring function from eq. (1) to include node-specific information, by introducing a relation-specific association op- The association operator generalizes various models from the relation prediction literature: TransE (Bordes et al., 2013) embeds each relation r into a vector in the shared space, representing a 'difference' between sources and targets, to compute the association score under a translational objective, BiLin (Nickel et al., 2011) embeds relations into full-rank matrices, computing the score by a bilinear multiplication, DistMult (Yang et al., 2014) is a special case of BiLin where the relation matrices are diagonal, reducing the computation to a ternary dot product,

Parameter Estimation
The probabilistic formulation of ERGM requires the computation of a normalization term that sums over all possible graphs with a given number of nodes, G N . The set of such graphs grows at a rate that is super-exponential in the number of nodes, making exact computation intractable even for networks that are orders of magnitude smaller than semantic graphs like WordNet. One solution is to approximate probability using a variant of the Monte Carlo Maximum Likelihood Estimation (MCMLE) produce, where M is the number of networksG sampled from G |V | , the space of all (multirelational) edge sets on nodes V . EachG is referred to as a negative sample, and the goal of estimation is to assign low scores to these samples, in comparison with the score assigned to the observed network G. Network samples can be obtained using edgewise negative sampling. For each edge s r − → t in the training network G, we remove it temporarily and consider T alternative edges, keeping the source s and relation r constant, and sampling a targett from a proposal distribution Q. Every such substitution produces a new graphG, Large-margin objective. Rather than approximating the log probability, as in MCMLE estimation, we propose a margin loss objective: the log score for each negative sampleG should be below the log score for G by a margin of at least 1. This motivates the hinge loss, where (x) + = max(0, x). Recall that the scoring function ψ ERGM+ includes both the local association score for the alternative edge and the global graph features for the resulting graph. However, it is not necessary to recompute all association scores; we need only subtract the association score for the deleted edge s r − → t, and add the association score for the sampled edge s r − →t. The overall loss function is the sum over N = |E|×T negative samples, {G (i) } N i=1 , plus an L 2 regularizer on the model parameters, Proposal distribution. The proposal distribution Q used to sample negative edges is defined to be proportional to the local association scores of edges not present in the training graph: By preferring edges that have high association scores, the negative sampler helps push the M3GM parameters away from likely false positives.

Relation Prediction
We evaluate M3GM on the relation graph edge prediction task. 3 Data for this task consists of a set of labeled edges, i.e. tuples of the form (s, r, t), where s and t denote source and target entities, respectively. Given an edge from an evaluation set, two prediction instances are created by hiding the source and target side, in turn. The predictor is then evaluated on its ability to predict the hidden entity, given the other entity and the relation type. 4

WN18RR Dataset
A popular relation prediction dataset for WordNet is the subset curated as WN18 (Bordes et al., 2013(Bordes et al., , 2014, containing 18 relations for about 41,000 synsets extracted from WordNet 3.0. It has been noted that this dataset suffers from considerable leakage: edges from reciprocal relations such as hypernym / hyponym appear in one direction in the training set and in the opposite direction in dev / test (Socher et al., 2013;Dettmers et al., 2018). This allows trivial rule-based baselines to achieve high performance. To alleviate this concern, Dettmers et al. (2018) released the WN18RR set, removing seven relations altogether. However, even this dataset retains four symmetric relation types: also see, derivationally related form, similar to, and verb group. These symmetric relations can be exploited by defaulting to a simple rulebased predictor.

Metrics
We report the following metrics, common in ranking tasks and in relation prediction in particular: MR, the Mean Rank of the desired entity; MRR, Mean Reciprocal Rank, the main evaluation metric; and H@k, the proportion of Hits (true entities) found in the top k of the lists, for k ∈ {1, 10}. Unlike some prior work, we do not type-restrict the possible relation predictions (so, e.g., a verb group link may select a noun, and that would count against the model).

Systems
We evaluate a single-rule baseline, three association models, and two variants of the M3GM re-ranker trained on top of the best-performing association baseline.

RULE
We include a single-rule baseline that predicts a relation between s and t in the evaluation set if the same relation was encountered between t and s in the training set. All other models revert to this baseline for the four symmetric relations.

Association Models
The next group of systems compute local scores for entity-relation triplets. They all encode entities into embeddings e. Each of these systems, in addition to being evaluated as a baseline, is also used for computing association scores in M3GM, both in the proposal distribution (see Section 3.3) and for creating lists to be re-ranked (see below): TRANSE, BILIN, DISTMULT. For detailed descriptions, see Section 3.2.

Max-Margin Markov Graph Model
The M3GM is applied as a re-ranker. For each relation and source (target), the top K candidate targets (sources) are retrieved based on the local association scores. Each candidate edge is introduced into the graph, and the score ψ ERGM+ (G) is used to re-rank the top-K list. We add a variant to this protocol where the graph score and association score are weighted by α and 1 − α, repsectively, before being summed. We tune a separate α r for each relation type, using the development set's mean reciprocal rank (MRR). These hyperparameter values offer further insight into where the M3GM signal benefits relation prediction most (see Section 6).
Since we do not apply the model to the symmetric relations (scored by the RULE baseline), they are excluded from the sampling protocol described in eq. (5), although their edges do contribute to the combinatory graph feature vector f .
Our default setting backpropagates loss into only the graph weight vector θ. We experiment with a model variant which backpropagates into the association model and synset embeddings as well.

Synset Embeddings
For the association component of our model, we require embedding representations for WordNet synsets. While unsupervised word embedding techniques go a long way in representing wordforms (Collobert et al., 2011;Mikolov et al., 2013;Pennington et al., 2014), they are not immediately applicable to the semantically-precise domain of synsets. We explore two methods of transforming pre-trained word embeddings into synset embeddings.
Averaging. A straightforward way of using word embeddings to create synset embeddings is to collect the words representing the synset as surface form within the WordNet dataset and average their embeddings (Socher et al., 2013). We apply this method to pre-trained GloVe embeddings (Pennington et al., 2014) and pre-trained FastText embeddings (Bojanowski et al., 2017), averaging over the set of all wordforms in all lemmas for each synset, and performing a caseinsensitive query on the embedding dictionary. For example, the synset 'determine.v.01' lists the following lemmas: 'determine', 'find', 'find out', 'ascertain'. Its vector is initialized as 1 5 (e determine + 2 · e f ind + e out + e ascertain ).
AutoExtend retrofitting + Mimick. AutoExtend is a method developed specifically for embedding WordNet synsets (Rothe and Schütze, 2015), in which pre-trained word embeddings are retrofitted to the tripartite relation graph connecting wordforms, lemmas, and synsets. The resulting synset embeddings occupy the same space as the word embeddings. However, some Word-Net senses are not represented in the underlying set of pre-trained word embeddings. 5 To handle these cases, we trained a character-based model called MIMICK, which learns to predict embeddings for out-of-vocabulary items based on their spellings (Pinter et al., 2017). We do not modify the spelling conventions of WordNet synsets before passing them to Mimick, so e.g. 'mask.n.02' (the second synset corresponding to 'mask' as a noun) acts as the input character sequence as is.
Random initialization. In preliminary experiments, we attempted training the association models using randomly-initialized embeddings. These proved to be substantially weaker than distributionally-informed embeddings and we do not report their performance in the results section. We view this finding as strong evidence to support the necessity of a distributional signal in a typelevel semantic setup.

Setup
Following tuning experiments, we train the association models on synset embeddings with d = 300, using a negative log-likelihood loss function over 10 negative samples and iterating over symmetric relations once every five epochs. We optimize the loss using AdaGrad with η = 0.01, and perform early stopping based on the development set mean reciprocal rank. M3GM is trained in four epochs using AdaGrad with η = 0.1. We set M3GM's rerank list size K = 100 and, following tuning, the regularization parameter λ = 0.01 and negative sample count per edge T = 10. Our models are all implemented in DyNet (Neubig et al., 2017). Table 1 presents the results on the development set. Lines 1-3 depict the results for local models using averaged FastText embedding initialization, showing that the best performance in terms of MRR and top-rank hits is achieved by TRANSE. Mean Rank does not align with the other metrics; this is an interpretable tradeoff, as both BILIN and DISTMULT have an inherent preference for correlated synset embeddings, giving a stronger fallback for cases where the relation embedding is completely off, but allowing less freedom for separating strong cases from correlated false positives, compared to a translational objective.

Results
Effect of global score. There is a clear advantage to re-ranking the top local candidates using the score signal from the M3GM model (line 4). These results are further improved when the graph score is weighted against the association component per relation (line 5). We obtain similar improvements when re-ranking the predictions from DISTMULT and BILIN.  Table 2: Main results on test set. † These models were not re-implemented, and are reported as in Nguyen et al. (2018) and in Dettmers et al. (2018).
The M3GM training procedure is not useful in fine-tuning the association model via backpropagation: this degrades the association scores for true edges in the evaluation set, dragging the reranked results along with them to about a 2-point drop relative to the untuned variant. Table 2 shows that our main results transfer onto the test set, with even a slightly larger margin. This could be the result of the greater edge density of the combined training and dev graphs, which enhance the global coherence of the graph structure captured by M3GM features. To support this theory, we tested the M3GM model trained on only the training set, and its test set performance was roughly one point worse on all metrics, as compared with the model trained on the training+dev data.
Synset embedding initialization. We trained association models initialized on AutoEx-tend+Mimick vectors (see Section 4.4). Their performance, inferior to averaged FastText vectors by about 1-2 MRR points on the dev set, is somewhat at odds with findings from previous experiments on WordNet (Guu et al., 2015). We believe the decisive factor in our result is the size of the training corpus used to create FastText embeddings, along with the increase in resulting vocabulary coverage. Out of 124,819 lemma tokens participating in 41,105 synsets, 118,051 had embeddings available (94.6%; type-level coverage 88.1%). Only 530 synsets (1.3%) finished this initialization process with no embedding and were assigned random vectors. AutoExtend, fit for embeddings from Mikolov et al. (2013) which were trained on a smaller corpus, offers a weaker signal: 13,377 synsets (32%) had no vector and needed Mimick initialization.

Graph Analysis
As a consequence of the empirical experiment, we aim to find out what M3GM has learned about WordNet. Table 3 presents a sample of topweighted motifs. Lines 1 and 2 demonstrate that the model prefers a broad scattering of targets for the member meronym and has part relations 6 , which are flat and top-downwards hierarchical, respectively, while line 4 shows that a multitude of unique hypernyms is undesired, as expected from a bottom-upwards hierarchical relation. Line 5 enforces the asymmetry of the hypernym relation. Lines 3, 6, and 7 hint at deeper interactions between the different relation types. Line 3 shows that the model assigns positive weights to hypernyms which have derivationally-related forms, suggesting that the derivational equivalence classes in the graph tend to exist in the higher, more abstract levels of the hypernym hierarchy, as noted in Section 3.1. Line 6 captures a semantic conflict: synsets located in the lower, specific levels of the graph can be specified either as instances of abstract concepts 7 , or as members of less specific concrete classes, but not as both. Line 7 may have captured a nodal property -since part of is a relation which holds between nouns, and verb group holds between verbs, this negative weight assignment may be the manifestation of a part-of-speech uniqueness constraint. In addition, in features 3 and 7 we see the importance of symmetric relations (here derivationally related form 6 Example edges: 'America' → 'American', 'face' → 'mouth', respectively. 7 Example instance hypernym edge: 'Rome' → 'national capital'.   and verb group, respectively), which manage to be represented in the graph model despite not being directly trained on. Table 4 presents examples of relation targets successfully re-ranked thanks to these features. The first false connection created a new unique hypernym, 'garden lettuce', downgraded by the graph score through incrementing the count of negatively-weighted feature 4. In the second case, 'vienna' was brought from rank 10 to rank 1 since it incremented the count for the positivelyweighted feature 2, whereas all targets ranked above it by the local model were already has parts, mostly of 'europe'.
The α r values weighing the importance of M3GM scores in the overall function, found per relation through grid search over the development set, are presented in Table 5. It appears that for all but two relations, the best-performing model preferred the signal from the graph features to that from the association model (α r > 0.5). Based on the surface properties of the different relation graphs, the decisive factor seems to be that synset domain topic of and has part pertain mostly to very common concepts, offering good local signal from the synset embeddings, whereas the rest include many long-tail, low-frequency synsets that require help from global features to detect regularity.

Conclusion
This paper presents a novel method for reasoning about semantic graphs like WordNet, combining the distributional coherence between individual entity pairs with the structural coherence of network motifs. Applied as a re-ranker, this method substantially improves performance on link prediction. Our analysis of results from Table 3, lines 6 and 7, suggests that adding graph motifs which qualify their adjacent nodes in terms of syntactic function or semantic category may prove useful.
From a broader perspective, M3GM can do more as a probabilistic model than predict individual edges. For example, consider the problem of linking a new entity into a semantic graph, given only the vector embedding. This task involves adding multiple edges simultaneously, while maintaining structural coherence. Our model is capable of scoring bundles of new edges, and in future work, we plan to explore the possibility of combining M3GM with a search algorithm, to automatically extend existing knowledge graphs by linking in one or more new entities.
We also plan to explore multilingual applications. To some extent, the structural parameters estimated by M3GM are not specific to English: for example, hypernymy cannot be symmetric in any language. If the structural parameters estimated from English WordNet are transferable to other languages, then the combination of M3GM and multilingual word embeddings could facilitate the creation and extension of large-scale semantic resources across many languages (Fellbaum and Vossen, 2012;Bond and Foster, 2013;Lafourcade, 2007).