Distributed Prediction of Relations for Entities: The Easy, The Difficult, and The Impossible

Word embeddings are supposed to provide easy access to semantic relations such as “male of” (man–woman). While this claim has been investigated for concepts, little is known about the distributional behavior of relations of (Named) Entities. We describe two word embedding-based models that predict values for relational attributes of entities, and analyse them. The task is challenging, with major performance differences between relations. Contrary to many NLP tasks, high difficulty for a relation does not result from low frequency, but from (a) one-to-many mappings; and (b) lack of context patterns expressing the relation that are easy to pick up by word embeddings.


Introduction
A central claim about distributed models of word meaning (e.g., Mikolov et al. (2013)) is that word embedding space provides easy access to semantic relations. E.g., Mikolov et al.'s space was shown to encode the "male-female relation" linearly, as a vector ( # » man − # » woman = # » king − # » queen). The accessibility of semantic relations was subsequently examined in more detail. Rei and Briscoe (2014) and Melamud et al. (2014) reported successful modeling of lexical relations such as hypernymy and synonymy. Köper et al. (2015) considered a broader range of relationships,with mixed results. Levy and Goldberg (2014b) developed an improved, nonlinear relation extraction method.
These studies were conducted primarily on concepts and their semantic relations, like hypernym(politician) = person.
Meanwhile, entities and the relations they partake in are much less well understood. 1 Entities are instances of concepts, i.e., they refer to specific individual objects in the real world, for example, Donald Trump is an instance of the concept politician. Consequently, entities are generally associated with a rich set of numeric and relational attributes (for politician instances: size, office, etc.). In contrast to concepts, the values of these attributes tend to be discrete (Herbelot, 2015): while the size of politician is best described by a probability distribution, the size of Donald Trump is 1.88m. Since distributional representations are notoriously bad at handling discrete knowledge (Fodor and Lepore, 1999;Smolensky, 1990), this raises the question of how well such models can capture entity-related knowledge.
In our previous work (Gupta et al., 2015), we analysed distributional prediction of numeric attributes of entities, found a large variance in quality among attributes, and identified factors determining prediction difficulty. A corresponding analysis for relational (categorial) attributes of entities is still missing, even though entities are highly relevant for NLP. This is evident from the highly active area of knowledge base completion (KBC), the task of extending incomplete entity information in knowledge bases such as Yago or Wikidata (e.g., Bordes et al., 2013;Freitas et al., 2014;Neelakantan and Chang, 2015;Guu et al., 2015;Krishnamurthy and Mitchell, 2015).
In this paper, we assess to what extent relational attributes of entities are easily accessible from word embedding space. To this end, we define two models that predict, given a target entity (Star Wars) and a relation (director), a distributed representation for the relatum (George Lucas). We carry out a detailed per-relation analyses of their performance on seven

Two Relatum Prediction Models
Both models predict a vector for a relatum r (plural: relata) given a target entity vector t and a symbolic relation ρ.
The Linear Model (LinM) is inspired by Mikolov et al.'s "phrase analogy" evaluation of word embeddings ( # » man − # » woman = # » king − # » queen). However, instead of looking at individual words, we extract representations of semantic relations from sets of pairs T ρ = {(t i , ρ, r i )} instantiating the relation ρ. For each relation ρ, LinM computes the average (or centroid) difference vector over the set of training pairs: That is, the predictedr for an input (t, ρ) is the sum of the target vector and the relation's prototype. This model should work well if relations are represented additively in the embedding space.
The Nonlinear Model (NonLinM) is a feedforward network ( Figure 1) introducing a nonlinearity, inspired by Levy and Goldberg (2014b) and similar to models used in KBC, e.g., Socher et al. (2013). The relatum vector is predicted aŝ where v ρ is the relation encoded as an mdimensional one-hot vector and the three matrices W in , W r , W out form the model parameters θ. For the nonlinearity σ, we use tanh.
In this model, the hidden layer represents a nonlinearly transformed composition of target and relation from which the relatum can be predicted. NonLinM can theoretically make accurate predictions even if relations are not additive in embedding space. Also, its sharing of training data among relations should lead to more reliable learning for infrequent relations. As objective function, we use where nc(v) is the nearest confounder of v, i.e., the next neighbor of v that is not a relatum for the current target-relation pair. Thus, we minimize the cosine distance between the predicted vector and the gold vector for the relatum while maximizing the cosine distance of the prediction to the closest negative example. We introduce a weight α ∈ [0, 1] for the negative sampling term as a hyper-parameter optimized on the development set. During training, we apply gradient descent with the adaptive learning rate method AdaDelta (Zeiler, 2012).

Experiments
Data. We extract relation data from FreeBase. We follow our earlier work Gupta et al. (2015), but go beyond its limitation to two domains (country, citytown). We experiment with seven major FreeBase domains: animal, book, citytown, country, employer, organization, people.
We limit the number of datapoints of very large relation types to 3000 with random sampling for efficiency reasons. We only remove relation types with fewer than 3 datapoints. This results in a quite challenging dataset that demonstrates the generalizability of our models and is roughly comparable, in variety and size, to the FB15K dataset (Bordes et al., 2013). The distributed representations for all entities come from the 1000-dimensional "Google News" skip-gram model (Mikolov et al., 2013) for Free-Base entities 2 trained on a 100G token news corpus. We only retain relation datapoints where both target and relatum are covered in the Google News vectors. Table 1 shows the numbers of relations and unique objects (target plus relata). We split all domains into training, validation, and test sets (60%-20%-20%). The split applies to each relation type: in test, we face no unseen relation types, but unseen datapoints for each relation. 3 Hyperparameter settings. The NonLinM model uses an L 2 norm constraint of s=3. We adopt the best AdaDelta parameters from Zeiler (2012), viz. ρ = 0.95 and = 10 −6 . We optimize the negative sampling weight α (cf. Eq. 3) by line search with a step size of 0.1 on the largest domain, country, and find 0.6 to be the optimal value, which we reuse for all domains. Due to the varying dimensionality m of the relation vector per domain, we set the size of the hidden layer to k = 2n + m/10 (n is the dimensionality of the word embeddings, cf. Figure 1). We train all models for a maximum of 1000 epochs with early stopping.
Evaluation. Models that predict vectors in a continuous vector space, like ours, cannot expect to predict the output vector precisely. Thus, we apply nearest neighbor mapping using the set of all unique targets and relata in each domain (cf. Table 1) to identify the correct relatum name. We then perform an Information Retrieval-style ranking evaluation: We compute the rank of the correct relatum r, given the target t and the relation ρ, in the test set T and aggregate these ranks to compute the mean reciprocal rank (MRR): where rank is the nearest neighbor rank of the relatum vector r given the prediction of the model 3 The dataset are available at: http://www.ims.unistuttgart.de/data/RelationPrediction.html for the input t, ρ. We report results at the relation level as well as macro-and micro-averaged MRR for the complete dataset.
Frequency Baseline (BL). Our baseline model ignores the target. For each relation, it predicts the frequency-ordered list of all training set relata.

Results and Discussion
Overall results. Table 1 shows that the nonlinear model NonLinM consistently gives the best results and statistically outperforms the linear model on all domains according to a Wilcoxon test (α=0.05). Both LinM and NonLinM clearly outclass the baseline. Most MRRs are around 0.25 (micro average 0.22), with one outlier, at 0.18, for country, the largest domain. Overall, the numbers may appear disappointing at first glance: these MRRs mean that the correct relatum is typically around the fourth nearest neighbor of the prediction vector. This indicates that open-vocabulary relatum prediction in a space of tens of thousands of words is a challenging task that warrants more detailed analysis. We observe that the nonlinear model achieves reasonable results even for sparse domains (cf. the low baseline), which we take as evidence for its generalization capabilities.
Analysis at relation level. Table 1 shows the number of relations with good MRRs (greater than 0.3) and bad MRRs (smaller than 0.1) for each relation. While the numbers vary across domains, the models tend to do badly on around 40-50% of all relations, and obtain good scores for less than one third of all relations. Figure 2 shows the distribution for the best domain (animal) and the worst one (country) . Both plots show a Zipfian distribution with a rel-    Qualitatively, the two models differ substantially with regard to prediction patterns at the level of targets. Table 2 shows the first predictions for three targets from two relations: continent, where NonLinM outperforms LinM, and capital, where it is the other way around. NonLinM's errors consist almost exclusively in predicting semantically similar entities of the correct relatum type, e.g., predicting Quito (the capital of Ecuador) as capital of Venezuela. In contrast, the LinM model has a harder time capturing the correct type, predicting country entities as capitals (e.g., Nepal as the capital of Nepal).

Analysis of Difficulty. So what makes many
FreeBase relations hard to model? To test for sparsity problems, we first computed the correlation between model performance and the "usual suspect" relation frequency (number of instances for each relation). In NLP applications, this typically yields a high positive correlation. The second-tolast column of Table 1 shows that this is not true for our dataset. We find a substantial positive correlation only for people, correlations around zero for most domains, and substantial negative ones for organization and country. For these domains, therefore, frequent relations are actually harder to model. Further analysis revealed two main sources of difficulty: (1) One-to-many relations. Relations with many datapoints tend to be one-to-many. We assume this to be a major source of difficulty, since the model is presented with multiple relata for the same target during training and will typically learn to predict a centroid of these relata. As an extreme case, consider a relation like administrative divisions that relates the US to all of its federal states: the resulting prediction will arguably be dissimilar to every individual state. To test this hypothesis, we computed the rank correlation at the relation level between the number of relata per target and NonLinM performance, shown in the last column of Table 1. Indeed, we find a strong negative correlation for every single domain. In addition, Figure 3 plots relation performance (y axis) against the ratio of relata per target (x axis: one-to-one on the left, one-to-many on the right) for animal and country. Qualitatively, Table 3 shows examples for the three most easy and difficult relations for country. The list suggests that relations tend to be easy when they associate targets with single relata: the relation country maps territories and colonies onto their motherlands, and the tournaments relation is only populated with a few Commonwealth games (cf. the high baseline). In contrast, relations that map targets on many relata are difficult, such as administrative divisions of countries, or a list of disputed territories. Note that this is not an evaluation issue, since MRR can deal with multiple correct answers. Our models do badly because they lack strategies to address these cases.
(2) Lack of contextual support. One-to-many relations are not the only culprit. Strikingly, Figure 2 shows that a low target-relatum ratio is a necessary condition for good performance (the upper right corners are empty), but not a sufficient one (the lower left corners are not empty). Some relations are not modelled well even though they are (almost) one-to-one. Examples include currency formerly used or named after for country and place of origin for animal. Further analysis indicated that these relations suffer from what Gupta et al. (2015) called lack of contextual support: Although they are expressed overtly in the linguistic context of the target and relatum (and often even frequently so), their realizations cannot be tied to individual words or topics. Instead, they are expressed by relatively specific linguistic patterns, often predicate-argument structures (X used to pay with Y, X is named in the honor of Y). Such structures are hard to pick up by word embedding models that make the bag-of-words independence assumption among context words.

Conclusion
This paper considers the prediction of related entities ("relata") given a pair of a target Named Entity and a relation (Star Wars, director, ?) on the basis of distributional information. This task is challenging due to the more discrete behavior of attributes of entities as compared to concepts. We provide an analysis based on two models that use vector representations for both the targets and the relata.
Our results yield new insights into how embedding spaces represent entity relations: they are generally not represented additively, and nonlinearity helps. They also complement insights on the be-   (Gupta et al., 2015): Relations, like numeric attributes, are difficult to model if they are not specifically expressed in the lingusitic context of target and relatum. A new challenge specific to relations are situations where a single target maps onto many relata. If none of the two problems applies, relations are easy to model. If one applies, they are difficult.
And if both apply, they are essentially impossible. Among the two challenges, the problem of oneto-many relations appears easier to address, since a continuous output vector is, at least in principle, able to be similar to many relata. In the future, we will extend the model to deal better with oneto-many relations. While the lack of contextual support seems more fundamental, it could be addressed by either using syntax-based embeddings (Levy and Goldberg, 2014a) that can better pick up the specific context patterns characteristic for these relations, or by optimizing the input word embeddings for the task. This becomes a similar problem to joint training of representations from knowledge base structure and textual evidence (Perozzi et al., 2014;Toutanova et al., 2015).