LinkNBed: Multi-Graph Representation Learning with Entity Linkage

Knowledge graphs have emerged as an important model for studying complex multi-relational data. This has given rise to the construction of numerous large scale but incomplete knowledge graphs encoding information extracted from various resources. An effective and scalable approach to jointly learn over multiple graphs and eventually construct a unified graph is a crucial next step for the success of knowledge-based inference for many downstream applications. To this end, we propose LinkNBed, a deep relational learning framework that learns entity and relationship representations across multiple graphs. We identify entity linkage across graphs as a vital component to achieve our goal. We design a novel objective that leverage entity linkage and build an efficient multi-task training procedure. Experiments on link prediction and entity linkage demonstrate substantial improvements over the state-of-the-art relational learning approaches.


Introduction
Reasoning over multi-relational data is a key concept in Artificial Intelligence and knowledge graphs have appeared at the forefront as an effective tool to model such multi-relational data. Knowledge graphs have found increasing importance due to its wider range of important applications such as information retrieval (Dalton et al., 2014), natural language processing (Gabrilovich and Markovitch, 2009), recommender systems (Catherine and Cohen, 2016), question-answering (Cui et al., 2017) and many more. This has led to the increased efforts in constructing numerous large-scale Knowledge Bases (e.g. Freebase (Bollacker et al., 2008), DBpedia (Auer et al., 2007), Google's Knowledge graph (Dong et al., 2014), Yago (Suchanek et al., 2007) and NELL (Carlson et al., 2010)), that can cater to these applications, by representing information available on the web in relational format.
All knowledge graphs share common drawback of incompleteness and sparsity and hence most existing relational learning techniques focus on using observed triplets in an incomplete graph to infer unobserved triplets for that graph (Nickel et al., 2016a). Neural embedding techniques that learn vector space representations of entities and relationships have achieved remarkable success in this task. However, these techniques only focus on learning from a single graph. In addition to incompleteness property, these knowledge graphs also share a set of overlapping entities and relationships with varying information about them. This makes a compelling case to design a technique that can learn over multiple graphs and eventually aid in constructing a unified giant graph out of them. While research on learning representations over single graph has progressed rapidly in recent years (Nickel et al., 2011;Dong et al., 2014;Trouillon et al., 2016;Bordes et al., 2013;Xiao et al., 2016;Yang et al., 2015), there is a conspicuous lack of principled approach to tackle the unique challenges involved in learning across multiple graphs.
One approach to multi-graph representation learning could be to first solve graph alignment problem to merge the graphs and then use existing relational learning methods on merged graph. Unfortunately, graph alignment is an important but still unsolved problem and there exist several techniques addressing its challenges (Liu and Yang, 2016;Pershina et al., 2015;Koutra et al., 2013;Buneman and Staworko, 2016) in limited settings.
The key challenges for the graph alignment problem emanate from the fact that the real world data are noisy and intricate in nature. The noisy or sparse data make it difficult to learn robust alignment features, and data abundance leads to computational challenges due to the combinatorial permutations needed for alignment. These challenges are compounded in multi-relational settings due to heterogeneous nodes and edges in such graphs.
Recently, deep learning has shown significant impact in learning useful information over noisy, large-scale and heterogeneous graph data (Rossi et al., 2017). We, therefore, posit that combining graph alignment task with deep representation learning across multi-relational graphs has potential to induce a synergistic effect on both tasks. Specifically, we identify that a key component of graph alignment process-entity linkage-also plays a vital role in learning across graphs. For instance, the embeddings learned over two knowledge graphs for an actor should be closer to one another compared to the embeddings of all the other entities. Similarly, the entities that are already aligned together across the two graphs should produce better embeddings due to the shared context and data. To model this phenomenon, we propose LinkNBed, a novel deep learning framework that jointly performs representation learning and graph linkage task. To achieve this, we identify key challenges involved in the learning process and make the following contributions to address them: • We propose novel and principled approach towards jointly learning entity representations and entity linkage. The novelty of our framework stems from its ability to support linkage task across heterogeneous types of entities.
• We devise a graph-independent inductive framework that learns functions to capture contextual information for entities and relations. It combines the structural and semantic information in individual graphs for joint inference in a principled manner.
• Labeled instances (specifically positive instances for linkage task) are typically very sparse and hence we design a novel multi-task loss function where entity linkage task is tackled in robust manner across various learning scenarios such as learning only with unlabeled instances or only with negative instances.
• We design an efficient training procedure to perform joint training in linear time in the number of triples. We demonstrate superior performance of our method on two datasets curated from Freebase and IMDB against stateof-the-art neural embedding methods.

Knowledge Graph Representation
A knowledge graph G comprises of set of facts represented as triplets (e s , r, e o ) denoting the relationship r between subject entity e s and object entity e o . Associated to this knowledge graph, we have a set of attributes that describe observed characteristics of an entity. Attributes are represented as set of key-value pairs for each entity and an attribute can have null (missing) value for an entity. We follow Open World Assumption -triplets not observed in knowledge graph are considered to be missing but not false. We assume that there are no duplicate triplets or self-loops.

Multi-Graph Relational Learning
Definition. Given a collection of knowledge graphs G, Multi-Graph Relational Learning refers to the the task of learning information rich representations of entities and relationships across graphs. The learned embeddings can further be used to infer new knowledge in the form of link prediction or learn new labels in the form of entity linkage. We motivate our work with the setting of two knowledge graphs where given two graphs G 1 , G 2 ∈ G, the task is to match an entity e G 1 ∈ G 1 to an entity e G 2 ∈ G 2 if they represent the same real-world entity. We discuss a straightforward extension of this setting to more than two graphs in Section 7.
Notations. Let X and Y represent realization of two such knowledge graphs extracted from two different sources. Let n X e and n Y e represent number of entities in X and Y respectively. Similarly, n X r and n Y r represent number of relations in X and Y . We combine triplets from both X and Y to obtain set of all observed triplets D = {(e s , r, e o ) p } P p=1 where P is total number of available records across from both graphs. Let E and R be the set of all entities and all relations in D respectively. Let |E| = n and |R| = m. In addition to D, we also have set of linkage labels L for entities between X and Y . Each record in L is represented as triplet (e X ∈ X, e Y ∈ Y , l ∈ {0, 1}) where l = 1 when the entities are matched and l = 0 otherwise.

Proposed Method: LinkNBed
We present a novel inductive multi-graph relational learning framework that learns a set of aggregator functions capable of ingesting various contextual information for both entities and relationships in multi-relational graph. These functions encode the ingested structural and semantic information into low-dimensional entity and relation embeddings. Further, we use these representations to learn a relational score function that computes how two entities are likely to be connected in a particular relationship. The key idea behind this formulation is that when a triplet is observed, the relationship between the two entities can be explained using various contextual information such as local neighborhood features of both entities, attribute features of both entities and type information of the entities which participate in that relationship.
We outline two key insights for establishing the relationships between embeddings of the entities over multiple graphs in our framework: Insight 1 (Embedding Similarity): If the two entities e X ∈ X and e Y ∈ Y represent the same real-world entity then their embeddings e X and e Y will be close to each other. Insight 2 (Semantic Replacement): For a given triplet t = (e s , r, e o ) ∈ X, denote g(t) as the function that computes a relational score for t using entity and relation embeddings. If there exists a matching entity e s ∈ Y for e s ∈ X, denote t = (e s , r, e o ) obtained after replacing e s with e s . In this case, g(t) ∼ g(t ) i.e. score of triplets t and t will be similar.
For a triplet (e s , r, e o ) ∈ D, we describe encoding mechanism of LinkNBed as three-layered architecture that computes the final output representations of z r , z e s , z e o for the given triplet. Figure 1 provides an overview of LinkNBed architecture and we describe the three steps below:

Atomic Layer
Entities, Relations, Types and Attributes are first encoded in its basic vector representations. We use these basic representations to derive more complex contextual embeddings further. Entities, Relations and Types. The embedding vectors corresponding to these three components are learned as follows: where v e s ,v e o ∈ R d . e s , e o ∈ R n are "one-hot" representations of e s and e o respectively. v r ∈ R k and r ∈ R m is "one-hot" representation of r. v t ∈ R q and t ∈ R z is "one-hot" representation of t . W E ∈ R d×n , W R ∈ R k×m and W T ∈ R q×z are the entity, relation and type embedding matrices respectively. f is a nonlinear activation function (Relu in our case). W E , W R and W T can be initialized randomly or using pre-trained word embeddings or vector compositions based on name phrases of components (Socher et al., 2013).
Attributes. For a given attribute a represented as key-value pair, we use paragraph2vec (Le and Mikolov, 2014) type of embedding network to learn attribute embedding. Specifically, we represent attribute embedding vector as: where a ∈ R y , a key ∈ R u and a val ∈ R v . W key ∈ R y×u and W val ∈ R y×v . a key will be "one-hot" vector and a val will be feature vector.
Note that the dimensions of the embedding vectors do not necessarily need to be the same.

Contextual Layer
While the entity and relationship embeddings described above help to capture very generic latent features, embeddings can be further enriched to capture structural information, attribute information and type information to better explain the existence of a fact. Such information can be modeled as context of nodes and edges in the graph. To this end, we design the following canonical aggregator function that learns various contextual information by aggregating over relevant embedding vectors: where c(z) is the vector representation of the aggregated contextual information for component z.
Here, component z can be either an entity or a relation. C(z) is the set of components in the context of z and z correspond to the vector embeddings of those components. AGG is the aggregator function which can take many forms such Mean, Max, Pooling or more complex LSTM based aggregators. It is plausible that different components in a context may have varied impact on the component for which the embedding is being learned.
To account for this, we employ a soft attention mechanism where we learn attention coefficients Figure 1: LinkNBed Architecture Overview -one step score computation for a given triplet (e s , r, e o ). The Attribute embeddings are not simple lookups but they are learned as shown in Eq 3 to weight components based on their impact before aggregating them. We modify Eq. 4 as: and θ z 's are the parameters of attention model. Following contextual information is modeled in our framework: Given a triplet (e s , r, e o ), the neighborhood context for an entity e s will be the nodes located near e s other than the node e o . This will capture the effect of local neighborhood in the graph surrounding e s that drives e s to participate in fact (e s , r, e o ). We use Mean as aggregator function.
As there can be large number of neighbors, we collect the neighborhood set for each entity as a pre-processing step using a random walk method. Specifically, given a node e, we run k rounds of random-walks of length l following (Hamilton et al., 2017) and create set N (e) by adding all unique nodes visited across these walks. This context can be similarly computed for object entity.
Entity Attribute Context A c (e) ∈ R y . For an entity e, we collect all attribute embeddings for e obtained from Atomic Layer and learn aggregated information over them using Max operator given in Eq. 4.
Relation Type Context T c (r) ∈ R q . We use type context for relation embedding i.e. for a given relationship r, this context aims at capturing the effect of type of entities that have participated in this relationship. For a given triplet (e s , r, e o ), type context for relationship r is computed by aggregation with mean over type embeddings corresponding to the context of r. Appendix C provides specific forms of contextual information.

Representation Layer
Having computed the atomic and contextual embeddings for a triplet (e s , r, e o ), we obtain the final embedded representations of entities and relation in the triplet using the following formulation: Subject Entity Attributes Object Entity Attributes (8) Entity Type Context (9) where W 1 , W 2 ∈ R d×d , W 3 ∈ R d×y , W 4 ∈ R d×k and W 5 ∈ R d×q . σ is nonlinear activation function -generally Tanh or Relu.
Following is the rationale for our formulation: An entity's representation can be enriched by encoding information about the local neighborhood features and attribute information associated with the entity in addition to its own latent features. Parameters W 1 , W 2 , W 3 learn to capture these different aspects and map them into the entity embedding space. Similarly, a relation's representation can be enriched by encoding information about entity types that participate in that relationship in addition to its own latent features. Parameters W 4 , W 5 learn to capture these aspects and map them into the relation embedding space. Further, as the ultimate goal is to jointly learn over multiple graphs, shared parameterization in our model facilitate the propagation of information across graphs thereby making it a graph-independent inductive model. The flexibility of the model stems from the ability to shrink it (to a very simple model considering atomic entity and relation embeddings only) or expand it (to a complex model by adding different contextual information) without affecting any other step in the learning procedure.

Relational Score Function
Having observed a triplet (e s , r, e o ), we first use Eq. 7, 8 and 9 to compute entity and relation representations. We then use these embeddings to capture relational interaction between two entities using the following score function g(·): where z r , z e s , z e o ∈ R d are d-dimensional representations of entity and relationships as described below. σ is the nonlinear activation function and represent element-wise product.
4 Efficient Learning Procedure

Objective Function
The complete parameter space of the model can be given by: To learn these parameters, we design a novel multitask objective function that jointly trains over two graphs. As identified earlier, the goal of our model is to leverage the available linkage information across graphs for optimizing the entity and relation embeddings such that they can explain the observed triplets across the graphs. Further, we want to leverage these optimized embeddings to match entities across graphs and expand the available linkage information. To achieve this goal, we define following two different loss functions catering to each learning task and jointly optimize over them as a multi-task objective to learn model parameters: Relational Learning Loss. This is conventional loss function used to learn knowledge graph embeddings. Specifically, given a p-th triplet (e s , r, e o ) p from training set D, we sample C negative samples by replacing either head or tail entity and define a contrastive max margin function as shown in (Socher et al., 2013): where, γ is margin, e s c represent corrupted entity and g (e s c , r p , e o p ) represent corrupted triplet score.
Linkage Learning Loss: We design a novel loss function to leverage pairwise label set L. Given a triplet (e s X , r X , e o X ) from knowledge graph X, we first find the entity e + Y from graph Y that represent the same real-world entity as e s X . We then replace e s X with e + Y and compute score g(e + Y , r X , e o X ). Next, we find set of all entities E − Y from graph Y that has a negative label with entity e s X . We consider them analogous to the negative samples we generated for Eq. 11. We then propose the label learning loss function as: where, Z is the total number of negative labels for e X . γ is margin which is usually set to 1 and e − Y ∈ E − Y represent entity from graph Y with which entity e s X had a negative label. Please note that this applies symmetrically for the triplets that originate from graph Y in the overall dataset. Note that if both entities of a triplet have labels, we will include both cases when computing the loss. Eq. 12 is inspired by Insight 1 and Insight 2 defined earlier in Section 2. Given a set D of N observed triplets across two graphs, we define complete multi-task objective as: where Ω is set of all model parameters and λ is regularization hyper-parameter. b is weight hyperparameter used to attribute importance to each task. We train with mini-batch SGD procedure (Algorithm 1) using Adam Optimizer. Missing Positive Labels. It is expensive to obtain positive labels across multiple graphs and hence it is highly likely that many entities will not have positive labels available. For those entities, we will modify Eq. 12 to use the original triplet (e s X , r X , e o X ) in place of perturbed triplet g(e + Y , r X , e o X ) for the positive label. The rationale here again arises from Insight 2 wherein embeddings of two duplicate entities should be able to replace each other without affecting the score. Training Time Complexity. Most contextual information is pre-computed and available to all training steps which leads to constant time embedding lookup for those context. But for attribute network, embedding needs to be computed for each attribute separately and hence the complexity to compute score for one triplet is O(2a) where a is number of attributes. Also for training, we generate C negative samples for relational loss function and use Z negative labels for label loss function. Let k = C + Z. Hence, the training time complexity for a set of n triplets will be O(2ak * n) which is linear in number of triplets with a constant factor as ak << n for real world knowledge graphs. This is desirable as the number of triplets tend to be very large per graph in multi-relational settings. Memory Complexity. We borrow notations from (Nickel et al., 2016a) and describe the parameter complexity of our model in terms of the number of each component and corresponding Here, N e , N r , N t , N k , N v signify number of entities, relations, types, attribute keys and vocab size of attribute values across both datasets. Here H b is the output dimension of the hidden layer.

Datasets
We evaluate LinkNBed and baselines on two real world knowledge graphs: D-IMDB (derived from large scale IMDB data snapshot) and D-FB (derived from large scale Freebase data snapshot). Ta

Baselines
We compare the performance of our method against state-of-the-art representation learning baselines that use neural embedding techniques to learn entity and relation representation. Specifically, we consider compositional methods of RESCAL (Nickel et al., 2011) as basic matrix factorization method, DISTMULT (Yang et al., 2015) as simple multiplicative model good for capturing symmetric relationships, and Complex (Trouillon et al., 2016), an upgrade over DISTMULT that can capture asymmetric relationships using complex valued embeddings. We also compare against translational model of STransE that combined original structured embedding with TransE and has shown state-of-art performance in benchmark testing (Kadlec et al., 2017). Finally, we compare with GAKE (Feng et al., 2016), a model that captures context in entity and relationship representations.
In addition to the above state-of-art models, we analyze the effectiveness of different components of our model by comparing with various versions that use partial information. Specifically, we report results on following variants: LinkNBed -Embed Only. Only use entity embeddings, LinkNBed -Attr Only. Only use Attribute Context, LinkNBed -Nhbr Only. Only use Neighborhood Context, LinkNBed -Embed + Attr. Use both Entity embeddings and Attribute Context, LinkNBed -Embed + Nhbr. Use both Entity embeddings and Neighbor Context and LinkNBed -Embed All. Use all three Contexts.

Evaluation Scheme
We evaluate our model using two inference tasks: Link Prediction. Given a test triplet (e s , r, e o ), we first score this triplet using Eq. 10. We then replace e o with all other entities in the dataset and filter the resulting set of triplets as shown in (Bordes et al., 2013). We score the remaining set of perturbed triplets using Eq. 10. All the scored triplets are sorted based on the scores and then the rank of the ground truth triplet is used for the evaluation. We use this ranking mechanism to compute HITS@10 (predicted rank ≤ 10) and reciprocal rank ( 1 rank ) of each test triplet. We report the mean over all test samples.
Entity Linkage. In alignment with Insight 2, we pose a novel evaluation scheme to perform entity linkage. Let there be two ground truth test sample triplets: (e X , e + Y , 1) representing a positive duplicate label and (e X , e − Y , 0) representing a negative duplicate label. Algorithm 2 outlines the procedure to compute linkage probability or score q (∈ [0, 1]) for the pair (e X , e Y ). We use L1 distance between the two vectors analogous Algorithm 2 Entity Linkage Score Computation Input: Test pair -(e X ∈ X, e Y ∈ Y ). Output: Linkage Score -q. using Eq. 10 and store the score in S repl . 5. Compute q. Elements in S orig and S repl have one-one correspondence so take the mean absolute difference: q = |S orig -S repl | 1 return q to Mean Absolute Error (MAE). In lieu of hard-labeling test pairs, we use score q to compute Area Under the Precision-Recall Curve (AUPRC).
For the baselines and the unsupervised version (with no labels for entity linkage) of our model, we use second stage multilayer Neural Network as classifier for evaluating entity linkage. Appendix B.2 provides training configuration details.

Predictive Analysis
Link Prediction Results. We train LinkNBed model jointly across two knowledge graphs and then perform inference over individual graphs to report link prediction reports. For baselines, we train each baseline on individual graphs and use parameters specific to the graph to perform link prediction inference over each individual graph. Table 5.4 shows link prediction performance for all methods. Our model variant with attention mechanism outperforms all the baselines with 4.15% improvement over single graph state-of-the-art Complex model on D-IMDB and 8.23% improvement on D-FB dataset. D-FB is more challenging dataset to   Hence closer performance of those two models aligns with expected outcome. We observed that the Neighborhood context alone provides only marginal improvements while the model benefits more from the use of attributes. Despite being marginal, attention mechanism also improves accuracy for both datasets. Compared to the baselines which are obtained by trained and evaluated on individual graphs, our superior performance demonstrates the effectiveness of multi-graph learning. Entity Linkage Results. We report entity linkage results for our method in two settings: a.) Supervised case where we train using both the objective functions. b.) Unsupervised case where we learn with only the relational loss function. The latter case resembles the baseline training where each model is trained separately on two graphs in an unsupervised manner. For performing the entity linkage in unsupervised case for all models, we first train a second stage of simple neural network classifier and then perform inference. In the supervised case, we use Algorithm 2 for performing the inference. Table 5.4 demonstrates the performance of all methods on this task. Our method significantly outperforms all the baselines with 33.86% over second best baseline in supervised case and 17.35% better performance in unsupervised case. The difference in the performance of our method in two cases demonstrate that the two training objectives are helping one another by learning across the graphs. GAKE's superior performance on this task compared to the other state-of-the-art relational baselines shows the importance of using contex-  Compositional Models learn representations by various composition operators on entity and relational embeddings. These models are multiplicative in nature and highly expressive but often suffer from scalability issues. Initial models include RESCAL (Nickel et al., 2011) that uses a relation specific weight matrix to explain triplets via pairwise interactions of latent features, Neural Tensor Network (Socher et al., 2013), more expressive model that combines a standard NN layer with a bilinear tensor layer and (Dong et al., 2014) that employs a concatenation-projection method to project entities and relations to lower dimensional space. Later, many sophisticated models (Neural Association Model , HoLE (Nickel et al., 2016b)) have been proposed. Path based composition models (Toutanova et al., 2016) and contextual models GAKE (Feng et al., 2016) have been recently studied to capture more information from graphs. Recently, model like Complex (Trouillon et al., 2016) and Analogy (Liu et al., 2017) have demonstrated state-of-the art performance on relational learning tasks. Translational Models ( (Bordes et al., 2014), (Bordes et al., 2011), (Bordes et al., 2013), (Wang et al., 2014), (Lin et al., 2015), (Xiao et al., 2016)) learn representation by employing translational operators on the embeddings and optimizing based on their score. They offer an additive and efficient alternative to expensive multiplicative models. Due to their simplicity, they often loose expressive power. For a comprehensive survey of relational learning methods and empirical comparisons, we refer the readers to (Nickel et al., 2016a), (Kadlec et al., 2017), (Toutanova and Chen, 2015) and (Yang et al., 2015). None of these methods address multi-graph relational learning and cannot be adapted to tasks like entity linkage in straightforward manner.

Entity Resolution in Relational Data
Entity Resolution refers to resolving entities available in knowledge graphs with entity mentions in text. (Dredze et al., 2010) proposed entity disambiguation method for KB population, (He et al., 2013) learns entity embeddings for resolution, (Huang et al., 2015) propose a sophisticated DNN architecture for resolution, (Campbell et al., 2016) proposes entity resolution across multiple social domains, (Fang et al., 2016) jointly embeds text and knowledge graph to perform resolution while (Globerson et al., 2016) proposes Attention Mechanism for Collective Entity Resolution.

Learning across multiple graphs
Recently, learning over multiple graphs have gained traction. (Liu and Yang, 2016) divides a multi-relational graph into multiple homogeneous graphs and learns associations across them by employing product operator. Unlike our work, they do not learn across multiple multi-relational graphs. (Pujara and Getoor, 2016) provides logic based insights for cross learning, (Pershina et al., 2015) does pairwise entity matching across multirelational graphs and is very expensive, (Chen et al., 2017) learns embeddings to support multi-lingual learning and Big-Align (Koutra et al., 2013) tackles graph alignment problem efficiently for bipartite graphs. None of these methods learn latent representations or jointly train graph alignment and learning which is the goal of our work.

Concluding Remarks and Future Work
We present a novel relational learning framework that learns entity and relationship embeddings across multiple graphs. The proposed representation learning framework leverage an efficient learning and inference procedure which takes into account the duplicate entities representing the same real-world entity in a multi-graph setting. We demonstrate superior accuracies on link prediction and entity linkage tasks compared to the existing approaches that are trained only on individual graphs. We believe that this work opens a new research direction in joint representation learning over multiple knowledge graphs. Many data driven organizations such as Google and Microsoft take the approach of constructing a unified super-graph by integrating data from multiple sources. Such unification has shown to significantly help in various applications, such as search, question answering, and personal assistance. To this end, there exists a rich body of work on linking entities and relations, and conflict resolution (e.g., knowledge fusion (Dong et al., 2014). Still, the problem remains challenging for large scale knowledge graphs and this paper proposes a deep learning solution that can play a vital role in this construction process. In real-world setting, we envision our method to be integrated in a large scale system that would include various other components for tasks like conflict resolution, active learning and human-in-loop learning to ensure quality of constructed super-graph. However, we point out that our method is not restricted to such use cases-one can readily apply our method to directly make inference over multiple graphs to support applications like question answering and conversations.
For future work, we would like to extend the current evaluation of our work from a two-graph setting to multiple graphs. A straightforward approach is to create a unified dataset out of more than two graphs by combining set of triplets as described in Section 2, and apply learning and inference on the unified graph without any major change in the methodology. Our inductive framework learns functions to encode contextual information and hence is graph independent. Alternatively, one can develop sophisticated approaches with iterative merging and learning over pairs of graphs until exhausting all graphs in an input collection.
Entity linkage task is novel in the space of multi-graph learning and yet has not been tackled by any existing relational learning approaches. Hence we analyze our performance on the task in more detail here. We acknowledge that baseline methods are not tailored to the task of entity linkage and hence their low performance is natural. But we observe that our model performs well even in the unsupervised scenario where essentially the linkage loss function is switched off and our model becomes a relational learning baseline. We believe that the inductive ability of our model and shared parameterization helps to capture knowledge across graphs and allows for better linkage performance. This outcome demonstrates the merit in multi-graph learning for different inference tasks. Having said that, we admit that our results are far from comparable to state-of-the-art linkage results (Das et al., 2017) and much work needs to be done to advance representation and relational learning methods to support effective entity linkage. But we note that our model works for multiple types of entities in a very heterogeneous environment with some promising results which serves as an evidence to pursue this direction for entity linkage task.
We now discuss several use-case scenarios where our model did not perform well to gain insights on what further steps can be pursued to improve over this initial model: Han Solo with many attributes (False-negative example). Han Solo is a fictional character in Star Wars and appears in both D-IMDB and D-FB records. We have a positive label for this sample but we do not predict it correctly. Our model combines multiple components to effectively learn across graphs. Hence we investigated all the components to check for the failures. One observation we have is the mismatch in the amount of attributes across the two datasets. Further, this is compounded by multi-value attributes. As described, we use paragraph2vec like model to learn attribute embeddings where for each attribute, we aggregate over all its values. This seems to be computing embeddings that are very noisy. As we have seen attributes are affecting the final result with high impact and hence learning very noisy attributes is not helping. Further, the mismatch in number of types is also an issue. Even after filtering the types, the difference is pretty large. Types are also included as attributes and they contribute context to relation embeddings. We believe that the skew in type difference is making the model learn bad embeddings. Specifically this happens in cases where lot of information is available like Han Solo as it lead to the scenario of abundant noisy data. With our investigation, we believe that contextual embeddings need further sophistication to handle such scenarios. Further, as we already learn relation, type and attribute embeddings in addition to entity embeddings, aligning relations, types and attributes as integral task could also be an important future direction.
Alfred Pennyworth is never the subject of matter (False-negative example). In this case, we observe a new pattern which was found in many other examples. While there are many triples available for this character in D-IMDB, very few triplets are available in D-FB. This skew in availability of data hampers the learning of deep network which ends up learning very different embeddings for two realizations. Further, we observe another patter where Alfred Pennyworth appears only as an object in all those few triplets of D-FB while it appears as both subject and object in D-IMDB. Accounting for asymmetric relationships in an explicit manner may become helpful for this scenario.
Thomas Wayne is Martha Wayne! (False-positive example). This is the case of abundance of similar contextual information as our model predicts Thomas Wayne and Martha Wayne to be same entity. Both the characters share a lot of context and hence many triples and attributes, neighborhood etc. are similar for of them eventually learning very similar embeddings. Further as we have seen before, neighborhood has shown to be a weak context which seems to hamper the learning in this case. Finally, the key insight here is to be able to attend to the very few discriminative features for the entities in both datasets (e.g. male vs female) and hence a more sophisticated attention mechanism would help.
In addition to the above specific use cases, we would like to discuss insights on following general concepts that naturally occur when learning over multiple graphs: • Entity Overlap Across Graphs. In terms of overlap, one needs to distinguish between *real* and *known* overlap between entities. For the known overlap between entities, we use that knowledge for linkage loss function L lab . But our method does not need to assume either types of overlap. In case there is no real overlap, the model will learn embeddings as if they were on two separate graphs and hence will only provide marginal (if any) improvement over state-of-art embedding methods for single graphs. If there is real overlap but no known overlap (i.e., no linked entity labels), the only change is that Equation (13) will ignore the term (1 − b) · L lab . Table 3 shows that in this case (corresponding to AUPRC (Unsupervised)), we are still able to learn similar embeddings for graph entities corresponding to the same real-world entity.
• Disproportionate Evidence for entities across graphs. While higher proportion of occurrences help to provide more evidence for training an entity embedding, the overall quality of embedding will also be affected by all other contexts and hence we expect to have varied entity-specific behavior when they occur in different proportions across two graphs • Ambiguity vs. Accuracy. The effect of ambiguity on accuracy is dependent on the type of semantic differences. For example, it is observed that similar entities with major difference in attributes across graphs hurts the accuracy while the impact is not so prominent for similar entities when only their neighborhood is different.

B.1 Additional Dataset Details
We perform light pre-processing on the dataset to remove self-loops from triples, clean the attributes to remove garbage characters and collapse CVT (Compound Value Types) entities into single triplets. Further we observe that there is big skew in the number of types between D-IMDB and D-FB. D-FB contains many non-informative type information such as #base. * . We remove all such non-informative types from both datasets which retains 41 types in D-IMDB and 324 types in D-FB. This filtering does not reduce the number of entities or triples by significant number (less than 1000 entities filtered) For comparing at scale with baselines, we further reduce dataset using similar techniques adopted in producing widely accepted FB-15K or FB-237K. Specifically, we filter relational triples such that both entities in a triple contained in our dataset must appear in more than k triples. We use k = 50 for D-FB and k = 100 for D-IMDB as D-IMDB has orders of magnitude more triples compared to D-FB in our curated datasets. We still maintain the overall ratio of the number of triples between the two datasets.
Positive and Negative Labels. We obtain 500662 positive labels using the existing links between the two datasets. Note that any entity can have only one positive label. We also generate 20 negative labels for each entity using the following method: (i) randomly select 10 entities from the other graph such that both entities belong to the same type and there exist no positive label between entities (ii) randomly select 10 entities from the other graph such that both entities belong to different types.

B.2 Training Configurations
We performed hyper-parameter grid search to obtain the best performance of our method and finally used the following configuration to obtain the reported results: -Entity Embedding Size: 256, Relation Embedding Size=64, Attribute Embedding Size = 16, Type Embedding Size = 16, Attribute Value Embedding Size = 512. We tried multiple batch sizes with very minor difference in performance and finally used size of 2000. For hidden units per layer, we use size = 64. We used C = 50 negative samples and Z = 20 negative labels. The learning rate was initialized as 0.01 and then decayed over epochs. We ran our experiments for 5 epochs after which the training starts to convert as the dataset is very large. We use loss weights b as 0.6 and margin as 1. Further, we use K = 50 random walks of length l = 3 for each entity We used a train/test split of 60%/40% for both the triples set and labels set. For baselines, we used the implementations provided by the respective authors and performed grid search for all methods according to their requirements.

C Contextual Information Formulations
Here we describe exact formulation of each context that we used in our work.
Neighborhood Context: Given a triplet (e s , r, e o ), the neighborhood context for an entity e s will be all the nodes at 1-hop distance from e s other than the node e o . This will capture the effect of other nodes in the graph surrounding e s that drives e s to participate in fact (e s , r, e o ). Concretely, we define the neighborhood context of e s as follows: where N (e s ) is the set of all entities in neighborhood of e s other than e o . We collect the neighborhood set for each entity as a pre-processing step using a random walk method. Specifically, given a node e, we run k rounds of random-walks of length l and create the neighborhood set N (e) by adding all unique nodes visited across these walks.
Please note that we can also use max function in (14) instead of sum. N c (e s ) ∈ R d and the context can be similarly computed for object entity.
Attribute Context. For an entity e s , the corresponding attribute context is defined as where n a is the number of attributes. a e s i is the embedding for attribute i. A c (e s ) ∈ R y .
Type Context. We use type context mainly for relationships i.e. for a given relationship r, this context aims at capturing the effect of type of entities that have participated in this relationship. For a given triplet (e s , r, e o ), we define type context for relationship r as: where, n r t is the total number of types of entities that has participated in relationship r and v t i is the type embedding that corresponds to type t. T c (r) ∈ R q .