Semi-supervised Entity Alignment via Joint Knowledge Embedding Model and Cross-graph Model

Entity alignment aims at integrating complementary knowledge graphs (KGs) from different sources or languages, which may benefit many knowledge-driven applications. It is challenging due to the heterogeneity of KGs and limited seed alignments. In this paper, we propose a semi-supervised entity alignment method by joint Knowledge Embedding model and Cross-Graph model (KECG). It can make better use of seed alignments to propagate over the entire graphs with KG-based constraints. Specifically, as for the knowledge embedding model, we utilize TransE to implicitly complete two KGs towards consistency and learn relational constraints between entities. As for the cross-graph model, we extend Graph Attention Network (GAT) with projection constraint to robustly encode graphs, and two KGs share the same GAT to transfer structural knowledge as well as to ignore unimportant neighbors for alignment via attention mechanism. Results on publicly available datasets as well as further analysis demonstrate the effectiveness of KECG. Our codes can be found in https: //github.com/THU-KEG/KECG.


Introduction
Recently, many Knowledge Graphs (KGs) (e.g., DBpedia (Lehmann et al., 2015), YAGO (Rebele et al., 2016) and BabelNet (Navigli and Ponzetto, 2012)) have emerged to provide structural knowledge for different applications. These separately constructed KGs contain heterogeneous but complementary contents; thus integrating KGs from different sources or languages into a unified KG becomes essential to better benefit knowledgedriven applications, ranging from information extraction (Han et al., 2018;Cao et al., 2018) to question answering (Cui et al., 2017). * Corresponding author Some efforts have been made to integrate KGs by aligning entities with their semantically same counterparts, namely entity alignment. Early entity alignment approaches either rely on human efforts (Mahdisoltani et al., 2013) or extra resources (Hu et al., 2011;Wang et al., 2013) to discover new equivalent entity pairs. Recent embedding approaches show significant improvement on entity alignment. We roughly classify these embedding models into two groups: KG-based models and graph-based models. KG-based models (Hao et al., 2016;Chen et al., 2017;Zhu et al., 2017;Chen et al., 2018) utilize existing KG representation learning methods to learn embeddings of entities and relations in different KGs, and then align them into a unified vector space. This type of method can not only preserve KG structures, but also implicitly complete KG with the missing links from existing knowledge. However, KG-based methods require a sufficient number of seed alignments, which is usually expensive to obtain.
To alleviate the burden of seed alignments, graph-based methods (Wang et al., 2018) utilize Graph Convolutional Network (GCN) (Kipf and Welling, 2017) to enhance entity embeddings with their neighbors' information, thus can make better use of seed alignments to propagate them over the entire graph. However, GCN-based models are sensitive to structural differences of KGs (Vaswani et al., 2017) and may not perform well for entity alignment, since different KGs are very heterogeneous. Another issue of this type of method is that they fail to consider relation types, which actually play an important role in entity alignment. As shown in Figure 1, the New York City entity has different surface forms in the two KGs (i.e., New York City in KG 1 and New York in KG 2 ). There are two possible mappings from the New York City and New York (state) in KG 1 to the New York in KG 2 , and the New York in KG 2 shares the same neighbors Hudson River and U.S. state with New York City and New York (state) in KG 1 . In this case, we need to further consider the relation types: adjoin and hasRiver to distinguish the two candidate alignments; otherwise, a graphbased method may be misled here.
In this paper, we propose a semi-supervised entity alignment method by joint Knowledge Embedding model and Cross-Graph model (KECG), which combines the above two types of methods. The basic idea of KECG is to utilize a cross-graph model to embed entities into a unified vector space by using inner-graph structure and inter-graph alignments information; meanwhile the knowledge embedding model learns KG representations to implicitly complete different KGs towards consistency and model relational constraints among entities. There are two key points for high-quality jointly training. First, the completion from KG learning may exacerbate the heterogeneity between two KGs, because two KGs may contain different rich parts, which shall become richer during training. Second, different from conventional KG representation learning, entity alignment requires one-to-one mapping, which implies that the similar entities sharing common neighbors cannot be embeded closely, otherwise they shall be aligned incorrectly as shown in Figure 1. Therefore, as for cross-graph model, we propose to utilize an extended Graph Attention Network (GAT) (Vaswani et al., 2017) as our crossgraph model that allows robustly encoding of KGs. To alleviate the negative impacts of heterogeneity, we encode different KGs with the same GAT, so that it not only highlights the entities that are important for alignment in the graph, but also transfers structural knowledge via shared parame-ters. We also extend it with constraints on weight matrices, making our model more efficient and effective. Moreover, we employ the nearest neighbor sampling strategy to differentiate the entities that are embeded closely but not the same to further boost the performance. We summarize the main contributions as follows: 1. We propose a novel semi-supervised KECG model for entity alignment by joint knowledge embedding model and cross-graph model.
2. We utilize an extended GAT to alleviate the negative impacts of heterogeneous KGs, and employ the nearest neighbor sampling strategy for KG representation learning towards one-to-one mapping, making the combination more suitable for entity alignment.
3. We evaluate KECG on five publicly available datasets including three cross-lingual datasets and two large-scale datasets from different sources. Compared with four stateof-the-art baselines, experimental results and further analysis demonstrate the outperformance of KECG.

Problem Formulation
Formally, we represent two heterogeneous KGs as G 1 = (E 1 , R 1 , T 1 ) and G 2 = (E 2 , R 2 , T 2 ), where E i , R i , T i represent the sets of entity, relation and triplet in G i , i ∈ {1, 2}, respectively. N e = {e |(e, r, e ) ∈ T } ∪ {e |(e , r, e) ∈ T } is the set of neighbors of the entity e in G. S = {(e 1 , e 2 ) ∈ E 1 × E 2 |e 1 ↔ e 2 } is a set of labeled entity pairs that are same in semantics, where an equivalence relation ↔ holds between e 1 and e 2 , e.g., e 1 in G 1 describes the same real-world entity with e 2 in G 2 . Entity alignment aims to find the remaining semantically same entity pairs in E 1 ×E 2 . For convenience, we put G 1 and G 2 together into one large graph G = G 1 + G 2 , R = R 1 ∪ R 2 , T = T 1 ∪ T 2 , and n = |E 1 | + |E 2 | is the total number of entities. We use bold-face letters to denote the vector representations of the corresponding terms throughout the paper.

Graph Neural Networks (GNNs)
GNNs are a type of neural network model that directly deals with structured data (Bruna et al., 2014; Defferrard et al., 2016;Hamilton et al., 2017). They take a graph as an input, output labels for each node, and are similar to a propagation model: to enhance the features of a node according to its neighbor nodes. Among them, GCN (Kipf and Welling, 2017) is a simplification of spectral graph convolutions, whose formulation is as follows:

Cross-graph Model
where A is a normalized adjacent matrix of the input graph with self-connection, H (l) and W (l) are the hidden states and weights in the l th layer, and σ(·) is a non-linear activation, such as ReLu. Subsequently, Vaswani et al. (2017) introduce attention mechanism into GCN by dynamically calculating A according to the attention values between each entity pairs, which is capable of mining structural knowledge better.

Method
To combine KG-based methods and graph-based methods with consideration of seed limit and KG heterogeneity, we propose a novel semisupervised method KECG for entity alignment. As shown in Figure 2, KECG consists of two parts: cross-graph model and knowledge embedding model. The input of KECG includes a composition G of two KGs and their prior alignment set S.

Cross-graph Model
The goal of our cross-graph model is to utilize structural information, including inner-graph structure and inter-graph alignments, to embed entities into a unified vector space. Due to the incompleteness of KGs, there may be some entities and relations existing in one KG but missing in another KG. We utilize an extended GAT with projection constraint as an encoder to embed the KGs while paying different attention over their neighbors, therefore it is able to alleviate the negative impact of heterogeneity by ignoring some unimportant neighbors for alignment. Besides, the graph convolution operation allows us to leverage both the labeled entities and the abundant unlabeled entity information for alignment, making our method naturally semi-supervised. The input of the encoder includes an entity embedding matrix X ∈ R n×d and neighbor sets of all entities as KG structure, where d is the dimension of entities. The KG encoder is built by stacking multiple GAT layers: where are the hidden states and weights in the l th layer, is a non-linear activation chosen as ReLU (·) = max(0, ·), and A (l) ∈ R n×n is a connectivity matrix computed by self-attention over the input graph. Considering a single element a (l) ij of A (l) , the weight from entity e i to entity e j , we compute it using self-attention mechanism: ij is the attention coefficient of entity e i to entity e j . Following Velickovic et al. (2018), the attention coefficient c ij is computed as follows: are the hidden states of e i and e j , respectively, LeakyReLU (Maas et al., 2013) is a nonlinear function, q ∈ R 2d (l) is a learnable parameter, · T represents transposition and ⊕ indicates vector concatenation. Let L be the number of layers. After L iterations of convolution, the hidden state H (L) integrates both features of entities and their neighbors, the i th row of which represents the attention enhanced embedding of entity e i . Projection Constraint. Note that in every forward propagation, we use W (l) to transform from d (l) dimension vector space to d (l+1) dimension. Motivated by , we keep the dimensions of the layers and embeddings to be the same, and restrict the projection matrix W to be a diagonal matrix. Such constraint can reduce the number of parameters and computations, increase the generalizability of the model. That's why we claim that we extend the normal GAT.
Objective O C . For semi-supervised entity alignment, we then minimize the representation distance of equivalent entities over all labeled alignments similar to (Wang et al., 2018): where dist(e i , e j ) = ||e i − e j || 2 is the L 2 distance between the aligned entity pair (e i , e j ), representations of e i and e j are from the attention enhanced embedding matrix H (L) , S represents the negative pair set of S generated by nearest neighbor sampling (Kotnis and Nastase, 2017), and γ 1 > 0 is a margin hyper-parameter. Because an entity can only have one corresponding entity in another KG, the entity closest to the corresponding entity in the same KG should be the best choice as a negative example to accurately discriminate the target entity. More formally, given a pre-aligned entity pair (e i , e j ) ∈ S, where e i ∈ E 1 , e j ∈ E 2 , K 2 is the number of negative samples, we choose K 2 entities that are nearest to e j in E 2 as the negative samples of e i , and vice versa for e j . After that, each pre-aligned entity pair will have 2*K 2 negative samples. We utilize the L 2 distance as the measure to search for the nearest negative samples due to its superior performance in experiments.

Knowledge Embedding Model
The goal of our knowledge embedding model is to model inner-graph relationships, making entities more distinguishable. Here, we use TransE (Bordes et al., 2013), which is one of the most representative translation-based methods, as our knowledge embedding model. It is worth mentioning that other advanced KG learning methods can also be applied to our knowledge embedding model such as TransD (Ji et al., 2015), which is left for future work as our main idea is to joint cross-graph embeddings and knowledge embeddings for entity alignment.
Objective O K . Formally, given a relational triplet (e h , r, e t ), TransE wants e h +r e t . So it defines the score function f (e h , r, e t ) = ||e h +r−e t || 2 to measure the plausibility of (e h , r, e t ), where || · || 2 means 2-norm. Following TransE, we utilize a margin-based ranking loss function as the training objective of the knowledge embedding model, defined as: where [·] + = max{0, ·} represents the maximum between 0 and the input, embeddings of entities e h and e t are from the attention enhanced embedding matrix H (L) , relation r is from the relation matrix R ∈ R |R|×d which needs to be learned, T stands for the negative sample triplet set of T , and γ 2 > 0 is a margin hyper-parameter separating positive and negative triplets. T is generated by corrupting T . For a triplet (e h , r, e t ) ∈ T , we replace e h to generate a negative triplet (e h , r, e t ), where e h also has the same relation type r connect to other entities, and so does e t . K 1 is the number of negative samples for each positive triplet. For example, we think Beijing City can be a replacer for New York City in (New York City, country, United States), since Beijing City also has the relation country but connected with a different tail entity China. If the number of replacers is less than K 1 , we randomly sample replacers from all entities as what the uniform negative sampling method does.

Optimization and Inference
Overall Objective O. Here, we define the objective function of KECG corresponding to the above two parts: where O C and O K denote the objective function of the cross-graph model and the knowledge embedding model, respectively. It is worth noting that we can set different weights for O C and O K . But in order to treat both kinds of models equally, here, we set the same weight for them in experiments. We use AdaGrad (Duchi et al., 2011) to optimize the overall objective O.
Inference. When inferring, new equivalent entities are discovered based on the L 2 distances between entities in the joint embeddings. Equivalent entities should become close with each other, so do equivalent relation types, e.g., circled nodes, r 1 and r 4 in the output of Figure 2.

Experiments
We evaluate KECG on entity alignment with five publicly available datasets including three crosslingual datasets and two large-scale datasets from different sources. We compare KECG with four state-of-the-art baselines and conduct further analysis of essential components of KECG.

Datasets
Following Wang et al., 2018), we use the DBP15K and the DWY100K datasets for evaluation. DBP15K contains three cross-lingual datasets built from DBpedia, denoted by DBP15K ZH−EN (Chinese to English), DBP15K JA−EN (Japanese to English) and DBP15K FR−EN (French to English). Each dataset contains 15,000 reference entity alignments. To test the adaptability of KECG to large-scale data, we evaluate on DWY100K. DWY100K contains two large-scale datasets extracted from DBpedia, Wikidata and YAGO3, denoted by DWY100K WD (DBpedia to Wikidata) and DWY100K YG (DBpedia to YAGO3). Each dataset has 100,000 reference entity alignments. The detailed information of each dataset is listed in Table 1.

Experiment Settings
To investigate the ability of KECG, we compare it with the following state-of-the-art entity align-  ment methods including three KG-based models and one graph-based model. More specifically, we use a 3-layer extended GAT as the encoder to process the composite KG G to generate embeddings of its entities. We set the dimension of all layers to be the same d. The entity embedding matrix X and the relation matrix R are initialized randomly and updated during training. Besides, we separated two variants from KECG for ablation study, called KECG(w/o NNS) and KECG(w/o K).
• KECG(w/o NNS) replaces the nearest neighbor sampling (NNS) strategy with normal uniform negative sampling.
• KECG(w/o K) simply removes the knowledge embedding model, only considering the equivalent relation as what GCN-Align does.
For all the compared approaches, we randomly split 30% of the reference entity alignments as prior alignments (i.e., training data) and leave the remaining as testing data. We use two evaluation metrics in this task: (1) the mean reciprocal rank of all correct entities (MRR) and (2) the proportion of correct entities that rank no larger than N  The number of negative samples for cross-graph model and knowledge model are K 1 = 25 and K 2 = 2, respectively, and the negative samples will be updated every 10 epochs. The total number of training rounds is 1000. The optimal configuration of our models for the entity alignment task is: λ = 0.005, γ 1 = 3.0, γ 2 = 3.0. Table 2 lists the results of entity alignment on DBP15K and DWY100K. In general, KECG is significantly more effective than MTransE, JAPE, and GCN-Align on all datasets, and slightly outperform AlignEA on small-scale datasets but greatly on large-scale datasets. Among all the baselines, AlignEA is the strongest one, which results from it using two margins to control the scores of triplets, making the KG structure modeling based on triplets more accurate than others. We conclude the detailed results in three aspects:

Results
• For different model groups. Compared with the KG-based models, KECG performs much better than MTransE and JAPE. On DBP15K ZH−EN , KECG and AlignEA get very close results regarding Hits@1 with a gap of 0.59%, but as for Hits@10 and MRR, the differences become 4.31% and 0.017, respectively. The reason is that, though some entities are not accurately aligned, their neighbors, or neighbors of neighbors, may already be aligned. And KECG propagates these features to unaligned entities, making the positions of the unaligned entities in the vector space very close to the corresponding entities. It reflects the importance of the global structural information of KGs. Compared with the graph-based model, KECG significantly outperforms GCN-Align regarding all metrics. It shows that involving the local relationships between entities can make entities more distinguishable, making the alignment more accurate.
• For different data scales. It can be seen that KECG has good adaptability on different data scales. The average gains of KECG regarding Hits@1 are 8%-41.3% on largescale datasets and 1.7%-20.7% on smallscale datasets. With large-scale KGs providing richer relational and structural information, KECG can encode entities with more semantics. However, we found that, on DWY100K YG , KECG's variant KECG(w/o K) works better than itself. This will be further discussed in Section 4.4.
• For different language pairs. KECG performs best on all language pairs. It improves the MRR score to 0.61 on DBP15K JA−EN , which is a nearly 8% improvement compared with the best baseline. To the best of our knowledge, the imbalanced resources of non-English KGs with English KG result in the heterogeneity of cross-lingual KGs, making it challenging to align them. KG-based models, such as MTransE, cannot learn good representations to model the structures of non-English KGs, as restraints for entities are relatively insufficient. Methods based on GCNs, such as GCN-Align, are sensitive to structural differences, and find it hard to narrow the differences for alignment. KECG makes use of the extended GAT to alleviate the neg-ative impacts of heterogeneous KGs, which successfully highlights the entities that are important for alignment in the graph. Because English and French share many similarities, both KECG and baselines get satisfactory results on DBP15K FR−EN .

Effectiveness of Knowledge Embedding Model
For better comparison, we draw two detailed Hits@N bar plots of GCN-Align, AlignEA, KECG(w/o K) and KECG on DBP15K ZH−EN and DWY100K YG , as shown in Figure 3. In Figure 3(a), KECG(w/o K) gets better results than GCN-Align, which reflects the power of the graph attention mechanism. However, KECG(w/o K) cannot exceed AlignEA, showing that only considering structural information is not enough for entity alignment. After adding the knowledge embedding model, KECG achieves the highest Hits@{1,10,50}. The effectiveness of knowledge embedding model lies in two parts. Firstly, it applies constraints to entities by involving representation of relation types. Entity alignment leads to some relationship alignment, and entities having similar relations become close. Secondly, the knowledge embedding model is beneficial for the graph attention mechanism to learn better attention values.
In Figure 3(b), KECG still outperforms GCN-Align and AlignEA, but KECG(w/o K) exceeds KECG regarding three metrics, which reflects the weakness of the current knowledge embedding model. According to our statistics, the 1-to-N, N-to-1, and N-to-N relations account for 79.58% of the total in DWY100K YG , while 60.05% in DBP15K ZH−EN . The introduced knowledge embedding model is based on TransE, lacking the ability of modeling complex relationships (e.g., 1to-N, N-to-1, and N-to-N relations). When there are too many complex relationships in the datasets, it brings more decrement rather than improvement to the results. To ease this situation, we can use a stronger representation model for complex relationships, such as TransR (Lin et al., 2015). We leave this for future work.

Effectiveness of Nearest Neighbor Sampling
In Table 2, it can be seen that KECG(w/o NNS) outperforms many baselines and its Hits@1 on DBP15K FR−EN reaches 39.30%, which is higher than another graph-based method GCN-Align by 2.01%. But the Hits@1 of KECG(w/o NNS) is lower than GCN-Align's by approximately 3% on DWY100K. It is because DWY100K has a larger number of triplets, while the number of relations is relatively small. Involving the representation of complex relationships will introduce noise and affect the performance of the cross-graph model. After employing the nearest neighbor sampling strategy in cross-graph model, KECG considerably improves the results of KECG(w/o NNS), e.g., Hits@1 on DWY100K WD is raised from 46.95% to 63.23%. The excellent performance of nearest neighbor sampling owing to its accurately discriminating the target entity from other close entities, especially on large-scale datasets.

Sensitivity to Proportion of Prior Alignment
To investigate whether KECG is sensitive to the proportion of pre-aligned entities, we vary the proportion from 10% to 50% with step 10% for training, and use the rest of pre-aligned entities for testing. Intuitively, more pre-aligned entities should provide more information to discover the potential pairs of equivalent entities. Figure 4 illustrates the changes in Hits@1 and Hits@10 of KECG, AlignEA and GCN-Align on all datasets. As expected, the results of all methods become better with the increase in proportion. Given half of the prior alignment like 50%, all three methods get close and satisfactory Hits@1. When given a very small proportion of prior alignment like 10%, the results of AlignEA and GCN-Align decrease enormously, but our KECG still achieves outstanding results. Therefore, KECG works well on entity alignment even with a limited number of prior alignment.  (Bordes et al., 2013) is a milestone in learning embeddings for KGs, which interprets a relation as the translation from its head entity to its tail entity. Then it motivated several enhanced methods like TransR (Lin et al., 2015). In addition to them, non-translational methods also achieve satisfactory performances, such as RESCAL (Nickel et al., 2011), ConvE (Dettmers et al., 2018) and RotatE (Sun et al., 2019). Meanwhile, external information in KGs is fused to improve embedding (Wang and Li, 2016). More detailed KG embedding methods are summarized in .

Entity Alignment
Early human efforts, such as crowdsourcing and hand-crafted features (Lehmann et al., 2015;Mahdisoltani et al., 2013), are utilized to address the entity alignment problem. Though reaching high precisions, they are usually time-consuming and costly. Many automated approaches leverage the extra resources, such as OWL properties and Wikipedia links (Hu et al., 2011;Wang et al., 2013) to align entities, but such extra information is not generally available for all KGs.
Recently, KG embedding-based approaches become the most popular solution for entity alignment. JE (Hao et al., 2016) is an early work to jointly embed different KGs into a uniform vector space to align entities in KGs. MTransE (Chen et al., 2017) learns transitions to translate each entity embedding vector to its counterparts in another space. JAPE  jointly trains the attribute embedding and structure embedding by a skip-gram model and a translational model, respectively, to align entities. GCN-Align (Wang et al., 2018) is a new graph-based model aligning KGs via GCN. It takes advantage of GCN to propagate information from neighbors, and align entity embeddings enhanced by structural knowledge. However, GCN-Align only considers the equivalent relations between entities in model training, neglecting the use of abundant relationships in KGs to precisely distinguish entities with similar neighbors. At the same time, iterative alignment becomes a new fashion to improve the results of entity alignment. IPTransE (Zhu et al., 2017) and BootEA  integrate knowledge among different KGs by enlarging the training data (prior alignments) in a bootstrapping way. KDCoE (Chen et al., 2018) iteratively co-trains multilingual KG embeddings and fuses them with entity description information for alignment. The aforementioned iterative methods improve the results mainly by increasing the number of pre-align entity pairs while training, and such a strategy can be a general enhancement for most alignment approaches . A better non-iterative method should achieve better results through bootstrap. Therefore, we are more concerned with the best result that a non-iterative method can achieve.

Conclusions
In this paper, we propose a semi-supervised entity alignment method KECG that combines the knowledge embedding model and graph-based model. We utilize an extended GAT to encode heterogeneous KGs and perform entity alignment by propagating alignment information over the entire graphs. Meanwhile, the KG learning is involved to model different relation types, making entities more distinguishable. Experimental results show that KECG significantly outperforms four state-ofthe-art baselines.
For future work, we will extend the knowledge embedding model of KECG to other KG representation learning methods, such as TransD (Ji et al., 2015), to gain a stronger ability of modeling relationships. Besides, iteratively discovering new entity alignments based on the framework of KECG is another interesting direction.