A Contextual Alignment Enhanced Cross Graph Attention Network for Cross-lingual Entity Alignment

Cross-lingual entity alignment, which aims to match equivalent entities in KGs with different languages, has attracted considerable focus in recent years. Recently, many graph neural network (GNN) based methods are proposed for entity alignment and obtain promising results. However, existing GNN-based methods consider the two KGs independently and learn embeddings for different KGs separately, which ignore the useful pre-aligned links between two KGs. In this paper, we propose a novel Contextual Alignment Enhanced Cross Graph Attention Network (CAECGAT) for the task of cross-lingual entity alignment, which is able to jointly learn the embeddings in different KGs by propagating cross-KG information through pre-aligned seed alignments. We conduct extensive experiments on three benchmark cross-lingual entity alignment datasets. The experimental results demonstrate that our proposed method obtains remarkable performance gains compared to state-of-the-art methods.


Introduction
Knowledge graphs (KGs) have recently demonstrated their potential in many natural language processing (NLP) tasks, such as language modelling  and question answering (Wu et al., 2019a). With the rapid growth of multilingual KGs, such as DBpedia (Lehmann et al., 2015) and YAGO (Rebele et al., 2016;Suchanek et al., 2008), cross-lingual entity alignment has attracted considerable focus due to the lack of cross-lingual links. The task of cross-lingual entity alignment aims to automatically detect equivalent entities from different monolingual KGs to bridge the language gap.
Most recently, many graph neural networks (GNNs) based methods are proposed for entity alignment Mao et al., 2020;Yang et al., 2019;. Since GNNs are powerful to model graph-structured data by aggregating neighborhood information, the GNN-based methods have shown promising performance. However, existing GNN-based methods model different KGs separately, which ignore the useful pre-aligned links between two KGs. These GNN-based methods only use the pre-aligned seed alignments to optimize the objective function during training and fail to take full use of seed alignments, which provide useful contextual alignment information for entity alignment task. Intuitively, if two entities in different KGs share many pre-aligned neighbors, they are most likely to be equivalent entities. Taking the example shown in Figure 1(a), although the entity "哥威迅语(Gothic language)" in Chinese from G 1 has the same translated surface form with the entity "Gothic language" in English from G 2 , the correct alignment for "哥威迅语(Gothic language)" is the entity "Gwich'in language" in English. By considering the contextual alignments in different KGs, such as "美国(United States)" and "United States", "加拿大(Canada)" and "Canada", "纳-德内语系(Na de nene)" and "Na-Dene languages", "德内语支(German language branch)" and "Athabaskan languages", we can get more evidences that the embeddings of entity "哥威迅语(Gothic language)" and entity "Gwich'in language" are equivalent with each other. Thus, the mismatch issue can be easily corrected. The contextual alignments information is not explicitly considered in conventional GNN-based methods, which leads to sub-optimal results.  . The dashed lines represents the pre-aligned neighbors. (b) The comparison between conventional GNNbased methods and the proposed CAECGAT model. Our aim is to predict whether the entity 1 from G 1 and entity 1 from G 2 are two equivalent entities.
Therefore, it is beneficial to take full advantage of contextual alignments for entity alignment task. To this end, we propose a Contextual Alignment Enhanced Cross Graph Attention Network (CAECGAT) for the task of cross-lingual entity alignment. CAECGAT is able to jointly learn the embeddings in different KGs by propagating cross-KG information through pre-aligned seed alignments. Specifically, we develop a new cross graph attention (CGAT) layer to learn cross-KG information. The CGAT layer includes a cross-KG aggregation layer and an attention-based cross-KG propagation layer. We first use the cross-KG aggregation layer to transfer the entity information across two KGs through the pre-aligned seed entities. Thus, the embeddings of different KGs are mapped into the same semantic space by sharing cross-KG information. Then, the attention-based cross-KG propagation layer is applied to focus on the neighbors with important cross-KG information, so that the the semantic gap between different languages can be alleviated by propagating the cross-KG information.
To train the model, we split the seed alignments into contextual seed alignments and objective seed alignments during training. The contextual seed alignments are fed into the model, which provide prealigned information and allow the entity information to propagate across different KGs. While the objective seed alignments are used to optimize parameters in the model. Figure 1(b) shows the comparison between conventional GNN-based methods and the proposed CAECGAT model. It is noted that there is a significant difference between the conventional GNN-based methods and our proposed CAECGAT model. The conventional GNN-based methods regard all pre-aligned entity pairs as objective seed alignments for training and ignore the contextual aligned information to propagate across different KGs, resulting in the sub-optimal results. On the contrary, our proposed CAECGAT model chooses one batch seed alignment as objective seed alignments (e.g., entity 1 in Figure 1(b)) and the rest as contextual seed alignments (e.g., entity 2 and entity 3 in Figure 1(b)) with an iterative manner. In this case, we can better learn the cross-KG embeddings by propagating contextual seed alignments across different KGs.
The main contributions of this study are summarised as follows: • We propose a novel Contextual Alignment Enhanced Cross Graph Attention Network (CAECGAT) for cross-lingual entities alignment, which can jointly learn cross-KG embeddings by propagating information across different KGs.
• We propose a new training strategy by dividing the seed alignments into contextual and objective The knowledge graphs in different languages.

R1, R2
The relations in G1 and G2. T1, T2 The triples in G1 and G2. A The seed alignments.

Actx, A obj
The contextual seed alignments and objective seed alignments.

E1,Ē2
The final entity embeddings for G1 and G2. e1, e2 The initial entity vectors for e1 ∈ G1 and e2 ∈ G2 e1,ē2 The final entity vectors for e1 ∈ G1 and e2 ∈ G2. gate The gate mechanism to combine cross-KG embeddings. crossAggr The function to aggregate cross-KG information. crossAtt The function to propagate cross-KG information in different KGs. L(φ; A obj ) The loss function of the proposed CAECGAT model. φ The trainable parameters in the model.
The L1 distance between entity vectorsē1 andē2 seed alignments with an iterative manner, enabling our model to capture the cross-KG information.
• We conduct extensive experiments on three benchmark datasets. The experimental results demonstrate that our proposed method obtains remarkable performance gains compared to state-of-the-art methods.
2 Approach 2.1 Overview Figure 2 illustrates the structure of the proposed CAECGAT model, which consists of multiple CGAT layers. A CGAT layer contains a cross-KG aggregation layer and an attention-based cross-KG propagation layer. As illustrated in Figure 2, given two different KGs G 1 and G 2 , and a collection of pre-aligned entity pairs, we first use the cross-KG aggregation layer to transfer the entity information across the two KGs through the pre-aligned entity pairs. This operation is useful to map the entity embeddings in different KGs into the same semantic space by sharing cross-KG entity embeddings. Then, we applied the attention-based cross-KG propagation layer to gather neighbors with important cross-KG information. Thus, the cross-KG information can be propagated in two KGs. By stacking multiple CGAT layers, the model is able to learn multi-hop cross-KG information. The main notations of this paper are summarized in Table 1.

Cross-KG Aggregation
The cross-KG aggregation is used to transfer graph information across different KGs through seed alignments. By taking full advantage of pre-aligned entity pairs, we can make full use of pre-aligned neighbors as contextual information to predict new alignments. This kind of cross-KG alignment information is very useful for entity alignment task but ignored in previous methods. In this study, we take full advantage of the pre-aligned entity pairs to aggregate cross-KG information and mitigate the semantic gap between different KGs. Formally, give two different KGs G 1 = (E 1 , R 1 , T 1 ) and G 2 = (E 2 , R 2 , T 2 ), and a set of seed alignments A = {(e 1 , e 2 )|e 1 ∈ E 1 , e 2 ∈ E 2 }. In this paper, we represent the entities in the two KGs as k-dimensional embedding matrices E 1 and E 2 . During training, we split all the seed alignments into a set of contextual seed alignments A ctx and a set of objective seed alignments A obj . The contextual seed alignments in A ctx are used as bridge to transfer information across different KGs. Thus, the cross-KG information provided by A ctx can be used as contextual alignment information to predict the matching scores for the entity pairs in objective seed alignments in A obj .
Concretely, for each pre-aligned entity pair (e 1 , e 2 ) ∈ A ctx , we use a gate mechanism to update their Figure 2: The architecture of the proposed CAECGAT model. The cross-KG aggregation layer is used to transfer cross-KG information across different KGs through contextual seed alignments. And the attention-based cross-KG propagation layer is used to propagate the cross-KG information in these two KGs. By stacking multiple CGAT layers, the cross-KG information can be propagated to multi-hop neighbors.
embeddings by combing the embeddings of themselves and their counterpart entities from the other KGs: where e l 1 and e l 2 are the vectors of entity e 1 and e 2 in the l-th layer, g l 1 and g l 2 are the gate mechanism to control how much information flow across KGs, which is computed as: where σ is the sigmoid activation function which can constrain the output values within the range of 0 to 1, W l {1,2} and b l {1,2} are the parameters, || denotes the concatenation operation. For the entities without pre-aligned counterpart entities in the other KGs, their embeddings do not change in the cross-KG aggregation layer. By applying this cross-KG aggregation method, we can obtain cross-KG embeddings H l 1 and H l 2 with sharing entity representations for equivalent entities. Formally, the cross-KG embeddings is computed as: where crossAggr denotes the cross-KG aggregation layer, which can be formulated as and crossAggr(e l 2 , E l 1 , A ctx ) is computed in a similar form.

Attention-based Cross-KG Propagation
Recently, GNNs have been successfully applied for entity alignment. These preliminary studies use GNNs to propagate neighborhood information in a monolingual KG and fail to take full use of seed alignments to alleviate the semantic gap between different KGs. In this study, we attempt to jointly learn the entity embeddings and propagate cross-KG information in different KGs. We have encoded cross-KG information in the entity embeddings by applying the cross-KG aggregation layer, we can further propagate the cross-KG information using GNNs rather than only mono-lingual KG information. Specifically, we use an attention-based cross-KG propagation layer, which is inspired by graph attention network (GAT) (Velickovic et al., 2017), to aggregate neighborhood features. Our attention-based cross-KG propagation layer aims to select the most important common neighbors shared by different KGs to enhance the entity embeddings. Thus, the semantic gap can be alleviated by focusing on the common neighbors with cross-KG information and reducing noising neighbors.
Given the output entity embeddings of cross-KG aggregation layer, e.g., H l 1 and H l 2 for two KGs G 1 and G 2 , we use a graph attention mechanism to update the entity embeddings by gathering neighborhood information: where crossAtt is the attention-based cross-KG propagation function. For each entity e 1 ∈ G 1 , the output feature of crossAtt function is generated by gathering the cross-KG neighborhood embeddings using a weighted sum function: where Relu (Glorot et al., 2011) is the activation function, N e 1 is a set of neighbors of entity e 1 , and α 1k is the normalized attention weight for the neighborhood entity e k , which is computed as: where s is the scoring function which is denoted as: where v is the attention vector, LeakyRelu (Maas et al., 2013) is the activation function which is widely used in graph attention mechanism. The crossAtt function for the entities in KG G 2 is computed in a similar form. Formally, the update rule of the cross-KG aggregation and cross-KG propagation layers in a CGAT layer can be denoted as: By stacking L cross-KG aggregation and attention-based cross-KG propagation layers, we can obtain new entity embeddingsĒ 1 = E L 1 andĒ 2 = E L 2 , which contain both cross-KG and multi-hop neighborhood information.

Optimization and Prediction
The goal of entity alignment is to ensure that the learned embeddings of equivalent entities in different KGs are close to each other. Instead of just use the seed alignments to compute the loss function in conventional GNN-based methods, our model can take full use the seed alignments as contextual alignments during training and prediction.
During training, we use the objective seed alignments A obj and margin-based ranking loss function to optimize the model, which is denoted as: Algorithm 1 CAECGAT Model Input: Given KGs G1 and G2, as well as the initial entity embeddings E1, E2 and pre-aligned seed alignments A. Output: The final entity embeddingsĒ1,Ē2 and parameters φ. 1: Let E 0 1 = E1 and E 0 2 = E2. 2: repeat 3: Select a batch seed alignments as objective seed alignments A obj and the rest are used as context seed alignments Actx.
To better understand the proposed model, we describe the details of CAECGAT model in Algorithm 1. Note that the contextual seed alignments A ctx and objective seed alignments A obj are not fixed during training. Actually, these two seed alignments sets will be dynamically changed to ensure all the seed alignments can be used to optimize the loss function. Specifically, all the seed alignments in A are split into several batch seed alignments. For each optimization step, we select one batch seed alignments as objective alignments A obj , and all the rest seed alignments are used as contextual seed alignments A ctx . We will repeat this operation until reach the maximum of training epochs, which is described from line 2 to line 11 in Algorithm 1.
During prediction, given a test entity e 1 (or e 2 ) in from KG G 1 (or G 2 ), we rank all entities in another KG G 2 (or G 1 ) according to the L 1 distance computed by using the entity vectorsĒ 1 andĒ 2 learned by CAECGAT model. Note that, during prediction, we can use all the pre-aligned seed alignments in A as contextual seed alignments, namely A ctx = A, which is described from line 12 to line 13 in Algorithm 1. Therefore, our proposed CAECGAT model provides a new way to take full advantage of the seed alignments in both training and prediction procedures.

Experiments
In this section, we conduct extensive experiments to evaluate our proposed method CAECGAT on crosslingual entity alignment task. The detailed experiments on benchmark datasets and the results are described in the following subsections.

Datasets
To evaluate the ability of the proposed CAECGAT model, we conduct experiments on three widely used cross-lingual datasets from DBP15K . These datasets are extracted from multilingual DBpedia (Lehmann et al., 2015), including four KGs with different languages (English, Chinese, Japanese and French). Three cross-lingual datasets are built upon these KGs, including DBP15K ZH−EN (Chinese-English), DBP15K JA−EN (Japanese-English) and DBP15K F R−EN (French-English). Each dataset contains 15,000 inter-lingual links connecting equivalent entity pairs in KGs with different languages. The statistics of the datasets are summarized in Table 2.

Implementation Details
In our experiments, as done in Wu et al., 2019c), the entity embeddings are initialized by using the sum of word vectors of the entities' surface form. We choose the above initialization due to its good performance shown in various previous work Wu et al., 2019c). We use the same data split as in various previous work Yang et al., 2019), namely 30% for training and 70% for testing. We further sample 10% of the training data as the development set for parameter selection, and use the remaining 90% for training. We apply Adam (Kingma and Ba, 2015) to optimize the parameters in the model, and we train the model up to 5, 000 epochs. The hyper-parameters are selected by using grid search according to the Hits@1 on development set. We select the number of negative entity pairs from {10, 20, 30, 50}, the margin parameter λ in Equation 10 from {1, 3, 5}, the number of CGAT layers from {1, 2, 3}, the dropout rate from {0.1, 0.2, 0.3}, and the batch size from 500, 1000, 1500, 2000, 3000, the learning rate from {0.002, 0.001, 0005}. Finally, we randomly sample 30 negative alignment entity pairs for each positive pair, and the margin parameter in Equation 10 is set to λ = 3. We stack two CGAT layers to propagate multi-hop cross-KG information. The learning rate is set to 0.002. The dropout rate is set to 0.2. The batch size is set to 2, 000. We evaluate the performance of the model using the standard metrics Hits@1 (H@1), Hits@10 (H@10) and MRR (Mean Reciprocal Rank). All experiments reported in this study are conducted on NVIDIA GTX 1080Ti GPUs and the codes are implemented using TensorFlow.

Comparison Models
To investigate the power of the proposed CAECGAT, we compare the proposed method with various baselines. These baseline methods can be roughly classified into two groups: embedding-based models and GNN-based models. We also note that some research studies, such as KDCoE (Chen et al., 2018), NTAM (Li et al., 2018), SEA  and OTEA (S. et al., 2019), and attribute enhanced methods (Trisedya et al., 2019), focus on the entity alignment on other datasets instead of DBP15K. We will leave these comparison in the future work due to the limited space. Table 3 presents the cross-lingual entity alignment performance of our proposed model, as well as the comparisons with various baselines. As shown in Table 3, our model CAECGAT achieves the best performance on DBP15K JA−EN and DBP15K F R−EN across all the metrics, and achieves the best result of H@10, the second best result of H@1 and MRR on DBP15K ZH−EN . These results demonstrate the effectiveness of our proposed CAECGAT model. Compared to the most similar GNN-based models, such as GCN-Align , the proposed CAECGAT model is able to take the seed alignments as contextual information and propagate the cross-KG information in different KGs, making our CAEC-GAT model achieves the remarkable performance gains. For example, CAECGAT obtains a 34.30% improvements over GCN-Align by H@1 on DBP15K ZH−EN . These results shows that the contextual  seed alignments are important for entity alignment task and our model is effective to bridge the language gap between different KGs by taking these contextual seed alignments into account.

Ablation Study
In order to better understand how each component affects the performance of the proposed CAECGAT model, we conduct an ablation study shown in Table 4. There are three variants of our proposed CAEC-GAT model, "BASELINE" is a simple model which identifies the equivalent counterparts from different KGs with the embeddings learned using the sum of word vectors within the surface names of entities. "GAT" represents the conventional graph attention network, which is reduced the cross-KG aggregation layer from the proposed CAECGAT model. "CrossGCN" is the model obtained by replacing the attention-based propagation layer with multiple GCN layers. The subscripts L = 1, 2, 3 denote models with different numbers of CGAT layers.  From Table 4, we can observe that the BASELINE model obtains good results by considering surface names, which has been proved in previous works . By adding GNN layers, the performances obtain further improvements (e.g., GAT vs. BASELINE). And our proposed CAECGAT L=2 obtains large performance gains over the GAT model for all datasets, which indicates the cross-KG aggregation plays a very important role in our model. Compared with CrossGCN, our CAECGAT L=2 also performs much better. This is because the attention mechanism in cross-KG propagation layer is able to select the neighbors with important cross-KG information and mitigate the semantic gap between different KGs by propagating cross-KG information. These experimental results clearly demonstrate that all the components in the CAECGAT model can contribute a lot to the performance.
Compared the CAECGAT models with different layers, we can see that the CAECGAT L=2 with two CGAT layers and CAECGAT L=3 with three CGAT layers achieve better performance than CAECGAT L=1 . This is because CAECGAT L=1 with one CGAT layer can only capture neigbhorhood information from one-hop neighbors, while CAECGAT L=2 and CAECGAT L=3 are able to propagate multi-hop neigbhors by stacking multiple CGAT layers. However, the performance of CAECGAT L=3 can not be further improved compared with CAECGAT L=2 and stacking more layers will increase the parameters in the model. Therefore, we set the number of CGAT layers to 2.

Performance vs. Different size of seed alignments
To investigate how the proportions of pre-aligned seed alignments affects the performance of our CAEC-GAT model, we further conduct experiments to evaluate the performance when the proportions of training sets vary from 10% to 50% with a step of 10%, and all the rest entity alignments are used for testing (e.g., from 10% for training and 90% for testing to 50% for training and 50% for testing). As depicted in Figure 3, we compare our proposed CAECGAT with the strong baseline GAT model which removes the cross-KG aggregation layer in our CAECGAT model. We can see that the performances for all datasets gradually improve when the proportions of training sets increase. Compared the proposed CAECGAT with the baseline GAT model, we can see that our CAECGAT consistently outperforms GAT model by large performance gains in all proportions, which reconfirms that our CAECGAT model is powerful to capture cross-KG features and reduces the semantic gap between different KGs.

Conclusions and Future Work
In this paper, we propose a novel CAECGAT model for cross-lingual KG entity alignment task, which take full use of seed alignments to alleviate the semantic gap between different KGs. We first use a cross-KG aggregation layer to transfer entity information across different KGs, which enables the embeddings of different KGs to share cross-KG information. Then, an attention-based cross-KG propagation layer is applied to gather neighbors with important cross-KG information. By staking multiple CGAT layers, the cross-KG information can be propagated in the two KGs. Experimental results on several benchmark datasets demonstrate the effectiveness of the proposed CAECGAT model.
In the future, we will consider more entity information, such as entity properties and descriptions. Besides, we may investigate the proposed model to other datasets (e.g., DBP100K  and WK31 (Chen et al., 2017)).