Dual Attention Network for Cross-lingual Entity Alignment

Cross-lingual Entity alignment is an essential part of building a knowledge graph, which can help integrate knowledge among different language knowledge graphs. In the real KGs, there exists an imbalance among the information in the same hierarchy of corresponding entities, which results in the heterogeneity of neighborhood structure, making this task challenging. To tackle this problem, we propose a dual attention network for cross-lingual entity alignment (DAEA). Specifically, our dual attention consists of relation-aware graph attention and hierarchical attention. The relation-aware graph attention aims at selectively aggregating multi-hierarchy neighborhood information to alleviate the difference of heterogeneity among counterpart entities. The hierarchical attention adaptively aggregates the low-hierarchy and the high-hierarchy information, which is beneficial to balance the neighborhood information of counterpart entities and distinguish non-counterpart entities with similar structures. Finally, we treat cross-lingual entity alignment as a process of linking prediction. Experimental results on three real-world cross-lingual entity alignment datasets have shown the effectiveness of DAEA.


Introduction
In recent years, many large-scale knowledge graphs (Bordes et al., 2013;Mahdisoltani et al., 2015;Auer et al., 2007) are built to represent and organize the explosive information over the Internet. They have been widely used in many fields, such as dialogue system , machine translation (Zhao et al., 2020) and medicine (Yan et al., 2020). However, the KGs are usually incomplete, so it is necessary to manually track emerging concepts and dynamically update the knowledge bases, making the whole process quite expensive. Fortunately, KGs of different languages are often complementary, which means that many components can be shared. To integrate this complementary knowledge, researches begin to pay attention to cross-lingual entity alignment.
Cross-lingual entity alignment aims at finding entities with the same semantics in KGs of different languages. Various methods have been explored for cross-lingual entity alignment. Traditional approaches rely on machine translation or feature engineering (Chen et al., 2013;Mahdisoltani et al., 2015;Otani et al., 2018;Feng et al., 2016). The effectiveness of these methods depends largely on the quality of the translation and the nature of the definition. Recently, many embedding approaches based on graph neural network (GNN) are proposed for cross-lingual entity alignment (Wang et al., 2018;Sun et al., 2020;Wu et al., 2019a). These methods first represent the entities and relations in low dimensional spaces and then utilize the powerful encoding ability of graph neural network to learn vector representations for entity or relation. Finally, a mapping function is used to align the entities from the source knowledge graph to the target one.
However, due to the incompleteness of knowledge graphs and the diversity of knowledge, the structure of KGs in different languages is usually quite different. In the entity alignment task, this difference is mainly reflected in two aspects: the non-isomorphism among the neighborhood structures of counterpart entities and information imbalance among the same hierarchy of counterpart entities. The non-isomorphism among the neighborhood structures refers to the neighborhood of the two entities is inconsistent with each other, especially containing different sets of neighboring entities. Figure 1 (A) gives a toy example. In (A), entity pairs (a, a ), (b, b ) and (c, c ) denote three pairs of pre-aligned entities. In KG1, entity b is in the one-hop neighbor set of entity a. However, in KG2, entity b is in two-hop neighbors set of entity a . It suggests that the semantically-related entities can appear at different neighborhood hierarchy, which can easily lead to encoding the same pair of entities into different representations when encoding entities with GNN. The information imbalance among the hierarchy of entities means that the degrees are different at the same hierarchy of the corresponding entity. Figure 1 (B) gives a toy example of the degree imbalance. In KG1, the one-hop neighborhood of entity a contains five entities, and the two-hop neighborhood contains four entities. However, in KG2, the one-hop neighborhood and the two-hop neighborhood of entity a each contain only two entities. The reason for this phenomenon may be that the two KGs focus on different contents when they are constructed or due to the lack of some knowledge. It may lead to some problems when encoding entities with GNN. For example, many entities at the center of the knowledge graphs have rich neighbor information, like a, they can integrate the information of the whole graph after limited updates, which will lead to little difference in the final representation of these central entities. In entity alignment, this will make it easier for entities with similar neighborhood structures to match together (Pei et al., 2019).
To address the issues above, in this paper, we propose a dual attention network for entity alignment (DAEA). The dual attention consists of the relation-aware graph attention (R-GAT) and hierarchical attention. In R-GAT, we extend the graph attention network (Veličković et al., 2018) by modeling relation in graph attention network and incorporate the translational assumption into the self-attention mechanism that we can model the relationship among head, tail, and relation in the graph attention network. By selectively aggregating multi-level neighborhood information, R-GAT can alleviate the nonisomorphic difference between corresponding entities' neighborhood structures. Inspired by jumping knowledge network (Xu et al., 2018), in the hierarchical attention module, we use an LSTM network to learn attention coefficients. It identifies the most useful neighborhood ranges for each entity and adaptively fuses the information of different hierarchy to balance the neighborhood information of peer entities and distinguish entities with similar structures. In entity alignment methods, we design a new entity alignment method, which uses link prediction directly to accomplish entity alignment. We perform thorough experiments with detailed ablation studies and analyses on three entity alignment datasets, demonstrating the effectiveness of DAEA.

Knowledge Embedding
Various embedding based models have been developed to knowledge representation learning. The models project the entities and relations into a low-dimensional vector space and define a score function to measure the plausibility of each triple. The total likelihood of triples is finally maximized to learn the embeddings of entities and relations. TransE (Bordes et al., 2013) regarded each relation as a translation vector between the head entity and the tail entity. Based on the idea of TransE, more advanced models such as TransH (Wang et al., 2014), RotatE (Sun et al., 2019)have been proposed. There are other embeddings based models like RESCAL (Nickel et al., 2011), DistMult (Yang et al., 2014) and ComplEx (Trouillon et al., 2016), which optimized a bilinear product scoring function between vector embedding for each head and tail entity and a full rank matrix for each relation.

Graph Neural Networks
Graph neural networks (GNNs), a connectionist model that captures the dependence of graphs via message passing between the nodes of graphs, are useful tools on non-Euclidean structures. There are various methods proposed trying to improve the model's capability. Graph Convolutional Networks (GCNs) (Kipf and Welling, 2017) , an extension of GNNs, are neural networks operating on unlabeled graphs and inducing features of nodes based on the structures of their neighborhoods. The attention mechanism has also been successfully used in GNNs. Graph Attention network (GAT) (Veličković et al., 2018) incorporates the attention mechanism into the propagation step and computes the hidden states of each node by attending over its neighbors, following a self-attention strategy. Furthermore, R-GCN (Schlichtkrull et al., 2018) has recently been proposed to model relational data and have been successfully exploited in link prediction and entity classification.

Entity Alignment
The earliest entity alignment approaches rely on machine translation or feature engineering. They are time-consuming, labor expensive and poor in adaptability, and scalability. Recently, embedding based methods have been applied to entity alignment. MTransE (Chen et al., 2017), IPTransE (Zhu et al., 2017), AlignE (Sun et al., 2018), JAPE  and NAEA (Zhu et al., 2019), rely on TransE model to learn entity embedding, define some kinds of transformation and learn a linear mapping or minimize the distance between the embedding of pre-aligned entities. Some works use GNNs for entity alignment (Wang et al., 2018;Sun et al., 2020;Wu et al., 2019a;Wu et al., 2019b;Zhu et al., 2019). GCN-Align (Wang et al., 2018) takes advantage of GCN to propagate information from neighbors, and align entity embeddings enhanced by structural knowledge. It only considers the connectivity between entities and ignores the relation features in KGs. KECG  introduces GAT with projection constraint to robustly encode graphs, and employ the nearest neighbor sampling strategy for KG representation learning towards one-to-one mapping. AliNet (Sun et al., 2020) introduces distant neighbors to expand the overlap between their neighborhood structures. It employs an attention mechanism to highlight helpful distant neighbors and reduce noises. AliNet considers the phenomenon of non-isomorphic neighborhood structures. However, they all ignore information imbalance among the same hierarchy of entities.

Problem Formulation
Formally, knowledge graphs consist of a set of entities and a set of relations. Knowledge facts are stored as a collection of triples (head, relation, tail). We define a KG as G = (E, R, T ), where E and R represent the sets of entities and relations respectively, T is the set of relational triples. Let is the set of pre-aligned entity pairs between G 1 and G 2 . The task of cross-lingual entity alignment aims to find new aligned entity pairs based on the pre-aligned seeds.

Our Approaches
The model DAEA proposed for entity alignment combines knowledge embedding and graph neural networks. It consists of three modules: the relation-aware graph attention network module, the LSTM-based hierarchical attention module, and the knowledge embedding module. Our model framework is shown in Figure 2.

Relation-aware Graph Attention Module
This module is designed to mitigate the difference of non-isomorphism among counterpart entities by aggregating multi-hierarchy neighborhood information with different weights. Compared to GAT ignoring relation information, in R-GAT, we fuse relation-level information, which is vital to representation learning of entity to graph attention module. Meanwhile, to speed up the flow of information, we add inverse triples to KGs, which have been proved to be effective in knowledge graphs completion (Bordes et al., 2013;Neelakantan et al., 2015). The module R-GAT takes as input an tail entity e t with its neighborhood set N et = {(e 1 , r 1 ), (e 2 , r 2 ), ..., (e n , r n )}, where e t ∈ {E 1 , E 2 } . In NAEA, the author only considers a fixed number of entities, which can lead to information loss. Here, just like the original GAT, we consider all neighbor entities.
In R-GAT, we first calculate attention coefficients between entity e t and neighborhood information. We use two weight matrices W 1 and W 2 to transform the entity and relation, respectively. Here we merge the translational assumption from TransE into attention coefficients. For each triple, the self-attention coefficients can be calculated as follows: where a t ∈ R 2d is the learnable parameter, e m i and e m j are the hidden states of tail entity e i and head entity e j , r m ij is the embedding of r ij which is between e i and e j . m is the number of layers.
[,] indicate the concatenation operation. Note that in Figure 2, the neighborhood set of entity e i1 contains the selfloop, so we need to add a relation vector to relation set to represent self-loop relation. Then we calculate the weight by applying a non-linear activation function.
where LeakyReLU is a nonlinear function (Veličković et al., 2018). To obtain the final output of the entity, we calculate a linear combination with different weights. Here we employ multi-head attention in the same way as GAT.
where K is the number of head, n is the number of neighbors of entity e i , w k ij are the attention coefficients computed by the k-th attention head. In the first iteration, e 1 i aggregates the information of e 0 j . Meanwhile, e 1 j also aggregates the information of e 0 k , where e k is the neighborhood head entity of e j . In the second iteration, e 2 i aggregates the information of e 1 j , here, e 1 j already contains the information of e 0 k . Therefore, e 2 i contains two-hop information, e 1 j and e 0 k . After m iterations, entity e can aggregate all neighborhood information between one-hop and m-hop.
Compared with AliNet which explicitly fuses the information of multiple hop neighborhood, R-GAT can merge the information of i-hop selectively into the current entity representation during the update of i-th layer, without explicit modeling, which is an essential feature of the graph neural networks.

LSTM-based Hierarchical Attention Module
Given the relation-aware graph attention module, now we discuss the LSTM-based hierarchical attention module. Like the example in Figure 1 (B), dense entities tend to be more consistently represented. To learn the low-level and the high-level information better adaptively and balance the neighborhood information of counterpart entities, we design the LSTM-based hierarchical attention module.
The input of the module is (e 1 i , e 2 i , ..., e m i ) which is the output for each iteration in R-GAT. The LSTMbased hierarchical attention module can be described as follows: where h j−1 i and c j−1 i is the hidden state and cell state, respectively. W 3 ∈ R d×1 are parameters to be learned. The LSTM-based hierarchical attention can learn the weight of different levels of entities, which can indirectly balance the information of the same level of counterpart entities through supervised learning.

Knowledge Embedding Module
Previous works have shown that knowledge embedding, which models inner-graph relationships, making entities more distinguishable, is effective for entity alignment . Another reason to leverage knowledge embedding here is that it is necessary for the novel entity alignment method. It will be described later. Here we use TransE, which regards relation as a translation vector between entities, as our knowledge embedding module.
Formally, for a triple (e h , r, e t ), we define the score function f (e h , r, e t ) = ||e h + r − e t || to measure the possibility of the triple. we use a margin-based ranking loss function as the training the knowledge embedding model. the loss function is defined as follows: where [·] + = max(0, ·), T is the set of the total relations in two KGs and T is the set of negative samples generated by corrupting T . γ 1 is a positive margin hyper-parameter separating positive and negative triplets.

Training and Inference
We use a margin-based loss function as the training objective to let the embedding of aligned entities have a very small distance while those of unaligned entities have a large distance: where dist(e i1 , e i2 ) = ||e i1 − e i2 || 2 is the L 2 distance, γ 2 is a positive margin hyper-parameter. P is the set of pre-aligned entity pair (e i1 , e i2 ), and P is the set of negative samples generated by nearest neighbor negative sampling. We define the loss function of DAEA corresponding to the above two parts Sun et al., 2020): It is worth noting that we can set different weights for L c and L k . Here, we set the same weight for them to treat both kinds of loss functions equally in experiments . In inference, we adopt two methods to align entity in this paper, distance-based method and completion-based method.

Distance-based Method
In distance-based methods, we can align the entity based on the nearest neighbor search among entity embedding by simply calculating the distance between two entities between different KGs.

Completion-based Method
In this part, we design a novel entity alignment method based on link prediction following the knowledge graph completion task. Before training the model, we need to add a relation type to KGs in order to establish links between pre-trained entity pairs. For each entity pair (e i1 , e i2 ) ∈ P , we add the symmetry relation 'Is same' between entity e i1 and e i2 , so we obtain triples (e i1 , Is same, e i2 ) and (e i2 , Is same, e i1 ). Then we add them to KGs for training. In the testing, we align the entities directly according to the link prediction method. The framework of the completion-based method is shown in Figure 3.  We evaluate DAEA on the DBP 15K datasets. DBP 15K contains three cross-lingual datasets built from DBpedia, denoted by DBP 15K ZH−EN (Chinese-to-English), DBP 15K JA−EN (Japanese-to-English) and DBP 15K F R−EN (French-to-English). Each dataset contains 15, 000 reference alignment links with popular entities from English to Chinese, Japanese and French respectively. The statistics of datasets are shown in Table 1. Figure 4 shows the statistics of degrees of 20 randomly selected pre-aligned entities at different levels from DBP 15K ZH−EN . Here, the degree of an entity refers to the number of all triplets associated with that entity. From Figure 4(A) and (B), we can see that the degree imbalance of one-hop neighborhood exists only in a few entity pairs. However, the imbalance is obvious in the two-hop neighborhood set. For example, in entity pair Z14 , the degree difference between two-hop neighborhood entities is more than 1000. Figure 4 (D) shows that the distribution of degree difference in the two-hop neighborhood roughly follows the long tail distribution.

Experimental Setting
In DAEA, we set the dimension of entity (relation) embeddings to 128 and the learning rate to 0.001, For each positive triplet, we select 25 negative triples for graph model training and 2 negative triples for knowledge embedding model training. the margin γ 1 and γ 2 both set to 3 . In R-GAT, we use a 2-layer R-GAT as the encoder. We used 30% of the gold standards as seed alignment while left the remaining as testing data. We compare DAEA with the following entity alignment methods: MTransE (Chen et al., 2017), IPTransE (Zhu et al., 2017), JAPE , AlignE (Sun et al., 2018), GCN-Align (Wang et al., 2018), SEA (Pei et al., 2019), MuGCN , KECG  and AliNet (Sun et al., 2020). Some models, like GMNN (Xu et al., 2019) and RDGCN (Wu et al., 2019a), merge the literal information of the entities into their representations. Since our model relies only on structural information, we do not take these models into comparison. In order to verify the different alignment methods, we also select two KG embedding models TransH and RotatE which are usually evaluated on the task of link prediction as baselines. We report the Hits@1, Hits@10 and MRR results to evaluate entity alignment performance. Meanwhile, we conducted three ablation experiments: DAEA (w/o LSTM & rel.) means that does not LSTM-based attention and relation-aware. DAEA (w/o rel.) and (DAEA w/o LSTM) represent that does not relation-aware and LSTM-based attention, respectively.

Main Results
We present the entity alignment results in Table 2. Note that we adopt the completion-based method in DAEA. We can see that DAEA is significantly more effective than the baseline models. On It is not surprising that DAEA outperforms the most basic GNN-based alignment model, i.e. KECG, AliNet, and GCN-Align. By performing relation-aware graph attention over entity's neighbors, R-GAT not only extends the relation to GAT, but also integrates translation hypothesis into the self-attention mechanism, which can model the relationship among heads, tails and relations. Meanwhile, DAEA introduces an LSTM-based hierarchical attention module to identify the most useful neighborhood ranges for each entity and learn low-level information and high-level information adaptively.

Ablation Studies
In the ablation study, we can see that both the relation-aware graph attention and LSTM-based hierarchical attention play essential roles in our model. Among them, the module with the most significant gain is the relation-aware graph attention module. It can improve Hit@1 by a margin of 0.173 and Hit@10 by a margin of 0.06 on DBP 15 ZH−EN . The LSTM-based hierarchical attention module also improves the performance of entity alignment based on the relation-aware graph attention module. The improvement of the LSTM-based attention module alone is not apparent. DAEA (w/o LSTM & rel.) is similar to KECG. Ablation studies show that the good performance of DAEA is largely attributed to the information fusion capability of R-GAT module, which incorporates the neighborhood entity and relation information of KGs with different weights for entity alignment. Table 3 lists the results of different alignment methods on DBP 15K ZH−EN . Here we ignore the graph encoding part to quickly compare the performance of alignment methods. In TransE, the completionbased method is significantly more effective than the distance method. Note that the score function is f c = ||e i1 + Is same − e i2 || in the completion method. In the distance method, the score function is f d = ||e i1 − e i2 || for entity pair (e i1 , e i2 ). The embedding of Is same is immutable during alignment. It indicates that the improvement of the completion-based module comes from   added triples (e i1 , Is same, e i2 ). So in TransH, we add an experiment, dist.(Is same), where triples (e i1 , Is same, e i2 ) are added to the training data in the distance-based alignment method. After eliminating differences in data, we can see that the completion-based alignment method works better than the distance-based alignment.

Effectiveness on alignment methods
However, in RotatE, the performance of the completion-based method is poorer than that of the distance method. We also utilize KGs embedding models ComplEx to alignment entity. However, the completion method based on ComplEx also has poor performance. It indicates that the scoring function affects the performance of the completion-based method. The scoring function based on the translation hypothesis contains distance metric, and it can also model the spatial transformation of different language entity pairs through the connection of the Is same. In RotatE and ComplEx, distance supervision is not apparent, which requires a large amount of training data. However, in entity alignment, the prior alignment usually accounts for only a small proportion, which would prevent these approaches from learning accurate embeddings for entities. In general, the advantage of the distance-based method is fast convergence and high robustness. Nevertheless, it lacks the ability to model spatial transformation between entity pairs. The completion-based method can model this transformation; however, it is sensitive to the scoring function and relies on a large number of the prior alignment entity pairs.

Effectiveness on hierarchical aggregate methods
We design three additional hierarchical aggregation methods, the mean-based method, the concat-based method, and the MaxPooling-based method. In mean-based methods, we average all R-GAT layers e out i,mean = 1 H m j=1 e j i . In concat-based method, we concatenate all R-GAT layers and then transform the dimensions through the full connection layer e out i,concat = g([e 1 i , e 2 i , ..., e m i ]), where g(·) is a full connection layer. In MaxPooling-based method, we maximize all R-GAT layers e out i,M axP ooling = max(e 1 i , e 2 i , ..., e m i ), where max(·) is element-wise. The results are shown in Table 4. We observe that nonlinear aggregation methods such as concat-based and MaxPooling-based methods do not show promising performance. Compared with nonlinear aggregation methods, the LSTM-based method is more effective than the others on Hit@10. On DBP 15K ZH−EN and DBP 15K F R−EN , the meanbased method achieves the best performance on Hit@1 and MRR. Look carefully, the mean-based is a special form of the LSTM-based methods, which indicates that the linear aggregation methods are more effective for balancing the neighborhood information of corresponding entities.

Conclusion
In this paper, we introduce a dual attention mechanism entity alignment model DAEA, which contains the relation-aware graph attention module and LSTM-based hierarchical attention module. In the relationaware graph attention module, we model relation into graph attention network and integrate translation hypothesis into the self-attention mechanism, which can mitigate the non-isomorphic difference among the neighborhood structures of counterpart entities. In the hierarchical attention module, we use the LSTM attention mechanism to learn the weight of each output layer of the relation-aware graph attention module, which can balance the neighborhood information of peer entities and distinguish noncounterpart entities with similar structures by learning low-level information and high-level information adaptively. We also design a new alignment method based on link prediction. Our experiments on three datasets demonstrate the effectiveness of DAEA. In future work, we will integrate the semantic information into the model and design the appropriate model based on the completion-based method.