Exploring and Evaluating Attributes, Values, and Structure for Entity Alignment

Entity alignment (EA) aims at building a uni-ﬁed Knowledge Graph (KG) of rich content by linking the equivalent entities from various KGs. GNN-based EA methods present promising performance by modeling the KG structure deﬁned by relation triples. However, attribute triples can also provide crucial alignment signal but have not been well explored yet. In this paper, we propose to utilize an attributed value encoder and partition the KG into subgraphs to model the various types of attribute triples efﬁciently. Besides, the performances of current EA methods are overestimated because of the name-bias of existing EA datasets. To make an objective evaluation, we propose a hard experimental setting where we select equivalent entity pairs with very different names as the test set. Under both the regular and hard settings, our method achieves signiﬁcant improvements ( 5 . 10% on average Hits@1 in DBP15k) over 12 baselines in cross-lingual and monolingual datasets. Ablation studies on different subgraphs and a case study about attribute types further demonstrate the effectiveness of our method. Source code and data can be found at https://github.com/ thunlp/explore-and-evaluate .


Introduction
The prosperity of data mining has spawned Knowledge Graphs (KGs) in many domains that are often complementary to each other. Entity Alignment (EA) provides an effective way to integrate the complementary knowledge in these KGs into a unified KG by linking equivalent entities, thus benefiting knowledge-driven applications such as Question Answering (Yang et al., , 2018, Recommendation (Cao et al., 2019b) and Information Extraction (Kumar, 2017;Cao et al., 2018). However, EA is a non-trivial task that it could be formulated as * Corresponding author. a quadratic assignment problem (Yan et al., 2016), which is NP-complete (Garey and Johnson, 1990).
A KG comprises a set of triples, with each triple consisting of a subject, predicate, and object. There are two types of triples: (1) relation triples, in which both the subject and object are entities, and the predicate is often called relation (see Figure 1(a)); and (2) attribute triples, in which the subject is an entity and the object is a value, which is either a number or literal string (see Figure 1(c)), and the predicate is often called attribute.
Most of the previous EA models Wang et al., 2018;Wu et al., 2019a) rely on the structure assumption that, the adjacencies of two equivalent entities in KGs usually contain equivalent entities (Wang et al., 2018) (see Figure 1(a)). These models mainly focus on modeling KG structure defined by the relation triples. However, we argue that attribute triples can also provide important clues for judging whether two entities are the same, based on the attribute assumption that: equivalent entities often share similar attributes and values in KGs. For example, in Figure 1(b), the equivalent entities e and e share the attribute Area with similar values of 153, 909 and 154, 077. Therefore, we aim to improve EA using attribute triples. We have identified the challenges of attribute incorporation and dataset bias.
Attribute Incorporation Challenge. Modeling attribute triples together with relation triples is a more effective strategy than modeling attribute triples alone. In this way, the alignment signal from attribute triples can be propagated to an entity's neighbors via relation triples. Recently, some pioneer EA works Trisedya et al., 2019) have incorporated both attribute and relation triples. However, they learn relation and attribute triples in separate networks. In this case, the alignment signal from an entity's discriminative attributes and values will be reserved to the Figure 1: Examples for EA using different assumptions and identifying the different importance of attributes.
In Figure 1(a), we align e 1 and e 1 for the equivalent entity pairs (e 2 , e 2 ) and (e 3 , e 3 ) in their neighbors. In Figure 1(b),1(c), we align e and e for their similar attributes and values; e refers to the entity "Georgia (U.S. state)" from English Wiki and e is the Chinese equivalent. In Figure 1(c), attribute Time Zone and its value is assigned less attention weight for being less discriminative for alignment. Chinese texts are translated. Dashed curves link the target equivalent entity pairs. Dashed bothway arrows indicate alignment signals.
entity itself and will not help align its neighbors. In addition, it is crucial to identify the different importance of attributes in discriminating whether two entities are equivalent. For example, the attribute Time Zone should be assigned less importance than Name since many cities can share the same Time Zone (Figure 1(c)). Previous works fail to consider the different importance of attributes. Dataset Bias Challenge. The performance of EA is overestimated because the existing EA datasets are biased to the attribute Name: 60% − 80% of the released seed set of equivalent entities in DBP15k can be aligned via name matching. The reason is that the equivalent entities are collected using inter language links, which are labeled by a strategy that heavily relies on the translation of entity names 1 . In this way, the datasets contain many "easy" equivalent entities that have similar names. However, in the practical application of EA, the "easy" equivalent entities are often aligned already, and the challenge is to align the "hard" ones that have very different names. This discrepancy between datasets and practical situation causes overestimated EA performance.
To address the first challenge, we propose Attributed Graph Neural Network (AttrGNN) to learn attribute triples and relation triples in a unified network, and learn importance of each attributes and values dynamically. Specifically, we propose an attributed value encoder to select and aggregate alignment signal from informative attributes and values. We further employ the mean aggregator (Hamilton et al., 2017) to propagate this signal to entity's neighbors. In addition, as different types of attributes have different similarity measurements, we partition the KG into four subgraphs by grouping attributes, i.e., attribute Name, literal attribute, digital attribute, and structural knowledge. We apply separate channels to learn their representations. We present two methods to ensemble the outputs from all channels.
To alleviate the name-bias of EA datasets (second challenge), we propose a hard experimental setting. Specifically, we construct harder test sets from existing datasets by selecting equivalent entities that have the least similarity in their names. We further evaluate the models on these harder test sets to offer a more objective evaluation of EA models' performance. Under both the hard and regular settings, AttrGNN achieves the best result with significant performance improvement (5.10% Hits@1 on average in DBP15k) over 12 baselines on both the cross-lingual and monolingual datasets.

Related Work
Recent entity alignment methods can be classified into embedding-based methods and Graph Neural Network-based (GNN-based) methods.

Embedding-based Methods
Recent works utilize KG embedding methods, such as TransE (Bordes et al., 2013), to model the relation triples and further unifies two KG embedding spaces by forcing seeds to be close . Attribute triples has been introduced in this field. JAPE  computes attribute similarity to regularize the structure-based optimization. KDCoE (Chen et al., 2018) cotrains entity description and structure embeddings with a shared iteratively enlarged seed set. At- Similarity Matrix S k Figure 2: The framework of AttrGNN. Three GNN channels (GCs) are shown as an example. We do not use any attributes in GC1 to focus on the learning of structural knowledge (node degree distribution). S k is the output similarity matrix of GCk. S k e,e is the similarity between e ∈ KG1 and e ∈ KG2 measured by GCk. For the KGs and its subgraphs, we use circles to denote entities and rectangles to denote values. trE (Trisedya et al., 2019) and MultiKE  encode values as extra entity embeddings. However, the diversity of attributes and uninformative values limit the performance of the above methods.

GNN-based Methods
Following Graph Convolutional Networks (Kipf and Welling, 2017), many GNN-based models are proposed because of GNN's strong ability to model graph structure. These methods present promising results on EA because GNN can propagate the alignment signal to the entity's distant neighbors. Previous GNN-based methods focus on extending GNN's ability to model relation types (Wu et al., 2019a,b;Li et al., 2019), aligning entities via matching subgraphs (Xu et al., 2019;Wu et al., 2020), and reducing the heterogeneity between KGs (Cao et al., 2019a). With the exception of Wang et al. (2018) that have incorporated attributes as the initial feature of entities, most of the current GNN-based methods fail to incorporate the attributes and values to further improve the performance of EA.
In this paper, we add values as nodes into graph and use an attributed value encoder to conduct attribute-aware value aggregation.

Methodology
The key idea of AttrGNN is to use graph partition and attributed value encoder to deal with various types of attribute triples. In this section, we first define KG and then introduce our graph partition strategy. Further, we design different GNN channels for different subgraphs and present two methods to ensemble all channels' outputs for final evaluation.

Model Framework
where E, R, A, and V refer to the set of entities, relations, attributes, and values, respectively.
} is the set of relation triples and attribute triples. Entity Alignment is to find a mapping between two KGs G and G , i.e., ψ = {(e, e ) | e ∈ E, e ∈ E }, where e and e are equivalent entities. A seed set of equivalent entities ψ s is used as training data. Framework. The framework of our AttrGNN model is shown in Figure 2, which consists of four major components: (1) Graph Partition, which divides the input KG into subgraphs by grouping attributes and values. (2) Subgraph Encoder, which employs multiple GNN channels to learn the subgraphs separately. Each channel is a stack of L attributed value encoders and mean aggregators. The attributed value encoder aggregate attributes and values to generate the entity embeddings, and the mean aggregator propagates entity features to its neighbors following the graph structure. (3) Graph Alignment, which unifies the entity vector spaces of two KGs for each channel. (4) Channel Ensemble, which infers the entity similarity using each channel and ensemble all channels' results for final inference.

Graph Partition
Attributes and values have various types, e.g., strings S and numbers R. Different attributes have different similarity measurements, for example, the similarity between digital values should be numerical differences (153, 909 v.s. 154, 077), while the similarity of literal values is often based on their semantic meanings. Therefore, we separately learn the similarity measurements of the KG's 4 subgraphs, defined as G k = (E, R, A k , V k , T r , T ak ), where k ∈ {1, 2, 3, 4}: • G 1 includes attribute triples of Name only, i.e., A 1 = {a name }.
These subgraphs have mutually-exclusive attribute triples but share the same relation triples.

Subgraph Encoder
We design different GNN channels (GCs) to encode the above four subgraphs: Name channel for G 1 , Literal channel for G 2 , Digital channel for G 3 , and Structure channel for G 4 . The building blocks of these channels are two types of GNN layers: the attributed value encoder and the mean aggregator. Particularly, to select alignment signal from the informative attributes and values, we first stack one attributed value encoder and then mean aggregators in the Literal and Digital channels. We stack no attributed value encoder and only mean aggregators for the Structure and Name channels because they do not use various attribute triples. We add residual connections (He et al., 2016) between GNN layers for the Name, Literal, and Digital channels. Following previous EA works, all channels have two GNN layers. Next, we describe attributed value encoder and mean aggregator in details.

Attributed Value Encoder
Attributed value encoder can selectively gather discriminative information from the initial feature of attributes and values to the central entity. As an example, we show how to obtain e's first layer hidden state h 1 e . The same method applies to all the entities. We obtain the sequence of attribute features {a 1 , · · · , a n } and value features {v 1 , · · · , v n } given the attribute triples {(e, a 1 , v 1 ), · · · , (e, a n , v n )} of e as inputs. Specifically, we use BERT (Devlin et al., 2019) to obtain the features of both literal and digital values 2 .
BERT is a language model that is pre-trained on a more than 3000M words corpora. It is popularly used as a feature extractor in NLP tasks. By adding values as nodes and attributes as edges, which connect values and the entity, into the graph, we then can apply attention from the entity to attributes and use the attention score to compute the weighted average of attributes and values. Following the Graph Attention Networks (Velickovic et al., 2018), we define h 1 e as follows: where j ∈ {1, · · · , n}, W 1 ∈ R D h 1 ×(Da+Dv) and u ∈ R (De+Da)×1 are learnable matrices, σ is the ELU(·) function, and h 0 e is the initial entity feature.

Mean Aggregator
Mean aggregator layer utilizes the features of the target entity and its neighbors to generate the entity embedding. The neighbor entities of e are defined by relation triples: N (e) = {j | ∀(j, r, e) ∈ T r or ∀(e, r, j) ∈ T r , ∀r ∈ R}. We aggregate the features of e's neighbor entities to gather alignment signal and learn the structural knowledge. Given the hidden state h l−1 e from the l − 1 layer, the mean aggregator (Hamilton et al., 2017) is defined as: where W l ∈ R D h l ×D h l−1 is a learnable matrix, MEAN(·) returns the mean vector of the inputs, and σ is the nonlinear function chosen as ReLU(·).

Graph Alignment
Graph Alignment unifies the two KGs' representations of each channel into a unified vector space by reducing the distance between the seed equivalent entities. We separately train the four channels and ensemble their outputs afterward for final evaluation (see Section 3.5). Following Li et al. (2019), we generate negative samples of (e, e ) ∈ ψ s by searching the nearest entities of e (or e ) in the entity embedding space. We denote the final output h L e of the channel GC k as the entity embedding e k . For each channel GC k , we optimize the following objective function: where ψ s is the seed set of equivalent entities, NS(e) denotes the negative samples of e; [·] + = max{·, 0}, d(·, ·) = 1 − cos(·, ·) is the cosine distance, and γ is a margin hyperparameter.

Channel Ensemble
We use the entity embedding of each channel to infer the similarity matrices S k ∈ R |E|×|E | (k ∈ {1, 2, 3, 4}), where S k e,e = cos(e k , e k ) is the cosine similarity score between e ∈ E and e ∈ E . We present two methods to ensemble the four matrices into a single similarity matrix S * for final evaluation. Average Pooling. Empirically, we assume that each channel has equal importance. We let S * = 1 4 4 k=1S k , whereS k is the standardized S k : SVM. We utilize LS-SVM (Suykens and Vandewalle, 1999) to learn the weights for each channel: ] is trained as follow: where x l = [S 1 e,e , S 2 e,e , S 3 e,e , S 4 e,e ] is a vector of sampled similarity scores. If (e, e ) ∈ φ s , label y l = 1 , otherwise y l = 0.

Experiments
In this section, we compare AttrGNN with 12 baselines on the regular setting and our designed hard setting of EA. We also present an ablation study and a case study to evaluate attributes' and values' effects for EA.

Overall Performance
We report the results in two settings: regular setting, i.e., the setting used in the previous entity alignment works; and hard setting, where we construct a harder test set for objective evaluation.

Regular Setting
Cross-lingual Dataset. Table 3 shows the overall performance on DBP15K. We can see that: 1. As compared to the second best model, At-trGNN achieves significant performance improvements of 5.10% for Hits@1 and 0.056 for MRR on average. This demonstrates the effectiveness of AttrGNN in integrating both attribute triples and relation triples.
2. NameBERT, which only uses entity names, performs better than models without using names in most cases. This demonstrates our observations that (1) the datasets are name-biased; and (2) the evaluation result cannot reflect true EA performance in real-world situation. Specifically, Name-BERT performs better on DBPFR-EN than that on DBPJA-EN and DBPZH-EN, which indicates a higher name-bias on DBPFR-EN. The reason is the better translation quality between French and English.
4. The SVM ensemble strategy performs better than average pooling on DBPFR-EN. On DBPFR-EN, the performances of AttrGNN channels are imbalanced: the Name channel performs much better than other channels, as shown by the performance gap between NameBERT and baselines without names on these datasets. In these imbalanced cases, SVM performs better because it can adjust the weights of channels. However, we can not explain that the SVM strategy performs worse that average pooling on DBPZH-EN and DBPJA-EN. In fact, the integration of the various KG features is an open problem. We leave that as a future work. Monolingual Dataset. We evaluate models on this monolingual setting to inspect the name-bias level when there is no translation error. Table 4 shows the performance on DWY100K. The overall performance is similar to that on DBP15k, on which AttrGNN achieves the best performance. There are three major observations: 1. NameBERT achieves nearly 100% Hits@1 on DBP-YG, which shows more severe name-bias than that on the cross-lingual dataset. The reason is that both DBpedia and YAGO are derived from Wikipedia, resulting in that 77.60% of the released equivalent entities have exactly the same names while the rest have very similar names, e.g., George B. Rodney and George B Rodney. This results dose not indicate that EA is solved because EA is still challenging when integrating KGs from different domains, where entity names can be very different.
2. AttrE and MultiKE, which use entity names, do not perform well because of their agnostic of attribute importance. The crucial alignment signal from Name is thus averaged away by other attribute triples (in DBpedia, each entity has 7-8 attribute triples in average).
3. MultiKE performs better than AttrE because it particularly sets a "Name View" to incorporate names. However, MultiKE performs worse than NameBERT on DBP-YG and DBP15k (Table 3), indicating that its inefficient combination of "Name View" and other views harms the performance.

Hard Setting
In the hard setting, we aim to carry out a more objective evaluation of EA models on a harder test set. We first introduce how to construct the test set and then present the results and discussion. Build Harder Test Set. Let E s and E s be the set of known aligned entities in G and G . First, we compute the similarity matrix S via NameBERT; each element S e,e denotes the similarity between the entity pair e ∈ E s and e ∈ E s . Second, we sort each row of S in descending order, by ranking (e, e ) higher when there is less similarity in their names. Finally, we pick the highest-ranked 60% of equivalent entity pairs as the test set. The train set (30%) and the valid set (10%) are then randomly selected from the remaining set of data. We construct harder test set for the cross-lingual dataset only, because it is impractical to find equivalent entity pairs whose entities have very different names on the monolingual dataset, as shown by the performance of NameBERT in Table 4. Discussion. We implement AttrGNN and eight best-performed baselines with their source codes on the hard setting. Table 5 shows the overall performance. We observe general performance drop in Hit@1 on DBP15k for all models, as shown in Figure 3. There are three major observations: 1. AttrGNN still achieves the best performance, demonstrating the effectiveness of our model. However, the performance of AttrGNN has degraded by around 6% for Hits@1. This degradation indicates that the practical application of EA is still challenging and worth exploration.
2. RDGCN shows the lowest degradation in performance among all the models with entity names  because RDGCN utilizes the feature of relation type within a GNN framework. This stable performance suggests that incorporating relation type into GNN is crucial for EA and worth exploration. 3. Except for the iterative model, i.e., BootEA, the performance of models without using entity names exhibits less performance drop than the models with names. The iterative model's performance degrades more because the harder dataset weakens the snowball effect 5 when iteratively enlarging the seed set of equivalent entities.

Ablation Study
We conduct an ablation study on the performance of each AttrGNN channel, AttrGNNavg without using the Name channel (A w/o Name), AttrGNN without using relation triples (A w/o Relation), and AttrGNN without graph partition (MixAttrGNN) (Figure 4). A w/o Relation is to ensemble Name-BERT and one-layer Literal and Digital channels. There are three major observations: 1. The Literal and Structure channels' performances are close to the Name channel under the hard setting. This demonstrates the importance to explore non-name features, including other attributes and relation, for practical EA.
2. Compared to MixAttrGNN, our simple graph partition strategy achieves promising improvement. The reason is that graph partition enables model to measure the similarity of different attributes differently.
3. The Digital channel's performance is poor because it is challenging to learn the numerical calculation with the supervision of entity alignment. We thus leave it as future work.  4. Our full model significantly outperforms the Structure channel and the A w/o relation, which are the models with only relation/attribute features. This demonstrates the necessity of considering both relation and attribute triples for EA.

Case Study of Attributes and Values
We give a qualitative analysis of how attribute triples contribute to EA in this case study. Table 6 shows an equivalent entity pair that NameBERT fails to align, but AttrGNN aligns it by taking alignment signal from attributes and values. We observe Score Attribute Value that most of the top-ranked attributes have similar values between two KGs. In this case, the similar values include three literal strings, e.g., GA, Flag of Georgia and Seal of Georgia, and a number, e.g. 24.
Meanwhile, the values that are not shared in both KGs are assigned low attention weights and filtered out. As similar cases are commonly observed, we conclude that -attributes determine the importance of values, and values provide discriminative signals. In other words, the attributes whose values are unique are ranked higher, e.g., postalabbreviation that denotes the unique postal abbreviation of provinces. The value of the lowest-ranked attributes may have different forms in different KGs. For example, the attention weight of totalarea is small, because English KG and Chinese KG use different units of area (square mile in English DBpedia and square kilometer in Chinese DBpedia).

Conclusion and Future Work
We propose a novel EA model (AttrGNN) and contribute a hard experimental setting for practical evaluation. AttrGNN can integrate both attribute and relation triples with varying importance for better performance. Experimental results under the regular and hard settings present significant improvements of our proposed model, and the severe dataset bias can be effectively alleviated in our proposed hard setting.
In the future, we are interested in replacing BERT with knowledge enhanced and number sensitive text representations models (Cao et al., 2017;Geva et al., 2020).