Aligning Cross-Lingual Entities with Multi-Aspect Information

Multilingual knowledge graphs (KGs), such as YAGO and DBpedia, represent entities in different languages. The task of cross-lingual entity alignment is to match entities in a source language with their counterparts in target languages. In this work, we investigate embedding-based approaches to encode entities from multilingual KGs into the same vector space, where equivalent entities are close to each other. Specifically, we apply graph convolutional networks (GCNs) to combine multi-aspect information of entities, including topological connections, relations, and attributes of entities, to learn entity embeddings. To exploit the literal descriptions of entities expressed in different languages, we propose two uses of a pretrained multilingual BERT model to bridge cross-lingual gaps. We further propose two strategies to integrate GCN-based and BERT-based modules to boost performance. Extensive experiments on two benchmark datasets demonstrate that our method significantly outperforms existing systems.


Introduction
A growing number of multilingual knowledge graphs (KGs) have been built, such as DBpedia , YAGO (Suchanek et al., 2008;Rebele et al., 2016), and BabelNet (Navigli and Ponzetto, 2012), which typically represent realworld knowledge as separately-structured monolingual KGs. Such KGs are connected via interlingual links (ILLs) that align entities with their counterparts in different languages, exemplified by Figure 1 (top). Highly-integrated multilingual KGs contain useful knowledge that can benefit many knowledge-driven cross-lingual NLP tasks, such as machine translation (Moussallem et al., 2018) and cross-lingual named entity recognition * Equal contribution.

Descriptions
The University of Toronto is a public research university in Toronto, Ontario, Canada · · · トロント大学 は、オンタリオ 州、トロントに本部を置くカナ ダの州立大学である · · · Figure 1: An example fragment of two KGs (in English and Japanese) connected by an inter-lingual link (ILL). In addition to the graph structures (top) consisting of entity nodes and typed relation edges, KGs also provide attributes and literal descriptions of entities (bottom). (Darwish, 2013). However, the coverage of ILLs among existing KGs is quite low (Chen et al., 2018): for example, less than 20% of the entities in DBpedia are covered by ILLs. The goal of cross-lingual entity alignment is to discover entities from different monolingual KGs that actually refer to the same real-world entities, i.e., discovering the missing ILLs. Traditional methods for this task apply machine translation techniques to translate entity labels (Spohr et al., 2011). The quality of alignments in the cross-lingual scenario heavily depends on the quality of the adopted translation systems. In addition to entity labels, existing KGs also provide multi-aspect information of entities, including topological connections, relation types, attributes, and literal descriptions expressed in different languages Xie et al., 2016), as shown in Figure 1 (bottom). The key challenge of addressing such a task thus is how to better model and use provided multi-aspect information of entities to bridge cross-lingual gaps and find more equivalent entities (i.e., ILLs).
Recently, embedding-based solutions (Chen et al., 2017b;Zhu et al., 2017;Wang et al., 2018;Chen et al., 2018) have been proposed to unify multilingual KGs into the same low-dimensional vector space where equivalent entities are close to each other. Such methods only make use of one or two aspects of the aforementioned information. For example, Zhu et al. (2017) relied only on topological features while  and Wang et al. (2018) exploited both topological and attribute features. Chen et al. (2018) proposed a co-training algorithm to combine topological features and literal descriptions of entities. However, combining these multi-aspect information of entities (i.e., topological connections, relations and attributes, as well as literal descriptions) remains under-explored.
In this work, we propose a novel approach to learn cross-lingual entity embeddings by using all aforementioned aspects of information in KGs. To be specific, we propose two variants of GCNbased models, namely MAN and HMAN, that incorporate multi-aspect features, including topological features, relation types, and attributes into cross-lingual entity embeddings. To capture semantic relatedness of literal descriptions, we finetune the pretrained multilingual BERT model (Devlin et al., 2019) to bridge cross-lingual gaps. We design two strategies to combine GCN-based and BERT-based modules to make alignment decisions. Experiments show that our method achieves new state-of-the-art results on two benchmark datasets. Source code for our models is publicly available at https://github.com/ h324yang/HMAN.

Problem Definition
In a multilingual knowledge graph G, we use L to denote the set of languages that G contains and G i = {E i , R i , A i , V i , D i } to represent the language-specific knowledge graph in language L i ∈ L. E i , R i , A i , V i and D i are sets of entities, relations, attributes, values of attributes, and literal descriptions, each of which portrays one aspect of an entity. The graph G i consists of relation triples h i , r i , t i and attribute triples h i , a i , v i such that h i , t i ∈ E i , r i ∈ R i , a i ∈ A i and v i ∈ V i . Each entity is accompanied by a literal description consisting of a sequence of words in language L i , e.g., Given two knowledge graphs G 1 and G 2 expressed in source language L 1 and target language L 2 , respectively, there exists a set of pre-aligned ILLs I (G 1 , G 2 ) = {(e, u) |e ∈ E 1 , u ∈ E 2 } which can be considered training data. The task of cross-lingual entity alignment is to align entities in G 1 with their cross-lingual counterparts in G 2 , i.e., discover missing ILLs.

Proposed Approach
In this section, we first introduce two GCN-based models, namely MAN and HMAN, that learn entity embeddings from the graph structures. Second, we discuss two uses of a multilingual pretrained BERT model to learn cross-lingual embeddings of entity descriptions: POINTWISEBERT and PAIRWISEBERT. Finally, we investigate two strategies to integrate the GCN-based and the BERT-based modules.

Cross-Lingual Graph Embeddings
Graph convolutional networks (GCNs) (Kipf and Welling, 2017) are variants of convolutional networks that have proven effective in capturing information from graph structures, such as dependency graphs (Guo et al., 2019b), abstract meaning representation graphs (Guo et al., 2019a), and knowledge graphs (Wang et al., 2018). In practice, multi-layer GCNs are stacked to collect evidence from multi-hop neighbors. Formally, the l-th GCN layer takes as input feature representations H (l−1) and outputs H (l) : whereÃ = A + I is the adjacency matrix, I is the identity matrix,D is the diagonal node degree matrix ofÃ, φ(·) is ReLU function, and W (l) represents learnable parameters in the l-th layer. H (0) is the initial input. GCNs can iteratively update the representation of each entity node via a propagation mechanism through the graph. Inspired by previous studies Wang et al., 2018), we also adopt GCNs in this work to collect evidence from multilingual KG structures and to learn crosslingual embeddings of entities. The primary assumptions are: (1) equivalent entities tend to be neighbored by equivalent entities via the same types of relations; (2) equivalent entities tend to share similar or even the same attributes.
Multi-Aspect Entity Features. Existing KGs Suchanek et al., 2008;Rebele et al., 2016) provide multi-aspect information of entities. In this section, we mainly focus on the following three aspects: topological connections, relations, and attributes. The key challenge is how to utilize the provided features to learn better embeddings of entities. We discuss how we construct raw features for the three aspects, which are then fed as inputs to our model. We use X t , X r and X a to denote the topological connection, relation, and attribute features, individually.
The topological features contain rich neighborhood proximity information of entities, which can be captured by multi-layer GCNs. As in Wang et al. (2018), we set the initial topological features to X t = I, i.e., an identity matrix serving as index vectors (also known as the featureless setting), so that the GCN can learn the representations of corresponding entities.
In addition, we also consider the relation and attribute features. As shown in Figure 1, the connected relations and attributes of two equivalent entities, e.g., "University of Toronto" (English) and "ト ロ ン ト 大 学" (Japanese), have a lot of overlap, which can benefit cross-lingual entity alignment. Specifically, they share the same relation types, e.g., "country" and "almaMater", and some attributes, e.g., "foundDate" and "創立 年". To capture relation information, Schlichtkrull et al. (2018) proposed RGCN with relation-wise parameters. However, with respect to this task, existing KGs typically contain thousands of relation types but few pre-aligned ILLs. Directly adopting RGCN may introduce too many parameters for the limited training data and thus cause overfitting. Wang et al. (2018) instead simply used the unlabeled GCNs (Kipf and Welling, 2017) with two proposed measures (i.e., functionality and inverse functionality) to encode the information of relations into the adjacency matrix. They also considered attributes as input features in their architecture. However, this approach may lose information about relation types. Therefore, we re-gard relations and attributes of entities as bag-ofwords features to explicitly model these two aspects. Specifically, we construct count-based Nhot vectors X r and X a for these two aspects of features, respectively, where the (i, j) entry is the count of the j-th relation (attribute) for the corresponding entity e i . Note that we only consider the top-F most frequent relations and attributes to avoid data sparsity issues. Thus, for each entity, both of its relation and attribute features are Fdimensional vectors. MAN. Inspired by Wang et al. (2018), we propose the Multi-Aspect Alignment Network (MAN) to capture the three aspects of entity features. Specifically, three l-layer GCNs take as inputs the tripleaspect features (i.e., X t , X r , and X a ) and produce the representations H where ⊕ denotes vector concatenation. H m can then feed into alignment decisions. Such fusion through concatenation is also known as Scoring Level Fusion, which has been proven simple but effective for capturing multimodal semantics (Bruni et al., 2014;Kiela and Bottou, 2014;Collell et al., 2017). It is worth noting that the main differences between MAN and the work of Wang et al. (2018) are two fold: First, we use the same approach as in Kipf and Welling (2017) to construct the adjacency matrix, while Wang et al. (2018) designed a new connectivity matrix as the adjacency matrix for the GCNs. Second, MAN explicitly regards the relation type features as model input, while Wang et al. (2018) incorporated such relation information into the connectivity matrix. HMAN. Note that MAN propagates relation and attribute information through the graph structure. However, for aligning a pair of entities, we observe that considering the relations and attributes of neighboring entities, besides their own ones, may introduce noise. Merely focusing on relation and attribute features of the current entity could be a better choice. Thus, we propose the Hybrid Multi-Aspect Alignment Network (HMAN) to better model such diverse features, shown in Figure 2. Similar to MAN, we still leverage the l-th layer of a GCN to obtain topological embeddings H the embeddings with respect to relations and attributes. The feedforward neural networks consist of one fully-connected (FC) layer and a highway network layer (Srivastava et al., 2015). The reason we use highway networks is consistent with the conclusions of Mudgal et al. (2018), who conducted a design space exploration of neural models for entity matching and found that highway networks are generally better than FC layers in convergence speed and effectiveness. Formally, these feedforward neural networks are defined as: a} and X f refer to one specific aspect (i.e., relation or attribute) and the corresponding raw features, respectively, W is ReLU function, and σ(·) is sigmoid function. Accordingly, we obtain the hybrid multi-aspect entity embed- , to which 2 normalization is further applied. Model Objective. Given two knowledge graphs, G 1 and G 2 , and a set of pre-aligned entity pairs I (G 1 , G 2 ) as training data, our model is trained in a supervised fashion. During the training phase, the goal is to embed cross-lingual entities into the same low-dimensional vector space where equiv-alent entities are close to each other. Following Wang et al. (2018), our margin-based ranking loss function is defined as: where [x] + = max{0, x}, I denotes the set of negative entity alignment pairs constructed by corrupting the gold pair (e 1 , e 2 ) ∈ I. Specifically, we replace e 1 or e 2 with a randomly-chosen entity in E 1 or E 2 . ρ(x, y) is the 1 distance function, and β > 0 is the margin hyperparameter separating positive and negative pairs.

Cross-Lingual Textual Embeddings
Existing multilingual KGs Navigli and Ponzetto, 2012;Rebele et al., 2016) also provide literal descriptions of entities expressed in different languages and contain detailed semantic information about the entities. The key observation is that literal descriptions of equivalent entities are semantically close to each other. However, it is non-trivial to directly measure the semantic relatedness of two entities' descriptions, since they are expressed in different languages.
Recently, Bidirectional Encoder Representations from Transformer (BERT) (Devlin et al., 2019) has advanced the state-of-the-art in various NLP tasks by heavily exploiting pretraining based on language modeling. Of special interest is the multilingual variant, which was trained with Wikipedia dumps of 104 languages. The spirit of BERT in the multilingual scenario is to project words or sentences from different languages into the same semantic space. This aligns well with our objective-bridging gaps between descriptions written in different languages. Therefore, we propose two methods for applying multilingual BERT, POINTWISEBERT and PAIRWISE-BERT, to help make alignment decisions.

POINTWISEBERT.
A simple choice is to follow the basic design of BERT and formulate the entity alignment task as a text matching task. For two entities e 1 and e 2 from two KGs in L 1 and L 2 , denoting source language and target language, respectively, their textual descriptions are d 1 and d 2 , consisting of word sequences in two languages.  Figure 3: Architecture overview of POINTWISEBERT (left) and PAIRWISEBERT (right).
from which the final hidden state is used as the sequence representation, and [SEP] is the special token for separating token sequences, and produces the probability of classifying the pair as equivalent entities. The probability is then used to rank all candidate entity pairs, i.e., ranking score. We denote this model as POINTWISEBERT, shown in Figure 3 (left). This approach is computationally expensive, since for each entity we need to consider all candidate entities in the target language. One solution, inspired by the work of , is to reduce the search space for each entity with a reranking strategy (see Section 3.3).
PAIRWISEBERT. Due to the heavy computational cost of POINTWISEBERT, semantic matching between all entity pairs is very expensive. Instead of producing ranking scores for description pairs, we propose PAIRWISEBERT to encode the entity literal descriptions as cross-lingual textual embeddings, where distances between entity pairs can be directly measured using these embeddings.
The PAIRWISEBERT model consists of two components, each of which takes as input the description of one entity (from the source or target language), as depicted in Figure 3 (right). Specifically, the input is designed as [CLS] d 1 (d 2 ) [SEP], which is then fed into PAIRWISEBERT for contextual encoding. We select the hidden state of [CLS] as the textual embedding of the entity description for training and inference. To bring the textual embeddings of cross-lingual entity descriptions into the same vector space, a similar ranking loss function as in Equation 4 is used.

Integration Strategy
Sections 3.1 and 3.2 introduce two modules that separately collect evidence from knowledge graph structures and the literal descriptions of entities, namely graph and textual embeddings. In this section, we investigate two strategies to integrate these two modules to further boost performance.
Reranking. As mentioned in Section 3.2, the POINTWISEBERT model takes as input the concatenation of two descriptions for each candidateentity pair, where conceptually we must process every possible pair in the training set. Such a setting would be cost prohibitive computationally.
One way to reduce the cost of POINTWISEBERT would be to ignore candidate pairs that are unlikely to be aligned. Rao et al. (2016) showed that uncertainty-based sampling can provide extra improvements in ranking. Following this idea, the GCN-based models (i.e., MAN and HMAN) are used to generate a candidate pool whose size is much smaller than the entire universe of entities. Specifically, GCN-based models provide top-q candidates of target entities for each source entity (where q is a hyperparameter). Then, the POINTWISEBERT model produces a ranking score for each candidate-entity pair in the pool to further rerank the candidates. However, the weakness of such a reranking strategy is that performance is bounded by the quality of (potentially limited) candidates produced by MAN or HMAN. Weighted Concatenation. With the textual embeddings learned by PAIRWISEBERT denoted as H B and graph embeddings denoted as H G , a simple way to combine the two modules is by weighted concatenation: where H G is the graph embeddings learned by either MAN or HMAN, and τ is a factor to balance the contribution of each source (where τ is a hyperparameter).

Entity Alignment
After we obtain the embeddings of entities, we leverage 1 distance to measure the distance between candidate-entity pairs. A small distance reflects a high probability for an entity pair to be aligned as equivalent entities. To be specific, with respect to the reranking strategy, we select the target entities that have the smallest distances to a source entity in the vector space learned by MAN or HMAN as its candidates. For weighted concatenation, we employ the 1 distance of the representations of a pair derived from the concatenated embedding, i.e., H C , as the ranking score.

Datasets and Settings
We evaluate our methods over two benchmark datasets: DBP15K and DBP100K . Table 1 outlines the statistics of both datasets, which contain 15,000 and 100,000 ILLs, respectively. Both are divided into three subsets: Chinese-English (ZH-EN), Japanese-English (JA-EN), and French-English (FR-EN). Following previous work Wang et al., 2018), we adopt the same split settings in our experiments, where 30% of the ILLs are used as training and the remaining 70% for evaluation. Hits@k is used as the evaluation metric (Bordes et al., 2013;Wang et al., 2018), which measures the proportion of correctly aligned entities ranked in the top-k candidates, and results in both directions, e.g., ZH-EN and EN-ZH, are reported.
In all our experiments, we employ two-layer GCNs and the top 1000 (i.e., F =1000) most frequent relation types and attributes are included to build the N -hot feature vectors. For the MAN model, we set the dimensionality of topological, relation, and attribute embeddings to 200, 100, and 100, respectively. When training HMAN, the hyperparameters are dependent on the dataset sizes due to GPU memory limitations. For DBP15K, we set the dimensionality of topological embeddings, relation embeddings, and attribute embeddings to 200, 100, and 100, respectively. For DBP100K, the dimensionalities are set to 100, 50, and 50, respectively. We adopt SGD to update parameters and the numbers of epochs are set to 2,000 and 50,000 for MAN and HMAN, respectively. The margin β in the loss function is set to 3. The balance factor τ is determined by grid search, which shows that the best performance lies in the range from 0.8 to 0.7. For simplicity, τ is set to 0.8 in all associated experiments. Multilingual BERTbase models with 768 hidden units are used in POINTWISEBERT and PAIRWISEBERT. We additionally append one more FC layer to the representation of [CLS] and reduce the dimensionality to 300. Both BERT models are fine-tuned using the Adam optimizer.

Results on Graph Embeddings
We first compare MAN and HMAN against previous systems (Hao et al., 2016;Chen et al., 2017a;Wang et al., 2018). As shown in Table 2, MAN and HMAN consistently outperform all baselines in all scenarios, especially HMAN. It is worth noting that, in this case, MAN and HMAN use as much information as Wang et al. (2018), while  require extra supervised information (relations and attributes of two KGs need to be aligned in advance). The performance improvements confirm that our model can better utilize topological, relational, and attribute information of entities provided by KGs.
To explain why HMAN achieves better results than MAN, recall that MAN collects relation and attribute information by the propagation mechanism in GCNs where such knowledge is exchanged through neighbors, while HMAN uses feedforward networks to capture expressive features directly from the input feature vectors without propagation. As we discussed before, it is not always the case that neighbors of equivalent entities share similar relations or attributes. Propagating such features through linked entities in GCNs may introduce noise and thus harm performance.    Table 2: Results of using graph information on DBP15K and DBP100K. @1, @10 and @50 refer to Hits@1, Hits@10 and Hits@50, respectively.
Moreover, we perform ablation studies on the two proposed models to investigate the effectiveness of each component. We alternatively remove each aspect of features (i.e., topological, relation, and attribute features) and the highway layer in HMAN, denoted as w/o TE (RE, AE, and HW). As reported in Table 2, we observe that after removing relation or attribute features, the performance of HMAN and MAN drops across all datasets. These figures prove that these two aspects of features are useful in making alignment decisions. On the other hand, compared to MAN, HMAN shows more significant performance drops, which also demonstrates that employing the feedforward networks can better categorize relation and attribute features than GCNs in this scenario. Interestingly, looking at the two variants MAN w/o TE and HMAN w/o TE, we can see the former achieves better results. Since MAN propagates relation and attribute features via graph structures, it can still implicitly capture topological knowledge of entities even after we remove the topological features.
However, HMAN loses such structure knowledge when topological features are excluded, and thus its results are worse. From these experiments, we can conclude that the topological information is playing an indispensable role in making alignment decisions.

Results with Textual Embeddings
In this section, we discuss empirical results involving the addition of entity descriptions, shown in Table 3. Applying literal descriptions of entities to conduct cross-lingual entity alignment is relatively under-explored. The recent work of Chen et al. (2018) used entity descriptions in their model; however, we are unable to make comparisons with their work, as we do not have access to their code and data. Since we employ BERT to learn textual embeddings of descriptions, we consider systems that also use external resources, like Google Translate, 1 as our baselines. We directly Model ZH → EN EN→ ZH JA → EN EN→ JA FR → EN EN→ FR @1 .@10 @50 @1 .@10 @50 @1 .@10 @50 @1 .@10 @50 @1 .@10 @50 @1 .@10 @50 DBP15K  Table 3: Results of using both graph and textual information on DBP15K and DBP100K. @1, @10, and @50 refer to Hits@1, Hits@10, and Hits@50, respectively. * indicates results are taken from  take results reported by , denoted as "Translation" and "JAPE+Translation".
The POINTWISEBERT model is used with GCN-based models, which largely reduces the search space, as indicated by MAN (RERANK) and HMAN (RERANK), where the difference is that the candidate pools are given by MAN and HMAN, respectively. For DBP15K, we select top-200 candidate target entities as the candidate pool while for DBP100K, top-20 candidates are selected due to its larger size. The reranking method does lead to performance gains across all datasets, where the improvements are dependent on the quality of the candidate pools. HMAN (RERANK) generally performs better than MAN (RERANK) since HMAN recommends more promising candidate pools.
The PAIRWISEBERT model learns the textual embeddings that map cross-lingual descriptions into the same space, which can be directly used to align entities. The results are listed under PAIRWISEBERT in Table 3. We can see that it achieves good results on its own, which also shows the efficacy of using multilingual descriptions. Moreover, such textual embeddings can be combined with graph embeddings (learned by MAN or HMAN) by weighted concatenation, as discussed in Section 3.3. The results are reported as MAN (WEIGHTED) and HMAN (WEIGHTED), respectively. As we can see, this simple operation leads to significant improvements and gives excellent results across all datasets. However, it is not always the case that KGs provide descriptions for every entity. For those entities whose descriptions  (4), Columbia Pictures (9) 丹尼爾·克雷格 (1), 伊娃·格 蓮 (4), 英語 (832) are not available, the graph embeddings would be the only source for making alignment decisions.

Case Study
In this section, we describe a case study to understand the performance gap between HMAN and MAN. The example in Table 4 provides insights potentially explaining this performance gap. We argue that MAN introduces unexpected noise from heterogeneous nodes during the GCN propagation process. We use the number in parentheses (*) after entity names to denote the number of relation features they have. In this particular example, the two entities "Casino Royale (2006 film)" in the source language (English) and "007大戰皇家賭場" in the target language (Chinese) both have three relation features. We notice that the propagation mechanism introduces some neighbors which are unable to find cross-lingual counterparts from the other end, marked in red. Considering the entity "英 語" (English), a neighbor of "007大 戰 皇 家 賭 場", no counterparts can be found in the neighbors of "Casino Royale (2006 film)". We also observe that "英 語" (English) is a pivot node in the Chinese KG and has 832 relations, such as "語 言" (Language), "官 方 語 言" (Official Language), and "頻道語言" (Channel Language). In this case, propagating features from neighbors can harm performance. In fact, the feature sets of the ILL pair already convey information that captures their similarity (e.g., the "starring" marked in blue are shared twice). Therefore, by directly using feedforward networks, HMAN is able to effectively capture such knowledge.

Related Work
KG Alignment. Research on KG alignment can be categorized into two groups: monolingual and multilingual entity alignment. As for monolingual entity alignment, main approaches align two entities by computing string similarity of entity labels (Scharffe et al., 2009;Volz et al., 2009;Ngomo and Auer, 2011) or graph similarity (Raimond et al., 2008;Pershina et al., 2015;Azmy et al., 2019). Recently, Trsedya et al. (2019) proposed an embedding-based model that incorporates attribute values to learn the entity embeddings.
To match entities in different languages, Wang et al. (2012) leveraged only language-independent information to find possible links cross multilingual Wiki knowledge graphs. Recent studies learned cross-lingual embeddings of entities based on TransE (Bordes et al., 2013), which are then used to align entities across languages. Chen et al. (2018) designed a co-training algorithm to alternately learn multilingual entity and description embeddings. Wang et al. (2018) applied GCNs with the connectivity matrix defined on relations to embed entities from multilingual KGs into a unified low-dimensional space.
In this work, we also employ GCNs. However, in contrast to Wang et al. (2018), we regard relation features as input to our models. In addition, we investigate two different ways to capture relation and attribute features.

Multilingual Sentence
Representations. Another line of research related to this work is aligning sentences in multiple languages. Recent works (Hermann and Blunsom, 2014;Conneau et al., 2018;Eriguchi et al., 2018) studied crosslingual sentence classification via zero-shot learning. Johnson et al. (2017) proposed a sequenceto-sequence multilingual machine translation system where the encoder can be used to produce cross-lingual sentence embeddings (Artetxe and Schwenk, 2018). Recently, BERT (Devlin et al., 2019) has advanced the state-of-the-art on multiple natural language understanding tasks. Specifically, multilingual BERT enables learning representations of sentences under multilingual settings. We adopt BERT to produce cross-lingual representations of entity literal descriptions to capture their semantic relatedness, which benefits cross-lingual entity alignment.