Cross-lingual Knowledge Graph Alignment via Graph Matching Neural Network

Previous cross-lingual knowledge graph (KG) alignment studies rely on entity embeddings derived only from monolingual KG structural information, which may fail at matching entities that have different facts in two KGs. In this paper, we introduce the topic entity graph, a local sub-graph of an entity, to represent entities with their contextual information in KG. From this view, the KB-alignment task can be formulated as a graph matching problem; and we further propose a graph-attention based solution, which first matches all entities in two topic entity graphs, and then jointly model the local matching information to derive a graph-level matching vector. Experiments show that our model outperforms previous state-of-the-art methods by a large margin.


Introduction
Multilingual knowledge graphs (KGs), such as DBpedia (Auer et al., 2007) and Yago (Suchanek et al., 2007), represent human knowledge in the structured format and have been successfully used in many natural language processing applications. These KGs encode rich monolingual knowledge but lack the cross-lingual links to bridge the language gap. Therefore, the cross-lingual KG alignment task, which automatically matches entities in a multilingual KG, is proposed to address this problem.
Most recently, several entity matching based approaches (Hao et al., 2016;Chen et al., 2016;Sun et al., 2017; have been proposed for this task. Generally, these approaches first project entities of each KG into lowdimensional vector spaces by encoding monolingual KG facts, and then learn a similarity score function to match entities based on their vector representations. However, since some entities in different languages may have different KG facts, the information encoded in entity embeddings may be diverse across languages, making it difficult for these approaches to match these entities. Figure 1 illustrates such an example where we aim to align e 0 with e 0 , but there is only one aligned neighbor in their surrounding neighbors. In addition, these methods do not encode the entity surface form into the entity embedding, also making it difficult to match entities that have few neighbors in the KG that lacks sufficient structural information. To address these drawbacks, we propose a topic entity graph to represent the KG context information of an entity. Unlike previous methods that utilize entity embeddings to match entities, we formulate this task as a graph matching problem between the topic entity graphs. To achieve this, we propose a novel graph matching method to estimate the similarity of two graphs. Specifically, we first utilize a graph convolutional neural network (GCN) (Kipf and Welling, 2016;Hamilton et al., 2017) to encode two graphs, say G 1 and G 2 , resulting in a list of entity embeddings for each graph. Then, we compare each entity in G 1 (or G 2 ) against all entities in G 2 (or G 1 ) by using an attentive-matching method, which generates cross-lingual KG-aware matching vectors for all entities in G 1 and G 2 . Consequently, we apply another GCN to propagate the local matching information throughout the entire graph. This produces a global matching vector for each topic graph that is used for the final prediction. The  motivation behind is that, the graph convolution could jointly encode all entity similarities, including both the topic entity and its neighbor entities, into a matching vector. Experimental results show that our model outperforms previous state-of-the-art models by a large margin. Our code and data is available at https://github. com/syxu828/Crosslingula-KG-Matching.

Topic Entity Graph
As indicated in , the local contextual information of an entity in the KG is important to the KG alignment task. In our model, we propose a structure, namely topic entity graph, to represent relations among the given entity (called topic entity) and its neighbors in the knowledge base. Figure 2 shows the topic graphs of Lebron James in the English and Chinese knowledge graph. In order to build the topic graph, we first collect 1-hop neighbor entities of the topic entity, resulting in a set of entities, {e 1 , ..., e n }, which are the nodes of the graph. Then, for each entity pair (e i , e j ), we add one directed edge between their corresponding nodes in the topic graph if e i and e j are directly connected through a relation, say r, in the KG. Notice that, we do not label this edge with r that e i and e j hold in the KG, but just retain r's direction. In practice, we find this strategy significantly improves both the efficiency and performance, which we will discuss in §4. Figure 2 gives an overview of our method for aligning Lebron James in the English and Chinese knowledge graph 1 . Specifically, we fist retrieve topic entity graphs of Lebron James from two KGs, namely G 1 and G 2 . Then, we propose a graph matching model to estimate the probability that G 1 and G 2 are describing the same entity. In particular, the matching model includes the following four layers:

Graph Matching Model
Input Representation Layer The goal of this layer is to learn embeddings for entities that occurred in topic entity graphs by using a GCN (henceforth GCN 1 ) (Xu et al., 2018a). Recently, GCN has been successfully applied in many NLP tasks, such as semantic parsing (Xu et al., 2018b), text representation , relation extraction  and text generation (Xu et al., 2018c). We use the following embedding generation of entity v as an example to explain the GCN algorithm: (1) We first employ a word-based LSTM to transform v's entity name to its initial feature vector a v ; (2) We categorize the neighbors of v into incoming neighbors N (v) and outgoing neighbors N (v) according to the edge direction.
(3) We leverage an aggregator to aggregate the incoming representations of v's incoming neighbors , where k is the iteration index. This aggregator feeds each neighbor's vector to a fully-connected neural network and applies an element-wise meanpooling operation to capture different aspects of the neighbor set.
(4) We concatenate v's current incoming representation h k−1 v with the newly generated neighborhood vector h k N (v) and feed the concatenated vector into a fully-connected layer to update the incoming representation of v, h k v for the next iteration; (5) We update the outgoing representation of v, h k v using the similar procedure as introduced in step (3) and (4) except that operating on the outgoing representations; (6) We repeat steps (3)∼(5) by K times and treat the concatenation of final incoming and outgoing representations as the final representation of v. The outputs of this layer are two sets of entity embeddings {e 1 1 , ..., e 1 |G 1 | } and {e 2 1 , ..., e 2 |G 2 | }. Node-Level (Local) Matching Layer In this layer, we compare each entity embedding of one topic entity graph against all entity embeddings of the other graph in both ways (from G 1 to G 2 and from G 2 to G 1 ), as shown in Figure 2. We propose an attentive-matching method similar to (Wang et al., 2017). Specifically, we first calculate the cosine similarities of entity e 1 i in G 1 with all entities {e 2 j } in G 2 in their representation space.
Then, we take these similarities as the weights to calculate an attentive vector for the entire graph G 2 by weighted summing all the entity embeddings of G 2 .ē We calculate matching vectors for all entities in both G 1 and G 2 by using a multi-perspective cosine matching function f m at each matching step (See Appendix A for more details): Graph-Level (Global) Matching Layer Intuitively, the above matching vectors (m att s) capture how each entity in G 1 (G 2 ) can be matched by the topic graph in the other language. However, they are local matching states and are not sufficient to measure the global graph similarity. For example, many entities only have few neighbor entities that co-occurr in G 1 and G 2 . For those entities, a model that exploits local matching information may have a high probability to incorrectly predict these two graphs are describing different topic entities since most entities in G 1 and G 2 are not close in their embedding space.
To overcome this issue, we apply another GCN (henceforth GCN 2 ) to propagate the local matching information throughout the graph. Intuitively, if each node is represented as its own matching state, by design a GCN over the graph (with a sufficient number of hops) is able to encode the global matching state between the pairs of whole graphs. We then feed these matching representations to a fully-connected neural network and apply the element-wise max and mean pooling method to generate a fixed-length graph matching representation.

Prediction Layer
We use a two-layer feedforward neural network to consume the fixedlength graph matching representation and apply the softmax function in the output layer.
Training and Inference To train the model, we randomly construct 20 negative examples for each positive example <e 1 i , e 2 j > using a heuristic method. That is, we first generate rough entity embeddings for G 1 and G 2 by summing over the pretrained embeddings of words within each entity's surface form; then, we select 10 closest entities to e 1 i (or e 2 j ) in the rough embedding space to construct negative pairs with e 2 j (or e 1 i ). During testing, given an entity in G 1 , we rank all entities in G 2 by the descending order of matching probabilities that estimated by our model.

Experiments
We evaluate our model on the DBP15K datasets, which were built by Sun et al. (2017). The datasets were generated by linking entities in the Chinese, Japanese and French versions of DBpedia into English version. Each dataset contains 15,000 interlanguage links connecting equivalent entities in two KGs of different languages. We use the same train/test split as previous works. We use the Adam optimizer (Kingma and Ba, 2014) to update parameters with mini-batch size 32. The learning rate is set to 0.001. The hop size K of GCN 1 and GCN 2 are set to 2 and 3, respectively. The non-linearity function σ is ReLU (Glorot et al., 2011) and the parameters of aggregators are randomly initialized. Since KGs are represented in different languages, we first retrieve monolingual fastText embeddings (Bojanowski et al., 2017) for each language, and apply the method proposed in Conneau et al. (2017) to align these word embeddings into a same vector space, namely, crosslingual word embeddings. We use these embeddings to initialize word representations in the first layer of GCN 1 .
Results and Discussion. Following previous works, we used Hits@1 and Hits@10 to evaluate our model, where Hits@k measures the proportion of correctly aligned entities ranked in the top k. We implemented a baseline (referred as BASE-LINE in Table 1) that selects k closest G 2 entities to a given G 1 entity in the cross-lingual embedding space, where an entity embedding is the sum of embeddings of words within its surface form. We also report results of an ablation of our model (referred as NodeMatching in Table 1) that uses GCN 1 to derive the two topic entity embeddings and then directly feeds them to the prediction layer without using matching layer. Table 1 summarizes the results of our model and existing works.
We can see that even without considering any KG structural information, the BASELINE significantly outperforms previous works that mainly learn entity embeddings from the KG structure, indicating that the surface form is an important feature for the KG alignment task. Also, the NodeMatching, which additionally encodes the KG structural information into entity embeddings using GCN 1 , achieves better performance compared to the BASELINE. In addition, we find the graph matching method significantly outperforms all baselines, which suggests that the global con-text information of topic entities is important to establish their similarities.
Let us first look at the impacts of hop size of GCN 2 to our model. From Table 1, we can see that our model could benefit from increasing the hop size of GCN 2 until it reaches a threshold λ. In experiments, we find the model achieves the best performance when λ = 3. To better understand on which type of entities that our model could better deal with due to introducing the graph matching layer, we analyze the entities that our model correctly predicts while NodeMatching does not. We find the graph matching layer enhances the ability of our model in handling the entities whose most neighbors in two KGs are different. For such entities, although most local matching information indicate that these two entities are irrelevant, the graph matching layer could alleviate this by propagating the most relevant local matching information throughout the graph.
Recall that our proposed topic entity graph only retains the relation direction while neglecting the relation label. In experiments, we find incorporating relation labels as distinct nodes that connecting entity nodes into the topic graph hurts not only the performance but efficiency of our model. We think this is due to that (1) relation labels are represented as abstract symbols in the datasets, which provides quite limited knowledge about the relations, making it difficult for the model to learn their alignments in two KGs; (2) incorporating relation labels may significantly increase the topic entity graph size, which requires bigger hop size and running time.

Conclusions
Previous cross-lingual knowledge graph alignment methods mainly rely on entity embeddings that derived from the monolingual KG structural information, thereby may fail at matching entities that have different facts in two KGs. To address this, we introduce the topic entity graph to represent the contextual information of an entity within the KG and view this task as a graph matching problem. For this purpose, we further propose a graph matching model which induces a graph matching vector by jointly encoding the entitywise matching information. Experimental results on the benchmark datasets show that our model significantly outperforms existing baselines. In the future, we will explore more applications of the proposed idea of attentive graph matching. For example, the metric learning based few-shot knowledge base completion (Xiong et al., 2018) can be directly formulated as a similar graph matching problem in this paper.