Learning Interpretable Relationships between Entities, Relations and Concepts via Bayesian Structure Learning on Open Domain Facts

Concept graphs are created as universal taxonomies for text understanding in the open-domain knowledge. The nodes in concept graphs include both entities and concepts. The edges are from entities to concepts, showing that an entity is an instance of a concept. In this paper, we propose the task of learning interpretable relationships from open-domain facts to enrich and refine concept graphs. The Bayesian network structures are learned from open-domain facts as the interpretable relationships between relations of facts and concepts of entities. We conduct extensive experiments on public English and Chinese datasets. Compared to the state-of-the-art methods, the learned network structures help improving the identification of concepts for entities based on the relations of entities on both datasets.


Introduction
Concept graphs are created as universal taxonomies for text understanding and reasoning in the open domain knowledge (Dagan et al., 2010;Bowman et al., 2015;Zamir et al., 2018;Huang et al., 2019;Hao et al., 2019;Jiang et al., 2019). The nodes in concept graphs include both entities and concepts. The edges are from entities to concepts, showing that an entity is an instance of a concept. The task of extracting and building concept graphs from user-generated texts has attracted a lot of research attentions for a couple of decades (Fellbaum, 1998;Wu et al., 2012;Shwartz et al., 2016;Chang et al., 2018;Le et al., 2019;Lewis, 2019). Most of these methods rely on high quality syntactic patterns to determine whether an entity belongs to a concept. For example, given the pattern "X is a Y " or "Y , including X" appearing in sentences, we can infer that the entity X is an instance of the concept Y . These pattern-based methods require that an entity and concept pair co-occurs in sentences. However, due to the different expressions of a certain concept, an entity and a concept may rarely appear in sentences together. We conduct a data analysis of millions of sentences extracted from Wikipedia and discover that only 10.61% of entity-concept pairs co-occur in sentences out of more than six million of pairs from the public Microsoft concept graph (https: //concept.research.microsoft.com). We also analyze Baidu Baike (http://baike.baidu.com) and its corresponding concept graph. A similar phenomenon is observed that only 8.56% entityconcept pairs co-occur in sentences. Table 1 Table 1: Entity-concept pairs that co-occur in sentences from Wikipedia (English) and Baidu Baike (Chinese).
Nowadays, the task of open domain information extraction (OIE) has become more and more important Wu and Weld, 2010;Mausam et al., 2012;Sun et al., 2018b,a;Di et al., 2019;Rashed et al., 2019;Liu et al., 2020a,b). OIE aims to generate entity and relation level intermediate structures to express facts from open domain sentences. These open domain facts usually express natural languages as triples in the form of (subject, predicate, object). For example, given the sentence "Anderson, who hosted Whose Line, is a winner of a British Comedy Award in 1991.", two facts will be extracted. They are ("Anderson", "host", "Whose Line") and ("Anderson", "winner of a British Comedy Award", "1991"). The subject and object in a fact are both entities. The open domain facts contain rich information about entities by representing the subject or object entities via different types of relations (i.e., groups of predicates).
It would be helpful for concept graph completion if we can take advantage of the relations in open domain facts. We again take the above two facts of "Anderson" as an instance. If we have explored the connections between relations of facts and concepts, and learned that "host" and "winner of a British Comedy Award" are associated with an "English presenter" subject with a higher probability than a "Japanese presenter" subject, we can infer that "Anderson" belongs to the "English presenter" concept regardless of whether these two co-appear in a sentence or not. In real-world open domain corpus, however, the connections between relations and concepts are not available to us.
In this paper, we propose the task of learning interpretable relationships between entities, relations and concepts from open domain facts to help enriching and refining concept graphs. Learning Bayesian networks (BNs) from data has been studied extensively (Heckerman et al., 1995;Koivisto and Sood, 2004;Scanagatta et al., 2015;Niinimaki et al., 2016) in the last few decades. The BNs formally encode probabilistic connections in a certain domain, yielding a human-oriented qualitative structure that facilitates communication between a user and a system incorporating the probabilistic model. Specifically, we apply the Bayesian network structure learning (BNSL) (Chow and Liu, 1968;Yuan and Malone, 2013) to discover meaningful relationships between entities, relations and concepts from open domain facts. The learned network encodes the dependen-cies from the relations of entities in facts to the concepts of entities, leading to the identification of more entity-concept pairs from open domain facts for the completion of concept graphs. We summarize our contributions as follows: • We propose the task of learning interpretable relationships between entities, relations and concepts from open domain facts, which is important for enriching and refining concept graphs.
• We build the BNSL model to discover meaningful network structures that express the connections from relations of entities in open domain facts to concepts of entities in concept graphs.
• Experimental results on both English and Chinese datasets reveal that the learned interpretable relationships help identify concepts for entities based on the relations of entities, resulting in a more complete concept graph.

Related Work
Concept Graph Construction. Concept graph construction has been extensively studied in the literature (Fellbaum, 1998;Ponzetto and Strube, 2007;Banko et al., 2007;Suchanek et al., 2007;Wu et al., 2012;Shwartz et al., 2016;Chang et al., 2018;Le et al., 2019;Lewis, 2019). Notable works toward creating open domain concept graphs from scratch include YAGO (Suchanek et al., 2007) and Probase (Wu et al., 2012). In addition, a wide variety of methods (Nakashole et al., 2012;Weeds et al., 2014;Roller et al., 2014;Shwartz et al., 2016;Roller et al., 2018;Chang et al., 2018;Le et al., 2019;Lewis, 2019) are developed to detect the hypernymy between entities and concepts for a more complete concept graph. Distributional representations of entities and concepts are learned for good hypernymy detection results (Weeds et al., 2014;Roller et al., 2014;Chang et al., 2018;Lewis, 2019). In contrast to distributional methods, pathbased algorithms (Nakashole et al., 2012;Shwartz et al., 2016;Roller et al., 2018;Le et al., 2019) are proposed to take advantage of the lexico-syntactic paths connecting the joint occurrences of an entity and a concept in a corpus. Most of these methods require the co-occurrence of entity and concept pairs in sentences for the graph completion task. However, due to the different expressions of a certain concept, an entity and a concept may rarely appear in one sentence together. With such limitations, the existing methods in the literature cannot deal with those non co-occurring entity concept pairs, leading to an incomplete concept graph.
Open Domain Information Extraction. Open domain information extraction (OIE) has attracted a lot of attention in recent years (Wu and Weld, 2010;Mausam et al., 2012;Pal and Mausam, 2016;Yahya et al., 2014;Sun et al., 2018b,a;Roy et al., 2019;Liu et al., 2020a,b). It extracts facts from open domain documents and expresses facts as triples of (subject, predicate, object). Recently, a neuralbased OIE system Logician (Sun et al., 2018b,a;Liu et al., 2020a,b) is proposed. It introduces a unified knowledge expression format SAOKE (symbol aided open knowledge expression) and expresses the most majority information in natural language sentences into four types of facts (i.e., relation, attribute, description and concept). Logician is trained on a human labeled SAOKE dataset using a neural sequence to sequence model. It achieves a much better performance than traditional OIE systems in Chinese language and provides a set of open domain facts with much higher quality to support upper-level algorithms. Since the subject and object in a fact are both entities, the open domain facts contain rich information about entities by representing the subjects or objects via different types of relations (i.e., groups of predicates). It can help the task of concept graph completion by making full use of the relations in open domain facts. In this paper, we leverage the high-quality facts of Logician as one dataset in the experiment.
Bayesian Network Structure Learning. Learning a Bayesian network structure from real-world data is a well-motivated but computationally hard task (Heckerman et al., 1995;Koivisto and Sood, 2004;de Campos et al., 2009;Scanagatta et al., 2015;Niinimaki et al., 2016). A Bayesian network specifies a joint probability distribution of a set of random variables in a structured fashion. A key component in this model is the network structure, a directed acyclic graph on the variables, encoding a set of conditional independence assertions. Several exact and approximate algorithms are developed to learn optimal Bayesian networks (Chow and Liu, 1968;Koivisto and Sood, 2004;Singh and Moore, 2005;Silander and Myllymäki, 2006;Yuan and Malone, 2013). Some exact algorithms (Koivisto and Sood, 2004;Singh and Moore, 2005;Silander and Myllymäki, 2006) are based on dynamic programming to find the best Bayesian network. In 2011, an A search algorithm is introduced  to formulate the learning process as a shortest path finding problem. However, these exact algorithms are inefficient due to the full evaluation of an exponential solution space. In this paper, we consider the Chow-Liu tree building algorithm (Chow and Liu, 1968) to approximate the underlying relationships between entities, relations and concepts as a dependency tree. This method is very efficient when there are large numbers of variables.

Finding Interpretable Relationships
We formulate the relationships between entities, relations, and concepts as follows: • Entities are associated with a set of relations that represent the behaviors and attributes of entities; • A concept is defined by a set of relations. The instances of a concept are those entities that associate with the corresponding set of relations.
In concept graphs, a concept is associated with a set of entities which share some common behaviors or attributes. However, the essence of a concept is a set of relations, and entities which associate with these relations automatically become the instance of the concept. So our formulation of the relationships between entities, relations and concepts can be illustrated by Figure 2.
In the closed domain, a knowledge base has a predefined ontology and the relationships in · · · r 1 c 1

· · ·
Relation Concept · · · · · · · · · · · · r p from Wikipedia to encode the relationships between entities and relations in the forms of facts. The relationships between relations and concepts are represented in the ontology structure of DBPedia, where each concept is associated with a group of relations. However, in the open domain, a predefined ontology does not exist, and hence the components in Figure 2 may not be associated with each other. For instance, given an open domain concept graph, we can discover the relationships between entities and concepts. Given the open domain corpus/facts, we can find the relationships between entities and relations. But the relationships between open domain concepts and relations are not available, to our knowledge. In this paper, we aim to find the connection between open domain relations and concepts, so that we can provide interpretations to the question "why the entity is associated with those concepts in open domain".

Problem Formulation
Suppose we have a set of entities E = {e 1 , · · · , e m }, a set of relations R = {r 1 , · · · , r p }, a set of concepts C = {c 1 , · · · , c q }, and a set of observed triplets O = {(e, r, c)}. Here E and C are from a concept graph G. R is from a set of facts F = {f 1 , · · · , f n } extracted from a text corpus D. A triplet (e, r, c) is observed means that the entity e with relation r and concept of c is found in above data sources. Given a set of observations O with N samples, the Bayesian network can be learned by maximizing the joint probability p(O): where p(c|(e, r)) = p(c|r) is due to our Bayesian network assumption (see Figure 2). By learning with the observed triplets with above model, we can infer the missing triplets, especially give interpretable relationship between entities and concepts.
Since p(r|e) can be approximated by the information from OIE corpus, the core of the above problem becomes to learn the part of the network of p(c|e). The difficulty of learning p(c|e) is the unknown structure of the Bayesian network. Due to sparsity of real-world knowledge base, the target network would be sparse. But the sparse structure must be known beforehand for probability learning.
In this paper, we employ the Bayesian Network Structure Learning (BNSL) technique to explore the connections between relations and concepts. Due to the large number of variables (i.e., entities, relations and concepts) in open domain facts and concept graphs, we develop an approximate algorithm to learn the network structure.

The Proposed Approximate Algorithm
Due to the sparsity of the relationships between relations and concepts, we decompose the problem into several sub-problems, with each sub-problem containing only one concept variable. Then for each concept variable, we identify possible related relations and apply a BNSL algorithm to discover the network structure between them. Finally, we use the learned network for concept discovery. The procedure is shown in Algorithm 1. We will state the key steps in detail in the next sub-sections.

Sub-problem Construction
Given a concept c ∈ C, we first collect all its entities E c ⊂ E from the concept graph. Then we can obtain a set of facts F c that contain these entities. Since an entity can appear in a fact as a subject or an object, we split the facts F c into subject-view facts F c,s and object-view facts F c,o . If we make use of all the relations under the subject or object view, it would be inefficient or event impossible to learn the sparse network structure with a large number of relation variables. Hence, based on the facts, we select possible related relations to the concept c to reduce the complexity of the problem.

Relation Selection
There are various strategies which can be applied for the relation selection. We can assume that a relation is highly related to the concept if it appears many times in the fact set F c . In this way, we can count the frequencies of relations for each view and select the top K as the most relevant ones with a concept. We call it TF selection since we measure the relevance of a relation according to its frequency. We can also select relations according to the TFIDF measurement (Wu et al., 2008). For each view, we select the most relevant K relations for the concept c. We denote them as R c,s ⊂ R for the subject-view facts and R c,o ⊂ R for the object-view facts. In summary, for each concept, we construct two sub-problems for the BNSL task. One is from the subject view and the other is from the object view. Under each view, the sub-problem contains one concept and at most K relations. The goal is to learn a network structure from the concept and corresponding relations.

Data Observations
Given a sub-problem for a concept c, we first obtain the corresponding data observations and then feed them as the input of BNSL for interpretable relationship discoveries. For each concept, we can learn a Bayesian network structure from its top subject-view or object view relations. The data observations X c,s with TF relation selection for the subject-view of the concept c are generated as follows: for each entity e ∈ E c , we use 1 to be the concept observation, meaning that the entity e is an instance of concept c. We use the times of the subject e and a top relation r ∈ R c,s appearing together in facts F c,s as a relation observation for e and r. The K relation observations and the concept observation together become the positive data observations for c. In order to learn meaningful network structures, we generate an equal number of negative data observations for c. We first randomly sample the same number of entities from E c = {e i : e i ∈ E \ E c } as negative entities of c. We use 0 as the concept observation for negative entities. Then for each negative entity e , we count the times of the subject e and a relation r ∈ R c,s appearing in all the collected facts as a relation observation for e and r. The K relation observations and the concept observation together become the negative data observations for c. X c,s consists of both the positive and negative data observations. Similarly, we can obtain the data observations X c,o for the object view.

Network Structure Learning
In this paper, we employ the widely-used Chow-Liu tree building algorithm (Chow and Liu, 1968) as the BNSL method. This algorithm approximates the underlying distributions of variables as a dependency tree, which is a graph where each node only has one parent and cycles are not allowed. It will first calculate the mutual information between each pair of nodes (i.e., variables), and then take the maximum spanning tree of that matrix as the approximation. While this will only provide a rough approximation of the underlying data, it provides good results for many applications (Suzuki, 2010;Tavassolipour et al., 2014;Hassan-Moghaddam and Jovanovic, 2018;Ding et al., 2019), especially when you need to know the most important influencer on each variable. In addition, this algorithm becomes extremely efficient when it deals with to a large number of variables.
Since both the subject and object views reflect some properties of entities, we can concatenate the subject-view relations and object-view relations together for a more complete representation of entities. The concatenated data can be forwarded into BNSL for a more comprehensive result of interpretable relationship discovery. Given q concept variables and K relevant relations for each concept, the number of parameters in BNSL is at most q×K.

Prediction
After we learn a network structure for each concept, we can learn the concept of a new entity e easily. We first identify the open domain facts with e as its subject or object, and then feed the observation of relations for a concept c into the network to calculate the probability of p(c|e). We still use the open domain entity "Anderson" and its two facts introduced in Section 1 as an example to show how BNSL works. Assume we have two open domain concepts, "English presenter" and "Japanese presenter". Given the entity "Anderson" and its open domain relations "host" and "winner of a British Comedy Award" as input of BNSL, the output is the probabilities that "Anderson" belongs to each concept. BNSL will predict a higher probability for "Anderson" having the concept "English presenter" than having "Japanese presenter".

Experiments
With the learned relationship between relations and concepts from BNSL, we indirectly associate entities with their concepts and give interpretations to the question "why the entity is associated with those concepts in open domain". The hypernymy detection task aims to identify concepts for entities in open domain. It is helpful for us to evaluate the quality of the learned relationships from BNSL. In this section, we conduct extensive experiments to evaluate the performance of BNSL.

Data Description
We test the performance of our proposed method on two public datasets, one is in English and the other is in Chinese. For the English dataset, we use 15 million high-precision OIE facts 1 , the Microsoft concept graph 2 and 7.87 million Wikipedia sentences 3 for our experiments. Since there are more than 5 million concepts in the English dataset and most of them have few entities, we focus on those concepts with more than 50 entities in the experiments. For the Chinese dataset, we use sentences and the corresponding facts 4 in (Sun et al., 2018b). The concept graph is also built by Baidu Baike.

Experimental Setting
In the experiment, we compare with the state-ofthe-art model HypeNet (Shwartz et al., 2016) for hypernymy detection. HypeNet improves the detection of entity-concept pairs with an integrated path-based and distributional method. An entity and a concept must appear together in a sentence so that HypeNet can extract lexico-syntactic dependency paths for training and prediction. However, only less than 11% of entity-concept pairs co-occur in Wikipedia sentences in reality (Table 1). Therefore, we compare BNSL with HypeNet only on the entity-concept pairs that co-appear in sentences.
In addition, we compare BNSL with recurrent neural networks (RNNs). We apply attention-based Bi-LSTM (Zhou et al., 2016) and derive three versions of RNNs as baseline methods, RNN(f), RNN(sen) and RNN(e). RNN(f) determines the concepts of an entity according to the facts containing the entity, while RNN(sen) by the sentences containing the co-appearance of an entity and a concept. Specifically, each entity in RNN(f) is represented by its associated facts. Each fact is a sequence of subject, predict and object. Each subject, predict and object vector is fed in sequence into RNN(f), resulting a fact embedding vector. The averaged fact vector becomes the entitys feature for concept classification.
Similar to HypeNet, RNN(sen) requires the entity-concept pairs co-appearing in sentences. Dif-ferent from RNN(sen), RNN(e) focuses on sentences containing the entity only. Based on the sentences, RNN(e) aims to learn which concept an entity belongs to. We follow HypeNet and RNN to use pre-trained GloVe embeddings (Pennington et al., 2014) for initialization. Besides, we compare BNSL with traditional support vector machines (SVM) with linear kernel. The input features for SVM and BNSL are the same, i.e., the top K relations for each concept. Here we set K = 5.
During testing, all methods are evaluated on the same testing entities. we calculate the accuracy, precision, recall and F1-score over the prediction results for evaluation. We split the data into 80% of training and 20% of testing. For English, the total numbers of training and testing data are 504,731 and 123,880, respectively; whereas for Chinese, the numbers are 5,169,220 and 1,289,382, respectively.

Performance Evaluation
In this section, we show the evaluation performance on the task of concept discovery with the learned interpretable relationships from open domain fact. Table 3 and Table 4 list the results for co-occurred and non co-occurred entity-concept pairs in sentences respectively. In the tables, (s) and (o) mean the performance only under the subject and the object view, respectively. RNN(f), BNSL and SVM present the prediction performance with the concatenation of both the subject and object views. As is mentioned in the previous section, we can use TF or TFIDF for the most relevant relation selection. We test both strategies for BNSL and SVM. For the English dataset, TFIDF performs much better than TF while the result is the opposite for the Chinese dataset. In this section, we analyze the results of BNSL and SVM with TFIDF for the English dataset. For the Chinese dataset, we report the performance of BNSL and SVM with TF. We will show more results for the relation selection in the next section.
For the co-occurred entity-concept pairs in sentences, BNSL(s) performs the best for both datasets. Surprisingly, SVM performs much better than Hy-peNet with an improvement of around 10% on accuracy for both datasets as is shown in Table 3. In addition, SVM achieves better results compared to RNN(sen). The reason that HypeNet or RNN(sen) cannot perform well may be that the information expressed from the sentences are too diverse. Hy-peNet or RNN(sen) cannot capture meaningful pat-terns from sentences for the task of concept discovery. Since RNN(e) further ignores the concept information during the sentence collection step, it cannot perform well compared with RNN(sen). In contrast, information extracted from open domain facts are much more concentrated about concepts. Furthermore, the most relevant relations associated with entities help filtering out noise. Therefore, SVM can achieve a much better result than sentence-based baselines.
Though SVM does well on the co-occurred data, BNSL outperforms SVM with all the four evaluation metrics. By learning interpretable relationships between relations and concepts, BNSL captures the most important knowledge about concepts and further exploits their dependencies to help improve the concept discovery task. However, the concatenation of subject and object views for BNSL cannot help improve the performance for both datasets. Similar phenomena can be observed for RNN(f) and SVM. Specifically, the results under the subject view are usually better than those of the object view, implying that when people narrate facts, they may pay more attention to selecting suitable predicate for subjects, rather for objects. Table 4 lists the performances of RNN(e), RNN(f), SVM and BNSL on non co-occurred data. We can observe a similar trend compared to the results on co-occurred data.
Since HypeNet and BNSL make use of different information sources (natural language sentences for HypeNet and open domain facts for BNSL), we try to ensemble them to improve the performance further. We first train HypeNet and BNSL independently. Then we can obtain prediction probabilities of entity-concept pairs from HypeNet and BNSL separately. We select the probabilities with higher values as the final predictions. The last row in Table 3 shows the performance of ensembling HypeNet and BNSL. We denote it as B + H. It can be seen that B + H achieves the best accuracy, recall and F1-scores on the co-occurred data. It reveals that interpretable relationships extracted from open domain facts are complementary to natural language sentences in helping concept discovery. Studying meaningful knowledge from open domain facts provides an alternative perspective to build concept graphs and this paper starts the first trial.

Analysis with missing information
In reality, the open domain facts or co-occurring sentences associated with entity-concept pairs are usually missing, making the input information for concept discovery extremely sparse. In this section, we study how BNSL performs with the sparse input. Given a set of entities, we first extract the corresponding facts (or sentences) under each concept. For both datasets, we get around 30 million entityconcept pairs for testing and more than 97% do not have the corresponding fact information with the top K relations, making the prediction of BNSL very challenging. Furthermore, both datasets have a large number of fine-grained concepts, making the task more difficult. For the missing data, we feed an empty fact or sentence into BNSL and other models for training and testing. Also, we observe that RNN does not performs as well compared with other methods and in particular RNN(sen) performs the worst when the input is extremely sparse.
In Figure 4, we report the improvement of F1score over RNN(sen). We can observe that Hy-peNet, SVM and BNSL can achieve much better performance, showing their robustness with missing values. In addition, B + H can still achieve the best result. It further confirms that open domain facts and natural language sentences are comple-  mentary to each other even when there is a large portion of missing information.

Conclusion
In this paper, we investigate the task of learning interpretable relationships between entities, relations and concepts from open domain facts to help enriching and refining concept graphs. The Bayesian network structures are learned from open domain facts as the discovered meaningful dependencies between relations of facts and concepts of entities. Experimental results on an English dataset and a Chinese dataset reveal that the learned network structures can better identify concepts for entities based on the relations of entities from open domain facts, which will further help building a more complete concept graph.