Taxonomy Construction of Unseen Domains via Graph-based Cross-Domain Knowledge Transfer

Extracting lexico-semantic relations as graph-structured taxonomies, also known as taxonomy construction, has been beneficial in a variety of NLP applications. Recently Graph Neural Network (GNN) has shown to be powerful in successfully tackling many tasks. However, there has been no attempt to exploit GNN to create taxonomies. In this paper, we propose Graph2Taxo, a GNN-based cross-domain transfer framework for the taxonomy construction task. Our main contribution is to learn the latent features of taxonomy construction from existing domains to guide the structure learning of an unseen domain. We also propose a novel method of directed acyclic graph (DAG) generation for taxonomy construction. Specifically, our proposed Graph2Taxo uses a noisy graph constructed from automatically extracted noisy hyponym hypernym candidate pairs, and a set of taxonomies for some known domains for training. The learned model is then used to generate taxonomy for a new unknown domain given a set of terms for that domain. Experiments on benchmark datasets from science and environment domains show that our approach attains significant improvements correspondingly over the state of the art.


Introduction
Taxonomy has been exploited in many Natural Language Processing (NLP) applications, such as question answering (Harabagiu et al., 2003), query understanding (Hua et al., 2017), recommendation systems (Friedrich and Zanker, 2011), etc. Automatic taxonomy construction is highly challenging as it involves the ability to recognize -(i) a set of types (i.e. hypernyms) from a text corpus, (ii) instances (i.e. hyponyms) of each type, and (iii) is-a (i.e. hypernymy) hierarchy between types.
Existing taxonomies (e.g., WordNet (Miller et al., 1990)) are far from being complete. Tax-onomies specific to many domains are either entirely absent or missing. In this paper, we focus on construction of taxonomies for such unseen domains 1 . Since taxonomies are expressed as directed acyclic graphs (DAGs) (Suchanek et al., 2008), taxonomy construction can be formulated as a DAG generation problem.
There has been considerable research on Graph Neural Networks (GNN) (Sperduti and Starita, 1997;Gori et al., 2005) over the years; particularly inspired by the convolutional GNN (Bruna et al., 2014) where graph convolution operations were defined in the Fourier domain. In a similar spirit to convolutional neural networks (CNNs), GNN methods aggregate neighboring information based on the connectivity of the graph to create node embeddings. GNN has been applied successfully in many tasks such as matrix completion (van den Berg et al., 2017), manifold analysis (Monti et al., 2017), predictions of community (Bruna et al., 2014), knowledge graph completion (Shang et al., 2019), and representations of network nodes (Hamilton et al., 2017;. To the best of our knowledge, there has been no attempt to exploit GNN for taxonomy construction. Our proposed framework, Graph2Taxo, is the first to show that a GNN-based model using a crossdomain noisy graph can substantially improve the taxonomy construction of unseen domains (e.g., Environment) by exploiting taxonomy of one or more seen domains (e.g., Food). (The task is described in detail in Section 3.1.) Another novelty of our approach is we are the first to apply the acyclicity constraint-based DAG structure learning model (Zheng et al., 2018;Yu et al., 2019) for taxonomy generation task.
The input of Graph2Taxo is a cross-domain noisy graph constructed by connecting noisy candidate is-a pairs, which are extracted from a large corpus using standard linguistic pattern-based approaches (Hearst, 1992). It is noisy because pattern-based approaches are prone to poor coverage as well as wrong extractions. In addition, it is cross-domain because the noisy is-a pairs are extracted from a large-scale corpus which contains a collection of text from multiple domains. Our proposed neural model directly encodes the structural information from a noisy graph into the embedding space. Since the links between domains are also used in our model, it has not only structural information of multiple domains but also crossdomain information.
We demonstrate effectiveness of our proposed method on science and environment datasets (Bordea et al., 2016), and show significant improvements on F-score over the state of the art.

Related Work
Taxonomy construction (also known as taxonomy induction) is a well-studied problem. Most of the existing works follow two sequential steps to construct taxonomies from text corpora . First, is-a pairs are extracted using patternbased or distributional methods. Then, a taxonomy is constructed from these is-a pairs.
The pattern-based methods, pioneered by Hearst (1992), detect is-a relation of a term pair (x, y) using the appearance of x and y in the same sentence through some lexical patterns or linguistic rules (Ritter et al., 2009;Luu et al., 2014). Snow et al. (2004) represented each (x, y) term-pair as the multiset of dependency paths connecting their co-occurrences in a corpus, which is also regarded as a path-based method.
An alternative approach for detecting is-a relation is the distributional methods (Baroni et al., 2012;Roller et al., 2014), using the distributional representation of terms to directly predict relations.
As for the step of taxonomy construction using the extracted is-a pairs, most of the approaches do it by incrementally attaching new terms (Snow et al., 2006;Shen et al., 2012;Alfarone and Davis, 2015;Wu et al., 2012). Mao et al. (2018) is the first to present a reinforcement learning based approach, named TaxoRL, for this task. For each term pair, its representation in TaxoRL is obtained by the path LSTM encoder, the word embeddings of both terms, and the embeddings of features.
Recently, Dash et al. (2020) argued that strict partial orders 2 correspond more directly to DAGs. They proposed a neural network architecture, called Strict Partial Order Network (SPON), that enforces asymmetry and transitive properties as soft constraints. Empirically, they showed that such a network produces better results for detecting hyponym-hypernym pairs on a number of datasets for different languages and domains in both supervised and unsupervised settings.
Many graph-based methods such as Kozareva and Hovy (2010) and Luu et al. (2014) regard the task of hypernymy organization as a hypernymy detection problem followed by a graph pruning problem. For the graph pruning task, various graphtheoretic approaches such as optimal branching algorithm (Velardi et al., 2013), Edmond's algorithm (Karp, 1971) and Tarjan's algorithm (Tarjan, 1972) have been used over the years. In addition to these,  mentions several other graphbased taxonomy induction approaches. In contrast, our approach formulates the taxonomy construction task as a DAG generation problem instead of an incremental taxonomy learning (Mao et al., 2018), which differentiates it when compared with the existing methods. In addition, our approach uses the knowledge from existing domains (Bansal et al., 2014;Gan et al., 2016) to build the taxonomies of missing domains.

The Graph2Taxo Framework
In this section, we first formulate the problem statement and then introduce our proposed Graph2Taxo framework as a solution. We describe the individual components of this framework in detail, along with justifications of how and why these components come together as a solution.

Problem Definition
The problem addressed in this paper is, given a list of domain-specific terms from a target unseen (aka missing) domain as input, how to construct a taxonomy for that target unseen domain. In other words, the problem addressed in this paper is how to organize these terms into a taxonomy.
This problem can be further abstracted out as follows: Given a large input corpus and a set of gold taxonomies G gold from some known domains (different from the target domain), our task is to learn a model (trained using the corpus and taxonomies of known domains) to construct multiple taxonomies for the target unseen domains.
As a solution to the aforementioned problem, we propose a GNN-based cross-domain transfer framework for taxonomy construction (see Figure  1), called Graph2Taxo which consists of a crossdomain graph encoder and a DAG generator.
The first step in our proposed approach is to build a cross-domain noisy graph as an input to our Graph2Taxo model. In this step, we extract candidate is-a pairs from a large collection of input corpora that spans multiple domains. To do so, we used the output of Panchenko et al. (2016), which is a combination of standard substring matching and pattern-based approaches. Since such patternbased approaches are too rigid, the corresponding output not only suffers from recall (i.e., missing is-a pairs) but also contains incorrect (i.e., noisy) pairs due to the ambiguity of language and richness in syntactic expression and structure in the input corpora. For example, consider the phrase "... animals other than dogs such as cats ...". As (Wu et al., 2012) noted, pattern-based approaches will extract (cat is-a dog) rather than (cat is-a animal).
Based on the noisy is-a pairs, we construct a directed graph G input = (V input , E input ), which is a cross-domain noisy graph. Here, V input denotes a set of terms, and (v i , v j ) ∈ E input if and only if (v i , v j ) belongs to the list of extracted noisy is-a pairs. The input document collection spans multiple domains, therefore E input not only has intradomain edges, but also has cross-domain edges (see Figure 1).
Graph2Taxo is a subgraph generation model which uses the large cross-domain noisy graph as the input. Given a list of terms for a target unseen domain, it aims to learn a taxonomy structure for the corresponding domain as a DAG. Graph2Taxo takes advantage of additional knowl-edge in the form of previously known gold taxonomies {G gold,i , 1 ≤ i ≤ N known } to train a learning model. During inference phase, the model receives a list of terms from the target unseen domain and aims to build a taxonomy by using the input terms. Here, N known denotes the number of previously known taxonomies used during the training phase.
This problem of distilling directed acyclic substructures (taxonomies of many domains) using a large cross-domain noisy graph is challenging, because of relatively lower overlap between noisy edges in E input and true edges in the available taxonomies in hand.
The following sections describe our proposed Cross-domain Graph Encoder and the DAG Generator in further detail.

Cross-domain Graph Encoder
This subsection describes the Cross-domain Graph Encoder in Figure 1 for embedding generation. This embedding generation algorithm uses two strategies, namely Neighborhood aggregation and Semantic clustering aggregation.

Neighborhood Aggregation
This is the first of the two strategies used for embedding generation. Let A ∈ R n×n be the adjacency matrix of the noisy graph G input , where n is the size of V input . Let h l i represent the feature representation for the node v i in the l-th layer and thus H l ∈ R n×d l denotes the intermediate representation matrix. The initial matrix H 0 is randomly initialized from a standard normal distribution.
We use the adjacency matrix A and the node representation matrix H l to iteratively update the representation of a particular node by aggregating representations of its neighbors. This is done by using a GNN. Formally, a GNN layer (Gilmer et al., 2017;Hamilton et al., 2017;Xu et al., 2019) employs the general message-passing architecture which consists of a message propagation function M to get messages from neighbors and a vertex update function U . The message passing works via the following equations, where N (v) denotes the neighbors of node v and m is the message. In addition, we use the following definitions for M and U functions, where Θ l ∈ R d l ×d l+1 denotes trainable parameters for layer l and σ represents an activation function. LetÃ = A + I, here I is the identity matrix, the information aggregation strategy described above can be abstracted out as,

Semantic Clustering Aggregation
This is the second of the two strategies used for embedding generation, which operates on the output of the previous step. The learned representations from the previous step are highly likely not to be uniformly distributed in the Euclidean Space, but rather form a bunch of clusters. In this regard, we propose a soft clustering-based pooling-unpooling step, that uses semantic clustering aggregating for learning better model representations. In essence, this step shares the similarity information for any pair of terms in the vocabulary.
Analogous to an auto-encoder, the pooling layer adaptively creates a smaller cluster graph comprising of a set of cluster nodes, whose representations are learned based on a trainable cluster assignment matrix. This idea of using an assignment matrix was first proposed by the DiffPool (Ying et al., 2018) approach. On the other hand, the unpooling layer decodes the cluster graph into the original graph using the same cluster assignment matrix learned in the pooling layer. The learned semantic cluster nodes can be thought of as "bridges" between nodes from the same or different clusters to pass messages.
Mathematically speaking, we learn a soft cluster assignment matrix S l ∈ R n×nc at layer l using the GNN model, where n c is the number of clusters. Each row in S l corresponds to one of n nodes in layer l and each column corresponds to one of the n c clusters. As a first step, the pooling layer uses the adjacency matrix A and the node feature matrix H l to generate a soft cluster assignment matrix as, where the sof tmax is a row-wise softmax function, Θ l cluster ∈ R d l ×nc denotes all trainable parameters in GN N l,cluster .
Since the matrix S l is calculated based on node embeddings, nodes with similar features and local structure will have similar cluster assignment.
As the final step, the pooling layer generates an adjacency matrix A c for the cluster graph and a new embedding matrix containing cluster node representations H l c as follows, to further propagate messages from the neighboring clusters. The trainable parameters in GN N l are Θ l ∈ R d l ×d l+1 . For passing clustering information to the original graph, the unpooling layer restores the original graph using cluster assignment matrix, as follows, The output of the pooling-unpooling layer results in the node representations possessing latent cluster information. Finally, we combine the neighborhood aggregation and semantic clustering aggregation strategies via a residual connection, as, where concate means concatenate the two matrices. H l+1 is the output of this pooling-unpooling step.

DAG Generator
The DAG generator takes in the noisy graph G input and representations of all the vocabulary terms (output of Section 3.2) as input, encodes acyclicity as a soft-constraint (as described below), and outputs a distribution of edges within G input that encodes the likelihood of true is-a relationships. This output distribution is finally used to induce taxonomies, i.e., DAGs of is-a relationships.
In each training step, DAG generator is applied to one domain (see Figure 2), using a noisy graph G, which is a subgraph from G input , as a training sample and a DAG is generated for that domain. Here let N t denote the number of (hypo, hyper) pairs belonging to the edge set of G. During the training, we also know label vector label ∈ {0, 1} Nt for these N t pairs, based on whether they belong to the gold known taxonomy.

Edge Prediction
For each edge within the noisy graph G, our DAG generator estimates the probability that the edge represents a valid hypernymy relationship. Our model estimates this probability through the use of a convolution operation illustrated in Figure 2.
For each edge (hypo, hyper), in the first step the term embeddings and edge features are concatenated as follows, where v hypo and v hyper are the embeddings for hypo and hyper nodes (from Section 3.2) and v f eas denotes a feature vector for the edge (hypo, hyper), which includes edge frequency and substring features. The substring features includes ends with, contains, prefix match, suffix match, length of longest common substring (LCS), length difference and a boolean feature denoting whether LCS in V input (the set of terms) or not. Inspired by ConvE model (Dettmers et al., 2018), a well known convolution based algorithm for link prediction, we apply a 1D convolution operation on v pair . We use a convolution operation since it increases the expressiveness of the DAG Generator through additional interaction between participating embeddings.
For the convolution operation, we make use of C different kernels parameterized by {w c , 1 ≤ c ≤ C}. The 1D convolution operation is then calculated as follows, where K denotes the kernel width, d v denotes the size of v pair , p denotes the position to start the kernel operation and the kernel parameters ω c are trainable. In addition,v pair denotes the padded version of v pair , wherein the padding strategy is as follows. If |K| is odd, we pad v pair with K/2 zeros on both the sides. On the other hand, if |K| is even, we pad K/2 − 1 zeros at the beginning, and K/2 zeros at the end of v pair . Here, value returns the floor of value.
Each kernel c generates a vector v c , according to Equation 2. As there are C different kernels, this results in the generation of C different vectors which are then concatenated together to form one vector The probability p (hypo,hyper) of a given edge (hypo, hyper) expressing a hypernymy relationship can then be estimated using the following scoring function, where W denotes the parameter matrix of a fully connected layer, as illustrated in Figure 2. Finally, for the loss calculations, we make use of differentiable F1 loss (Huang et al., 2015),

DAG Constraint
The edge prediction step alone does not guarantee that the generated graph is acyclic. Learning DAG from data is an NP-hard problem (Chickering, 1995;Chickering et al., 2004). To this effect, one of the first works that formulate the acyclic structure learning task as a continuous optimization problem was introduced by Zheng et al. (2018). In that paper, the authors note that the trace of B k denoted by tr(B k ), for a non-negative adjacency matrix B ∈ R n×n counts the number of length-k cycles in a directed graph. Hence, positive entries within the diagonal of B k suggests the existence of cycles. Or, in other words, B has no cycle if and only if ∞ k=1 n i=1 (B k ) ii = 0. However, calculating B k for every value of k, i.e. repeated matrix exponentiation, is impractical and can easily exceed machine precision. To solve this problem, Zheng et al. (2018) To make sure this constraint is useful for an arbitrary weighted matrix with both positive and negative values, a Hadamard product B = A • A is used, which leads us to the following theorem.
Theorem 1 (Zheng et al., 2018) A matrix A ∈ R n×n is a DAG if and only if: where tr represents the trace of a matrix, • represents the Hadamard product and e B equals matrix exponential of B.
Since the matrix exponential may not be available in all deep learning frameworks, (Yu et al., 2019) propose an alternative constraint that is practically convenient as follows. where α is a hyper-parameter.
Finally, using an augmented Lagrangian approach, we propose the combined loss function, where λ and ρ are the hyper-parameters. During the backpropagation, the gradients will be passed back to all domains through the intra-domain and crossdomain edges from G input to update all parameters.

Benchmark Datasets
For experiments, we used the English environment and the science taxonomies within the TExEval-2 benchmark datasets. These datasets do not come with any training data, but a list of terms and the task is to build a meaningful taxonomy using these terms. The science domain terms come from Wordnet, Eurovoc and a manually constructed taxonomy (henceforth referred to as combined), whereas the terms for environment domain comes from Eurovoc taxonomy only. Table 1 shows the dataset statistics. We chose to evaluate our proposed approach on environment and science taxonomies only, because we wanted to compare our approach with the existing state-of-the-art system named TaxoRL (Mao et al., 2018) as well as with TAXI, the winning system in the TExEval-2 task. Note that we use the same datasets with TaxoRL (Mao et al., 2018) for TExEval-2 task.
In addition, we used the dataset from Bansal et al. (2014) as gold taxonomies (i.e. sources of additional knowledge), G gold = {G gold,i , 1 ≤ i ≤ N known } that are known apriori. This dataset is a set of medium-sized full-domain taxonomies consisting of bottom-out full subtrees sampled from Wordnet, and contains 761 taxonomies in total.
To test our model for taxonomy prediction (and to remove overlap), we removed any taxonomy from G gold which had term overlap with the set of provided terms for science and environment domains within TExEval-2 task. Because of this, we get 621 non-overlapping taxonomies in total, partitioned by 80-20 ratio to create training and validation datasets respectively.

Experimental Settings
We ran our experiments in two different settings. In each of them, we train on a different noisy input graph (and the same gold taxonomies as mentioned before), and evaluate on the science and environ-   Table 2: Results on TExEval-2 task: Taxonomy Extraction Evaluation (a.k.a TExEval-2). First four rows represent participating systems in the TExEval-2 task, whose performances are taken from Bordea et al. (2016). TaxoRL A/B illustrate the performance of a Reinforcement Learning system by Mao et al. (2018) under the Partial and Full setting respectively. Graph2Taxo 1/2 represent our proposed algorithm under both the settings as described in Section 4.2. All results reported above are rounded to 2 decimal places. ment domains, within TExEval-2 task. In the first setting, we used the same input as TaxoRL (Mao et al., 2018) for a fair comparison. This input of TaxoRL consists of term pairs and associated dependency path information between them, which has been extracted from three public web-based corpora. For Graph2Taxo, we only make use of the term pairs to create a noisy input graph.
In the second setting, we used data 4 provided by TAXI (Panchenko et al., 2016), which comprises of a list of candidate is-a pairs extracted based on substrings and lexico-syntactic patterns. We used these noisy candidate pairs to create a noisy graph.
A Graph2Taxo model is then trained on the noisy graph obtained in each of the two settings. In the test phase, all candidate term-pairs for which both terms are present in the test vocabulary are scored (between 0 and 1) by the trained Graph2Taxo model. A threshold of 0.5 is applied, and the candidate pairs scoring beyond this threshold are accumulated together as the predicted taxonomy G pred . Notice that there are different optimal thresholds for different tasks. We get better performance if we tune the thresholds. However, we chose a harder task and proved our model has better performance than others even we simply use 0.5 as the threshold. In addition, We specify the hyper-parameter ranges for our experiments: learning rate {0.01, 0.005, 0.001}, number of kernels {5, 10, 20} and number of clusters {10, 30, 50, 100}. Finally, Adam optimizer (Kingma and Ba, 2015) is used for all experiments.
Evaluation Metrics. Given a gold taxonomy G gold (as part of the TExEval-2 benchmark dataset) and a predicted taxonomy G pred (by our proposed Graph2Taxo approach), we evaluate G pred using Edge Precision, Edge Recall and F-score measures as defined in Bordea et al. (2016).

Hyper-parameters
We use the following hyper-parameter configuration for training the model. We set dropout to 0.3, number of kernels C to 10, kernel size K to 5, learning rate to 0.001 and initial embedding size to 300. For the loss function, we use the λ = 1.0 and ρ = 0.5. In addition, number of clusters n c is set to 50 for all our experiments. In the scenario wherein the input resource comes from TAXI, only hyponym-hypernym candidate pairs observed more than 10 times are used to create a noisy graph. Also, we use one pooling and one unpooling layer for our experiments. We use dropouts in two places, one at the end of the cross-domain encoder module, and the other after the Conv1D operation. Our models are trained using NVIDIA Tesla P100 GPUs. Table 2 shows the results on the TExEval-2 task Evaluation on science and environment domains. The first row represents a string-based baseline method (Bordea et al., 2016), that exploits term compositionality to hierarchically relate terms. For example, it extracts pairs such as (Statistics Department, Department) from the provided Wikipedia corpus, and utilizes aforementioned technique to construct taxonomy.

Results and Discussions
The next three rows in Table 2, namely, TAXI, JUNLP and USAAR are some of the top perform-  Table 3: Ablation tests reporting the Precision, Recall and F-score, across Science and Environment domains. The first block of values reports results by ablating each layer utilized within Graph2Taxo model. In the second block, we demonstrate that addition of constraint does in fact improve performance. In the third block, we illustrate that the importance of features v f eas for improving performance. The final block uses pretrained fastText embeddings to initialize our Graph2Taxo model, and then fine tunes based on our training data. All results reported above are rounded off to 2 decimal places.
ing systems that participated in the TExEval-2 task. Furthermore, TaxoRL A,B illustrates the performance of a Reinforcement Learning system by under the Partial induction and Full induction settings respectively (Mao et al., 2018). Since Mao et al. (2018) has shown that it outperforms other methods such as Gupta et al. (2017); Bansal et al. (2014), we only compare the results of our proposed Graph2Taxo approach against the state-ofthe-art system TaxoRL.
Finally, Graph2Taxo 1 and Graph2Taxo 2 depict the results of our proposed algorithm under both aforementioned settings, i.e. using the input resources of TaxoRL in the first scenario, and using the resources of TAXI in the second scenario. In each of these settings, we find that the overall precision of our proposed Graph2Taxo approach is far better than all the other existing approaches, demonstrating the strong ability of Graph2Taxo to find true relations. Meanwhile, the recall of our proposed Graph2Taxo approach is comparable to that of the existing state-of-the-art approaches. Combining the precision and recall metrics, we observe that Graph2Taxo outperforms existing state-of-the-art approaches on the F-score, by a significant margin. For example, for the Science (Average) domain, Graph2Taxo 2 improves over TaxoRL's F-score by 5%. For the Environment (Eurovoc) domain, our model improves TaxoRL's F-score by 7% on the TExEval-2 task.
Besides, our proposed model has high scalability. For example, the GNN method has been trained for a large graph, including about 1 million nodes . Besides, the GNN part can be replaced by any improved GNN methods (Hamilton et al., 2017;Gao et al., 2018) designed for large-scale graphs.
Ablation Tests. Table 3 shows the results of proposed Graph2Taxo in the second setting for the ablation experiments (divided into four blocks), which indicates the contribution of each layer used in our Graph2Taxo model. In Table 3, all the experiments are run three times, and the average values of the three runs are reported. Furthermore, in Figure 3, we randomly choose Science (Eurovoc) domain as the one to report the error-bars (corresponding to the standard-deviation values) for our experiments.  The first block of values in Table 3 illustrates results by ablating layers from within our Graph2Taxo model. Comparing the first two rows, it's evident that adding a Semantic Cluster (SC) layer improves recall at the cost of precision, however improving the overall F-score. This improve-ment is clearly seen for the Science (Eurovoc) domain, wherein we have an increase of 3%.
In the second block, we show that the addition of constraints improves performance. Row 4 represents a Graph2Taxo i.e. 2GNN+SC+Res setup, but without any constraint. Adding the DAG Constraint (Row 1) to this yields can get a better Fscore. Specifically, we observe a major increase of +5% F1 for the Science (Eurovoc) domain.
In the third block, we remove the features v f eas as mentioned in section 3.3.1. The results, i.e. row 5 in Table 3 shows that these features are critical in improving the performance of our proposed system on both Science (Eurovoc) and Environment (Eurovoc) domains. Note that these features denoted as v f eas are not a novelty of our proposed method, but rather have been used by existing state-of-the-art approaches.
Finally, we study the effect of initializing our model using pre-trained embeddings, rather than initializing at random. Specifically, we initialize the input matrix H 0 of our Graph2Taxo model with pre-trained fastText 5 embeddings. Our model using fastText embeddings improves upon Row 1 by a margin of 4% in precision values for the Environment (Eurovoc) domain, but unfortunately has no significant effect on the F-score. Hence, we have not used pre-trained embeddings in reporting the results in Table 2.
We provide an illustration of the output of the Graph2Taxo model in Figure 4, for the Environment domain.The generated taxonomy in this example contains multiple trees, which serve the purpose of generating taxonomical classifications. As future work, we plan to figure out different strategies to connect the subtrees into a large graph for better DAG generation.

Conclusion
We have introduced a GNN-based cross-domain knowledge transfer framework Graph2Taxo, which makes use of a cross-domain graph structure, in conjunction with an acyclicity constraint-based DAG learning for taxonomy construction. Furthermore, our proposed model encodes acyclicity as a soft constraint and shows that the overall model outperforms state of the art.
In the future, we would like to figure out different strategies to merge individual gains, obtained by separate application of the DAG constraint, into a setup that can take the best of both precision and recall improvements, and put forth a better performing system. We also plan on looking into strategies to improve recall of the constructed taxonomy.