End-to-End Reinforcement Learning for Automatic Taxonomy Induction

We present a novel end-to-end reinforcement learning approach to automatic taxonomy induction from a set of terms. While prior methods treat the problem as a two-phase task (i.e.,, detecting hypernymy pairs followed by organizing these pairs into a tree-structured hierarchy), we argue that such two-phase methods may suffer from error propagation, and cannot effectively optimize metrics that capture the holistic structure of a taxonomy. In our approach, the representations of term pairs are learned using multiple sources of information and used to determine which term to select and where to place it on the taxonomy via a policy network. All components are trained in an end-to-end manner with cumulative rewards, measured by a holistic tree metric over the training taxonomies. Experiments on two public datasets of different domains show that our approach outperforms prior state-of-the-art taxonomy induction methods up to 19.6% on ancestor F1.


Introduction
Many tasks in natural language understanding (e.g., information extraction (Demeester et al., 2016), question answering (Yang et al., 2017), and textual entailment (Sammons, 2012)) rely on lexical resources in the form of term taxonomies (cf. rightmost column in Fig. 1). However, most existing taxonomies, such as WordNet (Miller, 1995) and Cyc (Lenat, 1995), are manually curated and thus may have limited coverage or become unavailable in some domains and languages. Therefore, recent efforts have been focusing on automatic taxonomy induction, which aims to organize a set of terms into a taxonomy based on relevant resources such as text corpora.
However, these two-phase methods encounter two major limitations. First, most of them ignore the taxonomy structure when estimating the probability that a term pair holds the hypernymy relation. They estimate the probability of different term pairs independently and the learned term pair representations are fixed during hypernymy organization. In consequence, there is no feedback from the second phase to the first phase and possibly wrong representations cannot be rectified based on the results of hypernymy organization, which causes the error propagation problem. Secondly, some methods (Bansal et al., 2014;Zhang et al., 2016) do explore the taxonomy space by regarding the induction of taxonomy structure as inferring the conditional distribution of edges. In other words, they use the product of edge proba- Figure 1: An illustrative example showing the process of taxonomy induction. The input vocabulary V 0 is {"working dog", "pinscher", "shepherd dog", ...}, and the initial taxonomy T 0 is empty. We use a virtual "root" node to represent T 0 at t = 0. At time t = 5, there are 5 terms on the taxonomy T 5 and 3 terms left to be attached: V t = {"shepherd dog", "collie", "affenpinscher"}. Suppose the term "affenpinscher" is selected and put under "pinscher", then the remaining vocabulary V t+1 at next time step becomes {"shepherd dog", "collie"}. Finally, after |V 0 | time steps, all the terms are attached to the taxonomy and V |V 0 | = V 8 = {}. A full taxonomy is then constructed from scratch.
bilities to represent the taxonomy quality. However, the edges are treated equally, while in reality, they contribute to the taxonomy differently. For example, a high-level edge is likely to be more important than a bottom-out edge because it has much more influence on its descendants. In addition, these methods cannot explicitly capture the holistic taxonomy structure by optimizing global metrics.
To address the above issues, we propose to jointly conduct hypernymy detection and organization by learning term pair representations and constructing the taxonomy simultaneously. Since it is infeasible to estimate the quality of all possible taxonomies, we design an end-to-end reinforcement learning (RL) model to combine the two phases. Specifically, we train an RL agent that employs the term pair representations using multiple sources of information and determines which term to select and where to place it on the taxonomy via a policy network. The feedback from hypernymy organization is propagated back to the hypernymy detection phase, based on which the term pair representations are adjusted. All components are trained in an end-to-end manner with cumulative rewards, measured by a holistic tree metric over the training taxonomies. The probability of a full taxonomy is no longer a simple aggregated probability of its edges. Instead, we assess an edge based on how much it can contribute to the whole quality of the taxonomy.
We perform two sets of experiments to evaluate the effectiveness of our proposed approach. First, we test the end-to-end taxonomy induction performance by comparing our approach with the state-of-the-art two-phase methods, and show that our approach outperforms them significantly on the quality of constructed taxonomies. Second, we use the same (noisy) hypernym graph as the input of all compared methods, and demonstrate that our RL approach does better hypernymy organization through optimizing metrics that can capture holistic taxonomy structure.
Contributions. In summary, we have made the following contributions: (1) We propose a deep reinforcement learning approach to unify hypernymy detection and organization so as to induct taxonomies in an end-to-end manner.
(2) We design a policy network to incorporate semantic information of term pairs and use cumulative rewards to measure the quality of constructed taxonomies holistically. (3) Experiments on two public datasets from different domains demonstrate the superior performance of our approach compared with state-of-the-art methods. We also show that our method can effectively reduce error propagation and capture global taxonomy structure.

Problem Definition
We define a taxonomy T = (V, R) as a treestructured hierarchy with term set V (i.e., vocabulary), and edge set R (which indicates is-a relationship between terms). A term v ∈ V can be either a unigram or a multi-word phrase. The task of end-to-end taxonomy induction takes a set of training taxonomies and related resources (e.g., background text corpora) as input, and aims to learn a model to construct a full taxonomy T by adding terms from a given vocabulary V 0 onto an empty hierarchy T 0 one at a time. An illustration of the taxonomy induction process is shown in Fig. 1.

Modeling Hypernymy Relation
Determining which term to select from V 0 and where to place it on the current hierarchy requires understanding of the semantic relationships between the selected term and all the other terms. We consider multiple sources of information (i.e., resources) for learning hypernymy relation representations of term pairs, including dependency path-based contextual embedding and distributional term embeddings (Shwartz et al., 2016).
Path-based Information. We extract the shortest dependency paths between each co-occurring term pair from sentences in the given background corpora. Each path is represented as a sequence of edges that goes from term x to term y in the dependency tree, and each edge consists of the word lemma, the part-of-speech tag, the dependency label and the edge direction between two contiguous words. The edge is represented by the concatenation of embeddings of its four components: Instead of treating the entire dependency path as a single feature, we encode the sequence of dependency edges V e 1 , V e 2 , ..., V e k using an LSTM so that the model can focus on learning from parts of the path that are more informative while ignoring others. We denote the final output of the LSTM for path p as O p , and use P(x, y) to represent the set of all dependency paths between term pair (x, y). A single vector representation of the term pair (x, y) is then computed as P P(x,y) , the weighted average of all its path representations by applying an average pooling: where c (x,y) (p) denotes the frequency of path p in P(x, y). For those term pairs without dependency paths, we use a randomly initialized empty path to represent them as in Shwartz et al. (2016).
Distributional Term Embedding. The previous path-based features are only applicable when two terms co-occur in a sentence. In our experiments, however, we found that only about 17% of term pairs have sentence-level co-occurrences. 2 To alleviate the sparse co-occurrence issue, we concatenate the path representation P P(x,y) with the word embeddings of x and y, which capture the distributional semantics of two terms.
Surface String Features. In practice, even the embeddings of many terms are missing because the terms in the input vocabulary may be multiword phrases, proper nouns or named entities, which are likely not covered by the external pretrained word embeddings. To address this issue, we utilize several surface features described in previous studies (Yang and Callan, 2009;Bansal et al., 2014;Zhang et al., 2016). Specifically, we employ Capitalization, Ends with, Contains, Suffix match, Longest common substring and Length difference. These features are effective for detecting hypernyms solely based on the term pairs.
Frequency and Generality Features. Another feature source that we employ is the hypernym candidates from TAXI 3 (Panchenko et al., 2016). These hypernym candidates are extracted by lexico-syntactic patterns and may be noisy. As only term pairs and the co-occurrence frequencies of them (under specific patterns) are available, we cannot recover the dependency paths between these terms. Thus, we design two features that are similar to those used in (Panchenko et al., 2016;Gupta et al., 2017). 4 • Normalized Frequency Diff. For a hyponymhypernym pair (x i , x j ) where x i is the hyponym and x j is the hypernym, its normalized frequency is defined as freq n ( , which down-ranks synonyms and co-hyponyms. Intuitively, a higher score indicates a higher probability that the term pair holds the hypernymy relation. • Generality Diff. The generality g(x) of a term x is defined as the logarithm of the number of its distinct hyponyms, i.e., g(x) = log(1+|hypo|), where for any hypo ∈ hypo, (hypo, x) is a hypernym candidate. A high g(x) of the term x implies that x is general since it has many distinct hyponyms. The generality of a term pair is defined as the difference in generality between x j and x i : g(x j ) − g(x i ). This feature would promote term pairs with the right level of generality and penalize term pairs that are either too general or too specific.
The surface, frequency, and generality features are binned and their embeddings are concatenated as a part of the term pair representation. In summary, the final term pair representation R xy has the following form: where P P(x,y) , V wx , V wy , V F (x,y) denote the path representation, the word embedding of x and y, and the feature embeddings, respectively.
Our approach is general and can be flexibly extended to incorporate different feature representation components introduced by other relation extraction models (Zhang et al., 2017;Lin et al., 2016;Shwartz et al., 2016). We leave in-depth discussion of the design choice of hypernymy relation representation components as future work.

Reinforcement Learning for End-to-End Taxonomy Induction
We present the reinforcement learning (RL) approach to taxonomy induction in this section. The RL agent employs the term pair representations described in Section 2.2 as input, and explores how to generate a whole taxonomy by selecting one term at each time step and attaching it to the current taxonomy. We first describe the environment, including the actions, states, and rewards. Then, we introduce how to choose actions via a policy network.

Actions
We regard the process of building a taxonomy as making a sequence of actions. Specifically, we define that an action a t at time step t is to (1) select a term x 1 from the remaining vocabulary V t ; (2) remove x 1 from V t , and (3) attach x 1 as a hyponym of one term x 2 that is already on the current taxonomy T t . Therefore, the size of action space at time step t is |V t | × |T t |, where |V t | is the size of the remaining vocabulary V t , and |T t | is the number of terms on the current taxonomy. At the beginning of each episode, the remaining vocabulary V 0 is equal to the input vocabulary and the taxonomy T 0 is empty. During the taxonomy induction process, the following relations always hold: The episode terminates when all the terms are attached to the taxonomy, which makes the length of one episode equal to |V 0 |. A remaining issue is how to select the first term when no terms are on the taxonomy. One approach that we tried is to add a virtual node as root and consider it as if a real node. The root embedding is randomly initialized and updated with other parameters. This approach presumes that all taxonomies share a common root representation and expects to find the real root of a taxonomy via the term pair representations between the virtual root and other terms. Another approach that we explored is to postpone the decision of root by initializing T with a random term as current root at the beginning of one episode, and allowing the selection of new root by attaching one term as the hypernym of current root. In this way, it overcomes the lack of prior knowledge when the first term is chosen. The size of action space then becomes |A t | = |V t | × |T t | + |V t |, and the length of one episode becomes |V 0 | − 1. We compare the performance of the two approaches in Section 4.

States
The state s at time t comprises the current taxonomy T t and the remaining vocabulary V t . At each time step, the environment provides the information of current state, based on which the RL agent takes an action. Once a term pair (x 1 , x 2 ) is selected, the position of the new term x 1 is automatically determined since the other term x 2 is already on the taxonomy and we can simply attach x 1 by adding an edge between x 1 and x 2 .

Rewards
The agent takes a scalar reward as feedback of its actions to learn its policy. One obvious reward is to wait until the end of taxonomy induction, and then compare the predicted taxonomy with gold taxonomy. However, this reward is delayed and difficult to measure individual actions in our scenario. Instead, we use reward shaping, i.e., giving intermediate rewards at each time step, to accelerate the learning process. Empirically, we set the reward r at time step t to be the difference of Edge-F1 (defined in Section 4.2 and evaluated by comparing the current taxonomy with the gold taxonomy) between current and last time step: r t = F 1 et − F 1 e t−1 . If current Edge-F1 is better than that at last time step, the reward would be positive, and vice versa. The cumula- Figure 2: The architecture of the policy network. The dependency paths are encoded and concatenated with word embeddings and feature embeddings, and then fed into a two-layer feed-forward network.
tive reward from current time step to the end of an episode would cancel the intermediate rewards and thus reflect whether current action improves the overall performance or not. As a result, the agent would not focus on the selection of current term pair but have a long-term view that takes following actions into account. For example, suppose there are two actions at the same time step. One action attaches a leaf node to a high-level node, and the other action attaches a non-leaf node to the same high-level node. Both attachments form a wrong edge but the latter one is likely to receive a higher cumulative reward because its following attachments are more likely to be correct.

Policy Network
After we introduce the term pair representations and define the states, actions, and rewards, the problem becomes how to choose an action from the action space, i.e., which term pair (x 1 , x 2 ) should be selected given the current state? To solve the problem, we parameterize each action a by a policy network π(a | s; W RL ). The architecture of our policy network is shown in Fig. 2. For each term pair, its representation is obtained by the path LSTM encoder, the word embeddings of both terms, and the embeddings of features. By stacking the term pair representations, we can obtain an action matrix A t with size (|V t | × |T t |) × dim(R), where (|V t | × |T t |) denotes the number of possible actions (term pairs) at time t and dim(R) denotes the dimension of term pair representation R. A t is then fed into a two-layer feed-forward network followed by a softmax layer which outputs the probability distribution of actions. 5 Finally, an action a t is sampled based on the probability distribution of the action space: At the time of inference, instead of sampling an action from the probability distribution, we greedily select the term pair with the highest probability.
We use REINFORCE (Williams, 1992), one instance of the policy gradient methods as the optimization algorithm. Specifically, for each episode, the weights of the policy network are updated as follows: where v i = ∑ T t=i γ t−i r t is the culmulative future reward at time i and γ ∈ [0, 1] is a discounting factor of future rewards.
To reduce variance, 10 rollouts for each training sample are run and the rewards are averaged. Another common strategy for variance reduction is to use a baseline and give the agent the difference between the real reward and the baseline reward instead of feeding the real reward directly. We use a moving average of the reward as the baseline for simplicity.

Implementation Details
We use pre-trained GloVe word vectors (Pennington et al., 2014) with dimensionality 50 as word embeddings. We limit the maximum number of dependency paths between each term pair to be 200 because some term pairs containing general terms may have too many dependency paths. We run with different random seeds and hyperparameters and use the validation set to pick the best model. We use an Adam optimizer with initial learning rate 10 −3 . We set the discounting factor γ to 0.4 as it is shown that using a smaller discount factor than defined can be viewed as regularization (Jiang et al., 2015). Since the parameter updates are performed at the end of each episode, we cache the term pair representations and reuse them when the same term pairs are encountered again in the same episode. As a result, the proposed approach is very time efficient -each training epoch takes less than 20 minutes on a single-core CPU using DyNet (Neubig et al., 2017).

Experiments
We design two experiments to demonstrate the effectiveness of our proposed RL approach for taxonomy induction. First, we compare our end-toend approach with two-phase methods and show that our approach yields taxonomies with higher quality through reducing error propagation and optimizing towards holistic metrics. Second, we conduct a controlled experiment on hypernymy organization, where the same hypernym graph is used as the input of both our approach and the compared methods. We show that our RL method is more effective at hypernymy organization.

Experiment Setup
Here we introduce the details of our two experiments on validating that (1) the proposed approach can effectively reduce error propagation; and (2) our approach yields better taxonomies via optimizing metrics on holistic taxonomy structure.
Performance Study on End-to-End Taxonomy Induction. In the first experiment, we show that our joint learning approach is superior to twophase methods. Towards this goal, we compare with TAXI (Panchenko et al., 2016), a typical two-phase approach, two-phase HypeNET, implemented by pairwise hypernymy detection and hypernymy organization using MST, and Bansal et al. (2014). The dataset we use in this experiment is from Bansal et al. (2014), which is a set of medium-sized full-domain taxonomies consisting of bottom-out full subtrees sampled from Word-Net. Terms in different taxonomies are from various domains such as animals, general concepts, daily necessities. Each taxonomy is of height four (i.e., 4 nodes from root to leaf) and contains (10, 50] nodes. The dataset contains 761 nonoverlapped taxonomies in total and is partitioned by 70/15/15% (533/114/114) as training, validation, and test set, respectively.
Testing on Hypernymy Organization. In the second experiment, we show that our approach is better at hypernymy organization by leveraging the global taxonomy structure. For a fair comparison, we reuse the hypernym graph as in TAXI (Panchenko et al., 2016) and SubSeq (Gupta et al., 2017) so that the inputs of each model are the same. Specifically, we restrict the action space to be the same as the baselines by considering only term pairs in the hypernym graph, rather than all |V | × |T | possible term pairs. As a result, it is possible that at some point no more hypernym candidates can be found but the remaining vocabulary is still not empty. If the induction terminates at this point, we call it a partial induction. We can also continue the induction by restoring the original action space at this moment so that all the terms in V are eventually attached to the taxonomy. We call this setting a full induction. In this experiment, we use the English environment and science taxonomies in the SemEval-2016 task 13 (TExEval-2) (Bordea et al., 2016). Each taxonomy is composed of hundreds of terms, which is much larger than the WordNet taxonomies. The taxonomies are aggregated from existing resources such as WordNet, Eurovoc 6 , and the Wikipedia Bitaxonomy (Flati et al., 2014). Since this dataset provides no training data, we train our model using the WordNet dataset in the first experiment. To avoid possible overlap between these two sources, we exclude those taxonomies constructed from Word-Net.
In both experiments, we combine three public corpora -the latest Wikipedia dump, the UMBC web-based corpus (Han et al., 2013) and the One Billion Word Language Modeling Benchmark (Chelba et al., 2013). Only sentences where term pairs co-occur are reserved, which results in  Table 1: Results of the end-to-end taxonomy induction experiment. Our approach significantly outperforms two-phase methods (Panchenko et al., 2016;Shwartz et al., 2016;Bansal et al., 2014). Bansal et al. (2014) and TaxoRL (NR) + FG are listed separately because they use extra resources. a corpus with size 2.6 GB for the WordNet dataset and 810 MB for the TExEval-2 dataset. Dependency paths between term pairs are extracted from the corpus via spaCy 7 .

Evaluation Metrics
Ancestor-F1. It compares the ancestors ("is-a" pairs) on the predicted taxonomy with those on the gold taxonomy. We use P a , R a , F 1 a to denote the precision, recall, and F1-score, respectively: Edge-F1. It is more strict than Ancestor-F1 since it only compares predicted edges with gold edges. Similarly, we denote edge-based metrics as P e , R e , and F 1 e , respectively. Note that P e = R e = F 1 e if the number of predicted edges is the same as gold edges.

Results
Comparison on End-to-End Taxonomy Induction. Table 1 shows the results of the first experiment. HypeNET (Shwartz et al., 2016) uses additional surface features described in Section 2.2. HypeNET+MST extends HypeNET by first constructing a hypernym graph using HypeNET's output as weights of edges and then finding the MST (Chu, 1965) of this graph. TaxoRL (RE) denotes our RL approach which assumes a common Root Embedding, and TaxoRL (NR) denotes its variant that allows a New Root to be added. We can see that TAXI has the lowest F 1 a while HypeNET performs the worst in F 1 e . Both TAXI and HypeNET's F 1 a and F 1 e are lower than 30. HypeNET+MST outperforms HypeNET in both 7 https://spacy.io/  Table 2: Results of the hypernymy organization experiment.
Our approach outperforms Panchenko et al. (2016); Gupta et al. (2017) when the same hypernym graph is used as input. The precision of partial induction in both metrics is high. The precision of full induction is relatively lower but its recall is much higher.
F 1 a and F 1 e , because it considers the global taxonomy structure, although the two phases are performed independently. TaxoRL (RE) uses exactly the same input as HypeNET+MST and yet achieves significantly better performance, which demonstrates the superiority of combining the phases of hypernymy detection and hypernymy organization. Also, we found that presuming a shared root embedding for all taxonomies can be inappropriate if they are from different domains, which explains why TaxoRL (NR) performs better than TaxoRL (RE). Finally, after we add the frequency and generality features (TaxoRL (NR) + FG), our approach outperforms Bansal et al. (2014), even if a much smaller corpus is used. 8 Analysis on Hypernymy Organization. Table 2 lists the results of the second experiment. TAXI (DAG) (Panchenko et al., 2016) denotes TAXI's original performance on the TExEval-2 dataset. 9 Since we don't allow DAG in our setting, we convert its results to trees (denoted by TAXI (tree)) by only keeping the first parent of each node. Sub-Seq (Gupta et al., 2017) also reuses TAXI's hypernym candidates. TaxoRL (Partial) and Tax-oRL (Full) denotes partial induction and full induction, respectively. Our joint RL approach outperforms baselines in both domains substantially. TaxoRL (Partial) achieves higher precision in both ancestor-based and edge-based metrics but has rel-8 Bansal et al. (2014) use an unavailable resource (Brants and Franz, 2006) which contains one trillion tokens while our public corpus contains several billion tokens. The frequency and generality features are sparse because the vocabulary that TAXI (in the TExEval-2 competition) used for focused crawling and hypernymy detection was different. 9 alt.qcri.org/semeval2016/task13/index.php?id=evaluation atively lower recall since it discards some terms. In addition, it achieves the best F 1 e in science domain. TaxoRL (Full) has the highest recall in both domains and metrics, with the compromise of lower precision. Overall, TaxoRL (Full) performs the best in both domains in terms of F 1 a and achieves best F 1 e in environment domain.

Ablation Analysis and Case Study
In this section, we conduct ablation analysis and present a concrete case for better interpreting our model and experimental results. Table 3 shows the ablation study of TaxoRL (NR) on the WordNet dataset. As one may find, different types of features are complementary to each other. Combining distributional and pathbased features performs better than using either of them alone (Shwartz et al., 2016). Adding surface features helps model string-level statistics that are hard to capture by distributional or path-based features. Significant improvement is observed when more data is used, meaning that standard corpora (such as Wikipedia) might not be enough for complicated taxonomies like WordNet. Fig. 3 shows the results of taxonomy about filter. We denote the selected term pair at time step t as (hypo, hyper, t). Initially, the term water filter is randomly chosen as the taxonomy root. Then, a wrong term pair (water filter, air filter, 1) is selected possibly due to the noise and sparsity of features, which makes the term air filter become the new root. (air filter, filter, 2) is selected next and the current root becomes filter that is identical to the real root. After that, term pairs such as (fuel filter, filter, 3), (coffee filter, filter, 4) are selected correctly, mainly because of the substring inclusion intuition. Other term pairs such as (colander, strainer, 13), (glass wool, filter, 16) are discovered later, largely by the information encoded in the dependency paths and embeddings. For those undiscovered relations, (filter tip, air filter) has no dependency path in the corpus. sifter is attached to the taxonomy before its hypernym sieve. There is no co-occurrence between bacteria bed (or drain basket) and other terms. In addition, it is hard to utilize the surface features since they "look different" from other terms. That is also why (bacteria bed, air filter, 17) and (drain basket, air filter, 18) are attached in the end: our approach prefers to select term pairs with high confidence first.  Table 3: Ablation study on the WordNet dataset (Bansal et al., 2014). P e and R e are omitted because they are the same as F 1 e for each model. We can see that our approach benefits from multiple sources of information which are complementary to each other.
6 Related Work

Hypernymy Detection
Finding high-quality hypernyms is of great importance since it serves as the first step of taxonomy induction. In previous works, there are mainly two categories of approaches for hypernymy detection, namely pattern-based and distributional methods. Pattern-based methods consider lexicosyntactic patterns between the joint occurrences of term pairs for hypernymy detection. They generally achieve high precision but suffer from low recall. Typical methods that leverage patterns for hypernym extraction include (Hearst, 1992;Snow et al., 2005;Kozareva and Hovy, 2010;Panchenko et al., 2016;Nakashole et al., 2012). Distributional methods leverage the contexts of each term separately. The co-occurrence of term pairs is hence unnecessary. Some distributional methods are developed in an unsupervised manner. Measures such as symmetric similarity (Lin et al., 1998) and those based on distributional inclusion hypothesis (Weeds et al., 2004;Chang et al., 2017) were proposed. Supervised methods, on the other hand, usually have better performance than unsupervised methods for hypernymy detection. Recent works towards this direction include (Fu et al., 2014;Rimell, 2014;Yu et al., 2015;Tuan et al., 2016;Shwartz et al., 2016).

Taxonomy Induction
There are many lines of work for taxonomy induction in the prior literature. One line of works (Snow et al., 2005;Yang and Callan, 2009;Shen et al., 2012;Jurgens and Pilehvar, 2015) aims to complete existing taxonomies by attaching new terms in an incremental way. Snow et al. (2005) enrich WordNet by maximizing the probability of an extended taxonomy given evidence of relations from text corpora. Shen et al. (2012) determine whether an entity is on the taxonomy and either attach it to the right category or link it to an existing one based on the results. Another line of works (Suchanek et al., 2007;Ponzetto and Strube, 2008;Flati et al., 2014) focuses on the taxonomy induction of existing encyclopedias (e.g., Wikipedia), mainly by employing the nature that they are already organized into semi-structured data. To deal with the issue of incomplete coverage, some works (Liu et al., 2012;Dong et al., 2014;Panchenko et al., 2016;Kozareva and Hovy, 2010) utilize data from domain-specific resources or the Web. Panchenko et al. (2016) extract hypernyms by patterns from general purpose corpora and domain-specific corpora bootstrapped from the input vocabulary. Kozareva and Hovy (2010) harvest new terms from the Web by employing Hearst-like lexico-syntactic patterns and validate the learned is-a relations by a web-based concept positioning procedure. Many works (Kozareva and Hovy, 2010;Anh et al., 2014;Velardi et al., 2013;Bansal et al., 2014;Zhang et al., 2016;Panchenko et al., 2016;Gupta et al., 2017) cast the task of hypernymy organization as a graph optimization problem. Kozareva and Hovy (2010) begin with a set of root terms and leaf terms and aim to generate intermediate terms by deriving the longest path from the root to leaf in a noisy hypernym graph. Velardi et al. (2013) induct a taxonomy from the hypernym graph via optimal branching and a weighting policy. Bansal et al. (2014) regard the induction of a taxonomy as a structured learning problem by building a factor graph to model the relations between edges and siblings, and output the MST found by the Chu-Liu/Edmond's algorithm (Chu, 1965). Zhang et al. (2016) propose a probabilistic Bayesian model which incorporates visual features (images) in addition to text features (words) to improve the performance. The optimal taxonomy is also found by the MST. Gupta et al. (2017) extract hypernym subsequences based on hypernym pairs, and regard the task of taxonomy induction as an instance of the minimum-cost flow problem.

Conclusion and Future Work
This paper presents a novel end-to-end reinforcement learning approach for automatic taxonomy induction. Unlike previous two-phase methods that treat term pairs independently or equally, our approach learns the representations of term pairs by optimizing a holistic tree metric over the training taxonomies. The error propagation between two phases is thus effectively reduced and the global taxonomy structure is better captured. Experiments on two public datasets from different domains show that our approach outperforms state-of-the-art methods significantly. In the future, we will explore more strategies towards term pair selection (e.g., allow the RL agent to remove terms from the taxonomy) and reward function design. In addition, study on how to effectively encode induction history will be interesting.