Transductive Non-linear Learning for Chinese Hypernym Prediction

Finding the correct hypernyms for entities is essential for taxonomy learning, fine-grained entity categorization, query understanding, etc. Due to the flexibility of the Chinese language, it is challenging to identify hypernyms in Chinese accurately. Rather than extracting hypernyms from texts, in this paper, we present a transductive learning approach to establish mappings from entities to hypernyms in the embedding space directly. It combines linear and non-linear embedding projection models, with the capacity of encoding arbitrary language-specific rules. Experiments on real-world datasets illustrate that our approach outperforms previous methods for Chinese hypernym prediction.


Introduction
A hypernym of an entity characterizes the type or the class of the entity. For example, the word country is the hypernym of the entity Canada. The accurate prediction of hypernyms benefits a variety of NLP tasks, such as taxonomy learning (Wu et al., 2012;Fu et al., 2014), fine-grained entity categorization (Ren et al., 2016), knowledge base construction (Suchanek et al., 2007), etc.
In previous work, the detection of hypernyms requires lexical, syntactic and/or semantic analysis of relations between entities and their respective hypernyms from a language-specific knowledge source. For example, Hearst (1992) is the pioneer work to extract is-a relations from a text corpus based on handcraft patterns. The followingup work mostly focuses on is-a relation extraction using automatically generated patterns (Snow * Corresponding author. et al., 2004;Ritter et al., 2009;Sang and Hofmann, 2009;Kozareva and Hovy, 2010) and relation inference based on distributional similarity measures (Kotlerman et al., 2010;Lenci and Benotto, 2012;Shwartz et al., 2016).
While these approaches have relatively high precision over English corpora, extracting hypernyms for entities is still challenging for Chinese. From the linguistic perspective, Chinese is a lower-resourced language with very flexible expressions and grammatical rules . For instance, there are no word spaces, explicit tenses and voices, and distinctions between singular and plural forms in Chinese. The order of words can be changed flexibly in sentences. Hence, as previous research indicates, hypernym extraction methods for English are not necessarily suitable for the Chinese language (Fu et al., 2014;Wang and He, 2016).
Based on such conditions, several classification methods are proposed to distinguish is-a and notis-a relations based on Chinese encyclopedias (Lu et al., 2015;Li et al., 2015). Similar to Princeton WordNet, a few Chinese wordnets have also been developed (Huang et al., 2004;Xu et al., 2008;Wang and Bond, 2013). The most recent approaches for Chinese is-a relation extraction (Fu et al., 2014;Wang and He, 2016) use word embedding based linear projection models to map embeddings of hyponyms to those of their hypernyms, which outperform previous algorithms.
However, we argue that these projection-based methods may have three potential limitations: (i) Only positive is-a relations are used for projection learning. The distinctions between is-a and not-is-a relations in the embedding space are not modeled. (ii) These methods lack the capacity to encode linguistic rules, which are designed by linguists and usually have high precision. (iii) It assumes that the linguistic regularities of is-a rela-tions can be solely captured by single or multiple linear projection models.
In this paper, we address these limitations by a two-stage transductive learning approach. It distinguishes is-a and not-is-a relations given a Chinese word/phrase pair as input. In the initial stage, we train linear projection models on positive and negative training data separately and predict isa relations jointly. In the transductive learning stage, the initial prediction results, linguistic rules and the non-linear mappings from entities to hypernyms are optimized simultaneously in a unified framework. This optimization problem can be efficiently solved by blockwise gradient descent. We evaluate our method over two public datasets and show that it outperforms state-of-the-art approaches for Chinese hypernym prediction.
The rest of this paper is organized as follows. We summarize the related work in Section 2. Our approach is introduced in Section 3. Experimental results are presented in Section 4. We conclude our paper in Section 5.

Related Work
In this section, we overview the related work on hypernym prediction and discuss the challenges of Chinese hypernym detection.
Pattern based methods identify is-a relations from texts by handcraft or automatically generated patterns. Hearst patterns (Hearst, 1992) are lexical patterns in English that are employed to extract isa relations for taxonomy construction (Wu et al., 2012). Automatic approaches mostly use iterative learning paradigms such that the system learns new is-a relations and patterns simultaneously. A few relevant studies can be found in (Caraballo, 1999;Etzioni et al., 2004;Sang, 2007;Pantel and Pennacchiotti, 2006;Kozareva and Hovy, 2010). To avoid "semantic drift" in iterations, Snow et al. (2004) train a hypernym classifier based on syntactic features based on parse trees. Carlson et al. (2010) exploit multiple learners to extract relations via coupled learning. These approaches are not effective for Chinese for two reasons: i) Chinese is-a relations are expressed in a highly flexible manner (Fu et al., 2014) and ii) the accuracy of basic NLP tasks such as dependency parsing still need improvement for Chinese (Li et al., 2013).
Inference based methods take advantage of distributional similarity measures (DSM) to infer relations between words. They assume that a hypernym may appear in all contexts of the hyponyms and a hyponym can only appear in part of the contexts of its hypernyms. In previous work, Kotlerman et al. (2010) design directional DSMs to model the asymmetric property of is-a relations. Other DSMs are introduced in (Bhagat et al., 2007;Szpektor et al., 2007;Lenci and Benotto, 2012;Santus et al., 2014). Shwartz et al. (2016) combine dependency parsing and DSM to improve the performance of hypernymy detection. The reason why DSM is not effective for Chinese is that the contexts of entities in Chinese are flexible and sparse.
Encyclopedia based methods take encyclopedias as knowledge sources to construct taxonomies. Ponzetto and Strube (2007) design features from multiple aspects to predict is-a relations between entities and categories in English Wikipedia. The taxonomy in YAGO (Suchanek et al., 2007) is constructed by linking conceptual categories in Wikipedia to WordNet synsets (Miller, 1995). For Chinese, Li et al. (2015) propose an SVM-based approach to build a large Chinese taxonomy from Wikipedia. Similar classification based algorithms are presented in (Fu et al., 2013;Lu et al., 2015). Due to the lack of Chinese version of WordNet, several Chinese semantic dictionaries have been conducted, such as Sinica BOW (Huang et al., 2004), SEW (Xu et al., 2008), COW (Wang and Bond, 2013), etc. These approaches have higher accuracy than mining hypernym relations from texts directly. However, they heavily rely on existing knowledge sources and are difficult to extend to different domains.
To tackle these challenges, word embedding based methods directly model the task of hypernym prediction as learning a mapping from entity vectors to their respective hypernym vectors in the embedding space. The vectors can be pretrained by neural language models (Mikolov et al., 2013). For the Chinese language, Fu et al. (2014) train piecewise linear projection models based on a Chinese thesaurus. The state-of-the-art method (Wang and He, 2016) combines an iterative learning procedure and Chinese Hearst-style patterns to improve the performance of projection models. They can reduce data noise by avoiding direct parsing of Chinese texts, but still capture the linguistic regularities of is-a relations based on word embeddings. Additionally, several work aims to study how to combine word embeddings for re-lation classification, such as (Mirza and Tonelli, 2016). In our paper, we extend these approaches by modeling non-linear mappings from entities to hypernyms and adding linguistic rules via a unified transductive learning framework.

Proposed Approach
This section begins with a brief overview of our approach. After that, the detailed steps and the learning algorithm are introduced in detail.

Overview
Given a word/phrase pair (x i , y i ), the goal of our task is to learn a classification model to predict whether y i is the hypernym of x i .
As illustrated in Figure 1, our approach has two stages: initial stage and transductive learning stage. The input is a positive is-a set D + , a negative is-a set D − and an unlabeled set D U , all of which are the collections of word/phrase pairs. Denote x i as the embedding vector of word x i , pre-trained and stored in a lookup table. In the initial stage, we train a linear projection model over D + such that for each (x i , y i ) ∈ D + , a projection matrix maps the entity vector x i to its hypernym vector y i . A similar model is also trained over D − . Based on the two models, we estimate the prediction score and the confidence score for each (x i , y i ) ∈ D U . In the transductive learning stage, a joint optimization problem is formed to learn the final prediction score for each (x i , y i ) ∈ D U . It aims to minimize the prediction errors based on the human labeled data, the initial model prediction and linguistic rules. It also employs nonlinear mappings to capture linguistic regularities of is-a relations other than linear projections.

Initial Model Training
The initial stage models how entities are mapped to their hypernyms or non-hypernyms by projection learning. We first train a Skip-gram model (Mikolov et al., 2013) to learn word embeddings over a large text corpus. Inspired by (Fu et al., 2014;Wang and He, 2016), for each (x i , y i ) ∈ D + , we assume there is a positive projection model such that M + x i ≈ y i where M + is an |x i |×|x i | projection matrix 1 . However, this model does not capture the semantics of not-is-a relations. Thus, we learn a negative projection model This approach is equivalent to learning two separate translation models within the same semantic space. For parameter estimation, we minimize the two following objectives: where λ > 0 is a Tikhonov regularization parameter (Golub et al., 1999). In the testing phase, for each ( The prediction score is defined as: where score(x i , y i ) ∈ (−1, 1). Higher prediction score indicates there is a larger probability of an is-a relation between x i and y i . We choose the hyperbolic tangent function rather than the sigmoid function to avoid the widespread saturation of sigmoid function (Menon et al., 1996). Because the semantics of Chinese is-a and not-is-a relations are complicated and difficult to model (Fu et al., 2014), we do not impose explicit connections between M + and M − and let the algorithm learn the parameters automantically.
The difference between d + (x i , y i ) and d − (x i , y i ) can be also used to indicate whether the models are confident enough to make a prediction.
In this paper, we calculate the confidence score as: where conf (x i , y i ) ∈ (0, 1). Higher confidence score means that there is a larger probability that the models can predict whether there is an is-a relation between x i and y i correctly. This score gives different data instances different weights in the transductive learning stage.

Transductive Non-linear Learning
Although linear projection methods are effective for Chinese hypernym prediction, it does not encode non-linear transformation and only leverages the positive data. We present an optimization framework for non-linear mapping utilizing both labeled and unlabeled data and linguistic rules by transductive learning (Gammerman et al., 1998;Chapelle et al., 2006).
Let F i be the final prediction score of the word/phrase pair (x i , y i ). In the initialization stage of our algorithm, we set The three components in our transductive learning model are as follows:

Initial Prediction
Denote S as an m × 1 initial prediction vector. We In order to encode the confidence of model prediction, we define W as an m × m diagonal weight matrix. The element in the ith row and the jth column of W is set as follows: The objective function is defined as: O s = W(F − S) 2 2 , which encodes the hypothesis that the final prediction should be similar to the initial prediction for unlabeled data or human labeling for training data. The weight matrix W gives the largest weight (i.e., 1) to all the pairs in D + ∪ D − and a larger weight to the pair (x i , y i ) ∈ D U if the initial prediction is more confident.

Linguistic Rules
Although linguistic rules can only cover a few circumstances, they are effective to guide the learning process. For Chinese hypernym prediction, Li et al. (2015) study the word formation of conceptual categories in Chinese Wikipedia. In our model, let C be the collection of linguistic rules. γ i is the true positive (or negative) rate with respect to the respective positive (or negative) rule c i ∈ C, estimated over the training set. Considering the word formation of Chinese entities and hypernyms, we design one positive rule (i.e., P1) and two negative rules (i.e., N1 and N2), shown in Table 1.
Let R be an m × 1 linguistic rule vector and R i is the ith element in R. For training data, we which follows the same settings as those in S. For unlabeled pairs that do not match any linguistic rules in C, we update R i = F i in each iteration of the learning process, meaning no loss for errors imposed in this part.
For other conditions, denote C (x i ,y i ) ⊆ C as the collection of rules that (x i , y i ) matches. If C (x i ,y i ) are positive rules, we set R i as follows: Similarly, if C (x i ,y i ) are negative rules, we have: which means F i receives a penalty only if F i < max c j ∈C (x i ,y i ) γ j for pairs that match positive rules or F i > − max c j ∈C (x i ,y i ) γ j for negative rules 2 . The objective function is: O r = F−R 2 2 . In this way, our model can integrate arbitrary "soft" constraints, making it robust to false positives or negatives introduced by these rules.

Non-linear Learning
TransLP is a transductive label propagation framework (Liu and Yang, 2015) for link prediction, previously used for applications such as text classification (Xu et al., 2016). In our work, we extend their work for our task, modeling non-linear mappings from entities to hypernyms.

P1
The head word of the entity x matches that of the candidate hypernym y. For example, 动物 (Animal) is the correct hypernym of 哺乳动物 (Mammal). N1 The head word of the entity x matches the non-head word of the candidate hypernym y. For example, 动物学 (Zoology) is not a hypernym of 哺乳动物 (Mammal). N2 The head word of the candidate hypernym y matches an entry in a Chinese lexicon extended based on the lexicon used in Li et al. (2015). It consists of 184 non-taxonomic, thematic words such as 政治(Politics), 军事(Military), etc. For is-a relations, we find that if y is the hypernym of x, it is likely that y is the hypernym of entities that are semantically close to x. For example, if we know United States is a country, we can infer country is the hypernym of similar entities such as Canada, Australia, etc. This intuition can be encoded in the similarity of the two pairs p i = (x i , y i ) and p j = (x j , y j ): where x i is the embedding vector of x i 3 . This similarity indicates there exists a nonlinear mapping from entities to hypernyms, which can not be encoded in linear projection based methods (Fu et al., 2014;Wang and He, 2016). Based on TransLP (Liu and Yang, 2015), this intuition can be model as propagating class labels (is-a or not-is-a) of labeled word/phrase pairs to similar unlabeled ones based on Eq. (1). For example, the score of is-a relations between United State and country will propagate to pairs such as (Canada, country) and (Australia, country) by random walks.
Denote F * as the optimal solution of the problem min O s + O r . Inspired by (Liu and Yang, 2015;Xu et al., 2016), we can add a Gaussian prior N (F * , Σ) to F where Σ is the covariance matrix and Σ i,j = sim(p i , p j ). Hence the optimization objective of this part is defined as: O n = F T Σ −1 F which is linearly proportional to the negative likelihood of the Gaussian random field prior. This means we minimize the training error and encourage F to have a smooth propagation with respect to the similarities among pairs defined by Eq. (1) at the same time. 3 We only consider the similarity between entities and not candidate hypernyms because the similar rule for candidate hypernyms is not true. For example, nouns close to country in our Skip-gram model are region, department, etc. They are not all correct hypernyms of United States, Canada, Australia, etc.

Joint Optimization
By combining the three components together, we minimize the following function: where F 2 2 imposes an additional smooth l 2regularization on F. µ 1 and µ 2 are regularization parameters that can be tuned manually.
Based on the convexity of the optimization problem, we can learn the optimal values of F is via gradient descent. The derivative of F with respect to J(F) is: which is computationally expensive when m is large. After W 2 , S, R and Σ −1 are pre-computed, the runtime complexity of the loop of gradient descent is O(tm 2 ) where t is the number of iterations.
To speed up the learning process, we introduce a blockwise gradient descent technique. From the definition of Eq. (2), we can see that the optimal values of F i and F j with respect to (x i , y i ) and (x j , y j ) are irrelevant if y i = y j . Therefore, the original optimization problem can be decomposed and solved separately according to different candidate hypernyms.
Let H be the collection of candidate hypernyms in D U . For each h ∈ H, denote D h as the collection of word/phase pairs in D + ∪ D − ∪ D U that share the same candidate hypernym h. The original problem can be decomposed into |H| optimization subproblems over D h for each h ∈ H. Denote W h , S h , R h , F h and Σ h as the weight matrix, the initial prediction vector, the rule prediction vector, the final prediction vector and the entity similarity covariance matrix with respect D h . The objective function can be rewritten as: We additionally use (n) to denote the values of matrices or vectors in the nth iteration. F (n) h is iteratively updated based on the following equation: where η is the learning rate. To this end, we present the learning algorithm in Algorithm 1. Estimate γ i over the training set; 7: end for

Algorithm 1 Learning Algorithm
for the next iteration: The runtime complexity of this algorithm is where t h is the number of iterations to solve the subproblem over D h . Although we do not know the upper bounds on the numbers of iterations of these two learning techniques, the runtime complexity can be reduced by blockwise gradient descent for two reasons: i) h∈D h |D h | ≤ m and ii) t h has a large probability to be smaller than t due to the smaller number of data instances. This technique can be also viewed as optimizing Eq. (2) based on blockwise matrix computation.
Finally, for each (x i , y i ) ∈ D U , we predict that y i is a hypernym of x i if F i > θ where θ ∈ (−1, 1) is a threshold tuned on the development set.

Experiments
In this section, we conduct experiments to evaluate our method. Section 4.1 to Section 4.5 report the experimental steps on Chinese datasets. We present the performance on English datasets in Section 4.6 and a discussion in Section 4.7.

Experimental Data
We have two collections of Chinese word/phase pairs as ground truth datasets. Each pair is labeled with an is-a or not-is-a tag. The first one (denoted as FD) is from Fu et al. (2014), containing 1,391 is-a pairs and 4,294 not-is-a pairs, which is the first publicly available dataset to evaluate this task. The second one (denoted as BK) is larger in size and crawled from Baidu Baike by ourselves, consisting of <entity, category> pairs. For each pair in BK, we ask multiple human annotators to label the tag and discard the pair with inconsistent labels by different annotators. In total, it contains 3,870 is-a pairs and 3,582 not-is-a pairs 4 .
The Chinese text corpus is extracted from the contents of 1.2M entity pages from Baidu Baike 5 , a Chinese online encyclopedia. It contains approximately 1.1B words. We use the open source toolkit Ansj 6 for Chinese word segmentation. Chinese words/phrases in our test sets may consist of multiple Chinese characters. We treat such word/phrase as a whole to learn embeddings, instead of using character-level embeddings.
In the following experiments, we use 60% of the data for training, 20% for development and 20% for testing, partitioned randomly. By rotating the 5-fold subsets of the datasets, we report the performance of each method on average.

Parameter Analysis
The word embeddings are pre-trained by ourselves on the Chinese corpus. In total, we obtain the 100dimensional embedding vectors of 5.8M distinct words. The regularization parameters are set to λ = 10 −3 and µ 1 = µ 2 = 10 −4 , fine tuned on the development set.
The choice of θ reflects the precision-recall trade-off in our model. A larger value of θ means we pay more attention to precision rather than recall. Figure 2 illustrates the precision-recall curves   on both datasets. It can be seen that the performance of our method is generally better in BK than FD. The most probable cause is that BK is a large dataset with more "balanced" numbers of positive and negative data. Finally, θ is set to 0.05 on FD and 0.1 on BK.

Performance
In a series of previous work (Fu et al., 2013(Fu et al., , 2014Wang and He, 2016), several pattern-based, inference-based and encyclopedia-based is-a relation extraction methods for English have been implemented for the Chinese language. As their experiments show, these methods achieve the Fmeasure of lower than 60% in most cases, which are not suggested to be strong baselines for Chinese hypernym prediction. Interested readers may refer to their papers for the experimental results.
To make the convincing conclusion, we employ two recent state-of-the-art approaches for Chinese is-a relation identification (Fu et al., 2014;Wang and He, 2016) as baselines. We also take the word embedding based classification approach (Mirza and Tonelli, 2016) 7 and Chinese Wikipedia based 7 Although the experiments in their paper are mostly related to temporal relations, the method can be applied to is-a SVM model (Li et al., 2015) as baselines to predict is-a relations between words 8 . The experimental results are illustrated in Table 2.
For Fu et al. (2014), we test the performance using a linear projection model (denoted as S in Table 2) and piecewise projection models (P). It shows that the semantics of is-a relations are better modeled by multiple projection models, with a slight improvement in F-measure. By combining iterative projection models and pattern-based validation, the most recent approach (Wang and He, 2016) increases the F-measure by 4% and 2% in two datasets. In this method, the patternbased statistics are calculated using the same corpus over which we train word embedding models. The main reason of the improvement may be that the projection models have a better generalization power by applying an iterative learning paradigm. Mirza and Tonelli (2016) is implemented using three different strategies in combining the word vectors of a pair: i) concatenation x i ⊕ y i (derelations without modification. 8 Previously, these methods used different knowledge sources to train models and thus the results in their papers are not directly comparable with ours. To make fair comparison, we take the training data as the same knowledge source to train models for all methods.   Table 4: TP/TN rates of three linguistic rules (%).
noted as C), ii) addition x i + y i (A) and iii) subtraction x i − y i (S). As seen, the classification models using addition and subtraction have similar performance in two datasets, while the concatenation strategy outperforms previous two approaches. Although Li et al. (2015) achieve a high performance in their dataset, this method does not perform well in ours. The most likely cause is that the features in that work are designed specifically for the Chinese Wikipedia category system. Our initial model has a higher accuracy than all the baselines. By utilizing the transductive learning framework, we boost the F-measure by 1.7% and 2.1%, respectively. Therefore, our method is effective to predict hypernyms of Chinese entities. We further conduct statistical tests which show our method significantly (p < 0.01) improves the Fmeasure over the state-of-the-art method (Wang and He, 2016).

Effectiveness of Linguistic Rules
To illustrate the effectiveness of linguistic rules, we present the true positive (or negative) rate by using one positive (or negative) rule solely, shown in Table 4. These values serve as γ i s in the transductive learning stage. The results indicate that these rules have high precision (over 90%) over both datasets for our task.
We state that currently we only use a few handcraft linguistic rules in our work. The proposed approach is a general framework that can encode arbitrary numbers of rules and in any language.

Error Analysis and Case Studies
We analyze correct and error cases in the experiments. Some examples of prediction results are shown in Table 3. We can see that our method is generally effective. However, some mistakes occur mostly because it is difficult to distinguish strict is-a and topic-of relations. For example, the entity Nuclear Reactor is semantically close to Nuclear Energy. The error statistics show that such kind of errors account for approximately 80.2% and 78.6% in two test sets, respectively.
Based on the literature study, we find that such problem has been also reported in (Fu et al., 2013;Wang and He, 2016). To reduce such errors, we employ the Chinese thematic lexicon based on Li et al. (2015) in the transductive learning stage but the coverage is still limited. Two possible solutions are: i) adding more negative training data of this kind; and ii) constructing a large-scale thematic lexicon automatically from the Web.

Experiments on English Datasets
To examine how our method can benefit hypernym prediction for the English language, we use two standard datasets in this paper. The first one is a benchmark dataset for distributional semantic evaluation, i.e., BLESS (Baroni and Lenci, 2011). Because the number of pairs in BLESS is relatively small, we also use the Shwartz (Shwartz et al., 2016) dataset. In the experiments, we treat the HYPER relations as positive data (1,337 pairs) and randomly sample 30% of the RANDOM relations as negative data (3,754 pairs) in BLESS. To create a relatively balanced dataset, we take the random split of Shwartz as input and use only 30% of the negative pairs. The dataset contains 14,135 positive pairs and 16,956 negative pairs. We use English Wikipedia as the text corpus to estimate the  statistics, and the pre-trained embedding vectors of English words 9 . For comparison, we test all the baselines over English datasets except Li et al. (2015). This is because most features in Li et al. (2015) can only be used in the Chinese environment. To implement Wang and He (2016) for English, we use the original Hearst patterns (Hearst, 1992) to perform relation selection and do not consider not-is-a patterns. We also take two recent DSM based approaches (Lenci and Benotto, 2012;Santus et al., 2014) as baselines. As for our own method, we do not use linguistic rules in Table 1 for English. The results are illustrated in Table 5. As seen, our method is superior to all the baselines over BLESS, with an F-measure of 81.9%. In Shwartz, while the approach (Mirza and Tonelli, 2016) has the highest F-measure of 80.1%, our method is generally comparable to theirs and outperforms others. The results suggest that although our method is not necessarily the state-of-the-art for English hypernym prediction, it has several potential applications. Refer to Section 4.7 for discussion.

Discussion
From the experiments, we can see that the proposed approach outperforms the state-of-the-art methods for Chinese hypernym prediction. Although the English language is not our focus, our approach still has relatively high performance. Additionally, our work has potential values for the following applications: • Domain-specific or Context-sparse Relation Extraction. If the task is to predict re-9 http://nlp.stanford.edu/projects/glove/ lations between words when it is related to a specific domain or the contexts are sparse, even for English, traditional pattern-based methods are likely to fail. Our method can predict the existence of relations without explicit textual patterns and requires a relatively small amount of pairs as training data.
• Under-resourced Language Learning. Our method can be adapted for relation extraction in languages with flexible expressions, few knowledge resources and/or lowperformance NLP tools. Our method does not require deep NLP parsing of sentences in a text corpus and thus the performance is not affected by parsing errors.

Conclusion
In summary, this paper introduces a transuctive learning approach for Chinese hypernym prediction. By modeling linear projection models, linguistic rules and non-linear mappings, our method is able to identify Chinese hypernyms with high accuracy. Experiments show that the performance of our method outperforms previous approaches. We also discuss the potential applications of our method besides Chinese hypernym prediction. In our work, the candidate Chinese hyponyms and hypernyms are extracted from user generated categories. In the future, we will study how to construct a taxonomy from texts in Chinese.