Word Relation Autoencoder for Unseen Hypernym Extraction Using Word Embeddings

Lexicon relation extraction given distributional representation of words is an important topic in NLP. We observe that the state-of-the-art projection-based methods cannot be generalized to handle unseen hypernyms. We propose to analyze it in the perspective of pollution and construct the corresponding indicator to measure it. We propose a word relation autoencoder (WRAE) model to address the challenge. Experiments on several hypernym-like lexicon datasets show that our model outperforms the competitors significantly.


Introduction
This paper discusses the inference of relations between words. For the hypernym beer IsA drink, , denoted as IsA(x, y), beer is the hyponym x and drink serves as the hypernym y. Relation lexicons are precious resource for NLP systems, while constructing the semantic graphs such as Word-Net (Fellbaum, 1998) and ConceptNet (Speer and Havasi, 2012) requires expensive human efforts for labeling.
Recently, researchers have started working on extracting word relations based on pre-trained word embedding without the need of an existing corpus, thanks to the success of distributional word representation models such as GloVe (Pennington et al., 2014).
Comparing with hypernym classification models (Lenci and Benotto, 2012;Weeds et al., 2014;Levy et al., 2015;Vylomova et al., 2016) that take a pair of entities (x,y) as inputs and output a binary

Query
Answer beef → meat ≈ crab → ? seafood tiger → zoo ≈ dolphin → ? aquarium paint → artist ≈ book → ? writer japan → asia ≈ italy → ? europe decision about the existence of relation, there has been less work focusing on hypernym extraction task. It is a challenging task to automatically extract all possible hypernyms of a given hyponym query, especially the unlabeled ones, from the vocabularies.
Classification-based models are not applicable for this task because the complexity of inference is O(V ), where V is the size of vocabulary that often scales to billions.
Among the existing solutions, projection-based methods (Fu et al., 2014;Yamane et al., 2016;Espinosa-Anke et al., 2016;Ustalov et al., 2017) emphasize on hypernym extraction which intuitively represent a relation as y − x according to the linear structure of word embedding. By directly learning a linear mapping Φ between two words such that xΦ = y, the predictionŷ can be obtained with nearest neighbor search for xΦ in the word embedding space. Moreover, the potential candidates of y are not required to be seen in advance so that the method can be used to predict unseen hypernym directly. Fu et al. (2014) further observe the existence of cluster structures in relation representation y − x and propose to learn a piecewise linear mapping such that xΦ k = y for each cluster C k . Their experiments show that domain clustering on training offset is very useful for hypernym identification. However, we observe that each cluster contains very few distinct hypernyms. For instance, about 83% of the clusters contain fewer than 5 hypernyms for ConceptNet-IsA in our experiments. Hypernyms can be seen as the collections of related word pairs, e.g., IsA(dog, animal), IsA(cat, animal), IsA(horse, animal), ... etc. The piecewise projection matrices can hardly learn the inference between hyponyms and hypernyms but only memorize some words which serve as the hypernyms in the training data. Inevitably, the state-of-the-art models using piecewise projection learning face generalization problem and fail to predict unseen hypernyms correctly.
We design a novel Word Relation Autoencoder (WRAE) framework, which adopts the conditional autoencoder structure (x → r → x ) that encodes hyponyms and reconstructs itself by decoding from r = y − x. The weights of encoder are further tied with decoder which is imposed to learn how to separate the hypernym and the hyponym from the relation vectors and extract the hyponym x with the intention to optimize reconstruction loss, thus effectively mitigates the mentioned generalization problem.
We summarize our main contributions as follows: (1) We propose a novel, yet more general scenario for relation extraction to handle unseen hypernyms. (2) We propose an intuitive pollution indicator that allows us to empirically measure whether the model learns the inference between a relation pair or not. (3) We propose a novel Word Relation Autoencoder (WRAE) which can effectively reduce pollution. We conduct thorough experiments to show that our model outperforms the competitors, and can be applied to other hypernym-like relations. Fu et al. (2014) first apply projection learning for generalized hypernym extraction by learning a linear transformation from a hyponym word embedding to the corresponding hypernym word vector. They further conduct piecewise projection learning, i.e., learning a projection matrix for each cluster and harvest significant improvements by first applying k-means clustering. They perform training with stochastic gradient descent methods, implying good potential for attaching different regularizers for optimization. Several recent works also follow the schema as the one proposed by Espinosa-Anke et al. (2016) and operate the similar model at the sense level and took advantage of domain clustering to discover hypernyms through domain adaption between different topics. Yamane et al. (2016) focus on improving the performance through better cluster assignments by learning clustering and the projections jointly. Ustalov et al. (2017) propose several regularization terms in addition to the original loss function (Fu et al., 2014) using extra synonym pairs or the asymmetric property of hypernym. Nayak (2015) provides detailed technical studies on piecewise projection models.

Related Work
Our work differs from all of them, as we emphasize on the setting that all hyponyms and hypernyms in testing vocabulary are not seen in training.

Piecewise Projection
Piecewise projection learning (Fu et al., 2014) serves as our baseline. The objective is to learn a relation transform from x to y on training pairs (x, y). Piecewise projection matrix Φ k is learned separately for each cluster, after applying k-means clustering on the offset of training using y − x between each pair.
where C k represents the size of the k th cluster.
In addition, we also examine a simple solution of L2-penalized projection learning model which imposes a L2 constraint on Φ in Equation 1, i.e., α Φ k 2 2 .

Word Relation Autoencoder (WRAE)
Our model takes the form of an autoencoder. As shown in Equation 2, where xΦ k = y − x. Here we adopt the simplifying trick (Kodirov et al., 2017a) to tie with the constraint (Ranzato et al., 2008) Φ * = Φ T . Note that the L2-norm regularization term is not necessary for WRAE to avoid overfitting since the constraint of Φ * = Φ T guarantees Φ 2 2 cannot be large otherwise the reconstruction loss will be bad. Also, the learning process is more efficient.
To release the constraint of xΦ k = y − x, the objective can be further split into two terms: where λ is a weighting constant. We find that learning relation mapping from x → (y−x) instead of x → y effectively mitigates the pollution problem (Lazaridou et al., 2015). A prediction is said to be polluted if the nearest neighbor of predictedŷ matches a hypernym appeared in training set. The operation fundamentally solves the cause of pollution since each pair of input and output becomes (x, y − x) instead of (x, y). Unlike projecting to a small number of target y, the target y−x obviously differs from pair to pair thus avoiding simply overfitting the lexicons.
Conceptually, WRAE learns to extract the hyponym x from the relation vectors r = y − x to optimize on the reconstruction loss. By encouraging the projection to learn the relationship between a word pair, WRAE effectively mitigates the mentioned generalized problem.
Our model is related to Semantic Autoencoder (SAE) (Kodirov et al., 2017b). With the latent relation directly associates with input x, WRAE can be regarded as a special conditional SAE where the condition is the input itself and is incorporated into the middle layer.  To further examine the generality of our model, we collect several hypernym-like relations listed in Table 2 from ConceptNet semantic graph. Considering the property of these relations, we treat the head and tail words of a pair as the x and y for our models similar to hyponym and hypernym, respectively. Examples are in Table 1.
We split the datasets with ratios 0.7, 0.2, and 0.1 for training, testing, and validation, respectively. For all results, we report the mean of 30 random splits. We test two different settings, one uses k-means clustering and one does not (k = 1). We tune the number of cluster k unsupervisedly with the Silhouette score (Rousseeuw, 1987) on validation. The projection matrices are optimized with the Adam method (Kingma and Ba, 2014) with learning rate = 1e −3 . We adopt the GloVe (Pennington et al., 2014) 300d pre-trained word embeddings 1 which are trained on 6B token corpus (Wikipedia 2014 + Gigaword 5) with 400,000 words.

Hit Rate
To evaluate the precision of returned hypernyms, we follow Ustalov et al. (2017) and Kodirov et al. (2017a) using the hit rate measure (Frome et al., 2013). We also adopt area under curve (AUC) measure which computes the averaged area under the l − 1 trapezoids of hit@l to take the ranks of ground truth into consideration:

Soft Pollution
To evaluate the degree of pollution of the extracted hypernym, we adopt a metric similar to Lazaridou et al. (2015). A prediction is said to be polluted if the nearest neighbor of predictedŷ matches a hypernym appears in the training set, noted as a binary function pol 1 (ŷ).
However, it is possible that ground truth unseen hypernyms are be very close to some seen hypernyms in Y train in real cases. We take ground truths  into consideration: where n is for the top n nearest neighbors (from 1 to l) of y that appears in Y train and ρ is a factor term exponentially decreases from 1 to 0 along with the increase of n therefore provides a smoother estimation. With pollution indications, one can understand to what degree the model suffers from overfitting on the seen examples. φ is the empty set. Note that it is reasonable to set l equal for both hit rate and soft pollution.

Results: Unseen General Hypernym-Like Relation Extraction
We report two sets of results for all models, one with clustering and one without (k = 1). As shown in Table 3, WRAE outperforms the competitors significantly with and without clustering. The naive application of Equation 2 which set xΦ k = y, denoted as WRAE-Y, consistently ranks second. The y − x operation is crucial to avoid pollution thus guarantees the generalization power of the mapping. Apply simple L2-norm regularization Equation 1 for Proj., denoted as Proj.+L2, only slightly improves the performance. The results in Table 3 supports our hypothesis that Proj. models deteriorate significantly for larger k, due to lack of training examples for hypernyms in each cluster. We prove that WRAE is effective against pollution. The role of regularizer is important for decoders to optimize towards better objective.
The negative effects derived from pollution impact accuracy. We observe severe pollution problem in simple projection learning. Take IsA as example, in k = 1 group the pol 1 is about 71% for Proj., which implies about two of out of three returned predictions are data points from training data. Our WRAE reduces the pollution pol 1 to 60% and 12% after clustering. The improvement on accuracy supports that pollution indication reflects the inherent overfitting problem. In general, results are consistent with our claims that pollution can be viewed as valid negative indicators.
Across the board, the performance should ben-efit from domain clustering if pollution is handled properly as the experiments showed.

Conclusion
We present an unseen hypernym extraction framework and analyze the pollution problem with this setup. Consequently we argue that only by using unseen candidates in evaluation can truly test whether the model learns the true relation representation, instead of being polluted by the seen training examples. Future work includes relation discovery, which is to identify new relations besides hypernyms in an unsupervised manner.