Learning Conceptual Spaces with Disentangled Facets

Conceptual spaces are geometric representations of meaning that were proposed by G ̈ardenfors (2000). They share many similarities with the vector space embeddings that are commonly used in natural language processing. However, rather than representing entities in a single vector space, conceptual spaces are usually decomposed into several facets, each of which is then modelled as a relatively low dimensional vector space. Unfortunately, the problem of learning such conceptual spaces has thus far only received limited attention. To address this gap, we analyze how, and to what extent, a given vector space embedding can be decomposed into meaningful facets in an unsupervised fashion. While this problem is highly challenging, we show that useful facets can be discovered by relying on word embeddings to group semantically related features.


Introduction
Conceptual spaces (Gärdenfors, 2000) are vector space models that are aimed at representing the entities of a given kind (e.g. movies), together with their associated properties (e.g. scary) and concepts (e.g. thrillers). As such, they are similar in spirit to the vector space models that have been proposed in information retrieval (Deerwester et al., 1990) and natural language processing (Turney and Pantel, 2010;Mikolov et al., 2013), but there are also notable differences. First, in the context of conceptual spaces, an explicit distinction is made between the entities from the domain of discourse, which are represented as vectors, and the corresponding properties and concepts, which are represented as regions (e.g. polytopes) or soft regions (e.g. characterized by a Gaussian). Second, conceptual spaces are organised into a set of facets 1 , each of which captures 1 These facets are often referred to as domains in the context of conceptual spaces. However, we will use the term a different aspect of meaning. For instance, in a conceptual space of movies, we may have facets such as genre, language, geographic location, etc.
Each facet is associated with its own vector space, which intuitively captures similarity w.r.t. the corresponding facet. For instance, in a conceptual space of movies, the vector space for the budget facet would only capture whether two movies had a similar budget. Most of these facet spaces tend to be low-dimensional (e.g. modelling budget only needs a single dimension). This clearly differentiates them from traditional semantic spaces, which often have hundreds of dimensions. From an application point-of-view, the separation of vector space models into facets is appealing for several reasons. One key advantage is that it allows us to model similarity in a more flexible, and cognitively more plausible way. A related advantage is that the low-dimensional nature of the facet-specific spaces should make it easier to learn from few examples. Finally, the separation into facets can also make conceptual spaces more interpretable. However, the study of conceptual spaces has mostly focused on modelling cognitive and linguistic phenomena, such as metaphor (Gärdenfors, 1996) and vagueness (Douven et al., 2013), with only few works addressing the challenge of learning such representations from data.
Decomposing conceptual spaces into facets is similar to the problem of disentangled representation learning (DRL), which has recently received considerable interest. However, empirical studies suggest that purely unsupervised DRL methods are unlikely to be successful without a strong inductive bias. In fact, Locatello et al. (2018) found that what mostly matters was how such methods are initialized, rather than what particular optimization objective is used. Moreover, much of facets to avoid confusion with domains of discourse. the work in DRL has focused on image processing rather than textual data (which is what we use in this paper). Finally, existing work in DRL is focused on learning factors which are uncorrelated. In our setting, however, the different facets are often highly correlated (e.g. natural disaster movies typically have a high budget).
In this paper, we explore a strategy for decomposing a given vector space embedding into separate facet spaces by first determining which interpretable features are modelled by the vector space and then clustering the word vectors corresponding to these features. Despite being intuitive, given that word embeddings are known to group together functionally similar words, we found this strategy to perform poorly in its basic form. First, simply looking for clusters in word embeddings often leads to thematic clusters, e.g. grouping horror together with words such as scary and zombie rather than other genres such as western and drama. To address this, we explicitly prevent two words from ending up in the same cluster if the features they are modelling are too similar. Second, in most domains, there are one or two central facets which tend to be highly correlated with most of the other facets (e.g. genre in the movie domain). To ensure that the resulting facet spaces are sufficiently different (rather than capturing minor variations of the most central facets), we found it useful to use an iterative approach, where previously found facets are "removed" from the vector space embedding before proceeding to find further facets. With these two modifications, we find that useful facets can indeed be found, which consistently lead to better classification performance compared to the original vector space embedding.

Related Work
Conceptual Spaces.
A conceptual space (Gärdenfors, 2000) is a vector representation of the entities from some domain, where the dimensions tend to capture salient features. It is usually assumed that the dimensions of a conceptual space can be grouped into semantic domains, or facets. From a cognitive point of view, this grouping is important because it affects how similarity scores are computed. Intuitively, this is because the dimensions from the same facet tend to interact with each other whereas the dimensions from different facets can be considered in isolation. The problem of learning conceptual spaces from data has only received limited attention to date. One exception is the work of Derrac and Schockaert (2015), which we build on in this paper. In their work, textual descriptions of the considered entities are used to find dimensions that model salient semantic features in a given semantic space. For instance, in a semantic space of movies they found dimensions corresponding to features such as scary, horror and zombie. Note that because these features tend to be correlated, the corresponding dimensions are typically not orthogonal in the input semantic space. For this reason, they refer to these dimensions as interpretable directions. More recently, (Ager et al., 2018) proposed a post-processing method to finetune these interpretable directions. The main challenge which we address in this paper is to group the features that are found by the method from Derrac and Schockaert (2015) into semantically meaningful facets. A supervised variant of this problem was considered by Banaee et al. (2018). Their approach relies on feature selection methods to find subsets of features that are predictive of particular class labels, based on a set of labelled training examples. In contrast, our focus in this paper is on unsupervised methods, as suitable training data is often not available.

Disentangled Representation Learning (DRL).
In the last few years, a large number of generative neural network models have been proposed, with variational autoencoders (VAEs) (Kingma and Welling, 2014) and generative adversarial networks (GANs) (Goodfellow et al., 2014) being the best-known examples. The main underlying idea behind these models is that high-dimensional data (e.g. images) can often be described in terms of a much lower-dimensional latent vector space. Each object can thus be compactly described by its latent code, i.e. the corresponding vector in this latent space. The problem of DRL is to learn such a latent vector representation which is such that (groups of) the dimensions of the latent codes correspond to meaningful interpretable factors. A variety of unsupervised and semi-supervised approaches for learning such disentangled representations have been proposed, such as InfoGAN (Chen et al., 2016), which is based on a modification of the loss function for GANs, and β-VAE (Higgins et al., 2017), which instead uses VAEs as the base model. Conceptually, these approaches modify the loss function of a given generative model by insisting that the dimensions of the latent vector space are in some sense independent. In principle, the latent vector spaces learned by DRL methods can be viewed as conceptual spaces. It is unclear, however, whether purely statistical measures of independence can be sufficient for learning semantically meaningful factors. While interesting results have been obtained for particular applications, after a thorough empirical analysis Locatello et al. (2018) concluded that such results were highly sensitive to the random initialization of the neural network models and the value of hyper-parameters. Their results suggest that, in absence of a suitable supervision signal, highquality factors can only be learned in the presence of a strong inductive bias. Going beyond unsupervised approaches, (Jain et al., 2018) propose a supervised approach for DRL for text. As supervision signal, they use triplets of the form (s, d, o) a which encode that relative to aspect a, it holds that s and d are more similar than d and o. Then they use a Convolutional Neural Network (CNN) based model to obtain low-dimensional document embeddings for each considered aspect.

Decomposing Conceptual Spaces
Let E be a set of entities of some particular type (e.g. movies) for which a vector space embedding is given. In the following, we will write e ∈ R n for the embedding of entity e. The first step of our approach consists in applying the method from Derrac and Schockaert (2015), which provides us with a set of words F , each corresponding to a feature that can be modelled as a direction in the vector space. For f ∈ F we write d f for the vector characterizing this direction. Formally, this means that e 1 · d f < e 2 · d f iff e 2 has the feature f to a higher extent than e 1 (e.g. if f denotes the feature scary, then this would mean that movie e 2 is scarier than movie e 1 ). We briefly recall the method from Derrac and Schockaert (2015) in Section 3.1. Our hypothesis is that we can group these features into meaningful facets and that we can represent these facets as subspaces of the given vector space embedding. Section 3.2 discusses our approach for finding these subspaces.

Identifying Feature Directions
The method proposed in Derrac and Schockaert (2015) aims to finds a set of features F which can be modelled as directions in the given vector space. The input to their method consists of a text description D e of each entity e, but they assume no other prior knowledge. In particular, each word w which occurs sufficiently frequently in the document collection D = {D e | e ∈ E} is considered as a candidate feature. To determine whether w should be added as a feature, they train a linear SVM classifier to separate the vector representations of the entities e for which w is mentioned in D e from the vector representations of the other entities. If this SVM classifier is sufficiently accurate 2 , they assume that the word w captures a salient feature. The corresponding feature direction is then characterized by the normal vector d f of the hyperplane that was learned by the SVM classifier. We will use the notation pos w to refer to the set of entities from E which are classified as positive. In our experiments, we used logistic regression classifiers instead of SVMs, which we found to perform similarly but were faster to train.

Finding Facets
Our aim is to group the features from F into meaningful facets. For instance, in the movies domain, we might expect to see facets corresponding to e.g. genre, language and release date. It does not seem possible (nor desirable) to formally define what constitutes a good facet, a typical problem in unsupervised learning. Intuitively, however, a facet should group features which are of the same kind (e.g. genres) and should in some sense be exhaustive (i.e. all genres, rather than a set of features that refer to one or a few particular genres).
Using subspace clustering. The aim of subspace clustering is to decompose a high-dimensional space into the union of lower-dimensional spaces. This problem has found numerous applications, especially in computer vision. One may wonder whether we can learn useful facets by applying subspace clustering to feature directions d f . Unfortunately, in our initial experiments, this approach did not prove successful. This is illustrated for the movies domain in Table 1, where we used the state-of-the-art SSC-OMP subspace clustering method (You et al., 2016). For this comparison, we first manually grouped the features from F to obtain a gold standard. The first column of the  funniest, thrillers, funnier, suspense, witty, unfunny, amusing, suspenseful, historical, horror, romance, interviews, psychological horror, thriller, political, charming, funnier, slapstick, documentaries, hilarious, killed, seat, issue, cheesy, gory, mystery, effects, amazon, widescreen, transfer, realistic, relationship, monster, epic, portrayed, glad, premise, hearing, evil, car, formula, decision, violent, villain, gun, goofy, game, teens, garbage, humor, ruin, product, amount, dad, loving, personality, award, folks  media types (which indirectly captures the time period during which a movie was released). The right-most column shows the closest facets that were found with SSC-OMP. As can be seen, these facets are largely non-sensical. For instance, in the first case, words such as blu and disc are clustered together with semantically unrelated words such as fighting, england and accurate. In the second example, genres such as horror and thriller are grouped together with unrelated words such as cheesy, widescreen and award. This negative result seems in accordance to the findings from Locatello et al. (2018) that unsupervised disentangled representation learning seems impossible without a strong inductive bias. We also tried several other subspace clustering methods, for a wide range of different configurations, without obtaining better results. Similarly, we experimented with neural approaches for learning disentangled representations directly from the bag-of-words representations of the entities, but again unsuccessfully.
Using word embeddings. These negative results strongly suggest that some kind of external knowledge is needed to find meaningful facets. To this end, we focus on the use of word embeddings, which seems natural given the fact that words of the same kind (e.g. different names of genres) tend to be used in similar contexts, and can thus be expected to have similar word vectors. In particular, our basic approach for identifying facets consists in clustering the word vectors, from some standard pre-trained word embedding model, corresponding to the features in F . One important drawback of this basic strategy, however, is that it often leads to thematic clusters. For instance, while we would want horror to be clustered to-gether with other names of genres, when simply clustering word vectors without any further guidance, horror may be clustered together with thematically similar words such as scary and zombie.
To avoid such clusters, we rely on the insight that if a and b are thematically similar words (e.g. horror and zombie) then the corresponding feature directions d a and d b will also be similar. However, for paradigmatically similar words, such as horror and comedy, this should not be the case. In other words, two words should intuitively end up in the same clusters if they have similar word vectors but dissimilar feature directions.
While there are many ways to implement this intuition, we found that using the cosine similarity between d a and d b was not always reliable. Instead we rely on the following measure of overlap between the sets pos a and pos b : The dissimilarity between features a and b from F is then defined as follows: where the overlap threshold λ is a hyperparameter and w f denotes the word vector for feature f . The aim of the clustering step is to find a number of disjoint subsets of F , each of which intuitively corresponds to a facet. We will denote these facets by X 1 , ..., X k . To avoid finding redundant facets, we identify them in an incremental fashion. In particular, from the clusters obtained by the clustering algorithm, we only select the single most important one, i.e. the one which is most likely to describe a salient facet. For this purpose, we rank clusters according to the following score: This score reflects the intuition that we prefer clusters with features that are general and diverse, i.e. such that most of the entities would have at least one of the features from the cluster. As will be explained below, after the subspace corresponding to this facet has been determined, we iteratively apply the same method on a reduced vector space to find the next most important facet, until the desired number of facets k has been found.

Modelling Facets as Subspaces
We model each facet X i as a linear subspace of the given vector space embedding. To find this subspace, we learn new feature directions c f for each f ∈ X i , which still capture these features but lie in a low-dimensional subspace. In particular, we minimize the following objective: where with r the desired number of dimensions of the subspace. Note that (2) essentially expresses that for each f ∈ X i , we want to train a logistic regression classifier with coefficient vector c f . However, as expressed in (3), rather than learning these coefficient vectors independently, they are constrained such that they can be written as a linear combination of the vectors a i 1 , ..., a i r . The resulting feature directions thus span a subspace of (at most) r dimensions. Let M i ∈ R r×n be an orthonormal basis for this subspace. Then e i = M i e is the rdimensional facet-specific embedding of entity e.
There may be some features from F which are not contained in X i but can nonetheless be modelled well in the resulting subspace (i.e. if they are semantically related to the features in X i ). To identify these features, we apply the method from (Derrac and Schockaert, 2015) to the facet-specific embeddings. We write Y i to denote the features that were thus identified, beyond the ones from X i .
Next we determine a null space of the basis M i , i.e. an (n−r)×n dimensional matrix R i satisfying  This matrix R i is a basis for the orthogonal complement of the subspace spanned by M i . Intuitively, it defines what remains of the vector space embedding after we remove (i.e. project away) the subspace modelling the facet X i .
To find the remaining facets, we repeat the same procedure, but with two changes. First, the n − r dimensional remainder space is used instead of the original embedding space, i.e. we use R i e as the vector representation of e. Second, the features in X i ∪ Y i are no longer considered by the clustering algorithm. This process is repeated until the desired number of facets has been found, each time considering an increasingly lower-dimensional remainder space and clustering only those features that are not already modelled in a previously identified facet. Intuitively, by learning the facets in this incremental way, we should be able to avoid finding multiple variants of the same facets.
The middle column of Table 1 shows two of the facets that were found with this approach. Intuitively, these facets are clearly more meaningful than those that were found with SSC-OMP.

Experimental Analysis
Methods. We have experimented with two clustering algorithms: agglomerative hierarchical average link clustering and HDBSCAN (Campello et al., 2013). However, in the case of HDBSCAN we noticed that when using overlap-based dissimilarity, we typically ended up without any clusters 3 . For HDBSCAN we therefore used cosine similarity instead. We refer to our method with agglomerative clustering as IncAgg and to the variant with HDBSCAN as IncHDB. In addition, we considered a variant of the method with agglomerative clustering which relies on cosine similarity instead of the overlap-based dissimilarity (CosIncAgg).  Table 3: Classification tasks performance (in terms of F1 score) when using the MDS space and four variation of the facet-based representations.
Finally, we also report results for variants of our methods in which we did not obtain the facets incrementally (NonIncAgg and NonIncHDB). In these cases, we simply extract r clusters from the initial set of features F and determine the corresponding facets directly. In all cases, we use 50dimensional pre-trained GloVe word vectors (Pennington et al., 2014) for clustering the features.
To generate the initial vector space embedding, we follow the approach proposed in (Derrac and Schockaert, 2015) based on multi-dimensional scaling. In all cases, we used 100-dimensional vector spaces and learned 10 facets, each being modelled as a 10-dimensional subspace. To select the set of features F , we initially consider the 500 highest scoring words according to the Kappa metric. However, if we end up without any clusters (in the case of HDBSCAN), we expand the set of features to the 1000 top words. The overlap threshold λ is selected based on held-out tuning data, considering values from {0.3, 0.5, 0.7}. To flatten the agglomerative clustering, we tune the number of clusters from {50, 100, 200} 4 . Evaluation tasks. Intrinsic evaluation of the learned facets is difficult, among others because what we might consider to be a natural facet is highly subjective. Therefore, in our quantitative evaluation, we will focus on the impact of the learned facets in a number of classification tasks. This is also motivated by the view that some types of classifiers need semantically meaningful features to perform well. For example, Ager et al. (2018) used low-depth decision trees to evaluate a method for learning feature directions in vector space embeddings. Specifically, if F = {f 1 , ..., f m } is the set of features that were identified, then they represent each entity e using the feature vector (d f 1 · e, ..., d fm · e), with d f the direction modelling feature f as before. Given that a depth-1 decision tree can only use one of these features, the performance of such a decision tree essentially tells us to what extent the classes that are considered in the supervised classification task have been discovered as features. In our experiments, we will report the result of depth-1 and depth-3 decision trees. As the baseline method, referred to as MDS, we will use the top-2000 features that we obtained with the method from Derrac and Schockaert (2015). To evaluate the facets, Table 4: Examples of clusters when standard cosine similarity is used (left) and with the proposed overlap based dissimilarity score (right).
we instead apply this method to find the top-200 features for each of the facet subspaces.
The performance of the decision trees will allow us to evaluate whether we are able to learn higher-quality feature directions thanks to the decomposition of the vector space into facet subspaces. To evaluate the quality of the facets independently of the quality of the feature directions, we also consider classifiers which use as input the facet-specific vector representations e i of the entities. Specifically, we train a support vector machine (SVM) for each of the facets, leading to the predictions p 1 , ...p k . These predictions are then aggregated to a final prediction using a logistic regression meta-classifier. As baseline, we simply train a single SVM classifier in the full vector space. As our final classifier, we estimate a Gaussian model from the positive training examples. In particular, we estimate a univariate Gaussian for each dimension and multiply the corresponding probabilities. We chose this method because it is sensitive to how well the dimensions of the space are aligned with semantically meaningful properties, and because such Gaussians are commonly used for representing categories in conceptual spaces (Bouraoui and Schockaert, 2018). For the baseline, we use the dimensions of the full vector space. For the facet-based representations, we use the dimensions of the facet subspaces.
Dataset. We have carried out experiments with vector space embeddings for four different domains. First, we used the movies and place type domains from (Derrac and Schockaert, 2015), where the embeddings are learned respectively from movie reviews and from Flickr tag cooccurrence distributions. We also considered two additional domains, for which we used Wikipedia showing the full space. Bottom: showing the 10dimensional representations for the facet X i = { campuses, students, offices, centers, facilities, area, hotels, homes, bridges, hospitals, cities, shops, stations} articles: buildings and organisations. In particular, we retrieved all Wikipedia pages whose semantic type on WikiData corresponds to building or organisation. Wikipedia pages containing fewer than 200 words were removed. The bagof-words (BoW) representation of the remaining Wikipedia concepts were obtained using a standard preprocessing strategy (e.g. removing HTML tags and references), including stopword removal with NLTK (Bird and Loper, 2004). Furthermore, we POS tagged the documents and only retained the nouns and adjectives. Finally, frequent words that occurred in more than 60% of the Wikipedia articles about buildings (resp. organisations) were removed, as well as words that occurred fewer than 10 times. This approach was taken to stay broadly in line with the strategy that was used in (Derrac and Schockaert, 2015). As classification tasks, we used two attributes from WikiData in both domains (being the only attributes for which  a sufficient number of entities per attribute value was found). The full datasets will be released upon acceptance. The properties of the considered domains and associated classification problems are summarized in Table 2. For each classification problem, we randomly split the labelled examples into 2/3 for training and 1/3 for testing. For tuning we use 5-fold cross-validation over the training set. In the movies domain, where more labelled data is available, we have used fixed splits of 60% for training, 20% for tuning and 20% testing.
Results. The results are summarized in Table  3. Our main method IncAgg outperforms the MDS baseline for almost all classification tasks and types of classifiers. For the HDBSCAN based variant, the results are more mixed, which seems related to the fact that the overlap based dissimilarity could not be used in that case. Indeed, the cosine based variant of IncAgg, i.e. CosIncAgg, also performs consistently worse than IncAgg. Looking at the performance of NonIncAgg and Non-IncHDB reveals that learning facets in an iterative fashion is critical, given that these two variants perform worse than the baseline in many cases. Looking more closely at the results of our main method IncAgg, it is interesting to note that large improvements are obtained for depth-1 decision trees, which shows that our facet subspaces make it easier to identify features that correspond to the categories from the corresponding classification problems. However, large improvements can also be seen for SVMs, which shows that the actual decomposition of the space is also helpful.
Qualitative Analysis. Figure 1 illustrates how our subspaces capture similarity in a facet-specific way, showing the two first principal components of the embedding of Birmingham School of Art in the full space and in the subspace of a facet that intuitively captures building type. While the neighbours in the full space are a mixture of different building types (hotels, commercial buildings, museum, and educational buildings), in the facet subspace all nearest neighbors are universities. Table 4 illustrates the impact of using overlapbased dissimilarity, where the clusters obtained with cosine similarity are clearly more thematic, while the ones obtained with the overlap-based metric intuitively capture a facet (i.e. geographic location and the natural-cultural opposition). Finally, Table 5 shows some of the facets obtained in the buildings domains. The first example shows a facet which intuitively captures the historicalcontemporary opposition, while the second example shows a facet that captures the rural-city opposition.

Conclusions
We considered the problem of decomposing a vector space embedding into facets, which are characterized by a set of semantically related features and a corresponding subspace of the embedding. In particular, we focused on unsupervised methods, considering both approaches that rely on the vector space itself (i.e. using subspace clustering) and approaches that additionally take into account the information about word meaning that is captured by pre-trained word vectors. Overall, we found this problem to be highly challenging, in accordance with the findings from Locatello et al. (2018) regarding unsupervised disentangled representation learning. However, we were still able to obtain useful facets based on two crucial modifications to a standard clustering based strategy. First, we measure the similarity between features based on two factors: the similarity between their word vectors and the dissimilarity between their meaning in the vector space embedding (measured in terms of overlap). Second, we found it essential to learn facets in an iterative fashion, to avoid too much redundancy between the different facets.