Learning Kernels for Semantic Clustering: A Deep Approach

In this thesis proposal we present a novel semantic embedding method, which aims at consistently performing semantic clustering at sentence level. Taking into account special aspects of Vector Space Models (VSMs) , we propose to learn reproducing kernels in clas-siﬁcation tasks. By this way, capturing spectral features from data is possible. These features make it theoretically plausible to model semantic similarity criteria in Hilbert spaces, i.e. the embedding spaces. We could improve the semantic assessment over embed-dings, which are criterion-derived representations from traditional semantic vectors. The learned kernel could be easily transferred to clustering methods, where the Multi-Class Im-balance Problem is considered (e.g. semantic clustering of deﬁnitions of terms).


Introduction
Overall in Machine Learning algorithms (Duda et al., 2012), knowledge is statistically embedded via the Vector Space Model (VSM), which is also named the semantic space (Landauer et al., 1998;Padó and Lapata, 2007;Baroni and Lenci, 2010). Contrarily to it is usually conceived in text data analysis (Manning et al., 2009;Aggarwal and Zhai, 2012), not any data set is suitable to embed into p metric spaces, including euclidean spaces (p = 2) (Riesz and Nagy, 1955). This implies that, in particular, clustering algorithms are being adapted to some p -derived metric, but not to semantic vector sets (clusters) (Qin et al., 2014).
The above implication also means that semantic similarity measures are commonly not consistent, e.g. the cosine similarity or transformationbased distances (Sidorov et al., 2014). These are mainly based on the concept of triangle. Thus if the triangle inequality does not hold (which induces norms for Hilbert spaces exclusively), then the case of the cosine similarity becomes mathematically inconsistent 1 . Despite VSMs are sometimes not mathematically analyzed, traditional algorithms work well enough for global semantic analysis (hereinafter global analysis, i.e. at document level where Zipf's law holds). Nevertheless, for local analysis (hereinafter local analysis, i.e., at sentence, phrase or word level) the issue remains still open (Mikolov et al., 2013).
In this thesis proposal, we will address the main difficulties raised from traditional VSMs for local analysis of text data. We consider the latter as an illposed problem (which implies unstable algorithms) in the sense of some explicit semantic similarity criterion (hereinafter criterion), e.g. topic, concept, etc. (Vapnik, 1998;Fernandez et al., 2007). The following feasible reformulation is proposed. By learning a kernel in classification tasks, we want to induce an embedding space (Lanckriet et al., 2004;Cortes et al., 2009). In this space, we will consider relevance (weighting) of spectral features of data, which are in turn related to the shape of semantic vector sets (Xiong et al., 2014). These vectors would be derived from different Statistical Language Models (SLMs); i.e. countable things, e.g. n-grams, bag-ofwords (BoW), etc.; which in turn encode language aspects (e.g. semantics, syntax, morphology, etc.). Learned kernels are susceptible to be transferred to clustering methods (Yosinski et al., 2014;Bengio et al., 2014), where spectral features would be properly filtered from text (Gu et al., 2011).
When both learning and clustering processes are performed, the kernel approach is tolerant enough for data scarcity. Thus, eventually, we could have any criterion-derived amount of semantic clusters regardless of the Multi-Class Imbalance Problem (MCIP) (Sugiyama and Kawanabe, 2012). It is a rarely studied problem in Natural Language Processing (NLP), however, contributions can be helpful in a number of tasks such as IE, topic modeling, QA systems, opinion mining, Natural Language Understanding, etc.
This paper is organized as follows: In Section 2 we show our case study. In Section 3 we show the embedding framework. In Section 4 we present our learning problem. Sections 5 and 6 respectively show research directions and related work. In Section 7, conclusions and future work are presented.

A case study and background
A case study. Semantic clustering of definitions of terms is our case study. See the next extracted 2 examples for the terms window and mouse. For each of them, the main acception is showed first, and afterwards three secondary acceptions: 1. A window is a frame including a sheet of glass or other material capable of admitting light...

(a)
The window is the time elapsed since a passenger calls to schedule... (b) A window is a sequence region of 20-codon length on an alignment of homologous genes... (c) A window is any GUI element and is usually identified by a Windows handle...

2.
A mouse is a mammal classified in the order Rodentia, suborder Sciurognathi.... In the example 1, it is possible to assign the four acceptions to four different semantic groups (the window (1), transport services (1a), genetics (1b) and computing (1c)) by using lexical features (bold terms). This example also indicates how abstract concepts are always latent in the definitions. The example 2 is a bit more complex. Unlike to example 1, there would be three clusters because there are two semantically similar acceptions (2a and 2b are related to computing). However, they are lexically very distant. See that in both examples the amount of semantic clusters can't be defined a priory (unlike to Wikipedia). Additionally, it is impossible to know what topic the users of an IE system could be interested in. These issues, point out the need for analyzing the way we are currently treating semantic spaces in the sense of stability of algorithms (Vapnik, 1998), i.e. the existence of semantic similarity consistence, although Zipf's law scarcely holds (e.g. in local analysis).
Semantic spaces and embeddings. Erk (2012) and Brychcín (2014) showed insightful empiricism about well known semantic spaces for different cases in global analysis. In this work we have special interest in local analysis, where semantic vectors are representations (embeddings) derived from learned feature maps for specific semantic assessments (Mitchell and Lapata, 2010). These feature maps are commonly encoded in Artificial Neural Networks (ANNs) (Kalchbrenner et al., 2014).
ANNs have recently attracted worldwide attention. Given their surprising adaptability to unknown distributions, they are used in NLP for embedding and feature learning in local analysis, i.e. Deep Learning (DL) (Socher et al., 2011;Socher et al., 2013). However, we require knowledge transfer towards clustering tasks. It is still not feasible by using ANNs (Yosinski et al., 2014). Thus, theoretical access becomes ever more necessary, so it is worth extending Kernel Learning (KL) studies as alternative feature learning method in NLP (Lanckriet et al., 2004). Measuring subtle semantic displacements, according to a criterion, is theoretically attainable in a well defined (learned) reproducing kernel Hilbert space (RKHS), e.g. some subset of L 2 (Aronszajn, 1950). In these spaces, features are latent abstraction levels 3 of data spectrum, which improves kernel scaling (Dai et al., 2014;Anandkumar et al., 2014).

RKHS and semantic embeddings
We propose mapping sets of semantic vectors (e.g. BoW) into well defined function spaces (RKHSs), prior to directly endowing such sets (not elliptical or at least convex (Qin et al., 2014)) with the euclidean norm, . 2 (see Figure 1). For the aforesaid purpose, we want to take advantage of the RKHSs.
Any semantic vector x o ∈ X could be consistently embedded (transformed) into a well defined Hilbert space by using the reproducing property of a kernel k(·, ·) (Shawe-Taylor and Cristianini, 2004): where: H ⊂ L 2 is a RKHS, f xo (·) ∈ H is the embedding derived from x o , which can be seen as fixed parameter of k(·, x o ) = f (·) ∈ H. This embedding function is defined over the vector domain {x} ⊂ X and ·, · H : X → H is the inner product in H. Always that (1) holds, k(·, ·) is a positive definite (PD) kernel function, so X does not need even to be a vector space and even then, convergence of any sequence {f n (x) : f n ∈ H; n ∈ N} can be ensured. The above is a highly valuable characteristic of the resulting function space (Smola et al., 2007): The result (2) implies that convergence of summation of initial guessing kernel functions k n (·, ·) ∈ H always occurs, hence talking about the existence of a suitable kernel function k(·, ·) ∈ H in (1) is absolutely possible. It means that L 2 operations can be consistently applied, e.g. the usual norm · 2 , trigonometric functions (e.g. cos θ) and distance d 2 = f n − f m 2 : m = n. Thus, from right side of (2), in order that (1) holds convergence of the Fourier series decomposition of k(·, ·) towards the spectrum of desired features from data is necessary; i.e., by learning parameters and hyperparameters 4 of the series (Ong et al., 2005;Bȃzȃvan et al., 2012).

Learnable kernels for language features
Assume (1) and (2) hold. For some SLM a encoded in a traditional semantic space, it is possible to define a learnable kernel matrix K a as follows (Lanckriet et al., 2004;Cortes et al., 2009): where {K i } p i=1 ⊂ K is the set of p initial guessing kernel matrices (belonging to the family K, e.g. Gaussian) with fixed hyperparameters and β i 's are parameters weighting K i 's. Please note that, for simplicity, we are using matrices associated to kernel functions k i (·, ·), k a (·, ·) ∈ H, respectively.
In the Fourier domain and bandwidth. In fact (3) is a Fourier series, where β i 's are decomposition coefficients of K a (Bȃzȃvan et al., 2012). This kernel would be fitting the spectrum of some SLM that encodes some latent language aspect from text (Landauer et al., 1998). On one hand, in Fourier domain operations (e.g. the error vector norm) are closed in L 2 , i.e., according to (2) convergence is ensured as a Hilbert space is well defined. Moreover, the L 2 -regularizer is convex in terms of the Fourier series coefficients (Cortes et al., 2009). The aforementioned facts imply benefits in terms of computational complexity (scaling) and precision (Dai et al., 2014). On the other hand, hyperparameters of initial guessing kernels are learnable for detecting the bandwitdh of data (Ong et al., 2005;Bȃzȃvan et al., 2012;Xiong et al., 2014). Eventually, the latter fact would lead us to know (learning) bounds for the necessary amount of data to properly train our model (the Nyquist theorem).
Cluster shape. A common shape among clusters is considered even for unseen clusters with different, independent and imbalanced prior probability densities (Vapnik, 1998;Sugiyama and Kawanabe, 2012). For example, if data is Guassian-distributed in the input space, then shape of different clusters tend to be elliptical (the utopian 2 case), although their densities are not regular or even very imbalanced. Higher abstraction levels of the data spectrum possess mentioned traits (Ranzato et al., 2007;Baktashmotlagh et al., 2013). We will suggest below a more general version of (3), thereby considering higher abstraction levels of text data.

Learning our kernel in a RKHS
A transducer is a setting for learning parameters and hyperparameters of a multikernel linear combination like the Fourier series (3) (Bȃzȃvan et al., 2012).
Overall, the above setting consists on defining a multi-class learning problem over a RKHS: let Y θ = {y } y ∈N be a sequence of targets inducing a semantic criterion θ, likewise a training set X = {x } x ∈R n and a set of initial guessing kernels {K σ i } p i=1 ⊂ K with the associated hyperparameter vector σ a = {σ i } p i=1 . Then for some SLM a ∈ A, we would learn the associated kernel matrix K a by optimizing the SLM empirical risk functional: where in J A (·, ·) we have: The learning is divided in two interrelated stages: at the first stage, the free parameter vector β a = {β i } p i=1 in (5) (a particular version of (3)), is optimized for learning a partial kernel K a , given a fixed (sufficiently small) σ a and by using the regularizer ξ(β a ) over the SLM prediction loss L A (·, ·) in (4). Conversely at the second stage σ a is free, thus by using the regularizer ψ(σ a ) over the prediction loss L A (·, ·), given that the optimal β * a was found at the first stage, we could have the optimal σ * a and therefore K * a is selected.
At higher abstraction levels, given the association {X , Y θ }, the transducer setting would learn a kernel function that fits a multi-class partition of X via summation of K a 's. Thus, we can use learned kernels K * a as new initial guesses in order to learn a compound kernel matrix K θ for a higher abstraction level: where in the general risk functional J (·) we have: In (6) the vector γ θ = {γ a } a∈A weights semantic representations K * a associated to each SLM and ζ(γ θ ) is a proper regularizer over the general loss L(·, ·). The described learning processes can even be jointly performed (Bȃzȃvan et al., 2012). The aforementioned losses and regularizers can be conveniently defined (Cortes et al., 2009).

The learned kernel function
In order to make relevant features to emerge from text, we would use our learned kernel K * θ . Thus if {γ * θ , {β * a , σ * a } a∈A } is the solution set of the learning problems (4) and (6), then combining (5) and (7) gives the embedding kernel function, for |A| different SLMs as required (see Figure 2): Definition 1. Given a semantic criterion θ, then the learned parameters {γ * θ , {β * a , σ * a } a∈A } are eigenvalues of kernels {K * a } a∈A ≺ K * θ , respectively 5 . Thus according to (1), we have for any semantic vector x o ∈ X its representation f xo (x) ∈ H: In (8), k i (·, ·), k θ (·, ·) ∈ H ⊂ L 2 are reproducing kernel functions associated to matrices K σ i and K θ , respectively. The associated {σ * a } a∈A would be optimally fitting the bandwidth of data. X ⊃ X is a compounding semantic space from different SLMs Please note that we could extend Definition 1 to deeper levels (layers) associated to abstraction levels of SLMs. These levels could explicitly encode morphology, syntax, semantics or compositional semantics, i.e. {K a } a∈A = K SLM s ≺ K aspects .

Research directions
Our main research direction is to address in detail linguistic interpretations associated to second member of (8), which is still not clear. There are potential ways of interpreting pooling operations over the expansion of either eigenvalues or eigenfunctions of f xo (·). This fact could lead us to an alternative way of analyzing written language, i.e. in terms of the spectral decomposition of X given θ.
As another direction we consider data scarcity (low annotated resources). It is a well handled issue by spectral approaches like the proposed one, so it is worth investigating hyperparameter learning techniques. We consider hyperparameters as the lowest abstraction level of the learned kernel and they are aimed at data bandwidth estimation (i.e. by tuning the σ i associated to each k i (·, ·) in (8)). This estimation could help us to try to answer the question of how much training data is enough. This question is also related to the quality bounds of a learned kernel. These bounds could be used to investigate the possible relation among the number of annotated clusters, the training set size and the generalization ability. The latter would be provided (transferred) by the learned kernel to a common clustering algorithm for discovering imbalanced unseen semantic clusters. We are planning to perform the above portrayed experiments at least for a couple of semantic criteria 6 , including term acception discovering (Section 2). Nevertheless, much remains to be done.

Related work
Clustering of definitional contexts. Molina (2009) processed snippets containing definitions of terms (Sierra, 2009). The obtained PD matrix is not more than a homogeneous quadratic kernel that induces a Hilbert space: The Textual Energy of data (Fernandez et al., 2007;Torres-Moreno et al., 2010). Hierarchical clustering is performed over the resulting space, but some semantic criterion was not considered. Thus, such as Cigarran (2008), they ranked retrieved documents by simply relying on lexical features (global analysis). ML analysis was not performed, so their approach suffers from high sensibility to lexical changes (instability) in local analysis.
Paraphrase extraction from definitional sentences. Hashimoto, et.al. (2011) andYan, et.al. (2013) engineered vectors from contextual, syntactical and lexical features of definitional sentence paraphrases (similarly to Lapata (2007) and Ferrone (2014)). As training data they used a POS annotated corpus of sentences that contain noun phrases. It was trained a binary SVM aimed at both paraphrase detection and multi-word term equivalence assertion (Choi and Myaeng, 2012;Abend et al., 2014). More complex constructions were not considered, but their feature mixure performs very well. Socher et al., (2011) used ANNs for paraphrase detection. According to labeling, the network unsupervisedly capture as many language features as latent in data (Kalchbrenner et al., 2014). The network supervisedly learns to represent desired contents inside phrases (Mikolov et al., 2013); thus paraphrase detection is highly generalized. Nevertheless, it is notable the necessity of a tree parser. Unlike to (Socher et al., 2013), the network must to learn syntactic features separately.
Definitional answer ranking. Fegueroa (2012) and (2014) proposed to represent definitional answers by a Context Language Model (CLM), i.e. a Markovian process as probabilistic language model. A knowledge base (WordNET) is used as an annotated corpus of specific domains (limited to Wikipedia). Unlike to our approach, queries must be previously disambiguated; for instance: "what is a computer virus?", where "computer virus" disambiguates "virus". Answers are classified according to relevant terms (Mikolov et al., 2013), similarly to the way topic modeling approaches work (Fernandez et al., 2007;Lau et al., 2014).
Learning kernels for clustering. Overall for knowledge transfer from classification (source) tasks to clustering (target) tasks, the state of the art is not bast. This setting is generally explored by using toy Gaussian-distributed data and predefined kernels (Jenssen et al., 2006;Jain et al., 2010). Particularly for text data, Gu et.al. (2011) addressed the setting by using multi-task kernels for global analysis. In their work, it was not necessary neither to discover clusters nor to model some semantic criterion. Both them are assumed as a presetting of their analysis, which differs from our proposal.
Feasibility of KL over DL. We want to perform clustering over an embedding space. At the best of our knowledge there exist two dominant approaches for feature learning: KL and DL. However, knowledge transfer is equally important for us, so both procedures should be more intuitive by adopting the KL approach instead of DL. We show the main reasons: (i) Interpretability. The form (8) has been deducted from punctual items (e.g. SLMs encoding language aspects), which leads us to think that a latent statistical interpretation of language is worthy of further investigation. (ii) Modularity. Any kernel can be transparently transferred into kernelized and non-kernelized clustering methods (Schölkopf et al., 1997;Aguilar-Martin and De Mántaras, 1982;Ben-Hur et al., 2002). (iii) Mathematical support. Theoretical access provided by kernel methods would allow for future work on semantic assessments via increasingly abstract representations. (iv) Data scarcity. It is one of our principal challenges, so kernel methods are feasible because of their generalization predictability (Cortes and Vapnik, 1995).
Regardless of its advantages, our theoretical framework exhibit latent drawbacks. The main of them is that feature learning is not fully unsupervised, which suggests the underlying possibility of preventing learning from some decisive knowledge related to, mainly, the tractability of the MCIP. Thus, many empirical studies are pending.

Conclusions and future work
At the moment, our theoretical framework analyzes semantic embedding in the sense of a criterion for semantic clustering. However, correspondences between linguistic intuitions and the showed theoretical framework (interpretability) are actually incipient, although we consider these challenging correspondences are described in a generalized way in the seminal work of Harris (1968). It is encouraging (not determinant) that our approach can be associated to his operator hypothesis on composition and separability of both linguistic entities and language aspects. That is why we consider it is worth investigating spectral decomposition methods for NLP as possible rapprochement to elucidate improvements in semantic assessments (e.g. semantic clustering). Thus, by performing this research we also expect to advance the state of the art in statistical features of written language. As immediate future work we are planning to learn compositional distributional operators (kernels), which can be seen as stable solutions of operator equations (Harris, 1968;Vapnik, 1998). We would like to investigate this approach for morphology, syntax and semantics (Mitchell and Lapata, 2010;Lazaridou et al., 2013). Another future proposal could be derived from the abovementioned approach (operator learning), i.e. multi-sentence compression for automatic sumarization.
A further extension could be ontology learning. It would be proposed as a multi-structure KL framework (Ferrone and Zanzotto, 2014). In this case, IE and knowledge organization would be our main aims (Anandkumar et al., 2014).
Aknowledgements. This work is funded by CONACyT Mexico (grant: 350326/178248). Thanks to the UNAM graduate program in CS. Thanks to Carlos Méndez-Cruz, to Yang Liu and to anonymous reviewers for their valuable comments.