Deep Learning in Semantic Kernel Spaces

Kernel methods enable the direct usage of structured representations of textual data during language learning and inference tasks. Expressive kernels, such as Tree Kernels, achieve excellent performance in NLP. On the other side, deep neural networks have been demonstrated effective in automatically learning feature representations during training. However, their input is tensor data, i.e., they can not manage rich structured information. In this paper, we show that expressive kernels and deep neural networks can be combined in a common framework in order to (i) explicitly model structured information and (ii) learn non-linear decision functions. We show that the input layer of a deep architecture can be pre-trained through the application of the Nystrom low-rank approximation of kernel spaces. The resulting “kernelized” neural network achieves state-of-the-art accuracy in three different tasks.


Introduction
Learning for Natural Language Processing (NLP) requires to more or less explicitly account for trees or graphs to express syntactic and semantic information. A straightforward modeling of such information has been obtained in statistical language learning with Tree Kernels (TKs) (Collins and Duffy, 2001), or by means of structured neural models (Hochreiter and Schmidhuber, 1997;Socher et al., 2013). In particular, kernel-based methods (Shawe-Taylor and Cristianini, 2004) have been largely applied in language processing for alleviating the need of complex activities of manual feature engineering (e.g., (Moschitti et al., 2008)). Although ad-hoc features are adopted by many successful approaches to language learning (e.g., (Gildea and Jurafsky, 2002)), kernels provide a natural way to capture textual generalizations directly operating over (possibly complex) linguistic structures. Sequence (Cancedda et al., 2003) or tree kernels (Collins and Duffy, 2001) are of particular interest as the feature space they implicitly generate reflects linguistic patterns. On the other hand, Recursive Neural Networks (Socher et al., 2013) have been shown to learn dense feature representations of the nodes in a structure, thus exploiting similarities between nodes and sub-trees. Also, Long-Short Term Memory (Hochreiter and Schmidhuber, 1997) networks build intermediate representations of sequences, resulting in similarity estimates over sequences and their inner sub-sequences.
While such methods are highly effective and reach state-of-the-art results in many tasks, their adoption can be problematic. In kernel-based Support Vector Machine (SVM) the classification model corresponds to the set of support vectors (SVs) and weights justifying the maximal margin hyperplane: the classification cost crucially depends on their number, as classifying a new instance requires a kernel computation against all SVs, making their adoption in large data settings prohibitive. This scalability issue is evident in many NLP and Information Retrieval applications, such as in answer re-ranking in question answering (Severyn et al., 2013;Filice et al., 2016), where the number of SVs is typically very large. Improving the efficiency of kernel-based methods is a largely studied topic. The reduction of computational costs has been early designed by imposing a budget (Dekel and Singer, 2006;Wang and Vucetic, 2010), that is limiting the maximum number of SVs in a model. However, in complex tasks, such methods still require large budgets to reach adequate accuracies. On the other hand, training complex neural networks is also difficult as no common design practice is established against complex data structures. In Levy et al. (2015), a careful analysis of neural word embedding models is carried out and the role of the hyper-parameter estimation is outlined. Different neural architectures result in the same performances, whenever optimal hyper-parameter tuning is applied. In this latter case, no significant difference is observed across different architectures, making the choice between different neural architectures a complex and empirical task.
A general approach to the large scale modeling of complex structures is a critical and open problem. A viable and general solution to this scalability issue is provided by the Nyström method (Williams and Seeger, 2001); it allows to approximate the Gram matrix of a kernel function and support the embedding of future input examples into a low-dimensional space. For example, if used over TKs, the Nyström projection corresponds to the embedding of any tree into a lowdimensional vector.
In this paper, we show that the Nyström based low-rank embedding of input examples can be used as the early layer of a deep feed-forward neural network. A standard NN back-propagation training can thus be applied to induce non-linear functions in the kernel space. The resulting deep architecture, called Kernel-based Deep Architecture (KDA), is a mathematically justified integration of expressive kernel functions and deep neural architectures, with several advantages: it (i) directly operates over complex non-tensor structures, e.g., trees, without any manual feature or architectural engineering, (ii) achieves a drastic reduction of the computational cost w.r.t. pure kernel methods, and (iii) exploits the non-linearity of NNs to produce accurate models. The experimental evaluation shows that the proposed approach achieves state-of-the-art results in three semantic inference tasks: Semantic Parsing, Question Classification and Community Question Answering.
In the rest of the paper, Section 2 surveys some of the investigated kernels. In Section 3 the Nyström methodology and KDA are presented. Experimental evaluations are described in Section 4. Finally, Section 5 derives the conclusions.

Kernel-based Semantic Inference
In almost all NLP tasks, explicit models of complex syntactic and semantic structures are required, such as in Paraphrase Detection: deciding whether two sentences are valid paraphrases involves learning grammatical rewriting rules, such as semantics preserving mappings among subtrees. Also in Question Answering, the syntactic information about input questions is crucial. While manual feature engineering is always possible, kernel methods on structured representations of data objects, e.g., sentences, have been largely applied. Since Collins and Duffy (2001), sentences can be modeled through their corresponding parse tree, and Tree Kernels (TKs) result in similarity metrics directly operating over tree fragments. Such kernels corresponds to dot products in the (implicit) feature space made of all possible tree fragments (Haussler, 1999). Notice that the number of tree fragments in a tree bank is combinatorial with the number of tree nodes and gives rise to billions of features, i.e., dimensions. In this high-dimensional space, kernel-based algorithms, such as SVMs, can implicitly learn robust prediction models (Shawe-Taylor and Cristianini, 2004), resulting in state-of-the-art approaches in several NLP tasks, e.g., Semantic Role Labeling (Moschitti et al., 2008), Question Classification (Croce et al., 2011) or Paraphrase Identification (Filice et al., 2015). As the feature space generated by the structural kernels depends on the input structures, different tree representations can be adopted to reflect more or less expressive syntactic/semantic feature spaces. While constituency parse trees have been early used (e.g., (Collins and Duffy, 2001)), dependency parse trees correspond to graph structures. TKs usually rely on their tree conversions, where grammatical edge labels corresponds to nodes. An expressive tree representation of dependency graphs is the Grammatical Relation Centered Tree (GRCT). As illustrated in Figure 1, PoS-Tags and grammatical functions correspond to nodes, dominating their associated lexicals. Types of tree kernels. While a variety of TK functions have been studied, e.g., the Partial Tree Kernel (PTK) (Moschitti, 2006), the kernels used in this work model grammatical and semantic information, as triggered respectively by the dependency edge labels and lexical nodes. The latter is exploited through recent results in distributional models of lexical semantics, as proposed in word embedding methods (e.g., (Mikolov et al., 2013;Sahlgren, 2006). In particular, we adopt the Smoothed Partial Tree Kernel (SPTK) described in Croce et al. (2011): it extends the PTK formulation with a similarity function between lexical nodes in a GRCT, i.e., the cosine similarity between word vector representations based on word embeddings. We also use a further extension of the SPTK, called Compositionally Smoothed Partial Tree Kernel (CSPTK) (as in Annesi et al. (2014)).
In CSPTK, the lexical information provided by the sentence words is propagated along the nonterminal nodes representing head-modifier dependencies. Figure 2 shows a compositionally-labeled tree, where the similarity function at the nodes can model lexical composition, i.e., capturing contextual information. For example, in the sentence, "What instrument does Hendrix play?", the role of the word instrument can be fully captured only if its composition with the verb play is considered. The CSPTK applies a composition function between nodes: while several algebraic functions can be adopted to compose two word vectors representing a head/modifier pair, here we refer to a simple additive function that assigns to each (h, m) pair the linear combination of the involved vectors, i.e., (h, m) = Ah + Bm: although simple and efficient, it actually produces very effective CSPTK functions.
Complexity. The training phase of an optimal maximum margin algorithm (such as SVM) requires a number of kernel operations that is more than linear (almost O(n 2 )) with respect to the number of training examples n, as discussed in Chang and Lin (2011). Also the classification phase depends on the size of the input dataset and the intrinsic complexity of the targeted task: classifying a new instance requires to evaluate the kernel function with respect to each support vector. For complex tasks, the number of selected support vectors tends to be very large, and using the resulting model can be impractical. This cost is also problematic as single kernel operations can be very expensive: the cost of evaluating the PTK on a single tree pair is almost linear in the number of nodes in the input trees, as shown in Moschitti (2006). When lexical semantics is considered, as in SPTKs and CSPTKs, it is more than linear in the number of nodes (Croce et al., 2011).

The Nyström method
Given an input training dataset D, a kernel K(o i , o j ) is a similarity function over D 2 that corresponds to a dot product in the implicit ker- The advantage of kernels is that the projection function Φ(o) = x ∈ R n is never explicitly computed (Shawe-Taylor and Cristianini, 2004). In fact, this operation may be prohibitive when the dimensionality n of the underlying kernel space is extremely large, as for Tree Kernels (Collins and Duffy, 2001). Kernel functions are used by learning algorithms, such as SVM, to operate only implicitly on instances in the kernel space, by never accessing their explicit definition. Let us apply the projection function Φ over all examples from D to derive representations, x denoting the rows of the matrix X. The Gram matrix can always be computed as G = XX , with each single element The aim of the Nyström method is to derive a new low-dimensional embeddingx in a l-dimensional space, with l n so thatG =XX andG ≈ G. This is obtained by generating an approximatioñ G of G using a subset of l columns of the matrix, i.e., a selection of a subset L ⊂ D of the available examples, called landmarks. Suppose we randomly sample l columns of G, and let C ∈ R |D|×l be the matrix of these sampled columns. Then, we can rearrange the columns and rows of G and define X = [X 1 X 2 ] such that: where W = X 1 X 1 , i.e., the subset of G that contains only landmarks. The Nyström approximation can be defined as: where W † denotes the Moore-Penrose inverse of W . The Singular Value Decomposition (SVD) is used to obtain W † as it follows. First, W is decomposed so that W = U SV , where U and V are both orthogonal matrices, and S is a diagonal matrix containing the (non-zero) singular values of W on its diagonal. Since W is symmetric and positive definite W = U SU . Then U and the Equation 2 can be rewritten as Given an input example o ∈ D, a new lowdimensional representationx can be thus determined by considering the corresponding item of where c is the vector whose dimensions contain the evaluations of the kernel function between o and each landmark o j ∈ L. Therefore, the method produces l-dimensional vectors. If k is the average number of basic operations required during a single kernel computation, the overall cost of a single projection is O(kl + l 2 ), where the first term corresponds to the cost of generating the vector c, while the second term is needed for the matrix multiplications in Equation 4. Typically, the number of landmarks l ranges from hundreds to few thousands and, for complex kernels (such as Tree Kernels), the projection cost can be reduced to O(kl). Several policies have been defined to determine the best selection of landmarks to reduce the Gram Matrix approximation error. In this work the uniform sampling without replacement is adopted, as suggested by Kumar et al. (2012), where this policy has been theoretically and empirically shown to achieve results comparable with other (more complex) selection policies.

A Kernel-based Deep Architecture
The above introduced Nyström representationx of any input example o is linear and can be adopted to feed a neural network architecture. We assume where o refers to a generic instance and y is its associated class. In this Section, we define a Multi-Layer Perceptron (MLP) architecture, with a specific Nyström layer based on the Nyström embeddings of Eq. 4. We will refer to this architecture as Kernel-based Deep Architecture (KDA). KDA has an input layer, a Nyström layer, a possibly empty sequence of non-linear hidden layers and a final classification layer, which produces the output.
The input layer corresponds to the input vector c, i.e., the row of the C matrix associated to an example o. Notice that, for adopting the KDA, the values of the matrix C should be all available. In the training stage, these values are in general cached. During the classification stage, the c vector corresponding to an example o is directly computed by l kernel computations between o and each one of the l landmarks.
The input layer is mapped to the Nyström layer, through the projection in Equation 4. Notice that the embedding provides also the proper weights, defined by U S − 1 2 , so that the mapping can be expressed through the Nyström matrix H N y = U S − 1 2 : it corresponds to a pre-trained stage derived through SVD, as discussed in Section 3.1. Equation 4 provides a static definition for H N y whose weights can be left invariant during the neural network training. However, the values of H N y can be made available for the standard back-propagation adjustments applied for training 1 . Formally, the low-dimensional embedding of an input example o, isx = c H N y = c U S − 1 2 . The resulting outcomex is the input to one or more non-linear hidden layers. Each t-th hidden layer is realized through a matrix H t ∈ R h t−1 ×ht and a bias vector b t ∈ R 1×ht , whereas h t denotes the desired hidden layer dimensionality. Clearly, given that H N y ∈ R l×l , h 0 = l. The first hidden layer in fact receives in inputx = cH N y , that corresponds to t = 0 layer input x 0 =x and its computation is formally expressed by where f is a non-linear activation function. In general, the generic t-th layer is modeled as: The final layer of KDA is the classification layer, realized through the output matrix H O and the output bias vector b O . Their dimensionality depends on the dimensionality of the last hidden layer (called O −1 ) and the number |Y | of different classes, i.e., respectively. In particular, this layer computes a linear classification function with a softmax operator so thatŷ = sof tmax( In order to avoid over-fitting, two different regularization schemes are applied. First, the dropout is applied to the input x t of each hidden layer (t ≥ 1) and to the input x O −1 of the final classifier. Second, a L 2 regularization is applied to the norm of each layer 2 H t and H O .
Finally, the KDA is trained by optimizing a loss function made of the sum of two factors: first, the cross-entropy function between the gold classes and the predicted ones; second the L 2 regularization, whose importance is regulated by a metaparameter λ. The final loss function is thus L(y,ŷ) =

Empirical Investigation
The proposed KDA has been applied adopting the same architecture but with different kernels to three NLP tasks, i.e., Question Classification, Community Question Answering, and Automatic Boundary Detection in Semantic Role Labeling. The Nyström projector has been implemented in the KeLP framework 3 . The neural network has been implemented in Tensorflow 4 , with 2 hidden layers whose dimensionality corresponds to the number of involved Nyström landmarks. The rectified linear unit is the non-linear activation function in each layer. The dropout has been applied in each hidden layer and in the final classification layer. The values of the dropout parameter and the λ parameter of the L 2 -regularization have been selected from a set of values via grid-search. The Adam optimizer with a learning rate of 0.001 has been applied to minimize the loss function, with a multi-epoch (500) training, each fed with batches of size 256. We adopted an early stop strategy, where the best model was selected according to the performance over the development set. Every performance measure is obtained against a specific sampling of the Nyström landmarks. Results averaged against 5 such samplings are always hereafter reported.

Question Classification
Question Classification (QC) is the task of mapping a question into a closed set of answer types in a Question Answering system. We used the UIUC dataset (Li and Roth, 2006), including a training and test set of 5, 452 and 500 questions, respectively, organized in 6 classes (like ENTITY or HUMAN). TKs resulted very effective, as shown in Croce et al. (2011); Annesi et al. (2014). In Annesi et al. (2014), QC is mapped into a One-vs-All multi-classification schema, where the CSPTK achieves state-of-the-art results of 95%: it acts directly over compositionally labeled trees without relying on any manually designed feature.
In order to proof the benefits of the KDA architecture, we generated Nyström representation of the CSPTK kernel function 5 with default parameters (i.e., µ = λ = 0.4). The SVM formulation by Chang and Lin (2011), fed with the CSPTK (hereafter KSVM), is here adopted to determine the reachable upper bound in classification quality, i.e., a 95% of accuracy, at higher computational costs. It establishes the state-of-the-art over the UIUC dataset. The resulting model includes 3,873 support vectors: this corresponds to the number of kernel operations required to classify any input test question. The Nyström method based on a number of landmarks ranging from 100 to 1,000 is adopted for modeling input vectors in the CSPTK kernel space. Results are reported in Table 1: computational saving refers to the percentage of avoided kernel computations with respect to the application of the KSVM to each test instance. To justify the need of the Neural Network, we compared the proposed KDA to an efficient linear SVM that is directly trained over the Nyström embeddings. This SVM implements the Dual Coordinate Descent method (Hsieh et al., 2008) and will be referred as SVM DCD . We also measured the state-of-the-art Convolutional Neural Network 6 (CNN) of Kim (2014), achieving the remarkable accuracy of 93.6%. Notice that the linear classifier SVM DCD operating over the approximated kernel space achieves the same classification quality of the CNN when just 1,000 landmarks are considered. KDA improves this results, achieving 94.3% accuracy even with fewer landmarks (only 600), showing the effectiveness of non-linear learning over the Nyström input. Although KSVM improves to 95%, KDA provides a saving of more than 84% kernel computations at classification time. This result is straightforward as it confirms that linguistic information encoded in a tree is important in the analysis of questions and can be used as a pre-training strategy. Figure 3 shows the accuracy curves according to various approximations of the kernel space, i.e., number of landmarks.

Community Question-Answering
In the SemEval-2016 task 3, participants were asked to automatically provide good answers in a community question answering setting (Nakov et al., 2016). We focused on the subtask A: given a question and a large collection of questioncomment threads created by a user community, 6 The deep architecture presented in Kim (2014) outperforms several NN models, including the Recursive Neural Tensor Network or Tree-LSTM presented in (Socher et al., 2013;Tai et al., 2015) which presents a semantic compositionality model that exploits parse trees. the task consists in (re-)ranking the comments w.r.t. their utility in answering the question. Subtask A can be modeled as a binary classification problem, where instances are (question, comment) pairs. Each pair generates an example for a binary SVM, where the positive label is associated to a good comment and the negative label refers to potentially useful and bad comments. The classification score achieved over different (question, comment) pairs is used to sort instances and produce the final ranking over comments. The above setting results in a train and test dataset made of 20,340 and 3,270 examples, respectively. In Filice et al. (2016), a Kernel-based SVM classifier (KSVM) achieved state-of-the-art results by adopting a kernel combination that exploited (i) feature vectors containing linguistic similarities between the texts in a pair; (ii) shallow syntactic trees that encode the lexical and morpho-syntactic information shared between text pairs; (iii) feature vectors capturing task-specific information. Such model includes 11,322 support vectors. We investigated the KDA architecture, trained by maximizing the F 1 measure, based on a Nyström layer initialized using the same kernel functions as KSVM. We varied the Nyström dimensions from 100 to 1,000 landmarks, i.e., a much lower number than the support vectors of KSVM. Table 2 reports the results: very high F 1 scores are observed with impressive savings in terms of kernel computations (between 91.2% and 99%). Also on the cQA task, the F 1 obtained by the SVM DCD is significantly lower than the KDA one. Moreover, with 800 landmarks KDA achieves the remarkable results of 0.68 of F 1 , that is the state-of-the-art against other convolutional systems, e.g., ConvKN (Barrón-Cedeño et al., 2016): this latter combines convolutional tree kernels with kernels operating on sentence embeddings generated by a convolutional neural network.

Argument Boundary Detection
Semantic Role Labeling (SRL) consists of the detection of the semantic arguments associated with the predicate of a sentence (called Lexical Unit) and their classification into their specific roles (Fillmore, 1985). For example, given the sentence "Bootleggers then copy the film onto hundreds of tapes" the task would be to recognize the verb copy as representing the DUPLICA-TION frame with roles, CREATOR for Bootleggers, ORIGINAL for the film and GOAL for hundreds of tapes. Argument Boundary Detection (ABD) corresponds to the SRL subtask of detecting the sentence fragments spanning individual roles. In the previous example the phrase "the film" represents a role (i.e., ORIGINAL), while "of tapes" or "film onto hundreds" do not, as they just partially cover one or multiple roles, respectively. The ABD task has been successfully tackled using TKs since Moschitti et al. (2008). It can be modeled as a binary classification task over each parse tree node n, where the argument span reflects words covered by the sub-tree rooted at n. In our experiments, Grammatical Relation Centered Tree (GRCT) derived from dependency grammar (Fig. 4) are employed, as shown in Fig. 5. Each node is considered as a candidate in covering a possible argument. In particular, the structure in Fig. 5a is a positive example. On the contrary, in Fig. 5b the NMOD node only covers the phrase "of tapes", i.e., a subset of the correct role, and it represents a negative example 7 .
We selected all the sentences whose predicate word (lexical unit) is a verb (they are about 60,000), from the 1.3 version of the Framenet dataset (Baker et al., 1998). This gives rise to about 1,400,000 sub-trees, i.e., the positive and negative instances. The dataset is split in train and test according to the 90/10 proportion (as in (Johansson and Nugues, 2008)). This size makes the application of a traditional kernel-based method unfeasible, unless a significant instance sub-sampling is performed. We firstly experimented standard SVM learning over a sampled training set of 10,000 examples, a typical size for annotated datasets in computational linguistics tasks. We adopted the Smoothed Partial Tree Kernel (Croce et al., 2011) with standard parameters (i.e., µ = λ = 0.4) and lexical nodes expressed through 250-dimensional vectors obtained by applying Word2Vec (Mikolov et al., 2013) to the entire Wikipedia. When trained over this 10k instances dataset, the kernel-based SVM (KSVM) achieves an F 1 of 70.2%, over the same test set used in    (Hsieh et al., 2008), SVM DCD , and the KDA proposed in this work. Table 3 presents the results in terms of F 1 and saved kernel operation. Although SVM DCD with 500 landmarks already achieves 0.713 F 1 , a score higher than KSVM, it is signif- icantly improved by the KDA. KDA achieves up to 0.76 F 1 with only 400 landmarks, resulting in a huge step forward w.r.t. the KSVM. This result is straightforward considering (i) the reduction of required kernel operations, i.e., more than 86% are saved and (ii) the quality achieved since 100 landmarks (i.e., 0.711, higher than the KSVM).

Discussion and Conclusions
In this work, we promoted a methodology to embed structured linguistic information within NNs, according to mathematically rich semantic similarity models, based on kernel functions. Structured data, such as trees, are transformed into dense vectors according to the Nyström methodology, and the NN is effective in capturing nonlinearities in these representations, but still improving generalization at a reasonable complexity. At the best our knowledge, this work is one of the few attempts to systematically integrate linguistic kernels within a deep neural network architecture. The problem of combining such methodologies has been studied in specific works, such as (Baldi et al., 2011;Cho and Saul, 2009;Yu et al., 2009). In Baldi et al. (2011) the authors propose a hybrid classifier, for bridging kernel methods and neural networks. In particular, they use the output of a kernelized k-nearest neighbors algorithm as input to a neural network. Cho and Saul (2009) introduced a family of kernel functions that mimic the computation of large multilayer neural networks. However, such kernels can be applied only on vector inputs. In Yu et al. (2009), deep neural networks for rapid visual recognition are trained with a novel regularization method taking advantage of kernels as an oracle representing prior knowledge. The authors transform the kernel regularizer into a loss function and carry out the neural network training by gradient descent. In Zhuang et al. (2011) a different approach has been promoted: a multiple (two) layer architecture of kernel functions, inspired by neural networks, is studied to find the best kernel combination in a Multiple Kernel Learning setting. In Mairal et al. (2014) the invariance properties of convolutional neural networks (LeCun et al., 1998) are modeled through kernel functions, resulting in a Convolutional Kernel Network. Other effort for combining NNs and kernel methods is described in , where a SVM adopts a tree kernels combinations with embeddings learned through a CNN.
The approach here discussed departs from previous approaches in different aspects. First, a general framework is promoted: it is largely applicable to any complex kernel, e.g., structural kernels or combinations of them. The efficiency of the Nyström methodology encourages its adoption, especially when complex kernel computations are required. Notice that other low-dimensional approximations of kernel functions have been studied, as for example the randomized feature mappings proposed in Rahimi and Recht (2008). However, these assume that (i) instances have vectorial form and (ii) shift-invariant kernels are adopted. The Nyström method adopted here does not suffer of such limitations: as our target is the application to structured (linguistic) data, more general kernels, i.e., non-shift-invariant convolution kernels are needed.
Given the Nyström approximation, the learning setting corresponds to a general well-known neural network architecture, i.e., a multilayer perceptron, and does not require any manual feature engineering or the design of ad-hoc network architectures. The success in three different tasks confirms its large applicability without major changes or adaptations. Second, we propose a novel learning strategy, as the capability of kernel methods to represent complex search spaces is combined with the ability of neural networks to find non-linear so-lutions to complex tasks. Last, the suggested KDA framework is fully scalable, as (i) the network can be parallelized on multiple machines, and (ii) the computation of the Nyström reconstruction vector c can be easily parallelized on multiple processing units, ideally l, as each unit can compute one c i value. Future work will address experimentations with larger scale datasets; moreover, it is interesting to experiment with more landmarks in order to better understand the trade-off between the representation capacity of the Nyström approximation of the kernel functions and the over-fitting that can be introduced in a neural network architecture. Finally, the optimization of the KDA methodology through the suitable parallelization on multicore architectures, as well as the exploration of mechanisms for the dynamic reconstruction of kernel spaces (e.g., operating over H N y ) also constitute interesting future research directions on this topic.