Instance-based Inductive Deep Transfer Learning by Cross-Dataset Querying with Locality Sensitive Hashing

Supervised learning models are typically trained on a single dataset and the performance of these models rely heavily on the size of the dataset, i.e., amount of data available with the ground truth. Learning algorithms try to generalize solely based on the data that is presented with during the training. In this work, we propose an inductive transfer learning method that can augment learning models by infusing similar instances from different learning tasks in the Natural Language Processing (NLP) domain. We propose to use instance representations from a source dataset, \textit{without inheriting anything} from the source learning model. Representations of the instances of \textit{source} \&\textit{target} datasets are learned, retrieval of relevant source instances is performed using soft-attention mechanism and \textit{locality sensitive hashing}, and then, augmented into the model during training on the target dataset. Our approach simultaneously exploits the local \textit{instance level information} as well as the macro statistical viewpoint of the dataset. Using this approach we have shown significant improvements for three major news classification datasets over the baseline. Experimental evaluations also show that the proposed approach reduces dependency on labeled data by a significant margin for comparable performance. With our proposed cross dataset learning procedure we show that one can achieve competitive/better performance than learning from a single dataset.


Introduction
A fundamental issue with performance of supervised learning techniques (like classification) is the requirement of enormous amount of labeled data, which in some scenarios maybe expensive or impossible to acquire. Every supervised task requires a dedicated labeled dataset and training state-of-the-art deep learning model requires extensive computational power. In this paper, we propose a deep transfer learning method that can enhance the performance of learning models by incorporating information from a secondary dataset belonging to a similar domain.
We present our approach in an inductive transfer learning (Pan and Yang, 2010) framework, with a labeled source (D S domain and task T S ) and target (D T domain and task T T ) dataset, the aim is to boost the performance of target predictive function f T (·) using available knowledge in D S and T S , given T S = T T . Knowledge transfer in our approach takes place in four ways (a) instance-transfer (b) feature-representationtransfer (c) parameter-transfer and (d) relationalknowledge-transfer.
Parameter and relational knowledge transfer are studied exhaustively in inductive transfer literature. Our work is based on a simple inductive bias (also used in (Snell et al., 2017)), that there exists an embedding space where instances belonging to the same class cluster around a central point. We utilize the instancelevel information in the source dataset, and also make the newly learnt target instance representation similar to the retrieved source instances. This allows the learning algorithm to improve generalization across the source and target datasets. We use instance-based learning that actively looks for similar instances in the source dataset given a target instance. The intuition behind retrieving similar instances comes from instance-based learning perspective, where simplification of the class distribution takes place within the locality of a test instance. As a result, modeling of similar instances become easier (Aggarwal, 2014). Similar instances have the maximum amount of information necessary to classify an unseen instance, as exploited by techniques like k-nearest neighbours.
We derived inspiration to propose this method from the working of the human brain, where memory consolidation (McGaugh, 2000) occurs, in which new memory representations are consolidated slowly over time for efficient retrieval in future. According to (McGaugh, 2000), newly learnt memory representation remain in a fragile state and are affected as further learning takes place. In our approach, we make use of encodings of instances precipitated while training for the source task using an independent model. This model being independently used for an source task and can be adapted as required, is in alignment with memory consolidation in human brain. One of the attractive features of the proposed method is that the search mechanism allows us to use more than one source dataset during training the joint model to achieve inductive transfer learning. Our approach differs from the standard instance-based learning in two major aspects. First, the instances retrieved are not necessarily from the same dataset, but can be from various secondary datasets. Secondly, our model simultaneously makes use of local instance level information as well as the macro-statistical view point of the dataset, where typical instance-based learning like k-nearest neighbour search make use of only the local instance level information.

Background
Locality Sensitive Hashing (LSH): Locality Sensitive Hashing (Gao et al., 2014;Gionis et al., 1999) is an algorithm which performs approximate nearest neighbor similarity search for highdimensional data in sub-linear time. LSH is a data independent hashing technique as the hash functions are selected at random, which makes LSH perfectly suited for our purpose. Latent vectors encountered during training cannot be accessed, which is required for constructing datadriven hash functions.
The locality sensitive hash family, H has to satisfy certain constraints mentioned in (Indyk and Motwani, 1998) for nearest neighbor retrieval. The LSH Index maps each point p into a bucket in a hash table with a label g(p) = (h 1 (p), h 2 (p), . . . , h k (p)), where h 1 , h 2 , . . . , h k are chosen independently with replacement from H. We generate l different hash functions of length k given by G j (p) = (h 1j (p), h 2j (p), · · · , h kj (p)) where j ∈ 1, 2, . . . , l denotes the index of the hash table. Given a collection of data points C, we hash them into l hash tables by concatenating randomly sampled k hash functions from H for each hash table. While returning the nearest neighbors of a query Q, it is mapped into a bucket in each of the l hash tables. The union of all points in the buckets G j (Q), j = 1, 2, . . . , l is returned. Therefore, all points in the collection C is not scanned and the query is executed in sublinear time. The storage overhead for LSH is subquadratic in n, the number of points in the collection C.
LSH Forests (Bawa et al., 2005) are an improvement over LSH Index which relaxes the constraints on hash family H with better practical performance guarantees. LSH Forests utilizes l prefix trees (LSH trees) instead of having hash tables, each constructed from independently drawn hash functions from H. The hash function of each prefix tree is of variable length (k) with an upper bound k m . The length of the hash label of a point is increased whenever a collision occurs to form leaf nodes from the parent node in the LSH tree. For m nearest neighbour query of a point p, the l prefix trees are traversed in a top-down manner to find the leaf node with highest similarity with point p. From the leaf node, we traverse in a bottom-up fashion to collect M points from the forest, where M = cl, c being a small constant. It has been shown in (Bawa et al., 2005), that for practical cases the LSH Forests execute each query in constant time with storage cost linear in n, the number of points in the collection C.
Instance-based transfer learning: Instancebased transfer learning has been extensively studied in literature (Zadrozny, 2004) (Gretton et al., 2009) (Huang et al., 2007) (Sugiyama et al., 2008) (Dai et al., 2007). These methods primarily focus on the problem of distribution mismatch between data from two different sources. They also assume that the training instances are sampled from a homogenous distribution and have the same target label space. In our approach, we are not assuming any constraints on the distribution of data or label space, our only assumption is that the datasets should have certain feature overlap in some embedding space. The feature overlap may not necessarily be substantial, as we also enforce the instance representations to be similar using a penalty function. The penalty function performs structural transformation of the feature space, which is usually an attribute of feature-

Proposed Model
Given the data x with the ground truth y, supervised learning models aim to find the parameters Θ that maximizes the log-likelihood as (1) To augment the learning by infusing similar source instances latent representations, a latent vector from source dataset z s is retrieved using the data sample x t (target dataset instance). Thus, our modified objective function can be expressed as To enforce latent representations of the instances to be similar, for better generalization across the tasks, we add a suitable penalty to the objective. The modified objective then becomes, where L is the penalty function and λ (scalefactor) is a hyperparameter. The subsequent sections focus on the methods to retrieve instance latent vector z s using the data sample x t . It is important to note that, we do not assume any structural form for P . Hence the proposed method is applicable to augment any supervised learning setting with any form for P . In the experiments we have used softmax using the bi-LSTM (Greff et al., 2015) encodings of the input as the form for P . Any state of the art text encoding scheme (Le and Mikolov, 2014) can be used here instead. The schematic representation of the model is shown in Figure 1. In the following section, we discuss the in-detail working of individual modules in Figure 1 and formulation of the penalty function L .
Sentence Encoder: The purpose of this module is to create a vector in some latent space, encoding the semantic context of a sentence from the input sequence of words. The context vector c is obtained from an input sentence which is a sequence of word vectors x = (x 1 , x 2 , . . . , x T ), using a bi-LSTM (Sentence Encoder shown in Figure 1) as where h t ∈ R n is the hidden state of the bi-LSTM at time t and n is the embedding size. We combine the states at multiple time steps using a linear function g. We have, where W ∈ R n×m and m is a hyper parameter representing the dimension of the context vector. g in our experiments is set as The bi-LSTM module is responsible for generating the context vector c is pre-trained on the target classification task. A separate bi-LSTM module (sentence encoder for the source dataset) is trained on the source classification task. In our experiments we used similar modules for creating the instance embeddings of the source and target dataset, this is not constrained by the method and different modules can be used here.
Instance Retrieval: Using the obtained context vector c t (c in Equation 5) corresponding to a target instance as a query, k-nearest neighbours are searched from the source dataset (z s 1 , z s 2 , . . . , z s k ) using Locality Sensitive Hashing (LSH). The search mechanism using LSH takes constant time in practical scenario (Bawa et al., 2005) and therefore does not affect the training duration by large margins. Although LSH returns approximate nearest neighbours it doesn't introduce any extra loss (compared to exact nearest neighbour retrieval) in our model, as our objective is to retrieve similar instances in order to determine the class label. Even if the ranking of the instances retrieved are not accurate, retrieving multiple instances (k) reduces the chance of missing out very similar instances. The retrieved source dataset instance embeddings receive attention α z i , using soft-attention mechanism based on inner product similarity given as, where c t ∈ R m and z s i , z s j ∈ R m . The fused instance embedding vector z s formed after soft attention mechanism is given by, where z s ∈ R m . The retrieved instance is concatenated with the context vector c (in Equation 5) as s = [c t , z s ] and y = softmax(s T W (1) ), (9) where W (1) ∈ R 2m×u , y is the output of the final target classification task. This model is then trained jointly with the initial parameters from the pre-trained classification module. The pre-training of the classification module is necessary because if we start from a randomly initialized context vector c t , the LSH Forest retrieves arbitrary vectors and the model as a whole fails to converge. As the error only propagates through the attention values and penalty function it is impossible to simultaneously rectify the query and search results of the hashing mechanism.
It is important to note that the proposed model adds only a limited number of parameters over the baseline model. The extra trainable weight matrix in the model is W (1) ∈ R 2m×u , adding only 2m × u, where m is the size of the context vector c and u is the number of classes.
Penalty Function: In instance-based learning, a test instance is assigned the label of the majority of its nearest-neighbour instances. This follows from the fact that similar instances belong to the same class distribution. Following the retrieval of latent vector embeddings from the source dataset, the target latent embedding is constrained to be similar to the retrieved source instances. In order to enforce this, we introduce an additional penalty along with the loss function (shown in Figure 1). The modified objective function is given as where || • || F stands for Frobenius norm of a matrix, y and z s are the outputs of the model and retrieved latent embedding respectively, y t is the label, λ is the scale factor and z t is the latent vector embedding of the target instance. L(·) in the above equation denotes the loss function used to train the model (depicted as L(·) in Figure 1) and θ denotes the model parameters. The additional penalty term enables the latent vectors to be similar across multiple datasets, which aids performance in the subsequent stages.

Experiments & Results
The experiments are designed in a manner to compare the performance of the baseline model with that of external dataset augmented model. A simple bi-LSTM (target-only) model is trained without consideration for source-domain instances (no source-instance retrieval branch included into the network), which acts as the baseline. The embeddings of the source instances are also trained using bi-LSTM classifier. The only constraint on the embeddings is that their shape should be same across multiple domain for LSH search to take place. Our experiments shows performance enhancement across several datasets by incorporating relevant instance information from a source dataset in varying setups. Our experiments also illustrate that our proposed model continues to perform better even when the size of training set is reduced, thereby reducing the dependence on labeled data. We also demonstrate the efficacy of our model through latent vector visualizations.   (Greene and Cunningham, 2006), (c) BBC Sports 2 (Greene and Cunningham, 2006). The datasets are chosen in such a way that all of them share common domain knowledge and have small number of training examples so that the improvement observed using instance-infusion is significant. The statistics of the three real-world datasets are mentioned in Table 2. The mentioned datasets do not have a dedicated test set, so the evaluations were performed using k-fold cross validation scheme. All performance scores that are reported in this paper are the mean performance over all the folds.

Parameter
News20 BBC BBC-Sports  The word embeddings were randomly initialized and trained along with the model. The learning rate is regulated over the training epochs, it is 1 http://qwone.com/ jason/20Newsgroups/ 2 http://mlg.ucd.ie/datasets/bbc.html decreased to 0.3 times its previous value every 10 epochs. The relevant hyper-parameters are listed in Table 3.
Results: Table 1 shows the details results of our approach for all the datasets. The source and target are chosen in such a manner so that the source dataset is able to provide relevant information. In Table 1, we have shown improvements by a high margin for all datasets. For 20Newsgroups the improvement over baseline model is 12%, BBC and BBC Sports datasets show an improvement of 5%. As the proposed approach is independent of the source encoding procedure, the source instance embeddings are kept constant during training, source instances from multiple datasets can be incorporated. In the subsequent sections, we describe various setups to prove the efficacy of our model.

Instance Infusion from Same Dataset:
We study the results of using the target dataset as the source for instance retrieval. This setting is same as the conventional instance-based learning setup. However, our approach not only uses the instance based information, but also leverage the macro statistics of the target dataset. The intuition behind this experimental setup is that instances from the same dataset is also useful in modeling other instances especially when a class imbalance exists in the target dataset. In this experimental setup, the nearest neighbour retrieved is ignored as it would be same as the instance sample being modeled during training. The performance of this setup is shown in Table 4.
Dataset Reduction with Single Source: We will discuss a set of experiments performed to support our hypothesis that the proposed model is capable of reducing the dependency on labeled instances. In these set of experiments, we show that the cross-dataset augmented models perform sig-   Table 4: Test Accuracy for proposed model using instances from the same target dataset nificantly better than baseline models when varying fractions of the training data is used. Figure 2 shows the variation of instance-infused bi-LSTM and bi-LSTM (target-only) performance for 20Newsgroups, BBC and BBC Sports datasets. In these set of experiments 20Newsgroups had BBC, BBC had 20Newsgroup and BBC Sports had BBC as source dataset. As shown in the plot, 0.3, 0.5, 0.7, 0.9 and 1.0 fraction of the dataset are used for performance analysis. The dashed line in the plots indicates the baseline model performance with 100% dataset support. It is observed that the performance of instance-infused bi-LSTM with 70% dataset, is better than the baseline model trained on the entire dataset. This observation shows that our proposed approach is successful in reducing the dependency on the training examples by at least 30% across all datasets.
Dataset Reduction with Multiple Source:. We design an experimental setup in which only 0.5 fraction of the target dataset is utilized and study the influence of multiple source dataset infusion. Table 6 compares the results, when single source and multiple source datasets are used for 50% dataset fraction. The results improves as and when more source datasets are used in the infusion process. This can be effectively leveraged for improving the performance of very lean datasets, by heavily deploying large datasets as source. For the single source setup, the same source datasets are used as mentioned in results section. In multiple source experiment setup, for a given target dataset the other two datasets are used as source.
Comparative Study: Table 5 gives the experimental results for our proposed approach, baselines and other conventional learning techniques on the 20 Newsgroups, BBC and BBC Sports datasets. Literature involving these datasets mostly focus on non-deep learning based approaches, we compare our results with some popular conventional learning techniques. The experiments involving conventional learning were performed using scikit-learn (Pedregosa et al., 2011) library in Python 3 . For the k-NN-ngram experiments, the number of nearest neighbours k was set to 5. In Table 5, the models studied are Multinomial Naive Bayes, k-nearest neighbour classifier, Support Vector Machine (SVM) (Bishop, 2006) and Random Forests Classifier. The input vectors were initialized using n-grams, bi-gram or term frequency-inverse document frequency (tfidf). For the mentioned datasets, conventional models outperform our baseline Bi-LSTM model, however upon instance infusion the deep learning based model is able to achieve competitive performance across all datasets. Moreover by instance infusion the simple bi-LSTM model approaches the classical models in performance on News20 and BBC Sports dataset, whereas on BBC Dataset the proposed instance infused bi-LSTM model beats all the mentioned models. The improvement by instance infusion is 13% for News20, 5% for BBC and 8% for BBC Sports datasets. The MODEL NEWS20 BBC BBC SPORTS Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score   Table 6: Test Accuracy using instances from multiple source datasets with 50% target dataset important point to note here is that although for News20 dataset we are not able to beat the state of the art(by less than 1%), by instance infusion we are able to improve the performance of the deep learning model by a significant margin of 13%.
Visualization: We show visualizations of latent space embeddings formed using bi-LSTM (target only) and with instance infusion. In Figure 3, the latent vector embeddings of BBC Sports dataset with News20 support is shown for 0.3 in (a) & (b), 0.5 in (c) & (d) and 0.7 in (e) & (f), fraction of the target training dataset (BBC Sports). Figure 3 (f) is the embeddings representation with 70% data for which best performance (among the 6 visualizations) is observed.
It is evident from the figure that even with 30% and 50% of the data instance infusion tries to make the embedding distribution similar to Figure 3 (f) as seen in Figure 3

Related Work
The motivation behind our model comes from memory networks (Graves et al., 2014) that have an augmented long-term memory component and our model follows the general workflow in (Weston et al., 2014;Sukhbaatar et al., 2015). In our work we have incorporated instance level information using content-based attention from support dataset memory. Attention based approaches are widely used in text analysis (Bahdanau et al., 2014;Lin et al., 2017) . This approach has gained popularity in works with limited sample space. (Vinyals et al., 2016) uses a similar approach for one-shot learning however they form inference based on only support instance labels. (Snell et al., 2017) extends the idea to few shot learning in a discriminative manner by measuring distance from a class representative from a support set. (Triantafillou et al., 2017) introduced a scoring function to rank instances in a batch and optimize mean Average Precision (mAP) for fewshot learning. (Edwards and Storkey, 2016) used a generative approach for selecting representative samples for inference. In our work, like memory network we maintain a fixed long term memory from source dataset but do not perform any modifications to it during training. We sample instances from the memory using content-based similarity but our model does not access labels like few-shot learning techniques. We present our work as a generalized approach for transfer learning across datasets sharing a common domain.

Conclusion & Future Work
In this work, we posit that while learning from a training data, infusion of instance level local information from an external data will improve the performance of learning algorithm, which we show through extensive experimentation on our proposed model. Although instance based learning is extensively studied in AI literature, this has rarely been used in a deep learning setup for transfer learning. An aspect of work which can be pursued to improve our setup is to incorporate a sophisticated search paradigm for instance retrieval in order to reduce latency. In this work, we have shown that our method is able to reduce the dependency on labeled data, which can also be extended to analyse performance in an unsupervised setup. Improved feature modification techniques can be augmented along with the search module in order to enhance the query formulation. We also assumed that the datasets share a common domain, in future work means to tackle domain discrepancy needs to be formulated to incorporate instances from a range of datasets.