Scalable Semi-Supervised Query Classification Using Matrix Sketching

The enormous scale of unlabeled text available today necessitates scalable schemes for representation learning in language processing. For instance, in this paper we are interested in classifying the intent of a user query. While our labeled data is quite limited, we have access to virtually an unlimited amount of unlabeled queries, which could be used to induce useful representations: for instance by principal component analysis (PCA). However, it is prohibitive to even store the data in memory due to its sheer size, let alone apply conventional batch algorithms. In this work, we apply the recently proposed matrix sketching algo-rithm to entirely obviate the problem with scalability (Liberty, 2013). This algorithm approximates the data within a speciﬁed memory bound while preserving the covariance structure necessary for PCA. Using matrix sketching, we signiﬁcantly improve the user intent classiﬁcation accuracy by leveraging large amounts of unlabeled queries.


Introduction
The large amount of high quality unlabeled data available today provides an opportunity to improve performance in tasks with limited supervision through a semi-supervised framework: learn useful representations from the unlabeled data and use them to augment supervised models. Unfortunately, conventional exact methods are no longer feasible on such data due to scalability is-sues. Even algorithms that are considered relatively scalable (e.g., the Lanczos algorithm (Cullum and Willoughby, 2002) for computing eigenvalue decomposition of large sparse matrices) fall apart in this scenario, since the data cannot be stored in the memory of a single machine. Consequently, approximate methods are needed.
In this paper, we are interested in improving the performance for sentence classification task by leveraging unlabeled data. For this task, supervision is precious but the amount of unlabeled sentences is essentially unlimited. We aim to learn sentence representations from as many unlabeled queries as possible via principal component analysis (PCA): specifically, learn a projection matrix for embedding a bag-of-words vector into a lowdimensional dense feature vector. However, it is not clear how we can compute an effective PCA when we are unable to even store the data in the memory.
Recently, Liberty (2013) proposed a scheme, called matrix sketching, for approximating a matrix while preserving its covariance structure. This algorithm, given a memory budget, deterministically processes a stream of data points while never exceeding the memory bound. It does so by occasionally computing singular value decomposition (SVD) on a small matrix. Importantly, the algorithm has a theoretical guarantee on the accuracy of the approximated matrix in terms of its covariance structure, which is the key quantity in PCA calculation.
We propose to combine the matrix sketching algorithm with random hashing to completely remove limitations on data sizes. In experiments, we significantly improve the intent classification accuracy by learning sentence representations from huge amounts of unlabeled sentences, outperforming a strong baseline based on word embeddings trained on 840 billion tokens (Pennington et al., 2014).

Deterministic Matrix Sketching
PCA is typically performed to reduce the dimension of each data point. Let X ∈ R n×d be a data matrix whose n rows correspond to n data points in R d . For simplicity, assume that X is preprocessed to have zero column means. The key quantity in PCA is the empirical covariance matrix X X ∈ R d×d (up to harmless scaling). It is well-known that the m length-normalized eigenvectors u 1 . . . u m ∈ R d of X X corresponding to the largest eigenvalues are orthogonal directions along which the variance of the data is maximized. Then if Π ∈ R d×m be a matrix whose i-th column is u i , the PCA representation of X is given by XΠ. PCA has been a workhorse in representation learning, e.g., inducing features for face recognition (Turk et al., 1991).
Frequently, however, the number of samples n is simply too large to work with. As n tends to billions and trillions, storing the entire matrix X in memory is practically impossible. Processing large datasets often require larger memory than the capacity of a typical single enterprise server. Clusters may enable a aggregating many boxes of memory on different machines, to build distributed memory systems achieving large memory capacity. However, building and maintaining these industry grade clusters is not trivial and thus not accessible to everyone. It is critical to have techniques that can process large data within a limited memory budget available in most typical enterprise servers.
One solution is to approximate the matrix with some Y ∈ R l×d where l n. Many matrix approximation techniques have been proposed, such as random projection (Papadimitriou et al., 1998;Vempala, 2005), sampling (Drineas and Kannan, 2003;Rudelson and Vershynin, 2007;Kim and Snyder, 2013;Kim et al., 2015b), and hashing (Weinberger et al., 2009). Most of these techniques involve randomness, which can be undesirable in certain situations (e.g., when experiments need to be exactly reproducible). Moreover, many are not designed directly for the objective that we care about: namely, ensuring that the covariance matrices X X and Y Y remain "similar".  (2013). In the output, X ∈ R n×d denotes the data matrix with rows x 1 . . . x n .
A recent result by Liberty (2013) gives a deterministic matrix sketching algorithm that tightly preserves the covariance structure needed for PCA. Specifically, given a sketch size l, the algorithm computes Y ∈ R l×d such that This result guarantees that the error decreases in O(1/l); in contrast, other approximation techniques have a significantly worse convergence bound of O(1/ √ l). The algorithm is pleasantly simple and is given in Figure 1 for completeness. It processes one data point at a time to update the sketch Y in an online fashion. Once the sketch is "full", its SVD is computed and the rows that fall below a threshold given by the median singular value are eliminated. This operation ensures that every time SVD is performed at least a half of the rows are discarded. Consequently, we perform no more than O(2n/l) SVDs on a small matrix Y ∈ R l×d . The analysis of the bound (1) is an extension of the "median trick" for count sketching and is also surprisingly elementary; we refer to Liberty (2013) for details.

Matrix Sketching for Sentence Representations
Our goal is to leverage enormous quantities of unlabeled sentences to augment supervised training for intent classification. We do so by learning a PCA projection matrix Π from the unlabeled data and applying it on both training and test sentences. The matrix sketching algorithm in Figure 1 enables us to compute Π on arbitrarily large data. There are many design considerations for using the sketching algorithm for our task.

Original sentence representations
We use a bag-of-words vector to represent a sentence. Specifically, each sentence is a ddimensional vector x ∈ R d where d is the size of the vocabulary and x i is the count of an n-gram i in the sentence (we use up to n = 3 in experiments); we denote this representation by SENT.
In experiments, we also use a modification of this representation, denoted by SENT+, in which we explicitly define features over the first two words in a query and also use intent predictions made by a supervised model.

Random hashing
When we process an enormous corpus, it can be computationally expensive just to obtain the vocabulary size d in the corpus. We propose using random hashing to avoid this problem. Specifically, we pre-define the hash size H we want, and then on encountering any word w we map w → {1 . . . H} using a fixed hash function. This allows us to compute a bag-of-words vector for any sentence without knowing the vocabulary size. See Weinberger et al. (2009) for a justification of the hashing trick for kernel methods (applicable in our setting since PCA has a kernel (dual) interpretation).

Parallelization
The sketching algorithm works in a sequential manner, processing each sentence at a time. While it leaves a small memory footprint, it can take prohibitively long time to process a large corpus. Liberty (2013) shows it is trivial to parallelize the algorithm: one can compute several sketches in parallel and then sketch the conjoined sketches. The theory guarantees that such layered sketches does not degrade the bound (1). We implement this parallelization to obtain an order of magnitude speedup.

Final sentence representation:
Once we learn a PCA projection matrix Π, we use it in both training and test times to obtain a dense feature vector of a bag-of-words sentence representation. Specifically, if x is the original bag-ofwords sentence vector, the new representation is given by where ⊕ is the vector concatenation operation. This representational scheme is shown to be effective in previous work (e.g., see Stratos and Collins (2015)).

Experiment
To test our proposed method, we conduct intent classification experiments (Hakkani-Tür et al., 2013;Celikyilmaz et al., 2011;Ji et al., 2014;El-Kahky et al., 2014;Chen et al., 2016) across a suite of 22 domains shown in Table 1. An intent is defined as the type of content the user is seeking. This task is part of the spoken language understanding problem (Li et al., 2009;Tur and De Mori, 2011;Kim et al., 2015c;Mesnil et al., 2015;Kim et al., 2015a;Xu and Sarikaya, 2014;Kim et al., 2015b;Kim et al., 2015d). The amount of training data we used ranges from 12k to 120k (in number of queries) across different domains, the test data was from 2k to 20k. The number of intents ranges from 5 to 39 per domains. To learn a PCA projection matrix from the unlabeled data, we collected around 17 billion unlabeled queries from search logs, which give the original data matrix whose columns are bag-of-n-grams vector (up to trigrams) and has dimensions approximately 17 billions by 41 billions, more specifically, X ∈ R 17,032,086,719×40,986,835,008 We use a much smaller sketching matrix Y ∈ R 1,000,000×1,000,000 to approximate X. Note that column size is hashing size. We parallelized the sketching computation over 1,000 machines; we will call the number of machines parallelized over "batch". In all our experiments, we train a linear multi-class SVM (Crammer and Singer, 2002). Task   Table 1 shows the performance of intent classification across domains. For the baseline, SVM without embedding (w/o Embed) achieved 91.99% accuracy, which is already very competitive. However, the models with word embedding trained on 6 billion tokens (6B-50d) and 840 billion tokens (840B-300d) (Pennington et al., 2014) achieved 92.89% and 93.00%, respectively. 50d and 300d denote size of embedding dimension. To use word embeddings as a sentence representation, we simply use averaged word vectors over a sentence, normalized and conjoined with the original representation as in (2). Surprisingly, when we use sentence representation (SENT) induced from the sketching method with our data set, we can boost the performance up to 93.49%, corresponding to a 18.78% decrease in error relative to a SVM without representation. Also, we see that the extended sentence representation (SENT+) can get additional gains.

Results of Intent Classification
As in Table 2 , we also measured performance of our method (SENT+) as a function of the percentage of unlabeled data we used from total unlabeled sentences. The overall trend is clear: as the number of sentences are added to the data for inducing sentence representation, the test performance improves because of both better coverage and better quality of embedding. We believe that if we consume more data, we can boost up the per-formance even more. Table 3 shows the sketching results for various batch size. To evaluate parallelization, we first randomly generate a matrix R 1,000,000×100 and it is sketched to R 100×100 . And then we sketch run with different batch size. The results show that as the number of batch increases, we can speed up dramatically, keeping residual value X X − Y Y 2 . It indeed satisfies the bound value, ||X|| 2 F /l, which was 100014503.16.