Accelerating Sparse Matrix Operations in Neural Networks on Graphics Processing Units

Graphics Processing Units (GPUs) are commonly used to train and evaluate neural networks efficiently. While previous work in deep learning has focused on accelerating operations on dense matrices/tensors on GPUs, efforts have concentrated on operations involving sparse data structures. Operations using sparse structures are common in natural language models at the input and output layers, because these models operate on sequences over discrete alphabets. We present two new GPU algorithms: one at the input layer, for multiplying a matrix by a few-hot vector (generalizing the more common operation of multiplication by a one-hot vector) and one at the output layer, for a fused softmax and top-N selection (commonly used in beam search). Our methods achieve speedups over state-of-the-art parallel GPU baselines of up to 7x and 50x, respectively. We also illustrate how our methods scale on different GPU architectures.


Introduction
The speedups introduced by parallel architectures inspired the development of accelerators tailored towards specialized functions. Graphics Processing Units (GPUs) are now a standard platform for deep learning. GPUs provide faster model training and inference times compared to serial processors, because they can parallelize the linear algebra operations used so heavily in neural networks (Raina et al., 2009).
Currently, major open source toolkits (Abadi et al., 2016) provide additional layers of abstraction to support one or more parallel GPU architectures. The seamless compatibility with multiple GPUs allows researchers to train a single model on multiple hardware platforms with no significant changes to their code base and no specialized knowledge about the targeted architectures. The disadvantage of hardware agnostic APIs is the lack of optimizations for a set of task-specific functions.
Adapting parallel neural operations to a specific hardware platform is required to obtain optimal speed. Since matrix operations are used heavily in deep learning, much research has been done on optimizing them on GPUs (Chetlur et al., 2014;Gupta et al., 2015). Recently, some efforts have been made to other kinds of operations: serial operations running on the GPU (Povey et al., 2016), operations not involving matrix multiplications (Bogoychev et al., 2018), and models using sparse structures (Zhang et al., 2016). In this paper, we focus on sparse operations running exclusively on the GPU architecture.
Much recent work in High Performance Computing (HPC) and Natural Language Processing (NLP) focuses on an expensive step of a model or models and optimizes it for a specific architecture. The lookup operation used in the input layer and the softmax function used in the output are two examples seen in machine translation, language modeling, and other tasks. Previous work has accelerated the softmax step by skipping it entirely (Devlin et al., 2014), or approximating it (Shim et al., 2017;Grave et al., 2017).
Another strategy is to fuse multiple tasks into a single step. This approach increases the room for parallelism. Recent efforts have fused the softmax and top-N operations to accelerate beam search on the GPU using similar approaches (Hoang et al., 2018;Milakov and Gimelshein, 2018). Our approach differs from former methods in the following aspects: We deliver a novel method tailored towards scenarios seen in Neural Machine Translation (NMT), we introduce a new GPU-specific method to obtain the top-N elements from a list of hypotheses using a different sorting mechanism, and we introduce a sparse lookup method for GPUs.
NMT uses beam search during inference to limit the full set of potential output translations explored during decoding (Cho et al., 2014;Graves, 2012). This algorithm is widely used to obtain state-of-the-art results during test time. At each decoding time-step t, the top-N hypotheses are chosen for further expansion and the rest are discarded. The top-N selection part of the search has been accelerated using hashing methods to avoid a full sort (Shi et al., 2018;Pagh and Rodler, 2004). The aim of this paper is to both combine softmax and top-N operations seen in the last layer of a neural network and optimize the top-N selection operation used by several NMT models.
Our work uses ideas from previous work to accelerate two different operations. We focus on operations that manipulate sparse structures (Saad, 1990). By sparse, we mean operations that only require a small fraction of the elements in a tensor to output the correct result. We propose two different optimizations for sparse scenarios in deep learning: The first operation involves the first layer of a neural network. We accelerate the first matrix multiplication using batched sparse vectors as input. The second operation is the computation of the softmax used for beam search. We combine the softmax and the top-N selection into one operation obtaining a speedup over a parallel stateof-the-art baseline. We show that our fused top-N selection and sparse lookups achieve speedups of 7× and 50× relative to other parallel NVIDIA baselines.

Graphics Processing Units
GPUs are widely used to accelerate a variety of non-neural tasks such as search (Garcia et al., 2008), parsing (Hall et al., 2014), and sorting (Sintorn and Assarsson, 2008). Applications adapted to the GPU spot different architectural properties of the graphics card to obtain the best performance. This section provides a short overview of the architectural features targeted for this work.

CUDA execution model
CPUs call special functions, also called kernels, to execute a set of instructions in parallel using multiple threads on the GPU. Kernels can be configured to create and execute an arbitrary number of threads. The threads in a kernel are grouped into different thread blocks (also called cooper-ative thread arrays). Threads in the same block can collaborate by sharing the same memory cache or similar operations. The maximum number of threads per block and number of blocks varies across GPU architectures.
All threads running in the same block are assigned to a single Streaming Multiprocessor (SM) on the GPU. A SM contains the CUDA cores that execute the instructions for each thread in a single block. The number of CUDA cores per SM varies depending on the architecture. For example, Volta V100 contain 64 cores per SM, while GeForce GTX 1080s contain 128 cores per SM. Multiple thread blocks can be assigned to a SM if the number of blocks in the grid is larger than the number of physical SMs. Execution time will increase when more than one block is assigned to all SMs on the device (assuming all blocks run the same instruction). Regardless of the number of threads per block, all SMs can only run a total of 32 threads, called a warp, asynchronously at a time. Warp schedulers select in a round-robin fashion a warp from an assigned block to execute in parallel. The SMs finish execution when all blocks assigned to them complete their tasks. Each thread running on the SM can access multiple levels of memory on the graphics card, and an efficient use of all levels significantly improves the overall execution time on the device.

Memory
GPUs contain different levels of memory designed to read and write data stored on the device. There are advantages and disadvantages associated with each memory type. The fastest memory on the device is the register memory. The amount of registers available per SM is limited and the access scope is limited to a single thread during execution. This memory is useful to hold a small amount of variables used at the thread-level. The next type of memory is shared memory. Shared memory is accessible by all threads running on the same block. While slower than registers, shared memory provides fast read and write access times. Shared memory also allows fast operations at the block level such as reductions, usermanaged caches, etc. The amount of shared memory per SM can range from 49KB (K40) up to 96KB (V100). The last (and slowest) type of memory is the global memory. Global memory latency is 100x slower than shared memory. The main use of this memory is to store all the data copied from and to the host CPU. The amount of global memory varies depending on the GPU model (e.g. 12GB on the K40 and 16GB on the V100).
An efficient use of the memory hierarchy provides the best performance. A parallel application must be designed to minimize the total amount of calls to global memory while maximizing the use of registers and shared memory. An exclusive use of main memory will produce the worst execution times. Our methods focus on the efficient use of shared and register memory for scenarios where the data is small enough to fit.

GPU Sorting
Currently, state-of-the-art methods use a treebased reduction operation (Harris, 2005) to sort the list on the GPU and obtain the top elements. Reductions are most efficient when the input needs to be completely sorted, yet faster algorithms can be used if only a portion of the sorted output is needed.
The top-N operation can be accelerated with an improved sorting algorithm for the beam search task on the GPU. Beam search only requires the top-N entries for each mini-batch, and the entries do not need to be sorted in a specific order (ascending or descending). Storing the irrelevant elements for beam search back into global memory is not required for this task and should be avoided. A clear optimization is to obtain the top elements in each minibatch using a faster sorting algorithm.
Distinct sorting algorithms can be used to obtain the top elements from a set of candidates. Previous work introduced custom sorting algorithms for specific tasks using multi-core CPU (Tridgell, 1999) and GPU setups (Satish et al., 2009;Govindaraju et al., 2006).

Background
In this section, we describe two sparse operations commonly used in deep learning, especially for NLP: at the input layer, multiplication by a sparse matrix, and at the output layer, softmax and selection of the top-N elements.

N-hot lookup
In models whose inputs are words, the input layer typically looks up a learned word embedding for each word. Equivalently, it represents each word as a one-hot vector (whose dimensionality is equal to the vocabulary size, K) and multiplies it (as a row vector) by a K × M matrix B whose rows are word embeddings. Then, a minibatch of L words can be represented as a L× K matrix A whose rows are one-hot vectors, so that the product C = AB is a matrix whose rows are the embeddings of the words in the minibatch. Deep learning toolkits (Neubig et al., 2017;Jia et al., 2014) do not perform a full matrix multiplication; typically, they implement a specialized operation to do this.
A problem arises, however, when the input vector is not a one-hot vector, but an "N-hot" vector. For example, we might use additional dimensions of the vector to represent subword or partof-speech tag information (Niehues et al., 2011;Collobert et al., 2011;Chiu and Nichols, 2016). In this case, it would be appropriate to use a sparse matrix library like cuSPARSE, but we show below that we can do better.

Softmax
The softmax function (Equation 1) is widely used in deep learning to output a categorical probability distribution: For better numerical stability, all deep learning toolkits actually compute the softmax as follows: This alternative requires different optimizations on the GPU given the max operation. Recent work (Milakov and Gimelshein, 2018) explore different techniques to calculate this safe softmax version efficiently.

Beam search and top-N selection
Some applications in deep learning require additional computations after the softmax function. During NMT decoding, the top-N probabilities from softmax(z) are chosen at every time-step t and used as an input to the next search step t + 1. It is common practice to obtain the top-N elements after the softmax operation. Naively, we can do this by sorting the probabilities and then taking the first N elements, as shown in Algorithm 1. This operation is sparse in nature given the fact that several hypotheses are discarded during search. The Algorithm 1 Serial minibatched softmax and top-N algorithm.
for ← 1, . . . , L do 8: 12: return D retrieval of non-zero elements in a sparse input parallels the top-N scenario. (Beam search also requires that we keep track of the original column indices (i.e., the word IDs) of the selected columns; this is not shown in Algorithm 1 for simplicity.) In NMT, the top-N operation consumes a significant fraction of time during decoding. Hoang et al. (2018) find that the softmax operation takes 5% of total decoding time, whereas finding the top-N elements can take up to 36.8%. So there is a large potential benefit from speeding up this step.

Method
In this section, we present our algorithms for Nhot lookup ( §4.1) and fused softmax and top-N selection ( §4.2).

Sparse input lookups
Our sparse N-hot lookup method, shown in Algorithm 2, multiplies a sparse matrix A in Compressed Sparse Row (CSR) format by a row-major matrix B to yield a dense matrix C.
CSR is widely used to store and process sparse matrices. This format stores all non-zero elements of a sparse matrix A contiguously into a new structure A v . Two additional vectors A r and A c are required to access the values in A v . An example of the CSR format is illustrated in Figure 1. A r is first used to access the columns storing the non-zero elements in row . The number of non-zero elements for a row can be computed by accessing A r [ ] and calculating its offset with the next element Our method computes the matrix multiplication by processing the elements of the output matrix C in parallel. For our experiments, we process 32 (warp size) rows and columns in parallel for the input matrices. We cannot use a stride size larger than 32, since certain GPU architectures do not allow a 2 dimensional block larger than 32 × 32 (or a block containing more than 1024 threads total). Although this method is fairly straightforward, we will see below that it outperforms other methods when N is small, as we expect it to be.

Fused softmax and top-N
The beam size, or top-N, used in NMT is usually small, with the most commonly used values ranging from 1 to 75 (Sutskever et al., 2014;Koehn and Knowles, 2017). Because of this, we base our implementation on insertion sort, which is O(K 2 ), where K is the number of elements to be sorted, but is reasonably efficient for small arrays. It can be easily modified into a top-N selection algorithm that runs in O(KN) time (Algorithm 3). Unlike in-Algorithm 2 Sparse matrix multiplication using the CSR format. sertion sort, it maintains separate buffers for the sorted portion (D) and the unsorted portion (C); it also performs an insertion by repeating swapping instead of shifting.
The key to our method is that we can parallelize the loop over k (line 3) while maintaining correctness, as long as the comparison and swap can be done atomically. To see this, note that no swap can ever decrease the value of one of the D[n]. Furthermore, because for each k, we compare C[k] with every element of D, it must be the case that after looping over all n (line 4), we have C[k] ≤ D[n] for all n. Therefore, when the algorithm finishes, D contains the top-N values.
Fusing this algorithm with the softmax algorithm, we obtain Algorithm 4. It takes an input array C containing a minibatch of logits and returns an array D with the top-N probabilities and an array E with their original indices. The comparisons in our method are carried out by the CUDA atomicMax operation (line 12). This function reads a value D [ ][n] and computes the max-Algorithm 4 Parallel fused batched softmax, and top-N algorithm. The comment "kernel-level" means a loop over blocks, and the comment "block-level" means a loop over threads in a block. Our algorithm recovers the original column indices (m) with a simple extension following Argueta and Chiang (2017). We pack each probability as well as its original column index into a single 64-bit integer before the sorting step (line 5), with the probability in the upper 32 bits and the column index in the lower 32 bits. This representation preserves the ordering of probabilities, so a single atomicMax operation on the packed representation will atomically update both the probability and the index.
The final aspect to consider is the configuration of the kernel calls from the host CPU. The grid layout must be configured correctly to use this method. The top-N routine relies on specific ker-  Table 1: Performance comparison for the N-hot lookups against the NVIDIA baseline using dimensions L = 100, K = 10240, N = 512. Each time (in ms) is an average over ten runs. Fastest times are in bold.
nel and memory configurations to obtain the best performance. The number of kernel blocks must be equal to the number of elements in the minibatch. This means that batch sizes smaller than or equal to the number of SMs on the GPU will run more efficiently given only one block, or less, will run on all SMs in parallel. The overall performance will be affected if multiple blocks are assigned to all SMs. The number of SMs on the GPU varies depending on the architecture. For example, the Tesla V100 GPU contains 80 SMs, while the Pascal TITAN X contains 30 SMs. This means that our method will perform better on newer GPU architectures with a large amount of SMs. The number of threads in the block is an additional aspect to consider for our method.
The block size used for our experiments is fixed to 256 for all the experiments. This number can be adapted if the expected number of hypotheses to sort is smaller than 256 (the number of threads must be divisible by 32). The amount of shared memory allocated per block depends on the size of N. The auxiliary memory used to store the top-N elements must fit in shared memory to obtain the best performance. A large N will use a combination of shared and global memory affecting the overall execution of our method.

Experiments
We run experiments on two different GPU configurations. The first setup is a 16 core Intel(R) Xeon(R) Silver 4110 CPU connected to a Tesla V100 CPU, and the second set is a 16-core Intel(R) Xeon(R) CPU E5-2630 connected to a GeForce GTX TITAN X. The dense matrices we use are randomly generated with different floating point values. We assume the dense representations contain no values equal to zero. The sparse minibatches used for the top-N experiments are randomly generated to contain a specific amount of non-zero values per element. The indices for all non-zero values are selected at random.

Sparse N-hot lookups
For the N-hot lookup task, we compared against the cuBLAS 1 and cuSPARSE 2 parallel APIs from NVIDIA. Both interfaces provide methods to compute mathematical operations in parallel on the GPU. Table 1 shows the performance of our method against the two NVIDIA APIs for sparse and dense matrix multiplication using different architectures and levels of sparsity. All speedups decrease as the input becomes less sparse. The cuS-PARSE baseline performs on par with the dense cuBLAS version on the V100 architecture when the number of non-zero elements per batch is larger than 1. The cuSPARSE baseline performs better than its dense counterpart on the TITAN X architecture and worse on the V100. An explanation behind this is the type of sparsity patterns cuSPARSE handles and the different amount of SMs and memory types on both architectures.  cuSPARSE is designed to handle sparsity patterns that translate well on several tasks with different sparsity patterns. The multiplication time remains constant on the V100 when a standard dense matrix multiplication is used while cuSPARSE keeps performing worse once the sparse input becomes dense.
The highest speedups are obtained when the amount of non-zero elements is low, and the lowest speedups are seen when the amount of nonzero elements increase. On the V100, our method starts performing worse than the cuBLAS baseline when the amount of non-zero elements per batch element is larger than 100. On the other side, the performance of our method is worse than cuSPARSE when the sparsity is larger than 10 on the TITAN X architecture. Our method performs well on newer GPU models with a larger amount of SMs.
We also compare the performance of our method against a one-hot lookup (i.e., N = 1) implementation used in DyNet (Neubig et al., 2017). DyNet is a C++ toolkit (with CUDA support) designed for NLP models. We compare the time it takes to execute the lookup function on the same dimensions used for our N-hot lookup experiments on both architectures. On average, DyNet takes 0.06ms to execute the lookup on the TITAN X architecture and 0.08ms on the V100 architecture. This operation is faster than both cuBLAS and cuSPARSE yet slower than our sparse implementation; however, this comparison is not entirely fair, because the DyNet times include the overhead of constructing a computation graph, whereas the other times only include the matrix operation itself.

Softmax and top-N
We compared our fused softmax operation against the current state-of-the art method from NVIDIA (Milakov and Gimelshein, 2018). Table 2 demonstrates the comparison of our method against the NVIDIA baseline using two different architectures. Our method outperforms the baseline on top-N sizes smaller than or equal to 300. Our method scales differently on both GPU architectures given the constrained amount of shared memory on the graphics cards and the amount of SMs available. The performance of our suggested implementation will slightly degrade on both architectures when the amount of memory used to perform the selection overtakes the amount of shared memory available.
The speedups against the baseline decrease as N grows. Our execution time still outperforms the baseline on most sizes of N used in NMT scenarios. This makes our method suitable for tasks requiring a small amount of elements from an output list. If the size of N exceeds 300, different methods should be used to obtain the most optimal performance.
The baseline scales better than our implementation when N increases. Table 2 shows the execution time for the baseline is not affected significantly when N grows. The baseline does see performance degradation when the amount of elements in the mini-batch increases. This is due to the same reduction operation used for all sizes of N. This factor allows our method to perform better in several scenarios where N is smaller than or equal to 300. The baseline performs best on scenarios where the batch size is small and the size of the batch elements is large (about 4000). They claim their method does not perform well on batches with a high dimensionality if N is very large due to the cost of computing the full reduction to sort the input weights and their ids.
The batch size affects the performance in a different manner on both architectures. The performance scales in a different manner when the batch size changes. On our largest experiments, the performance for N = 400 does not degrade significantly on the V100 architecture, while the speedups on the TITAN X change significantly from 1.19 to 0.32. This shows that our method runs best on the TITAN X architecture when the batch size is small, and the amount of top-N elements required does not exceed 400. For larger batches, the V100 architecture performs best for all values of N. The TITAN X provides better speedups against the baseline when the number of elements in the mini-batch is small, and both our method and baseline run on the same GPU device.

Conclusion
In this work, we introduce two parallel methods for sparse computations found in NMT. The first operation is the sparse multiplication found in the input layer, and the second one is a fused softmax and top-N. Both implementations outperform different parallel baselines. We obtained speedups of up to 7× for the sparse affine transformation, and 50× for the fused softmax and top-N task. 3 Future work includes the fusion of additional operations in neural models. Matrix operations form the largest bottleneck in deep learning. The last affine transformation in deep neural models can be fused with our softmax and top-N methods. The fusion of these three operations requires a different implementation of the matrix multiplication, and shared memory usage.