Using cuSolver from Scikit-CUDA

This using cuSolver from Scikit-CUDA Article was provided by Dr. Brian Tuomanen, who has been working with CUDA and General-Purpose GPU Programming since the year 2014.

In this article, we will be taking a tour of one of the standard CUDA libraries intended for streamlined numerical and scientific computation. We will look at cuSolver, which can perform more involved linear algebra operations than those featured in cuBLAS, such as singular value decomposition (SVD) or Cholesky factorization.

You may have been primarily dealing with one single Python module that acts as your gateway to CUDA—PyCUDA. While PyCUDA is a very powerful and versatile Python library, its main purpose is to provide a gateway to program, compile, and launch CUDA kernels, rather than provide an interface to the CUDA libraries. To this end, fortunately, there is a free Python module available that provides a user-friendly wrapper interface to these libraries. This is called Scikit-CUDA.

While you don’t have to know PyCUDA or even understand GPU programming to appreciate Scikit-CUDA, it is conveniently compatible with PyCUDA; Scikit-CUDA, for instance, can operate easily with PyCUDA’s gpuarray class, and this allows you to easily pass data between our own CUDA kernel routines and Scikit-CUDA. Additionally, most routines will also work with PyCUDA’s stream class, which will allow us to properly synchronize our own custom CUDA kernels with Scikit-CUDA’s wrappers.

For completing this article, you’ll need a Linux or Windows 10 PC with a modern NVIDIA GPU (2016—onward), with all of the necessary GPU drivers and the CUDA Toolkit (9.0–onward) installed. A suitable Python 2.7 installation (such as Anaconda Python 2.7) that includes the PyCUDA module is also required.

This article’s code is also available on GitHub and can be found at

Installing Scikit-CUDA

It is suggested that you install the latest stable version of Scikit-CUDA directly from GitHub at

Unzip the package into a directory, and then open up the command line here and install the module by typing python install into the command line. You may then run the unit tests to ensure that a correct installation has been performed with python test. (This method is suggested for both Windows and Linux users.) Alternatively, Scikit-CUDA can be installed directly from the PyPI repository with pipinstallscikit-cuda.

Using cuSolver from Scikit-CUDA

We will now look at how we can use cuSolver from Scikit-CUDA’s linalg submodule. cuSolver is a library that’s used for performing more advanced linear algebra operations than cuBLAS, such as the Singular Value Decomposition, LU/QR/Cholesky factorization, and eigenvalue computations. Since cuSolver, like cuBLAS and cuFFT, is another vast library, we will only take the time to look at one of the most fundamental operations in data science and machine learning—SVD.

Singular value decomposition (SVD)

SVD takes any m x n matrix A, and then returns three matrices in return—UΣ, and V. Here, U is an m x m unitary matrix, Σ is an m x n diagonal matrix, and V is an n x n unitary matrix. By unitary, we mean that a matrix’s columns form an orthonormal basis; by diagonal, we mean that all values in the matrix are zero, except for possibly the values along its diagonal.

The significance of the SVD is that this decomposes A into these matrices so that we have A = UΣVT ; moreover, the values along the diagonal of Σ will all be positive or zero and are known as the singular values. We will see some applications of this soon, but you should keep in mind that the computational complexity of SVD is of the order O(mn2)—for large matrices, it is definitely a good idea to use a GPU since this algorithm is parallelizable.

We’ll now look at how we can compute the SVD of a matrix. Let’s make the appropriate import statements:

import pycuda.autoinit
from pycuda import gpuarray
import numpy as np
from skcuda import linalg

We will now generate a relatively large random matrix and transfer it to the GPU:

a = np.random.rand(1000,5000).astype(np.float32)
a_gpu = gpuarray.to_gpu(a)


We can now execute the SVD. This will have three outputs corresponding to the matrices that we just described. The first parameter will be the matrix array we just copied to the GPU. Then we need to specify that we want to use cuSolver as our backend for this operation:

U_d, s_d, V_d = linalg.svd(a_gpu,  lib='cusolver')


Now, let’s copy these arrays from the GPU to the host:

U = U_d.get()
s = s_d.get()
V = V_d.get()


s is actually stored as a one-dimensional array; we will have to create a zero matrix of size 1000 x 5000 and copy these values along the diagonal. We can do this with the NumPy diag function, coupled with some array slicing:

S = np.zeros((1000,5000))
S[:1000,:1000] = np.diag(s)

We can now matrix-multiply these values on the host with the NumPy dot function to verify that they match up to our original array:

print 'Can we reconstruct a from its SVD decomposition? : %s' %
np.allclose(a,,, V)), atol=1e-5)


Since we are using only float32s and our matrix is relatively large, a bit of numerical error was introduced; we had to set the “tolerance” level (atol) a little higher than usual here, but it’s still small enough to verify that the two arrays are sufficiently close.

Using SVD for Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a tool that’s used primarily for dimensionality reduction. We can use this to look at a dataset and find which dimensions and linear subspaces are the most salient. While there are several ways to implement this, we will show you how to perform PCA using SVD.

We’ll do this as follows—we will work with a dataset that exists in 10 dimensions. We will start by creating two vectors that are heavily weighted in the front, and 0 otherwise:

vals = [ np.float32([10,0,0,0,0,0,0,0,0,0]) , 
np.float32([0,10,0,0,0,0,0,0,0,0]) ]



We will then add 9,000 additional vectors: 6,000 of these will be the same as the first two vectors, only with a little added random white noise, and the remaining 3,000 will just be random white noise:

for i in range(3000):
    vals.append(vals[0] + 0.001*np.random.randn(10))
    vals.append(vals[1] + 0.001*np.random.randn(10))


We will now typecast the vals list to a float32 NumPy array. We take the mean over the rows and subtract this value from each row. (This is a necessary step for PCA.) We then transpose this matrix, since cuSolver requires that input matrices have fewer or equal rows compared to the columns:

vals = np.float32(vals)
vals = vals - np.mean(vals, axis=0)
v_gpu = gpuarray.to_gpu(vals.T.copy())


We will now run cuSolver, just like we did previously, and copy the output values off of the GPU:

U_d, s_d, V_d = linalg.svd(v_gpu, lib='cusolver')

u = U_d.get()
s = s_d.get()
v = V_d.get()


Now we are ready to begin our investigative work. Let’s open up IPython and take a closer look at u and s. First, let’s look at s; its values are actually the square roots of the principal values, so we will square them and then take a look:

You will notice that the first two principal values are of the order 105, while the remaining components are of the order 10-3. This tells us there is only really a two-dimensional subspace that is even relevant to this data at all, which shouldn’t be surprising. These are the first and second values, which will correspond to the first and second principal components that are, the corresponding vectors. Let’s take a look at these vectors, which will be stored in U:

You will notice that these two vectors are very heavily weighted in the first two entries, which are of the order 10-1; the remaining entries are all of the order 10-6 or lower and are comparably irrelevant. This, in a nutshell, is the idea behind PCA.

If you found this article interesting, you can explore Hands-On GPU Programming with Python and CUDA to build real-world applications with Python 2.7, CUDA 9, and CUDA 10. Hands-On GPU Programming with Python and CUDA will help you apply GPU programming to problems related to data science and high-performance computing.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.