When we implement matrix operation using numpy
on the CPU, It is simple to implement and get the result. But in the case we use the GPU, We should do more complicated works.
Multi GPU Matrix multiplication
CUDA
CUDA1 is a parallel computing platform and application programming interface (API) model created by Nvidia.[1] It allows software developers and software engineers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing an approach termed GPGPU (General-Purpose computing on Graphics Processing Units). The CUDA platform is a software layer that gives direct access to the GPU’s virtual instruction set and parallel computational elements, for the execution of compute kernels.[2]
PyCUDA
PyCUDA2 is a Nvidia’s CUDA parallel computation API from Python. It is more convenient to implement the GPU computation comparing CUDA
In this report, I used the PyCUDA
for computing multi-GPU matrix.
CUDA computation
Basic concepts
For using the GPU resources, the data must move from cpu memory to GPU
memory. And after that to calculate the personal operation.
Example of processing flow on CUDA
- Copy data from
main memory
toGPU memory
- CPU instructs the process to GPU
- GPU executes parallel in each core
- Copy the result from
GPU memory
tomain memory
Matrix multiplication
There are two ways for implementing multiplication. Simple multiplication is same with the normal way to matrix multiplication by hand. And the other way use the tile for reducing times and dimension.
Simple multiplication
\[\begin{bmatrix} a_{11} & a_{12} & a_{13}\\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33}\\ \end{bmatrix} \cdot \begin{bmatrix} b_{11} & b_{12} & b_{13}\\ b_{21} & b_{22} & b_{23}\\ b_{31} & b_{32} & b_{33}\\ \end{bmatrix} \\= \begin{bmatrix} a_{11}b_{11}+a_{12}b_{21} + a_{13}b_{31} & a_{11}b_{12}+a_{12}b_{22} + a_{13}b_{32}& a_{11}b_{13}+a_{12}b_{23} + a_{13}b_{33}& \\ a_{21}b_{11}+a_{22}b_{21} + a_{23}b_{31} & a_{21}b_{12}+a_{22}b_{22} + a_{23}b_{32}& a_{21}b_{13}+a_{22}b_{23} + a_{23}b_{33}& \\ a_{31}b_{11}+a_{32}b_{21} + a_{33}b_{31} & a_{31}b_{12}+a_{32}b_{22} + a_{33}b_{32}& a_{31}b_{13}+a_{32}b_{23} + a_{33}b_{33}& \end{bmatrix}\]But In this case, we have some problem, one thing is that we use the whole matrix for implementing multiplication. When the size is bigger, the kernel can’t implement the multiplication because of the size of memory. It is working only for small size matrix.
Each thread calculate the each result, row * col
and save the result into c
.
import numpy as np
from pycuda import driver, compiler, gpuarray, tools
# -- initialize the device
import pycuda.autoinit
kernel_code_template = """
__global__ void MatrixMulKernel(float *a, float *b, float *c)
{
int tx = threadIdx.x;
int ty = threadIdx.y;
// Pvalue is used to store the element of the matrix
// that is computed by the thread
float Pvalue = 0;
// Each thread loads one row of M and one column of N,
// to produce one element of P.
for (int k = 0; k < %(MATRIX_SIZE)s; ++k) {
float Aelement = a[ty * %(MATRIX_SIZE)s + k];
float Belement = b[k * %(MATRIX_SIZE)s + tx];
Pvalue += Aelement * Belement;
}
// Write the matrix to device memory;
// each thread writes one element
c[ty * %(MATRIX_SIZE)s + tx] = Pvalue;
}
"""
# define the (square) matrix size
# note that we'll only use *one* block of threads here
# as a consequence this number (squared) can't exceed max_threads,
# see http://documen.tician.de/pycuda/util.html#pycuda.tools.DeviceData
# for more information on how to get this number for your device
MATRIX_SIZE = 2
# create two random square matrices
a_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)
b_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)
# compute reference on the CPU to verify GPU computation
c_cpu = np.dot(a_cpu, b_cpu)
# transfer host (CPU) memory to device (GPU) memory
a_gpu = gpuarray.to_gpu(a_cpu)
b_gpu = gpuarray.to_gpu(b_cpu)
# create empty gpu array for the result (C = A * B)
c_gpu = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)
# get the kernel code from the template
# by specifying the constant MATRIX_SIZE
kernel_code = kernel_code_template % {
'MATRIX_SIZE': MATRIX_SIZE
}
# compile the kernel code
mod = compiler.SourceModule(kernel_code)
# get the kernel function from the compiled module
matrixmul = mod.get_function("MatrixMulKernel")
# call the kernel on the card
matrixmul(
# inputs
a_gpu, b_gpu,
# output
c_gpu,
# (only one) block of MATRIX_SIZE x MATRIX_SIZE threads
block = (MATRIX_SIZE, MATRIX_SIZE, 1),
)
# print the results
print "-" * 80
print "Matrix A (GPU):"
print a_gpu.get()
print "-" * 80
print "Matrix B (GPU):"
print b_gpu.get()
print "-" * 80
print "Matrix C (GPU):"
print c_gpu.get()
print "-" * 80
print "CPU-GPU difference:"
print c_cpu - c_gpu.get()
Tiled multiplication
In this way, we can solve the memory problem by using block matrix and shared memory. Here is a equation[^PyCUDA matrix multiplication].
It use partitioned matrix
. The partitioned matrix
The block matrices calculate the multiplication of two blocks, A and B. After that the sum of the columns and row save into the result for c.
And the shared memory accumulate the result of multiplication by each thread. It makes this logic faster.
"""
Multiples two square matrices together using multiple blocks and shared memory.
Each thread block is assigned a "tile" of the resulting matrix and is responsible
for generating the elements in that tile. Each thread in a block computes one element
of the tile.
"""
import numpy as np
from numpy import linalg as la
from pycuda import driver, compiler, gpuarray, tools
# -- initialize the device
import pycuda.autoinit
# define the (square) matrix size
MATRIX_SIZE = 4
def matmul(a_gpu,b_gpu,MATRIX_SIZE=MATRIX_SIZE):
kernel_code_template = """
__global__ void MatrixMulKernel(float *A, float *B, float *C)
{
const uint wA = %(MATRIX_SIZE)s;
const uint wB = %(MATRIX_SIZE)s;
// Block index
const uint bx = blockIdx.x;
const uint by = blockIdx.y;
// Thread index
const uint tx = threadIdx.x;
const uint ty = threadIdx.y;
// Index of the first sub-matrix of A processed by the block
const uint aBegin = wA * %(BLOCK_SIZE)s * by;
// Index of the last sub-matrix of A processed by the block
const uint aEnd = aBegin + wA - 1;
// Step size used to iterate through the sub-matrices of A
const uint aStep = %(BLOCK_SIZE)s;
// Index of the first sub-matrix of B processed by the block
const uint bBegin = %(BLOCK_SIZE)s * bx;
// Step size used to iterate through the sub-matrices of B
const uint bStep = %(BLOCK_SIZE)s * wB;
// The element of the block sub-matrix that is computed
// by the thread
float Csub = 0;
// Loop over all the sub-matrices of A and B required to
// compute the block sub-matrix
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep)
{
// Shared memory for the sub-matrix of A
__shared__ float As[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s];
// Shared memory for the sub-matrix of B
__shared__ float Bs[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s];
// Load the matrices from global memory to shared memory
// each thread loads one element of each matrix
As[ty][tx] = A[a + wA * ty + tx];
Bs[ty][tx] = B[b + wB * ty + tx];
// Synchronize to make sure the matrices are loaded
__syncthreads();
// Multiply the two matrices together;
// each thread computes one element
// of the block sub-matrix
for (int k = 0; k < %(BLOCK_SIZE)s; ++k)
Csub += As[ty][k] * Bs[k][tx];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write the block sub-matrix to global memory;
// each thread writes one element
const uint c = wB * %(BLOCK_SIZE)s * by + %(BLOCK_SIZE)s * bx;
C[c + wB * ty + tx] = Csub;
}
"""
# define size of blocks and tiles sub-matrix
# (we assume that the block size is same as tile size)
TILE_SIZE = 2
BLOCK_SIZE = TILE_SIZE
# get the kernel code from the template
# by specifying the constants MATRIX_SIZE and BLOCK_SIZE
kernel_code = kernel_code_template % {
'MATRIX_SIZE': MATRIX_SIZE,
'BLOCK_SIZE': BLOCK_SIZE,
}
# compile the kernel code
mod = compiler.SourceModule(kernel_code)
# create empty gpu array for the result (C = A * B)
c_gpu = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)
# get the kernel function from the compiled module
matrixmul = mod.get_function("MatrixMulKernel")
# call the kernel on the card
matrixmul(
# inputs
a_gpu, b_gpu,
# output
c_gpu,
# grid of multiple blocks
grid = (MATRIX_SIZE // TILE_SIZE, MATRIX_SIZE // TILE_SIZE),
# block of multiple threads
block = (TILE_SIZE, TILE_SIZE, 1),
)
return c_gpu
# create two random square matrices
a_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)
b_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)
# compute reference on the CPU to verify GPU computation
c_cpu = np.dot(a_cpu, b_cpu)
# transfer host (CPU) memory to device (GPU) memory
a_gpu = gpuarray.to_gpu(a_cpu)
b_gpu = gpuarray.to_gpu(b_cpu)
c_gpu = matmul(a_gpu,b_gpu)
# print the results
print "-" * 80
print "Matrix A (GPU):"
print a_gpu.get()
print "-" * 80
print "Matrix B (GPU):"
print b_gpu.get()
print "-" * 80
print "Matrix C (GPU):"
print c_gpu.get()
print "-" * 80
print "CPU-GPU difference:"
print c_cpu - c_gpu.get()
--------------------------------------------------------------------------------
Matrix A (GPU):
[[-0.41882259 0.41177082 -1.03836977 -0.51131785]
[-0.10568042 0.04367886 -0.42693496 -0.66633946]
[-0.46227372 -1.15817285 0.44100505 -1.34267986]
[ 0.89732301 0.09717333 0.8238706 1.85555255]]
--------------------------------------------------------------------------------
Matrix B (GPU):
[[ 1.15622389 -0.01867246 0.44202217 -1.31834888]
[-0.35549149 -0.23209515 -0.62931013 -1.11239004]
[ 1.45448041 -0.83201063 1.78740346 0.30777019]
[-1.01214206 -1.33131325 0.65836674 -0.5433535 ]]
--------------------------------------------------------------------------------
Matrix C (GPU):
[[-1.62339604 1.4569093 -2.63688087 0.05235163]
[-0.08425605 1.23415661 -1.27600145 0.32139575]
[ 1.87764466 1.69804466 0.4287928 2.76305604]
[ 0.32318279 -3.19509935 3.02970767 -2.04573774]]
--------------------------------------------------------------------------------
CPU-GPU difference:
[[ 3.57627869e-07 1.19209290e-07 0.00000000e+00 -3.35276127e-08]
[ 5.96046448e-08 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 1.49011612e-07 0.00000000e+00]
[ 2.68220901e-07 0.00000000e+00 0.00000000e+00 0.00000000e+00]]
Multithreading for GPUs
GPUThread can use thread function by inheritance.
\[Thread0: \overset{\mathtt{Allocated}}{ \fbox{CPU Array} } \overset{\mathtt{memcpy}}{ \longrightarrow } \overset{\mathtt{}}{ \fbox{GPU Array} } \overset{\mathtt{GPU Compiler}}{\longrightarrow} \overset{\mathtt{}}{\fbox{GPU Binary} } \overset{\mathtt{GPU}}{\longrightarrow} \overset{\mathtt{result}}{\fbox{GPU Array} } \overset{\mathtt{get()}}{\longrightarrow} {\fbox{CPU Array}} \\ Thread1: \overset{\mathtt{Allocated}}{ \fbox{CPU Array} } \overset{\mathtt{memcpy}}{ \longrightarrow } \overset{\mathtt{}}{ \fbox{GPU Array} } \overset{\mathtt{GPU Compiler}}{\longrightarrow} \overset{\mathtt{}}{\fbox{GPU Binary} } \overset{\mathtt{GPU}}{\longrightarrow} \overset{\mathtt{result}}{\fbox{GPU Array} } \overset{\mathtt{get()}}{\longrightarrow} {\fbox{CPU Array}}\]import pycuda
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import threading
import numpy as np
# Inheritance for using thread
class GPUThread(threading.Thread):
def __init__(self, number):
threading.Thread.__init__(self)
self.number = number
def run(self):
self.dev = cuda.Device(self.number)
self.ctx = self.dev.make_context()
print "successful exit from thread %d \n" % self.number
self.ctx.pop()
# delete device,context for saving resources.
del self.ctx
#initialize CUDA driver
cuda.init()
num = cuda.Device.count()
gpu_thread_list = []
for i in range(num):
gpu_thread = GPUThread(i)
gpu_thread.start()
gpu_thread_list.append(gpu_thread)
successful exit from thread 0
successful exit from thread 1
Performance comparing
Then combined two function matrix multiplication and Threads, We can implement multi GPU matrix multiplication.
I used four 4000 * 4000 squared matrices for multiplication. And tile size is 20.
import pycuda
import pycuda.driver as cuda
from pycuda import compiler, gpuarray, tools
import threading
import numpy as np
import time
# -- initialize the device
import pycuda.autoinit
MATRIX_SIZE = 4000
# Inheritance for using thread
class GPUThread(threading.Thread):
def __init__(self, number, arr):
threading.Thread.__init__(self)
self.number = number
self.arr = arr
def run(self):
self.dev = cuda.Device(self.number)
self.ctx = self.dev.make_context()
# initialize gpu array and copy from cpu to gpu.
self.array_gpu = gpuarray.to_gpu(self.arr)
# Get lock to synchronize threads
threadLock.acquire()
ctic = time.time()
np.dot(self.arr,self.arr)
ctoc = float(time.time()-ctic)
gtic = time.time()
output = matmul(self.array_gpu,self.array_gpu)
gtoc = float(time.time()-gtic)
# Free lock to release next thread
print "CPU:" , ctoc, "GPU:" , gtoc,"GPU-CPU:",gtoc-ctoc
threadLock.release()
print "successful exit from thread %d \n" % self.number
self.ctx.pop()
# delete device,context for saving resources.
del self.ctx
del self.array_gpu
def matmul(a_gpu,b_gpu,MATRIX_SIZE=MATRIX_SIZE):
kernel_code_template = """
__global__ void MatrixMulKernel(float *A, float *B, float *C)
{
const uint wA = %(MATRIX_SIZE)s;
const uint wB = %(MATRIX_SIZE)s;
// Block index
const uint bx = blockIdx.x;
const uint by = blockIdx.y;
// Thread index
const uint tx = threadIdx.x;
const uint ty = threadIdx.y;
// Index of the first sub-matrix of A processed by the block
const uint aBegin = wA * %(BLOCK_SIZE)s * by;
// Index of the last sub-matrix of A processed by the block
const uint aEnd = aBegin + wA - 1;
// Step size used to iterate through the sub-matrices of A
const uint aStep = %(BLOCK_SIZE)s;
// Index of the first sub-matrix of B processed by the block
const uint bBegin = %(BLOCK_SIZE)s * bx;
// Step size used to iterate through the sub-matrices of B
const uint bStep = %(BLOCK_SIZE)s * wB;
// The element of the block sub-matrix that is computed
// by the thread
float Csub = 0;
// Loop over all the sub-matrices of A and B required to
// compute the block sub-matrix
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep)
{
// Shared memory for the sub-matrix of A
__shared__ float As[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s];
// Shared memory for the sub-matrix of B
__shared__ float Bs[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s];
// Load the matrices from global memory to shared memory
// each thread loads one element of each matrix
As[ty][tx] = A[a + wA * ty + tx];
Bs[ty][tx] = B[b + wB * ty + tx];
// Synchronize to make sure the matrices are loaded
__syncthreads();
// Multiply the two matrices together;
// each thread computes one element
// of the block sub-matrix
for (int k = 0; k < %(BLOCK_SIZE)s; ++k)
Csub += As[ty][k] * Bs[k][tx];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write the block sub-matrix to global memory;
// each thread writes one element
const uint c = wB * %(BLOCK_SIZE)s * by + %(BLOCK_SIZE)s * bx;
C[c + wB * ty + tx] = Csub;
}
"""
# define size of blocks and tiles sub-matrix
# (we assume that the block size is same as tile size)
TILE_SIZE = 20
BLOCK_SIZE = TILE_SIZE
# get the kernel code from the template
# by specifying the constants MATRIX_SIZE and BLOCK_SIZE
kernel_code = kernel_code_template % {
'MATRIX_SIZE': MATRIX_SIZE,
'BLOCK_SIZE': BLOCK_SIZE,
}
# compile the kernel code
mod = compiler.SourceModule(kernel_code)
# create empty gpu array for the result (C = A * B)
c_gpu = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)
# get the kernel function from the compiled module
matrixmul = mod.get_function("MatrixMulKernel")
# call the kernel on the card
matrixmul(
# inputs
a_gpu, b_gpu,
# output
c_gpu,
# grid of multiple blocks
grid = (MATRIX_SIZE // TILE_SIZE, MATRIX_SIZE // TILE_SIZE),
# block of multiple threads
block = (TILE_SIZE, TILE_SIZE, 1),
)
return c_gpu
num = cuda.Device.count()
gpu_thread_list = []
some_list = []
for i in range(num):
some_list.append(np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32))
a_gpu = gpuarray.to_gpu(some_list[0])
b_gpu = gpuarray.to_gpu(some_list[1])
a = matmul(a_gpu,b_gpu)
print("MATRIX SIZE: ", MATRIX_SIZE)
print("Difference between GPU result and CPU result")
print(np.dot(some_list[0],some_list[1])-a.get())
threadLock = threading.Lock()
for i,arr in enumerate(some_list):
gpu_thread = GPUThread(i,arr)
gpu_thread.start()
gpu_thread_list.append(gpu_thread)
Result
Difference between GPU result and CPU result
[[ -5.34057617e-05 -1.90734863e-05 -4.76837158e-05 ..., -1.90734863e-06
8.39233398e-05 -4.95910645e-05]
[ 1.37329102e-04 -1.14440918e-05 1.22070312e-04 ..., -2.28881836e-05
-1.52587891e-05 1.14440918e-05]
[ 4.57763672e-05 -4.57763672e-05 6.48498535e-05 ..., 3.05175781e-05
6.86645508e-05 -4.19616699e-05]
...,
[ -6.77108765e-05 1.37329102e-04 1.83105469e-04 ..., -9.53674316e-07
4.27246094e-04 -1.14440918e-04]
[ -5.34057617e-05 0.00000000e+00 6.86645508e-05 ..., 9.53674316e-06
2.38418579e-05 -9.91821289e-05]
[ 2.28881836e-05 -4.57763672e-05 -4.57763672e-05 ..., 1.90734863e-05
1.10626221e-04 3.52859497e-05]]
CPU: 1.49331712723 GPU: 0.394819974899 GPU-CPU: -1.09849715233
successful exit from thread 0
CPU: 1.60758495331 GPU: 0.390916109085 GPU-CPU: -1.21666884422
successful exit from thread 1
Conclusion
When the matrix size is small, the speed is faster on CPU case, but more bigger size and the faster the speed on the CPU case. That’s because the GPU needs some preprocessing for implementing like memory copy and preparing the memory. The computing power is much more faster on the GPU than CPU. In my test better than times. It will be different by changing some parameters(MATRIX_SIZE, TILE_SIZE and etc) or computing power.
GPGPU programming is hard to write codes for implementing and settings. But the computing power is remarkable comparing CPU implementation.
GPU offer powerful computing power and multi threading make GPU parallel computing.
And there are other Library to make it easy for us, like tensorflow
, Theano
and etc.