read

When we implement matrix operation using numpy on the CPU, It is simple to implement and get the result. But in the case we use the GPU, We should do more complicated works.

Multi GPU Matrix multiplication

CUDA

CUDA¹ is a parallel computing platform and application programming interface (API) model created by Nvidia.[1] It allows software developers and software engineers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing an approach termed GPGPU (General-Purpose computing on Graphics Processing Units). The CUDA platform is a software layer that gives direct access to the GPU’s virtual instruction set and parallel computational elements, for the execution of compute kernels.[2]

PyCUDA

PyCUDA² is a Nvidia’s CUDA parallel computation API from Python. It is more convenient to implement the GPU computation comparing CUDA

In this report, I used the PyCUDA for computing multi-GPU matrix.

CUDA computation

Basic concepts

For using the GPU resources, the data must move from cpu memory to GPU memory. And after that to calculate the personal operation.

Example of processing flow on CUDA

Copy data from main memory to GPU memory
CPU instructs the process to GPU
GPU executes parallel in each core
Copy the result from GPU memory to main memory

\[\overset{\mathtt{Allocated}}{ \fbox{CPU Array} } \overset{\mathtt{memcpy}}{ \longrightarrow } \overset{\mathtt{}}{ \fbox{GPU Array} } \overset{\mathtt{GPU Compiler}}{\longrightarrow} \overset{\mathtt{}}{\fbox{GPU Binary} } \overset{\mathtt{GPU}}{\longrightarrow} \overset{\mathtt{result}}{\fbox{GPU Array} } \overset{\mathtt{get()}}{\longrightarrow} {\fbox{CPU Array}}\]

Matrix multiplication

There are two ways for implementing multiplication. Simple multiplication is same with the normal way to matrix multiplication by hand. And the other way use the tile for reducing times and dimension.

Simple multiplication

\[\begin{bmatrix} a_{11} & a_{12} & a_{13}\\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33}\\ \end{bmatrix} \cdot \begin{bmatrix} b_{11} & b_{12} & b_{13}\\ b_{21} & b_{22} & b_{23}\\ b_{31} & b_{32} & b_{33}\\ \end{bmatrix} \\= \begin{bmatrix} a_{11}b_{11}+a_{12}b_{21} + a_{13}b_{31} & a_{11}b_{12}+a_{12}b_{22} + a_{13}b_{32}& a_{11}b_{13}+a_{12}b_{23} + a_{13}b_{33}& \\ a_{21}b_{11}+a_{22}b_{21} + a_{23}b_{31} & a_{21}b_{12}+a_{22}b_{22} + a_{23}b_{32}& a_{21}b_{13}+a_{22}b_{23} + a_{23}b_{33}& \\ a_{31}b_{11}+a_{32}b_{21} + a_{33}b_{31} & a_{31}b_{12}+a_{32}b_{22} + a_{33}b_{32}& a_{31}b_{13}+a_{32}b_{23} + a_{33}b_{33}& \end{bmatrix}\]

But In this case, we have some problem, one thing is that we use the whole matrix for implementing multiplication. When the size is bigger, the kernel can’t implement the multiplication because of the size of memory. It is working only for small size matrix.

Each thread calculate the each result, row * col and save the result into c.

import numpy as np
from pycuda import driver, compiler, gpuarray, tools

# -- initialize the device
import pycuda.autoinit

kernel_code_template = """
__global__ void MatrixMulKernel(float *a, float *b, float *c)
{
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    // Pvalue is used to store the element of the matrix
    // that is computed by the thread
    float Pvalue = 0;

    // Each thread loads one row of M and one column of N, 
    //   to produce one element of P.
    for (int k = 0; k < %(MATRIX_SIZE)s; ++k) {
        float Aelement = a[ty * %(MATRIX_SIZE)s + k];
        float Belement = b[k * %(MATRIX_SIZE)s + tx];
        Pvalue += Aelement * Belement;
    }

    // Write the matrix to device memory;
    // each thread writes one element
    c[ty * %(MATRIX_SIZE)s + tx] = Pvalue;
}
"""

# define the (square) matrix size
#  note that we'll only use *one* block of threads here
#  as a consequence this number (squared) can't exceed max_threads,
#  see http://documen.tician.de/pycuda/util.html#pycuda.tools.DeviceData
#  for more information on how to get this number for your device
MATRIX_SIZE = 2

# create two random square matrices
a_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)
b_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)

# compute reference on the CPU to verify GPU computation
c_cpu = np.dot(a_cpu, b_cpu)

# transfer host (CPU) memory to device (GPU) memory 
a_gpu = gpuarray.to_gpu(a_cpu) 
b_gpu = gpuarray.to_gpu(b_cpu)

# create empty gpu array for the result (C = A * B)
c_gpu = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)

# get the kernel code from the template 
# by specifying the constant MATRIX_SIZE
kernel_code = kernel_code_template % {
    'MATRIX_SIZE': MATRIX_SIZE 
    }

# compile the kernel code 
mod = compiler.SourceModule(kernel_code)

# get the kernel function from the compiled module
matrixmul = mod.get_function("MatrixMulKernel")

# call the kernel on the card
matrixmul(
    # inputs
    a_gpu, b_gpu, 
    # output
    c_gpu, 
    # (only one) block of MATRIX_SIZE x MATRIX_SIZE threads
    block = (MATRIX_SIZE, MATRIX_SIZE, 1),
    )

# print the results
print "-" * 80
print "Matrix A (GPU):"
print a_gpu.get()

print "-" * 80
print "Matrix B (GPU):"
print b_gpu.get()

print "-" * 80
print "Matrix C (GPU):"
print c_gpu.get()

print "-" * 80
print "CPU-GPU difference:"
print c_cpu - c_gpu.get()

Tiled multiplication

In this way, we can solve the memory problem by using block matrix and shared memory. Here is a equation[^PyCUDA matrix multiplication].

$fig_submatrix_tiled_mul$

It use partitioned matrix. The partitioned matrix

\[\begin{bmatrix} a_{11} & a_{12} & a_{13} & a_{14}\\ a_{21} & a_{22} & a_{23} & a_{24}\\ a_{31} & a_{32} & a_{33} & a_{34}\\ a_{41} & a_{42} & a_{43} & a_{44}\\ \end{bmatrix} \cdot \begin{bmatrix} b_{11} & b_{12} & b_{13} & b_{14}\\ b_{21} & b_{22} & b_{23} & b_{24}\\ b_{31} & b_{32} & b_{33} & b_{34}\\ b_{41} & b_{42} & b_{43} & b_{44}\\ \end{bmatrix} \\= \begin{array}{cc|cc} a_{11} & a_{12} & a_{13} & a_{14}\\ a_{21} & a_{22} & a_{23} & a_{24}\\ \hline a_{31} & a_{32} & a_{33} & a_{34}\\ a_{41} & a_{42} & a_{43} & a_{44}\\ \end{array} \cdot \begin{array}{cc|cc} b_{11} & b_{12} & b_{13} & b_{14}\\ b_{21} & b_{22} & b_{23} & b_{24}\\ \hline b_{31} & b_{32} & b_{33} & b_{34}\\ b_{41} & b_{42} & b_{43} & b_{44}\\ \end{array} \\= \begin{bmatrix} A_{11} &A_{12}\\ A_{21} &A_{22}\\ \end{bmatrix} \cdot \begin{bmatrix} B_{11} &B_{12}\\ B_{21} &B_{22}\\ \end{bmatrix} \\= \begin{bmatrix} col_1A & col_2A \end{bmatrix} \begin{bmatrix} row_1B \\ row_2B \end{bmatrix} \\= col_1A \cdot row_1B + col_2A \cdot row_2B \\= \begin{bmatrix} a_{11}b_{11}+a_{12}b_{21} + a_{13}b_{31} + a_{14}b_{41}& \cdots & a_{11}b_{14}+a_{12}b_{24} + a_{13}b_{34} + a_{14}b_{44}& \\ \vdots & \ddots& \vdots \\ a_{41}b_{11}+a_{42}b_{21} + a_{43}b_{31} + a_{44}b_{41}& \cdots & a_{41}b_{14}+a_{42}b_{24} + a_{43}b_{34} + a_{44}b_{44}& \end{bmatrix}\]

The block matrices calculate the multiplication of two blocks, A and B. After that the sum of the columns and row save into the result for c.

And the shared memory accumulate the result of multiplication by each thread. It makes this logic faster.

"""
Multiples two square matrices together using multiple blocks and shared memory.
Each thread block is assigned a "tile" of the resulting matrix and is responsible
for generating the elements in that tile.  Each thread in a block computes one element
of the tile.
"""

import numpy as np
from numpy import linalg as la
from pycuda import driver, compiler, gpuarray, tools

# -- initialize the device
import pycuda.autoinit


# define the (square) matrix size
MATRIX_SIZE = 4


def matmul(a_gpu,b_gpu,MATRIX_SIZE=MATRIX_SIZE):
    kernel_code_template = """
    __global__ void MatrixMulKernel(float *A, float *B, float *C)
    {

      const uint wA = %(MATRIX_SIZE)s;
      const uint wB = %(MATRIX_SIZE)s;

      // Block index
      const uint bx = blockIdx.x;
      const uint by = blockIdx.y;

      // Thread index
      const uint tx = threadIdx.x;
      const uint ty = threadIdx.y;

      // Index of the first sub-matrix of A processed by the block
      const uint aBegin = wA * %(BLOCK_SIZE)s * by;
      // Index of the last sub-matrix of A processed by the block
      const uint aEnd = aBegin + wA - 1;
      // Step size used to iterate through the sub-matrices of A
      const uint aStep = %(BLOCK_SIZE)s;

      // Index of the first sub-matrix of B processed by the block
      const uint bBegin = %(BLOCK_SIZE)s * bx;
      // Step size used to iterate through the sub-matrices of B
      const uint bStep = %(BLOCK_SIZE)s * wB;

      // The element of the block sub-matrix that is computed
      // by the thread
      float Csub = 0;
      // Loop over all the sub-matrices of A and B required to
      // compute the block sub-matrix
      for (int a = aBegin, b = bBegin;
           a <= aEnd;
           a += aStep, b += bStep)
        {
          // Shared memory for the sub-matrix of A
          __shared__ float As[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s];
          // Shared memory for the sub-matrix of B
          __shared__ float Bs[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s];

          // Load the matrices from global memory to shared memory
          // each thread loads one element of each matrix
          As[ty][tx] = A[a + wA * ty + tx];
          Bs[ty][tx] = B[b + wB * ty + tx];
          // Synchronize to make sure the matrices are loaded
          __syncthreads();

          // Multiply the two matrices together;
          // each thread computes one element
          // of the block sub-matrix
          for (int k = 0; k < %(BLOCK_SIZE)s; ++k)
            Csub += As[ty][k] * Bs[k][tx];

          // Synchronize to make sure that the preceding
          // computation is done before loading two new
          // sub-matrices of A and B in the next iteration
          __syncthreads();
        }

      // Write the block sub-matrix to global memory;
      // each thread writes one element
      const uint c = wB * %(BLOCK_SIZE)s * by + %(BLOCK_SIZE)s * bx;
      C[c + wB * ty + tx] = Csub;
    }
    """

    # define size of blocks and tiles sub-matrix
    # (we assume that the block size is same as tile size)
    TILE_SIZE = 2
    BLOCK_SIZE = TILE_SIZE

    # get the kernel code from the template
    # by specifying the constants MATRIX_SIZE and BLOCK_SIZE
    kernel_code = kernel_code_template % {
        'MATRIX_SIZE': MATRIX_SIZE,
        'BLOCK_SIZE': BLOCK_SIZE,
        }

    # compile the kernel code
    mod = compiler.SourceModule(kernel_code)
    # create empty gpu array for the result (C = A * B)
    c_gpu = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)

    # get the kernel function from the compiled module
    matrixmul = mod.get_function("MatrixMulKernel")

    # call the kernel on the card
    matrixmul(
        # inputs
        a_gpu, b_gpu,
        # output
        c_gpu,
        # grid of multiple blocks
        grid = (MATRIX_SIZE // TILE_SIZE, MATRIX_SIZE // TILE_SIZE),
        # block of multiple threads
        block = (TILE_SIZE, TILE_SIZE, 1),
        )

    return c_gpu

# create two random square matrices
a_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)
b_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)

# compute reference on the CPU to verify GPU computation
c_cpu = np.dot(a_cpu, b_cpu)

# transfer host (CPU) memory to device (GPU) memory
a_gpu = gpuarray.to_gpu(a_cpu)
b_gpu = gpuarray.to_gpu(b_cpu)
c_gpu = matmul(a_gpu,b_gpu)


# print the results
print "-" * 80
print "Matrix A (GPU):"
print a_gpu.get()

print "-" * 80
print "Matrix B (GPU):"
print b_gpu.get()

print "-" * 80
print "Matrix C (GPU):"
print c_gpu.get()

print "-" * 80
print "CPU-GPU difference:"
print c_cpu - c_gpu.get()

--------------------------------------------------------------------------------
Matrix A (GPU):
[[-0.41882259  0.41177082 -1.03836977 -0.51131785]
 [-0.10568042  0.04367886 -0.42693496 -0.66633946]
 [-0.46227372 -1.15817285  0.44100505 -1.34267986]
 [ 0.89732301  0.09717333  0.8238706   1.85555255]]
--------------------------------------------------------------------------------
Matrix B (GPU):
[[ 1.15622389 -0.01867246  0.44202217 -1.31834888]
 [-0.35549149 -0.23209515 -0.62931013 -1.11239004]
 [ 1.45448041 -0.83201063  1.78740346  0.30777019]
 [-1.01214206 -1.33131325  0.65836674 -0.5433535 ]]
--------------------------------------------------------------------------------
Matrix C (GPU):
[[-1.62339604  1.4569093  -2.63688087  0.05235163]
 [-0.08425605  1.23415661 -1.27600145  0.32139575]
 [ 1.87764466  1.69804466  0.4287928   2.76305604]
 [ 0.32318279 -3.19509935  3.02970767 -2.04573774]]
--------------------------------------------------------------------------------
CPU-GPU difference:
[[  3.57627869e-07   1.19209290e-07   0.00000000e+00  -3.35276127e-08]
 [  5.96046448e-08   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   1.49011612e-07   0.00000000e+00]
 [  2.68220901e-07   0.00000000e+00   0.00000000e+00   0.00000000e+00]]

Multithreading for GPUs

GPUThread can use thread function by inheritance.

\[Thread0: \overset{\mathtt{Allocated}}{ \fbox{CPU Array} } \overset{\mathtt{memcpy}}{ \longrightarrow } \overset{\mathtt{}}{ \fbox{GPU Array} } \overset{\mathtt{GPU Compiler}}{\longrightarrow} \overset{\mathtt{}}{\fbox{GPU Binary} } \overset{\mathtt{GPU}}{\longrightarrow} \overset{\mathtt{result}}{\fbox{GPU Array} } \overset{\mathtt{get()}}{\longrightarrow} {\fbox{CPU Array}} \\ Thread1: \overset{\mathtt{Allocated}}{ \fbox{CPU Array} } \overset{\mathtt{memcpy}}{ \longrightarrow } \overset{\mathtt{}}{ \fbox{GPU Array} } \overset{\mathtt{GPU Compiler}}{\longrightarrow} \overset{\mathtt{}}{\fbox{GPU Binary} } \overset{\mathtt{GPU}}{\longrightarrow} \overset{\mathtt{result}}{\fbox{GPU Array} } \overset{\mathtt{get()}}{\longrightarrow} {\fbox{CPU Array}}\]

import pycuda
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import threading
import numpy as np

# Inheritance for using thread
class GPUThread(threading.Thread):
    def __init__(self, number):
        threading.Thread.__init__(self)

        self.number = number

    def run(self):
        self.dev = cuda.Device(self.number)
        self.ctx = self.dev.make_context()

        print "successful exit from thread %d \n" % self.number
        self.ctx.pop()

        # delete device,context for saving resources.
        del self.ctx

#initialize CUDA driver
cuda.init()
        
num = cuda.Device.count()
gpu_thread_list = []


for i in range(num):
    gpu_thread = GPUThread(i)
    gpu_thread.start()
    gpu_thread_list.append(gpu_thread)

successful exit from thread 0

successful exit from thread 1

Performance comparing

Then combined two function matrix multiplication and Threads, We can implement multi GPU matrix multiplication.

I used four 4000 * 4000 squared matrices for multiplication. And tile size is 20.

import pycuda
import pycuda.driver as cuda
from pycuda import compiler, gpuarray, tools
import threading
import numpy as np

import time
# -- initialize the device
import pycuda.autoinit

MATRIX_SIZE = 4000

# Inheritance for using thread
class GPUThread(threading.Thread):
    def __init__(self, number, arr):
        threading.Thread.__init__(self)
        self.number = number
        self.arr = arr

    def run(self):
        self.dev = cuda.Device(self.number)
        self.ctx = self.dev.make_context()

        # initialize gpu array and copy from cpu to gpu.
        self.array_gpu = gpuarray.to_gpu(self.arr)
         # Get lock to synchronize threads
        threadLock.acquire()

        ctic = time.time()
        np.dot(self.arr,self.arr)
        ctoc = float(time.time()-ctic)

        gtic = time.time()
        output = matmul(self.array_gpu,self.array_gpu)
        gtoc = float(time.time()-gtic)
        # Free lock to release next thread

        print "CPU:" , ctoc, "GPU:" , gtoc,"GPU-CPU:",gtoc-ctoc
        threadLock.release()
        print "successful exit from thread %d \n" % self.number
        self.ctx.pop()

        # delete device,context for saving resources.
        del self.ctx
        del self.array_gpu


def matmul(a_gpu,b_gpu,MATRIX_SIZE=MATRIX_SIZE):
    kernel_code_template = """
    __global__ void MatrixMulKernel(float *A, float *B, float *C)
    {

      const uint wA = %(MATRIX_SIZE)s;
      const uint wB = %(MATRIX_SIZE)s;

      // Block index
      const uint bx = blockIdx.x;
      const uint by = blockIdx.y;

      // Thread index
      const uint tx = threadIdx.x;
      const uint ty = threadIdx.y;

      // Index of the first sub-matrix of A processed by the block
      const uint aBegin = wA * %(BLOCK_SIZE)s * by;
      // Index of the last sub-matrix of A processed by the block
      const uint aEnd = aBegin + wA - 1;
      // Step size used to iterate through the sub-matrices of A
      const uint aStep = %(BLOCK_SIZE)s;

      // Index of the first sub-matrix of B processed by the block
      const uint bBegin = %(BLOCK_SIZE)s * bx;
      // Step size used to iterate through the sub-matrices of B
      const uint bStep = %(BLOCK_SIZE)s * wB;

      // The element of the block sub-matrix that is computed
      // by the thread
      float Csub = 0;
      // Loop over all the sub-matrices of A and B required to
      // compute the block sub-matrix
      for (int a = aBegin, b = bBegin;
           a <= aEnd;
           a += aStep, b += bStep)
        {
          // Shared memory for the sub-matrix of A
          __shared__ float As[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s];
          // Shared memory for the sub-matrix of B
          __shared__ float Bs[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s];

          // Load the matrices from global memory to shared memory
          // each thread loads one element of each matrix
          As[ty][tx] = A[a + wA * ty + tx];
          Bs[ty][tx] = B[b + wB * ty + tx];
          // Synchronize to make sure the matrices are loaded
          __syncthreads();

          // Multiply the two matrices together;
          // each thread computes one element
          // of the block sub-matrix
          for (int k = 0; k < %(BLOCK_SIZE)s; ++k)
            Csub += As[ty][k] * Bs[k][tx];

          // Synchronize to make sure that the preceding
          // computation is done before loading two new
          // sub-matrices of A and B in the next iteration
          __syncthreads();
        }

      // Write the block sub-matrix to global memory;
      // each thread writes one element
      const uint c = wB * %(BLOCK_SIZE)s * by + %(BLOCK_SIZE)s * bx;
      C[c + wB * ty + tx] = Csub;
    }
    """

    # define size of blocks and tiles sub-matrix
    # (we assume that the block size is same as tile size)
    TILE_SIZE = 20
    BLOCK_SIZE = TILE_SIZE

    # get the kernel code from the template
    # by specifying the constants MATRIX_SIZE and BLOCK_SIZE
    kernel_code = kernel_code_template % {
        'MATRIX_SIZE': MATRIX_SIZE,
        'BLOCK_SIZE': BLOCK_SIZE,
        }

    # compile the kernel code
    mod = compiler.SourceModule(kernel_code)
    # create empty gpu array for the result (C = A * B)
    c_gpu = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)

    # get the kernel function from the compiled module
    matrixmul = mod.get_function("MatrixMulKernel")

    # call the kernel on the card
    matrixmul(
        # inputs
        a_gpu, b_gpu,
        # output
        c_gpu,
        # grid of multiple blocks
        grid = (MATRIX_SIZE // TILE_SIZE, MATRIX_SIZE // TILE_SIZE),
        # block of multiple threads
        block = (TILE_SIZE, TILE_SIZE, 1),
        )

    return c_gpu


num = cuda.Device.count()
gpu_thread_list = []

some_list = []
for i in range(num):
    some_list.append(np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32))
a_gpu = gpuarray.to_gpu(some_list[0])
b_gpu = gpuarray.to_gpu(some_list[1])

a = matmul(a_gpu,b_gpu)
print("MATRIX SIZE: ", MATRIX_SIZE)
print("Difference between GPU result and CPU result")
print(np.dot(some_list[0],some_list[1])-a.get())

threadLock = threading.Lock()
for i,arr in enumerate(some_list):
    gpu_thread = GPUThread(i,arr)
    gpu_thread.start()
    gpu_thread_list.append(gpu_thread)

Result

Difference between GPU result and CPU result
[[ -5.34057617e-05  -1.90734863e-05  -4.76837158e-05 ...,  -1.90734863e-06
    8.39233398e-05  -4.95910645e-05]
 [  1.37329102e-04  -1.14440918e-05   1.22070312e-04 ...,  -2.28881836e-05
   -1.52587891e-05   1.14440918e-05]
 [  4.57763672e-05  -4.57763672e-05   6.48498535e-05 ...,   3.05175781e-05
    6.86645508e-05  -4.19616699e-05]
 ...,
 [ -6.77108765e-05   1.37329102e-04   1.83105469e-04 ...,  -9.53674316e-07
    4.27246094e-04  -1.14440918e-04]
 [ -5.34057617e-05   0.00000000e+00   6.86645508e-05 ...,   9.53674316e-06
    2.38418579e-05  -9.91821289e-05]
 [  2.28881836e-05  -4.57763672e-05  -4.57763672e-05 ...,   1.90734863e-05
    1.10626221e-04   3.52859497e-05]]
CPU: 1.49331712723 GPU: 0.394819974899 GPU-CPU: -1.09849715233
successful exit from thread 0

CPU: 1.60758495331 GPU: 0.390916109085 GPU-CPU: -1.21666884422
successful exit from thread 1

Conclusion

When the matrix size is small, the speed is faster on CPU case, but more bigger size and the faster the speed on the CPU case. That’s because the GPU needs some preprocessing for implementing like memory copy and preparing the memory. The computing power is much more faster on the GPU than CPU. In my test better than times. It will be different by changing some parameters(MATRIX_SIZE, TILE_SIZE and etc) or computing power.

GPGPU programming is hard to write codes for implementing and settings. But the computing power is remarkable comparing CPU implementation.

GPU offer powerful computing power and multi threading make GPU parallel computing.

And there are other Library to make it easy for us, like tensorflow, Theano and etc.

https://en.wikipedia.org/wiki/CUDA ↩
https://documen.tician.de/pycuda/index.html ↩

PyCUDA Mutli GPU multiplication