In this presentation slides we go through various (basic) implementations of SPMV and SPMM on GPU.


A Survey of Sparse Matrix Multiplication on GPUs (CUDA)

Xiang Li

8/18/2022


Revisit – What is GPU & Why HPC on GPUs


GPU Preliminary

Features of GPU:


Comparing with CPU


GPU Preliminary - General steps for a GPU Program

  1. Allocate memory on CPU/GPU
  2. Prepare data on / transfer data to GPU
  3. Computation on GPU
  4. Data copied back to CPU
  5. Finish up outputs on CPU

GPU Preliminary - Memory manipulation


GPU Preliminary - How to declare a kernel

__global__ void kernel_func(int arg1) {}

GPU Preliminary - How to run a kernel

kernel_func<<<grid_shape, block_shape>>>(arg1);

Logical thread model: Grids & Blocks

w:1000 drop-shadow:0,5px,10px Logical Thread Model



bg left Turing TU104 Chip Diagram

Physically (Tesla T4):


w:420 bg left Turing TU104 SM


Keys to understanding GPU execution logic:


General Ideas on Improving Performance with GPU


Note: A thread can issue multiple reads (on separate instructions in the instruction stream). Multiple reads from a single thread can be in flight. A read request by itself does not stall the thread. Only when the results of that read are required for a future operation will it stall (if the read data is not available.) The compiler is aware of this and will attempt to schedule reads before the data are actually required. In such scenarios, multiple reads (from a single thread or warp) can be in flight. See ref.


Parallelizing SpMV on GPUs

w:600 bg right SpMV


Sparse Matrix Representation for SpMV


DIA Kernel

1 row = globalThreadIdx
2 if row < num_rows:
3   dot = 0
4
5   for n in range(0, cols_per_row):
6     col = row + offset[n]
7     val = data[num_rows * n + row]
8
9     if 0 <= col < num_cols:
10      dot += val * x[col]
11
12  y[row] += dot

w:600 bg right DIA SpMV Workflow


w:600 bg right DIA SpMV Workflow - Step 1



ELL Kernel

w:600 bg right ELL SpMV Workflow


1 row = globalThreadIdx
2 if row < num_rows:
3   dot_sum = 0
4
5   for n in range(0, cols_per_row):
6     col = indices[num_rows * n + row]
7     val = data[num_rows * n + row]
8
9     if val != 0:
10      dot += val * x[col]
11
12  y[row] += dot

CSR Kernel (Scalar)

w:600 bg right CSR Format

Access:      a b c d e f g
----------------------------
Iteration 0: a   c d     g
Iteration 1:   b     e
Iteration 2:           f

CSR Kernel (Vector)

Access: a b | c | d e f | g
----------------------------
Warp 0: a b
Warp 1: c
Warp 2: d e f
Warp 3: g
(Simultaneously)


warp_id = globalThreadIdx // 32
lid = globalThreadIdx % 32  # lane_id
tid = threadIdx

row = warp_id

# on shared memory: vals[]
vals[tid] = 0
for j in range(row_st+lid, row_ed, 32):
  vals[tid] += data[j] * x[indices[j]]

  # todo: sum vals[0:32] to vals[0]
  # parallel reduction

  if lane_id == 0:
    y[row] += vals[tid]

w:600 bg right CSR SpMV Workflow


Parallel Reduction in Shared Memory

# simple parallel reduction in shared memory
if lid < 16:  vals[tid] += vals[tid+16]
if lid <  8:  vals[tid] += vals[tid+ 8]
if lid <  4:  vals[tid] += vals[tid+ 4]
if lid <  2:  vals[tid] += vals[tid+ 2]
if lid <  1:  vals[tid] += vals[tid+ 1]

Parallel Scan: All-prefix-sums

h:200 Prefix Sum


CSR Kernel (Vector) (Cont’d)

Segment starts (CSR):
    0:     0    9   19   25

Values:
    0:     1    5    5    1    2    5    1    1    4    4
   10:     5    3    4    4    4    2    2    4    2    5
   20:     5    1    5    1    4    *    *    *    *    *

Reduced Values:
    0:    25   34   21

COO Kernel

w:600 COO Format


Segmented Reduction (COO)

Keys:
    0:     0    0    0    1    1    1    2    2    2    2
   10:     2    2    2    2    2    2    2    2    2    2
   20:     2    2    2    2    2    2    2    2    2    2
   30:     2    2    2    2    2    3    3    3    3    3

Values:
    0:     2    4    2    4    1    5    3    2    4    4
   10:     2    5    2    2    5    3    3    5    3    3
   20:     2    2    1    4    4    2    1    4    1    3
   30:     1    3    2    4    2    5    1    2    1    5

Reduced keys:
    0:     0    1    2    3
Reduced values:
    0:     8   10   82   14

Segmented Reduction

Segments         Values                                        Reductions
 0:  5  4                                                       =   9
 1:  5                                                          =   5
 2:  0  0  4  2  5                                              =  11
 3:  1  3  1  5  1  2                                           =  13
 4:  0  3  0  2  3  4  4  3                                     =  19
 5:  2  5  5  0  5  0  3  4  5  1  1  0  5  3  2  3  3          =  47
 6:  3  1  5  4  5                                              =  18

Segmented Reduction (CSR)

Threads    Inputs (* = segment end)       Partials       Carry-out  Carry-in
 0: 5  4* 5* 0  0  4  2  5*              :  9  5 11         =>  0*     0
 1: 1  3  1  5  1  2* 0  3               : 13               =>  3*     0
 2: 0  2  3  4  4  3* 2  5               : 16               =>  7*     3
 3: 5  0  5  0  3  4  5  1               :                  => 23      7
 4: 1  0  5  3  2  3  3* 3               : 17               =>  3*    30
 5: 1  5  4  5* 4  3  3  1               : 15               => 11*     3
 6: 5* 1  4* 5  2  0  0  4               :  5  5            => 11*    11
 7: 4  2  4  4  2  3  2  3               :                  => 24     11
 8: 4  2  0  3* 2  3  5  0               :  9               => 10*    35
 9: 4  0  2  4  2  5  4  0               :                  => 21     10
10: 3  2* 5  4  2  0*                    :  5 11            =>  0*    31

Fixed reductions
 9  5 11 13 19 47 18 16  5 44 36 11

Segmented Reduction (COO) - Inside Warp

# segment reduction in shared memory
if lane >= 1  and rows[tid] == rows[tid-1]:
    vals[tid] += vals[tid-1]
if lane >= 2  and rows[tid] == rows[tid-2]:
    vals[tid] += vals[tid-2]
if lane >= 4  and rows[tid] == rows[tid-4]:
    vals[tid] += vals[tid-4]
if lane >= 8  and rows[tid] == rows[tid-8]:
    vals[tid] += vals[tid-8]
if lane >= 16 and rows[tid] == rows[tid-16]:
    vals[tid] += vals[tid-16]


SpMV Summary


Parallelizing SpMM on GPUs

w:1000 Thoughput of SpMM on CUDA


Applications


Existing Software / Packages


From SpMV to SpMM

w:600 bg right From SpMV to SpMM


Original data organization

# Simple implementation of CSR-based SpMM

# All examples of range-like for-loop is exclusive for the last element

# for sparse (CSR) A: A.rowptr[], A.colind[], A.val[],
# for dense        B: B[]
# for dense        C: C[]

1   for i in range(0, M):         # in parallel
2       for j in range(0, N):     # in parallel
3           result = 0
4           # dot product of A[i, :] and B[:, j]
5           for ptr in range(A.rowptr[i], A.rowptr[i+1]): # cannot parallelize!
6               k = A.colind[ptr]                         #
7               result += A.val[ptr] * B[k, j]            # <- reduction must be serial
8         C[i, j] = result

Access pattern for this code

h:500 bg left SpMM Access Pattern

Observation 1 (Sharing same $A$ rows) Threads within a warp have the same $i$ and contiguous $j$ to parallize the code above.

Observation 2 The sequential execution of the for-loop in line 2 forces threads in the same warp to access the same address.


Coalesced Row Caching (From vector CSR SpMV)
# Improved implementation of CSR-based SpMM (Coalesced Row Caching)
# A.rowptr[], A.colind[], A.val[], B[], C[], block_size (1D grid and 1D block)

# shared memory declared to use: sm_k[block_size], sm_v[block_size]

# Each block handle several rows of A = threads // 32, i.e., each warp handle one row of A
# data in B are partitioned into 32-cols-by-32-cols, and traversed row by row (Row-major stored B prevails)

# This code snippet computes matmul(A[i, 0:32], B[0:32, :])
1   i = block_id * (block_size // 32) + (thread_id) // 32
2   j = lane_id = thread_id % warp_size
3   sm_offset = tid - lane_id  # offset for this warp
4   row_start = A.rowptr[i]  # different warp read different rows!
5   row_end = A.rowptr[i+1]  # they get different row st/ed
6   result = 0
7   for ptr in range(row_start, row_end, warp_size):
8      # load data to shared memory
9      if ptr + lane_id < row_end:
10          sm_k[tid] = A.colind[ptr + lane_id]
11          sm_v[tid] = A.val[ptr + lane_id]
12      __syncwarp()  # ensure data copy inside warp is complete, otherwise race condition
13      for kk in range(0, warp_size):
14          if ptr + kk < row_end:
15              k = sm_k[sm_offset + kk]
16              result += sm_v[sm_offset + kk] * B[k, j]
17  C[i, j] = result


Visualized CRC

h:700 bg Step 1 of CRC


h:700 bg Step 2 of CRC


Observation 1 CRC shares loaded sparse matrix via shared memory.

Observation 2 Different warps still perform redundant loading of sparse rows.


Warp Merging

h:450 bg right Warp Merging


Other important techniques to improve efficiency not covered


References

  1. G. Huang, G. Dai, Y. Wang and H. Yang, “GE-SpMM: General-Purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks,” SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020, pp. 1-12, doi: 10.1109/SC41405.2020.00076.
  2. Yang, C., Buluç, A., Owens, J.D. (2018). Design Principles for Sparse Matrix Multiplication on the GPU. In: Aldinucci, M., Padovani, L., Torquati, M. (eds) Euro-Par 2018: Parallel Processing. Euro-Par 2018. Lecture Notes in Computer Science(), vol 11014. Springer, Cham. https://doi.org/10.1007/978-3-319-96983-1_48
  3. A. Mehrabi, D. Lee, N. Chatterjee, D. J. Sorin, B. C. Lee and M. O’Connor, “Learning Sparse Matrix Row Permutations for Efficient SpMM on GPU Architectures,” 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2021, pp. 48-58, doi: 10.1109/ISPASS51385.2021.00016.
  4. Lin, C. Y., Luo, L., & Ceze, L. (2021). Accelerating SpMM Kernel with Cache-First Edge Sampling for Graph Neural Networks. arXiv preprint arXiv:2104.10716.
  5. Bell, N., & Garland, M. (2008). Efficient sparse matrix-vector multiplication on CUDA (Vol. 2, No. 5). Nvidia Technical Report NVR-2008-004, Nvidia Corporation.
  6. Yan, S., Li, C., Zhang, Y., & Zhou, H. (2014). yaSpMV: Yet another SpMV framework on GPUs. Acm Sigplan Notices, 49(8), 107-118.
  7. https://moderngpu.github.io/segreduce.html
  8. https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda
  9. https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/
  10. https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/
  11. https://developer.nvidia.com/blog/nvidia-turing-architecture-in-depth/