Week 5 Challenge - zanzibarcircuit/ECE510 GitHub Wiki

Challenge 17

I had never seen a systolic array (shown below) before this and was curious to learn more about it. I had AI give me a quick summary of what a systolic array was with an example involving doing two 2x2 matrices and a 2x2 systolic array. Let's look at it below in detail.
3-s2 0-B9780127345307500088-f08-22-9780127345307

Systolic Array Toy Example

Let's have matrix A = [1 2], 3 4 and matrix B = [5 6], 7 8. To do C = A × B with a systolic array, we have the following array and coordinates:

[ PE(0,0)  PE(0,1) ]  
[ PE(1,0)  PE(1,1) ]

At each time step we're going to feed in a new element, with rows from A being fed in from the left to PE(0,0) and columns from B being fed in from the top into PE(0,0).

  • At time step t = 0, we feed in A(0,0) = 1 and B(0,0) = 5 and multiply to get PE(0,0) = 5 × 1 = 5.

  • At t = 1, PE(0,0) = 2 × 7 = 14, and we add that to the existing value there, so PE(0,0) = 5 + 14 = 19. This has completed C(0,0). While this is happening, the initial A(0,0) value that was in PE(0,0) is shifted to PE(0,1) and is met with B(0,1), so PE(0,1) = 1 × 6 = 6. By the same token, B(0,0) is passed to PE(1,0) and multiplied with A(1,0), so PE(1,0) = 3 × 5 = 15.

  • At t = 2, new values A(1,1) and B(1,1) enter, A(0,1) shifts right and B(1,0) shifts down, and so on and so forth until we get...

  • PE(0,0) = 19, PE(0,1) = 22, PE(1,0) = 43, PE(1,1) = 50. The expected value for C. Somewhat difficult to understand. Maybe it'll make more sense with bubble sort.

Bubble Sort

Bubble sort is a simple algorithm that iterates through an array and for the element at the current idx and the idx after it, if array[idx] > array[idx+1], swap array[idx] and array[idx+1]. Repeat until an iteration through the array results in no swaps and the array is sorted.

Bubble Sort in a Systolic Array Design

After asking AI, it seems as though the array only needs to be as large as the list you'd like to sort and works as follows. Place all the elements in the array at their respective indices. On odd clock cycles, compare PE(0) and PE(1), PE(2) and PE(3), ..., PE(N-2) and PE(N-1) and swap if the value at the lesser index is greater than the value at the greater index. At even clock cycles, do the same for PE(1) and PE(2), PE(3) and PE(4), ..., PE(N-3) and PE(N-2).

Bubble Sort Systolic Array

Below is my Python version of bubble sort in a systolic array. For each element in the array, it does the first even index comparison starting at index 0 and comparing each odd to each even and swapping if necessary. Then it does the odd comparison, comparing each odd to each even and swapping if necessary.

def systolic_bubble_sort(values):
    arr = values[:]
    n = len(arr)

    for _ in range(n):
        # Even index compare-and-swap
        noswaps = True
        for i in range(0, n - 1, 2):
            if arr[i] > arr[i + 1]:
                arr[i], arr[i + 1] = arr[i + 1], arr[i]
                noswaps = False

        # Odd index compare-and-swap
        for i in range(1, n - 1, 2):
            if arr[i] > arr[i + 1]:
                arr[i], arr[i + 1] = arr[i + 1], arr[i]
                noswaps = False
                
        if noswaps:
            break

    return arr

In reality that is pretty serialized. I'm just iterating over things in a different way. I think the goal would be to do all the comparisons (odds or evens) in parallel at each clock cycle. What if I tried to do this in CUDA?

CUDA Implementation

import numpy as np
from numba import cuda
import time
import random

@cuda.jit
def compare_swap_kernel(arr, phase):
    idx = cuda.grid(1)
    if phase == 0:
        i = 2 * idx
    else:
        i = 2 * idx + 1

    if i < arr.size - 1 and arr[i] > arr[i + 1]:
        tmp = arr[i]
        arr[i] = arr[i + 1]
        arr[i + 1] = tmp

def systolic_bubble_sort_cuda(values):
    arr_np = np.array(values, dtype=np.int32)
    d_arr = cuda.to_device(arr_np)

    n = arr_np.size
    threads_per_block = 128
    blocks = (n + threads_per_block - 1) // threads_per_block

    for _ in range(n):
        compare_swap_kernel[blocks, threads_per_block](d_arr, 0)  # Even phase
        cuda.synchronize()

        compare_swap_kernel[blocks, threads_per_block](d_arr, 1)  # Odd phase
        cuda.synchronize()

    return d_arr.copy_to_host()

# Example
arr = [random.randint(0, 1000) for _ in range(1000)]

start = time.time()
sorted_arr = systolic_bubble_sort_cuda(arr)
end = time.time()

print(f"Sorted array (first 20): {sorted_arr[:20]}")
print(f"CUDA execution time: {end - start:.6f} seconds")

My hope is that this parallelizes the comparisons, but I don't think these are working as expected based on my comparison times. Below is a table comparing a normal bubble sort, my systolic in Python, and my CUDA (parallelized(?)) version. Note that because bubble sort time is dependent on how sorted the array is, I did an average time over 5 runs for each of the array sizes.

image

The normal version and the systolic(ish) version outperformed the CUDA version by a lot. After some more research, I found out it's because I wasn't checking if no swaps have occurred. Apparently this is non-trivial in CUDA, so I'm not going to bother with it for now, but the goal would be to do all the comparisons in parallel so for each clock cycle I would be doing N/2 comparisons rather than just 1 and greatly speeding things up.