Week 5 Challenge - zanzibarcircuit/ECE510 GitHub Wiki
Challenge 17
I had never seen a systolic array (shown below) before this and was curious to learn more about it. I had AI give me a quick summary of what a systolic array was with an example involving doing two 2x2 matrices and a 2x2 systolic array. Let's look at it below in detail.
Systolic Array Toy Example
Let's have matrix A = [1 2], 3 4 and matrix B = [5 6], 7 8. To do C = A × B with a systolic array, we have the following array and coordinates:
[ PE(0,0) PE(0,1) ]
[ PE(1,0) PE(1,1) ]
At each time step we're going to feed in a new element, with rows from A being fed in from the left to PE(0,0) and columns from B being fed in from the top into PE(0,0).
-
At time step t = 0, we feed in A(0,0) = 1 and B(0,0) = 5 and multiply to get PE(0,0) = 5 × 1 = 5.
-
At t = 1, PE(0,0) = 2 × 7 = 14, and we add that to the existing value there, so PE(0,0) = 5 + 14 = 19. This has completed C(0,0). While this is happening, the initial A(0,0) value that was in PE(0,0) is shifted to PE(0,1) and is met with B(0,1), so PE(0,1) = 1 × 6 = 6. By the same token, B(0,0) is passed to PE(1,0) and multiplied with A(1,0), so PE(1,0) = 3 × 5 = 15.
-
At t = 2, new values A(1,1) and B(1,1) enter, A(0,1) shifts right and B(1,0) shifts down, and so on and so forth until we get...
-
PE(0,0) = 19, PE(0,1) = 22, PE(1,0) = 43, PE(1,1) = 50. The expected value for C. Somewhat difficult to understand. Maybe it'll make more sense with bubble sort.
Bubble Sort
Bubble sort is a simple algorithm that iterates through an array and for the element at the current idx and the idx after it, if array[idx] > array[idx+1], swap array[idx] and array[idx+1]. Repeat until an iteration through the array results in no swaps and the array is sorted.
Bubble Sort in a Systolic Array Design
After asking AI, it seems as though the array only needs to be as large as the list you'd like to sort and works as follows. Place all the elements in the array at their respective indices. On odd clock cycles, compare PE(0) and PE(1), PE(2) and PE(3), ..., PE(N-2) and PE(N-1) and swap if the value at the lesser index is greater than the value at the greater index. At even clock cycles, do the same for PE(1) and PE(2), PE(3) and PE(4), ..., PE(N-3) and PE(N-2).
Bubble Sort Systolic Array
Below is my Python version of bubble sort in a systolic array. For each element in the array, it does the first even index comparison starting at index 0 and comparing each odd to each even and swapping if necessary. Then it does the odd comparison, comparing each odd to each even and swapping if necessary.
def systolic_bubble_sort(values):
arr = values[:]
n = len(arr)
for _ in range(n):
# Even index compare-and-swap
noswaps = True
for i in range(0, n - 1, 2):
if arr[i] > arr[i + 1]:
arr[i], arr[i + 1] = arr[i + 1], arr[i]
noswaps = False
# Odd index compare-and-swap
for i in range(1, n - 1, 2):
if arr[i] > arr[i + 1]:
arr[i], arr[i + 1] = arr[i + 1], arr[i]
noswaps = False
if noswaps:
break
return arr
In reality that is pretty serialized. I'm just iterating over things in a different way. I think the goal would be to do all the comparisons (odds or evens) in parallel at each clock cycle. What if I tried to do this in CUDA?
CUDA Implementation
import numpy as np
from numba import cuda
import time
import random
@cuda.jit
def compare_swap_kernel(arr, phase):
idx = cuda.grid(1)
if phase == 0:
i = 2 * idx
else:
i = 2 * idx + 1
if i < arr.size - 1 and arr[i] > arr[i + 1]:
tmp = arr[i]
arr[i] = arr[i + 1]
arr[i + 1] = tmp
def systolic_bubble_sort_cuda(values):
arr_np = np.array(values, dtype=np.int32)
d_arr = cuda.to_device(arr_np)
n = arr_np.size
threads_per_block = 128
blocks = (n + threads_per_block - 1) // threads_per_block
for _ in range(n):
compare_swap_kernel[blocks, threads_per_block](d_arr, 0) # Even phase
cuda.synchronize()
compare_swap_kernel[blocks, threads_per_block](d_arr, 1) # Odd phase
cuda.synchronize()
return d_arr.copy_to_host()
# Example
arr = [random.randint(0, 1000) for _ in range(1000)]
start = time.time()
sorted_arr = systolic_bubble_sort_cuda(arr)
end = time.time()
print(f"Sorted array (first 20): {sorted_arr[:20]}")
print(f"CUDA execution time: {end - start:.6f} seconds")
My hope is that this parallelizes the comparisons, but I don't think these are working as expected based on my comparison times. Below is a table comparing a normal bubble sort, my systolic in Python, and my CUDA (parallelized(?)) version. Note that because bubble sort time is dependent on how sorted the array is, I did an average time over 5 runs for each of the array sizes.
The normal version and the systolic(ish) version outperformed the CUDA version by a lot. After some more research, I found out it's because I wasn't checking if no swaps have occurred. Apparently this is non-trivial in CUDA, so I'm not going to bother with it for now, but the goal would be to do all the comparisons in parallel so for each clock cycle I would be doing N/2 comparisons rather than just 1 and greatly speeding things up.