Week 3 Challenges - zanzibarcircuit/ECE510 GitHub Wiki

Introduction

I chose to do Challenge #10. Just a quick look at the code shows me it's Q-learning, where using a Q-table, an agent navigates through a space as determined by the most prudent action according to the Q-table. It updates the Q-table as it goes using the Q-learning algorithm and the punishment or reward it receives for each action. I first gave ChatGPT the code and asked for optimization as requested in the challenge.

Bottlenecks (SW)

1. Full Q-Table Copying Each Episode

The issue with copying the whole Q-table is obvious. As the state-action space grows, copying will take longer and longer. ChatGPT says we can edit the Q-table in place or update only the entry that changes. This seems feasible to me. This is a software issue, not necessarily something that can be optimized in hardware. I suppose it is wise to first see how we can update the algorithm in its pure form.

2. Dictionaries Used Instead of Arrays

Dictionaries are used as the data structure for the Q-table instead of NumPy arrays. Dictionary hashing is slow compared to direct memory access, which is provided in NumPy arrays. Again, this is a software issue, and it's clear this isn't the goal of the challenge.

Let's ask it about hardware and see what it thinks.

Bottlenecks (HW)

I asked for the bottlenecks that would come up if I were to implement this in hardware, and it gave me a different list as one would expect. It gave me a long list, and I asked it for the most valuable ways I could accelerate Q-learning in hardware.

Replace the Dictionary Q-Table with 3D BRAM Array

This one is actually similar to the first suggestion in the SW bottleneck in that slow hash-based random access is being used when we could use a 3D BRAM. I don't know what a 3D BRAM is, so I asked and understood it just as a 3D array stored in RAM with logic to handle 3 dimensions of indexing and not a special RAM block. I think this makes sense and is the most critical speedup we could get according to ChatGPT. I also realized I didn't really understand hashing here, which means taking the dict key, making it into a 'big, weird number,' and using that number to look in multiple places to see if it's the right one. Sounds a lot more complicated than direct indexing.

Convert Floating-Point Math to Fixed-Point

Given my real lack of familiarity with computer hardware (I'm in the ML/SP track), I understand that this is converting the floating-point numbers to integers of some kind. There are apparently a few integer (fixed-point) formats to pick from. I learned that this algorithm uses values between -5 and 1 with multipliers alpha = 0.5 and gamma = 0.9. ChatGPT suggested Q4.12. To my understanding, this lets the algorithm Q(s,a) = (1 - alpha) * Q(s,a) + alpha * (reward + gamma * max(Q(s',a'))) be done with a shift, sum, and clip. This is what was discussed in class a bit and should result in a real speedup and makes sense, but would need to be tested on a real dataset to evaluate performance and if resolution is appropriate.

Parallel Max Q-Value Computation (Reduction Tree)

This is classic hardware acceleration thinking. Instead of serially going through each action and evaluating its impact, we can parallelize them and check 4 actions at once. Given what I understand about the algorithm, this makes sense given that we're only looking at 4 actions. I wonder what the limit of parallelization would be for more actions.

Lightweight RNG via LFSR for ε-Greedy Policy

Our algorithm requires an RNG, and RNGs are of varying complexity in hardware. ChatGPT suggested we use an LFSR. I don't really know what that is or what the baseline is and learned it's a linear feedback shift register and is pseudorandom and just cycles through a long sequence of bits that appear random. I'm going to accept this as is, keeping in mind that since this suggested it as a speedup, it will likely bias the RNG in some way that at this point I won't try to quantify.

Hardware Implementation Biggest Bottleneck

When asking ChatGPT to give me a hardware implementation of the biggest bottleneck, it suggested we build a hardware Q-update module that performs Q[s,a] = (1 - α) * Q[s,a] + α * (r + γ * max(Q[s',a']))

It gave me the following, which I'm putting here to understand better with pseudocode below.

Top Level

Where s_ is the current state in 5 bits, action is the action (UDLR) in 2 bits, and reward is a 16-bit reward.

module q_update (
    input clk,
    input rst,
    input [4:0] s_row, s_col, s_prime_row, s_prime_col,
    input [1:0] action,
    input signed [15:0] reward,
    input start,
    output done
);

3D BRAM

This is the speedup recommended for our Q-table (rather than hash), and it is trivial to implement.

// 5x5x4 Q-table = 100 entries
reg signed [15:0] Q_table [0:4][0:4][0:3];

Control

This is the enum for where the algorithm state is. It is either idle, checking the current Q, getting all the next Qs, computing the next best action, updating the table, or writing back.

typedef enum logic [2:0] {
    IDLE,
    READ_Q_CURRENT,
    READ_Q_NEXT_ALL,
    COMPUTE_MAX,
    UPDATE_Q,
    WRITE_BACK,
    DONE
} state_t;

state_t state;

Pipelined Data Path

This is the core algorithm where it checks the current Q entry, checks the Q value for each action in parallel, and gets the max_q value from those. It then calculates the update and updates the table. Keep in mind that it's using the accelerator (the 3D BRAM instead of the dict) in all the Q-table getting and setting. Note that it is doing the accelerated parallelized max reward action computation even though I didn't ask for this.

// Stage 1: Read Q[s,a]
q_current = Q_table[s_row][s_col][action];

// Stage 2: Read Q[s',a'] for a' = 0..3
q0 = Q_table[s_prime_row][s_prime_col][0];
q1 = Q_table[s_prime_row][s_prime_col][1];
q2 = Q_table[s_prime_row][s_prime_col][2];
q3 = Q_table[s_prime_row][s_prime_col][3];

// Stage 3: max(Q[s',:]) using comparator tree
max_q = max(max(q0, q1), max(q2, q3));

// Stage 4: Bellman update computation (fixed-point)
// target = r + gamma * max_q
// updated = (1-alpha) * q_current + alpha * target
target = reward + fixed_mult(gamma, max_q);
updated = fixed_mult((1-alpha), q_current) + fixed_mult(alpha, target);

// Stage 5: Write Q[s,a] back
Q_table[s_row][s_col][action] = updated;

Fixed-Point Multiplication

It also (without asking) implemented the fixed-point arithmetic where multiplication is just multiplication and then a 12-bit shift.

function signed [15:0] fixed_mult;
    input signed [15:0] a, b;
    reg signed [31:0] product;
begin
    product = a * b;  // 32-bit intermediate
    fixed_mult = product >>> 12;  // Adjust for Q4.12
end
endfunction

FSM Logic Sketch

This is the main loop.

always_ff @(posedge clk) begin
    if (rst) state <= IDLE;
    else begin
        case (state)
            IDLE: if (start) state <= READ_Q_CURRENT;
            READ_Q_CURRENT: state <= READ_Q_NEXT_ALL;
            READ_Q_NEXT_ALL: state <= COMPUTE_MAX;
            COMPUTE_MAX: state <= UPDATE_Q;
            UPDATE_Q: state <= WRITE_BACK;
            WRITE_BACK: state <= DONE;
            DONE: state <= IDLE;
        endcase
    end
end

Actual Implementation

I asked ChatGPT to implement this in Verilog, and added the file to my GitHub. The first iteration resulted in a timing issue where the FSM transitioned too quickly between states. It took about 10 back and forths for ChatGPT to understand that it wasn't initializing the Q-table appropriately. This is the first time I REALLY pushed ChatGPT to find its problem, and it found it eventually. It gives me a lot more confidence to let it push through issues! I should ask it to give it some output for itself to help debug because that was what eventually helped. Anyway, the code is added.