Control Unit - CarlosCraveiro/RISCV_based_processor GitHub Wiki

Control Unit

Introduction

As described before, the processor architecture is inspired on RV32C standard, however, the project is working with 8 registers, 16-bit registers and 16-bit modified instructions, so it is called RV16Cm (standing for RISC-V 16bits compact-modified set of instructions) - it is important to mention this was conceived by the designers and we have no further notice of such a standard existing previously, but it is a nice way of adapting the RV standards for didactic purposes.

Instruction Formats

Firstly, it is important to mention the RV32C instruction set, which inspires most of the conceived instruction set of RV16Cm. It is important to notice that the RV32C instructions are 16 bits wide(in accord with it's purpose of compact instructions) so, technically, the 16 bit architecture could be compatible with RV32C. As the state machine of this project was never intended to implement all instructions on RV32C for simplicity and other hardware reasons(for instance, the ALU was also chosen not to have floating point numbers), it was chosen to modify the RV32 instructions and create the so called RV16Cm. To visualize the modifications, the following tables containing both instructions sets.

Format Meaning 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
CR Register funct4 rd/rs1 rs2 op
CI Immediate funct3 imm rd/rs1 imm op
CSS Stack-relative Store funct3 imm rs2 op
CIW Wide Immediate funct3 imm rs2 op
CL Load funct3 imm rs1 imm rd op
CS Store funct3 imm rs1 imm rs2 op
CB Branch funct3 offset rs1 offset op
CJ Jump funct3 jump target op
  1. Table that describes RV32C instruction formats. Source: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-209.pdf
Fields 3 bits 5 bits 3 bits 3 bits 2 bits
CRm Funct3 00000 rd/rs1 rs2 op
CIm Funct3 Imm8[7:3] rd/rs1 Imm8[2:0] op
CLm Funct3 Addr8[7:3] rd Addr8[2:0] op
CSm Funct3 Addr8[7:3] rs1 Addr8[2:0] op
CBm Funct3 Addr11[7:3] Addr11[10:8] Addr11[2:0] op
  1. Table that describes RV16Cm instruction formats. Source: the authors

It is important to point some main differences:

  • The RVC16m uses only 5 instructions formats - CR(Compact-Register), CL(Compact-Load), CS(Compact-Store) and CB(Compact-Branch). So, naturally, RV16Cm instructions do not support Stack-Relative Store, unconditional branching with absolute values(jump) or wide immediates.
  • Not all the functions are implemented for each original instruction format - this can be easily visualized as CRm has a Funct3 field and not Funct4 (allowing less instructions then original CR)
  • The load, store and branch formats had their immediates renamed to adresses and have only one(load and store) or no register fields - which is exactly one less then original CL/CS/CB formats. This is because, for immediate-handling and insctruction implementation simplicity it is chosen to save, load and branch only based on absolute adresses - what will obviously have implications on register and memory acess which will be discussed on a later section.
  • The parts of the instruction do not vary in size through the instruction formats(notice that there are no merged cells) - bit 5(from right to left), for example, could be part of the 5 bit-field rs2 for CR format or the 8-bit wide immediate imm for CIW on the original instruction formats. In RV16Cm, it is always part of a 3-bit field.

These modifications all target a simpler and more didactic implementation of a RISCV inspired processor - being the last one perhaps the most important for didactic puporses.

Implemented Instructions

As mentioned earlier, the modified instruction formats also are not implemented with allo original instructions - due to field changes, some instructions are not possible(CR cannot have all instructions because of Funct3 insetad of Funct4 for example), some lack of meaning and others are impossible because of other hardware limitations of the project(such as floating points). However, that does not mean that the only instructions possible are the ones implemented here, and there is definetly more room of adaptation for new instructions and states on the Finite State Machine. The 9 instructions here implemented are chosen for being some of the most basic ones and capable of doing basic operations - with the proper assembler, this machine can definetly implement some serious functionalities. The following sub-sections describe the implemented instructions

CRm format

The Compact Register modified (CRm) format implements the 5 functions of the ALU operating on registers rd and rs2 and being stores on rd. It's op field equals to 00 and the Funct3 field distinguishes between the 5 different instructions. The following table describes all the fields and meaning of the instruction of CRm format.

Instruction Operation Funct3 -------- rd/rs1 rs2 Op
add rd, rs2 rd <- rd + rs2 000 00000 rd/rs1[2:0] rs2[2:0] 00
sub rd, rs2 rd <- rd - rs2 001 00000 rd/rs1[2:0] rs2[2:0] 00
and rd, rs2 rd <- rd AND rs2 010 00000 rd/rs1[2:0] rs2[2:0] 00
or rd, rs2 rd <- rd OR rs2 011 00000 rd/rs1[2:0] rs2[2:0] 00
slt rd, rs2 rd <- rd SLT rs2 101 00000 rd/rs1[2:0] rs2[2:0] 00
  1. Table that describes the instructions of CRm format. Source: the authors

Notice that:

  • The Funct3 field of each operation is exactly the same that the ALU uses for representing such operations - meaning that no extra-Decoder for the ALU is needed
  • The choice to make the field Funct3 instead of the original Funct4 is not because of space - since there is 5 unused bits that could be used for this or to represent more registers if the architecture needed it - as mentioned earlier, the choice is made targeting simplicity

CIm Format

The Compact Immediate modified (CIm) format implements the ALU functions operating over immediates. For the scope of this project, only one function is implemented(addi, which adds an immediate to a register and saves on the same register), making the processor itself function(the addi is specialy important for iterating) but without making it's implementation too redundant(the other 4 ALU operations are already implemented on registers). Its op field is 01 and the 8-bit immediate is signal-extended to fit the 16 bits and add to the register operand. The following table describes the fields and meaning of the sole instruction implemented for the CIm format

Instruction Operation Funct3 Imm[7:3] rd/rs1 Imm[2:0] Op
addi rd, Imm rd <- rd + s_ext(Imm) 000 Imm[7:3] rs1[2:0] Imm[2:0] 01
  1. Table that describes the instructions of CIm format. Source: the authors

CLm Format

The Compact Load modified format implements the load instructions that get a data from memory onto a register. Again, the implementation has only one instruction - load word(lw) - that loads 4 bytes(a word) of data from an adress to a register. Its op field is 10(which is shares with the CSm instruction format - they differ in the Funct3 field). The following table describes the fields and meaning of the load word instruction implemented for the CLm format:

Instruction Operation Funct3 Addr[7:3] rd Addr[2:0] Op
lw rd,Addr rd <- M[Addr] 000 Addr[7:3] rd[2:0] Addr[2:0] 10
  1. Table that describes the instructions of CLm format. Source: the authors

CSm format

The Compact Save modified format implements the save instructions that saves a data from a register into memory. It also has only one instruction - save word(sw) - that saves 4 bytes of data from a register onto a memory adress. Its op field is also 10. The following table describes the fields and meaning of the save word instruction implemeted for the CSm format:

Instruction Operation Funct3 Addr[7:3] rs1 Addr[2:0] Op
sw rs1,Addr M[Addr] <- rs1 001 Addr[7:3] rs1[2:0] Addr[2:0] 10
  1. Table that describes the instructions of CLm format. Source: the authors

CBm format

The Compact Branch modified format implements the conditional branch instructions that jumps the PC to a given adress conditionally. In this case, once more because additional implementations would only and complexity and be redundant in didactic means, there is only one implementation. bneqzbranches if the zero flag is false, which can be extensively used on logical loops, for example. Its op field is 11. The following table describes the fields and meaning of the branch if not equal zero instruction of CBm format:

Instruction Operation Funct3 Addr[7:3] Addr[10:8] Addr[2:0] Op
bneqz Addr PC <- Addr if Zero==False 000 Addr[7:3] Addr[10:8] Addr[2:0] 11
  1. Table that describes the instructions of CBm format. Source: the authors PS: The PC is a register named program counter and described on the registers section

Memory Acess

This sub-section is used to describe the memory adresses that are acessed by the processor. First of all, the RAM is not implemented on FPGA and is used from externally(on the test benches, it is used the Altera RAM memory from the FPGA kit). The instructions section describe how many bits of adress each instruction format receives. Basically, the lw and sw instructions receive an adress of 8 bits, while the bneqz receives an adress of 11 bits. Mapping it to memory with some math:

$$LSWAddr = (2^{8} - 1)_d = (255)_d = (00FF)_h$$

$$BAddr = (2^{11} - 1)_d = (2047)_d = (07FF)_h$$

Where LSWAddr and BAddr correspond to the largest adress that the load/store and branch instructions can acess, respectively. This is a limitation due to the choice to acess adresses only with absolute values - because of this, the whole adress must fit onto the instruction and, with a reduced instruction of 16 bits, fewer adresses are acessible.

So, basically, the data memory range is 0x000 - 0x0FF, while the program memory range is 0x000 - 0x7FF.

Overall architecture comments

Since the architecture is based on, but not exactly, RISC-V, it is important to point clearly the adaptations made on RV16Cm:

RV32C RV16Cm
Data bus 32 bits 16 bits
Instruction Size 32 bits 16 bits
Register size 32 bits 16 bits
Number of registers 16 8
Register instructions 2 registers 2 registers
rd <- rd op rs rd <- rd op rs
Load instructions rt <- M[rs + sign_ext(imm)] rd <- M[imm]
Store instructions M[rs + sign_ext(imm)] <- rt M[imm] -< rd
Branch instructions U/J type J type
bneq rs,rt, label bneqz label
if rs != rt, PC <- PC + sign_ext(label) if ZF=0, PC <- label
8. Table that points the main differences and adaptations of RV16Cm compared to RV32C

Finite State Machine

With every instruction properly documented, the control unit can be totally described as a finite state machine(FSM) that will percurr all states needed to perfom the instructions - fetch, decode and execute. But since the processor is multi-cycle, one instruction may take more than one clock cycle to be executed - meaning that the FSM will not be as simple as 3 states. Firstly, it is important to list every single input and output that the FSM should control. On the main diagram of the processor it is possible to visualize all hardware attached to the control machine, thus all inputs and outputs it has to deal with. All that information is sumarized on the following table:

Variable Input or Output Number of bits Meaning
op Input 2 It is the op field from the instruction
Funct3 Input 3 It is the Funct3 field from th instruction
Zero Input 1 It is the Zero flag from the ALU that indicates wheter the last result of it was 0
PCUpdate Output 1 Controls if the PC should be updated
Branch Output 1 Controls if it is a branch case
AdrSrc Output 1 Controls a MUX to determine if the Adress to be inputted on the Memory should be readen from PC(0) or from the result from the ResultSrc MUX(1)
MemWrite Output 1 Determines if the memory should be written - connected to WriteEnable of the memory
IRWrite Output 1 Determines if the Instruction Register should be written to - connected to its enable
ResultSrc Output 2 Controls a MUX that sends its result to the PC(as input) and to the MUX controlled by AddrSrc. The inputs of the MUX are a register that accumulates the result of ALU(ALUOut), another register that contains the readen data from memory(Data) and the direct result from the switching circuit of the ALU(ALUResult)
ALUControl Output 3 Controls the ALU different operations
ALUSrcB Output 2 Controls wheter the SrcB input of ALU will be the register that contains the output RD2 of the register bank(00), the signal/zero extension from the Extend block(01), or a hardwired 2(10)
ALUSrcA Output 2 Controls wheter the SrcA input of ALU will be the PC(00), a hardwired 0(01), or the register that contains the ouput RD1 of the register bank
ImmSrc Output 2 Controls the Extend module - stating if there sould be a signal or zero extension on the immediate, and where the immediate is placed on the instruction
RegWrite Output 1 Controls the write on the register bank - connected to WE3
  1. Table that contains every input and output of the finite state machine

It is important to note that:

  • The Funct3 field that selects the operation to be done on the ALU, as described earlier, has the same correspondence of operations as ALUControl, which means that in the instructions where it is used one will be directly connected to the other without any addiotinal decoder
  • The 0 and 2 hardwired to muxes controled by ALUSrcA and ALUSrcB, respectively, are used to increment PC
  • PCUpdate and Branch are not actually outputs of the control unit, but only from the finite state machine. Together, through a switching circuit, both compose PCWrite, which determines if the control unit should write to the PC and is directly attached to the EN of the PC register. The switching circuit that defines PCWrite is: $ PCWrite = (PCUpdate) OR (Branch AND Zero) $

Secondly, it is as important to list all states conceived for the FSM:

State name State function
Fetch Fetches the next instruction of PC and increments PC
Decode Decodes the instruction and jumps to the according state
MemAdr Gets the memory adress from a instruction by parsing it on immediate module and summing wiht 0 on ALU
MemRead Uses the ALUOut register(with previous adress result) to read an adress from memory
MemWB Writes the readen data form memory back to register file on the according register
MemWrite Acesses the adress at ALUOut and writes it with RD1 register(one of two possible simultaneous read-registers from regbank) from register bank
ExecuteR Executes a CRm format by operating on the right registers on register bank and making the correspondent operation on ALU
ALUWB Writes the ALU result back to register bank
ExecuteI Executes a CIm format by parsing correclty the immediates on immediate module and making the correspondent operation on ALU
BNEQZ Parses the immediate accordingly on the immediate module, sums it with 0 on the ALU and places the result on PC to branch
  1. Table that describes the states of the FSM

Truth Table

With all control variables of the control unit defined, the following truth table describes how each state should behave

PC Write Addr Src Mem Write IR Write Result Src ALU Control ALU Src B ALU Src A Imm Src Reg Write
Branch PCUpdate
Fetch 0 1 0 0 1 10 000 10 00 dd 0
Decode 0 0 d 0 0 dd ddd dd dd dd 0
Mem Adr 0 0 d 0 0 dd 000 01 01 01 0
Mem Read 0 0 1 0 0 00 ddd dd dd dd 0
Mem WB 0 0 d 0 0 01 ddd dd dd dd 1
Mem Write 0 0 1 1 0 00 ddd dd dd dd 0
ExecuteR 0 0 d 0 0 dd funct3 00 10 dd 0
ALU WB 0 0 d 0 0 00 funct3 dd dd dd 1
Execute I 0 0 d 0 0 dd funct3 01 10 00 0
BNEZ not(ZF) 0 d 0 0 10 000 01 01 11 0
11. Truth Table of the control unit

The truth table shown allows a classification of the finite state machine(FSM) that composes the control unit: the table shows that the outputs depend only upon the current state, not upon the inputs - which means the FSM is a Moore Machine. There are 2 clear exceptions that would make this classification invalid: on ExecuteR, ALUWB, ExecuteI states, the ALUControl signal is related to Funct3 input from the instruction, and on BNEQZ state, the Branch signal is related to ZF input from ALU. Instead of reclassifying the FSM as a Mealy Machine - what would be a more general case, but could make the implementation less modular if it is made based on this idea - it is chosen to isolate those outputs from the FSM.

So, basically, the ALUControl output comes from a separate decoder that uses the Funct3 field in it, while the Branch output is not implemented, in such a way that there is a switching circuit that uses the current state, instead of a Branch output(this is logically equivalent to an implicit Branch output set as always 1 on the Truth Table). To clarify, the code snippet that sets the PC_Write (located on control_unit.v) is:

pc_write = pc_update or (~zero_flag and (curr_state == `BNEZ));

As described, the switching circuit is outside the FSM(described on CU_main_decoder.v) and put in directly in the control unit, and it does not use a branch signal, only the curr_state signal.

Another important thing to mention is that it can be analyzed that there are many dont-care terms on the truth table - this indicates that many parts of the processor are not being used on many states, meaning that multiple states could easily be processed simultanously. That is, this feature of dont-care terms suggests that the processor's speed would improve a lot with an eventual future pipelining implementation.

Table of States

With all input variables of the control unit defined, the following table of states describe how each state should transition according to the inputs:

Current State OP Field Func Field Next State
Fetch dd ddd Decode
Decode 00 ddd ExecuteR
01 ddd ALUWB
10 ddd MemAdr
11 ddd BNEQZ
MemAdr 10 000 MemRead
10 001 MemWrite
MemRead dd ddd MemWB
MemWB dd ddd Fetch
Mem Write dd ddd Fetch
ExecuteR dd ddd ALUWB
ALUWB dd ddd Fetch
ExecuteI dd ddd ALUWB
BNEQZ dd ddd Fetch
12. Table of States(describing the states transition) of the control unit

Diagram

flowchart TB

Fetch((______FETCH______<br> AdrSrc = 0 <br> IRWrite <br> ALUSrcA = 00 <br> ALUSrcB = 10 <br> ALUControl = 000 <br> ResultSrc = 10 <br> PCUpdate)) --> Decode((DECODE))


Decode --> |"op = 10 (CLm OR CLs)"|MemAdr((MemAdr <br> ALUSrcA = 01 <br> ALUSrcB = 01 <br> ALUControl = 000))

Decode -->|"op = 00 (CRm)"| ExecuteR((ExecuteR <br> ALUSrcA = 10 <br> ALUSrcB = 00 <br> ALUControl = funct3))

Decode -->|"op = 01 (CIm)"| ExecuteI((ExecuteI <br> ALUSrcA = 10 <br> ALUSrcB = 01 <br> ALUControl = funct3))


Decode -->|"op = 11 (CBm)"| BNEQZ((BNEQZ <br> ALUSrcA = 01 <br> ALUSrcB = 01 <br> ResultSrc = 10 <br> ALUControl = 000 <br> Branch))

MemAdr --> |"funct3 = 000 (CLm)"|MemRead((MemRead <br> ResultSrc = 00 <br> AdrSrc = 1))

MemAdr --> |"funct3 = 001 (CSm)"|MemWrite((MemWrite <br> ResultSrc = 00 <br> AdrSrc = 1 <br> MemWrite))

MemRead --> MemWB((MemWB <br> ResultSrc = 01 <br> RegWrite))

ExecuteR & ExecuteI --> ALUWB((ALUWB <br> ResultSrc = 00 <br> RegWrite))

MemWB & MemWrite & ALUWB & BNEQZ --> Fetch
Loading
  1. Flowchart of the FSM

There are some important things to mention about the diagram that describes the FSM:

  • On the diagram, the 1-bit signals are 1 where they are mentioned or 0 where they are excluded, while the signals with more bits are always represented with their according value
  • The state transitions are all well-defined in terms of inputs - but there can be invalid inputs which are dealt with by maintaning the previous state. That is, the default case if the inputs do not match any specified transition is to maintain the state. This is chosen as a design option to minimize the transition of states, minimizing the switching of signals and, thus, the heat emission on an integrated circuit. Be noted that no testbenching or quantitive analysis was made for this design option, just followed a convention.
  • The state-flow always starts on Fetch after a reset

Immediate module

Apart from the main control unit, there is a module called Immediate module that operates parallel to the FSM parsing the immediates form the instructions and providing it correctly according to the instruciton format. This module is described here on Control Unit section since they were implemented together(even on the same dev branch) because of their shared role of parsing.

Basically, it receives the instruction and a 2-bit signal that comes from the FSM and indicates the immediate format of the instruction format. On the instruction formats description, it can be seen that the immediate varies from size, possibly having 8 or 11 bits. Additionally, the immediates should be extended to fit the 16-bit bus to the ALU. This extensions can be done in two ways: the instructions involving immediates to be operated on ALU are signal-extended, while the instructions involving adress operations on ALU are zero-extended.

Because of this, the immediate module control signal(ImmSrc) is 2-bits wide, the first bit meaning its size and the second one meaning which extension should be performed. The following table describes the control signal of immediate extension:

ImmSrc Size of immediate Extension to be performed Instruction Format FSM State that uses it
00 8 bits Signal Extension CIm ExecuteI
01 8 bits Zero Extension CLm or CSm MemAdr
10 11 bits Signal Extension - -
11 11 bits Zero Extension CBm BNEZ
  1. Table that describres ImmSrc control of the immediate module

As the signal extension on a 11 bit immediate (ImmSrc = 10) case should never happen in the conceived FSM, it is not written onto code and the default case is defined as setting the immediate as a 16-bit 0 value.

Code organization and implementation

States as parameters

Firstly, one important feature added to the Verilog code is the description of the states as parameters. For example, the state Fetch is defined as follows:

parameter fetch = 4'b0000

This representation not only improves code legibility, but also allows the developers to use a special feature of Intel® Quartus® software, which is used for loading the hardware description in Verilog onto an Altera FPGA kit(which is the brand that the developers have acess on the laboratory).

The Quartus® State Machine Editor can change the values of states to different sequences(not only binary sequence 0000,0001,0010... but, for example, the Gray sequence 000,001,011...). This allows the developers to test not only on the hardcoded default case of binary sequence, but also another sequences. The sequence in which the states are declared may affect how the synthesis tool synthesizes the circuit onto the FPGA - affecting the number of logic cells, for example.

So, basically, the states are defined as:

parameter fetch = 4'b0000
parameter decode = 4'b0001
parameter memadr = 4'b0010
parameter memread = 4'b0011
parameter memwb = 4'b0100
parameter memwrite = 4'b0101
parameter executer = 4'b0110
parameter aluwb = 4'b0111
parameter executei = 4'b1000
parameter bneq = 4'b1001

Code modularization

The files that contain the control unit code are modularized as follows:

control_unit.v is the top control unit module - meaning it calls the other ones. Basically, it calls a decoder called the main decoder that sets most of the outputs (CU_main_decoder.v), another module that updates the state(CU_sequential.v), another decoder that sets the ALU outputs(alu_decoder.v), apart from the switching circuit that defines pc_write that was explained above.

So, basically, CU_main_decoder.v implements the FSM truth table(apart from ALUControl), while CU_sequential implements the FSM table of states. As said earlier, alu_decoder.v sets only the ALUControl signal. Inside the decoders, there are only case statements(that work inside always blocks for checking positive edge of clock) that are really simple and do nothing apart from the tables described throughout this text. On the case statement, it is possible to visualize the Moore Machine properties described: the case from CU_main_decoder.v depend only upon the state, while the one from alu_decoder.v depend on the input funct3. The only specific note is that on CU_main_decoder.v, as there were too many ouputs, to make the code briefer, they were all atributed to a buffer_out reg that is 13-bit wide and matches the outputs accordingly.

So, basically, control_unit.v has some instantiations to wire the decoders all up, and the state is updated inside an always block that gives the sequential behaviour due to clock cycle and also implements a synchronous reset. The code snippet for the always block is only:

always @(posedge clk) begin
	    // updates the state
		if (reset == 1'b1) curr_state = fetch;
		else curr_state = next_state;    
	end

being next_state a wire connected to the CU_sequential instance's output.

Before the instantiations, other regs are defined simply to parse the instruction to funct3 and op fields.

The immediate module follows a similar modularization, with CU_immediate_extension.v being the top most module, that may call N_to_16_sig_extend.v or N_to_16_zero_extend.v depending on the imm_src signal inside the case statement(in this case, the always block has instruction inside, meaning the immediate module is a switching circuit and does not depend on the clk). The N_to_16_sig_extend.v and N_to_16_zero_extend.vare not so trivial as the ones above, but they basically use Verilog syntax and parametrized modules to perform signal/zero extension from N bits(passed as parameter, used only for N=11 or N=8) to 16 bits.

Besides, all the case statements have default to set the outputs to all 0 in case of invalid inputs.

⚠️ **GitHub.com Fallback** ⚠️