Control Unit

Introduction

As described before, the processor architecture is inspired on RV32C standard, however, the project is working with 8 registers, 16-bit registers and 16-bit modified instructions, so it is called RV16Cm (standing for RISC-V 16bits compact-modified set of instructions) - it is important to mention this was conceived by the designers and we have no further notice of such a standard existing previously, but it is a nice way of adapting the RV standards for didactic purposes.

Instruction Formats

Firstly, it is important to mention the RV32C instruction set, which inspires most of the conceived instruction set of RV16Cm. It is important to notice that the RV32C instructions are 16 bits wide(in accord with it's purpose of compact instructions) so, technically, the 16 bit architecture could be compatible with RV32C. As the state machine of this project was never intended to implement all instructions on RV32C for simplicity and other hardware reasons(for instance, the ALU was also chosen not to have floating point numbers), it was chosen to modify the RV32 instructions and create the so called RV16Cm. To visualize the modifications, the following tables containing both instructions sets.

Format	Meaning	15	12	11	9	6	4	1
CR	Register	funct4		rd/rs1		rs2		op
CI	Immediate	funct3	imm	rd/rs1		imm		op
CSS	Stack-relative Store	funct3	imm			rs2		op
CIW	Wide Immediate	funct3	imm				rs2	op
CL	Load	funct3	imm		rs1	imm	rd	op
CS	Store	funct3	imm		rs1	imm	rs2	op
CB	Branch	funct3	offset		rs1	offset		op
CJ	Jump	funct3	jump target					op

Table that describes RV32C instruction formats. Source: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-209.pdf

Fields	3 bits	5 bits	3 bits	3 bits	2 bits
CRm	Funct3	00000	rd/rs1	rs2	op
CIm	Funct3	Imm8[7:3]	rd/rs1	Imm8[2:0]	op
CLm	Funct3	Addr8[7:3]	rd	Addr8[2:0]	op
CSm	Funct3	Addr8[7:3]	rs1	Addr8[2:0]	op
CBm	Funct3	Addr11[7:3]	Addr11[10:8]	Addr11[2:0]	op

Table that describes RV16Cm instruction formats. Source: the authors

It is important to point some main differences:

The RVC16m uses only 5 instructions formats - CR(Compact-Register), CL(Compact-Load), CS(Compact-Store) and CB(Compact-Branch). So, naturally, RV16Cm instructions do not support Stack-Relative Store, unconditional branching with absolute values(jump) or wide immediates.
Not all the functions are implemented for each original instruction format - this can be easily visualized as CRm has a Funct3 field and not Funct4 (allowing less instructions then original CR)
The load, store and branch formats had their immediates renamed to adresses and have only one(load and store) or no register fields - which is exactly one less then original CL/CS/CB formats. This is because, for immediate-handling and insctruction implementation simplicity it is chosen to save, load and branch only based on absolute adresses - what will obviously have implications on register and memory acess which will be discussed on a later section.
The parts of the instruction do not vary in size through the instruction formats(notice that there are no merged cells) - bit 5(from right to left), for example, could be part of the 5 bit-field rs2 for CR format or the 8-bit wide immediate imm for CIW on the original instruction formats. In RV16Cm, it is always part of a 3-bit field.

These modifications all target a simpler and more didactic implementation of a RISCV inspired processor - being the last one perhaps the most important for didactic puporses.

Implemented Instructions

As mentioned earlier, the modified instruction formats also are not implemented with allo original instructions - due to field changes, some instructions are not possible(CR cannot have all instructions because of Funct3 insetad of Funct4 for example), some lack of meaning and others are impossible because of other hardware limitations of the project(such as floating points). However, that does not mean that the only instructions possible are the ones implemented here, and there is definetly more room of adaptation for new instructions and states on the Finite State Machine. The 9 instructions here implemented are chosen for being some of the most basic ones and capable of doing basic operations - with the proper assembler, this machine can definetly implement some serious functionalities. The following sub-sections describe the implemented instructions

CRm format

The Compact Register modified (CRm) format implements the 5 functions of the ALU operating on registers rd and rs2 and being stores on rd. It's op field equals to 00 and the Funct3 field distinguishes between the 5 different instructions. The following table describes all the fields and meaning of the instruction of CRm format.

Instruction	Operation	Funct3	rd/rs1	rs2
add rd, rs2	rd <- rd + rs2	000	rd/rs1[2:0]	rs2[2:0]
sub rd, rs2	rd <- rd - rs2	001	rd/rs1[2:0]	rs2[2:0]
and rd, rs2	rd <- rd AND rs2	010	rd/rs1[2:0]	rs2[2:0]
or rd, rs2	rd <- rd OR rs2	011	rd/rs1[2:0]	rs2[2:0]
slt rd, rs2	rd <- rd SLT rs2	101	rd/rs1[2:0]	rs2[2:0]

Table that describes the instructions of CRm format. Source: the authors

Notice that:

The Funct3 field of each operation is exactly the same that the ALU uses for representing such operations - meaning that no extra-Decoder for the ALU is needed
The choice to make the field Funct3 instead of the original Funct4 is not because of space - since there is 5 unused bits that could be used for this or to represent more registers if the architecture needed it - as mentioned earlier, the choice is made targeting simplicity

CIm Format

The Compact Immediate modified (CIm) format implements the ALU functions operating over immediates. For the scope of this project, only one function is implemented(addi, which adds an immediate to a register and saves on the same register), making the processor itself function(the addi is specialy important for iterating) but without making it's implementation too redundant(the other 4 ALU operations are already implemented on registers). Its op field is 01 and the 8-bit immediate is signal-extended to fit the 16 bits and add to the register operand. The following table describes the fields and meaning of the sole instruction implemented for the CIm format

Instruction	Operation	Funct3	Imm[7:3]	rd/rs1	Imm[2:0]	Op
addi rd, Imm	rd <- rd + s_ext(Imm)	000	Imm[7:3]	rs1[2:0]	Imm[2:0]	01

Table that describes the instructions of CIm format. Source: the authors

CLm Format

The Compact Load modified format implements the load instructions that get a data from memory onto a register. Again, the implementation has only one instruction - load word(lw) - that loads 4 bytes(a word) of data from an adress to a register. Its op field is 10(which is shares with the CSm instruction format - they differ in the Funct3 field). The following table describes the fields and meaning of the load word instruction implemented for the CLm format:

Instruction	Operation	Funct3	Addr[7:3]	rd	Addr[2:0]	Op
lw rd,Addr	rd <- M[Addr]	000	Addr[7:3]	rd[2:0]	Addr[2:0]	10

Table that describes the instructions of CLm format. Source: the authors

CSm format

The Compact Save modified format implements the save instructions that saves a data from a register into memory. It also has only one instruction - save word(sw) - that saves 4 bytes of data from a register onto a memory adress. Its op field is also 10. The following table describes the fields and meaning of the save word instruction implemeted for the CSm format:

Instruction	Operation	Funct3	Addr[7:3]	rs1	Addr[2:0]	Op
sw rs1,Addr	M[Addr] <- rs1	001	Addr[7:3]	rs1[2:0]	Addr[2:0]	10

Table that describes the instructions of CLm format. Source: the authors

CBm format

The Compact Branch modified format implements the conditional branch instructions that jumps the PC to a given adress conditionally. In this case, once more because additional implementations would only and complexity and be redundant in didactic means, there is only one implementation. bneqzbranches if the zero flag is false, which can be extensively used on logical loops, for example. Its op field is 11. The following table describes the fields and meaning of the branch if not equal zero instruction of CBm format:

Instruction	Operation	Funct3	Addr[7:3]	Addr[10:8]	Addr[2:0]	Op
bneqz Addr	PC <- Addr if Zero==False	000	Addr[7:3]	Addr[10:8]	Addr[2:0]	11

Table that describes the instructions of CBm format. Source: the authors PS: The PC is a register named program counter and described on the registers section

Memory Acess

This sub-section is used to describe the memory adresses that are acessed by the processor. First of all, the RAM is not implemented on FPGA and is used from externally(on the test benches, it is used the Altera RAM memory from the FPGA kit). The instructions section describe how many bits of adress each instruction format receives. Basically, the lw and sw instructions receive an adress of 8 bits, while the bneqz receives an adress of 11 bits. Mapping it to memory with some math:

$$LSWAddr = (2^{8} - 1)_d = (255)_d = (00FF)_h$$

$$BAddr = (2^{11} - 1)_d = (2047)_d = (07FF)_h$$

Where LSWAddr and BAddr correspond to the largest adress that the load/store and branch instructions can acess, respectively. This is a limitation due to the choice to acess adresses only with absolute values - because of this, the whole adress must fit onto the instruction and, with a reduced instruction of 16 bits, fewer adresses are acessible.

So, basically, the data memory range is 0x000 - 0x0FF, while the program memory range is 0x000 - 0x7FF.

Overall architecture comments

Since the architecture is based on, but not exactly, RISC-V, it is important to point clearly the adaptations made on RV16Cm:

	RV32C	RV16Cm
Data bus	32 bits	16 bits
Instruction Size	32 bits	16 bits
Register size	32 bits	16 bits
Number of registers	16	8
Register instructions	2 registers	2 registers
Register instructions	rd <- rd op rs	rd <- rd op rs
Load instructions	rt <- M[rs + sign_ext(imm)]	rd <- M[imm]
Store instructions	M[rs + sign_ext(imm)] <- rt	M[imm] -< rd
Branch instructions	U/J type	J type
	bneq rs,rt, label	bneqz label
	if rs != rt, PC <- PC + sign_ext(label)	if ZF=0, PC <- label

8. Table that points the main differences and adaptations of RV16Cm compared to RV32C

Finite State Machine

With every instruction properly documented, the control unit can be totally described as a finite state machine(FSM) that will percurr all states needed to perfom the instructions - fetch, decode and execute. But since the processor is multi-cycle, one instruction may take more than one clock cycle to be executed - meaning that the FSM will not be as simple as 3 states. Firstly, it is important to list every single input and output that the FSM should control. On the main diagram of the processor it is possible to visualize all hardware attached to the control machine, thus all inputs and outputs it has to deal with. All that information is sumarized on the following table:

Variable	Input or Output	Number of bits	Meaning
op	Input	2	It is the `op` field from the instruction
Funct3	Input	3	It is the `Funct3` field from th instruction
Zero	Input	1	It is the `Zero` flag from the ALU that indicates wheter the last result of it was 0
PCUpdate	Output	1	Controls if the `PC` should be updated
Branch	Output	1	Controls if it is a branch case
AdrSrc	Output	1	Controls a MUX to determine if the Adress to be inputted on the Memory should be readen from PC(0) or from the result from the `ResultSrc` MUX(1)
MemWrite	Output	1	Determines if the memory should be written - connected to `WriteEnable` of the memory
IRWrite	Output	1	Determines if the `Instruction Register` should be written to - connected to its enable
ResultSrc	Output	2	Controls a MUX that sends its result to the `PC`(as input) and to the MUX controlled by `AddrSrc`. The inputs of the MUX are a register that accumulates the result of ALU(`ALUOut`), another register that contains the readen data from memory(`Data`) and the direct result from the switching circuit of the ALU(`ALUResult`)
ALUControl	Output	3	Controls the ALU different operations
ALUSrcB	Output	2	Controls wheter the `SrcB` input of ALU will be the register that contains the output `RD2` of the register bank(00), the signal/zero extension from the Extend block(01), or a hardwired 2(10)
ALUSrcA	Output	2	Controls wheter the `SrcA` input of ALU will be the `PC`(00), a hardwired 0(01), or the register that contains the ouput `RD1` of the register bank
ImmSrc	Output	2	Controls the Extend module - stating if there sould be a signal or zero extension on the immediate, and where the immediate is placed on the instruction
RegWrite	Output	1	Controls the write on the register bank - connected to `WE3`

Table that contains every input and output of the finite state machine

It is important to note that:

The Funct3 field that selects the operation to be done on the ALU, as described earlier, has the same correspondence of operations as ALUControl, which means that in the instructions where it is used one will be directly connected to the other without any addiotinal decoder
The 0 and 2 hardwired to muxes controled by ALUSrcA and ALUSrcB, respectively, are used to increment PC
PCUpdate and Branch are not actually outputs of the control unit, but only from the finite state machine. Together, through a switching circuit, both compose PCWrite, which determines if the control unit should write to the PC and is directly attached to the EN of the PC register. The switching circuit that defines PCWrite is: $ PCWrite = (PCUpdate) OR (Branch AND Zero) $

Secondly, it is as important to list all states conceived for the FSM:

State name	State function
Fetch	Fetches the next instruction of PC and increments PC
Decode	Decodes the instruction and jumps to the according state
MemAdr	Gets the memory adress from a instruction by parsing it on immediate module and summing wiht 0 on ALU
MemRead	Uses the ALUOut register(with previous adress result) to read an adress from memory
MemWB	Writes the readen data form memory back to register file on the according register
MemWrite	Acesses the adress at ALUOut and writes it with RD1 register(one of two possible simultaneous read-registers from regbank) from register bank
ExecuteR	Executes a CRm format by operating on the right registers on register bank and making the correspondent operation on ALU
ALUWB	Writes the ALU result back to register bank
ExecuteI	Executes a CIm format by parsing correclty the immediates on immediate module and making the correspondent operation on ALU
BNEQZ	Parses the immediate accordingly on the immediate module, sums it with 0 on the ALU and places the result on PC to branch

Table that describes the states of the FSM

Truth Table

With all control variables of the control unit defined, the following truth table describes how each state should behave

	PC Write		Addr Src	Mem Write	IR Write	Result Src	ALU Control	ALU Src B	ALU Src A	Imm Src	Reg Write
	Branch	PCUpdate	Addr Src	Mem Write	IR Write	Result Src	ALU Control	ALU Src B	ALU Src A	Imm Src	Reg Write
Fetch	0	1	0	0	1	10	000	10	00	dd	0
Decode	0	0	d	0	0	dd	ddd	dd	dd	dd	0
Mem Adr	0	0	d	0	0	dd	000	01	01	01	0
Mem Read	0	0	1	0	0	00	ddd	dd	dd	dd	0
Mem WB	0	0	d	0	0	01	ddd	dd	dd	dd	1
Mem Write	0	0	1	1	0	00	ddd	dd	dd	dd	0
ExecuteR	0	0	d	0	0	dd	funct3	00	10	dd	0
ALU WB	0	0	d	0	0	00	funct3	dd	dd	dd	1
Execute I	0	0	d	0	0	dd	funct3	01	10	00	0
BNEZ	not(ZF)	0	d	0	0	10	000	01	01	11	0

11. Truth Table of the control unit

The truth table shown allows a classification of the finite state machine(FSM) that composes the control unit: the table shows that the outputs depend only upon the current state, not upon the inputs - which means the FSM is a Moore Machine. There are 2 clear exceptions that would make this classification invalid: on ExecuteR, ALUWB, ExecuteI states, the ALUControl signal is related to Funct3 input from the instruction, and on BNEQZ state, the Branch signal is related to ZF input from ALU. Instead of reclassifying the FSM as a Mealy Machine - what would be a more general case, but could make the implementation less modular if it is made based on this idea - it is chosen to isolate those outputs from the FSM.

So, basically, the ALUControl output comes from a separate decoder that uses the Funct3 field in it, while the Branch output is not implemented, in such a way that there is a switching circuit that uses the current state, instead of a Branch output(this is logically equivalent to an implicit Branch output set as always 1 on the Truth Table). To clarify, the code snippet that sets the PC_Write (located on control_unit.v) is:

pc_write = pc_update or (~zero_flag and (curr_state == `BNEZ));

As described, the switching circuit is outside the FSM(described on CU_main_decoder.v) and put in directly in the control unit, and it does not use a branch signal, only the curr_state signal.

Another important thing to mention is that it can be analyzed that there are many dont-care terms on the truth table - this indicates that many parts of the processor are not being used on many states, meaning that multiple states could easily be processed simultanously. That is, this feature of dont-care terms suggests that the processor's speed would improve a lot with an eventual future pipelining implementation.

Table of States

With all input variables of the control unit defined, the following table of states describe how each state should transition according to the inputs:

Current State	OP Field	Func Field	Next State
Fetch	dd	ddd	Decode
Decode	00	ddd	ExecuteR
	01	ddd	ALUWB
	10	ddd	MemAdr
	11	ddd	BNEQZ
MemAdr	10	000	MemRead
MemAdr	10	001	MemWrite
MemRead	dd	ddd	MemWB
MemWB	dd	ddd	Fetch
Mem Write	dd	ddd	Fetch
ExecuteR	dd	ddd	ALUWB
ALUWB	dd	ddd	Fetch
ExecuteI	dd	ddd	ALUWB
BNEQZ	dd	ddd	Fetch

12. Table of States(describing the states transition) of the control unit

Diagram

flowchart TB

Fetch((______FETCH______<br> AdrSrc = 0 <br> IRWrite <br> ALUSrcA = 00 <br> ALUSrcB = 10 <br> ALUControl = 000 <br> ResultSrc = 10 <br> PCUpdate)) --> Decode((DECODE))


Decode --> |"op = 10 (CLm OR CLs)"|MemAdr((MemAdr <br> ALUSrcA = 01 <br> ALUSrcB = 01 <br> ALUControl = 000))

Decode -->|"op = 00 (CRm)"| ExecuteR((ExecuteR <br> ALUSrcA = 10 <br> ALUSrcB = 00 <br> ALUControl = funct3))

Decode -->|"op = 01 (CIm)"| ExecuteI((ExecuteI <br> ALUSrcA = 10 <br> ALUSrcB = 01 <br> ALUControl = funct3))


Decode -->|"op = 11 (CBm)"| BNEQZ((BNEQZ <br> ALUSrcA = 01 <br> ALUSrcB = 01 <br> ResultSrc = 10 <br> ALUControl = 000 <br> Branch))

MemAdr --> |"funct3 = 000 (CLm)"|MemRead((MemRead <br> ResultSrc = 00 <br> AdrSrc = 1))

MemAdr --> |"funct3 = 001 (CSm)"|MemWrite((MemWrite <br> ResultSrc = 00 <br> AdrSrc = 1 <br> MemWrite))

MemRead --> MemWB((MemWB <br> ResultSrc = 01 <br> RegWrite))

ExecuteR & ExecuteI --> ALUWB((ALUWB <br> ResultSrc = 00 <br> RegWrite))

MemWB & MemWrite & ALUWB & BNEQZ --> Fetch

Flowchart of the FSM

There are some important things to mention about the diagram that describes the FSM:

On the diagram, the 1-bit signals are 1 where they are mentioned or 0 where they are excluded, while the signals with more bits are always represented with their according value
The state transitions are all well-defined in terms of inputs - but there can be invalid inputs which are dealt with by maintaning the previous state. That is, the default case if the inputs do not match any specified transition is to maintain the state. This is chosen as a design option to minimize the transition of states, minimizing the switching of signals and, thus, the heat emission on an integrated circuit. Be noted that no testbenching or quantitive analysis was made for this design option, just followed a convention.
The state-flow always starts on Fetch after a reset

Immediate module

Apart from the main control unit, there is a module called Immediate module that operates parallel to the FSM parsing the immediates form the instructions and providing it correctly according to the instruciton format. This module is described here on Control Unit section since they were implemented together(even on the same dev branch) because of their shared role of parsing.

Basically, it receives the instruction and a 2-bit signal that comes from the FSM and indicates the immediate format of the instruction format. On the instruction formats description, it can be seen that the immediate varies from size, possibly having 8 or 11 bits. Additionally, the immediates should be extended to fit the 16-bit bus to the ALU. This extensions can be done in two ways: the instructions involving immediates to be operated on ALU are signal-extended, while the instructions involving adress operations on ALU are zero-extended.

Because of this, the immediate module control signal(ImmSrc) is 2-bits wide, the first bit meaning its size and the second one meaning which extension should be performed. The following table describes the control signal of immediate extension:

`ImmSrc`	Size of immediate	Extension to be performed	Instruction Format	FSM State that uses it
00	8 bits	Signal Extension	CIm	ExecuteI
01	8 bits	Zero Extension	CLm or CSm	MemAdr
10	11 bits	Signal Extension	-	-
11	11 bits	Zero Extension	CBm	BNEZ

Table that describres ImmSrc control of the immediate module

As the signal extension on a 11 bit immediate (ImmSrc = 10) case should never happen in the conceived FSM, it is not written onto code and the default case is defined as setting the immediate as a 16-bit 0 value.

Code organization and implementation

States as parameters

Firstly, one important feature added to the Verilog code is the description of the states as parameters. For example, the state Fetch is defined as follows:

parameter fetch = 4'b0000

This representation not only improves code legibility, but also allows the developers to use a special feature of Intel® Quartus® software, which is used for loading the hardware description in Verilog onto an Altera FPGA kit(which is the brand that the developers have acess on the laboratory).

The Quartus® State Machine Editor can change the values of states to different sequences(not only binary sequence 0000,0001,0010... but, for example, the Gray sequence 000,001,011...). This allows the developers to test not only on the hardcoded default case of binary sequence, but also another sequences. The sequence in which the states are declared may affect how the synthesis tool synthesizes the circuit onto the FPGA - affecting the number of logic cells, for example.

So, basically, the states are defined as:

parameter fetch = 4'b0000
parameter decode = 4'b0001
parameter memadr = 4'b0010
parameter memread = 4'b0011
parameter memwb = 4'b0100
parameter memwrite = 4'b0101
parameter executer = 4'b0110
parameter aluwb = 4'b0111
parameter executei = 4'b1000
parameter bneq = 4'b1001

Code modularization

The files that contain the control unit code are modularized as follows:

control_unit.v is the top control unit module - meaning it calls the other ones. Basically, it calls a decoder called the main decoder that sets most of the outputs (CU_main_decoder.v), another module that updates the state(CU_sequential.v), another decoder that sets the ALU outputs(alu_decoder.v), apart from the switching circuit that defines pc_write that was explained above.

So, basically, CU_main_decoder.v implements the FSM truth table(apart from ALUControl), while CU_sequential implements the FSM table of states. As said earlier, alu_decoder.v sets only the ALUControl signal. Inside the decoders, there are only case statements(that work inside always blocks for checking positive edge of clock) that are really simple and do nothing apart from the tables described throughout this text. On the case statement, it is possible to visualize the Moore Machine properties described: the case from CU_main_decoder.v depend only upon the state, while the one from alu_decoder.v depend on the input funct3. The only specific note is that on CU_main_decoder.v, as there were too many ouputs, to make the code briefer, they were all atributed to a buffer_out reg that is 13-bit wide and matches the outputs accordingly.

So, basically, control_unit.v has some instantiations to wire the decoders all up, and the state is updated inside an always block that gives the sequential behaviour due to clock cycle and also implements a synchronous reset. The code snippet for the always block is only:

always @(posedge clk) begin
	    // updates the state
		if (reset == 1'b1) curr_state = fetch;
		else curr_state = next_state;    
	end

being next_state a wire connected to the CU_sequential instance's output.

Before the instantiations, other regs are defined simply to parse the instruction to funct3 and op fields.

The immediate module follows a similar modularization, with CU_immediate_extension.v being the top most module, that may call N_to_16_sig_extend.v or N_to_16_zero_extend.v depending on the imm_src signal inside the case statement(in this case, the always block has instruction inside, meaning the immediate module is a switching circuit and does not depend on the clk). The N_to_16_sig_extend.v and N_to_16_zero_extend.vare not so trivial as the ones above, but they basically use Verilog syntax and parametrized modules to perform signal/zero extension from N bits(passed as parameter, used only for N=11 or N=8) to 16 bits.

Besides, all the case statements have default to set the outputs to all 0 in case of invalid inputs.

Control Unit - CarlosCraveiro/RISCV_based_processor GitHub Wiki

Control Unit

Introduction

Instruction Formats

Implemented Instructions

CRm format

CIm Format

CLm Format

CSm format

CBm format

Memory Acess

Overall architecture comments

Finite State Machine

Truth Table

Table of States

Diagram

Immediate module

Code organization and implementation

States as parameters

Code modularization

⚠️ GitHub.com Fallback ⚠️

Control Unit - CarlosCraveiro/RISCV_based_processor GitHub Wiki

Control Unit

Introduction

Instruction Formats

Implemented Instructions

CRm format

CIm Format

CLm Format

CSm format

CBm format

Memory Acess

Overall architecture comments

Finite State Machine

Truth Table

Table of States

Diagram

Immediate module

Code organization and implementation

States as parameters

Code modularization

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️