Summary 8 - muneeb-mbytes/computerArchitectureCourse GitHub Wiki

Register File

A register file is a collection of registers which are used to store data temporarily during the execution of instructions.
They hold operands, intermediate results, and control information.

For operations such as addition, subtraction, and similar functions, two operands are required from the register file, and the result must be stored back into the register file.
The register file is essentially an array of flip-flops.
This can be visualized as either a 2D array of flip-flops or a 1D array of registers, with each register being 32 bits in size.

MIPS

Inputs:

Two 5-bit read register addresses
One 5-bit write register address
One 32-bit write data input

Outputs:

Two 32-bit read data outputs
It provides storage for 32 registers, each 32 bits in size, and can simultaneously read two 32-bit values and write one 32-bit value

RISCV

Inputs:

Two 5-bit read register addresses (Read register 1 & 2) to specify registers for reading.
One 5-bit write register address to specify the register for writing.
One 64-bit write data input to supply the data to be written.

Outputs:

Two 64-bit read data outputs (Read data 1 & 2) for the data read from the specified registers
The register file continuously outputs the contents of the registers specified by the Read register inputs
Writing to a register is controlled by the Regwrite control signal, which must be asserted at the clock edge for the write to occur
All inputs related to writing must be valid at the clock edge
An Arithmetic Logic Unit operates on the values read from the registers

Program Memory

Program Memory is a special type of memory used in a computer system to store the instructions of a program
This memory is read-only, meaning we can only retrieve instructions from it, not modify or write new instructions to it
Although memories are typically sequential elements, program memory behaves like a combinational circuit in this context. This is because its contents are fixed and not altered, allowing it to respond instantly to input without changing state
The program memory receives an address as an input and outputs the instruction stored at that address. This process is like how a combinational circuit works, providing an immediate response based on the input.
Program memory stores the instructions for a program and provides these instructions based on the given address.

Data Memory

Data Memory in a computer system is used for storing and retrieving data. It works with two types of instructions:

1. Load Instruction: This reads data from the memory and transfers it to the processor.

2. Store Instruction: This writes data from the processor into the memory

Data memory uses a single address input for both reading (load) and writing (store) operations. This means that you specify the location in memory where data should be read from or written to
Load and store operations cannot happen at the same time; only one can be performed at any given moment
External inputs, or control signals, determine whether the operation will be a read (load) or a write (store). These signals specify the type of operation to be performed on the data memory

Instruction Cycle

A program residing in the memory unit of a computer consists of a sequence of instructions.
These instructions are executed by the processor by going through a cycle for each instruction.
In a basic computer, each instruction cycle consists of the following phases:

Addressing Register File

In computer architecture, accessing and modifying the CPU's registers is known as addressing the register file.
The CPU keeps data necessary for computation and instruction execution in registers, which are gathered into a register file. An outline of how addressing functions using the register file is provided below:

Passing Operands to ALU, Incrementing PC:

Function of the ALU:

The CPU's essential ALU (Arithmetic Logic Unit) carries out arithmetic (such as addition and subtraction) and logic (such as AND and OR) operations on data.

Operands:

The data values that the ALU employs in its operations are known as operands. Registers and memory are usually where these values are read from.

Method:

Operand Fetching:

The operands are held in which registers or memory locations, as indicated by the instruction.
These operands are retrieved from memory or the register file.

ALU Input:

The ALU receives the operands that have been fetched. These operands are used by the ALU to carry out the designated operation (such as addition).

Handling of Results:

Depending on the Instruction, the result of an operation performed by the ALU is either transferred to memory or saved back into a register.

Incrementing the PC:

Incrementing the PC involves updating the Program Counter to point to the address of the next instruction to ensure sequential execution.

Incrementing Process:

Sequential Execution: After obtaining an instruction, the PC is usually incremented to point to the location of the next instruction in memory. This ensures that commands are executed in the appropriate sequence.

Address calculation: The increment procedure usually involves multiplying the current value on the PC by the size of the instruction (four bytes, for example). The PC is updated to the location of the next instruction by performing this.

Branching: When a branch or jump instruction alters the sequence of execution, the PC may be configured to a different address supplied by the instruction, overriding the customary increment.

Adding Load and Store Instruction

Load Instruction:

Load instruction loads the data stored at a specific memory address specified by the instruction into a register.
Example: lw $t0, 4($s0) which means load the word (4 bytes) at the address stored in register $s0 plus 4 and store it in register $t0.
The PC provides the address to the instruction memory, which fetches the lw instruction.
The registers file is accessed to read the base address and prepare for writing the loaded data.
The ALU calculates the effective address by adding the base and offset addresses.
This address is sent to the data memory, which reads the data and sends it back.
The data is finally written into the specified register, completing the lw instruction cycle.

Store Instruction:

Store instruction stores the data at a specific memory address specified by the instruction.
Example: sw $t0, 4($s0), which means store the word (4 bytes) in register $t0 to the address stored in register $s0 plus 4.
The PC provides the address to the instruction memory, which fetches the sw instruction.
The registers file is accessed to read the base address (from Read register 1) and the data to be stored (from Read register 2).
The ALU calculates the effective address by adding the base address and the offset (from Sign Extension).
This address and the data from Read Data 2 are sent to the data memory.
The data memory writes the data to the calculated address, completing the sw instruction cycle.

Adding beq and j instruction: Format of “beq” instruction: I-format:

The branch instruction uses an I-format opcode like load/store instructions. It compares two registers, rs and rt, and adds a word offset to the PC.
Normally, PC + 4 is used to keep things consistent. Instead of sending PC + 4 back directly to the PC, we can add an offset to it, allowing for more flexible branching.

Adding “j” instruction:

The instruction has a 6-bit opcode and a 26-bit target address
We can't specify a full 32-bit address, so we retain some bits from the PC
Branching uses an offset in words
Left-shift the 26-bit target address by 2 to make 28 bits
Take the top 4 bits from PC+4 (bits 28-31)
Combine these 4 bits with the 28-bit target address
This forms a 32-bit jump address
The program counter uses this 32-bit jump address

Datapath and Control

Datapath: The datapath in a computer system is the part that handles the actual movement and processing of data. A data path is a collection of functional units such as arithmetic logic units or multipliers that perform data processing operations, registers, and buses. Along with the control unit, it composes the central processing unit (CPU).

Control Signal: Control is the hardware that tells the datapath what to do, in terms of switching, operation selection, data movement between ALU components, etc.

The Program Counter (PC) provides an address to the Instruction Memory to fetch the current instruction.
The instruction is decoded to determine the necessary actions.
Data is fetched from registers or extended from immediate values within the instruction.
A control signal called "Rdsd" selects the destination register, choosing the address from two possible parts of the instruction.
The Register File receives a control signal (RW) to decide whether it should write data, as not all instructions require writing. Only specific instructions, such as the first five and the load instruction, write data.
The ALU source control signal determines whether the ALU is used for address calculations or regular operations.
The ALU's status output 'z' indicates whether two values are equal, which is useful for instructions like beq (branch if equal). This status aids in control decisions.
The ALU uses a control signal called “op" to determine the operation it should perform, such as arithmetic or logical functions.
The result from the ALU can be stored in registers or Data Memory.
For instructions that change the program flow, such as branches or jumps, the next PC address is calculated based on the current instruction and specific conditions.

Control Inputs For Different Instructions

1. Branch-if-equal Instruction Type

Branching is a mechanism to alter the normal sequential execution of instructions.
A branch instruction causes a jump to a different part of the code.
The Program Counter (PC) holds the address of the next instruction to be executed.
Branch-if-Equal (BEQ) is a type of branch instruction.
It checks if two values are equal.
If the condition is true (values are equal), the program jumps to a specified address.
If the condition is false (values are not equal), the program continues to the next instruction.
The offset to the jump target is typically encoded within the instruction itself.
The PC is modified by adding the offset to determine the new address.
Control logic determines whether to take the branch based on the comparison result.

2. R Instruction Type

R-type instruction "add x1, x2, x3" involves four key steps:

Instruction Fetch and PC Incrementation:
Processor retrieves the instruction from memory.
Program Counter (PC) holds the address of the next instruction and is incremented after the fetch.

Register Read:

Values of source registers (x2, x3) are read from the register file.
Main control unit generates control signals for reading the registers.

ALU Operation:

ALU performs addition on the values from registers x2 and x3.
Control signals specify the ALU operation.

Register Write:

Result of the addition is written to the destination register (x1).
Control unit generates signals for the write operation.
Each step occurs within a single clock cycle, allowing efficient pipelined instruction processing, characteristic of RISC architectures.

Load Instruction Type

Load Instruction Type "ld x1, offset(x2)" involves four key steps:

Instruction Fetch and PC Incrementation:
Instruction fetched from memory using the Program Counter (PC).
PC incremented to prepare for the next instruction.

Register Read:

Value of base register (x2) read from the register file.
Control signals generated for reading the register.

ALU Computation:

ALU calculates the effective memory address by adding the sign-extended 12-bit offset to the value from register x2.
Sign extension ensures correct interpretation of the offset as a signed value.

Memory Access:

Processor accesses data memory using the effective memory address.
Control signals activate the memory read operation.

Register Write:

Data from memory written into the destination register (x1).
Control signals generated for the write operation.

Programmable Logic Array (PLA)

A Programmable Logic Array (PLA) is a configurable digital logic device used to implement combinational logic circuits. It consists of two programmable planes: the AND plane and the OR plane. These planes allow the PLA to realize any Boolean function by configuring the connections within them. PLA's are used to build reconfigurable digital circuits. It Combines memory and logic. It is similar to ROM but lacks full variable decoding and min-term generation. It Doesn't require programming like C and C++.

Why PLA's?

How to Implement a PLA-Based Design?

Example-1:

Consider a 1 bit Full Adder

Truth table for the given block

Find a minimal short form for the given function

Find the number of input buffer(Combination of variables in normal and complemented form)required. No. of input buffer = 3 (no. of variables)

Find required number of AND and OR gate No. of programmable AND gate =7 (no. of min terms (not repeated)) No. of programmable OR gate= 2 (no. of functions) [Here it is Sum and Carry]

Example-2:

Consider an expression:

Simplified expression:

Implementation of PLA:

Where,

The above is the input buffer for random input X. In the example given input buffer for variables A, B and C are provided. The simplified form of the above is

Analysing Performance

The clock period has to be large enough to accommodate component delays for all possible operations. Let’s analyse the component delays possible,

• Register file (dr): delay associated with array with collection of of resisters where according to the address one of them is selected • Program memory (dp): delay associated with instruction fetch from Program memory • Data memory (dd): delay associated with accessing data memory • ALU delay (da) • Interconnecting wire also has delay

Let’s analyse the delays of instructions,

ADD, SUB, AND, OR instruction: dp + dr + da + dr
SW (Store) instruction: dp + dr + da + dd
LW (load) instruction: dp + dr + da + dd + dr
Beq (branch) instructions: dp + dr + da
(jump) instruction: dp + da

Component delay and impact on clock period:

In a computer system, various components contribute to the overall delay, which affects the clock period.

PC Register Delay: The Program Counter (PC) is a register with a delay of 0. This means it introduces no delay on its own.

Adder Delay: This is the delay of the adder that calculates PC + 4 and PC + 4 + offset.

ALU Delay: The ALU (Arithmetic Logic Unit) delay is denoted as tAt_AtA, which includes a base delay ttt plus additional delay from the ALU operations.

MUX Delay: The delay of the multiplexer (MUX) is 0, indicating it introduces no additional delay.

Register File Delay: The register file delay, denoted as tRt_RtR, is the time it takes to access a register from a collection of registers.

Program Memory Delay: Denoted as tit_iti, this is the delay for accessing instructions from the program memory.

Data Memory Delay: Denoted as tMt_MtM, this is the delay for accessing data from the data memory.

So in order to calculate the impact on clock period we need to consider each instructions or group of instructions separately.

1) Computing the Sum, Difference, AND, OR:

• The instruction goes through the instruction memory (tit_iti), register file (tRt_RtR), ALU (tAt_AtA), and back to the register file (tRt_RtR).

• Total delay: ti+tR+tA+tRt_i + t_R + t_A + t_Rti+tR+tA+tR

2) Computing PC + 4:

The delay for this path involves the adder delay, which is accounted for separately. Delays for different instructions:

Instructions {add, sub, AND, OR}:

These instructions involve:

• fetching from instruction memory • reading from the register file • executing in the ALU. • writing back to the register file.

Clock Period: Maximum of (ti+tR+tA+tR)(t_i + t_R + t_A + t_R)(ti+tR+tA+tR) or t+t+t+ (the ALU delay, which is very quick).

Instruction {sw} (store word):

The store word instruction involves

• fetching from instruction memory. • reading from the register file, • executing in the ALU, • writing to data memory.

Clock Period: Maximum of (t+)(t+)(t+) or (ti+tR+tA+tM)(t_i + t_R + t_A + t_M)(ti+tR +tA+tM).

Instruction {lw} (load word):

The store word instruction involves:

• fetching from instruction memory, • reading from the register file, • executing in the ALU, • reading from data memory, • writing back to the register file.

Clock Period: Maximum of (t+)(t+)(t+) or (ti+tR+tA+tM+tR)(t_i + t_R + t_A + t_M + t_R)(ti+tR+tA+tM+tR).

Instruction {beq} (branch if equal):

The branch if equal instruction involves three paths:

One path goes through both adders, with a delay of t+t+t+ and t+t+t+. Another path involves fetching from instruction memoryfrom where we are picking the offset and then adding it to PC + 4 so it is ti + t+ for the third one the way the comparison being done is ti + tR + tA

Clock Period: Maximum value of the three paths.

Instruction {j} (jump):

The jump instruction involves fetching from instruction memory and jumping to the target address.

Clock Period: maximum value of { (t+) , (ti) }

Component Delays

● PC Register has 0 delay and Mux has 0 delay. ● ALU delay tA = t + delay ● Interconnecting Wires have some delay which is negligible if gates are fast. ● Clock Period Calculation is calculated separately for each instruction or group of instructions.

Delays for different instructions/Instruction Path Delays

Delays for Add, Sub, AND, OR:

● Adder to compute PC + 4 ● Clock Period: ti(Program memory) + tR(Register file) + tA(ALU delay) + tR(Register file)

Delay for Store (sw):

● This is used to store the data from registers to the memory. ● Clock Period: max(t+, ti + tR + tA + tM)

Delay for Load (lw):

● This is used to load the data from memory to the registers. ● Clock Period: max(t+, ti + tR + tA + tM + tR)

Delay for Branch if Equal (beq):

● In this beq instruction we have to consider three paths. ▪ Two adders ▪ Instruction memory → Adder ▪ Instruction memory → Register file → ALU Delay is the maximum of all the above three paths.

Clock Period: max(t+ + t+, ti + t+, ti + tR + tA)

Delay for Jump (j):

● Clock period will be high at max(t+, ti)

Instruction Delays Summary

Add, Sub, AND, OR Instructions:

Clock period: max(ti + tR + tA + tR, t+)

ALU delay is very quick.

Store (sw) Instruction:

Clock period: max(t+, ti + tR + tA + tM)

● Assumes same time for reading/writing in memory and register file. ● Delay includes data memory access.

Load (lw) Instruction:

Clock period: max(t+, ti + tR + tA + tM + tR)

Path: Instruction memory → Register file → ALU → Data memory → Register file

Includes additional delay for accessing data memory and storing back to register.

Branch if Equal (beq) Instruction:

beq instruction has three paths and the delay is maximum of all three paths.

Clock period: max(t+ + t+, ti + t+, ti + tR + tA)

▪ Path 1: Two adders. ▪ Path 2: Instruction memory → Adder. ▪ Path 3: Instruction memory → Register file → ALU.

Jump (j) Instruction:

Clock period: max(t+, ti)

Path: Mainly through instruction memory

Analysing Delays in computer architecture

In computer architecture, different instructions take different amounts of time to execute due to variations in complexity, required resources, and the specific hardware implementations. Here’s a breakdown of typical instruction delays.

ADD, SUB, AND, OR: These are simple arithmetic and logical operations that are typically executed by the Arithmetic Logic Unit (ALU). Eg: ADD R1, R2, R3 //Adds the contents of registers R2 and R3 and stores the result in R1
Store Word (sw): This instruction stores a word from a register to memory. Eg: SW R1, 0(R2) //Stores the word in R1 to the address in R2.
Load Word (lw): This instruction loads a word from memory into a register. Eg: LW R1, 0(R2) //Loads the word from the address in R2 into R1.
Branch if Equal (beq): This instruction compares two registers and branches if they are equal. Eg: BEQ R1, R2, target //Compares R1 and R2; if they are equal, branches to ‘target’.

5.Jump (j): This instruction performs an unconditional jump to a specified address. Eg: J target //Unconditionally jumps to the instruction at targe

Single Cycle Design and Multi Cycle Design

Single cycle datapath :

A single cycle datapath is a type of CPU architecture where each instruction is executed in one clock cycle. Whereas in multi-cycle datapath, an instruction is broken down into multiple stages, each completed in a separate clock cycle. The single-cycle design simplifies control logic, as every instruction must complete all its stages (fetch, decode, execute, memory access, and write-back) within one cycle.

Stages of Execution in a Single Cycle Datapath Instruction Fetch (IF):

● The PC provides the address to the instruction memory. ● The instruction at that address is fetched and placed into the instruction register.

Instruction Decode (ID):

● The instruction is decoded to determine the operation and the operands.

● The control unit generates the necessary control signals based on the instruction.

Execute (EX):

● The ALU performs the required operation (addition, subtraction, logical operations, etc.) using the operands from the registers or immediate values.

Memory Access (MEM):

● If the instruction involves accessing data memory (e.g., load or store instructions), the data memory is accessed.

Write-Back (WB):

● The result from the ALU or data memory is written back into the appropriate register.

Example:

lw $t0, –4($sp) needs 8ns, assuming the delays shown here. reading the instruction memory 2ns reading the base register $sp 1ns computing memory address $sp-4 2ns reading the data memory 2ns storing data back to $t0 1ns Total 8ns: If we make the cycle time 8ns then every instruction will take 8ns, even if they don’t need that much time.

How to improve the resource utilization: ● ALU which will do all arthemitic and logic operations ● For improvisation the 2 memory blocks will be clubbed (IM-instruction memory and DM-data memory) ● The single memory will store program as well as data ● The two adders are removed and connections are rearranged ● Registers are introduced

ALU which will do all arthemitic and logic operations ● For improvisation the 2 memory blocks will be clubbed (IM-instruction memory and DM-data memory) ● The single memory will store program as well as data ● The two adders are removed and connections are rearranged ● Registers are introduced

Multi-cycle datapath

A multi-cycle datapath is a CPU architecture where each instruction is executed over multiple clock cycles, with each cycle completing a specific stage of the instruction. This design breaks down the execution process into smaller steps, allowing for more efficient use of hardware and shorter clock cycles compared to a single-cycle datapath.

Stages of Execution in a Multi-Cycle Datapath

Instruction Fetch (IF):

● The PC provides the address to the instruction memory. ● The instruction at that address is fetched and placed into the instruction register (IR). ● The PC is incremented to point to the next instruction. Instruction Decode/Register Fetch (ID): ● The instruction is decoded to determine the operation and the operands. ● The control unit generates the necessary control signals based on the instruction. ● The source registers are read, and their values are stored in intermediate registers (A and B).

Execution/Effective Address Calculation (EX):

● The ALU performs the required operation (addition, subtraction, logical operations) using the operands from the intermediate registers or immediate values. ● For load/store instructions, the effective address is calculated.

Memory Access/Branch Completion (MEM):

● If the instruction involves accessing data memory, the data memory is accessed. ● For branch instructions, the branch target address is calculated, and the PC is updated if the branch is taken.

Write-Back (WB):

● The result from the ALU or data memory is written back into the appropriate register.

Breaks down instruction execution into multiple cycles. Determined by the longest stage among instruction fetch, memory access, etc. Breakdown by Instruction Type:

R-Class (Register-Register) Instructions:

•Total Cycles: 4 Instruction Fetch, Instruction Decode/Register Fetch, Execution, Register Write Back

lw (Load Word) Instruction:

Total Cycles: 5 Instruction Fetch, Instruction Decode/Register Fetch, Effective Address Calculation, Memory Access (Read), Register Write Back sw (Store Word) Instruction:

•Total Cycles: 4 Instruction Fetch, Instruction Decode/Register Fetch, Effective Address Calculation, Memory Access (Write)

beq (Branch if Equal) Instruction:

•Total Cycles: 3 Instruction Fetch, Instruction Decode/Register Fetch, Branch Decision

j (Jump) Instruction:

•Total Cycles: 2 Instruction Fetch, Jump Execution

Efficiency: Each instruction takes only the cycles it needs

To balance delays in a multi-cycle processor design, we can use two methods:

Multiple Actions in One Period:

If two actions take very little time, they can be done in the same clock cycle. For example, if reading the instruction (ti) and reading the registers (tR) are both quick, they can be combined into one period. This reduces the total number of cycles needed for instructions.

Multiple Periods for One Action:

For actions that take longer, we can set the clock period so that time isn't wasted. For example, if memory access (tM) takes longer, the clock period can be adjusted to match this duration, ensuring no time is wasted. These methods help make the processor more efficient by reducing the total time needed to execute instructions. The image shows how different instruction types (R class, lw, sw, beq, j) are broken down over several cycles, emphasizing the need to balance the clock periods.

PROBLEMS WITH SINGLE CYCLE DESIGN:

Slowest Instruction Determines Clock Frequency In a single-cycle design, every instruction is designed to be executed within one clock cycle. Consequently, the duration of the clock cycle is dictated by the slowest instruction, which sets the upper limit for the cycle time. Typically, this slowest instruction is a load instruction, which involves multiple steps:
Instruction Memory: Fetch the instruction from memory.
Register File: Read the source register(s).
ALU (Arithmetic Logic Unit): Perform the required arithmetic or logic operation.
Data Memory: Access data memory for load or store operations.
Register File: Write the result back to the destination register.

Since all these steps must be completed within one cycle, the clock cycle time is quite long. Although the Clock Per Instruction (CPI) is one, the overall performance suffers due to the lengthy clock cycle required to accommodate the slowest instruction.

Poor Resource Utilization

In a single-cycle design, hardware resources are underutilized because many parts of the hardware remain idle during parts of the cycle. For instance, when the ALU is performing its task, other units like the instruction memory and data memory might be idle. This inefficiency leads to poor resource utilization.

Some Instructions Cannot be Implemented in a Single Cycle Certain instructions inherently require multiple steps that cannot be completed within a single clock cycle.

For example:

●Read-Operate-Write Sequence: An instruction that reads from one address, performs an operation, and writes to another address requires multiple steps. ●Data Movement: Moving a block of data from one memory area to another involves several read and write operations. These multi-step instructions cannot fit into the fixed clock cycle of a single-cycle design. While early, simple computers with straightforward instruction sets used this method, it becomes impractical for more complex instruction sets or when incorporating advanced features like floating-point units.

CPI (Clocks Per Instruction)

CPI, or Clocks Per Instruction, measures the average number of clock cycles needed to execute each instruction in a program. It is a critical performance metric: ●Lower CPI: Indicates better performance, as instructions are executed more efficiently. ●Higher CPI: Implies less efficiency, with more clock cycles needed per instruction.

Conclusion

The single-cycle design, while straightforward and easy to implement for simple instruction sets, has significant drawbacks: ●The clock cycle is determined by the slowest instruction, resulting in a long cycle time. ●Hardware resources are inefficiently utilized. ●Certain complex instructions cannot be executed within a single cycle, limiting the design's applicability. As computer architectures evolved, more advanced designs like multi-cycle and pipelined architectures were developed to overcome these limitations, improving overall performance and resource utilization.