Summary 8(b) - muneeb-mbytes/computerArchitectureCourse GitHub Wiki

Programmable Logic Array (PLA)

A Programmable Logic Array (PLA) is a configurable digital logic device used to implement combinational logic circuits. It consists of two programmable planes: the AND plane and the OR plane. These planes allow the PLA to realize any Boolean function by configuring the connections within them. PLA's are used to build reconfigurable digital circuits. It Combines memory and logic. It is similar to ROM but lacks full variable decoding and min-term generation. It Doesn't require programming like C and C++.

Why PLA's?

How to Implement a PLA-Based Design?

Example-1:

Consider a 1 bit Full Adder

Truth table for the given block

Find a minimal short form for the given function

Find the number of input buffer(Combination of variables in normal and complemented form)required. No. of input buffer = 3 (no. of variables)

Find required number of AND and OR gate No. of programmable AND gate =7 (no. of min terms (not repeated)) No. of programmable OR gate= 2 (no. of functions) [Here it is Sum and Carry]

Example-2:

Consider an expression:

Simplified expression:

Implementation of PLA:

Where,

The above is the input buffer for random input X. In the example given input buffer for variables A, B and C are provided. The simplified form of the above is

Analysing Performance

The clock period has to be large enough to accommodate component delays for all possible operations. Let’s analyse the component delays possible,

• Register file (dr): delay associated with array with collection of of resisters where according to the address one of them is selected • Program memory (dp): delay associated with instruction fetch from Program memory • Data memory (dd): delay associated with accessing data memory • ALU delay (da) • Interconnecting wire also has delay

Let’s analyse the delays of instructions,

ADD, SUB, AND, OR instruction: dp + dr + da + dr
SW (Store) instruction: dp + dr + da + dd
LW (load) instruction: dp + dr + da + dd + dr
Beq (branch) instructions: dp + dr + da
(jump) instruction: dp + da

Component delay and impact on clock period:

In a computer system, various components contribute to the overall delay, which affects the clock period.

PC Register Delay: The Program Counter (PC) is a register with a delay of 0. This means it introduces no delay on its own.

Adder Delay: This is the delay of the adder that calculates PC + 4 and PC + 4 + offset.

ALU Delay: The ALU (Arithmetic Logic Unit) delay is denoted as tAt_AtA, which includes a base delay ttt plus additional delay from the ALU operations.

MUX Delay: The delay of the multiplexer (MUX) is 0, indicating it introduces no additional delay.

Register File Delay: The register file delay, denoted as tRt_RtR, is the time it takes to access a register from a collection of registers.

Program Memory Delay: Denoted as tit_iti, this is the delay for accessing instructions from the program memory.

Data Memory Delay: Denoted as tMt_MtM, this is the delay for accessing data from the data memory.

So in order to calculate the impact on clock period we need to consider each instructions or group of instructions separately.

1) Computing the Sum, Difference, AND, OR:

• The instruction goes through the instruction memory (tit_iti), register file (tRt_RtR), ALU (tAt_AtA), and back to the register file (tRt_RtR).

• Total delay: ti+tR+tA+tRt_i + t_R + t_A + t_Rti+tR+tA+tR

2) Computing PC + 4:

The delay for this path involves the adder delay, which is accounted for separately. Delays for different instructions:

Instructions {add, sub, AND, OR}:

These instructions involve:

• fetching from instruction memory • reading from the register file • executing in the ALU. • writing back to the register file.

Clock Period: Maximum of (ti+tR+tA+tR)(t_i + t_R + t_A + t_R)(ti+tR+tA+tR) or t+t+t+ (the ALU delay, which is very quick).

Instruction {sw} (store word):

The store word instruction involves

• fetching from instruction memory. • reading from the register file, • executing in the ALU, • writing to data memory.

Clock Period: Maximum of (t+)(t+)(t+) or (ti+tR+tA+tM)(t_i + t_R + t_A + t_M)(ti+tR +tA+tM).

Instruction {lw} (load word):

The store word instruction involves:

• fetching from instruction memory, • reading from the register file, • executing in the ALU, • reading from data memory, • writing back to the register file.

Clock Period: Maximum of (t+)(t+)(t+) or (ti+tR+tA+tM+tR)(t_i + t_R + t_A + t_M + t_R)(ti+tR+tA+tM+tR).

Instruction {beq} (branch if equal):

The branch if equal instruction involves three paths:

One path goes through both adders, with a delay of t+t+t+ and t+t+t+. Another path involves fetching from instruction memoryfrom where we are picking the offset and then adding it to PC + 4 so it is ti + t+ for the third one the way the comparison being done is ti + tR + tA

Clock Period: Maximum value of the three paths.

Instruction {j} (jump):

The jump instruction involves fetching from instruction memory and jumping to the target address.

Clock Period: maximum value of { (t+) , (ti) }

Component Delays

● PC Register has 0 delay and Mux has 0 delay. ● ALU delay tA = t + delay ● Interconnecting Wires have some delay which is negligible if gates are fast. ● Clock Period Calculation is calculated separately for each instruction or group of instructions.

Delays for different instructions/Instruction Path Delays

Delays for Add, Sub, AND, OR:

● Adder to compute PC + 4 ● Clock Period: ti(Program memory) + tR(Register file) + tA(ALU delay) + tR(Register file)

Delay for Store (sw):

● This is used to store the data from registers to the memory. ● Clock Period: max(t+, ti + tR + tA + tM)

Delay for Load (lw):

● This is used to load the data from memory to the registers. ● Clock Period: max(t+, ti + tR + tA + tM + tR)

Delay for Branch if Equal (beq):

● In this beq instruction we have to consider three paths. ▪ Two adders ▪ Instruction memory → Adder ▪ Instruction memory → Register file → ALU Delay is the maximum of all the above three paths.

Clock Period: max(t+ + t+, ti + t+, ti + tR + tA)

Delay for Jump (j):

● Clock period will be high at max(t+, ti)

Instruction Delays Summary

Add, Sub, AND, OR Instructions:

Clock period: max(ti + tR + tA + tR, t+)

ALU delay is very quick.

Store (sw) Instruction:

Clock period: max(t+, ti + tR + tA + tM)

● Assumes same time for reading/writing in memory and register file. ● Delay includes data memory access.

Load (lw) Instruction:

Clock period: max(t+, ti + tR + tA + tM + tR)

Path: Instruction memory → Register file → ALU → Data memory → Register file

Includes additional delay for accessing data memory and storing back to register.

Branch if Equal (beq) Instruction:

beq instruction has three paths and the delay is maximum of all three paths.

Clock period: max(t+ + t+, ti + t+, ti + tR + tA)

▪ Path 1: Two adders. ▪ Path 2: Instruction memory → Adder. ▪ Path 3: Instruction memory → Register file → ALU.

Jump (j) Instruction:

Clock period: max(t+, ti)

Path: Mainly through instruction memory

Analysing Delays in computer architecture

In computer architecture, different instructions take different amounts of time to execute due to variations in complexity, required resources, and the specific hardware implementations. Here’s a breakdown of typical instruction delays.

ADD, SUB, AND, OR: These are simple arithmetic and logical operations that are typically executed by the Arithmetic Logic Unit (ALU). Eg: ADD R1, R2, R3 //Adds the contents of registers R2 and R3 and stores the result in R1
Store Word (sw): This instruction stores a word from a register to memory. Eg: SW R1, 0(R2) //Stores the word in R1 to the address in R2.
Load Word (lw): This instruction loads a word from memory into a register. Eg: LW R1, 0(R2) //Loads the word from the address in R2 into R1.
Branch if Equal (beq): This instruction compares two registers and branches if they are equal. Eg: BEQ R1, R2, target //Compares R1 and R2; if they are equal, branches to ‘target’.

5.Jump (j): This instruction performs an unconditional jump to a specified address. Eg: J target //Unconditionally jumps to the instruction at targe

Single Cycle Design and Multi Cycle Design

Single cycle datapath :

A single cycle datapath is a type of CPU architecture where each instruction is executed in one clock cycle. Whereas in multi-cycle datapath, an instruction is broken down into multiple stages, each completed in a separate clock cycle. The single-cycle design simplifies control logic, as every instruction must complete all its stages (fetch, decode, execute, memory access, and write-back) within one cycle.

Stages of Execution in a Single Cycle Datapath Instruction Fetch (IF):

● The PC provides the address to the instruction memory. ● The instruction at that address is fetched and placed into the instruction register.

Instruction Decode (ID):

● The instruction is decoded to determine the operation and the operands.

● The control unit generates the necessary control signals based on the instruction.

Execute (EX):

● The ALU performs the required operation (addition, subtraction, logical operations, etc.) using the operands from the registers or immediate values.

Memory Access (MEM):

● If the instruction involves accessing data memory (e.g., load or store instructions), the data memory is accessed.

Write-Back (WB):

● The result from the ALU or data memory is written back into the appropriate register.

Example:

lw $t0, –4($sp) needs 8ns, assuming the delays shown here. reading the instruction memory 2ns reading the base register $sp 1ns computing memory address $sp-4 2ns reading the data memory 2ns storing data back to $t0 1ns Total 8ns: If we make the cycle time 8ns then every instruction will take 8ns, even if they don’t need that much time.

How to improve the resource utilization: ● ALU which will do all arthemitic and logic operations ● For improvisation the 2 memory blocks will be clubbed (IM-instruction memory and DM-data memory) ● The single memory will store program as well as data ● The two adders are removed and connections are rearranged ● Registers are introduced

ALU which will do all arthemitic and logic operations ● For improvisation the 2 memory blocks will be clubbed (IM-instruction memory and DM-data memory) ● The single memory will store program as well as data ● The two adders are removed and connections are rearranged ● Registers are introduced

Multi-cycle datapath

A multi-cycle datapath is a CPU architecture where each instruction is executed over multiple clock cycles, with each cycle completing a specific stage of the instruction. This design breaks down the execution process into smaller steps, allowing for more efficient use of hardware and shorter clock cycles compared to a single-cycle datapath.

Stages of Execution in a Multi-Cycle Datapath

Instruction Fetch (IF):

● The PC provides the address to the instruction memory. ● The instruction at that address is fetched and placed into the instruction register (IR). ● The PC is incremented to point to the next instruction. Instruction Decode/Register Fetch (ID): ● The instruction is decoded to determine the operation and the operands. ● The control unit generates the necessary control signals based on the instruction. ● The source registers are read, and their values are stored in intermediate registers (A and B).

Execution/Effective Address Calculation (EX):

● The ALU performs the required operation (addition, subtraction, logical operations) using the operands from the intermediate registers or immediate values. ● For load/store instructions, the effective address is calculated.

Memory Access/Branch Completion (MEM):

● If the instruction involves accessing data memory, the data memory is accessed. ● For branch instructions, the branch target address is calculated, and the PC is updated if the branch is taken.

Write-Back (WB):

● The result from the ALU or data memory is written back into the appropriate register.

Breaks down instruction execution into multiple cycles. Determined by the longest stage among instruction fetch, memory access, etc. Breakdown by Instruction Type:

R-Class (Register-Register) Instructions:

•Total Cycles: 4 Instruction Fetch, Instruction Decode/Register Fetch, Execution, Register Write Back

lw (Load Word) Instruction:

Total Cycles: 5 Instruction Fetch, Instruction Decode/Register Fetch, Effective Address Calculation, Memory Access (Read), Register Write Back sw (Store Word) Instruction:

•Total Cycles: 4 Instruction Fetch, Instruction Decode/Register Fetch, Effective Address Calculation, Memory Access (Write)

beq (Branch if Equal) Instruction:

•Total Cycles: 3 Instruction Fetch, Instruction Decode/Register Fetch, Branch Decision

j (Jump) Instruction:

•Total Cycles: 2 Instruction Fetch, Jump Execution

Efficiency: Each instruction takes only the cycles it needs

To balance delays in a multi-cycle processor design, we can use two methods:

Multiple Actions in One Period:

If two actions take very little time, they can be done in the same clock cycle. For example, if reading the instruction (ti) and reading the registers (tR) are both quick, they can be combined into one period. This reduces the total number of cycles needed for instructions.

Multiple Periods for One Action:

For actions that take longer, we can set the clock period so that time isn't wasted. For example, if memory access (tM) takes longer, the clock period can be adjusted to match this duration, ensuring no time is wasted. These methods help make the processor more efficient by reducing the total time needed to execute instructions. The image shows how different instruction types (R class, lw, sw, beq, j) are broken down over several cycles, emphasizing the need to balance the clock periods.

PROBLEMS WITH SINGLE CYCLE DESIGN:

Slowest Instruction Determines Clock Frequency In a single-cycle design, every instruction is designed to be executed within one clock cycle. Consequently, the duration of the clock cycle is dictated by the slowest instruction, which sets the upper limit for the cycle time. Typically, this slowest instruction is a load instruction, which involves multiple steps:
Instruction Memory: Fetch the instruction from memory.
Register File: Read the source register(s).
ALU (Arithmetic Logic Unit): Perform the required arithmetic or logic operation.
Data Memory: Access data memory for load or store operations.
Register File: Write the result back to the destination register.

Since all these steps must be completed within one cycle, the clock cycle time is quite long. Although the Clock Per Instruction (CPI) is one, the overall performance suffers due to the lengthy clock cycle required to accommodate the slowest instruction.

Poor Resource Utilization

In a single-cycle design, hardware resources are underutilized because many parts of the hardware remain idle during parts of the cycle. For instance, when the ALU is performing its task, other units like the instruction memory and data memory might be idle. This inefficiency leads to poor resource utilization.

Some Instructions Cannot be Implemented in a Single Cycle Certain instructions inherently require multiple steps that cannot be completed within a single clock cycle.

For example:

●Read-Operate-Write Sequence: An instruction that reads from one address, performs an operation, and writes to another address requires multiple steps. ●Data Movement: Moving a block of data from one memory area to another involves several read and write operations. These multi-step instructions cannot fit into the fixed clock cycle of a single-cycle design. While early, simple computers with straightforward instruction sets used this method, it becomes impractical for more complex instruction sets or when incorporating advanced features like floating-point units.

CPI (Clocks Per Instruction)

CPI, or Clocks Per Instruction, measures the average number of clock cycles needed to execute each instruction in a program. It is a critical performance metric: ●Lower CPI: Indicates better performance, as instructions are executed more efficiently. ●Higher CPI: Implies less efficiency, with more clock cycles needed per instruction.

Conclusion

The single-cycle design, while straightforward and easy to implement for simple instruction sets, has significant drawbacks: ●The clock cycle is determined by the slowest instruction, resulting in a long cycle time. ●Hardware resources are inefficiently utilized. ●Certain complex instructions cannot be executed within a single cycle, limiting the design's applicability. As computer architectures evolved, more advanced designs like multi-cycle and pipelined architectures were developed to overcome these limitations, improving overall performance and resource utilization.