Assignment 4 - MIPT-ILab/mipt-mips GitHub Wiki
** Note: this assignment is finished.** |
---|
In this assignment you will upgrade our single-cycle implementation to scalar MIPS simulator with constant latency for instruction execution.
All requirements remain the same as in previous tasks
You should create a new branch task_4
and create four new files:
perf_sim/perf_sim.h
perf_sim/perf_sim.cpp
perf_sim/main.cpp
perf_sim/Makefile
Hint: We suggest you to start with copy func_sim.h, func_sim.cpp files |
---|
In previous task you've implemented single-cycled MIPS. But, all actions were incapsulated in to 5 stages:
- Fetch
- Decode, read sources
- Execute, calculate address
- Memory access
- Writeback, PC update, information dump
Now we're going to incapsulate each stage to modules connected with ports.
For simpilicty, all ports and modules will be stored in one class PerfMIPS
.
We will use ports for two purposes:
- Data port transfers data from one stage to the next one
- Stall port signals that pipeline is stall to previous stages
In this task, we don't use complicated port topology, so use constants PORT_BW
, PORT_FANOUT
and PORT_LATENCY
everywhere. They must be defined as 1
.
Data ports must have following syntax:
class PerfMIPS {
ReadPort</*Type*/>* rp_/*source_module*/_2_/*dest_module*/;
WritePort</*Type*/>* wp_/*source_module*/_2_/*dest_module*/;
// examples
ReadPort<FuncInstr> rp_decode_2_execute;
ReadPort<FuncInstr> rp_execute_2_memory;
WritePort<uint32> wp_fetch_2_decode;
WritePort<FuncInstr> wp_decode_2_execute;
};
and be initialized in a following way:
PerfMIPS::PerfMIPS() {
// example
rp_decode_2_execute = new ReadPort<FuncInstr>("DECODE_2_EXECUTE", PORT_BW, PORT_FANOUT);
wp_decode_2_execute = new WritePort<FuncInstr>("DECODE_2_EXECUTE", PORT_LATENCY);
}
Each pair of data ports has to transmit FuncInstr
object. The only exception is fetch->decode port which transmits raw uint32
.
Stall port is used to stop previous stages if this stage can not be passed by current instructions and has to be re-started.
These ports must transmit only one 1 bit of data presented in bool
type.
ReadPort<bool>* rp_decode_2_fetch_stall;
WritePort<bool>* wp_decode_2_fetch_stall;
rp_decode_2_fetch_stall = new ReadPort<bool>("DECODE_2_FETCH_STALL", /**/);
For unification, you're recommended to name modules this way:
fetch
decode
execute
memory
writeback
Each module consists of following objects:
- read port from the previous stage
*
- write port to the next stage
**
- stall read port from the next stage
**
- stall write port to the previous stage
*
- internal value on the latch — FuncInstr object or data bytes
*
-
void clock_module(int cycle)
function (wheremodule
is name above)
*
Is not needed onfetch
module.
**
Is not needed onwriteback
module.
void clock_module( int cycle) {
bool is_stall;
/* If the next module tells us to stall, we stops
and send stall signals to previous module */
rp_next_2_me_stall->read( &is_stall, cycle);
if ( is_stall) {
wp_me_2_previous_stall->write( true, cycle);
return;
}
/* If nothing cames from previous stage
execute, memory and writeback modules have to jump out here */
if ( rp_previous_2_me->read( &module_data, cycle))
return;
/* But, decode stage doesn't jump out
It takes non-updated bytes from module_data
and re-decodes them */
// rp_previous_2_me->read( &module_data, cycle)
// Here we process data.
if (...) {
/* This branch is chosen if everything is OK and
we may continue promotion to the next pipeline stages */
wp_me_2_next->write( module_data, cycle);
}
else {
// Otherwise, nothing is done and we have to stall pipeline
wp_me_2_previous_stall->write( true, cycle);
}
}
Note: Decode stage behavior is slightly different from other modules, pay attention to code options |
---|
In this assignment we assume that every instruction is executed in 1 cycle, so the only possible stalls are caused by data dependency and control dependency.
Note: We DO NOT model "long" instructions, load/store misses in this task. Every instruction that reaches execution unit, leaves it on the next cycle! |
---|
Our goal is to stop instruction if its sources are not ready.
It can be checked by following extension of RF: each register is extended by 1 validity bit.
For instruction's destination register, this bit is set to false
on decoding stage, and returned back to true
on the writeback stage.
Next instructions must check the bits of their sources. If and only if they are in true
state, this instruction can continue execution, otherwise it is stalled.
Note: Because $zero register is never overwritten, its validity bit is always in true state! |
---|
The code changes should look like:
class RF {
struct Reg {
uint32 value;
bool is_valid;
Reg() : value(0ull), is_valid(true) { }
} array[REG_MAX_NUM];
public:
uint32 read( Reg_Num);
bool check( Reg_Num num) const { return array[(size_t)num].is_valid; }
void invalidate( Reg_Num num) { array[(size_t)num].is_valid = false; }
void write ( Reg_Num num, uint32 val) {
// ...
assert( array[(size_t)num].is_valid == false);
array[(size_t)num].is_valid = true;
}
};
Note: We ARE NOT going to model out-of-order execution, superscalar CPU etc. — please, do not invent "scalable" solutions, working scalar MIPS will be more than enough |
---|
Control dependency can be represented as a data dependency via PC register.
You have to add validity bit for PC register that is set to false
by jumps and branches — they must be detected with FuncInstr::is_jump() const
method.
But, this bit have to be checked not on decode, but on fetch stage.
Note: Non-branch instructions must promote PC by 4 at the decoding stage to continue fetch of next instructions! |
---|
At each stage, the instruction disassembly (if exists) and its result (if exists) should be printed to the std::cout
in the way similar to functional simulator, but preceeded by the stage name and current clock number separated "\t" sign:
Sometimes it is very useful to see what happens inside the machine. One of simpliest ways is per-stage output: simulator shows instruction being proceeded at each stage, like this:
fetch cycle 5: 0x43adcb90
decode cycle 5: ori $t2, $t1, 0xAA00
execute cycle 5: add $t1, $t2, $t3
memory cycle 5: bubble
You are free to add IPC/CPI counters output in the end of simulation.
In silent output mode, the output must be equal to the FuncSim's output, e.g. it doesn't contain cycle
prefixes, IPC counters etc; only writeback stages produces output.
As in functional simulator, run
has 2 parameters
-
const std::string& tr
with file system path to the trace to execute -
int instrs_to_run
with amount of instructions to be performed
and one extra parameter
-
bool silent
— see above
The code inside must be very simple:
PerfMIPS::run(...) {
// .. init
executed_instrs = 0; // this variable is stored inside PerfMIPS class
cycle = 0;
while (executed_instr <= instrs_to_run) {
clock_fetch(cycle);
clock_decode(cycle);
clock_execute(cycle);
clock_memory(cycle);
clock_writeback(cycle); // each instruction writeback increases executed_instrs variable
++cycle;
}
// ..
}
Question: Can calls of clock_fetch and clock_decode be swapped? What about clock_writeback and clock_fetch ? |
---|
Entry point has to be very similar to the FuncSim's one, but you have to support -d
option that disables "silent output mode".
As you have probably guessed, silent mode is required to quickly compare FuncSim and PerfSim outputs.
The best way to validate performance simulator is to compare its output to the functional simulator's one.
We provide a script ./run_and_compare.sh
that performs build and launch of func_sim
and perf_sim
.
Its syntax is similar to simulators:
./run_and_compare.sh <test name> <instructions amount>
If everything is correct, script will print "Tests passed' to the screen and finish. Otherwise, vim
showing differences between FuncSim and PerfSim traces will be started.
Note: Please, look inside that script and try to understand it's behavior. Bash scripts can be very useful in development process. |
---|