Performance Simulator - MIPT-ILab/mipt-mips GitHub Wiki
In functional simulator all actions are incapsulated in to 5 stages:
- Fetch
- Decode, read sources
- Execute, calculate address
- Memory access
- Writeback, PC update, information dump
In performance simulation each stage is encapsulated to modules connected with ports.
Modules and ports can be visualized with the dedicated tool
We use ports for two purposes:
- Data port transfers data from one stage to the next one
- Stall port signals that pipeline is stall to previous stages
Currently we don't use complicated port topology, so use constants PORT_BW
, PORT_FANOUT
and PORT_LATENCY
everywhere. They are defined as 1
.
Data ports have following syntax:
class PerfMIPS {
std::unique_ptr<ReadPort</*Type*/>> rp_/*source_module*/_2_/*dest_module*/;
std::unique_ptr<WritePort</*Type*/>> wp_/*source_module*/_2_/*dest_module*/;
// examples
std::unique_ptr<ReadPort<FuncInstr>> rp_decode_2_execute;
std::unique_ptr<ReadPort<FuncInstr>> rp_execute_2_memory;
std::unique_ptr<WritePort<uint32>> wp_fetch_2_decode;
std::unique_ptr<WritePort<FuncInstr>> wp_decode_2_execute;
};
and are initialized in a following way:
PerfMIPS::PerfMIPS() {
// example
rp_decode_2_execute = make_read_port<FuncInstr>("DECODE_2_EXECUTE", PORT_BW, PORT_FANOUT);
wp_decode_2_execute = make_write_port<FuncInstr>("DECODE_2_EXECUTE", PORT_LATENCY);
}
Each pair of data ports transmits FuncInstr
object. The only exception is fetch->decode port which transmits raw uint32
.
Stall port is used to stop previous stages if this stage can not be passed by current instructions and has to be re-started.
These ports transmit only one 1 bit of data presented in bool
type.
std::unique_ptr<ReadPort<bool>> rp_decode_2_fetch_stall;
std::unique_ptr<WritePort<bool>> wp_decode_2_fetch_stall;
rp_decode_2_fetch_stall = make_read_port<bool>("DECODE_2_FETCH_STALL", /**/);
We have following modules:
fetch
decode
execute
memory
writeback
Each module consists of following objects:
- read port from the previous stage
*
- write port to the next stage
**
- stall read port from the next stage
**
- stall write port to the previous stage
*
- internal value on the latch — FuncInstr object or data bytes
*
-
void clock_module(int cycle)
function (wheremodule
is name above)
*
Is not needed onfetch
module.
**
Is not needed onwriteback
module.
void clock_module( int cycle) {
bool is_stall;
/* If the next module tells us to stall, we stops
and send stall signals to previous module */
rp_next_2_me_stall->read( &is_stall, cycle);
if ( is_stall) {
wp_me_2_previous_stall->write( true, cycle);
return;
}
/* If nothing cames from previous stage
execute, memory and writeback modules have to jump out here */
if ( rp_previous_2_me->read( &module_data, cycle))
return;
/* But, decode stage doesn't jump out
It takes non-updated bytes from module_data
and re-decodes them */
// rp_previous_2_me->read( &module_data, cycle)
// Here we process data.
if (...) {
/* This branch is chosen if everything is OK and
we may continue promotion to the next pipeline stages */
wp_me_2_next->write( module_data, cycle);
}
else {
// Otherwise, nothing is done and we have to stall pipeline
wp_me_2_previous_stall->write( true, cycle);
}
}
Note: Decode stage behavior is slightly different from other modules, pay attention to code options |
---|
For now we assume that every instruction is executed in 1 cycle, so the only possible stalls are caused by data dependency and control dependency.
Our goal is to stop instruction if its sources are not ready.
It can be checked by following extension of RF: each register is extended by 1 validity bit.
For instruction's destination register, this bit is set to false
on decoding stage, and returned back to true
on the writeback stage.
Next instructions must check the bits of their sources. If and only if they are in true
state, this instruction can continue execution, otherwise it is stalled.
Note: Because $zero register is never overwritten, its validity bit is always in true state! |
---|
The code changes should look like:
class RF {
struct Reg {
uint32 value;
bool is_valid;
Reg() : value(0ull), is_valid(true) { }
} array[REG_MAX_NUM];
public:
uint32 read( Reg_Num);
bool check( Reg_Num num) const { return array[(size_t)num].is_valid; }
void invalidate( Reg_Num num) { array[(size_t)num].is_valid = false; }
void write ( Reg_Num num, uint32 val) {
// ...
assert( array[(size_t)num].is_valid == false);
array[(size_t)num].is_valid = true;
}
};
Control dependency can be represented as a data dependency via PC register.
You have to add validity bit for PC register that is set to false
by jumps and branches — they must be detected with FuncInstr::is_jump() const
method.
But, this bit have to be checked not on decode, but on fetch stage.
Note: Non-branch instructions must promote PC by 4 at the decoding stage to continue fetch of next instructions! |
---|
At each stage, the instruction disassembly (if exists) and its result (if exists) should be printed to the std::cout
in the way similar to functional simulator, but preceeded by the stage name and current clock number separated "\t" sign:
Sometimes it is very useful to see what happens inside the machine. One of simpliest ways is per-stage output: simulator shows instruction being proceeded at each stage, like this:
fetch cycle 5: 0x43adcb90
decode cycle 5: ori $t2, $t1, 0xAA00
execute cycle 5: add $t1, $t2, $t3
memory cycle 5: bubble
IPC is printed in the end of simulation.
As in functional simulator, run
has 2 parameters
-
const std::string& tr
with file system path to the trace to execute -
int instrs_to_run
with amount of instructions to be performed
and one extra parameter
-
bool silent
— see above
The code inside is very simple:
PerfMIPS::run(...) {
// .. init
executed_instrs = 0; // this variable is stored inside PerfMIPS class
cycle = 0;
while (executed_instr <= instrs_to_run) {
clock_fetch(cycle);
clock_decode(cycle);
clock_execute(cycle);
clock_memory(cycle);
clock_writeback(cycle); // each instruction writeback increases executed_instrs variable
++cycle;
}
// ..
}
Question: Can calls of clock_fetch and clock_decode be swapped? What about clock_writeback and clock_fetch ? |
---|
Entry point has to be very similar to the FuncSim's one.