Assignment 4 - MIPT-ILab/mipt-mips GitHub Wiki

** Note: this assignment is finished.**

Introduction

In this assignment you will upgrade our single-cycle implementation to scalar MIPS simulator with constant latency for instruction execution.

All requirements remain the same as in previous tasks


Task

You should create a new branch task_4 and create four new files:

    perf_sim/perf_sim.h
    perf_sim/perf_sim.cpp
    perf_sim/main.cpp
    perf_sim/Makefile
Hint: We suggest you to start with copy func_sim.h, func_sim.cpp files

Pipeline modeling

In previous task you've implemented single-cycled MIPS. But, all actions were incapsulated in to 5 stages:

  • Fetch
  • Decode, read sources
  • Execute, calculate address
  • Memory access
  • Writeback, PC update, information dump

Now we're going to incapsulate each stage to modules connected with ports. For simpilicty, all ports and modules will be stored in one class PerfMIPS.

Ports

We will use ports for two purposes:

  • Data port transfers data from one stage to the next one
  • Stall port signals that pipeline is stall to previous stages

In this task, we don't use complicated port topology, so use constants PORT_BW, PORT_FANOUT and PORT_LATENCY everywhere. They must be defined as 1.

Data port

Data ports must have following syntax:

    class PerfMIPS {
        ReadPort</*Type*/>* rp_/*source_module*/_2_/*dest_module*/;
        WritePort</*Type*/>* wp_/*source_module*/_2_/*dest_module*/;
        // examples
        ReadPort<FuncInstr> rp_decode_2_execute;
        ReadPort<FuncInstr> rp_execute_2_memory;
        WritePort<uint32>   wp_fetch_2_decode;
        WritePort<FuncInstr> wp_decode_2_execute;
    };

and be initialized in a following way:

    PerfMIPS::PerfMIPS() {
        // example
        rp_decode_2_execute = new ReadPort<FuncInstr>("DECODE_2_EXECUTE", PORT_BW, PORT_FANOUT);
        wp_decode_2_execute = new WritePort<FuncInstr>("DECODE_2_EXECUTE", PORT_LATENCY);
    }

Each pair of data ports has to transmit FuncInstr object. The only exception is fetch->decode port which transmits raw uint32.

Stall ports

Stall port is used to stop previous stages if this stage can not be passed by current instructions and has to be re-started. These ports must transmit only one 1 bit of data presented in bool type.

    ReadPort<bool>* rp_decode_2_fetch_stall;
    WritePort<bool>* wp_decode_2_fetch_stall;

    rp_decode_2_fetch_stall = new ReadPort<bool>("DECODE_2_FETCH_STALL", /**/);

Modules

Module names

For unification, you're recommended to name modules this way:

  • fetch
  • decode
  • execute
  • memory
  • writeback
Module objects

Each module consists of following objects:

  • read port from the previous stage*
  • write port to the next stage**
  • stall read port from the next stage**
  • stall write port to the previous stage*
  • internal value on the latch — FuncInstr object or data bytes*
  • void clock_module(int cycle) function (where module is name above)

* Is not needed on fetch module.

** Is not needed on writeback module.

Module behavior sceleton
    void clock_module( int cycle) {
        bool is_stall;
        /* If the next module tells us to stall, we stops
           and send stall signals to previous module */
        rp_next_2_me_stall->read( &is_stall, cycle);
        if ( is_stall) {
             wp_me_2_previous_stall->write( true, cycle);
             return;
        }

        /* If nothing cames from previous stage
           execute, memory and writeback modules have to jump out here */
        if ( rp_previous_2_me->read( &module_data, cycle))
            return;

        /* But, decode stage doesn't jump out
           It takes non-updated bytes from module_data 
           and re-decodes them */
        // rp_previous_2_me->read( &module_data, cycle)

        // Here we process data.

        if (...) {
             /* This branch is chosen if everything is OK and
                we may continue promotion to the next pipeline stages */
             wp_me_2_next->write( module_data, cycle);
        }
        else {
             // Otherwise, nothing is done and we have to stall pipeline
             wp_me_2_previous_stall->write( true, cycle);
        }
    }            
Note: Decode stage behavior is slightly different from other modules, pay attention to code options

Stall generation

In this assignment we assume that every instruction is executed in 1 cycle, so the only possible stalls are caused by data dependency and control dependency.

Note: We DO NOT model "long" instructions, load/store misses in this task. Every instruction that reaches execution unit, leaves it on the next cycle!

Data dependency tracker

Our goal is to stop instruction if its sources are not ready. It can be checked by following extension of RF: each register is extended by 1 validity bit. For instruction's destination register, this bit is set to false on decoding stage, and returned back to true on the writeback stage. Next instructions must check the bits of their sources. If and only if they are in true state, this instruction can continue execution, otherwise it is stalled.

Note: Because $zero register is never overwritten, its validity bit is always in true state!

The code changes should look like:

    class RF {
        struct Reg {
            uint32 value;
            bool   is_valid;
            Reg() : value(0ull), is_valid(true) { }    
        } array[REG_MAX_NUM];
    public:
        uint32 read( Reg_Num);
        bool check( Reg_Num num) const { return array[(size_t)num].is_valid; }
        void invalidate( Reg_Num num) { array[(size_t)num].is_valid = false; }
        void write ( Reg_Num num, uint32 val) {
             // ...
             assert( array[(size_t)num].is_valid == false);
             array[(size_t)num].is_valid = true;
        }
    };
Note: We ARE NOT going to model out-of-order execution, superscalar CPU etc. — please, do not invent "scalable" solutions, working scalar MIPS will be more than enough

Control dependency tracker

Control dependency can be represented as a data dependency via PC register. You have to add validity bit for PC register that is set to false by jumps and branches — they must be detected with FuncInstr::is_jump() const method. But, this bit have to be checked not on decode, but on fetch stage.

Note: Non-branch instructions must promote PC by 4 at the decoding stage to continue fetch of next instructions!

User interfaces

Master output

At each stage, the instruction disassembly (if exists) and its result (if exists) should be printed to the std::cout in the way similar to functional simulator, but preceeded by the stage name and current clock number separated "\t" sign:

Sometimes it is very useful to see what happens inside the machine. One of simpliest ways is per-stage output: simulator shows instruction being proceeded at each stage, like this:

    fetch   cycle 5:  0x43adcb90
    decode  cycle 5:  ori $t2, $t1, 0xAA00
    execute cycle 5:  add $t1, $t2, $t3
    memory  cycle 5:  bubble

You are free to add IPC/CPI counters output in the end of simulation.

Silent output

In silent output mode, the output must be equal to the FuncSim's output, e.g. it doesn't contain cycle prefixes, IPC counters etc; only writeback stages produces output.

Method PerfMIPS::run

As in functional simulator, run has 2 parameters

  • const std::string& tr with file system path to the trace to execute
  • int instrs_to_run with amount of instructions to be performed

and one extra parameter

  • bool silent — see above

The code inside must be very simple:

    PerfMIPS::run(...) {
        // .. init
        executed_instrs = 0; // this variable is stored inside PerfMIPS class
        cycle = 0;
        while (executed_instr <= instrs_to_run) {
              clock_fetch(cycle);
              clock_decode(cycle);
              clock_execute(cycle);
              clock_memory(cycle);
              clock_writeback(cycle); // each instruction writeback increases executed_instrs variable
              ++cycle;
        }
        // ..
    }
Question: Can calls of clock_fetch and clock_decode be swapped? What about clock_writeback and clock_fetch?

int main

Entry point has to be very similar to the FuncSim's one, but you have to support -d option that disables "silent output mode". As you have probably guessed, silent mode is required to quickly compare FuncSim and PerfSim outputs.

Validation

The best way to validate performance simulator is to compare its output to the functional simulator's one. We provide a script ./run_and_compare.sh that performs build and launch of func_sim and perf_sim. Its syntax is similar to simulators:

    ./run_and_compare.sh <test name> <instructions amount>

If everything is correct, script will print "Tests passed' to the screen and finish. Otherwise, vim showing differences between FuncSim and PerfSim traces will be started.

Note: Please, look inside that script and try to understand it's behavior. Bash scripts can be very useful in development process.
⚠️ **GitHub.com Fallback** ⚠️