Instruction Fetch - janomach/the-hardisc GitHub Wiki

Overview

The instruction fetch (FE) stage fetches data from memory via the instruction bus and saves them into the instruction-fetch buffer. The core uses AMBA 3 AHB-Lite protocol for bus transfers. Each transfer comprises the address phase and the data phase, so the fetch stage is also separated into two phases: FE0 (address phase) and FE1 (data phase). The core sends out a request in the phase FE0, while in the phase FE1, the core awaits a response. According to the protocol, the core can initiate the address phase of transfer B if it waits for transfer A. The FE contains several prediction units to lower the penalty (5 pipeline bubbles) due to taking branches and jumps, whose prediction is saved into the buffer of predictions.

Registers

Name	Bit count	Description
fe0_add	31	31 MSB bits of address in phase FE0 for access to the instruction bus
fe0_utd	1	Indicates that the FE0 phase is up-to-date
fe1_add	31	31 MSB bits of address in phase FE1
fe1_utd	1	Indicates that the FE1 phase is up-to-date
fe1_inf	2	Additional information about FE1 - refer to the file pipeline_fe1.sv

Instruction Fetch Buffer

If the pipeline is stalled, the instructions are prefetched into the FIFO-like instruction-fetch buffer (IFB). The current implementation of IFB is purposefully designed for short (input-to-register) I2R paths. The data are pushed always into the same registers (row 0 of the buffer). All entries are shifted if entry 0 is occupied during a data push to save the incoming data into entry 0. The following figure shows which data are saved in each row. The IFB is also specially designed to allow predictions of return addresses in the FE stage. The prediction is based on the last pushed data into entry 0. This means that the the IFB not only outputs the oldest entry but also the newest data for which it can also accept updates of prediction information (field FE1_INF in a row 0). The field FETCH_ERR saves an error that occurred during fetch, whereas the FE1_ADD[0] holds bit 1 of the fetch address, which is important information for an instruction alignment.

IFB example

Predictor

The current implementation of the predictor in the Hardisc leverages a simple prediction scheme based on a history of predictions. It consists of two units. The branch predictor predicts the outcome of conditional branches. It includes a branch target buffer (BTB) to save information about executed branches and their targets and a branch history table (BHT) with saturation counters to record the history of predictions. The jump predictor predicts the outcomes of unconditional jumps and contains a jump target buffer (JTB). The separation into two units is because the conditional branches have 13 bits and require the BHT for more precise predictions, while the unconditional jumps may have 21-bit offsets. If the prediction is correct, there is no penalty for taking a branch or jump. Prediction to unaligned addresses creates an additional pipeline bubble due to always aligned fetch addresses.

The following figure shows a record of information about an executed branch in a single row of BTB. The branch is in the memory at address 0x910, while the branch's target address is 0x910 + 0xFFFFFF00. The BTB has 16 entries, and the option SHARED is set to 20. The bits [5:2] of the instructions select the row within BTB. The two LSBs of the base branch address are not saved, and since the option SHARED is set to 20, the count of the recorded address bits is (32-20-4-2).

BTB example

The predictor compares the instruction bus transfer address with its records in the BTB. If it overlays with the saved record and the BHT counter has a value larger than 1, the next address is predicted by adding the offset to the transfer address. As shown in the figure, only a small part of the instruction address is saved in the row to lower the area and power consumption penalty. This means the predictor may predict a jump from an instruction, which does not perform the transfer of control. For this reason, the core contains mechanisms to correct these mispredictions further in the pipeline.

The selection of rows within the BHT and JTB is the same. The rows in the JTB are similar to BTB, but the offset field is 20-bit wide.

Return Address Stack

The return address stack RAS is also a predictor in the form of a circular buffer with only several entries, so it occupies less area. The predictions are not based on the instruction address but on the data fetched from program memory (the latest saved into the IFB). If the RAS detects a return instruction, it predicts a return address. The used IFB has a special output port for the latest pushed data, which the RAS leverages to evaluate the instruction. The Hardisc supports an extension for compressed (16-bit) instructions, so the RAS must count with the alignment of instructions. The IFB also contains a special input port leveraged by RAS to override the information about predictions from the latest pushed instruction. The return address is pushed into the BoP in the same way as the target address from the predictor. The RAS has a higher priority. If the RAS prediction is correct, there is only a two-bubble penalty instead of 5. Prediction to unaligned addresses creates an additional pipeline bubble due to always aligned fetch addresses.

Buffer or Predictions

To check predictions, it is necessary to hold the predicted target address and the information that the prediction was performed (the prediction flag) until the decision about the correctness of the prediction in the MA stage is made. It is improbable that the prediction would be made from each instruction present in the pipeline simultaneously. Therefore, it is inefficient to propagate the prediction information through the pipeline and have dedicated registers for it in each stage and row of IFB. Instead, the Hardisc contains a separate FIFO-based buffer of predictions (BoP), which holds the predicted target addresses. The prediction flag propagates with the instruction through the pipeline. The target address is pushed into the BoP whenever the prediction is performed. On the other hand, the prediction is performed only if the BoP has free space. If the instruction from which the prediction was performed propagates into the MA stage, the target address is popped from the BoP and evaluated. Each flush of the pipeline also flushes the BoP.