Non temporal loads vs CMOs in loops - riscvarchive/riscv-CMOs-discuss GitHub Wiki

Table of Contents

Instruction Encoding: Non-temporal hint on load/stores, vs separate instructions

In meeting 2020-11-23 I thought that I heard somebody say

"It is less expensive in terms of instruction encodings to have a single non-temporal bit in a load instruction than it is to have a "cacheline forgot" instruction that has M[rs1+imm12]

This is so obviously incorrect that I hope I mis-heard...

If you are adding a "non-temporal" bit to a load instruction, e.g.

LOAD.some.size.{temporal,non-temporal}.rd:5,rs1:5.imm:12

then it is obviously more expensive in terms of instruction encodings than

CMO.CLFORGET.rs1:5.imm:12

2^22 vs 2^17.

More expensive in terms of instruction encodings, but less expensive in terms of dynamic instruction count.

If you were talking about having a non-temporal load instruction that does not have an imm12 offset, sure, that saves instruction encodings. But then you are adding dynamic instructions and dataflow depth. Which brings us back to one of the other arguments we were having: the is it or is it not okay to add instructions to critical loops.

Hint bits on loads and stores are great - but 32-bit instructions run out

In general, BTW, I would *love* to have a whole slew of hints attached to loads (and stores, and AMOs, and...): non-temporal, which level of cache to place in, critical vs not (for OOO scheduling), ... But 32-bit RISC instruction sets basically run out of room, at least as long as you have to have reg+offset addressing. Such bits can be better afforded on M[rs1] only memory accesses, at the cost of extra instructions for address computation. When we start defining instructions of length 48-bit or 64-bit, or even 128-bit (like some GPUs), such hints may be easier to accommodate

Oftentimes there is a separate CMO instruction like CLFORGET that provides almost the same functionality as a hint bit attached to a load or store. Saving encodings wrt hint+M[reg+imm]+... Adding extra instructions compared to hint+M[reg+imm]. But adding <= dynamic instructions compared to hint+M[reg].

I am going to close this email here, restricting it to discussing instruction encodings of non-temporal vs flush.

I will go on and discuss other aspects of non-temporal vs CMO.CLFORGET in https://github.com/riscv/riscv-CMOs-discuss/wiki/Non-temporal-loads-vs-CMOs-in-loops.

= CLFORGET to reduce buss traffic

Although... I should add that non-temporal also misses the point of CMO.CLFORGET == DISCARD, avoiding the unnecessary writeback of dirty data. Lest people panic about security implications, we can discuss those in htps://github.com/riscv/riscv-CMOs-discuss/wiki/Avoid-Unnecessary-Writebacks

the real issues wrt non-temporal

Even if hint bits are cheap, on loads and stores with full addressing modes, there are issues with non-temporal hints.

Separate CMO instructions like CLFORGET often provide better control (albeit mostly for ninja programmers - compilers are seldom this good). At the cost of an extra instruction per cache line.

E.g. the classic problem with non-temporal loads is that they work best with full cache line loads - e.g. when VLEN=cache-line-size.

If you are doing multiple sub-cache-line scalar loads, the first non-temporal might say "put bypass the L1D$", or maybe "not in the L1D$ but not in the L2D$" (I have seen both), and subsequent loads to the same cache line would have the same cache line marking.

But that doesn't tell you when you are finished with a cache line. Proper non-temporal really needs two annotations, (1) say what cache levels to put in, i.e. allocation control, (2) the other to say "I have stopped using it now". I.e. last use, or deallocation control.

I have observed that many people confuse these two flavors of hints related to non-temporal cache management: (1) bypass (some) cache levels, and (2) last use.

Naturally, because (a) having two such markings wastes precious encoding space (especially if M[reg+imm]), but also (b) many implementations don't support both. E.g. many implementations don't have the ability to bypass inclusive cache layers because of coherence. (Although it should be obvious that many implementations can bypass many or all cache layers.) In many implementations non-temporal amounts to setting LRU on the first load - but then your data may be thrashed out before subsequent loads to the same cacheline. Or maybe setting non-temporal on the last load to a cache line --- but that makes it useless for implementations that can do cache bypass. Plus speculation and OOO cause issues with LRU based non-temporal hint implementations

At the very least you need to provide examples of exactly where, in code that has multiple accesses to the same cache line, you would apply your non-temporal hint.

At the cost of extra instructions, CMO.CLFORGET allows a compiler to software pipeline the "I don't need to use this any more" indication, and provide an arbitrary, tunable, delay. 1 such instruction per cache line if M[reg+imm12]. 2 or more if M[reg]. You can do this with nontemporal without adding extra instructions and/or a CSR to control the flush behind delay.

However, this comes with the usual cost for software pipelining: loop startup or prologue. It is not unusual for the prologue to be larger than the software scheduled loop body, which makes it challenging to use for code size limited embedded systems. The sort of thing that Bob Rau Cydra/Itanium cyclic predicates resolve - an approach which is not viable for RISC-V. Prefetch instructions can be placed in a software pipelined loop without requiring any prologue, but allocation hint CMO instructions can be more problematic. Especially the CMO.FORGET=DISCARD flavor, which is destructive.

Compatibility

RISC-V already contains hint instruction encodings - it is only a question of whether we wish to confused some of these hint instruction encodings for software prefetches

AFAIK RISC-V does not contain hint bitfields in existing loads and stores.

Ucode that uses such hint instructions, whether prefetches or the allocation CMOs, will run on old machines. Because the hint instructions are NOPs on old machines.

Whereas new code that uses a newly defined load or store instruction with a nontemporal bit may not run correctly on old machines. if the old machine implementation traps illegal instruction encodings, it could be emulated - but at the very least that requires new trap and emulate software. IIRC RISC-V allows implementations to not trap and emulate illegal instruction encodings, particularly for low end systems that want to save decode hardware ( illegal instruction decoding is often more expensive than legal instruction decoding).

i.e. hint prefetch and CMO instructions are more compatible: ucode using these instructions can run on old machines. Nontemporal bits are not compatible in this new --> old sense.

See new vs old code compatibility issues.

prefetches and other CMOs for loop performance

By the way, when I propose a CMO for cache line deallocation control inside loops of firm M[reg+imm], I do not mean that all CMOs should have M[reg+imm]. Obviously expensive In terms of instruction encodings.

I only mean that one (stretch two) flavor of CMO.CLFORGET should have the same sort of M[rs1+imm] hint encodings that can be anticipated for PREFETCH.R and PREFETCH.W.

As I have tried to explain in many places, software prefetch instructions are of questionable value on many advanced microarchitectures. But have proven value on many simple microarchitectures, especially those of low end embedded microcontroller systems and HPC systems.

⚠️ **GitHub.com Fallback** ⚠️