Motivation for Address Range CMO.AR - riscvarchive/riscv-CMOs-discuss GitHub Wiki
Why did we (Krste, Andrew, Ag) propose address range CMOs?
CMO Trap and Emulation - for Mash-Up Systems
It is quite common for systems, especially SOCs, to be "mash ups" of IP from different vendors. E.g. the CPU and L1 cache might be designed by one team, while the L2 or L3 or foobar caches are designed by completely different teams. Often from different companies. Sometimes accessed across bus bridges.
-
Vertical companies, that design their own CPU and cache IP, will not have this problem.
In a mature ecosystem, e.g. ARM, where bus standards for CMO operations are well-defined,
there is a reasonably good chance that CPU and cache IP vendors agree on bus transactions for CMOs.
But in an immature ecosystem, or one that is rapidly innovating,
there may be impedance mismatches with respect to CMO and other concepts.
software may be needed to accommodate such impedance mismatches.
- IIRC it took quite a few years or the arm ecosystem become so mature.
- Arguably the Intel x86 ecosystem never became consistently mature with respect to CMOs. the thing that eventually led to consistent behavior was Intel designing everything, becoming much less dependent on caches designed by others.
- Also see #CMO innovations and compatibility
While a system designed all in one place should reasonably have bus transactions that he CPU can propagate across whatever interconnected has to performed the appropriate operations on external caches, it is quite common for such transactions to have trouble crossing bus bridges. Sometimes the different buses do not have exactly the same transactions. More subtly, oftentimes the ordering rules for those bus transactions are different.
Moreover, the lowest common denominator for interfacing IP blocks is not bus transactions, but memory mapped I/O. E.g. a cache IP block may expose MMIO registers that have to be written in particular patterns to perform a CMO/CBO. For example
write M[$IP.MMIO_address] := physical address you wish to perform a cache management operation on. write M[$IP.MMIO_op] := Encoding of the operation
Obviously one can write create a state machine that responds to a bus CMO command and performs the iteration. But this has problems that should be obvious:
- it's a pain to have to create such command to MMIO state machines
- even if you plan on creating such translators, you will almost certainly have memory transactions crossing over earlier. Often much earlier.
- but most important: e already have a really good state machine - the CPU.
We can... by trap and emulate of CMO/CBO instructions.
But if you're doing a range of even two or three lines, trapping and emulating each one is bad performance.
One of the best motivations for address range CMO.AR is to amortize this trap and emulate overhead. You take one trap for the entire range, rather than one trap for every line inside the range.
If the trap and emulate overhead for a single cache line sized CBO/CMO instruction is excessive - to do things like interface to $IP MMIO - why not define a system call or other privileged call interface, where the software called accesses the idiosyncratic mechanisms?
(In RISC-V such a software call interface might be part of the SBI.)
In fact, it is possible, even quite likely, that the overhead of an SBI call might be lower than the overhead of a trap.
But this is limiting: people can make CMO instructions fast. In most circumstances. But not all circumstances. some of us believe we could even make address range CMOs fast.
And, while we can always hope that privilege calls will become higher performance, that hasn't happened in more than 40 years. The state-of-the-art is that system calls and other privilege calls are slower than we would like.
Q: Why not require SBI calls for large address ranges, but have instructions that are usually fast for Small address ranges?
A: where do you draw the boundary? 2 cache lines? 16 cache lines? ...
Such a hybrid approach fragments the ecosystem wrt performance. while slow trap and emulate code might run everywhere, compilers, etc., will probably need to have implementation specific switches to decide when to change from one to the other.
We believe that CMOs are an area of potential innovation.
e.g. although the base set of CMO operations is well-known and small
- WBINVD/FLUSH/DCBF/EVICT
- INVD/DISCARD
- WB/CLEAN
E.g. Flush Clean Keep Dirty Operation invented while working on the RISC-V CMO proposal, which improves performance when flushing stale data on a writeback cache.
E.g. CMO operations like SETLRU/DEMOTE
E.g. operations like way locking and line locking.
E.g. operations that mitigate many of the security issues or INVD.
We would always like to be able to innovate, without fragmenting the ecosystem. We would always like to have a story so that new instructions run on old machines that do not have those instructions.
While some of these operations can be implemented as NOPs or mapped to earlier operations, sometimes this is not possible. Sometimes it is necessary to trap and emulate. (Sometimes it is necessary to trap and kill, but that would be unfortunate.)
When trap and emulate is your compatibility strategy, it is desirable to minimize the traps. Address range CMOs amortize the trap and emulate cost.
It should be obvious to anybody with the slightest bit of imagination that you can do a lot of optimizations or address range CMOs that you cannot do for per cache line CBOs.
Consider a cache with four sectors per address block. E.g. four 64-byte sectors per 256-byte "line". The sectors being independently present/ clean/dirty. I.e. sector size is the classic cache line size. SW can avoid false sharing problems by allocating data on the 64 byte sector boundary.
It is probably most natural for a RISC-V implementation to report the cache line size/block size for CBOs as being 64 bytes. Mostly because increasing the false sharing granularity has its own issues. But also in implementation, to avoid having to have a state machine step over all the sectors of cache line, especially since the bus is probably a bottleneck.
But: an operation like INVD had obviously be performed without needing a state machine: You only need to go to the address block, and clear all the sector valid bits. You don't need to iterate over sector by sector, moving dirty lines to some outgoing writeback queue.
Some general-purpose server bigots piss on INVD. It obviously has security issues, although some embedded and HPC systems can make very effective use of it. in fact, when somebody who worked at a network processor company heard of the CMO TG, his number one request was for INVD.
But the same principle can be applied as a performance optimization for any other operation but does not need to evict dirty data. Such as the Invalidate Clean, not Dirty operation. which has none of the security issues of INVD.
=> INVD and Invalidate Clean, not Dirty can before X faster than WBINVD/FLUSH/DCBF/EVICT on such a sector cache implementation, even if there is absolutely no dirty data flush. If there is dirty data, ...
But: you cannot properly take advantage of this optimization without having an address range.
I keep hearing people say that they don't believe that CMO performance matters, that the bus will always be the bottleneck.
- It should be evident to anybody with a little bit of sense that that is not true for writethrough caches.
- While writeback caches are very common, they are far from universal.
- RISC-V should not be doing things that will handicap the performance of implementations that wish to use writethrough caches.
- Heck, even many systems with writeback caches of an L1 or an L2 that are writethrough.
- the operation get rid of clean stale data does not incur bus traffic
- unsafe: INVD
- safe: Flush Clean not Dirty
It has been suggested by AW that if we don't get address range CMOs, that some of our implementations that need to do trap and emulate should lie or inflate the block size.
E.g. if the true cache line size is 64 bytes, but we need to trap and emulate, then we might make the block size reported for the cache line at a time CBO operation 256 bytes. Thereby reducing the trap and emulate overhead by 4X.
While this would reduce the trap and emulate overhead, it would be quite regrettable. It would probably mean that malloc would have to allocate on the larger 256 byte granularity, rather than the true 64 byte granularity; i.e. malloc would be adapted for CMOs, not for false sharing the way it usually is.
You can avoid the bad malloc granularity affect by always reporting the natural small cache line size for operations like that affect dirty data WBINVD/FLLUSH/DCBF/EVICT, paying the full trap and emulate performance costs for those operations, but not for operations that do not affect dirty data. such as the sector can example elsewhere in this wiki page.
While that doesn't help trap and emulate performance of dirty CMOs, it helps clean CMOs.
But it requires other minor but necessarily unfortunate changes to the CMO ISA proposal. In particular, it requires that software to cover not just the cache line size any particular clash, but the cache line size used by any particular operation.