Semantics of Whole Cache Flushes CMO.ALL - riscvarchive/riscv-CMOs-discuss GitHub Wiki
What are the semantics of whole cache flush operations like CMO.ALL?
In particular, an issue that as inspired much FUD: what are the semantics of such whole cache plus operations when a process migration might occur in the middle, i.e. when the whole cache flushed operation is not atomic with respect to process migration, or, that matter, interrupts. What does it mean for a whole cache flushed operation to be nonblocking, interruptible, preemptable, migratable?
First, I like to define the semantics of a whole cache operation CMO.ALL as being equivalent to a sequence of her cache operations, CMO.IX, operating on cache indexes (set/way).
But that just begs the question: what is the semantics of individual CMO.IX?
So I will reverse it: we have agreed that CMO.IX is only really useful or meaningful if it invalidates an entire cache:
CMO.ALL
is equivalent to
for all entries E in cache: CMO.IX E
is equivalent to
CMO.IX.L1 0 CMO.IX.L1 1 ... CMO.IX.L1 nEntries - 1
So I will define semantics of CMO.ALL, but I will also extend it so that there is a meaning for each individual CMO.IX
But I will start by talking about the semantics of the "simple" by address at operations, CMO.EA
A per address cache management operation has a fairly simple meaning that people are comfortable with:
- CMO.op.$id EA addr;
- Performs the operation "op" on the cache $id for the cache block containing "addr".
- on that cache in isolation
- on a set of caches, e.g. caches that are included by it
- in a coherence domain or subdomain
- although inclusion is one of the implementation techniques for such coherence domain operations - whether the inclusion is within the cache itself, or in a snoop filter, or a memory directory...
- there are other ways of implementing coherence domain, such as snoopy buses
- and we don't necessarily mean the maximal coherence domain including the cache $id, but possibly a subset, typically a subtree
Now what does the operation imply?
It would be very convenient if an operation like FLUSH implied that the cache block containing the address was invalid after the operation. E.g. in the short instruction sequence
CMO.FLUSH.$id.EA addr load addr
if we could guarantee that the next instruction "load addr" would have caused a cache miss.
Unfortunately, we cannot make such a guarantee in the presence of aggressive speculation and prefetching.
An aggressive microarchitecture might immediately turn around and catch address into $id after the flesh operations performed but before the next instruction starts executing.
In general, a cache management operation like a flush cannot guarantee that the data is invalid. it can only guarantee that the data has been flushed.
A flush cannot guarantee that a subsequent instruction will take a cache miss. It can only guarantee that a cache miss will have occurred after the flush and before the specific instruction has completed.
Many people find this behavior surprising. Nevertheless, it is characteristic of many modern processors. I have spent much of the last 30 years explaining this to people, e.g. in Intel's code tuning and operating system coding guides.
This characteristic of noncausal cacheability is not present in all processors. It was not present in nearly all processors before the 1990s. In processors that do not do speculative execution or Prefetching. It is not present in processors that have speculative execution, but which do not allow speculative cache misses or hardware prefetches. insert and special cases you can make such guarantees - e.g. if the address "addr" that is flushed, is marked idempotent so that speculative execution and prefetching is not allowed. but in general, if your ISA permits noncausal cacheability ( perhaps with some restrictions), speculative execution and aggressive prefetching, cache flushes do not implied the data is invalid when you next look at it. It only implies that the data has been flushed in the meantime.
This is standard stuff. It should not be surprising to people familiar with caches, speculative execution, and prefetching, although as I note some microarchitectures and even models of speculation permit stronger definitions. Nevertheless, many people continue to find it surprising. (TBD: notes about Spectre and CMOs)
To be more accurate, we should define the interaction of the CMOs with the memory ordering model. E.g. with respect to observability from other processors.
TBD: I will leave that for the moment.
IPM = nonblocking, interruptible, preemptable, migratable (should be NIPM, but IPM looks better to me for some reason. IMP would be even better, since this is a discussion like the imp of the perverse).
By "atomic wrt IPM", I mean that no interruption/preemption/migration will occur in the middle of the CMO.ALL sequence. I do not mean that it is atomic from the point of view of an observer cross the bus. All atomicity is relative.
OK, so we have the semantics of by address PMO operations CMO/EA defined above:
- CMO.op.$id EA addr;
- Performs the operation "op" on the cache $id for the cache block containing "addr".
so what does CMO.ALL mean?
Operationally
- CMO.ALL.op.$id;
- Performs the operation "op" on all entries or blocks in the cache on the cache $id
- .. or possibly a multi-cache domain, as in CMO.EA, although I suspect many people do not yet think that way - and it doesn't affect the definition too much.
The following is a more abstract definition, that is not implementation specific.
CMO.ALL allows a statement to be made about the set of all addresses that might possibly be in the cache at the time the CMO.ALL is performed.
The set of all addresses that might possibly be in a cache" IS an architectural property. Or at least it should be, although many ISAs are not explicit. This set is derived from the rules for speculation and prefetching. With "noncausal cacheability" It is essentially every address that is marked as cacheable. In RISC-V idempotent, and reachable from the current modes.
An operation like CMO.ALL.FLUSH with maximal scope, i.e. applying to all caches of the system, means
- not that a bus writeback has been performed for every possible address - since the vast majority of all possible addresses would probably not be in the cache and all
- but it instead means that there is no dirty data in any cache at the end
- more precisely, in a uniprocessor system, it means that main memory contains versions of all addresses that were current as of the start of the flesh operation
- I put this "uniprocessor" constraint to permit the language to be simple.
- In a multiprocessor system, we must provide further specifications about what versions are in memory or not
- For that matter, I have said "maximal scope" once again to make the language simpler. I'll try to tighten that up below
- I put this "uniprocessor" constraint to permit the language to be simple.
- more precisely, in a uniprocessor system, it means that main memory contains versions of all addresses that were current as of the start of the flesh operation
More precisely, it is establishing an ordered relation on versions of all possible addresses. More precisely, a relationship "not ordered before" - since the memory ordering model in general, and specifically for cache management operations, may be a partial order not a total order.
PLEASE DON'T PANIC !! - all of the things that I'm saying about versions and cache domains and scopes also apply to CMO.EA by address operations. It is just easier to say when you have a specific address, because of my ability to express things in the English natural language. I hope that it will be more clear in a formal notation, but I'm not ready to prepare such a formal notation at this time.
The only thing that is different about CMO.ALL is that the statement is made about all possible addresses that might have beat the cache.
...And possibly also that most people think of CMO.ALL operations applying to a single cache, not necessarily a domain/scope related to a cache, whereas domain/scope operations appear to be obvious or intuitive for by address CMO.EA. E.g. people intuitively understand that CMO.EA.FLUSH.$inc on an inclusive cache $inc will also apply to all of the caches that it includes or covers. but several times in our meetings people have product, thinking that a CMO.ALL.FLUSH on an inclusive L2 cache will violate the inclusion policy by flushing data from L2 but leaaving it behind in the included L1s. I suppose it could, but that's not what we're talking about. If you do a CMO.ALL.FLUSH.$inc on a strictly inclusive cache, it will backwards invalidate or howsoever else it maintains strict inclusion, and therefore flush all of the caches it covers. if you do not have a strictly inclusive cache, ... then that depends.
OK, let us be more precise about what " all addresses that might possibly be or have been in the cache" means.
At any particular point in time T, the processor is allowed to *place* in the cache a set S(T), whether via ordinary load and store operations, or via speculative execution or prefetching, where S(T) is delimited by PMAs, PMPs, page tables. Those hardware data structures that restrict S(t) from being the entire address space may change over time.
At any particular point in time T, the set of all addresses that might be in the cache is the union of all addresses A that might have been accessible or any time Tbtwn(A) since the last time an operation A was performed that would guarantee, i.e. make a statement, that the address was no longer present had the specified property in the cache(s) involved at time Tlast(A).
- the time values Tbtwn(A) and Tlast(A) may differ by address because of the presence of by address CMO.EA
- if we only had to worry about CMO.ALL, then it would just be global Tbtwn and Tlast, not dependent by address
- the time T is not parameterized by address, because of the simplifying assumption that CMO.ALL is atomic, i.e. CMO.ALL applies to all possible addresses at the same time.
The statement above about all possible addresses needs to be elaborated to apply to which possible versions of possible addresses may be present. Again, I'm just sketching the approach.
QL:so how do we handle non-atomic CMO.ALL? A: by expressing it as a set of CMO.IX.
But then how do we define a set of CMO.IX, without requiring microarchitecture concepts to be inserted in the ISA definition?
Actually, the definition is in terms of a something like CMO.EA operation on he appropriate versions of all possible addresses that might've been in the cache. I say "something like" to emphasize that CMO.ALL does not operationally flush all possible addresses. However, it allows the statement that the CMO like FLUSH enables -0 e.g. that the version in DRAM or outside the scope specified is ordered affter any version that was inside the sccope specified when the CMO.ALL was performed.
But this defines CMO.ALL as a unuion of all CMO.EA on all possible address/versions. what does that have to do with CMO.IX?
Well, unless we wish to expose the ISA to microarchitecture features such as the mapping of addresses to particular indexes (in general I do not want to do this). I think the most abstract definition is to act as if the cache is fully asssociative. Each CMO.IX might apply to any possible address/version that might be in the specified cache/domain/scope.
I.e. I think an appropriate, conservative, semantic model is that each CMO.IX might affect every possible address that a CMO.ALL might similarly affect at that time. This is obviously not an operational definition, since obviously the operations would be hugely redundant. But it permits us to reason about a worst case set of addresses/versions that might be affected by CMO.IX.
However, whereas CMO.ALL all makes a statement about all addresses, e.g. it can guarantee that all dirty versions not ordered after the CMO.ALL have been "exposed" outside of the domain/scope, CMO.IX is a statement about ANY, but not necessarily ALL. Not "any one". preferably not any finite number. but any subset of the set of possible addresses between size 0 and maximum.
These definitions allow us to reason about CMO.ALL when it's sequence of CMO.IX has been interrupted or preempted or migrated. at any particular time in such a sequence, you can't make a statement about all. You can only make a statement about subsets varying between 0 and all.
But, in an interval of instructions, where every possible CMO.IX or a cache has been performed, you can make a statement about all possible addresses. But of course the statement is not guaranteed to be observable at any particular point in the middlee, only after the end. and the initial conditions are not necessarily at the beginning, but any point between the beginning and the end.
What about if you establish multi processor memory ordering between an individual CMO.IX an instructions on other processors, or even on the same processor using data and control flow dependencies.
Again, you have to reason about a range of Permitted behaviors. any individual CMO that is ordered before may have touched nothing, or any subset up to the entire set of possible address/versions. The individual CMO is similarly imprecise. Subsequent CMOs similarly.
Formal semantics do not specify a single state. They specify a range or set of permitted states. The set of permitted states for an individual CMO.IX is quite large. it is certainly larger than the set of permitted states that an "atomic" CMO.ALL can produce. in fact, the spirit of permitted results states for CMO.ALL is always a strict subset of the set of permitted results states for CMO.IX, since for any particular set of CMO.IX, Any particular address/version may not yet been touched.
No it is not.
Formal semantics for things like memory ordering almost never specify a single result state. They specify a range or set of permitted states.
We have shown that all CMOs, included CBO.EA, can be considered to semantic statements about the versions of the specified addresses that are available inside and outside the specified cache domains, subsequent to memory ordering rules.
...