Process migration while performing multiple cache block operations together - riscvarchive/riscv-CMOs-discuss GitHub Wiki
|
This wiki page is going to be prepared quick and dirty. I will not have time to elaborate the discussion in as much detail as many of the audience members will require. For now, I just wish to get it down as quickly as possible, before I forget, and in the hopes that this brief description will be sufficient.
I have made some errors in some of the pseudocode examples, e.g. by forgetting the base address and some CMOs, and also by cut-and-paste errors such as saying "T on P1" when it should be "T on P2"
I hope that the reader can look past such problems. I hope that the reader will not be scared off by such problems that are easily corrected. It might be smarter not to post the examples until everything is fixed, but I will take the chance in the interest of speeding the discussion up.
Even if I were posting this on my "Art of Computer Architecture" wiki, which I eventually will, such errors are bound to occur.
There has been a lot of unnecessary FUD about process migration in the middle of a string of related CBOs. without loss of generality I want to call that a multi-block CBO or a multi-block CMO. I.e. a cache flush that involves multiple cache lines.
In particular, the FUD has been about doing a whole cache flush, CMO.ALL, when process migration might occur. but also to iterating over such an entire cache using COM.IX, the cache index(set,way).
There has been less FUD about doing an address range cache flush - by which I mean a set of independent CBO.EA, not necessarily the address range cache flush instruction CMO.AR, although exactly the same considerations apply.
Going up a level, that's the whole point: almost exactly the same considerations apply to multiple block Cache operations as apply apply to whole cache flush w/wo multiple CBO.IX:
BRIEFLY: there is no problem in a coherent system. in an incoherent system, if the OS supports process migration, it already has to do the right thing.
There has also been FUD about the semantics of CMO.ALL whole cache operations. not as much as there has been about process migration, but nevertheless some, sufficient to cause panic.
I will explain this as well. Again, briefly. And I should probably move it into a completely separate wiki page.
See Semantics of Whole Cache Flushes CMO.ALL's discussion of meaning in the presence of process migration What for the hell of it I will create a forwarding page for Semantics of Whole Cache Flushes CMO.ALL with Process Migration'
I will start off by discussing processors, P1, P2, P3..., each with tightly coupled L1$ (P1,$1), (P2,$2), ... and a shared L2$ ($$)
... and I will first discuss the case of coherency for data that is accessed by processes or tasks T (T1,T2,T3), as migrated by the operating system from one processor to the other.
(see convenient short names for examples)
First, it should be obvious that there is no problem if all of the processors and their associated L1 caches are consistent with each other. In fact, that was historically one of the original reason to introduce cache coherence - enabling task migration between processors without expensive cache flushes.
Now let's consider the case where the processors and their associated L1 caches are not consistent. when a process or task T is migrated from P1 to P2, the runtime or operating system for such does:
- P1 flush dirty*: CLEAN: any dirty data in P1's L1 cache must be flushed to the shared L2
- P2 flush stale*: any stale data in P2's L1 cache must be similarly flushed (invalidated) (there should be no dirty data in $2 that is used by T).
The second operation responds to CMO.INVALIDATE on a writethrough cache, but I am reluctant to say that because on a writeback cache it is dangerous. More precisely, it corresponds to the operation " invalidate clean line, leave dirty lines alone", which I have proposed - but which is unlikely to be considered by the CMO TG in part because unnecessary FUD such as this is consuming all our brains. the second operation can be implemented by CMO.FLUSH - but as noted elsewhere, that's a little bit of overkill. still, on a present system, it is CMO.FLUSH in general, and CMO.INVALIDATE for writethrough caches or where the operating system has special knowledge.
Some runtimes may optimize by combining the flesh of the stale data with the flesh of the dirty data, i.e. performing only one scan of the P1 L2 cache, and assuming there is no stale data on P2.
This works, so long as there can be no speculative execution or prefetching that might fetch into P2 a stale copy of data that T subsequently dirties on P1. That optimization:
- worked in the 1980s on nearly all machine
- works now on simple machines that don't do speculative cache misses or prefetching
- but does NOT work on a processor that has aggressive speculation with noncausal cacheability.
- e.g. I.e. the Intel speculation model
Without loss of generality I will assume that the dirty flush and clean validate are done as above. If the runtime and implementation are such that they can fuse the dirty flush and clean and validate, similar arguments apply.
So let's consider a sequence of cache operations on multiple blocks, operating on P1's L1 cache
CMO.EA.L1 A CMO.EA.L1 A + 1 * clSize CMO.EA.L1 A + 2 * clSize ... CMO.EA.L1 A + (N-1) * clSize
What happens if there is a process migration in the middle of the sequence?
Again, I hope is obvious that there is no data consistency problem if the task T that is performing the sequence is migrated from P1 to P2 in the middle of it.
Even though that makes it effectively
T on P1: CMO.EA.L1 A T on P1: CMO.EA.L1 A + 1 * clSize T on P1: ... T on P1: CMO.EA.L1 A + k * clSize .. migration happens .. T on P2: CMO.EA.L1 A + (k+1) * clSize T on P2: .. T on P1: CMO.EA.L1 A + (N-1) * clSize
Because the processors' L1 caches are coherent, the migrated task T on P2 will see the same data values as on P1.
There is some lossage about the performance tuning aspects, but at least the data values will be consistent.
May be somewhat less obvious.
same sequence
CMO.EA.L1 A CMO.EA.L1 A + 1 * clSize CMO.EA.L1 A + 2 * clSize ... CMO.EA.L1 a + (N-1) * clSize
migrated in middle
T on P1: CMO.EA.L1 A T on P1: CMO.EA.L1 A + 1 * clSize T on P1: ... T on P1: CMO.EA.L1 A + k * clSize .. migration happens .. T on P2: CMO.EA.L1 A + (k+1) * clSize T on P2: ... T on P2: CMO.EA.L1 A + (N-1) * clSize
If it were just the above operations, there would be a problem. A line such as A + j * clSize, k <= j < = N-1, might have been dirtied on P1 but not flushed, because the corresponding CBO is performed on P2. Similarly for clean stale data on P2.
But now let's show the effect of the cache flushing that the runtime must do to migrate a process/task T in a non-coherent system
T on P1: CMO.EA.L1 A T on P1: CMO.EA.L1 A + 1 * clSize T on P1: ... T on P1 CMO.EA.L1= A + k * clSize ... migration starts ... Runtime on P1: CMO.FLUSH dirty data (all, or that T may have touched) Runtime on P1: flush stale data data (all, or that T may have touched/access) .. migration done, T starts on P2 ... T on P2: CMO.EA.L1 A + (k+1) * clSize T on P2: ... T on P2: CMO.EA.L1 A + (N-1) * clSize
this solves the problem. the runtime flushes performed for process migration make things correct. There is redundant work being done here, but we are not trying to optimize that away.
This is what will happen as a natural result of performing such CMO.ALL operations, defined in terms of sequences of CMO.IX, on a system that can form process migration in the middle of such a sequence. If the operating system knows how to migrate processes on a non-coherent processor system, it will handle the CMO.IX sequences correctly.
QED
The above follows pretty naturally for FLUSH and CLEAN. it also follows pretty naturally for similar safe operations that are not in the traditional set.
INVAL, discarding dirty data, is not safe for the above. But then INVAL is not safe in any case. Most systems will not allow INVAL to user code, and will probably not migrate the system code that is performing INVAL. Of course, any guest OS system code that is performing INVAL may have issues on a hypervisor that performs guest migration. Strategies such as mapping INVAL to FLUSH our probably acceptable, if the runtime cannot make guarantees.
There was less FUD about multiple related by address cache operations then there was about whole cache operations.
So let us consider a similar example of multiple cache block operations by index(set/way), that effect a whole cache flush, CMO.ALL
CMO.ALL == CMO.IX.L1 A CMO.IX.L1 A + 1 * clSize CMO.IX.L1 A + 2 * clSize ... CMO.IX.L1 A + 2 * (N-1) * clSize
I prefer to define the semantics of CMO.ALL as being equivalent to such a sequence of by index operations. See Semantics of Whole Cache Flushes CMO.ALL.
It is obviously not obvious that there is no problem with such CMO.ALL / multiple CMO.IX operations in a non-coherent system, because that was the case that inspired so much FUD. I will attempt to show it here, using much the same arguments as above.
T on P1/P2: CMO.ALL.L1 == T on P1: CMO.IX.L1 A T on P1: CMO.IX.L1 A + 1 * clSize T on P1: ... T on P1: CMO.IX.L1 A + k * clSize ... T is migrated from P1 to P2 ... T on P2: CMO.IX.L1 A + (k+1) * clSize T on P2: ... T on P2: CMO.IX.L1 A + (N-1) * clSize
once again, if nothing is done during the migration on such a non-coherent system, problems will occur.
but, once again, let us consider what the runtime must do when it migrates task T from P1 to P2
T on P1/P2: CMO.ALL.L1 == T on P1: CMO.IX.L1 A T on P1: CMO.IX.L1 A + 1 * clSize T on P1: ... T on P1: CMO.IX.L1 A + k * clSize ... migration starts ... Runtime on P1: dirty flush: CMO.FLUSH dirty data (all, or that T may have touched) Runtime on P2: invalidate stale: (all, or that T may have touched/access) .. migration done, T starts on P2 ... T on P2: CMO.IX.L1 A + (k+1) * clSize T on P2: ... T on P2: CMO.IX.L1 A + (N-1) * clSize
Once again, this fixes the problem. redundant work may have been performed, but data value consistency is maintained
If T's CMO.ALL that started on P1 was a flush operation like CMO.FLUSH, then
- the dirty flush on P1 will have completed it there
- the clean invalidate on P2 will have completed it there, assuming there is no speculatively dirtied or prefetch data in P2's L1
If T's CMO.ALL was INVALIDATE/DISCARD... OK, that poses issues not solved by the standard non-coherent process migration described above. But INVALIDATE/DISCARD is such a dangerous optimization, that I expected anyone using it has more precise control, such as ensuring there is no process migration.
QED...
TBD: similar arguments show that process migration is not a problem for non-coherent I/O.
- neither when the I/O is non-coherent with the processors, but the processors are coherent with each other
- nor when the I/O is non-coherent with the processors, the processors are not coherent with each other
I might also hope that this convinces the doubters, but I doubt that.
Part of the misunderstanding that may have caused this FUD, apart from lack of familiarity with non-coherent systems,
is the terminology:
"process T migration happens" tends to apply that the migration is spontaneous.
It might be better to say "run time migrates process T", because that makes clear that there is another party, the run time, that is executing code in order to accomplish the migration.
Q: process migration ever be truly spontaneous? performed without OS/runtime intervention ?
Maybe: if it is migrating between non-coherent processors, the same sort of flushes me must be done.
E.g. by hardware thread schedulers? ... the same sort of flushes must be done.
Arguably closer to having spontaneous process/task/thread migration is the case of user level scheduling: user level processes/threads take work off and put suspended work back onto task queue. this is often done in order to avoid OS overhead. E.g. classic pthreads - multiple user level threads that are effectively time sliced, M user threads onto K OS managed threads, within the same process.
I have tried to allow for this by saying "runtime" or "runtime/OS" in the discussions above. Not all past schedulers are OS level.
AFAIK most such user level thread or task queue implementations and others of that ilk, such as Cilk have been implemented on cache coherent system, or on systems that don't have caches at all.
But if implemented so that tasks migrate between executor threads on different processors that are not at coherent, then the task queue management software must perform the necessary cache flushes.
If anything, such user level scheduling and migration is an argument or user level cache flush operations. In particular, user level whole cache flushes CMO.ALL = n * CMO.IX, since the whole point of user level scheduling and task cues is to avoid OS overhead.
General-purpose Linux systems on symmetric multiprocessing hardware usually do process migration, so we reassured ourselves about issues such as the above.
But even if the statements above about process migration were not valid, I contend that should not be enough to kill CMO.ALL and CMO.IX types of operations.
Many systems are not general-purpose Linux systems. Many systems do not do process migration. This is particularly common in embedded systems and HPC systems. Many such systems use non-Linux OSs, RTOSes. Some subsystems actually use Linux code, although possibly forked from the main distributions. Particular systems may disable process migration, or provide strong processor affinity rather than the namby-pamby version available on many general-purpose Linux flavors.
Some systems have writethrough caches. Some writethrough caches can be instantaneously flushed, implying that process migration cannot occur in the middle of a CMO.ALL.
(Indeed, this is a good argument for having a CMO.ALL operation independent of CMO.IX.)
I will note, however, that the fact that such instantaneous cache flushes or invalidations are atomic with respect to migration does not all the more fundamental problem that, when such a process is migrated, stale clean data on the target processor must be prevented.
coherent writethrough caches have no problem.
non-coherent writethrough caches are no problem. However, instead of needing to flush dirty data on processor P1 being migrated from - there is no dirty data flush on a writethrough cache, except for pending writethroughs - they only need to invalidate stale clean data.
Therefore, process migration even on a non-coherent cache that can be instantaneously flushed requires the flush, a.k.a. invalidate sale clean, to be performed by the code that is migrating the process/task.