WIP: Optimize GC frame generation - yuyichao/explore GitHub Wiki
Challenges
The GC frame generation is effectively an register allocation problem. However, there are a few constraints that make it different from a simple register allocation.
-
jlcallframesAt the
jlcallcall sites, we need the roots to be in the right order for thejlcall. This makes the allocation significantly harder to optimize. Due to the conflicts and overlapping lifetime of multiplejlcallframes, we may not be able to easily select the optimumjlcallframe offsets. However, we'd like to at least be as good as the allocation on 0.4, i.e. directly emit temporaries into thejlcallframe and reused uninitializedjlcallframe slots for temporaries. -
returntwicefunctions (a.k.a.try-catch)A
returntwicefunction creates invisible control flow in the function. We don't really need to support genericreturntwicefunctions that jump outside ofllvmcallsince they introduce control flow that is not visible to other part of the compiler. For atry-catchframe, we need to keep everything that is live during thereturntwicecall for thecatchbranch live during thetrybranch until it hits the correspondingleave.The roots that is kept live this way also need to always be in the same slot, which is not really an issue for simple register allocation but might need to be handled specially due to
jlcallframes.
Optimizations
-
Do not root an object if it is known to be rooted
typeof- constants
- arguments
- load from immutable (root parent instead)
- known cached boxes (
Symbol,Int8, etc) - singletons (constants?)
This can be done by a pass to delete gc roots for these types. For
typeof, cached boxes and singletons, we might need information from codegen, which can be represented as ajulia.gc_noroot(%jl_value_t*)intrinsic.Loading from immutable can be detected by
tbaa_immut. This should be done afterjlcallframe layout in order to avoid using an extra slot for the parent if the child is only used in ajlcallframe anyway. -
Do not duplicate roots
This should be mostly handled by the
mem2regpass, which should do store-to-load forwarding and therefore we mostly need to handle distinguished SSA values. There might still be case where we'll see store to and load from stack slot (alloca) either because we don't run themem2regpass (-O0) or if the store and loads are volatile (variables used intry-catch). In such cases, we need to promote the slot into a gc root and do some simple gc root collapsing.-
The stored value doesn't need a root if its lifetime is fully coverred by the lifetime of the store, which is from the store to the first
enterorstoreto the same slot or other instructions that can change the value of the slot. We can probably assume the slot address is not escaped. -
The load of the value doesn't need a root if its lifetime is fully coverred by the lifetime of the value loaded in the slot.
-
-
Handle
phinodeRunning after
mem2regmeans that we need to handlephinode of different roots (codegen doesn't really emit them butmem2regwill). We shouldn't need to add any new slots since it is always legal to keep all the sources live in their own slots. We can optimize this by using one of the parent slot if the other one is never live at the same time as thephinode. -
Track lifetime only at safepoint
Different from a typical register allocation, which needs to track lifetime at each instruction, we only need to track the lifetime at GC safepoints, which are usually function calls (unless specially marked) or other GC intrinsics in the code. This means that we can trivially delete a GC root if it is only used between two safepoints.
-
Optimize write barrier using the same info (TODO)
(Store of known old value or to known young value doesn't need a wb)
-
Lower GC safepoint and state transition (TODO)
Steps
-
Collect a list of jlcall frames
Iterate the entry block before
ptlsStates, looking for calls ofjulia.jlcall_frame_declintrinsic. -
Collect a list of root slots
Iterate the entry block before
ptlsStates, looking for calls ofjulia.gcroot_declintrinsic. -
Collect a list of SSA roots
These are the SSA values that are stored to GC roots and jlcall frames. Follow
phinodes with at least one SSA roots as input. -
Collect a list of known rooted values
These are the SSA values marked with
julia.gc_noroot(%jl_value_t*), function arguments. Mark any load from these values withjltbaa_immuttoo recursively. -
Remove known rooted values from SSA root list
Including the list collected above and constants.
-
Collect a list of safepoints
Iterate all basic blocks and instructions. Collect a list of call instructions except the ones that are marked as non-safepoint.
-
Collect a list of stack slots related to SSA roots
These are single pointer
allocas which have at least one load or store of SSA roots as well as all the load and stores of SSA roots on this slot. (This needs to be done before removing jlcall roots) -
Collect
enter-leavepairsCreate a map between
enterandleave. Do constant propagation onenterto figure out the normal branch and the error branch (if possible). Allocate exception frames. -
Collect the live interval (set of safepoints) of SSA roots
Follow
gepandbitcast. Do not followphinodes and do not merge the live interval ofphinodes input and output. -
Collect the live interval (set of safepoints) of
jlcallslotsScan in the reverse order
jlcallframes are declared.For ones with only one store, if the store value is an unhandled SSA root, assign the root to this slot and remove it from the SSA root list. The lifetime of the slot is the lifetime of the SSA root plus the lifetime of the slot (from store to
jlcall).If the single store is a known rooted value, move the store to right before the use and the lifetime is only the
jlcall.If the single store is assigned to a different
jlcallframe, remove the lifetime of the SSA value from the lifetime of the slot and move the store to the new beginning of the lifetime. Extend the "rooted lifetime" of the SSA value with the lifetime of the slot so that other slots don't need to root it in this interval. We need to be careful that the move of the store is valid. It cannot be moved across the creation of the SSA value and maybe notreturntwicefunctions either.Otherwise use the lifetime of the store.
-
Place
jlcallframesScan in the reverse order
jlcallframes are declared.If any values in the
jlcallframe exists (and live) in frames that are already positioned try to see if we can match any of them and merge the stores.Otherwise, find a best fit hole in the existing frame to place the current one (Possibly extend the frame size).
When placing the jlcall frame, we need to check all existing frames with overlapping lifetime. (How to make this
O(n)?)