Virtual Shadow Maps - JuanDiegoMontoya/Frogfood GitHub Wiki

Virtual shadow maps (VSMs) are a concept that have been mainstreamed by the release of Unreal Engine 5. Essentially, they are regular shadow maps that use virtual memory management. With this, we can make shadow maps as large as we want as long as the paged-in areas sum to less than the amount of physical memory we have.

distant_close_result

In the above 25 km^2 scene containing both small and large scale detail, we can render near-pixel-perfect filtered shadows with 10 cascades of 4096x4096 virtual textures. The memory footprint of VSMs in this scene is just the size of the backing texture (4096x4096 R32) and a small amount of metadata.

The image is from GPU Zen 3, Chapter 12: Virtual Shadow Maps, in which we have written a complete article detailing an implementation of virtual shadow maps. The implementation described in that article is based on the one in Timberdoodle, which has some minor differences from the one in Frogfood, but the core ideas remain the same.

Another write-up about VSMs can be found on J. Stephano's blog.

Overview

Frogfood's implementation of VSM consists of a series of bookkeeping dispatches followed by a series of cull-and-draws (one for each cascade).

The GPU code for virtual shadow maps can be found in data/shaders/shadows/vsm/ and the CPU code can be found in VirtualShadowMaps.cpp.

Background

The page table (which contains the state and address of each page) is represented as a 2D array texture of R32_UINT. If a single shadow map might be 4096x4096 and a page is 128x128, the page status texture only needs to be 32x32. The state contained by each texel is handily defined in a comment:

// Each layer indicates whether the page is visible and whether it's dirty in addition to mapping to a physical page
// Bit 0: is this page visible?
// Bit 1: is this page dirty (object within it moved or the light source itself moved)?
// Bit 2: is this page backed (allocated)?
// Bits 3-15: reserved
// Bits 16-31: physical page address from 0 to 2^16-1

Physical memory is allocated up-front as a 4096x4096 atlas containing 128x128 pages. Originally, it was an array texture to simplify addressing, but the limit of 2048 shared by almost all hardware was, well, too limiting. The only spatial coherence maintained here is within each page. Two neighboring pages could be allocated to very different locations of two different cascades!

Reset Page Visibility

This pass clears the visibility bit of all pages.

Mark Visible Pages

In this pass, each texel in the viewer's depth buffer is analyzed to determine what pages of the VSM need to contain usable depth information. The analysis makes a guess to select a cascade based on the projected size of the shadow texel compared to the size of a pixel on the screen. With a bias of zero, it attempts to match them 1-to-1 (i.e. so each screen pixel corresponds to one shadow texel). While this analysis isn't perfect (it doesn't account for projective distortion at all), it provides pretty good results in practice.

When a page is selected, its status gets updated via imageAtomicOr and, if it wasn't already allocated (visible last frame), a request to allocate that page is made.

Since texels in the same spatial region have a high probability of selecting the same page, a waterfall loop with subgroupBroadcastFirst and subgroupElect is used to minimize the number of atomics issued.

Below is a debug view of the page table for a single cascade while turning the camera rapidly (different cascades can be viewed by changing the layer shown). Black texels represent pages that are not visible, not backed (by a physical page), and not dirty. Magenta texels represent pages that are visible and backed, but not dirty (i.e. they are cached). White texels represent pages that are visible, dirty, and backed (i.e. they need to be rendered this frame).

Note that this pass only accounts for geometry that writes to the G-buffer. Transparent geometry and volumetric effects require other methods to receive shadows. One such method could be to always treat every page of a high cascade that overlaps the frustum as being visible. This seems to be similar to what is done in UE5.

Free Non-Visible Pages

This pass simply iterates over all pages and sets their backed bit to zero if they weren't visible.

Allocate Pages

Despite what the name may suggest, this pass is actually a very simple single-thread dispatch. The thread iterates over all page allocation requests and for each, scans a buffer of bits indicating whether the corresponding physical page is free.

We had many ideas for optimizing this pass, but discovering that the most naive possible implementation doesn't even register in a profiler, those plans were scrapped.

Generate Bitmap HZB

We also call this the HPB or Hi-P buffer. It's a simple hierarchical acceleration structure that indicates which regions contain a page that must be rendered to (is visible, backed, and dirty). The reduction is done exactly as one would do with a Hi-Z buffer, except with boolean values instead of depth. The HPB is used to efficiently cull geometry that could not contribute to any active page.

Enqueue and Clear Dirty Pages

This consists of two dispatches. The first creates a list of dirty pages to be freed and sets up the indirect dispatch parameters for the second. The second dispatch clears all of those pages to prepare them for drawing.

Render Clipmaps (Cascades)

This is where the shadow cascades are actually rendered. For each cascade, the scene is culled much as it normally is, with the addition of HPB culling.

Drawing

The vertex shader used here is almost identical to the one used for standard drawing. The main difference is in the fragment shader.

Since we're rendering to a virtual texture, we cannot bind it as a normal render target. Instead, we bind the texture as a storage (RW) image. The fragment shader simply translates samples from virtual space to physical space by looking up the page info in the page table. Then, the depth function is implemented with imageAtomicMin.

Sampling

When sampling the VSM, we reuse the cascade selection heuristic from Mark Visible Pages. Then, the address can be translated to a physical page and our sample can be acquired. This process is repeated for each sample we wish to take. Since it can be expensive to do this for each one, we can do it once and search nearby cascades to see if they have an active page with our sample.

We can also display various debug information about our pages.

The sampling code can be found in ShadowVsmPcss in ShadeDeferredPbr.frag.glsl. The VSM debug visualizations are at the end of the same file.

Caching

Caching is an arguably essential optimization that can be performed if we are careful with how we address the VSM. To ensure pages remain valid even as the camera moves, we do a few things:

Quantize the position of each cascade (the 'shadow camera') to the size of a page.
Determine the cascade's offset from the origin in number of pages.
Constrain each cascade's movement along its local Z-axis so depth is never invalidated.

The first two points are to help us translate positions within the cascade's viewport to stable (translationally invariant) positions. In other words, we have an address wrapping scheme.

This is clearer when viewing the page table directly. You can see that the frustum shape (which starts near the bottom-left) wraps around. Furthermore, the frustum isn't centered in these stable coordinates.

The third point (constraining movement along the local Z-axis) is simple, but limits the environments that we can render. Consider a scene in which the sun is directly overhead and the length of a cascade's frustum (which is initially centered at the origin) is 1000 units. Any geometry that extends below Z=-500 or above Z=500 will not receive shadows. The situation gets worse when considering sunrise or sunset, because it's more likely for a world to be 1000 units wide than 1000 units tall.

Fortunately, this can be mitigated in a relatively straightforward way. Just increase the length of the frustum! Even with a 10km-long frustum, 32-bit floats will give you sub-millimeter depth precision everywhere. That's an acceptable amount of bias for Frogfood.

Timberdoodle takes a more complex approach to solving this problem, but it does so in a more general way that doesn't require lengthening the frustum to ridiculous extents.