BigFile Needs Work - widberg/fmtk GitHub Wiki

Questions I still want answers to:

General

When does an object get compressed?
- Size?
- Type?
- Ratio? - anything below or equal to 20%/25% is not compressed, floor the ratio before the comparison ✅
- The first object in a block is rarely compressed. Is this to minimize the offset? probably not
- Probably a combination of size and ratio if anything. all objects of size > x with ratio > y. unlikely it is complicated.

Header

Primary Header

`poolManifestUnused0` and `poolManifestUnused1`

These fields are copied to the internal DPC header representation but those copies do not appear to be used
Always equal to each other
Equals 0 when there is no pool manifest
Number of [unique] objects in the pool.
Check if its proportional to pool size

BlockDescription

`workingBufferOffset`

This is often extremely large for the last block
This may be non-zero even when the block contains no compressed objects and the DPC contains no pool
- HUB_YELLOWS and TELEPORT_HELICO are good examples of the previous statement
- TELE_01 has 1 object in the last block and a massive offset
- THE_INTRO has a block with 0 offset
- USA1 has all 0 offsets except for the last block which is >10,000,000
When isNotRTC == 1 this value is used as the workingBufferOffset otherwise it is unused but will still have a value
This is the size of the area used to decompress/process blockObjects aka how far from the beginning of the working buffer to offset the block data which may or may not have compressed objects in it
The previous statement is only half true; for the last block this field is exceptionally large
The last one might be big because it has to accommodate pool stuff sometimes (not true, big even if no pool)
Indexing data similar to the pool manifest can be found in this space while the last block is processed
This could be the reason last block has a big value
Might be where the lookup table for objects in the dpc winds up
This may be 0 and contain compressed objects if the compressed data is sufficiently far enough in the block that the space before it is enough to decompress
If there is at least one compressed objects then this may be equal to calculatePaddedSize(largestCompressedObject.objectHeader.decompressedSize - (&largestCompressedObject.data - &block))
Following the previous rule the game will load without crashing although the original value is usually larger
It is possible that some classes may use this buffer while loading (probably not)
Stages 7, 8, and 9 call ResourceObject_Z virtual methods
Is it constructing the classes in place in this buffer? (no, this pointer isn't in the buffer)
Always divisible by 2048
Try this: for each compressed object if decompressing the object from the start of the block would overwrite any part of the compressed object then consider how much additional padded space is needed before the object to prevent this, keep the largest required padding for all objects. then for even and odd store the max sum of this offset and the size of the final block with everything compressed. finally, for each block the working buffer should be the max sum for its parity minus the size of that final block with everything compressed.
- Why might this work: the large values in blocks with no compression indicate that there is a shared value between blocks with the same parity and another block in the same parity would cause the uncompressed block to have a large offset due to it being placed at the end of the big buffer caused by the compressed one. The ground truth value belongs to the parity, not the block.
Don't forget about the pool's offset calculatePaddedSize(max(poolObjects.decompressedSize)) >> 11 instead of all the complicated checks I said for the overwrite stuff it could just be all objects regardless of compression.

Block Sector

How is the number of blocks determined?
- All of the objects may safely be placed into a single block or moved between any number of blocks. ✔️
- Seem to be split to have a similar size so that loading doesn't stall too long on a single block
  - Almost certainly this is true. and sadly it is likely due to profiled load times splitting into x millisecond chunks and completely indeterministic and unrecoverable. If the console bigfiles have vastly different blocks from the PC version and each other then this is probably true, if not then it could go either way.
How is the order of blocks determined?
- The blocks may occur in any order so long as blockWorkingBufferCapacityEven and blockWorkingBufferCapacityOdd are appropriate. ✔️
- Maybe related to the lexicographical order of the path strings before hashing?
- The decompressed size of the largest compressed object in each block seems to decrease as the block index increases. am I going insane. this would be a crazy heuristic and way more complicated than anything else they've done. KISS
- Probably just whatever order the original objects list was before it was split based on the last point about number of blocks. so simple it might be true
How is the order of objects in a block determined?
- The objects may occur in any order so long as the block offset is appropriate. ✔️
- Maybe related to the lexicographical order of the path strings before hashing?
- Maybe ordered to reduce the block offset? (too much work)
- Maybe ordered by dependency excluding duplicates?
- Maybe an unordered_map-like container was used and it’s effectively random?
- Maybe ordered by the hash value signed/unsigned/string? NO.
- In P_MOTO.DPC and P_BUGGY.DPC the objects are ordered by dependency with the root of the dependency graph being first.
- Again probably just whatever order they happened to be in.

Pool Sector

When is an object bumped to the pool?
- References?
- Size? NO.
- Type?
- Is it mandatory or an optimization? It is an optional optimization.
- Other DPCs share the object? NO.
- Handpicked based on whether its going to be loaded lazily based on dev knowledge? (KISS)
Why are there duplicate objects in the pool?
- Something to do with the references needing continuous intervals ✅
- Locality on disk for console BigFiles
When does an object reference another?
- CRC used in objects data?
- CRC used in base class object header?
- Whenever LoadLinkID is used to read the field ✅
How are the associative arrays in the pool manifest generated?
- Maybe an unordered_map-like container was used and it’s effectively random?
- Maybe related to the lexicographical order of the path strings before hashing?

BigFile Needs Work - widberg/fmtk GitHub Wiki

General

Header

Primary Header

poolManifestUnused0 and poolManifestUnused1

BlockDescription

workingBufferOffset

Block Sector

Pool Sector

`poolManifestUnused0` and `poolManifestUnused1`

`workingBufferOffset`