Format Documentation - HearthSim/UnityPack GitHub Wiki

This page documents the internal structures of file data loaded as assets in the Unity engine. To understand what Unity does with these files, it is helpful to first know how Unity's Entity-Component system works, and to read the relevant public Unity documentation of serialization:

File Types

SerializedFile

A SerializedFile is essentially a collection of Objects and metadata describing how those Objects are serialized. *.assets files are of this format. A SerializedFile is also the most important portion (CAB-...) of an AssetBundle file.

Header

The file begins with a 0x14 byte header. Values in this header are always stored big-endian.

struct SerializedFileHeader {
  int32 metadataSize;
  int32 fileSize;
  int32 version;
  int32 objectDataOffset;
  bool bigEndian;
  char _padding[3];
};
  • metadataSize: The size in bytes of the metadata section which immediately follows the header.
  • fileSize: The size in bytes of the entire SerializedFile.
  • version: This is the primary version indicator for this format, and "version" in this document refers to this value unless otherwise specified. (Note: in UnityPack, this is called "format")
  • objectDataOffset: The offset in bytes from the start of the file to the beginning of serialized object data.
  • bigEndian: When true, the remainder of this file has values which are stored big-endian. When false, the remainder of this file has values which are stored little-endian. Again, the values in the first 0x14 bytes of the file, i.e. this header, are always big-endian; the endianness of values in the rest of the file are determined by this boolean.

When version is < 9, the bigEndian field does not exist, the file is always big-endian, and the header is only 0x10 bytes.

Metadata

The metadata section contains some variable-length strings. These are serialized with a null terminator; for example, "5.3.6p1" will be written 35 2E 33 2E 36 70 31 00, 8 bytes including the null terminator. There is no padding for alignment purposes.

This list is in order of serialization:

  • string generatorVersion: This string contains the engine version of the generator of this file, e.g. "5.3.6p1".
  • int32 platform: An enum value giving the platform this file was built for. See the RuntimePlatform enum for possible values.
  • When version is >= 13:
    • bool hasTypeTrees: whether TypeTree data has been included in this metadata
    • int32 numTypes: the number of types of objects in this serialized file
    • For each type in numTypes:
      • When version is >= 17:
        • int32 classID: 0x72 (114: MonoBehaviour) for script types
        • int8 ???
        • int16 ???
      • Otherwise (version < 17):
        • int32 classID: Negative for script types
      • If classID indicates a script type:
        • char scriptHash[16]
      • char typeHash[16]
      • If hasTypeTrees:
        • TypeTree typeTree (see TypeTree serialization below)
  • Otherwise (version < 13):
    • int32 numTypes
    • For each type in numTypes:
      • int32 classID: Negative for script types
      • TypeTree typeTree
  • When 7 <= version and version <= 13:
    • int32: (unused)
  • int32 numObjectInfos
  • For each objectInfo in numObjectInfos:
    • When version >= 14:
      • Align the stream to the next 4-byte boundary (relative to the start of the metadata section)
      • int64 objectID
    • Otherwise:
      • int32 objectID
    • int32 dataOffset: This is added to the objectDataOffset value in the header to determine the file offset to the object data.
    • int32 dataSize: The size in bytes of the serialized object.
    • When version < 17:
      • int32 typeID
      • int16 classID
    • Otherwise:
      • int32 typeIndex: an index in to the array of type information given by looping numTypes above.
    • When version <= 10:
      • int16 isDestroyed: (unused)
    • When 11 <= version and version <= 16:
      • int16: Unknown. Starting in version 17 this is the int16 read in the loop over numTypes above. Defaults to 0xffff.
    • When 15 <= version and version <= 16:
      • int8: Unknown. Starting in version 17 this is the int8 read in the loop over numTypes above. Defaults to 0.
  • When version >= 11:
    • int32 numAdds
    • For each add in numAdds: (Serialize a PPtr)
      • int32 fileID
      • When version >= 14:
        • Align the stream to the next 4-byte boundary
        • int64 pathID
      • Otherwise:
        • int32 pathID
  • int32 numExternalFiles
  • For each externalFile in numExternalFiles:
    • When version >= 6
      • string assetName
    • When version >= 5
      • int32 guid[4]
      • int32 type
    • string fileName
  • When version >= 5 string: Unknown, always empty.

TypeTree

The TypeTree describes how values in individual objects are serialized. It is a tree structure in which each node is a struct field. Object serialization can be performed by traversing this tree depth-first and reading or writing according to the information at each node.

Type and field names in the newer TypeTree format are stored not as strings directly, but as offsets into one of two string buffers: a global string buffer, defined as a constant string in the engine; and a local string buffer, included locally in the TypeTree. These string buffers contain null-terminated strings stored sequentially. When the offset value has bit 31 set (i.e., is negative), that bit is masked off to get an offset in the global string buffer; when bit 31 is not set, the offset is in the local string buffer. A copy of the global string buffer can be found in strings.dat (Note: github's preview strips out null bytes, and so it is very wrong).

  • When version == 10 or version >= 12:
    (new compact blob format)
    • int32 numNodes: Number of nodes in the tree
    • int32 stringBufferSize: Size in bytes of the local string buffer
    • For each node in numNodes:
      • int16 version: This field is never used by anything really.
      • int8 depth: The depth in the tree of the current node. Nodes of the tree are serialized depth-first, so this number will increase when the current node is a child of the previous node, will stay the same when the current node is a sibling of the previous node, and will decrease when the current node is a sibling of one of the previous node's parents.
      • bool array: When true, this node is a special array node -- its first child (size) in the tree is the size in elements of the array, and its next child (data) is serialized in a loop for each element of the array.
      • int32 type: The string buffer offset of the type name of this node.
      • int32 name: The string buffer offset of the field name of this node.
      • int32 size: The expected size in bytes when this node (including children) is serialized. This is -1 for variable-sized fields, such as arrays or structs that have arrays as children.
      • int32 index: This is just an index of the node in the flat depth-first list of nodes.
      • int32 flags: Flags for serialization and miscellaneous information.
        • 0x4000: the stream should be aligned after serializing this field
    • char stringBuffer[stringBufferSize]: The local string buffer
  • Otherwise (version == 11 or version <= 9):
    (old format; fields are described above)
    • string type
    • string name
    • int32 size
    • int32 index
    • int32 array: This is still a bool (0 or 1), it's just 4 bytes as opposed to the 1 byte in the blob format
    • int32 version
    • int32 flags
    • int32 numChildren
      • For each child in numChildren, recurse starting at string type
        Note that this is explicitly listing children, as opposed to the blob format which just uses depth to determine child/parent relationships.

Objects

Each ObjectInfo value found in the metadata corresponds to one of the serialized objects that make up the remainder of a SerializedFile. The type of these objects can be found in the TypeTree, and the type is either a native Unity Engine type, such as Texture2D or TextAsset, or a serialized script type deriving from MonoBehaviour. Native Unity Engine types are serialized according to native code, generally mirroring the native structure layout; a very out-of-date snapshot of the TypeTrees for these native types can be found in a gist. Script types are serialized by iterating the class's serializable fields--serializable fields consist of any public fields of the class and of the class's base classes which are not marked [NonSerialized] and any other fields which are marked [SerializeField].

When a serialized field is a reference to another Object, it is serialized as a PPtr<T> where T inherits Object. A PPtr consists of a fileID and a pathID. When fileID is 0, it refers to the current file. Otherwise, the PPtr refers to the file in the externalFiles array of the metadata at index fileID - 1. pathID corresponds to the objectID found in the ObjectInfos section of the referenced SerializedFile. Starting in version 14, pathID became a 64-bit integer, and the stream was aligned to the next 32-bit boundary before each pathID was serialized; before version 14, pathID was a 32-bit integer, and the stream was not aligned before serializing it. In addition, 64-bit pathIDs are generally hashes, while 32-bit pathIDs are generally sequential indices.

When a variable length type like an array or a string is serialized, it is serialized starting with an int32 size which determines the number of elements in the array, and then all of the array elements are serialized in order. A dictionary or map is serialized as an array of pairs, i.e. [int32 size] [K first] [V second](/HearthSim/UnityPack/wiki/K-first]-[V-second) [K first] [V second](/HearthSim/UnityPack/wiki/K-first]-[V-second) ....

Asset Bundles

An AssetBundle file contains a SerializedFile containing Objects which are loaded dynamically by way of scripts. The SerializedFile in an AssetBundle container also has an AssetBundle Object, which contains a lookup from path name to individual objects in the bundle. For information on how this name lookup is usually created, see the Unity documentation.

Container Format

Values in this header are all big-endian unless otherwise noted.

  • string signature: The AssetBundle file begins with a signature string, serialized with a null terminator, which can be one of UnityFS, UnityWeb, UnityRaw, UnityArchive. The format of the rest of the container depends on this signature.
  • When signature is UnityArchive:
    (I don't have a sample of a bundle with this signature)
    • int64 headerOffset: Seek to this offset to begin reading the header:
      • uint32 formatVersion: Should be 5 here.
      • string targetVersion
      • string generatorVersion
      • char guid[16]
      • uint32
      • uint32
      • uint32 storageInfoOffset: Add to header offset and seek to begin reading the storage information:
        • uint32 dataOffset
        • uint32 numNodes
        • for each node in numNodes:
          • uint64 dataOffset
          • uint64 dataSize
          • uint32 status
          • string name
        • uint32 numBlocks
        • uint64 firstBlockOffset
        • for each block in numBlocks:
          • uint64 nextBlockOffset: decompressed block size is calculated by subtracting the previous offset from this offset.
        • for each block in numBlocks+1:
          • uint64 fileOffset: The previous block's compressed block size is calculated by subtracting the previous offset from this offset. This calculation is skipped for the first iteration in this loop.
          • uint32 compressionType
          • uint32 flag_40: &1 is synonymous with StorageBlock.flags & 0x40; no other bits have significance
  • When signature is UnityWeb or UnityRaw:
    • uint32 formatVersion
    • string targetVersion
    • string generatorVersion
    • When formatVersion >= 4:
      • char guid[16]
      • uint32
    • uint32 fileSize
    • uint32 headerSize: aka dataOffset
    • uint32 unkCount0
    • uint32 unkCount1
    • for each i in unkCount1:
      • uint32 uncompressedSize
      • uint32 compressedSize
    • When formatVersion >= 2:
      • uint32
    • When formatVersion >= 3:
      • uint32
  • TODO: verify UnityRaw/UnityWeb; doc UnityFS; explain what's going on with all these different formats and how to unify them under a single interface

Flat resource

*.resource and *.resS files are flat resource files, generally audio or texture data, and are viewed by Unity just as a sequence of bytes. In the case of audio data, the bytes are FMOD sound bank files that can be passed directly to FMOD to create a playable sound. In the case of texture data, the bytes are texture image data. The position and length of each segment of audio or image data are in the asset file of the same name, within individual AudioClip or Texture2D objects.