A Gentle Introduction to Hex Editing - Ezekial711/MonsterHunterWorldModding GitHub Wiki
Preface
One of the more important skills for modding research is Hex Editing. Most articles cover practical aspects of hex editing for specific edits. This is a general guide to hex editing and templating, it covers:
- Data Storage and Representation
- Intermediate Structures (Structs) and 010 Template Writing and Editing
- Common Data Patterns
- Common Approaches
- The Data Type Information Approach
The guide assumes you have 010 Editor (either the paid version or repeatedly reinstalling it to continue having access to templates).
If you have observations, recommendations, corrections or requests to expand sections you can contact me under *Ỽ (Asterisk Ampersand) in Discord.
Data Storage and Representation
Computers store all data as numbers. In fact, when looking at a file at "raw" level, one simply gets a soup of numbers.
[]
Even text and more complex data ends up as just series of numbers. Even the hex view itself is misleading, data can stored straight out as binary numbers (string of 0s and 1s), with no real delimitation of where values start and end. Most of the time though data is compartmentalized in groupings of 8 bits (a bit being a 0/1), this groupings are labelled bytes which is what the hex view displays.
A file doesn't store information on itself about how it should be interpreted. Even a tiny file consisting only of 4 bytes can be interpreted in nearly infinite ways. It's up to the application reading the file to decide how to parse the data contained there. Reverse engineering a file consists of making best-effort guesses into how to interpret the hex soup that makes up a file.
Primitive Data Types
In most cases, because of the above mentioned compartmentalization of data, there's a few immediate candidates on how to read groupings of hex on a file. This are known as Primitive Datatypes. Programming languages have preferential treatment of this "interpretations" of data and at some level they are the smallest (standard) units within a file:
- Byte (Int8 - 1 Byte/8 Bits)
- Short (Int16 - 2 Bytes/16 Bits)
- Long (Int32 or Int - 4 Bytes/32 Bits)
- Quad (Int64 - 8 Bytes/64 Bits)
- Half-Float (HFloat - 2 Bytes/16 Bits)
- Float (Float32 or Float - 4 Bytes/32 Bits)
- Double (Float64 - 8 Bytes/64 Bits)
- Char (Char - 1 Bytes/8 Bits)
Of this, the integer datatypes (Byte to Quad) can be either signed or unsigned. Signed means that the number can be positive or negative, Unsigned means it can only be positive. Signed Values have the same AMOUNT of POSSIBLE VALUES, however their max range is half of the Unsigned possible values. For example SInt32 (Signed Int32) can take values between -2,147,483,648 and 2,147,483,647 while UInt32 (Unsigned Int32) can take values between 0 and 4,294,967,295. Both have a total of 4,294,967,296 possible values, but UInt32 has double the max possible value at the expense of negatives. Keep in mind Signed values are not symmetric on their range, they have one more possible value on the negatives than on the positives because 0 is considered within the positives.
Integer Data Types
Datatype | Hex Values (Positive Range/Negative Range) | Positive Data Range | Negative Data Range |
---|---|---|---|
UByte | 00 - FF | 255 | |
Byte/SByte | 00 - 7F / 80-FF | 127 | -128 |
UInt16 | 0000 - FFFF | 65535 | |
Int16/SInt16 | 0000 - 7FFF / 8000-FFFF | 32767 | -32768 |
UInt32 | 0000 0000 - FFFF FFFF | 4,294,967,295 | |
Int32/SInt32 | 0000 0000 - 7FFF FFFF / 8000 0000-FFFF FFFF | 2,147,483,647 | -2,147,483,648 |
UInt64 | 0000 0000 0000 0000- FFFF FFFF FFFF FFFF | 9,223,372,036,854,775,807 | |
Int64/SInt64 | 0000 0000 0000 0000 - 7FFF FFFF FFFF FFFF/ 8000 0000 0000 0000 - FFFF FFFF FFFF FFFF | 9,223,372,036,854,775,807 | -9,223,372,036,854,775,808 |
This data types represent Integer (or in the case of Unsigned versions of them Natural) values. They represent the "most plain" reading of the hex values and are simply the hex representation of a number (a quick primer to what exactly this means can be found on the web as Base 16 or Hex representation but isn't actually required). In 010 we can get the decimal value of a group of hex in the inspector window:
[]
The reason Byte, Short, Int and Int64 have different values is because the inspector only takes as many bytes as the datatype allows, signed and unsigned coincide when the value of the number is below the signed max value.
Floating Point Data Types
Datatype | Size | Representation of 1 |
---|---|---|
HFloat | 2 Bytes | 00 3C |
Float | 4 Bytes | 00 00 80 3F |
Double | 8 Bytes | 00 00 00 00 00 00 F0 3F |
Floating points represent Real numbers. Normally one can identify them because of the distinct appearance of floating point 1.0 (00 00 80 3F). They are converted from the binary representation of the hex value by interpreting the bits as sign, exponent and mantissa (a more complete overview can be found in the Wikipedia article)
Character Data Type
Strings are a somewhat more complicated topic. At the simplest case a character is a single byte which is read on a table and converted to a recognizable character. For example 41 (in hex) corresponds to the character "A" (capital A). However there's other ways to convert text to hex values, this is called the encoding. Encoding is the rule through which text is converted to hex, there's a few standard encodings (ASCII, UTF-8, UTF-16, JIST). 010 provides a helpful view of how a file looks under certain encodings:
[]
UTF-16 (called Unicode by 010)
[]
UTF-8 (called Unicode by most people)
[]
ASCII
[]
ASCII and Unicode overlap on the ASCII Range. UTF-8 coincides with ASCII on the ASCII range, but also allows non-ASCII characters such as tildes, Asian languages, etc. UTF-8 is variable size, a character can require multiple bytes and use control sequences. Control Sequences are special bytes at the start of a character that instruct the parser to read multiple bytes as a single character. UTF-16 is fixed size (2 bytes), it's not a superset or subset of UTF-8 and ASCII. It's an alternate encoding that expands the ASCII range, XX 00 in UTF-16 is equivalent to ASCII XX, but because ASCII is characters one after another they aren't properly interchangeable as seen above.
A C-String, also called a null-terminated string, is a sequence of consecutive characters (not necessarily bytes, a character is a collection of bytes that map to a single text "point") with a final character called the null terminator (in all encodings the null terminator consists of only 0s, for UTF-16 the null terminator is 0000 while for ASCII and UTF-8 it's 00). C-Strings aren't a primitive per se, but they are a very low level structure that's extremely common.
Data Polysemy
As mentioned before, data on a file is polysemic. There is no canonical or unique reading. One simply has bits that are read as hex sequences and those are up to the reader to interpret them as meaningful data. So a data sequence on a file can be seen as any combination of data types and used for anything the reader intends. For example the sequence:
5B 63 75 62 61 73 65 00
Can be read as 8 Int8, 4 Int16, 2 Int32 or 1 Int64, it can be read as 2 Int8 and 2 Int6, it can be read as 2 Floats or 1 Double, etc.
Interpretation | Values |
---|---|
8 UInt8 | 91, 99, 117, 98, 97, 115, 101, 0 |
4 UInt16 | 25435, 25205, 29537, 101 |
2 UInt32 | 1651860315, 6648673 |
2 UInt8 2 UInt6 | 91, 99, 25205, 29537 |
2 Floats | 1.131653e+21, 9.316775e-39 |
1 Double | 9.5458804643068e-307 |
Of those only the 8 UInt8 really looks plausible. However if we take this to 010 and we look at the right side of the screen we quickly notice:
[]
We are probably dealing with a string.
Intermediate Structures (Structs)
As seen before, interpreting a file is based on reading a soup of hex values in such a way that the values one gets for each field have some measure of plausibility. Data in a file (for the most part) has a reason. For example a file that we know is for models will have to store vertex position data somewhere, so we are bound to find some float in the file which corresponds to the X coordinates of a vertex in the file. At a simpler level we would have an integer somewhere which is the vertex count at some point (this isn't strictly true, some file formats have data in arrays that are meant to be read until one finds a sentinel, a specific value or set of values that indicates that an array is finished). In general interpreting a file is based on two levels, knowing the data types of every hex sequence on a file, but also labelling each of those data entries, and understanding how each variable is linked with each other and with the game.
Given a byte sequence at the start of a file:
43 54 43 00 1C 00 00 00 00 00 00 00 E8 03 00 00 01 00 00 00 02 00 00 00
From looking at other files with the same extension we notice that the first 16 characters are always the same, furthermore when we look at them on 010 character panel: []
So we can make educated guesses. We normally prefer "longer" data types than shorter since everything is plausible when read as bytes. in this case, we notice that
43 54 43 00 / 1C 00 00 00 / 00 00 00 00 / E8 03 00 00 / 01 00 00 00 / 02 00 00 00
Would have relative reasonable int32 values for everything after the first value. Furthermore because we compared with other files with the same extension we know the first 3 ints are constant.
In 010 we can open the template editor to help our work in figuring out how the file works: []
Most files begin with what's known as a header, this normally has counts for all substructures inside a file, size declarations and other "global properties" related to the whole file. Templates are written in a C-like language, though for the purposes of hex editing and mapping one doesn't need extensive knowledge of the programming language for most basic things.
The quick rules are that one starts a structure by declaring it as
struct StructName{
};
and then starts adding variables inside in the format of type name;
, we can even make groupings of a same type by using the syntax type name[n];
. For now we have:
struct CTCHeader{
char unkn0[4];
int unkns[5];
};
Keen readers might also remember our talk about null terminated strings, also known as C-Strings, we could also write this template with the exact same results as
struct CTCHeader{
string unkn0;
int unkns[5];
};
And with our knowledge that 3 of those Ints are fixed we can write thus.
struct CTCHeader{
string unkn0;
int constInt[3];
int unkn1;
int unkn2;
};
Writing templates and structuring a file is an organic process. We group and ungroup variables as we go, we add labels and comment out things as we work on a file.
This is how it will look in the editor:
[]
And we can run it on the file at hand by clicking the arrow next to Run Template
. And ... we'd get nothing back.
This is because we've only said there's a structure called CTCHeader, we haven't actually said it's in the file. The struct
keyword indicates that it's a possible data type we need to actually tell the template that there's an instance of one in the file. We do this by treating our struct as if it were a primitive type and giving a field a name outside any structure declaration:
struct CTCHeader{
string unkn0;
int constInt[3];
int unkn1;
int unkn2;
};
CTCHeader Header;
Running this yields promising results:
[]
Additionally we can make some eye-candy edits by adding < >
commands. This are not related to how the file is read (with the exception of optimization = false>
) but how it displays on the editor. Some useful fields are name
which changes the variable display name, comment
which allows adding a comment and bgcolor
which highlights the variable in our editor window.
struct CTCHeader{
string type<name = "File Type", comment = "Always CTC", bgcolor=0xffff56>;
int constInt[3]<name = "Constant Int Triplet", comment = "28,0,1000">;
int unkn1;
int unkn2;
};
CTCHeader Header;
[]
After some comparisons on file size and whatnot we arrive to the conclusion that even the smallest file always has at least 80 bytes so we hypothesize that the header has the following structure:
struct CTCHeader {
string type<name = "File Type", comment = "Always CTC", bgcolor=0xffff56>;
int constInt[3]<name = "Constant Int Triplet", comment = "28,0,1000">;
int unkn1;
int unkn2;
byte unkn3[0x38];
};
CTCHeader Header;
We then notice that the file size is somewhat related to unkn1 and unkn2. In particular, that unkn1 increases the file by 80 bytes per, and unkn2 by 112 bytes. This makes us consider that they are counts for substructures, one 80 bytes and another 112 bytes long.
struct CTCHeader {
string type<name = "File Type", comment = "Always CTC", bgcolor=0xffff56>;
int constInt[3]<name = "Constant Int Triplet", comment = "28,0,1000">;
int count1;
int count2;
byte unkn3[0x38];
};
struct CTCSubstructure1{
byte unkn[80];
};
struct CTCSubstructure2{
byte unkn[112];
};
CTCHeader Header;
We still need a way of saying we have count1 of CTCSubstructure1 and count2 of CTCSubstructure2. We'd also prefer to start grouping things in some bigger structure that encapsulates the whole file. We can use any structure defined on the file as if it were a primitive (in fact we did this before in the template to have it show up on the results) and then assign it a name. We can then use members (sub-properties) of a previously defined property by using the dot. For example Header.count1
would give us the value at count1 if we had a variable called Header of type CTCHeader.
struct CTCHeader {
string type<name = "File Type", comment = "Always CTC", bgcolor=0xffff56>;
int constInt[3]<name = "Constant Int Triplet", comment = "28,0,1000">;
int count1;
int count2;
byte unkn3[0x38];
};
struct CTCSubstructure1{
byte unkn[80];
};
struct CTCSubstructure2{
byte unkn[112];
};
struct CTCFile{
CTCHeader Header;
CTCSubstructure1 Subs1[Header.count1];
CTCSubstructure2 Subs2[Header.count2];
};
CTCFile Ctc;
The dot operator also can be chained, for example Ctc.Header.count1
would return the count1 value in this case. From this point on, we'd start trying to find reliable groupings of bytes and interpreting them in similar manner. We test this template on files with the same extension to confirm if it manages to both, cover all files completely (there are no untemplated bytes), exactly (the template doesn't fail because no values left to read) and consistently (a field we say is a float, is always a reasonable value for a float and not a strange 1e-32 in some files but not in others). This last condition, consistency, sometimes might have issues, sometimes a value before can indicate that the type of a field is different. Sometimes a variable has special sentinel values that indicate it should be read differently (for example FFFFFFFF is an invalid float value in most cases but might be used as a special value to indicate that the float will operate in some special manner). This DO require some knowledge of basic C to work with (conditionals) and will not be covered by this guide.
Templates allow us to quickly edit files by using the interpreted fields directly instead of by hex fiddling. Double Clicking on the value on a template variable changes it on the corresponding editor window.
[]
Common Data Patterns
There's some common "irregularities" that might show up sometimes.
Alignment and Padding
On occasions the size of some substructures might seem to vary, but the "extra bytes" are always 0. Furthermore it happens that those extra bytes happen to make the next structure start at an editor position that lines up nicely with the vertical divisions (4,8,16 bytes from the start of the file). This extra bytes are called padding, and the whole idea is called alignment. This tend to be common in visual formats because GPUs have considerably faster read times for data that happens to be aligned like this.
When dealing with alignment, the following template code is useful to declare that a field will be aligned with the start of the file.
struct padding(int bitalignment){
local int bytealign;
bytealign = bitalignment/8;
local uint64 start = FTell();
byte paddingBytes[(-start)%bytealign];
};
struct AlignedStructure{
int exampleInt[8];
string varSizeString;
float moreExample;
padding alignment(16);//8 for byte, 16 for short, 32 for int, 64 for int64
};
Pointers
Often going hand-in-hand with certain forms of alignment, pointers are values within data that "point" to another location within data. These can be of any integer datatype, although they generally take the form of UInt32 and UInt64. It should be noted that the location a pointer points to is not always absolute; it can influenced by other factors such as the start of a sub-file or another pointer.
Pointers can be utilized within templates by using the FSeek function. It should be noted that if FSeek is used, the previous position is discarded and cannot be returned to without saving it in a local variable. An example of this would be:
struct StringOffset{
int64 pointer; // the pointer to visit
local int64 retAddress = FTell(); //save our current position in a local variable
FSeek(pointer);
string Name;
FSeek(retAddress);
};
Bitfields
As mentioned at the very beginning, very formally speaking the file is not bytes but BITS. This means one can have multiple variables crammed inside of bytes. Even worse, nothing really stops one from just ramming multibyte abominations that have weird bit separations. These aren't standard but still present. Particularly horrible examples include the way weights are stored in the mod3 format. Weights fit in an int32 but in reality there's 3 10-bit values and 2 bits used for "other". In general bit variables are fitted inside a byte aligned variable.
To declare bit level sub-variables on some byte aligned container 010 provides the syntax:
int16 field1:3;
int16 field2:4;
int16 field3:8;
int8 field4:6;
Where the value after the colon is the number of bits the variable is made of. Missing bits are autocompleted to align with the field they belong to, in this case there's an implicit 1-bit field after field3 so that the next field starts on a byte alignment as the underlying bound changed and 2-bit field after field4.
Size Prefaced Arrays
As a general pattern it's common for arrays of a datatype to be prefaced by an int which determines the number of entries.
Pascal Strings
Pascal Strings, named after the programming language Pascal, are char arrays prefaced by an int indicating their size. They are size prefaced arrays of type char but for historical reasons are given explicit mention.
struct PascalString{
byte len;
char string[len];
}
Hashes
A hash is a sequence of bytes generated by an algorithm to represent another sequence of bytes (often strings, but full file-hashing is also common). While hashing itself is a very complex subject, one should be familiar with the basics as they are very common within MHW's files. The most common algorithm utilized by MHW is JAMCRC, a variant of CRC32. Occasionally further masking is performed on the hash as well, generally occurring on hashes of class references. JAMCRC hashes can be calculated online here, while they can be calculated within Python using the zlib module and the following code:
hash = zlib.crc32(string.encode()) ^ 0xffffffff
Hashes tend to be UInt32, and generally look like an incoherent block of 4 bytes that don't quite work as any data type. They are used for many things, in MHW efx use them to determine the structure that follows the hash variable, similarly MHR uses them almost globally as part of their generic structure format to have dynamic structure contents within a file. Generally hashes come from strings, however it's not possible to reverse hashes in some generic fashion. If one knows the list of possible strings a hash comes from, one can hash every element of the list and then see if it matches any of the hash results.
Common Approaches
Some of the more common high-level approaches to mapping a file are listed below
Comparative Analysis
Comparative Analysis is based on comparing multiple files with the same extension. Comparing between file sizes is extremely useful at finding substructure counts, identifying the presence of variable sized structures (for example if the file sizes always jump between specific sizes it probably means there are no variable sized structures, only variable count of fixed size structure), and when comparing headers. It's also useful for testing consistency within the template variables, it allows cross referencing if type hypothesis are valid within the sample.
With knowledge of programming languages it's possible to extend this to massed analysis. Pooling what values can be found under certain variables which is helpful if, for example, a certain float is always between -360 and 360, odds are it's an angle. Experience tends to inform what certain value pools can represent. Integers with bounded values tend to be counts or enumerations.
Poke and Prod
Modifying a file and loading into the game and seeing what has changed. It's a quick and dirty approach that yields results fast. Replacing files with another with the same extension allows one to know what set of things a file might control.
Documentation Lookup
Looking up either official documentation for similar formats (or formats on the same family, for example looking at fbx and obj when analyzing mod3 as all are model formats), previously reversed formats for older versions of the engine, or other formats in the same game tends to be good starting points. Furthermore documentation about function, specifically what the file controls and how is helpful for labelling variables but also for understanding some more complex functionality specially when complex structures such as trees and graphs are involved.
The Data Type Information Approach
This approach is limited to games where there's a Data Type enumeration and will focus on SOME of the many avenues for exploiting this very rich source of information. In particular this will focus on MHW iteration of the DTI Property Enumeration. DTI stands for Data Type Information and consists of a reverse engineering of all of the classes in the game (in the Object Oriented Programming sense). It includes a wealth of information of what fields a structure has, it includes a lot of memory analysis utilities, etc. It might be the single most valuable general format reversing document around outside of someone having direct leaks of the MTFramework documentation.
This section has higher technical requirements than the ones before. It's recommended to have python installed and some familiarity with running it. Cheat Engine is also required to make the most of the contents of the following section.
Set-Up
The DTI dump was developed by Andoryuuta and the dumper utility can be found at this link. An existing premade dump can be found at this link. It's recommended to download it with Save Link As
, the document is gargantuan and will make most browsers die. It's recommended to use Notepad++ to open the file for cursory checks.
At this instance one can browse the DTI for classes named similarly to the file one is working on or which imply functionality similar to the file of interest. Sometimes the fields in memory align with the fields in the file and one's work is basically done as the DTI has the type and variable name. Sometimes they don't and they are shuffled around and one side has more or less fields.
File to Memory
Exploiting Classes
One needs to have a good guess for relevant classes in the DTI related to the file in question. Copying the class declarations from the DTI and feeding them to the CT converter found here will create a Cheat Table to more directly visualize classes in memory.
A usage example within the python interpreter:
dtt = """// uCnsTinyChain::cChainNode vftable:0x1435715C8, Size:0x1E0, CRC32:0x4AF2141C
class uCnsTinyChain::cChainNode /*: uCnsGroup::cNode, MtObject*/ {
s32 'mJointNo' ; // Offset:0x8, Var, CRC32:0x673E4986, Flags:0x0
f32 'mR' ; // Offset:0x1C, Var, CRC32:0xEB10C832, Flags:0x0
matrix44 'mAngleAxis' ; // Offset:0x30, Var, CRC32:0x50680E2B, Flags:0x0
f32 'mMass' ; // Offset:0x70, Var, CRC32:0xB50A905A, Flags:0x0
f32 'mElasticCoef' ; // Offset:0x74, Var, CRC32:0xA8EB81C7, Flags:0x0
f32 'mWindCoef' ; // Offset:0x78, Var, CRC32:0xDD2AE1AD, Flags:0x0
matrix44 'mMat' ; // Offset:0x90, Var, CRC32:0x443BFE6D, Flags:0x0
vector3 'mJointScale' ; // Offset:0x1D0, Var, CRC32:0x6A417252, Flags:0x1000
f32 'mWidthRate' ; // Offset:0x7FFFFFFFFFFFFFFF, PSEUDO-PROP, Getter:0x1402EB6E0, Setter:0x142390A60, CRC32:0xF969DBD0, Flags:0x80000
f32 'mAngleLimit' ; // Offset:0x7FFFFFFFFFFFFFFF, PSEUDO-PROP, Getter:0x14238EC90, Setter:0x142390A50, CRC32:0x5C958227, Flags:0x80000
u32 'mAngleMode' ; // Offset:0x7FFFFFFFFFFFFFFF, PSEUDO-PROP, Getter:0x141191820, Setter:0x142390AE0, CRC32:0x42ADCF7A, Flags:0x80000
u32 'ShapeScroll' ; // Offset:0x7FFFFFFFFFFFFFFF, PSEUDO-PROP, Getter:0x14238F250, Setter:0x1423913F0, CRC32:0xD2AB4C91, Flags:0x80000
u32 'ShapeObject' ; // Offset:0x7FFFFFFFFFFFFFFF, PSEUDO-PROP, Getter:0x14238F240, Setter:0x1423913D0, CRC32:0x9711D9FE, Flags:0x80000
u32 'mAttr' ; // Offset:0x7FFFFFFFFFFFFFFF, PSEUDO-PROP, Getter:0x142370230, Setter:0x142390B00, CRC32:0xDD77E828, Flags:0x80000
u32 'mRefJntNo' ; // Offset:0x7FFFFFFFFFFFFFFF, PSEUDO-PROP, Getter:0x1402EB760, Setter:0x14044EA90, CRC32:0x8D4FE80A, Flags:0x80000
u32 'mRotMode' ; // Offset:0x7FFFFFFFFFFFFFFF, PSEUDO-PROP, Getter:0x14238F230, Setter:0x1423913C0, CRC32:0x8635A16B, Flags:0x80000
u32 'mAttach' ; // Offset:0x7FFFFFFFFFFFFFFF, PSEUDO-PROP, Getter:0x1422F4540, Setter:0x142390AF0, CRC32:0xE055D5B0, Flags:0x80000
};""".split("\n")
with open("CTCNode.ct","w") as outTable:
outTable.write(createCT(*parseEntries(dtt)))
Preparing Files
A next step is trying to tie values seen in files to values in memory. A simple approach to doing this is having "unique" values on each field on a file, even if this values would otherwise be irregular or illegal for the file. In this particular case we write sequential values starting from 1 (a higher value such as 80 would have been better) and increasing one by one for each field, even for floats. The resulting CTC file does not properly display but it IS loaded into memory. If the file were to fail to load one would have to compromise and allow some fields to be unaltered.
Analyzing in Memory
Loading up the game, Cheat Engine, and loading the table. One enters a restricted environment to minimize the number of instances of the class of interest. In the case of CTCs for example the training arena with a character in an armor without ctcs except the prepared one, a weapon with CTCs dummied out, and a palico in an armor without ctcs.
We scan for the class vftable (in this case: vftable:0x1435715C8) this value is present at offset 0 for each occurrence of the class. Because of the careful setup there's only one instance of the class. If you get multiple instances it will require some more work. [] We recalculate the table base address to the one of our result (Control+C on the Address on the results list and ´Recalculate New Address´ will immediately set it to that. [] We notice that some of our prepared values went through (17,9,10,18) and we can now label the variables that we gave those values with the names that the DTI provided for us.
From this point it's possible to poke around in memory and in the files and seeing how values in files reflect in memory, and similarly how they are called by using breakpoints and traces.