Avalanche Tests (Part 1) - bryc/code GitHub Wiki

Now that I learned how to generate these lil diagrams, I can compare my various hash algorithms. Now, one thing to note, is that these graphs really only tell you how distinguishable from random they are (that is, how biased they are). If a hash is biased, it doesn't necessarily mean it will fail collision tests (a collision means the same hash occurs for two different datasets; if lots of collisions occur when they shouldn't, it's really bad). These tests are also quite strict. Some algorithms may still look random when hash bits are plotted as pixels, yet display glaring biases when their strict avalanche behavior is tested.

First some info about this test. It's called Strict Avalanche Criterion. Exactly 1,048,575 random 32-bit numbers are generated in a loop - these are the samples for the test. The more samples, the more precise the measurements are (Though too precise can sometimes hurt the test). These samples serve as the input data to our function. Some of these algorithms are tested as integer hashing functions, also known as "mixing functions", which is often the final step of a hash. Though most are tested as complete hash functions, because the input combining step often serves as a partial mixing step, and that is important. In this cases, the 32-bit random number is effectively interpreted as 4 subsequent bytes or chars.

So, with that of the way, let's start.

XOR checksum (8-bit)

image

function xor(data, h1 = 0) {
  for(let i = 0; i < data.length; i++) {
    h1 ^= data[i];
  }
  return h1;
}

Arguably the simplest logical operation for computers to do, and forms an important component of larger algorithms--particularly encryption (because it is a simple yet effective reversible operation). It simply performs a XOR Sum of all the input bytes. The result is effectively just 8 parity bits.

According to this paper, it is less effective at error detection than addition as an 8-bit checksum (which what it is being tested as in the avalanche diagram).

ADD checksum / ADC checksum (8-bit)

image

// Left: Add
function add(data, h1 = 0) {
  for(let i = 0; i < data.length; i++) {
    h1 += data[i];
  }
  return h1 & 0xFF;
}
// Left: Add with Carry
function adc(data, h1 = 0) {
  for(let i = 0; i < data.length; i++) {
    h1 += data[i];
  }
  return h1 % 0xFF;
}

This is simple addition. Each input byte gets added to the total sum. In an 8-bit checksum, any time the final sum exceeds 255, the last carry bit is ignored.

A variant of this checksum from the world of low-level assembly is the ADC checksum. It effectively adds the carry bit into the result. The idea being that this adds another layer of unpredictability (to an extent). It is hard to emulate in modern languages, though a Modulo 255 instead of Modulo 256 effectively emulates the math correctly. In assembly, sometimes the carry bit is altered during a comparison, so it ends up taking two instructions (add n + aci 0 instead of one (adc n) to "force" the intended result.

It's sometimes called "one's complement addition", though that can be confused with other concepts so I try to avoid it. It's marginally better as a checksum? But I haven't verified that yet and there's conflicting information.

ROL ADC checksum

image

function roladc(data, h1 = 0) {
  for(let i = 0; i < data.length; i++) {
    h1 = (h1 << 1) | (h1 >>> 7);
    h1 &= 0xFF;
    h1 += data[i];
    h1 %= 0xFF;
  }
  return h1;
}

Perhaps the best checksum possible in the fewest CPU instructions? It is effectively the ADC, but uses a ROL beforehand, which is a 'rotate left' instruction (and is technically a multiplication by 2 which also adds any carry). The original assembly code for this has been shared around on microcontroller forums, so I must assume it has its merits:

8-bit ASM version (Motorola 6800)

   rol sum        ; rotate sum left
   add sum,new    ; add new byte into sum
   adc sum,0      ; if overflow, add carry back in

16-bit ASM version (Motorola 6800)

         CLR A               ;Init sum = 0 (A=MSB, B=LSB)
         CLR B
L_DBC4   ASL B               ;16-bit sum <<= 1
         ROL A               ;Shift MSB out to carry
         ADC B   $00,X       ;16-bit add of $00,x plus carry -> sum
         ADC A   #$00
         INX                 ;Loop from $C002 to FFFF inclusive
         CPX     #$0000
         BNE     L_DBC4

Unfortunately, 6502/6800 ROL is "rotate left through carry". I can't quite figure out how to handle the Carry Flag, so what I'm actually implementing is a circular rotate. Perhaps I'll port my 8080 checksum test code to 6502/6800 and see if I can make a proper port.

Fletcher-16 checksum

image

function fletcher16(data, h1 = 0, h2 = 0) {
  for(let i = 0; i < data.length; i++) {
    h2 += h1 += data[i];
  }
  h1 %= 255, h2 %= 255;
  return (h2 << 8) | h1;
}

The Fletcher checksum is a variant of the ADC checksum, which incorporates the carry bits. But the difference is that it produces two sums.

It's purpose is to essentially achieve position-dependence, meaning the order of the input bytes matter, whereas this is not relevant in most checksums. This is similar to ROL ADC checksum.

Fletcher-32 and Adler-32

image

// Fletcher-32 (Left)
function fletcher32(data, h1 = 0, h2 = 0) {
  for(let i = 0; i < data.length; i += 2) {
    h2 += h1 += data[i+1] << 8 | data[i+0];
  }
  h1 %= 0xFFFF, h2 %= 0xFFFF;
  return ((h2 << 16) |  h1) >>> 0;
}
// Adler-32 (Right)
function adler32(data, h1 = 1, h2 = 0) {
  for(let i = 0; i < data.length; i++) {
    h2 += h1 += data[i];
  }
  h1 %= 65521, h2 %= 65521;
  return ((h2 << 16) |  h1) >>> 0;
}

Fletcher-32 simply reads the input in 16-bit chunks. So effectively, it sums 16-bit numbers instead of 8-bit.

Adler-32 on the other hand, is a slight modification (supposedly better). It reads 8 bits at a time, and uses a prime modulus instead of 2^n-1. It also initializes the first checksum to 1, but unless I'm missing something, this has no tangible effect on its effectiveness.

BSD checksum

image

function bsd(data, h1 = 0) {
  for(let i = 0; i < data.length; i++) {
    h1 = (h1 >>> 1) | (h1 << 15);
    h1 += data[i];
    h1 &= 0xFFFF;
  }
  return h1;
}

The BSD checksum is nearly identical to the ROL ADC checksum. However, it uses ROR instead (rotate right), does not add carry, and returns a 16-bit result.

Upon closer inspection, it appears ROR may over a few advantages over ROL, as it does appear to accumulate sooner in the higher bits, because we're shifting the LSB straight into the MSB.

I will note however, that the left-shift operation is the most significant aspect of this operation. The right-shift operation, simply prevents losing the bit overflow.

CRC-8 / CRC-16

image

// CRC-8
function crc8(data, h1 = 0) {
  for(let i = 0; i < data.length; i++) {
    h1 ^= data[i];
    for(let j = 0; j < 8; j++) {
        h1 = h1 & 128 ? (h1 << 1) ^ 7 : h1 << 1;
    }
  }
  return h1 & 0xFF; 
}
// CRC-16
function crc16(data, h = 0) {
  for(let i = 0, t; i < data.length; i++) {
    t = h ^ data[i];
    t = (t ^ t << 4) & 0xFF;
    h = (h >> 8) ^ (t << 8) ^ (t >> 4) ^ (t << 3);
  }
  return h; 
}

CRC is a 'proper' error detection code. Basically, it should detect more errors than a typical checksum (only if the input data is short enough, its all highly mathematical).

CRC has often been used in hash functions, however, it has terrible avalanche behavior and is often much slower. It should only be used as a checksum for short input lengths.

For reference, I also tested (and verified) CRC-32.

⚠️ **GitHub.com Fallback** ⚠️