BFloat16 - AshokBhat/ml GitHub Wiki

bfloat16

  • 16-bit floating-point format
  • Format: 1 bit sign, 8 bit exponent, 7 bit mantissa
  • Better suited for Deep learning than FP16

Comparison

Advantages

  • Can easily replace FP32
    • Same dynamic range as FP32, unlike FP16
    • Retains correct NN operation
  • Potential perf uplift
    • Half memory footprint and bandwidth
    • 2x FP multiplications per instruction
  • Easy to support in software on existing CPUs
    • Simple masking and shifting operations to convert BF16 to FP32, and vice versa.
  • Potential for single format for training and inference
    • No need for scaling and quantization.
    • Avoid expensive retraining and redesign of network architecture.

Standard

  • No IEEE standard
  • Different architectures, accelerators and software libraries have adopted slightly different aspects of the IEEE 754 floating-point standard to govern the numeric behavior of arithmetic on BF16 values.

Support

See Also

⚠️ **GitHub.com Fallback** ⚠️