BFloat16

Advantages

Can easily replace FP32
- Same dynamic range as FP32, unlike FP16
- Retains correct NN operation
Potential perf uplift
- Half memory footprint and bandwidth
- 2x FP multiplications per instruction
Easy to support in software on existing CPUs
- Simple masking and shifting operations to convert BF16 to FP32, and vice versa.
Potential for single format for training and inference
- No need for scaling and quantization.
- Avoid expensive retraining and redesign of network architecture.

No IEEE standard
Different architectures, accelerators and software libraries have adopted slightly different aspects of the IEEE 754 floating-point standard to govern the numeric behavior of arithmetic on BF16 values.