XDR floating point numbers - rhjdjong/xdrlib2 GitHub Wiki

Introduction

The XDR standard requires the support for single-precision, double-precision, and quadruple-precision floating point representations, i.e. the interchange formats binary32, binary64, and binary128 as defined in IEEE 754. This poses some challenges for the xdrlib2 implementation. For one thing, most PCs do not support quadruple-precisison floating points natively. But there are other (potential) issues: the handling of floating point numbers in Python is dependent on the underlying machine architecture and the C or Java implemention in which Python is written. This means that Python offers no guarantee as to which floating point precisions are supported. A hypothetical Python implementation that only supports floating point numbers in the binary16 format (see IEEE 754) would still comply with the Python language as defined in the Python Language Reference.

With the above considerations, it is not feasible to simply treat XDR floating point numbers as a if they were Python's floats.

On the other hand, the XDR protocol is intended for data transfer, not for high-precision calculations. That means that, as far as the xdrllib2 library is concerned, only the creation, encoding, and decoding operations need to support the precision required by XDR floating point types. It is up to the user of the XDR protocol to provide any high-precision numerical libraries (e.g. gmpy) if they require to do calculations with instances of the XDR float types (in particular the quadruple-precision type).

Most use cases that involve XDR data involve generating data, transferring received data, and interpreting received data. Typical scenarios for interpreting received floating point data are: determining the sign, comparing to a pre-defined threshold, comparing with an earlier received data value, and so on.

For the above reasons I have decided to make the XDR floating point number types float subclasseses, with additional attributes and methods that ensure that creating, comparing, encoding, and decoding these values works for all precisions.

Theoretical basis

The following is extracted from IEEE 754. For more information on IEEE 754 see e.g. the IEEE 754 page on wikipedia. Since XDR is primarily about the representation of data, this section focussus on the interchange format, not on the in-memory encoding of floating point values or the operations that can be performed on them.

A floating point format (i.e. a set of numbers and symbols) contains

  • finite numbers
  • plus and minus infinity
  • NaN values

The XDR standard uses a subset of the full IEEE 754 interchange formats, namely the binary32, binary64, and binary128 formats. Each interchange format uses a packed sequence of three positive integers signbit, exponent, and fraction (sef) to represent numbers. A interchange format has a fixed number of bits, therefore the number of bits available for the representation of these integers is limited. The signbit always uses 1 bit: 0 indicates a positive value, 1 indicates a negative value. The number of bits used for e is indicated with E, and the number of bits used for f is indicated with F. Different representation formats have different values for E and F.

format size E F bias qmin qmax vmin vmax
binary32 32 8 23 127 -126 127 2βˆ’149 (2βˆ’2βˆ’23)β€―Γ—β€―2127
binary64 64 11 52 1023 -1022 1023 2βˆ’1074 (2βˆ’2βˆ’52)β€―Γ—β€―21023
binary128 128 15 112 16383 -16382 16383 2βˆ’16494 (2βˆ’2βˆ’112)β€―Γ—β€―216383

In the interchange representations, the minimum and maximum possible values of e (0 and 2Eβˆ’1) are used to indicate special numbers. All other values 1 ≀ eβ€―<β€―2E-1 are used to represent normal numbers.

Any normal number v that is representable in a floating point format can be described by three integer values: a signbit (s) with values 0 (for positive values) or 1 (for negative values), a coefficient (c), and an exponent (q), such that vβ€―=β€―(-1)sβ€―Γ—β€―cβ€―Γ—β€―2q, and 1 ≀ cβ€―<β€―2.

To allow the representation of negative exponents, the value of e (which is always a positive number) is assumed to have a bias b so that: qβ€―=β€―eβ€―βˆ’β€―b, with bβ€―=β€―2Eβˆ’1β€―βˆ’β€―1, i.e. a leading 0-bit followed by Eβˆ’1 1-bits.

To maximize the precision of the coefficient, the most significant bit of c (which is always 1) is not represented in the interchange format, only the bits that form the binary fraction are encoded in f. The integer value f has therefore an implied factor of 2βˆ’F. This means that cβ€―=β€―1β€―+β€―fβ€―Γ—β€―2βˆ’F.

The smallest representable q value is qminβ€―=β€―1β€―βˆ’β€―b. Numbers v with an absolute value less than 21β€―βˆ’β€―b cannot be represented in the way described above. Instead, such subnormal numbers are represented with eβ€―=β€―0, and their value is vβ€―=β€―(-1)sβ€―Γ—β€―fβ€―Γ—21βˆ’bβˆ’F. The absolute minimal value greater than 0 that can be represented in this way is vminβ€―=β€―21βˆ’bβˆ’F. Any value less than vmin is represented with e and f equal to 0, which is equivalent to Β±0.0, depending on the value of the signbit.

Similarly, the largest representable value vmax has eβ€―=β€―2Eβˆ’2 and fβ€―=β€―2Fβˆ’1, corresponding with vmaxβ€―=β€―(2βˆ’2βˆ’F)β€―Γ—β€―2b. Numbers with an absolute value |β€―vβ€―|β€―>β€―vmax cannot be represented as normal numbers. Such values are equivalent to Β±infinity, and are represented with eβ€―=β€―2Eβ€―βˆ’β€―1 and fβ€―=β€―0.

Finally, each format has representations for Not-a-Number (NaN) values. Such representations have eβ€―=β€―2Eβ€―βˆ’β€―1, and fβ€―β‰ β€―0. Signaling NaNs have 0β€―<β€―fβ€―<β€―2Fβˆ’1, meaning that they have a leading 0-bit in f. Quiet NaNs have 2Fβˆ’1 ≀ fβ€―<β€―2F, meaning that they have a leading 1-bit in f. The other bits in f act as an opaque payload and have no significance for the floating point format.

Constructing an XDR floating point value

In the xdrlib2 module, XDR floating points are a subclass of Python floats. All Python float instantiation methods must therefore also apply to XDR floating point types. That means they can be instantiated with either a number (integer or float) or a unicode or byte string as argument. The challenge is then to calculate the values of s, e, and f without using floating point calculations, because that would almost inevitably lead to rounding errors.

The key insight here is that the argument used to instantiate an XDR float type can always be represented as the ratio of two integers numerator (n) and denominator (d):

  • If the argument a is an integer, then nβ€―=β€―a and dβ€―=β€―1;
  • If the argument a is a float, then n and d are the result of a.as_integer_ratio();
  • If the argument a is a string, then a can be normalized by shifting the decimal point such that the exponent becomes 0 and the string becomes <i>.<d>. Then nβ€―=β€―10Dβ€―Γ—β€―iβ€―+β€―d, and dβ€―=β€―10D, with D equal to the number of digits in the decimal part d.

Assuming a positive value, so positive integers n and d, we must now find values for f and q such that (n/d)β€―=β€―(1β€―+β€―fβ€―Γ—β€―2βˆ’F)β€―Γ—β€―2q, with an integer value for q and 1 ≀ (1β€―+β€―fβ€―Γ—β€―2βˆ’F)β€―<β€―2.

The above inequality can be written as 1 ≀ (n/d)β€―Γ—β€―2βˆ’qβ€―<β€―2. Taking the log2 of both sides we get 0 ≀ log2nβ€―βˆ’β€―log2dβ€―βˆ’β€―qβ€―<1. Pretending for now that we have an unlimited value range for q, we can see that the integer value for q is given by qβ€―=β€―int(log2nβ€―βˆ’β€―log2d).

Since we don't want to use floating point calculations, we approximate the log2 values by taking the bit_length() of the values. The bit_length() is the smallest integer strictly greater than the log2 value, and the log2 value is greater than or equal to bit_length() - 1 (because the value must begin with a 1-bit). In other words (writing bitsβ€―v for v.bit_length()), bitsβ€―v βˆ’ 1 β‰€ log2β€―v < bitsβ€―v.

Taking q outside the inequalities, we get:

  • q β‰€ log2β€―n βˆ’ log2β€―d <bitsβ€―n βˆ’bitsβ€―d + 1
  • q > log2β€―n βˆ’ log2β€―d βˆ’ 1 β‰₯ bitsβ€―n βˆ’bitsβ€―d βˆ’ 1

Writing b for bitsβ€―n βˆ’ bitsβ€―d, we get that either q = b or q = bβˆ’1. It is now simply a matter of programmatically checking for which value the original condition is true.

Having the value of q available allows us to calculate f. But due to the fact that we only have F bits available for the integer f, there will in general not be an exact solution for this. Instead we will calculate the real value g that is exact solution for this equation, and then calculate the proper value for f that best approximates this solution.

Starting from (n/d)β€―=β€―(1β€―+β€―gβ€―Γ—β€―2βˆ’F)β€―Γ—β€―2q, we can see that gβ€―=β€―(nβ€―Γ—β€―2Fβ€―βˆ’β€―qβ€―βˆ’β€―dβ€―Γ—β€―2F)β€―/β€―d. This is fine for the case where q ≀ F. But since we want to avoid floating point operations, for the case where qβ€―>β€―F we write this as: gβ€―=β€―(nβ€―βˆ’β€―dβ€―Γ—β€―2q)β€―/β€―(dβ€―Γ—β€―2qβ€―βˆ’β€―F). In either case we can express g as the ratio of two integer values m and p, and the desired integer f is the rounded value of g, i.e. mβ€―//β€―p if the remainder is less than half p, and (mβ€―//β€―p)β€―+β€―1 otherwise. Since rounding the value up can cause f to overflow, we must in that case divide f again by two and increase q by 1.

Now we must face the fact that q is not really unlimited in range.

  • If βˆ’2Eβ€―βˆ’β€―1 ≀ qβ€―<β€―2Eβ€―βˆ’β€―1, we have a normal situation. In that case we have f with the value calculated above, and eβ€―=β€―qβ€―+β€―b with bβ€―=β€―2Eβ€―βˆ’β€―1β€―βˆ’β€―1.

  • If qβ€―β‰₯β€―2Eβ€―βˆ’β€―1, we have an overflow situation. In that case we must use the representation for an infinite number, i.e. eβ€―=β€―2Eβ€―βˆ’β€―1 and fβ€―=β€―0.

  • If qβ€―<β€―βˆ’2Eβ€―βˆ’β€―1, we have an underflow situation. In that case we must rewrite the number as (n/d)β€―=β€―gβ€―Γ—β€―21β€―βˆ’bβ€―βˆ’F, with bβ€―=β€―2Eβ€―βˆ’β€―1β€―βˆ’β€―1, or gβ€―=β€―(nβ€―Γ—β€―2Fβ€―+β€―bβ€―βˆ’β€―1)β€―/β€―d. The exponent e is 0 in this case, and we calculate f from these two integer values as before. If f overflows due to rounding up, we must in this case set e to 1 and f to 0, because of the implied leading 1-bit for nominal values.

⚠️ **GitHub.com Fallback** ⚠️