Design & Implementation Guidelines - HenrikBengtsson/matrixStats GitHub Wiki

Requirement

A requirement for the native implementation is to:

avoid calling ISNAN(x) as far as possible whenever x is a double.

This is because that operation to test for missing value is fairly expensive for doubles. It is often better to let the floating-point arithmetic of CPU to take of this; it will correctly propagate NAs whenever doing additions, subtractions, multiplications, divisions etc. This also means that we cannot to do early stopping for na.rm=FALSE cases. See below for an example.

Note that testing for missing values when x is an integer is not expensive. This is because the test is a simple equality comparison, which is very cheap. More precisely, it is done as x == NA_INTEGER. Since testing for missing values for integers are cheap, we can also use it for early stopping whenever na.rm=FALSE. Actually, we have to test for NA_INTEGER, because the CPU does not handle them for us. This is because missing values for integers are not part of the IEEE standard(s). Instead, in R they are defined as the the smallest possible 4-byte signed integer, i.e.

LibExtern int	 R_NaInt;	/* NA_INTEGER:= INT_MIN currently */
#define NA_INTEGER	R_NaInt

This value is -2^31 = -2147483648. Note that the smallest possible integer in R is this value plus one, i.e. -.Machine$integer.max = -2147483647. Also, if one would not account this special value it would be treated as just another value, e.g. NA_INTEGER + 1 == -2147483647.

Strategy

A typically approach to avoid the overhead of ISNAN(x), which only exists for doubles, is to implement algorithms slightly different for integers and doubles. We use preprocessing macros for this, e.g.

#if X_TYPE == 'i'
      if (!X_ISNAN(value)) {
        sum += (LDOUBLE)value;
      } else if (!narm) {
          sum = R_NaReal;
          break;
      }
#elif X_TYPE == 'r'
      if (!narm || !X_ISNAN(value)) {
        sum += (LDOUBLE)value;
      }
#endif

Here we see that for integers (X_TYPE == 'i') we always test for missing values in order to handle them properly. At the same time we can typically do "early stopping" in case there is a missing value and na.rm=FALSE.

For doubles (X_TYPE == 'r') we minimize the testing for missing values by only testing for them when na.rm=TRUE; hence the (!narm || !X_ISNAN(value)) contruct.

BTW, in order to have a somewhat more consistent notation, we X_ISNAN(x) for both integers and doubles by defining:

  #define X_ISNA(x) (x == NA_INTEGER)

for integers (X_TYPE == 'i') and

  #define X_ISNAN(x) ISNAN(x) /* NA or NaN */

for doubles (X_TYPE == 'r').