CPP‐Floating Numbers - rFronteddu/general

Scientific Notation

Scientific notation is a useful shorthand for writing lengthy numbers in a concise manner. $$significand * 10^exponent$$. For example $$1.2*10^4 = 10000$$. By convention, numbers in scientific notation are written with one digit before the decimal point, and the rest of the digits afterward.
Because it can be hard to type or display exponents in C++, we use the letter ‘e’ (or sometimes ‘E’) to represent the “times 10 to the power of” part of the equation. For example, 1.2 x 10⁴ would be written as 1.2e4, and 5.9722 x 10²⁴ would be written as 5.9722e24. For numbers smaller than 1, the exponent can be negative. The number 5e-2 is equivalent to 5 * 10⁻², which is 5 / 10², or 0.05. The mass of an electron is 9.1093837e-31 kg.
The digits in the significand (the part before the ‘e’) are called the significant digit (or significant figures). The more significant digits, the more precise a number is. In scientific notation, we’d write 3.14 as 3.14e0. Since there are 3 numbers in the significand, this number has 3 significant digits.

When converting to scientific notation, trailing zeros after a decimal point are considered significant, so we keep them:

87.0g = 8.70e1
87.000g = 8.7000e1

For numbers with no decimal point, trailing zeros are considered to be insignificant by default. Given the number 2100 (with no additional information), we assume the trailing zeroes are not significant, so we drop them:

2100 = 2.1e3 (trailing zeros assumed not significant)

However, if we happened to know that this number was measured precisely (or that the actual number was somewhere between 2099.5 and 2100.5), then we should instead treat those zeros as significant:

2100 = 2.100e3 (trailing zeros known significant)

Floating Point numbers

A floating point type variable is a variable that can hold a number with a fractional component, such as 4320.0, -3.33, or 0.01226. The floating part of the name floating point refers to the fact that the decimal point can “float” -- that is, it can support a variable number of digits before and after the decimal point. Floating point data types are always signed (can hold positive and negative values).

C++ has three fundamental floating point data types: a single-precision float, a double-precision double, and an extended-precision long double. As with integers, C++ does not define the actual size of these types.

On modern architectures, floating-point types are conventionally implemented using one of the floating-point formats defined in the IEEE 754 standard (see https://en.wikipedia.org/wiki/IEEE_754). As a result, float is almost always 4 bytes, and double is almost always 8 bytes.

On the other hand, long double is a strange type. On different platforms, its size can vary between 8 and 16 bytes, and it may or may not use an IEEE 754 compliant format. We recommend avoiding long double.

Note that by default, floating point literals default to type double. An f suffix is used to denote a literal of type float.
Always make sure the type of your literals match the type of the variables they’re being assigned to or used to initialize. Otherwise an unnecessary conversion will result, possibly with a loss of precision.

Printing floating point numbers

The precision of a floating point type defines how many significant digits it can represent without information loss. The number of digits of precision a floating point type has depends on both the size (floats have less precision than doubles) and the particular value being stored (some values can be represented more precisely than others).

For example, a float has 6 to 9 digits of precision. This means that a float can exactly represent any number with up to 6 significant digits. A number with 7 to 9 significant digits may or may not be represented exactly depending on the specific value. And a number with more than 9 digits of precision will definitely not be represented exactly.
Double values have between 15 and 18 digits of precision, with most double values having at least 16 significant digits. Long double has a minimum precision of 15, 18, or 33 significant digits depending on how many bytes it occupies.

When outputting floating point numbers, std::cout has a default precision of 6 -- that is, it assumes all floating point variables are only significant to 6 digits (the minimum precision of a float), and hence it will truncate anything after that.

We can override the default precision that std::cout shows by using an output manipulator function named std::setprecision(). Output manipulators alter how data is output, and are defined in the iomanip header.

When precision is lost because a number can’t be stored precisely, this is called a rounding error. Favor double over float unless space is at a premium, as the lack of precision in a float will often lead to inaccuracies.

Rounding errors make floating point comparisons tricky

Floating point numbers are tricky to work with due to non-obvious differences between binary (how data is stored) and decimal (how we think) numbers. Consider the fraction 1/10. In decimal, this is easily represented as 0.1, and we are used to thinking of 0.1 as an easily representable number with 1 significant digit. However, in binary, decimal value 0.1 is represented by the infinite sequence: 0.00011001100110011… Because of this, when we assign 0.1 to a floating point number, we’ll run into precision problems. Computers use binary, not decimal. That means they can represent exactly only those fractions that can be written as a sum of powers of two. Rounding errors may make a number either slightly smaller or slightly larger, depending on where the truncation happens.

   double d2{ 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 }; // should equal 1.0
   std::cout << d2 << '\n' // prints 0.99999999999999989

Because floating point numbers tend to be inexact, comparing floating point numbers is generally problematic. One last note on rounding errors: mathematical operations (such as addition and multiplication) tend to make rounding errors grow. So even though 0.1 has a rounding error in the 17th significant digit, when we add 0.1 ten times, the rounding error has crept into the 16th significant digit. Continued operations would cause this error to become increasingly significant.

Rounding errors occur when a number can’t be stored precisely. This can happen even with simple numbers, like 0.1. Therefore, rounding errors can, and do, happen all the time. Rounding errors aren’t the exception -- they’re the norm. Never assume your floating point numbers are exact. A corollary of this rule is: be wary of using floating point numbers for financial or currency data.

NaN and Inf

IEEE 754 compatible formats additionally support some special values: Inf, which represents infinity. Inf is signed, and can be positive (+Inf) or negative (-Inf).

NaN, which stands for “Not a Number”. There are several different kinds of NaN
Signed zero, meaning there are separate representations for “positive zero” (+0.0) and “negative zero” (-0.0).

#include <iostream>

int main()
{
    double zero { 0.0 };

    double posinf { 5.0 / zero }; // positive infinity
    std::cout << posinf << '\n';

    double neginf { -5.0 / zero }; // negative infinity
    std::cout << neginf << '\n';

    double z1 { 0.0 / posinf }; // positive zero
    std::cout << z1 << '\n';

    double z2 { -0.0 / posinf }; // negative zero
    std::cout << z2 << '\n';

    double nan { zero / zero }; // not a number (mathematically invalid)
    std::cout << nan << '\n';

    return 0;
}

Note that the results of printing Inf and NaN are platform specific, so your results may vary (e.g. Visual Studio prints the last result as -nan(ind)).

To summarize, the two things you should remember about floating point numbers:

Floating point numbers are useful for storing very large or very small numbers, including those with fractional components.
Floating point numbers often have small rounding errors, even when the number has fewer significant digits than the precision. Many times these go unnoticed because they are so small, and because the numbers are truncated for output. However, comparisons of floating point numbers may not give the expected results. Performing mathematical operations on these values will cause the rounding errors to grow larger.

CPP‐Floating Numbers - rFronteddu/general_wiki GitHub Wiki

Scientific Notation

Floating Point numbers

Printing floating point numbers

Rounding errors make floating point comparisons tricky

NaN and Inf

⚠️ GitHub.com Fallback ⚠️

CPP‐Floating Numbers - rFronteddu/general_wiki GitHub Wiki

Scientific Notation

Floating Point numbers

Printing floating point numbers

Rounding errors make floating point comparisons tricky

NaN and Inf

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️