IEEE 754 Floating-Point Arithmetic Quiz
The following questions are about floating-point arithmetic as defined by the IEEE 754 standard. The revised version from 2008 generalized floating-point arithmetic and introduced three decimal formats. Here we only consider the binary floating-point formats single precision (32-bit) and double precision (64-bit). Although, most questions are independent of the radix, range of exponent, and precision.
In the sequel we denote the floating-point approximation of any real number by the projection \(\fl \colon \mathbb{R} \to \mathbb{F}\) where \(\mathbb{F}\) is any finite and discrete floating-point number system. If not mentioned otherwise, we assume “round ties to even” which is the default rounding-direction (IEEE 754-2008 §4.3.3).
In order to distinguish between elementary operations over the reals and floating-point numbers we introduce the floating-point operators \(\oplus,\ominus,\otimes,\oslash\) for their counterparts over the reals \({+},{-},{\times},{/}\), respectively. We then have \(x \oplus y = \fl(x + y)\) for floating-point numbers \(x,y\) as well as similar definitions for \(\ominus,\otimes,\oslash\). Furthermore, we lift these operators in order to deal with NaNs and infinities as defined by the IEEE 754 standard.
In order to distinguish between numbers, infinities, and NaNs the following notation is used. A number is typed in normal font like \(x, y, z\). A datum is either an infinity, NaN, or a number and is typed bold face like \(\mathbf{x}, \mathbf{y}, \mathbf{z}\).
-
Floating-point numbers are uniformly distributed.
YesNoEach floating-point number is of the form \(\pm m \times 2^e\) with \(1 \leq m \lt 2\). The significand \(m\) has \(p\) many bits—including one implied bit. Thus, in each interval \(\left[\left.2^e, 2^{e+1}\right)\right.\) there are \(2^{p-1}\) equally distributed floating-point numbers. The spacing between floating-point numbers in \(\left[\left.2^e, 2^{e+1}\right)\right.\) is \(2^{e-(p-1)}\). Therefore, spacing doubles between intervals. The distribution is more like a logarithmic scale than a linear one. In other words, the numbers are dense towards \(\pm 0\) and sparse towards \(\pm \infty\).
Let us have a look at a minifloat example. Consider IEEE-754-like 8-bit floating-point numbers given by a 3-bit exponent, 4-bit significand, and bias 3. The following figure depicts the number line.
Click on it and zoom in for more details.
Fun fact: spacing is rather huge for big numbers which might be nonintuitive. For instance, the biggest positive exponent of a double-precision floating-point number is \(+1023\). Thus, for numbers in the interval \(\left[\left.2^{1023}, 2^{1024}\right)\right.\) spacing equals \(2^{971}\) which is roughly \(10^{292}\). In contrast, for numbers in the interval \([2^{-1022}, 2^{-1021})\) spacing equals \(2^{-1074}\) which is roughly \(10^{-324}\).
-
Roughly half of all positive floating-point numbers are in the interval \((0,1)\).
YesNoYeah I know giving a precise answer to an imprecise question is … rough. This question is more about sharpening the intuition about the distribution of floating-point numbers.
The exponent of IEEE 754 single-precision floating-point numbers ranges from \(-126\) to \(+127\). We ignore the exponents \(-127\) (all-zero) and \(+128\) (all-one) since they are reserved for special values (zero, subnormals, infinities, NaN). In total we have \(254\) exponents from which \(126\) are negative which is roughly half of all.
The exponent of IEEE 754 double-precision floating-point numbers ranges from \(-1022\) to \(+1023\). We ignore the exponents \(-1023\) (all-zero) and \(+1024\) (all-one) since they are reserved for special values (zero, subnormals, infinities, NaN). In total we have \(2046\) exponents from which \(1023\) are negative which is precisely half of all.
-
Only fractions with power-of-two denominators are representable.
YesNoFloating-point numbers are of the form \(\pm m \times 2^e\) where \(m = (1.f)_2\) for some binary fractional part \(f\), and exponent \(e\) is an integer. Let \(p\) be the precision of \(m\), then \(m = m' \times 2^{-p}\) for some integer \(m'\) q.e.d.
For example, the finite decimal number \(0.1\) is not representable by a finite binary number: \[(0.1)_{10} = (0.00011001100110011\dots)_{2} \] Since the number is periodical in the binary system it must be rounded towards its nearest floating-point number, i.e., \(\fl(0.1) \neq 0.1\).
-
There exist integers \(x,y \in \mathbb{Z}\) with \(x \neq y\) such that \(\fl(x) = \fl(y) \neq \pm\infty\).
YesNoIf an integer has more digits than are representable by the significand, then the least significant digits are rounded. For single-precision floating-point numbers we have \(24\) bits (including one implied bit) for the significand. Thus every integer from \(-2^{24}\) up to \(+2^{24}\) is representable by a single-precision floating-point number. Likewise, for double-precision numbers we have \(53\) bits (including one implied bit) for the significand which means that every integer from \(-2^{53}\) up to \(+2^{53}\) is representable. Bigger numbers are subject to be rounded towards their nearest floating-point number. In other words, digits are rounded not only after the decimal point, but also before the decimal point.
For example, the following expression
evaluates to
true
in C. -
There exists a floating-point number \(x\) such that \(x \oplus 1 = x\).
YesNoThis question is somewhat related to the previous one, nonetheless, the answer differs.
If two numbers are added which differ in several orders of magnitude, then one number may be absorbed by the other which we show in the subsequent. Let us consider the general case \(x \oplus y = x\) for some non-zero floating-point numbers \(x,y\). Assume \(y \lt x\), and let \(x = m_x \times 2^{e_x}\), and \(y = m_y \times 2^{e_y}\), then \[ x + y = (m_x + m_y \times 2^{e_y - e_x}) \times 2^{e_x} \] If \(e_y - e_x \lt -(p+1)\) where \(p\) is the precision, then \[ m_y \times 2^{e_y - e_x} \lt 2^{-p} \] since \(m_y \lt 2\). Thus we have \[ x + y \lt (m_x + 2^{-p}) \times 2^{e_x} = (x + \succ(x)) / 2 \] where \(\succ(x)\) is the successor floating-point number of \(x\) q.e.d.
For example, consider decimal floating-point arithmetic with four digits of precision. The numbers \(100\) and \(0.001\) are precisely representable as \(1.000 \times 10^2\) and \(1.000 \times 10^{-3}\), respectively. While adding both numbers, the exponents get aligned (the numbers after the vertical bar get rounded and are shown only for demonstration purposes): \[\begin{align} 1.000 \times 10^2 + 1.000 \times 10^{-3} &= 1.000 \times 10^2 + 0.000|01 \times 10^2 \\ &= 1.000 \times 10^2 + 0.000 \times 10^2 \\ &= 1.000 \times 10^2 \end{align}\]
In other words \(\fl(100.001) = 100\).
An example in C for single-precision floating-point arithmetic is
1e8f + 1.0f == 1e8f
and for double-precision floating-point arithmetic1e16 + 1.0 == 1e16
.Note spacing between numbers gets rather huge at the end of the scale, and you might lose feeling about it. For example, the expression
1e300 + 1e280 == 1e300
yieldstrue
in C.Thus in real arithmetic the equation \(x + 1 = x\) has no solution whereas in floating-point arithmetic it has multiple solutions.
-
Addition of floating-point numbers is commutative, i.e., \(x \oplus y = y \oplus x\) for all floating-point numbers \(x,y\).
YesNoThis follows from the definition of the plus operator over the floating-point numbers and the plus operator over the reals which is commutative: \[x \oplus y = \fl(x + y) = \fl(y + x) = y \oplus x\]
The same holds true for floating-point multiplication.
Fun fact 1: addition is not necessarily commutative w.r.t. NaNs. Propagation of NaNs still holds, however, the payload of a resulting NaN is only suggested to equal to one of the inputs (IEEE 754-2008 §6.2.3 NaN propagation). For example, an implementation could choose \(\NaN_1 \oplus \NaN_2 = \NaN_1\). For further discussions about this have a look at the IEEE 754 mailing list.
Fun fact 2: there exist floating-point implementations different than IEEE 754 where addition is not necessarily commutative. For example, according to Towards sensible floating-point arithmetic multiplication is not commutative for Cray-1.
-
Addition of floating-point numbers is associative, i.e., \((x \oplus y) \oplus z = x \oplus (y \oplus z)\) for all floating-point numbers \(x,y,z\).
YesNoAdding two floating-point numbers does not necessarily result in a floating-point number, i.e., \(x \oplus y \neq x + y\) holds for some floating-point numbers \(x,y\). In such cases, further operations are affected by the previous rounding error. For example, consider decimal floating-point arithmetic with four digits of precision and round-half-up as the tie-breaking rule: \[\begin{align*} & (1.000 \times 10^3 \oplus 5.000 \times 10^{-1}) \oplus 5.000 \times 10^{-1} \\ &= (1.000 \times 10^3 \oplus 0.000|5 \times 10^3) \oplus 5.000 \times 10^{-1} \\ &= (1.000 \times 10^3 \oplus 0.001 \times 10^3) \oplus 5.000 \times 10^{-1} \\ &= 1.001 \times 10^3 \oplus 5.000 \times 10^{-1} \\ &= 1.002 \times 10^3 \\ &\neq 1.001 \times 10^3 \\ &= 1.000 \times 10^3 \oplus 0.001 \times 10^3 \\ &= 1.000 \times 10^3 \oplus 1.000 \times 10^0 \\ &= 1.000 \times 10^3 \oplus (5.000 \times 10^{-1} \oplus 5.000 \times 10^{-1}) \\ \end{align*}\]
Another example is the following expression which yields
true
in C:Such rounding errors may lead to surprising results as the following C expression demonstrates:
Even more nonintuitive: sometimes the rounding error is canceled and sometimes not
evaluates to
true
whereasevaluates to
false
. Note that the numbers0.1
,0.2
,0.3
, and0.4
are all not representable by (finite) binary floating-point numbers.Thus in general addition is not associative. The same holds true for multiplication.
-
Addition and multiplication of floating-point numbers is distributive, i.e., \(x \otimes (y \oplus z) = (x \otimes y) \oplus (x \otimes z)\) and \((x \oplus y) \otimes z = (x \otimes z) \oplus (y \otimes z)\) for all floating-point numbers \(x,y,z\).
YesNoAddition and multiplication of floating-point numbers is not distributive. A similar argument as for why addition is not associative holds for this case, too.
An example in C for double-precision floating-point numbers is:
10.0 * (0.1 + 0.2) != (10.0 * 0.1) + (10.0 * 0.2)
-
Expression \(0 \oslash 0\) evaluates to
anything because it is undefinedan infinitya NaNZero divided by zero is undefined as a real number and is of an indeterminate form. For example, \(\lim_{x \to 0} \sin(x)/x = 1\) whereas \(\lim_{x \to 0} (1-\cos(x))/x = 0\). Intuitively, when thinking of \(0/0\) as the result from the limit of two very small numbers, then \(0/0\) could represent “anything”. In IEEE 754 floating-point arithmetic, such values are represented by NaN (Not-a-Number).
-
Expression \(1 \oslash 0\) evaluates to
anything because it is undefinedan infinitya NaNIn ordinary arithmetic division by zero is undefined. However, in contrast to the previous question we have that \(1 / 0\) does not correspond to an indeterminate value. Any limit that gives rise to this form diverges to an infinity, i.e., \(\lim_{x \to 0^+} 1/x = +\infty\) and \(\lim_{x \to 0^-} 1/x = -\infty\). Thus in IEEE 754 floating-point arithmetic we have for any number \(x\) if \(x\gt0\) then \(x/0 = +\infty\) and if \(x\lt0\) then \(x/0 = -\infty\).
-
Equality relation is reflexive, i.e., \(\mathbf{x} \doteq \mathbf{x}\) for any datum \(\mathbf{x}\).
YesNoFloating-point equality is not reflexive. A NaN compares unequal for any other value including a NaN itself. Thus in C the expression
NAN == x
evaluates tofalse
for all doublesx
. This holds true irrespective whether NaN is signaling or quiet.If you want to test in C for
NAN
you should use the function-like macroisnan
which takes as parameter either afloat
or adouble
and returns abool
.For more details have a look at this StackOverflow answer.
-
Identity law holds for addition, i.e., \(\mathbf{x} \oplus 0\) equals \(\mathbf{x}\) for any datum \(\mathbf{x}\).
YesNoFor a negative zero we have \(-0 + 0 = 0 \neq -0\). However, the good news are that the floating-point equality relation holds for \(-0 \doteq 0\) and \(0 \doteq -0\). \[\begin{array}{c|c|c} {\oplus} & +0 & -0 \\\hline +0 & +0 & +0 \\\hline -0 & +0 & -0 \end{array}\]
This is also an example (beside others) why a compiler may not optimize
x + (y - y)
intox
. -
Zero property of multiplication holds, i.e., \(\mathbf{x} \otimes 0\) equals \(0\) for any datum \(\mathbf{x}\).
YesNoIf \(\mathbf{x}\) is a negative finite number, then \(\mathbf{x} \otimes 0 = -0 \neq 0\). Furthermore, if \(\mathbf{x}\) is a NaN or an infinity, then \(\mathbf{x} \otimes 0 = \NaN\).
Score
You answered 0 out of 0 questions correctly!