IEEE 754 FloatingPoint Arithmetic Quiz
The following questions are about floatingpoint arithmetic as defined by the IEEE 754 standard. The revised version from 2008 generalized floatingpoint arithmetic and introduced three decimal formats. Here we only consider the binary floatingpoint formats single precision (32bit) and double precision (64bit). Although, most questions are independent of the radix, range of exponent, and precision.
In the sequel we denote the floatingpoint approximation of any real number by the projection \(\fl \colon \mathbb{R} \to \mathbb{F}\) where \(\mathbb{F}\) is any finite and discrete floatingpoint number system. If not mentioned otherwise, we assume “round ties to even” which is the default roundingdirection (IEEE 7542008 §4.3.3).
In order to distinguish between elementary operations over the reals and floatingpoint numbers we introduce the floatingpoint operators \(\oplus,\ominus,\otimes,\oslash\) for their counterparts over the reals \({+},{},{\times},{/}\), respectively. We then have \(x \oplus y = \fl(x + y)\) for floatingpoint numbers \(x,y\) as well as similar definitions for \(\ominus,\otimes,\oslash\). Furthermore, we lift these operators in order to deal with NaNs and infinities as defined by the IEEE 754 standard.
In order to distinguish between numbers, infinities, and NaNs the following notation is used. A number is typed in normal font like \(x, y, z\). A datum is either an infinity, NaN, or a number and is typed bold face like \(\mathbf{x}, \mathbf{y}, \mathbf{z}\).

Floatingpoint numbers are uniformly distributed.
YesNoEach floatingpoint number is of the form \(\pm m \times 2^e\) with \(1 \leq m \lt 2\). The significand \(m\) has \(p\) many bits—including one implied bit. Thus, in each interval \(\left[\left.2^e, 2^{e+1}\right)\right.\) there are \(2^{p1}\) equally distributed floatingpoint numbers. The spacing between floatingpoint numbers in \(\left[\left.2^e, 2^{e+1}\right)\right.\) is \(2^{e(p1)}\). Therefore, spacing doubles between intervals. The distribution is more like a logarithmic scale than a linear one. In other words, the numbers are dense towards \(\pm 0\) and sparse towards \(\pm \infty\).
Let us have a look at a minifloat example. Consider IEEE754like 8bit floatingpoint numbers given by a 3bit exponent, 4bit significand, and bias 3. The following figure depicts the number line.
Click on it and zoom in for more details.
Fun fact: spacing is rather huge for big numbers which might be nonintuitive. For instance, the biggest positive exponent of a doubleprecision floatingpoint number is \(+1023\). Thus, for numbers in the interval \(\left[\left.2^{1023}, 2^{1024}\right)\right.\) spacing equals \(2^{971}\) which is roughly \(10^{292}\). In contrast, for numbers in the interval \([2^{1022}, 2^{1021})\) spacing equals \(2^{1074}\) which is roughly \(10^{324}\).

Roughly half of all positive floatingpoint numbers are in the interval \((0,1)\).
YesNoYeah I know giving a precise answer to an imprecise question is … rough. This question is more about sharpening the intuition about the distribution of floatingpoint numbers.
The exponent of IEEE 754 singleprecision floatingpoint numbers ranges from \(126\) to \(+127\). We ignore the exponents \(127\) (allzero) and \(+128\) (allone) since they are reserved for special values (zero, subnormals, infinities, NaN). In total we have \(254\) exponents from which \(126\) are negative which is roughly half of all.
The exponent of IEEE 754 doubleprecision floatingpoint numbers ranges from \(1022\) to \(+1023\). We ignore the exponents \(1023\) (allzero) and \(+1024\) (allone) since they are reserved for special values (zero, subnormals, infinities, NaN). In total we have \(2046\) exponents from which \(1023\) are negative which is precisely half of all.

Only fractions with poweroftwo denominators are representable.
YesNoFloatingpoint numbers are of the form \(\pm m \times 2^e\) where \(m = (1.f)_2\) for some binary fractional part \(f\), and exponent \(e\) is an integer. Let \(p\) be the precision of \(m\), then \(m = m' \times 2^{p}\) for some integer \(m'\) q.e.d.
For example, the finite decimal number \(0.1\) is not representable by a finite binary number: \[(0.1)_{10} = (0.00011001100110011\dots)_{2} \] Since the number is periodical in the binary system it must be rounded towards its nearest floatingpoint number, i.e., \(\fl(0.1) \neq 0.1\).

There exist integers \(x,y \in \mathbb{Z}\) with \(x \neq y\) such that \(\fl(x) = \fl(y) \neq \pm\infty\).
YesNoIf an integer has more digits than are representable by the significand, then the least significant digits are rounded. For singleprecision floatingpoint numbers we have \(24\) bits (including one implied bit) for the significand. Thus every integer from \(2^{24}\) up to \(+2^{24}\) is representable by a singleprecision floatingpoint number. Likewise, for doubleprecision numbers we have \(53\) bits (including one implied bit) for the significand which means that every integer from \(2^{53}\) up to \(+2^{53}\) is representable. Bigger numbers are subject to be rounded towards their nearest floatingpoint number. In other words, digits are rounded not only after the decimal point, but also before the decimal point.
For example, the following expression
evaluates to
true
in C. 
There exists a floatingpoint number \(x\) such that \(x \oplus 1 = x\).
YesNoThis question is somewhat related to the previous one, nonetheless, the answer differs.
If two numbers are added which differ in several orders of magnitude, then one number may be absorbed by the other which we show in the subsequent. Let us consider the general case \(x \oplus y = x\) for some nonzero floatingpoint numbers \(x,y\). Assume \(y \lt x\), and let \(x = m_x \times 2^{e_x}\), and \(y = m_y \times 2^{e_y}\), then \[ x + y = (m_x + m_y \times 2^{e_y  e_x}) \times 2^{e_x} \] If \(e_y  e_x \lt (p+1)\) where \(p\) is the precision, then \[ m_y \times 2^{e_y  e_x} \lt 2^{p} \] since \(m_y \lt 2\). Thus we have \[ x + y \lt (m_x + 2^{p}) \times 2^{e_x} = (x + \succ(x)) / 2 \] where \(\succ(x)\) is the successor floatingpoint number of \(x\) q.e.d.
For example, consider decimal floatingpoint arithmetic with four digits of precision. The numbers \(100\) and \(0.001\) are precisely representable as \(1.000 \times 10^2\) and \(1.000 \times 10^{3}\), respectively. While adding both numbers, the exponents get aligned (the numbers after the vertical bar get rounded and are shown only for demonstration purposes): \[\begin{align} 1.000 \times 10^2 + 1.000 \times 10^{3} &= 1.000 \times 10^2 + 0.00001 \times 10^2 \\ &= 1.000 \times 10^2 + 0.000 \times 10^2 \\ &= 1.000 \times 10^2 \end{align}\]
In other words \(\fl(100.001) = 100\).
An example in C for singleprecision floatingpoint arithmetic is
1e8f + 1.0f == 1e8f
and for doubleprecision floatingpoint arithmetic1e16 + 1.0 == 1e16
.Note spacing between numbers gets rather huge at the end of the scale, and you might lose feeling about it. For example, the expression
1e300 + 1e280 == 1e300
yieldstrue
in C.Thus in real arithmetic the equation \(x + 1 = x\) has no solution whereas in floatingpoint arithmetic it has multiple solutions.

Addition of floatingpoint numbers is commutative, i.e., \(x \oplus y = y \oplus x\) for all floatingpoint numbers \(x,y\).
YesNoThis follows from the definition of the plus operator over the floatingpoint numbers and the plus operator over the reals which is commutative: \[x \oplus y = \fl(x + y) = \fl(y + x) = y \oplus x\]
The same holds true for floatingpoint multiplication.
Fun fact 1: addition is not necessarily commutative w.r.t. NaNs. Propagation of NaNs still holds, however, the payload of a resulting NaN is only suggested to equal to one of the inputs (IEEE 7542008 §6.2.3 NaN propagation). For example, an implementation could choose \(\NaN_1 \oplus \NaN_2 = \NaN_1\). For further discussions about this have a look at the IEEE 754 mailing list.
Fun fact 2: there exist floatingpoint implementations different than IEEE 754 where addition is not necessarily commutative. For example, according to Towards sensible floatingpoint arithmetic multiplication is not commutative for Cray1.

Addition of floatingpoint numbers is associative, i.e., \((x \oplus y) \oplus z = x \oplus (y \oplus z)\) for all floatingpoint numbers \(x,y,z\).
YesNoAdding two floatingpoint numbers does not necessarily result in a floatingpoint number, i.e., \(x \oplus y \neq x + y\) holds for some floatingpoint numbers \(x,y\). In such cases, further operations are affected by the previous rounding error. For example, consider decimal floatingpoint arithmetic with four digits of precision and roundhalfup as the tiebreaking rule: \[\begin{align*} & (1.000 \times 10^3 \oplus 5.000 \times 10^{1}) \oplus 5.000 \times 10^{1} \\ &= (1.000 \times 10^3 \oplus 0.0005 \times 10^3) \oplus 5.000 \times 10^{1} \\ &= (1.000 \times 10^3 \oplus 0.001 \times 10^3) \oplus 5.000 \times 10^{1} \\ &= 1.001 \times 10^3 \oplus 5.000 \times 10^{1} \\ &= 1.002 \times 10^3 \\ &\neq 1.001 \times 10^3 \\ &= 1.000 \times 10^3 \oplus 0.001 \times 10^3 \\ &= 1.000 \times 10^3 \oplus 1.000 \times 10^0 \\ &= 1.000 \times 10^3 \oplus (5.000 \times 10^{1} \oplus 5.000 \times 10^{1}) \\ \end{align*}\]
Another example is the following expression which yields
true
in C:Such rounding errors may lead to surprising results as the following C expression demonstrates:
Even more nonintuitive: sometimes the rounding error is canceled and sometimes not
evaluates to
true
whereasevaluates to
false
. Note that the numbers0.1
,0.2
,0.3
, and0.4
are all not representable by (finite) binary floatingpoint numbers.Thus in general addition is not associative. The same holds true for multiplication.

Addition and multiplication of floatingpoint numbers is distributive, i.e., \(x \otimes (y \oplus z) = (x \otimes y) \oplus (x \otimes z)\) and \((x \oplus y) \otimes z = (x \otimes z) \oplus (y \otimes z)\) for all floatingpoint numbers \(x,y,z\).
YesNoAddition and multiplication of floatingpoint numbers is not distributive. A similar argument as for why addition is not associative holds for this case, too.
An example in C for doubleprecision floatingpoint numbers is:
10.0 * (0.1 + 0.2) != (10.0 * 0.1) + (10.0 * 0.2)

Expression \(0 \oslash 0\) evaluates to
anything because it is undefinedan infinitya NaNZero divided by zero is undefined as a real number and is of an indeterminate form. For example, \(\lim_{x \to 0} \sin(x)/x = 1\) whereas \(\lim_{x \to 0} (1\cos(x))/x = 0\). Intuitively, when thinking of \(0/0\) as the result from the limit of two very small numbers, then \(0/0\) could represent “anything”. In IEEE 754 floatingpoint arithmetic, such values are represented by NaN (NotaNumber).

Expression \(1 \oslash 0\) evaluates to
anything because it is undefinedan infinitya NaNIn ordinary arithmetic division by zero is undefined. However, in contrast to the previous question we have that \(1 / 0\) does not correspond to an indeterminate value. Any limit that gives rise to this form diverges to an infinity, i.e., \(\lim_{x \to 0^+} 1/x = +\infty\) and \(\lim_{x \to 0^} 1/x = \infty\). Thus in IEEE 754 floatingpoint arithmetic we have for any number \(x\) if \(x\gt0\) then \(x/0 = +\infty\) and if \(x\lt0\) then \(x/0 = \infty\).

Equality relation is reflexive, i.e., \(\mathbf{x} \doteq \mathbf{x}\) for any datum \(\mathbf{x}\).
YesNoFloatingpoint equality is not reflexive. A NaN compares unequal for any other value including a NaN itself. Thus in C the expression
NAN == x
evaluates tofalse
for all doublesx
. This holds true irrespective whether NaN is signaling or quiet.If you want to test in C for
NAN
you should use the functionlike macroisnan
which takes as parameter either afloat
or adouble
and returns abool
.For more details have a look at this StackOverflow answer.

Identity law holds for addition, i.e., \(\mathbf{x} \oplus 0\) equals \(\mathbf{x}\) for any datum \(\mathbf{x}\).
YesNoFor a negative zero we have \(0 + 0 = 0 \neq 0\). However, the good news are that the floatingpoint equality relation holds for \(0 \doteq 0\) and \(0 \doteq 0\). \[\begin{array}{ccc} {\oplus} & +0 & 0 \\\hline +0 & +0 & +0 \\\hline 0 & +0 & 0 \end{array}\]
This is also an example (beside others) why a compiler may not optimize
x + (y  y)
intox
. 
Zero property of multiplication holds, i.e., \(\mathbf{x} \otimes 0\) equals \(0\) for any datum \(\mathbf{x}\).
YesNoIf \(\mathbf{x}\) is a negative finite number, then \(\mathbf{x} \otimes 0 = 0 \neq 0\). Furthermore, if \(\mathbf{x}\) is a NaN or an infinity, then \(\mathbf{x} \otimes 0 = \NaN\).
Score
You answered 0 out of 0 questions correctly!