Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1Learning Outcomes

In a previous version of the course, we covered floating point in much more detail over multiple lectures. In recent semesters, we have reduced floating point topics to focus on the core of the standard, and we have not covered more advanced topics like arithmetic, casting, and other floating-point representations. For now, we leave this out-of-scope content below as general reference.

2Floating Point Addition

Let’s consider arithmetic with floating point numbers.

Floating point addition is more complex than integer addition. We can’t just add significands without considering the exponent value. In general:

Because of how floating point numbers are stored, simple operations like addition are not always associative.

Define x, y, and z as 1.5×1038-1.5 \times 10^{38}, y: 1.5×10381.5 \times 10^{38}, and 1.0, respectively.

x + (y + z)=1.5×1038+(1.5×1038+1.0)=1.5×1038+(1.5×1038)=0.0\begin{align} \texttt{x + (y + z)} &= -1.5 \times 10^{38} + (1.5 \times 10^{38} + 1.0) \\ &= -1.5 \times 10^{38} + (1.5 \times 10^{38}) \\ &= 0.0 \end{align}
(x + y) + z=(1.5×1038+1.5×1038)+1.0=0.0+1.0=1.0\begin{align} \texttt{(x + y) + z} &= (-1.5 \times 10^{38} + 1.5 \times 10^{38}) + 1.0 \\ &= 0.0 + 1.0\\ &= 1.0 \end{align}

Remember, floating point effectively approximates real results. With bigger exponents, step size between floats gets bigger too. In this example, 1.5×10381.5 \times 10^{38} is so much larger than 1.0 that 1.5×1038+1.01.5 \times 10^{38} + 1.0 in floating point representation rounds to 1.5×10381.5 \times 10^{38}.

3Floating Point Rounding Modes

When we perform math on real numbers, we have to worry about rounding to fit the result in the significand field. The floating point hardware carries two extra bits of precision, and then rounds to get the proper value.

There are four primary rounding modes:

The unbiased mode is the default, though the others can be specified. Unbiased works almost like normal rounding. Generally, we round to the nearest representable number, e.g., 2.4 rounds to 2, 2.6 to 3, 2.5 to 2, 3.5 to 4, etc. If the value is on the borderline, we round to the nearest even number. In other words, if there is a “tie”, half the time we round up; the other half time we round down. This “unbiased” nature ensures fairness on calculation by balancing out inaccuracies.

4Casting and converting

Rounding also occurs when converting betwen numeric types. In C:

Double-casting therefore does not work as expected. Code A and Code B below may not always print "true":

/* Code A */
int i = …;
if (i == (int)((float) i)) {
   printf("true\n");
}

/* Code B */
float f = …;
if (f == (float)((int) f)) {
   printf("true\n");
}

5Other Floating Point Representations

5.1Precision vs. Accuracy

Recall from before:

High precision permits high accuracy but doesn’t guarantee it. It is possible to have high precision but low accuracy.

For example, consider float pi = 3.14;. pi will be represented using all 23 bits of the significand (“highly precise”), but it is only an approximation of π\pi (“not accurate”).

Below, we discuss other floating point representations that can yield more accurate numbers in certain cases. However, because all of these representations are fixed precision (i.e., fixed bit-width) we cannot represent everything perfectly.

5.2Even More Floating Point Representations

Still more representations exist. Here are a few from the IEEE 754 standard:

Domain-specific architectures demand different number formats (Table 1). For example, the bfloat16[1] on Google’s Tensor Processing Unit (TPU) is defined over 16 bits (8 exponent bits, 7 significand bits); because of its wider exponent field, it covers the same range as IEEE 754 single-precision format at the expense of significand precision. This tradeoff is preferred given vanishing gradients towards zero for neural network training.

Table 1:Different domain accelerators support various integer and floating-point formats.

Acceleratorint4int8int16fp16bf16[1]fp32tf32[2]
Google TPU v1x
Google TPU v2x
Google TPU v3x
Nvidia Volta TensorCorexxx
Nvidia Ampere TensorCorexxxxxxx
Nvidia DLAxxx
Intel AMXxx
Amazon AWS Inferentiaxxx
Qualcomm Hexagonx
Huawei Da Vincixx
MediaTek APU 3.0xxx
Samsung NPUx
Tesla NPUx

For those interested, we recommend reading about the proposed Unum format, which suggests using variable field widths for the exponent and significand. This format adds a “u-bit” to tell whether the number is exact or in-between unums.

Footnotes