Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1Learning Outcomes

2Intuition: Scientific Notation

Instead of identifying the fixed binary point location across every number in our representation, the floating point paradigm determines a binary point location for each number.

Consider the decimal number 0.1640625 which has binary representation[1]

000000.001010100000.\dots\texttt{000000.001010100000}\dots.

There are many zeros in this binary representation. For example, there is no integer component, so everything to the left of the binary point is zero. In fact, there are really only one “interesting” range with some “energy”, e.g., varying ones and zeros: 10101. This bit pattern is located two values right of the binary point.

With these two pieces of information—what the significant bits are, and what exponent the significant bits are associated with—we can suddently represent both very large and very small numbers. This is the intuition behind scientific notation.

2.1Scientific Notation (Base-10)

You may have seen scientific notation in a physics or chemistry course, but we review it here and introduce core terminology that carry over into the binary case. The scientific notation for the number 0.1640625 is 1.640625×1011.640625 \times 10^{-1} (Figure 1):

"TODO"

Figure 1:Scientific notation assumes a normalized form, where the mantissa has exactly one non-zero digit to the left of the decimal point.

Scientific notation assumes a normalized form of numbers, where the mantissa has exactly one digit to the left of the decimal point.

Every number represented in scientific notation has exactly one normalized form for a given number of significant figures. For example, the number 1/10000001/1000000 has normalized form (for two significant figures) 1.0×1091.0 \times 10^{-9} and non-normalized forms 0.1×108,10×10100.1 \times 10^{-8}, 10 \times 10^{-10}, and so on.

2.2Binary Normalized Form

The number 0.1640625 in a binary normalized form is 1.0101two×231.0101_{\text{two}} \times 2^{-3} (Figure 2):

"TODO"

Figure 2:In binary, we also assume a normalized form, where the mantissa has exactly one non-zero digit to the left of the binary point.

We close this discussion with two important points:

3IEEE 754 Single-Precision Floating Point

This discussion leads us to the definition of the IEEE 754 Single-Precision Floating Point, which is used for the C float variable type. This format leverages binary normalized form to represent a wide range of numbers for scientific use using 32 bits.[2]

This standard was pioneered by UC Berkeley Professor William (“Velvel”) Kahan. Prior to Kahan’s system, the ecosystem for representing floating points was chaotic, and calculations on one machine would give different answers on another. By leading the effort to centralize floating point arithmetic into the IEEE 754 standard, Professor Kahan earned the Turing Award in 1988.

The three fields in the IEEE 754 standard are designed to maximize the accuracy of representing many numbers using the same limited precision of 32 bits (for single-precision, and 64 bits for double-precision). This leads us to two important definitions to help us quantify the efficacy of this number representation:

Without further ado, Figure 3 defines the three fields in the IEEE 754 single-precision floating point for 32 bits.

"TODO"

Figure 3:Bit fields in IEEE 754 single-precision floating point. The least significant bit (rightmost) is indexed 0; the most significant bit (leftmost) is indexed 31.

4Normalized Numbers

The three fields in Figure 3 can be used to represent normalized numbers of the form in Equation (1).

(1)s×(1+significand)×2(exponent127)(-1)^\text{s} \times (1 + \text{significand}) \times 2^{(\text{exponent}-127)}

This design will seem esoteric at first glance. Why is there a 1+1+? Where did the mantissa go—what’s a significand? Why is the exponent offset by -127? As stated before, these three fields are defined to maximize the accuracy of representable numbers in 32 bits. In other words, none of the bit patterns in these three fields use two’s complement! Instead, the standard uses Table 1 defines precisely how to interpret each field’s bit pattern for representing normalized numbers.

Table 1:Sign, exponent, and significand fields for normalized numbers.

Field NameRepresentsNormalized Numbers[3]
sSign1 is negative; 0 is positive
exponentBias-Encoded ExponentSubtract 127 from exponent field to get the exponent value.
significandFractional Component of the MantissaInterpret the significand as a 23-bit fraction (0.xx...xx) and add 1 to get the mantissa value.

5Zero, Infinity, and More

Normalized numbers are not the only values that can be represented by the IEEE 754 standard. We discuss zero, infinity, and other numbers in a later section.

5.1IEEE 754 Double-Precision Floating Point

The IEEE 754 double-precision floating point standard is used for the C double variable type. It has three fields, now over 64 bits:

The primary advantage is greater accuracy due to the larger significand. The normalized form can represent numbers from about 2.0×103082.0 \times 10^{-308} to 2.0×103082.0 \times 10^{308}.

6Use a Floating Point Converter

Check out this web app for a simple converter between decimal numbers and their IEEE 754 single-precision floating point format.

While we would love for you to use the converter to understand the comic in Figure 4, it should be noted that the robot is assuming IEEE 754 double-precision format (see another section).

"TODO"

Figure 4:Welcome to the Secret Robot Internet (SMBC Comics).

Because floating point is based on powers of two, it cannot represent most decimal fractions exactly. Here, 0.3 is inaccurately represented as 0.29999999999, and 0.1+0.20.1 + 0.2 with doubles is 0.30000000000000004. Read more about doubles, addition, and accuracy in another section.

Footnotes
  1. We leave this conversion as an exercise to the reader.

  2. The IEEE 754 Double-Precision Floating Point is used for the C double variable type, which is 64 bits. Read more in a bonus section.

  3. Only valid when exponent field is in the range 0000001 (1) to 1111110 (254), i.e., when it is neither 0 nor 255.