Page 228 - Introduction to Microcontrollers Architecture, Programming, and Interfacing of The Motorola 68HC12
P. 228
7,5 Floating-Point Arithmetic and Conversion 205
In all of the preceding examples, the calculations were exact in the sense that the
operation between two normalized floating-point numbers yielded a normalized floating-
point number. This will not always be the case, as we can get overflow, underflow, or a
result that requires some type of rounding to get a normalized approximation to the
result. For example, multiplying
256 * i.oo .. . 0
10Q
* 2 * 1.00 . . . 0
2*56 * 1.00 .. . 0
yields a number that is too large to be represented in the 32-bit floating-point format.
This is an example of overflow, a condition analogous to that encountered with integer
arithmetic. Unlike integer arithmetic, however, underflow can occur, that is, we can get
a result that is too small to be represented as a normalized floating-point number. For
example,
2-126 * 1.0010 .. . 0
26
- 2-l * 1.0000 . . . 0
2-126 * 0.0010 .. . 0
yields a result that is too small to be represented as a normalized floating-point number
with the 32-bit format.
The third situation is encountered when we obtain a result that is within the
normalized floating-point range but is not exactly equal to one of the numbers (14).
Before this result can be used further, it will have to be approximated by a normalized
floating-point number. Consider the addition of the following two numbers.
22 * 1.00 .. . 00
+ 20 * 1.00 .. . 01
22 * 1.01 . . . 00(01)
(in parenthesis: least significant bits of the significand)
The exact result is expressed with 25 bits in the fractional part of the significand so that
we have to decide which of the possible normalized floating-point numbers will be
chosen to approximate the result. Rounding toward plus infinity always takes the
approximate result to be the next larger normalized number to the exact result, while
rounding toward minus infinity always takes the next smaller normalized number to
approximate the exact result. Truncation just throws away all the bits in the exact result
beyond those used in the normalized significand. Truncation rounds toward plus infinity
for negative results and rounds toward minus infinity for positive results. For this
reason, truncation is also called rounding toward zero. For most applications, however,
picking the closest normalized floating-point number to the actual result is preferred.
This is called rounding to nearest. In the case of a tie, the normalized floating-point
number with the least significant bit of 0 is taken to be the approximate result.
Rounding to nearest is the default type of rounding for the IEEE floating-point standard.
With rounding to nearest, the magnitude of the error in the approximate result is less
24
than or equal to the magnitude of the exact result times 2~ .