Page 228 - Introduction to Microcontrollers Architecture, Programming, and Interfacing of The Motorola 68HC12
P. 228

7,5 Floating-Point Arithmetic and Conversion                        205


            In all of the preceding examples, the calculations were exact in the sense that the
        operation between two normalized floating-point numbers yielded a normalized floating-
        point number. This will not always be the case, as we can get overflow, underflow, or a
        result that requires some type of rounding to get a normalized approximation to the
        result. For example, multiplying

                                   256 * i.oo .. . 0
                                     10Q
                               *   2    * 1.00 . . . 0
                                   2*56 * 1.00 .. . 0

        yields a number that is too large to be represented in the 32-bit floating-point format.
        This is an example of overflow, a condition analogous to that encountered with integer
        arithmetic. Unlike integer arithmetic, however, underflow can occur, that is, we can get
        a result that is too small to be represented as a normalized floating-point number. For
        example,
                                   2-126 * 1.0010 .. . 0
                                      26
                               -   2-l  * 1.0000 . . . 0
                                   2-126 * 0.0010 .. . 0

        yields a result that is too small to be represented as a normalized floating-point number
        with the 32-bit format.
            The third situation is encountered when we obtain a result that is within the
        normalized floating-point range but is not exactly equal to one of the numbers (14).
        Before this result can be used further, it will have to be approximated by a normalized
        floating-point number. Consider the addition of the following two numbers.

                                   22 * 1.00 .. . 00
                               + 20 * 1.00 .. . 01
                                   22 * 1.01 . . . 00(01)
                                   (in parenthesis: least significant bits of the significand)

        The exact result is expressed with 25 bits in the fractional part of the significand so that
        we have to decide which of the possible normalized floating-point numbers will be
        chosen to approximate the result. Rounding toward plus infinity always takes the
        approximate result to be the next larger normalized number to the exact result, while
        rounding toward minus infinity always takes the next smaller normalized number to
        approximate the exact result. Truncation just throws away all the bits in the exact result
        beyond those used in the normalized significand. Truncation rounds toward plus infinity
        for negative results and rounds toward minus infinity for positive results. For this
        reason, truncation is also called rounding toward zero. For most applications, however,
        picking the closest normalized floating-point number to the actual result is preferred.
        This is called rounding to nearest. In the case of a tie, the normalized floating-point
        number with the least significant bit of 0 is taken to be the approximate result.
        Rounding to nearest is the default type of rounding for the IEEE floating-point standard.
        With rounding to nearest, the magnitude of the error in the approximate result is less
                                                       24
        than or equal to the magnitude of the exact result times 2~ .
   223   224   225   226   227   228   229   230   231   232   233