Page 74 - Applied Numerical Methods Using MATLAB
P. 74

PROBLEMS   63
                This can be confirmed by typing the following statement into MATLAB
                command window.
                 >>fprintf(’3 = %bx\n’,3)  or   >>format hex, 3, format short

                which will print out onto the screen

                 0000000000000840         4008000000000000
                Noting that more significant byte (8[bits] = 2[hexadecimal digits]) of a
                number is stored in the memory of higher address number in the INTEL
                system, we can reverse the order of the bytes in this number to see the
                number having the most/least significant byte on the left/right side as we
                can see in the daily life.

                 00 00 00 00 00 00 08 40 → 40 08 00 00 00 00 00 00

                This is exactly the hexadecimal representation of the number 3 as we
                expected. You can find the IEEE 64-bit floating-point number represen-
                tation of the number 14 and use the command fprintf() or format hex to
                check if the result is right.


                               −1  3                     −1   3
                  <procedure of adding 2  to 2 >  <procedure of subtracting 2  from 2 >
                  1 .0000 × 2 3  1 .00000 × 2 3  1 .0000 × 2 3  1 .00000 × 2 3  2’s  1 .00000 × 2 3
                 + 1 .0000 × 2 −1  alignment + 0 .00010 × 2 3  − 1 .0000 × 2 −1  alignment − 0 .00010 × 2 3  complement + 1 .11110 × 2 3
                                      3                                      3
                               1 .00010 × 2                   normalization  0 .11110 × 2
                  truncation of guard bit                 truncation of guard bit
                               1 .0001   × 2 3                        1 .1110   × 2 2
                                                                         −3
                                 −4
                             = (1 + 2 )   × 2 3                    = (1 + 1 − 2 ) × 2 2
                                right result                          right result
                               −2
                                   3
                                                             3
                                                        −2
                  <procedure of adding 2  to 2 >  <procedure of subtracting 2  from 2 >
                  1  .0000 × 2 3  1 .00000 × 2 3  1 .0000 × 2 3  1 .00000 × 2 3  2’s  1 .00000 × 2 3
                         alignment                alignment    complement
                 + 1 .0000 × 2 −2  + 0 .00001 × 2 3  − 1 .0000 × 2 −2  − 0 .00001 × 2 3  + 1 .11111 × 2 3
                               1 .00001 × 2 3                 normalization  0 .11111 × 2 3
                  truncation of guard bit
                               1 .0000   × 2 3           truncation of guard bit   1 .1111   × 2 2
                                                                         −4
                              = (1 + 0)   × 2 3                    = (1 + 1 − 2 ) × 2 2
                               no difference                          right result
                 (cf)  : hidden bit,  : guard bit
                      Figure P1.18 Procedure of addition/subtraction with four mantissa bits.
            1.18 Resolution of Number Representation and Quantization Error
                In Section 1.2.1, we have seen that adding 2 −22  to 2 30  makes some dif-
                ference, while adding 2 −23  to 2 30  makes no difference due to the bit shift
                by over 52 bits for alignment before addition. How about subtracting 2 −23
                      30
                                                             30
                from 2 ? In contrast with the addition of 2 −23  to 2 , it makes a differ-
                ence as you can see by typing the following statement into the MATLAB
   69   70   71   72   73   74   75   76   77   78   79