Page 74 - Applied Numerical Methods Using MATLAB
P. 74
PROBLEMS 63
This can be confirmed by typing the following statement into MATLAB
command window.
>>fprintf(’3 = %bx\n’,3) or >>format hex, 3, format short
which will print out onto the screen
0000000000000840 4008000000000000
Noting that more significant byte (8[bits] = 2[hexadecimal digits]) of a
number is stored in the memory of higher address number in the INTEL
system, we can reverse the order of the bytes in this number to see the
number having the most/least significant byte on the left/right side as we
can see in the daily life.
00 00 00 00 00 00 08 40 → 40 08 00 00 00 00 00 00
This is exactly the hexadecimal representation of the number 3 as we
expected. You can find the IEEE 64-bit floating-point number represen-
tation of the number 14 and use the command fprintf() or format hex to
check if the result is right.
−1 3 −1 3
<procedure of adding 2 to 2 > <procedure of subtracting 2 from 2 >
1 .0000 × 2 3 1 .00000 × 2 3 1 .0000 × 2 3 1 .00000 × 2 3 2’s 1 .00000 × 2 3
+ 1 .0000 × 2 −1 alignment + 0 .00010 × 2 3 − 1 .0000 × 2 −1 alignment − 0 .00010 × 2 3 complement + 1 .11110 × 2 3
3 3
1 .00010 × 2 normalization 0 .11110 × 2
truncation of guard bit truncation of guard bit
1 .0001 × 2 3 1 .1110 × 2 2
−3
−4
= (1 + 2 ) × 2 3 = (1 + 1 − 2 ) × 2 2
right result right result
−2
3
3
−2
<procedure of adding 2 to 2 > <procedure of subtracting 2 from 2 >
1 .0000 × 2 3 1 .00000 × 2 3 1 .0000 × 2 3 1 .00000 × 2 3 2’s 1 .00000 × 2 3
alignment alignment complement
+ 1 .0000 × 2 −2 + 0 .00001 × 2 3 − 1 .0000 × 2 −2 − 0 .00001 × 2 3 + 1 .11111 × 2 3
1 .00001 × 2 3 normalization 0 .11111 × 2 3
truncation of guard bit
1 .0000 × 2 3 truncation of guard bit 1 .1111 × 2 2
−4
= (1 + 0) × 2 3 = (1 + 1 − 2 ) × 2 2
no difference right result
(cf) : hidden bit, : guard bit
Figure P1.18 Procedure of addition/subtraction with four mantissa bits.
1.18 Resolution of Number Representation and Quantization Error
In Section 1.2.1, we have seen that adding 2 −22 to 2 30 makes some dif-
ference, while adding 2 −23 to 2 30 makes no difference due to the bit shift
by over 52 bits for alignment before addition. How about subtracting 2 −23
30
30
from 2 ? In contrast with the addition of 2 −23 to 2 , it makes a differ-
ence as you can see by typing the following statement into the MATLAB