Page 298 - ARM 64 Bit Assembly Language
P. 298
Non-integral mathematics 287
8.7.4 IEEE 754 quad-precision
The IEEE 754 Quad-Precision format was designed to provide enough range and precision
for very demanding applications. It provides a 14-bit exponent and a 116-bit mantissa. This
format is still not supported by most hardware. The IBM POWER9 CPU fully supports quad
precision in hardware. Some other processors, such as SPARC V8 and V9, and PA-RISC, of-
fer partial support. However for mid-range processors such as the Intel x86 family and the
ARM, this format is still definitely out of their league. It may be supported by some compil-
ers, but the operations are implemented in software, and can take ten times as long (or more)
as a hardware implementation.
8.8 Floating point operations
Many processors do not have hardware support for floating point. On those processors, all
floating point must be accomplished through software. Processors that do support floating
point in hardware must have quite sophisticated circuitry to manage the basic operations on
data in the IEEE 754 standard formats. Regardless of whether the operations are carried out in
software or hardware, the basic arithmetic operations require multiple steps.
8.8.1 Floating point addition and subtraction
The steps required for addition and subtraction of floating point numbers is the same, regard-
less of the specific format. The steps for adding or subtracting to floating point numbers a and
b are as follows:
1. Extract the exponents E a and E b .
2. Extract the significands M a and M b , and convert them into 2’s complement numbers, us-
ing the signs S a and S b .
3. Shift the significand with the smaller exponent right by |E a − E b |.
4. Perform addition (or subtraction) on the significands to get the significand of the result,
M r . Remember that the result may require one more significant bit to avoid overflow.
5. If M r is negative, then take the 2’s complement and set S r to 1. Otherwise set S r to 0.
6. Shift M r until the leftmost 1 is in the “hidden” bit position, and add the shift amount to
the smaller of the two exponents to form the new exponent E r .

