Page 298 - ARM 64 Bit Assembly Language

P. 298

Non-integral mathematics 287

8.7.4 IEEE 754 quad-precision

The IEEE 754 Quad-Precision format was designed to provide enough range and precision
for very demanding applications. It provides a 14-bit exponent and a 116-bit mantissa. This
format is still not supported by most hardware. The IBM POWER9 CPU fully supports quad
precision in hardware. Some other processors, such as SPARC V8 and V9, and PA-RISC, of-
fer partial support. However for mid-range processors such as the Intel x86 family and the
ARM, this format is still deﬁnitely out of their league. It may be supported by some compil-
ers, but the operations are implemented in software, and can take ten times as long (or more)
as a hardware implementation.

8.8 Floating point operations

Many processors do not have hardware support for ﬂoating point. On those processors, all
ﬂoating point must be accomplished through software. Processors that do support ﬂoating
point in hardware must have quite sophisticated circuitry to manage the basic operations on
data in the IEEE 754 standard formats. Regardless of whether the operations are carried out in
software or hardware, the basic arithmetic operations require multiple steps.

8.8.1 Floating point addition and subtraction

The steps required for addition and subtraction of ﬂoating point numbers is the same, regard-
less of the speciﬁc format. The steps for adding or subtracting to ﬂoating point numbers a and
b are as follows:
1. Extract the exponents E a and E b .
2. Extract the signiﬁcands M a and M b , and convert them into 2’s complement numbers, us-
ing the signs S a and S b .
3. Shift the signiﬁcand with the smaller exponent right by |E a − E b |.
4. Perform addition (or subtraction) on the signiﬁcands to get the signiﬁcand of the result,
M r . Remember that the result may require one more signiﬁcant bit to avoid overﬂow.
5. If M r is negative, then take the 2’s complement and set S r to 1. Otherwise set S r to 0.
6. Shift M r until the leftmost 1 is in the “hidden” bit position, and add the shift amount to
the smaller of the two exponents to form the new exponent E r .

293 294 295 296 297 298 299 300 301 302 303