Page 208 - ARM 64 Bit Assembly Language
P. 208
196 Chapter 7
code than the general algorithm. For small constants, this method is often faster than using a
hardware multiply instruction. If we inspect the constant multiplier, we can usually find a pat-
tern to exploit that will save a few instructions. For example, suppose we want to multiply a
variable x by 10 10 . The multiplier 10 10 = 1010 2 , so we only need to add x shifted left 1 bit to
x shifted left 3 bits as shown below:
1 adr x0, x
2 ldr x0, [x0] // load x (x0 = x)
3 lsl x0, x0, #1 // shift x left (x0=2x)
4 add x0, x0, x0, lsl #2 // x0=2x+8x
Now suppose we want to multiply a number x by 11 10 . The multiplier 11 10 = 1011 2 ,sowe
will add x to x shifted left one bit plus x shifted left 3 bits as in the following:
1 adr x1, x
2 ldr x1, [x1] // load x (x1 <= x)
3 add x0, x1, x1, lsl #1 // shift and add (x0=x+2x)
4 add x0, x0, x1, lsl #3 // x0=3x+8x
If we wish to multiply a number x by 1000 10 , we note that 1000 10 = 1111101000 2 . It looks
like we need 1 shift plus 5 add/shift operations, or 6 add/shift operations. With a little thought,
we can reduce the number of operations, as shown below:
1 adr x1, x
2 ldr x1, [x1] // load x
3 add x0, x1, x1, lsl #1 // shift and add: x0=3x
4 add x0, x0, x0, lsl #2 // x0=3x+3x*4=15x
5 add x0, x1, x0, lsl #1 // x0 = 15x*2 + x = 31x
6 lsl x0, x0, #5 // x0 = 31x * 32 = 992x
7 add x0, x0, x1, lsl #3 // x0 = 992x + x*8 = 1000x
Applying the basic multiplication algorithm to multiply a number x by 255 10 would result
in seven add/shift operations, but we can do it with only three operations and use only one
register, as shown below:
1 adr x0, x
2 ldr x0, [x0] // load x
3 add x0, x0, x0, lsl #1 // shift and add: x0=3x
4 add x0, x0, x0, lsl #2 // x0 = 3x + x*3*4 = 15x
5 add x0, x0, x0, lsl #4 // x0 = 15x + x*15*16 = 255x
Most modern systems have assembly language instructions for multiplication. However,
on most processors, it is often more efficient to use the shift, add, and subtract operations
when multiplying by a small constant. The AArch64 processors have a particularly pow-
erful hardware multiplier. They can typically perform multiplication with a 64-bit result in