Page 208 - ARM 64 Bit Assembly Language
P. 208

196 Chapter 7

                  code than the general algorithm. For small constants, this method is often faster than using a
                  hardware multiply instruction. If we inspect the constant multiplier, we can usually find a pat-
                  tern to exploit that will save a few instructions. For example, suppose we want to multiply a
                  variable x by 10 10 . The multiplier 10 10 = 1010 2 , so we only need to add x shifted left 1 bit to
                  x shifted left 3 bits as shown below:

                1         adr     x0, x
                2         ldr     x0, [x0]           // load x (x0 = x)
                3         lsl     x0, x0, #1         // shift x left (x0=2x)
                4         add     x0, x0, x0, lsl #2  // x0=2x+8x

                  Now suppose we want to multiply a number x by 11 10 . The multiplier 11 10 = 1011 2 ,sowe
                  will add x to x shifted left one bit plus x shifted left 3 bits as in the following:

                1         adr     x1,  x
                2         ldr     x1, [x1]           // load x (x1 <= x)
                3         add     x0, x1, x1, lsl #1  // shift and add (x0=x+2x)
                4         add     x0, x0, x1, lsl #3  // x0=3x+8x

                  If we wish to multiply a number x by 1000 10 , we note that 1000 10 = 1111101000 2 . It looks
                  like we need 1 shift plus 5 add/shift operations, or 6 add/shift operations. With a little thought,
                  we can reduce the number of operations, as shown below:

                1         adr     x1,  x
                2         ldr     x1, [x1]           // load x
                3         add     x0, x1, x1, lsl #1  // shift and add: x0=3x
                4         add     x0, x0, x0, lsl #2  // x0=3x+3x*4=15x
                5         add     x0, x1, x0, lsl #1  // x0 = 15x*2 + x = 31x
                6         lsl     x0, x0, #5         // x0 = 31x * 32 =  992x
                7         add     x0, x0, x1, lsl #3  // x0 = 992x + x*8 = 1000x
                  Applying the basic multiplication algorithm to multiply a number x by 255 10 would result
                  in seven add/shift operations, but we can do it with only three operations and use only one
                  register, as shown below:

                1         adr     x0,  x
                2         ldr     x0, [x0]           // load x
                3         add     x0, x0, x0, lsl #1  // shift and add: x0=3x
                4         add     x0, x0, x0, lsl #2  // x0 = 3x + x*3*4 = 15x
                5         add     x0, x0, x0, lsl #4  // x0 = 15x + x*15*16 = 255x

                  Most modern systems have assembly language instructions for multiplication. However,
                  on most processors, it is often more efficient to use the shift, add, and subtract operations
                  when multiplying by a small constant. The AArch64 processors have a particularly pow-
                  erful hardware multiplier. They can typically perform multiplication with a 64-bit result in
   203   204   205   206   207   208   209   210   211   212   213