Page 247 - ARM 64 Bit Assembly Language
P. 247

Integer mathematics 235

                          Table 7.1: Performance of bigint_negate implementations on an nVidia Jetson TX-1.
                       Chunk         Version                Negate                     Program
                                                  Time          Speedup       Time          Speedup
                                     C            0.097846      1.00          27.898544     1.00
                       32-bit
                                     Assembly     0.049903      1.96          27.322730     1.02
                                     C            0.095243      1.03          26.719709     1.04
                       64-bit
                                     Assembly     0.048488      2.02          26.492714     1.05



                   53         adcs   w6, w6, wzr          // add carry flag, set flags
                   54         str    w6, [x4], #4         // store chunk in destination
                   55         #endif
                   56         #endif
                   57         #endif
                   58         #endif
                   59         sub    w20, w20, #1
                   60         cbnz   w20, loop
                   61  endloop:
                   62         // return address of new bigint is already in x0
                   63         ldp    x19, x20, [sp, #16]    // Restore non-volatile regs
                   64         ldp    x29, x30, [sp], #32    // Restore FP & LR
                   65         ret
                   66         .size  bigint_negate,(. - bigint_negate)


                     The regression testing program was executed on an nVidia Jetson TX-1 using the C version
                     of bigint_negate and then with the assembly version. This was done for chunk sizes of
                     32 bits and 64 bits. The results are shown in Table 7.1. The total time required for the negate
                     function tests using the 32-bit C version was 0.097846 seconds. The 32-bit assembly version
                     ran in 0.049903 seconds, for a speedup of 1.96, which means that the assembly version was
                     almost twice as fast as the C version. The 64-bit C version ran in 0.095243 seconds, while
                     the 64-bit assembly code ran in 0.048488 seconds, which gives a 1.96 speedup. Note that the
                     assembly version with a 64-bit chunk size is more than twice as fast as the C version using
                     32-bit chunks.
                     The bigint_negate function also has a small impact on the functions that rely on it, so the
                     overall time to run the regression tests was also reduced slightly. The 64-bit assembly ver-
                     sion was 5% faster overall than the 32-bit C version. Because this function was not called
                     often in the test program, the overall program speedup was modest. Implementing other func-
                     tions, such as bigint_add and bigint_sub in assembly would result in much larger overall
                     speedups.
   242   243   244   245   246   247   248   249   250   251   252