Page 245 - ARM 64 Bit Assembly Language
P. 245

Integer mathematics 233

                     When attempting to speed up a C program by converting selected parts of it to assembly
                     language, it is important to first determine where the most significant gains can be made. A
                     profiler, such as gprof or callgrind, can be used to help identify the sections of code that
                     will have the greatest impact on performance. It is also important to make sure that the result
                     is not just highly optimized C code. If the code cannot benefit from some features offered by
                     assembly, then it may not be worth the effort of re-writing in assembly. The code should be
                     re-written from a pure assembly language viewpoint.
                     It is also important to avoid premature assembly programming. Make sure that the C algo-
                     rithms and data structures are efficient before moving to assembly. if a better algorithm can
                     give better performance, then assembly may not be required at all. Once the assembly is writ-
                     ten, it is more difficult to make major changes to the data structures and algorithms. Assembly
                     language optimization is the final step in optimization, not the first one.

                     Well-written C code is modularized, with many small functions. This helps readability, pro-
                     motes code reuse, and may allow the compiler to better optimization. However, each function
                     call has some associated overhead. If optimal performance is the goal, then calling many
                     small functions should be avoided. For instance, if the piece of code to be optimized is in
                     a loop body, then it may be best to write the entire loop in assembly, rather than writing a
                     function and calling it each time through the loop. Writing in assembly is not a guarantee
                     of performance. Spaghetti code is slow. Load/store instructions are slow. Multiplication and
                     division are slow. The secret to good performance is avoiding things that are slow. Good opti-
                     mization requires rethinking the code to take advantage of assembly language.
                     The profiler indicated that bigint_adc is used more than any other function. If assembly lan-
                     guage can make this function run faster, then it should have a profound effect on the program.
                     We will leave that as an exercise for the student, but will give an example of how to optimize
                     a less critical function. The bigint_negate function was re-written in assembly, as shown
                     in Listing 7.10. Note that the original C implementation used the bigint_complement
                     function to get the 1’s complement, and then used the bigint_adc function to add one.
                     The C implementation of the bigint_adc function requires extra code to calculate the
                     carry from one chunk to the next. However, the assembly version of bigint_complement
                     loads each chunk, complements it, and propagates the carry all in one step. Since it only
                     loads and stores the data once, instead of twice, and requires much less work to propagate
                     the carry bit, we would expect the assembly version to run about twice as fast as the C ver-
                     sion.
                          Listing 7.10 AArch64 assembly implementation if the bigint_negate function.

                    1         .text
                    2         .type  bigint_negate, %function
                    3         .global bigint_negate
   240   241   242   243   244   245   246   247   248   249   250