Page 245 - ARM 64 Bit Assembly Language

P. 245

Integer mathematics 233

When attempting to speed up a C program by converting selected parts of it to assembly
language, it is important to ﬁrst determine where the most signiﬁcant gains can be made. A
proﬁler, such as gprof or callgrind, can be used to help identify the sections of code that
will have the greatest impact on performance. It is also important to make sure that the result
is not just highly optimized C code. If the code cannot beneﬁt from some features offered by
assembly, then it may not be worth the effort of re-writing in assembly. The code should be
re-written from a pure assembly language viewpoint.
It is also important to avoid premature assembly programming. Make sure that the C algo-
rithms and data structures are efﬁcient before moving to assembly. if a better algorithm can
give better performance, then assembly may not be required at all. Once the assembly is writ-
ten, it is more difﬁcult to make major changes to the data structures and algorithms. Assembly
language optimization is the ﬁnal step in optimization, not the ﬁrst one.

Well-written C code is modularized, with many small functions. This helps readability, pro-
motes code reuse, and may allow the compiler to better optimization. However, each function
call has some associated overhead. If optimal performance is the goal, then calling many
small functions should be avoided. For instance, if the piece of code to be optimized is in
a loop body, then it may be best to write the entire loop in assembly, rather than writing a
function and calling it each time through the loop. Writing in assembly is not a guarantee
of performance. Spaghetti code is slow. Load/store instructions are slow. Multiplication and
division are slow. The secret to good performance is avoiding things that are slow. Good opti-
mization requires rethinking the code to take advantage of assembly language.
The proﬁler indicated that bigint_adc is used more than any other function. If assembly lan-
guage can make this function run faster, then it should have a profound effect on the program.
We will leave that as an exercise for the student, but will give an example of how to optimize
a less critical function. The bigint_negate function was re-written in assembly, as shown
in Listing 7.10. Note that the original C implementation used the bigint_complement
function to get the 1’s complement, and then used the bigint_adc function to add one.
The C implementation of the bigint_adc function requires extra code to calculate the
carry from one chunk to the next. However, the assembly version of bigint_complement
loads each chunk, complements it, and propagates the carry all in one step. Since it only
loads and stores the data once, instead of twice, and requires much less work to propagate
the carry bit, we would expect the assembly version to run about twice as fast as the C ver-
sion.
Listing 7.10 AArch64 assembly implementation if the bigint_negate function.

1 .text
2 .type bigint_negate, %function
3 .global bigint_negate

240 241 242 243 244 245 246 247 248 249 250