Files
2025-10-22 03:04:38 +09:00

6.7 KiB

Optimization

There's more to performance than asymptotic complexity(time complexity).

But all the instructions are not consume the same amount of time. Constant factors matter too! So we need to understand system to optimize performance.

  • How programs are compiled and executed
  • How modern processors and memory system operate
  • How to measure performance and identify bottlenecks
  • How to improve performance without destroying code modularity and generality

Provide efficent mapping of program to machine code

  • Register allocation
  • Code selection and ordering (scheduling)
  • Dead code elimination
  • Elimininating minor inefficiencies

Don't improve asymptotic efficiency.

Generally Useful Optimizations

Code Motion(Hoisting)

Reduce frequecy where computation performed. If it will always produce the same result, then move it to a place where it is computed once and reused. Especially moving code out of loop.

void set_row(double *a, double *b, long i, long n) {
    long j;
    for (j = 0; j < n; j++) {
        a[i * n + j] = b[j];
    }
}
DefaultOptimized
void set_row(double *a, double *b, long i, long n) {
    long j;
    for (j = 0; j < n; j++) {
        a[i * n + j] = b[j];
    }
}
void set_row_opt(double *a, double *b, long i, long n) {
    long j;
    int ni = n * i;
    for (j = 0; j < n; j++) {
        a[ni + j] = b[j];
    }
}
while ! [ -r 4_1.o ]; do sleep .1; done; objdump -d 4_1.o

imul is located in the loop.

while ! [ -r 4_2.o ]; do sleep .1; done; objdump -d 4_2.o

can see that imul is located out of the loop.

Above two codes have same number of instructions. But optimized version has fewer executed instructions.

GCC will do this with -O1 options

Reduction in Strength

Replace costly operation with simpler one.

for example: power of 2 multiply to shift operation. normally, multiply and divide are expensive exmaple. on Intel Nehalem, imul requires 3 CPU cylcles on the other hand, add requires 1 cycle.

DefaultOptimized
void test_reduction(double *a, double *b, long i, long n) {
    int i, j;
    for (i = 0;i < n; i++) {
        int ni = n * i;
        for (j = 0; j < n; j++) {
            a[ni + j] = b[j];
        }
    }
}
void test_reduction_opt(double *a, double *b, long i, long n) {
    int i, j;
    int ni = 0;
    for (i = 0;i < n; i++) {
        for (j = 0; j < n; j++) {
            a[ni + j] = b[j];
        }
        ni += n;
    }
}

Share Common Subexpressions

Reuse portations of expressions

GCC will do this with -O1

DefaultOptimized
double test_scs(double* val, long i, long j, long n) {
    double up, down, left, right;

    up = val[(i - 1) * n + j];
    down = val[(i + 1) * n + j];
    left = val[i * n + (j - 1)];
    right = val[i * n + (j + 1)];
    return up + down + left + right;
}
double test_scs_opt(double *a, double *b, long i, long n) {
    double up, down, left, right;
    
    long inj = i * n + j;

    up = a[inj - n];
    down = a[inj + n];
    left = b[inj - 1];
    right = b[inj + 1];
    return up + down + left + right;
}
while ! [ -r 4_3.o ]; do sleep .1; done; objdump -d 4_3.o

Above dump shows only one imul, which shows that share common subexpressions are applied.

Remove Unnecessary Procedure

Think with your intuition.

Optimization Blockers

Compilers cannot always optimize your code.

void lower(char *s) {
    size_t i;
    for (i = 0; i < strlen(s); i++) {
        if (s[i] >= 'A' && s[i] <= 'Z') {
            s[i] -= ('A' - 'a');
        }
    }
}

Above code's performance is bad. time quadruples when double string length. Because strlen is executed on every loop. so strlen is O(n), therefore overall performance of lower is O(n^2)

Therefore we optimized by Code Motion by moving the calculation length parts to out of the loop.

void lower(char *s) {
    size_t i;
    size_t len = strlen(s);
    for (i = 0; i < len; i++) {
        if (s[i] >= 'A' && s[i] <= 'Z') {
            s[i] -= ('A' - 'a');
        }
    }
}

#1 Procedure Calls

Procedure may have side effects. and Function may not return same value for given arguments.

So compiler treats procedure call as a black box. Weak optimizations near them. Therefore strong optimizations like Code Motion are not applied.

In order to apply strong optimizations, First, use of inline function with -O1 option, or do your self.

Memory Aliasing

void sum_rows(double *a, double *b, long n) {
    long i, j;
    for (i = 0; i < n; i++) {
        b[i] = 0;
        for (j = 0; j < n; j++) {
            b[i] += a[i * n + j];
        }
    }
}
while ! [ -r 4_4.o ]; do sleep .1; done; objdump -d 4_4.o -Msuffix

Compiler leave b[i] on every iteration. Because compiler must consider possibility that the updates will affect program behavior. (b and a is shared, memory aliasing)

Memory aliasing means two different memory references specify single location. in C, it is easy to have happen. because address arithmetic and direct access to storage structures.

void sum_rows(double *a, double *b, long n) {
    long i, j;
    for (i = 0; i < n; i++) {
        double val = 0;
        for (j = 0; j < n; j++) {
            val += a[i * n + j];
        }
        b[i] = val;
    }
}
while ! [ -r 4_5.o ]; do sleep .1; done; objdump -d 4_5.o -Msuffix

By introducing local local variables, we can easy to get optimized code.

Exploiting Instruction-Level Parallelism(ILP)

Execute multiple instructions at the same time. it can reduce average instruction cycle, which needs general understanding of modern processor design: HW can execute many operations in parallel.

  • performance limited by data dependency

simple transformations can yield dramatic performance improvement.

Superscalar Processors

Issue and Execute Multiple Instructions in one cycle.

pipelining -> data dependency.

for example Haswell CPU Functional Units

  • 2 load
  • 1 store
  • 4 integer
  • 2 FP mult
  • 1 FP add
  • 1 FP div
  • 1 int mult

Programming with AVX2

YMM register: 256bit, total 16 registers.

SIMD Operations

for single precision vaddps %ymm0, %ymm1, %ymm1:

for double precision

vaddpd %ymm0, %ymm1, %ymm1