285 lines
6.7 KiB
Markdown
285 lines
6.7 KiB
Markdown
# Optimization
|
|
|
|
There's more to performance than asymptotic complexity(time complexity).
|
|
|
|
But all the instructions are not consume the same amount of time. Constant factors matter too! So we need to understand system to optimize performance.
|
|
* How programs are compiled and executed
|
|
* How modern processors and memory system operate
|
|
* How to measure performance and identify bottlenecks
|
|
* How to improve performance without destroying code modularity and generality
|
|
|
|
Provide efficent mapping of program to machine code
|
|
* Register allocation
|
|
* Code selection and ordering (scheduling)
|
|
* Dead code elimination
|
|
* Elimininating minor inefficiencies
|
|
|
|
**Don't improve asymptotic efficiency**.
|
|
|
|
## Generally Useful Optimizations
|
|
|
|
### Code Motion(Hoisting)
|
|
|
|
Reduce frequecy where computation performed. If it will always produce the same result, then move it to a place where it is computed once and reused.
|
|
Especially moving code out of loop.
|
|
|
|
```c {cmd=gcc args=[-Og -x c -c $input_file -o 4_1.o]}
|
|
void set_row(double *a, double *b, long i, long n) {
|
|
long j;
|
|
for (j = 0; j < n; j++) {
|
|
a[i * n + j] = b[j];
|
|
}
|
|
}
|
|
```
|
|
|
|
<table>
|
|
<tr><th>Default</th><th>Optimized</th></tr>
|
|
<tr><td>
|
|
|
|
```c {cmd=gcc args=[-O1 -x c -c $input_file -o 4_2.o]}
|
|
void set_row(double *a, double *b, long i, long n) {
|
|
long j;
|
|
for (j = 0; j < n; j++) {
|
|
a[i * n + j] = b[j];
|
|
}
|
|
}
|
|
```
|
|
</td><td>
|
|
|
|
```c
|
|
void set_row_opt(double *a, double *b, long i, long n) {
|
|
long j;
|
|
int ni = n * i;
|
|
for (j = 0; j < n; j++) {
|
|
a[ni + j] = b[j];
|
|
}
|
|
}
|
|
```
|
|
</td></tr>
|
|
<tr>
|
|
<td>
|
|
|
|
```sh {cmd hide}
|
|
while ! [ -r 4_1.o ]; do sleep .1; done; objdump -d 4_1.o
|
|
```
|
|
`imul` is located in the loop.
|
|
</td>
|
|
<td>
|
|
|
|
```sh {cmd hide}
|
|
while ! [ -r 4_2.o ]; do sleep .1; done; objdump -d 4_2.o
|
|
```
|
|
can see that `imul` is located out of the loop.
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
Above two codes have same number of instructions. But optimized version has **fewer executed instructions**.
|
|
|
|
GCC will do this with `-O1` options
|
|
|
|
### Reduction in Strength
|
|
|
|
Replace costly operation with simpler one.
|
|
|
|
for example: power of 2 multiply to shift operation. normally, multiply and divide are expensive exmaple. on Intel Nehalem, `imul` requires 3 CPU cylcles on the other hand, `add` requires 1 cycle.
|
|
|
|
<table>
|
|
<tr><th>Default</th><th>Optimized</th></tr>
|
|
<tr><td>
|
|
|
|
```c
|
|
void test_reduction(double *a, double *b, long i, long n) {
|
|
int i, j;
|
|
for (i = 0;i < n; i++) {
|
|
int ni = n * i;
|
|
for (j = 0; j < n; j++) {
|
|
a[ni + j] = b[j];
|
|
}
|
|
}
|
|
}
|
|
```
|
|
</td><td>
|
|
|
|
```c
|
|
void test_reduction_opt(double *a, double *b, long i, long n) {
|
|
int i, j;
|
|
int ni = 0;
|
|
for (i = 0;i < n; i++) {
|
|
for (j = 0; j < n; j++) {
|
|
a[ni + j] = b[j];
|
|
}
|
|
ni += n;
|
|
}
|
|
}
|
|
```
|
|
</td></tr>
|
|
</table>
|
|
|
|
### Share Common Subexpressions
|
|
|
|
Reuse portations of expressions
|
|
|
|
GCC will do this with `-O1`
|
|
|
|
<table>
|
|
<tr><th>Default</th><th>Optimized</th></tr>
|
|
<tr><td>
|
|
|
|
```c {cmd=gcc args=[-O1 -x c -c $input_file -o 4_3.o]}
|
|
double test_scs(double* val, long i, long j, long n) {
|
|
double up, down, left, right;
|
|
|
|
up = val[(i - 1) * n + j];
|
|
down = val[(i + 1) * n + j];
|
|
left = val[i * n + (j - 1)];
|
|
right = val[i * n + (j + 1)];
|
|
return up + down + left + right;
|
|
}
|
|
```
|
|
</td><td>
|
|
|
|
```c
|
|
double test_scs_opt(double *a, double *b, long i, long n) {
|
|
double up, down, left, right;
|
|
|
|
long inj = i * n + j;
|
|
|
|
up = a[inj - n];
|
|
down = a[inj + n];
|
|
left = b[inj - 1];
|
|
right = b[inj + 1];
|
|
return up + down + left + right;
|
|
}
|
|
```
|
|
</td></tr>
|
|
</table>
|
|
|
|
```sh {cmd hide}
|
|
while ! [ -r 4_3.o ]; do sleep .1; done; objdump -d 4_3.o
|
|
```
|
|
|
|
Above dump shows only one `imul`, which shows that share common subexpressions are applied.
|
|
|
|
### Remove Unnecessary Procedure
|
|
|
|
Think with your intuition.
|
|
|
|
## Optimization Blockers
|
|
|
|
Compilers cannot always optimize your code.
|
|
|
|
```c
|
|
void lower(char *s) {
|
|
size_t i;
|
|
for (i = 0; i < strlen(s); i++) {
|
|
if (s[i] >= 'A' && s[i] <= 'Z') {
|
|
s[i] -= ('A' - 'a');
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
Above code's performance is bad. time quadruples when double string length.
|
|
Because `strlen` is executed on every loop. so `strlen` is $O(n)$, therefore overall performance of `lower` is $O(n^2)$
|
|
|
|
Therefore we optimized by Code Motion by moving the calculation length parts to out of the loop.
|
|
|
|
```c
|
|
void lower(char *s) {
|
|
size_t i;
|
|
size_t len = strlen(s);
|
|
for (i = 0; i < len; i++) {
|
|
if (s[i] >= 'A' && s[i] <= 'Z') {
|
|
s[i] -= ('A' - 'a');
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### #1 Procedure Calls
|
|
|
|
Procedure may have side effects. and Function may not return same value for given arguments.
|
|
|
|
So compiler treats procedure call as a black box. Weak optimizations near them. Therefore strong optimizations like **Code Motion** are not applied.
|
|
|
|
In order to apply strong optimizations, First, use of inline function with `-O1` option, or **do your self**.
|
|
|
|
### Memory Aliasing
|
|
|
|
```c {cmd=gcc args=[-O1 -x c -c $input_file -o 4_4.o]}
|
|
void sum_rows(double *a, double *b, long n) {
|
|
long i, j;
|
|
for (i = 0; i < n; i++) {
|
|
b[i] = 0;
|
|
for (j = 0; j < n; j++) {
|
|
b[i] += a[i * n + j];
|
|
}
|
|
}
|
|
}
|
|
```
|
|
```sh {cmd hide}
|
|
while ! [ -r 4_4.o ]; do sleep .1; done; objdump -d 4_4.o -Msuffix
|
|
```
|
|
|
|
Compiler leave `b[i]` on every iteration. Because compiler must consider possibility that the updates will affect program behavior. (`b` and `a` is shared, memory aliasing)
|
|
|
|
Memory aliasing means two different memory references specify single location.
|
|
in C, it is easy to have happen. because address arithmetic and direct access to storage structures.
|
|
|
|
```c {cmd=gcc args=[-O1 -x c -c $input_file -o 4_5.o]}
|
|
void sum_rows(double *a, double *b, long n) {
|
|
long i, j;
|
|
for (i = 0; i < n; i++) {
|
|
double val = 0;
|
|
for (j = 0; j < n; j++) {
|
|
val += a[i * n + j];
|
|
}
|
|
b[i] = val;
|
|
}
|
|
}
|
|
```
|
|
```sh {cmd hide}
|
|
while ! [ -r 4_5.o ]; do sleep .1; done; objdump -d 4_5.o -Msuffix
|
|
```
|
|
|
|
By introducing local local variables, we can easy to get optimized code.
|
|
|
|
## Exploiting Instruction-Level Parallelism(ILP)
|
|
|
|
Execute multiple instructions at the same time. it can reduce average instruction cycle, which needs general understanding of modern processor design: HW can execute many operations in parallel.
|
|
|
|
* performance limited by data dependency
|
|
|
|
simple transformations can yield dramatic performance improvement.
|
|
|
|
### Superscalar Processors
|
|
|
|
Issue and Execute Multiple Instructions in one cycle.
|
|
|
|
pipelining -> data dependency.
|
|
|
|
|
|
for example Haswell CPU Functional Units
|
|
* 2 load
|
|
* 1 store
|
|
* 4 integer
|
|
* 2 FP mult
|
|
* 1 FP add
|
|
* 1 FP div
|
|
* 1 int mult
|
|
|
|
### Programming with AVX2
|
|
|
|
YMM register: 256bit, total 16 registers.
|
|
|
|
**SIMD Operations**
|
|
|
|
for single precision
|
|
`vaddps %ymm0, %ymm1, %ymm1`:
|
|
|
|
for double precision
|
|
|
|
`vaddpd %ymm0, %ymm1, %ymm1`
|
|
|