# Floating Point

## Fractional Binary Number

representation:

* for $w = i + j + 1$ bits data $b$
$$\sum_{k = -j}^{i}b_k\times 2^k$$

for example:
* $5+3/4 = 23/4 = 101.11_2$
* $1 7/16 = 23/16 = 1.0111_2$

**Limitations**
* Can only exactly represent numbers of the form of $x/2^k$
* Just one setting of binary point within the $w$ bits, which means that very small value or very large value cannot be represented

## IEEE Floating Point Definition

**IEEE Standard 754**

Driven by numerical concerns:
* Nice standards for rounding, overflow, underflow
* But Hard to make fast in hardware
  * Numberical Analysts predominated over hw designers in defining standard


### Representation

**Form**

$$(-1)^s M 2^E$$

* $s$: sign bit
* $M$: mantissa fractional value in $[1.0,2.0)$
* $E$: exponent

**Encoding**

```mermaid
---
title: "Single Precision"
config:
    packet:
        bitsPerRow: 32
        rowHeight: 32
---
packet
+1: "s"
+8: "exp"
+23: "frac"
```

```mermaid
---
title: "Double Precision"
config:
    packet:
        bitsPerRow: 32
        rowHeight: 32
---
packet
+1: "s"
+11: "exp"
+52: "frac"
```

There is three kinds of `float`: **normalized**, **denormalized**, **special**

**normalized**

$E = \exp - B$
$B = 2^{k-1}-1$ where $k$ is number of exp bits
* single: 127
* double: 1023

$M = 1.xxxxx$
minumum when $\text {frac} = 0000...\quad (M = 1.0)$
maximum when $\text{frac }= 1111... \quad (M = 2.0 - \epsilon)$

**denormalized**

when `exp=000...0`

$\exp = 1 - Bias$

$M = 0.xxxxx$

**special**

when `exp = 111...1`

* case `exp = 111...1, frac = 000...0`

  * repr $\infty$
  * operation that overflows

* case `exp = 111...1, frac = 111...1`

  * repr `NaN`
  * repr case when no numeric value can be determined
    * e.g., `sqrt(-1)`, `inf - inf`, `inf * 0`

```c {cmd="gcc-14" args=[-x c $input_file --std=c23 -O0 -m32 -o 2_1.out]}
#include <stdio.h>

int main() {
  unsigned x_a = 0b0'11111111'00000000000000000000000;
  unsigned x_b = 0b0'11111111'00000000000000000000001;
  unsigned x_c = 0b0'01111111'00000000000000000000000;
  float a = *(float*)&x_a;
  float b = *(float*)&x_b;
  float c = *(float*)&x_c;
  double cx = c;
  printf("%08x: %f\n", x_a, a);
  printf("%08x: %f\n", x_b, b);
  printf("%08x: %f\n", x_c, c);
  printf("%016llx: %f\n", *(unsigned long long *)&cx, cx);
  return 0;
}
```

```sh {cmd hide}
while ! [ -f 2_1.out ]; do sleep .1; done; ./2_1.out
```

### Properties

* FP0 is Same as Int0

* Can (almost) use unsigned int comparison

### Arithmetic

$x + y = \text{Round}(x+y)$
$x \times y = \text{Round}(x\times y)$

Idea: 
1. compute exact result
2. Make it fit into desired precision
   * overflow if too large
   * **round** to fit into frac

#### Rounding

* Twowards zero
* Round down
* Round up
* **Nearest Even***(default)

**Nearest Even** is default rounding mode
Any other kind rounding mode is hard to get without dropping into assembly, but C99 has support for rounding mode management.

This rounding mode is used because **reduced statistically bias**.

For binary fractional numbers:
* "Even" when least significant bit is $0$
* "Half way" when bits to right of rounding position is $100..._2$

so for example of rounding to neareast $1/4$:

| Binary Value | Rounded | Action     |
| ------------ | ------- | ---------- |
| `10.00011`   | `10.00` | (<1/2)Down |
| `10.00110`   | `10.01` | (>1/2)Up   |
| `10.11100`   | `11.00` | (=1/2)Up   |
| `10.10100`   | `10.10` | (=1/2)Down |

`BBGRXXX`

* `G`: **G**uard bit: LSB of result
* `R`: **R**ound bit: first bit of removed
* `X`: **S**ticky bits: OR of remaining bits(001 = 1, 000 = 0)

Round up conditions
1. R = 1, S = 1 -> `>.5`
2. G = 1, R = 1, S = 0 -> Round to even

```c {cmd="gcc-14" args=[-x c $input_file --std=c23 -O0 -m32 -o 2_2.out]}
#include <stdio.h>

int main() {
  unsigned long long
           tb = 0b0'10000010000'0000000000000000000001010000000000000000000000000000;
  unsigned xb =    0b0'10000001'01000000000000000000011;
  double t = *(double*)&tb;
  float x = t;
  for(int i=31; i>=0;i--) {
    if(i == 31 - 1) {
      printf("/");
    } else if (i == 31 - 1 - 8){
      printf("/");
    }
    printf("%d", !!((*(unsigned *)&x) & (1<<i)));
  }
  printf("\n");
  printf("%f", x);
}
```
```sh {cmd hide}
while ! [ -f 2_2.out ]; do sleep .1; done; ./2_2.out
```


## Float Quiz