4.2 KiB
Floating Point
Fractional Binary Number
representation:
- for
w = i + j + 1bits datab
\sum_{k = -j}^{i}b_k\times 2^k
for example:
5+3/4 = 23/4 = 101.11_21 7/16 = 23/16 = 1.0111_2
Limitations
- Can only exactly represent numbers of the form of
x/2^k - Just one setting of binary point within the
wbits, which means that very small value or very large value cannot be represented
IEEE Floating Point Definition
IEEE Standard 754
Driven by numerical concerns:
- Nice standards for rounding, overflow, underflow
- But Hard to make fast in hardware
- Numberical Analysts predominated over hw designers in defining standard
Representation
Form
(-1)^s M 2^E
s: sign bitM: mantissa fractional value in[1.0,2.0)E: exponent
Encoding
---
title: "Single Precision"
config:
packet:
bitsPerRow: 32
rowHeight: 32
---
packet
+1: "s"
+8: "exp"
+23: "frac"
---
title: "Double Precision"
config:
packet:
bitsPerRow: 32
rowHeight: 32
---
packet
+1: "s"
+11: "exp"
+52: "frac"
There is three kinds of float: normalized, denormalized, special
normalized
E = \exp - B
B = 2^{k-1}-1 where k is number of exp bits
- single: 127
- double: 1023
M = 1.xxxxx
minumum when \text {frac} = 0000...\quad (M = 1.0)
maximum when \text{frac }= 1111... \quad (M = 2.0 - \epsilon)
denormalized
when exp=000...0
\exp = 1 - Bias
M = 0.xxxxx
special
when exp = 111...1
-
case
exp = 111...1, frac = 000...0- repr
\infty - operation that overflows
- repr
-
case
exp = 111...1, frac = 111...1- repr
NaN - repr case when no numeric value can be determined
- e.g.,
sqrt(-1),inf - inf,inf * 0
- e.g.,
- repr
#include <stdio.h>
int main() {
unsigned x_a = 0b0'11111111'00000000000000000000000;
unsigned x_b = 0b0'11111111'00000000000000000000001;
unsigned x_c = 0b0'01111111'00000000000000000000000;
float a = *(float*)&x_a;
float b = *(float*)&x_b;
float c = *(float*)&x_c;
double cx = c;
printf("%08x: %f\n", x_a, a);
printf("%08x: %f\n", x_b, b);
printf("%08x: %f\n", x_c, c);
printf("%016llx: %f\n", *(unsigned long long *)&cx, cx);
return 0;
}
while ! [ -f 2_1.out ]; do sleep .1; done; ./2_1.out
Properties
-
FP0 is Same as Int0
-
Can (almost) use unsigned int comparison
Arithmetic
x + y = \text{Round}(x+y)
x \times y = \text{Round}(x\times y)
Idea:
- compute exact result
- Make it fit into desired precision
- overflow if too large
- round to fit into frac
Rounding
- Twowards zero
- Round down
- Round up
- Nearest Even*(default)
Nearest Even is default rounding mode Any other kind rounding mode is hard to get without dropping into assembly, but C99 has support for rounding mode management.
This rounding mode is used because reduced statistically bias.
For binary fractional numbers:
- "Even" when least significant bit is
0 - "Half way" when bits to right of rounding position is
100..._2
so for example of rounding to neareast 1/4:
| Binary Value | Rounded | Action |
|---|---|---|
10.00011 |
10.00 |
(<1/2)Down |
10.00110 |
10.01 |
(>1/2)Up |
10.11100 |
11.00 |
(=1/2)Up |
10.10100 |
10.10 |
(=1/2)Down |
BBGRXXX
G: Guard bit: LSB of resultR: Round bit: first bit of removedX: Sticky bits: OR of remaining bits(001 = 1, 000 = 0)
Round up conditions
- R = 1, S = 1 ->
>.5 - G = 1, R = 1, S = 0 -> Round to even
#include <stdio.h>
int main() {
unsigned long long
tb = 0b0'10000010000'0000000000000000000001010000000000000000000000000000;
unsigned xb = 0b0'10000001'01000000000000000000011;
double t = *(double*)&tb;
float x = t;
for(int i=31; i>=0;i--) {
if(i == 31 - 1) {
printf("/");
} else if (i == 31 - 1 - 8){
printf("/");
}
printf("%d", !!((*(unsigned *)&x) & (1<<i)));
}
printf("\n");
printf("%f", x);
}
while ! [ -f 2_2.out ]; do sleep .1; done; ./2_2.out