version 1 showing authors affecting page license.
.
Rev |
Author |
# |
Line |
1 |
CraigBox |
1 |
Floating point is a number representation consisting of a mantissa, M, an exponent, E, and an (assumed) radix (or "base") . The number represented is M*R^E where R is the |
|
|
2 |
radix - usually ten but sometimes 2. |
|
|
3 |
|
|
|
4 |
Many different representations are used for the mantissa and exponent themselves. The [IEEE] specify a standard representation which is used by many hardware floating-point |
|
|
5 |
systems. This is [IEEE 754|http://grouper.ieee.org/groups/754/]. There is also lots of documentation at http://cch.loria.fr/documentation/IEEE754/. |
|
|
6 |
|
|
|
7 |
The opposite is fixed-point. |
|
|
8 |
|
|
|
9 |
!Single Precision |
|
|
10 |
|
|
|
11 |
The IEEE single precision floating point standard representation requires a 32 bit word, which may be represented as numbered from 0 to 31, left to right. The first bit is the sign bit, S, the next eight bits are the exponent bits, 'E', and the final 23 bits are the fraction 'F': |
|
|
12 |
|
|
|
13 |
S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF |
|
|
14 |
0 1 8 9 31 |
|
|
15 |
|
|
|
16 |
The value V represented by the word may be determined as follows: |
|
|
17 |
|
|
|
18 |
* If E=255 and F is nonzero, then V=NaN ("Not a number") |
|
|
19 |
* If E=255 and F is zero and S is 1, then V=-Infinity |
|
|
20 |
* If E=255 and F is zero and S is 0, then V=Infinity |
|
|
21 |
* If 0<E<255 then V=(-1)**S * 2 ** (E-127) * (1.F) where "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point. |
|
|
22 |
* If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-126) * (0.F) These are "unnormalized" values. |
|
|
23 |
* If E=0 and F is zero and S is 1, then V=-0 |
|
|
24 |
* If E=0 and F is zero and S is 0, then V=0 |
|
|
25 |
|
|
|
26 |
In particular, |
|
|
27 |
|
|
|
28 |
0 00000000 00000000000000000000000 = 0 |
|
|
29 |
1 00000000 00000000000000000000000 = -0 |
|
|
30 |
|
|
|
31 |
0 11111111 00000000000000000000000 = Infinity |
|
|
32 |
1 11111111 00000000000000000000000 = -Infinity |
|
|
33 |
|
|
|
34 |
0 11111111 00000100000000000000000 = NaN |
|
|
35 |
1 11111111 00100010001001010101010 = NaN |
|
|
36 |
|
|
|
37 |
0 10000000 00000000000000000000000 = +1 * 2**(128-127) * 1.0 = 2 |
|
|
38 |
0 10000001 10100000000000000000000 = +1 * 2**(129-127) * 1.101 = 6.5 |
|
|
39 |
1 10000001 10100000000000000000000 = -1 * 2**(129-127) * 1.101 = -6.5 |
|
|
40 |
|
|
|
41 |
0 00000001 00000000000000000000000 = +1 * 2**(1-127) * 1.0 = 2**(-126) |
|
|
42 |
0 00000000 10000000000000000000000 = +1 * 2**(-126) * 0.1 = 2**(-127) |
|
|
43 |
0 00000000 00000000000000000000001 = +1 * 2**(-126) * |
|
|
44 |
0.00000000000000000000001 = |
|
|
45 |
2**(-149) (Smallest positive value) |
|
|
46 |
|
|
|
47 |
!Double Precision |
|
|
48 |
|
|
|
49 |
The IEEE double precision floating point standard representation requires a 64 bit word, which may be represented as numbered from 0 to 63, left to right. The first bit is the sign bit, S, the next eleven bits are the exponent bits, 'E', and the final 52 bits are the fraction 'F': |
|
|
50 |
|
|
|
51 |
S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF |
|
|
52 |
0 1 11 12 63 |
|
|
53 |
|
|
|
54 |
The value V represented by the word may be determined as follows: |
|
|
55 |
|
|
|
56 |
* If E=2047 and F is nonzero, then V=NaN ("Not a number") |
|
|
57 |
* If E=2047 and F is zero and S is 1, then V=-Infinity |
|
|
58 |
* If E=2047 and F is zero and S is 0, then V=Infinity |
|
|
59 |
* If 0<E<2047 then V=(-1)**S * 2 ** (E-1023) * (1.F) where "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point. |
|
|
60 |
* If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-1022) * (0.F) These are "unnormalized" values. |
|
|
61 |
* If E=0 and F is zero and S is 1, then V=-0 |
|
|
62 |
* If E=0 and F is zero and S is 0, then V=0 |