Penguin
Blame: FloatingPoint
EditPageHistoryDiffInfoLikePages
Annotated edit history of FloatingPoint version 1, including all changes. View license author blame.
Rev Author # Line
1 CraigBox 1 Floating point is a number representation consisting of a mantissa, M, an exponent, E, and an (assumed) radix (or "base") . The number represented is M*R^E where R is the
2 radix - usually ten but sometimes 2.
3
4 Many different representations are used for the mantissa and exponent themselves. The [IEEE] specify a standard representation which is used by many hardware floating-point
5 systems. This is [IEEE 754|http://grouper.ieee.org/groups/754/]. There is also lots of documentation at http://cch.loria.fr/documentation/IEEE754/.
6
7 The opposite is fixed-point.
8
9 !Single Precision
10
11 The IEEE single precision floating point standard representation requires a 32 bit word, which may be represented as numbered from 0 to 31, left to right. The first bit is the sign bit, S, the next eight bits are the exponent bits, 'E', and the final 23 bits are the fraction 'F':
12
13 S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
14 0 1 8 9 31
15
16 The value V represented by the word may be determined as follows:
17
18 * If E=255 and F is nonzero, then V=NaN ("Not a number")
19 * If E=255 and F is zero and S is 1, then V=-Infinity
20 * If E=255 and F is zero and S is 0, then V=Infinity
21 * If 0<E<255 then V=(-1)**S * 2 ** (E-127) * (1.F) where "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point.
22 * If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-126) * (0.F) These are "unnormalized" values.
23 * If E=0 and F is zero and S is 1, then V=-0
24 * If E=0 and F is zero and S is 0, then V=0
25
26 In particular,
27
28 0 00000000 00000000000000000000000 = 0
29 1 00000000 00000000000000000000000 = -0
30
31 0 11111111 00000000000000000000000 = Infinity
32 1 11111111 00000000000000000000000 = -Infinity
33
34 0 11111111 00000100000000000000000 = NaN
35 1 11111111 00100010001001010101010 = NaN
36
37 0 10000000 00000000000000000000000 = +1 * 2**(128-127) * 1.0 = 2
38 0 10000001 10100000000000000000000 = +1 * 2**(129-127) * 1.101 = 6.5
39 1 10000001 10100000000000000000000 = -1 * 2**(129-127) * 1.101 = -6.5
40
41 0 00000001 00000000000000000000000 = +1 * 2**(1-127) * 1.0 = 2**(-126)
42 0 00000000 10000000000000000000000 = +1 * 2**(-126) * 0.1 = 2**(-127)
43 0 00000000 00000000000000000000001 = +1 * 2**(-126) *
44 0.00000000000000000000001 =
45 2**(-149) (Smallest positive value)
46
47 !Double Precision
48
49 The IEEE double precision floating point standard representation requires a 64 bit word, which may be represented as numbered from 0 to 63, left to right. The first bit is the sign bit, S, the next eleven bits are the exponent bits, 'E', and the final 52 bits are the fraction 'F':
50
51 S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
52 0 1 11 12 63
53
54 The value V represented by the word may be determined as follows:
55
56 * If E=2047 and F is nonzero, then V=NaN ("Not a number")
57 * If E=2047 and F is zero and S is 1, then V=-Infinity
58 * If E=2047 and F is zero and S is 0, then V=Infinity
59 * If 0<E<2047 then V=(-1)**S * 2 ** (E-1023) * (1.F) where "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point.
60 * If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-1022) * (0.F) These are "unnormalized" values.
61 * If E=0 and F is zero and S is 1, then V=-0
62 * If E=0 and F is zero and S is 0, then V=0