Mth 351 Numerical Analysis
Intel 80x87 Floating Point Data Types - IEEE Std 754
Spring 2001 - Bent E. Petersen

There are some minor differences between the 8087, 80287 floating point coprocessors and later versions. The major difference is probably the handling of infinity and NaN's. The 8087 and 80287 default to using projective infinity. The later fpu's default to using affine infinity, that is, plus and minus infinity are different. There are some other minor differences as well.

The exponent is stored as an integer, regarded as unsigned. This is achieved by adding an offset, a bias, to the actual (or logical) exponent.

The biased exponent with all bits 0 is reserved for zero and denormals.

The biased exponent with all bits 1 is reserved for the two infinities and for NaN's. NaN means Not a Number. NaN's propagate through a calculation and eventually may signal an exception. The details do not concern us here and are, in any case, not well enough known by me to discuss them. They may be found in the Intel literature.

A normalized number is a number in which the integer part (most significant bit) of the mantissa (significand) is 1. In the packed formats this bit is not explicitly stored. Thus the logical length of the mantissa is one bit more than the physical length. Of course, we have to unpack the number (that is, insert the missing bit) before doing any arithmetic. Internally the coprocessor uses 80 bit registers. It automatically unpacks fp numbers when loading the registers, and packs them when storing them to memory, at least for those formats which are less than 80 bits long. There is no packed 80 bit format, since the coprocessor's registers would not be long enough to unpack it.

A denormalized number (denormal) is a number in which the integer part (most significant bit) of the mantissa is 0 and the biased (physical) exponent is 0. An unnormalized number (unnormal) is a number in which the integer part of the mantissa is 0 and the exponent is arbitrary. Note that unnormals can occur only in the unpacked format (extended precision) since otherwise we have no way to recognize them!

While denormals and unnormals are necessary in the course of a calculation the denormals may also occur in the stored formats to represent (with a loss of significance) very small numbers. This scheme allows the coprocessor to underflow gracefully. The physical exponent 0 is reserved to indicate a denormal. Since the biased or physical exponent 0 signals a denormal we do not need to store the leading bit, which is 0, of the mantissa, except when we wish to do actual arithmetic. Thus in the packed formats we omit the leading bit also for denormals.

Note that we can store integers as floating point numbers, but for very large integers the gap between successive exactly storable integers becomes larger than one. Still this gives a much larger range of integers than would be available with ordinary integer storage (that is, fixed point). The c language provides some useful functions for handling floating point integers: floor(), ceil() and fmod(). Thus floor(x) returns the largest floating point integer less than or equal to x. If  x  is very large then floor(x) may be much smaller than the greatest integer in  x, gid(x), if gid(x) can not be stored exactly.

The largest floating point integer N such that N-1 is also a floating point integer (that is, exactly representable) is called the largest fp integer with a predecessor. It is of course much smaller than the largest fp integer.

The unit round is the smallest positive exactly representable floating point number u such that 1.0 + u > 1.0. It is very easy to write code to determine the unit round but one will obtain incorrect results unless one allows for the fact that the coprocessor uses the full 80 bit precision internally and also that some compilers are very clever about eliminating "useless" loops.

It is not difficult to see that for any real number y (within the range of normalized fp numbers) if fp(y) is its floating point representative then

y = fp(y) + e y

where |e| < u and where u is the unit round. This is if we chop the mantissa of y to obtain fp(y). If instead we round the mantissa of y to obtain fp(y) then we have |e| < u/2.

The largest fp integer with a predecessor and the unit round are convenient measures of the precision of the floating point representation.

Note if M is the largest logical exponent for normalized fp numbers and n is the number of logical bits in the mantissa then the largest normalized floating point magnitude (apart from infinity) is

N = 2M (2 - 21-n).

Here we have used

(1 + 2-1 + ... + 21-n)  =  2 - 21-n

and we have assumed 1<= mantissa < 2 is the normalization used for the mantissa. Since the binary point is not actually stored we have to have some agreement about where it falls. Another common convention is 1/2 <= mantisa < 1. Since M >= n in the formats below we see N is an integer, and so is the largest fp integer. If N0 it the largest floating point integer with N0 < N then

N - N0 = 2M-n+1,

a very large integer. Thus N0 and N are very far from being consecutive.


The 32 bit fp format

Short Real
Single Precision Real
typical c data type float
length of format 32 bits
storage for sign 1 bit
storage for exponent 8 bits
storage for mantissa 23 bits
packed? yes
mantissa normalization 1 <= mantissa < 2
mantissa precision (logical length) 24 bits
exponent bias 127
reserved exponent for denormals physical 0, logical -127
reserved exponent for infinity and NaN's physical 255, logical 128
range of physical exponent for normalized fp's 1 to 254
range of logical exponent for normalized fp's -126 to 127
smallest positive normalized fp 2-126 = 1.175 10-38
largest normalized fp  2+127 (2 - 2-23) 2+128 - 2+104 = 3.403 10+38
smallest positive denormal  2-127 2-23 2-150 = 7.006 10-46
largest denormal  2-127 (1 - 2-23) 2-127 - 2-150 = 5.877 10-39
largest fp integer 2+128 - 2+104 = 3.403 10+38
gap from largest fp integer to previous fp integer 2+104 = 2.028 10+31
largest fp integer with a predecessor 2+24 - 1 = 16,777,215
unit round (chop precision) 2-23 = 1.192 10-07
precision (round precision) 2-24 = 5.960 10-08

The 64 bit fp format

Long Real
Double Precision Real
typical c data type double
length of format 64 bits
storage for sign 1 bit
storage for exponent 11 bits
storage for mantissa 52 bits
packed? yes
mantissa normalization 1 <= mantissa < 2
mantissa precision (logical length) 53 bits
exponent bias 1023
reserved exponent for denormals physical 0, logical -1023
reserved exponent for infinity and NaN's physical 2047, logical 1024
range of physical exponent for normalized fp's 1 to 2046
range of logical exponent for normalized fp's -1022 to 1023
smallest positive normalized fp 2-1022 = 2.225 10-308
largest normalized fp  2+1023 (2 - 2-52) 2+1024 - 2+971 = 1.798 10+308
smallest positive denormal  2-1023 2-52 2-1075 = 2.470 10-324
largest denormal  2-1023 (1 - 2-52) 2-1023 - 2-1075 = 1.113 10-308
largest fp integer 2+1024 - 2+971 = 1.798 10+308
gap from largest fp integer to previous fp integer 2+971 = 1.996 10+292
largest fp integer with a predecessor 2+53 - 1 = 9,007,199,254,740,991
unit round (chop precision) 2-52 = 2.220 10-16
precision (round precision) 2-53 = 1.110 10-16

The 80 bit fp format

Temporary Real
Extended Precision Real
typical c data type long double
length of format 80 bits
storage for sign 1 bit
storage for exponent 15 bits
storage for mantissa 64 bits
packed? no
mantissa normalization 1 <= mantissa < 2
mantissa precision (logical = physical length) 64 bits
exponent bias 16383
reserved exponent for denormals physical 0, logical -16383
reserved exponent for infinity and NaN's physical 32767, logical 16384
range of physical exponent for normalized fp's 1 to 32766
range of logical exponent for normalized fp's -16382 to 16383
smallest positive normalized fp 2-16382 = 3.632 10-4932
largest normalized fp  2+16383 (2 - 2-63) 2+16384 - 2+16320 = 1.190 10+4932
smallest positive denormal  2-16383 2-63 2-16446 = 1.823 10-4951
largest denormal  2-16383 (1 - 2-63) 2-16383 - 2-16446 = 1.681 10-4932
largest fp integer 2+16384 - 2+16320 = 1.190 10+4932
gap from largest fp integer to previous fp integer 2+16320 = 6.450 10+4912
largest fp integer with a predecessor 2^64 - 1 = 18,446,744,073,709,551,615
unit round (chop precision) 2^(-63) = 1.084 10-19

Corrections are welcome!  petersen@math.orst.edu

References

Robert L. Hummel, PC Magazine: Programmer's Technical Reference: The Processor and Coprocessor, Ziff-Davis Press, Emeryville, California, 1992.

Stephen P. Morse, Eric J. Isaacson, Douglas J. Albert, The 80386/387 Architecture, John Wiley & Sons, Inc., New York, 1987.

Richard Startz, 8087 Applications and Programming for the IBM PC, XT, and AT, Revised and Expanded, Prentice Hall, New York, 1985.

John F. Palmer, Stephen P. Morse, The 8087 Primer, John Wiley & Sons, New York, 1984.

Intel Corporation, i486 Processor Programmer's Reference Manual, Intel, Osborne McGraw-Hill, 1990.


Updated Sunday, October 26, 2003
Bent E. Petersen (541) 737-5163
email: petersen@math.orst.edu
Fax: (541) 737-0517

Mth 351 Index