Floating Point Number representation Digital Logic and Computer Design by Ravinder Nath Rajotiya - August 29, 2023September 10, 20230 Share on Facebook Share Send email Mail Print Print Table of Contents Toggle Representation of Real Numbers:Fixed point representationSmallest and largest Fixed point numberRange and precession of the numberRange:Precession:Floating Point Notation.Converting floating points :IEEE standards 754:IEEE-754 single precession Floating PointIEEE-754 Double precession FLPFLP number representation in IEEE standardBiased Exponent:Process of converting a decimal number to IEEE standard FLP number:Range of Floating Point numbers Representation of Real Numbers: Real numbers can be represented using the following two representations. · Fixed point Notation · Floating point notation Fixed point representation A fixed point number is one which is stored with fixed number of bits for integer part and also a fixed number of bits for the fraction part. It has the following parts · Integer Part · binary Point · Fraction Part IIIIIIIIIIIII . FFFFFF Suppose we have 8 bits storage to store a real number, of these 5 bits to store the integer part and 3 bits to store the fractional part as shown below: ( 1 0 1 0 1 . 1 0 1)2 = 1*24 +0*23 + 1*22 + 0*21 + 1*20 + 1*2-1 + 0*2-2 + 1*2-3 Smallest and largest Fixed point number With 8 bit storage (5-bit interger and 3 bit fraction) the range of numbers will be> Smallest number = 00000.001 = 0.125 Largest number = 11111.111 = 31.875 Range and precession of the number Range: Difference between the largest and the smallest value possible. More the number of bits in the integer part more is the range. Range in Signed Fixed Point Notation -(2N-1 -1) to (2N-1 -1) for N-bits It is not possible to decide how many bits be chosen for the fraction and the integer values. Precession: Difference between the two consecutive numbers is the precession. More the number of bits in the fraction part better is the precession. Example-1: Represent +45.525 in the fixed point notation. Solution: Assuming 1 sign-bit, 6 integer part bits and 4 bit fraction part 45 = 101101 0.525 = 0.525*2 = 1.050 = 0.050*2 = 0.100 = 0.100*2 = 0.200 = 0.200 *2 = 0.400 = 0.400 *2 = 0.800 = 0.800*2 = 1.600 45.525 = 101101.100001 = 101101.1000 Approx = 0 101101 . 1000 Example-2: Represent -50.675 in fixed point with 1 sign bit, 16 bit integer and 15 bit fraction part. Solution: 50 = 0000000000110010 .675 = 101011001100110 So -50.675 = 1 0000000000110010 . 101011001100110 Floating Point Notation. This notation is the scientific notation. A floating-point number can represent numbers of different order of magnitude (very large and very small) with the same number of fixed digits. It does not reserve specific number of bits to the integer or the fraction part of the number. Instead the decimal point is floating. It has the following Parts. · Sign Bit · Exponent · Mantissa In general, in the binary system, a floating number can be expressed as = ± q × 2m ; Here q is significant and m is exponent Some examples “Floating” the binary point (23) 10 = 1×16 + 0×8 + 1×4 + 1×2 + 1×1 = (10111) 2 = 1.0111 x 24 (11.5) 10 = 1×8 + 0×4 + 1×2 + 1×1 + 1× 2-1 = (1011.1) 2 = 1.0111 x 23 (5.75) 10 = 1×4 + 0×2 + 1×1 + 1× × 2-1 + 1× × 2-2 =(101.11 ) 2 = = 1.0111 x 22 “# Move “binary point” to the left by one bit position : Divide the decimal number by 2 Move “binary point” to the right by one bit position: Multiply the decimal number by 2 Converting floating points : Convert (39.6875)10 into floating point representation (39.6875) 10 = (100111.1011)2 = (1.001111011 ) 2 × 25 IEEE standards 754: IEEE Floating Point Number Representation Half Precession ( 16 bits== 1 Sign bit, 5 bit exponent, 10 bit mantissa) Single Precession (32 bits== 1 Sign bit, 8 bit exponent, 23 bit mantissa) Double Precession (64 bits== 1 Sign bit, 11 bit exponent, 52 bit mantissa) IEEE-754 single precession Floating Point This standard uses total of 32 bits to represent the FLP number. It has 1 bit sign, 8 bit exponent and 23 bit significant The format : Exponent is represented as biased exponent. With 8-bit exponent the biased value is taken as 127. To store a value in FLP we will add the bias value to the exponent and to read the value we will subtract the bias from the exponent given in FLP number. IEEE-754 Double precession FLP This standard uses total of 64 bits to represent the FLP number. It has 1 bit sign, 11 bit exponent and 52 bit significant Exponent is represented as biased exponent. With 11-bit exponent the biased value is taken as 1023. To store a value in FLP we will add the bias value to the exponent and to read the value we will subtract the bias from the exponent given in FLP number. FLP number representation in IEEE standard The IEEE standard uses the format for the significant as 1.sss………….s. In IEEE standard ‘1’ before the decimal point is already part of the system and is hidden and need not be stored. We need to convert the given number in the format +/- 1.ssssss x 2E. Biased Exponent: a biased exponent is the result of adding some constant (called the bias) to the exponent chosen to make the range of the exponent nonnegative. Biased exponents(127 in single precession and 1023 in double precession IEEE standard) are particularly useful when encoding and decoding the floating-point representations of subnormal numbers. Process of converting a decimal number to IEEE standard FLP number: 1. Convert the number to fixed point binary notation 2. Normalize so that the bit 1 is before the decimal point and accordingly adjust the exponent 3. Add bias (+127 in IEEE single precession and +1023 in IEEE double precession) to the exponent value. 4. Store the number so obtained in the FLP format Example-1: Represent 4.5 in IEEE single precession format Solution: Convert to binary fixed point 4.5 = 100.10 = 1.001 x22 in IEEE format Add bias : exponent = 2 ; biased exponent = 2 + 127 = 129 = 10000001 Significant = .001 4.5 = ( 0 10000001 00100000000000000000000 ) Example-2 : Represent -3.75 in FLP representation Solution: (3.75)10 = 11 . 11 = 1.111 x21. Exponent = 1; Biased exponent = 1 + 127 = 128 = 10000000 Significant = 1.111 ( with 1 before decimal point is hidden and no need to store. = .111; note that 1 before the decimal is note stored, it is already there (or assumed to be) in system Sign = ‘1’ for –ve number So, (-3.75)10 = 1 10000000 11100000000000000000000 Example-3 : Represent -53.5 in floating point notation assuming 8 bit exponent, 1 sign bit, 23 bit mantissa Solution: 53 in binary = 110101 .5 = .10000 53.5 = 110101.1 = -1.101011 x 25. Here Exponent =5 ; biased exponent e= 5 + 127 = 132 = 10000100 Significant = .101011 (-53.5)10 = 1 10000100 10101100000000000000000 Example-4: Consider three registers R1, R2, R3 that store numbers in IEEE-754 single precession floating point format. Assume R1 and R2 contain the values(in Hexadecimal notation) 0x42200000 and R2=0xC120000000. If R3=R1/R2 find R3. Quadruple Precession (128 bits== 1 Sign bit, 15 bit exponent, 112 bit mantissa) Range of Floating Point numbers Note: Zero (0) is indicated by having all 0’s in significant and exponent part Share on Facebook Share Send email Mail Print Print