You are here

Floating Point Number representation

Representation of Real  Numbers:

Real numbers can be represented using the following two representations.

·         Fixed point Notation

·         Floating point notation

Fixed point representation

A fixed point number is one which is stored  with fixed number of bits for integer part and also a fixed number of bits for the fraction part.

It has the following parts

·         Integer Part

·         binary Point

·         Fraction Part

IIIIIIIIIIIII . FFFFFF

Suppose we have 8 bits storage to store a real number, of these 5 bits to store the integer part and 3 bits to store the fractional part as shown below:

( 1    0    1    0    1 . 1    0    1)2

= 1*24  +0*23 +  1*22 + 0*21 + 1*20 +   1*2-1 + 0*2-2 + 1*2-3

Smallest and largest Fixed point number

With 8 bit storage (5-bit interger and 3 bit fraction) the range of numbers will be>

Smallest number = 00000.001 = 0.125

Largest number = 11111.111 = 31.875

Range and precession of the number

Range:

Difference between the largest and the smallest value possible. More the number of bits in the integer part more is the range.

Range in Signed Fixed Point Notation

-(2N-1 -1) to  (2N-1 -1) for N-bits

It is not possible to decide how many bits be chosen for the fraction and the integer values.

Precession:

Difference between the two consecutive numbers is the precession. More the number of bits in the fraction part better is the precession.

Example-1: Represent +45.525 in the fixed point notation.

Solution: Assuming 1 sign-bit, 6 integer part bits and 4 bit fraction part

45 = 101101

0.525 = 0.525*2 = 1.050

          = 0.050*2 = 0.100

          = 0.100*2 = 0.200

         = 0.200 *2 = 0.400

         = 0.400 *2 = 0.800

         = 0.800*2 = 1.600

45.525 = 101101.100001

            = 101101.1000 Approx

            = 0 101101 . 1000

Example-2: Represent -50.675 in fixed point with 1 sign bit, 16 bit integer and 15 bit fraction part.

Solution:

50 = 0000000000110010

.675 = 101011001100110

So -50.675 = 1 0000000000110010 . 101011001100110

Floating Point Notation.

This notation is the scientific notation. A floating-point number can represent numbers of different order of magnitude (very large and very small) with the same number of fixed digits.  It does not reserve specific number of bits to the integer or the fraction part of the number. Instead the decimal point is floating.

It has the following Parts.

·         Sign Bit

·         Exponent

·         Mantissa

In general, in the binary system, a floating number can be expressed as

= ± q × 2m     ;    Here q is significant and m is exponent

Some examples

“Floating” the binary point

(23) 10  = 1×16 + 0×8 + 1×4 + 1×2 + 1×1 = (10111) 2     = 1.0111 x 24

(11.5) 10   = 1×8 + 0×4 + 1×2 + 1×1 + 1× 2-1 = (1011.1)= 1.0111 x 23

(5.75) 10  = 1×4 + 0×2 + 1×1 + 1× × 2-1 + 1× × 2-2   =(101.11 ) 2 = = 1.0111 x 22

“# Move “binary point” to the left by one bit position : Divide the decimal number by 2

Move “binary point” to the right by one bit position: Multiply the decimal number by 2

Converting floating points :

Convert (39.6875)10   into floating point representation

(39.6875) 10 = (100111.1011)2 = (1.001111011 ) 2  × 25

IEEE standards 754:

IEEE Floating Point Number Representation

Half Precession ( 16 bits== 1 Sign bit, 5 bit exponent, 10 bit mantissa)

Single Precession (32 bits== 1 Sign bit, 8 bit exponent, 23 bit mantissa)

Double Precession (64 bits== 1 Sign bit, 11 bit exponent, 52 bit mantissa)

IEEE-754 single precession Floating Point

This standard uses total of 32 bits to represent the FLP number. It has 1 bit sign, 8 bit exponent  and 23 bit significant

The format :

Exponent is represented as biased exponent. With 8-bit exponent the biased value is taken as 127. To store a value in FLP we will add the bias value to the exponent and to read the value we will subtract the bias from the exponent given in FLP number.

IEEE-754 Double precession FLP

This standard uses total of 64 bits to represent the FLP number. It has 1 bit sign, 11 bit exponent  and 52  bit significant

Exponent is represented as biased exponent. With 11-bit exponent the biased value is taken as 1023. To store a value in FLP we will add the bias value to the exponent and to read the value we will subtract the bias from the exponent given in FLP number.

FLP number representation in IEEE standard

The IEEE standard uses the format for the significant as 1.sss………….s. In IEEE standard ‘1’  before the decimal point is already part of the system and is hidden and need not be stored. We need to convert the given number in the format +/- 1.ssssss x 2E.

Biased Exponent:

a biased exponent is the result of adding some constant (called the bias) to the exponent chosen to make the range of the exponent nonnegative. Biased exponents(127 in single precession  and 1023 in double precession IEEE standard) are particularly useful when encoding and decoding the floating-point representations of subnormal numbers.

Process of converting a decimal number to IEEE standard FLP number:

1.    Convert the number to fixed point binary  notation

2.    Normalize so that the bit 1 is before the decimal point and accordingly adjust the exponent

3.    Add bias (+127 in IEEE single precession and +1023 in IEEE double precession) to the exponent value.

4.    Store the number so obtained in the FLP format

Example-1: Represent 4.5 in IEEE single precession format

Solution:

  •     Convert to binary fixed point  4.5  = 100.10
  •        = 1.001 x22 in IEEE format
  •     Add bias : exponent = 2 ; biased exponent = 2 + 127 = 129 = 10000001
  •     Significant = .001
  •    4.5 = ( 0 10000001 00100000000000000000000 )

Example-2 : Represent -3.75 in FLP representation

Solution:

  • (3.75)10           = 11 . 11
  • = 1.111 x21.
  • Exponent = 1; Biased exponent = 1 + 127 = 128 = 10000000
  • Significant = 1.111 ( with 1 before decimal point is hidden and no need to store. = .111; note that 1 before the decimal is note stored, it is already  there (or assumed to be)  in system
  • Sign = ‘1’ for –ve number
  • So, (-3.75)10       = 1 10000000 11100000000000000000000

Example-3 : Represent -53.5 in floating point notation assuming 8 bit exponent, 1 sign bit, 23 bit mantissa

Solution:

  • 53 in binary = 110101
  • .5 = .10000
  • 53.5 = 110101.1
  • = -1.101011 x 25.
  • Here Exponent =5 ; biased exponent e= 5 + 127 = 132 = 10000100
  • Significant = .101011
  • (-53.5)10    =    1 10000100  10101100000000000000000

Example-4:

Consider three registers R1, R2, R3 that store numbers in IEEE-754 single precession floating point format. Assume R1 and R2 contain the values(in Hexadecimal notation) 0x42200000 and R2=0xC120000000. If R3=R1/R2 find R3.

Quadruple Precession (128 bits== 1 Sign bit, 15 bit exponent, 112 bit mantissa)

Range of Floating Point numbers

 

Note:

Zero (0) is indicated by having all 0’s in significant and exponent part

 

Leave a Reply

Top
error: Content is protected !!