Sunday 23 December 2012

Magic of Floating Point Representation in the Digital World

Many great engineering and scientific advances of recent decades would not have been possible without the floating-point capabilities of digital computers. Still, some results of floating-point calculations look pretty strange, even to people with years of mathematical experience. I will attempt to explain the causes of some of these strange results and give some suggestions where appropriate.
Lets take a look at the following example :



groovy> System.out.println(100.87d * 0.01d)
1.0087000000000002    --> Oops ! What is happening here ?
groovy>

We all know 100.87 multiplied by 0.01 is 1.0087. How is this junk introduced. Lets look at the following statements to add more confusion:

groovy> System.out.println(100.87f * 0.01f)
1.0087000049196178
groovy> 
groovy> System.out.println(100.87 * 0.01)
1.0087
groovy>

To answers these questions lets go to the basics.
How are floating point numbers represented in the modern computer systems and how is arithmetic done. As we all know, floating point numbers are represented in binary format which follows IEEE standard for floating point arithmetic. 
The format for single precision numbers uses 32 bits divided in the following way,
     seeeeeeeefffffffffffffffffffffff
     
     s = sign bit, 1 bit
     e = exponent, 8 bits  (E_min=-126, E_max=127, bias=127)
     f = fraction, 23 bits
The format for double precision numbers uses 64 bits divided in the following way,
     seeeeeeeeeeeffffffffffffffffffffffffffffffffffffffffffffffffffff
     
     s = sign bit, 1 bit
     e = exponent, 11 bits  (E_min=-1022, E_max=1023, bias=1023)
     f = fraction, 52 bits

The above problem is because some numbers can't be represented exactly in this format.

(100.87)10 = (1100100.11011110101110000101000111101011100001010001111010111000010100.........)2

When we say 100.87d, it means 100.87 is stored in a double variable which has 64 bit precision and when we say 100.87f, it means 100.87 is stored in a float variable, which has 32 bit precision. So even though 100.87 is a real number, it can't be exactly represented in binary format. This poses a big challenge to programmers in financial domain where accuracy is very important. So if you normalizing a price of 100.87 with a scale of 0.01, you would introduce noise as seen above. Mostly what programmers do is rounding to some significant digits based on the requirement. 


A key feature of the IEEE standard is that it requires correctly rounded arithmetic operations. Very often, the result of an arithmetic operation on two floating point numbers is not a floating point number. This is most obviously the case for multiplication and division; for example, 1 and 10 are both floating point numbers but 1/10 is not, regardless of where the single or double format is in use. It is also true of addition and subtraction.

Let x and y be floating point numbers, let +,−,* ,/ denote the four standard arithmetic operations, and let (+),(-), (*), (/) denote the corresponding operations as they are actually implemented on the computer. Thus, x + y may not be a floating point number, but x (+) y is the floating point number which is the computed approximation of x + y. When the result of a floating point operation is not a floating point number, the IEEE standard requires
that the computed result is the rounded value of the exact result. It is worth stating this requirement carefully. The rule is as follows: if x and y are floating point numbers, then
x (+) y = round(x + y);
x (-) y = round(x − y);
x (*) y = round(x * y);
and
x (/) y = round(x / y)
where round is the operation of rounding to the nearest floating point number in the single or double format, whichever is in use.

Now lets explain the behaviors seen in my cases presented at the beginning.
groovy> System.out.println(100.87d * 0.01d)
1.0087000000000002
groovy> System.out.println(100.87f * 0.01f)
1.0087000049196178
groovy> 

100.87 is not exactly representable in binary, same is the case with 0.01. We represent an approximation of these numbers in binary. The rule is if a number is not exactly representable, then it must be approximated by one of the nearest representable values. So, when we are multiplying 100.87 with 0.01 in the above case, we actually are multiplying some approximation of these numbers and that's why we the junk. This is expected. But what about the below case. What is happening here ?
groovy> System.out.println(100.87 * 0.01)
1.0087
groovy>

In the above case we are not storing the intermediate values in any variable and hence no rounding is done. They are stored in registers and hence we are getting better precision. To prove my point analyze the following cases:
groovy> System.out.println((float)(100.87f * 0.01f))
1.0087
groovy> System.out.println(100.87f * 0.01f)
1.0087000049196178
groovy>

In the 1st case you are putting the value in a float variable and then printing. In the 2nd case, you are directly printing the value in the registers.

No comments:

Post a Comment