Minimizing Numeric Errors

Minimizing Numeric Errors

Using floating-point numbers in the loop continuation condition may cause numeric errors.

Numeric errors involving floating-point numbers are inevitable, because floating-point numbers are represented in approximation in computers by nature. This section discusses how to minimize such errors through an example.

The program presents an example summing a series that starts with 0.01 and ends with 1.0. The numbers in the series will increment by 0.01, as follows: 0.01 + 0.02 + 0.03, and so on.

The for loop (lines 10-11) repeatedly adds the control variable i to sum. This variable, which begins with 0.01, is incremented by 0.01 after each iteration. The loop terminates when i exceeds 1.0.

The for loop initial action can be any statement, but it is often used to initialize a control variable. From this example, you can see that a control variable can be a float type. In fact, it can be any data type.

The exact sum should be 50.50, but the answer is 50.499985. The result is imprecise because computers use a fixed number of bits to represent floating-point numbers, and thus they cannot represent some floating-point numbers exactly. If you change float in the program to double, as follows, you should see a slight improvement in precision, because a double variable holds 64 bits, whereas a float variable holds 32 bits.

// Initialize sum
double sum = 0;
// Add 0.01, 0.02, …, 0.99, 1 to sum
for (double i = 0.01; i <= 1.0; i = i + 0.01)
sum += i;

However, you will be stunned to see that the result is actually 49.50000000000003. What went wrong? If you display i for each iteration in the loop, you will see that the last i is slightly larger than 1 (not exactly 1). This causes the last i not to be added into sum. The fundamental problem is that the floating-point numbers are represented by approximation. To fix the problem, use an integer count to ensure that all the numbers are added to sum. Here is the new loop:

double currentValue = 0.01;
for (int count = 0; count < 100; count++) {
sum += currentValue;
currentValue += 0.01;
}

After this loop, sum is 50.50000000000003. This loop adds the numbers from smallest to biggest. What happens if you add numbers from biggest to smallest (i.e., 1.0, 0.99, 0.98, . . . , 0.02, 0.01 in this order) as follows:

double currentValue = 1.0;
for (int count = 0; count < 100; count++) {
sum += currentValue;
currentValue -= 0.01;
}

After this loop, sum is 50.49999999999995. Adding from biggest to smallest is less accurate than adding from smallest to biggest. This phenomenon is an artifact of the finite-precision arithmetic. Adding a very small number to a very big number can have no effect if the result requires more precision than the variable can store. For example, the inaccurate result of 100000000.0 + 0.000000001 is 100000000.0. To obtain more accurate results, carefully select the order of computation. Adding smaller numbers before bigger numbers is one way to minimize errors.

Leave a Reply

Your email address will not be published. Required fields are marked *