Floating Point Rounding Problem — Part 01

5 min readMay 10, 2021

This is a most important thing that every developers should know. Without concerning it, you may have developed some systems and they may be work properly now. But you may be in a trouble in future after 5–10 years because of unexpected error. So let’s see what is this???

We all know that computers are working with binary. So when we assign a value for a float or double variable, it converts that value into the binary format and store them in the memory or do the operations what we have called. So after finishing its operations and providing output for the users, again it converts that binary value into decimal format and at that moment the problem will occur….

Let’s discuss the problem using an example. Before that we should know about IEEE — 754 Standard.

IEEE — 754 (Standard for floating point Arithmetic)

It is a standards that specify interchange and arithmetic formats and methods for binary and decimal floating point arithmetic in programming. It describes about three component method when representing the floating point numbers. They are sign bit, exponent value and mantissa value.

Sign bit — It is a component that express whether the value is positive or negative. 0 represents a positive number while 1 represents a negative number.
Exponent value — Positive and negative exponents must be represented in the exponent field. To obtain the stored exponent, a bias is applied to the actual exponent.
Mantissa value — The mantissa is the significant digits of a number in scientific notation or a floating-point number. We only have two digits here, O and 1. As a result, a normalized mantissa has only one 1 to the left of the decimal.

IEEE — 754 standards for different types

Example :— Let’s convert 8.3 into binary format

Step 01 — First divide the given number into two part as complete number and decimal point and convert the complete number into the binary format. (converting 8 into binary format)

Step 01: Converting 8 for the binary format

Step 02 — Then convert the decimal point into the binary format. To convert it first you have to take the value and multiply by 2 and write the value before the decimal point. And then again take the decimal point value and repeat the above step until you get repetition of same series of value. (converting 0.3 into binary format)

Step 02: Converting 0.3 for the binary format

Step 03 — Then join the two binary representations and write the decimal number into the binary format. (Binary representation for the 8.3)

Step 04 — Convert that binary representation into the scientific representation. (Scientific representation for the binary representation of 8.3)

Step 04: Converting to scientific representation

Step 05 — Now represent the value according to the IEEE-754 standard. It has three sub processes.

Representing sign it value: Check whether value is positive or negative. If it is positive assign 0 for it and if it is negative assign 1 for it.
Representing exponent value: Select the standard exponent value from the table according to your precision.(If it is single precision value = 8 and for double precision value = 11 and see the above table.). Let’s get single precision and exponent value is 8. So its exponent value =2⁸. As it has to represent both positive and negative value the value is -2⁸ — 2⁸ -1 . That is -128 and +127. Then take the exponent bias from the scientific representation and add that value into the exponent value. Now we have the exact exponent value and convert it into the binary format.

Representing mantissa value: Now check the standard mantissa value from the table and get the binary representation in the scientific format and write the mantissa value representation until the standard count. As an example if you want to represent it into single precision you should have 23 numbers in mantissa value. (In here we not count the first 1 in the scientific representation for the count. Only take the values after the decimal point.)

After collecting above three value now you have the complete IEEE-754 Standard representation for the selected floating value. (IEEE representation for the 8.3 )

Step 05: IEEE-754 Representation for 8.3

So now you know how to convert the floating point value into the IEEE-754 format. It means all the values that we assign for float or double variable in programming, CPU stores that value according to the above format. But the reason id if we try to calculate binary representation for the 8.3 using IEEE-754 standard calculator or any other computation device you never get that answer for that.

IEEE-754 standard value for the 8.3 generated from online calculator

By looking at the above figure now you can see our calculate value for the 8.3=01000001000001001100110011001100 and generated value is 8.3=01000001000001001100110011001101 . And we store 8.3 and then we retrieve it back, it gives 8.3000002. That is known as the floating point rounding problem.

Think you have used float as the data type to hold sensitive data like currency, and when doing the calculation it changes like that and occurred unexpected errors in your system when compare with the real value in the book. I hope now you have clear idea about the floating point rounding problem and in next article I’ll describe this problem in detail and method that can be used to avoid from this problem.

Stay Safe and Learn New Things!!!

Floating Point Rounding Problem — Part 01

IEEE — 754 (Standard for floating point Arithmetic)

Written by Thilini Weerasinghe

No responses yet