Applying C - Floating Point
Written by Harry Fairhead   
Tuesday, 21 May 2024
Article Index
Applying C - Floating Point
Floating Point Algorithms
Detecting Problems
Floating Point Reconsidered

Floating point arithmetic solves all of your problems - except when it doesn't.  It really is simple once you read this extract is from my  book on using C in an IoT context.

Now available as a paperback or ebook from Amazon.

Applying C For The IoT With Linux

  1. C,IoT, POSIX & LINUX
  2. Kernel Mode, User Mode & Syscall
  3. Execution, Permissions & Systemd
    Extract Running Programs With Systemd
  4. Signals & Exceptions
    Extract  Signals
  5. Integer Arithmetic
    Extract: Basic Arithmetic As Bit Operations
    Extract: BCD Arithmetic  ***NEW
  6. Fixed Point
    Extract: Simple Fixed Point Arithmetic
  7. Floating Point 
  8. File Descriptors
    Extract: Simple File Descriptors 
    Extract: Pipes 
  9. The Pseudo-File System
    Extract: The Pseudo File System
    Extract: Memory Mapped Files 
  10. Graphics
    Extract: framebuffer
  11. Sockets
    Extract: Sockets The Client
    Extract: Socket Server
  12. Threading
    Extract:  Pthreads
    Extract:  Condition Variables
    Extract:  Deadline Scheduling
  13. Cores Atomics & Memory Management
    Extract: Applying C - Cores 
  14. Interupts & Polling
    Extract: Interrupts & Polling 
  15. Assembler
    Extract: Assembler

Also see the companion book: Fundamental C

<ASIN:1871962609>

<ASIN:1871962617>

Fixed point may be simple, but it is very limited in the range of numbers it can represent. If a calculation involves big scale changes, i.e. with results very much smaller or bigger than the initial set of numbers, then it fails due to under- or over-flow unless you are paying a great deal of attention to the fine detail. A better scheme if you want trouble free calculations is to use a floating point representation.

Pro and Cons of Floating Point

Floating point is easy to use. You feed it the numeric values and the expression and simply expect it to get the right answer. You can usually forget about overflow and other problems and just rely on the FPU to get on with it. This is one of the many things that programmers believe about floating point and it is mostly wrong. Floating point is flexible and easy to use, but unless you know what you are doing you can get almost random values back from a calculation.

Not so long ago floating point was the exception rather than the rule for small machines. Even processors that had floating point hardware were often rendered unusable because of lack of software support. For example, Linux for the Raspberry Pi, Raspbian, took some years before making floating point available. In particular, floating point on ARM processors was a mess of confusing different types of hardware. Today there are still some processors that don’t implement floating point hardware to save cost and power, including the Arduino Uno and most of the PIC range of processors.

Another big change is that today’s FPUs are fast. Only a few years ago, floating point hardware incurred a significant overhead in communicating with the CPU, making floating point much slower than integer arithmetic and the default practice was to use fixed point wherever speed was important. Today, FPUs are much better integrated with the CPU and they are optimized. In most cases, you can expect a speed penalty of only 20 to 30%. What this means is that if you have to use as few as two integer operations to implement an alternative to floating point then it runs slower.

If you have a modern FPU then use it.

In this chapter we look at some of the aspects of floating point that are important to the general programmer. The whole subject is very big and leads into issues of numerical analysis and exactly how to do a computation. Here the aim is to make you aware of some of the subtle problems that you can encounter with floating point – it’s stranger than you might imagine.

In particular, unless your calculation is with moderate values and only a few decimal points of accuracy are important, you can’t simply supply an arithmetic expression and just expect it to give you the right answer. Floating point arithmetic can go very wrong unless you understand it – and even then it can still go wrong.

Computers working with numbers is a complete field of study in its own right - numerical analysis - and there is no way that a single chapter can even touch on the subject. What this chapter is about is the way floating point works and some of the problems that arise in simple computations.

The Floating Idea

Floating point allows the precision and magnitude of the representation to change as the computation proceeds. You can do the same thing with fixed point by varying the position of the binary point to accommodate the result and this can be considered a primitive form of floating point. Of course, as you move the fixed point you lose precision to gain an increase in magnitude. So it is with floating point, but there are generally many more bits allocated to the problem.

In floating point the binary point is allowed to move during the calculation, i.e. the binary point "floats", but extra bits have to be allocated to keep track of where it is.

The advantage of this approach is clear if you consider multiplying a value such as 123.4 by 1000. If the hardware (decimal in this case) can only hold four digits then the result is an overflow error. That is:

123.4 * 1000 = 123400

truncates to 3400, which is clearly not the right answer.

If the hardware uses the floating point approach it can simply record the shift in the decimal point four places to the right. You can think of this as a way of allowing a much larger range of numbers to be represented, but with a fixed number of digits’ precision. Notice that the number of digits of precision remains the same, but the percentage accuracy changes.

A floating point number is represented by two parts – an exponent and a fractional part. The fractional part is just a fixed-point representation of the number – usually with the fractional point to the immediate left of the first bit, making its value less than 1. The exponent is a scale factor which determines the true magnitude of the number.

In decimal we are used to this scheme as scientific notation, standard form or exponential notation. For example, Avogadro’s number is usually written as 6.02252 x 1023 and the 23 is the exponent and the 6.02252 is the fractional part – notice that in standard form the fractional part is always less than 10 and more than 1. In binary floating point representation it is usual for the fractional part to be normalized to be just less than 1.

Binary floating point is just the binary equivalent of decimal standard form. The exponent is the power of two by which you have to multiply the fraction to get the true magnitude. At this point you might want to write floating point off as trivial, but there are some subtleties. For example, when the fractional part is zero what should the exponent be set to?

Clearly there is more than one representation for zero. By convention, the exponent is made as negative as it can be, i.e. as small as possible in the representation of zero. If two's-complement were used this would result in a zero that didn’t have all its bits set to zero and this is to be avoided for obvious reasons. To achieve this a small change is needed to use a biased, rather than two's-complement exponent, i.e. by adding the largest negative value to it. For example, if the exponent is six bits in size, the two's- complement notation range is –32 to +31.

If instead of two's-complement, a simple biased representation is used then we have to subtract 32 from the exponent to get the signed value. In this case an exponent of 0 represents –32, 32 represents 0, and 63 represents 31. The same range is covered, but now the representation of zero has all bits set to 0 and it corresponds to 0x2-32, i.e. zero with the most negative exponent possible.



Last Updated ( Tuesday, 21 May 2024 )