Graphics of Ryan Geiss



---------------------------------------------------------------------------------
USING THE PENTIUM FLOATING-POINT UNIT
---------------------------------------------------------------------------------

** DISCLAIMER: Please excuse the disarray, this file is a bit of a mess.  I'll 
try to clean it up soon, but it's hard to find the time.  In the meantime, all 
the info. is here, you might just have to dig around.  Enjoy.

Please be aware that this information is only for super-nerds; regular compilers 
tend to optimize FPU usage very well!  But if you're really desperate to save
a clock cycle, or just generally knowledge-thirsty, read on.

What you will need: A basic understanding of how stacks/pushes/pops 
work; know what a floating point is (vs. integer); know about pointers; 
and you will need a compiler that supports Pentium commands/optimizations.  
Assembly programming experience and basic understanding of system 
architecture are a must.




USING THE PENTIUM FLOATING-POINT UNIT
---------------------------------------------------------------------------------

The FPU is built right in to the Pentium.  It has a stack that holds eight
floating-point values, referenced by st(0) through st(7).  st(0) is the 
"bottom" of the stack and is where pushes and pops occur.  It is also the 
default register to use for most arithmetic.  Here are the basic instructions:

                                                                        clocks
FLD   [pointer]        push a copy of an 8-byte f.p. value into st(0)     1
FADD  st(0), st(y)     add st(y) to st(0)                                 3
FSUB  st(0), st(y)     subtract st(y) from st(0)                          3
FMUL  st(0), st(y)     multiply st(0) by st(y)                            3
FDIV  st(0), st(y)     divide st(0) by st(y)                              ~39
FXCH  st(x)            swaps st(0) and st(x) where x is in [1..7]         0*
FST   [pointer]        copy st(0) to mem. location at pointer             2

FADDP, FSUBP, FMULP    perform the operation then pop the stack.          3
FSTP                   store, then pop the stack.                         2

-For the four arithmetic operations, either the source or destination operand 
	MUST be st(0).  
-To make a copy of a value in a register you can use FLD st(x).  This will copy 
	it into st(0).  
-Many commands can be suffixed by a "P", which, after storing the result, pops 
	the stack (deletes st(0)).  Note that this happens, usually, 3 cycles after 
	the command is issued (or more if we’re waiting on a source)!
-All pointers must point to an 8-byte chunk of memory.  To change precision, 
	see page 3, Precision Control.


For more info, look at the "developers" section of Intel’s website.  Go to 
product documentation and get the Intel Architecture Software Developer’s 
Manual®, Volume 1, in Acrobat format, 24319001.PDF.  Chapter 7 is the complete 
FPU handbook.  Start here; instructions are introduced in a tutorial format.  
For more detailed descriptions of the instructions, get volume 2, 24319101.PDF, 
which is a complete index of *all* Pentium commands, and look under F.






OPTIMIZING
---------------------------------------------------------------------------------
FAdd, FSub, and FMul all take 3 clock cycles.  Each takes two operands: the destination 
and the source stack position, one of which must be st(0).  But you can start new 
calculations while you're waiting on the others, so long as an operand to the second 
calculation doesn't depend on the result of the first.  The key is in having operations 
that don’t stall out waiting for previous results.  

One complicated but worthwhile trick is taking two complex expressions and breaking 
them down into two or three threads via assembly; then interleaving them with FXCH.  
You can almost double the speed.  This is how my realtime Julia fractal generator works.

Another good idea is to take advantage of necessary stalls.  If you can not get around
a certain calculation stalling (a later calculation sits and waits for its result),
go ahead and start a calculation you will need later while you're sitting there.

All throughout your floating-point code, you can allegedly be doing regular integer 
assembly, but this seldom fits well into a coding algorithm.  The Pentium has the two 
pipes, U and V, for integer calculations, and so all throughout your FPU code you could
ideally have integer math going through for free.



CAUTIONS
---------------------------------------------------------------------------------
-You can’t start one FMUL right after the other; it will stall for 1 cycle.  FXCH 
	doesn’t count as an in-between instruction.

-FXCH will almost always cost about 1/2 cycle if preceded and followed by regular 
	FPU commands (it actually just renames the stack registers).

-FST can not be used on an FADD/FSUB/FMUL/FDIV result for 4 clock cycles rather than 3.







MORE INSTRUCTIONS                                                                        
---------------------------------------------------------------------------------
F2XM1               replace st(0) 2^st(0) - 1
FABS st(x)          replace st(x) with its absolute value
FCHS                change sign of st(0)
FCOS                replace st(0) with cos(st(0)), in radians.
FSIN                replace st(0) with sin(st(0)), in radians.
FSINCOS             replace st(0) with its sine; push cosine onto stack
FSQRT               replace st(0) with its square root
FPATAN              take arctan(st(1)/st(0)) and store in st(1); pop stack
FPTAN               replace st(0) with its tangent; push 1.0 onto stack.
FRNDINT             rounds st(0) (see below, under Rounding Control)

FILD/FIST           load/store an integer in a FPU register.  Then you
                      need to look up FIMUL, FIADD, etc.  FIST places
                      the FPU in kinky mode.
FINIT               initialize FPU (reset it)

LOADING CONSTANTS
---------------------------------------------------------------------------------
FLDZ                push a zero onto the stack
FLD1                push one onto the stack
FLDPI               push pi onto the stack
FLDL2T              push log2 10 onto the stack
FLDL2E              push log2 e onto the stack
FLDLG2              push log10 2 onto the stack
FLDLN2              push ln 2 onto the stack






PRECISION CONTROL
---------------------------------------------------------------------------------
There are three IEEE standard precisions for floating-point variables:

  Single real   = 32 bits memory (4 byte "float")  = 24 bits precision = 00 in PC field
  Double real   = 64 bits memory (8 byte "double") = 53 bits precision = 10 in PC field
  Extended real = 80 bits memory (10 bytes)        = 64 bits precision = 11 in PC field

You have to specify for the FPU what precision it should use in its calculations; the 
default is 64-bit, which you can ensure with FINIT.  This is a speed issue; FDIV takes 
19, 33, or 39 clock cycles based on the precision!  The difference is less noticeable 
with quicker operations, though.

You can change the precision by setting the PC (precision control) field of the FPU 
control word, which is bits 8 and 9.  This can be done by using FSTCW [pointer] to 
save the word to memory.  Then modify the appropriate bits and do an FLDCW [pointer] 
to put it back.  

Note that you can load/store values of any precision to and from memory, but the FPU 
will convert them and operate on them with the precision set by the PC field.  
Therefore, you must specify, with most compilers, not only an address but also how 
many bytes to load/store there.  For example: 

        FLD   dword ptr variablename       // for floats
        FST   qword ptr variablename       // for doubles





ROUNDING CONTROL
---------------------------------------------------------------------------------

FINIT will set rounding to the nearest whole number by default.  The RC field of 
the FPU control word is made of bits 10 and 11.  Here is the breakdown:

        00 = Round to nearest whole number. (default)
        01 = Round down, toward -infinity.
        10 = Round up, toward +infinity.
        11 = Round toward zero (truncate).

FRNDINT rounds st(0) based on the state of these 2 bits. 





CONDITIONS & BRANCHING INSIDE THE FPU
---------------------------------------------------------------------------------

The FPU can do condition testing and branching but you need the manual for that.  
It's not hard, but you need to look some values up - because you have to do the 
FCOM (float compare), then copy the FPU flags register into AX with a special 
command, then do stuff to AX to find out if it's less than, greater than, etc.  
Look up FCOM in volume 1 of the IASDM - it will give you straightforward 
instructions on how to do this.  See also FUCOM & FTST in volume 2.

You can compare any register to st(0) and branch from the result.  This involves 
doing a generic compare between the two, which modifies the FPU status word.  You 
then copy the SW into AX and examine the bit combinations in different manners in 
order to achieve JG, JL, and JE.  Here is sample code to compare st(0) to st(5).  
Replace the bold terms according to the table below to achieve whatever type of 
compare you want.

   FCOM   st(5)            compare st(x) to st(5) - modifies FPU status word.
   FSTSW  AX               store the status word in AX
   TEST   AX, xxxxH        examine status word
   Jxx    :label           jump based on compare.

   Test desired:           xxxxH         Jxx
   ST(0) > st(x)           4500H         JZ
   ST(0) < st(x)           0100H         JNZ
   ST(0) = st(x)           4000H         JNZ





A TIDBIT ON MMX
---------------------------------------------------------------------------------

So what the hell does MMX really do?  It does parallel processing on integers,
using the floating-point stack.

You enter MMX mode with a special command; this takes about 40 clock cycles (ouch!).  
In MMX mode, 57 new instructions are indeed available to the processor.  They use 
the 8 registers of the FPU stack to pack in 8 bytes, 4 words, or 2 double-words each 
(so note that you can NOT use the FPU while in MMX mode!  Big drawback!)  New custom 
Add, Subtract, And/Or, etc operations can then operate on all 8 bytes (or whatever) 
in the same time it would take to do just one.  

But consider this: what about overflows?  Say you add 200 and 100 and get 300.  
Problem: a byte can only hold 0 to 255.   So there are actually two add commands; one 
clips this at 255 (saturation add), and the other loops it around to 45 (wraparound 
add).  Also consider that you have signed and unsigned bytes, words, etc.  That gives 
us even more modes to add.

The bottom line: all in all, SEVEN of those 57 new commands bragged about are just for 
adding!  Seven more are for subtracting.  It also offers integer multiplication (no 
division at all), some ANDing, ORing, and a few other odds and ends.  

The problem is that hardly any processing-intensive applications have such clean-cut 
needs.  They could probably stand to multiply 8 floats in parallel, but not integers!  
The only thing that really profits from such technology are Photoshop Plug-ins (which 
do profit immensely), where you have mass similar operations on integers (pixels).  
In addition, tying up the FPU can hurt, and integer operations were faster to begin 
with!  

However, MMX is a stepping stone.  It has paved the way to a floating-point version 
of this, which, with proper code, could speed floating point calculations up eightfold! 
Picture a PC with eight FPU’s, all running at once!  It would be something.  I’ve 
heard that AMD is working on a new "3d" processor that will do just this, but this 
waits to be verified.






BONUS
---------------------------------------------------------------------------------
This code is from the MSDN Library for Visual Studio 6.0.  It reads the FPU 
control word and, if necessary, sets it to single precision to maximize speed.
It will only work in Windows.  

If you have any large chunk of code in your app that does tons of FPU work,
mission-critical precision isn't necessary, and you want it to be sped up,
try this.  It disables exceptions (WHICH CAN BE A CRITICAL SPEED BOOST), 
and sets precision to 32 bits (instead of the default 64 bits).



#include <windows.h>
#include <math.h>
 
// This function evaluates whether the floating-point
// control Word is set to single precision/round to nearest/
// exceptions disabled. If not, the
// function changes the control Word to set them and returns
// TRUE, putting the old control Word value in the passback
// location pointed to by pwOldCW.
BOOL MungeFPCW( WORD *pwOldCW )
{
    BOOL ret = FALSE;
    WORD wTemp, wSave;
 
    __asm fstcw wSave
    if (wSave & 0x300 ||            // Not single mode
        0x3f != (wSave & 0x3f) ||   // Exceptions enabled
        wSave & 0xC00)              // Not round to nearest mode
    {
        __asm
        {
            mov ax, wSave
            and ax, not 300h    ;; single mode
            or  ax, 3fh         ;; disable all exceptions
            and ax, not 0xC00   ;; round to nearest mode
            mov wTemp, ax
            fldcw   wTemp
        }
        ret = TRUE;
    }
    *pwOldCW = wSave;
    return ret;
}
 
void RestoreFPCW(WORD wSave)
{
    __asm fldcw wSave
}
 
void __cdecl main()
{
    WORD wOldCW;
    BOOL bChangedFPCW = MungeFPCW( &wOldCW );
    // Do something with control Word, as set by MungeFPCW.
    if ( bChangedFPCW )
        RestoreFPCW( wOldCW );
}












This document copyright (c)1998+ Ryan M. Geiss.
Return to FAQ page