An MMX coding example - by Ryan Geiss
---------------------

[COMMENT - 5/26/2001]

Please note that this article is a bit outdated by now, but was kept
because it still serves as a decent primer for using MMX in graphics 
programming.  It could be said that this code is somewhat unnecessary 
now, given DirectDraw's Blt and BltFast functions, or GDI's BitBlt, which 
automatically convert between pixel formats for you as optimally as 
possible.  *However*, some (actually, many) drivers don't properly 
implement these color conversions, so if you want a 100% guarantee
it will work, you have to do it manually (Geiss and Drempels do).
Another advantage of doing it yourself is that you can sneak in 
some cheap per-pixel post-processing effects!

I also have to say - I renounce my renunciation of the usefulness of MMX
(which appears at the end of this article) - I love MMX to pieces.  Without
it, Geiss would be 40% slower and Drempels would be 50% slower!  It also
works extremely well as a fast memcpy(), in audio processing, and in just
about any kind of image processing.



[ORIGINAL ARTICLE]

This little intro to MMX is explained via a program called Geiss.  You
can get it on the main page.

In Geiss, three separate greyscale "screens" are calculated.  Each one 
represents a color channel: red, green, and blue.  The problem is
displaying this on the screen.  This little FAQ goes through the non-MMX
and the MMX ways of doing so.  Please remember that the non-MMX discussion
revolves around the code involved to plot ONE pixel; the MMX discussion is
very similar, only it does 4 pixels simultaneously.

The idea is this:

   B B B B B B B B   R R R R R R R R   G G G G G G G G

You have 3 buffers of red, green, and blue values, each value being 8 bits 
(1 byte).  You want to merge these to put them on the screen.  Well, 
the 320x240 video mode rarely supports 32-bit color, but rather, only 
16-bit.  The format for a 16-bit color value (in the hardware) is:

   B B B B B G G G   G G G R R R R R

It's 5, 6, and 5 bits of each color component.  As you can imagine, getting
the 3 bytes into this 2-byte-chawed quantity is a real bitch.  Here's the
code:

   WORDVALUE = ((blue >> 3) << 11) | ((green >> 2) << 5) | (red >> 3);

   This is like dicing each of the three into the following quantities:
 
         B B B B B 0 0 0   0 0 0 0 0 0 0 0
         0 0 0 0 0 G G G   G G G 0 0 0 0 0
         0 0 0 0 0 0 0 0   0 0 0 R R R R R

   Then forming the final quantity by OR'ing all three, yielding

         B B B B B G G G   G G G R R R R R

   just as desired.  Notice that we've taken the MOST significant
	bits of the 8 for each component to use.  

	Well, that's 5 bitshifts and 3 or's per pixel, plus a lot of crappy 
	other stuff like loop counters and gamma correction (it makes Geiss look
	ten time better... gives you a "bright white" color zone).  Basically, 
	that's SLOW.

   With MMX, you can make it slightly better.  You can do 4 pixels
   at a time, with LESS overhead (due to a convenient way MMX gives
   you to do the gamma-correction), and you end up with huge speed
   gains.  

   Here's how it works.  First, load four blue pixels at once, four
   green, and four red.  An MMX register is 64 bits, or 8 bytes.  Our
	goal is turn one MMX register into 4 16-bit color values that can
	all be plotted together.  We first have to use the Padded Unpack 
	Word instruction, which will load in 4 bytes in every other byte 
	of the MMX register:

PUNPCKHBW  mm0, [eax]      (Unpack High-Packed Data/Words)
PUNPCKHBW  mm1, [ebx]
PUNPCKHBW  mm2, [edx]

MM0:  blueblue !$%^@#$! blueblue !$%^@#$! blueblue !$%^@#$! blueblue !$%^@#$!
MM1:  greengre !$%^@#$! greengre !$%^@#$! greengre !$%^@#$! greengre !$%^@#$!
MM2:  redredre !$%^@#$! redredre !$%^@#$! redredre !$%^@#$! redredre !$%^@#$!

	You can probably see what is coming... the same thing as before...
	shift all these registers around to align them, OR them, and you
	have it.  The !$%^@#$! is garbage that was in the register beforehand;
	when we shift right to lop off the least significant 2/3 bits, we'll
	be killing these too and adding nice zeros to the left side.

   We're only using 3 of the 8 MMX registers here.  Each one holds four 2-byte
   quantities; the left byte is the high byte, the right byte is the low one.
   Now we do some shifts.  MMX allows us to bitshift an entire MMX register
   as if it contained 8 bytes, 4 words, 2 dwords, or 1 quadword.  We choose
   4 words, obviously.  Here's the code:

PSRLW      mm0, 11    (Packed Shift Right/Logical/Words)
PSRLW      mm1, 10
PSRLW      mm2, 11

MM0:  00000000 000blueb 00000000 000blueb 00000000 000blueb 00000000 000blueb
MM1:  00000000 00greeng 00000000 00greeng 00000000 00greeng 00000000 00greeng 
MM2:  00000000 000redre 00000000 000redre 00000000 000redre 00000000 000redre

   Now we've lopped off the least significant bits of the values (this
   is equivalent to the initial >> operations in the non-MMX equation) in
   just three clock cycles.  We now do the leftshift (<<) operations to
   move the bits into place for OR'ing:

PSLLW	   mm0, 11      (Packed Shift Left/Logical/Words)
PSLLW	   mm1, 5

MM0:  blueb000 00000000 blueb000 00000000 blueb000 00000000 blueb000 00000000 
MM1:  00000gre eng00000 00000gre eng00000 00000gre eng00000 00000gre eng00000 
MM2:  00000000 000redre 00000000 000redre 00000000 000redre 00000000 000redre

   Now, just like before, you OR the three registers together.

POR        mm0, mm1   (Packed Logical OR)
POR        mm0, mm2
   
MM0:  bluebgre engredre bluebgre engredre bluebgre engredre bluebgre engredre

   See what we have?  Four 16-bit color values.  Re-written, they look like

MM0:  BBBBBGGG GGGRRRRR BBBBBGGG GGGRRRRR BBBBBGGG GGGRRRRR BBBBBGGG GGGRRRRR

   Now we store all 4 pixels to "video memory" (really a buffer) with one
	command:

MOVQ       [edi], mm0      (64-bit MMX <--> memory transfer)

   Then increment our 4 pointers and loop to the top.  The whole main loop,
   unoptimized, looks like this:

-------------------------------------------------------------------
FourLoop:

PUNPCKHBW  mm0, [eax] // load & expand the red, green, & blue values
PUNPCKHBW  mm1, [ebx]
PUNPCKHBW  mm2, [edx]

PADDUSB    mm0, mm0 // doubles the brightness, capping at 255, for every value.
PADDUSB    mm1, mm1 // this is the equiv. of the REMAP[] array's effect.
PADDUSB    mm2, mm2

PSRLW      mm0, 8+3 // move each byte into the -lower- part of the word
PSRLW      mm1, 8+2 // also chop off some # of least significant bits 
PSRLW      mm2, 8+3

PSLLW	   mm0, 11  // shift back into position for combination process
PSLLW	   mm1, 5
	
POR        mm0, mm1
POR        mm0, mm2

// store result + increment pointers

MOVQ       [edi], mm0

ADD		   eax, 4
ADD		   ebx, 4
ADD		   edx, 4
ADD		   edi, 8

LOOP FourLoop
-------------------------------------------------------------------

   The great thing is that there are *plenty* of optimizations to the
   above code... most involve interleaving instructions so they can
   execute simultaneously (in the two pipes) and also by rearranging
   the code so that there are no stalls.

   ... so that's about it.  The key points that save time are in the mass
   data transfers (stores & loads to/from memory) and in the reduced
   number of instructions (1/4 as many).  Also, before, the "gamma
   correction" was done by an array of 256 bytes that remapped the byte.
   The array made values 0..127 appear twice as bright, and 128..255 appear
   "max'ed out" at 255.  Well, with MMX, you can just add the register to
   itself in "saturation mode," where it will cap at 255 if it overflows.
   
   I never thought I'd find a use for MMX, but here it is.  I'm truly
   amazed.  I get about 36-37 fps without the MMX, and about 49 with it!
   (And this is not the only process going on... the three buffers are also
   being crunched each frame, and that's expensive alone.)  Go MMX!


   -Ryan M. Geiss            



BONUS
---------------------------------------------------------------------------------
This code will check to see if MMX is supported by the CPU.  You should run this
code once at startup, then save the resulting bool.  And remember: it's always a
good idea to keep your pre-MMX loops around so that you still support non-MMX 
CPU's, or in case you need to port your code to a new architecture.


bool CheckMMXTechnology()
{
    bool retval = true;
    DWORD RegEDX;

    __try {
        __asm {
            mov eax, 1
            cpuid
            mov RegEDX, edx
        }
    }
    __except(EXCEPTION_EXECUTE_HANDLER)
    {
        retval = FALSE;
    }

    if (retval == FALSE) return FALSE;  // processor does not support CPUID

    if (RegEDX & 0x800000)          // bit 23 is set for MMX technology
    {
       __try { __asm emms }          // try executing the MMX instruction "emms"
       __except(EXCEPTION_EXECUTE_HANDLER) { retval = FALSE; }
    }

    else
            return FALSE;           // processor supports CPUID but does not support MMX technology

    // if retval == 0 here, it means the processor has MMX technology but
    // floating-point emulation is on; so MMX technology is unavailable

    return retval;
}



This document copyright (c)1998+ Ryan M. Geiss.
Return to FAQ page