## 11.9 The MMX Programming Paradigm

In general, you don't learn scalar (non-MMX) 80x86 assembly language programming and then use that same mindset when writing programs using the MMX instruction set. While it is possible to directly use various MMX instructions the same way you would the general purpose integer instructions, one phrase comes to mind when working with MMX: think parallel. This text has spent many hundreds of pages up to this point attempting to get you to think in assembly language; to think that this small section can teach you how to design optimal MMX sequence would be ludicrous. Nonetheless, a few simple examples are useful to help start you thinking about how to use the MMX instructions to your benefit in your programs. This section will begin by presenting some fairly obvious uses for the MMX instruction set, and then it will attempt to present some examples that exploit the inherent parallelism of the MMX instructions.

Since the MMX registers are 64-bits wide, you can double the speed of certain data movement operations by using MMX registers rather than the 32-bit general purpose registers. For example, consider the following code from the HLA Standard Library that copies one character set object to another:

```

procedure cs.cpy( src:cset; var dest:cset ); nodisplay;

begin cpy;

push( eax );

push( ebx );

mov( dest, ebx );

mov( (type dword src), eax );

mov( eax, [ebx] );

mov( (type dword src[4]), eax );

mov( eax, [ebx+4] );

mov( (type dword src[8]), eax );

mov( eax, [ebx+8] );

mov( (type dword src[12]), eax );

mov( eax, [ebx+12] );

pop( ebx );

pop( eax );

end cpy;

Program 11.2	 HLA Standard Library cs.cpy Routine

```

This is a relatively simple code sequence. Indeed, a fair amount of the execution time is spent copying the parameters (20 bytes) onto the stack, calling the routine, and returning from the routine. This entire sequence can be reduced to the following four MMX instructions:

```		movq( (type qword src), mm0 );

movq( (type qword src[8]), mm1 );

movq( mm0, (type qword dest));

movq( mm1, (type qword dest[8]));

```

Of course, this sequence assumes two things: (1) it's okay to wipe out the values in MM0 and MM1, and (2) you'll execute the EMMS instruction a little later on after the execution of some other MMX instructions. If either, or both, of these assumptions is incorrect, the performance of this sequence won't be quite as good (though probably still better than the cs.cpy routine). However, if these two assumptions do hold, then it's relatively easy to implement the cs.cpy routine as an in-line function (i.e., a macro) and have it run much faster. If you really need this operation to occur inside a procedure and you need to preserve the MMX registers, and you don't know if any MMX instructions will execute shortly thereafter (i.e., you'll need to execute EMMS), then it's doubtful that using the MMX instructions will help here. However, in those cases when you can put the code in-line, using the MMX instructions will be faster.

Warning: don't get too carried away with the MMX MOVQ instruction. Several programmers have gone to great extremes to use this instruction as part of a high performance MOVSD replacement. However, except in very special cases on very well designed systems, the limiting factor for a block move is the speed of memory. Since Intel has optimized the operation of the MOVSD instruction, you're best off using the MOVSD instructions when moving blocks of memory around.

Earlier, this chapter used the cs.difference function as an example when discussing the PANDN instruction. Here's the original HLA Standard Library implementation of this function:

```

procedure cs.difference( src:cset; var dest:cset ); nodisplay;

begin difference;

push( eax );

push( ebx );

mov( dest, ebx );

mov( (type dword src), eax );

not( eax );

and( eax, [ebx] );

mov( (type dword src[4]), eax );

not( eax );

and( eax, [ebx+4] );

mov( (type dword src[8]), eax );

not( eax );

and( eax, [ebx+8] );

mov( (type dword src[12]), eax );

not( eax );

and( eax, [ebx+12] );

pop( ebx );

pop( eax );

end difference;

Program 11.3	 HLA Standard Library cs.difference Routine

```

Once again, the high-level nature of HLA is hiding the fact that calling this function is somewhat expensive. A typical call to cs.difference emits five or more instructions just to push the parameters (it takes four 32-bit PUSH instructions to pass the src character set because it is a value parameter). If you're willing to wipe out the values in MM0 and MM1, and you don't need to execute an EMMS instruction right away, it's possible to compute the set difference with only six instructions - that's about the same number of instructions (and often fewer) than are needed to call this routine, much less do the actual work. Here are those six instructions:

```		movq( dest, mm0 );

movq( dest[8], mm1 );

pandn( src, mm0 );

pandn( src[8], mm1 );

movq( mm0, dest );

movq( mm1, dest[8] );

```

These six instructions replace 12 of the instructions in the body of the function. The sequence is sufficiently short that it's reasonable to code it in-line rather than in a function. However, were you to bury this code in the cs.difference routine, you needed to preserve MM0 and MM11, and you needed to execute EMMS afterwards, this would cost more than it's worth. As an in-line macro, however, it is going to be significantly faster since it avoids passing parameters and the call/return sequence.

If you want to compute the intersection of two character sets, the instruction sequence is identical to the above except you substitute PAND for PANDN. Similarly, if you want to compute the union of two character sets, use the code sequence above substituting POR for PANDN. Again, both approaches pay off handsomely if you insert the code in-line rather than burying it in a procedure and you don't need to preserve MMX registers or execute EMMS afterwards.

We can continue with this exercise of working our way through the HLA Standard Library character set (and other) routines substituting MMX instructions in place of standard integer instructions. As long as we don't need to preserve the MMX machine state (i.e., registers) and we don't have to execute EMMS, most of the character set operations will be short enough to code in-line. Unfortunately, we're not buying that much over code the standard implementations of these functions in-line from a performance point of view (though the code would be quite a bit shorter). The problem here is that we're not "thinking in MMX." We're still thinking in scalar (non-parallel mode) and the fact that the MMX instruction set requires a lot of set-up (well, "tear-down" actually) negates many of the advantages of using MMX instructions in our programs.

The MMX instructions are most appropriate when you compute multiple results in parallel The problem with the character set examples above is that we're not even processing a whole data object with a single instruction; we're actually only processing a half of a character set with a sequence of three MMX instructions (i.e., it requires six instructions to compute the intersection, union, or difference of two character sets). At best, we can only expect the code to run about twice as fast since we're processing 64 bits at a time instead of 32 bits. Executing EMMS (and, God help us, having to preserve MMX registers) negates much of what we might gain by using the MMX instructions. Again, we're only going to see a speed improvement if we process multiple objects with a single MMX instruction. We're not going to do that manipulating large objects like character sets.

One data type that will let us easily manipulate up to eight objects at one time is a character string. We can speed up many character string operations by operating on eight characters in the string at one time. Consider the HLA Standard Library str.uppercase procedure. This function steps through each character of a string, tests to see if it's a lower case character, and if so, converts the lower case character to upper case. A good question to ask is "can we process eight characters at a time using the MMX instructions?" The answer turns out to be yes and the MMX implementation of this function provides an interesting perspective on writing MMX code.

At first glance it might seem impractical to use the MMX instructions to test for lower case characters and convert them to upper case. Consider the typical scalar approach that tests and converts a single character at a time:

```		<< Get character to convert into the AL register >>

cmp( al, `a' );

jb noConversion;

cmp( al, `z' );

ja noConversion;

sub( \$20, al );    // Could also use AND(\$5f, al); here.

noConversion:

```

This code first checks the value in AL to see if it's actually a lower case character (that's the CMP and Jcc instructions in the code above). If the character is outside the range `a'..'z' then this code skips over the conversion (the SUB instruction); however, if the code is in the specified range, then the sequence above drops through to the SUB instruction and converts the lower case character to upper case by subtracting \$20 from the lower case character's ASCII code (since lower case characters always have bit #5 set, subtracting \$20 always clears this bit).

Any attempt to convert this code directly to an MMX sequence is going to fail. Comparing and branching around the conversion instruction only works if you're converting one value at a time. When operating on eight characters simultaneously, any mixture of the eight characters may or may not require conversion from lower case to upper case. Hence, we need to be able to perform some calculation that is benign if the character is not lower case (i.e., doesn't affect the character's value) while converting the character to upper case if it was lower case to begin with. Worse, we have to do this with pure computation since flow of control isn't going to be particularly effective here (if we test each individual result in our MMX register we won't really save anything over the scalar approach). To save you some suspense, yes, such a calculation does exist.

Consider the following algorithm that converts lower case characters to upper case:

```		<< Get character to test into AL >>

cmp( al, `a' );

setae( bl );     // bl := al >= `a'

cmp( al, `z' );

setbe( bh );     // bh := al <= `z'

and( bh, bl );   // bl := (al >= `a') && (al <= `z' );

dec( bl );       // bl := \$FF/\$00 if false/true.

not( bl );       // bl := \$FF/\$00 if true/false.

and( \$20, bl );  // bl := \$20/\$00 if true/false.

sub( bl, al );   // subtract \$20 if al was lowercase.

```

This code sequence is fairly straight-forward up until the DEC instruction above. It computes true/false in BL depending on whether AL is in the range `a'..'z'. At the point of the DEC instruction, BL contains one if AL is a lower case character, it contains zero if AL's value is not lower case. After the DEC instruction, BL contains \$FF for false (AL is not lower case) and \$00 for true (AL is lowercase). The code is going to use this as a mask a little later, but it really needs true to be \$FF and false \$00, hence the NOT instruction that follows. The (second) AND instruction above converts true to \$20 and false to \$00 and the final SUB instruction subtracts \$20 if AL contained lower case, it subtracts \$00 from AL if AL did not contain a lower case character (subtracting \$20 from a lower case character will convert it to upper case).

Whew! This sequence probably isn't very efficient when compared to the simpler code given previously. Certainly there are more instructions in this version (nearly twice as many). Whether this code without any branches runs faster or slower than the earlier code with two branches is a good question. The important thing to note here, though, is that we converted the lower case characters to upper case (leaving other characters unchanged) using only a calculation; no program flow logic is necessary. This means that the code sequence above is a good candidate for conversion to MMX. Even if the code sequence above is slower than the previous algorithm when converting one character at a time to upper case, it's positively going to scream when it converts eight characters at a shot (since you'll only need to execute the sequence one-eighth as many times).

The following is the code sequence that will convert the eight characters starting at location [EDI] in memory to upper case:

```static

A:qword; @nostorage;

byte \$60, \$60, \$60, \$60, \$60, \$60, \$60, \$60; // Note: \$60 = `a'-1.

Z:qword; @nostorage;

byte \$7B, \$7B, \$7B, \$7B, \$7B, \$7B, \$7B, \$7B; // Note: \$7B = `z' + 1.

ConvFactor:qword; @nostorage;

byte \$20, \$20, \$20, \$20, \$20, \$20, \$20, \$20; // Magic value for lc->UC.

.

.

.

movq( ConvFactor, mm4 ); // Eight copies of conversion value.

movq( A, mm2 );       // Put eight "a" characters in mm2.

movq( Z, mm3 );       // Put eight "z" characters in mm3.

movq( [edi], mm0 );   // Get next eight characters of our string.

movq( mm0, mm1 );     // We need two copies.

pcmpgtb( mm2, mm1 );  // Generate 1's in MM1 everywhere chars >= 'a'

pcmpgtb( mm0, mm3 );  // Generate 1's in MM3 everywhere chars <= 'z'

pand( mm3, mm1 );     // Generate 1's in MM1 when 'a'<=chars<='z'

pand( mm4, mm1 );     // Generates \$20 in each spot we have a l.c. char

psubb( mm1, mm0 );    // Convert l.c. chars to U.C. by adding \$20.

movq( mm0, [edi]);

```

Note that this code compares the characters that [EDI] points at to `a'-1 and `z'+1 because we only have a greater than comparison rather than a greater or equal comparison (this saves a few extra instructions). Other than setting up the MMX registers and taking advantage of the fact that the PCMPGTB instructions automatically produce \$FF for true and \$00 for false, this is a faithful reproduction of the previous algorithm except it operates on eight bytes simultaneously. So if we put this code in a loop and execute it once for each eight characters in the string, there will be one-eighth the iterations of a similar loop using the scalar instructions.

Of course, there is one problem with this code. Not all strings have lengths that are an even multiple of eight bytes. Therefore, we've got to put some special case code into our algorithm to handle strings that are less than eight characters long and handle strings whose length is not an even multiple of eight characters. In the following program, the mmxupper function simply borrows the scalar code from the HLA Standard Library's str.upper procedure to handle the leftover characters. The following example program provides both an MMX and a scalar solution with a main program that compares the running time of both. If you're wondering, the MMX version is about three times faster (on a Pentium III) for strings around 35 characters long, containing mostly lower case (mostly lower case favors the scalar algorithm since fewer branches are taken with lower case characters; longer strings favor the MMX algorithm since it spends more time in the MMX code compared to the scalar code at the end).

```

program UpperCase;

#include( "stdlib.hhf" )

// The following code was stolen from the

// HLA Standard Library's str.upper function.

// It is not optimized, but then none of this

// code is optimized other than to use the MMX

// instruction set (later).

procedure strupper( dest: string ); @nodisplay;

begin strupper;

push( edi );

push( eax );

mov( dest, edi );

if( edi = 0 ) then

raise( ex.AttemptToDerefNULL );

endif;

// Until we encounter a zero byte, convert any lower

// case characters to upper case.

forever

mov( [edi], al );

breakif( al = 0 );      // Quit when we find a zero byte.

// If a lower case character, convert it to upper case

// and store the result back into the destination string.

if

(#{

cmp( al, `a' );

jb false;

cmp( al, `z' );

ja false;

}#) then

and( \$5f, al );     // Magic lc->UC translation.

mov( al, [edi] );   // Save result.

endif;

// Move on to the next character.

inc( edi );

endfor;

pop( edi );

pop( eax );

end strupper;

procedure mmxupper( dest: string ); @nodisplay;

const

zCh:char := char( uns8( `z') + 1 );

aCh:char := char( uns8( `a') - 1 );

static

// Create eight copies of the A-1 and Z+1 characters

// so we can compare eight characters at once:

A:qword; @nostorage;

byte aCh, aCh, aCh, aCh, aCh, aCh, aCh, aCh;

Z:qword; @nostorage;

byte zCh, zCh, zCh, zCh, zCh, zCh, zCh, zCh;

// Conversion factor: UC := LC - \$20.

ConvFactor: qword; @nostorage;

byte \$20, \$20, \$20, \$20, \$20, \$20, \$20, \$20;

begin mmxupper;

push( edi );

push( eax );

mov( dest, edi );

if( edi = 0 ) then

raise( ex.AttemptToDerefNULL );

endif;

// Some invariant operations (things that don't

// change on each iteration of the loop):

movq( A, mm2 );

movq( ConvFactor, mm4 );

// Get the string length from the length field:

mov( (type str.strRec [edi]).length, eax );

// Process the string in blocks of eight characters:

while( (type int32 eax) >= 8 ) do

movq( [edi], mm0 );   // Get next eight characters of our string.

movq( mm0, mm1 );     // We need two copies.

movq( Z, mm3 );       // Need to refresh on each loop.

pcmpgtb( mm2, mm1 );  // Generate 1's in MM1 everywhere chars >= `a'

pcmpgtb( mm0, mm3 );  // Generate 1's in MM3 everywhere chars <= `z'

pand( mm3, mm1 );     // Generate 1's in MM1 when `a'<=chars<='z'

pand( mm4, mm1 );     // Generates \$20 in each spot we have a l.c. char

psubb( mm1, mm0 );    // Convert l.c. chars to U.C. by adding \$20.

movq( mm0, (type qword [edi]));

// Move on to the next eight characters in the string.

sub( 8, eax );

endwhile;

// If we're processing less than eight characters, do it the old-fashioned

// way (one character at a time).  This also handles the last 1..7 chars

// if the number of characters is not an even multiple of eight.  This

// code was swiped directly from the HLA str.upper function (above).

if( eax != 0 ) then

forever

mov( [edi], al );

breakif( al = 0 );      // Quit when we find a zero byte.

// If a lower case character, convert it to upper case

// and store the result back into the destination string.

if

(#{

cmp( al, `a' );

jb false;

cmp( al, `z' );

ja false;

}#) then

and( \$5f, al );     // Magic lc->UC translation.

mov( al, [edi] );   // Save result.

endif;

// Move on to the next character.

inc( edi );

endfor;

endif;

emms(); // Clean up MMX state.

pop( edi );

pop( eax );

end mmxupper;

static

MyStr: string := "Hello There, MMX Uppercase Routine!";

destStr:string;

mmxCycles:qword;

strCycles:qword;

begin UpperCase;

// Charge up the cache (prefetch the code and data

// to avoid cache misses later).

mov( str.a_cpy( MyStr ), destStr );

mmxupper( destStr );

strupper( destStr );

// Okay, time the execution of the MMX version:

mov( str.a_cpy( MyStr ), destStr );

rdtsc();

mov( eax, (type dword mmxCycles));

mov( edx, (type dword mmxCycles[4]));

mmxupper( destStr );

rdtsc();

sub( (type dword mmxCycles), eax );

sbb( (type dword mmxCycles[4]), edx );

mov( eax, (type dword mmxCycles));

mov( edx, (type dword mmxCycles[4]));

stdout.put( "Dest String = `", destStr, "`", nl );

// Okay, time the execution of the HLA version:

mov( str.a_cpy( MyStr ), destStr );

rdtsc();

mov( eax, (type dword strCycles));

mov( edx, (type dword strCycles[4]));

strupper( destStr );

rdtsc();

sub( (type dword strCycles), eax );

sbb( (type dword strCycles[4]), edx );

mov( eax, (type dword strCycles));

mov( edx, (type dword strCycles[4]));

stdout.put( "Dest String(2) = `", destStr, "`", nl );

stdout.put( "MMX cycles:" );

stdout.puti64( mmxCycles );

stdout.put( nl "HLA cycles: " );

stdout.puti64( strCycles );

stdout.newln();

end UpperCase;

Program 11.4	 MMX Implementation of the HLA Standard Library str.upper Procedure

```

Other string functions, like a case insensitive string comparison, can greatly benefit from the use of parallel computation via the MMX instruction set. Implementation of other string functions is left as an exercise to the reader; interested readers should consider converting string functions that involve calculations and tests on each individual characters in a string as candidates for optimization via MMX.

## 11.10 Putting It All Together

Intel's MMX enhancements to the basic Pentium instruction set allow the acceleration of certain algorithms. Unfortunately, the MMX instruction set isn't generally applicable to a wide range of problems. The MMX instructions, with their SIMD orientation, are generally useful for manipulating a large amount of data organized as byte, word, or double word arrays where the MMX instructions can calculate several values in parallel. Learning to effectively use the MMX instruction set requires a paradigm shift on the part of the programmer. You don't apply the same rules for scalar 80x86 instructions to the MMX instructions. However, if you take the time to master parallel programming techniques with the MMX instructions, then you will be able to accelerate many of your applications.

1Actually, the code could be rewritten easily enough to use only one MMX register.