Hacking and Other Thoughts

Sun, 25 Apr 2010

Couple of strcmp tricks...

Like for strlen it's pretty easy to make a loop which checks a long word at a time. The less easy part is making the code run cheaply in constant time when we exit that loop.

We're going to work with two constants:

#define	STRCMP_CONST1	0x7f7f7f7f7f7f7f7f
#define	STRCMP_CONST2	-0x0101010101010101
We have two calculations, one for the inner loop and one for the post-loop calculations. Two are necessary because we want to minimize the cycle count in the loop and we only care there if there exists a zero byte somewhere. Whereas in the post-loop exit code we have to know precisely which byte the zero resides in.

The inner loop runs roughly like (in pseudo C):

	while (1) {
		s1 = *s1_word++;
		s2 = *s2_word++;
		g2 = s1 + STRCMP_CONST2;
		g1 = s1 | STRCMP_CONST1;
		s_xor = s1 ^ s2;
		if (s_xor)
		if (g2 & ~g1)
			return 0;		
We specifically prioritize the inequality test before there "is there a zero byte in s1" test. And the inner loop "zero byte present in 's1'?" test is:
	(s1 + -0x0101010101010101) & ~(s1 | 0x7f7f7f7f7f7f7f7f)
It's one of several ways to test for a zero byte in a word in 3 instructions. But as mentioned it's imprecise in that due to cascading overflows from the addition, the zero marker left in the mask result might not be in the actual zero byte. That's important in the post-loop exit code calculations so we'll use something else, which is:
	tmp1 = ~(s1 | 0x7f7f7f7f7f7f7f7f);
	tmp2 = ~((srcword1 & 0x7f7f7f7f7f7f7f7f) + 0x7f7f7f7f7f7f7f7f);
	x = (tmp1 & tmp2);
"x" will have a "0x80" value in every byte that was zero in "s1" and zeros elsewhere. In the inner loop we already calculated "(s1 | 0x7f7f7f7f7f7f7f7f)" so we can reuse it and simply negate it.

How does this clever calulation work it's magic? 'tmp1' records every byte of the word that has bit 7 clear, and gives us a "0x80" in such bytes and a zero in all others. 'tmp2' records all the bytes that have all bits below 7 (0x7f) clear, leaving a 0x80 in all such bytes and zeros elsewhere. So "tmp1 & tmp2" is "all bytes that have all bits clear". Get it?

Now, using "x" we can see which comes first, a byte miscompare or a zero. Note we already have this "s_xor" thing calculated in the inner loop, and we'll use that here.

	ret = ((unsigned)s1 <= (unsigned)s2) ? -1 : 1;
	low_bit = x >> 7;
	if (low_bit > s_xor)
		ret = 0;
And we're done.

"low_bit" is the "x" value shifted down by 7 bits so that we have a "0x01" in every byte that was zero in "s1". With big-endian byte ordering, if we simply compare this "low_bit" with the "s_xor" a larger value of "low_bit" indicates that the zero byte comes first. And in such a case the strings we scanned were equal and we should return "0".

A lot of strcmp implementations have performance which is dependent upon which byte in the long word miscompares (they simply scan a byte at a time when they exit the loop) or their location agnostic code is much more expensive than it needs to be.

Here's an icebag for your brain, I can see the steam coming out of your ears after reading all this.