Hacking and Other Thoughts

Mon, 08 Mar 2010

strlen(), oh strlen()...

I've been going through the glibc sparc optimized assembler routines to see if anything can be improved. And I took a stab at seeing if strlen() could be made faster. Find first zero byte in string, pretty simple right?

The first thing we have to discuss is the infamous trick coined by Alan Mycroft, way back in 1987. It allows to check for the presence of a zero byte in a word in 3 instructions. There are 2 magic constants:

#define MAGIC1		0x80808080
#define MAGIC2		0x01010101
If you're checking 64-bits at a time simply expand the above magic values to 64-bits on 64-bit systems.

Then, given a word the check becomes:

	if ((val - MAGIC2) & ~val & MAGIC1)
		goto found_zero_byte_in_word;
Essentially we're subtracting MAGIC2 to induce underflow in each byte that has the value zero in it. Such underflows cause bit 8 to get set in that byte. Then we want to see if bit 8 is set after subtraction in any byte where bit 8 wasn't set before the subtraction.

To get the most parallelization on multi-issue cpus, we want to compute this using something like:

	tmp1 = val - MAGIC2;
	tmp2 = ~val & MAGIC1;
	if (tmp1 & tmp2)
		goto found_zero_byte_in_word;
to reduce the number of dependencies such that the computation of tmp1 and tmp2 can occur in the same cpu cycle.

Then there is all the trouble of getting the source buffer aligned so we can do the fast loop comparing a word at a time. The most direct implement is to read a byte at a time, checking for zero, until the buffer address is properly aligned. This is also the slowest implementation.

The powerpc code in glibc has a better idea. If dereferencing a non-word-aligned byte at address 'x' is valid, so is reading the word at 'x & ~3' (or 'x & ~7' on 64-bit). This is because page protection occurs on page boundaries, and x and 'x & ~3' are on the same page.

The only thing left to attend to is to make sure we don't match the alignment pad bytes with zero. This is solved by computing a mask of 1's and writing those 1's into the word we read before we do the Mycroft computation above. In C it looks something like:

	orig_ptr = ptr;
	align = (unsigned long) ptr & 3;
	mask = -1 >> (align * 8);
	ptr = (void *) ((unsigned long) ptr & ~3UL);
	val = *ptr;
	val |= ~mask;
	if ((val - MAGIC2) & ~val & MAGIC1)
		goto found_zero_byte_in_word;
At which point we can fall into the main loop.

Once we find the word containing a zero byte, we have to iteratively look for where it is in order to compute the return value. How to schedule this is not trivial, and it's especially cumbersome on 64-bit (where we have to potentially check 8 bytes as opposed to 4).

Anyways, let's analyze the 64-bit Sparc implementation I'm hacking on at the moment. I'm targetting UltraSPARC-III and Niagara2 for performance analysis. Simply speaking UltraSPARC-III can dual-issue integer operations, and Niagara2 is single issue and predicts all branches not taken (basically this means: minimize use of branches).

	mov	%o0, %o1
	andn	%o0, 0x7, %o0

	ldx	[%o0], %o5
	and	%o1, 0x7, %g1
	mov	-1, %g5
Save away the original string pointer in %o1. At the end we'll compute the return value as "%o1 - %o0". Align the buffer pointer and load a word as quickly as possible. We load the first word early so that we can hide the memory latency into all of the constant and mask formation we need to do before we can make the Mycroft test.

%g5 holds the initial part of the mask computation (-1, which gets expanded fully to 64-bits by this move instruction) and %g1 will have the shift factor.

	sethi	%hi(0x01010101), %o2
	sll	%g1, 3, %g1

	or	%o2, %lo(0x01010101), %o2
	srlx	%g5, %g1, %o3

	sllx	%o2, 32, %g1
	sethi	%hi(0x00ff0000), %g5
%o2 is going to hold the "0x01" expanded to 64-bits subtraction magic value. %o3 wil first hold the initial word mask, and then it will holds the "0x80" magic constant. We can compute the two 64-bit magic constants into registers in 5 instructions.

Pick either of the two constants, we choose the "0x01" here because we'll need it first. This is loaded first using "sethi", "or". This gives us the lower 32-bits of the constant, then we shift up a copy by 32-bits, then or that into the lower 32-bit copy to compute the final value. "0x80" is "0x01" shifted left by 7 bits so a simple shift is all we need to load the other 64-bit constant.

The "0x00ff0000" constant will be used while searching for the zero byte in the final word.

Next, we mask the initial word and fall through into the main loop.

	orn	%o5, %o3, %o5
	or	%o2, %g1, %o2

	sllx	%o2, 7, %o3
Mask in the pad bits using mask compute in %o3. Finish computation of 64-bit MAGIC1 into %o2, and finally put MAGIC2 into %o3. We're ready for the main loop:
10:	add	%o0, 8, %o0

	andn	%o3, %o5, %g1
	sub	%o5, %o2, %g2

	andcc	%g1, %g2, %g0
	be,a,pt	%xcc, 10b
	 ldx	[%o0], %o5
This is a real pain to schedule because there are many dependencies. But the "andn", "sub", "andcc" sequence is the Mycroft test, and those first two instructions can execute in one clock cycle on UltraSPARC-III. The ",a" annul bit on the branch means that we only execute the load in the branch delay slot if the branch is taken.

Now we have the code that searches for where exactly the zero byte is in the final word.

	srlx	%o5, 32, %g1
	sub	%o0, 8, %o0
We over advanced the buffer pointer in the main loop, so correct that by subtracting 8. Prepare a copy of the upper 32-bits of the word into %g1.
	andn	%o3, %g1, %o4
	sub	%g1, %o2, %g2

	add	%o0, 4, %g3
	andcc	%o4, %g2, %g0

	movne	%icc, %g1, %o5
	move	%icc, %g3, %o0
This is divide and conquer. Instead of doing 8 byte compares, we first see if the upper 32-bits have the zero byte. We essentially redo the Mycroft test on the upper 32-bits of the word.

If the upper 32-bits have the zero byte, we use %g1 for the comparisons. Otherwise we retain %o5 for the subsequent comparisons and advance the buffer pointer by 4 bytes. This is what the final two conditional move instructions are doing. Note that these conditional moves use '%icc', the 32-bit condition codes.

The astute reader may wonder why we just can't use the upper 32-bits of the Mycroft computation we made in the main loop? This doesn't work because the underflows can carry and cause false positives in upper bytes of the word. For example, consider a value where bits 35 down to 24 have hex value "0x0100". The subtraction of MAGIC2 will result in "0x8080". The real zero byte is the lower one, not the upper one. So we can't merely use the upper 32-bits of the already computed 64-bit Mycroft mask, we have to recompute it over 32-bits by hand.

Now we're left with 32-bits to check for a zero byte, we make extensive use of conditional moves to avoid branches:

	mov	3, %g2
	srlx	%o5, 8, %g1

	andcc	%g1, 0xff, %g0
	move	%icc, 2, %g2

	andcc	%o5, %g5, %g0
	srlx	%o5, 24, %o5
	move	%icc, 1, %g2

	andcc	%o5, 0xff, %g0
	move	%icc, 0, %g2

	add	%o0, %g2, %o0
We check starting at the low byte up to the highest byte. Because the highest byte, if zero, takes priority. We add the offset of the zero byte to the buffer pointer.


	 sub	%o0, %o1, %o0
We compute the length and return from the routine.

Many many moons ago, in 1998, Jakub Jelinek and his friend Jan Vondrak wrote the routines we use now on sparc. And frankly it's very hard to beat that code especially on multi-issue processors.

The powerpc trick to align the initial word helps us beat the existing code for all the unaligned cases. But for the aligned case the existing code holds a slight edge.

So now I've been trimming cycles as much as possible in the new code trying to reach the state where the aligned case executes at least as fast as the existing code. I'll check this work into glibc once I accomplish that.

The Mycroft trick extends to other libc string routines. For example for 'memchr' you replicate the search character into all bytes of a word, let's call it 'xor_mask' and in the inner loop you adjust each word by using:

	val ^= xor_mask;
Then use the Mycroft test as in strlen(). Another complication with memchr, however, is the need to check the given length bounds.

This can be done in one instruction by putting the far bounds into your base pointer register (called '%top_of_buffer' below), then using offsets starting at "0 - total_len" (referred to as '%negative_len' below).

Then your inner loop can do something like:

	ldx	[%top_of_buffer + %negative_len], %o5
	addcc	%negative_len, 8, %negative_len
	bcs	%xcc, len_exceeded
We exit the loop when adding 8 bytes to the negative len causes an overflow.

If you're interested in this kind of topic, bit twiddling tricks and whatnot, you absolutely have to own a copy of "Hacker's Delight" by Henry S. Warren, Jr.