Like for strlen it's pretty easy to make a loop which checks a long word at a time. The less easy part is making the code run cheaply in constant time when we exit that loop.
We're going to work with two constants:
#define STRCMP_CONST1 0x7f7f7f7f7f7f7f7f #define STRCMP_CONST2 -0x0101010101010101We have two calculations, one for the inner loop and one for the post-loop calculations. Two are necessary because we want to minimize the cycle count in the loop and we only care there if there exists a zero byte somewhere. Whereas in the post-loop exit code we have to know precisely which byte the zero resides in.
The inner loop runs roughly like (in pseudo C):
while (1) {
s1 = *s1_word++;
s2 = *s2_word++;
g2 = s1 + STRCMP_CONST2;
g1 = s1 | STRCMP_CONST1;
s_xor = s1 ^ s2;
if (s_xor)
break;
if (g2 & ~g1)
return 0;
}
We specifically prioritize the inequality test before there
"is there a zero byte in s1" test. And the inner loop "zero
byte present in 's1'?" test is:
(s1 + -0x0101010101010101) & ~(s1 | 0x7f7f7f7f7f7f7f7f)It's one of several ways to test for a zero byte in a word in 3 instructions. But as mentioned it's imprecise in that due to cascading overflows from the addition, the zero marker left in the mask result might not be in the actual zero byte. That's important in the post-loop exit code calculations so we'll use something else, which is:
tmp1 = ~(s1 | 0x7f7f7f7f7f7f7f7f); tmp2 = ~((srcword1 & 0x7f7f7f7f7f7f7f7f) + 0x7f7f7f7f7f7f7f7f); x = (tmp1 & tmp2);"x" will have a "0x80" value in every byte that was zero in "s1" and zeros elsewhere. In the inner loop we already calculated "(s1 | 0x7f7f7f7f7f7f7f7f)" so we can reuse it and simply negate it.
How does this clever calulation work it's magic? 'tmp1' records every byte of the word that has bit 7 clear, and gives us a "0x80" in such bytes and a zero in all others. 'tmp2' records all the bytes that have all bits below 7 (0x7f) clear, leaving a 0x80 in all such bytes and zeros elsewhere. So "tmp1 & tmp2" is "all bytes that have all bits clear". Get it?
Now, using "x" we can see which comes first, a byte miscompare or a zero. Note we already have this "s_xor" thing calculated in the inner loop, and we'll use that here.
ret = ((unsigned)s1 <= (unsigned)s2) ? -1 : 1; low_bit = x >> 7; if (low_bit > s_xor) ret = 0;And we're done.
"low_bit" is the "x" value shifted down by 7 bits so that we have a "0x01" in every byte that was zero in "s1". With big-endian byte ordering, if we simply compare this "low_bit" with the "s_xor" a larger value of "low_bit" indicates that the zero byte comes first. And in such a case the strings we scanned were equal and we should return "0".
A lot of strcmp implementations have performance which is dependent upon which byte in the long word miscompares (they simply scan a byte at a time when they exit the loop) or their location agnostic code is much more expensive than it needs to be.
Here's an icebag for your brain, I can see the steam coming out of your ears after reading all this.
The Linux kernel has a facility called the function graph tracer that allows one to record the full call graph of everything the kernel does with timestamping. It's the big brother of the plain function tracer and is implemented with similar facilities.
The plain function tracer merely intercepts calls to mcount (which are emitted by the compiler at the beginning of every function when building with the "-pg" option) and records the calling function as well as other pieces of state (timestamp, parent function, etc.). These trace entries are recorded into per-cpu lockless NMI-safe buffers.
The function graph tracer takes things one step further by recording function returns in the trace as well. And it does this only using mcount. How does it do this? The trick is that it uses a special stub that it inserts into every call point, and it does so dynamically.
On function entry, mcount is invoked and the function graph tracer is called like so:
mov %i7, %g2 mov %fp, %g3 save %sp, -128, %sp mov %g2, %o1 mov %g2, %l0 mov %g3, %l1 mov %l0, %o0 mov %i7, %o1 call prepare_ftrace_return mov %l1, %o2 ret restore %o0, -8, %i7
prepare_ftrace_return() is a helper function that passes the mcount caller program counter, that caller's parent program counter, and the callers frame pointer into the function graph tracer.
The tracer sees if the function should be traced and if there are enough tracking stack slot entries available for the current thread. The slots are used to remember the frame pointer and caller's parent program counter so that it can be compared and restored later. We need to remember this program counter because we are going to change it so that it calls a stub instead of actually returning from the function.
The frame pointer is saved and used as a debugging tool so that we can make sure that when we execute the stub, the state is as we expect it to be and we can be sure that restoring the return address register is safe.
prepare_ftrace_return gives it's caller the address to put into the callers return address register. This will be the special stub if we decide to trace this function call or it will simply be the original value (plus 8 to account for how we return from sparc functions by jumping to "return address register" plus 8).
The stub looks like this:
save %sp, -128, %sp call ftrace_return_to_handler mov %fp, %o0 jmpl %o0 + 8, %g0 restoreftrace_return_to_handler() validates the frame pointer on the top-most stack slot of saved return address + frame pointer pairs. If it matches it records a function return trace entry (with timestamps, etc.) and returns the function's original return address and then we'll jump to that from the stub.
So if we're deep in the call chain of traced functions, and you were to look at the backtrace, you'd see a continuous stack of return addresses referencing the stub. And as the functions return, the function graph tracer resolves the return addresses to what they should be and then the originally intended address is returned to.
Of course there is some non-trivial cost to all of this, in particilar rewriting the return address makes the cpu's return address stack never hit so the function returns become very expensive. But this isn't something you have running constantly, you turn the tracer on around a set of events of interest (f.e. while running a test program).
Here's what a small snippet of a trace looks like with the function graph tracer. In this part of a capture we have a call to "run_local_timers":
64) | run_local_timers() {
64) 2.637 us | hrtimer_run_queues();
64) 2.527 us | raise_softirq();
64) | softlockup_tick() {
64) 2.967 us | __touch_softlockup_watchdog();
64) 8.570 us | }
64) + 23.953 us | }
The first number is the cpu number (this machine has 128 cpus, so yes
64 is not a typo). The next field gives time deltas. Finally we have
the call graph itself.
For the call graph a C-like syntax is used. For leaf functions the line ends with just a semicolon. When a function calls other functions it is closed off at it's return by a closing brace. And the sub-calls are indented as needed.
Call latencies are expressed at the point for which we have the return. So we'll see it on the lines for leaf functions (ending with semicolons) and for closing braces. But never on lines having openning braces.
When latencies exceed certain values a "+" (greater than 10usec) or "!" (greater than 100usec) will be prepended to the time delta so expensive operations can be easily seen.
There are several other powerful tools built on top of the mcount based tracing hooks. For example there is a stack tracer that monitors which functions created the deepest stack frames.
For more information about using ftrace in general, check out the two part series written by Stephen Rostedt at LWN: PART 1 and PART 2.
It's an incredibly powerful tool and framework. In fact, in just in the last few days I've already found 3 bugs and anomalies by simply scanning the traces and looking for nothing in particular.
I've been going through the glibc sparc optimized assembler routines to see if anything can be improved. And I took a stab at seeing if strlen() could be made faster. Find first zero byte in string, pretty simple right?
The first thing we have to discuss is the infamous trick coined by Alan Mycroft, way back in 1987. It allows to check for the presence of a zero byte in a word in 3 instructions. There are 2 magic constants:
#define MAGIC1 0x80808080 #define MAGIC2 0x01010101If you're checking 64-bits at a time simply expand the above magic values to 64-bits on 64-bit systems.
Then, given a word the check becomes:
if ((val - MAGIC2) & ~val & MAGIC1) goto found_zero_byte_in_word;Essentially we're subtracting MAGIC2 to induce underflow in each byte that has the value zero in it. Such underflows cause bit 8 to get set in that byte. Then we want to see if bit 8 is set after subtraction in any byte where bit 8 wasn't set before the subtraction.
To get the most parallelization on multi-issue cpus, we want to compute this using something like:
tmp1 = val - MAGIC2; tmp2 = ~val & MAGIC1; if (tmp1 & tmp2) goto found_zero_byte_in_word;to reduce the number of dependencies such that the computation of tmp1 and tmp2 can occur in the same cpu cycle.
Then there is all the trouble of getting the source buffer aligned so we can do the fast loop comparing a word at a time. The most direct implement is to read a byte at a time, checking for zero, until the buffer address is properly aligned. This is also the slowest implementation.
The powerpc code in glibc has a better idea. If dereferencing a non-word-aligned byte at address 'x' is valid, so is reading the word at 'x & ~3' (or 'x & ~7' on 64-bit). This is because page protection occurs on page boundaries, and x and 'x & ~3' are on the same page.
The only thing left to attend to is to make sure we don't match the alignment pad bytes with zero. This is solved by computing a mask of 1's and writing those 1's into the word we read before we do the Mycroft computation above. In C it looks something like:
orig_ptr = ptr; align = (unsigned long) ptr & 3; mask = -1 >> (align * 8); ptr = (void *) ((unsigned long) ptr & ~3UL); val = *ptr; val |= ~mask; if ((val - MAGIC2) & ~val & MAGIC1) goto found_zero_byte_in_word;At which point we can fall into the main loop.
Once we find the word containing a zero byte, we have to iteratively look for where it is in order to compute the return value. How to schedule this is not trivial, and it's especially cumbersome on 64-bit (where we have to potentially check 8 bytes as opposed to 4).
Anyways, let's analyze the 64-bit Sparc implementation I'm hacking on at the moment. I'm targetting UltraSPARC-III and Niagara2 for performance analysis. Simply speaking UltraSPARC-III can dual-issue integer operations, and Niagara2 is single issue and predicts all branches not taken (basically this means: minimize use of branches).
davem_strlen: mov %o0, %o1 andn %o0, 0x7, %o0 ldx [%o0], %o5 and %o1, 0x7, %g1 mov -1, %g5Save away the original string pointer in %o1. At the end we'll compute the return value as "%o1 - %o0". Align the buffer pointer and load a word as quickly as possible. We load the first word early so that we can hide the memory latency into all of the constant and mask formation we need to do before we can make the Mycroft test.
%g5 holds the initial part of the mask computation (-1, which gets expanded fully to 64-bits by this move instruction) and %g1 will have the shift factor.
sethi %hi(0x01010101), %o2 sll %g1, 3, %g1 or %o2, %lo(0x01010101), %o2 srlx %g5, %g1, %o3 sllx %o2, 32, %g1 sethi %hi(0x00ff0000), %g5%o2 is going to hold the "0x01" expanded to 64-bits subtraction magic value. %o3 wil first hold the initial word mask, and then it will holds the "0x80" magic constant. We can compute the two 64-bit magic constants into registers in 5 instructions.
Pick either of the two constants, we choose the "0x01" here because we'll need it first. This is loaded first using "sethi", "or". This gives us the lower 32-bits of the constant, then we shift up a copy by 32-bits, then or that into the lower 32-bit copy to compute the final value. "0x80" is "0x01" shifted left by 7 bits so a simple shift is all we need to load the other 64-bit constant.
The "0x00ff0000" constant will be used while searching for the zero byte in the final word.
Next, we mask the initial word and fall through into the main loop.
orn %o5, %o3, %o5 or %o2, %g1, %o2 sllx %o2, 7, %o3Mask in the pad bits using mask compute in %o3. Finish computation of 64-bit MAGIC1 into %o2, and finally put MAGIC2 into %o3. We're ready for the main loop:
10: add %o0, 8, %o0 andn %o3, %o5, %g1 sub %o5, %o2, %g2 andcc %g1, %g2, %g0 be,a,pt %xcc, 10b ldx [%o0], %o5This is a real pain to schedule because there are many dependencies. But the "andn", "sub", "andcc" sequence is the Mycroft test, and those first two instructions can execute in one clock cycle on UltraSPARC-III. The ",a" annul bit on the branch means that we only execute the load in the branch delay slot if the branch is taken.
Now we have the code that searches for where exactly the zero byte is in the final word.
srlx %o5, 32, %g1 sub %o0, 8, %o0We over advanced the buffer pointer in the main loop, so correct that by subtracting 8. Prepare a copy of the upper 32-bits of the word into %g1.
andn %o3, %g1, %o4 sub %g1, %o2, %g2 add %o0, 4, %g3 andcc %o4, %g2, %g0 movne %icc, %g1, %o5 move %icc, %g3, %o0This is divide and conquer. Instead of doing 8 byte compares, we first see if the upper 32-bits have the zero byte. We essentially redo the Mycroft test on the upper 32-bits of the word.
If the upper 32-bits have the zero byte, we use %g1 for the comparisons. Otherwise we retain %o5 for the subsequent comparisons and advance the buffer pointer by 4 bytes. This is what the final two conditional move instructions are doing. Note that these conditional moves use '%icc', the 32-bit condition codes.
The astute reader may wonder why we just can't use the upper 32-bits of the Mycroft computation we made in the main loop? This doesn't work because the underflows can carry and cause false positives in upper bytes of the word. For example, consider a value where bits 35 down to 24 have hex value "0x0100". The subtraction of MAGIC2 will result in "0x8080". The real zero byte is the lower one, not the upper one. So we can't merely use the upper 32-bits of the already computed 64-bit Mycroft mask, we have to recompute it over 32-bits by hand.
Now we're left with 32-bits to check for a zero byte, we make extensive use of conditional moves to avoid branches:
mov 3, %g2 srlx %o5, 8, %g1 andcc %g1, 0xff, %g0 move %icc, 2, %g2 andcc %o5, %g5, %g0 srlx %o5, 24, %o5 move %icc, 1, %g2 andcc %o5, 0xff, %g0 move %icc, 0, %g2 add %o0, %g2, %o0We check starting at the low byte up to the highest byte. Because the highest byte, if zero, takes priority. We add the offset of the zero byte to the buffer pointer.
Finally:
retl sub %o0, %o1, %o0We compute the length and return from the routine.
Many many moons ago, in 1998, Jakub Jelinek and his friend Jan Vondrak wrote the routines we use now on sparc. And frankly it's very hard to beat that code especially on multi-issue processors.
The powerpc trick to align the initial word helps us beat the existing code for all the unaligned cases. But for the aligned case the existing code holds a slight edge.
So now I've been trimming cycles as much as possible in the new code trying to reach the state where the aligned case executes at least as fast as the existing code. I'll check this work into glibc once I accomplish that.
The Mycroft trick extends to other libc string routines. For example for 'memchr' you replicate the search character into all bytes of a word, let's call it 'xor_mask' and in the inner loop you adjust each word by using:
val ^= xor_mask;Then use the Mycroft test as in strlen(). Another complication with memchr, however, is the need to check the given length bounds.
This can be done in one instruction by putting the far bounds into your base pointer register (called '%top_of_buffer' below), then using offsets starting at "0 - total_len" (referred to as '%negative_len' below).
Then your inner loop can do something like:
ldx [%top_of_buffer + %negative_len], %o5 addcc %negative_len, 8, %negative_len bcs %xcc, len_exceeded ...We exit the loop when adding 8 bytes to the negative len causes an overflow.
If you're interested in this kind of topic, bit twiddling tricks and whatnot, you absolutely have to own a copy of "Hacker's Delight" by Henry S. Warren, Jr.
I've always wanted to work on support for STT_GNU_IFUNC symbols on sparc. This is going to solve a real problem distribution makers have faced on sparc64 for quite some time.
What is STT_GNU_IFUNC?
Well, normally a symbol is resolved by the dynamic linker based upon information in the symbol table of the objects involved. This is after taking into consideration things like symbol visibility, where it is defined, etc.
The difference with STT_GNU_IFUNC is that the resolution of the reference can be made based upon other criteria. For example, based upon the capabilities of the cpu we are executing on. The most obvious place this would be very useful is in libc, where you can pick the most optimized memcpy() implementation.
Normally the symbol table entry points to the actual symbol location, but STT_GNU_IFUNC symbols point to the location of a "resolver" function. This function returns the symbol location that should be used for references to this symbol.
So when the dynamic linker resolves a reference to a STT_GNU_IFUNC type symbol "foo". It calls the resolver function recorded in the symbol table entry, and uses the return value as the resolved address.
Simple example:
void * memcpy_ifunc (void) __asm__ ("memcpy");
__asm__(".type foo, %gnu_indirect_function");
void *
memcpy_ifunc (void)
{
switch (cpu_type)
{
case cpu_A:
return memcpy_A;
case cpu_B:
return memcpy_B;
default:
return memcpy_default;
}
}
So, references to 'memcpy' will be resolved as determined by
the logic in memcpy_ifunc().
These magic ifunc things even work in static executables. How is that possible?
First, even though the final image is static, the linker arranges to still create PLT entries and dynamic sections for the STT_GNU_IFUNC relocations.
Next, the CRT files for static executables walk through the relocations in the static binary and resolve the STT_GNU_IFUNC symbols.
There are some thorny issues wrt. function pointer equality. To make that work static references to STT_GNU_IFUNC symbols use the PLT address whereas non-static references do not (they get fully resolved).
Back to the reason I was so eager to implement this. On sparc we have four different sets of optimized memcpy/memset implementations in glibc (UltraSPARC-I/II, UltraSPARC-III, Niagara-T1, Niagara-T2). Right now the distributions have to thus build glibc four times each for 32-bit and 64-bit (for a total of 8 times).
With STT_GNU_IFUNC they will only need to build it once for 32-bit and once for 64-bit.
I've just recently posted patches for full support of STT_GNU_IFUNC symbols to the binutils and glibc lists.
My recent hobbies have included an intense study of New York City architecture, and in particular the facinating stories behind the city's two most prominent train stations. That being Grand Central Terminal and the arguably infamous Pennsylvania Station .
In the second half of the 19th century and on towards the first half of the 20th century, any American architect worth his salt studied at the Ecole des Beaux-Arts in Paris.
If you had a degree from that school, you were at the top of the pile for selection on all of the interesting commisions of the time. The school presented the student with a challenging and fast paced curriculum.
Firstly, for these American students attending in Paris, the first challenge was just getting in. The entrance exam (of course) required at least some proficiency in French. Several of the most notable American architects had the retake this entrace exam 5 or more times before being able to pass.
Once accepted, the student was pressed to solve problems. 12 hours were given to draft up a solution to a real architectual problem. Then once the draft was accepted, the student had 2 weeks to flesh out all of the details and present the final design. All the while the student's progress was critiqued by an established French architect who oversaw a group of students.
We really don't have that kind of training for computer science people. It's not even science I would say. This kind of training does exist for pure mathmatics, espcially in France.
Envision a school where you're asked to draft up the design of a compiler pass in 12 hours, then for two weeks you implement it, and meanwhile Alfred Aho critiques your work. This kind of place simply doesn't exist. (Yes I know Alfred teaches at Columbia currently, so maybe this specific place does exist :-) but I maintain that more generally such institutions do not exist)
Open source development and "throwing the masses of monkeys at the problem" seems to be a logical consequence of this, does it not?
A formally trained Beaux-arts architect and a room with a few drafters could design something as insanely complicated as a huge transportation hub in the middle of New York City. And it would work, as there would be no room for failure. To me it seems that someone similarly trained could do a complete operating system, compiler, or similar large software engineering task.
McKimm, Meade, and White were three architects and a couple drafters, and yet they were able to complete works such as Boston Public Library, Pennsylvania Station, and the James Farley Post Office. To just name a few.
So which is better, strict formal training and mentorship or open source monkeys? You decide!
I recently got DRM working on sparc64, and this means I had to of course test it :-)
I played around with ioquake3 and it worked just fine. This was in a way exciting since I have devoured many hours of my life into this game on x86.
One aspect of quake3 is that it has a virtual machine. You can write a MOD for quake3 and replace pretty much any aspect of the game outside of the rendering engine. But to keep MOD authors from eating people's home directories, sending your password out to some rogue collection system, and things of that nature the interfaces are tightly controlled and the MOD code runs in a JIT'd VM.
The only way you can get into the JIT is to make a "system call". And the only way to get out is to either return or make such a system call into another module. The system call is the main entrypoint into the module, it takes an integer command and 0 or more integer arguments.
All memory accesses done by the JIT'd code are masked so that it is impossible for the JIT to touch memory outside of that allocated explicitly for it by the VM.
Ben H. kiddingly said to me that I should write the Sparc JIT since there is one for x86 and PowerPC already. One should never kid about such things...
It's pretty neat stuff, although the stack machine VM code output by the LCC compiler they used is horribly inefficient. Some code for a function might look like:
OPCODE[ OP_ENTER] IMM4[0x0000001c] ! ENTER function, 0x1c of stack OPCODE[ OP_LOCAL] IMM4[0x00000024] ! PUSH stack offset 0x24 (first arg) OPCODE[ OP_LOAD4] ! LOAD from "stack + 0x24" OPCODE[ OP_LEAVE] IMM4[0x0000001c] ! LEAVE function, return LOAD resultOperations push entries onto the "register stack", and consume entries on the top of that stack. This might emit some sparc code like:
save %sp, -64, %sp ! OP_ENTER sub %g3, 0x1c, %g3 add %g3, 0x24, %l0 ! OP_LOCAL and %l0, %g5, %l0 ! OP_LOAD4 ld [%g4 + %l0], %l0 add %g3, 0x1c, %g3 ! OP_LEAVE ret restore %l0, %g0, %o0We use several fixed registers, "%g3" is the stack pointer, "%g5" is the VM data segment offset mask, "%g4" is the data segment base address. So every load or store address formation is "mask with %g5 and add to %g4".
It's there in the ioquake3 repo right now and will be in the next release. There are lots of things that can be improved but it works very well and most of the quake3 MODS I've tried (CPMA, UrbanTerror, etc.) work. I've also been playing the base game online extensively, you know, for stress testing.
One thing that always has been suboptimal on sparc64 has been the profiling with oprofile.
Yes it worked but we only have the dumb timer interrupt profiling available. The biggest loss in this is that IRQ disabled code sequences do not get profiled. This leads to "clumps" in the profile, where all the code in an IRQ disabled sequence can show up as one big hit at the point where IRQs get re-enabled.
This makes the profile often non-representative and, at worst, completely unusable.
Say what you want about levelled interrupts, they do provide a level of flexibility that can be useful. Sparc64 chips have a PIL register, you indicate the interrupt levels (out of 15) you want to block out by writing the highest level to block out into the register. So writing zero enables all interrupt levels, and writing 15 blocks all of them out.
Device interrupts work by interrupt vectors, which are not blocked by the PIL mechanism. These are processed quickly in a trap handler and revectored into a PIL levelled interrupt by software.
Under Linux we use one PIL level for all of the device interrupts. A few of the other PIL levels we use for specific SMP cross-call types.
On UltraSPARC-III (cheetah) and later we (finally) have a profiling counter overflow interrupt. This arrives at PIL level 15. So naturally I had the idea to run the majority of the kernel only disabled up to PIL level 14. The result is that was can use these profile counter overflow interrupts to provide a pseudo-NMI oprofile implementation.
This works quite well and is checked into the sparc-next-2.6 GIT tree, so it will show up in 2.6.29
Someone on IRC asked what was so cool about my 'mkcf' shell script and those command lines I showed.
I'm so old skool that I didn't even know that something as cool as "git stash" even existed. Holy crap that's a nice feature :-)
Yeah but these young whipper snappers don't know how to build a git index by hand, har har har.
In other news, unless you live under a rock you know that Barak Obama won the presidental elections, but what you may not know is that he only won because of a last minute endorsement from Linus.
Tee-hee.
I've had this tiny shell script in my repetoire that I don't know how I'd live without, it's called 'mkcf'.
It's so stupid and probably there are a thousand better ways to implement it. It just shows what files are referenced by a patch, here it is:
#!/bin/sh
diffstat -p 1 $* | grep -v changed | awk ' { print $1 } '
So I'm always typing things like:
bash$ git diff >diff bash$ git checkout-index $(mkcf diff); git update-index --refreshTo save a patch I'm working on and revert it from the working tree.
I'm David S. Miller, and I support this message.
There is a good set of presentation slides on Rock for a talk given this year at HotCHIPS by Shailender Chaudhry. The Rock chip gets a lot of attention for all of the speculation features it will have (and that stuff is of course extremely cool) but to me the most interesting bit initially is the transactional memory support.
If you look at page 18 and 19 in the slides you'll see the basic implementation. There are two instructions called "checkpoint" and "commit". So the traditional atomic increment is Rock'ified as:
atomic_inc: chkpt fail_label ld [%o0], %o1 add %o1, 1, %o1 st %o1, [%o1] commit retl nop fail_label: ...and at the "fail_label" you'd do the traditional atomic sequence using "cas" (compare and swap) instructions.
The checkpoint instruction takes a fail label so that it can be implemented in a best-effort manner (or not at all on some chips).
The commit instruction defines the end of the transaction and if it completes successfully then all of the memory operations enclosed in the sequence execute as if they were all done atomically.
So why bother with all of this? One interesting aspect is that this avoids atomic instructions.
You might look at the above and say "hey, that's the same as some sequence of atomic instructions" and yes for simple examples like an atomic increment it is. But there is still a difference, and that is that the checkpoint/commit sequence is easier to speculate around than anything using atomic instructions.
Atomic instructions act as a serialization point, and the scout threads can't do very much when such explicit ordering constraints start coming into play.
On the other hand the speculation threads can just execute the checkpoint/commit sequence as if it were just a normal set of loads and stores. The chip can perhaps do something like put a marker in the cache lines effected. If for some reason the speculation fails or we can't ensure the necessary atomicity, that's ok we zap the cache lines with the markers and rollback to the checkpoint instruction and head on down to the fail label. Just like for anything else, the speculation threads execute the checkpoint/commit sequence optimistically.
I don't blog as often as I should, sorry...
Nick Piggin was working on optimizing the mutex implementations on various architectures, he started even hacking on some sparc64 bits, and it got me thinking....
There are three defined memory models for sparc v9, TSO (Total Store Ordering) PSO (Partial Store Ordering) and RMO (Relaxed Memory Ordering). The Linux kernel always programs the cpu to run in RMO when in privileged mode.
So for all the atomics and spinlocks we have memory barrier instructions (on sparc64 it's called "membar") peppered all over the place to make sure memory operation visibility is correct especially when the interface provides "locking" semantics.
Complicating matters further, UltraSPARC-I, UltraSPARC-II, and derivates have a chip bug where if a memory barrier occurs right after a mispredicted branch the chip can stop executing instructions until the next trap is signalled. If interrupts are off the cpu hangs completely.
To work around this we either make sure the condition can't happen in hand-crafted assembler, or we encapsulate the code in a macro that forces the memory barrier instruction into the delay slot of a "branch always, predict taken" control transfer. And this bloats up the kernel a bit.
But here's the interesting bit. All chips starting with UltraSPARC-III only actually implement the TSO memory model and essentially ignore what gets programmed into the memory-model field of the PSTATE register. And the RMO gains on the older chips that do actually support RMO is marginal at best. So most of this memory barrier business is just a complete waste.
So I wrote a patch that just gets rid of all of the membar instructions used with atomics and spinlocks, and this shaves 10 full seconds of system time off of a "make -j64" kernel build on Niagara-2. W00t...
Of course, for the optimized memcpy and memset implementations we have to keep the memory barriers in there. The block load and store instructions used in those routines are special and do not obey the normal memory model ordering semantics. Thus they do still require explicit memory barriers.
Linus has a blog :-)