Hacking and Other Thoughts

Fri, 10 Dec 2010

Metrics, metrics, metrics...

First a quick shout-out to the Colbert Nation.

Next, let's talk about route metrics and sharing, shall we?

Metrics exist to specify attributes for a path. For example, what hoplimit should we use when using this route? What should the path MTU be? How about the TCP congestion window or RTT estimate?

Metric settings come from two places.

The administrator can attach initial metric values to routes when they get loaded into the kernel. This helps deal with specialized situations, but it's not very common at all.

There is a special metric, called "lock" which is a bitmask. There is a bit for each of the existing metrics. If the bit is set, it means that metric should not be modified by the kernel and we should always respect the value the administrator placed there.

The kernel itself will, on the fly, adjust metric values. For example, at the teardown of every TCP connection the kernel can update the metrics on the route attached to the socket. These updates are based upon measurements made during the life of the TCP socket. There is a sysctl to disable this automatic TCP metric update mechanism, for testing purposes.

For a router, metrics don't really change. So, in theory, we could take the defaults stored in the routing table and just reference those directly instead of having a private copy in every routing cache entry.

There are some barriers to this, although none insurmountable.

First of all, in order to share we have to be able to catch any dynamic update so we can unshare those read-only metrics. The net-next-2.6 tree has changes to make sure every metric write goes through a helper function. So we have the traps there and ready to go, problem solved.

Next, we actually change the metrics a little bit when we create every single routing cache entry. Essentially we pre-compute defaults. This is pretty much unnecessary, and actually could theoretically cause some problems in some cases. The metrics in question are the hoplimit, the mtu, and advertised MSS, For ipv4 these are set in rt_set_nexthop.

If the route table metric is simply the default (ie. zero) we pre-calculate it. These calculations are pretty simple and could be done instead when the value of the metric is actually asked for.

Since the defaults are address-family dependent we will need to abstract the calculations behind dst_ops methods, but that's easy enough. Accesses to each of these three metrics then need to go through a helper function which essentially says:

	if (metric_value == 0)
		metric_value = dst->ops->metric_foo_default(dst);
	return metric_value;

With that in place we will rarely, if ever, modify metric values in the routing cache metric arrays. Then the next step is putting unshared metrics somewhere else (f.e. inetpeer cache), and then changing dst_entry metrics member to be a pointer instead of an array.

Initially the pointer will point into the routing table read-only default metrics. On a COW event, we'll hook up an inetpeer entry to the route and retarget the metrics pointer to the inetpeer's metrics storage.

Sun, 21 Nov 2010

IPv4 source address selection in Linux

Often when we connect a socket up, or bind it, we really don't care what source address is used for the resulting connection. We let the kernel decide.

The usual sequence is:

	connect(fd, { AF_INET, $PORT, $DEST_IP_ADDR }, ...);
Here we leave the source port and source address to be selected by the kernel via a facility called auto-binding. Oddly enough TCP and UDP use a different ordering for selecting the source address vs. selecting the source port.

UDP will first select a local port to use, and make this choice in a global namespace of ports for the machine. TCP on the other hand will have a source address selected first, and then try to allocate a local port using the source address as a partial key. This latter ordering is necessary in order to handle SO_REUSEADDR correctly.

Source address selection itself happens via routing.

The route lookup will be performed with the source address in the flow lookup key set to zero. After the route, based upon destination address, is found the routing code uses the next-hop interface to select an appropriate source address.

All of these results are propagated into the routing cache entry.

It is interesting to note that the routing cache entry created in such a situation will have a zero source address as well in it's routing key. So the next time a routing lookup occurs to the same destination, but without a specified source-address, we'll match this routing cache entry.

This little detail creates some minor complications when handling ICMP messages for redirects. Since we must update any potentially matching routing cache entries, we have to probe the hash table multiple times. Once with an explicit source address in the lookup key, and once with the source address in the key set to zero. Otherwise we won't update all of the entries that we need to.

Actual source address selection is performed by inet_select_addr(). Either via direct calls made by net/ipv4/route.c, or indirectly via __fib_res_prefsrc() This function works with a "scope" specification which says which realm in which the source address must be valid. Most of the time this is RT_SCOPE_UNIVERSE.

The linked list of ipv4 interface addressed for the interface is traversed, and the first address with an appropriate scope is selected.

Even though the flow key of the routing cache entry will have a zero source address, the source address selected is remembered in rt->rt_src so that users of it can see what source address to use.

Finally, routes loaded into the kernel can have an explicit "preferred source address" attribute set by the administrator. This value will fully preempt whatever inet_select_addr() would have choosen.

Mon, 30 Aug 2010

How GRO works

All modern device drivers should be doing two things, first they should use NAPI for interrupt mitigation plus simpler mutual exclusion (all RX code paths run in software interrupt context just like TX), and use the GRO NAPI interfaces for feeding packets into the network stack.

Like just about anything else in the networking, GRO is all about increasing performance. The idea is that we can accumulate consequetive packets (based upon protocol specific sequence number checks etc.) into one huge packet. Then process the whole group as one packet object. (in Network Algorithmics this would be principle P2c, Shift computation in time, Share expenses, batch)

GRO help significantly on everyday systems, but it helps even more strongly on machines making use of virtualization since bridging streams of packets is very common and GRO batching decreases the number of switching operations.

Each NAPI instance maintains a list of GRO packets we are trying to accumulate to, called napi->gro_list. The GRO layer dispatches to the network layer protocol that the packet is for. Each network layer that supports GRO implements both a ptype->gro_receive and a ptype->gro_complete method.

->gro_receive attempts to match the incoming skb with ones that have already been queued onto the ->gro_list At this time, the IP and TCP headers are popped from the front of the packets (from GRO's perspective, that actual normal skb packet header pointers are left alone). Also, the GRO'ability state of all packets in the GRO list and the new incoming SKB are updated.

Once we've committed to receiving a GRO skb, we invoke the ->gro_complete method. It is at this point that we make the collection of individual packets look truly like one huge one. Checksums are updated, as are various private GSO state flags in the head 'skb' given to the network stack.

We do not try to accumulate GRO packets infinitely. At the end of a NAPI poll quantum, we force flush the GRO packet list.

For ipv4 TCP there are various criteria for GRO matching.

Certain events cause the current GRO bunch to get flushed out. For example:

The most important attribute of GRO is that it preserves the received packets in their entirety, such that if we don't actually receive the packets locally (for example we want to bridge or route them) they can be perfectly and accurately reconstituted to the transmit path. This is because none of the packet headers are modified (they are entirely preserved) and since GRO requires completely regular packet streams for merging, the packet boundary points are known precisely as well. The GRO merged packet can be completely unraveled and it will mimmick exactly the incoming packet sequence.

GRO mainly the work of Herbert Xu. Various driver authors and others helped him tune and optimize the implementation.

Thu, 26 Aug 2010

Converting sk_buff to list_head.

I've been trying to make this happen, off and on, for at least two years now. Most of the kernel is straightforward and uses the skb_*() interfaces we have for manipulating skb objects on a list.

So for those, simply tweaking the interfaces in skbuff.h will make them all "just work".

However there are a few other spots in the kernel which manipulate the SKB list pointers directly:

I'm taking another stab at this, and hopefully I can work out these wrinkles. It'd be a really nice change because of lot of uses of "struct sk_buff_head" which don't care about the spinlock or the packet count can be converted to simply "list_head" saving serious space in various datastructures.

Sun, 25 Apr 2010

Couple of strcmp tricks...

Like for strlen it's pretty easy to make a loop which checks a long word at a time. The less easy part is making the code run cheaply in constant time when we exit that loop.

We're going to work with two constants:

#define	STRCMP_CONST1	0x7f7f7f7f7f7f7f7f
#define	STRCMP_CONST2	-0x0101010101010101
We have two calculations, one for the inner loop and one for the post-loop calculations. Two are necessary because we want to minimize the cycle count in the loop and we only care there if there exists a zero byte somewhere. Whereas in the post-loop exit code we have to know precisely which byte the zero resides in.

The inner loop runs roughly like (in pseudo C):

	while (1) {
		s1 = *s1_word++;
		s2 = *s2_word++;
		g2 = s1 + STRCMP_CONST2;
		g1 = s1 | STRCMP_CONST1;
		s_xor = s1 ^ s2;
		if (s_xor)
		if (g2 & ~g1)
			return 0;		
We specifically prioritize the inequality test before there "is there a zero byte in s1" test. And the inner loop "zero byte present in 's1'?" test is:
	(s1 + -0x0101010101010101) & ~(s1 | 0x7f7f7f7f7f7f7f7f)
It's one of several ways to test for a zero byte in a word in 3 instructions. But as mentioned it's imprecise in that due to cascading overflows from the addition, the zero marker left in the mask result might not be in the actual zero byte. That's important in the post-loop exit code calculations so we'll use something else, which is:
	tmp1 = ~(s1 | 0x7f7f7f7f7f7f7f7f);
	tmp2 = ~((srcword1 & 0x7f7f7f7f7f7f7f7f) + 0x7f7f7f7f7f7f7f7f);
	x = (tmp1 & tmp2);
"x" will have a "0x80" value in every byte that was zero in "s1" and zeros elsewhere. In the inner loop we already calculated "(s1 | 0x7f7f7f7f7f7f7f7f)" so we can reuse it and simply negate it.

How does this clever calulation work it's magic? 'tmp1' records every byte of the word that has bit 7 clear, and gives us a "0x80" in such bytes and a zero in all others. 'tmp2' records all the bytes that have all bits below 7 (0x7f) clear, leaving a 0x80 in all such bytes and zeros elsewhere. So "tmp1 & tmp2" is "all bytes that have all bits clear". Get it?

Now, using "x" we can see which comes first, a byte miscompare or a zero. Note we already have this "s_xor" thing calculated in the inner loop, and we'll use that here.

	ret = ((unsigned)s1 <= (unsigned)s2) ? -1 : 1;
	low_bit = x >> 7;
	if (low_bit > s_xor)
		ret = 0;
And we're done.

"low_bit" is the "x" value shifted down by 7 bits so that we have a "0x01" in every byte that was zero in "s1". With big-endian byte ordering, if we simply compare this "low_bit" with the "s_xor" a larger value of "low_bit" indicates that the zero byte comes first. And in such a case the strings we scanned were equal and we should return "0".

A lot of strcmp implementations have performance which is dependent upon which byte in the long word miscompares (they simply scan a byte at a time when they exit the loop) or their location agnostic code is much more expensive than it needs to be.

Here's an icebag for your brain, I can see the steam coming out of your ears after reading all this.

Thu, 08 Apr 2010

Function graph tracer on sparc64...

The Linux kernel has a facility called the function graph tracer that allows one to record the full call graph of everything the kernel does with timestamping. It's the big brother of the plain function tracer and is implemented with similar facilities.

The plain function tracer merely intercepts calls to mcount (which are emitted by the compiler at the beginning of every function when building with the "-pg" option) and records the calling function as well as other pieces of state (timestamp, parent function, etc.). These trace entries are recorded into per-cpu lockless NMI-safe buffers.

The function graph tracer takes things one step further by recording function returns in the trace as well. And it does this only using mcount. How does it do this? The trick is that it uses a special stub that it inserts into every call point, and it does so dynamically.

On function entry, mcount is invoked and the function graph tracer is called like so:

	mov		%i7, %g2
	mov		%fp, %g3
	save		%sp, -128, %sp
	mov		%g2, %o1
	mov		%g2, %l0
	mov		%g3, %l1
	mov		%l0, %o0
	mov		%i7, %o1
	call		prepare_ftrace_return
	 mov		%l1, %o2
	 restore	%o0, -8, %i7

prepare_ftrace_return() is a helper function that passes the mcount caller program counter, that caller's parent program counter, and the callers frame pointer into the function graph tracer.

The tracer sees if the function should be traced and if there are enough tracking stack slot entries available for the current thread. The slots are used to remember the frame pointer and caller's parent program counter so that it can be compared and restored later. We need to remember this program counter because we are going to change it so that it calls a stub instead of actually returning from the function.

The frame pointer is saved and used as a debugging tool so that we can make sure that when we execute the stub, the state is as we expect it to be and we can be sure that restoring the return address register is safe.

prepare_ftrace_return gives it's caller the address to put into the callers return address register. This will be the special stub if we decide to trace this function call or it will simply be the original value (plus 8 to account for how we return from sparc functions by jumping to "return address register" plus 8).

The stub looks like this:

	save		%sp, -128, %sp
	call		ftrace_return_to_handler
	 mov		%fp, %o0
	jmpl		%o0 + 8, %g0
ftrace_return_to_handler() validates the frame pointer on the top-most stack slot of saved return address + frame pointer pairs. If it matches it records a function return trace entry (with timestamps, etc.) and returns the function's original return address and then we'll jump to that from the stub.

So if we're deep in the call chain of traced functions, and you were to look at the backtrace, you'd see a continuous stack of return addresses referencing the stub. And as the functions return, the function graph tracer resolves the return addresses to what they should be and then the originally intended address is returned to.

Of course there is some non-trivial cost to all of this, in particilar rewriting the return address makes the cpu's return address stack never hit so the function returns become very expensive. But this isn't something you have running constantly, you turn the tracer on around a set of events of interest (f.e. while running a test program).

Here's what a small snippet of a trace looks like with the function graph tracer. In this part of a capture we have a call to "run_local_timers":

  64)               |          run_local_timers() {
  64)   2.637 us    |            hrtimer_run_queues();
  64)   2.527 us    |            raise_softirq();
  64)               |            softlockup_tick() {
  64)   2.967 us    |              __touch_softlockup_watchdog();
  64)   8.570 us    |            }
  64) + 23.953 us   |          }
The first number is the cpu number (this machine has 128 cpus, so yes 64 is not a typo). The next field gives time deltas. Finally we have the call graph itself.

For the call graph a C-like syntax is used. For leaf functions the line ends with just a semicolon. When a function calls other functions it is closed off at it's return by a closing brace. And the sub-calls are indented as needed.

Call latencies are expressed at the point for which we have the return. So we'll see it on the lines for leaf functions (ending with semicolons) and for closing braces. But never on lines having openning braces.

When latencies exceed certain values a "+" (greater than 10usec) or "!" (greater than 100usec) will be prepended to the time delta so expensive operations can be easily seen.

There are several other powerful tools built on top of the mcount based tracing hooks. For example there is a stack tracer that monitors which functions created the deepest stack frames.

For more information about using ftrace in general, check out the two part series written by Stephen Rostedt at LWN: PART 1 and PART 2.

It's an incredibly powerful tool and framework. In fact, in just in the last few days I've already found 3 bugs and anomalies by simply scanning the traces and looking for nothing in particular.

Mon, 08 Mar 2010

strlen(), oh strlen()...

I've been going through the glibc sparc optimized assembler routines to see if anything can be improved. And I took a stab at seeing if strlen() could be made faster. Find first zero byte in string, pretty simple right?

The first thing we have to discuss is the infamous trick coined by Alan Mycroft, way back in 1987. It allows to check for the presence of a zero byte in a word in 3 instructions. There are 2 magic constants:

#define MAGIC1		0x80808080
#define MAGIC2		0x01010101
If you're checking 64-bits at a time simply expand the above magic values to 64-bits on 64-bit systems.

Then, given a word the check becomes:

	if ((val - MAGIC2) & ~val & MAGIC1)
		goto found_zero_byte_in_word;
Essentially we're subtracting MAGIC2 to induce underflow in each byte that has the value zero in it. Such underflows cause bit 8 to get set in that byte. Then we want to see if bit 8 is set after subtraction in any byte where bit 8 wasn't set before the subtraction.

To get the most parallelization on multi-issue cpus, we want to compute this using something like:

	tmp1 = val - MAGIC2;
	tmp2 = ~val & MAGIC1;
	if (tmp1 & tmp2)
		goto found_zero_byte_in_word;
to reduce the number of dependencies such that the computation of tmp1 and tmp2 can occur in the same cpu cycle.

Then there is all the trouble of getting the source buffer aligned so we can do the fast loop comparing a word at a time. The most direct implement is to read a byte at a time, checking for zero, until the buffer address is properly aligned. This is also the slowest implementation.

The powerpc code in glibc has a better idea. If dereferencing a non-word-aligned byte at address 'x' is valid, so is reading the word at 'x & ~3' (or 'x & ~7' on 64-bit). This is because page protection occurs on page boundaries, and x and 'x & ~3' are on the same page.

The only thing left to attend to is to make sure we don't match the alignment pad bytes with zero. This is solved by computing a mask of 1's and writing those 1's into the word we read before we do the Mycroft computation above. In C it looks something like:

	orig_ptr = ptr;
	align = (unsigned long) ptr & 3;
	mask = -1 >> (align * 8);
	ptr = (void *) ((unsigned long) ptr & ~3UL);
	val = *ptr;
	val |= ~mask;
	if ((val - MAGIC2) & ~val & MAGIC1)
		goto found_zero_byte_in_word;
At which point we can fall into the main loop.

Once we find the word containing a zero byte, we have to iteratively look for where it is in order to compute the return value. How to schedule this is not trivial, and it's especially cumbersome on 64-bit (where we have to potentially check 8 bytes as opposed to 4).

Anyways, let's analyze the 64-bit Sparc implementation I'm hacking on at the moment. I'm targetting UltraSPARC-III and Niagara2 for performance analysis. Simply speaking UltraSPARC-III can dual-issue integer operations, and Niagara2 is single issue and predicts all branches not taken (basically this means: minimize use of branches).

	mov	%o0, %o1
	andn	%o0, 0x7, %o0

	ldx	[%o0], %o5
	and	%o1, 0x7, %g1
	mov	-1, %g5
Save away the original string pointer in %o1. At the end we'll compute the return value as "%o1 - %o0". Align the buffer pointer and load a word as quickly as possible. We load the first word early so that we can hide the memory latency into all of the constant and mask formation we need to do before we can make the Mycroft test.

%g5 holds the initial part of the mask computation (-1, which gets expanded fully to 64-bits by this move instruction) and %g1 will have the shift factor.

	sethi	%hi(0x01010101), %o2
	sll	%g1, 3, %g1

	or	%o2, %lo(0x01010101), %o2
	srlx	%g5, %g1, %o3

	sllx	%o2, 32, %g1
	sethi	%hi(0x00ff0000), %g5
%o2 is going to hold the "0x01" expanded to 64-bits subtraction magic value. %o3 wil first hold the initial word mask, and then it will holds the "0x80" magic constant. We can compute the two 64-bit magic constants into registers in 5 instructions.

Pick either of the two constants, we choose the "0x01" here because we'll need it first. This is loaded first using "sethi", "or". This gives us the lower 32-bits of the constant, then we shift up a copy by 32-bits, then or that into the lower 32-bit copy to compute the final value. "0x80" is "0x01" shifted left by 7 bits so a simple shift is all we need to load the other 64-bit constant.

The "0x00ff0000" constant will be used while searching for the zero byte in the final word.

Next, we mask the initial word and fall through into the main loop.

	orn	%o5, %o3, %o5
	or	%o2, %g1, %o2

	sllx	%o2, 7, %o3
Mask in the pad bits using mask compute in %o3. Finish computation of 64-bit MAGIC1 into %o2, and finally put MAGIC2 into %o3. We're ready for the main loop:
10:	add	%o0, 8, %o0

	andn	%o3, %o5, %g1
	sub	%o5, %o2, %g2

	andcc	%g1, %g2, %g0
	be,a,pt	%xcc, 10b
	 ldx	[%o0], %o5
This is a real pain to schedule because there are many dependencies. But the "andn", "sub", "andcc" sequence is the Mycroft test, and those first two instructions can execute in one clock cycle on UltraSPARC-III. The ",a" annul bit on the branch means that we only execute the load in the branch delay slot if the branch is taken.

Now we have the code that searches for where exactly the zero byte is in the final word.

	srlx	%o5, 32, %g1
	sub	%o0, 8, %o0
We over advanced the buffer pointer in the main loop, so correct that by subtracting 8. Prepare a copy of the upper 32-bits of the word into %g1.
	andn	%o3, %g1, %o4
	sub	%g1, %o2, %g2

	add	%o0, 4, %g3
	andcc	%o4, %g2, %g0

	movne	%icc, %g1, %o5
	move	%icc, %g3, %o0
This is divide and conquer. Instead of doing 8 byte compares, we first see if the upper 32-bits have the zero byte. We essentially redo the Mycroft test on the upper 32-bits of the word.

If the upper 32-bits have the zero byte, we use %g1 for the comparisons. Otherwise we retain %o5 for the subsequent comparisons and advance the buffer pointer by 4 bytes. This is what the final two conditional move instructions are doing. Note that these conditional moves use '%icc', the 32-bit condition codes.

The astute reader may wonder why we just can't use the upper 32-bits of the Mycroft computation we made in the main loop? This doesn't work because the underflows can carry and cause false positives in upper bytes of the word. For example, consider a value where bits 35 down to 24 have hex value "0x0100". The subtraction of MAGIC2 will result in "0x8080". The real zero byte is the lower one, not the upper one. So we can't merely use the upper 32-bits of the already computed 64-bit Mycroft mask, we have to recompute it over 32-bits by hand.

Now we're left with 32-bits to check for a zero byte, we make extensive use of conditional moves to avoid branches:

	mov	3, %g2
	srlx	%o5, 8, %g1

	andcc	%g1, 0xff, %g0
	move	%icc, 2, %g2

	andcc	%o5, %g5, %g0
	srlx	%o5, 24, %o5
	move	%icc, 1, %g2

	andcc	%o5, 0xff, %g0
	move	%icc, 0, %g2

	add	%o0, %g2, %o0
We check starting at the low byte up to the highest byte. Because the highest byte, if zero, takes priority. We add the offset of the zero byte to the buffer pointer.


	 sub	%o0, %o1, %o0
We compute the length and return from the routine.

Many many moons ago, in 1998, Jakub Jelinek and his friend Jan Vondrak wrote the routines we use now on sparc. And frankly it's very hard to beat that code especially on multi-issue processors.

The powerpc trick to align the initial word helps us beat the existing code for all the unaligned cases. But for the aligned case the existing code holds a slight edge.

So now I've been trimming cycles as much as possible in the new code trying to reach the state where the aligned case executes at least as fast as the existing code. I'll check this work into glibc once I accomplish that.

The Mycroft trick extends to other libc string routines. For example for 'memchr' you replicate the search character into all bytes of a word, let's call it 'xor_mask' and in the inner loop you adjust each word by using:

	val ^= xor_mask;
Then use the Mycroft test as in strlen(). Another complication with memchr, however, is the need to check the given length bounds.

This can be done in one instruction by putting the far bounds into your base pointer register (called '%top_of_buffer' below), then using offsets starting at "0 - total_len" (referred to as '%negative_len' below).

Then your inner loop can do something like:

	ldx	[%top_of_buffer + %negative_len], %o5
	addcc	%negative_len, 8, %negative_len
	bcs	%xcc, len_exceeded
We exit the loop when adding 8 bytes to the negative len causes an overflow.

If you're interested in this kind of topic, bit twiddling tricks and whatnot, you absolutely have to own a copy of "Hacker's Delight" by Henry S. Warren, Jr.

Sun, 07 Feb 2010


I've always wanted to work on support for STT_GNU_IFUNC symbols on sparc. This is going to solve a real problem distribution makers have faced on sparc64 for quite some time.


Well, normally a symbol is resolved by the dynamic linker based upon information in the symbol table of the objects involved. This is after taking into consideration things like symbol visibility, where it is defined, etc.

The difference with STT_GNU_IFUNC is that the resolution of the reference can be made based upon other criteria. For example, based upon the capabilities of the cpu we are executing on. The most obvious place this would be very useful is in libc, where you can pick the most optimized memcpy() implementation.

Normally the symbol table entry points to the actual symbol location, but STT_GNU_IFUNC symbols point to the location of a "resolver" function. This function returns the symbol location that should be used for references to this symbol.

So when the dynamic linker resolves a reference to a STT_GNU_IFUNC type symbol "foo". It calls the resolver function recorded in the symbol table entry, and uses the return value as the resolved address.

Simple example:

void * memcpy_ifunc (void) __asm__ ("memcpy");
__asm__(".type foo, %gnu_indirect_function");

void *
memcpy_ifunc (void)
  switch (cpu_type)
  case cpu_A:
    return memcpy_A;
  case cpu_B:
    return memcpy_B;
    return memcpy_default;
So, references to 'memcpy' will be resolved as determined by the logic in memcpy_ifunc().

These magic ifunc things even work in static executables. How is that possible?

First, even though the final image is static, the linker arranges to still create PLT entries and dynamic sections for the STT_GNU_IFUNC relocations.

Next, the CRT files for static executables walk through the relocations in the static binary and resolve the STT_GNU_IFUNC symbols.

There are some thorny issues wrt. function pointer equality. To make that work static references to STT_GNU_IFUNC symbols use the PLT address whereas non-static references do not (they get fully resolved).

Back to the reason I was so eager to implement this. On sparc we have four different sets of optimized memcpy/memset implementations in glibc (UltraSPARC-I/II, UltraSPARC-III, Niagara-T1, Niagara-T2). Right now the distributions have to thus build glibc four times each for 32-bit and 64-bit (for a total of 8 times).

With STT_GNU_IFUNC they will only need to build it once for 32-bit and once for 64-bit.

I've just recently posted patches for full support of STT_GNU_IFUNC symbols to the binutils and glibc lists.

Thu, 21 May 2009

Beaux-Arts and kernel hacking...

My recent hobbies have included an intense study of New York City architecture, and in particular the facinating stories behind the city's two most prominent train stations. That being Grand Central Terminal and the arguably infamous Pennsylvania Station .

McKimm, Meade, and White's masterpiece at 42nd Street and 3rd Ave.

In the second half of the 19th century and on towards the first half of the 20th century, any American architect worth his salt studied at the Ecole des Beaux-Arts in Paris.

If you had a degree from that school, you were at the top of the pile for selection on all of the interesting commisions of the time. The school presented the student with a challenging and fast paced curriculum.

Firstly, for these American students attending in Paris, the first challenge was just getting in. The entrance exam (of course) required at least some proficiency in French. Several of the most notable American architects had the retake this entrace exam 5 or more times before being able to pass.

Once accepted, the student was pressed to solve problems. 12 hours were given to draft up a solution to a real architectual problem. Then once the draft was accepted, the student had 2 weeks to flesh out all of the details and present the final design. All the while the student's progress was critiqued by an established French architect who oversaw a group of students.

We really don't have that kind of training for computer science people. It's not even science I would say. This kind of training does exist for pure mathmatics, espcially in France.

Envision a school where you're asked to draft up the design of a compiler pass in 12 hours, then for two weeks you implement it, and meanwhile Alfred Aho critiques your work. This kind of place simply doesn't exist. (Yes I know Alfred teaches at Columbia currently, so maybe this specific place does exist :-) but I maintain that more generally such institutions do not exist)

Open source development and "throwing the masses of monkeys at the problem" seems to be a logical consequence of this, does it not?

A formally trained Beaux-arts architect and a room with a few drafters could design something as insanely complicated as a huge transportation hub in the middle of New York City. And it would work, as there would be no room for failure. To me it seems that someone similarly trained could do a complete operating system, compiler, or similar large software engineering task.

McKimm, Meade, and White were three architects and a couple drafters, and yet they were able to complete works such as Boston Public Library, Pennsylvania Station, and the James Farley Post Office. To just name a few.

So which is better, strict formal training and mentorship or open source monkeys? You decide!

Thu, 05 Mar 2009

A Sparc JIT for ioquake3

I recently got DRM working on sparc64, and this means I had to of course test it :-)

I played around with ioquake3 and it worked just fine. This was in a way exciting since I have devoured many hours of my life into this game on x86.

One aspect of quake3 is that it has a virtual machine. You can write a MOD for quake3 and replace pretty much any aspect of the game outside of the rendering engine. But to keep MOD authors from eating people's home directories, sending your password out to some rogue collection system, and things of that nature the interfaces are tightly controlled and the MOD code runs in a JIT'd VM.

The only way you can get into the JIT is to make a "system call". And the only way to get out is to either return or make such a system call into another module. The system call is the main entrypoint into the module, it takes an integer command and 0 or more integer arguments.

All memory accesses done by the JIT'd code are masked so that it is impossible for the JIT to touch memory outside of that allocated explicitly for it by the VM.

Ben H. kiddingly said to me that I should write the Sparc JIT since there is one for x86 and PowerPC already. One should never kid about such things...

It's pretty neat stuff, although the stack machine VM code output by the LCC compiler they used is horribly inefficient. Some code for a function might look like:

OPCODE[  OP_ENTER] IMM4[0x0000001c]	! ENTER function, 0x1c of stack
OPCODE[  OP_LOCAL] IMM4[0x00000024]	! PUSH stack offset 0x24 (first arg)
OPCODE[  OP_LOAD4] 			! LOAD from "stack + 0x24"
OPCODE[  OP_LEAVE] IMM4[0x0000001c]	! LEAVE function, return LOAD result
Operations push entries onto the "register stack", and consume entries on the top of that stack. This might emit some sparc code like:
	save	%sp, -64, %sp		! OP_ENTER
	sub	%g3, 0x1c, %g3
	add	%g3, 0x24, %l0		! OP_LOCAL
	and	%l0, %g5, %l0		! OP_LOAD4
	ld	[%g4 + %l0], %l0
	add	%g3, 0x1c, %g3		! OP_LEAVE
	restore	%l0, %g0, %o0
We use several fixed registers, "%g3" is the stack pointer, "%g5" is the VM data segment offset mask, "%g4" is the data segment base address. So every load or store address formation is "mask with %g5 and add to %g4".

It's there in the ioquake3 repo right now and will be in the next release. There are lots of things that can be improved but it works very well and most of the quake3 MODS I've tried (CPMA, UrbanTerror, etc.) work. I've also been playing the base game online extensively, you know, for stress testing.

Sat, 29 Nov 2008

NMI profiling on sparc64...

One thing that always has been suboptimal on sparc64 has been the profiling with oprofile.

Yes it worked but we only have the dumb timer interrupt profiling available. The biggest loss in this is that IRQ disabled code sequences do not get profiled. This leads to "clumps" in the profile, where all the code in an IRQ disabled sequence can show up as one big hit at the point where IRQs get re-enabled.

This makes the profile often non-representative and, at worst, completely unusable.

Say what you want about levelled interrupts, they do provide a level of flexibility that can be useful. Sparc64 chips have a PIL register, you indicate the interrupt levels (out of 15) you want to block out by writing the highest level to block out into the register. So writing zero enables all interrupt levels, and writing 15 blocks all of them out.

Device interrupts work by interrupt vectors, which are not blocked by the PIL mechanism. These are processed quickly in a trap handler and revectored into a PIL levelled interrupt by software.

Under Linux we use one PIL level for all of the device interrupts. A few of the other PIL levels we use for specific SMP cross-call types.

On UltraSPARC-III (cheetah) and later we (finally) have a profiling counter overflow interrupt. This arrives at PIL level 15. So naturally I had the idea to run the majority of the kernel only disabled up to PIL level 14. The result is that was can use these profile counter overflow interrupts to provide a pseudo-NMI oprofile implementation.

This works quite well and is checked into the sparc-next-2.6 GIT tree, so it will show up in 2.6.29

Wed, 05 Nov 2008

git stash

Someone on IRC asked what was so cool about my 'mkcf' shell script and those command lines I showed.

I'm so old skool that I didn't even know that something as cool as "git stash" even existed. Holy crap that's a nice feature :-)

Yeah but these young whipper snappers don't know how to build a git index by hand, har har har.

In other news, unless you live under a rock you know that Barak Obama won the presidental elections, but what you may not know is that he only won because of a last minute endorsement from Linus.