Hacking and Other Thoughts

Wed, 30 Apr 2008


Effective GIT bisecting...

I've had to do a lot of this lately, and the most efficient attempts take on a certain pattern.

So you have a bug, and you can readily reproduce it. Also, the bug appeared in the last pull you made from Linus's tree. Perfect.

At this point you know that 'master' has the bug and that 'ORIG_HEAD' lacks the bug. You could just blindly bisect the whole thing, but you can save yourself some time (and also learn a bit about the nature of the bug) by using some clues and some quick tests to narrow things down a lot from the beginning.

This determination can be easy. Your goal is to first find a spot which you think works. You'd like it to be something a bit further than ORIG_HEAD, as you're trying to narrow things down.

The easy case is some driver breaks or similar, or you see some error message and it's clear what subsystem that came from. Take that information and use it to scan over the changesets you got from your pull:

	gitk ORIG_HEAD..
Note that when you select a changeset in gitk, the SHA ID of that commit becomes the current X selection. You can use that to do things more quickly below.

So you're found a sequence of commits that look suspicious. Pick the changeset before the first commit in the suspicious set, and check out a test tree with it as the tip into a test branch:

	git checkout -b test $(SHA_ID)
Build that kernel, and make sure the bug doesn't happen. Let's assume that this kernel passes your test. You have a few options on how to proceed.

The easiest thing to do is to just bisect using the information you now have:

	git bisect start
	git bisect bad master
	git bisect good test
and so on. Build, test boot, and if it shows the bug:
	git bisect bad
else if it succeeds:
	git bisect good
and repeat the process until GIT shows you the guilty commit.

The other option is to try and figure out an approximate more optimal end point for bisection. Take the set of "suspicious" changeset your determined above, and take the one after the last and go:

	git checkout -b test2 $(SHA_ID)
If this shows the bug, you're in business:
	git bisect start
	git bisect bad test2
	git bisect good test
and continue as detailed above.

When you're done with all of this:

	git bisect reset
and report your results to the mailing list.

Wed, 19 Mar 2008


Vger recovering...

VGER suffered a major disk failure and it was not easy to bring the machine back up cleanly.

Thanks to the incredible efforts of Matthew Galgoci, Matti Aarnio, and others, it is back up now and slowly but surely the mail queue is running what should be quite an impressive backlog :-)


Back from Japan...

I just returned from a wonderful 10 day trip to Japan, during which I gave a presentation on multi-queue networking for the Linux Foundation Japan Symposium.

Most of my time was spent in Tokyo, but I was able to make 3 exciting trips outside the city.

On Sunday March 9th, Kazunori Miyazawa gave us a tour of the Kamakura region and Enoshima.

On Friday March 14th we took the hikari shinkansen to Osaka, during which I was able to experience "kuidaore" (literally "to ruin oneself by extravagance in food") in the Dotonbori district of the city.

That evening we travelled to Kyoto and slept at a ryokan at which we had a traditional Japanese breakfast the next morning.

Finally we spent that Saturday exploring Kyoto, mostly the eastern section.

Finally we hopped back onto the shinkansen in the evening to get back to Tokyo.

On Sunday I was able to meet up with other members of the USAGI Project for dinner, several of whom were unfortunately travelling and thus outside Japan during most of my stay.

All in all it was a great trip, and I already have a list of places and things I want to explore next time!

Mon, 31 Dec 2007


TCP retransmit queue overhead...

There is this one cpu cycle killer we've been trying to find ways to overcome, and the more I think about it the situation wouldn't even exist if the SACK standard had been given just a smidgen more thought.

SACK has been specified as a lossy indicator of out of order packet reception. This is the core problem. It's a hint, and you cannot actually "act" upon the information.

So if you get some SACK blocks you can't free those marked packets from your retransmit queue. SACK is specified such that the receiver can free up those packets and stop advertising them in the SACK blocks.

This is profoundly stupid, just look at the implications.

What this means is that during a loss recovery event, we have to hold onto a whole extra window of data in the retransmit queue for no reason at all. Even more brilliant is that when the hole is filled and we get this HUGE ACK back acknowledging two windows worth of data we have to purge all of those packets from the retransmit queue in one go.

For large windows, we're talking 4,000 packets or more. I don't care what datastructure you use to manage your retransmit queue, that's going to be expensive and show up in profiles.

If, instead, the SACK specification authors spent just a little bit more time thinking about the implications of SACK being a "soft" indicator I think they would have changed their minds. If SACK were a "hard" indication and we could thus free up packets so marked there, the processing would be spread out throughout the recovery event instead of being batched up into one huge wallop of packet freeing.

Supposedly the reason for marking SACK a "soft" indication is so that "low memory" embedded systems could free up the data. But this is incredibly stupid. If you can handle one window of data, you can also handle holding onto two full windows of data as well. We aren't talking about PDP-11's with 16K of RAM or anything like that.

Furthermore this is inconsistent because the sender has to hold all of the data in such loss recovery events, why treat the receiver specially? It makes absolutely no sense, because if such small embedded systems ever have to send data, they have to commit to the same amount of RAM resourses in such situations.

In fact, a "hard SACK" would help those small devices when they act as senders, because during a loss event they could liberate packets indicated in the SACK blocks they get back.

The SACK standard folks made a horrible tradeoff. They created a significant future performance hindrance for %99.9999 of systems in order to cater to some theoretical memory usage issue on %0.00000001 of machines in the world.

Every TCP implementation out theere is going to have to come up with a workaround for this issue especially as huge windows become more prevalent. And all of that engineering effort would be totally unnecessary (read as: we could work on much more important stuff) if SACK had been specified as a hard indicator.

I have the urge to propose some TCP option that allows two TCP implementations to negotiate "hard" SACK blocks but that won't work. We'd still support "soft" SACKs for eternity and furthermore that's what anyone wishing to exploit this high cpu cost code path in the TCP stack would use.

To be honest, nobody drops data in the way that SACK allows. It'd probably be pragmatic and reasonable just to start enforcing the "hard" semantics. And if they are violated (packets disappear from SACK blocks) we just RST the connection. We could add a ton of logging if such a thing happens, and we could investigate such cases.

It's quite a mess and somebody has to mop it up.


Holidays...

I have to say that Christmas holidays are the best time to hack on things. Most of the yahoos are away on vacation and therefore the constant stream of distractions and really dumb emails in your inbox just aren't there. In short you can work on the things you always want to work on but never can because of time constraints.

There really is an information overload problem.

Today I want to talk about multiple return values. In short, don't do it :-) It really is a sign that things need to be redesigned, and on top of it multiple return values result in quite inefficient code.

The most common case with C is when you need to allocate some memory and return it to caller, but if something goes wrong you want to give some error status too.

So you end up with absolute crap like this:

int create_foo(int flags, struct foo **foop)
or, even worse:
struct foo *create_foo(int flags, int *errp)
I mean, that just doesn't deserve to live. First of all, the compiler has to allocate stack space for that "by reference" turd you had to add to the arguments to pass that second piece of information back. And that's slow, even on register starved cpus like x86 where stack accesses have been heavily optimized inside of the cpu to make up for that.

Secondly, the semantics are not entirely clear. If an error happens for the second API above, will the function return value always be NULL? In the first API above, when an error happens will the "*foop" always be NULL'd out for me or is the caller expected to do that? Likewise, for the second API above, if there is no error and non-NULL is returned, can I depend upon "*errp" being set to zero?

You don't know, because when you look at that interface definition you simply cannot tell. It's one big ambiguous ugly interface.

For this particular case in the kernel we've settled on a set of macros that allow a pointer and an error to be returned in a single function return value. Basically, it takes advantage of the fact that the range of negative error codes cannot ever be legitimate pointers. So the interface above looks like:

struct foo *create_foo(int flags)
and the caller first checks:
	p = create_foo(flags);
	if (IS_ERR(p)) {
		err = PTR_ERR(p);
		return err;
	}
And if "IS_ERR" is not true, it is a non-NULL pointer and no error happened.

It is completely unambiguous what the return values mean, and how they will be presented to the caller. Yes, it's true that looking at the function definition you can't "see" this. But once people are exposed to this pattern enough they pick it up, and it allows us to eliminate ambiguity and unclear semantics for these kinds of cases.

Have a happy new year everyone.

Tue, 11 Dec 2007


Visit from Yoshifuji...

Two weeks ago I left to spend some time in Korea, which I'll write about later. But more recently YOSHIFUJI Hideaki paid a visit to Seattle.

We had only one day and an evening, so we hit all of the "essentials":

  1. Sushi at Ototo, although our favorite chef was out, the food was nevertheless still excellent.
  2. Pike Place Market
  3. Boeing Museum of Flight
  4. Fry's


SR71 Blackbird, Museum of Flight, Seattle

After getting a few hours of evening sleep, we woke up again to do hacking, solving puzzles, eating lots of greasy unhealthy breakfast food, and even some WIFI war driving on the way to SeaTac airport at 4am.

It was a fun but short visit, and we hope to see him again some time.

Wed, 14 Nov 2007


Haven't written in a while, net and mono hacking...

It's hard to get inspired sometimes to blog, the work itself is enjoyable and all consuming, so why stop to write about it? Oh that's right, the attention is oh so nice...

Networking is moving right along, 2.6.24 is in bug fix only phase and things don't look too bad. For future stuff I just merged a set of patches from Herbert Xu that consolidates the input and output IPSEC paths so we can turn things upside down, have resume points, and thus support async crypto properly.

I've been trying to toss some cycles into hacking on MONO's sparc support. I'm trying to take my knowledge from my active GCC sparc backend days and translate that into MONO code generation improvements. We'll see how all of this goes. Here is a fun one:

The most straightforware way to set a boolean in a register (0 or 1) based upon some condition (say some simple 'a' COND 'b' comparison) is:

	cmp	%reg1, %reg2
	mov	0, %reg3
	bCOND,a	1f
	 mov	1, %reg3
1:
This has a few problems. First it has a hard to predict branch, potentially stalling the entire pipeline. Second it uses an annulling branch which can hurt performance even more.

Sparc has some instructions that take the carry bit of the condition codes as an input to a normal 'add' or 'sub' instruction. This can be used to implement the above in 3 instruction or less branchless sequences for the not-equal, less-than-unsigned, and greater-than-unsigned cases using an old trick:

	/* %reg3 = (%reg1 != %reg2) */
	xor	%reg1, %reg2, %reg3
	subcc	%g0, %reg3, %g0
	addx	%g0, 0, %reg3

	/* %reg3 = (%reg1 < %reg2) */
	cmp	%reg1, %reg2
	addx	%g0, 0, %reg3

	/* %reg3 = (%reg1 > %reg2) */
	cmp	%reg2, %reg1
	addx	%g0, 0, %reg3
GCC has been emitting those things for years. And the history behind the discovery of those code sequences is quite interesting. Way back when, the GNU Superoptimizer was written. Super optimization is basically finding the truly optimal sequence of code to compute some function like the ones above.

The GNU variant was written by Torbjorn Granlund and it worked by exhaustive search. It would basically load the inputs into source registers than execute random sequences of instructions over and over using that initial state until the desired result appeared in the destination register.

Pretty cool stuff.

Tue, 06 Nov 2007


Sorry Rusty...

Dude, you've got it all wrong.

What makes the code in the kernel so great is not that it goes in perfect, it's that we whittle all code in the treee down over and over again until it reaches it's perfect form.

It is this whittling system that allows us to thrust 25MB of changes into a tree over a week and a half and it all works out. Yes a lot of shit and bugs get added, but we fix and clean them all up, and we do it fast.

So it doesn't matter that the new scatterlist stuff has a crappy set of interfaces, because people like Rusty will fix it up and make it better.

Tue, 11 Sep 2007


Multiqueue presentation...

You can pick up the multiqueue networking presentation I just gave at the Netfilter Workshop from here.

This, along with other presentation materials will eventually propagate over to the workshop web site.


Netfilter Workshop 2007, Karlsruhe

I was in Cambridge, UK last week for Kernel Summit 2007, which was covered quite well on lwn.net and similar so I won't say much other than it went extremely well.

This week I am in Karlsruhe, Germany for the Netfilter Workshop. Today is the first day and a live blog is being updated as we sit here.

To get to the workshop I got to take the Eurostar from London to Paris (via the Channel Tunnel, which you're in for just under 20 minutes), and then the new TGV East high speed train line from Paris to Karlsruhe.

I fly home on Thursday, but I'm trying to at least keep the bug fixes flowing for the networking since Linus and others are home from the kernel summit.

Wed, 08 Aug 2007


Bike riding in Seattle...

I've been biking a lot so that I'm not completely out of shape come winter time when I want to go snowboarding all the time.

Here is a route I've been doing lately to build up my endurance. The trail has another 4 or 5 miles left in it for me to stretch my trip, and if that isn't enough there is a trail that continues on and circles Lake Washington.

Maybe eventually I'll be capable of doing STP but that's still a bit extreme at the moment.


For all of the Australians out there...

Read this for a good laugh.