So I spent the weekend fishing out some gdb bugs on sparc. Every time I think I understand and know how all of this stuff works I get thrown some new surprises. This time was no exception.
The kernel has all of these neat features, via ptrace()'s PTRACE_SETOPTIONS, that allows a debugger to get notifications when a process forks, vforks, does an exec, etc.
When these events occur in the inferior process, it does a ptrace_notify() which sends a SIGTRAP to the process with the event encoded into the siginfo exit code.
As a result, when the process tries to return back into userspace, it'll do signal processing. As part of that, it will invoke ptrace_stop() which sleeps the task and wakes up the debugger parent so it can examine the event and process state.
The debugger has several choices of what to do. For many cases it will do a ptrace() PTRACE_CONT with a code indicating how the process should continue. Another thing the debugger can do is decide that it's no longer interested in tracing this task, and therefore it does a ptrace() PTRACE_DETACH. This is part of the first bug.
A long time ago we picked the values for PTRACE_this and PTRACE_that on sparc. Some of them mirrored the SunOS values. One of those was PTRACE_DETACH. We unconditionally recognized the SunOS PTRACE_DETACH value, even for Linux processes. Unfortunately, this is the value that also ended up in the sys/ptrace.h Linux header file. So that's the value every Linux application ended up using too.
I yanked out the SunOS PTRACE_foo call support long ago, and it's amazing how much works without a properly functioning PTRACE_DETACH. Putting in a compat translation from the SunOS to the intended Linux value for PTRACE_DETACH in the Sparc ptrace code solved this bug.
Which brings us to the next bug...
What brought me down this path in the first place was examining why running gdb under itself didn't work properly. This kind of game is always fun:
bash$ ./gdb ./gdb (top-gdb) set args test (top-gdb) run (gdb) run Hello World! ...This would hang when the inner-gdb tries to run the test program, and I had to figure out why. After lots of tracing I found that the inner-gdb was hanging in a sigpause() call.
GDB uses sigpause() to wait for SIGCHLD events when it is simply waiting for running inferiors to ptrace_stop() or take some other kind of signal.
In comes the issue of system call restart. Handling this wrt. debuggers is not easy. One nice feature of debuggers like GDB is that you can ask them at any point in time to call some function in the program they are running.
(gdb) p printf("hi\n")
hi
$1 = 2
(gdb)
and after calling the function it completely restores the process state to
what it was before the call. How does it implement this?
First it saves the process state, mostly this comprises the registers. Next it allocates some stack space for the call, pushes the arguments for the function call onto the stack and in registers. Next, it makes the function return address point to a breakpoint it can uniquely recognize. Finally, it sets the program counter to point at the function to call.
When the function returns it hits the breakpoint, and the debugger restores all of the saved state it stored away before the special call.
Now, back to system call restart. When a signal interrupts a system call, it can return immediately to process the signal. Internally to the kernel system calls return special error codes to let the signal dispatch know what to do with the system call that was in progress. It can say that -EINTR should be returned and the system call completed. But it can also say that the program counter should be rewound to the system call trap, the argument registers re-setup with their original values, and the system call thus replayed.
In the example above, imagine what happens if the debugger calls an inferior function at the time that the signal is dispatched. Somehow, in order to restore the state properly after the call, this "we're in a syscall and should do syscall restarting" state has to get saved and restored too.
Long ago I had this clever idea wherein I tried to solve this problem entirely inside of the kernel to shield debuggers from having to deal with it. The idea was that we'd modify the process state and perform the system call restart operations before we stopped the program to let the debugger see the state.
Although great in theory, in practice it's an unworkable solution. We don't know what the debugger is going to do with the process. As I stated earlier it can pass along the signal to the process, or it can cancel the signal delivery altogether. This decision influences whether we should do system call restart or not, but we already pre-commited that state and let the debugger see it already. We can't know what the debugger is going to do ahead of time, therefore it is impossible to do the right thing.
This is what was causing the inner debugger to hang. The inner gdb is receiving a SIGCHLD because the 'test' program it is debugging has hit ptrace_stop(). The top-level gdb looks at this and says "ok, let's just let the inner gdb see the signal, PTRACE_CONT." But my funny in-kernel ptrace syscall restart code already setup for a syscall restart of sigpause(), but what should have happened was a return of -EINTR.
The inner gdb is now wedged forever, it missed the SIGCHLD and the debugged process is sleeping waiting to be woke up by the inner gdb.
So I ripped out all of this silly code, and ended up doing what powerpc, x86, and other platforms do. I added a piece of software binary state (an unused bit in one of the processor state registers), that gdb can control. It is the "we're in a system call" state. When the debugger does an inferior call, it clears this bit when changing the program counter register. This forces the kernel to not do system call restart processing when the task wakes out of ptrace_stop().
But later, when gdb restores all of the register state after the call, that special bit will be restored, and we'll do the right thing as we deliver the original signal and subsequently do syscall restart processing as needed.
It's always fun to find land mines like these ones.
I've had to do a lot of this lately, and the most efficient attempts take on a certain pattern.
So you have a bug, and you can readily reproduce it. Also, the bug appeared in the last pull you made from Linus's tree. Perfect.
At this point you know that 'master' has the bug and that 'ORIG_HEAD' lacks the bug. You could just blindly bisect the whole thing, but you can save yourself some time (and also learn a bit about the nature of the bug) by using some clues and some quick tests to narrow things down a lot from the beginning.
This determination can be easy. Your goal is to first find a spot which you think works. You'd like it to be something a bit further than ORIG_HEAD, as you're trying to narrow things down.
The easy case is some driver breaks or similar, or you see some error message and it's clear what subsystem that came from. Take that information and use it to scan over the changesets you got from your pull:
gitk ORIG_HEAD..Note that when you select a changeset in gitk, the SHA ID of that commit becomes the current X selection. You can use that to do things more quickly below.
So you're found a sequence of commits that look suspicious. Pick the changeset before the first commit in the suspicious set, and check out a test tree with it as the tip into a test branch:
git checkout -b test $(SHA_ID)Build that kernel, and make sure the bug doesn't happen. Let's assume that this kernel passes your test. You have a few options on how to proceed.
The easiest thing to do is to just bisect using the information you now have:
git bisect start git bisect bad master git bisect good testand so on. Build, test boot, and if it shows the bug:
git bisect badelse if it succeeds:
git bisect goodand repeat the process until GIT shows you the guilty commit.
The other option is to try and figure out an approximate more optimal end point for bisection. Take the set of "suspicious" changeset your determined above, and take the one after the last and go:
git checkout -b test2 $(SHA_ID)If this shows the bug, you're in business:
git bisect start git bisect bad test2 git bisect good testand continue as detailed above.
When you're done with all of this:
git bisect resetand report your results to the mailing list.
VGER suffered a major disk failure and it was not easy to bring the machine back up cleanly.
Thanks to the incredible efforts of Matthew Galgoci, Matti Aarnio, and others, it is back up now and slowly but surely the mail queue is running what should be quite an impressive backlog :-)
I just returned from a wonderful 10 day trip to Japan, during which I gave a presentation on multi-queue networking for the Linux Foundation Japan Symposium.
Most of my time was spent in Tokyo, but I was able to make 3 exciting trips outside the city.
On Sunday March 9th, Kazunori Miyazawa gave us a tour of the Kamakura region and Enoshima.
On Friday March 14th we took the hikari shinkansen to Osaka, during which I was able to experience "kuidaore" (literally "to ruin oneself by extravagance in food") in the Dotonbori district of the city.
That evening we travelled to Kyoto and slept at a ryokan at which we had a traditional Japanese breakfast the next morning.
Finally we spent that Saturday exploring Kyoto, mostly the eastern section.
Finally we hopped back onto the shinkansen in the evening to get back to Tokyo.
On Sunday I was able to meet up with other members of the USAGI Project for dinner, several of whom were unfortunately travelling and thus outside Japan during most of my stay.
All in all it was a great trip, and I already have a list of places and things I want to explore next time!
There is this one cpu cycle killer we've been trying to find ways to overcome, and the more I think about it the situation wouldn't even exist if the SACK standard had been given just a smidgen more thought.
SACK has been specified as a lossy indicator of out of order packet reception. This is the core problem. It's a hint, and you cannot actually "act" upon the information.
So if you get some SACK blocks you can't free those marked packets from your retransmit queue. SACK is specified such that the receiver can free up those packets and stop advertising them in the SACK blocks.
This is profoundly stupid, just look at the implications.
What this means is that during a loss recovery event, we have to hold onto a whole extra window of data in the retransmit queue for no reason at all. Even more brilliant is that when the hole is filled and we get this HUGE ACK back acknowledging two windows worth of data we have to purge all of those packets from the retransmit queue in one go.
For large windows, we're talking 4,000 packets or more. I don't care what datastructure you use to manage your retransmit queue, that's going to be expensive and show up in profiles.
If, instead, the SACK specification authors spent just a little bit more time thinking about the implications of SACK being a "soft" indicator I think they would have changed their minds. If SACK were a "hard" indication and we could thus free up packets so marked there, the processing would be spread out throughout the recovery event instead of being batched up into one huge wallop of packet freeing.
Supposedly the reason for marking SACK a "soft" indication is so that "low memory" embedded systems could free up the data. But this is incredibly stupid. If you can handle one window of data, you can also handle holding onto two full windows of data as well. We aren't talking about PDP-11's with 16K of RAM or anything like that.
Furthermore this is inconsistent because the sender has to hold all of the data in such loss recovery events, why treat the receiver specially? It makes absolutely no sense, because if such small embedded systems ever have to send data, they have to commit to the same amount of RAM resourses in such situations.
In fact, a "hard SACK" would help those small devices when they act as senders, because during a loss event they could liberate packets indicated in the SACK blocks they get back.
The SACK standard folks made a horrible tradeoff. They created a significant future performance hindrance for %99.9999 of systems in order to cater to some theoretical memory usage issue on %0.00000001 of machines in the world.
Every TCP implementation out theere is going to have to come up with a workaround for this issue especially as huge windows become more prevalent. And all of that engineering effort would be totally unnecessary (read as: we could work on much more important stuff) if SACK had been specified as a hard indicator.
I have the urge to propose some TCP option that allows two TCP implementations to negotiate "hard" SACK blocks but that won't work. We'd still support "soft" SACKs for eternity and furthermore that's what anyone wishing to exploit this high cpu cost code path in the TCP stack would use.
To be honest, nobody drops data in the way that SACK allows. It'd probably be pragmatic and reasonable just to start enforcing the "hard" semantics. And if they are violated (packets disappear from SACK blocks) we just RST the connection. We could add a ton of logging if such a thing happens, and we could investigate such cases.
It's quite a mess and somebody has to mop it up.
I have to say that Christmas holidays are the best time to hack on things. Most of the yahoos are away on vacation and therefore the constant stream of distractions and really dumb emails in your inbox just aren't there. In short you can work on the things you always want to work on but never can because of time constraints.
There really is an information overload problem.
Today I want to talk about multiple return values. In short, don't do it :-) It really is a sign that things need to be redesigned, and on top of it multiple return values result in quite inefficient code.
The most common case with C is when you need to allocate some memory and return it to caller, but if something goes wrong you want to give some error status too.
So you end up with absolute crap like this:
int create_foo(int flags, struct foo **foop)or, even worse:
struct foo *create_foo(int flags, int *errp)I mean, that just doesn't deserve to live. First of all, the compiler has to allocate stack space for that "by reference" turd you had to add to the arguments to pass that second piece of information back. And that's slow, even on register starved cpus like x86 where stack accesses have been heavily optimized inside of the cpu to make up for that.
Secondly, the semantics are not entirely clear. If an error happens for the second API above, will the function return value always be NULL? In the first API above, when an error happens will the "*foop" always be NULL'd out for me or is the caller expected to do that? Likewise, for the second API above, if there is no error and non-NULL is returned, can I depend upon "*errp" being set to zero?
You don't know, because when you look at that interface definition you simply cannot tell. It's one big ambiguous ugly interface.
For this particular case in the kernel we've settled on a set of macros that allow a pointer and an error to be returned in a single function return value. Basically, it takes advantage of the fact that the range of negative error codes cannot ever be legitimate pointers. So the interface above looks like:
struct foo *create_foo(int flags)and the caller first checks:
p = create_foo(flags);
if (IS_ERR(p)) {
err = PTR_ERR(p);
return err;
}
And if "IS_ERR" is not true, it is a non-NULL pointer and
no error happened.
It is completely unambiguous what the return values mean, and how they will be presented to the caller. Yes, it's true that looking at the function definition you can't "see" this. But once people are exposed to this pattern enough they pick it up, and it allows us to eliminate ambiguity and unclear semantics for these kinds of cases.
Have a happy new year everyone.
Two weeks ago I left to spend some time in Korea, which I'll write about later. But more recently YOSHIFUJI Hideaki paid a visit to Seattle.
We had only one day and an evening, so we hit all of the "essentials":
After getting a few hours of evening sleep, we woke up again to do hacking, solving puzzles, eating lots of greasy unhealthy breakfast food, and even some WIFI war driving on the way to SeaTac airport at 4am.
It was a fun but short visit, and we hope to see him again some time.
It's hard to get inspired sometimes to blog, the work itself is enjoyable and all consuming, so why stop to write about it? Oh that's right, the attention is oh so nice...
Networking is moving right along, 2.6.24 is in bug fix only phase and things don't look too bad. For future stuff I just merged a set of patches from Herbert Xu that consolidates the input and output IPSEC paths so we can turn things upside down, have resume points, and thus support async crypto properly.
I've been trying to toss some cycles into hacking on MONO's sparc support. I'm trying to take my knowledge from my active GCC sparc backend days and translate that into MONO code generation improvements. We'll see how all of this goes. Here is a fun one:
The most straightforware way to set a boolean in a register (0 or 1) based upon some condition (say some simple 'a' COND 'b' comparison) is:
cmp %reg1, %reg2 mov 0, %reg3 bCOND,a 1f mov 1, %reg3 1:This has a few problems. First it has a hard to predict branch, potentially stalling the entire pipeline. Second it uses an annulling branch which can hurt performance even more.
Sparc has some instructions that take the carry bit of the condition codes as an input to a normal 'add' or 'sub' instruction. This can be used to implement the above in 3 instruction or less branchless sequences for the not-equal, less-than-unsigned, and greater-than-unsigned cases using an old trick:
/* %reg3 = (%reg1 != %reg2) */ xor %reg1, %reg2, %reg3 subcc %g0, %reg3, %g0 addx %g0, 0, %reg3 /* %reg3 = (%reg1 < %reg2) */ cmp %reg1, %reg2 addx %g0, 0, %reg3 /* %reg3 = (%reg1 > %reg2) */ cmp %reg2, %reg1 addx %g0, 0, %reg3GCC has been emitting those things for years. And the history behind the discovery of those code sequences is quite interesting. Way back when, the GNU Superoptimizer was written. Super optimization is basically finding the truly optimal sequence of code to compute some function like the ones above.
The GNU variant was written by Torbjorn Granlund and it worked by exhaustive search. It would basically load the inputs into source registers than execute random sequences of instructions over and over using that initial state until the desired result appeared in the destination register.
Pretty cool stuff.
Dude, you've got it all wrong.
What makes the code in the kernel so great is not that it goes in perfect, it's that we whittle all code in the treee down over and over again until it reaches it's perfect form.
It is this whittling system that allows us to thrust 25MB of changes into a tree over a week and a half and it all works out. Yes a lot of shit and bugs get added, but we fix and clean them all up, and we do it fast.
So it doesn't matter that the new scatterlist stuff has a crappy set of interfaces, because people like Rusty will fix it up and make it better.
You can pick up the multiqueue networking presentation I just gave at the Netfilter Workshop from here.
This, along with other presentation materials will eventually propagate over to the workshop web site.
I was in Cambridge, UK last week for Kernel Summit 2007, which was covered quite well on lwn.net and similar so I won't say much other than it went extremely well.
This week I am in Karlsruhe, Germany for the Netfilter Workshop. Today is the first day and a live blog is being updated as we sit here.
To get to the workshop I got to take the Eurostar from London to Paris (via the Channel Tunnel, which you're in for just under 20 minutes), and then the new TGV East high speed train line from Paris to Karlsruhe.
I fly home on Thursday, but I'm trying to at least keep the bug fixes flowing for the networking since Linus and others are home from the kernel summit.
I've been biking a lot so that I'm not completely out of shape come winter time when I want to go snowboarding all the time.
Here is a route I've been doing lately to build up my endurance. The trail has another 4 or 5 miles left in it for me to stretch my trip, and if that isn't enough there is a trail that continues on and circles Lake Washington.
Maybe eventually I'll be capable of doing STP but that's still a bit extreme at the moment.