I've redone everything about how I'll approach the multiqueue TX issues. I'll try to detail this a bit tomorrow.
If you want to turn the world upside down and rework an entire core interface, that's fine, but you'd better completely understand how things currently work before proceeding.
That's the situation I'm in right now.
If you're following netdev then you know that I'm trying to get the network device transmit path all sorted wrt. multiqueue devices. The current situation is that the entire path is serialized so really it's not possible to take advantage of sophisticated devices on the send side.
It also turns out that the modelling for device throttling is overly simplistic. Whether a packet can currently be accepted by the device depends upon a lot of things, some of which are determined by aspects of the specific packet that is next to be sent.
So how are things locked now and why?
Go far back in Linux history and we come to a time when SMP support was it's infancy or simply non-existent. The transmit path into the driver was guarded by a simple bit lock on a word stored in the struct net_device. It was named "tbusy".
You can identify old time experienced Linux networking hackers by their being able to explain to you what "tbusy" is and how it worked.
The initial SMP support in Linux was all about essentially holding a global lock across all kernel execution. The basic invariant preserved by this was lack of re-entrancy.
At this time drivers did things like disable interrupts to prevent concurrent access to the transmit datastructures. Very simple and very stupid. And we had to find some way to preserve the semantics these simple devices wanted while initially SMP threading the entire network stack.
Add into this the complexity of our packet scheduler. We have all of these sophisticated semantics you can tie to the packet flow going out a particular device. These sophisticated algorithms have a lot of shared data structures which guide the packet sending decisions. So these need protection too.
So we have two state machines on the path a packet takes out of the system. It's very nearly a two coupled loop system, you know the kind Van Jacobson talked about in his LCA06 presentation. The one's that tend to be more unstable than a single coupled loop.
On the one hand we have the device which can queue up a fixed number of packets at a time. And on the other we have the packet scheduler which wants to know when the device receives the packet, and when the device's queue fills up, so that it can make rate calculations and decide to (in some cases) even drop packets when the backlog gets beyond a certain point.
The solution to the re-entrancy problem for the simplistic drivers was to wrap all calls to ->hard_start_xmit() with a spinlock stored in the generic network device structure. It guaranteed that there would only be one thread of control executing in the driver's transmit function for a particular device instance.
The packet scheduler state is managed with it's own locks. And we have this complex interaction between the two sets of activity because one cannot make progress without knowing the state of the other.
So now, some 8 or 9 years later, we have on the order of 400 networking drivers. This means if changes to how this works will occur, we better think long and hard about it, because we'll have to make any such change 400 times and make sure we get it right. The main goal is to get it so that the driver can control the locking (only it knows how to handle multiple queues, etc.) and also manage the queue backlog (which also might need to be multiqueue).
Currently I'm the one with this task of figuring out how to make this work even with all of this legacy stuff sitting around.
As a first step I drew up some patches for the NIU and TG3 drivers, which move the TX backlog handling into the drivers themselves. This is sort of a baby step towards what needs to actually happen. In the end we somehow have to integrate the qdisc dequeueing logic into these places.
Next, for every driver that does not request lockless transmit handling, I added a per-driver spinlock that protects the entirety of the ->hard_start_xmit() handler. Things get real exciting here because there are code paths that believe that holding the TX lock means there is nothing executing inside of this routine, and that if you do something like:
netif_tx_lock(dev); netif_stop_queue(dev); netif_tx_unlock(dev);then you can be sure afterwards that nothing will call into the ->hard_start_xmit() handler.
Preserving that semantic is non-trivial now that the lock is grabbed inside of the handler. I'm currently supposing that a near equivalent can be obtained using something like:
spin_lock(&tp->tx_lock); netif_stop_queue(tp->dev); spin_unlock(&tp->tx_lock); synchronize_net();
But this brings us to another fun topic. Currently the maintainence of the TX queue throttling is made inside of this generic TX lock. This is part of why I decided that the TX backlog had to be taken care of inside of the driver. The only way to protect the queue state properly, turning it off and re-enabling it later when TX space becomes free again, is now going to be with the driver private TX lock. So now the backlog is also the driver's business and problem. The state maintainence follows the locking responsibility.
The biggest headache is how to tie this into the packet scheduler state management.
I want to see this hit 2.6.27 if possible. Very optimistic, and my espresso machine will be working overtime.
Just for fun I started trying to get Grub working on Sparc, there is some code there for ieee1275 firmware stuff but mostly there hasn't been much work done. Current CVS of grub2 doesn't even build on sparc.
A long time ago I contacted the GRUB author, and wanted to contribute to making sparc work, but he told me at the time something crazy. He said that since I worked on our existing loader SILO, I was "contaminated" and therefore couldn't write code for GRUB. I was pretty stunned, as SILO is GPL'd :-) But, whatever.
I figured in 5 years they'd have something working by now, but that is not the case. So I'm going to make this work and if the GRUB folks decide to still be unreasonable and not take my code we'll just have a fork in every distribution I guess. They can happily be without working sparc support for another 5 years, in that case.
The first task is to get the first stage boot block going. There are two choices on how to do this. When you type "boot" for a block device, the firmware loads 7.5K starting at the second 512-byte block. If this block device is the third partition (the "all disk" partition in the Sun disk label) this bootblock starts after the disk label.
Under Linux we really can only use 512 bytes of that boot block area, because it's possible for the filesystem superblock to show up as early as the very next 512 byte block.
The firmware lets you put in the block some sparc executable image (tried by the firmware as 64-bit ELF, then 32-bit ELF, and finally as A.OUT) or tokenized forth. Since 1) we only have 512 bytes and 2) there are no fully GPL'd forth tokenizer implementations (the openbios folks use something that is BSD and MIT licensed) we'll need to go the sparc image route.
The task of this 512 byte sequence of code is to load the next stage of the bootloader. For GRUB I've choosen a multi-tiered scheme similar to how the x86 stuff works. The first stage bootloader loads a single block from the disk and jumps to it. This block we load is actually a header block of the main boot loader binary, a second stage loader, which loads the rest of the image.
This first stage loader therefore needs a few parameters. It needs to know the OF device path where the second stage header block resides. It needs to know the block to read, and finally it needs to know where to load that block and thus where to jump to it for execution.
We'd also like to print some status messages to the screen while this happens and have at least some minimal error handling. Not a small feat in 512 bytes.
We put everything in the text section, and the first thing we do is jump over our embedded data bits:
.text .align 4 .globl _start _start: /* OF CIF entry point arrives in %o4 */ pic_base: call boot_continue mov %o4, CIF_REGThe "CIF" is the client interface to openfirmware. Calls are made by initializing an array of cells (64-bits on sparc64) in memory which describe the call to be made, the input arguments, and the return values (if any). This value provided in %o4 is the OF entry point we jump to when making calls. The only register argument goes in %o0 and is the base of the aforementioned array of cells.
The offsets into the bits coming up are defined in a GRUB boot.h header file so that tools can patch in values during bootblock installation.
. = _start + GRUB_BOOT_MACHINE_VER_MAJ boot_version: .byte GRUB_BOOT_VERSION_MAJOR, GRUB_BOOT_VERSION_MINOR boot_path: . = _start + GRUB_BOOT_MACHINE_KERNEL_ADDRESS kernel_sector: .xword 2 kernel_address: .word GRUB_BOOT_MACHINE_KERNEL_ADDRThe boot_version is just a version blob that various tools and sub-bootloaders could validate for compatibility if they wanted to. It is unused currently.
The boot_path will be filled in by the boot block installation tools with the boot device OF path. kernel_sector and kernel_address tell where to load the image from the device into memory. Next, we have string constants we'll need to make OF calls and put messages onto the console:
prom_finddev_name: .asciz "finddevice" prom_chosen_path: .asciz "/chosen" prom_getprop_name: .asciz "getprop" prom_stdout_name: .asciz "stdout" prom_write_name: .asciz "write" prom_bootpath_name: .asciz "bootpath" prom_open_name: .asciz "open" prom_seek_name: .asciz "seek" prom_read_name: .asciz "read" prom_exit_name: .asciz "exit" grub_name: .asciz "GRUB "To simplify things we'll write all of our code as position independent. There are other macros in the boot.h header which describe these register name macros such as CIF_REG, PIC_REG, etc. most are local registers which are not volatils across OF calls, so we can keep stable values in them. It also defines macros to compute absolute addresses of a symbol using this PIC_REG, into a register.
The next chunk of code are helpers for doing OF calls with various sets of input and output arguments.
prom_open_error: GET_ABS(prom_open_name, %o2) call console_write mov 4, %o3 /* fallthru */ prom_error: GET_ABS(prom_exit_name, %o0) /* fallthru */ /* %o0: OF call name * %o1: input arg 1 */ prom_call_1_1: mov 1, %g1 ba prom_call mov 1, %o5 /* %o2: message string * %o3: message length */ console_write: GET_ABS(prom_write_name, %o0) mov STDOUT_NODE_REG, %o1 /* fallthru */ /* %o0: OF call name * %o1: input arg 1 * %o2: input arg 2 * %o3: input arg 3 */ prom_call_3_1: mov 3, %g1 mov 1, %o5 /* fallthru */ /* %o0: OF call name * %g1: num inputs * %o5: num outputs * %o1-%o4: inputs */ prom_call: stx %o0, [%l1 + 0x00] stx %g1, [%l1 + 0x08] stx %o5, [%l1 + 0x10] stx %o1, [%l1 + 0x18] stx %o2, [%l1 + 0x20] stx %o3, [%l1 + 0x28] stx %o4, [%l1 + 0x30] jmpl CIF_REG, %g0 mov %l1, %o0All of those routines are "call" invoked, and we do the CIF OF call using "jmpl" with no return register write in order to make the CIF OF call return to the stub caller, ie. a tail call. This way we don't need to allocate a register window here or anything complicated like that.
Now for the code we jumped to at the _start header. Our first task is to save away the PIC_REG (%o7 was set by the initial "call" instruction, and will be equal to _start, we use it to reference our data blobs PC relative). Also we setup register %l1 which holds the base of the OF cell argument data block used above. SCRATCH_PAD is defined to 0x10000, which is guarenteed to be mapped by OF and outside of where we will be executing (which is at 0x4000).
boot_continue: mov %o7, PIC_REG /* PIC base */ sethi %hi(SCRATCH_PAD), %l1 /* OF argument slots */We need the console stdout handle to write console messages, this is done by finding the "/chosen" directory in the OF device tree, and fetching from there the "stdout" property which is a 32-bit handle.
GET_ABS(prom_finddev_name, %o0) GET_ABS(prom_chosen_path, %o1) call prom_call_1_1 clr %o2 ldx [%l1 + 0x20], CHOSEN_NODE_REG brz CHOSEN_NODE_REG, prom_error GET_ABS(prom_getprop_name, %o0) mov 4, %g1 mov 1, %o5 mov CHOSEN_NODE_REG, %o1 GET_ABS(prom_stdout_name, %o2) add %l1, 256, %o3 mov 1024, %o4 call prom_call stx %g1, [%l1 + 256] lduw [%l1 + 256], STDOUT_NODE_REG brz,pn STDOUT_NODE_REG, prom_errorSince we have very little space, the error handling has to be small. The "clr %o2" in the prom_call_1_1 invocation causes the return value cell slot (at %l1 + 0x20) to be cleared by the prom_call code. Zero is not a prom handle we'll get back, so if the slot stays zero we took an error. In the getprop call to get the 'stdout' handle, we rely upon OF providing us with zero'd memory outside of the boot block image.
Now that we have the console output cookie, we can print out a message:
GET_ABS(grub_name, %o2) call console_write mov 5, %o3Next, we open up the boot device:
GET_ABS(prom_open_name, %o0) GET_ABS(boot_path, %o1) call prom_call_1_1 clr %o2 ldx [%l1 + 0x20], BOOTDEV_REG brz,pn BOOTDEV_REG, prom_open_errorWe now seek to the appropriate block:
GET_ABS(prom_seek_name, %o0) mov BOOTDEV_REG, %o1 clr %o2 LDX_ABS(kernel_sector, 0x00, %o3) call prom_call_3_1 sllx %o3, 9, %o3and finally read the block into memory:
GET_ABS(prom_read_name, %o0) mov BOOTDEV_REG, %o1 LDUW_ABS(kernel_address, 0x00, %o2) call prom_call_3_1 mov 512, %o3This image will be an A.OUT image as well, so we jump into it right past the A.OUT header:
LDUW_ABS(kernel_address, 0x00, %o2) jmpl %o2 + GRUB_BOOT_AOUT_HEADER_SIZE, %o7 nop 1: ba,a 1bJust for fun we put an endless loop after the jump in case it does return for some reason. We could, alternatively, call prom_error instead. Now skip to 4 bytes right before the end of the 512-byte block and write a signature cookie.
. = _start + GRUB_BOOT_MACHINE_CODE_END /* the last 4 bytes in the sector 0 contain the signature */ .word GRUB_BOOT_MACHINE_SIGNATUREAnd that's the whole boot block implementation. This file is compiled normally into first an ELF image using GCC. Then we strip out all the symbols and other junk, and finally use objcopy to output an A.OUT object. It is exactly 512 bytes in size and the bootblock installer verifies this.
The second stage header block code now can take advantage of all of the values the first stage has computed, such as the stdout handle, the handle for the boot device, etc. In the second stage image, it begins with assembler just like the first stage does above. But, it implements a block + length tuple list which grows down from the end of the block. This tells us where to read in the rest of the GRUB kernel from the actual file on the disk.
The GRUB installer program links in with modules that understand how various filesystems work. It has a block traverser that can be run on arbitrary files. This is the mechanism used to build the block list which is patched into this second stage loader block list.
I've tested both of these using a Sun LDOM guest, which is the fastest way to test low-level stuff like this since resetting (which you need to do every test boot attempt) is nearly instantaneous.
The current task is to flesh out the installer programs and make sure the rest of the GRUB ieee1275 code is going to work properly on sparc. I already know some bits that will need some tweaking. For example, partitions on OF paths are indicated (aparently) by adding ":N" where N is a partition number. On sparc this "N" is instead a letter.
More to come...
So I spent the weekend fishing out some gdb bugs on sparc. Every time I think I understand and know how all of this stuff works I get thrown some new surprises. This time was no exception.
The kernel has all of these neat features, via ptrace()'s PTRACE_SETOPTIONS, that allows a debugger to get notifications when a process forks, vforks, does an exec, etc.
When these events occur in the inferior process, it does a ptrace_notify() which sends a SIGTRAP to the process with the event encoded into the siginfo exit code.
As a result, when the process tries to return back into userspace, it'll do signal processing. As part of that, it will invoke ptrace_stop() which sleeps the task and wakes up the debugger parent so it can examine the event and process state.
The debugger has several choices of what to do. For many cases it will do a ptrace() PTRACE_CONT with a code indicating how the process should continue. Another thing the debugger can do is decide that it's no longer interested in tracing this task, and therefore it does a ptrace() PTRACE_DETACH. This is part of the first bug.
A long time ago we picked the values for PTRACE_this and PTRACE_that on sparc. Some of them mirrored the SunOS values. One of those was PTRACE_DETACH. We unconditionally recognized the SunOS PTRACE_DETACH value, even for Linux processes. Unfortunately, this is the value that also ended up in the sys/ptrace.h Linux header file. So that's the value every Linux application ended up using too.
I yanked out the SunOS PTRACE_foo call support long ago, and it's amazing how much works without a properly functioning PTRACE_DETACH. Putting in a compat translation from the SunOS to the intended Linux value for PTRACE_DETACH in the Sparc ptrace code solved this bug.
Which brings us to the next bug...
What brought me down this path in the first place was examining why running gdb under itself didn't work properly. This kind of game is always fun:
bash$ ./gdb ./gdb (top-gdb) set args test (top-gdb) run (gdb) run Hello World! ...This would hang when the inner-gdb tries to run the test program, and I had to figure out why. After lots of tracing I found that the inner-gdb was hanging in a sigpause() call.
GDB uses sigpause() to wait for SIGCHLD events when it is simply waiting for running inferiors to ptrace_stop() or take some other kind of signal.
In comes the issue of system call restart. Handling this wrt. debuggers is not easy. One nice feature of debuggers like GDB is that you can ask them at any point in time to call some function in the program they are running.
(gdb) p printf("hi\n")
hi
$1 = 2
(gdb)
and after calling the function it completely restores the process state to
what it was before the call. How does it implement this?
First it saves the process state, mostly this comprises the registers. Next it allocates some stack space for the call, pushes the arguments for the function call onto the stack and in registers. Next, it makes the function return address point to a breakpoint it can uniquely recognize. Finally, it sets the program counter to point at the function to call.
When the function returns it hits the breakpoint, and the debugger restores all of the saved state it stored away before the special call.
Now, back to system call restart. When a signal interrupts a system call, it can return immediately to process the signal. Internally to the kernel system calls return special error codes to let the signal dispatch know what to do with the system call that was in progress. It can say that -EINTR should be returned and the system call completed. But it can also say that the program counter should be rewound to the system call trap, the argument registers re-setup with their original values, and the system call thus replayed.
In the example above, imagine what happens if the debugger calls an inferior function at the time that the signal is dispatched. Somehow, in order to restore the state properly after the call, this "we're in a syscall and should do syscall restarting" state has to get saved and restored too.
Long ago I had this clever idea wherein I tried to solve this problem entirely inside of the kernel to shield debuggers from having to deal with it. The idea was that we'd modify the process state and perform the system call restart operations before we stopped the program to let the debugger see the state.
Although great in theory, in practice it's an unworkable solution. We don't know what the debugger is going to do with the process. As I stated earlier it can pass along the signal to the process, or it can cancel the signal delivery altogether. This decision influences whether we should do system call restart or not, but we already pre-commited that state and let the debugger see it already. We can't know what the debugger is going to do ahead of time, therefore it is impossible to do the right thing.
This is what was causing the inner debugger to hang. The inner gdb is receiving a SIGCHLD because the 'test' program it is debugging has hit ptrace_stop(). The top-level gdb looks at this and says "ok, let's just let the inner gdb see the signal, PTRACE_CONT." But my funny in-kernel ptrace syscall restart code already setup for a syscall restart of sigpause(), but what should have happened was a return of -EINTR.
The inner gdb is now wedged forever, it missed the SIGCHLD and the debugged process is sleeping waiting to be woke up by the inner gdb.
So I ripped out all of this silly code, and ended up doing what powerpc, x86, and other platforms do. I added a piece of software binary state (an unused bit in one of the processor state registers), that gdb can control. It is the "we're in a system call" state. When the debugger does an inferior call, it clears this bit when changing the program counter register. This forces the kernel to not do system call restart processing when the task wakes out of ptrace_stop().
But later, when gdb restores all of the register state after the call, that special bit will be restored, and we'll do the right thing as we deliver the original signal and subsequently do syscall restart processing as needed.
It's always fun to find land mines like these ones.
I've had to do a lot of this lately, and the most efficient attempts take on a certain pattern.
So you have a bug, and you can readily reproduce it. Also, the bug appeared in the last pull you made from Linus's tree. Perfect.
At this point you know that 'master' has the bug and that 'ORIG_HEAD' lacks the bug. You could just blindly bisect the whole thing, but you can save yourself some time (and also learn a bit about the nature of the bug) by using some clues and some quick tests to narrow things down a lot from the beginning.
This determination can be easy. Your goal is to first find a spot which you think works. You'd like it to be something a bit further than ORIG_HEAD, as you're trying to narrow things down.
The easy case is some driver breaks or similar, or you see some error message and it's clear what subsystem that came from. Take that information and use it to scan over the changesets you got from your pull:
gitk ORIG_HEAD..Note that when you select a changeset in gitk, the SHA ID of that commit becomes the current X selection. You can use that to do things more quickly below.
So you're found a sequence of commits that look suspicious. Pick the changeset before the first commit in the suspicious set, and check out a test tree with it as the tip into a test branch:
git checkout -b test $(SHA_ID)Build that kernel, and make sure the bug doesn't happen. Let's assume that this kernel passes your test. You have a few options on how to proceed.
The easiest thing to do is to just bisect using the information you now have:
git bisect start git bisect bad master git bisect good testand so on. Build, test boot, and if it shows the bug:
git bisect badelse if it succeeds:
git bisect goodand repeat the process until GIT shows you the guilty commit.
The other option is to try and figure out an approximate more optimal end point for bisection. Take the set of "suspicious" changeset your determined above, and take the one after the last and go:
git checkout -b test2 $(SHA_ID)If this shows the bug, you're in business:
git bisect start git bisect bad test2 git bisect good testand continue as detailed above.
When you're done with all of this:
git bisect resetand report your results to the mailing list.
VGER suffered a major disk failure and it was not easy to bring the machine back up cleanly.
Thanks to the incredible efforts of Matthew Galgoci, Matti Aarnio, and others, it is back up now and slowly but surely the mail queue is running what should be quite an impressive backlog :-)
I just returned from a wonderful 10 day trip to Japan, during which I gave a presentation on multi-queue networking for the Linux Foundation Japan Symposium.
Most of my time was spent in Tokyo, but I was able to make 3 exciting trips outside the city.
On Sunday March 9th, Kazunori Miyazawa gave us a tour of the Kamakura region and Enoshima.
On Friday March 14th we took the hikari shinkansen to Osaka, during which I was able to experience "kuidaore" (literally "to ruin oneself by extravagance in food") in the Dotonbori district of the city.
That evening we travelled to Kyoto and slept at a ryokan at which we had a traditional Japanese breakfast the next morning.
Finally we spent that Saturday exploring Kyoto, mostly the eastern section.
Finally we hopped back onto the shinkansen in the evening to get back to Tokyo.
On Sunday I was able to meet up with other members of the USAGI Project for dinner, several of whom were unfortunately travelling and thus outside Japan during most of my stay.
All in all it was a great trip, and I already have a list of places and things I want to explore next time!
There is this one cpu cycle killer we've been trying to find ways to overcome, and the more I think about it the situation wouldn't even exist if the SACK standard had been given just a smidgen more thought.
SACK has been specified as a lossy indicator of out of order packet reception. This is the core problem. It's a hint, and you cannot actually "act" upon the information.
So if you get some SACK blocks you can't free those marked packets from your retransmit queue. SACK is specified such that the receiver can free up those packets and stop advertising them in the SACK blocks.
This is profoundly stupid, just look at the implications.
What this means is that during a loss recovery event, we have to hold onto a whole extra window of data in the retransmit queue for no reason at all. Even more brilliant is that when the hole is filled and we get this HUGE ACK back acknowledging two windows worth of data we have to purge all of those packets from the retransmit queue in one go.
For large windows, we're talking 4,000 packets or more. I don't care what datastructure you use to manage your retransmit queue, that's going to be expensive and show up in profiles.
If, instead, the SACK specification authors spent just a little bit more time thinking about the implications of SACK being a "soft" indicator I think they would have changed their minds. If SACK were a "hard" indication and we could thus free up packets so marked there, the processing would be spread out throughout the recovery event instead of being batched up into one huge wallop of packet freeing.
Supposedly the reason for marking SACK a "soft" indication is so that "low memory" embedded systems could free up the data. But this is incredibly stupid. If you can handle one window of data, you can also handle holding onto two full windows of data as well. We aren't talking about PDP-11's with 16K of RAM or anything like that.
Furthermore this is inconsistent because the sender has to hold all of the data in such loss recovery events, why treat the receiver specially? It makes absolutely no sense, because if such small embedded systems ever have to send data, they have to commit to the same amount of RAM resourses in such situations.
In fact, a "hard SACK" would help those small devices when they act as senders, because during a loss event they could liberate packets indicated in the SACK blocks they get back.
The SACK standard folks made a horrible tradeoff. They created a significant future performance hindrance for %99.9999 of systems in order to cater to some theoretical memory usage issue on %0.00000001 of machines in the world.
Every TCP implementation out theere is going to have to come up with a workaround for this issue especially as huge windows become more prevalent. And all of that engineering effort would be totally unnecessary (read as: we could work on much more important stuff) if SACK had been specified as a hard indicator.
I have the urge to propose some TCP option that allows two TCP implementations to negotiate "hard" SACK blocks but that won't work. We'd still support "soft" SACKs for eternity and furthermore that's what anyone wishing to exploit this high cpu cost code path in the TCP stack would use.
To be honest, nobody drops data in the way that SACK allows. It'd probably be pragmatic and reasonable just to start enforcing the "hard" semantics. And if they are violated (packets disappear from SACK blocks) we just RST the connection. We could add a ton of logging if such a thing happens, and we could investigate such cases.
It's quite a mess and somebody has to mop it up.
I have to say that Christmas holidays are the best time to hack on things. Most of the yahoos are away on vacation and therefore the constant stream of distractions and really dumb emails in your inbox just aren't there. In short you can work on the things you always want to work on but never can because of time constraints.
There really is an information overload problem.
Today I want to talk about multiple return values. In short, don't do it :-) It really is a sign that things need to be redesigned, and on top of it multiple return values result in quite inefficient code.
The most common case with C is when you need to allocate some memory and return it to caller, but if something goes wrong you want to give some error status too.
So you end up with absolute crap like this:
int create_foo(int flags, struct foo **foop)or, even worse:
struct foo *create_foo(int flags, int *errp)I mean, that just doesn't deserve to live. First of all, the compiler has to allocate stack space for that "by reference" turd you had to add to the arguments to pass that second piece of information back. And that's slow, even on register starved cpus like x86 where stack accesses have been heavily optimized inside of the cpu to make up for that.
Secondly, the semantics are not entirely clear. If an error happens for the second API above, will the function return value always be NULL? In the first API above, when an error happens will the "*foop" always be NULL'd out for me or is the caller expected to do that? Likewise, for the second API above, if there is no error and non-NULL is returned, can I depend upon "*errp" being set to zero?
You don't know, because when you look at that interface definition you simply cannot tell. It's one big ambiguous ugly interface.
For this particular case in the kernel we've settled on a set of macros that allow a pointer and an error to be returned in a single function return value. Basically, it takes advantage of the fact that the range of negative error codes cannot ever be legitimate pointers. So the interface above looks like:
struct foo *create_foo(int flags)and the caller first checks:
p = create_foo(flags);
if (IS_ERR(p)) {
err = PTR_ERR(p);
return err;
}
And if "IS_ERR" is not true, it is a non-NULL pointer and
no error happened.
It is completely unambiguous what the return values mean, and how they will be presented to the caller. Yes, it's true that looking at the function definition you can't "see" this. But once people are exposed to this pattern enough they pick it up, and it allows us to eliminate ambiguity and unclear semantics for these kinds of cases.
Have a happy new year everyone.
Two weeks ago I left to spend some time in Korea, which I'll write about later. But more recently YOSHIFUJI Hideaki paid a visit to Seattle.
We had only one day and an evening, so we hit all of the "essentials":
After getting a few hours of evening sleep, we woke up again to do hacking, solving puzzles, eating lots of greasy unhealthy breakfast food, and even some WIFI war driving on the way to SeaTac airport at 4am.
It was a fun but short visit, and we hope to see him again some time.
It's hard to get inspired sometimes to blog, the work itself is enjoyable and all consuming, so why stop to write about it? Oh that's right, the attention is oh so nice...
Networking is moving right along, 2.6.24 is in bug fix only phase and things don't look too bad. For future stuff I just merged a set of patches from Herbert Xu that consolidates the input and output IPSEC paths so we can turn things upside down, have resume points, and thus support async crypto properly.
I've been trying to toss some cycles into hacking on MONO's sparc support. I'm trying to take my knowledge from my active GCC sparc backend days and translate that into MONO code generation improvements. We'll see how all of this goes. Here is a fun one:
The most straightforware way to set a boolean in a register (0 or 1) based upon some condition (say some simple 'a' COND 'b' comparison) is:
cmp %reg1, %reg2 mov 0, %reg3 bCOND,a 1f mov 1, %reg3 1:This has a few problems. First it has a hard to predict branch, potentially stalling the entire pipeline. Second it uses an annulling branch which can hurt performance even more.
Sparc has some instructions that take the carry bit of the condition codes as an input to a normal 'add' or 'sub' instruction. This can be used to implement the above in 3 instruction or less branchless sequences for the not-equal, less-than-unsigned, and greater-than-unsigned cases using an old trick:
/* %reg3 = (%reg1 != %reg2) */ xor %reg1, %reg2, %reg3 subcc %g0, %reg3, %g0 addx %g0, 0, %reg3 /* %reg3 = (%reg1 < %reg2) */ cmp %reg1, %reg2 addx %g0, 0, %reg3 /* %reg3 = (%reg1 > %reg2) */ cmp %reg2, %reg1 addx %g0, 0, %reg3GCC has been emitting those things for years. And the history behind the discovery of those code sequences is quite interesting. Way back when, the GNU Superoptimizer was written. Super optimization is basically finding the truly optimal sequence of code to compute some function like the ones above.
The GNU variant was written by Torbjorn Granlund and it worked by exhaustive search. It would basically load the inputs into source registers than execute random sequences of instructions over and over using that initial state until the desired result appeared in the destination register.
Pretty cool stuff.
Dude, you've got it all wrong.
What makes the code in the kernel so great is not that it goes in perfect, it's that we whittle all code in the treee down over and over again until it reaches it's perfect form.
It is this whittling system that allows us to thrust 25MB of changes into a tree over a week and a half and it all works out. Yes a lot of shit and bugs get added, but we fix and clean them all up, and we do it fast.
So it doesn't matter that the new scatterlist stuff has a crappy set of interfaces, because people like Rusty will fix it up and make it better.