<?xml version="1.0"?>
<!-- name="generator" content="blosxom/2.0" -->
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">
  <channel>
    <title>DaveM's Linux Networking BLOG   </title>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi</link>
    <description>Mashimaro Fan Club</description>
    <language>en</language>
    <image>
      <url>http://vger.kernel.org/~davem/davem-48-70.png</url>
      <width>48</width>
      <height>70</height>
    </image>

  <item>
    <title>Think I've got it...</title>
    <pubDate>Fri, 27 Jun 2008 07:39:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2008/06/27#net_tx_more_tomorrow</link>
    <description>&lt;p&gt;
I've redone everything about how I'll approach the multiqueue
TX issues.  I'll try to detail this a bit tomorrow.
&lt;center&gt;
&lt;img src=&quot;http://vger.kernel.org/~davem/big_chess.jpg&quot;&gt;
&lt;br&gt;
A big chess match, Sydney
&lt;/center&gt;
I posted a small teaser on
&lt;a href=&quot;http://marc.info/?l=linux-netdev&amp;m=121456557602254&amp;w=2&quot;&gt;
netdev&lt;/a&gt;</description>
  </item>
  <item>
    <title>Network device TX lock.</title>
    <pubDate>Thu, 26 Jun 2008 04:45:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2008/06/26#net_tx_lock</link>
    <description>&lt;p&gt;
If you want to turn the world upside down and rework an entire
core interface, that's fine, but you'd better completely understand
how things currently work before proceeding.
&lt;p&gt;
That's the situation I'm in right now.
&lt;p&gt;
If you're following netdev then you know that I'm trying to get the
network device transmit path all sorted wrt. multiqueue devices.  The
current situation is that the entire path is serialized so really it's
not possible to take advantage of sophisticated devices on the send
side.
&lt;p&gt;
It also turns out that the modelling for device throttling is
overly simplistic.  Whether a packet can currently be accepted
by the device depends upon a lot of things, some of which are
determined by aspects of the specific packet that is next to
be sent.
&lt;p&gt;
So how are things locked now and why?
&lt;br&gt;
&lt;center&gt;
&lt;img src=&quot;http://vger.kernel.org/~davem/yee_haw.jpg&quot;&gt;
&lt;br&gt;
Network device driver transmit locking?  Yee haw!
&lt;/center&gt;
&lt;p&gt;
Go far back in Linux history and we come to a time when SMP support
was it's infancy or simply non-existent.  The transmit path into the
driver was guarded by a simple bit lock on a word stored in the
struct net_device.  It was named &quot;tbusy&quot;.
&lt;p&gt;
You can identify old time experienced Linux networking hackers by
their being able to explain to you what &quot;tbusy&quot; is and how it worked.
&lt;p&gt;
The initial SMP support in Linux was all about essentially holding
a global lock across all kernel execution.  The basic invariant
preserved by this was lack of re-entrancy.
&lt;p&gt;
At this time drivers did things like disable interrupts to prevent
concurrent access to the transmit datastructures.  Very simple and
very stupid.  And we had to find some way to preserve the semantics
these simple devices wanted while initially SMP threading the
entire network stack.
&lt;p&gt;
Add into this the complexity of our packet scheduler.  We have
all of these sophisticated semantics you can tie to the packet
flow going out a particular device.  These sophisticated algorithms
have a lot of shared data structures which guide the packet
sending decisions.  So these need protection too.
&lt;p&gt;
So we have two state machines on the path a packet takes out of the
system.  It's very nearly a two coupled loop system, you know the
kind Van Jacobson talked about in his LCA06 presentation.  The
one's that tend to be more unstable than a single coupled loop.
&lt;p&gt;
On the one hand we have the device which can queue up a fixed number
of packets at a time.  And on the other we have the packet scheduler
which wants to know when the device receives the packet, and when
the device's queue fills up, so that it can make rate calculations
and decide to (in some cases) even drop packets when the backlog
gets beyond a certain point.
&lt;p&gt;
The solution to the re-entrancy problem for the simplistic drivers
was to wrap all calls to -&gt;hard_start_xmit() with a
spinlock stored in the generic network device structure.  It
guaranteed that there would only be one thread of control executing
in the driver's transmit function for a particular device instance.
&lt;p&gt;
The packet scheduler state is managed with it's own locks.  And we
have this complex interaction between the two sets of activity
because one cannot make progress without knowing the state of the
other.
&lt;p&gt;
So now, some 8 or 9 years later, we have on the order of 400
networking drivers.  This means if changes to how this works
will occur, we better think long and hard about it, because
we'll have to make any such change 400 times and make sure we
get it right.  The main goal is to get it so that the driver
can control the locking (only it knows how to handle multiple
queues, etc.) and also manage the queue backlog (which also
might need to be multiqueue).
&lt;p&gt;
Currently I'm the one with this task of figuring out how to make
this work even with all of this legacy stuff sitting around.
&lt;p&gt;
As a first step I drew up some patches for the NIU and TG3 drivers,
which move the TX backlog handling into the drivers themselves.
This is sort of a baby step towards what needs to actually happen.
In the end we somehow have to integrate the qdisc dequeueing
logic into these places.
&lt;p&gt;
&lt;center&gt;
&lt;img src=&quot;http://vger.kernel.org/~davem/get_outta_here.jpg&quot;&gt;
&lt;br&gt;
Outta here you swine!
&lt;/center&gt;
&lt;p&gt;
Next, for every driver that does not request lockless transmit
handling, I added a per-driver spinlock that protects the entirety
of the -&gt;hard_start_xmit() handler.  Things get real
exciting here because there are code paths that believe that
holding the TX lock means there is nothing executing inside of
this routine, and that if you do something like:
&lt;pre&gt;
	netif_tx_lock(dev);
	netif_stop_queue(dev);
	netif_tx_unlock(dev);
&lt;/pre&gt;
then you can be sure afterwards that nothing will call into
the -&gt;hard_start_xmit() handler.
&lt;p&gt;
Preserving that semantic is non-trivial now that the lock is
grabbed inside of the handler.  I'm currently supposing that
a near equivalent can be obtained using something like:
&lt;pre&gt;
	spin_lock(&amp;tp-&gt;tx_lock);
	netif_stop_queue(tp-&gt;dev);
	spin_unlock(&amp;tp-&gt;tx_lock);
	synchronize_net();
&lt;/pre&gt;
&lt;p&gt;
But this brings us to another fun topic.  Currently the maintainence
of the TX queue throttling is made inside of this generic TX lock.
This is part of why I decided that the TX backlog had to be taken
care of inside of the driver.  The only way to protect the queue
state properly, turning it off and re-enabling it later when TX
space becomes free again, is now going to be with the driver private
TX lock.  So now the backlog is also the driver's business and
problem.  The state maintainence follows the locking responsibility.
&lt;p&gt;
The biggest headache is how to tie this into the packet scheduler
state management.
&lt;p&gt;
I want to see this hit 2.6.27 if possible.  Very optimistic, and
my espresso machine will be working overtime.</description>
  </item>
  <item>
    <title>Grub on Sparc...</title>
    <pubDate>Tue, 20 May 2008 02:22:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2008/05/20#grub_boot_block</link>
    <description>&lt;p&gt;
Just for fun I started trying to get Grub working on Sparc, there is
some code there for ieee1275 firmware stuff but mostly there hasn't
been much work done.  Current CVS of grub2 doesn't even build on sparc.
&lt;p&gt;
A long time ago I contacted the GRUB author, and wanted to contribute
to making sparc work, but he told me at the time something crazy.  He
said that since I worked on our existing loader SILO, I was
&quot;contaminated&quot; and therefore couldn't write code for GRUB.  I was
pretty stunned, as SILO is GPL'd :-)  But, whatever.
&lt;p&gt;
I figured in 5 years they'd have something working by now, but that is
not the case.  So I'm going to make this work and if the GRUB folks
decide to still be unreasonable and not take my code we'll just have
a fork in every distribution I guess.  They can happily be
without working sparc support for another 5 years, in that case.
&lt;p&gt;
The first task is to get the first stage boot block going.  There are
two choices on how to do this.  When you type &quot;boot&quot; for a block
device, the firmware loads 7.5K starting at the second 512-byte block.
If this block device is the third partition (the &quot;all disk&quot; partition
in the Sun disk label) this bootblock starts after the disk label.
&lt;p&gt;
Under Linux we really can only use 512 bytes of that boot block area,
because it's possible for the filesystem superblock to show up as
early as the very next 512 byte block.
&lt;p&gt;
The firmware lets you put in the block some sparc executable image
(tried by the firmware as 64-bit ELF, then 32-bit ELF, and finally as
A.OUT) or tokenized forth.  Since 1) we only have 512 bytes and 2)
there are no fully GPL'd forth tokenizer implementations (the openbios
folks use something that is BSD and MIT licensed) we'll need to go the
sparc image route.
&lt;p&gt;
The task of this 512 byte sequence of code is to load the next stage
of the bootloader.  For GRUB I've choosen a multi-tiered scheme
similar to how the x86 stuff works.  The first stage bootloader loads
a single block from the disk and jumps to it.  This block we load is
actually a header block of the main boot loader binary, a second stage
loader, which loads the rest of the image.
&lt;p&gt;
This first stage loader therefore needs a few parameters.  It needs to
know the OF device path where the second stage header block resides.
It needs to know the block to read, and finally it needs to know where
to load that block and thus where to jump to it for execution.
&lt;p&gt;
We'd also like to print some status messages to the screen while this
happens and have at least some minimal error handling.  Not a small
feat in 512 bytes.
&lt;p&gt;
We put everything in the text section, and the first thing
we do is jump over our embedded data bits:
&lt;pre&gt;
	.text
	.align	4
	.globl	_start
_start:
	/* OF CIF entry point arrives in %o4 */
pic_base:
	call	boot_continue
	 mov	%o4, CIF_REG
&lt;/pre&gt;
The &quot;CIF&quot; is the client interface to openfirmware.  Calls are made by
initializing an array of cells (64-bits on sparc64) in memory which
describe the call to be made, the input arguments, and the return
values (if any).  This value provided in %o4 is the OF entry point we
jump to when making calls.  The only register argument goes in %o0 and
is the base of the aforementioned array of cells.
&lt;p&gt;
The offsets into the bits coming up are defined in a GRUB boot.h
header file so that tools can patch in values during bootblock
installation.
&lt;pre&gt;
	. = _start + GRUB_BOOT_MACHINE_VER_MAJ
boot_version:		.byte	GRUB_BOOT_VERSION_MAJOR, GRUB_BOOT_VERSION_MINOR
boot_path:
	. = _start + GRUB_BOOT_MACHINE_KERNEL_ADDRESS
kernel_sector:		.xword 2
kernel_address:		.word  GRUB_BOOT_MACHINE_KERNEL_ADDR
&lt;/pre&gt;
The boot_version is just a version blob that various tools and
sub-bootloaders could validate for compatibility if they wanted to.
It is unused currently.
&lt;p&gt;
The boot_path will be filled in by the boot block installation tools
with the boot device OF path.  kernel_sector and kernel_address tell
where to load the image from the device into memory.  Next, we have
string constants we'll need to make OF calls and put messages onto the
console:
&lt;pre&gt;
prom_finddev_name:	.asciz &quot;finddevice&quot;
prom_chosen_path:	.asciz &quot;/chosen&quot;
prom_getprop_name:	.asciz &quot;getprop&quot;
prom_stdout_name:	.asciz &quot;stdout&quot;
prom_write_name:	.asciz &quot;write&quot;
prom_bootpath_name:	.asciz &quot;bootpath&quot;
prom_open_name:		.asciz &quot;open&quot;
prom_seek_name:		.asciz &quot;seek&quot;
prom_read_name:		.asciz &quot;read&quot;
prom_exit_name:		.asciz &quot;exit&quot;
grub_name:		.asciz &quot;GRUB &quot;
&lt;/pre&gt;
To simplify things we'll write all of our code as position
independent.  There are other macros in the boot.h header
which describe these register name macros such as CIF_REG,
PIC_REG, etc.  most are local registers which are not
volatils across OF calls, so we can keep stable values in
them.  It also defines macros to compute absolute addresses
of a symbol using this PIC_REG, into a register.
&lt;p&gt;
The next chunk of code are helpers for doing OF calls with
various sets of input and output arguments.
&lt;pre&gt;
prom_open_error:
	GET_ABS(prom_open_name, %o2)
	call	console_write
	 mov	4, %o3
	/* fallthru */

prom_error:
	GET_ABS(prom_exit_name, %o0)
	/* fallthru */

	/* %o0: OF call name
	 * %o1: input arg 1
	 */
prom_call_1_1:
	mov	1, %g1
	ba	prom_call
	 mov	1, %o5

	/* %o2: message string
	 * %o3: message length
	 */
console_write:
	GET_ABS(prom_write_name, %o0)
	mov	STDOUT_NODE_REG, %o1
	/* fallthru */

	/* %o0: OF call name
	 * %o1: input arg 1
	 * %o2: input arg 2
	 * %o3: input arg 3
	 */
prom_call_3_1:
	mov	3, %g1
	mov	1, %o5
	/* fallthru */
	
	/* %o0: OF call name
	 * %g1: num inputs
	 * %o5: num outputs
	 * %o1-%o4: inputs
	 */
prom_call:
	stx	%o0, [%l1 + 0x00]
	stx	%g1, [%l1 + 0x08]
	stx	%o5, [%l1 + 0x10]
	stx	%o1, [%l1 + 0x18]
	stx	%o2, [%l1 + 0x20]
	stx	%o3, [%l1 + 0x28]
	stx	%o4, [%l1 + 0x30]
	jmpl	CIF_REG, %g0
	 mov	%l1, %o0
&lt;/pre&gt;
All of those routines are &quot;call&quot; invoked, and we do the CIF OF call
using &quot;jmpl&quot; with no return register write in order to make the CIF OF
call return to the stub caller, ie. a tail call.  This way we don't
need to allocate a register window here or anything complicated like
that.
&lt;p&gt;
Now for the code we jumped to at the _start header.  Our
first task is to save away the PIC_REG (%o7 was set by
the initial &quot;call&quot; instruction, and will be equal to _start,
we use it to reference our data blobs PC relative).  Also
we setup register %l1 which holds the base of the OF cell
argument data block used above.  SCRATCH_PAD is defined to
0x10000, which is guarenteed to be mapped by OF and outside
of where we will be executing (which is at 0x4000).
&lt;pre&gt;
boot_continue:
	mov	%o7, PIC_REG		/* PIC base */
	sethi	%hi(SCRATCH_PAD), %l1	/* OF argument slots */
&lt;/pre&gt;
We need the console stdout handle to write console messages,
this is done by finding the &quot;/chosen&quot; directory in the
OF device tree, and fetching from there the &quot;stdout&quot; property
which is a 32-bit handle.
&lt;pre&gt;
	GET_ABS(prom_finddev_name, %o0)
	GET_ABS(prom_chosen_path, %o1)
	call	prom_call_1_1
	 clr	%o2
	ldx	[%l1 + 0x20], CHOSEN_NODE_REG
	brz	CHOSEN_NODE_REG, prom_error
	 GET_ABS(prom_getprop_name, %o0)
	mov	4, %g1
	mov	1, %o5
	mov	CHOSEN_NODE_REG, %o1
	GET_ABS(prom_stdout_name, %o2)
	add	%l1, 256, %o3
	mov	1024, %o4
	call	prom_call
	 stx	%g1, [%l1 + 256]
	lduw	[%l1 + 256], STDOUT_NODE_REG
	brz,pn	STDOUT_NODE_REG, prom_error
&lt;/pre&gt;
Since we have very little space, the error handling has to be small.
The &quot;clr %o2&quot; in the prom_call_1_1 invocation causes the return value
cell slot (at %l1 + 0x20) to be cleared by the prom_call code.  Zero
is not a prom handle we'll get back, so if the slot stays zero we took
an error.  In the getprop call to get the 'stdout' handle, we rely
upon OF providing us with zero'd memory outside of the boot block
image.
&lt;p&gt;
Now that we have the console output cookie, we can print out
a message:
&lt;pre&gt;
	 GET_ABS(grub_name, %o2)
	call	console_write
	 mov	5, %o3
&lt;/pre&gt;
Next, we open up the boot device:
&lt;pre&gt;
	GET_ABS(prom_open_name, %o0)
	GET_ABS(boot_path, %o1)
	call	prom_call_1_1
	 clr	%o2

	ldx	[%l1 + 0x20], BOOTDEV_REG
	brz,pn	BOOTDEV_REG, prom_open_error
&lt;/pre&gt;
We now seek to the appropriate block:
&lt;pre&gt;
	 GET_ABS(prom_seek_name, %o0)
	mov	BOOTDEV_REG, %o1
	clr	%o2
	LDX_ABS(kernel_sector, 0x00, %o3)
	call	prom_call_3_1
	 sllx	%o3, 9, %o3
&lt;/pre&gt;
and finally read the block into memory:
&lt;pre&gt;
	GET_ABS(prom_read_name, %o0)
	mov	BOOTDEV_REG, %o1
	LDUW_ABS(kernel_address, 0x00, %o2)
	call	prom_call_3_1
	 mov	512, %o3
&lt;/pre&gt;
This image will be an A.OUT image as well, so we
jump into it right past the A.OUT header:
&lt;pre&gt;
	LDUW_ABS(kernel_address, 0x00, %o2)
	jmpl	%o2 + GRUB_BOOT_AOUT_HEADER_SIZE, %o7
	 nop
1:	ba,a	1b
&lt;/pre&gt;
Just for fun we put an endless loop after the jump in case it does
return for some reason.  We could, alternatively, call prom_error
instead.  Now skip to 4 bytes right before the end of the 512-byte
block and write a signature cookie.
&lt;pre&gt;
	. = _start + GRUB_BOOT_MACHINE_CODE_END

/* the last 4 bytes in the sector 0 contain the signature */
	.word	GRUB_BOOT_MACHINE_SIGNATURE
&lt;/pre&gt;
And that's the whole boot block implementation.  This file
is compiled normally into first an ELF image using GCC.
Then we strip out all the symbols and other junk, and finally
use objcopy to output an A.OUT object.  It is exactly 512
bytes in size and the bootblock installer verifies this.
&lt;p&gt;
The second stage header block code now can take advantage of all of
the values the first stage has computed, such as the stdout handle,
the handle for the boot device, etc.  In the second stage image,
it begins with assembler just like the first stage does above.
But, it implements a block + length tuple list which grows down from
the end of the block.  This tells us where to read in the rest of
the GRUB kernel from the actual file on the disk.
&lt;p&gt;
The GRUB installer program links in with modules that understand
how various filesystems work.  It has a block traverser that can
be run on arbitrary files.  This is the mechanism used to build
the block list which is patched into this second stage loader
block list.
&lt;p&gt;
I've tested both of these using a Sun LDOM guest, which is the fastest
way to test low-level stuff like this since resetting (which you need
to do every test boot attempt) is nearly instantaneous.
&lt;p&gt;
The current task is to flesh out the installer programs and make sure
the rest of the GRUB ieee1275 code is going to work properly on sparc.
I already know some bits that will need some tweaking.  For example,
partitions on OF paths are indicated (aparently) by adding &quot;:N&quot; where
N is a partition number.  On sparc this &quot;N&quot; is instead a letter.
&lt;p&gt;
More to come...</description>
  </item>
  <item>
    <title>gdb bugs...</title>
    <pubDate>Mon, 12 May 2008 08:11:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2008/05/12#gdb_bugs</link>
    <description>&lt;p&gt;
So I spent the weekend fishing out some gdb bugs on sparc.  Every time
I think I understand and know how all of this stuff works I get thrown
some new surprises.  This time was no exception.
&lt;p&gt;
The kernel has all of these neat features, via ptrace()'s PTRACE_SETOPTIONS,
that allows a debugger to get notifications when a process forks,
vforks, does an exec, etc.
&lt;p&gt;
When these events occur in the inferior process, it does a ptrace_notify()
which sends a SIGTRAP to the process with the event encoded into the
siginfo exit code.
&lt;p&gt;
As a result, when the process tries to return back into userspace, it'll
do signal processing.  As part of that, it will invoke ptrace_stop()
which sleeps the task and wakes up the debugger parent so it can examine
the event and process state.
&lt;p&gt;
The debugger has several choices of what to do.  For many cases it
will do a ptrace() PTRACE_CONT with a code indicating how the process
should continue.  Another thing the debugger can do is decide that it's
no longer interested in tracing this task, and therefore it does
a ptrace() PTRACE_DETACH.  This is part of the first bug.
&lt;p&gt;
A long time ago we picked the values for PTRACE_this and PTRACE_that
on sparc.  Some of them mirrored the SunOS values.  One of those was
PTRACE_DETACH.  We unconditionally recognized the SunOS PTRACE_DETACH
value, even for Linux processes.  Unfortunately, this is the value that
also ended up in the sys/ptrace.h Linux header file.  So that's the
value every Linux application ended up using too.
&lt;p&gt;
I yanked out the SunOS PTRACE_foo call support long ago, and it's amazing
how much works without a properly functioning PTRACE_DETACH.  Putting in
a compat translation from the SunOS to the intended Linux value for
PTRACE_DETACH in the Sparc ptrace code solved this bug.
&lt;p&gt;
Which brings us to the next bug...
&lt;p&gt;
What brought me down this path in the first place was examining why
running gdb under itself didn't work properly.  This kind of game is
always fun:
&lt;pre&gt;
bash$ ./gdb ./gdb
(top-gdb) set args test
(top-gdb) run
(gdb) run
Hello World!
...
&lt;/pre&gt;
This would hang when the inner-gdb tries to run the test program, and I
had to figure out why.  After lots of tracing I found that the inner-gdb
was hanging in a sigpause() call.
&lt;p&gt;
GDB uses sigpause() to wait for SIGCHLD events when it is simply waiting
for running inferiors to ptrace_stop() or take some other kind of signal.
&lt;p&gt;
In comes the issue of system call restart.  Handling this wrt. debuggers
is not easy.  One nice feature of debuggers like GDB is that you can ask
them at any point in time to call some function in the program they
are running.
&lt;pre&gt;
(gdb) p printf(&quot;hi\n&quot;)
hi
$1 = 2
(gdb)
&lt;/pre&gt;
and after calling the function it completely restores the process state to
what it was before the call.  How does it implement this?
&lt;p&gt;
First it saves the process state, mostly this comprises the registers.
Next it allocates some stack space for the call, pushes the arguments
for the function call onto the stack and in registers.  Next, it
makes the function return address point to a breakpoint it can uniquely
recognize.  Finally, it sets the program counter to point at the function
to call.
&lt;p&gt;
When the function returns it hits the breakpoint, and the debugger restores
all of the saved state it stored away before the special call.
&lt;p&gt;
Now, back to system call restart.  When a signal interrupts a system call,
it can return immediately to process the signal.  Internally to the kernel
system calls return special error codes to let the signal dispatch know what
to do with the system call that was in progress.  It can say that -EINTR
should be returned and the system call completed.  But it can also say that
the program counter should be rewound to the system call trap, the argument
registers re-setup with their original values, and the system call thus
replayed.
&lt;p&gt;
In the example above, imagine what happens if the debugger calls an inferior
function at the time that the signal is dispatched.  Somehow, in order to
restore the state properly after the call, this &quot;we're in a syscall and should
do syscall restarting&quot; state has to get saved and restored too.
&lt;p&gt;
Long ago I had this clever idea wherein I tried to solve this problem entirely
inside of the kernel to shield debuggers from having to deal with it.  The
idea was that we'd modify the process state and perform the system call
restart operations before we stopped the program to let the debugger see
the state.
&lt;p&gt;
Although great in theory, in practice it's an unworkable solution.  We
don't know what the debugger is going to do with the process.  As I
stated earlier it can pass along the signal to the process, or it can
cancel the signal delivery altogether.  This decision influences
whether we should do system call restart or not, but we already
pre-commited that state and let the debugger see it already.  We can't
know what the debugger is going to do ahead of time, therefore it is impossible
to do the right thing.
&lt;p&gt;
This is what was causing the inner debugger to hang.  The inner gdb is receiving
a SIGCHLD because the 'test' program it is debugging has hit ptrace_stop().
The top-level gdb looks at this and says &quot;ok, let's just let the inner gdb
see the signal, PTRACE_CONT.&quot;  But my funny in-kernel ptrace syscall restart
code already setup for a syscall restart of sigpause(), but what should have
happened was a return of -EINTR.
&lt;p&gt;
The inner gdb is now wedged forever, it missed the SIGCHLD and the debugged
process is sleeping waiting to be woke up by the inner gdb.
&lt;p&gt;
So I ripped out all of this silly code, and ended up doing what
powerpc, x86, and other platforms do.  I added a piece of software
binary state (an unused bit in one of the processor state registers),
that gdb can control.  It is the &quot;we're in a system call&quot; state.  When
the debugger does an inferior call, it clears this bit when changing
the program counter register.  This forces the kernel to not do system
call restart processing when the task wakes out of ptrace_stop().
&lt;p&gt;
But later, when gdb restores all of the register state after the call,
that special bit will be restored, and we'll do the right thing as we
deliver the original signal and subsequently do syscall restart processing
as needed.
&lt;p&gt;
It's always fun to find land mines like these ones.</description>
  </item>
  <item>
    <title>Effective GIT bisecting...</title>
    <pubDate>Wed, 30 Apr 2008 06:08:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2008/04/30#bisect_effective</link>
    <description>&lt;p&gt;
I've had to do a lot of this lately, and the most efficient
attempts take on a certain pattern.
&lt;p&gt;
So you have a bug, and you can readily reproduce it.  Also,
the bug appeared in the last pull you made from Linus's
tree.  Perfect.
&lt;p&gt;
At this point you know that 'master' has the bug and that
'ORIG_HEAD' lacks the bug.  You could just blindly bisect
the whole thing, but you can save yourself some time (and
also learn a bit about the nature of the bug) by using some
clues and some quick tests to narrow things down a lot
from the beginning.
&lt;p&gt;
This determination can be easy.  Your goal is to first
find a spot which you think works.  You'd like it to be
something a bit further than ORIG_HEAD, as you're trying
to narrow things down.
&lt;p&gt;
The easy case is some driver breaks or similar, or you
see some error message and it's clear what subsystem that
came from.  Take that information and use it to scan
over the changesets you got from your pull:
&lt;pre&gt;
	gitk ORIG_HEAD..
&lt;/pre&gt;
Note that when you select a changeset in gitk, the SHA ID
of that commit becomes the current X selection.  You can
use that to do things more quickly below.
&lt;p&gt;
So you're found a sequence of commits that look suspicious.
Pick the changeset before the first commit in the suspicious
set, and check out a test tree with it as the tip into
a test branch:
&lt;pre&gt;
	git checkout -b test $(SHA_ID)
&lt;/pre&gt;
Build that kernel, and make sure the bug doesn't happen.
Let's assume that this kernel passes your test.  You have
a few options on how to proceed.
&lt;p&gt;
The easiest thing to do is to just bisect using the information
you now have:
&lt;pre&gt;
	git bisect start
	git bisect bad master
	git bisect good test
&lt;/pre&gt;
and so on.  Build, test boot, and if it shows the bug:
&lt;pre&gt;
	git bisect bad
&lt;/pre&gt;
else if it succeeds:
&lt;pre&gt;
	git bisect good
&lt;/pre&gt;
and repeat the process until GIT shows you the guilty
commit.
&lt;p&gt;
The other option is to try and figure out an approximate
more optimal end point for bisection.  Take the set
of &quot;suspicious&quot; changeset your determined above, and
take the one after the last and go:
&lt;pre&gt;
	git checkout -b test2 $(SHA_ID)
&lt;/pre&gt;
If this shows the bug, you're in business:
&lt;pre&gt;
	git bisect start
	git bisect bad test2
	git bisect good test
&lt;/pre&gt;
and continue as detailed above.
&lt;p&gt;
When you're done with all of this:
&lt;pre&gt;
	git bisect reset
&lt;/pre&gt;
and report your results to the mailing list.</description>
  </item>
  <item>
    <title>Vger recovering...</title>
    <pubDate>Wed, 19 Mar 2008 15:28:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2008/03/19#vger_recovering</link>
    <description>&lt;p&gt;
VGER suffered a major disk failure and it was not easy to bring
the machine back up cleanly.
&lt;p&gt;
Thanks to the incredible efforts of Matthew Galgoci, Matti Aarnio, and
others, it is back up now and slowly but surely the mail queue is
running what should be quite an impressive backlog :-)</description>
  </item>
  <item>
    <title>Back from Japan...</title>
    <pubDate>Wed, 19 Mar 2008 15:23:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2008/03/19#back_from_japan08</link>
    <description>&lt;p&gt;
I just returned from a wonderful 10 day trip to Japan, during which
I gave
&lt;a href=&quot;http://vger.kernel.org/~davem/davem_tokyo08.pdf&quot;&gt;
a presentation
&lt;/a&gt;
on multi-queue networking for the
&lt;a href=&quot;http://www.linux-foundation.jp/modules/eguide/event.php?eid=9&quot;&gt;
Linux Foundation Japan Symposium.
&lt;/a&gt;
&lt;p&gt;
Most of my time was spent in Tokyo, but I was able to make 3
exciting trips outside the city.
&lt;p&gt;
On Sunday March 9th, Kazunori Miyazawa gave us a tour of
the
&lt;a href=&quot;http://en.wikipedia.org/wiki/Kamakura%2C_Kanagawa&quot;&gt;
Kamakura
&lt;/a&gt;
region and
&lt;a href=&quot;http://en.wikipedia.org/wiki/Enoshima&quot;&gt;
Enoshima.
&lt;/a&gt;
&lt;p&gt;
On Friday March 14th we took the
&lt;a href=&quot;http://en.wikipedia.org/wiki/Hikari_%28Shinkansen%29&quot;&gt;
hikari shinkansen
&lt;a&gt; to Osaka, during which I was able to experience &quot;kuidaore&quot;
(literally &quot;to ruin oneself by extravagance in food&quot;) in the
&lt;a href=&quot;http://en.wikipedia.org/wiki/D%C5%8Dtonbori&quot;&gt;
Dotonbori
&lt;/a&gt;
district of the city.
&lt;p&gt;
That evening we travelled to Kyoto and slept at a
&lt;a href=&quot;http://en.wikipedia.org/wiki/Ryokan_%28Japanese_inn%29&quot;&gt;
ryokan
&lt;/a&gt;
at which we had a traditional Japanese breakfast the next morning.
&lt;p&gt;
Finally we spent that Saturday exploring
&lt;a href=&quot;http://en.wikipedia.org/wiki/Kyoto&quot;&gt;
Kyoto,
&lt;/a&gt;
mostly the eastern section.
&lt;p&gt;
Finally we hopped back onto the shinkansen in the evening to get
back to Tokyo.
&lt;p&gt;
On Sunday I was able to meet up with other members of the USAGI
Project for dinner, several of whom were unfortunately travelling and
thus outside Japan during most of my stay.
&lt;p&gt;
All in all it was a great trip, and I already have a list of places
and things I want to explore next time!</description>
  </item>
  <item>
    <title>TCP retransmit queue overhead...</title>
    <pubDate>Mon, 31 Dec 2007 17:53:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2007/12/31#tcp_overhead</link>
    <description>&lt;p&gt;
There is this one cpu cycle killer we've been trying to
find ways to overcome, and the more I think about it
the situation wouldn't even exist if the SACK standard
had been given just a smidgen more thought.
&lt;p&gt;
SACK has been specified as a lossy indicator of out of
order packet reception.  This is the core problem.  It's
a hint, and you cannot actually &quot;act&quot; upon the information.
&lt;p&gt;
So if you get some SACK blocks you can't free those
marked packets from your retransmit queue.  SACK is
specified such that the receiver can free up those
packets and stop advertising them in the SACK blocks.
&lt;p&gt;
This is profoundly stupid, just look at the implications.
&lt;p&gt;
What this means is that during a loss recovery event,
we have to hold onto a whole extra window of data in the
retransmit queue for no reason at all.  Even more brilliant
is that when the hole is filled and we get this HUGE ACK
back acknowledging two windows worth of data we have to
purge all of those packets from the retransmit queue in
one go.
&lt;p&gt;
For large windows, we're talking 4,000 packets or more.
I don't care what datastructure you use to manage your
retransmit queue, that's going to be expensive and show
up in profiles.
&lt;p&gt;
If, instead, the SACK specification authors spent just a
little bit more time thinking about the implications of
SACK being a &quot;soft&quot; indicator I think they would have
changed their minds.  If SACK were a &quot;hard&quot; indication and
we could thus free up packets so marked there, the processing
would be spread out throughout the recovery event instead of
being batched up into one huge wallop of packet freeing.
&lt;p&gt;
Supposedly the reason for marking SACK a &quot;soft&quot; indication is so that
&quot;low memory&quot; embedded systems could free up the data.  But this is
incredibly stupid.  If you can handle one window of data, you can also
handle holding onto two full windows of data as well.  We aren't
talking about PDP-11's with 16K of RAM or anything like that.
&lt;p&gt;
Furthermore this is inconsistent because the sender has to hold all of
the data in such loss recovery events, why treat the receiver
specially?  It makes absolutely no sense, because if such small
embedded systems ever have to send data, they have to commit to the
same amount of RAM resourses in such situations.
&lt;p&gt;
In fact, a &quot;hard SACK&quot; would help those small devices when they act as
senders, because during a loss event they could liberate
packets indicated in the SACK blocks they get back.
&lt;p&gt;
The SACK standard folks made a horrible tradeoff.  They created a
significant future performance hindrance for %99.9999 of systems in
order to cater to some theoretical memory usage issue on %0.00000001
of machines in the world.
&lt;p&gt;
Every TCP implementation out theere is going to have to come up with a
workaround for this issue especially as huge windows become more
prevalent.  And all of that engineering effort would be totally
unnecessary (read as: we could work on much more important stuff)
if SACK had been specified as a hard indicator.
&lt;p&gt;
I have the urge to propose some TCP option that allows two
TCP implementations to negotiate &quot;hard&quot; SACK blocks but that
won't work.  We'd still support &quot;soft&quot; SACKs for eternity and
furthermore that's what anyone wishing to exploit this high
cpu cost code path in the TCP stack would use.
&lt;p&gt;
To be honest, nobody drops data in the way that SACK allows.  It'd
probably be pragmatic and reasonable just to start enforcing the
&quot;hard&quot; semantics.  And if they are violated (packets disappear from
SACK blocks) we just RST the connection.  We could add a ton of
logging if such a thing happens, and we could investigate such
cases.
&lt;p&gt;
It's quite a mess and somebody has to mop it up.</description>
  </item>
  <item>
    <title>Holidays...</title>
    <pubDate>Mon, 31 Dec 2007 07:40:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2007/12/31#holidays2007</link>
    <description>&lt;p&gt;
I have to say that Christmas holidays are the best time to hack on
things.  Most of the yahoos are away on vacation and therefore the
constant stream of distractions and really dumb emails in your inbox
just aren't there.  In short you can work on the things you always
want to work on but never can because of time constraints.
&lt;p&gt;
There really is an information overload problem.
&lt;p&gt;
Today I want to talk about multiple return values.  In short,
don't do it :-)  It really is a sign that things need to
be redesigned, and on top of it multiple return values result
in quite inefficient code.
&lt;p&gt;
The most common case with C is when you need to allocate some
memory and return it to caller, but if something goes wrong
you want to give some error status too.
&lt;p&gt;
So you end up with absolute crap like this:
&lt;pre&gt;
int create_foo(int flags, struct foo **foop)
&lt;/pre&gt;
or, even worse:
&lt;pre&gt;
struct foo *create_foo(int flags, int *errp)
&lt;/pre&gt;
I mean, that just doesn't deserve to live.  First of all, the
compiler has to allocate stack space for that &quot;by reference&quot;
turd you had to add to the arguments to pass that second piece
of information back.  And that's slow, even on register
starved cpus like x86 where stack accesses have been heavily
optimized inside of the cpu to make up for that.
&lt;p&gt;
Secondly, the semantics are not entirely clear.  If an error happens
for the second API above, will the function return value always be
NULL?  In the first API above, when an error happens will the
&quot;*foop&quot; always be NULL'd out for me or is the caller expected to
do that?  Likewise, for the second API above, if there is no error
and non-NULL is returned, can I depend upon &quot;*errp&quot; being set to
zero?
&lt;p&gt;
You don't know, because when you look at that interface definition
you simply cannot tell.  It's one big ambiguous ugly interface.
&lt;p&gt;
For this particular case in the kernel we've settled on a set of
macros that allow a pointer and an error to be returned in a single
function return value.  Basically, it takes advantage of the fact
that the range of negative error codes cannot ever be legitimate
pointers.  So the interface above looks like:
&lt;pre&gt;
struct foo *create_foo(int flags)
&lt;/pre&gt;
and the caller first checks:
&lt;pre&gt;
	p = create_foo(flags);
	if (IS_ERR(p)) {
		err = PTR_ERR(p);
		return err;
	}
&lt;/pre&gt;
And if &quot;IS_ERR&quot; is not true, it is a non-NULL pointer and
no error happened.
&lt;p&gt;
It is completely unambiguous what the return values mean,
and how they will be presented to the caller.  Yes, it's
true that looking at the function definition you can't
&quot;see&quot; this.  But once people are exposed to this pattern
enough they pick it up, and it allows us to eliminate
ambiguity and unclear semantics for these kinds of cases.
&lt;p&gt;
Have a happy new year everyone.
</description>
  </item>
  <item>
    <title>Visit from Yoshifuji...</title>
    <pubDate>Tue, 11 Dec 2007 11:24:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2007/12/11#yoshifuji_visit</link>
    <description>&lt;p&gt;
Two weeks ago I left to spend some time in Korea, which I'll
write about later.  But more recently YOSHIFUJI Hideaki
paid a visit to Seattle.
&lt;p&gt;
We had only one day and an evening, so we hit all of the
&quot;essentials&quot;:
&lt;ol&gt;
&lt;li&gt;Sushi at Ototo, although our favorite chef was out, the
food was nevertheless still excellent.
&lt;li&gt;Pike Place Market
&lt;li&gt;Boeing Museum of Flight
&lt;li&gt;Fry's
&lt;/ol&gt;
&lt;br&gt;
&lt;center&gt;
&lt;img src=&quot;http://vger.kernel.org/~davem/sr71.jpg&quot;&gt;
&lt;br&gt;
SR71 Blackbird, Museum of Flight, Seattle
&lt;/center&gt;
&lt;p&gt;
After getting a few hours of evening sleep, we woke up
again to do hacking, solving puzzles, eating lots of
greasy unhealthy breakfast food, and even some WIFI
war driving on the way to SeaTac airport at 4am.
&lt;p&gt;
It was a fun but short visit, and we hope to see him again
some time.</description>
  </item>
  <item>
    <title>Haven't written in a while, net and mono hacking...</title>
    <pubDate>Wed, 14 Nov 2007 06:30:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2007/11/14#midnov2007</link>
    <description>&lt;p&gt;
It's hard to get inspired sometimes to blog, the work
itself is enjoyable and all consuming, so why stop to
write about it?  Oh that's right, the attention is oh
so nice...
&lt;p&gt;
Networking is moving right along, 2.6.24 is in bug fix
only phase and things don't look too bad.  For future
stuff I just merged a set of patches from Herbert Xu
that consolidates the input and output IPSEC paths
so we can turn things upside down, have resume points,
and thus support async crypto properly.
&lt;p&gt;
I've been trying to toss some cycles into hacking on
MONO's sparc support.  I'm trying to take my knowledge
from my active GCC sparc backend days and translate
that into MONO code generation improvements.  We'll
see how all of this goes.  Here is a fun one:
&lt;p&gt;
The most straightforware way to set a boolean in
a register (0 or 1) based upon some condition (say
some simple 'a' COND 'b' comparison) is:
&lt;pre&gt;
	cmp	%reg1, %reg2
	mov	0, %reg3
	bCOND,a	1f
	 mov	1, %reg3
1:
&lt;/pre&gt;
This has a few problems.  First it has a hard to predict branch,
potentially stalling the entire pipeline.  Second it uses an annulling
branch which can hurt performance even more.
&lt;p&gt;
Sparc has some instructions that take the carry bit of the condition
codes as an input to a normal 'add' or 'sub' instruction.  This can be
used to implement the above in 3 instruction or less branchless
sequences for the not-equal, less-than-unsigned, and
greater-than-unsigned cases using an old trick:
&lt;pre&gt;
	/* %reg3 = (%reg1 != %reg2) */
	xor	%reg1, %reg2, %reg3
	subcc	%g0, %reg3, %g0
	addx	%g0, 0, %reg3

	/* %reg3 = (%reg1 &amp;lt %reg2) */
	cmp	%reg1, %reg2
	addx	%g0, 0, %reg3

	/* %reg3 = (%reg1 &amp;gt %reg2) */
	cmp	%reg2, %reg1
	addx	%g0, 0, %reg3
&lt;/pre&gt;
GCC has been emitting those things for years.  And the history
behind the discovery of those code sequences is quite interesting.
Way back when, the GNU Superoptimizer was written.  Super
optimization is basically finding the truly optimal sequence
of code to compute some function like the ones above.
&lt;p&gt;
The GNU variant was written by Torbjorn Granlund and
it worked by exhaustive search.  It would basically load the
inputs into source registers than execute random sequences
of instructions over and over using that initial state until
the desired result appeared in the destination register.
&lt;p&gt;
Pretty cool stuff.
</description>
  </item>
  <item>
    <title>Sorry Rusty...</title>
    <pubDate>Tue, 06 Nov 2007 22:10:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2007/11/06#rustys_wrong</link>
    <description>&lt;p&gt;
Dude, you've
&lt;a href=&quot;http://ozlabs.org/~rusty/index.cgi/tech/2007-11-06.html&quot;&gt;got
it all wrong.&lt;/a&gt;
&lt;p&gt;
What makes the code in the kernel so great is &lt;i&gt;not&lt;/i&gt; that it
goes in perfect, it's that we whittle all code in the treee down over
and over again until it reaches it's perfect form.
&lt;p&gt;
It is this whittling system that allows us to thrust 25MB of changes
into a tree over a week and a half and it all works out.  Yes a lot of
shit and bugs get added, but we fix and clean them all up, and we do it
fast.
&lt;p&gt;
So it doesn't matter that the new scatterlist stuff has a crappy set
of interfaces, because people like Rusty will fix it up and make it
better.
</description>
  </item>
  </channel>
</rss>