<?xml version="1.0"?>
<!-- name="generator" content="blosxom/2.0" -->
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">
  <channel>
    <title>DaveM's Linux Networking BLOG   </title>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi</link>
    <description>Mashimaro Fan Club</description>
    <language>en</language>
    <image>
      <url>http://vger.kernel.org/~davem/davem-48-70.png</url>
      <width>48</width>
      <height>70</height>
    </image>

  <item>
    <title>STT_GNU_IFUNC</title>
    <pubDate>Sun, 07 Feb 2010 15:46:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2010/02/07#stt_gnu_ifunc</link>
    <description>&lt;p&gt;
I've always wanted to work on support for STT_GNU_IFUNC symbols
on sparc.  This is going to solve a real problem distribution
makers have faced on sparc64 for quite some time.
&lt;p&gt;
&lt;center&gt;
&lt;img src=&quot;http://vger.kernel.org/~davem/nyc_soho_facades.jpg&quot;&gt;
&lt;/center&gt;
&lt;p&gt;
What is STT_GNU_IFUNC?
&lt;p&gt;
Well, normally a symbol is resolved by the dynamic linker based
upon information in the symbol table of the objects involved.
This is after taking into consideration things like symbol
visibility, where it is defined, etc.
&lt;p&gt;
The difference with STT_GNU_IFUNC is that the resolution of the
reference can be made based upon other criteria.  For example,
based upon the capabilities of the cpu we are executing on.
The most obvious place this would be very useful is in libc,
where you can pick the most optimized memcpy() implementation.
&lt;p&gt;
Normally the symbol table entry points to the actual symbol location,
but STT_GNU_IFUNC symbols point to the location of a &quot;resolver&quot;
function.  This function returns the symbol location that should
be used for references to this symbol.
&lt;p&gt;
So when the dynamic linker resolves a reference to a STT_GNU_IFUNC
type symbol &quot;foo&quot;.  It calls the resolver function recorded in
the symbol table entry, and uses the return value as the resolved
address.
&lt;p&gt;
Simple example:
&lt;pre&gt;
void * memcpy_ifunc (void) __asm__ (&quot;memcpy&quot;);
__asm__(&quot;.type foo, %gnu_indirect_function&quot;);

void *
memcpy_ifunc (void)
{
  switch (cpu_type)
    {
  case cpu_A:
    return memcpy_A;
  case cpu_B:
    return memcpy_B;
  default:
    return memcpy_default;
    }
}
&lt;/pre&gt;
So, references to 'memcpy' will be resolved as determined by
the logic in memcpy_ifunc().
&lt;p&gt;
These magic ifunc things even work in static executables.  How
is that possible?
&lt;p&gt;
First, even though the final image is static, the linker arranges to
still create PLT entries and dynamic sections for the STT_GNU_IFUNC
relocations.
&lt;p&gt;
Next, the CRT files for static executables walk through the relocations
in the static binary and resolve the STT_GNU_IFUNC symbols.
&lt;p&gt;
There are some thorny issues wrt. function pointer equality.  To make
that work static references to STT_GNU_IFUNC symbols use the PLT address
whereas non-static references do not (they get fully resolved).
&lt;p&gt;
Back to the reason I was so eager to implement this.  On sparc we have
four different sets of optimized memcpy/memset implementations in
glibc (UltraSPARC-I/II, UltraSPARC-III, Niagara-T1, Niagara-T2).
Right now the distributions have to thus build glibc four times each
for 32-bit and 64-bit (for a total of 8 times).
&lt;p&gt;
With STT_GNU_IFUNC they will only need to build it once for 32-bit
and once for 64-bit.
&lt;p&gt;
I've just recently posted patches for full support of STT_GNU_IFUNC
symbols to the
&lt;a href=&quot;http://sourceware.org/ml/binutils/2010-02/msg00095.html&quot;&gt;
binutils
&lt;/a&gt;
and
&lt;a href=&quot;http://sourceware.org/ml/libc-alpha/2010-02/msg00005.html&quot;&gt;
glibc
&lt;/a&gt;
lists.
</description>
  </item>
  <item>
    <title>Beaux-Arts and kernel hacking...</title>
    <pubDate>Thu, 21 May 2009 02:42:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2009/05/21#beaux_arts</link>
    <description>&lt;p&gt;
My recent hobbies have included an intense study of New York City
architecture, and in particular the facinating stories behind
the city's two most prominent train stations.  That being
&lt;a href=&quot;http://www.nyc-architecture.com/MID/MID031.htm&quot;&gt;
Grand Central Terminal
&lt;/a&gt;
and the arguably infamous
&lt;a href=&quot;http://www.nyc-architecture.com/GON/GON004.htm&quot;&gt;
Pennsylvania Station
&lt;/a&gt;.
&lt;p&gt;
&lt;center&gt;
&lt;img src=&quot;http://vger.kernel.org/~davem/pennstation_concourse_scaled.jpg&quot;&gt;
&lt;br&gt;
McKimm, Meade, and White's masterpiece at 42nd Street and 3rd Ave.
&lt;/center&gt;
&lt;p&gt;
In the second half of the 19th century and on towards the first
half of the 20th century, any American architect worth his salt
studied at the Ecole des Beaux-Arts in Paris.
&lt;p&gt;
If you had a degree from that school, you were at the top of the
pile for selection on all of the interesting commisions of the time.
The school presented the student with a challenging and fast
paced curriculum.
&lt;p&gt;
Firstly, for these American students attending in Paris, the first
challenge was just getting in.  The entrance exam (of course) required
at least some proficiency in French.  Several of the most notable
American architects had the retake this entrace exam 5 or more times
before being able to pass.
&lt;p&gt;
Once accepted, the student was pressed to solve problems.  12 hours
were given to draft up a solution to a real architectual problem.
Then once the draft was accepted, the student had 2 weeks to flesh out
all of the details and present the final design.  All the while the
student's progress was critiqued by an established French architect
who oversaw a group of students.
&lt;p&gt;
We really don't have that kind of training for computer science
people.  It's not even science I would say.  This kind of training
does exist for pure mathmatics, espcially in France.
&lt;p&gt;
Envision a school where you're asked to draft up the design of a
compiler pass in 12 hours, then for two weeks you implement it, and
meanwhile Alfred Aho critiques your work.  This kind of place
simply doesn't exist.  (Yes I know Alfred teaches at Columbia
currently, so maybe this specific place does exist :-) but I maintain
that more generally such institutions do not exist)
&lt;p&gt;
Open source development and &quot;throwing the masses of monkeys at
the problem&quot; seems to be a logical consequence of this, does it
not?
&lt;p&gt;
A formally trained Beaux-arts architect and a room with a few drafters
could design something as insanely complicated as a huge
transportation hub in the middle of New York City.  And it would work,
as there would be no room for failure.  To me it seems
that someone similarly trained could do a complete operating system,
compiler, or similar large software engineering task.
&lt;p&gt;
&lt;a href=&quot;http://en.wikipedia.org/wiki/McKim,_Mead_%26_White&quot;&gt;
McKimm, Meade, and White
&lt;/a&gt;
were three architects and a couple drafters,
and yet they were able to complete works such as Boston Public
Library, Pennsylvania Station, and the James Farley Post Office.
To just name a few.
&lt;p&gt;
So which is better, strict formal training and mentorship or open
source monkeys?  You decide!</description>
  </item>
  <item>
    <title>A Sparc JIT for ioquake3</title>
    <pubDate>Thu, 05 Mar 2009 08:15:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2009/03/05#quake3_jit</link>
    <description>&lt;p&gt;
I recently got DRM working on sparc64, and this means I had
to of course test it :-)
&lt;p&gt;
I played around with ioquake3 and it worked just fine.
This was in a way exciting since I have devoured many
hours of my life into this game on x86.
&lt;p&gt;
One aspect of quake3 is that it has a virtual machine.  You
can write a MOD for quake3 and replace pretty much any
aspect of the game outside of the rendering engine.  But
to keep MOD authors from eating people's home directories,
sending your password out to some rogue collection system,
and things of that nature the interfaces are tightly controlled
and the MOD code runs in a JIT'd VM.
&lt;p&gt;
The only way you can get into the JIT is to make a &quot;system call&quot;.
And the only way to get out is to either return or make such
a system call into another module.  The system call is the main
entrypoint into the module, it takes an integer command and 0
or more integer arguments.
&lt;p&gt;
All memory accesses done by the JIT'd code are masked so that it is
impossible for the JIT to touch memory outside of that allocated
explicitly for it by the VM.
&lt;p&gt;
Ben H. kiddingly said to me that I should write the Sparc JIT since
there is one for x86 and PowerPC already.  One should never kid about
such things...
&lt;p&gt;
It's pretty neat stuff, although the stack machine VM code output
by the LCC compiler they used is horribly inefficient.  Some
code for a function might look like:
&lt;pre&gt;
OPCODE[  OP_ENTER] IMM4[0x0000001c]	! ENTER function, 0x1c of stack
OPCODE[  OP_LOCAL] IMM4[0x00000024]	! PUSH stack offset 0x24 (first arg)
OPCODE[  OP_LOAD4] 			! LOAD from &quot;stack + 0x24&quot;
OPCODE[  OP_LEAVE] IMM4[0x0000001c]	! LEAVE function, return LOAD result
&lt;/pre&gt;
Operations push entries onto the &quot;register stack&quot;, and consume entries
on the top of that stack.  This might emit some sparc code like:
&lt;pre&gt;
	save	%sp, -64, %sp		! OP_ENTER
	sub	%g3, 0x1c, %g3
	add	%g3, 0x24, %l0		! OP_LOCAL
	and	%l0, %g5, %l0		! OP_LOAD4
	ld	[%g4 + %l0], %l0
	add	%g3, 0x1c, %g3		! OP_LEAVE
	ret
	restore	%l0, %g0, %o0
&lt;/pre&gt;
We use several fixed registers, &quot;%g3&quot; is the stack pointer,
&quot;%g5&quot; is the VM data segment offset mask, &quot;%g4&quot; is the data segment
base address.  So every load or store address formation is
&quot;mask with %g5 and add to %g4&quot;.
&lt;p&gt;
It's there in the ioquake3 repo right now and will be in the next
release.  There are lots of things that can be improved but it works
very well and most of the quake3 MODS I've tried (CPMA, UrbanTerror,
etc.) work.  I've also been playing the base game online extensively,
you know, for stress testing.</description>
  </item>
  <item>
    <title>NMI profiling on sparc64...</title>
    <pubDate>Sat, 29 Nov 2008 23:59:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2008/11/29#sparc64_nmi</link>
    <description>&lt;p&gt;
One thing that always has been suboptimal on sparc64 has been
the profiling with oprofile.
&lt;p&gt;
Yes it worked but we only have the dumb timer interrupt profiling
available.  The biggest loss in this is that IRQ disabled code
sequences do not get profiled.  This leads to &quot;clumps&quot; in the
profile, where all the code in an IRQ disabled sequence can show
up as one big hit at the point where IRQs get re-enabled.
&lt;p&gt;
This makes the profile often non-representative and, at worst,
completely unusable.
&lt;p&gt;
Say what you want about levelled interrupts, they do provide a level
of flexibility that can be useful.  Sparc64 chips have a PIL register,
you indicate the interrupt levels (out of 15) you want to block out
by writing the highest level to block out into the register.  So
writing zero enables all interrupt levels, and writing 15 blocks all
of them out.
&lt;p&gt;
Device interrupts work by interrupt vectors, which are not blocked by
the PIL mechanism.  These are processed quickly in a trap handler and
revectored into a PIL levelled interrupt by software.
&lt;p&gt;
Under Linux we use one PIL level for all of the device interrupts.
A few of the other PIL levels we use for specific SMP cross-call
types.
&lt;p&gt;
On UltraSPARC-III (cheetah) and later we (finally) have a profiling
counter overflow interrupt.  This arrives at PIL level 15.  So
naturally I had the idea to run the majority of the kernel only
disabled up to PIL level 14.  The result is that was can use these
profile counter overflow interrupts to provide a pseudo-NMI
oprofile implementation.
&lt;p&gt;
This works quite well and is checked into the sparc-next-2.6 GIT
tree, so it will show up in 2.6.29</description>
  </item>
  <item>
    <title>git stash</title>
    <pubDate>Wed, 05 Nov 2008 06:19:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2008/11/05#git_stash</link>
    <description>&lt;p&gt;
Someone on IRC asked what was so cool about my
'mkcf' shell script and those command lines I
showed.
&lt;p&gt;
I'm so old skool that I didn't even know that something
as cool as &quot;git stash&quot; even existed.  Holy crap that's
a nice feature :-)
&lt;p&gt;
Yeah but these young whipper snappers don't know how
to build a git index by hand, har har har.
&lt;p&gt;
In other news, unless you live under a rock you know that Barak Obama
won the presidental elections, but what you may not know is that he
only won because of a
&lt;a href=&quot;http://torvalds-family.blogspot.com/2008/11/black-and-white.html&quot;&gt;
last minute endorsement from Linus.
&lt;/a&gt;
&lt;p&gt;
Tee-hee.
</description>
  </item>
  <item>
    <title>mkcf</title>
    <pubDate>Mon, 03 Nov 2008 20:18:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2008/11/03#mkcf</link>
    <description>&lt;p&gt;
I've had this tiny shell script in my repetoire that I don't
know how I'd live without, it's called 'mkcf'.
&lt;p&gt;
It's so stupid and probably there are a thousand better ways
to implement it.  It just shows what files are referenced
by a patch, here it is:
&lt;pre&gt;
#!/bin/sh
diffstat -p 1 $* | grep -v changed | awk ' { print $1 } '
&lt;/pre&gt;
So I'm always typing things like:
&lt;pre&gt;
bash$ git diff &gt;diff
bash$ git checkout-index $(mkcf diff); git update-index --refresh
&lt;/pre&gt;
To save a patch I'm working on and revert it from the
working tree.
&lt;p&gt;
I'm David S. Miller, and I support this message.
</description>
  </item>
  <item>
    <title>Rock transactional memory</title>
    <pubDate>Thu, 30 Oct 2008 05:28:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2008/10/30#rock_p1</link>
    <description>&lt;p&gt;
There is a good set of
&lt;a href=&quot;http://www.opensparc.net/pubs/preszo/08/RockHotChips.pdf&quot;&gt;
presentation slides on Rock
&lt;/a&gt;
for a talk given this year at HotCHIPS by Shailender Chaudhry.
The Rock chip gets a lot of attention for all of the speculation
features it will have (and that stuff is of course extremely cool)
but to me the most interesting bit initially is the transactional
memory support.
&lt;p&gt;
If you look at page 18 and 19 in the slides you'll see the basic
implementation.  There are two instructions called &quot;checkpoint&quot;
and &quot;commit&quot;.  So the traditional atomic increment is Rock'ified
as:
&lt;pre&gt;
atomic_inc:
	chkpt	fail_label
	ld	[%o0], %o1
	add	%o1, 1, %o1
	st	%o1, [%o1]
	commit
	retl
	 nop
fail_label:
	 ...
&lt;/pre&gt;
and at the &quot;fail_label&quot; you'd do the traditional atomic sequence
using &quot;cas&quot; (compare and swap) instructions.
&lt;p&gt;
The checkpoint instruction takes a fail label so that it can be
implemented in a best-effort manner (or not at all on some chips).
&lt;p&gt;
The commit instruction defines the end of the transaction and if
it completes successfully then all of the memory operations enclosed
in the sequence execute as if they were all done atomically.
&lt;p&gt;
So why bother with all of this?  One interesting aspect is that this
avoids atomic instructions.
&lt;p&gt;
You might look at the above and say &quot;hey, that's the same as some
sequence of atomic instructions&quot; and yes for simple examples like an
atomic increment it is.  But there is still a difference, and that is
that the checkpoint/commit sequence is easier to speculate around
than anything using atomic instructions.
&lt;p&gt;
Atomic instructions act as a serialization point, and the scout threads
can't do very much when such explicit ordering constraints start coming
into play.
&lt;p&gt;
On the other hand the speculation threads can just execute the
checkpoint/commit sequence as if it were just a normal set of loads
and stores.  The chip can perhaps do something like put a marker in
the cache lines effected.  If for some reason the speculation fails or
we can't ensure the necessary atomicity, that's ok we zap the cache
lines with the markers and rollback to the checkpoint instruction and
head on down to the fail label.  Just like for anything else, the
speculation threads execute the checkpoint/commit sequence
optimistically.</description>
  </item>
  <item>
    <title>sparc64 memory barriers</title>
    <pubDate>Thu, 30 Oct 2008 05:02:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2008/10/30#sparc64mb</link>
    <description>&lt;p&gt;
I don't blog as often as I should, sorry...
&lt;p&gt;
Nick Piggin was working on optimizing the mutex implementations on
various architectures, he started even hacking on some sparc64 bits,
and it got me thinking....
&lt;p&gt;
There are three defined memory models for sparc v9, TSO (Total Store
Ordering) PSO (Partial Store Ordering) and RMO (Relaxed Memory Ordering).
The Linux kernel always programs the cpu to run in RMO when in privileged
mode.
&lt;p&gt;
So for all the atomics and spinlocks we have memory barrier
instructions (on sparc64 it's called &quot;membar&quot;) peppered all over the
place to make sure memory operation visibility is correct especially
when the interface provides &quot;locking&quot; semantics.
&lt;p&gt;
Complicating matters further, UltraSPARC-I, UltraSPARC-II, and
derivates have a chip bug where if a memory barrier occurs right after
a mispredicted branch the chip can stop executing instructions until
the next trap is signalled.  If interrupts are off the cpu hangs
completely.
&lt;p&gt;
To work around this we either make sure the condition can't happen in
hand-crafted assembler, or we encapsulate the code in a macro that
forces the memory barrier instruction into the delay slot of a
&quot;branch always, predict taken&quot; control transfer.  And this bloats
up the kernel a bit.
&lt;p&gt;
But here's the interesting bit.  All chips starting with
UltraSPARC-III only actually implement the TSO memory model and
essentially ignore what gets programmed into the memory-model field of
the PSTATE register.  And the RMO gains on the older chips that do
actually support RMO is marginal at best.  So most of this memory
barrier business is just a complete waste.
&lt;p&gt;
So I wrote a patch that just gets rid of all of the membar
instructions used with atomics and spinlocks, and this shaves 10 full
seconds of system time off of a &quot;make -j64&quot; kernel build on Niagara-2.
W00t...
&lt;p&gt;
Of course, for the optimized memcpy and memset implementations we
have to keep the memory barriers in there.  The block load and store
instructions used in those routines are special and do not obey the
normal memory model ordering semantics.  Thus they do still require
explicit memory barriers.</description>
  </item>
  <item>
    <title>Did hell freeze over?</title>
    <pubDate>Mon, 06 Oct 2008 21:41:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2008/10/06#torvalds_blog</link>
    <description>&lt;p&gt;
Linus &lt;a href=&quot;http://torvalds-family.blogspot.com/&quot;&gt;has a blog&lt;/a&gt;
:-)</description>
  </item>
  <item>
    <title>Netfilter Workshop 2008, Paris</title>
    <pubDate>Sun, 05 Oct 2008 00:30:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2008/10/05#nfws2008</link>
    <description>&lt;p&gt;
I just returned from Paris, and the 2008 Netfilter Workshop.  Just like last
year it was a blast, and there were lots of interesting things discussed
as well as inbibed.
&lt;p&gt;
On the first day there was a users day where presentations were made aimed
at a more user oriented audience.  It seems that just about anyone who was
aware could attend and hear the talks.  I gave one on multiqueue networking.
You can find my slides and other info
&lt;a href=&quot;http://nfws.inl.fr/en&quot;&gt;here&lt;/a&gt;.
&lt;p&gt;
Tuesday and Wednesday were the main workshop days.
&lt;p&gt;
Of greatest interest to me were the descriptions given by Patrick McHardy for
his new filtering framework, where all the complexity is in userspace and
the kernel just runs filtering scripts and lookup datastructures fed to it
by the user tools.  In short, I think this stuff is great, and unlike some
folks I don't think this will decrease netfilter participation by other
developers at all.
&lt;p&gt;
And frankly, iptables was absolutely too accessible to contributors.  Look
at how much stinking poo is in the patch-o-matic, oft called &quot;crap-o-matic&quot;.
&lt;p&gt;
Patrick's work is a wonderful centralized framework, and in fact the scripting
is generic that you can build any tool to create these filtering instructions
and subsidiary lookup tables.
&lt;p&gt;
We also made some headway with the tproxy stuff.  All but one of the
core networking patches are in the net-next-2.6 GIT tree.  Indeed, this is
a feature which has been missing for 5 years :-)  I have to hand it to the
balabit guys for sticking to it and working so hard for so long to get this
merged.
&lt;p&gt;
Pablo gave some interesting presentations (3 at once!), and he is exploring
some ways to perhaps make use of bloom filters.  This is something Patrick
has devoted some exploratory brian power to in the past, but it is often
hard to find a use case for these inexact matches, although they are very
cool.
&lt;p&gt;
Jozsef Kadlecsik gave his IPSET state of the union, discussing new features
such as support for ip-port-ip hashing and set lists (which are unions of
the existing set type).
&lt;p&gt;
Yasuyuki Kosakai gave a presentation on the road blocks that exist currently
for doing proper connection tracking for MIPV6 nodes.  The basic problem is
that the persistent addresses (ie. the ones we'd want to use for connection
tracking) exist in various locations in the IPV6 packet and extension
headers.
&lt;p&gt;
Jesper talked about all of the userland scalability improvements he
made to the iptables utilities.  He also described a set of scripts
he wrote to build optimized rule table trees.
&lt;p&gt;
Stephen Hemminger discussed some of the user visible interface work
that Vyatta has been doing.  Essentially these are a set of templated
bash shell scripts and descriptor files that present a Cisco IOS like
interface to administrators.  He also talked about the performance
issues surrounding the way in which iptables does packet counters, as
well as the global conntrack table lock.
&lt;p&gt;
Harald Welte gave a talk about the current state of GPL violation
enforcement.  Things seem to have been going quite well, but it is
becoming more and more important to give Harald more facilities by
which to make air-tight arguments that he has enforcement rights to
code which has been violated.  One way for that to happen is for
significant contributors to sign over their rights to him so that he
can make enforcements on their behalf.
&lt;p&gt;
It seems that this is a very common stall tactic by the defence in
such cases, to try and bring up some doubt about the code property
ownership situation.
&lt;p&gt;
Of course, aside from the workshop itself there were plenty of parties.
Even for lunch we had quite nice French cuisine and beverages, and the
dinners were even nicer.
&lt;p&gt;
Tuesday night even included a full 4 hour boat cruise on the Seine,
with tons of champaigne, wine, small bite size delicasies of all
types, and lots of sweets.
&lt;p&gt;
Overall a wonderful time, the netfilter workshop never disappoints.
A big thank you to the official organizers this year,
&lt;a href=&quot;http://inl.fr&quot;&gt;INL&lt;/a&gt;.
</description>
  </item>
  <item>
    <title>Work towards a peek operation in the Linux networking packet schedulers.</title>
    <pubDate>Sun, 14 Sep 2008 02:27:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2008/09/14#pkt_sched_push</link>
    <description>&lt;p&gt;
There are some ugly corners of the packet scheduler qdisc interfaces.
Jarek P., Herbert Xu, and myself have been trying to figure out some
ways to tidy them a bit.
&lt;p&gt;
First of all, we have this requeue operation.  Basically, if you do
a dequeue, then decide (for whatever reason) you can't actually take
the packet right now, you can stuff it back to the head of the
qdisc using the requeue call.  The qdisc must be locked during this
sequence.
&lt;p&gt;
Some of the uses of requeue are dubious, at least in my opinion.  For
example, we let device driver transmit routines push back on the caller
even when they have the queue marked as having room for more packets.
The generic transmit level code uses requeue to stick such packets back
into the qdisc.
&lt;p&gt;
I truly believe that we can eliminate this specific case.  Such
drivers are buggy, and we are better off just dropping the packet when
the device does something like that.  No reasonable nor common device
driver causes this behavior.
&lt;p&gt;
Another use of requeue is in netem, which is the network emulator
qdisc.  You can use this qdisc to simulate various network characteristics
such as delay, random packet drops, etc.  The netem packet scheduler
uses the requeue operation as part of it's packet reordering implementation.
&lt;p&gt;
At least the netem case could be handled more cleanly with a peek operation.
The rule would be that you can do a peek to see the next packet that would
be returned from a dequeue operation.  If you decide you can't use the
packet right now, you do nothing.  Otherwise you must do a dequeue and
it will return the same packet.
&lt;p&gt;
From the qdisc implementation's perspective, this can be done quite simply.
Actually, Herbert Xu noticed this.  Basically, non-trivial qdiscs implement
peek the same as dequeue, except that they don't actually unlink and they
remember this packet and the class it came from.  Then when the subsequent
dequeue operation comes along, we call down into the remembered class qdisc
to invoke the dequeue operations down to the leafs.
&lt;p&gt;
And as it turns out, this is exactly how the CBQ packet scheduler currently
implements requeue.  It remembers the last class used for dequeue, and
uses that to vector the requeue down to the correct leaf.
&lt;p&gt;
There are some details to work out, but this is the basic idea of how peek
would work and how we would use it.
&lt;p&gt;
Jarek raised one interesting aspect of this.  Classful qdiscs implement link
sharing using timers, and packet send quotas based upon bandwidth used over
time, etc.  These are usually done on a class basis.
&lt;p&gt;
So, for example, if you peek, decide not to dequeue at the moment, and
then later we do a dequeue and it uses the cached class mentioned above, that
might be wrong.  In the intervening time period, another class might have
gotten it's quota back, and if that class if of a higher priority we should
take packets from that class not the cached one.
&lt;p&gt;
In some sense this is theoretical, and that's what Herbert seems to be arguing
currently.</description>
  </item>
  <item>
    <title>Elimination of Sparc SBUS and EBUS device layers...</title>
    <pubDate>Sun, 14 Sep 2008 02:12:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2008/09/14#sparc_sbus_elimination</link>
    <description>&lt;p&gt;
If you've been following my sparc-next-2.6 tree you've noticed quite a bit
of activity over the past few weeks.  During the waning weeks of 2.6.x-rcX
development, things slowdown enough to allow some time for big cleanups
and major infrastructure changes.
&lt;p&gt;
For years we've had this SBUS specific device layer.  It was written before
Linux had a generic device layer at all.  Even PCI support was a bunch of
hodge-podge interfaces, and a global linked list of probed devices.  So
I did SBUS in a similar way (only worse, if that can even be possible).
&lt;p&gt;
With the addition of the OF device layer I tried to clean the SBUS stuff
up a bit.  But I had to keep SBUS as a seperate device layer because
all of the DMA mapping done by the drivers were done by sbus_foo() specific
routines, and thus we needed to provide some SBUS specific state in the
device objects.  That was too much to untangle at the time.
&lt;p&gt;
EBUS, which is Sun's &quot;8-bit bus&quot;, has it's own device layer as well.  The
EBUS is for things normally found behind ISA busses on x86 systems.  RTC
clocks, serial ports, floppy controllers, these sorts of things.  Well,
just like SBUS, they had their own device driver and DMA interfaces.
&lt;p&gt;
I bit the bullet and killed it all off.  The first step was to deal with
the DMA issues.  Just like for PCI on sparc64, we store the IOMMU controller
information in the generic struct device archdata area.  This gets propagated
to all bus children.  Then calls to the generic dma_foo() DMA interfaces
just does the right thing.
&lt;p&gt;
Then all the SBUS drivers were converted to use dma_foo() instead of
sbus_*().  The only remaining task was to propagate the SBUS iommu
initialization into seperate init functions and make sure they set
things up before any real devices would probe.
&lt;p&gt;
Every SBUS driver thus uses OF device layer generic probing, register
mapping, and IRQ information acquisition.  They also use only the generic
DMA mapping APIs.  So, the SBUS layer could finally just get deleted.
&lt;p&gt;
A similar transformation happened for the EBUS layer.
&lt;p&gt;
The CS4231 sound driver was the trickest case in all of this.  It
comes in both EBUS and SBUS variants.  So a large mess had to get
untangled to convert the driver throughout all of these changes.
&lt;p&gt;
But now it's all done and there in sparc-next-2.6 ready for merging
during the 2.6.28 merge window.
&lt;p&gt;
Another nice development that occured during this time period was
a conversion of both sparc ports over to the generic RTC layer.
This was thanks to some real nice groundwork done by Krzysztof Helt.
&lt;p&gt;
I'd also like to mention Robert Reif who has been helping test the
sparc32 side of all of these changes.
&lt;p&gt;
Now I'm slowly cleaning up all of the interrupt and timer handling
on sparc32 so that it can be converted over to use the generic IRQ
layer just like sparc64.</description>
  </item>
  </channel>
</rss>