Probably the most intellectually rewarding presentation at LCA2006 was Van Jacobson's. He gave a full historical perspective on why we implement networking the way we do today, making the point that many of these reasons are purely historical artifacts and no longer apply today.
Next he gave an analysis of what system and network hardware is like today, and how this contrasts heavily with the systems around at the time that the current network implementation model was devised. The various limitations, latency relationships, and undeniable trends in system performance characteristics were presented. And he used this to firmly narrow down where it makes sense to concentrate optimization efforts.
Underlying all of this talk was the basic Internet design precept of pushing work to the end nodes. This is the only way to design systems that truly scale. The middle is implemented as simply as possible, they just push packets around, and the real compute and "work" on the packet is done at the end hosts. The larger the network gets, the more compute power you have at the end nodes, and thus the system scales up.
With SMP systems this "end host" concept really should be extended to the computing entities within the system, that being cpus and threads within the box.
So, given all that, how do you implement network packet receive properly? Well, first of all, you stop doing so much work in interrupt (both hard and soft) context. Jamal Hadi Salim and others understood this quite well, and NAPI is a direct consequence of that understanding. But what Van is trying to show in his presentation is that you can take this further, in fact a _lot_ further.
A Van Jacobson channel is a path for network packets. It is implemented as an array'd queue of packets. There is state for the producer and the consumer, and it all sits in different cache lines so that it is never the case that both the consumer and producer write to shared cache lines. Network cards want to know purely about packets, yet for years we've been enforcing an OS determined model and abstraction for network packets upon the drivers for such cards. This has come in the form of "mbufs" in BSD and "SKBs" under Linux, but the channels are designed so that this is totally unnecessary. Drivers no longer need to know about what the OS packet buffers look like, channels just contain pointers to packet data.
At the first step we just have one channel, that goes to a generic routine in the generic network device code that attaches the packet to an OS network packet data structure and passes it into the normal input path. So the driver interrupt handler just puts packets into the channel, and the software interrupt sucks them out and passes them into the stack. At this first step, drivers no longer need to know about OS packet buffer data structures. Van stated that a channel'ized e1000 driver gets 200 lines of code removed from the fast paths, a non-trivial feat.
The next step is to build channels to sockets. We need some intelligence in order to map packets to channels, and this comes in the form of a tiny packet classifier the drivers use on input. It reads the protocol, ports, and addresses to determine the flow ID and uses this to find a channel. If no matching flow is found, we fall back to the basic channel we created in the first step. As sockets are created, channel mappings are installed and thus the driver classifier can find them later. The socket wakes up, and does protocol input processing and copying into userspace directly out of the channel.
And in the next step you can have the socket ask for a channel ID (with a getsockopt or something like that), have it mmap() a receive ring buffer into user space, and the mapped channel just tosses the packet data into that mmap()'d area and wakes up the process. The process has a mini TCP receive engine in user space.
And you can take this even further than that (think, remote systems). At each stage Van presents a table of profiled measurements for a normal bulk TCP data transfer. The final stage of channeling all the way to userspace is some 6 times faster than what we're doing today, yes I said 6 times faster that isn't a typo.
I really haven't done justice to Van's presentation, so once I see his slides posted online I'll provide a link here in my blog. I might also spend some time with "dia" drawing some pretty pictures to show how net channels work.
The three presentations that composed my keynote on Thursday morning at linux.conf.au 2006 are at the links below:
Yes, the rumors are true, in order to help out a worthy cause I offered to shave off my breard and mustache and had to make good on it over lunch today. Others had to follow through on similar offers. I'll put up photos later.