Netconf 2009 minutes, Part 1.

Attendees:

Saturday, September 19, 2009.

Arnaldo Melo: Batch Datagram Receiving

Summary:

Reduce per-packet overhead on receive by batching packets, so that protocol-stack overhead is amortized over all packets making up a given batch. This is accomplished via a change to the syscall layer allowing iovec to be passed to the message-receive syscall, which returns either the number of messages received or an error. Lower UDP layers changed to reduced locking overhead.

Although financial institutions are said to be intensely interested in this optimization, they are unwilling to share performance results. Fortunately, Nir Tzachar tested on 1Gb/s hardware and noted a latency reduction for 100-byte packets from 750us to 470us. For larger packets, Nir noted that throughput doubled. Future work might include batching on the send side, but this requires lots of consecutive sends to the same destination, which appears to be a bit more unusual.

Details:

Herbert Xu: Tx interrupt mitigation

Summary:

Networking bandwidths have increased by three orders of magnitude over the past 25 years, but the MTU remains firmly fixed at the ancient value of 1536 bytes. This means that per-packet overhead per unit time has also increased by three orders of magnitude during that time.

The obvious solution would be to increase the MTU size, but the tendency to drop ICMP packets defeats MTU discovery, so that connections spanning Internet are still required to use small MTU values. In addition, many important applications use small packets, and thus cannot benefit from increased MTU. Finally, jumbograms can increase queueing latencies within Internet, which degrades response times.

So we need to live with small packets, and one way to do so is to decrease interrupt overhead. NAPI (the not-so-new API) has done this for some time, but only for the receive side. At 10Gb/s speeds, we must also deal with tranmit side. Herbert has implemented a work-witholding approach in which completion interrupts are requested only 3-4 times per transmit-ring traversal, as opposed to on each and every packet. But this approach for virtual NICs, since there is not sufficiency per-packet transmission delay. But Herbert noted that very few transmitters want or need timely completion notification, so is proposing a boolean flag in the skb structure indicating whether this particular packet requires a completion interrupt.

Details:

Stephen Hemminger: Bridging

Summary:

Bridging is now receiving much more attention due to its new-found uses in virtual environments. The setups for these envronments is mostly automated, and works quite well. Except for the spanning-tree implementation, which can take up to 30 seconds to sync up. RSTP (Rapid Scanning Tree Protocol) would be a great improvement, and it also better handles leaf nodes. There is an rstplib, but it is only occasionally used, mostly by embedded people: problems include lack of distro uptake and problems with user-kernel version synchronization.

EMC has an RSTP implementation, which is now in the repository, and which EMC wants to replace with GPLed code from VMWare, which is now also in the repository.

Much discussion of VEPA (Virtual Ethernet Port aggregator), especially regarding the need for solutions that work across a wide range of hardware.

Details:

Jesper Dangaard Brouer: 10Gb bi-directional routing

Summary:

Jesper described ComX'es (Danish ISP) use of a Linux box as a 10Gbit/s Internet router as an alternative to a proprietary solution that is an order of magnitude more costly. Jesper's testing achieved full wire speed on dual-interface unidirectional workloads for packet sizes of 420 bytes or larger, and on dual-interface bidirectional workloads for packet sizes of 1280 bytes or larger. Also showed good results on Robert Olsson's internet-traffic benchmark.

Perhaps needless to say, careful choice of hardware and careful tuning are required to achieve these results.

Bottom line findings are: 10Gbit/s bi-directional routing is possible, but we are limited by the packet per second processing power, there is still memory bandwidth available.

Details: