Netconf 2009 minutes, Part 1.
Saturday, September 19, 2009.
Arnaldo Melo: Batch Datagram Receiving
Reduce per-packet overhead on receive by batching packets, so that protocol-stack overhead is amortized over all packets making up a given batch. This is accomplished via a change to the syscall layer allowing iovec to be passed to the message-receive syscall, which returns either the number of messages received or an error. Lower UDP layers changed to reduced locking overhead.
Although financial institutions are said to be intensely interested in this optimization, they are unwilling to share performance results. Fortunately, Nir Tzachar tested on 1Gb/s hardware and noted a latency reduction for 100-byte packets from 750us to 470us. For larger packets, Nir noted that throughput doubled. Future work might include batching on the send side, but this requires lots of consecutive sends to the same destination, which appears to be a bit more unusual.
"Every networking problem solved with GRO or GSO". GRO: Generic Receive Offload. GSO: Generic Segmentation Offload.
Detailed performance statistics. 1M samples.
Herbert Xu: Tx interrupt mitigation
Networking bandwidths have increased by three orders of magnitude over the past 25 years, but the MTU remains firmly fixed at the ancient value of 1536 bytes. This means that per-packet overhead per unit time has also increased by three orders of magnitude during that time.
The obvious solution would be to increase the MTU size, but the tendency to drop ICMP packets defeats MTU discovery, so that connections spanning Internet are still required to use small MTU values. In addition, many important applications use small packets, and thus cannot benefit from increased MTU. Finally, jumbograms can increase queueing latencies within Internet, which degrades response times.
So we need to live with small packets, and one way to do so is to decrease interrupt overhead. NAPI (the not-so-new API) has done this for some time, but only for the receive side. At 10Gb/s speeds, we must also deal with tranmit side. Herbert has implemented a work-witholding approach in which completion interrupts are requested only 3-4 times per transmit-ring traversal, as opposed to on each and every packet. But this approach for virtual NICs, since there is not sufficiency per-packet transmission delay. But Herbert noted that very few transmitters want or need timely completion notification, so is proposing a boolean flag in the skb structure indicating whether this particular packet requires a completion interrupt.
Jumbo frames do help, except that Internet doesn't like large frames. Need MTU discovery, but many firewalled eat ICMP, which defeats MTU discovery. :-(
100-byte packets as 10Gb comes to an interrupt every 100ns or so at wire speed...
On receive, just poll at end of interrupt processing, thus eliminating the need for other interrupts. But need to make sure to re-enable interrupts eventually...
UDP needs to be careful, as there is no congestion control. (Though current wire speeds are making this a non-problem.)
Need local flow control in the general case -- especially when you have multiple transmitting sockets, and fair access is required.
KVM hack, "tx mitigation", postpones the work until some time later. Initial postponement was for 2ms, but high-resolution timers has helped somewhat. Not as nice as dedicated hardware, but much better than 2ms timers.
In virtualized environments, can use hypervisor as the private pool used to get guests out of trouble. If apps running on host, then host must avoid swapping.
Stephen Hemminger: Bridging
Bridging is now receiving much more attention due to its new-found uses in virtual environments. The setups for these envronments is mostly automated, and works quite well. Except for the spanning-tree implementation, which can take up to 30 seconds to sync up. RSTP (Rapid Scanning Tree Protocol) would be a great improvement, and it also better handles leaf nodes. There is an rstplib, but it is only occasionally used, mostly by embedded people: problems include lack of distro uptake and problems with user-kernel version synchronization.
EMC has an RSTP implementation, which is now in the repository, and which EMC wants to replace with GPLed code from VMWare, which is now also in the repository.
Much discussion of VEPA (Virtual Ethernet Port aggregator), especially regarding the need for solutions that work across a wide range of hardware.
Userspace difficult due to need to keep kernel and user-space library in sync. In theory better security, but...
EMC coded up a version, which is now in the repository. But EMC wants to replace with GPLed code from VMWare.
Jesper Dangaard Brouer: 10Gb bi-directional routing
Jesper described ComX'es (Danish ISP) use of a Linux box as a 10Gbit/s Internet router as an alternative to a proprietary solution that is an order of magnitude more costly. Jesper's testing achieved full wire speed on dual-interface unidirectional workloads for packet sizes of 420 bytes or larger, and on dual-interface bidirectional workloads for packet sizes of 1280 bytes or larger. Also showed good results on Robert Olsson's internet-traffic benchmark.
Perhaps needless to say, careful choice of hardware and careful tuning are required to achieve these results.
Bottom line findings are: 10Gbit/s bi-directional routing is possible, but we are limited by the packet per second processing power, there is still memory bandwidth available.
"Normal" Internet router. 2-port 10 Gbit/s router. Bidirectional, total of 40Gb/s through the interfaces. Can't use jumbo frames. Must withstand small-packet DoS attacks.
PCI-Express marketing numbers looked good: 160Gb/s. But in reality, only get about 54Gb/s.
CPU: Core i7 (920) vs. Phenom II X4 (940) RAM: DDR3 vs. DDR2
Raw memory bandwidth sufficient in all cases.
[discussion of solving serialization issue with bw shapers, applying per-CPU value-caching tricks to allow scalable traffic shaping. 1Mb/s at 1500 byte packets give 83 interactions per Mb. So hand out (say) 10 as requested. Vary based on number of CPUs and size of share.]
Use "mpstat -A -P ALL" to validate irq spreading.
Make sure that corresponding RX and TX queues are on same CPU.
With faster packet generators, generators get close to wire speed -- at wire speed for 420-byte packets and larger.
Some limitations due to memory bandwidth -- may need more NUMA-awareness in drivers and possibly also the network stack... Will need API from driver to give preferred buffer/queue layout in memory.
11.6 Gb/s bi-directional. Preferentially dropping large packets. ;-)
Chipset issues result in artifacts due to cacheline size.
Thomas Graf: Control Groups (cgroups) Networking Subsystem
Thomas described his extension of cgroups to cover networking. The administrator can create networking classIDs and assign them to cgroups. These classIDs can then be used by the traffic classifier.
This approach does not cover incoming traffic, nore does it cover delayed traffic (where packets are sent from within softirq context rather than from the context of the originating task). However, according to DaveM, Thomas's approach covers the cases that most people care about.
Thomas Graf: libnl (Netlink library)
Thomas reported progress on libnl, including the new extended-match support that allows the rules to use protocol field names rather than byte offsets. This change is likely to be quite welcome.
Gerrit Rankin: DCCP (Datagram Congestion Control Protocol).
Gerrit reported on progress with DCCP, a protocol that is unusual in that it has been "synthesized in the lab" rather than being refined through experience, bakeoffs, and consumption of large quantities of alcohol, as was the process used for TCP and SCTP. DCCP has not seen great uptake, with the result that the only surviving in-kernel implementation is in Linux. Nevertheless, a number of applications, including GStreamer, have been ported to DCCP, and a number of people are actively working to improve it. Most notably, a group in Italy has applied formal control-theory results to DCCP's CCID-3 protocol to obtain a simple and effective congestion-avoidance algorithm that is expected to allow CCID-3 to dispense with high-resolution timers, thereby increasing its efficiency.
It is expected that increased application usage of DCCP may eventually require expanding the kernel/user interface to pass timing information from DCCP to the user application. Such a change could well permit DCCP to come into its own as a first-class production-quality protocol suite for time-sensitive multi-media applications.
No really compelling reasons to use these vs. TCP or UDP. But there are 251 remaining CCIDs left to implement!!! :-)
IETF let this through. [IETF has certainly changed a lot in 20 years!!!]
PJ: DCB (Data Center Bridging). IEEE standard.
PJ gave rundown of DCB, which can be very roughly thought of as a member of the DCE family. PJ discussed tagging traffic via VLAN egress, which might simplify filtering setup by avoiding the need for filtering rules that are aware of both TPC/IP and Ethernet header fields. Another improvement proposed was to bypass qdisc when empty, to avoid the qdisc chokepoint (but see DaveM's talk).
Tag traffic via VLAN egress, simplifying filtering setup. Can filter based on ethertype, for example. Avoids the need to make filtering rules that are aware of both Ethernet and TCP/IP headers.
Problem is that completion interrupts might happen on other CPUs. Possible mitigation: move clean-up to transmit side as Chelsio folks do.
But real fix is to more carefully distribute the traffic so that CPUs don't go after each others' locks.
Dave Miller: Linux Multiqueue Networking
Dave gave a compressed version of his NYLUG talk on multiqueue networking. This work parallelizes the networking stack in almost all cases -- remaining cases include the complex-qdisc scenario discussed in Jesper's talk, though backwards compatibility is important given that qdisc API changes touch something like 450 drivers.
Changes include interrupt mitigation, rework of NAPI, keeping queue-selection state in the SKB (thus avoiding the need to acquire locks to access or change this state), and many others besides. Future challenges include wakeup mapping (perhaps using some of the tricks that Jens Axboe is applying in his block-I/O work), Tom Herbert's per-device packet-steering table, a software version of Intel's flow director, and any number of changes to accommodate the increasinly common virtualized environments.
Prior to this... Driver writers would instantiate multiple dummy devices just to get multiple NAPI contexts.
But most people use simple stateless qdisc.
PJ: DCB does weighted round robin, and people are moving towards hardware shaping. [Put simple fast stuff in hardware, keep more elaborate qdiscs in SW?]
Flow-control API used by 450 drivers, so need backwards compatibility is critically important.
Now: replicate queues, qdiscs, and TX locks over the available queues on the device.
But: complex qdiscs force traffic-shaping choke point. [Hopefully, value-caching approaches fix this.]
Small hash table, no chaining. Track transmits, correlate TX and RX.
Google would prefer using hardware-generated RSS hash value.
TCAM (ternary content-addressable memory) might also be applied to this scenario.
Bridging through hypervisor... Might not be so high overhead, but need GRO/GSO/&c. KVM issues would remain.
Herbert Xu: Bridging and Multicasting
Athough bridging currently satisfies most needs, even in virtualized environments, and even in conjunction with multicasting, the combination of the three can be inefficient. The problem is that the Linux kernel's bridging drivers are unaware of multicast state, and therefore simply flood all multicast packets out all interfaces (except of course for the interface on which the packet was received). One solution would be to leverage IGMP (Internet Group Management Protocol) to allow the bridging drivers to send multicast packets only where needed.
Some interesting applications: IPTV, but mostly banks.
So, thinking of implementing IGMP routing protocol for the bridging driver. SCH: how about -associated- -with- the bridging driver rather than -in- the bridging driver???
HX: Yes, to be cleanly implemented.
Herbert Xu: GRO (Generic Receive Offload)
GRO has been quite successful, but Herbert sees several ways to usefully expand on it. One example is receive steering, directing packets to the CPU on which the destination thread is running. Another example is to reduce processing cost by short-circuiting TCP ACK processing, as opposed to the current practice of running all TCP ACK packets through the full protocol stack. A final example is the creation of monstergram, allowing a large group of packets from the same connection to be run through the protocol stack as a unit.
Given fixed MTU and ever-increasing bandwidths, the opportunity (and need) for such tricks can be expected to increase over time.
NAPI-GRO receive -- separate entry point allows delay to be introduced, permitting additional packets to arrive.
Peter P. Waskiewicz Jr. (PJ): I/O MMU
Much discussion of possible hardware and software optimizations for I/O MMUs, which incur high overheads. It is safe to say that more work will be required here, both in hardware and in software.
SCH: But user might not read the buffer for a long time...
DM: Some NICs have had a problem where certain conditions can cause the associated buffer to be forevermore useless. IO MMU very expensive to change.
DM: If you have direct HW access to IO MMU, you can run through the registers, and only touch hardware when you have overflowed. Can also have hardware update the status block intermittently.
DM: Timestamp-overflow issue. Does a zero-valued timestamp mean "I don't have a timestamp" or "invalid value for timestamp"? GRO error? HX: Just a literal comparison! DM: Let me find it... LRO issue. But standard is not clear.
DM: need list of GRO tests so that hardware guys can make something compatible.
SCH: convert drivers for GRO/NAPI? HX: not for 1Gb. Certainly for 10Gb. SCH: what about virtualized drivers?
Other miscellaneous topics: