Netconf 2009 minutes, Part 1.

Attendees:

Stephen Hemminger, Vyatta: Bridging
Jesper Dangaard Brouer, ComX Networks A/S: 10Gb routing
Paul E. McKenney, IBM: Note taker
Bob Gilligan, Vyatta: Performance data
Thomas Graf, Red Hat: cgroups and packet classification. libnl v2.0.
Pavel Emelyanov, OpenVZ:
Arnaldo Carvalho de Melo, Red Hat: batch receive.
Wei Yongujun, Fujitsu. SCTP testing and developing. (Listen)
Gerrit Renker, Swiss National Supercomputing Center (Linux as hobby): DCCP (Datagram Congestion Control Protocol).
Peter P. Waskiewicz Jr. (PJ), Intel. 10Gb scaling and NUMA. Jesse Brandberg.
Jeff Kirsher, Intel.
Dave Miller Red Hat. "I tell you what to do, in return I merge all your crap." State of tree, what Eric Dumazet is up to.
Soyoung Park, pointing at DaveM. "I tell him what to do."
Herbert Xu, Red Hat. Tx, multicast with bridging.
Andy Grover, Oracle. RDS (Reliable Datagram Sockets) over Infiniband.

Saturday, September 19, 2009.

Arnaldo Melo: Batch Datagram Receiving

Summary:

Reduce per-packet overhead on receive by batching packets, so that protocol-stack overhead is amortized over all packets making up a given batch. This is accomplished via a change to the syscall layer allowing iovec to be passed to the message-receive syscall, which returns either the number of messages received or an error. Lower UDP layers changed to reduced locking overhead.

Although financial institutions are said to be intensely interested in this optimization, they are unwilling to share performance results. Fortunately, Nir Tzachar tested on 1Gb/s hardware and noted a latency reduction for 100-byte packets from 750us to 470us. For larger packets, Nir noted that throughput doubled. Future work might include batching on the send side, but this requires lots of consecutive sends to the same destination, which appears to be a bit more unusual.

Details:

Financial institutions interested -- lots of small packets, multicast. Receive multiple datagrams per syscall. Sending multiple datagrams per syscall is future work.
Upper layer: Pass iovec with multiple operations. Timeouts. Reduces user/kernel overhead, also that of SELinux. (Suggested by Paul Moore.)
API returns number of datagrams or error. If partial fill followed by error, get good return for correctly received datagrams, followed by error on next call. Timeout for all receives available -- "give me 20 or whatever shows up in the next millisecond.
Lower layer: UDP locking changed, release at end of full batch, not on each packet free.
Nir Tzachar testing. 1Gb/s. 100-byte packets in batches of 30 reduces latency from 750us to 470us. For larger packets, get double the throughput.
"Every networking problem solved with GRO or GSO". GRO: Generic Receive Offload. GSO: Generic Segmentation Offload.
Detailed performance statistics. 1M samples.
To do: sendmsg... But would need lots of packets to same destination.

Herbert Xu: Tx interrupt mitigation

Summary:

Networking bandwidths have increased by three orders of magnitude over the past 25 years, but the MTU remains firmly fixed at the ancient value of 1536 bytes. This means that per-packet overhead per unit time has also increased by three orders of magnitude during that time.

The obvious solution would be to increase the MTU size, but the tendency to drop ICMP packets defeats MTU discovery, so that connections spanning Internet are still required to use small MTU values. In addition, many important applications use small packets, and thus cannot benefit from increased MTU. Finally, jumbograms can increase queueing latencies within Internet, which degrades response times.

So we need to live with small packets, and one way to do so is to decrease interrupt overhead. NAPI (the not-so-new API) has done this for some time, but only for the receive side. At 10Gb/s speeds, we must also deal with tranmit side. Herbert has implemented a work-witholding approach in which completion interrupts are requested only 3-4 times per transmit-ring traversal, as opposed to on each and every packet. But this approach for virtual NICs, since there is not sufficiency per-packet transmission delay. But Herbert noted that very few transmitters want or need timely completion notification, so is proposing a boolean flag in the skb structure indicating whether this particular packet requires a completion interrupt.

Details:

10Gb performance. Ethernet showing its age, especially 1500 byte MTU. ;-)
Stephen: Would not need to mess with the congestion control had packet sizes scaled with wire speed.
1000x more packets per second due to constant MTU. Segmentation and reassembly overhead increases with wirespeed.
Jumbo frames do help, except that Internet doesn't like large frames. Need MTU discovery, but many firewalled eat ICMP, which defeats MTU discovery. :-(
TSO is one way to reduce per-packet cost by essentially increasing MTU within host. GRO/GSO does same on receive side.
But still need to decrease interrupt overhead. Want to batch up the interrupts. Been there for receive for quite some time -- "NAPI". 10Gb Enet imposes same problem on transmit side.
100-byte packets as 10Gb comes to an interrupt every 100ns or so at wire speed...
On receive, just poll at end of interrupt processing, thus eliminating the need for other interrupts. But need to make sure to re-enable interrupts eventually...
UDP needs to be careful, as there is no congestion control. (Though current wire speeds are making this a non-problem.)
Need local flow control in the general case -- especially when you have multiple transmitting sockets, and fair access is required.
But cannot predict the future. So keep a ringbuffer, and exponentially increase the batching. This works fine for hardware, where packet-transmission delays allow work to accumulate. But in conjunction with flow-control layers, you don't get any time to accumulate work. Same thing happens with virtualization (thank you, Rusty!).
KVM hack, "tx mitigation", postpones the work until some time later. Initial postponement was for 2ms, but high-resolution timers has helped somewhat. Not as nice as dedicated hardware, but much better than 2ms timers.
And hardware will likely continue to get faster, and so the "virtualization problem" will likely hit real hardware soon.
One trick for things like UDP is to dispense with completion interrupts for most traffic -- routers don't care about completion, for example. So have a per-packet flag that says whether sender cares about completion feedback. Note that hardware will still be within its rights to delay this feedback, permitting the feedback for multiple senders and streams to be batched up into a single interrupt.
When utilization is low, a greater number of interrupts can be tolerated -- and is desired, in order to reduce latency. When utilization is high, aggressively mitigate interrupts in order to increase throughput.
SH: Swapping over NFS requires special care, as you can livelock system due to OOM issues.
In virtualized environments, can use hypervisor as the private pool used to get guests out of trouble. If apps running on host, then host must avoid swapping.

Stephen Hemminger: Bridging

Summary:

Bridging is now receiving much more attention due to its new-found uses in virtual environments. The setups for these envronments is mostly automated, and works quite well. Except for the spanning-tree implementation, which can take up to 30 seconds to sync up. RSTP (Rapid Scanning Tree Protocol) would be a great improvement, and it also better handles leaf nodes. There is an rstplib, but it is only occasionally used, mostly by embedded people: problems include lack of distro uptake and problems with user-kernel version synchronization.

EMC has an RSTP implementation, which is now in the repository, and which EMC wants to replace with GPLed code from VMWare, which is now also in the repository.

Much discussion of VEPA (Virtual Ethernet Port aggregator), especially regarding the need for solutions that work across a wide range of hardware.

Details:

Lots of "playing" earlier, but now mostly used for virtualization.
Setup mostly automated, occasional users need special handling.
Need better spanning-tree protocol: RSTP (Rapid Spanning Tree Protocol). Improves from 30-second sync-up time, better handling of leaf nodes. RSTP code based on rstplib from researchers. Converted to user-mode library, occasional embedded use, no distro uptake. (Though Stephen has created a debian package.)
Userspace difficult due to need to keep kernel and user-space library in sync. In theory better security, but...
EMC coded up a version, which is now in the repository. But EMC wants to replace with GPLed code from VMWare.
VEPA (Virtual Ethernet Port Aggregator). But want solutions to work across a wide range of hardware. Lots of competing patches and approaches.
Link detection pretty much there, but MTU issues remain. Bonding issues with min-MTU. Need to make sure that optimizations like GRO all work correctly.

Jesper Dangaard Brouer: 10Gb bi-directional routing

Summary:

Jesper described ComX'es (Danish ISP) use of a Linux box as a 10Gbit/s Internet router as an alternative to a proprietary solution that is an order of magnitude more costly. Jesper's testing achieved full wire speed on dual-interface unidirectional workloads for packet sizes of 420 bytes or larger, and on dual-interface bidirectional workloads for packet sizes of 1280 bytes or larger. Also showed good results on Robert Olsson's internet-traffic benchmark.

Perhaps needless to say, careful choice of hardware and careful tuning are required to achieve these results.

Bottom line findings are: 10Gbit/s bi-directional routing is possible, but we are limited by the packet per second processing power, there is still memory bandwidth available.

Details:

Short summary: Linux Network stack scales with CPUs.
ComX Networks A/S: Danish Fiber Broadband provider. Motivation: 10x cheaper solution with Linux.
"Normal" Internet router. 2-port 10 Gbit/s router. Bidirectional, total of 40Gb/s through the interfaces. Can't use jumbo frames. Must withstand small-packet DoS attacks.
PCI-Express marketing numbers looked good: 160Gb/s. But in reality, only get about 54Gb/s.
20% encoding overhead (8b/10b encoding). But PCIe generation 2 does 4Gbit/s. 128 byte packets, which means 16% addtional overhead. 26.88Gbit/s. In addition, have PCI traffic to set up DMA addresses, things round up to cache-line sizes...
Budgetary issues forced low-end approach: gaming hardware rather than server-class hardware.
CPU: Core i7 (920) vs. Phenom II X4 (940) RAM: DDR3 vs. DDR2
Raw memory bandwidth sufficient in all cases.
Network cards:
- Sun Neptune: only 16Gb/s -> 13.44Gb/s real life.
- SMC Networks: hardware queue issues, cannot parallelize.
- Intel 82599 (ixgbe): very fast!!! HotLava 6-port 10GbE doesn't require external power! 12-port 1GbE card requires external power, plus exceeds PCIe power limitations... So HotLava systems take their name seriously...
Intel NIC and CPU Core i7 with 1333MHz DDR3 memory with Quickpath tuned properly (6.4GT/s). Single socket
AMD did one-way 10Gb/s. Suspect HyperTransport limitations, as varying HT clock frequency varied performance.
To achieve these results:
- Distribute load across CPUs
- Enable "multiqueue" with separate irq per queue, rx & tx. Uses -lots- of irqs.
- RX path: NIC computes hash: RSS (receive-side scaling)
- Lots of TX qdisc API hack, backwards compatible. Each TX queue gets its own qdisc to avoid qdisc scalability issues.
  [discussion of solving serialization issue with bw shapers, applying per-CPU value-caching tricks to allow scalable traffic shaping. 1Mb/s at 1500 byte packets give 83 interactions per Mb. So hand out (say) 10 as requested. Vary based on number of CPUs and size of share.]
- Affinity irqs to spread load over CPUs.
  Use "mpstat -A -P ALL" to validate irq spreading.
  Make sure that corresponding RX and TX queues are on same CPU.
  - Three usage cases, for staying on same CPU
  - Forwarding (RXq to TXq other NIC): record RX queue number and use it at TX queue.
  - Server (RXq to TXq): cache socket info (thanks to Eric Dumazet).
  - Client (TXq to RXq): Hard!!! Need to use the flow director in the 10GbE Intel 82599 NIC.
- Be skeptic about generators: First runs unidirectional at wire speed for packet sizes of 768 bytes or larger. But limited by generator! Note that pktgen has some known limitations -- want to run with delay zero, otherwise pktgen does per-frame gettimeofday(). Also need faster NICs on the packet-generating systems. Stephen Hemminger is looking into optimizing pktgen.
  With faster packet generators, generators get close to wire speed -- at wire speed for 420-byte packets and larger.
  Some limitations due to memory bandwidth -- may need more NUMA-awareness in drivers and possibly also the network stack... Will need API from driver to give preferred buffer/queue layout in memory.
- The bidirectional throughput hits wirespeed at 1280-byte packets. Gracefully degrades for smaller packet sizes. Hitting 60,000 interrupts per second when tuned. >200,000 interrupts per seconds if untuned...
- Also tried Robert Olsson's internet traffic pattern. ("Open-source routing at 10Gb/s")
  9.4-9.5Gb/s uni-directional.
  11.6 Gb/s bi-directional. Preferentially dropping large packets. ;-)
  Chipset issues result in artifacts due to cacheline size.
- At 64 byte packets achived 3.8Mpps forwarding!!! (pps = Packets Per Seconds)
- Bottom line: We are limited by Packets Per Second processing power.
- Future: use more queues than CPUs to implement QoS. Even better, use per-socket queues -- but need thousands, or even millions of queues.
Thomas Graf: Control Groups (cgroups) Networking Subsystem
Summary:
Thomas described his extension of cgroups to cover networking. The administrator can create networking classIDs and assign them to cgroups. These classIDs can then be used by the traffic classifier.
This approach does not cover incoming traffic, nore does it cover delayed traffic (where packets are sent from within softirq context rather than from the context of the originating task). However, according to DaveM, Thomas's approach covers the cases that most people care about.
Details:
- Shape/drop/route packets based on cgroup.
- Create cgroups, assign tasks to cgroups, assign classID to cgroup. Traffic classifier can classify based on classID.
- Can still use other classification criteria.
- Red Hat whitepaper "Controlling Network Resources Using Control Groups".
- Does not cover incoming traffic, does not necessary cover delayed traffic -- need to still be in the sending process's context to be able to use the cgroup classID. But does cover the cases most people care about, says DaveM.
Thomas Graf: libnl (Netlink library)
Summary:
Thomas reported progress on libnl, including the new extended-match support that allows the rules to use protocol field names rather than byte offsets. This change is likely to be quite welcome.
Details:
- Documentation, thread-safe, split into sub-libraries.
- Reworked qdisc/class/classifiers support, extended matches, automake, bug fixes.
- The extended matches now know about protocol fields and offsets.
- Goal for next release: maintain stable API.
Gerrit Rankin: DCCP (Datagram Congestion Control Protocol).
Summary:
Gerrit reported on progress with DCCP, a protocol that is unusual in that it has been "synthesized in the lab" rather than being refined through experience, bakeoffs, and consumption of large quantities of alcohol, as was the process used for TCP and SCTP. DCCP has not seen great uptake, with the result that the only surviving in-kernel implementation is in Linux. Nevertheless, a number of applications, including GStreamer, have been ported to DCCP, and a number of people are actively working to improve it. Most notably, a group in Italy has applied formal control-theory results to DCCP's CCID-3 protocol to obtain a simple and effective congestion-avoidance algorithm that is expected to allow CCID-3 to dispense with high-resolution timers, thereby increasing its efficiency.
It is expected that increased application usage of DCCP may eventually require expanding the kernel/user interface to pass timing information from DCCP to the user application. Such a change could well permit DCCP to come into its own as a first-class production-quality protocol suite for time-sensitive multi-media applications.
Details:
- DCCP originally by Arnaldo. Only suriving in-kernel implementation.
- TCP and SCTP refined through experience, with many bake-offs and incremental improvements. DCCP synthesized in lab. For example, difficulties routing it through Internet.
- CCID-2: TCP with datagrams CCID-3: UDP with token-bucket filter CCID-4: CCID-3 with different parameters, perhaps for VOIP
  No really compelling reasons to use these vs. TCP or UDP. But there are 251 remaining CCIDs left to implement!!! :-)
  IETF let this through. [IETF has certainly changed a lot in 20 years!!!]
- Test tree at Aberdeen.
- ECN/ECT(0) patches for DCCPv4/6. CCID-4 (RFC 5622) in development in Brazil. New CCID-3 algorithm from Italy, applying control theory. Hopefully dispense with need for high-res timers.
- A number of applications ported to DCCP, including GStreamer. These might require require additional information piped to user space, for example, timing information.
PJ: DCB (Data Center Bridging). IEEE standard.
Summary:
PJ gave rundown of DCB, which can be very roughly thought of as a member of the DCE family. PJ discussed tagging traffic via VLAN egress, which might simplify filtering setup by avoiding the need for filtering rules that are aware of both TPC/IP and Ethernet header fields. Another improvement proposed was to bypass qdisc when empty, to avoid the qdisc chokepoint (but see DaveM's talk).
Details:
- RDMA over Infiniband. Can configure via netlink layer. Tools are quite rough, considering putting this function in etool for FCOE (FibreChannel over Ethernet).
  Tag traffic via VLAN egress, simplifying filtering setup. Can filter based on ethertype, for example. Avoids the need to make filtering rules that are aware of both Ethernet and TCP/IP headers.
- DCB uses 4-bit field, some contention for bits.
- Page-based receive using packet-split approach. But cannot split packets across pages.
- Bypassing qdisc when empty (Krishna patches)
  - Move queuing into drivers in empty-qdisc case.
  - HX: why contention if per-CPU queues in default case?
    Problem is that completion interrupts might happen on other CPUs. Possible mitigation: move clean-up to transmit side as Chelsio folks do.
    But real fix is to more carefully distribute the traffic so that CPUs don't go after each others' locks.
Dave Miller: Linux Multiqueue Networking
Summary:
Dave gave a compressed version of his NYLUG talk on multiqueue networking. This work parallelizes the networking stack in almost all cases -- remaining cases include the complex-qdisc scenario discussed in Jesper's talk, though backwards compatibility is important given that qdisc API changes touch something like 450 drivers.
Changes include interrupt mitigation, rework of NAPI, keeping queue-selection state in the SKB (thus avoiding the need to acquire locks to access or change this state), and many others besides. Future challenges include wakeup mapping (perhaps using some of the tricks that Jens Axboe is applying in his block-I/O work), Tom Herbert's per-device packet-steering table, a software version of Intel's flow director, and any number of changes to accommodate the increasinly common virtualized environments.
Details:
- End of Moore's Law frequency scaling. More networking flows per system. Single-queue/stream approach no longer works. Need multiple queues.
- End nodes (servers) vs. intermediate nodes (routers, firewalls).
  - Intermediate nodes: good flow distribution, packets remain in networking layer.
  - End nodes: good flow distribution, but those pesky applications get in the way. Need to respond to application needs.
- MSI and MSI-X interrupts, RSS (receive-side scaling) flow hashing, multiqueue functions, stateless flow distribution.
- NAPI (No-so-new API). Interrupt mitigation scheme, disable interrupts if enough packets remain on the interface card. Use DRR (distributed round robin) among cards.
- Initial NAPI was not separable, Stephen extracted NAPI state into separate structure. Multiple per-NIC RX queues can be handled by multiple NAPI instances.
  Prior to this... Driver writers would instantiate multiple dummy devices just to get multiple NAPI contexts.
- Packet scheduler. Very flexible, but single lock. Not SMP-friendly. Cannot share root qdisc device.
  But most people use simple stateless qdisc.
  PJ: DCB does weighted round robin, and people are moving towards hardware shaping. [Put simple fast stuff in hardware, keep more elaborate qdiscs in SW?]
  Flow-control API used by 450 drivers, so need backwards compatibility is critically important.
- Keep queue-selection state in SKB. Queue-selection function depends on packet origin. Problem: unequal RX/TX devices.
  Now: replicate queues, qdiscs, and TX locks over the available queues on the device.
  But: complex qdiscs force traffic-shaping choke point. [Hopefully, value-caching approaches fix this.]
- Some difficulties in wakeup mapping. Possibly use some tricks that Jens Axboe is applying to the block-I/O subsystem, some of which got 10% improvement at system level. DaveM has prototyped it, but dropped it in favor of multiqueue hardware.
- Tom Herbert of Google uses per-device packet-steering table that is set via sysctl.
- Another approach: software version of Intel's flow director. Space, time, locality issues.
  Small hash table, no chaining. Track transmits, correlate TX and RX.
  Google would prefer using hardware-generated RSS hash value.
- Virtualization: virtual non-multiqueue NICs.
  TCAM (ternary content-addressable memory) might also be applied to this scenario.
  Bridging through hypervisor... Might not be so high overhead, but need GRO/GSO/&c. KVM issues would remain.
- PJ: numerous VM-to-VM benchmarking efforts.
- DaveM re-implemented multi-queue -three- times... ;-)
Herbert Xu: Bridging and Multicasting
Summary:
Athough bridging currently satisfies most needs, even in virtualized environments, and even in conjunction with multicasting, the combination of the three can be inefficient. The problem is that the Linux kernel's bridging drivers are unaware of multicast state, and therefore simply flood all multicast packets out all interfaces (except of course for the interface on which the packet was received). One solution would be to leverage IGMP (Internet Group Management Protocol) to allow the bridging drivers to send multicast packets only where needed.
Details:
- "Bridging does what most people want, even with multicasting."
  Some interesting applications: IPTV, but mostly banks.
- Main use of bridging is vitualization -- bridging used to connect multiple guests to single networking devices.
- Can use multicasting over bridges, it works -- but it simply floods to all other possible destinations. Strongly suboptimal, as it is often the case that the multicast isn't going to all the possible destination.
  So, thinking of implementing IGMP routing protocol for the bridging driver. SCH: how about -associated- -with- the bridging driver rather than -in- the bridging driver???
  HX: Yes, to be cleanly implemented.
Herbert Xu: GRO (Generic Receive Offload)
Summary:
GRO has been quite successful, but Herbert sees several ways to usefully expand on it. One example is receive steering, directing packets to the CPU on which the destination thread is running. Another example is to reduce processing cost by short-circuiting TCP ACK processing, as opposed to the current practice of running all TCP ACK packets through the full protocol stack. A final example is the creation of monstergram, allowing a large group of packets from the same connection to be run through the protocol stack as a unit.
Given fixed MTU and ever-increasing bandwidths, the opportunity (and need) for such tricks can be expected to increase over time.
Details:
- GRO opposite of LRO. LRO offloads segmentation, while GRO pastes packets together.
  NAPI-GRO receive -- separate entry point allows delay to be introduced, permitting additional packets to arrive.
- Good place for other things:
  - Receive steering -- direct packets to other CPUs if warranted.
  - Reduce processing cost -- helping general processing. For example, short-circuit TCP ACK processing: send and receive the ACKs at the GRO level rather than running ACKs through the full protocol stack.
  - Make monstergrams -- wait for full NAPI interval, accumulate full set of data, run it up the protocol stack as a unit. Expect things like video to increase their bandwidth requirements.
  The fact that increasing numbers of packets arrive within a NAPI interval means that the opportunity for such improvements can be expected to increase over time.
Peter P. Waskiewicz Jr. (PJ): I/O MMU
Summary:
Much discussion of possible hardware and software optimizations for I/O MMUs, which incur high overheads. It is safe to say that more work will be required here, both in hardware and in software.
Details:
- Optimizations to keep buffers in cache, and repeatedly run through same set of buffers.
  SCH: But user might not read the buffer for a long time...
  DM: Some NICs have had a problem where certain conditions can cause the associated buffer to be forevermore useless. IO MMU very expensive to change.
- Order of magnitude decrease in throughput when enabling IO MMU, even if not remapping it.
  DM: If you have direct HW access to IO MMU, you can run through the registers, and only touch hardware when you have overflowed. Can also have hardware update the status block intermittently.
  DM: Timestamp-overflow issue. Does a zero-valued timestamp mean "I don't have a timestamp" or "invalid value for timestamp"? GRO error? HX: Just a literal comparison! DM: Let me find it... LRO issue. But standard is not clear.
  DM: need list of GRO tests so that hardware guys can make something compatible.
  SCH: convert drivers for GRO/NAPI? HX: not for 1Gb. Certainly for 10Gb. SCH: what about virtualized drivers?
Other miscellaneous topics:
- Paul E. McKenney: Real Time issues?
  Responses:
  - Time-wait reaper moved to threads.
  - Solaris takes interrupts in hardware context, generates thread if it blocks.
  - Problem: lots of kthreads... Some people think this is a problem. PEM: just us grep. Or another option to "ps" to suppress kthreads (HPA).
- Dave Miller: the "perf" tool
  - Try it, use it, this is the time to get people to add function to it.
- printk races...
- gdb over network... Has anyone actually used gdb? (Yes, there were examples. A few examples, anyway.)
- Use of networking software to prevent fires in train stations.
- Why you should never mention RCU at a picnic.
- Start at 10:30A Sunday.