Netconf 2009 minutes, Part 2.

Sunday, September 20, 2009.

John Linville: Wireless status report

Al Viro has managed to get his and his wife's paperwork in order, and has returned to USA. Apparently, there was no record of his having been discharged from the Soviet Army. The authorities also lacked any record of his wife having an address in Russia while an adult. This situation was reportedly resolved as only Al Viro could have resolved it.
The Moblin and Gnome guys are dealing with their Connection Manager differences by living in different universes despite being in the same room.
Apparently, wpa-supplicant is now fixed.

Bob Gilligan: IP Forwarding Performance Benchmarks

Summary:

Bob Gilligan showed some counter-intuitive performance results on a SuperMicro Nehalem system. After some discussion, NUMA effects were suspect number one. Jesse is working on some patches to address NUMA issues.

Details:

Using high-end Nehalem system from SuperMicro, thus non-standard (but more intuitive) CPU numbering. 960K pps for flows going between CPUs on same core. No penalty for doing all work on CPU 0 compared to 0->1 (hyperthreads in same core). Penalty for doing all work on CPU 8 (probably due to NUMA effects). Jesse Brandeberg: NUMA patches in progress, submitted to Linus, but found bug, patch submitted as well.
Bidirectional single-flow acts more intuitively, as doing all work on CPU 0 gets about half the pps rate as 0->2. Doing 0->1 is intermediate, as hyperthreading does not scale perfectly due to sharing of CPU resources.
Some degradation for high-numbered CPUs, presumably again due to NUMA.
Multiflow workloads show more-complex results, see Bob's presentation.
Number of queues is currently a limiting factor.
Debate on the merits of automatic irq balancing vs. static setup. Probably depends on workload.

Stephen Hemminger: Additional Performance Results

Summary:

Use of SMP affinity can provide 40% benefit, but workloads using IPSec and IDS (intrusion detection) see poor performance regardless.

Details:

Single-flow benchmark.
SMP affinity gets 40% increase over default for 64-byte packets. Turning on firewall rules (50 rules) or NAT decreases slightly over default. IPSec and IDS causes severe degradation (though multiple SAs would increase performance for multiple flows).
Large packets helps default, SMP, firewall, and NAT. IPSec and IDS still results in degradation.

Intel Folks Pow-Wow Presentation.

Summary:

NUMA awareness looks to be necessary in a large number of areas, and Herbert Xu noted that node-affinitied data should be allocated outside of struct netdev. It is possible that some per-CPU data structures might need to be reworked to be per-node, though this needs careful thought, as it re-introduces locking overheads, potential deadlocks, and other problems. That said, if you think 10GbE is challenging, just wait for 40GbE or 100GbE!

Small-packet forwarding performance requires NUMA, cache-alignment, and per-CPU optimizations. Using any one of these optimizations does not help at all, using NUMA and one other helps significantly, and using all three helps a lot. So people evaluating performance enhancements should take note -- interactions can be important!

There was some discussion of porting some BSD performance improvements to Linux, but many (all?) were said to already be present. It would nevertheless be good to check up on possible improvements from other areas.

Details:

NUMA Scaling Issues in 10GbE (PJ)
- Linux lacks socket affinity on driver load. So driver data structures land randomly, and ditto dynamic allocations for most drivers.
- No linkage between where user application is running and where driver is located (especially for receive).
- HX: alloc_netdev_q might need NUMA awareness.
- Multiport 10GbE adapters can run into memory-bandwidth bottlenecks.
- Special handling required for PCIe that is associated with a processor socket ("procket", as opposed to a TCP or UDP "socket"...)
- Kernel uses per-CPU locality for pretty much everything. In the future, node-affinity might make more sense... [Or combining trees, or any of a number of other data structures...]
  Tradeoff: local locking vs. overhead of scan of full set of CPUs for every global operation.
- Need to spread driver data structures over nodes in order to get the benefit of the full system's memory bandwidth.
- Even more of a problem with 40GbE or even worse, 100GbE.
- HX: need to allocate data with node affinity outside of struct netdev.
Small-packet forwarding on Nehalem (Jesse)
- Running out-of-tree driver (but not much change from mainline).
- Tested numerous combinations of driver optimizations.
  - NUMA optimizations.
  - Align packets to cache lines -- very important. But need pad bytes for other systems. Unaligned DMA problematic for increasing numbers of modern systems, not just x86. Not a problem for large packets -- there is a cacheline-merge bandwidth limit (# transfers on QPI).
    DSM: use variable to handle different systems within x86 architecture. But might be able to just set alignment to zero for all x86.
  - Write-often shared memory painful on 16 cores. Only collecting stats when requested.
    Per-CPU variables are one easy solution. "Mold grew on our driver, lots of incremental changes" DSM: "You know what happens if enough mold grows on your driver? It walks right out of the staging tree!"
    Need all three optimizations to get much benefit.
- Problems with some code getting upset when the NUMA binding changes. Possibly put info in driver queue vectors. Other locations might be NAPI.
- HW capable of 13M pps -- measured by low-level forwarding (swap MAC addresses and transmit from receive interrupt). 8 queues.
  3,000(?) cycles (a bit over a microsecond) for full IP forwarding -- in contrast, packets could arrive every 70ns or so.
  Suspect memory-allocator overhead -- suggest checking how often slab is hitting the global pool, and adjusting config to suit.
- Should memory move when queues move???
  DSM: netconsole is the only thing that would care about the memory moving. "More research is necessary."
BSD hacks (Jesse)
- Making packet trains within kernel. (Talk to Herbert!)

Dave Miller: Miscellaneous optimizations and features

Socket refcount avoidance (Eric Dumazet)
- Packets in queue imply a reference to socket
- Now must initialize socket reference to 1
VLAN and MACVLAN multiqueue support
per-TXQ trans_start
CONPAT_NETDEV_OPS is -gone-!!! (Stephen Hemminger)
Early DST release in dev_hard_start_xmit (Eric Dumazet) Eliminate cache misses. Some exceptional cases (e.g., tunnelling) cannot do early release.
Per-queue TX stats
last_rx update avoidance -- only used by bonding, so now only actually updated by bonding.

Paul McKenney: RCU

RCU-bh
- what is RCU bh for
- Robert Olsson DoS worklog hung system
- ICMP redirects update routing table
- no grace period, or long grace periods
- Dipankar created RCU-bh
- new quiescent state, in softirq instead of schedule
- routing cache converted to RCU-bh, then withstood DoS
RCU tiny
- simplified RCU for embedded
- depends on non-SMP
- for ARM and SH etc. with small memory configs
- synchronize_rcu() is a NOP
- no grace period necessary
RCU in mainline
- synchronize_sched_expedited() is in
  - grace period in few tens of microsends
  - hammer cpus with IPIs, force reschedule into thread
  - use sparingly, expensive
- CLASSIC_RCU and PREEMPT_RCU are gone
- replaced with TREE_RCU and TREE_PREEMPT_RCU
- TINY_RCU under test, not yet in mainline
Performance of sync mechanisms
- locks vs. other primitives
- locks are cheaper now
- but off socket, still very expensive
- lots of atomic operations on same variable
- what slows you down
  - pipeline flushes
  - memory barriers
  - cache misses
  - all in the noise compared to I/O
Dining Philosophers
- Solution #1: lowest fork first
- Solution #2: partitioning
- Solution #3: More forks
- Objections "can't change rules!", "it's a lock hierarchy test, solution #3 destroyed it!". "what if fork cost a millions dollars"
- But what if problem does not partition nicely?
Embarassingly parallel
- per-cpu variables
- per-device stuff
- etc.
If cannot partition
- Use per-cpu/per-task caching (reduce global interaction)
- Use periodic update (give up accuracy or responsiveness)
- maybe random() (coordination more expensive than justified)