Netconf 2009 minutes, Part 2.
Sunday, September 20, 2009.
John Linville: Wireless status report
- Al Viro has managed to get his and his wife's paperwork
in order, and has returned to USA. Apparently, there was no
record of his having been discharged from the Soviet Army.
The authorities also lacked any record of his wife having an
address in Russia while an adult. This situation was reportedly
resolved as only Al Viro could have resolved it.
- The Moblin and Gnome guys are dealing with their Connection
Manager differences by living in different universes despite
being in the same room.
- Apparently, wpa-supplicant is now fixed.
Bob Gilligan: IP Forwarding Performance Benchmarks
Summary:
Bob Gilligan showed some counter-intuitive performance results
on a SuperMicro Nehalem system. After some discussion, NUMA
effects were suspect number one. Jesse is working on some
patches to address NUMA issues.
Details:
- Using high-end Nehalem system from SuperMicro, thus
non-standard (but more intuitive) CPU numbering. 960K pps for
flows going between CPUs on same core.
No penalty for doing all work on CPU 0 compared to 0->1
(hyperthreads in same core). Penalty for doing all work on CPU 8
(probably due to NUMA effects).
Jesse Brandeberg: NUMA patches in progress, submitted to Linus,
but found bug, patch submitted as well.
- Bidirectional single-flow acts more intuitively,
as doing all work on CPU 0 gets about half the pps
rate as 0->2. Doing 0->1 is intermediate, as hyperthreading
does not scale perfectly due to sharing of CPU resources.
Some degradation for high-numbered CPUs, presumably again
due to NUMA.
- Multiflow workloads show more-complex results, see Bob's
presentation.
Number of queues is currently a limiting factor.
Debate on the merits of automatic irq balancing vs. static
setup. Probably depends on workload.
Stephen Hemminger: Additional Performance Results
Summary:
Use of SMP affinity can provide 40% benefit, but workloads using
IPSec and IDS (intrusion detection) see poor performance regardless.
Details:
- Single-flow benchmark.
- SMP affinity gets 40% increase over default for 64-byte packets.
Turning on firewall rules (50 rules) or NAT decreases slightly over
default. IPSec and IDS causes severe degradation (though multiple
SAs would increase performance for multiple flows).
- Large packets helps default, SMP, firewall, and NAT. IPSec and
IDS still results in degradation.
Intel Folks Pow-Wow Presentation.
Summary:
NUMA awareness looks to be necessary in a large number of areas,
and Herbert Xu noted that node-affinitied data should be allocated
outside of struct netdev. It is possible that some per-CPU data
structures might need to be reworked to be per-node, though this
needs careful thought, as it re-introduces locking overheads,
potential deadlocks, and other problems. That said, if you
think 10GbE is challenging, just wait for 40GbE or 100GbE!
Small-packet forwarding performance requires NUMA,
cache-alignment, and per-CPU optimizations. Using any one of
these optimizations does not help at all, using NUMA and one
other helps significantly, and using all three helps a lot.
So people evaluating performance enhancements should take note --
interactions can be important!
There was some discussion of porting some BSD performance
improvements to Linux, but many (all?) were said to already
be present. It would nevertheless be good to check up on
possible improvements from other areas.
Details:
- NUMA Scaling Issues in 10GbE (PJ)
- Small-packet forwarding on Nehalem (Jesse)
- Running out-of-tree driver (but not much change
from mainline).
- Tested numerous combinations of driver optimizations.
- NUMA optimizations.
- Align packets to cache lines -- very important.
But need pad bytes for other systems. Unaligned
DMA problematic for increasing numbers of modern
systems, not just x86. Not a problem for large
packets -- there is a cacheline-merge bandwidth
limit (# transfers on QPI).
DSM: use variable to handle different systems
within x86 architecture. But might be able
to just set alignment to zero for all x86.
- Write-often shared memory painful on 16 cores.
Only collecting stats when requested.
Per-CPU variables are one easy solution.
"Mold grew on our driver, lots of incremental
changes"
DSM: "You know what happens if enough mold grows
on your driver? It walks right out of the staging
tree!"
Need all three optimizations to get much benefit.
- Problems with some code getting upset when the NUMA
binding changes. Possibly put info in driver queue
vectors. Other locations might be NAPI.
- HW capable of 13M pps -- measured by low-level forwarding
(swap MAC addresses and transmit from receive interrupt).
8 queues.
3,000(?) cycles (a bit over a microsecond) for full IP
forwarding -- in contrast, packets could arrive every
70ns or so.
Suspect memory-allocator overhead -- suggest checking
how often slab is hitting the global pool, and adjusting
config to suit.
- Should memory move when queues move???
DSM: netconsole is the only thing that would care
about the memory moving. "More research is necessary."
- BSD hacks (Jesse)
- Making packet trains within kernel. (Talk to Herbert!)
Dave Miller: Miscellaneous optimizations and features
- Socket refcount avoidance (Eric Dumazet)
- Packets in queue imply a reference to socket
- Now must initialize socket reference to 1
- VLAN and MACVLAN multiqueue support
- per-TXQ trans_start
- CONPAT_NETDEV_OPS is -gone-!!! (Stephen Hemminger)
- Early DST release in dev_hard_start_xmit (Eric Dumazet)
Eliminate cache misses. Some exceptional cases (e.g.,
tunnelling) cannot do early release.
- Per-queue TX stats
- last_rx update avoidance -- only used by bonding, so now
only actually updated by bonding.
Paul McKenney: RCU
- RCU-bh
- what is RCU bh for
- Robert Olsson DoS worklog hung system
- ICMP redirects update routing table
- no grace period, or long grace periods
- Dipankar created RCU-bh
- new quiescent state, in softirq instead of schedule
- routing cache converted to RCU-bh, then withstood DoS
- RCU tiny
- simplified RCU for embedded
- depends on non-SMP
- for ARM and SH etc. with small memory configs
- synchronize_rcu() is a NOP
- no grace period necessary
- RCU in mainline
- synchronize_sched_expedited() is in
- grace period in few tens of microsends
- hammer cpus with IPIs, force reschedule into thread
- use sparingly, expensive
- CLASSIC_RCU and PREEMPT_RCU are gone
- replaced with TREE_RCU and TREE_PREEMPT_RCU
- TINY_RCU under test, not yet in mainline
- Performance of sync mechanisms
- locks vs. other primitives
- locks are cheaper now
- but off socket, still very expensive
- lots of atomic operations on same variable
- what slows you down
- pipeline flushes
- memory barriers
- cache misses
- all in the noise compared to I/O
- Dining Philosophers
- Solution #1: lowest fork first
- Solution #2: partitioning
- Solution #3: More forks
- Objections "can't change rules!", "it's a lock hierarchy
test, solution #3 destroyed it!". "what if fork cost a
millions dollars"
- But what if problem does not partition nicely?
- Embarassingly parallel
- per-cpu variables
- per-device stuff
- etc.
- If cannot partition
- Use per-cpu/per-task caching (reduce global interaction)
- Use periodic update (give up accuracy or responsiveness)
- maybe random() (coordination more expensive than justified)