Minutes - Linux Kernel Developers' Netconf 2014

     Attendees:
  David Miller
Soyoung Park
Stephen Hemminger
Jeff Kirsher
John Fastabend
Herbet Xu
Eric Beiderman
Tom Herbert
Eric Duzamet
Scott Feldman
Rusty Russell
Jamal Hadi Salim
Pablo Neira Ayuso
Anna Schumaker
Florian Fainelli
Hannes Frederic Sowa
Eric W. Biederman
Chris Wright
John Linville
Johannes Berg


  • Introductions

  • SCTP consolidation with generic infrastructure.
    - Lower Arnaldo's generic infrastructure work to SCTP some more.
    - Who uses associations?
    - Does not use inet hash
    synchronize RCU, which is in recent merge w/ dynamic lookup done
    Conversion of ehash tables to rhashtables

  • eBPF
    - 64-bit opcodes, LLVM --> C backend, more than networking
    - pushback(code & design reviews needed on community level)
    - load arbitrary pointers(security concern)
    - LLVM backend
    - useful for tracing

  • Pablo Netfilter Workshop report
     - http://workshop.netfilter.org/2014/wiki/index.php/List_of_presentations
     - Conntrack removal, Jesper
     - OVS replacing bridging, shemminger
     - 10GB/s wirespeed, Jesper
     - DPDK (concerns: sharing IP and port space with normal stack)
       + take another look at netchannels
       + people think userland development is "easier"
    - offloads between hardware such as switches, software and vm layers
     - nftables, compatibility layer, multidimensional keys - Pablo/Patrick
     - OVS w/conntrack - Jesse Gross

  • Tom Herbert - Offloading Encapsulations
     - non-virtualization (GRE, source routing) vs. virtualization (vxlan, nvgre, etc.)
     - UDP encapsulation ubiquitous
     - Avoiding deep packet inspection for flow steering
     - Idea: Set source port to hash of inner packet
     - checksum offloading
       + multiple checksums in single packet (IP->UDP->GRE->IP->TCP)
       + Switch vendors want to avoid UDP checksums
       + Receive checksum overhaul
         - CHECKSUM_COMPLETE(always works) vs. CHECKSUM_UNNECESSARY(stack allows two levels)
         - Most NICs can provide checksum unnecessary for UDP packets
         - if checksum is non-zero, derive checksum complete when processing UDP packets
         - after conversion, any encapsulated checksums is verified by using skb->csum
       + TX checksum offloading
         - one checksum easy, for two stack and NIC do not support
         - Remote Checksum Offload
           + Checksum only outer UDP packet on TX
           + like normal checksum offload except it's deferred to peer
           + Deduce both csums on receive
         - 2 or more checksums
           + outer packet and inner transport packet
           + stack and NICs do not support
           + Alternative: Remote Checksum Offload
     - GRO after GRE decap
     - TSO/GSO
      + Partially generic support
      + works as long as no per-segment values reside in encap header
        (such as sequence numbers, packet lengths)
     - TSO/LRO to guest driver
      + Tx guest uses TSO interface, host kernel converts to TSO/GSO
      + On Rx, host probably uses GRO, converts to LRO to guest device

  • Johannes Berg - Wireless
     - ARP Proxying
       + Power and air time saving
       + Snoops DHCP, ARP, NS/NA frames.
       + Implementation location: generic networking vs. bridge
     - L2 Traffic Inspection and Filtering
     - Wireless traffic bound for wireless medium currently forwarded
       internally by 802.11 layer.  Will have to change in order to
       implement snooping for things like ARP proxying
       + BR_HAIRPIN_MODE
     - GTK protected traffic(L3 unicast in L2 multicast), RFC 1122 broadcast check
       + since all stations share GTK key, any station can send multicast
         GTK protected frames to anyone and appear to be the AP.
       + thus frames containing L2 unicast in L2 multicast packets, which
         are GTK protected, should be dropped
       + RFC 1122 mandated rules on receive should already disallow this but
         seem to be simply not implemented in ip_route_input_slow() yet.
       + Alternatively, parse L3 in wireless stack (or even in iptables rules?)
       + Previous attempt at using an skb bit "drop_unicast" reject by Eric

  • Jamal Hadi Salim - Network Function Offloading
     - Use/support existing tools(nftables/iptables, iproute2, route/ifconfig...)
     - Linux APIs, netlink, etc., no vendor APIs/SDKs
     - Bridging/switching, QoS, IPSEC, L3 forwarding, stateless ACL
     - Capability probing becomes necessary due to disparate set of features
       and capacities (TCAMs etc.)
     - How vs. Network centric view of the worls
     - Challenge1: design toward generic framework that will cater each drivers
    - Challenge2: limited open source drivers - Challenge3: unresolved driver support issues get spilled to the userspace for kernel to support as openwrt guys' open drivers
     - QEMU virtual device coming soon so that prototyping is possible
       of userspace
    - on-going discussions: https://linux.cumulusnetworks.com/offload-discussion-1/

  • David S. Miller - VLAN offload limits
     - HW decap stored in SKB
     - multiple HW decaps possible?
     - --> no
     - Want consistent handling of multiple tags

  • Bonding
     - very modular
     - locking is a lot cleaner
     - smaller code base
     - Where are with bonding and offloading?

  • Eric Dumazet - IP VLAN
     - Like MAC VLAN, but decapsulating on IP address
     - supports ipv4 and ipv6

  • David S. Miller - send batching
     - ->ndo_start_xmit() takes one SKB at a time
     - Extend to be able to queue multiple SKBs at a time
     - Amortize the number of "trigger" events (writes to "TX start"
       register, or calls into hypervisor)
     - Transition path is important, there are 400+ implementations
       of ->ndo_start_xmit()
     - Add ->ndo_xmit_flush() op
     - If implemented, semantics are that it must be called after
       a sequence of ->ndo_start_xmit() invocations.  The implication
       is that ->ndo_start_xmit() does not kick the TX queue to
       start processing the newly queued up SKBs, that's what the
       new ->ndo_xmit_flush() operation does.

  • Misc
     - Batching of "same routing keyed" packets on RX to amortize
       routing lookups, perhaps similarly on TX.
     - TPACKET_V4, generically exporting RX ring into userspace.
       Header will have description of RX queue descriptor format.
       Should work on Intel, Mellanox, Solarflare NICs