Linux Plumbers Conference 2018
Networking Track

 

 

    

A two-day Networking Track will be featured at this year's Linux Plumbers Conference (LPC) in Vancouver, British Columbia, Canada.

It will run the first two days of LPC, November 13-14. The track will consist of a series of talks with a clear focus on recent topics in Linux kernel networking, including a keynote from the Linux kernel networking maintainer, David S. Miller.

The Networking Track will be open to all LPC attendees. There is no additional registration required. This is a great occasion for Linux networking developers to meet face to face and discuss ongoing developments.

Schedule

For viewing the schedule, please check the main LPC website:

Accepted Talks

The following talks have been accepted by the LPC Networking Track Technical Committee:

Speaker:

  • David Ahern (Cumulus Networks)

Duration (incl. QA): 35 min

Content: Slides, Paper, Video

Abstract:

XDP is a framework for running BPF programs in the NIC driver to allow decisions about the fate of a received packet at the earliest point in the Linux networking stack. For the most part the BPF programs rely on maps to drive packet decisions, maps that are managed for example by a userspace agent. This architecture has implications on how the system is configured, monitored and debugged.

An alternative approach is to make the kernel networking tables accessible by BPF programs. This approach allows the use of standard Linux APIs and tools to manage networking configuration and state while still achieving the higher performance provided by XDP. An example of providing access to kernel tables is the recently added helper to allow IPv4 and IPv6 FIB (and nexthop) lookups in XDP programs. Routing suites such as FRR manage the FIB tables, and the XDP packet path benefits by automatically adapting to the FIB updates in real time. While a huge first step, a FIB lookup alone is not sufficient for general networking deployments.

This talk discusses the advantages of making kernel tables available to XDP programs to create a programmable packet pipeline, what features have been implemented as of October 2018, key missing features, and current challenges.

Speakers:

  • Roopa Prabhu (Cumulus Networks)
  • Nikolay Aleksandrov (Cumulus Networks)

Duration (incl. QA): 35 min

Content: Slides, Paper, Video

Abstract:

Linux bridge is deployed on Hosts, Hypervisors, Container OS's and in most recent years on data center switches. It is complete in its feature set with forwarding, learning, proxy and snooping functions. It can bridge Layer-2 domains between VM's, Containers, Racks, POD's and between data centers as seen with Ethernet-Virtual Private networks [1, 2]. With Linux bridge deployments moving up the rack, it is now bridging Larger Layer-2 domains bringing in scale challenges. The bridge forwarding database can scale to thousands of entries on a data center switch with hardware acceleration support.

In this paper we discuss performance and operational challenges with large scale bridge fdb database and solutions to address them. We will discuss solutions like fdb dst port failover for faster convergence, faster API for fdb updates from control plane and reducing number of fdb dst ports with Light weight tunnel endpoints for bridging over a tunneling solution (eg vxlan).

Most solutions though discussed around the below deployment scenarios are generic and can be applied to all bridge use-cases:

  • Multi-chassis link aggregation scenarios where Linux bridge is part of the active-active switch redundancy solution
  • Ethernet VPN solutions where Linux bridge forwarding database is extended to reach Layer-2 domains over a network overlay like VxLAN

[1] https://tools.ietf.org/html/draft-ietf-bess-evpn-overlay-11
[2] https://www.netdevconf.org/2.2/slides/prabhu-linuxbridge-tutorial.pdf

Speakers:

  • P.J. Waskiewicz (Intel)
  • Neerav Parikh (Intel)

Duration (incl. QA): 35 min

Content: Slides, Paper, Video

Abstract:

This talk is a continuation of the initial XDP HW-based hints work presented at NetDev 2.1 in Seoul, South Korea.

It will start with focus on showcasing new prototypes to allow an XDP program to request required HW-generated metadata hints from a NIC. The talk will show how the hints are generated by the NIC and what are the performance characteristics for various XDP applications. We also want to demonstrate how such a metadata can be helpful for applications that use AF_XDP sockets.

The talk with then discuss planned upstreaming thoughts, and look to generate more discussion around implementation details, programming flows, etc., with the larger audience from the community.

Speakers:

  • William Tu (VMware)
  • Greg Rose (VMware)

Duration (incl. QA): 35 min

Content: Slides, Paper, Video

Abstract:

Port mirroring is one of the most common network troubleshooting techniques. SPAN (Switch Port Analyzer) allows a user to send a copy of the monitored traffic to a local or remote device using a sniffer or packet analyzer. RSPAN is similar, but sends and received traffic on a VLAN. ERSPAN extends the port mirroring capability from Layer 2 to Layer 3, allowing the mirrored traffic to be encapsulated in an extension of the GRE (Generic Routing Encapsulation) protocol and sent through an IP network. In addition, ERSPAN carries configurable metadatas (e.g., session ID, timestamps), so that the packet analyzer has better understanding of the packets.

ERSPAN for IPv4 was added into Linux kernel in 4.14, and for IPv6 in 4.16. The implementation includes both transmission and reception and is based on the existing ip_gre and ip6_gre kernel module. As a result, Linux today can act as an ERSPAN traffic source sending the ERSPAN mirrored traffic to the remote host, or an ERSPAN destination which receives and parses the ERSPAN packets generated from Cisco or other ERSPAN-capable switches.

We've added both the native tunnel support and metadata-mode tunnel support. In this paper, we demonstrate three ways to use the ERSPAN protocol. First, for Linux users, using iproute2 to create native tunnel net device. Traffic sent to the net device will be encapsulated with the protocol header accordingly and traffic matching the protocol configuration will be received from the net device. Second, for eBPF users, using iproute2 to create metadata-mode ERSPAN tunnel. With eBPF TC hook and eBPF tunnel helper functions, users can read/write ERSPAN protocol's fields in finer granularity. Finally, for Open vSwitch users, using the netlink interface to create a switch and programmatically parse, lookup, and forward the ERSPAN packets based on flows installed from the userspace.

Speaker:

  • David S. Miller (Red Hat)

Duration (incl. QA): 25 min

Content: Slides, Video (Keynote)

Abstract:

TBD

Speakers:

  • Lawrence Brakmo (Facebook)
  • Boris Burkov (Facebook)
  • Greg Leclercq (Facebook)
  • Murat Mugan (Facebook)

Duration (incl. QA): 35 min

Content: Slides, Paper, Video

Abstract:

In this talk we describe our experiences in evaluating DC-TCP. Preliminary testing with Netesto uncovered issues with our NIC that affected fairness between flows, as well as bugs in the DC-TCP code path in Linux that resulted in RPC tail latencies of up to 200ms. Once we fixed those issues, we proceeded to test in a 6 rack mini cluster running some of our production applications. This testing demonstrated very large decreases in packet discards (12 to 1000x) at a cost of larger CPU utilization. In addition to describing the issues and fixes, we provide detailed experimental results and explore the causes of the larger CPU utilization as well as discuss partial solutions to this issue.

Speakers:

  • William Tu (VMware)
  • Joe Stringer (Cilium)
  • Yifeng Sun (VMware)
  • Yi-Hung Wei (VMware)

Duration (incl. QA): 45 min

Content: Slides, Paper, Video

Abstract:

Among the various ways of using eBPF, OVS has been exploring the power of eBPF in three: (1) attaching eBPF to TC, (2) offloading a subset of processing to XDP, and (3) by-passing the kernel using AF_XDP. Unfortunately, as of today, none of the three approaches satisfies the requirements of OVS. In this presentation, we'd like to share the challenges we faced, experience learned, and seek for feedbacks from the community for future direction.

Attaching eBPF to TC started first with the most aggressive goal: we planned to re-implement the entire features of OVS kernel datapath under net/openvswitch/* into eBPF code. We worked around a couple of limitations, for example, the lack of TLV support led us to redefine a binary kernel-user API using a fixed-length array; and without a dedicated way to execute a packet, we created a dedicated device for user to kernel packet transmission, with a different BPF program attached to handle packet execute logic. Currently, we are working on connection tracking. Although a simple eBPF map can achieve basic operations of conntrack table lookup and commit, how to handle NAT, (de)fragmentation, and ALG are still under discussion.

Moving one layer below TC is called XDP (eXpress Data Path), a much faster layer for packet processing, but with almost no extra packet metadata and limited BPF helpers support. Depending on the complexity of flows, OVS can offload a subset of its flow processing to XDP when feasible. However, the fact that XDP has fewer helper function support implies that either 1) only very limited number of flows are eligible for offload, or 2) more flow processing logic needed to be done in native eBPF.

AF_XDP represents another form of XDP, with a socket interface for control plane and a shared memory API for accessing packets from userspace applications. OVS today has another full-fledged datapath implementation in userspace, called dpif-netdev, used by DPDK community. By treating the AF_XDP as a fast packet-I/O channel, the OVS dpif-netdev can satisfy almost all existing features. We are working on building the prototype and evaluating its performance.

RFC patch: OVS eBPF datapath, https://www.mail-archive.com/iovisor-dev@lists.iovisor.org/msg01105.html

Speakers:

  • Daniel Borkmann (Cilium)
  • John Fastabend (Cilium)

Duration (incl. QA): 35 min

Content: Slides, Paper, Video

Abstract:

This talk is divided into two parts, first we present on kTLS, the current kernel's sockmap BPF architecture for L7 policy enforcement, as well as the kernel's ULP and strparser framework which is utilized by both in order to hook into socket callbacks and determine message boundaries for subsequent processing.

We further elaborate on the challenges we face when trying to combine kTLS with the power of BPF for the eventual goal of allowing in-kernel introspection and policy enforcement of application data before encryption. Besides others, this includes a discussion on various approaches to address the shortcomings of the current ULP layer, optimizations for strparser, and the consolidation of scatter/gather processing for kTLS and sockmap as well as future work on top of that.

Speaker:

  • Willem de Bruijn (Google)

Duration (incl. QA): 35 min

Content: Slides, Paper, Video

Abstract:

UDP is a popular foundation for new protocols. It is available across operating systems without superuser privileges and widely supported by middleboxes. Shipping protocols in userspace on top of a robust UDP stack allows for rapid deployment, experimentation and innovation of network protocols.

But implementing protocols in userspace has limitations. The environment lacks access to features like high resolution timers and hardware offload. Transport cost can be high. Cycle count of transferring large payloads with UDP can be up to 3x that of TCP.

In this talk we present recent and ongoing work, both by the authors and others, at improving UDP for content delivery.

UDP Segmentation offload amortizes transmit stack traversal by sending as many as 64 segments as one large fused large packet. The kernel passes this through the stack as one datagram, then splits it into multiple packets and replicates their network and transport headers just before handing to the network device.

Some devices can offload segmentation for exact multiples of segment size. We discuss how partial GSO support combines the best of software and hardware offload and evaluate the benefits of segmentation offload over standard UDP.

With these large buffers, MSG_ZEROCOPY becomes effective at removing the cost of copying in sendmsg, often the largest single line item in these workloads. We extend this to UDP and evaluate it on top of GSO.

Bursting too many segments at once can cause drops and retransmits. SO_TXTIME adds a release time interface which allows offloading of pacing to the kernel, where it is both more accurate and cheaper. We will look at this interface and how it is supported by queuing disciplines and hardware devices.

Finally, we look at how these transmit savings can be extended to the forwarding and receive paths through the complement of GSO, GRO, and local delivery of fused packets.

Speaker:

  • Nikita V. Shirokov (Facebook)

Duration (incl. QA): 35 min

Content: Slides, Paper, Video

Abstract:

Today every packet which is reaching Facebook's network is being processed by XDP enabled application. We have been using it for more then 1.5 years and this talk is about evolution of XDP and BPF which has been driven by our production needs. I'm going to talk about history of changes in core BPF components as well as will show why and how it was done. What performance improvements did we get (with synthetics and real world data) and how it was implemented. Also I'm going to talk about issues and shortcoming of BPF/XDP which we have found during our operations, as well as some gotchas and corner cases. In the end we are going to discuss on what is still missing and which part could be improved.

Topics and areas of existing BPF/XDP infrastructure which are going to be covered in this talk:

  • Why helpers such as bpf_adjust_head/bpf_adjust_tail has been added
  • Unittesting and microbenchmarking with bpf_prog_test_run: how to add test coverage of you BPF program and track the regression (we are going to cover how spectre affected BPF kernel infrastructure and what tweaks has been made to get some performance back)
  • How map-in-map helps us to scale and make sure that we don't waste memory
  • NUMA aware allocation for BPF maps
  • Inline lookups for BPF arrays/map-in-map

Lessons which we have learned during operation of XDP:

  • BPF instruction counts vs complexity
  • How to attach more then one XDP program to the interface
  • When LLVM and verifier are not the same: some tricks to force LLVM to generate proper BPF
  • We will briefly discuss HW limitation: NIC's bandwidth vs packet per second performance

Missing parts, what and why could be added:

  • The need for hardware checksumming offload
  • Bounded loops: what they would allow us to do

Speakers:

  • Magnus Karlsson (Intel)
  • Björn Töpel (Intel)

Duration (incl. QA): 35 min

Content: Slides, Paper, Video

Abstract:

AF_XDP is a new socket type for raw frames to be introduced in 4.18 (in linux-next at the time of writing). The current code base offers throughput numbers north of 20 Mpps per application core for 64-byte packets on our system, however there are a lot of optimizations that could be performed in order to increase this even further. The focus of this paper is the performance optimizations we need to make in AF_XDP to get it to perform as fast as DPDK.

We present optimization that fall into two broad categories: ones that are seamless to the application and ones that requires additions to the uapi. In the first category we examine the following:

  • Loosen the requirement for having an XDP program. If the user does not need an XDP program and there is only one AF_XDP socket bound to a particular queue, we do not need an XDP program. This should cut out quite a number of cycles from the RX path.
  • Wire up busy poll from user space. If the application writer is using epoll() and friends, this has the potential benefit of removing the coherency communication between the RX (NAPI) core and the application core as everything is now done on a single core. Should improve performance for a number of use cases. Maybe it is worth revisiting the old idea of threaded NAPI in this context too.
  • Optimize for high instruction cache usage through batching as has been explored in for example Cisco's VPP stack and Edward Cree in his net-next RFC "Handle multiple received packets at each stage".

In the uapi extensions category we examine the following optimizations:

  • Support a new mode for NICs with in-order TX completions. In this mode, the completion queue would not be used. Instead the application would simply look at the pointer in the TX queue to see if a packet has been completed. In this mode, we do not need any backpreassure between the completion queue and the TX queue and we do not need to populate or publish anything in the completion queue as it is not used. Should improve the performance of TX for in-order NICs significantly.
  • Introduce the "type-writer" model where each chunk can contain multiple packets. This is the model that e.g., Chelsio has in its NICs. But experiments show that this mode also can provide better performance for regular NICs as there are fewer transactions on the queues. Requires a new flag to be introduced in the options field of the descriptor.

With these optimization, we believe we can reach our goal of close to 40 Mpps of throughput for 64-byte packets in zero-copy mode. Full analysis with performance numbers will be presented in the final paper.

Speakers:

  • Marcelo Ricardo Leitner (Red Hat)
  • Xin Long (Red Hat)

Duration (incl. QA): 35 min

Content: Slides, Paper, Video

Abstract:

SCTP is a transport protocol, like TCP and UDP, originating from SIGTRAN IETF Working Group in the early 2000's with the initial objective of supporting the transport of PSTN signalling over IP networks. It featured multi-homing and multi-stream from the beginning, and since then there have been a number of improvements that help it serve other purposes too, such as support for Partial Reliability and Stream Scheduling.

Linux SCTP arrived late and was stuck. It wasn't as up to date as the released RFCs, while it was also far behind other systems such as BSD, and also suffered from performance problems. In the past 2 years, we were dedicated to ensuring that these features were addressed and focused on making many improvements. Now all the features from released RFCs have been fully supported in Linux, and some from draft RFCs are already ongoing. Besides, we've seen an obvious improvement in performance in various scenarios.

In this talk we will first do a quick review on SCTP basics, including:

  • Background: Why SCTP is used for PSTN Signalling Transport, why other applications are using or will use SCTP.
  • Architecture: The general SCTP structures and procedures implemented in Linux kernel.
  • VS TCP/UDP: An overview of functions and applicability of SCTP, TCP and UDP.
Then go through the improvements that were made in the past 2 years, including:
  • SCTP-related projects in Linux: Other than kernel part, there are also lksctp-tools, sctp-tests, tahi-sctp, etc.
  • Features implemented lately: RFC ones like Stream Scheduling, Message Interleaving, Stream Reconfig, Partially Reliable Policy, and many CMSGs, SndInfos, Socket APIs.
  • Improvements made recently: Big patchsets like SCTP Offload, Transport Hashtable, SCTP Diag and Full SELinux support.
  • VS BSD: We will take a look at the difference between Linux and BSD now regarding SCTP. You will be surprised to see that we've gone further than other systems.
We will finish by reviewing a list of what is on our radar as well as next steps, like:
  • Ongoing features: SCTP NAT and SCTP CMT, two big important features are ongoing and already taking form, and more Performance Improvements in kernel have also been started.
  • Code refactor: New Congestion Framework will be introduced, which will be more flexible for SCTP to extend more Congestion Algorithms.
  • Hardware support: HW CRC Checksum and GSO will definitely make performance better, for which a new segment logic for both .segment and HW that works for SCTP chunks is needed.
  • RFC docs improvements: We believe that more extensions and revisions will make SCTP more widespread.
For its powerfulness and complexity, SCTP is destined to face many challenges and threats, but we believe that we have already and will continue to make it better than that on other systems, but also than other transport protocols. Please join us, Linux SCTP needs your help too!

Speaker:

  • Nick Viljoen (Netronome)

Duration (incl. QA): 35 min

Content: Slides, Paper, Video

Abstract:

eBPF (extended Berkeley Packet Filter) has been shown to be a flexible kernel construct used for a variety of use cases, such as load balancing, intrusion detection systems (IDS), tracing and many others. One such emerging use case revolves around the proposal made by William Tu for the use of eBPF as a data path for Open vSwitch. However, there are broader switching use cases developing around the use of eBPF capable hardware. This talk is designed to explore the bottlenecks that exist in generalising the application of eBPF further to both container switching as well as physical switching.

Topics that will be covered include proposals for container isolation through the use of features such as programmable RSS, the viability of physical switching using eBPF capable hardware as well as integrations with other subsystems or additional helper functions which may improve the possible functionality.

Speakers:

  • Jesse Brandeburg (Intel)
  • Anjali Singhai Jain (Intel)

Duration (incl. QA): 35 min

Content: Slides, Paper, Video

Abstract:

Over the last 10 years the world has seen NICs go from single port, single netdev devices, to multi-port, hardware switching, CPU/NFP having, FPGA carrying, hundreds of attached netdev providing, behemoths. This presentation will begin with an overview of the current state of filtering and scheduling, and the evolution of the kernel and networking hardware interfaces. (HINT: it's a bit of a jungle we've helped grow!) We'll summarize the different kinds of networking products available from different vendors, and show the workflows of how a user can use the network hardware offloads/accelerations available and where there are still gaps. Of particular interest to us is how to have a useful, generic hardware offload supporting infrastructure (with seamless software fallback!) within the kernel, and we'll explain the differences between deploying an eBPF program that can run in software, and one that can be offloaded by a programmable ASIC based NIC. We will discuss our analysis of the cost of an offload, and when it may not be a great idea to do so, as hardware offload is most useful when it achieves the desired speed and requires no special software (kernel changes). Some other topics we will touch on: the programmability exposed by smart NICs is more than that of a data plane packet processing engine and hence any packet processing programming language such as eBPF or P4 will require certain extensions to take advantage of the device capabilities in a holistic way. We'll provide a look into the future and how we think our customers will use the interfaces we want to provide both from our hardware, and from the kernel. We will also go over the matrix of most important parameters that are shaping our HW designs and why.

Speakers:

  • Anant Deepak (Facebook)
  • Richard Huang (Facebook)
  • Puneet Mehra (Facebook)

Duration (incl. QA): 35 min

Content: Slides, Paper, Video

Abstract:

iptables has been the typical tool to create firewall for linux hosts. We have used them at Facebook for setting up host firewalls on our servers across a variety of tiers. In this proposal, we introduce a eBPF / XDP based firewall solution which we use for packet filtering and has parity to our iptables implementation. We discuss various aspects of this. Following is a brief summary of these aspects, which we will detail further in the paper / presentation.

Design and Implementation:

  • We use BPF Tables (maps, lpm tries, and arrays) to match for appropriate packet header contents
  • The heart of a firewall is a eBPF filter which parses a packet and does lookups against all relevant maps collecting the matching values. A logical rule set is applied to these collected values. This logical set reads similar to a human-readable high level firewall policy. With iptable rules, amidst all the verbose matching criteria inline every rule, such a policy level representation is hard to infer.
Performance benefits and comparisons with iptables:
  • iptables does packet matching linearly against each rule until a match is found. In our proposal, we use BPF Tables (maps) containing keys for all rules, making packet matching highly efficient. We then apply the policy using the collected results, which results in a considerable speedup over iptables.
Ease of policy / config updates and maintenance:
  • The network administrator owns the firewall while the app developers typically require opening ports for their applications to work. With our approach of using a eBPF filter, we create a logical separation between the filter which enforces the policy and the contents of the associated maps which represent the specific ports and prefixes that need to be filtered. The policy is owned by the network administrator (Example: ports open to the internet, ports open from within specific prefixes, drop everything else). The data (port numbers, prefixes, etc) can now belong to a separate logical section which presents application developers a predetermined destination to update their data (Example: File containing port opened to internal subnets, etc). This reduces friction between the 2 different functions and reduces human errors.
Deployment experience:
  • We deploy this solution in our edge infrastructure to implement our firewall policy.
  • We update configuration, reload filters and contents of the various maps containing keys and values for filtering
BPF Program arrays:
  • We use the power of BPF program array to chain different programs like rate limiter, firewall, load balancers, etc. These are building blocks to create a rich, high performant networking solution
Proposal for a completely generic firewall solution to migrate existing iptables rules to eBPF / XDP based filtering:
  • We present a proposal which can translate existing iptables rules to a better performant eBPF program with mostly user space processing and validation.

Speakers:

  • William Tu (VMware)
  • Fabian Ruffy (University of British Columbia)
  • Mihai Budiu (VMware)

Duration (incl. QA): 35 min

Content: Slides, Paper, Video

Abstract:

The eXpress Data Path (XDP) is a new kernel-feature, intended to provide fast packet processing as close as possible to device hardware. XDP builds on top of the extended Berkely Packet Filter (eBPF) and allows users to write a C-like packet processing program, which can be attached to the device driver's receiving queue. When the device observes an incoming packet, the user-defined XDP program is triggered to execute on the packet payload, making the decision as early as possible before handing the packet down the processing pipeline.

P4 is a domain-specific language describing how packets are processed by the data plane of a programmable network elements, including network interface cards, appliances, and virtual switches. It provides an abstraction that allows programmers to express existing and future protocol format without coupling it to any data plane specific knowledge. The language is explicitly designed to be protocol-agnostic. A P4 programmer can write their own protocols and load the P4 program into P4-capable network elements. As high-level networking language, P4 supports a diverse set of compiler backends and also possesses the capability to express eBPF and XDP programs.

We present P4C-XDP, a new backend for the P4 compiler. P4C-XDP leverages XDP to aim for a high performance software data plane. The backend generates a eBPF-compliant C representation from a given P4 program which is passed to clang and llvm to produce the bytecode. Using conventional eBPF kernel hooks the program can then be loaded into the eBPF virtual machine in the device driver. The kernel verifier guarantees the safety of the generated code. Any packets received/transmitted from/to this device driver now trigger the execution of the loaded P4 program.

The P4C-XDP project is an open source project hosted at https://github.com/vmware/p4c-xdp/. We provide prove-of-concept sample code under the tests directory, which contains a couple of examples such as basic protocol parsing, checksum recalculation, multiple tables lookups, and tunnel protocol en-/decapsulation.

Speakers:

  • Paolo Abeni (Red Hat)
  • Davide Caratti (Red Hat)
  • Eelco Chaudron (Red Hat)
  • Marcelo Ricardo Leitner (Red Hat)

Duration (incl. QA): 35 min

Content: Slides, Paper, Video

Abstract:

Currently the Linux kernel implements two distinct datapaths for Open vSwitch: the ovskdp and the TC DP. The latter has been added recently mainly to allow HW offload, while the former is usually preferred for SW based forwarding due to functional and performance reasons.

We evaluate both datapaths in a typical forwarding scenario - the PVP test - using the perf tool to identify bottlenecks in the TC SW dp. While similar steps usually incur in similar costs, the TC SW DP requires an additional, per packet, skb_clone, due to a TC actions constraint.

We propose to extend the existing act infrastructure, leveraging the ACT_REDIRECT action and the bpf redirect code, to allow clone-free forwarding from the mirred action and then re-evaluate the datapaths performances: the gap is then almost already closed.

Nevertheless, TC SW performance can be further improved by completing the RCU-ification of the TC actions and expanding the recent listification infrastructure to the TC (ingress) hook. We plan also to compare the TC/SW datapath with an custom eBPF program implementing the equivalent flow set to tag a reference value for the target performances.

Speaker:

  • Joe Stringer (Cilium)

Duration (incl. QA): 35 min

Content: Slides, Paper, Video

Abstract:

Over the past several years, BPF has steadily become more powerful in multiple ways: Through building more intelligence into the verifier which allows more complex programs to be loaded, and through extension of the API such as by adding new map types and new native BPF function calls. While BPF has its roots in applying filters at the socket layer, the ability to introspect the sockets relating to traffic being filtered has been limited.

To build such awareness into a BPF helper, the verifier needs the ability to track the safety of the calls, including appropriate reference counting upon the underlying socket. This talk walks through extensions to the verifier to perform tracking of references in a BPF program. This allows BPF developers to extend the UAPI with functions that allocate and release resources within the execution lifetime of a BPF program, and the verifier will validate that the resources are released exactly once prior to program completion.

Using this new reference tracking ability in the verifier, we add socket lookup and release function calls to the BPF API, allowing BPF programs to safely find a socket and build logic upon the presence or attributes of a socket. This can be used to load-balance traffic based on the presence of a listening application, or to implement stateful firewalling primitives to understand whether traffic for this connection has been seen before. With this new functionality, BPF programs can integrate more closely with the networking stack's understanding of the traffic transiting the kernel.

Speakers:

  • Jesper Dangaard Brouer (Red Hat)
  • Toke Høiland-Jørgensen (Karlstad University)

Duration (incl. QA): 35 min

Content: Slides, Paper, Video

Abstract:

XDP already offers rich facilities for high performance packet processing, and has seen deployment in several production systems. However, this does not mean that XDP is a finished system; on the contrary, improvements are being added in every release of Linux, and rough edges are constantly being filed down. The purpose of this talk is to discuss some of these possibilities for future improvements, including how to address some of the known limitations of the system. We are especially interested in soliciting feedback and ideas from the community on the best way forward.

The issues we are planning to discuss include, but are not limited to:

  • User experience and debugging tools: How do we make it easier for people who are not familiar with the kernel or XDP to get to grips with the system and be productive when writing XDP programs?
  • Driver support: How do we get to full support for XDP in all drivers? Is this even a goal we should be striving for?
  • Performance: At high packet rates, every micro-optimisation counts. Things like inlining function calls in drivers are important, but also batching to amortise fixed costs such as DMA mapping. What are the known bottlenecks, and how do we address them?
  • QoS and rate transitions: How should we do QoS in XDP? In particular, rate transitions (where a faster link feeds into a slower) are currently hard to deal with from XDP, and would benefit from, e.g., Active Queue Management (AQM). Can we adapt some of the AQM and QoS facilities in the regular networking stack to work with XDP? Or should we do something different?
  • Accelerating other parts of the stack: Tom Herbert started the discussion on accelerating transport protocols with XDP back in 2016. How do we make progress on this? Or should we be doing something different? Are there other areas where we can extend XDPs processing model to provide useful accelerations?

Speaker:

  • Andrew Lunn

Duration (incl. QA): 35 min

Content: Slides, Paper, Video

Abstract:

phylib has provided the API Ethernet MAC drivers have used to control Copper PHYs for many years. However with the advent of MACs/PHYs with bandwidth of > 1Gbps, SERDES interfaces and fibre optical modules, phylib is not sufficient. phylink provides an API which MAC drivers can use to control these more complex and dynamic, possibly hot-pluggable PHYs. This presentation will explain why phylink is needed, how it differs from phylib, and describe how to convert a MAC driver from phylib to phylink in order to make use of its new features. The kernel support for SFP modules will also be detailed, including how the MAC needs to handle hot-plugging of the PHY, which can be copper or fibre.

Speakers:

  • Lawrence Brakmo (Facebook)
  • Alexei Starovoitov (Facebook)

Duration (incl. QA): 35 min

Content: Slides, Paper, Video

Abstract:

Linux currently provides mechanisms for managing and allocating many of the system resources such as CPU, Memory, etc. Network resource management is more complicated since networking deals not only with a local resource, such as CPU management does, but can also deal with a global resource. The goal is not only to provide a mechanism for allocating the local network resource (NIC bandwidth), but also to support management of network resources external to the host, such as link and switch bandwidths.

For networking, the primary mechanism for allocating and managing bandwidth has been the traffic control (tc) subsystem. While tc allows for shaping of outgoing traffic and policing of incoming traffic, it suffers from some drawbacks. The first drawback is a history of performance issues when using the Hierarchical Queuing Discipline (htb) which is usually required for anything other than simple shaping needs. A second drawback is the lack of flexibility usually provided by general programming constructs.

We are in the process of designing and implementing a BPF based framework for efficiently supporting shaping of both egress and ingress traffic based on both local and global network allocations.

All presentation material and papers can be found here.

Key dates

  • Proposal submissions are due by July 11th 2018.
  • Authors will be notified of acceptance/rejection by August 15th, 2018.
  • 1st draft of the slides and papers (both as PDF) are due by October 29th, 2018.
  • Final drafts (as PDF) are due by November 4th, 2018.

Networking Track Technical Committee

The networking track is run and organized by the Linux netdev community. The technical program committee for this year is:

  • David S. Miller (Chair, Red Hat)
  • Daniel Borkmann (Cilium)
  • Florian Fainelli (Broadcom)
  • Jesper Dangaard Brouer (Red Hat)

Call for Proposal (CFP) on Talks

We are seeking proposals for the networking track at Linux Plumbers Conference in Vancouver from November 13th, 2018 to November 14th, 2018.

Submitted proposals, 40 minutes length and accompanied by 2-10 pages length paper, to the LPC Networking Track Technical Committee should be on new and upcoming work with suggestions for solutions to open problems on but not limited to the following topics:

  • XDP and BPF
  • Wireless Networking
  • Performance and Performance Analysis
  • TCP and congestion control algorithms
  • Interactions between Networking and other subsystems
  • IPSEC
  • Security
  • Testing
  • Offloading
  • Configuration APIs
  • Filtering and Classification

Proposal submission sent to the LPC Networking Track Technical Committee should contain

  • Title,
  • A list of submitter names, and
  • description of up to 350 words
for review. The LPC Networking Track Technical Committee will review all submissions and will provide feedback promptly.

Paper requirements and other questions can be reviewed on a case-by-case basis, with the exception of recycled material.

Please contact the LPC Networking Track Technical Committee with any questions.