Linux Plumbers Conference 2018
BPF Microconference

 

 

    

A BPF Microconference will be featured at this year's Linux Plumbers Conference (LPC) in Vancouver, British Columbia, Canada.

It will run on the third day of LPC, November 15, so that it is not overlapping with the LPC's Networking Track on the first two days which received a large number of networking related BPF and XDP talks this year.

The goal of the BPF Microconference is to bring BPF developers together to discuss and hash out unresolved issues and to move new ideas forward. The focus of this year's event is on the core BPF infrastructure as well as its many subsystems and related user space tooling.

The BPF Microconference will be open to all LPC attendees. There is no additional registration required. This is also a great occasion for BPF users and developers to meet face to face and to exchange and discuss developments.

Schedule

For viewing the schedule, please check the main LPC website:

Accepted Topics

The following discussion topics have been accepted by the LPC BPF Microconference Organizers:

Speaker:

  • Vlad Dumitrescu (Google)
  • Eddie Hao (Google)
  • Willem de Bruijn (Google)

Duration: 20 min

Content: Slides

Abstract:

Google servers classify, measure, and shape their outgoing traffic. The original implementation is based on Linux kernel traffic control (TC). As server platforms scale so does their network bandwidth and number of classified flows, exposing scalability limits in the TC system - specifically contention on the root qdisc lock.

Mechanisms like selective qdisc bypass, sharded qdisc hierarchies, and low-overhead prequeue ameliorate the contention up to a point. But they cannot fully resolve it. Recent changes to the Linux kernel make it possible to move classification, measurement, and packet mangling outside this critical section, potentially scaling to much higher rates while simultaneously shaping more flows and applying more flexible policies.

By moving classification and measurement to BPF at the new TC egress hook, servers avoid taking a lock millions of times per second. Running BPF programs at socket connect time with TCP_BPF converts overhead from per-packet to per-flow. The programmability of BPF also allows us to implement entirely new functions, such as runtime configurable congestion control, first-packet classification and socket-based QoS policies. It enables faster deployment cycles and as this business logic can be updated dynamically from a user agent. The discussion will focus on our experience converting an existing traffic shaping system to a solution based on BPF, and the issues we’ve encountered during testing and debugging.

Speaker:

  • Yonghong Song (Facebook)
  • Alexei Starovoitov (Facebook)

Duration: 20 min

Content: Slides

Abstract:

Compile-once and run-everywhere can make deployment simpler and may consume less resources on the target host, e.g., without llvm compiler and kernel devel package. Currently bpf programs for networking can compile once and run over different kernel versions. But bpf programs for tracing cannot since it accesses kernel internal headers and these headers are subject to change between kernel versions.

But compile-once run-everywhere for tracing is not easy. BPF programs could access anything in the kernel headers, including data structures, macros and inline functions. To achieve this goal, we need (1) preserving header-level accesses for the bpf program, and (2) abstracting header info of vmlinux. Right before program load on the target host, some kind of resolution is done for bpf program against the running kernel so the resulted program is just like to that compiled against host kernel headers.

In this talk, we will explore how BTF could be used by both bpf program and vmlinux to advance the possibility of bpf program compile-once and run-everywhere.

Speaker:

  • Joe Stringer (Cilium)
  • Daniel Borkmann (Cilium)

Duration: 20 min

Content: Slides

Abstract:

BPF program writers today who build and distribute their programs as ELF objects typically write their programs using one of a small set of (mostly) similar headers that establish norms around ELF section definitions. One such norm is the presence of a "maps" section which allows maps to be referenced within BPF instructions using virtual file descriptors. When a BPF loader (eg, iproute2) opens the ELF, it loads each map referred in this section, creates a real file descriptor for that map, then updates all BPF instructions which refer to the same map to specify the real file descriptor. This allows symbolic referencing to maps without requiring writers to implement their own loaders or recompile their programs every time they create a map.

This discussion will take a look at how to provide similar symbolic referencing for static data. Existing implementations already templatize information such as MAC or IP addresses using C macros, then invoke a compiler to replace such static data at load time, at a cost of one compilation per load. By extending the support for static variables into ELF sections, programs could be written and compiled once then reloaded many times with different static data.

Speaker:

  • John Fastabend (Cilium)

Duration: 20 min

Content: Slides

Abstract:

Currently, BPF can not support basic loops such as for, while, do/while, etc. Users work around this by forcing the compiler to "unroll" these control flow constructs in the LLVM backend. However, this only works up to a point. Unrolling increases instruction count and complexity on the verifier and further LLVM can not easily unroll all loops. The result is developers end up writing code that is unnatural, iterating until they find a version that LLVM will compile into a form the verifier backend will support.

We developed a verifier extension to detect bounded loops here,

https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/log/?h=wip/bpf-loop-detection

This requires building a DOM tree (computationally expensive) and then matching loop patterns to find loop invariants to verify loops terminate. In this discussion we would like to cover the pros and cons of this approach. As well as discuss another proposal to use explicit control flow instructions to simplify this task.

The goal of this discussion would be to come to a consensus on how to proceed to make progress on supporting bounded loops.

Speaker:

  • Jiong Wang (Netronome)

Duration: 20 min

Content: Slides

Abstract:

eBPF has 64-bit general purpose registers, therefore 32-bit architectures normally need to use register pair to model them and need to generate extra instructions to manipulate the high 32-bit in the pair. Some of these overheads incurred could be eliminated if JIT compiler knows only the low 32-bit of a register is interested. This could be known through data flow (DF) analysis techniques. Either the classic iterative DF analysis or "path-sensitive" version based on verifier's code path walker.

In this talk, implementations for both versions of DF analyser will be presented. We will see how a def-use chain based classic eBPF DF analyser looks first, and will see the possibility to integrate it with previous proposed eBPF control flow graph framework to make a stand-alone eBPF global DF analyser which could potentially serve as a library. Then, another "path-sensitive" DF analyser based on the existing verifier code path walker will be presented. We will discuss how function calls, path prune, path switch affect the implementation. Finally, we will summarize pros and cons for each, and will see how could each of them be adapted to 64-bit and 32-bit architecture back-ends.

Also, eBPF has 32-bit sub-register and ALU32 instructions associated, enable them (-mattr=+alu32) in LLVM code-gen could let the generated eBPF sequences carry more 32-bit information which could potentially easy flow analyser. This will be briefly discussed in the talk as well.

Speaker:

  • Quentin Monnet (Netronome)

Duration: 20 min

Content: Slides

Abstract:

eBPF (extended Berkeley Packet Filter), in particular with its driver-level hook XDP (eXpress Data Path), has increased in importance over the past few years. As a result, the ability to rapidly debug and diagnose problems is becoming more relevant. This talk will cover common issues faced and techniques to diagnose them, including the use of bpftool for map and program introspection, the use of disassembly to inspect generated assembly code and other methods such as using debug prints and how to apply these techniques when eBPF programs are offloaded to the hardware.

The talk will also explore where the current gaps in debugging infrastructure are and suggest some of the next steps to improve this, for example, integrations with tools such as strace, valgrind or even the LLDB debugger.

Speaker:

  • Adrian Szyndela (Samsung)
  • Maciej Slodczyk (Samsung)

Duration: 20 min

Content: Slides

Abstract:

Complex software usually depends on many different components, which sometimes perform background tasks with side effects not directly visible to their users. Without proper tools it can be hard to identify which component is responsible for performance hits or undesired behaviors.

We were challenged to implement D-Bus observability tools in embedded, ARM32 or ARM64 kernel based environments, both with 32-bit userspace. While we found bcc-tools, an open source compiler set useful, it appeared that it lacks support for 32-bit environments. We extended bcc-tools with support for 32-bit architectures. Using bcc-tools we created Linux eBPF programs – small programs written in a subset of C language, loaded from user-space and executed in kernel context. We attached them to uprobes and kprobes - user and kernel space special kinds of breakpoints. While it worked on ARM32 kernel based system, we faced another problem - ARM64 kernel lacked support for uprobes set in 32-bit binaries. The 64-bit ARM Linux kernel was extended with the ability to probe 32-bit binaries.

We propose to discuss challenges we faced trying to implement bcc-tools based tracing tools on ARM devices. We present a working solution to overcome lack of support for 32-bit architectures in bcc-tools, leaving space for discussion about other ways to achieve the same result. We also introduce 32-bit instruction probing in ARM64 kernel - a solution that we found very useful in our case. As a proof of concept we present tools that monitor D-Bus usage in ARM32 or ARM64 kernel based system with 32-bit userspace. We list what needs to be done for complete eBPF-based tools to be fully usable on ARM.

Speaker:

  • Jakub Kicinski (Netronome)

Duration: 20 min

Content: Slides

Abstract:

eBPF (extended Berkeley Packet Filter) is an in-kernel generic virtual machine, which can be used to execute simple programs injected by the user at various hooks in the kernel, on the occurrences of events such as incoming packets. eBPF was designed to simplify the work of in-kernel just-in-time compilers, i.e. translation of eBPF intermediate representation to CPU machine code. Upstream Linux kernel currently contains JITs for all major 64-bit instruction set architectures (ISAs) (x86, AArch64, MIPS, PowerPC, SPARC, s390) as well as some 32-bit translators (ARM, x86, also NFP - Netronome Flow Processor).

The eBPF generic virtual machine with clearly defined semantics makes it a very good vehicle for enabling programming of custom hardware. From storage devices to networking processors most host I/O controllers today are built based on or with accompaniment with general purpose processing cores, e.g. ARM. As vendors try to expose more and more capabilities of their hardware, using a general purpose machine definition like eBPF to inject code into hardware directly allows us to avoid creation of vendor specific APIs.

In this talk I will describe the eBPF offload mechanism which exists today in the Linux kernel and how they compare to other offloading stacks e.g. for compute or graphics. I will present a proof-of-concept work on of reusing existing eBPF JITs for non-host architecture (e.g. ARM JIT on x86) to program a emulated device, followed by a short description of the eBPF offload for NFP hardware as an example of a real-life offload.

Speaker:

  • Julia Kartseva (Facebook)

Duration: 20 min

Content: Slides

Abstract:

eBPF-based traffic policer as a replacement* of Hierarchical Token Bucket queuing discipline.

The key idea is two rate three color marker (rfc2698) algorithm, which inputs are committed and peak rates with the corresponding burst sizes and the output is a color or category assigned to a packet. There are conforming, exceeding, violating categories. An action is applied to violating category - either drop or dscp remark. Another action may optionally be applied to exceeding category.

Close-up of eBPF implementation**. Write intensiveness is a cornerstone: an update of available tokens is required on each packet, as well as tracking of time. Naive implementation and its exposure to data races on multi-core processors system. A problem of updating both timestamp and the number of available tokens atomically. Slicing the timeline into chunks of the size of burst duration as a solution for races, mapping each packet into its chunk, so there is no need in updating global timestamp. Two approaches of storing timeline chunks: bpf LRU hash map and a block of timeline chunks in bpf array. Circulating over a block of timeline chunks. Proc and cons of the latter approach: lock-free with bpf array as the only data structure used vs. increased amount of locked memory.

Combining several policers. Linear chain of policers instead of hierarchy. Passing a packet over the chain. Dealing with bandwidth underutilization when first K policers in a chain conform a packet and K+1 rejects. Commutative property of chained policers.

Interaction with UDP and TCP. TCP reacting on drop by changing congestion window which affects the actual rate.

* Note, that it's a replacement, not the alternative: eBPF based implementation doesn't assume putting packets into queues.

** Since the action is per packet, eBPF program should be attached to tc chain, it doesn't work with cgroups.

Speaker:

  • Peter Parkanyi (Red Sift)

Duration: 20 min

Content: Slides

Abstract:

Deep packet inspection seems to be a largely unexplored area of BPF use cases. The 4096 instruction limit and the lack of loops make such implementations non-straightforward for many protocols. Using XDP and socket filters, at Red Sift, we implemented DNS and TLS handshake detection to provide better monitoring for our clusters. We learned that while the protocol implementation is not necessarily straightforward, the BPF VM provides a reasonably safe environment for DPI-style parsing. When coupled with our Rust userspace implementation, it can provide information and functionality that previously would have required userspace intercepting proxies or middleboxes, at a comparable performance to iptables-style packet filters. Further work is needed to explore how we can turn this into a more comprehensive, active component, mainly due to the BPF VM restrictions around 4096 instruction programs.

Speaker:

  • Matheus Marchini (Sthima)

Duration: 20 min

Content: Slides

Abstract:

BPF trace tools such as bcc/trace and bpftrace can attach to Systemtap USDT (user application statically defined tracepoints) probes. These probes can be created by a macro imported from "sys/sdt.h" or by a provider file. Either way, Systemtap will register those probes as entries in the note section of the ELF file with the name of the probe, its address and the arguments as assembly locations. This approach is fairly simple, easy to parse and non-intrusive. Unfortunately, it is also obsolete and lacks features such as typed arguments and built-in dynamic instrumentation. Since BPF tools are growing in popularity, it makes sense to create a new enhanced format to fix these shortcomings.

We can discuss and make decisions about the future of USDT probes used by BPF trace tools. Some possible alternatives are: extend Systemtap USDT to introduce these new features or extend kernel tracepoints so that user applications can also register them.

Speaker:

  • Arnaldo Carvalho de Melo (Red Hat)

Duration: 20 min

Content: Slides

Abstract:

The 'perf trace' tool uses the syscall tracepoints to provide a !ptrace based 'strace' like tool, augmenting the syscall arguments provided by the tracepoints with integer->strings tables automatically generated from the kernel headers, showing the paths associated with fds, pid COMMs, etc.

That is enough for integer arguments, pointer arguments needs either kprobes put in special locations, which is fragile and has been so far implemented only for getname_flags (open, etc filenames), or using eBPF to hook into the syscall enter/exit tracepoints to collect pointer contents right after the existing tracepoint payload.

This has been done to some extent and is present in the kernel sources in the tools/perf/examples/bpf/augmented_syscalls.c, using the pre-existing support in perf to use BPF C programs as event names, automagically using clang/llvm to build and load it via sys_bpf(), 'perf trace' hooks this to the existing beautifiers that seeing that extra data use it to get the filename, struct sockaddr_in, etc.

This was done for a bunch of syscalls, what is left is to get this all automated using BTF, allow passing filters attached to the syscalls, select which syscalls should be traced, use a pre-compiled augmented_syscalls.c just selecting what bits of the obj should be used, etc, i.e. the open issues about this streamlining process to avoid requiring the clang toolchain, etc will be the matter of this discussion.

Speaker:

  • Matheus Marchini (Sthima)

Duration: 20 min

Content: Slides

Abstract:

BPFtrace is a high-level tracing language powered by BPF. Inspired by awk and C, as well as predecessor tracers such as DTrace and SystemTap, BPFtrace has a clean and simple syntax which empower users to easily create BPF programs and attach them to kprobes, uprobes, and tracepoints.

We can discuss the future of this work, including BTF integration for kprobe struct arguments, and solicit feedback.

Speaker:

  • Kris van Hees (Oracle)

Duration: 20 min

Content: Slides

Abstract:

The existence and power of eBPF provides a generic execution engine at the kernel level. We have been exploring leveraging the power of eBPF as a way to integrate DTrace more into the existing tracing framework that has matured within the Linux kernel. While DTrace comes with some more lightweight ways for getting probes fired, and while it has a pretty nice userlevel consumer with useful features, there should be no need to duplicate a lot of effort on the level of processing probe events and generating data for the consumer.

We want to move forward with modifying DTrace to make use of the eBPF subsystem, and propose and contribute extensions to eBPF (and most likely some other tracing related subsystems) to provide more support for not only DTrace but tracing tools in general. In order to contribute things that benefit more than just us, we need to get together and talk, so let's get it started...

All presentation material can be found here.

The Etherpad with discussion notes from the sessions can be found here.

Microconference Format

Similar to last year's BPF and Tracing Microconference the main focus will be on discussion rather than pure presentation style.

Therefore, each accepted topic will provide introductory slides with subsequent discussion as the main part for the rest of the allocated time slot. The expected time for one discussion slot is approximately 20 min.

The whole BPF microconference is a 4 hours 40 min long session.

Conference notes will be taken during the session on an Etherpad, and along with each introduction slides published here later on.

Ideas on Discussion Topics

Below are some (non-exhaustive) ideas of content that we think would be a good fit for this year's BPF Microconference edition:

  • Improving verifier scalability to 1 million instructions
  • BPF introspection
  • BPF Type Format (BTF)
  • bpftool and libbpf
  • Dynamic tracing without on the fly compilation with clang
  • Efficient JIT to 32-bit and embedded architectures
  • Linker-style logic for BPF loaders
  • String support in BPF
  • Sleep-able BPF programs
  • Verifier bounded loop support
  • BPF offload state
  • BPF LRU eviction heuristics
  • BPF timers
  • LLVM BPF backend
  • Syscall interception
  • Microkernel and BPF

Key dates

  • Proposal submissions are due by October 1st 2018.
  • Authors will be notified of acceptance/rejection by October 5th, 2018.
  • Final slides (as PDF) are due by November 1st, 2018.

BPF Microconference Organizers

This LPC microconference is organized and lead by both BPF kernel maintainers:

  • Alexei Starovoitov (Facebook)
  • Daniel Borkmann (Cilium)

Call for Proposals

We are seeking discussion proposals for the BPF Microconference at Linux Plumbers Conference in Vancouver, November 15th, 2018.

Submitted proposals should be on new and upcoming work with potential suggestions for solutions to open problems. The discussion proposal submission sent (as plain text, not as html) to the BPF Microconference Organizers should contain

  • Title,
  • A list of submitter names, and
  • description of up to 350 words
for review. The Microconference Organizers will review all submissions carefully and will provide feedback in time.

Please note that speaking at the microconference does not automatically get you access to the LPC conference.

Each microconference has access to a limited number of free and early-bird discounted passes for LPC, which are preferably provided to those who otherwise could not attend the conference. If your situation requires such a pass, please indicate this in your proposal submission.

Please contact the BPF Microconference Organizers with any questions.