It will run on the third day of LPC, November 15, so that it is not
overlapping with the LPC's Networking Track on the first two days which received a
large number of networking related BPF and XDP talks this year.
The goal of the BPF Microconference is to bring BPF developers together
to discuss and hash out unresolved issues and to move new ideas forward.
The focus of this year's event is on the core BPF infrastructure as well
as its many subsystems and related user space tooling.
The BPF Microconference will be open to all LPC attendees. There is no
additional registration required. This is also a great occasion for BPF
users and developers to meet face to face and to exchange and discuss
For viewing the schedule, please check the main LPC website:
The following discussion topics have been accepted by the LPC BPF
Vlad Dumitrescu (Google)
Eddie Hao (Google)
Willem de Bruijn (Google)
Duration: 20 min
Google servers classify, measure, and shape their outgoing traffic.
The original implementation is based on Linux kernel traffic control
(TC). As server platforms scale so does their network bandwidth and
number of classified flows, exposing scalability limits in the TC
system - specifically contention on the root qdisc lock.
Mechanisms like selective qdisc bypass, sharded qdisc hierarchies, and
low-overhead prequeue ameliorate the contention up to a point. But
they cannot fully resolve it. Recent changes to the Linux kernel make
it possible to move classification, measurement, and packet mangling
outside this critical section, potentially scaling to much higher
rates while simultaneously shaping more flows and applying more
By moving classification and measurement to BPF at the new TC egress
hook, servers avoid taking a lock millions of times per second.
Running BPF programs at socket connect time with TCP_BPF converts
overhead from per-packet to per-flow. The programmability of BPF also
allows us to implement entirely new functions, such as runtime
configurable congestion control, first-packet classification and
socket-based QoS policies. It enables faster deployment cycles and as
this business logic can be updated dynamically from a user agent. The
discussion will focus on our experience converting an existing traffic
shaping system to a solution based on BPF, and the issues we’ve
encountered during testing and debugging.
Compile-once and run-everywhere can make deployment simpler
and may consume less resources on the target host, e.g.,
without llvm compiler and kernel devel package.
Currently bpf programs for networking can compile once and
run over different kernel versions. But bpf programs for
tracing cannot since it accesses kernel internal headers
and these headers are subject to change between kernel
But compile-once run-everywhere for tracing is not easy.
BPF programs could access anything in the kernel headers, including
data structures, macros and inline functions. To
achieve this goal, we need (1) preserving header-level
accesses for the bpf program, and (2) abstracting header info
of vmlinux. Right before program load on the target host,
some kind of resolution is done for bpf program against the
running kernel so the resulted program is just like to that
compiled against host kernel headers.
In this talk, we will explore how BTF could be used by both
bpf program and vmlinux to advance the possibility of
bpf program compile-once and run-everywhere.
BPF program writers today who build and distribute their programs as ELF
objects typically write their programs using one of a small set of (mostly)
similar headers that establish norms around ELF section definitions. One such
norm is the presence of a "maps" section which allows maps to be referenced
within BPF instructions using virtual file descriptors. When a BPF loader (eg,
iproute2) opens the ELF, it loads each map referred in this section, creates a
real file descriptor for that map, then updates all BPF instructions which
refer to the same map to specify the real file descriptor. This allows symbolic
referencing to maps without requiring writers to implement their own loaders or
recompile their programs every time they create a map.
This discussion will take a look at how to provide similar symbolic referencing
for static data. Existing implementations already templatize information such
as MAC or IP addresses using C macros, then invoke a compiler to replace such
static data at load time, at a cost of one compilation per load. By extending
the support for static variables into ELF sections, programs could be written
and compiled once then reloaded many times with different static data.
Currently, BPF can not support basic loops such as for, while, do/while,
etc. Users work around this by forcing the compiler to "unroll" these
control flow constructs in the LLVM backend. However, this only works up
to a point. Unrolling increases instruction count and complexity on the
verifier and further LLVM can not easily unroll all loops. The result is
developers end up writing code that is unnatural, iterating until they
find a version that LLVM will compile into a form the verifier backend
We developed a verifier extension to detect bounded loops here,
This requires building a DOM tree (computationally expensive) and then
matching loop patterns to find loop invariants to verify loops
terminate. In this discussion we would like to cover the pros and cons
of this approach. As well as discuss another proposal to use explicit
control flow instructions to simplify this task.
The goal of this discussion would be to come to a consensus on how to
proceed to make progress on supporting bounded loops.
eBPF has 64-bit general purpose registers, therefore 32-bit
architectures normally need to use register pair to model them and need to
generate extra instructions to manipulate the high 32-bit in the pair. Some of
these overheads incurred could be eliminated if JIT compiler knows only the low
32-bit of a register is interested. This could be known through data flow (DF)
analysis techniques. Either the classic iterative DF analysis or
"path-sensitive" version based on verifier's code path walker.
In this talk, implementations for both versions of DF analyser will be
presented. We will see how a def-use chain based classic eBPF DF analyser looks
first, and will see the possibility to integrate it with previous proposed eBPF
control flow graph framework to make a stand-alone eBPF global DF analyser which
could potentially serve as a library. Then, another "path-sensitive" DF analyser
based on the existing verifier code path walker will be presented. We will
discuss how function calls, path prune, path switch affect the implementation.
Finally, we will summarize pros and cons for each, and will see how could each
of them be adapted to 64-bit and 32-bit architecture back-ends.
Also, eBPF has 32-bit sub-register and ALU32 instructions associated, enable
them (-mattr=+alu32) in LLVM code-gen could let the generated eBPF sequences
carry more 32-bit information which could potentially easy flow analyser. This
will be briefly discussed in the talk as well.
eBPF (extended Berkeley Packet Filter), in particular with its
driver-level hook XDP (eXpress Data Path), has increased in importance
over the past few years. As a result, the ability to rapidly debug and
diagnose problems is becoming more relevant. This talk will cover common
issues faced and techniques to diagnose them, including the use of
bpftool for map and program introspection, the use of disassembly to
inspect generated assembly code and other methods such as using debug
prints and how to apply these techniques when eBPF programs are
offloaded to the hardware.
The talk will also explore where the current gaps in debugging
infrastructure are and suggest some of the next steps to improve this,
for example, integrations with tools such as strace, valgrind or even
the LLDB debugger.
Complex software usually depends on many different components, which
sometimes perform background tasks with side effects not directly
visible to their users. Without proper tools it can be hard to identify
which component is responsible for performance hits or undesired behaviors.
We were challenged to implement D-Bus observability tools in embedded,
ARM32 or ARM64 kernel based environments, both with 32-bit userspace.
While we found bcc-tools, an open source compiler set useful, it
appeared that it lacks support for 32-bit environments. We extended
bcc-tools with support for 32-bit architectures. Using bcc-tools we
created Linux eBPF programs – small programs written in a subset of C
language, loaded from user-space and executed in kernel context. We
attached them to uprobes and kprobes - user and kernel space special
kinds of breakpoints. While it worked on ARM32 kernel based system, we
faced another problem - ARM64 kernel lacked support for uprobes set in
32-bit binaries. The 64-bit ARM Linux kernel was extended with the
ability to probe 32-bit binaries.
We propose to discuss challenges we faced trying to implement bcc-tools
based tracing tools on ARM devices. We present a working solution to
overcome lack of support for 32-bit architectures in bcc-tools, leaving
space for discussion about other ways to achieve the same result. We
also introduce 32-bit instruction probing in ARM64 kernel - a solution
that we found very useful in our case. As a proof of concept we present
tools that monitor D-Bus usage in ARM32 or ARM64 kernel based system
with 32-bit userspace. We list what needs to be done for complete
eBPF-based tools to be fully usable on ARM.
eBPF (extended Berkeley Packet Filter) is an in-kernel generic virtual
machine, which can be used to execute simple programs injected by the
user at various hooks in the kernel, on the occurrences of events such
as incoming packets. eBPF was designed to simplify the work of
in-kernel just-in-time compilers, i.e. translation of eBPF intermediate
representation to CPU machine code. Upstream Linux kernel currently
contains JITs for all major 64-bit instruction set architectures (ISAs)
(x86, AArch64, MIPS, PowerPC, SPARC, s390) as well as some 32-bit
translators (ARM, x86, also NFP - Netronome Flow Processor).
The eBPF generic virtual machine with clearly defined semantics makes
it a very good vehicle for enabling programming of custom hardware.
From storage devices to networking processors most host I/O controllers
today are built based on or with accompaniment with general purpose
processing cores, e.g. ARM. As vendors try to expose more and more
capabilities of their hardware, using a general purpose machine
definition like eBPF to inject code into hardware directly allows us to
avoid creation of vendor specific APIs.
In this talk I will describe the eBPF offload mechanism which exists
today in the Linux kernel and how they compare to other offloading
stacks e.g. for compute or graphics. I will present a proof-of-concept
work on of reusing existing eBPF JITs for non-host architecture (e.g.
ARM JIT on x86) to program a emulated device, followed by a short
description of the eBPF offload for NFP hardware as an example of a
eBPF-based traffic policer as a replacement* of Hierarchical Token Bucket queuing discipline.
The key idea is two rate three color marker (rfc2698) algorithm, which inputs are committed and peak rates with the corresponding burst sizes and the output is a color or category assigned to a packet. There are conforming, exceeding, violating categories. An action is applied to violating category - either drop or dscp remark. Another action may optionally be applied to exceeding category.
Close-up of eBPF implementation**. Write intensiveness is a cornerstone: an update of available tokens is required on each packet, as well as tracking of time. Naive implementation and its exposure to data races on multi-core processors system. A problem of updating both timestamp and the number of available tokens atomically. Slicing the timeline into chunks of the size of burst duration as a solution for races, mapping each packet into its chunk, so there is no need in updating global timestamp. Two approaches of storing timeline chunks: bpf LRU hash map and a block of timeline chunks in bpf array. Circulating over a block of timeline chunks. Proc and cons of the latter approach: lock-free with bpf array as the only data structure used vs. increased amount of locked memory.
Combining several policers. Linear chain of policers instead of hierarchy. Passing a packet over the chain. Dealing with bandwidth underutilization when first K policers in a chain conform a packet and K+1 rejects. Commutative property of chained policers.
Interaction with UDP and TCP. TCP reacting on drop by changing congestion window which affects the actual rate.
* Note, that it's a replacement, not the alternative: eBPF based implementation doesn't assume putting packets into queues.
** Since the action is per packet, eBPF program should be attached to tc chain, it doesn't work with cgroups.
Deep packet inspection seems to be a largely unexplored area of BPF
use cases. The 4096 instruction limit and the lack of loops make such
implementations non-straightforward for many protocols. Using XDP and
socket filters, at Red Sift, we implemented DNS and TLS handshake
detection to provide better monitoring for our clusters. We learned
that while the protocol implementation is not necessarily
straightforward, the BPF VM provides a reasonably safe environment for
DPI-style parsing. When coupled with our Rust userspace
implementation, it can provide information and functionality that
previously would have required userspace intercepting proxies or
middleboxes, at a comparable performance to iptables-style packet
filters. Further work is needed to explore how we can turn this into a
more comprehensive, active component, mainly due to the BPF VM
restrictions around 4096 instruction programs.
The 'perf trace' tool uses the syscall tracepoints to provide a !ptrace based
'strace' like tool, augmenting the syscall arguments provided by the
tracepoints with integer->strings tables automatically generated from the
kernel headers, showing the paths associated with fds, pid COMMs, etc.
That is enough for integer arguments, pointer arguments needs either kprobes
put in special locations, which is fragile and has been so far implemented only
for getname_flags (open, etc filenames), or using eBPF to hook into the syscall
enter/exit tracepoints to collect pointer contents right after the existing
This has been done to some extent and is present in the kernel sources in the
tools/perf/examples/bpf/augmented_syscalls.c, using the pre-existing support in
perf to use BPF C programs as event names, automagically using clang/llvm to
build and load it via sys_bpf(), 'perf trace' hooks this to the existing
beautifiers that seeing that extra data use it to get the filename, struct
This was done for a bunch of syscalls, what is left is to get this all
automated using BTF, allow passing filters attached to the syscalls, select
which syscalls should be traced, use a pre-compiled augmented_syscalls.c just
selecting what bits of the obj should be used, etc, i.e. the open issues about
this streamlining process to avoid requiring the clang toolchain, etc will be
the matter of this discussion.
BPF trace tools such as bcc/trace and bpftrace can attach
to Systemtap USDT (user application statically defined tracepoints)
probes. These probes can be created by a macro imported from
"sys/sdt.h" or by a provider file. Either way, Systemtap will register
those probes as entries in the note section of the ELF file with the
name of the probe, its address and the arguments as assembly
locations. This approach is fairly simple, easy to parse and
non-intrusive. Unfortunately, it is also obsolete and lacks features
such as typed arguments and built-in dynamic instrumentation. Since
BPF tools are growing in popularity, it makes sense to create a new
enhanced format to fix these shortcomings.
We can discuss and make decisions about the future of USDT probes used
by BPF trace tools. Some possible alternatives are: extend Systemtap
USDT to introduce these new features or extend kernel tracepoints so
that user applications can also register them.
Therefore, each accepted topic will provide a short 1-2 slide introduction
with subsequent discussion as the main part for the rest of the allocated
time slot. The expected time for one discussion slot is approximately 15 min.
The whole BPF microconference is a 3 hours long session.
Conference notes will be taken during the session on an Etherpad, and
along with each introduction slides published here later on.
Ideas on Discussion Topics
Below are some (non-exhaustive) ideas of content that we think would
be a good fit for this year's BPF Microconference edition:
Improving verifier scalability to 1 million instructions
BPF Type Format (BTF)
bpftool and libbpf
Dynamic tracing without on the fly compilation with clang
Efficient JIT to 32-bit and embedded architectures
Linker-style logic for BPF loaders
String support in BPF
Sleep-able BPF programs
Verifier bounded loop support
BPF offload state
BPF LRU eviction heuristics
LLVM BPF backend
Microkernel and BPF
Proposal submissions are due by October 1st 2018.
Authors will be notified of acceptance/rejection by October 5th, 2018.
Final 1-2 page slides (as PDF) are due by November 1st, 2018.
BPF Microconference Organizers
This LPC microconference is organized and lead by both BPF kernel maintainers:
Submitted proposals should be on new and upcoming work with potential suggestions
for solutions to open problems. The discussion proposal submission sent (as plain text,
not as html) to the BPF Microconference
Organizers should contain
A list of submitter names, and
description of up to 350 words
for review. The Microconference Organizers will review all submissions carefully
and will provide feedback in time.
Please note that speaking at the microconference does not automatically get
you access to the LPC conference.
Each microconference has access to a limited number of free and early-bird
discounted passes for LPC, which are preferably provided to those who otherwise
could not attend the conference. If your situation requires such a pass, please
indicate this in your proposal submission.