Adding features to perf using BPF

Arnaldo Carvalho de Melo acme@redhat.com
Song Liu song@kernel.org
Namhyung Kim namhyung@kernel.org

What is this about?

perf tools: familiar control plane
BPF: flexible, powerful dataplane
BTF for pretty printing and more

perf for BPF

BPF profiling
BPF sampling
BPF annotation
BPF event counting

BPF for perf

BPF event counting
perf using BPF to count events in BPF code
bpftool prog profile
perf stat -b/--bpf-prog PROG_ID
perf stat --bpf-counters
perf stat --bpf-counters --for-each-cgroup

BPF for perf

Reuse BPF infrastructure in perf
build-ids
bpf_get_stack(BPF_F_USER_BUILD_ID)
Use in PERF_RECORD_MMAP2

bpfprog profile

Limited to some events
cycles, instructions, l1d_loads, llc_misses
itlb_misses and dtlb_misses
Different workflow from familiar 'perf stat'
New tool, few features

bpftool prog profile help

$ bpftool prog help
       bpftool prog profile PROG [duration DURATION] METRICs

       PROG := { id PROG_ID | pinned FILE | tag PROG_TAG | name PROG_NAME }
       METRIC := { cycles | instructions | l1d_loads | llc_misses | itlb_misses | dtlb_misses }
$

bpftool prog profile

$ bpftool prog profile id 324 duration 3 cycles itlb_misses

       1885029 run_cnt
    5134686073 cycles
        306893 itlb_misses

perf stat for BPF

Just another target
pid, tid, cpu, cgroup, BPF_PROG
Familiar workflow
Lots of events
Metrics
First class perf stat citizen

But how? BPF skels

Canonical example: tools/bpf/runqslower
Lots of boilerplate taken care of
BPF built and "linked" with perf
Details on Devconf.cz 2020 BPF talk

Using it

# perf stat -e ref-cycles,cycles --bpf-prog 254 --interval-print 1000
    1.487903822            115,200      ref-cycles
    1.487903822             86,012      cycles
    2.489147029             80,560      ref-cycles
    2.489147029             73,784      cycles
    3.490341825             60,720      ref-cycles
    3.490341825             37,797      cycles
#
# # Equivalent to:
#
# perf stat -e ref-cycles,cycles -b 254 -I 1000

bperf: perf stat with BPF backend

bperf: the problem

Multiple tools monitor same common metrics (cycles, instructions) at different granularities: system wide, per process, per request, etc.
Limited hardware counters
Time multiplexing when there are more active events than hardware counters: low accuracy or high overhead
Sharing counters in kernel is hard. After v13 of this patchset , I started hating it myself.

bperf: the solution

Use BPF to manage hardware hardware counters
Create per cpu perf events on each cpu
BPF program triggers on the context switch, reads perf events, and aggregates reading in BPF maps.
User space reads output from BPF maps.

bperf: using it

:b for one event, --bpf-counters for all events in this command

# perf stat -e cycles:b,cs               # use bpf for cycles, but not cs
# perf stat -e cycles,cs --bpf-counters  # use bpf for both cycles and cs

No time multiplexing!

# for x in {1..20} ; do perf stat -a -e cycles:b sleep 1000 & done
# perf stat -a -e cycles,instructions --bpf-counters sleep 0.1
     Performance counter stats for 'system wide':
         119,410,122   cycles
         152,105,479   instructions    #    1.27  insn per cycle

bperf: architecture

bperf: share across processes

BPF hashmap pinned in /sys/fs/bpf/perf_attr_map


						struct perf_event_attr_map_entry {
							__u32 link_id;   /* bpf_link of the leader prog */
							__u32 diff_map_id;
						};

Each user holds a fd to the bpf_link
The perf_events, BPF programs, maps are freed when the last user exits (or closes all fds)

bperf: share across processes

Monitor daemons and self monitoring processes may share hardware counters with perf-stat
Kernel supports up to 38 (BPF_MAX_TRAMP_PROGS) follower progs per leader prog
Share one set of hardware counters (one per cpu) among up to 38 different processes

Scalable perf event counting with BPF

perf stat cgroup Usecase

Google runs hundreds of jobs on a single machine
any workload runs inside a cgroup
wants to monitor each job (counting mode)
reference: LPC 2019 talk

Scalability issues

each cgroup has its own perf_event

needs (#events x #cpus x #cgroups) fds
increases cgroup context switch overhead

by reprogramming PMU counters

workaround

limits number of cgroups to profile at once
creates blind spots and inconsistent data

Cgroup perf events

cpu perf events with associated cgroup
measuring same events across cgroups

this is the most common use case at Google
ensured by --for-each-cgroup

no need to have separate events for cgroups

but still need separate counts per cgroup

In-kernel aggregation

proposed ioctl-based approach

using a cpu event (for each cpu)
collects event counts for current cgroup

at context switch

no PMU reprogramming during context switch

rejected due to interface concerns

link to the discussion

BPF based approach

thanks to bperf infra by Song Liu
same design, but doing it in BPF

attaches to cgroup-switches* event
collects event counts for current cgroup
saves the results in per-cpu array

* a software event added to v5.13

Results

estimated context switch time
tasks communicate through pipes
tasks are in different cgroups

Future

triggers: start/stop perf events from bpf program
Use bpf_get_branch_snapshot() in perf tools
Helping developers figure out ENOSOMETHING from perf_event_open()
As default case for evsel__open_strerror()
Show a backtrace
in addition to asking user to look at dmesg output
Presentations at: http://vger.kernel.org/~acme/bpf/