Profiling Data Structures

perf + pahole

Arnaldo Carvalho de Melo acme@redhat.com

What is this about?

Data structures
Optimal field ordering
CPU Cache usage
False Sharing
Avoiding unintendend outcomes

Why do we care?

Cachelines
Cache hierarchy
Cache coherency
L1, L2, LLC
NUMA: Remote memory
Load latencies

Problems?

Non cache aligned arrays
Close by read-mostly and write-mostly variables
Lots fixed over the years
Learn from it!

Tedious

Being done already
Manually
Tinkering
Move, rebuild, perf stat it

Manually doing it

Look for "false sharing" on git log
Eric Dumazet does it a lot

First example

commit 91b6d325635617540b6a1646ddb138bb17cbd569
Author: Eric Dumazet 
Date:   Mon Nov 15 11:02:39 2021 -0800

    net: cache align tcp_memory_allocated, tcp_sockets_allocated

    tcp_memory_allocated and tcp_sockets_allocated often share
    a common cache line, source of false sharing.

The change

+++ b/net/ipv4/tcp.c
-atomic_long_t tcp_memory_allocated; // Current allocated memory
+atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp; // Current allocated memory

-struct percpu_counter tcp_sockets_allocated;
+struct percpu_counter tcp_sockets_allocated ____cacheline_aligned_in_smp;

bcache_dev_sectors_dirty_add()

Always set_bit(stripe, full_dirty_stripes)
Cacheline being invalidated constantly
100G dirty data with 20 threads
50 times slower
Change to test_bit() first
Mingzhe Zou
7b1002f7cfe581930f63787a0b3de0144e61ed55

The change

+++ b/drivers/md/bcache/writeback.c
                sectors_dirty = atomic_add_return(s,
                                        d->stripe_sectors_dirty + stripe);
-               if (sectors_dirty == d->stripe_size)
-                       set_bit(stripe, d->full_dirty_stripes);
-               else
-                       clear_bit(stripe, d->full_dirty_stripes);
+               if (sectors_dirty == d->stripe_size) {
+                       if (!test_bit(stripe, d->full_dirty_stripes))
+                               set_bit(stripe, d->full_dirty_stripes);
+               } else {
+                       if (test_bit(stripe, d->full_dirty_stripes))
+                               clear_bit(stripe, d->full_dirty_stripes);
+               }

page_counter

cgroup perf regression
'usage' and 'parent' on the same cacheline
'parent' is mostly read
'usage' is mostly written
False sharing
Feng Tang used 'perf c2c' to detect it
https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian
802f1d522d5fdaefc2b935141bc8fe03d43a99ab

The change

+++ b/include/linux/page_counter.h
@@ -12,7 +12,6 @@ struct page_counter {
        unsigned long low;
        unsigned long high;
        unsigned long max;
-       struct page_counter *parent;

        /* effective memory.min and memory.min usage tracking */
        unsigned long emin;
@@ -26,6 +25,12 @@ struct page_counter {
        unsigned long watermark;
        unsigned long failcnt;
+       /*
+        * 'parent' is placed here to be far from 'usage' to reduce cache
+        * false sharing, as 'usage' is written mostly while parent is
+        * frequently read for cgroup's hierarchical counting nature.
+        */
+       struct page_counter *parent;
 };

The new struct

$ pahole page_counter
struct page_counter {
	atomic_long_t        usage;                /*   0   8 */
	long unsigned int    min;                  /*   8   8 */
	long unsigned int    low;                  /*  16   8 */
	long unsigned int    high;                 /*  24   8 */
	long unsigned int    max;                  /*  32   8 */
	long unsigned int    emin;                 /*  40   8 */
	atomic_long_t        min_usage;            /*  48   8 */
	atomic_long_t        children_min_usage;   /*  56   8 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	long unsigned int    elow;                 /*  64   8 */
	atomic_long_t        low_usage;            /*  72   8 */
	atomic_long_t        children_low_usage;   /*  80   8 */
	long unsigned int    watermark;            /*  88   8 */
	long unsigned int    failcnt;              /*  96   8 */
	struct page_counter *parent;               /* 104   8 */

	/* size: 112, cachelines: 2, members: 14 */
	/* last cacheline: 48 bytes */
};
$

Layout

Type info: DWARF, BTF
BTF now always available
per-cpu variable types

BTF

Compact
Yeah, BPF stuff
Used with BPF's CO-RE
And other BPF features
/sys/kernel/btf/

Small

$ cd /sys/kernel/btf
$ ls -lh vmlinux
-r--r--r--. 1 root root 5.1M Sep  8 20:38 vmlinux
$
$ ls -lh i915
-r--r--r--. 1 root root 556K Sep 12 09:29 i915
$
$ ls -1 | wc -l
204
$ lsmod | wc -l
204
$ lsmod | head -2
Module                  Size  Used by
sctp                  434176  28
$

pahole

$ pahole page_pool
struct page_pool {
	struct page_pool_params  p;                     /* 0  56 */
	struct delayed_work      release_dw;            /* 56  88 */
	/* XXX last struct has 4 bytes of padding */

	/* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
	void                     (*disconnect)(void *); /* 144  8 */
	long unsigned int        defer_start;           /* 152  8 */
	long unsigned int        defer_warn;            /* 160  8 */
	u32                      pages_state_hold_cnt;  /* 168  4 */
	unsigned int             frag_offset;           /* 172  4 */
	struct page *            frag_page;             /* 176  8 */
	long int                 frag_users;            /* 184  8 */
	/* --- cacheline 3 boundary (192 bytes) --- */
	u32                      xdp_mem_id;           /*   192     4 */
	/* XXX 60 bytes hole, try to pack */

	/* --- cacheline 4 boundary (256 bytes) --- */
	struct pp_alloc_cache    alloc __attribute__((__aligned__(64))); /*   256  1032 */
	/* XXX 56 bytes hole, try to pack */

	/* --- cacheline 21 boundary (1344 bytes) --- */
	struct ptr_ring          ring __attribute__((__aligned__(64))); /*  1344   192 */
	/* XXX last struct has 48 bytes of padding */
	/* --- cacheline 24 boundary (1536 bytes) --- */
	atomic_t                 pages_state_release_cnt; /*  1536     4 */
	refcount_t               user_cnt;             /*  1540     4 */
	u64                      destroy_cnt;          /*  1544     8 */
	/* size: 1600, cachelines: 25, members: 15 */
	/* sum members: 1436, holes: 2, sum holes: 116 */
	/* padding: 48 paddings: 2, sum paddings: 52 */
	/* forced alignments: 2, forced holes: 2, sum forced holes: 116 */
} __attribute__((__aligned__(64)));
$

CPU helps

Intel PEBS
AMD IBS
ARM CoreSight

Example PEBS doc

18.8.1.2 Load Latency Performance Monitoring Facility

The load latency facility provides software a means to charaterize
the average load latency to different levels of cache/memory
hierarchy.  This facility requires processor supporting enhanced
PEBS record format in the PEBS buffer, see Table 18-23.

This field measures the load latency from load's first dispatch of
till final data writeback from the memory subsystem. The latency is
reported for retired demand load operations and in core cycles (it
accounts for re-dispatches).

What do we have in perf?

perf stat
perf mem
perf c2c
Several articles about using it
Improve upon this foundation

perf mem

loads, stores
load latency
Stephane Eranian
Jiri Olsa, Kan Liang, others

Global stats

# perf mem record -a sleep 1
#
# perf mem -t load report --sort=mem --stdio
# Total Lost Samples: 0
#
# Samples: 51K of event 'cpu/mem-loads,ldlat=30/P'
# Total weight : 4819902
# Sort order   : mem
#
# Overhead       Samples  Memory access
# ........  ............  ........................
    44.87%         20217  LFB or LFB hit
    27.30%         18618  L3 or L3 hit
    22.53%         11712  L1 or L1 hit
     4.85%           637  Local RAM or RAM hit
     0.25%             1  Uncached or N/A hit
     0.20%           188  L2 or L2 hit
     0.00%            35  L3 miss

Workload stats

# perf mem record sleep 1
#
# perf mem -t load report --sort=mem --stdio
# Total Lost Samples: 0
#
# Samples: 16  of event 'cpu/mem-loads,ldlat=30/P'
# Total weight : 1556
# Sort order   : mem
# Overhead       Samples  Memory access
# ........  ............  ........................
#
    64.52%             8  LFB or LFB hit
    14.07%             4  L1 or L1 hit
    11.05%             3  L3 or L3 hit
    10.35%             1  Local RAM or RAM hit

A familiar workload

$ make -j8 O=../build/allmodconfig/
make[1]: Entering directory '/home/acme/git/build/allmodconfig'

# perf mem record sleep 1m
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (20 samples) ]
#

perf mem report

# perf mem report --stdio
# Total Lost Samples: 0
#
# Samples: 11  of event 'cpu/mem-loads,ldlat=30/P'
# Total weight : 2155
# Sort order   : local_weight,mem,sym,dso,symbol_daddr,dso_daddr
#
#              Local Mem
#Overhead Weig Access   Symbol              Sh Object   Data Symbol        Data Obj
#........ .... ........ ................... .........   ................   ...........
   23.94% 516  LocalRAM copy_page           [kernel]    0xffff8d42228ea900 [unknown]
   15.31% 330  LFB      flush_signal_handle [kernel]    0xffff8d3f976020a0 [unknown]
   14.66% 316  LFB      strlen              [kernel]    0xffffffff9b5f4cd3 [kernel].ro
   13.36% 288  LFB      _dl_relocate_object ld-linux.so 0x00007f6ccdc23068 libc.so.6
   11.46% 247  LFB      next_uptodate_page  [kernel]    0xffffe401957e4df4 [unknown]
    7.33% 158  LFB      copy_page           [kernel]    0xffff8d41f2dae920 [unknown]
    4.04% 87   LFB      unlock_page_memcg   [kernel]    0xffffe4019333d8b8 [unknown]
    3.06% 66   L1       check_preemption_di [kernel]    0xffffa8e8622ffc80 [unknown]
    2.69% 58   LFB      perf_output_begin   [kernel]    0xffff8d3f52a1b01c [unknown]
    2.13% 46   L3       task_work_run       [kernel]    0xffff8d3f4a9c802c [unknown]
    2.00% 43   L1       kmem_cache_alloc_tr [kernel]    0xffffa8e8622ffbc8 [unknown]

perf c2c

Cache to Cache
False Sharing
Cachelines tugged
Resolves global variables
Should resolve its type and field
Using BTF

Origins

Dick Fowles
HP-UX had something similar
Joe Mario
Don Zickus
Jiri Olsa

c2c output

Cachelines where false sharing was detected
Readers and writers and cacheline offsets
pid, tid, instruction addr, function name, binary object names
Source file and line number
Average load latency
NUMA nodes and CPUs involved

perf c2c

# perf c2c record -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 7.787 MB perf.data (2450 samples) ]
# perf evlist
cpu/mem-loads,ldlat=30/P
cpu/mem-stores/P
dummy:HG
#

Whats in there?

					
# perf script --cpu 4 --pid 0 | head
 swapper 0 [4] 319242.043904:    58 cpu/mem-loads,ldlat=30/P: ffff8d3e49c0e688 11868100242 |OP LOAD|LVL LFB or LFB hit|SNP None|TLB L1 or L2 hit|LCK No|BLK  N/A      290 0 ffffffff9a13eb2c __update_load_avg_cfs_rq+0x9c (vmlinux) 9c0e688
 swapper 0 [4] 319242.142295:    39 cpu/mem-loads,ldlat=30/P: ffff8d44865f2408 10268100142 |OP LOAD|LVL L1 or L1 hit|SNP None|TLB L1 or L2 hit|LCK No|BLK  N/A        335 0 ffffffff9a13eecd update_rt_rq_load_avg+0x17d (vmlinux) 6465f2408
 swapper 0 [4] 319242.143587: 99614         cpu/mem-stores/P: ffff8d4486500028     5080184 |OP STORE|LVL L1 miss|SNP N/A|TLB N/A|LCK N/A|BLK  N/A                       0 0 ffffffff9a001c2f __switch_to_asm+0x1f (vmlinux) 646500028
 swapper 0 [4] 319242.174494:    33 cpu/mem-loads,ldlat=30/P: ffff8d3f595ddc38 11a68201042 |OP LOAD|LVL Local RAM or RAM hit|SNP Hit|TLB L1 or L2 hit|LCK No|BLK  N/A 176 0 ffffffff9a13e78d __update_load_avg_se+0x1d (vmlinux) 1195ddc38
 swapper 0 [4] 319242.178002:    27 cpu/mem-loads,ldlat=30/P: ffff8d44865312c0 10668100842 |OP LOAD|LVL L3 or L3 hit|SNP None|TLB L1 or L2 hit|LCK No|BLK  N/A         56 0 ffffffff9a07d74f switch_mm_irqs_off+0x16f (vmlinux) 6465312c0
 swapper 0 [4] 319242.212148:    23 cpu/mem-loads,ldlat=30/P: ffff8d44865322e8 10668100842 |OP LOAD|LVL L3 or L3 hit|SNP None|TLB L1 or L2 hit|LCK No|BLK  N/A         55 0 ffffffff9a140c22 irqtime_account_process_tick+0xa2 (vmlinux) 6465322e8
 swapper 0 [4] 319242.217357:    18 cpu/mem-loads,ldlat=30/P: ffff8d4486532490 10268100142 |OP LOAD|LVL L1 or L1 hit|SNP None|TLB L1 or L2 hit|LCK No|BLK  N/A        125 0 ffffffff9a140076 update_irq_load_avg+0xf6 (vmlinux) 646532490
 swapper 0 [4] 319242.220573:    15 cpu/mem-loads,ldlat=30/P: ffff8d3f4f35f218 11868100242 |OP LOAD|LVL LFB or LFB hit|SNP None|TLB L1 or L2 hit|LCK No|BLK  N/A      383 0 ffffffff9a73b407 rb_erase+0x7 (vmlinux) 10f35f218
 swapper 0 [4] 319242.240176:    15 cpu/mem-loads,ldlat=30/P: ffff8d3f6b617be0 10650100842 |OP LOAD|LVL L3 or L3 hit|SNP None|TLB L2 miss|LCK No|BLK  N/A             184 0 ffffffff9a129fbb update_blocked_averages+0x1fb (vmlinux) 12b617be0
 swapper 0 [4] 319242.243441:  8849         cpu/mem-stores/P: ffff8d3f40c2b1a4     5080144 |OP STORE|LVL L1 hit|SNP N/A|TLB N/A|LCK N/A|BLK  N/A                        0 0 ffffffff9ad68aed rcu_eqs_exit.constprop.0+0x3d (vmlinux) 100c2b1a4
#

What was really asked?

# perf evlist -v | head -1
cpu/mem-loads,ldlat=30/P: type: 4, size: 128, config: 0x1cd, \
{ sample_period, sample_freq }: 4000, \
sample_type: IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|PHYS_ADDR| \
             WEIGHT_STRUCT, \
read_format: ID, disabled: 1, inherit: 1, freq: 1, precise_ip: 3, \
sample_id_all: 1, { bp_addr, config1 }: 0x1f
#

PERF_SAMPLE_DATA_SRC

PEBS Load Latency
opcode: load, store, prefetch, exec
mem level: l1, l2, l3, l4, lfb, ram, pmem, etc
tlb: not available, hit, miss, l1, l2, hw walker, OS fault handler
snoop: no snoop, hit, miss, hit modified
lock: not available, locked instruction

PERF_SAMPLE_WEIGHT_STRUCT

hardware provided number
How expensive the sampled action represents
For profiler to scale samples

c2c perf.data header

# perf report --header-only
# captured on    : Fri Sep  9 11:10:04 2022
# hostname : quaco
# os release : 5.18.17-200.fc36.x86_64
# perf version : 6.0.rc3.gfaf59ec8c3c3
# arch : x86_64
# nrcpus online : 8
# nrcpus avail : 8
# cpudesc : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
# total memory : 24487388 kB
# cmdline : /home/acme/bin/perf c2c record -a sleep 1m 
# event : name = cpu/mem-loads,ldlat=30/P, freq = 4000,
  sample_type = IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|\
                PHYS_ADDR|WEIGHT_STRUCT
# event : name = cpu/mem-stores/P, freq = 4000,
  sample_type = IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|\
                PHYS_ADDR|WEIGHT_STRUCT

Key metrics

# perf c2c report --stats
  Total records                     :    3223429
  Locked Load/Store Operations      :     112673
  Load Operations                   :    1387118
  Loads - uncacheable               :          1
  Loads - IO                        :          4
  Loads - Miss                      :        142
  Loads - no mapping                :       2350
  Load Fill Buffer Hit              :     455747
  Load L1D hit                      :     264355
  Load L2D hit                      :      29304
  Load LLC hit                      :     534642
  Load Local HITM                   :        629
  Load Remote HITM                  :          0
  Load Remote HIT                   :          0

perf c2c contended cachelines view

# perf c2c report --stdio
=================================================
           Shared Data Cache Line Table          
=================================================
#
#      -- Cacheline --    -- Load Hitm  ----   Tot  Total   Total  -- Stores --
# Idx           Address    Hitm  Tot LclHitm   rec  Loads  Stores  L1Hit L1Miss  
# ...  ................   .....  ... .......   ...  .....  ......  ............  
#
    0  ffff8d449e7d6380   8.43%   53      53   510    499      11     11      0  
    1  ffff8d4058209340   6.20%   39      39   371    135     236    223     13  
    2  ffff8d449e7ff400   5.88%   37      37   501    479      22     22      0  
    3  ffffffff9bf53980   4.93%   31      31   233    208      25     24      1  
    4  ffff8d3f49ebd280   3.18%   20      20   162    153       9      9      0  
    5  ffff8d3f420d4880   2.86%   18      18   126    121       5      5      0

tugged cacheline

Cacheline 0xffff8d449e7ff400
-HITM-     CL                    --- cycles ---  Tot  cpu 
LclHitm   Off      Code address  lcl hitm  load  rec  cnt   Symbol Object      Source:Line
 97.30%   0x0 0xffffffff9a2d293b      113    44  454   8 __mod_node_page_state [kernel] vmstat.c:379 
  0.00%   0x8 0xffffffff9a2d29bb        0   112  40    8 __mod_node_page_state [kernel] atomic64_64.h:46
  2.70%  0x18 0xffffffff9a2d2be5      959   103   2    2 refresh_cpu_vm_stats  [kernel] atomic64_64.h:46

Next steps

Look at the CL Off (Cacheline Offset)
Three fields being accessed
Two are with local HITM
Find out what is the data structure
By looking at the functions, source:line

vmstat.c:379

$ perf probe -L vmstat.c:379 | head

    379  	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
    380  	s8 __percpu *p = pcp->vm_node_stat_diff + item;
    381  	long x;
         	long t;
         
         	if (vmstat_item_in_bytes(item)) {
         		/*
         		 * Only cgroups use subpage accounting right now; at
         		 * the global level, these items still change in
$

Looking at pgdat

$ pfunct __mod_node_page_state
void __mod_node_page_state(struct pglist_data * pgdat,
                           enum node_stat_item item, long int delta);
$
$ pahole pglist_data | grep -B2 -A6 per_cpu_nodestats
    /* --- cacheline 2704 boundary (173056 bytes) --- */
    struct zone_padding      _pad2_;              /* 173056   0 */
    struct per_cpu_nodestat *per_cpu_nodestats
    		__attribute__((__aligned__(64))); /* 173056   8 */
    atomic_long_t            vm_stat[41];         /* 173064 328 */

    /* size: 173440, cachelines: 2710, members: 32 */
    /* sum members: 173309, holes: 6, sum holes: 83 */
    /* padding: 48 */
    /* forced alignments: 2 */
$

Code inspection

Manual steps
Streamlining the current UI:
Should go from the c2c TUI
To the 'perf annotate' browser
Positioning at that source:line
ENTER on type: go to a pahole browser

refresh_cpu_vm_stats

# perf annotate --stdio2 refresh_cpu_vm_stats
<SNIP>
refresh_cpu_vm_stats() /usr/lib/debug/lib/modules/5.18.17-200.fc36.x86_64/vmlinux
             ffffffff812d29e0 :
             static int refresh_cpu_vm_stats(bool do_pagesets)
             struct pglist_data *pgdat;
             struct zone *zone;
<SNIP>
             for_each_online_pgdat(pgdat) {
<SNIP>
        1f3:   cmpxchg %r8b,%gs:(%rdx)
             ↑ jne     1f3
               movsbl  %al,%edi
             if (v) {
               test    %edi,%edi
             ↓ je      20b
             atomic_long_add(v, &pgdat->vm_stat[i])

Being mindful

 * Update the zone counters for the current cpu.
 
 * Note that refresh_cpu_vm_stats strives to only access node local
 * memory.  The per cpu pagesets on remote zones are placed in the
 * memory local to the processor using that pageset. So the loop over
 * all zones will access a series of cachelines local to the processor.
 
 * The call to zone_page_state_add updates the cachelines with the stats
 * in the remote zone struct as well as the global cachelines with the
 * global counters. These could cause remote node cache line bouncing
 * and will have to be only done when necessary.
 
 * The function returns the number of global counters updated.

pahole TODO

We have --hex, from struct start
Implement per-cacheline offsets in pahole output
For multi cacheline structs
Like pgdat in the previous example

What do we see?

Code
Source code/line
All focused where things happened
Not on what data
But looking at where it happens
Helps figuring out the data structure accessed

No rmap for data symbols?

Global vars: ok
SLAB: can we do it?
From code?
Going back to args
DWARF location expressions
LBR?

perf lock contention -b

Gets some backtraces
Identifies locks by backtrace sigs
Good enough?

A detour

perf record + report is too costly
In kernel aggregation should be supported
perf kmem, kwork, bcounters do it
BPF way of doing things
perf using it

perf BPF skels

$ wc -l tools/perf/util/bpf_skel/*.bpf.c
  191 tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
   78 tools/perf/util/bpf_skel/bperf_follower.bpf.c
   55 tools/perf/util/bpf_skel/bperf_leader.bpf.c
   92 tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c
  116 tools/perf/util/bpf_skel/func_latency.bpf.c
  383 tools/perf/util/bpf_skel/kwork_trace.bpf.c
  175 tools/perf/util/bpf_skel/lock_contention.bpf.c
  273 tools/perf/util/bpf_skel/off_cpu.bpf.c
 1363 total
$

perf lock contention -b

 $ sudo perf lock contention -b
^C
contended  total wait   max wait  avg wait      type  caller

        42   192.67 us  13.64 us   4.59 us  spinlock  queue_work_on+0x20
        23    85.54 us  10.28 us   3.72 us  spinlock  worker_thread+0x14a
         6    13.92 us   6.51 us   2.32 us     mutex  kernfs_iop_permission+0x30
         3    11.59 us  10.04 us   3.86 us     mutex  kernfs_dop_revalidate+0x3c
         1     7.52 us   7.52 us   7.52 us  spinlock  kthread+0x115
         1     7.24 us   7.24 us   7.24 us  rwlock:W  sys_epoll_wait+0x148
         2     7.08 us   3.99 us   3.54 us  spinlock  delayed_work_timer_fn+0x1b
         1     6.41 us   6.41 us   6.41 us  spinlock  idle_balance+0xa06
         2     2.50 us   1.83 us   1.25 us     mutex  kernfs_iop_lookup+0x2f
         1     1.71 us   1.71 us   1.71 us     mutex  kernfs_iop_getattr+0x2c
...

perf sampling in BPF

We have BPF programs counting events
That perf then treats as a counter (evsel)
Now for sampling
Do the c2c aggregation in a BPF program
Until control+C
Or some other existing perf stop method
End of a control workload (sleep), etc
Recent discussion: Namhyung Kim and Song Liu

BTF

per-cpu variables
patch to record all variables
And have it in a kernel module
Load it and get access to type info
For all variables

Code Annotation

perf annotate
Source code
Machine code
Where events take place

perf annotate

nospectre_v1 + nospectre_v2

perf + pahole

Data structure annotation
Show HITM causing fields
Mapping data memory to a variable
Finding its type
BTF info for variables: types
Function signatures: arg types

Profiling Data Structures

What is this about?

Why do we care?

Problems?

Tedious

Manually doing it

First example

The change

bcache_dev_sectors_dirty_add()

The change

page_counter

The change

The new struct

Layout

BTF

Small

CPU helps

Example PEBS doc

What do we have in perf?

perf mem

Global stats

Workload stats

A familiar workload

perf mem report

perf c2c

Origins

c2c output

perf c2c

Whats in there?

What was really asked?

PERF_SAMPLE_DATA_SRC

PERF_SAMPLE_WEIGHT_STRUCT

c2c perf.data header

Key metrics

perf c2c contended cachelines view

tugged cacheline

Next steps

vmstat.c:379

Looking at pgdat

Code inspection

refresh_cpu_vm_stats

Being mindful

pahole TODO

What do we see?

No rmap for data symbols?

perf lock contention -b

A detour

perf BPF skels

perf lock contention -b

perf sampling in BPF

BTF

Code Annotation

perf + pahole

THE END