Hacking and Other Thoughts

Thu, 08 Apr 2010


Function graph tracer on sparc64...

The Linux kernel has a facility called the function graph tracer that allows one to record the full call graph of everything the kernel does with timestamping. It's the big brother of the plain function tracer and is implemented with similar facilities.

The plain function tracer merely intercepts calls to mcount (which are emitted by the compiler at the beginning of every function when building with the "-pg" option) and records the calling function as well as other pieces of state (timestamp, parent function, etc.). These trace entries are recorded into per-cpu lockless NMI-safe buffers.

The function graph tracer takes things one step further by recording function returns in the trace as well. And it does this only using mcount. How does it do this? The trick is that it uses a special stub that it inserts into every call point, and it does so dynamically.

On function entry, mcount is invoked and the function graph tracer is called like so:

	mov		%i7, %g2
	mov		%fp, %g3
	save		%sp, -128, %sp
	mov		%g2, %o1
	mov		%g2, %l0
	mov		%g3, %l1
	mov		%l0, %o0
	mov		%i7, %o1
	call		prepare_ftrace_return
	 mov		%l1, %o2
	ret
	 restore	%o0, -8, %i7

prepare_ftrace_return() is a helper function that passes the mcount caller program counter, that caller's parent program counter, and the callers frame pointer into the function graph tracer.

The tracer sees if the function should be traced and if there are enough tracking stack slot entries available for the current thread. The slots are used to remember the frame pointer and caller's parent program counter so that it can be compared and restored later. We need to remember this program counter because we are going to change it so that it calls a stub instead of actually returning from the function.

The frame pointer is saved and used as a debugging tool so that we can make sure that when we execute the stub, the state is as we expect it to be and we can be sure that restoring the return address register is safe.

prepare_ftrace_return gives it's caller the address to put into the callers return address register. This will be the special stub if we decide to trace this function call or it will simply be the original value (plus 8 to account for how we return from sparc functions by jumping to "return address register" plus 8).

The stub looks like this:

	save		%sp, -128, %sp
	call		ftrace_return_to_handler
	 mov		%fp, %o0
	jmpl		%o0 + 8, %g0
	 restore
ftrace_return_to_handler() validates the frame pointer on the top-most stack slot of saved return address + frame pointer pairs. If it matches it records a function return trace entry (with timestamps, etc.) and returns the function's original return address and then we'll jump to that from the stub.

So if we're deep in the call chain of traced functions, and you were to look at the backtrace, you'd see a continuous stack of return addresses referencing the stub. And as the functions return, the function graph tracer resolves the return addresses to what they should be and then the originally intended address is returned to.

Of course there is some non-trivial cost to all of this, in particilar rewriting the return address makes the cpu's return address stack never hit so the function returns become very expensive. But this isn't something you have running constantly, you turn the tracer on around a set of events of interest (f.e. while running a test program).

Here's what a small snippet of a trace looks like with the function graph tracer. In this part of a capture we have a call to "run_local_timers":

  64)               |          run_local_timers() {
  64)   2.637 us    |            hrtimer_run_queues();
  64)   2.527 us    |            raise_softirq();
  64)               |            softlockup_tick() {
  64)   2.967 us    |              __touch_softlockup_watchdog();
  64)   8.570 us    |            }
  64) + 23.953 us   |          }
The first number is the cpu number (this machine has 128 cpus, so yes 64 is not a typo). The next field gives time deltas. Finally we have the call graph itself.

For the call graph a C-like syntax is used. For leaf functions the line ends with just a semicolon. When a function calls other functions it is closed off at it's return by a closing brace. And the sub-calls are indented as needed.

Call latencies are expressed at the point for which we have the return. So we'll see it on the lines for leaf functions (ending with semicolons) and for closing braces. But never on lines having openning braces.

When latencies exceed certain values a "+" (greater than 10usec) or "!" (greater than 100usec) will be prepended to the time delta so expensive operations can be easily seen.

There are several other powerful tools built on top of the mcount based tracing hooks. For example there is a stack tracer that monitors which functions created the deepest stack frames.

For more information about using ftrace in general, check out the two part series written by Stephen Rostedt at LWN: PART 1 and PART 2.

It's an incredibly powerful tool and framework. In fact, in just in the last few days I've already found 3 bugs and anomalies by simply scanning the traces and looking for nothing in particular.