So I spent the weekend fishing out some gdb bugs on sparc. Every time I think I understand and know how all of this stuff works I get thrown some new surprises. This time was no exception.
The kernel has all of these neat features, via ptrace()'s PTRACE_SETOPTIONS, that allows a debugger to get notifications when a process forks, vforks, does an exec, etc.
When these events occur in the inferior process, it does a ptrace_notify() which sends a SIGTRAP to the process with the event encoded into the siginfo exit code.
As a result, when the process tries to return back into userspace, it'll do signal processing. As part of that, it will invoke ptrace_stop() which sleeps the task and wakes up the debugger parent so it can examine the event and process state.
The debugger has several choices of what to do. For many cases it will do a ptrace() PTRACE_CONT with a code indicating how the process should continue. Another thing the debugger can do is decide that it's no longer interested in tracing this task, and therefore it does a ptrace() PTRACE_DETACH. This is part of the first bug.
A long time ago we picked the values for PTRACE_this and PTRACE_that on sparc. Some of them mirrored the SunOS values. One of those was PTRACE_DETACH. We unconditionally recognized the SunOS PTRACE_DETACH value, even for Linux processes. Unfortunately, this is the value that also ended up in the sys/ptrace.h Linux header file. So that's the value every Linux application ended up using too.
I yanked out the SunOS PTRACE_foo call support long ago, and it's amazing how much works without a properly functioning PTRACE_DETACH. Putting in a compat translation from the SunOS to the intended Linux value for PTRACE_DETACH in the Sparc ptrace code solved this bug.
Which brings us to the next bug...
What brought me down this path in the first place was examining why running gdb under itself didn't work properly. This kind of game is always fun:
bash$ ./gdb ./gdb (top-gdb) set args test (top-gdb) run (gdb) run Hello World! ...This would hang when the inner-gdb tries to run the test program, and I had to figure out why. After lots of tracing I found that the inner-gdb was hanging in a sigpause() call.
GDB uses sigpause() to wait for SIGCHLD events when it is simply waiting for running inferiors to ptrace_stop() or take some other kind of signal.
In comes the issue of system call restart. Handling this wrt. debuggers is not easy. One nice feature of debuggers like GDB is that you can ask them at any point in time to call some function in the program they are running.
(gdb) p printf("hi\n")
hi
$1 = 2
(gdb)
and after calling the function it completely restores the process state to
what it was before the call. How does it implement this?
First it saves the process state, mostly this comprises the registers. Next it allocates some stack space for the call, pushes the arguments for the function call onto the stack and in registers. Next, it makes the function return address point to a breakpoint it can uniquely recognize. Finally, it sets the program counter to point at the function to call.
When the function returns it hits the breakpoint, and the debugger restores all of the saved state it stored away before the special call.
Now, back to system call restart. When a signal interrupts a system call, it can return immediately to process the signal. Internally to the kernel system calls return special error codes to let the signal dispatch know what to do with the system call that was in progress. It can say that -EINTR should be returned and the system call completed. But it can also say that the program counter should be rewound to the system call trap, the argument registers re-setup with their original values, and the system call thus replayed.
In the example above, imagine what happens if the debugger calls an inferior function at the time that the signal is dispatched. Somehow, in order to restore the state properly after the call, this "we're in a syscall and should do syscall restarting" state has to get saved and restored too.
Long ago I had this clever idea wherein I tried to solve this problem entirely inside of the kernel to shield debuggers from having to deal with it. The idea was that we'd modify the process state and perform the system call restart operations before we stopped the program to let the debugger see the state.
Although great in theory, in practice it's an unworkable solution. We don't know what the debugger is going to do with the process. As I stated earlier it can pass along the signal to the process, or it can cancel the signal delivery altogether. This decision influences whether we should do system call restart or not, but we already pre-commited that state and let the debugger see it already. We can't know what the debugger is going to do ahead of time, therefore it is impossible to do the right thing.
This is what was causing the inner debugger to hang. The inner gdb is receiving a SIGCHLD because the 'test' program it is debugging has hit ptrace_stop(). The top-level gdb looks at this and says "ok, let's just let the inner gdb see the signal, PTRACE_CONT." But my funny in-kernel ptrace syscall restart code already setup for a syscall restart of sigpause(), but what should have happened was a return of -EINTR.
The inner gdb is now wedged forever, it missed the SIGCHLD and the debugged process is sleeping waiting to be woke up by the inner gdb.
So I ripped out all of this silly code, and ended up doing what powerpc, x86, and other platforms do. I added a piece of software binary state (an unused bit in one of the processor state registers), that gdb can control. It is the "we're in a system call" state. When the debugger does an inferior call, it clears this bit when changing the program counter register. This forces the kernel to not do system call restart processing when the task wakes out of ptrace_stop().
But later, when gdb restores all of the register state after the call, that special bit will be restored, and we'll do the right thing as we deliver the original signal and subsequently do syscall restart processing as needed.
It's always fun to find land mines like these ones.