The most difficult kernel bugs are the ones that try both your patience and knowledge. This adventure I'm about to tell was one such case.
The first inkling I had of this problem was quite some time ago. Occaisionally the Debian package database on one of my sparc64 boxes, usually the SunBlade100, would get corrupt. This is truly a pain when it happens because you cannot install, update, or remove any packages until you fix these files. It was difficult to reproduce, it would happen once then never show itself again for weeks or months.
It came back recently, with a vengeance. Several Debian users reported the corruption, and Ubuntu sparc64 installs could trigger it as well. Some of these reporters could trigger it on-demand. Thus, it was time to fully investigate and fix this sucker for good.
The most crucial step in analyzing a bug like this is to get the data, and get as much as possible. Who sees the bug? What does the file corruption look like? What is in common with all the systems seeing the bug? There is no chance to be able to figure out the problem without this kind of information, and lots of it.
First, we knew it was a sparc64 specific problem. It had never been seen on any other platform. Second, we knew for a fact that we had never seen it on Niagara processors, but it had been observed on several machines based upon earlier UltraSPARC chips. Third, the pattern of corruption was always the same, some text from the description field of one package entry would show up in the middle of another package entry a few thousand characters later in the 'status' or 'available' package database file.
The corruption deserves more attention because it is the single most direct indication of what the bug is doing, and thus the best piece of data we have to determine the cause. Most of the corruptions were at least 16 bytes, sometimes larger. But the pattern was completely consistent across all samples made available to me.
What's unique about sparc64, and more specifically about non-Niagara UltraSPARC chips? Well, they have a virtually indexed cache whose indexing bits go beyond the smallest page size. Caches with this property are susceptible to aliasing. The same piece of data can appear multiple times in the cache, and then if you write to one copy you won't get the updated contents when you read from the other copy. The PAGE_SIZE is 8K and the D-Cache size is 16K, so at most 2 aliased copies of a piece of aliased data can appear on the cache.
We go to great strides to make sure aliasing and the resulting corruption do not occur. This gets really tricky when files are mmap()'d into a process address space, because Linux maps all physical pages linearly in the kernel address space at fixed virtual addresses. The user can end up with these pages mapped to arbitrary virtual addresses, leading to potential aliasing between where the user and the kernel access the physical page.
We use cache flushing to defeat writes creating inconsistent state in the cache. For example, if a file is mmap()'d into some process address space, and a write() is performed on the file, we will flush the cache after copying in the write data. To keep from flushing the cache constantly, we defer the flush if there are no mmap()'s of the file data. In this case, we set a state bit that marks the page as "D-cache dirty", so that if a subsequent mapping is in fact made to that page in some process, we flush the cache and clear the state bit before actually giving the page to the user. We also maintain which cpu did the write() system call so that we flush the D-cache on the correct processor.
For shared-anonymous mmap() mappings, we enforce that all processes mmap() the object at a virtual address which is a multiple of 16K. This makes it impossible for aliases to get created.
Finally, when servicing a read() call on a file, we check if any process has the file data mmap()'d and writable. If so, we flush the D-cache before trying to copy out the file data to make sure the kernel sees the most recent copy of the data even if the user with the writeable mmap() is at an aliased virtual address wrt. the kernel's mapping.
I went and audited this code several times, no luck. I even audited the cpu specific cache flushing code, nope.
Another thing specific to individual cpu types on sparc64 is the memcpy implementation. Chips before UltraSPARC-III have one memcpy, UltraSPARC-III and it's followon's have another memcpy, and Niagara has it's own memcpy as well. I looked at changes made to the pre-UltraSPARC-III memcpy and saw that I removed some memory barriers, consultation with all documentation showed that the current code was legal and all the necessary memory barriers were being performed. Just to check, I asked a tester to try a kernel with the memory barriers added back, the bug was still present.
At this point frustration set in and I had to have some way to reproduce the corruption, at will, on my systems in order to fix this. I tried to write small test cases, trying to mimick how dpkg accesses the package database files. No luck. So I had to dig deeper. I yanked out all of the package database reading and writing code, and wrote a small test application on top of this that would simply read in a given package database file and create dpkg's internal data structures representing all of those packages, then simply dump those datastructures into a new package database file. I figured this should simulate everything, every nuance about dpkg when it updates these files.
I still couldn't trigger it. I started to dissect the test program and the library code I copied over.... looking for clues.
And then by chance I ran strace on my simulator. I noticed that the program was doing an enormous amount of mremap() calls. Where in the world is that coming from? It wasn't explicitly being done in any of the code, so I went to check out GLIBC's malloc implementation. Sure enough it's realloc() uses mremap() on larger malloc regions. Dpkg uses an internal allocator called varbuf that is implemented using realloc().
So we go to the mremap() implementation in the kernel, and if the area is being requested to be enlarged, and the current area doesn't have that much space, it moves the mappings to another location and returns the new address.
And here we have the bug, if we move the pages to a new location which would cause cache aliasing with the previous mapping, we get potential corruption.
Using this knowledge I was able to write a small reproducable test case that just stressed mremap() usage, and write a kernel fix for sparc64 that flushed the cache when necessary by overriding the move_pte() macro which mremap() uses.
Know you system and how every part of it is implemented, and get as much data about a bug as you can. This is the only tried and true way to conquer even the hairiest of bugs.