Hacking and Other Thoughts

Thu, 21 May 2009


Beaux-Arts and kernel hacking...

My recent hobbies have included an intense study of New York City architecture, and in particular the facinating stories behind the city's two most prominent train stations. That being Grand Central Terminal and the arguably infamous Pennsylvania Station .


McKimm, Meade, and White's masterpiece at 42nd Street and 3rd Ave.

In the second half of the 19th century and on towards the first half of the 20th century, any American architect worth his salt studied at the Ecole des Beaux-Arts in Paris.

If you had a degree from that school, you were at the top of the pile for selection on all of the interesting commisions of the time. The school presented the student with a challenging and fast paced curriculum.

Firstly, for these American students attending in Paris, the first challenge was just getting in. The entrance exam (of course) required at least some proficiency in French. Several of the most notable American architects had the retake this entrace exam 5 or more times before being able to pass.

Once accepted, the student was pressed to solve problems. 12 hours were given to draft up a solution to a real architectual problem. Then once the draft was accepted, the student had 2 weeks to flesh out all of the details and present the final design. All the while the student's progress was critiqued by an established French architect who oversaw a group of students.

We really don't have that kind of training for computer science people. It's not even science I would say. This kind of training does exist for pure mathmatics, espcially in France.

Envision a school where you're asked to draft up the design of a compiler pass in 12 hours, then for two weeks you implement it, and meanwhile Alfred Aho critiques your work. This kind of place simply doesn't exist. (Yes I know Alfred teaches at Columbia currently, so maybe this specific place does exist :-) but I maintain that more generally such institutions do not exist)

Open source development and "throwing the masses of monkeys at the problem" seems to be a logical consequence of this, does it not?

A formally trained Beaux-arts architect and a room with a few drafters could design something as insanely complicated as a huge transportation hub in the middle of New York City. And it would work, as there would be no room for failure. To me it seems that someone similarly trained could do a complete operating system, compiler, or similar large software engineering task.

McKimm, Meade, and White were three architects and a couple drafters, and yet they were able to complete works such as Boston Public Library, Pennsylvania Station, and the James Farley Post Office. To just name a few.

So which is better, strict formal training and mentorship or open source monkeys? You decide!

Thu, 05 Mar 2009


A Sparc JIT for ioquake3

I recently got DRM working on sparc64, and this means I had to of course test it :-)

I played around with ioquake3 and it worked just fine. This was in a way exciting since I have devoured many hours of my life into this game on x86.

One aspect of quake3 is that it has a virtual machine. You can write a MOD for quake3 and replace pretty much any aspect of the game outside of the rendering engine. But to keep MOD authors from eating people's home directories, sending your password out to some rogue collection system, and things of that nature the interfaces are tightly controlled and the MOD code runs in a JIT'd VM.

The only way you can get into the JIT is to make a "system call". And the only way to get out is to either return or make such a system call into another module. The system call is the main entrypoint into the module, it takes an integer command and 0 or more integer arguments.

All memory accesses done by the JIT'd code are masked so that it is impossible for the JIT to touch memory outside of that allocated explicitly for it by the VM.

Ben H. kiddingly said to me that I should write the Sparc JIT since there is one for x86 and PowerPC already. One should never kid about such things...

It's pretty neat stuff, although the stack machine VM code output by the LCC compiler they used is horribly inefficient. Some code for a function might look like:

OPCODE[  OP_ENTER] IMM4[0x0000001c]	! ENTER function, 0x1c of stack
OPCODE[  OP_LOCAL] IMM4[0x00000024]	! PUSH stack offset 0x24 (first arg)
OPCODE[  OP_LOAD4] 			! LOAD from "stack + 0x24"
OPCODE[  OP_LEAVE] IMM4[0x0000001c]	! LEAVE function, return LOAD result
Operations push entries onto the "register stack", and consume entries on the top of that stack. This might emit some sparc code like:
	save	%sp, -64, %sp		! OP_ENTER
	sub	%g3, 0x1c, %g3
	add	%g3, 0x24, %l0		! OP_LOCAL
	and	%l0, %g5, %l0		! OP_LOAD4
	ld	[%g4 + %l0], %l0
	add	%g3, 0x1c, %g3		! OP_LEAVE
	ret
	restore	%l0, %g0, %o0
We use several fixed registers, "%g3" is the stack pointer, "%g5" is the VM data segment offset mask, "%g4" is the data segment base address. So every load or store address formation is "mask with %g5 and add to %g4".

It's there in the ioquake3 repo right now and will be in the next release. There are lots of things that can be improved but it works very well and most of the quake3 MODS I've tried (CPMA, UrbanTerror, etc.) work. I've also been playing the base game online extensively, you know, for stress testing.

Sat, 29 Nov 2008


NMI profiling on sparc64...

One thing that always has been suboptimal on sparc64 has been the profiling with oprofile.

Yes it worked but we only have the dumb timer interrupt profiling available. The biggest loss in this is that IRQ disabled code sequences do not get profiled. This leads to "clumps" in the profile, where all the code in an IRQ disabled sequence can show up as one big hit at the point where IRQs get re-enabled.

This makes the profile often non-representative and, at worst, completely unusable.

Say what you want about levelled interrupts, they do provide a level of flexibility that can be useful. Sparc64 chips have a PIL register, you indicate the interrupt levels (out of 15) you want to block out by writing the highest level to block out into the register. So writing zero enables all interrupt levels, and writing 15 blocks all of them out.

Device interrupts work by interrupt vectors, which are not blocked by the PIL mechanism. These are processed quickly in a trap handler and revectored into a PIL levelled interrupt by software.

Under Linux we use one PIL level for all of the device interrupts. A few of the other PIL levels we use for specific SMP cross-call types.

On UltraSPARC-III (cheetah) and later we (finally) have a profiling counter overflow interrupt. This arrives at PIL level 15. So naturally I had the idea to run the majority of the kernel only disabled up to PIL level 14. The result is that was can use these profile counter overflow interrupts to provide a pseudo-NMI oprofile implementation.

This works quite well and is checked into the sparc-next-2.6 GIT tree, so it will show up in 2.6.29

Wed, 05 Nov 2008


git stash

Someone on IRC asked what was so cool about my 'mkcf' shell script and those command lines I showed.

I'm so old skool that I didn't even know that something as cool as "git stash" even existed. Holy crap that's a nice feature :-)

Yeah but these young whipper snappers don't know how to build a git index by hand, har har har.

In other news, unless you live under a rock you know that Barak Obama won the presidental elections, but what you may not know is that he only won because of a last minute endorsement from Linus.

Tee-hee.

Mon, 03 Nov 2008


mkcf

I've had this tiny shell script in my repetoire that I don't know how I'd live without, it's called 'mkcf'.

It's so stupid and probably there are a thousand better ways to implement it. It just shows what files are referenced by a patch, here it is:

#!/bin/sh
diffstat -p 1 $* | grep -v changed | awk ' { print $1 } '
So I'm always typing things like:
bash$ git diff >diff
bash$ git checkout-index $(mkcf diff); git update-index --refresh
To save a patch I'm working on and revert it from the working tree.

I'm David S. Miller, and I support this message.

Thu, 30 Oct 2008


Rock transactional memory

There is a good set of presentation slides on Rock for a talk given this year at HotCHIPS by Shailender Chaudhry. The Rock chip gets a lot of attention for all of the speculation features it will have (and that stuff is of course extremely cool) but to me the most interesting bit initially is the transactional memory support.

If you look at page 18 and 19 in the slides you'll see the basic implementation. There are two instructions called "checkpoint" and "commit". So the traditional atomic increment is Rock'ified as:

atomic_inc:
	chkpt	fail_label
	ld	[%o0], %o1
	add	%o1, 1, %o1
	st	%o1, [%o1]
	commit
	retl
	 nop
fail_label:
	 ...
and at the "fail_label" you'd do the traditional atomic sequence using "cas" (compare and swap) instructions.

The checkpoint instruction takes a fail label so that it can be implemented in a best-effort manner (or not at all on some chips).

The commit instruction defines the end of the transaction and if it completes successfully then all of the memory operations enclosed in the sequence execute as if they were all done atomically.

So why bother with all of this? One interesting aspect is that this avoids atomic instructions.

You might look at the above and say "hey, that's the same as some sequence of atomic instructions" and yes for simple examples like an atomic increment it is. But there is still a difference, and that is that the checkpoint/commit sequence is easier to speculate around than anything using atomic instructions.

Atomic instructions act as a serialization point, and the scout threads can't do very much when such explicit ordering constraints start coming into play.

On the other hand the speculation threads can just execute the checkpoint/commit sequence as if it were just a normal set of loads and stores. The chip can perhaps do something like put a marker in the cache lines effected. If for some reason the speculation fails or we can't ensure the necessary atomicity, that's ok we zap the cache lines with the markers and rollback to the checkpoint instruction and head on down to the fail label. Just like for anything else, the speculation threads execute the checkpoint/commit sequence optimistically.


sparc64 memory barriers

I don't blog as often as I should, sorry...

Nick Piggin was working on optimizing the mutex implementations on various architectures, he started even hacking on some sparc64 bits, and it got me thinking....

There are three defined memory models for sparc v9, TSO (Total Store Ordering) PSO (Partial Store Ordering) and RMO (Relaxed Memory Ordering). The Linux kernel always programs the cpu to run in RMO when in privileged mode.

So for all the atomics and spinlocks we have memory barrier instructions (on sparc64 it's called "membar") peppered all over the place to make sure memory operation visibility is correct especially when the interface provides "locking" semantics.

Complicating matters further, UltraSPARC-I, UltraSPARC-II, and derivates have a chip bug where if a memory barrier occurs right after a mispredicted branch the chip can stop executing instructions until the next trap is signalled. If interrupts are off the cpu hangs completely.

To work around this we either make sure the condition can't happen in hand-crafted assembler, or we encapsulate the code in a macro that forces the memory barrier instruction into the delay slot of a "branch always, predict taken" control transfer. And this bloats up the kernel a bit.

But here's the interesting bit. All chips starting with UltraSPARC-III only actually implement the TSO memory model and essentially ignore what gets programmed into the memory-model field of the PSTATE register. And the RMO gains on the older chips that do actually support RMO is marginal at best. So most of this memory barrier business is just a complete waste.

So I wrote a patch that just gets rid of all of the membar instructions used with atomics and spinlocks, and this shaves 10 full seconds of system time off of a "make -j64" kernel build on Niagara-2. W00t...

Of course, for the optimized memcpy and memset implementations we have to keep the memory barriers in there. The block load and store instructions used in those routines are special and do not obey the normal memory model ordering semantics. Thus they do still require explicit memory barriers.

Mon, 06 Oct 2008


Did hell freeze over?

Linus has a blog :-)

Sun, 05 Oct 2008


Netfilter Workshop 2008, Paris

I just returned from Paris, and the 2008 Netfilter Workshop. Just like last year it was a blast, and there were lots of interesting things discussed as well as inbibed.

On the first day there was a users day where presentations were made aimed at a more user oriented audience. It seems that just about anyone who was aware could attend and hear the talks. I gave one on multiqueue networking. You can find my slides and other info here.

Tuesday and Wednesday were the main workshop days.

Of greatest interest to me were the descriptions given by Patrick McHardy for his new filtering framework, where all the complexity is in userspace and the kernel just runs filtering scripts and lookup datastructures fed to it by the user tools. In short, I think this stuff is great, and unlike some folks I don't think this will decrease netfilter participation by other developers at all.

And frankly, iptables was absolutely too accessible to contributors. Look at how much stinking poo is in the patch-o-matic, oft called "crap-o-matic".

Patrick's work is a wonderful centralized framework, and in fact the scripting is generic that you can build any tool to create these filtering instructions and subsidiary lookup tables.

We also made some headway with the tproxy stuff. All but one of the core networking patches are in the net-next-2.6 GIT tree. Indeed, this is a feature which has been missing for 5 years :-) I have to hand it to the balabit guys for sticking to it and working so hard for so long to get this merged.

Pablo gave some interesting presentations (3 at once!), and he is exploring some ways to perhaps make use of bloom filters. This is something Patrick has devoted some exploratory brian power to in the past, but it is often hard to find a use case for these inexact matches, although they are very cool.

Jozsef Kadlecsik gave his IPSET state of the union, discussing new features such as support for ip-port-ip hashing and set lists (which are unions of the existing set type).

Yasuyuki Kosakai gave a presentation on the road blocks that exist currently for doing proper connection tracking for MIPV6 nodes. The basic problem is that the persistent addresses (ie. the ones we'd want to use for connection tracking) exist in various locations in the IPV6 packet and extension headers.

Jesper talked about all of the userland scalability improvements he made to the iptables utilities. He also described a set of scripts he wrote to build optimized rule table trees.

Stephen Hemminger discussed some of the user visible interface work that Vyatta has been doing. Essentially these are a set of templated bash shell scripts and descriptor files that present a Cisco IOS like interface to administrators. He also talked about the performance issues surrounding the way in which iptables does packet counters, as well as the global conntrack table lock.

Harald Welte gave a talk about the current state of GPL violation enforcement. Things seem to have been going quite well, but it is becoming more and more important to give Harald more facilities by which to make air-tight arguments that he has enforcement rights to code which has been violated. One way for that to happen is for significant contributors to sign over their rights to him so that he can make enforcements on their behalf.

It seems that this is a very common stall tactic by the defence in such cases, to try and bring up some doubt about the code property ownership situation.

Of course, aside from the workshop itself there were plenty of parties. Even for lunch we had quite nice French cuisine and beverages, and the dinners were even nicer.

Tuesday night even included a full 4 hour boat cruise on the Seine, with tons of champaigne, wine, small bite size delicasies of all types, and lots of sweets.

Overall a wonderful time, the netfilter workshop never disappoints. A big thank you to the official organizers this year, INL.

Sun, 14 Sep 2008


Work towards a peek operation in the Linux networking packet schedulers.

There are some ugly corners of the packet scheduler qdisc interfaces. Jarek P., Herbert Xu, and myself have been trying to figure out some ways to tidy them a bit.

First of all, we have this requeue operation. Basically, if you do a dequeue, then decide (for whatever reason) you can't actually take the packet right now, you can stuff it back to the head of the qdisc using the requeue call. The qdisc must be locked during this sequence.

Some of the uses of requeue are dubious, at least in my opinion. For example, we let device driver transmit routines push back on the caller even when they have the queue marked as having room for more packets. The generic transmit level code uses requeue to stick such packets back into the qdisc.

I truly believe that we can eliminate this specific case. Such drivers are buggy, and we are better off just dropping the packet when the device does something like that. No reasonable nor common device driver causes this behavior.

Another use of requeue is in netem, which is the network emulator qdisc. You can use this qdisc to simulate various network characteristics such as delay, random packet drops, etc. The netem packet scheduler uses the requeue operation as part of it's packet reordering implementation.

At least the netem case could be handled more cleanly with a peek operation. The rule would be that you can do a peek to see the next packet that would be returned from a dequeue operation. If you decide you can't use the packet right now, you do nothing. Otherwise you must do a dequeue and it will return the same packet.

From the qdisc implementation's perspective, this can be done quite simply. Actually, Herbert Xu noticed this. Basically, non-trivial qdiscs implement peek the same as dequeue, except that they don't actually unlink and they remember this packet and the class it came from. Then when the subsequent dequeue operation comes along, we call down into the remembered class qdisc to invoke the dequeue operations down to the leafs.

And as it turns out, this is exactly how the CBQ packet scheduler currently implements requeue. It remembers the last class used for dequeue, and uses that to vector the requeue down to the correct leaf.

There are some details to work out, but this is the basic idea of how peek would work and how we would use it.

Jarek raised one interesting aspect of this. Classful qdiscs implement link sharing using timers, and packet send quotas based upon bandwidth used over time, etc. These are usually done on a class basis.

So, for example, if you peek, decide not to dequeue at the moment, and then later we do a dequeue and it uses the cached class mentioned above, that might be wrong. In the intervening time period, another class might have gotten it's quota back, and if that class if of a higher priority we should take packets from that class not the cached one.

In some sense this is theoretical, and that's what Herbert seems to be arguing currently.


Elimination of Sparc SBUS and EBUS device layers...

If you've been following my sparc-next-2.6 tree you've noticed quite a bit of activity over the past few weeks. During the waning weeks of 2.6.x-rcX development, things slowdown enough to allow some time for big cleanups and major infrastructure changes.

For years we've had this SBUS specific device layer. It was written before Linux had a generic device layer at all. Even PCI support was a bunch of hodge-podge interfaces, and a global linked list of probed devices. So I did SBUS in a similar way (only worse, if that can even be possible).

With the addition of the OF device layer I tried to clean the SBUS stuff up a bit. But I had to keep SBUS as a seperate device layer because all of the DMA mapping done by the drivers were done by sbus_foo() specific routines, and thus we needed to provide some SBUS specific state in the device objects. That was too much to untangle at the time.

EBUS, which is Sun's "8-bit bus", has it's own device layer as well. The EBUS is for things normally found behind ISA busses on x86 systems. RTC clocks, serial ports, floppy controllers, these sorts of things. Well, just like SBUS, they had their own device driver and DMA interfaces.

I bit the bullet and killed it all off. The first step was to deal with the DMA issues. Just like for PCI on sparc64, we store the IOMMU controller information in the generic struct device archdata area. This gets propagated to all bus children. Then calls to the generic dma_foo() DMA interfaces just does the right thing.

Then all the SBUS drivers were converted to use dma_foo() instead of sbus_*(). The only remaining task was to propagate the SBUS iommu initialization into seperate init functions and make sure they set things up before any real devices would probe.

Every SBUS driver thus uses OF device layer generic probing, register mapping, and IRQ information acquisition. They also use only the generic DMA mapping APIs. So, the SBUS layer could finally just get deleted.

A similar transformation happened for the EBUS layer.

The CS4231 sound driver was the trickest case in all of this. It comes in both EBUS and SBUS variants. So a large mess had to get untangled to convert the driver throughout all of these changes.

But now it's all done and there in sparc-next-2.6 ready for merging during the 2.6.28 merge window.

Another nice development that occured during this time period was a conversion of both sparc ports over to the generic RTC layer. This was thanks to some real nice groundwork done by Krzysztof Helt.

I'd also like to mention Robert Reif who has been helping test the sparc32 side of all of these changes.

Now I'm slowly cleaning up all of the interrupt and timer handling on sparc32 so that it can be converted over to use the generic IRQ layer just like sparc64.

Wed, 03 Sep 2008


Bumbershoot 2008

This past weekend the Bumbershoot festival was held here in Seattle. It's basically a showcase of both local and world known artists, primarily focusing on music and comedy. It's usually held over Labor Day holiday weekend, and lasts 3 days. You pay seperately for a general admission ticket for each day, it's about $35.00 USD

Originally admission was free. It first began in 1971 during the near collapse of local megacorp Boeing. The idea was to have something to lift the spirits of local Seattle folks. The event quickly grew to the point where the city couldn't fund and run the whole thing any longer, so now a private entity runs the thing and charges admission.

I attended Saturday, primarily to see Beck, who I already was familiar with and whose work I like a lot. But in the end the best part turned out to be all the new bands I got to sample and become familiar with.

Saul Williams was particularly impressive. After seeing him perform it was no surprise for me to later learn that he used to be primarily a poet in New York City. Real powerful stuff, great delivery, lots of energy. He also hates our current president, so this guy is perfect.

I'm not usually much of an R and B person, but Estelle gets great marks. She's a UK hiphop artist, and her songs are quite smooth. I would definitely put her album into the rotation for long car drives. Unfortunately her live performance didn't translate as well as it could have, so I'm kind of glad I went after the show and listened to her album nonetheless.

Finally, Asylum Street Spankers, man these guys were a hoot. From Texas, they play this folk'ish music to sometimes serious but mostly funny (bordering on hysterical) lyrics.

Of particular note, at least for me, was their song "Jailhouse". The first few verses describe some criminal (a crack dealer, a thief, etc.) and then the chorus "he's in the Jailhouse now" is sung 2 or 3 times. Then the final verse cleverly describes George W. Bush indirectly, and the chorus becomes "he's in the White House now". The delivery was incredibly well done, and I was in stitches.

I won't say much about Beck, there's tons of info about him online and he's quite established. His performance was great, and what struck me most was how calm, cool, and confident he seemed. They did this bit using Roland TR 808 boxes that was absolutely phenominal. I think they performed two entire songs using nothing but a set of Roland TR 808s that each of them had at least one of.

Overall a fun experience. For a quite reasonable price I got to see an act I already liked and also became familiar with three new acts I get to enjoy now as well.

Yeah, I've been hacking too. More on that later...