Hacking and Other Thoughts

Mon, 31 Dec 2007

TCP retransmit queue overhead...

There is this one cpu cycle killer we've been trying to find ways to overcome, and the more I think about it the situation wouldn't even exist if the SACK standard had been given just a smidgen more thought.

SACK has been specified as a lossy indicator of out of order packet reception. This is the core problem. It's a hint, and you cannot actually "act" upon the information.

So if you get some SACK blocks you can't free those marked packets from your retransmit queue. SACK is specified such that the receiver can free up those packets and stop advertising them in the SACK blocks.

This is profoundly stupid, just look at the implications.

What this means is that during a loss recovery event, we have to hold onto a whole extra window of data in the retransmit queue for no reason at all. Even more brilliant is that when the hole is filled and we get this HUGE ACK back acknowledging two windows worth of data we have to purge all of those packets from the retransmit queue in one go.

For large windows, we're talking 4,000 packets or more. I don't care what datastructure you use to manage your retransmit queue, that's going to be expensive and show up in profiles.

If, instead, the SACK specification authors spent just a little bit more time thinking about the implications of SACK being a "soft" indicator I think they would have changed their minds. If SACK were a "hard" indication and we could thus free up packets so marked there, the processing would be spread out throughout the recovery event instead of being batched up into one huge wallop of packet freeing.

Supposedly the reason for marking SACK a "soft" indication is so that "low memory" embedded systems could free up the data. But this is incredibly stupid. If you can handle one window of data, you can also handle holding onto two full windows of data as well. We aren't talking about PDP-11's with 16K of RAM or anything like that.

Furthermore this is inconsistent because the sender has to hold all of the data in such loss recovery events, why treat the receiver specially? It makes absolutely no sense, because if such small embedded systems ever have to send data, they have to commit to the same amount of RAM resourses in such situations.

In fact, a "hard SACK" would help those small devices when they act as senders, because during a loss event they could liberate packets indicated in the SACK blocks they get back.

The SACK standard folks made a horrible tradeoff. They created a significant future performance hindrance for %99.9999 of systems in order to cater to some theoretical memory usage issue on %0.00000001 of machines in the world.

Every TCP implementation out theere is going to have to come up with a workaround for this issue especially as huge windows become more prevalent. And all of that engineering effort would be totally unnecessary (read as: we could work on much more important stuff) if SACK had been specified as a hard indicator.

I have the urge to propose some TCP option that allows two TCP implementations to negotiate "hard" SACK blocks but that won't work. We'd still support "soft" SACKs for eternity and furthermore that's what anyone wishing to exploit this high cpu cost code path in the TCP stack would use.

To be honest, nobody drops data in the way that SACK allows. It'd probably be pragmatic and reasonable just to start enforcing the "hard" semantics. And if they are violated (packets disappear from SACK blocks) we just RST the connection. We could add a ton of logging if such a thing happens, and we could investigate such cases.

It's quite a mess and somebody has to mop it up.


I have to say that Christmas holidays are the best time to hack on things. Most of the yahoos are away on vacation and therefore the constant stream of distractions and really dumb emails in your inbox just aren't there. In short you can work on the things you always want to work on but never can because of time constraints.

There really is an information overload problem.

Today I want to talk about multiple return values. In short, don't do it :-) It really is a sign that things need to be redesigned, and on top of it multiple return values result in quite inefficient code.

The most common case with C is when you need to allocate some memory and return it to caller, but if something goes wrong you want to give some error status too.

So you end up with absolute crap like this:

int create_foo(int flags, struct foo **foop)
or, even worse:
struct foo *create_foo(int flags, int *errp)
I mean, that just doesn't deserve to live. First of all, the compiler has to allocate stack space for that "by reference" turd you had to add to the arguments to pass that second piece of information back. And that's slow, even on register starved cpus like x86 where stack accesses have been heavily optimized inside of the cpu to make up for that.

Secondly, the semantics are not entirely clear. If an error happens for the second API above, will the function return value always be NULL? In the first API above, when an error happens will the "*foop" always be NULL'd out for me or is the caller expected to do that? Likewise, for the second API above, if there is no error and non-NULL is returned, can I depend upon "*errp" being set to zero?

You don't know, because when you look at that interface definition you simply cannot tell. It's one big ambiguous ugly interface.

For this particular case in the kernel we've settled on a set of macros that allow a pointer and an error to be returned in a single function return value. Basically, it takes advantage of the fact that the range of negative error codes cannot ever be legitimate pointers. So the interface above looks like:

struct foo *create_foo(int flags)
and the caller first checks:
	p = create_foo(flags);
	if (IS_ERR(p)) {
		err = PTR_ERR(p);
		return err;
And if "IS_ERR" is not true, it is a non-NULL pointer and no error happened.

It is completely unambiguous what the return values mean, and how they will be presented to the caller. Yes, it's true that looking at the function definition you can't "see" this. But once people are exposed to this pattern enough they pick it up, and it allows us to eliminate ambiguity and unclear semantics for these kinds of cases.

Have a happy new year everyone.