<?xml version="1.0"?>
<!-- name="generator" content="blosxom/2.0" -->
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">
  <channel>
    <title>DaveM's Linux Networking BLOG   </title>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi</link>
    <description>Mashimaro Fan Club</description>
    <language>en</language>
    <image>
      <url>http://vger.kernel.org/~davem/davem-48-70.png</url>
      <width>48</width>
      <height>70</height>
    </image>

  <item>
    <title>Metrics, metrics, metrics...</title>
    <pubDate>Fri, 10 Dec 2010 18:25:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2010/12/10#route_metrics</link>
    <description>&lt;p&gt;
First a quick shout-out to the Colbert Nation.
&lt;p&gt;
Next, let's talk about route metrics and sharing, shall we?
&lt;p&gt;
Metrics exist to specify attributes for a path.  For example,
what hoplimit should we use when using this route?  What
should the path MTU be?  How about the TCP congestion window
or RTT estimate?
&lt;p&gt;
Metric settings come from two places.
&lt;ul&gt;
&lt;li&gt;Administrator settings
&lt;li&gt;On the fly measurements
&lt;/ul&gt;
&lt;p&gt;
The administrator can attach initial metric values to
routes when they get loaded into the kernel.  This helps
deal with specialized situations, but it's not very common
at all.
&lt;p&gt;
There is a special metric, called &quot;lock&quot; which is a bitmask.  There is
a bit for each of the existing metrics.  If the bit is set, it means
that metric should not be modified by the kernel and we should always
respect the value the administrator placed there.
&lt;p&gt;
The kernel itself will, on the fly, adjust metric values.  For
example, at the teardown of every TCP connection the kernel
can update the metrics on the route attached to the socket.
These updates are based upon measurements made during the life
of the TCP socket.  There is a sysctl to disable this automatic
TCP metric update mechanism, for testing purposes.
&lt;p&gt;
For a router, metrics don't really change.  So, in theory, we
could take the defaults stored in the routing table and just
reference those directly instead of having a private copy in
every routing cache entry.
&lt;p&gt;
There are some barriers to this, although none insurmountable.
&lt;p&gt;
First of all, in order to share we have to be able to catch
any dynamic update so we can unshare those read-only metrics.
The net-next-2.6 tree has changes to make sure every metric
write goes through a helper function.  So we have the traps
there and ready to go, problem solved.
&lt;p&gt;
Next, we actually change the metrics a little bit when we create
every single routing cache entry.  Essentially we pre-compute
defaults.  This is pretty much unnecessary, and actually could
theoretically cause some problems in some cases.  The metrics
in question are the hoplimit, the mtu, and advertised MSS,
For ipv4 these are set in rt_set_nexthop.
&lt;p&gt;
If the route table metric is simply the default (ie. zero) we
pre-calculate it.  These calculations are pretty simple and
could be done instead when the value of the metric is actually
asked for.
&lt;p&gt;
Since the defaults are address-family dependent we will need
to abstract the calculations behind dst_ops methods, but that's
easy enough.  Accesses to each of these three metrics then
need to go through a helper function which essentially says:
&lt;br&gt;
&lt;pre&gt;
	if (metric_value == 0)
		metric_value = dst-&gt;ops-&gt;metric_foo_default(dst);
	return metric_value;
&lt;/pre&gt;
&lt;p&gt;
With that in place we will rarely, if ever, modify metric values in
the routing cache metric arrays.  Then the next step is putting
unshared metrics somewhere else (f.e. inetpeer cache), and then
changing dst_entry metrics member to be a pointer instead of an array.
&lt;p&gt;
Initially the pointer will point into the routing table read-only
default metrics.  On a COW event, we'll hook up an inetpeer entry to
the route and retarget the metrics pointer to the inetpeer's metrics
storage.</description>
  </item>
  <item>
    <title>IPv4 source address selection in Linux</title>
    <pubDate>Sun, 21 Nov 2010 14:46:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2010/11/21#ipv4_saddr_selection</link>
    <description>&lt;p&gt;
Often when we connect a socket up, or bind it, we really
don't care what source address is used for the resulting
connection.  We let the kernel decide.
&lt;p&gt;
The usual sequence is:
&lt;pre&gt;
	fd = socket(PF_INET, SOCK_{STREAM,DATAGRAM}, IPPROTO_{TCP,UDP});
	connect(fd, { AF_INET, $PORT, $DEST_IP_ADDR }, ...);
&lt;/pre&gt;
Here we leave the source port and source address to be selected
by the kernel via a facility called auto-binding.  Oddly enough
TCP and UDP use a different ordering for selecting the source
address vs. selecting the source port.
&lt;p&gt;
UDP will first select a local port to use, and make this choice
in a global namespace of ports for the machine.  TCP on the other
hand will have a source address selected first, and then try to
allocate a local port using the source address as a partial key.
This latter ordering is necessary in order to handle SO_REUSEADDR
correctly.
&lt;p&gt;
Source address selection itself happens via routing.
&lt;p&gt;
The route lookup will be performed with the source address in
the flow lookup key set to zero.  After the route, based upon
destination address, is found the routing code uses the next-hop
interface to select an appropriate source address.
&lt;p&gt;
All of these results are propagated into the routing cache
entry.
&lt;p&gt;
It is interesting to note that the routing cache entry created
in such a situation will have a zero source address as well in
it's routing key.  So the next time a routing lookup occurs
to the same destination, but without a specified source-address,
we'll match this routing cache entry.
&lt;p&gt;
This little detail creates some minor complications when handling
ICMP messages for redirects.  Since we must update any potentially
matching routing cache entries, we have to probe the hash table
multiple times.  Once with an explicit source address in the
lookup key, and once with the source address in the key set to zero.
Otherwise we won't update all of the entries that we need to.
&lt;p&gt;
Actual source address selection is performed by
&lt;tt&gt;inet_select_addr()&lt;/tt&gt;.  Either via direct calls made
by net/ipv4/route.c, or indirectly via &lt;tt&gt;__fib_res_prefsrc()&lt;/tt&gt;
This function works with a &quot;scope&quot; specification which says
which realm in which the source address must be valid.  Most of
the time this is RT_SCOPE_UNIVERSE.
&lt;p&gt;
The linked list of ipv4 interface addressed for the interface is
traversed, and the first address with an appropriate scope is
selected.
&lt;p&gt;
Even though the flow key of the routing cache entry will have
a zero source address, the source address selected is remembered
in &lt;tt&gt;rt-&gt;rt_src&lt;/tt&gt; so that users of it can see what
source address to use.
&lt;p&gt;
Finally, routes loaded into the kernel can have an explicit
&quot;preferred source address&quot; attribute set by the administrator.
This value will fully preempt whatever &lt;tt&gt;inet_select_addr()&lt;/tt&gt;
would have choosen.</description>
  </item>
  <item>
    <title>How GRO works</title>
    <pubDate>Mon, 30 Aug 2010 20:46:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2010/08/30#gro_howto</link>
    <description>&lt;p&gt;
All modern device drivers should be doing two things, first
they should use NAPI for interrupt mitigation plus simpler
mutual exclusion (all RX code paths run in software interrupt
context just like TX), and use the GRO NAPI interfaces for
feeding packets into the network stack.
&lt;p&gt;
Like just about anything else in the networking, GRO is
all about increasing performance.  The idea is that we
can accumulate consequetive packets (based upon protocol
specific sequence number checks etc.) into one huge packet.
Then process the whole group as one packet object. (in
Network Algorithmics this would be principle P2c,
Shift computation in time, Share expenses, batch)
&lt;p&gt;
GRO help significantly on everyday systems, but it helps
even more strongly on machines making use of virtualization
since bridging streams of packets is very common and GRO
batching decreases the number of switching operations.
&lt;p&gt;
Each NAPI instance maintains a list of GRO packets we are
trying to accumulate to, called &lt;tt&gt;napi-&gt;gro_list&lt;/tt&gt;.
The GRO layer dispatches to the network layer protocol
that the packet is for.  Each network layer that supports
GRO implements both a &lt;tt&gt;ptype-&gt;gro_receive&lt;/tt&gt; and a
&lt;tt&gt;ptype-&gt;gro_complete&lt;/tt&gt; method.
&lt;p&gt;
&lt;tt&gt;-&gt;gro_receive&lt;/tt&gt; attempts to match the incoming &lt;tt&gt;skb&lt;/tt&gt;
with ones that have already been queued onto the &lt;tt&gt;-&gt;gro_list&lt;/tt&gt;
At this time, the IP and TCP headers are popped from the front of the
packets (from GRO's perspective, that actual normal &lt;tt&gt;skb&lt;/tt&gt;
packet header pointers are left alone).  Also, the GRO'ability state
of all packets in the GRO list and the new incoming SKB are updated.
&lt;p&gt;
Once we've committed to receiving a GRO skb, we invoke the
&lt;tt&gt;-&gt;gro_complete&lt;/tt&gt; method.  It is at this point that
we make the collection of individual packets look truly like one
huge one.  Checksums are updated, as are various private GSO
state flags in the head 'skb' given to the network stack.
&lt;p&gt;
We do not try to accumulate GRO packets infinitely.  At the
end of a NAPI poll quantum, we force flush the GRO packet
list.
&lt;p&gt;
For ipv4 TCP there are various criteria for GRO matching.
&lt;ul&gt;
&lt;li&gt;Source and destination address must match
&lt;li&gt;TOS and protocol fields must be the same
&lt;li&gt;Source and destination ports must match
&lt;/ul&gt;
Certain events cause the current GRO bunch to get flushed
out.  For example:
&lt;ul&gt;
&lt;li&gt;ID field not being in sequence with existing packets
&lt;li&gt;Don't fragment bit clear
&lt;li&gt;TCP CWR congestion indication being set
&lt;li&gt;TCP ACK sequence mis-match
&lt;li&gt;Any TCP option mis-match
&lt;li&gt;TCP sequence not being in-order
&lt;/ul&gt;
&lt;p&gt;
The most important attribute of GRO is that it preserves
the received packets in their entirety, such that if we
don't actually receive the packets locally (for example
we want to bridge or route them) they can be perfectly
and accurately reconstituted to the transmit path.  This
is because none of the packet headers are modified (they
are entirely preserved) and since GRO requires completely
regular packet streams for merging, the packet boundary
points are known precisely as well.  The GRO merged
packet can be completely unraveled and it will mimmick
exactly the incoming packet sequence.
&lt;p&gt;
GRO mainly the work of Herbert Xu.  Various driver authors
and others helped him tune and optimize the implementation.</description>
  </item>
  <item>
    <title>Converting sk_buff to list_head.</title>
    <pubDate>Thu, 26 Aug 2010 23:40:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2010/08/26#skb_list_head</link>
    <description>&lt;p&gt;

I've been trying to make this happen, off and on, for at least two
years now.  Most of the kernel is straightforward and uses the skb_*()
interfaces we have for manipulating skb objects on a list.

&lt;p&gt;

So for those, simply tweaking the interfaces in skbuff.h will make
them all &quot;just work&quot;.

&lt;p&gt;

However there are a few other spots in the kernel which manipulate the
SKB list pointers directly:

&lt;ul&gt;

&lt;li&gt;SKB fragment lists have a head of skb_shinfo(skb)-&gt;frag_list of
the head skb and use only skb-&gt;next for linkage.
&lt;li&gt;The GRO handling uses both -&gt;prev and -&gt;next with a single
pointer head at napi_info-&gt;gro_list
&lt;li&gt;Both ISDN PPP and the generic PPP code have a fragmentation
handling engine which manipulates the SKB list pointers directly.
I've very nearly converted the ISDN side to use the standard
skbuff.h list interfaces, but it added regressions and I had to
eventually revert.
&lt;li&gt;The socket backlog handling uses a by-hand coded FIFO tail
queue list of SKBs.

&lt;/ul&gt;

&lt;p&gt;

I'm taking another stab at this, and hopefully I can work out these
wrinkles.  It'd be a really nice change because of lot of uses of
&quot;struct sk_buff_head&quot; which don't care about the spinlock or the
packet count can be converted to simply &quot;list_head&quot; saving serious
space in various datastructures.
</description>
  </item>
  <item>
    <title>Couple of strcmp tricks...</title>
    <pubDate>Sun, 25 Apr 2010 04:43:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2010/04/25#strcmp_tricks</link>
    <description>&lt;p&gt;
Like for strlen it's pretty easy to make a loop which checks a long
word at a time.  The less easy part is making the code run cheaply
in constant time when we exit that loop.
&lt;p&gt;
We're going to work with two constants:
&lt;pre&gt;
#define	STRCMP_CONST1	0x7f7f7f7f7f7f7f7f
#define	STRCMP_CONST2	-0x0101010101010101
&lt;/pre&gt;
We have two calculations, one for the inner loop and one
for the post-loop calculations.  Two are necessary because
we want to minimize the cycle count in the loop and we
only care there if there exists a zero byte somewhere.
Whereas in the post-loop exit code we have to know precisely
which byte the zero resides in.
&lt;p&gt;
The inner loop runs roughly like (in pseudo C):
&lt;pre&gt;
	while (1) {
		s1 = *s1_word++;
		s2 = *s2_word++;
		g2 = s1 + STRCMP_CONST2;
		g1 = s1 | STRCMP_CONST1;
		s_xor = s1 ^ s2;
		if (s_xor)
			break;
		if (g2 &amp; ~g1)
			return 0;		
	}
&lt;/pre&gt;
We specifically prioritize the inequality test before there
&quot;is there a zero byte in s1&quot; test.  And the inner loop &quot;zero
byte present in 's1'?&quot; test is:
&lt;pre&gt;
	(s1 + -0x0101010101010101) &amp; ~(s1 | 0x7f7f7f7f7f7f7f7f)
&lt;/pre&gt;
It's one of several ways to test for a zero byte in a word in
3 instructions.  But as mentioned it's imprecise in that due
to cascading overflows from the addition, the zero marker left
in the mask result might not be in the actual zero byte.  That's
important in the post-loop exit code calculations so we'll use
something else, which is:
&lt;pre&gt;
	tmp1 = ~(s1 | 0x7f7f7f7f7f7f7f7f);
	tmp2 = ~((srcword1 &amp; 0x7f7f7f7f7f7f7f7f) + 0x7f7f7f7f7f7f7f7f);
	x = (tmp1 &amp; tmp2);
&lt;/pre&gt;
&quot;x&quot; will have a &quot;0x80&quot; value in every byte that was zero in &quot;s1&quot; and
zeros elsewhere.  In the inner loop we already calculated &quot;(s1 |
0x7f7f7f7f7f7f7f7f)&quot; so we can reuse it and simply negate it.
&lt;p&gt;
How does this clever calulation work it's magic?  'tmp1' records
every byte of the word that has bit 7 clear, and gives us
a &quot;0x80&quot; in such bytes and a zero in all others.  'tmp2' records
all the bytes that have all bits below 7 (0x7f) clear, leaving
a 0x80 in all such bytes and zeros elsewhere.  So &quot;tmp1 &amp; tmp2&quot; is
&quot;all bytes that have all bits clear&quot;.  Get it?
&lt;p&gt;
Now, using &quot;x&quot; we can see which comes first, a byte miscompare or a
zero.  Note we already have this &quot;s_xor&quot; thing calculated in the
inner loop, and we'll use that here.
&lt;pre&gt;
	ret = ((unsigned)s1 &lt;= (unsigned)s2) ? -1 : 1;
	low_bit = x &gt;&gt; 7;
	if (low_bit &gt; s_xor)
		ret = 0;
&lt;/pre&gt;
And we're done.
&lt;p&gt;
&quot;low_bit&quot; is the &quot;x&quot; value shifted down by 7 bits so that we have a
&quot;0x01&quot; in every byte that was zero in &quot;s1&quot;.  With big-endian byte
ordering, if we simply compare this &quot;low_bit&quot; with the &quot;s_xor&quot; a
larger value of &quot;low_bit&quot; indicates that the zero byte comes first.
And in such a case the strings we scanned were equal and we should
return &quot;0&quot;.
&lt;p&gt;
A lot of strcmp implementations have performance which is dependent upon
which byte in the long word miscompares (they simply scan a byte at a
time when they exit the loop) or their location agnostic code is much
more expensive than it needs to be.
&lt;p&gt;
Here's an icebag for your brain, I can see the steam coming out of your
ears after reading all this.</description>
  </item>
  <item>
    <title>Function graph tracer on sparc64...</title>
    <pubDate>Thu, 08 Apr 2010 22:46:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2010/04/08#function_graph</link>
    <description>&lt;p&gt;
The Linux kernel has a facility called the function graph tracer that
allows one to record the full call graph of everything the kernel does
with timestamping.  It's the big brother of the plain function tracer
and is implemented with similar facilities.
&lt;p&gt;
The plain function tracer merely intercepts calls to mcount (which
are emitted by the compiler at the beginning of every function when
building with the &quot;-pg&quot; option) and records the calling function as
well as other pieces of state (timestamp, parent function, etc.).
These trace entries are recorded into per-cpu lockless NMI-safe
buffers.
&lt;p&gt;
The function graph tracer takes things one step further by recording
function returns in the trace as well.  And it does this only using
mcount.  How does it do this?  The trick is that it uses a special
stub that it inserts into every call point, and it does so dynamically.
&lt;p&gt;
On function entry, mcount is invoked and the function graph tracer is
called like so:
&lt;pre&gt;
	mov		%i7, %g2
	mov		%fp, %g3
	save		%sp, -128, %sp
	mov		%g2, %o1
	mov		%g2, %l0
	mov		%g3, %l1
	mov		%l0, %o0
	mov		%i7, %o1
	call		prepare_ftrace_return
	 mov		%l1, %o2
	ret
	 restore	%o0, -8, %i7
&lt;/pre&gt;
&lt;p&gt;
prepare_ftrace_return() is a helper function that passes the mcount
caller program counter, that caller's parent program counter, and
the callers frame pointer into the function graph tracer.
&lt;p&gt;
The tracer sees if the function should be traced and if there are
enough tracking stack slot entries available for the current thread.
The slots are used to remember the frame pointer and caller's parent
program counter so that it can be compared and restored later.  We
need to remember this program counter because we are going to change
it so that it calls a stub instead of actually returning from the
function.
&lt;p&gt;
The frame pointer is saved and used as a debugging tool so that we
can make sure that when we execute the stub, the state is as we expect
it to be and we can be sure that restoring the return address register
is safe.
&lt;p&gt;
prepare_ftrace_return gives it's caller the address to put into the
callers return address register.  This will be the special stub if
we decide to trace this function call or it will simply be the original
value (plus 8 to account for how we return from sparc functions by
jumping to &quot;return address register&quot; plus 8).
&lt;p&gt;
The stub looks like this:
&lt;pre&gt;
	save		%sp, -128, %sp
	call		ftrace_return_to_handler
	 mov		%fp, %o0
	jmpl		%o0 + 8, %g0
	 restore
&lt;/pre&gt;
ftrace_return_to_handler() validates the frame pointer on the top-most
stack slot of saved return address + frame pointer pairs.  If it matches
it records a function return trace entry (with timestamps, etc.) and
returns the function's original return address and then we'll jump to that
from the stub.
&lt;p&gt;
So if we're deep in the call chain of traced functions, and you were to
look at the backtrace, you'd see a continuous stack of return addresses
referencing the stub.  And as the functions return, the function graph
tracer resolves the return addresses to what they should be and then the
originally intended address is returned to.
&lt;p&gt;
Of course there is some non-trivial cost to all of this, in particilar
rewriting the return address makes the cpu's return address stack
never hit so the function returns become very expensive.  But this
isn't something you have running constantly, you turn the tracer on
around a set of events of interest (f.e. while running a test
program).
&lt;p&gt;
Here's what a small snippet of a trace looks like with the function
graph tracer.  In this part of a capture we have a call to
&quot;run_local_timers&quot;:
&lt;pre&gt;
  64)               |          run_local_timers() {
  64)   2.637 us    |            hrtimer_run_queues();
  64)   2.527 us    |            raise_softirq();
  64)               |            softlockup_tick() {
  64)   2.967 us    |              __touch_softlockup_watchdog();
  64)   8.570 us    |            }
  64) + 23.953 us   |          }
&lt;/pre&gt;
The first number is the cpu number (this machine has 128 cpus, so yes
64 is not a typo).  The next field gives time deltas.  Finally we have
the call graph itself.
&lt;p&gt;
For the call graph a C-like syntax is used.  For leaf functions the
line ends with just a semicolon.  When a function calls other functions
it is closed off at it's return by a closing brace.  And the sub-calls
are indented as needed.
&lt;p&gt;
Call latencies are expressed at the point for which we have the return.
So we'll see it on the lines for leaf functions (ending with semicolons)
and for closing braces.  But never on lines having openning braces.
&lt;p&gt;
When latencies exceed certain values a &quot;+&quot; (greater than 10usec) or &quot;!&quot;
(greater than 100usec) will be prepended to the time delta so expensive
operations can be easily seen.
&lt;p&gt;
There are several other powerful tools built on top of the mcount based
tracing hooks.  For example there is a stack tracer that monitors which
functions created the deepest stack frames.
&lt;p&gt;
For more information about using ftrace in general, check out the two
part series written by Stephen Rostedt at LWN:
&lt;a href=&quot;http://lwn.net/Articles/365835/&quot;&gt;PART 1&lt;/a&gt; and
&lt;a href=&quot;http://lwn.net/Articles/366796/&quot;&gt;PART 2&lt;/a&gt;.
&lt;p&gt;
It's an incredibly powerful tool and framework.  In fact, in just in
the last few days I've already found 3 bugs and anomalies by simply
scanning the traces and looking for nothing in particular.</description>
  </item>
  <item>
    <title>strlen(), oh strlen()...</title>
    <pubDate>Mon, 08 Mar 2010 17:09:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2010/03/08#strlen_1</link>
    <description>&lt;p&gt;
I've been going through the glibc sparc optimized assembler routines
to see if anything can be improved.  And I took a stab at seeing if
strlen() could be made faster.  Find first zero byte in string, pretty
simple right?
&lt;p&gt;
The first thing we have to discuss is the infamous trick coined by
Alan Mycroft, way back in 1987.  It allows to check for the presence of
a zero byte in a word in 3 instructions.  There are 2 magic constants:
&lt;pre&gt;
#define MAGIC1		0x80808080
#define MAGIC2		0x01010101
&lt;/pre&gt;
If you're checking 64-bits at a time simply expand the above magic values
to 64-bits on 64-bit systems.
&lt;p&gt;
Then, given a word the check becomes:
&lt;pre&gt;
	if ((val - MAGIC2) &amp; ~val &amp; MAGIC1)
		goto found_zero_byte_in_word;
&lt;/pre&gt;
Essentially we're subtracting MAGIC2 to induce underflow in each
byte that has the value zero in it.  Such underflows cause bit 8
to get set in that byte.  Then we want to see if bit 8
is set after subtraction in any byte where bit 8 wasn't set before
the subtraction.
&lt;p&gt;
To get the most parallelization on multi-issue cpus, we want to
compute this using something like:
&lt;pre&gt;
	tmp1 = val - MAGIC2;
	tmp2 = ~val &amp; MAGIC1;
	if (tmp1 &amp; tmp2)
		goto found_zero_byte_in_word;
&lt;/pre&gt;
to reduce the number of dependencies such that the computation
of tmp1 and tmp2 can occur in the same cpu cycle.
&lt;p&gt;
Then there is all the trouble of getting the source buffer aligned
so we can do the fast loop comparing a word at a time.  The most
direct implement is to read a byte at a time, checking for zero,
until the buffer address is properly aligned.  This is also the
slowest implementation.
&lt;p&gt;
The powerpc code in glibc has a better idea.  If dereferencing a
non-word-aligned byte at address 'x' is valid, so is reading the
word at 'x &amp; ~3' (or 'x &amp; ~7' on 64-bit).  This is because page
protection occurs on page boundaries, and x and 'x &amp; ~3' are on
the same page.
&lt;p&gt;
The only thing left to attend to is to make sure we don't match the
alignment pad bytes with zero.  This is solved by computing a mask
of 1's and writing those 1's into the word we read before we do
the Mycroft computation above.  In C it looks something like:
&lt;pre&gt;
	orig_ptr = ptr;
	align = (unsigned long) ptr &amp; 3;
	mask = -1 &gt;&gt; (align * 8);
	ptr = (void *) ((unsigned long) ptr &amp; ~3UL);
	val = *ptr;
	val |= ~mask;
	if ((val - MAGIC2) &amp; ~val &amp; MAGIC1)
		goto found_zero_byte_in_word;
&lt;/pre&gt;
At which point we can fall into the main loop.
&lt;p&gt;
Once we find the word containing a zero byte, we have to iteratively
look for where it is in order to compute the return value.  How to
schedule this is not trivial, and it's especially cumbersome on 64-bit
(where we have to potentially check 8 bytes as opposed to 4).
&lt;p&gt;
Anyways, let's analyze the 64-bit Sparc implementation I'm hacking on
at the moment.  I'm targetting UltraSPARC-III and Niagara2 for
performance analysis.  Simply speaking UltraSPARC-III can dual-issue
integer operations, and Niagara2 is single issue and predicts all
branches not taken (basically this means: minimize use of branches).
&lt;pre&gt;
davem_strlen:
	mov	%o0, %o1
	andn	%o0, 0x7, %o0

	ldx	[%o0], %o5
	and	%o1, 0x7, %g1
	mov	-1, %g5
&lt;/pre&gt;
Save away the original string pointer in %o1.  At the end we'll compute
the return value as &quot;%o1 - %o0&quot;.  Align the buffer pointer and load a word
as quickly as possible.  We load the first word early so that we can hide
the memory latency into all of the constant and mask formation we need to
do before we can make the Mycroft test.
&lt;p&gt;
%g5 holds the initial part of the mask computation (-1, which gets expanded
fully to 64-bits by this move instruction) and %g1 will have the shift
factor.
&lt;pre&gt;
	sethi	%hi(0x01010101), %o2
	sll	%g1, 3, %g1

	or	%o2, %lo(0x01010101), %o2
	srlx	%g5, %g1, %o3

	sllx	%o2, 32, %g1
	sethi	%hi(0x00ff0000), %g5
&lt;/pre&gt;
%o2 is going to hold the &quot;0x01&quot; expanded to 64-bits subtraction
magic value.  %o3 wil first hold the initial word mask, and then
it will holds the &quot;0x80&quot; magic constant.  We can compute the
two 64-bit magic constants into registers in 5 instructions.
&lt;p&gt;
Pick either of the two constants, we choose the &quot;0x01&quot; here because
we'll need it first.  This is loaded first using &quot;sethi&quot;, &quot;or&quot;.
This gives us the lower 32-bits of the constant, then we shift up
a copy by 32-bits, then or that into the lower 32-bit copy to
compute the final value.  &quot;0x80&quot; is &quot;0x01&quot; shifted left by 7 bits
so a simple shift is all we need to load the other 64-bit constant.
&lt;p&gt;
The &quot;0x00ff0000&quot; constant will be used while searching for the zero
byte in the final word.
&lt;p&gt;
Next, we mask the initial word and fall through into the main loop.
&lt;pre&gt;
	orn	%o5, %o3, %o5
	or	%o2, %g1, %o2

	sllx	%o2, 7, %o3
&lt;/pre&gt;
Mask in the pad bits using mask compute in %o3.  Finish computation
of 64-bit MAGIC1 into %o2, and finally put MAGIC2 into %o3.  We're
ready for the main loop:
&lt;pre&gt;
10:	add	%o0, 8, %o0

	andn	%o3, %o5, %g1
	sub	%o5, %o2, %g2

	andcc	%g1, %g2, %g0
	be,a,pt	%xcc, 10b
	 ldx	[%o0], %o5
&lt;/pre&gt;
This is a real pain to schedule because there are many dependencies.
But the &quot;andn&quot;, &quot;sub&quot;, &quot;andcc&quot; sequence is the Mycroft test, and
those first two instructions can execute in one clock cycle on
UltraSPARC-III.  The &quot;,a&quot; annul bit on the branch means that we
only execute the load in the branch delay slot if the branch is
taken.
&lt;p&gt;
Now we have the code that searches for where exactly the zero byte
is in the final word.
&lt;pre&gt;
	srlx	%o5, 32, %g1
	sub	%o0, 8, %o0
&lt;/pre&gt;
We over advanced the buffer pointer in the main loop, so correct
that by subtracting 8.  Prepare a copy of the upper 32-bits of
the word into %g1.
&lt;pre&gt;
	andn	%o3, %g1, %o4
	sub	%g1, %o2, %g2

	add	%o0, 4, %g3
	andcc	%o4, %g2, %g0

	movne	%icc, %g1, %o5
	move	%icc, %g3, %o0
&lt;/pre&gt;
This is divide and conquer.  Instead of doing 8 byte compares, we
first see if the upper 32-bits have the zero byte.  We essentially
redo the Mycroft test on the upper 32-bits of the word.
&lt;p&gt;
If the upper 32-bits have the zero byte, we use %g1 for the comparisons.
Otherwise we retain %o5 for the subsequent comparisons and advance
the buffer pointer by 4 bytes.  This is what the final two conditional
move instructions are doing.  Note that these conditional moves use
'%icc', the 32-bit condition codes.
&lt;p&gt;
The astute reader may wonder why we just can't use the upper 32-bits
of the Mycroft computation we made in the main loop?  This doesn't work
because the underflows can carry and cause false positives in upper
bytes of the word.  For example, consider a value where bits 35 down
to 24 have hex value &quot;0x0100&quot;.  The subtraction of MAGIC2 will result
in &quot;0x8080&quot;.  The real zero byte is the lower one, not the upper one.
So we can't merely use the upper 32-bits of the already computed 64-bit
Mycroft mask, we have to recompute it over 32-bits by hand.
&lt;p&gt;
Now we're left with 32-bits to check for a zero byte, we make extensive
use of conditional moves to avoid branches:
&lt;pre&gt;
	mov	3, %g2
	srlx	%o5, 8, %g1

	andcc	%g1, 0xff, %g0
	move	%icc, 2, %g2

	andcc	%o5, %g5, %g0
	srlx	%o5, 24, %o5
	move	%icc, 1, %g2

	andcc	%o5, 0xff, %g0
	move	%icc, 0, %g2

	add	%o0, %g2, %o0
&lt;/pre&gt;
We check starting at the low byte up to the highest byte.  Because
the highest byte, if zero, takes priority.  We add the offset of
the zero byte to the buffer pointer.
&lt;p&gt;
Finally:
&lt;pre&gt;
	retl
	 sub	%o0, %o1, %o0
&lt;/pre&gt;
We compute the length and return from the routine.
&lt;p&gt;
Many many moons ago, in 1998, Jakub Jelinek and his friend Jan Vondrak
wrote the routines we use now on sparc.  And frankly it's very hard to
beat that code especially on multi-issue processors.
&lt;p&gt;
The powerpc trick to align the initial word helps us beat the existing
code for all the unaligned cases.  But for the aligned case the existing
code holds a slight edge.
&lt;p&gt;
So now I've been trimming cycles as much as possible in the new code
trying to reach the state where the aligned case executes at least as
fast as the existing code.  I'll check this work into glibc once I
accomplish that.
&lt;p&gt;
The Mycroft trick extends to other libc string routines.  For example
for 'memchr' you replicate the search character into all bytes of
a word, let's call it 'xor_mask' and in the inner loop you adjust
each word by using:
&lt;pre&gt;
	val ^= xor_mask;
&lt;/pre&gt;
Then use the Mycroft test as in strlen().  Another complication with
memchr, however, is the need to check the given length bounds.
&lt;p&gt;
This can be done in one instruction by putting the far bounds into
your base pointer register (called '%top_of_buffer' below), then
using offsets starting at &quot;0 - total_len&quot; (referred to as
'%negative_len' below).
&lt;p&gt;
Then your inner loop can do something like:
&lt;pre&gt;
	ldx	[%top_of_buffer + %negative_len], %o5
	addcc	%negative_len, 8, %negative_len
	bcs	%xcc, len_exceeded
	 ...
&lt;/pre&gt;
We exit the loop when adding 8 bytes to the negative len causes an
overflow.
&lt;p&gt;
If you're interested in this kind of topic, bit twiddling tricks and
whatnot, you absolutely have to own a copy of &quot;Hacker's Delight&quot; by
Henry S. Warren, Jr.</description>
  </item>
  <item>
    <title>STT_GNU_IFUNC</title>
    <pubDate>Sun, 07 Feb 2010 15:46:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2010/02/07#stt_gnu_ifunc</link>
    <description>&lt;p&gt;
I've always wanted to work on support for STT_GNU_IFUNC symbols
on sparc.  This is going to solve a real problem distribution
makers have faced on sparc64 for quite some time.
&lt;p&gt;
&lt;center&gt;
&lt;img src=&quot;http://vger.kernel.org/~davem/nyc_soho_facades.jpg&quot;&gt;
&lt;/center&gt;
&lt;p&gt;
What is STT_GNU_IFUNC?
&lt;p&gt;
Well, normally a symbol is resolved by the dynamic linker based
upon information in the symbol table of the objects involved.
This is after taking into consideration things like symbol
visibility, where it is defined, etc.
&lt;p&gt;
The difference with STT_GNU_IFUNC is that the resolution of the
reference can be made based upon other criteria.  For example,
based upon the capabilities of the cpu we are executing on.
The most obvious place this would be very useful is in libc,
where you can pick the most optimized memcpy() implementation.
&lt;p&gt;
Normally the symbol table entry points to the actual symbol location,
but STT_GNU_IFUNC symbols point to the location of a &quot;resolver&quot;
function.  This function returns the symbol location that should
be used for references to this symbol.
&lt;p&gt;
So when the dynamic linker resolves a reference to a STT_GNU_IFUNC
type symbol &quot;foo&quot;.  It calls the resolver function recorded in
the symbol table entry, and uses the return value as the resolved
address.
&lt;p&gt;
Simple example:
&lt;pre&gt;
void * memcpy_ifunc (void) __asm__ (&quot;memcpy&quot;);
__asm__(&quot;.type foo, %gnu_indirect_function&quot;);

void *
memcpy_ifunc (void)
{
  switch (cpu_type)
    {
  case cpu_A:
    return memcpy_A;
  case cpu_B:
    return memcpy_B;
  default:
    return memcpy_default;
    }
}
&lt;/pre&gt;
So, references to 'memcpy' will be resolved as determined by
the logic in memcpy_ifunc().
&lt;p&gt;
These magic ifunc things even work in static executables.  How
is that possible?
&lt;p&gt;
First, even though the final image is static, the linker arranges to
still create PLT entries and dynamic sections for the STT_GNU_IFUNC
relocations.
&lt;p&gt;
Next, the CRT files for static executables walk through the relocations
in the static binary and resolve the STT_GNU_IFUNC symbols.
&lt;p&gt;
There are some thorny issues wrt. function pointer equality.  To make
that work static references to STT_GNU_IFUNC symbols use the PLT address
whereas non-static references do not (they get fully resolved).
&lt;p&gt;
Back to the reason I was so eager to implement this.  On sparc we have
four different sets of optimized memcpy/memset implementations in
glibc (UltraSPARC-I/II, UltraSPARC-III, Niagara-T1, Niagara-T2).
Right now the distributions have to thus build glibc four times each
for 32-bit and 64-bit (for a total of 8 times).
&lt;p&gt;
With STT_GNU_IFUNC they will only need to build it once for 32-bit
and once for 64-bit.
&lt;p&gt;
I've just recently posted patches for full support of STT_GNU_IFUNC
symbols to the
&lt;a href=&quot;http://sourceware.org/ml/binutils/2010-02/msg00095.html&quot;&gt;
binutils
&lt;/a&gt;
and
&lt;a href=&quot;http://sourceware.org/ml/libc-alpha/2010-02/msg00005.html&quot;&gt;
glibc
&lt;/a&gt;
lists.
</description>
  </item>
  <item>
    <title>Beaux-Arts and kernel hacking...</title>
    <pubDate>Thu, 21 May 2009 02:42:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2009/05/21#beaux_arts</link>
    <description>&lt;p&gt;
My recent hobbies have included an intense study of New York City
architecture, and in particular the facinating stories behind
the city's two most prominent train stations.  That being
&lt;a href=&quot;http://www.nyc-architecture.com/MID/MID031.htm&quot;&gt;
Grand Central Terminal
&lt;/a&gt;
and the arguably infamous
&lt;a href=&quot;http://www.nyc-architecture.com/GON/GON004.htm&quot;&gt;
Pennsylvania Station
&lt;/a&gt;.
&lt;p&gt;
&lt;center&gt;
&lt;img src=&quot;http://vger.kernel.org/~davem/pennstation_concourse_scaled.jpg&quot;&gt;
&lt;br&gt;
McKimm, Meade, and White's masterpiece at 42nd Street and 3rd Ave.
&lt;/center&gt;
&lt;p&gt;
In the second half of the 19th century and on towards the first
half of the 20th century, any American architect worth his salt
studied at the Ecole des Beaux-Arts in Paris.
&lt;p&gt;
If you had a degree from that school, you were at the top of the
pile for selection on all of the interesting commisions of the time.
The school presented the student with a challenging and fast
paced curriculum.
&lt;p&gt;
Firstly, for these American students attending in Paris, the first
challenge was just getting in.  The entrance exam (of course) required
at least some proficiency in French.  Several of the most notable
American architects had the retake this entrace exam 5 or more times
before being able to pass.
&lt;p&gt;
Once accepted, the student was pressed to solve problems.  12 hours
were given to draft up a solution to a real architectual problem.
Then once the draft was accepted, the student had 2 weeks to flesh out
all of the details and present the final design.  All the while the
student's progress was critiqued by an established French architect
who oversaw a group of students.
&lt;p&gt;
We really don't have that kind of training for computer science
people.  It's not even science I would say.  This kind of training
does exist for pure mathmatics, espcially in France.
&lt;p&gt;
Envision a school where you're asked to draft up the design of a
compiler pass in 12 hours, then for two weeks you implement it, and
meanwhile Alfred Aho critiques your work.  This kind of place
simply doesn't exist.  (Yes I know Alfred teaches at Columbia
currently, so maybe this specific place does exist :-) but I maintain
that more generally such institutions do not exist)
&lt;p&gt;
Open source development and &quot;throwing the masses of monkeys at
the problem&quot; seems to be a logical consequence of this, does it
not?
&lt;p&gt;
A formally trained Beaux-arts architect and a room with a few drafters
could design something as insanely complicated as a huge
transportation hub in the middle of New York City.  And it would work,
as there would be no room for failure.  To me it seems
that someone similarly trained could do a complete operating system,
compiler, or similar large software engineering task.
&lt;p&gt;
&lt;a href=&quot;http://en.wikipedia.org/wiki/McKim,_Mead_%26_White&quot;&gt;
McKimm, Meade, and White
&lt;/a&gt;
were three architects and a couple drafters,
and yet they were able to complete works such as Boston Public
Library, Pennsylvania Station, and the James Farley Post Office.
To just name a few.
&lt;p&gt;
So which is better, strict formal training and mentorship or open
source monkeys?  You decide!</description>
  </item>
  <item>
    <title>A Sparc JIT for ioquake3</title>
    <pubDate>Thu, 05 Mar 2009 08:15:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2009/03/05#quake3_jit</link>
    <description>&lt;p&gt;
I recently got DRM working on sparc64, and this means I had
to of course test it :-)
&lt;p&gt;
I played around with ioquake3 and it worked just fine.
This was in a way exciting since I have devoured many
hours of my life into this game on x86.
&lt;p&gt;
One aspect of quake3 is that it has a virtual machine.  You
can write a MOD for quake3 and replace pretty much any
aspect of the game outside of the rendering engine.  But
to keep MOD authors from eating people's home directories,
sending your password out to some rogue collection system,
and things of that nature the interfaces are tightly controlled
and the MOD code runs in a JIT'd VM.
&lt;p&gt;
The only way you can get into the JIT is to make a &quot;system call&quot;.
And the only way to get out is to either return or make such
a system call into another module.  The system call is the main
entrypoint into the module, it takes an integer command and 0
or more integer arguments.
&lt;p&gt;
All memory accesses done by the JIT'd code are masked so that it is
impossible for the JIT to touch memory outside of that allocated
explicitly for it by the VM.
&lt;p&gt;
Ben H. kiddingly said to me that I should write the Sparc JIT since
there is one for x86 and PowerPC already.  One should never kid about
such things...
&lt;p&gt;
It's pretty neat stuff, although the stack machine VM code output
by the LCC compiler they used is horribly inefficient.  Some
code for a function might look like:
&lt;pre&gt;
OPCODE[  OP_ENTER] IMM4[0x0000001c]	! ENTER function, 0x1c of stack
OPCODE[  OP_LOCAL] IMM4[0x00000024]	! PUSH stack offset 0x24 (first arg)
OPCODE[  OP_LOAD4] 			! LOAD from &quot;stack + 0x24&quot;
OPCODE[  OP_LEAVE] IMM4[0x0000001c]	! LEAVE function, return LOAD result
&lt;/pre&gt;
Operations push entries onto the &quot;register stack&quot;, and consume entries
on the top of that stack.  This might emit some sparc code like:
&lt;pre&gt;
	save	%sp, -64, %sp		! OP_ENTER
	sub	%g3, 0x1c, %g3
	add	%g3, 0x24, %l0		! OP_LOCAL
	and	%l0, %g5, %l0		! OP_LOAD4
	ld	[%g4 + %l0], %l0
	add	%g3, 0x1c, %g3		! OP_LEAVE
	ret
	restore	%l0, %g0, %o0
&lt;/pre&gt;
We use several fixed registers, &quot;%g3&quot; is the stack pointer,
&quot;%g5&quot; is the VM data segment offset mask, &quot;%g4&quot; is the data segment
base address.  So every load or store address formation is
&quot;mask with %g5 and add to %g4&quot;.
&lt;p&gt;
It's there in the ioquake3 repo right now and will be in the next
release.  There are lots of things that can be improved but it works
very well and most of the quake3 MODS I've tried (CPMA, UrbanTerror,
etc.) work.  I've also been playing the base game online extensively,
you know, for stress testing.</description>
  </item>
  <item>
    <title>NMI profiling on sparc64...</title>
    <pubDate>Sat, 29 Nov 2008 23:59:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2008/11/29#sparc64_nmi</link>
    <description>&lt;p&gt;
One thing that always has been suboptimal on sparc64 has been
the profiling with oprofile.
&lt;p&gt;
Yes it worked but we only have the dumb timer interrupt profiling
available.  The biggest loss in this is that IRQ disabled code
sequences do not get profiled.  This leads to &quot;clumps&quot; in the
profile, where all the code in an IRQ disabled sequence can show
up as one big hit at the point where IRQs get re-enabled.
&lt;p&gt;
This makes the profile often non-representative and, at worst,
completely unusable.
&lt;p&gt;
Say what you want about levelled interrupts, they do provide a level
of flexibility that can be useful.  Sparc64 chips have a PIL register,
you indicate the interrupt levels (out of 15) you want to block out
by writing the highest level to block out into the register.  So
writing zero enables all interrupt levels, and writing 15 blocks all
of them out.
&lt;p&gt;
Device interrupts work by interrupt vectors, which are not blocked by
the PIL mechanism.  These are processed quickly in a trap handler and
revectored into a PIL levelled interrupt by software.
&lt;p&gt;
Under Linux we use one PIL level for all of the device interrupts.
A few of the other PIL levels we use for specific SMP cross-call
types.
&lt;p&gt;
On UltraSPARC-III (cheetah) and later we (finally) have a profiling
counter overflow interrupt.  This arrives at PIL level 15.  So
naturally I had the idea to run the majority of the kernel only
disabled up to PIL level 14.  The result is that was can use these
profile counter overflow interrupts to provide a pseudo-NMI
oprofile implementation.
&lt;p&gt;
This works quite well and is checked into the sparc-next-2.6 GIT
tree, so it will show up in 2.6.29</description>
  </item>
  <item>
    <title>git stash</title>
    <pubDate>Wed, 05 Nov 2008 06:19:00 GMT</pubDate>
    <link>http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2008/11/05#git_stash</link>
    <description>&lt;p&gt;
Someone on IRC asked what was so cool about my
'mkcf' shell script and those command lines I
showed.
&lt;p&gt;
I'm so old skool that I didn't even know that something
as cool as &quot;git stash&quot; even existed.  Holy crap that's
a nice feature :-)
&lt;p&gt;
Yeah but these young whipper snappers don't know how
to build a git index by hand, har har har.
&lt;p&gt;
In other news, unless you live under a rock you know that Barak Obama
won the presidental elections, but what you may not know is that he
only won because of a
&lt;a href=&quot;http://torvalds-family.blogspot.com/2008/11/black-and-white.html&quot;&gt;
last minute endorsement from Linus.
&lt;/a&gt;
&lt;p&gt;
Tee-hee.
</description>
  </item>
  </channel>
</rss>
