"Techniques that can have its behavior changed when the kernel is replaced" or "Earthquaky kernel interfaces" Arnaldo Carvalho de Melo acme@redhat.com Red Hat Inc. This document presents programming practices that when used can explain changes in application behavior when the OS kernel is updated to a new version. 1. sched_yield Lets start with sched_yield. This syscall is used by a thread trying to be "nice" to other threads since it thinks it is taking too much of the CPU time and should voluntarily give the other threads a chance to run. But this is the work of the scheduler, he can make way better decisions when and if there are actually other threads wanting to run. One common technique is to poll on a variable and if its value is not the one desired, call sched_yield and then keep doing this in a loop till the value gets to be the one desired. Locking. But this only makes sense in the SCHED_FIFO RT case. And when there are ready to run threads in the same priority. This is because on RT the scheduler can not preempt a thread that has the same priority even if others are wanting to run. The sched_yield behavior for non RT threads, where a thread can be preempted when consuming its time quantum, is dependent to the underlying OS scheduler implementation. The thread can go to the end of the combined runqueues, taking a long time to be scheduled again. Or if it is the only thread ready to run on its nice level it can not even take a short nap, getting rescheduled straight away. This is called burning CPU in a busy loop. Depending on sched_yield behavior is thus highly discouraged. There are other ways to do proper locking or to wait for a condition to be met. Posix Threads, aka pthreads, have abstractions for mutexes and condition variables that will provide more consistent behavior across kernel versions, when components such as the scheduler, or even others, can make assumptions tied to the previous implementation of such components to be invalidated. This in turn can lead to performance loss or other undesired changes in the behavior of some applications. The system may, for instance, have less time to process networking packets. This would lead to considerable performance loss that is difficult to diagnose when there were no significant changes in the networking components of the system. One example can be found in recent changes to the iperf networking benchmark tool, that when the Linux kernel switched to the CFS scheduler exhibited a 30% drop in performance. Source code inspection showed that it was using sched_yield in a loop to check for a condition to be met. After a fix was made the performance drop disappeared and CPU consumption went down from 100% to saturate a gigabit network to just 9%[1]. 2. TCP_NODELAY and Small Buffer Writes Being the most used transport protocol poses a fantastic challenge for TCP to meet many different needs. Several heuristics were introduced over time as new application use cases and new hardware features appeared and as well kernel architecture optimizations were implemented. For instance, TCP delays sending small buffers, trying to coalesce several before generating a network packet. This normally is very effective, but in some cases we are reminded that this is indeed a heuristic. And being an heuristic, it has a place in this document, as it ends up being an earthquaky API, one that can have its behavior changed when underlying OS components change and thus should be used with great care. Applications that want lower latency for the packets to be sent will be harmed by this TCP heuristic. So there is a knob for applications that don't want this algorithm to be used. It is a socket option called TCP_NODELAY. Applications can use it thru the setsockopt sockets API: int one = 1; setsockopt(descriptor, SOL_TCP, TCP_NODELAY, &one, sizeof(one)); But for this to be used effectively applications must avoid doing small, logically related buffer writes as this will make TCP send these multiple buffers as individual packets, and TCP_NODELAY can interact with receiver optimization heuristics, such as ACK piggybacking, and result in poor overall performance If applications have several buffers that are logically related and that should be sent as one packet they will achieve better latency and performance by using one of the following techniques: If the buffers will be obtained from libraries or from hardware it could be possible to build a contiguous packet and the logical packet in one go to TCP, on a socket configured with TCP_NODELAY. Building an I/O vector with the logically related but not already contiguous in memory buffers and then passing to the kernel using writev, again on a socket configured with TCP_NODELAY. Then there is another, less known TCP socket option that is present in a similar fashion in several OS kernels and in Linux is called TCP_CORK. Setting TCP_CORK with a value of 1, aka "corking the socket", using: int one = 1; setsockopt(descriptor, SOL_TCP, TCP_CORK, &one, sizeof(one)); tells TCP to wait for the application to remove the cork before sending any packets, just appending the buffers it receives to the socket in-kernel buffers. This allows applications to build a packet in kernel space, something that can be required when using different libraries that provides abstractions for layers. One example is on the SMB networking protocol, where headers are sent together with a data payload, and better performance is obtained if the header and payload is bundled in as few packets as possible. When the logical packet was built in the kernel by the various components in the application, something that the kernel doesn't have an easy (or possible at all) way to identify on behalf of the application, we just tell TCP to remove the cork using: int zero = 0; setsockopt(descriptor, SOL_TCP, TCP_CORK, &zero, sizeof(zero)); this makes TCP send the accumulated logical packet right away, without waiting for any further packets from the application, something that it could do to fully use the network maximum packet size available. To fully understand what kind of performance impact the use of these techniques can have on your application we provide two simple applications[2] that exercises these socket options. The server just waits for packets of 30 bytes and then sends a 2 bytes packet in response. To start it you must tell the server TCP port and the number of packets it should process, 10.000 on this tests: ./tcp_nagle_server 5001 10000 The server doesn't need to set any socket option, as the options discussed so far are applicable to the sender of small packets, and this example this is done on the client. The client can be used without setting any of these options, TCP_NODELAY or using TCP_CORK. In all cases it will send 15 two byte sized buffers and then wait for a response from the server. Lets now try this over the loopback interface, using the three possibilities: # Not using TCP_NODELAY nor TCP_CORK $ ./tcp_nagle_client localhost 5001 10000 10000 packets of 30 bytes sent in 400129.781250 ms: 0.749757 bytes/ms This is the baseline, when TCP coalesces writes and has to wait a bit to check if the application has more data that can optimally fit on a network packet. $ ./tcp_nagle_client localhost 5001 10000 no_delay 10000 packets of 30 bytes sent in 1649.771240 ms: 181.843399 bytes/ms using TCP_NODELAY Here TCP was told not to wait but send the buffers right away, disabling the algorithm that coalesces small packets. This improved performance by a huge factor, but caused a flurry of network packets to be sent for each logical packet. $ ./tcp_nagle_client localhost 5001 10000 cork 10000 packets of 30 bytes sent in 850.796448 ms: 352.610779 bytes/ms using TCP_CORK This halves the time needed to send the same number of logical packets, because TCP doesn't sends that many small packets, instead coalescing full logical packets in its socket buffers and then sending less network packets. As we can see, using TCP_CORK is clearly the best technique in this scenario. It allows the application to precisely convey the information that a logical packet was finished and thus must be sent without any delay. TCP long accumulated heuristics don't need to be used as it is not trying anymore to foretell what the application will do. If your application sends bulk data read from a file you may consider using TCP_CORK together with sendfile. Information about sendfile is available in the system manual pages, accessible by running: man sendfile References: [1] "Re: Network slowdown due to CFS", Ingo Molnar, http://lkml.org/lkml/2007/9/26/132 [2] TCP nagle sample applications http://oops.ghostprotocols.net:81/acme/tcp_nagle_client.c http://oops.ghostprotocols.net:81/acme/tcp_nagle_server.c