A two-day Networking Track will be featured at this year's
Linux Plumbers Conference
(LPC) in Vancouver, British Columbia, Canada.
It will run the first two days of LPC, November 13-14. The track will consist of a
series of talks with a clear focus on recent topics in Linux kernel networking, including
a keynote from the Linux kernel networking maintainer, David S. Miller.
The Networking Track will be open to all LPC attendees. There is no additional
registration required. This is a great occasion for Linux networking developers to
meet face to face and discuss ongoing developments.
Schedule
For viewing the schedule, please check the main LPC website:
XDP is a framework for running BPF programs in the NIC driver to allow
decisions about the fate of a received packet at the earliest point in
the Linux networking stack. For the most part the BPF programs rely on
maps to drive packet decisions, maps that are managed for example by a
userspace agent. This architecture has implications on how the system is
configured, monitored and debugged.
An alternative approach is to make the kernel networking tables
accessible by BPF programs. This approach allows the use of standard
Linux APIs and tools to manage networking configuration and state while
still achieving the higher performance provided by XDP. An example of
providing access to kernel tables is the recently added helper to allow
IPv4 and IPv6 FIB (and nexthop) lookups in XDP programs. Routing suites
such as FRR manage the FIB tables, and the XDP packet path benefits by
automatically adapting to the FIB updates in real time. While a huge
first step, a FIB lookup alone is not sufficient for general networking
deployments.
This talk discusses the advantages of making kernel tables available to
XDP programs to create a programmable packet pipeline, what features
have been implemented as of October 2018, key missing features, and
current challenges.
Linux bridge is deployed on Hosts, Hypervisors, Container OS's and in
most recent years on data center switches. It is complete in its
feature set with forwarding, learning, proxy and snooping functions.
It can bridge Layer-2 domains between VM's, Containers, Racks, POD's
and between data centers as seen with Ethernet-Virtual Private
networks [1, 2]. With Linux bridge deployments moving up the rack, it
is now bridging Larger Layer-2 domains bringing in scale challenges.
The bridge forwarding database can scale to thousands of entries on a
data center switch with hardware acceleration support.
In this paper we discuss performance and operational challenges with
large scale bridge fdb database and solutions to address them. We will
discuss solutions like fdb dst port failover for faster convergence,
faster API for fdb updates from control plane and reducing number of
fdb dst ports with Light weight tunnel endpoints for bridging over a
tunneling solution (eg vxlan).
Most solutions though discussed around the below deployment scenarios
are generic and can be applied to all bridge use-cases:
Multi-chassis link aggregation scenarios where Linux bridge is part
of the active-active switch redundancy solution
Ethernet VPN solutions where Linux bridge forwarding database is
extended to reach Layer-2 domains over a network overlay like VxLAN
This talk is a continuation of the initial XDP HW-based hints work presented
at NetDev 2.1 in Seoul, South Korea.
It will start with focus on showcasing new prototypes to allow an XDP program
to request required HW-generated metadata hints from a NIC. The talk will
show how the hints are generated by the NIC and what are the performance
characteristics for various XDP applications. We also want to demonstrate how
such a metadata can be helpful for applications that use AF_XDP sockets.
The talk with then discuss planned upstreaming thoughts, and look to generate
more discussion around implementation details, programming flows, etc., with
the larger audience from the community.
Port mirroring is one of the most common network troubleshooting
techniques. SPAN (Switch Port Analyzer) allows a user to send a copy
of the monitored traffic to a local or remote device using a sniffer
or packet analyzer. RSPAN is similar, but sends and received traffic
on a VLAN. ERSPAN extends the port mirroring capability from Layer 2
to Layer 3, allowing the mirrored traffic to be encapsulated in an
extension of the GRE (Generic Routing Encapsulation) protocol and sent
through an IP network. In addition, ERSPAN carries configurable
metadatas (e.g., session ID, timestamps), so that the packet analyzer
has better understanding of the packets.
ERSPAN for IPv4 was added into Linux kernel in 4.14, and for IPv6 in
4.16. The implementation includes both transmission and reception and
is based on the existing ip_gre and ip6_gre kernel module. As a
result, Linux today can act as an ERSPAN traffic source sending the
ERSPAN mirrored traffic to the remote host, or an ERSPAN destination
which receives and parses the ERSPAN packets generated from Cisco or
other ERSPAN-capable switches.
We've added both the native tunnel support and metadata-mode tunnel
support. In this paper, we demonstrate three ways to use the ERSPAN
protocol. First, for Linux users, using iproute2 to create native
tunnel net device. Traffic sent to the net device will be
encapsulated with the protocol header accordingly and traffic matching
the protocol configuration will be received from the net device.
Second, for eBPF users, using iproute2 to create metadata-mode ERSPAN
tunnel. With eBPF TC hook and eBPF tunnel helper functions, users can
read/write ERSPAN protocol's fields in finer granularity. Finally,
for Open vSwitch users, using the netlink interface to create a switch
and programmatically parse, lookup, and forward the ERSPAN packets
based on flows installed from the userspace.
In this talk we describe our experiences in evaluating DC-TCP. Preliminary testing with Netesto uncovered issues with our NIC that affected fairness between flows, as well as bugs in the DC-TCP code path in Linux that resulted in RPC tail latencies of up to 200ms.
Once we fixed those issues, we proceeded to test in a 6 rack mini cluster running some of our production applications. This testing demonstrated very large decreases in packet discards (12 to 1000x) at a cost of larger CPU utilization. In addition to describing the issues and fixes, we provide detailed experimental results and explore the causes of the larger CPU utilization as well as discuss partial solutions to this issue.
Among the various ways of using eBPF, OVS has been exploring the power
of eBPF in three: (1) attaching eBPF to TC, (2) offloading a subset of
processing to XDP, and (3) by-passing the kernel using AF_XDP.
Unfortunately, as of today, none of the three approaches satisfies the
requirements of OVS. In this presentation, we'd like to share the
challenges we faced, experience learned, and seek for feedbacks from
the community for future direction.
Attaching eBPF to TC started first with the most aggressive goal: we
planned to re-implement the entire features of OVS kernel datapath
under net/openvswitch/* into eBPF code. We worked around a couple of
limitations, for example, the lack of TLV support led us to redefine a
binary kernel-user API using a fixed-length array; and without a
dedicated way to execute a packet, we created a dedicated device for
user to kernel packet transmission, with a different BPF program
attached to handle packet execute logic. Currently, we are working on
connection tracking. Although a simple eBPF map can achieve basic
operations of conntrack table lookup and commit, how to handle NAT,
(de)fragmentation, and ALG are still under discussion.
Moving one layer below TC is called XDP (eXpress Data Path), a much
faster layer for packet processing, but with almost no extra packet
metadata and limited BPF helpers support. Depending on the complexity
of flows, OVS can offload a subset of its flow processing to XDP when
feasible. However, the fact that XDP has fewer helper function support
implies that either 1) only very limited number of flows are eligible
for offload, or 2) more flow processing logic needed to be done in
native eBPF.
AF_XDP represents another form of XDP, with a socket interface for
control plane and a shared memory API for accessing packets from
userspace applications. OVS today has another full-fledged datapath
implementation in userspace, called dpif-netdev, used by DPDK
community. By treating the AF_XDP as a fast packet-I/O channel, the
OVS dpif-netdev can satisfy almost all existing features. We are
working on building the prototype and evaluating its performance.
This talk is divided into two parts, first we present on kTLS, the current kernel's
sockmap BPF architecture for L7 policy enforcement, as well as the kernel's ULP and
strparser framework which is utilized by both in order to hook into socket callbacks
and determine message boundaries for subsequent processing.
We further elaborate on the challenges we face when trying to combine kTLS with the
power of BPF for the eventual goal of allowing in-kernel introspection and policy
enforcement of application data before encryption. Besides others, this includes a
discussion on various approaches to address the shortcomings of the current ULP layer,
optimizations for strparser, and the consolidation of scatter/gather processing for
kTLS and sockmap as well as future work on top of that.
UDP is a popular foundation for new protocols. It is available across
operating systems without superuser privileges and widely supported
by middleboxes. Shipping protocols in userspace on top of
a robust UDP stack allows for rapid deployment, experimentation
and innovation of network protocols.
But implementing protocols in userspace has limitations. The
environment lacks access to features like high resolution timers
and hardware offload. Transport cost can be high. Cycle count of
transferring large payloads with UDP can be up to 3x that of TCP.
In this talk we present recent and ongoing work, both by the authors
and others, at improving UDP for content delivery.
UDP Segmentation offload amortizes transmit stack traversal by
sending as many as 64 segments as one large fused large packet.
The kernel passes this through the stack as one datagram, then
splits it into multiple packets and replicates their network and
transport headers just before handing to the network device.
Some devices can offload segmentation for exact multiples of
segment size. We discuss how partial GSO support combines the
best of software and hardware offload and evaluate the benefits of
segmentation offload over standard UDP.
With these large buffers, MSG_ZEROCOPY becomes effective at
removing the cost of copying in sendmsg, often the largest
single line item in these workloads. We extend this to UDP and
evaluate it on top of GSO.
Bursting too many segments at once can cause drops and retransmits.
SO_TXTIME adds a release time interface which allows offloading of
pacing to the kernel, where it is both more accurate and cheaper.
We will look at this interface and how it is supported by queuing
disciplines and hardware devices.
Finally, we look at how these transmit savings can be extended to
the forwarding and receive paths through the complement of GSO,
GRO, and local delivery of fused packets.
Today every packet which is reaching Facebook's network is being processed by XDP enabled application. We have been using it for more then 1.5 years and this talk is about evolution of XDP and BPF which has been driven by our production needs. I'm going to talk about history of changes in core BPF components as well as will show why and how it was done. What performance improvements did we get (with synthetics and real world data) and how it was implemented. Also I'm going to talk about issues and shortcoming of BPF/XDP which we have found during our operations, as well as some gotchas and corner cases. In the end we are going to discuss on what is still missing and which part could be improved.
Topics and areas of existing BPF/XDP infrastructure which are going to be covered in this talk:
Why helpers such as bpf_adjust_head/bpf_adjust_tail has been added
Unittesting and microbenchmarking with bpf_prog_test_run: how to add test coverage of you BPF program and track the regression (we are going to cover how spectre affected BPF kernel infrastructure and what tweaks has been made to get some performance back)
How map-in-map helps us to scale and make sure that we don't waste memory
NUMA aware allocation for BPF maps
Inline lookups for BPF arrays/map-in-map
Lessons which we have learned during operation of XDP:
BPF instruction counts vs complexity
How to attach more then one XDP program to the interface
When LLVM and verifier are not the same: some tricks to force LLVM to generate proper BPF
We will briefly discuss HW limitation: NIC's bandwidth vs packet per second performance
AF_XDP is a new socket type for raw frames to be introduced in 4.18
(in linux-next at the time of writing). The current code base offers
throughput numbers north of 20 Mpps per application core for 64-byte
packets on our system, however there are a lot of optimizations that
could be performed in order to increase this even further. The focus
of this paper is the performance optimizations we need to make in
AF_XDP to get it to perform as fast as DPDK.
We present optimization that fall into two broad categories: ones that
are seamless to the application and ones that requires additions to
the uapi. In the first category we examine the following:
Loosen the requirement for having an XDP program. If the user does
not need an XDP program and there is only one AF_XDP socket bound to
a particular queue, we do not need an XDP program. This should cut
out quite a number of cycles from the RX path.
Wire up busy poll from user space. If the application writer is
using epoll() and friends, this has the potential benefit of
removing the coherency communication between the RX (NAPI) core and
the application core as everything is now done on a single
core. Should improve performance for a number of use cases. Maybe it
is worth revisiting the old idea of threaded NAPI in this context
too.
Optimize for high instruction cache usage through batching as has
been explored in for example Cisco's VPP stack and Edward Cree in
his net-next RFC "Handle multiple received packets at each stage".
In the uapi extensions category we examine the following
optimizations:
Support a new mode for NICs with in-order TX completions. In this
mode, the completion queue would not be used. Instead the
application would simply look at the pointer in the TX queue to see
if a packet has been completed. In this mode, we do not need any
backpreassure between the completion queue and the TX queue and we
do not need to populate or publish anything in the completion queue
as it is not used. Should improve the performance of TX for in-order
NICs significantly.
Introduce the "type-writer" model where each chunk can contain
multiple packets. This is the model that e.g., Chelsio has in its
NICs. But experiments show that this mode also can provide better
performance for regular NICs as there are fewer transactions on the
queues. Requires a new flag to be introduced in the options field of
the descriptor.
With these optimization, we believe we can reach our goal of close to
40 Mpps of throughput for 64-byte packets in zero-copy mode. Full
analysis with performance numbers will be presented in the final
paper.
SCTP is a transport protocol, like TCP and UDP, originating from SIGTRAN
IETF Working Group in the early 2000's with the initial objective of
supporting the transport of PSTN signalling over IP networks. It featured
multi-homing and multi-stream from the beginning, and since then there
have been a number of improvements that help it serve other purposes too,
such as support for Partial Reliability and Stream Scheduling.
Linux SCTP arrived late and was stuck. It wasn't as up to date as the
released RFCs, while it was also far behind other systems such as BSD,
and also suffered from performance problems. In the past 2 years, we
were dedicated to ensuring that these features were addressed and
focused on making many improvements. Now all the features from released
RFCs have been fully supported in Linux, and some from draft RFCs are
already ongoing. Besides, we've seen an obvious improvement in performance
in various scenarios.
In this talk we will first do a quick review on SCTP basics, including:
Background: Why SCTP is used for PSTN Signalling Transport, why other
applications are using or will use SCTP.
Architecture: The general SCTP structures and procedures implemented in
Linux kernel.
VS TCP/UDP: An overview of functions and applicability of SCTP, TCP and
UDP.
Then go through the improvements that were made in the past 2 years,
including:
SCTP-related projects in Linux: Other than kernel part, there are also
lksctp-tools, sctp-tests, tahi-sctp, etc.
Features implemented lately: RFC ones like Stream Scheduling, Message
Interleaving, Stream Reconfig, Partially Reliable Policy, and many
CMSGs, SndInfos, Socket APIs.
Improvements made recently: Big patchsets like SCTP Offload, Transport
Hashtable, SCTP Diag and Full SELinux support.
VS BSD: We will take a look at the difference between Linux and BSD now
regarding SCTP. You will be surprised to see that we've gone further
than other systems.
We will finish by reviewing a list of what is on our radar as well as next
steps, like:
Ongoing features: SCTP NAT and SCTP CMT, two big important features are
ongoing and already taking form, and more Performance Improvements in
kernel have also been started.
Code refactor: New Congestion Framework will be introduced, which will
be more flexible for SCTP to extend more Congestion Algorithms.
Hardware support: HW CRC Checksum and GSO will definitely make performance
better, for which a new segment logic for both .segment and HW that works
for SCTP chunks is needed.
RFC docs improvements: We believe that more extensions and revisions will
make SCTP more widespread.
For its powerfulness and complexity, SCTP is destined to face many challenges
and threats, but we believe that we have already and will continue to make it
better than that on other systems, but also than other transport protocols.
Please join us, Linux SCTP needs your help too!
eBPF (extended Berkeley Packet Filter) has been shown to be a flexible
kernel construct used for a variety of use cases, such as load balancing,
intrusion detection systems (IDS), tracing and many others. One such
emerging use case revolves around the proposal made by William Tu for
the use of eBPF as a data path for Open vSwitch. However, there are
broader switching use cases developing around the use of eBPF capable
hardware. This talk is designed to explore the bottlenecks that exist in
generalising the application of eBPF further to both container switching as
well as physical switching.
Topics that will be covered include proposals for container isolation through
the use of features such as programmable RSS, the viability of physical
switching using eBPF capable hardware as well as integrations with other
subsystems or additional helper functions which may improve the possible
functionality.
Over the last 10 years the world has seen NICs go from single port,
single netdev devices, to multi-port, hardware switching, CPU/NFP
having, FPGA carrying, hundreds of attached netdev providing,
behemoths. This presentation will begin with an overview of the
current state of filtering and scheduling, and the evolution of the
kernel and networking hardware interfaces. (HINT: it's a bit of a
jungle we've helped grow!) We'll summarize the different kinds of
networking products available from different vendors, and show the
workflows of how a user can use the network hardware
offloads/accelerations available and where there are still gaps. Of
particular interest to us is how to have a useful, generic hardware
offload supporting infrastructure (with seamless software fallback!)
within the kernel, and we'll explain the differences between deploying
an eBPF program that can run in software, and one that can be
offloaded by a programmable ASIC based NIC. We will discuss our
analysis of the cost of an offload, and when it may not be a great
idea to do so, as hardware offload is most useful when it achieves the
desired speed and requires no special software (kernel changes). Some
other topics we will touch on: the programmability exposed by smart
NICs is more than that of a data plane packet processing engine and
hence any packet processing programming language such as eBPF or P4
will require certain extensions to take advantage of the device
capabilities in a holistic way. We'll provide a look into the future
and how we think our customers will use the interfaces we want to
provide both from our hardware, and from the kernel. We will also go
over the matrix of most important parameters that are shaping our HW
designs and why.
iptables has been the typical tool to create firewall for linux hosts. We have used them at Facebook for setting up host firewalls on our servers across a variety of tiers.
In this proposal, we introduce a eBPF / XDP based firewall solution which we use for packet filtering and has parity to our iptables implementation. We discuss various aspects of this. Following is a brief summary of these aspects, which we will detail further in the paper / presentation.
Design and Implementation:
We use BPF Tables (maps, lpm tries, and arrays) to match for appropriate packet header contents
The heart of a firewall is a eBPF filter which parses a packet and does lookups against all relevant maps collecting the matching values. A logical rule set is applied to these collected values. This logical set reads similar to a human-readable high level firewall policy. With iptable rules, amidst all the verbose matching criteria inline every rule, such a policy level representation is hard to infer.
Performance benefits and comparisons with iptables:
iptables does packet matching linearly against each rule until a match is found. In our proposal, we use BPF Tables (maps) containing keys for all rules, making packet matching highly efficient. We then apply the policy using the collected results, which results in a considerable speedup over iptables.
Ease of policy / config updates and maintenance:
The network administrator owns the firewall while the app developers typically require opening ports for their applications to work. With our approach of using a eBPF filter, we create a logical separation between the filter which enforces the policy and the contents of the associated maps which represent the specific ports and prefixes that need to be filtered. The policy is owned by the network administrator (Example: ports open to the internet, ports open from within specific prefixes, drop everything else). The data (port numbers, prefixes, etc) can now belong to a separate logical section which presents application developers a predetermined destination to update their data (Example: File containing port opened to internal subnets, etc). This reduces friction between the 2 different functions and reduces human errors.
Deployment experience:
We deploy this solution in our edge infrastructure to implement our firewall policy.
We update configuration, reload filters and contents of the various maps containing keys and values for filtering
BPF Program arrays:
We use the power of BPF program array to chain different programs like rate limiter, firewall, load balancers, etc. These are building blocks to create a rich, high performant networking solution
Proposal for a completely generic firewall solution to migrate existing iptables rules to eBPF / XDP based filtering:
We present a proposal which can translate existing iptables rules to a better performant eBPF program with mostly user space processing and validation.
The eXpress Data Path (XDP) is a new kernel-feature, intended to provide fast packet processing as close as possible to device hardware. XDP builds on top of the extended Berkely Packet Filter (eBPF) and allows users to write a C-like packet processing program, which can be attached to the device driver's receiving queue. When the device observes an incoming packet, the user-defined XDP program is triggered to execute on the packet payload, making the decision as early as possible before handing the packet down the processing pipeline.
P4 is a domain-specific language describing how packets are processed by the data plane of a programmable network elements, including network interface cards, appliances, and virtual switches. It provides an abstraction that allows programmers to express existing and future protocol format without coupling it to any data plane specific knowledge. The language is explicitly designed to be protocol-agnostic. A P4 programmer can write their own protocols and load the P4 program into P4-capable network elements.
As high-level networking language, P4 supports a diverse set of compiler backends and also possesses the capability to express eBPF and XDP programs.
We present P4C-XDP, a new backend for the P4 compiler. P4C-XDP leverages XDP to aim for a high performance software data plane. The backend generates a eBPF-compliant C representation from a given P4 program which is passed to clang and llvm to produce the bytecode. Using conventional eBPF kernel hooks the program can then be loaded into the eBPF virtual machine in the device driver. The kernel verifier guarantees the safety of the generated code. Any packets received/transmitted from/to this device driver now trigger the execution of the loaded P4 program.
The P4C-XDP project is an open source project hosted at https://github.com/vmware/p4c-xdp/. We provide prove-of-concept sample code under the tests directory, which contains a couple of examples such as basic protocol parsing, checksum recalculation, multiple tables lookups, and tunnel protocol en-/decapsulation.
Currently the Linux kernel implements two distinct datapaths for Open
vSwitch: the ovskdp and the TC DP. The latter has been added recently
mainly to allow HW offload, while the former is usually preferred for
SW based forwarding due to functional and performance reasons.
We evaluate both datapaths in a typical forwarding scenario - the PVP
test - using the perf tool to identify bottlenecks in the TC SW dp.
While similar steps usually incur in similar costs, the TC SW DP
requires an additional, per packet, skb_clone, due to a TC actions
constraint.
We propose to extend the existing act infrastructure, leveraging the
ACT_REDIRECT action and the bpf redirect code, to allow clone-free
forwarding from the mirred action and then re-evaluate the datapaths
performances: the gap is then almost already closed.
Nevertheless, TC SW performance can be further improved by completing
the RCU-ification of the TC actions and expanding the recent
listification infrastructure to the TC (ingress) hook. We plan also to
compare the TC/SW datapath with an custom eBPF program implementing the
equivalent flow set to tag a reference value for the target
performances.
Over the past several years, BPF has steadily become more powerful in multiple
ways: Through building more intelligence into the verifier which allows more
complex programs to be loaded, and through extension of the API such as by
adding new map types and new native BPF function calls. While BPF has its roots
in applying filters at the socket layer, the ability to introspect the sockets
relating to traffic being filtered has been limited.
To build such awareness into a BPF helper, the verifier needs the ability to
track the safety of the calls, including appropriate reference counting upon
the underlying socket. This talk walks through extensions to the verifier to
perform tracking of references in a BPF program. This allows BPF developers to
extend the UAPI with functions that allocate and release resources within the
execution lifetime of a BPF program, and the verifier will validate that the
resources are released exactly once prior to program completion.
Using this new reference tracking ability in the verifier, we add socket lookup
and release function calls to the BPF API, allowing BPF programs to safely find
a socket and build logic upon the presence or attributes of a socket. This can
be used to load-balance traffic based on the presence of a listening
application, or to implement stateful firewalling primitives to understand
whether traffic for this connection has been seen before. With this new
functionality, BPF programs can integrate more closely with the networking
stack's understanding of the traffic transiting the kernel.
XDP already offers rich facilities for high performance packet
processing, and has seen deployment in several production systems.
However, this does not mean that XDP is a finished system; on the
contrary, improvements are being added in every release of Linux, and
rough edges are constantly being filed down. The purpose of this talk is
to discuss some of these possibilities for future improvements,
including how to address some of the known limitations of the system. We
are especially interested in soliciting feedback and ideas from the
community on the best way forward.
The issues we are planning to discuss include, but are not limited to:
User experience and debugging tools: How do we make it easier for
people who are not familiar with the kernel or XDP to get to grips
with the system and be productive when writing XDP programs?
Driver support: How do we get to full support for XDP in all drivers?
Is this even a goal we should be striving for?
Performance: At high packet rates, every micro-optimisation counts.
Things like inlining function calls in drivers are important, but also
batching to amortise fixed costs such as DMA mapping. What are the
known bottlenecks, and how do we address them?
QoS and rate transitions: How should we do QoS in XDP? In particular,
rate transitions (where a faster link feeds into a slower) are
currently hard to deal with from XDP, and would benefit from, e.g.,
Active Queue Management (AQM). Can we adapt some of the AQM and QoS
facilities in the regular networking stack to work with XDP? Or should
we do something different?
Accelerating other parts of the stack: Tom Herbert started the
discussion on accelerating transport protocols with XDP back in 2016.
How do we make progress on this? Or should we be doing something
different? Are there other areas where we can extend XDPs processing
model to provide useful accelerations?
phylib has provided the API Ethernet MAC drivers have used to control
Copper PHYs for many years. However with the advent of MACs/PHYs with
bandwidth of > 1Gbps, SERDES interfaces and fibre optical modules,
phylib is not sufficient. phylink provides an API which MAC drivers
can use to control these more complex and dynamic, possibly
hot-pluggable PHYs. This presentation will explain why phylink is
needed, how it differs from phylib, and describe how to convert a MAC
driver from phylib to phylink in order to make use of its new
features. The kernel support for SFP modules will also be detailed,
including how the MAC needs to handle hot-plugging of the PHY, which
can be copper or fibre.
Linux currently provides mechanisms for managing and allocating many of the system resources such as CPU, Memory, etc. Network resource management is more complicated since networking deals not only with a local resource, such as CPU management does, but can also deal with a global resource. The goal is not only to provide a mechanism for allocating the local network resource (NIC bandwidth), but also to support management of network resources external to the host, such as link and switch bandwidths.
For networking, the primary mechanism for allocating and managing bandwidth has been the traffic control (tc) subsystem. While tc allows for shaping of outgoing traffic and policing of incoming traffic, it suffers from some drawbacks. The first drawback is a history of performance issues when using the Hierarchical Queuing Discipline (htb) which is usually required for anything other than simple shaping needs. A second drawback is the lack of flexibility usually provided by general programming constructs.
We are in the process of designing and implementing a BPF based framework for efficiently supporting shaping of both egress and ingress traffic based on both local and global network allocations.
Submitted proposals, 40 minutes length and accompanied by 2-10 pages length paper, to the LPC Networking Track Technical Committee should be on new and upcoming work with suggestions for solutions to open problems on but not limited to the following topics:
XDP and BPF
Wireless Networking
Performance and Performance Analysis
TCP and congestion control algorithms
Interactions between Networking and other subsystems