RSS, IRQ affinity and RPS on Linux
In this blog post, we are going to have a look at the tuning of Linux receive queues and their interrupt requests. We are going to learn a bit about RSS (Receive Side Scaling), IRQ SMP affinity, RPS (Receive Packet Steering) and how to analyze what's happening on your CPUs with perf flamegraphs. I do not aim at going very low-level. Instead, I'd like you to get a high-level understanding of this topic.
Lab setup
We spawn two RHEL 9 virtual machines with two interfaces each. We are going to connect to the instances via eth0 and we are going to run our tests via eth1.
DUT: Device Under Test (server, VM with 4 queues on eth1)
LG: Load Generator (client)
Both VMs have 8 GB of memory and 8 CPUs each:
1 2 3 |
|
In addition, the secondary interface of the Device Under Test (DUT) is configured to have 4 queues:
1 2 3 4 5 6 7 8 9 |
|
Note If you set up your VMs with Virtual Machine Manager, you have to add line
<driver name='vhost' queues='4'/>
via the XML input field.
Test application
The test application is a simple client <-> server application written in Golang. The client opens a connection, writes a message which is received by the server, and closes the connection. It does so asynchronously at a given provided rate. The application's goal is not to be particularly performant; instead, it shall be easy to understand, have a configurable rate and support both IPv4 UDP and TCP.
You can find the code at https://github.com/andreaskaris/golang-loadgen/tree/blog-post.
The application can send/receive traffic for both TCP and UDP. In the interest of brevity, I'll focus on the UDP part only.
Server code
The entire UDP server code is pretty simple and should be self explanatory:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
|
The client code is even simpler:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
Disabling irqbalance on the DUT
irqbalance would actually interfere with our tests. Therefore, let's disable it on the Device Under Test:
1 2 3 4 5 6 7 |
|
Isolating CPUs with tuned on the DUT
Let's isolate the last 4 CPUs of the Device Under Test with tuned. First, install tuned and the tuned-profiles-realtime package:
1 |
|
Configure the isolated cores in /etc/tuned/realtime-variables.conf
:
1 2 |
|
Enable the realtime profile and reboot:
1 2 |
|
When the system is back up, verify that isolation is configured:
1 2 3 4 |
|
Building the application
Build the test application on both nodes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Running the test application with debug output
Start the test application on the client:
1 |
|
Start the test application on the server and force it to run on CPUs 6 and 7:
1 2 3 4 5 6 7 |
|
RSS + Tuning NIC queues and RX interrupt affinity on the Device Under Test
What's RSS?
RSS, short for Receive Side Scaling, is an in-hardware feature that allows a NIC to "send different packets to different queues to distribute processing among CPUs. The NIC distributes packets by applying a filter to each packet that assigns it to one of a small number of logical flows. Packets for each flow are steered to a separate receive queue, which in turn can be processed by separate CPUs." For more details, have a look at Scaling in the Linux Networking Stack.
RSS happens in hardware for modern NICs and is enabled by default. Virtio supports multiqueue and RSS when the vhost driver is used (see our Virtual Machine setup instructions earlier).
It's possible to show some information about RSS with ethtool -x <interface>
. As far as I know, it's not possible to
modify RSS configuration for virtio (when I tried, my VM froze). However, for real NICs such as recent Intel or
Mellanox/Nvidia devices, plenty of configuration options are available to influence how RSS steers packets to queues.
1 2 3 4 5 6 7 8 9 10 |
|
Reducing queue count to 2
To make things a little bit more manageable, let's reduce our queue count (both for RX and TX) to 2:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
Identifying and reading per queue interrupts
Let's find the interrupt names for eth1. Get the bus-info:
1 2 |
|
And let's get the device name as it appears in /proc/interrupts
:
1 2 3 4 |
|
Or, even simpler, search /sys/devices/
for eth1
:
1 2 |
|
We can now read the interrupts for the NIC:
1 2 3 4 5 6 7 8 9 10 11 |
|
Note that virtio will show all 4 queues for each input and output in /proc/interrupts
, even though 2 of each are disabled.
We can however easily check that the setting is working. Let's start our application on the server on CPUs 6 and 7:
1 |
|
And on the client:
1 |
|
And you can see that interrupts for queues virtio6-input.0
and virtio6-input.1
increment as RSS balances flows
across the 2 remaining queues. At the same time, queues 2 and 3 are disabled. It might be a bit difficult to read, but
remember that these are absolute counters from since when the machines started. Look at the delta for each of the
queues, so compare the lines starting with 53 to each other, then the lines starting with 55, and so on. You will see
that the 2 lines for IRQs 57 and 59 did not change, whereas the counters for IRQs 53 (CPU 0) and 55 (CPU 5) did increase.
1 2 3 4 5 6 7 8 9 |
|
How to translate CPUs to hex mask:
In order to pin to a single CPU for an 8 CPU system, refer to the following table:
CPU | Binary mask | Hex |
---|---|---|
0 | 0b00000001 |
01 |
1 | 0b00000010 |
02 |
2 | 0b00000100 |
04 |
3 | 0b00001000 |
08 |
4 | 0b00010000 |
10 |
5 | 0b00100000 |
20 |
6 | 0b01000000 |
40 |
7 | 0b10000000 |
80 |
Querying SMP affinity for RX queue interrupts
You can get the interrupt numbers for virtio6-input.0
(in this case 53) and virtio6-input.1
(in this case 55) from
/proc/interrupts
. Then, query /proc/irq/<interrupt number>/smp_affinity
and smp_affinity_list
.
1 2 3 4 5 6 7 8 |
|
That matches what we saw earlier: virtio6-input.0
's affinity currently is CPUs 0-3 and in /proc/interrupts
we saw that
it generated interrupts on CPU 0. virtio6-input.1
's affinity currently is CPUs 4-7 and in /proc/interrupts
we saw that
it generated interrupts on CPU 5. But wait, irqbalance is switched off, and we even rebooted the system. Why are our
IRQs distributed between our CPUs and why aren't they allowed on all CPUs? To be confirmed, but the answer may be in
this commit.
For further details about SMP IRQ affinity, see SMP IRQ affinity.
Configuring SMP affinity for RX queue interrupts
Let's force virtio6-input.0
onto CPU 2 and virtio6-input.1
onto CPU 3. The softirqs will be processed on the same NICs
by default. The affinity can be any CPU, regardless of our tuned configuration, either from the system CPU set or from
the isolated CPU set. But I already moved IRQs to isolated CPUs during an earlier test, so for the sake of it, I want
to move them to system reserved CPUs now :-)
1 2 3 4 5 6 |
|
Start the server and the client again with the same affinity for the server:
1 |
|
1 |
|
And now, interrupts increase for virtio6-input.0
on CPU 2 and for virtio6-input.1
on CPU 3:
1 2 3 4 5 6 7 8 9 |
|
Using flamegraphs to analyze CPU activity
Now that we configured our server to run on CPUs 6 and 7, and our receive queue interrupts on CPUs 2 and 3, let's profile our CPUs. Let's use perf script to create flamegraphs:
1 2 |
|
Note: The
-F 99
means that iperf samples at a rate of 99 samples per second. Sampling isn't perfect: if the CPU does something very shortlived between samples, the profiler will not capture it!
Now, we'll copy the flamegraphs to our local system for analysis. You can access the flamegraphs of my test runs here:
- IRQ smp affinity - flamegraph 0
- IRQ smp affinity - flamegraph 1
- IRQ smp affinity - flamegraph 2
- IRQ smp affinity - flamegraph 3
- IRQ smp affinity - flamegraph 4
- IRQ smp affinity - flamegraph 5
- IRQ smp affinity - flamegraph 6
- IRQ smp affinity - flamegraph 7
Note: This version of the flamegraphs color codes userspace code in green and kernel code in blue.
The flamegraphs show us largely idle CPUs 0,1, 4 and 5 which is expected. However, CPU 7 is idle as well, even though the server should be running on CPUs 6 and 7. There are 2 reasons for this:
- We started the application on CPUs 6 and 7 which are isolated via
isolcpus
in/proc/cmdline
, so the kernel loadbalancer is off for these CPUs. - Have another look at the server implementation: you will see that the TCP server is multithreaded (use of go routines) and the UDP server is single threaded. I had not intended it to be this way - it was an omission on my side because I initially implemented the TCP code and then quickly added the UDP part. This is fixed in the master branch of the repository.
The flamegraph for CPU 6 shows us our server is spending most of its time in readFromUDP, as expected. About half of that time is spent waiting in golang function internal/poll.runtime_pollWait. In turn, this go function does an epoll_wait syscall which leads to a napi_busy_loop which is responsible for receiving our packets. Most of the other half is spent in syscall recvfrom.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
CPUs 2 and 3 process our softirqs, at roughly 20% and 13% respectively. Actually, running top
or mpstat
during the
tests also showed that the CPUs were spending that amount of time processing hardware interrupts and softirqs.
This is pure speculation, but I suppose that we do not see any hardware interrupts here for 2 reasons:
- Linux NAPI makes sure to spend most of its time polling, thus most of the time will be spent processing softirqs in the bottom half.
- We do not see hardware interrupts because they are missed by our samples. Brendan Gregg's blog might have some tips to get to the bottom of this.
Even though most of what's happening is still difficult to understand for me, at least it's not a black box any more. We can see what's causing each CPU to be busy, and with the help of man pages or the actual application and kernel code we can dive deep into the code and try to understand how things work if we want to.
RPS on the Device Under Test
What's RPS?
RPS, short for Receive Packet Steering, "is logically a software implementation of RSS. Being in software, it is necessarily called later in the datapath. Whereas RSS selects the queue and hence CPU that will run the hardware interrupt handler, RPS selects the CPU to perform protocol processing above the interrupt handler. This is accomplished by placing the packet on the desired CPU’s backlog queue and waking up the CPU for processing. (...) RPS is called during bottom half of the receive interrupt handler, when a driver sends a packet up the network stack with netif_rx() or netif_receive_skb(). These call the get_rps_cpu() function, which selects the queue that should process a packet." For more details, have a look at Scaling in the Linux Networking Stack.
Configuring RPS for the RX queues
Red Hat has a great Knowledge Base Solution for configuring RPS. If you cannot acccess it, refer to Scaling in the Linux Networking Stack.
By default, RPS is off:
1 2 3 4 |
|
Because I rebooted my virtual machines, let's first reconfigure 2 queues for eth1 and let's configure SMP affinity for the IRQs:
1 2 3 4 5 |
|
Let's start our client <-> server again:
1 |
|
1 |
|
Check softirqs for NET_RX
:
1 2 3 |
|
You can see that NET_RX
softirqs are mainly processed on CPUs 2 and 3, the CPUs that we pinned our RX IRQs to.
Let's enable receive packet steering for rx-0
, and let's move it to CPU 4:
1 2 |
|
Well, that's a bummer, we can't actually configure RPS on isolated CPUs due to Preventing job distribution to isolated CPUs.
Alright, then let's move it to CPU 1:
1 |
|
Check softirqs for NET_RX
:
1 2 3 |
|
Now, we are processing NET_RX softirqs on CPU 1 (for RPS) as well as on CPUs 2 and 3 because of IRQ SMP affinity.
Using flamegraphs to analyze CPU activity
Now that we configured our server to run on CPUs 6 and 7, our receive queue interrupts on CPUs 2 and 3, and our RPS for RX queue 0 to use CPU 1, let's profile our CPUs. Let's use perf script to create flamegraphs:
1 2 |
|
Note: The
-F 99
means that iperf samples at a rate of 99 samples per second. Sampling isn't perfect: if the CPU does something very shortlived between samples, the profiler will not capture it!
Now, we'll copy the flamegraphs to our local system for analysis. You can access the flamegraphs of my test runs here:
- IRQ smp affinity - flamegraph 0
- IRQ smp affinity - flamegraph 1
- IRQ smp affinity - flamegraph 2
- IRQ smp affinity - flamegraph 3
- IRQ smp affinity - flamegraph 4
- IRQ smp affinity - flamegraph 5
- IRQ smp affinity - flamegraph 6
- IRQ smp affinity - flamegraph 7
Note: This version of the flamegraphs color codes userspace code in green and kernel code in blue.
The flamegraphs show us largely idle CPUs 0, 4, 5 and 7 for the reasons that we already explained earlier. CPU 6 runs our golang server, we also already talked about this.
Let's first focus on CPU 2. We can see that it polls the queue in _napi_poll
. We can also see that it spends roughly
the same amount in net_rps_send_ipi
(you can click on asm_common_interrupt
to zoom in further).
Now, zoom into napi_complete_done
. You see here that netif_receive_skb_list_internal
calls get_rps_cpu
to
determine the CPU to steer packets to. That matches the description from
Scaling in the Linux Networking Stack.
On the other hand, CPU 1 processes our packets after RPS for RX queue 0. We can see that _napi_poll
runs here as
well inside net_rx_action
. As part of RPS, we can see here that _netif_receive_skb_one_core
calls ip_rcv
,
ip_local_deliver
and ip_local_deliver_finish
. We see that CPU 1 does not poll the virtio queue directly.
If you compare that to the work that's being done on CPU 3, you can see that CPU 3 does both the polling of the virtio queue and it's also in charge of local delivery.
So RPS generates some overhead for dispatching the packets to another CPU, but it splits the work of receiving and delivering our packets between CPUs 2 and 1. (You can also observe for example that GRO happens on CPU 2 and not on CPU 1.)
Conclusion
We could have aimed for a deeper dive by optimizing our test application, by looking at NAPI and its configuration and by analyzing the kernel code and by analyzing the flamegraphs more thoroughly. However, I hope that this blog post could serve as a short overview.