A tale about fixing eBPF spinlock issues in the Linux kernel

Developers of the Superluminal CPU profiler share their journey of debugging a critical system freeze caused by eBPF spinlock interactions within the Linux kernel.

We’ve been working on the Linux version of Superluminal (a CPU profiler) for a while now, and we’ve been in a private alpha with a small group of testers. That was going great, until one of our testers, Aras, ran into periodic full system freezes while capturing with Superluminal.

We always pride ourselves on Superluminal “Just Working”, and this was decidedly not that, so we of course went hunting for what turned out to be one of the toughest bugs we’ve faced in our careers.

The hunt led us deep into the internals of the Linux kernel (again), where we learned more about spinlocks in the kernel than we ever expected to know, and we ended up helping to find & fix a number of issues along the way.

Initial analysis

The problem he was running into was that on his Fedora 42 machine (kernel 6.17.4-200), the system would periodically freeze for short periods while a Superluminal capture was running.

It’s really difficult to remotely debug issues like this, so we first attempted to reproduce the issue in a VM. However, we were unable to after several attempts with various Fedora versions/kernels. We finally tried installing Fedora on a physical machine, and we were able to reproduce it there.

Now that we have a repro, we can start looking into the issue in earnest. Since the machine is periodically freezing while capturing with Superluminal, we can start by looking at what the capture looks like after opening it.

This is showing the timeline for each thread in the process. A green color means the CPU is actively executing work, any other color means the thread is scheduled out and waiting for something. We can immediately spot some suspicious looking areas in the capture where it appears as if each thread in the process is busy for the ~same amount of time, across all threads, which doesn’t match the workload being profiled.

Each of these areas is 250+ milliseconds where the CPU appears to be fully busy. Zooming in on one of these sections and expanding a thread, we see that despite this thread being reported as ‘busy’, there are no samples being collected during this period at all.

Looking at dmesg output while reproducing the issue, we also get messages like the following:

[ +0.014286] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.723 msecs
[ +0.232451] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 250.424 msecs
[ +0.000001] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 250.424 msecs
[ +0.250938] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 250.936 msecs

So all of this doesn’t tell us why the CPU is busy, but it does indicate that there is something happening in the kernel that takes 250+ milliseconds.

Debugging the kernel

Normally, when you get freezes like this, you attach a debugger. In this case, we’re dealing with a kernel freeze. We had to get serial port PCIe cards to use the Linux kernel debugger for communication between the debugger & debuggee.

With this in place, we’re able to attach to the problematic machine using gdb. Unfortunately, it seems that when the kernel is in this freezing state, even the debugger doesn’t respond anymore. Our attempts to break while freezing all resulted in gdb crashing and/or timing out.

Finding a minimal repro

Given the nature of the issue, and the fact that the eBPF code is the only code running in the kernel, we made the assumption that the problem is somewhere in our eBPF code. This reduces the code we need to investigate to ~2000 lines.

We created some debug options that allowed us to disable event types individually. This resulted in the following observations:

With only sampling events enabled, the freezes do not occur.
With only context switch/wake events enabled, the freezes do not occur either.
With both sampling events and context switch events enabled, the freezes occur again.

This strongly points towards the issue being some kind of interaction between the eBPF code that runs when a sample event happens, and the eBPF code that runs when a context switch happens.

We stripped the eBPF code further until we reached this minimal repro:

struct {
  __uint(type, BPF_MAP_TYPE_RINGBUF);
  __uint(max_entries, 512 * 1024 * 1024);
} ringBuffer SEC(".maps");

SEC("tp_btf/sched_switch")
int cswitch(struct bpf_raw_tracepoint_args* inContext) {
  struct CSwitchEvent* event = bpf_ringbuf_reserve(&ringBuffer, sizeof(struct CSwitchEvent), 0);
  if (event == NULL) return 1;
  bpf_ringbuf_discard(event, 0);
  return 0;
}

SEC("perf_event")
int sample(struct bpf_perf_event_data* inContext) {
  struct SampleEvent* event = bpf_ringbuf_reserve(&ringBuffer, sizeof(struct SampleEvent), 0);
  if (event == NULL) return 1;
  bpf_ringbuf_discard(event, 0);
  return 0;
}

This describes two eBPF programs: one that runs when a context switch happens, and one that runs on the sampling interrupt. The programs do nothing but reserve and discard space in a ring buffer.

Source: Hacker News