Understanding the Linux Kernel: The Scheduler

The scheduler is the core component that decides how Linux allocates CPU resources to thousands of tasks simultaneously. This article decodes the underlying structure of processes, scheduling classes, and how the system runs smoothly using the EEVDF algorithm.

In the previous article we looked at how the kernel gives every process its own private view of memory. But memory is only half of what a process needs to actually run. The other half is the CPU itself – and there are only so many CPUs in a machine, while there are usually hundreds or thousands of things that want to run on them.

So somebody has to decide, constantly, who gets a CPU and for how long. That somebody is the scheduler. Every few milliseconds, on every core, the kernel asks itself the same question – of everything that wants to run right now, who runs next? – and the answer has to be fast, fair, and good enough that your text editor stays responsive even while a compile is pegging every core.

Let’s take it step by step, because the scheduler has a lot of moving parts. We’ll start with what the scheduler is even scheduling – what a process and a thread really are under the hood. Then we’ll see that Linux doesn’t have just one scheduler but several, stacked as scheduling classes. From there we’ll look at when a running task stops running (it’s more interesting than it sounds), and look at what it costs to switch from one task to another. And finally we’ll get to the heart of it: how the kernel actually decides who’s next, using an algorithm called EEVDF.

A note on scope

Everything here is against Linux 7.1 (the scheduler core lives in kernel/sched/, mostly in fair.c and core.c). And it’s a deliberate simplification: the real implementation has far more going on – load balancing across CPUs, group scheduling through cgroups, CPU bandwidth control, NUMA awareness, and countless edge cases – that I’m skipping over to keep the core ideas clear.

So let’s begin where the scheduler itself begins: with the thing it’s actually shuffling on and off the CPU.

What the Scheduler Actually Schedules

Here’s the first surprise, and it clears up a lot of confusion: the kernel doesn’t schedule “processes” or “threads.” Those are words we use up in user space. Down in the kernel there’s just one kind of schedulable thing – a task_struct (include/linux/sched.h:820), the kernel’s record of one flow of execution, one thing that can sit on a CPU and run instructions.

What you call a process and what you call a thread are both, underneath, the exact same kind of object. The only difference is what they share. When you fork() a new process, you get a fresh task_struct that shares nothing with its parent – its own memory, its own file descriptors, its own everything. When you create a thread (with clone() and the right flags) you also get a fresh task_struct, except this one shares the address space, the open files, the signal handlers, and so on with the task that spawned it. So a “multi-threaded process” is really just a bunch of task_structs that happen to point at the same memory.

From the scheduler’s chair none of that matters anyway – it doesn’t know or care who shares what. It just sees a pile of task_structs, some runnable and some not, and picks among the runnable ones. Everything runnable on a CPU is a task_struct, full stop.

Now, a task_struct is huge – it holds basically everything the kernel knows about a task – but the scheduler only cares about a sliver of it. Tucked inside every task is a small embedded bundle of scheduling state (called the sched_entity, include/linux/sched.h:575), and that little bundle – not the giant struct around it – is what the scheduler actually reasons about.

It’s where the interesting numbers live, things with names like vruntime, vlag, deadline, and slice. Don’t worry about what any of those mean yet – unpacking them is basically the rest of this article. For now just hold onto the shape of it: each runnable task carries a small wad of accounting state, and that’s the part the scheduler reads when it decides who runs next.

But “the scheduler” is a bit of a white lie, because there isn’t just one – so before we go further, let’s see who actually gets handed the decision.

Scheduling Classes: Who Even Gets Asked First

Before we get to the algorithm, there’s a twist: Linux doesn’t actually have one scheduler. It has several, and they’re stacked in a strict pecking order. These stacked schedulers are called scheduling classes, and each one is a self-contained policy for a different kind of workload.

The way they cooperate is dead simple. When a CPU needs something to run, the kernel goes down the stack from the top and asks each class in turn, “got anything runnable?” The first one to say yes wins, and everything below it doesn’t even get a vote. So a class only ever runs a task when every class above it had nothing to offer.

So what’s in those boxes? The three at the top all exist to give some task the right to butt ahead of ordinary work. At the very top sits stop, which isn’t really a scheduling policy so much as a “drop everything” lever the kernel pulls when it needs a CPU to do one urgent thing this instant – like migrating tasks off a core that’s being shut down. Below it, deadline handles tasks with hard timing needs (think audio or robotics), where you don’t ask for a priority but for a guarantee: “this task needs so many milliseconds of CPU every so often.” And then rt is classic real-time priorities – a real-time task runs for as long as it likes and only steps aside for something even higher, which is wonderful for latency-critical code and a great way to freeze your machine if such a task never sleeps.

Here’s the thing, though: on a normal desktop or server, those top three boxes are almost always empty. Which brings us to the one that matters for the rest of this article: fair. This is where essentially everything you run actually lives – your shell, your browser, your database, that compile job. None of those tasks has any special timing demand; they just want a reasonable slice of the CPU, and the fair class’s whole job is to hand those slices out fairly. Because the classes above it are usually idle, the fair class is the one running the show nearly all the time – so when people say “the Linux scheduler,” this is the one they almost always mean.

Rounding out the bottom, just so the picture’s complete: ext is a newer addition that lets you load a whole scheduling policy as a BPF program, handy for experimenting without recompiling the kernel; and idle is the floor, the do-nothing task that runs only when nothing else wants the CPU and quietly puts the core to sleep to save power. We won’t dwell on either – from here on, it’s all about fair.

The fair class is built on an algorithm with an intimidating name: EEVDF, Earliest Eligible Virtual Deadline First. Don’t let the name scare you – we’ll take it apart piece by piece later, and it turns out to be a fairly intuitive idea.

Before we get to how EEVDF chooses, though, it helps to know when it even gets the chance to – the moments a running task lets go of the CPU.

When Does a Running Task Stop Running?

Say a task is happily running on a CPU. What makes it stop so something else can run? This is the part people usually hand-wave past, but it’s where the whole system actually lives. There are really only two ways a task gives up the CPU, and understanding both is most of understanding the scheduler. Let’s take the gentler one first.

Way One: It Voluntarily Blocks

The most common reason a task stops running is that it asks for something it can’t have yet. It calls read() on a socket with no data, takes a mutex someone else holds, calls sleep(), waits on a condition variable – anything that can’t complete immediately.

When that happens, deep inside the blocking primitive, the kernel sets the task’s state to “not runnable” and calls schedule() (kernel/sched/core.c:7273). This is the task voluntarily saying “I have nothing to do right now, give the CPU to someone else"

Source: Hacker News