My first patch to the Linux kernel

A deep dive into how a subtle sign-extension bug in C led to system crashes during hypervisor development and eventually became the author's first contribution to the Linux kernel.

my first patch to the linux kernel

How a sign-extension bug in C made me pull my hair out for days but became my first patch to the Linux kernel!

Intro

A while ago, I started dipping my toe into virtualization. It's a topic that many people have heard of or are using on a daily basis but a few know and think about how it works under the hood.

I like to learn by reinventing the wheel, and naturally, to learn virtualization I started by trying to build a Type-2 hypervisor. This approach is similar to how KVM (Linux) or bhyve (FreeBSD) are built.

Since virtualization is hardware assisted these days, the hypervisor needs to communicate directly with the CPU by running certain privileged instructions; which means a Type-2 hypervisor is essentially a Kernel Module that exposes an API to the user-space where a Virtual Machine Monitor (VMM) like QEMU or Firecracker is running and orchestrating VMs by utilizing that API.

In this post, I want to describe exactly how I found that bug. But to make it a bit more educational, I'm going to set the stage first and talk about a few core concepts so you can see exactly where the bug emerges.

x86 Task State Segment (TSS)

The x86 architecture in protected mode (32-bit mode) envisions a task switching mechanism that is facilitated by the hardware. The architecture defines a Task State Segment (TSS) which is a region in the memory that holds information about a task (General purpose registers, segment registers, etc.). The idea was that any given task or thread would have its own TSS, and when the switch happens, a specific register (Task Register or TR) would get updated to point to the new task.

This was abandoned in favor of software-defined task switching which gives more granular control and portability to the operating system kernel.

But the TSS was not entirely abandoned. In modern days (64-bit systems) the kernel uses a TSS-per-core approach where the main job of TSS is to hold a few stack pointers that are very critical for the kernel and CPU's normal operation. More specifically, it holds the kernel stack of the current thread which is used when the system wants to switch from user-space to the kernel-space.

It also holds a few known good stacks for critical events like Non-Maskable Interrupts (NMIs) and Double Faults. These are events that if not handled correctly, can cause a triple fault and crash a CPU core or cause an immediate system reboot.

We know that memory access is generally considered to be expensive and caching values somewhere on the CPU die is the preferred approach if possible. This is where the TR register comes into the picture. It has a visible part which is a 16-bit offset as well as a hidden part that holds direct information about the TSS (Base address, Limit, and Access rights).

Why do hypervisors care?

A hypervisor is essentially a task switcher where tasks are operating systems. In order for multiple operating systems to run on the same silicon chip, the hypervisor must swap the entire state of the CPU which includes updating the hidden part of the TR register as well.

Intel implemented their virtualization extension (VT-x) where each vCPU is given its own VMCS (Virtual Machine Control Structure) block where its state is saved to or restored from by the hardware when switching between host and guest OSes. VMCS consists of four main areas:

Host-state area
Guest-state area
Control fields
VM-exit information area

Host-state area has fields which correspond to the TR register. It is the hypervisor's job to set these values on initial run and to update them when needed. To set these values, I "borrowed" some code from the linux kernel tree (KVM selftests):

vmwrite(HOST_TR_BASE,
get_desc64_base((struct desc64 *)(get_gdt().address + get_tr())));

If for any reason this operation fails to extract and write the correct address, upon the next context switch, the CPU will eventually face a double fault and then a triple fault, causing a system reboot.

Symptoms

I started developing my hypervisor on a virtualized instance of Fedora. On my virtualized dev environment with only three vCPUs, everything was working just fine. Until I decided to give it a try on my main machine where the hypervisor would talk to an actual physical CPU.

And BOOM! Seconds after running, the system crashed. After investigating kernel logs, I saw a pattern: An NMI triggered a VM-Exit on CPU 5, the hardware tried to locate a valid kernel stack from TSS, hit a fatal Page Fault attempting to read an unmapped memory address, resulting in a Kernel Oops.

Source: Hacker News