System call instrumentation on Linux/x86‑64 using memory‑indirect calls, part I

An in-depth analysis of system call instrumentation techniques on Linux x86-64, exploring the limitations of double-trap overhead, modern solutions like zpoline, and a novel approach leveraging x86 segmentation.

Diverting trains of thought, wasting precious time

My libsystrap library provides a simple instrumentation of system calls in Linux x86-64 userland. However, its current implementation suffers a double-trap overhead: system calls become ud2, which generates a SIGILL trap. Then we run the system call itself from within the signal handler, causing a second trap and some interesting tricky cases.

There has been some interesting research in this space in recent years, including the Liteinst “instruction punning” paper, the closely related E9Patch paper (though both not specifically about system call instrumentation), later the “zpoline” paper (which definitely is), and some follow-ups for making the latter more robust (lazypoline, K23).

The core problem that all these approaches are solving is a pure accident of the Intel instruction encoding: all useful jump instructions are at least 5 bytes long, whereas often we want to patch smaller instructions, such as system call instructions which are all (essentially) two bytes long. So if you want to replace a system call with a jump, you have a problem.

The idea of instruction punning, simplifying horribly and specialising it to the system-call problem (it is more general), is that if we have an instruction sequence containing a two-byte system call (here using the syscall instruction, 0f 05)

... 0f 05 xx yy zz ...

then when we make it into a jump or call, we might be able to work with the bytes of the next instruction, since they form part of the relative jump offset. In fact we have one free byte to play with;

... e9 WW xx yy zz ...

i.e. we leave the xx, yy and zz bytes alone because the belong to the next instruction(s), but we can change WW. WW xx yy zz will be interpreted as 32-bit displacement and we ideally simply place some kind of trampoline code wherever that lands.

Unfortunately, with the machine being little-endian, WW is the least significant byte, so the jump target is fixed except for 256 bytes of wiggle room. It demands a statistical approach: as long as the high-order byte is not zero or very small, we have a good chance of jumping far enough away to land at some memory that is available to use. If not, we can fall back on a signal-generating option like ud2, or do something else. The E9Patch paper presents some head-twisting compound versions of instruction punning for increasing its coverage in such scenarios, without resorting to trapping approaches like ud2. Meanwhile, this scattered nature of trampolines will require a lot of virtual address space, roughly one page per patch site, but we can play virtual memory tricks to colocate multiple trampolines on the same physical page (the E9Patch tool also does this)..

The idea of zpoline is cleaner and does not rely on punning or statistical approaches. It's quite clever. We can always replace a 2-byte system call with

ff d0 call *%rax

... which will generate a call to a small nonnegative address, because %rax must be holding the system call number i.e. a small nonnegative integer. That's neat but it means you have to map some instructions at the very bottom page (address zero), which undoes the standard hardware-enforced protection against null pointer accesses. The paper suggests mitigating this by (1) using Intel memory protection keys to make this memory execute-only, and (2) catching “jump to null pointer” bugs by validating the return address against a bitmap or hash table recording the known patched system call sites. However, this is still non-ideal: many processors don't support memory protection keys, validating the return address takes time, and on Linux, mapping low memory requires system privileges. The approach also behaves unpredictably if buggy code invokes a system call with a high value in %rax, whereas the kernel would fail cleanly (with ENOSYS).

The zpoline work made me think: can we find similar tricks with different trade-offs by exploring other corners of the instruction encoding? In x86 I have always been fascinated by the segmentation features, so I was minded to explore there. All x86 processors, even 64-bit ones, always run with some form of segmentation permanently enabled. In protected mode, all memory accesses are first translated through one of two segment descriptor tables, global (system-wide) and local (typically per-process). These tables select the linear virtual address that is then pushed through the page tables, as a second layer of translation. Linux lets us modify the process's local descriptor table using the modify_ldt() system call. Could we find a 2-byte form that will indirect through this table to reach, somehow, our intended system call instrumentation?

Spoiler: sort of, but not really as I hoped. Nevertheless, I learned quite a bit, the near misses are fun enough to go into, and there may be some benefits worth having.

Betraying how poorly I understood x86 segmentation at first, I was optimistically hoping that we could use a two-byte “long call” instruction (lcall in AT&T syntax, call far in Intel), perhaps (naively) something like this:

ff 18 lcall *(%rax)

to perform an emulated/instrumented system call via the LDT entry whose selector is stored in %rax. Sadly that is not what the lcall instruction does.

(Among other glaring alarm bells, it would be odd to store 16-bit selectors in a full-width register like %rax. Also, the “l for long” of “lcall” is of course not the same “L for Local” of “LDT”. I'll explain all this....)

Backing up slightly: this lcall instruction lets us call into a different code segment, instead of the usual “near call” which stays within one segment. What we think of as a memory address in x86 (whether 16-, 32- or 64-bit) is really an offset into a segment—it's just that in flat memory models (which were the choice of 32-bit Unix environments, and are forced upon us in 64-bit mode) the segment base address happens to 0. Similarly, an indirect near call in 64-bit mode, such as

ff d0 call *%rax

consumes as its destination operand not a pointer but a 64-bit offset within the current code segment.

Far call target is specified by not only an offset but also a 16-bit segment selector. This indexes into either the global or local table of segment descriptors (the actual definitions of the segments, roughly base/limit pairs with permissions), the table being chosen by one of the three reserved low bits of the selector value. Each table may contain up to 8192 entries, accounting for the remaining 13 bits.

I should recap a not-so-obvious bit of Intel assembly. All indirect calls jump to some memory location, but there are two forms: register-indirect and memory-indirect. The latter are doubly indirect, in that a memory location is itself specified using a register. That is the memory location from which the call target address is loaded; I'll call this a “stepping stone” location, although there is probably a better term. (As far as I know, memory-indirect jumps and calls are the only memory-indirect operations in the entire Intel ISA.)

ff d0 call *%rax ff 10 call *(%rax)

The first of the above does the obvious (register-indirect) thing: call the address (or rather, offset within the code segment) held in register %rax. The second one adds the additional layer of indirection: the address (sorry, offset) to be called is itself loaded from memory: from the location whose address (sorry, offset) is held in register %rax.

And of course there are two forms of memory-indirect call: near and far.

ff 10 call *(%rax) ff 18 lcall *(%rax)

There is no simple register-indirect form of far call. You might think this is because a register isn't big enough to hold a complete far address, i.e. segment selector and offset. However, we'll see in a moment that that doesn't explain it.

(There is an absolute far call, where the full far address is appears in the instruction as an immediate operand. That is not availabl

Source: Hacker News