Zero-Copy Pages in Rust: Or How I Learned to Stop Worrying and Love Lifetimes

Zero-copy techniques in Rust allow database engines to eliminate redundant CPU copies between the kernel and user space, significantly boosting performance by leveraging Rust's lifetime system.

You can find the source code for the project here

Zero-copy is a way to elide CPU copies between the kernel and user space buffers that is particularly useful in high throughput applications like database engines. It makes a huge difference in performance under high load, particularly when your working set is no longer cache resident.

What Is Zero-Copy

Here is what a typical database engine looks like. For this post, focus on two copy boundaries: the OS boundary, and the path from the buffer pool into the layers above it.

┌─────────────────────────────────────────────────────────┐
│ Query Layer │
└────────────────────────┬────────────────────────────────┘
│
┌────────────────────────▼────────────────────────────────┐
│ Execution Engine │
└────────────────────────┬────────────────────────────────┘
│
┌────────────────────────▼────────────────────────────────┐
│ Transaction Manager │
└──────────┬─────────────┴─────────────────┬──────────────┘
│ │
┌──────────▼──────────┐ ┌────────────▼────────────┐
│ Lock Manager │ │ Log Manager │
└─────────────────────┘ └─────────────────────────┘
│ fresh copies into higher layers
┌────────────────────────▼────────────────────────────────┐
│ Buffer Pool │
└────────────────────────┬────────────────────────────────┘
│ copy at OS boundary
┌────────────────────────▼────────────────────────────────┐
│ Disk │
└─────────────────────────────────────────────────────────┘

Trying to build a high performance engine requires eliding any non-useful work as far as possible and copying data falls squarely in this category. Think of each copy operation as an equivalent of memcpy(). memcpy can actually cause pipeline stalls which is something you want to avoid in high perf applications. which requires the CPU to copy data from a source and put it into a destination. You’re spending cycles on non-essential work and this can cause eviction of hot data from CPU caches.

Now, let’s focus on eliminating copies at the layer between the buffer pool and disk first.

The Buffer Pool And Direct IO

The buffer pool opens and stores file descriptors with the open() syscall. When we call read() and write() on those file descriptors it goes through the whole cycle you saw earlier with copies between userspace, kernel and DMA.

An easy win here is to use direct IO with the O_DIRECT flag. This will force the application to bypass the OS page cache. O_DIRECT requires that the buffers submitted are pointer aligned, along with I/O length and file offset. In Rust, we guarantee the former with #[repr(align(4096))] on the buffer holding our page, and 4 KiB page-sized reads and writes at page-aligned offsets satisfy the rest. Without this, O_DIRECT reads or writes would often fail with EINVAL.

Since we’re bypassing the kernel page cache we don’t get useful boosts like readahead or write coalescing but this is exactly why a buffer pool is so important in a database. The buffer pool is a replacement for the OS page cache designed with specific workloads in mind.

Eliminating Copies From The Read Path

So far, zero-copy has meant removing copies between the kernel and the buffer pool. From here on, I’m going to broaden it slightly to mean removing redundant copies inside the engine too. Rust has a great and terrible way to avoid dealing with copies of data - references. It’s great because it’s a single character (&), it’s terrible because now we have to learn to deal with lifetimes.

The simplest way to think about lifetimes is that you are proving to the compiler that any reference held by type A will not outlive the data it points to. Let’s start with defining the raw bytes for a single page like this:

pub struct PageBytes {
bytes: [u8; PAGE_SIZE_BYTES as usize],
}

Now, we’ll define the data that is held within a single buffer pool frame. The RwLock<T> type here is our page latch.

#[derive(Debug)]
pub struct BufferFrame {
page: RwLock<PageBytes>,
}

What we actually want is not ownership, but a borrowed view into bytes that already live somewhere else. We can model that by introducing a lifetime.

pub struct PageReadGuard<'a> {
page: &'a PageBytes,
}

With this lifetime annotation, we are proving to the compiler that PageReadGuard will not outlive PageBytes, which means higher-level page objects can become views into existing bytes rather than owned copies. In the real implementation, the field is RwLockReadGuard<'a, PageBytes> rather than &'a PageBytes, but the ownership story is the same: the guard borrows the page bytes instead of owning them.

Source: Hacker News