Writing string.h functions using string instructions in asm x86-64 (2025)

An in-depth look at how modern compilers leverage x86-64 string instructions to optimize memory functions, and a guide to manual assembly implementation for peak performance.

Write string.h functions using string instructions in asm x86-64

Introduction

The C standard library offers a bunch of functions (whose declarations can be found in the string.h header) to manage NULL-terminated strings and arrays. These are some of the most used C functions, often implemented as builtin by the C compiler, as they are crucial to the speed of programs.

On the other hand, the x86 architecture contains “string instructions”, aimed at implementing operations on strings at the hardware level. Moreover, the x86 architecture was incrementally enhanced with SIMD instructions over the years, allowing for the processing of multiple bytes of data in a single instruction.

In this article, we’ll inspect the implementation of string.h of the GNU standard library for x86, and see how it compares with a pure assembly implementation of these functions using string instructions and SIMD, and try to explain the choices made by the GNU developers and help you write better assembly.

Disassembling a call to memcpy

One of the most popular C functions is memcpy. It copies an array of bytes to another, which is a very common operation and makes its performance particularly important.

There are several ways you can perform this operation using x86 asm. Let’s see how it is implemented by gcc using this simple C program:

#include <string.h>
#define BUF_LEN 1024
char a[BUF_LEN];
char b[BUF_LEN];
int main(void) {
    memcpy(b, a, BUF_LEN);
    return EXIT_SUCCESS;
}

We can observe the generated asm by using godbolt or compile the code using gcc 14.2: gcc -O1 -g -o string main.c.

And then disassemble the executable using: objdump --source-comment="; " --disassembler-color=extended --disassembler-options=intel --no-show-raw-insn --disassemble=main string

You should get this result:

0000000000401134 <main>:
;
; int main(int argc, char *argv[]) {
; memcpy(b, a, BUF_LEN);
401134: mov esi,0x404440
401139: mov edi,0x404040
40113e: mov ecx,0x80
401143: rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi]
; return 0;
; }
401146: mov eax,0x0
40114b: ret

The first surprising thing you notice is that the machine code does not contain any call to the memcpy function. It has been replaced by 3 mov instructions preceding a mysterious rep movsq instruction.

rep movsq is one of the five string instructions defined in the “Intel® 64 and IA-32 Architectures Software Developer’s Manual”.

The string instructions of x86

String instructions perform operations on array elements pointed by rsi (source register) and rdi (destination register).

| instruction | Description | Effect on registers | |---|---|---| | movs | Move string | *(rdi++) = *(rsi++) | | cmps | Compare string | cmp *(rsi++), *(rdi++) | | scas | Scan string | cmp rax, *(rdi++) | | lods | Load string | rax = *(rsi++) | | stos | Store string | *(rdi++) = rax |

Each of these instructions must have a suffix (b,w,d,q) indicating the type of elements pointed by rdi and rsi (byte, word, doubleword, quadword).

These instructions may also have a prefix indicating how to repeat themselves.

| prefix | Description | Effect on registers | |---|---|---| | rep | Repeat while the ECX register not zero | for(; rcx != 0; rcx–) | | repe/repz | Repeat while the ECX register not zero and the ZF flag is set | for(; rcx != 0 && ZF == true; rcx–) | | repne/repnz | Repeat while the ECX register not zero and the ZF flag is clear | for(; rcx != 0 && ZF == false; rcx–) |

The movs instruction

Now that we have learned more about the string instructions, we can break down the effect of the rep movsq instruction:

Copy the quadword pointed by rsi to rdi
Add 8 to rsi and rdi so that they point onto the next quadword
Decrement rcx and repeat until rcx == 0

This is what we would expect memcpy to do, except for one thing: bytes are not copied one by one, but in blocks of 8. Here, as the byte size of our arrays is a multiple of 8, we can copy the source array as an array of quadwords. This will necessitate 8 times fewer operations than copying the array one byte at a time.

The cmps instruction

The cmps instruction will compare the elements pointed by rsi and rdi and will set the flag accordingly. As cmps will set the ZF flag, we can use the repe/repz and repne/repnz prefixes to, respectively, continue until the strings differ or stop when matching characters are encountered.

Let’s write a basic memcmp function using this instruction:

; int memcmp_cmpsb(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n);
memcmp:
mov rcx, rdx ; rcx = n
xor eax, eax ; Set return value to zero
xor edx, edx ; rdx = 0
repe cmpsb ; for(; rcx != 0 and ZF == true; rcx--)
setb al ; if(ZF == false and CF == true) al = 1
seta dl ; if(ZF == false and CF == false) bl = 1
sub eax, edx ; return al - dl
ret

To get the result of the comparison, we need to compare the last two quadwords. However, on little-endian systems, the lowest significant byte will be the first one, and we want to compare the byte in lexical order. Hence, the need to convert the quadword to big-endian using the bswap instruction. The instruction bzhi is useful when you need to mask out the higher bits of a register.

Source: Hacker News