NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...4 min read

Writing string.h functions using string instructions in asm x86-64 (2025)

Share
NOW LET US Article – Writing string.h functions using string instructions in asm x86-64 (2025)

An in-depth look at how modern compilers leverage x86-64 string instructions to optimize memory functions, and a guide to manual assembly implementation for peak performance.

Write string.h functions using string instructions in asm x86-64

Introduction

The C standard library offers a bunch of functions (whose declarations can be found in the string.h header) to manage NULL-terminated strings and arrays. These are some of the most used C functions, often implemented as builtin by the C compiler, as they are crucial to the speed of programs.

On the other hand, the x86 architecture contains “string instructions”, aimed at implementing operations on strings at the hardware level. Moreover, the x86 architecture was incrementally enhanced with SIMD instructions over the years, allowing for the processing of multiple bytes of data in a single instruction.

In this article, we’ll inspect the implementation of string.h of the GNU standard library for x86, and see how it compares with a pure assembly implementation of these functions using string instructions and SIMD, and try to explain the choices made by the GNU developers and help you write better assembly.

Disassembling a call to memcpy

One of the most popular C functions is memcpy. It copies an array of bytes to another, which is a very common operation and makes its performance particularly important.

There are several ways you can perform this operation using x86 asm. Let’s see how it is implemented by gcc using this simple C program:

#include <string.h>
#define BUF_LEN 1024
char a[BUF_LEN];
char b[BUF_LEN];
int main(void) {
    memcpy(b, a, BUF_LEN);
    return EXIT_SUCCESS;
}

We can observe the generated asm by using godbolt or compile the code using gcc 14.2: gcc -O1 -g -o string main.c.

And then disassemble the executable using: objdump --source-comment="; " --disassembler-color=extended --disassembler-options=intel --no-show-raw-insn --disassemble=main string

You should get this result:

0000000000401134 <main>:
;
; int main(int argc, char *argv[]) {
; memcpy(b, a, BUF_LEN);
401134: mov esi,0x404440
401139: mov edi,0x404040
40113e: mov ecx,0x80
401143: rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi]
; return 0;
; }
401146: mov eax,0x0
40114b: ret

The first surprising thing you notice is that the machine code does not contain any call to the memcpy function. It has been replaced by 3 mov instructions preceding a mysterious rep movsq instruction.

rep movsq is one of the five string instructions defined in the “Intel® 64 and IA-32 Architectures Software Developer’s Manual”.

The string instructions of x86

String instructions perform operations on array elements pointed by rsi (source register) and rdi (destination register).

| instruction | Description | Effect on registers | |---|---|---| | movs | Move string | *(rdi++) = *(rsi++) | | cmps | Compare string | cmp *(rsi++), *(rdi++) | | scas | Scan string | cmp rax, *(rdi++) | | lods | Load string | rax = *(rsi++) | | stos | Store string | *(rdi++) = rax |

Each of these instructions must have a suffix (b,w,d,q) indicating the type of elements pointed by rdi and rsi (byte, word, doubleword, quadword).

These instructions may also have a prefix indicating how to repeat themselves.

| prefix | Description | Effect on registers | |---|---|---| | rep | Repeat while the ECX register not zero | for(; rcx != 0; rcx–) | | repe/repz | Repeat while the ECX register not zero and the ZF flag is set | for(; rcx != 0 && ZF == true; rcx–) | | repne/repnz | Repeat while the ECX register not zero and the ZF flag is clear | for(; rcx != 0 && ZF == false; rcx–) |

The movs instruction

Now that we have learned more about the string instructions, we can break down the effect of the rep movsq instruction:

  • Copy the quadword pointed by rsi to rdi
  • Add 8 to rsi and rdi so that they point onto the next quadword
  • Decrement rcx and repeat until rcx == 0

This is what we would expect memcpy to do, except for one thing: bytes are not copied one by one, but in blocks of 8. Here, as the byte size of our arrays is a multiple of 8, we can copy the source array as an array of quadwords. This will necessitate 8 times fewer operations than copying the array one byte at a time.

The cmps instruction

The cmps instruction will compare the elements pointed by rsi and rdi and will set the flag accordingly. As cmps will set the ZF flag, we can use the repe/repz and repne/repnz prefixes to, respectively, continue until the strings differ or stop when matching characters are encountered.

Let’s write a basic memcmp function using this instruction:

; int memcmp_cmpsb(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n);
memcmp:
mov rcx, rdx ; rcx = n
xor eax, eax ; Set return value to zero
xor edx, edx ; rdx = 0
repe cmpsb ; for(; rcx != 0 and ZF == true; rcx--)
setb al ; if(ZF == false and CF == true) al = 1
seta dl ; if(ZF == false and CF == false) bl = 1
sub eax, edx ; return al - dl
ret

To get the result of the comparison, we need to compare the last two quadwords. However, on little-endian systems, the lowest significant byte will be the first one, and we want to compare the byte in lexical order. Hence, the need to convert the quadword to big-endian using the bswap instruction. The instruction bzhi is useful when you need to mask out the higher bits of a register.

© 2026 Now Let Us. All rights reserved.

Source: Hacker News

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

dev-tools

Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

Google has released new Gemma 4 checkpoints optimized with Quantization-Aware Training (QAT), reducing the memory footprint of Gemma 4 E2B to just 1GB. This allows high-performance AI models to run locally on everyday mobile devices and consumer GPUs without sacrificing quality.

NOW LET US Related – Changing How We Develop Ladybird

dev-tools

Changing How We Develop Ladybird

The Ladybird browser project has announced a major shift in its development process, closing public pull requests to tighten security and code quality. The decision is heavily influenced by the rise of AI tools, which have altered the trust dynamics of open-source contributions.

NOW LET US Related – Fine-tuning an LLM to write docs like it's 1995

dev-tools

Fine-tuning an LLM to write docs like it's 1995

An experiment in fine-tuning modern LLMs like Llama 3.1 and Qwen 2.5 using QLoRA to mimic the technical writing style of 1980s and 1990s Microsoft manuals.

NOW LET US Related – The IsUpMap lets you check the status of over 100 major sites at once

dev-tools

The IsUpMap lets you check the status of over 100 major sites at once

isUpMap is a real-time status heatmap that allows users to quickly check the operational status of over 80 popular online services. From leading AI platforms like OpenAI to cloud and payment services, this tool helps you easily identify whether a connection issue is system-wide or on your end.

NOW LET US Related – Open Code Review – An AI-powered code review CLI tool

dev-tools

Open Code Review – An AI-powered code review CLI tool

Open Code Review is an AI-powered code review CLI tool, originally developed as Alibaba Group's internal assistant, now open-sourced to provide precise, line-level code feedback.

NOW LET US Related – Leap in DNA synthesis slashes time to build new genetic sequences

dev-tools

Leap in DNA synthesis slashes time to build new genetic sequences

A new DNA synthesis method called Sidewinder promises to unlock the potential of generative AI in biology by offering a fast, cheap, and highly accurate way to physically build novel genetic sequences.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.