NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...1 min read

Binary GCD

Share
NOW LET US Article – Binary GCD

This article explores how the Binary GCD algorithm outperforms the traditional Euclidean method by leveraging bitwise operations, achieving a 2x speedup over the C++ standard library.

In this section, we will derive a variant of gcd that is ~2x faster than the one in the C++ standard library.

#Euclid’s Algorithm

Euclid’s algorithm solves the problem of finding the greatest common divisor (GCD) of two integer numbers $a$ and $b$, which is defined as the largest such number $g$ that divides both $a$ and $b$. The formula $\gcd(a, b) = \gcd(b, a \bmod b)$ is essentially the algorithm itself: you can simply apply it recursively, and since each time one of the arguments strictly decreases, it will eventually converge to the $b = 0$ case.

The worst possible input to Euclid’s algorithm are consecutive Fibonacci numbers, and the running time is logarithmic in the worst case. All common implementations, including std::gcd, get compiled into an assembly loop using the idiv instruction. General integer division works notoriously slow on all computers, including x86, with idiv taking up ~90% of the execution time.

#Binary GCD

The binary GCD algorithm (Stein's algorithm) was rediscovered in 1967 for use in computers with slow division. It is based on identities involving powers of 2, such as $\gcd(2a, 2b) = 2 \cdot \gcd(a, b)$ and $\gcd(2a, b) = \gcd(a, b)$ if $b$ is odd. The algorithm uses only binary shifts, comparisons, and subtractions, which typically take just one cycle.

#Implementation

A naive implementation with many branches is often slower than std::gcd. However, by using __builtin_ctz (count trailing zeros) and removing unnecessary branching, we can achieve significant speedups. An optimized version handles the common power of 2 at the beginning and then enters a loop where at least one number is always odd.

int gcd(int a, int b) {
    if (a == 0) return b;
    if (b == 0) return a;
    int az = __builtin_ctz(a);
    int bz = __builtin_ctz(b);
    int shift = std::min(az, bz);
    a >>= az; b >>= bz;
    while (a != b) {
        int diff = a - b;
        int next_a = std::abs(diff);
        b = std::min(a, b);
        a = next_a >> __builtin_ctz(diff);
    }
    return b << shift;
}

This optimized version runs in 116ns, compared to 198ns for std::gcd. By analyzing the assembly dependency graph, we can further reduce latency by calculating ctz directly from the difference of the two numbers.

© 2026 Now Let Us. All rights reserved.

Source: Hacker News

Advertisement
Ad slot ready: 5887729102

More in this category

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.