Generating All 32-Bit Primes (Part I)

This article details the process of creating a high-performance C program to generate all 32-bit primes, starting with the basic trial division method and exploring the wheel factorization optimization.
This article documents my quest to write a C program targeting Linux
that generates all prime numbers that fit in a 32-bit unsigned int
(uint32_t
) as quickly as possible.
In particular, the program should write all 32-bit primes to a file
(which, in all my implementations, is called PRIMES
) in
binary, so that every 4 bytes of the file stores one primes number, with
the bytes ordered in a little-endian fashion. In hex, the file should start
out: 02 00 00 00 03 00 00 00 05 00 00 00 07 00 00 00
, and the
correct SHA-256 hash for the file turns out to be:
272eb05aa040ba1cf37d94717998cbbae53cd669093c9fa4eb8a584295156e15
.
The algorithm used should able to correctly generate primes up to an arbitrarily large limit (within reasonable hardware constraints). It should not assume the primality of any number larger than 2 without first verifying it. It should not use a huge amount of memory—1GB should be plenty.
The simplest and most obvious way to check a number's primality is trial division. Given a target integer , where , we check if is divisible by each prime number less than or equal to its square root. If and only if it is not divisible by any of them, we can conclude that is prime. In C we could implement trial division like this:
/// Returns `true` iff `n` is a prime number.
///
/// `primes`: an array of prime numbers in order, starting from 2, skipping
/// none, and going at least up to the square root of `n`.
bool is_prime(uint32_t n, uint32_t* primes) {
for (size_t i=0; ; i++) {
uint32_t p = primes[i];
if (n % p == 0) {
return false;
}
// Comparison against 0xFFFF prevents overflow in `p*p`
if (p >= 0xFFFF || p*p >= n) return true;
}
}
In the worst case, carrying out this algorithm requires
divisions, where
is the prime-counting function. (That is,
is the number of primes less than or equal to
).
Since
is asymptotically equivalent to
Trial division can be used to generate all the prime numbers up to a
limit
as follows: we create a growing list of the primes in order, which is
initialized to {2}
. Then we go through all the integers from to , checking the primality of each one and adding it to the end of the list if it is found to be prime. In C, for :
uint32_t* primes = (uint32_t*) malloc(203739216 * sizeof(uint32_t));
primes[0] = 2;
size_t p_idx = 1; // index of the next element to be added to `primes`
for (uint32_t n=3; n<=0xFFFFFFFF; n+=2) {
if ( is_prime(n, primes) ) {
primes[p_idx] = n;
p_idx++;
}
// Prevents overflow in `n+=2`
if (n == 0xFFFFFFFF) break;
}
```[2]
This algorithm calls `is_prime`
times, so its time complexity in the worst case is within
.
The array `primes`
can be written to a file as follows:
FILE* prime_file = fopen("PRIMES", "wb"); fwrite(primes, 4, p_idx, prime_file);
Now we have everything we need to write a full program meeting the
specifications laid out in this article's introduction. The implementation
found here runs in about 24m20s of
user time on my system.[3]
Some numbers are obviously not prime, and we are wasting our time by
even checking. For instance, all even numbers greater than 2 (or, in other
words, all integers greater than 2 and congruent to 0 modulo 2) are clearly
not prime. The implementation of the sequential trial division algorithm
already takes advantage of this fact, incrementing the `for`
loop with `n+=2`
, instead of `n++`
, to skip all even
numbers (`n`
is always greater than 2).
A well-known result is the fact that all primes greater than 3 are congruent to either 1 or 5 mod 6. This is easy to prove by cases: any number congruent to 0, 2, or 4 mod 6 is divisible by 2, and any such number greater than 2 is therefore composite; and any number congruent to 3 mod 6 is divisible by 3, and any such number greater than 3 is therefore composite. Less well known is the fact that all primes greater than 5 are congruent to 1, 7, 11, 13, 17, 19, 23, or 29 mod 30, but this can be proved by cases in a very similar fashion: 30 is 2×3×5, so it is easy to eliminate all the possible remainders mod 30 which imply divisibility by 2, 3, or 5.
Let's generalize. Suppose we know the first
primes, which we shall call
.
Let
be the *primorial* of
.
That is, the product of the first
primes, or
.
Now, take the sequence
,
and delete any numbers divisible by any of
.
All prime numbers greater than
will be congruent modulo
to one of the numbers still remaining in the sequence.
Here is a diagram of a 'wheel' of size
.
The numbers in each 'spoke' form a congruence class modulo 30. The
darkened spokes contain exactly those numbers which are divisible by 2,
3, or 5, or, in other words, are not coprime to 30. Clearly, no primes
greater than 5 will be found in the darkened spokes. The white spokes
contain all natural numbers coprime to 30, and may be called the
*coprime spokes*.
Note that the properties 'not divisible by any of the first primes', 'coprime to ', and 'found on a coprime spoke of the wheel of size ' are all equivalent.
We can represent a wheel in C with the following struct:
struct Wheel {
size_t size;
/// Numbers with these remainders modulo size are coprime to size.
uint32_t* coprime_spokes;
/// The number of elements in spokes
size_t n_spokes;
};
And initialize it as follows:
/// Initialize a struct Wheel using the first k prime numbers.
///
/// primes: an array of size at least k whose first k elements are the
/// first k primes in order
struct Wheel wheel_init(uint32_t* primes, size_t k) {
struct Wheel wheel;
wheel.size = 1;
for (size_t i=0; i<k; i++)
wheel.size *= primes[i];
// All the remainders mod wheel.size which imply divisibility by one of
// the first k primes
uint32_t *coprime_spokes = malloc( wheel.size * sizeof(uint32_t) );
wheel.n_spokes = 0; // index of the next element to be added to
// coprime_spokes
for (size_t i=0; i<wheel.size; i++) {
for (size_t j=0; j<k; j++) {
if (i % primes[j] == 0) goto continue_loop;
}
coprime_spokes[wheel.n_spokes] = i;
wheel.n_spokes ++;
continue_loop: ;
}
wheel.coprime_spokes = malloc( wheel.n_spokes * sizeof(uint32_t) );
memcpy(wheel.coprime_spokes, coprime_spokes, wheel.n_spokes * sizeof(uint32_t));
free(coprime_spokes);
return wheel;
}
To prevent integer overflows, we need to know the largest
`uint32_t`
that is coprime to the `size`
of a
`struct Wheel`
:
/// Returns the largest uint32_t coprime to wheel.size
uint32_t wheel_last(struct Wheel wheel) {
for (uint32_t l = 0xFFFFFFFF;; l--) {
for(size_t i=0; i<wheel.n_spokes; i++) {
if ( (l - wheel.coprime_spokes[i]) % wheel.size == 0 )
return l;
}
}
}
We can loop through the numbers on the coprime spokes of ```
struct
Wheel wheel
as follows:
uint32_t last = wheel_last(wheel);
uint32_t turn = 0;
while (true) {
for (size_t i=0; i<wheel.n_spokes; i++) {
// `n` is on a coprime spoke of `wheel`, and every number on a coprime
// spoke of `wheel` will appear as `n` in this loop.
uint32_t n = wheel.size * turn + wheel.coprime_spokes[i];
// ...
if (n == last)
goto exit_loop;
}
turn++;
}
exit_loop: // ...
Adding wheel factorization to the sequential trial division algorithm is
quite simple. To start, we initialize our growing list of primes to
{2}
, and use ordinary trial division to extend it until it contains the first primes. We use these primes to create the wheel of size . For the larger primes, we use the same trial division algorithm before, but only checking numbers which are found on a coprime spike of the wheel. The time complexity of this algorithm is no different from that of the ordinary sequential trial division algorithm, since numbers are found on the coprime spokes.
The implementation here, which uses a wheel based on the first 5 primes, runs about 23m30s of user time, barely an improvement over not using the wheel. Though the wheel requires us to consider only about 20% of natural numbers as potential primes[5], the ~80% of numbers we avoid checking are the ones that ar
Source: Hacker News










