24 September 2021

Motivation

Being able to compute lists of prime numbers can be crucial in certain problems. Besides being very efficient the sieve of Eratosthenes is a simple and easy-to-implement method that has passed the age test and is still widely used today.

How does it work ?

The sieve of Eratosthenes is an ancient method of finding all prime numbers smaller than a certain number \(n\).

We recall the definition of a prime number as a positive integer \(k\) whose only divisors are \(1\) and \(k\).

The way this algorithm works is that we have a list of all integers \([1, n]\) and we mark all composite numbers.
We start by marking \(1\) as not prime, since \(1\) is not considered to be a prime number.
The next step is to find the next unmarked number, which is \(2\). We proceed to go through all
the multiples of \(2\) and mark them as non-primes.
Now we just need to repeat the process of finding the next unmarked number and marking all its multiples until we reach \(n\).

One might ask, how can we be sure that the next unmarked number is a prime number? The answer is really simple:
if the number is unmarked, it means that it is not a multiple of the previous numbers,
which by definition makes it a prime number.

If we think about it carefully, what Eratosthenes did was remove all integers of the form \(2k,\: 3k,\: 4k,\: ...\),
where \(k \gt 1\), or in other words, remove all composite numbers from the list leaving all prime numbers unmarked.

Implementation

This algorithm can be easily reproduced with two nested loops. The variable \(i\) iterates through all integers \([1, n]\) looking for an unmarked number, while the second variable \(j\) marks all multiples of \(i\).

void SieveOfEratosthenes (int n) { vector<bool>primes(n, true); primes[1] = false; for (int i = 2; i <= n; i++) { if (primes[i] == true) { for (int j = 2 * i; j <= n; j += i) { primes[j] = false; } } } }

Optimization

There are several ways to optimize the efficiency of this algorithm, some of which are not the direct aim of this article,
such as the *segmented sieve* or the *linear time sieve*. But there are some simple tricks that can really
improve the performance of the algorithm.

The first trick is to bound \(i\) between \([1, \sqrt n\:]\) instead of \([1, n]\). The reason we bound \(i\)
in this way is quite simple. Suppose we do not bound it and \(i\) is an integer greater than \(\sqrt n\).
The multiples of \(i\) are, \(2i,\: 3i,\: 4i,\: ...,\: ki\).
Here \(k \:< \sqrt n \:\), because otherwise \(ki > n\) and we are only interested in multiples smaller than \(n\).

But there is no need to mark these numbers, for they are already marked, \(2i\) is a multiple of \(2\),
\(\:3i\) is a multiple of \(3\), and \(ki\) is a multiple of \(k\). Since \(2,\: 3,\: ...,\: k < i\)
, the algorithm has already marked all multiples of each of them, including \(2i,\: 3i,\: 4i,\: ...,\: ki\).

This proves that for any \(i \: >\sqrt n\), all multiples of \(i\) have already been marked.
Note that for \(i \: <\sqrt n\) there are also cases where some multiples are marked twice or more,
but we cannot be sure that **all** multiples of \(i\) have been marked in this case (which also means that there is
room for further optimization).

This optimization is basically the same as the previous one, but this time we apply it to \(j\). In the original naive implementation, we initialize \(j\) as \(j=2\times i\) and then increase \(j\) in each iteration by increments of \(i\) to produce the sequence \(2i,\: 3i,\: 4i,\: ...\), but again, all of these numbers have already been marked. The trick is to start with \(j=i^2\) and continue from there \((i+1)i,\: (i+2)i,\: (i+3)i\: ...\) This way, we avoid repeating steps that we know have have already been done.

void SieveOfEratosthenes (int n) { vector<bool>primes(n, true); primes[1] = false; for (int i = 2; i <= sqrt(n) + 1; i++) { if (primes[i] == true) { for (int j = i * i; j <= n; j += i) { primes[j] = false; } } } }

Asymptotic analysis

To analyze the time complexity of this algorithm, we will resort to some results from number theory.
Since we iterate through each multiple of each prime \(p_i < n\), it is logical to say that the
total number of operations performed by the algorithm is approximately equal to
$$ \frac{n}{2} + \frac{n}{3} + \frac{n}{5} + \frac{n}{7} + \: ... $$
Where the denominators are all prime numbers up to \(n\). In a more formal notation, we can also write this
sum in the following way
$$ \sum_{p_i < n}\: \frac{n}{p_i}$$
Since \(n\) is a constant we can simply write
$$ n \times \sum_{p_i < n}\: \frac{1}{p_i}$$
Now we just have to figure out the value of this sum, right ? Well, we could, but it turns out that
this is a famous series in number theory, and Euler already has an answer. While proving the divergence
of this series, Euler found out that it approaches the value \(log\:log(n)\) (a very slow divergence if you ask).

Substituting this result into the expression, we get \(n \times log\:log(n)\), which means that we also have
a time complexity of \(O(n \times log\:log(n))\). I'll go out on a limb and say that this is basically linear time,
\(log\:log(n)\) is so much smaller than \(n\) that you could almost say it's linear time.

The space complexity is simply \(O(n)\), since we are just using an array of size \(n\) to mark the composite numbers.

If you know this algorithm, you actually have a trick up your sleeve. It's very intuitive, very fast, easy to implement,
and also very flexible for optimizations/tweaks if you need them.