A naive implementation of this problem would be as such in Python:If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.Find the sum of all the multiples of 3 or 5 below 1000.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
sum(x for x in range(1000) if x % 5 == 0 or x % 3 == 0) |
What we are really interested in, though, is the performance. Let's benchmark our result.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import timeit | |
timeit.timeit( | |
"sum(x for x in range(1000) if x % 5 == 0 or x % 3 == 0)", | |
number=10000 | |
) / 10000. |
Now begins the fun. Two paths must be searched in order to attain maximal performance in this problem. The first is to enhance the mathematical search space, meaning that we should limit the number of numbers we evaluate. The first solution tried every positive natural number. It would be way more efficient to only run through multiples of 3 and 5. In Python, we can use the range iterator (xrange for Python 2) to provide us only the interesting values. Concatenation of the two desired iterators can be done using the chain function. It is then a simple matter of getting rid of duplicates, which we can do using the set data structure.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Algorithm | |
from itertools import chain | |
sum(set(x for x in chain(range(5, 1000, 5), range(3, 1000, 3)))) | |
# To benchmark it | |
import timeit | |
timeit.timeit( | |
"sum(set(x for x in chain(range(5, 1000, 5), range(3, 1000, 3))))", | |
setup="from itertools import chain", | |
number=10000 | |
) / 10000. |
Ah. Much better. For a human being, this is the most optimal piece of core you can encounter solving your problem. It is short, relatively simple and quite easy to maintain.
Everything further down in this post is an abomination and exists for the sole purpose of either educational or masochistic endeavours. Writing clear, simple and maintainable code is way more useful, profitable and developer-time efficient.
Despite these warnings, let's dive into a C implementation of the naive interpretation of the problem.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include <stdio.h> | |
unsigned int NumberIt() | |
{ | |
/* Create a persistant variable x and initialize it to 0 */ | |
static unsigned int x = 0; | |
for (x++; x < 1000; x++) { | |
if (x % 5 == 0 || x % 3 == 0) { | |
return x; | |
} | |
} | |
/* This section is for looping purposes */ | |
x = 3; | |
return x; | |
} | |
int main() | |
{ | |
unsigned int lResult, lTmpValue, lLoop; | |
for (lLoop = 0; lLoop <= 100000; lLoop++) { | |
lResult = 0; | |
while (1) { | |
lTmpValue = NumberIt(); | |
if (lTmpValue > 1000) { break; } | |
lResult += lTmpValue; | |
} | |
} | |
printf("Result: %u\n", lResult); | |
return 0; | |
} |
5.36 microseconds with -O3
Quite more efficient than its Python counterpart, indeed. Let's write an iterator-like function giving out only interesting numbers. Instead of using high-level concepts such as the chain function or the set data structure which would be pretty complex to write in C, we'll simply use the fact that distance between multiples of two numbers repeats after their least common multiple (LCM). 3 and 5 have a LCM of 15, so let's enumerate multiple of both up to 15:
3 5 6 9 10 12 15This means that multiples of 3 and 5 will be a cycling loop of the difference between these numbers. This give us +3, +2, +1, +3, +1, +2, +3 (Beginning with 0, going to 3 is +3. 5 - 3 = +2, 6 - 5 = +1, and so on).
We'll use the fact that switch cases are very efficient in C since they map to a jump case (multiple jump operations) in assembly language (compared to a bunch of ifs which would map to comparisons operations). This gives us the following solution:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include <stdio.h> | |
unsigned int x = 0, state = 0; | |
unsigned int NumberIt() | |
{ | |
/* Create a persistant variable x and initialize it to 0 */ | |
switch (state) { | |
/* case 0: Replaced by default*/ | |
case 3: | |
case 6: | |
x += 3; | |
break; | |
case 1: | |
case 5: | |
x += 2; | |
break; | |
case 2: | |
case 4: | |
x += 1; | |
break; | |
default: | |
state = 0; | |
x += 3; | |
} | |
state++; | |
return x; | |
} | |
int main() | |
{ | |
unsigned int lResult, lTmpValue, lLoop; | |
for (lLoop = 0; lLoop <= 100000; lLoop++) { | |
lResult = 0; | |
while (1) { | |
lTmpValue = NumberIt(); | |
if (lTmpValue >= 1000) { | |
break; | |
} | |
lResult += lTmpValue; | |
x = x - 1000; | |
state = 0; | |
} | |
} | |
printf("Result: %u\n", lResult); | |
return 0; | |
} |
3.11 µs -O0
1.55 µs -O3
The C code seems to be on-par with the Python one without optimizations. But this is because of the timing mechanisms we used to benchmark both (one used in-code timestamps while the other one is using time which takes into account the process creation). This should not bias our benchmarks too much because of the repeating loop which minimizes the impact of the process creation overhead.
This is the most performance you can get out of a portable and somewhat readable code. Anything further down in this post engulfs your senses in a warm delusional wool, or as someone stated it:
If you put something like that in production code, you would likely be shot by the maintainer. A jury would never convict him. [1]I must insist: The rest of this is so bad that you will think that you are right doing so and have the illusory though that it makes you a master of a programming language or computer systems.
Which is wrong.
Skills in parallel development are really important nowadays. But let's put this in perspective: we've got a solution running at the microsecond scale. Spawning a new thread or process by the OS will generate a relative overhead way over any speedup could ever justify. Task-level parallelism won't help us for this particular problem.
But we can take advantage of parallel instructions available in current processors through the MMX/SSE/AVX extensions. Data-level parallelism to the rescue!
A generally clean approach to implement calls to CPU extensions in C is to use intrinsics.
One way to solve our problem using SSE is to generate every 7 previously mentioned entries (from 0 to 15) of the cycle using a single addition operation. We then use the addition operation to perform a sum operation using a reduction pattern as shown in figure 1. It is only a matter of sum after this.
One way to solve our problem using SSE is to generate every 7 previously mentioned entries (from 0 to 15) of the cycle using a single addition operation. We then use the addition operation to perform a sum operation using a reduction pattern as shown in figure 1. It is only a matter of sum after this.
Figure 1: Example integer sum reduction pattern using addition (SSE2)
The cleanest way to have done it would be letting hints or rearranging the original code for the compiler to optimize our program using SIMD instructions himself, but I couldn't find a way to cleanly formulate my needs to get GCC to understand what I wanted. Written manually, it gives us the following code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include <stdio.h> | |
#include <x86intrin.h> | |
#define NUMBER_OF_TRIES 1000000 | |
int main() | |
{ | |
register unsigned long lResult, lLoop, i; | |
for (lLoop = 0; lLoop < NUMBER_OF_TRIES; lLoop++) { | |
lResult = 0; | |
__m128i numbers, resultat, somme, sum_iteration; | |
const __m128i delta = _mm_set_epi16(0, 15, 12, 10, 9, 6, 5, 3); | |
const __m128i step = _mm_set_epi16(0, 15, 15, 15, 15, 15, 15, 15); | |
const __m128i add_mask = _mm_set_epi16(0, 0, 0, 0, 0, 0, 0, 0xFFFF); | |
const __m128i final_mask = _mm_set_epi16(0, 0, 0, 0, 0xFFFF, 0xFFFF, | |
0xFFFF, 0xFFFF); | |
numbers = _mm_set_epi16(0, 0, 0, 0, 0, 0, 0, 0); | |
sum_iteration = _mm_set_epi16(0, 0, 0, 0, 0, 0, 0, 0); | |
/* 1000 / (7 / 15) = 66,... | |
* We'll make the 65 first iterations correctly, then make the final | |
* one with a mask. */ | |
for (i = 0; i <= 66; i++) { | |
// Compute next number pattern | |
resultat = _mm_add_epi16(numbers, delta); | |
// If final iteration, remove the last numbers (over 1000) | |
if (i == 66) { | |
resultat = _mm_and_si128(resultat, final_mask); | |
} | |
// Sum of the 7 numbers | |
somme = _mm_add_epi16(resultat, _mm_srli_si128(resultat, 2)); | |
somme = _mm_add_epi16(somme, _mm_srli_si128(somme, 4)); | |
somme = _mm_add_epi16(somme, _mm_srli_si128(somme, 8)); | |
// Sum them in a 32 bits wide entry | |
somme = _mm_and_si128(somme, add_mask); | |
sum_iteration = _mm_add_epi32(somme, sum_iteration); | |
// Get the next numbers | |
numbers = _mm_add_epi16(numbers, step); | |
} /* end for of sum */ | |
lResult = _mm_cvtsi128_si32(sum_iteration); | |
} /* end for of benchmark repeat */ | |
printf("Result: %u\n", lResult); | |
return 0; | |
} |
But we are still using a linear complexity algorithm... Isn't there something more efficient in the mathematical search space?
Actually, yes. There is.
There is an identity out there stating that sum of a multiple can be found directly. Using the inclusion-exclusion principle, we can find the answer for many multiples. But I'll let others explain that. Let's see its implementation.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* To compile with: gcc euler1_s6.c -nostartfiles -s -O3 -o euler1_s6 */ | |
/* How many 3, 5 and 15 are there in 999 (below 1000) */ | |
#define QTY3 ( 999 / 3 ) | |
#define QTY5 ( 999 / 5 ) | |
#define QTY15 ( 999 / 15 ) | |
/* Their sums */ | |
#define SUM3 ( 3 * ( QTY3 * (QTY3 + 1) / 2 ) ) | |
#define SUM5 ( 5 * ( QTY5 * (QTY5 + 1) / 2 ) ) | |
#define SUM15 ( 15 * ( QTY15 * (QTY15 + 1) / 2 ) ) | |
/* The answer */ | |
#define RESULT ( SUM3 + SUM5 - SUM15 ) | |
#include <stdio.h> | |
void _start(void) { | |
printf("%u\n", RESULT); | |
asm("movl $1,%eax;" | |
"xorl %ebx,%ebx;" | |
"int $0x80" | |
); | |
} |
So there's two things to remember of this:
Nothing is more important than knowing what we're doing;
The journey is more important than the destination.