multigrad: July 2013

Let's solve efficiently the first Euler Project problem, which is:

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 1000.

A naive implementation of this problem would be as such in Python:

Loading ....

This gives a result of 233168.

What we are really interested in, though, is the performance. Let's benchmark our result.

Loading ....

It runs in 555.5 microseconds using my computer.

Now begins the fun. Two paths must be searched in order to attain maximal performance in this problem. The first is to enhance the mathematical search space, meaning that we should limit the number of numbers we evaluate. The first solution tried every positive natural number. It would be way more efficient to only run through multiples of 3 and 5. In Python, we can use the range iterator (xrange for Python 2) to provide us only the interesting values. Concatenation of the two desired iterators can be done using the chain function. It is then a simple matter of getting rid of duplicates, which we can do using the set data structure.

Loading ....

Giving us a 3.806 microseconds run, always on the same computer.

Ah. Much better. For a human being, this is the most optimal piece of core you can encounter solving your problem. It is short, relatively simple and quite easy to maintain.

Everything further down in this post is an abomination and exists for the sole purpose of either educational or masochistic endeavours. Writing clear, simple and maintainable code is way more useful, profitable and developer-time efficient.

Despite these warnings, let's dive into a C implementation of the naive interpretation of the problem.

Loading ....

10.75 microseconds without optimisations
5.36 microseconds with -O3

Quite more efficient than its Python counterpart, indeed. Let's write an iterator-like function giving out only interesting numbers. Instead of using high-level concepts such as the chain function or the set data structure which would be pretty complex to write in C, we'll simply use the fact that distance between multiples of two numbers repeats after their least common multiple (LCM). 3 and 5 have a LCM of 15, so let's enumerate multiple of both up to 15:

3 5 6 9 10 12 15

This means that multiples of 3 and 5 will be a cycling loop of the difference between these numbers. This give us +3, +2, +1, +3, +1, +2, +3 (Beginning with 0, going to 3 is +3. 5 - 3 = +2, 6 - 5 = +1, and so on).
We'll use the fact that switch cases are very efficient in C since they map to a jump case (multiple jump operations) in assembly language (compared to a bunch of ifs which would map to comparisons operations). This gives us the following solution:

Loading ....

3.11 µs -O0

1.55 µs -O3

The C code seems to be on-par with the Python one without optimizations. But this is because of the timing mechanisms we used to benchmark both (one used in-code timestamps while the other one is using time which takes into account the process creation). This should not bias our benchmarks too much because of the repeating loop which minimizes the impact of the process creation overhead.

This is the most performance you can get out of a portable and somewhat readable code. Anything further down in this post engulfs your senses in a warm delusional wool, or as someone stated it:

If you put something like that in production code, you would likely be shot by the maintainer. A jury would never convict him. [1]

I must insist: The rest of this is so bad that you will think that you are right doing so and have the illusory though that it makes you a master of a programming language or computer systems.

Which is wrong.

For the foolish still willing to carry on:

Skills in parallel development are really important nowadays. But let's put this in perspective: we've got a solution running at the microsecond scale. Spawning a new thread or process by the OS will generate a relative overhead way over any speedup could ever justify. Task-level parallelism won't help us for this particular problem.

But we can take advantage of parallel instructions available in current processors through the MMX/SSE/AVX extensions. Data-level parallelism to the rescue!

A generally clean approach to implement calls to CPU extensions in C is to use intrinsics.

One way to solve our problem using SSE is to generate every 7 previously mentioned entries (from 0 to 15) of the cycle using a single addition operation. We then use the addition operation to perform a sum operation using a reduction pattern as shown in figure 1. It is only a matter of sum after this.

Figure 1: Example integer sum reduction pattern using addition (SSE2)

The cleanest way to have done it would be letting hints or rearranging the original code for the compiler to optimize our program using SIMD instructions himself, but I couldn't find a way to cleanly formulate my needs to get GCC to understand what I wanted. Written manually, it gives us the following code:

Loading ....

The codes run in 357 ns on my computer using -O3 compiler flag.

But we are still using a linear complexity algorithm... Isn't there something more efficient in the mathematical search space?

Actually, yes. There is.

There is an identity out there stating that sum of a multiple can be found directly. Using the inclusion-exclusion principle, we can find the answer for many multiples. But I'll let others explain that. Let's see its implementation.

Loading ....

Sorry about the startup script skipping showoff. Comparing the execution time of this operation is irrelevant because the computation is done by the compiler and the answer is directly written to a registry as a literal constant.

So there's two things to remember of this:

Nothing is more important than knowing what we're doing;

The journey is more important than the destination.

multigrad

July 8, 2013

Project Euler Problem 1 - A journey