multigrad

Executing x86-64 opcodes in Python - Genetic Programming case study

2014-07-24T11:11:00.003-04:00

This post is about generating machine code directly and make it run in Python. (At least, i'm not the first one to do something similar.) Since we've left any hope for portability by using machine code, we'll support my processor, which implements the (not so pretty) standard x86-64 instruction set.

Jumping down the abstraction layers

First of all, how do you even execute arbitrary x86-64 instructions in Python? (Note: doing it in Python is mandatory, mainly because I want it to. And it's simpler to develop and debug.) Well, diving into the ctypes module gives us half the answer: we can write a byte string that contains the executable code and then cast it into a CFUNCTYPE object.

But an portability issue arises. CFUNCTYPE states that it will call the function using the standard C convention (what we call an ABI)... But which x86 calling convention? The answer is: the one that is used by the compiler with which Python was compiled. Luckily, most x86-64 calling conventions (there are two: Microsoft-flavored for Windows-based systems and System V-flavored for anything else) are pretty similar. Another lucky fact: we only have to take care of the volatile registers and the parameter passing registers and return registers.

Next, we must know the machine code that does what we want. It is worth noting that an interactive compiler and an online assembler are useful to generate the machine code sequence. The same output is achieved using gcc to compile a file containing a simple function as such:

Loading... Please enable Javascript

Note that this invocation of gcc won't execute the linker, so every call, jump or other reference to a label (an address) won't be defined - they will be set to 0. This won't hinder us as our addresses won't be the same as the sample C code anyway. To better understand what we are dealing with, it is possible to check out the opcodes in an an x86-64 opcode reference chart, or the intel architecture software developer manual if the gory details are needed.

Ok, everything should be fine now! Let's try to cast a simple return. This operands only returns from a function call without doing anything.

Loading... Please enable Javascript

And... That gives a segmentation fault.

Which is pretty logical: the string is contained in a segment of the memory that have the read and write flags, but not the execution flag. Luckily, this particular problem was solved on stackoverflow. This gives us the following (x86-64 Linux only) code:

Loading... Please enable Javascript

At last, it works! Or, at least, doesn't explode upon execution.

So, basically, if I want to perform an addition, the documentation says to interface the "C function" this way (to replace the end of the previous code snippet):

Loading... Please enable Javascript

Great, it works! But how did we come up with the machine code f2 0f 58 c1 c3 ? (the prefix \x before every byte is only to signify to Python that it's hexadecimal.) Here we go:

Loading... Please enable Javascript

The missing link

Now that the hassle of explaining evolutionary algorithms, genetic programming and their need for fast execution has been done by my colleague, let's push it a step further using the aforementioned notions . Marc-André brought you on the edge of the cliff; we'll now take a big step forward. While he used Python bytecode, which is an awesome and versatile idea that supports any hardware for which Python was ported, we'll use machine code.

Now that we know how we are going to execute arbitrary machine code, we can focus on the problem beforehand: generate machine code from a program tree. Let's take for example the tree for the equation $ (x - 1) + 4 $ :

Figure 1: Example of the tree representing the expression $ (x - 1) + 4 $.

Since the processor execution is based on a stack, the easiest way to traverse this three is depth-first. Indeed, by appending every node of the tree when traversing depth-first, it generates the inverted order of execution of the tree. Here's how it works:

Figure 2: Example of depth-first traversing with call stack

As we traverse the tree, we pile up the calls or parameters on the stack. As you can see from the figure, it generates a call stack that, when executed bottom-up, is exactly the order that the x86_64 processor needs. We notice that it is different from the Python bytecode stack Marc-André showed: arguments positions are reversed and LOAD_GLOBAL is replaced by the call to the function.

All that seems well, but there is a problem. x86_64 calling convention passes the floating-point parameters of a function on XMM0, XMM1, XMM2 and so on. These registers are volatile, meaning that their content may well be modified when we call a function. Let's take the tree showed in the previous figure for the sake of example and assume we're dealing with floats. X will be put on XMM0, 1.0f on XMM1 and sub() will be called. This call will return its result on XMM0. Perfect, that's where we want our first argument for the call to add(), along with 4.0f previously put in XMM1. Uh-oh. sub() needs 1.0f in XMM1 while add() needs 4.0f. This can be visualized here:

Instruction	XMM0	XMM1
Put 4.0f in XMM1	-	4.0f
Put 1.0f in XMM1	-	1.0f
Put x in XMM0	x	1.0f
Call sub()	x - 1.0f	1.0f (?)
Call add()	(x - 1.0f) + ?	1.0f (?)

You'll probably say "hey, this is easy, simply put node 5 before node 2, problem solved!" Don't. Do. That. All hell will break loose and you will not enjoy this. This kind of tweaking will lead you to hours and hours of wondering why it works in some cases but won't in such or such cases. (Says the guy who produces executable machine code in Python.) As I said earlier: this should be simple, an x86_64 CPU is based on a stack! Let's use it! We'll simply push every argument needed on the CPU stack when we traverse its node and then pop it back when it's needed. If we feel adventurous, we realize that the first argument (next node after a call) won't need to push/pop if the arguments are compatible: the result will already be in XMM0 (for floats and doubles), ready to be used. This gives us this:

Instruction	XMM0	XMM1
Push 4.0f on stack	-	-
Push 1.0f on stack	-	-
Put x in XMM0	x	-
Pop stack in XMM1	x	1.0f
Call sub()	x - 1.0f	?
Pop stack in XMM1	x - 1.0f	4.0f
Call add()	(x - 1.0f) + 4.0f	?

As we can see, this generates quite a lot of unnecessary pushes (for example argument 2 of sub()). Eliminating these unnecessary stack usage is a potential optimization that we may discuss in another article.

Before rambling into a non sequitur madness of idea flow almost disconnected from the subject, I'll present you the symbolic regression program with individuals evaluated in x86_64 machine code using deap. It is located here. Feel free to fork it, mess with it and be curious around it.

I haven't implemented the division (had to deal with divisions by zeros) nor cos / sin, but feel free to be inspired and fork the code!

You may be tempted to print out the machine code generated and understand it. To better understand it, it is possible to copy it in an online disassembler which will provide the almost human-readable assembly translation.

To the Infinity and beyond

In the context of genetic programming, a way better idea than generating the machine code at each evaluation as the previous example did is to represent the individuals in deap as its machine code and evolve it. It's not that I am wary of the tortuous path toward a stable implementation of this representation. It would require the writing of a mutation and crossover function which needs pointers to mark the beginning and end of each node in the machine code representation. No, the only reason I don't dig further in this general direction is because I won't offend you by serving some old reheated matter. Marc-André already showed how it's done in its previous post. (This has also nothing to do with the fact that it's an unmaintainable piece of code that won't ever be published as-is in a working project.)

At first, this proposed idea was tagged "useless waste of time" by colleagues and friends. But as the idea developed, we realized it could be used to circumvent security features. Calling obscure opcodes, low-level functions, software interruptions or similar are now available directly in pure Python. Furthermore, it would enable the execution of dynamic libraries that are not flagged as executable. You only have to read the exported symbols of a .so library, load them in memory, apply this method and voilà, you can execute its functions.

An interesting lead from this point is to make a compiler for generic Python functions. Some pretty module would get us near a C compiler, but I won't insult the dedication and hard work of Donald Knuth by proposing half an article written on a napkin on compiler creation. But I don't mind being familiar and even tutoyer Python optimizations packages. Can we perform better than Numba or even Cython and PyPy? Stay tuned for Part II were we'll try as-generic-as-possible Python-to-Machine-Code translation.

Fun with Python bytecode

2014-06-08T13:08:00.000-04:00

The Python programming language is now present everywhere. While significantly slower than other more low-level languages like C/C++, its ease of use, its power and its various libraries make it an ideal choice for many projects. Yet for some cases, performance does matter. For these, many solutions have been proposed. One of the most common is the use of an underlying, heavily-optimized, C library along with Python bindings (numpy, scipy, opencv, etc.). Another is the addition of performance related features to the interpreter, like JIT or stackless mode with Pypy. However, what to do when you can't apply neither of them, yet you still want more performance without sacrificing the Python zen and simplicity?

One of the answers, at least in the present case, is to roll up our sleeves, fasten our seat belts, and dive into the mysteries of the Python bytecode.

Wait, what?

You have probably already heard that Python, unlike Fortran, C++ or Go, was an interpreted language. While not completely wrong, it is not entirely true either. You could create a Python interpreter that works in a purely interpreted way, but it is not how it is done in most current Python implementations. The Python code is first compiled into a series of simpler instructions. Thereafter, these instructions are actually interpreted by a virtual machine. The main reason for that design choice is speed: directly interpreting high-level Python code each time it is executed (think for instance of a loop or a function executed many times) would be ineffective. Moreover, some different Python syntax turn out to give the same result. For instance, using a if/elif statement produces exactly the same result than a if inside the else clause of another condition. A list comprehension has the same behavior than a traditional loop – except for some underlying details in the current implementations, but it could be exactly the same thing. The use of a simplified, intermediate language allows the actual interpreter to be simpler, easier to understand and maintain, and faster, while being entirely hidden to the user in most cases. With CPython, the most visible effect is the creation of *.pyc files, used to store the bytecode in order to avoid a recompilation step each time the file is executed.

It is to be noted that this design is not Python-specific, since many other prominent languages use it too (Java, Ruby, etc.). All the designs are more or less identical, the steps involved being a) the parsing of the high-level language into an intermediate, simpler representation, and b) the interpretation of this intermediate representation to actually run the program. One could note that Python makes use of a fancier design, by translating the Python code into Abstract Syntax Tree or AST before create the bytecode, but it is out of the scope of this post. However, those interested by the CPython internals could refer to the Cpython compiler developers information.

In Cpython 3.4 (the reference interpreter), the bytecode is based on a stack representation, and 101 different opcodes. Python 2.7 uses 18 more, mainly for slicing operations, which where separated from the item-based operations, and print and exec statements, which became functions in Python 3. For the rest of this post, we will use CPython 3.4. While the explanations provided here also mostly apply to other versions and interpreters, they may not be totally accurate.

Let's take a look at an actual bytecode snippet. Suppose you have the following simple function :

Loading... Please enable Javascript

We could observe the bytecode generated by Python by using the __code__ attribute :

In [9]: myMathOperator.__code__.co_code

Out[9]: b'|\x00\x00|\x01\x00|\x02\x00\x17d\x01\x00\x13\x14S'

Ok... While expected (since the bytecode is basically a sequence of bytes), the result is not exactly readable. Fortunately, there is a helper module in the Python library which can render a bit more understandable code. This module is named dis, for disassembly. Let's see what it can do :

In [12]: import dis

In [13]: print(dis.Bytecode(myMathOperator.__code__).dis())

2 0 LOAD_FAST 0 (a)

3 LOAD_FAST 1 (b)

6 LOAD_FAST 2 (c)

9 BINARY_ADD

10 LOAD_CONST 1 (2)

13 BINARY_POWER

14 BINARY_MULTIPLY

15 RETURN_VALUE

Well, that's definitely better! The dis module is able to translate the bytecode into its opcodes and their arguments, which are far more readable. However, to find what does a bytecode do, we must place ourselves in a stack context. Most opcodes modify the stack by either adding or removing elements (or both), or by changing the value of the top of stack pointer. For instance, LOAD_FAST is described as:

Pushes a reference to the local co_varnames[var_num] onto the stack.

For now, we can forget the reference to co_varnames (we will talk about it later on), and just retain that this opcode fetches a variable and put it onto the stack. In our example, the stack is initially empty since it is the beginning of the function (we will represent it as [ ]). Supposing that the values of a, b, and c arguments are respectively 12, 7 and 1, then the stack will contain [12, 7, 1] after the first three statements execution.

On the next line, we reach a new opcode, BINARY_ADD. In the documentation, it is said that it:

Implements TOS = TOS1 + TOS

Ok, that's a bit less clear. Here, TOS means “Top Of the Stack”. So, basically, it takes the value at the top of the stack and the second value at the top (TOS1), add them, and push back the result of the top of the stack. Applying this operation to our [12, 7, 1] stack, we obtain [12, 8].

Moving on to the next opcode, we find a LOAD_CONST. Basically, it does the same job as LOAD_FAST, except that it loads constants and not variables (in our case, the constant loaded is the 2 used as exponent). So our stack now contains [12, 8, 2].

The next opcode, BINARY_POWER does, according to the documentation:

TOS = TOS1 ** TOS

As for the addition, we take the two top-most items, exponent the second by the first, and push back the result on the stack, which now contains [12, 64].

The next opcode is BINARY_MULTIPLY, which works similarly than BINARY_POWER and BINARY_ADD, and our stack now contains the product of 12 and 64, that is [768]. Finally, the RETURN_VALUE operation is said to:

Returns with TOS to the caller of the function.

That is, it pops the value on the top of the stack, and return it to the calling function. In our case, the answer (768) is effectively returned, and we're done for this example.

There are of course many other opcodes, but their behavior is essentially the same, popping and pushing values and references on a global stack. As one can see, interpreting (and even writing) Python bytecode is fairly simple when we get the twist.

A last twist before the next step: some opcodes need arguments. This is the case for LOAD_FAST, which needs the index of the variable to grab. This is achieved very simply, by using up to three bytes for each operation. The first byte is the opcode itself, and the two remaining can be used for parameter passing. If there is no parameter to pass (like BINARY_ADD), then only one byte is used. The attentive reader would have noticed the curious indexes in the bytecode snippet above: they precisely mark the byte index of each opcode.

Great, so we are now ready! Let's write bytecode instead of Python code!

Wait, why?

At this point, a reasonable mind might say: why in the world would you bother to write bytecode, a limited and difficult language, instead of actual Python code? Well, because the compilation procedure is too slow and induces a significant overhead! To this answer, the same reasonable mind would probably say: who cares about the compilation time? It is not so slow (a few tenths of seconds in most cases), and, even if it was not, this must be done only once – since the next times, you can just reload the *.pyc and go happily!

In most cases, the clever guy would be right. But there's a specific use which has to be taken into account. Its name is GP, standing for Genetic Programming.

Genetic programming is an hyper-heuristics member of the bigger family of the evolutionary algorithms. Basically, it uses natural selection, mutations and other evolutionary concepts to evolve trees (in the computer sense). As any program can be represented as a tree, genetic programming is actually able to evolve programs, that is learn, alone, the best algorithm for a given problem, like getting out of a labyrinth, solving the Rubik Cube, etc. We do not intend to dive into deeper explanations on GP, but the interested reader could find introductory and advanced material about it at www.gp-field-guide.org.uk.

As a DEAP contributor and user, I frequently have to work with GP and its DEAP implementation (DEAP is a generic evolutionary algorithms framework in Python, which implements GP among others). Here, the compilation time becomes a problem. GP uses a population of several hundreds, and even thousands of different programs. This is mandatory to keep a sufficient level of diversity. But in order to see the results produced by each of these programs, we must execute them, and, as they change during the evolution, we have to compile them each time. Consider the next figure, produced with gprof2dot and the Python profile module, which represents the execution of a typical evolution with DEAP. One can see that the compilation time represented by the gp.compile function, in light green, takes almost 50% of the total computation time. In other words, we spend half our computational effort just to compile the programs!

A few solution have been proposed to address the problem. The compilation of the individuals with LLVM, as shown here, is one of them. Since it compiles to bare x86 bytecode, it indeed reduces the execution time, but at the price of an increase in compilation time, the penalty being more severe if we also add compiler optimizations. Also, the use of an optimized Python interpreter like Pypy is not of great help with GP – it actually decreases performance by a factor 2. Shortly, this can be explained by the JIT design itself: just-in-time compilers are efficient with long loop and repetitive code segments, since the whole idea is to make up for the compilation time by an important speedup in upcoming executions of the same code. With GP, the code is constantly changing, so the JIT either does nothing (because it detects that it would not be of any use) or deteriorate the processing time by trying to compile every individual at each evaluation. Finally, the use of an heavily optimized C implementation of the basic functions used in the GP program is not helpful, since Python still have to compile the high level code before calling this C implementation!

That's where our bytecode hobby comes on stage. What if we directly evolve Python bytecode? Well (spoiler alert), the results obtained with a preliminary implementation are given in the next figure. It should be noted that we use the exact same problem, configuration, and random seed for our standard and optimized implementations. The results are impressive: the compile time is now negligible (less than 1% of the total execution time), and the computation time is divided by more than two (47 seconds vs. 107 for the standard implementation), without altering the results in any way! It should be noted that this is the maximum speedup achievable, since the standard compilation procedure took about half of the program execution time. The small additional edge is a side effect caused by more effective implementations of other compile related methods.

Ok, great, so that was easy! Another problem solved!

Wait, how?

In order to understand the intricacies of the solution, we have to learn a bit about how GP is implemented in DEAP. As we said earlier, GP works by evolving trees. Since there's no tree implementation in the Python standard library, DEAP includes one specifically targeted on GP. It works by storing the trees into an underlying list, along with information about how many children each node have. For instance, the mathematical function f(x) = (x-1) + x*x can be represented by the following tree:

and will be stored in the following list:

['add', 'subtract', 'x', 1, 'multiply', 'x', 'x']

Because DEAP knows the number of arguments of each function (or, in genetic programming gibberish, their arity), it can reconstitute the tree. Thereafter, when comes the evaluation step, the following procedure is used:

The tree is converted into its string representation. Basically, it is the program that one would write if he wanted to produce the result of the tree. For instance, the string representation of the previous tree would be add( subtract(x, 1), multiply(x, x) ).
This string is passed to eval(). This function is a powerful (and dangerous!) tool which allows to execute arbitrary code from its string representation. For instance, the following code would write “Hello world!” to the standard output:

eval('print(“Hello world!”)')

It is in this step that the Python code is actually compiled into its executable form.
The compiled function is returned, so it can be evaluated.

The reader would have understood that the last step is mandatory and cannot be suppressed, which is why we focus on the first two steps with our bytecode hack.

First, one useful simplification: the resulting program is merely only a sequence of calls to different functions. We never actually have to understand what is going inside these functions. This leads to the fact that our implementation does not restrain at all the function choice. If it works in Python, it will work with our approach! Also, this reduce our solution complexity, since we only have to call a few opcodes. Basically, we will need the following:

LOAD_FAST : this will be needed to access to our program arguments.
LOAD_CONST : similarly, this opcode is required to use constant values (numerical or not).
LOAD_GLOBAL : this one will be used to retrieve references to the function objects of our tree. As its name suggests, instead of grabbing a symbol from the local dictionary like LOAD_FAST, it uses the global dictionary.
CALL_FUNCTION : obviously, this is mandatory to actually call the functions previously loaded. It takes as parameter the number of arguments to pass to the function. All these arguments must be on the stack at the moment of the call, plus, under them, a reference to the function object itself. This particular structure will prove to be very useful hereinafter.
RETURN_VALUE : this one will actually be needed only one time, to return the final result value.

Before we can go on with some code, it is the time to learn about how the functions, constants and arguments are actually described in a code object. The LOAD_* functions have an almost common description:

Loads the {global/local/constant} named {co_names/co_varnames/co_consts}[arg] onto the stack

Fair enough, but what are these co_names, co_varnames or co_consts fields? Well, they are simply tuple objects containing all the needed symbols. For instance, if co_names = (add, divide, subtract) and that we want the subtract function, we will write LOAD_GLOBAL (2), that is put the index of the wanted function in the co_names tuple as operation argument. As explained in the first section, the argument is simply the value of the two bytes following the opcode in the bytecode.

For the sake of simplicity, we will consider in the following part that we already have these tuples. Let's now see how to convert a list representation of our tree to its bytecode representation. The important point is to realize the similarities between the Python bytecode and our list representation. The following figure shows it quite straightly.

As one can see, the conversion is merely an iteration through our list with the appending of a corresponding LOAD_* for each node, the relative order between the nodes staying the same (for the record, this order is called depth-first). The only tricky part is to add the CALL_FUNCTION opcode after the last argument. For this purpose, we keep a list of the number of arguments of each added node. Whenever we bump into a terminal node (a leaf), we decrement the argument count of its parent node. If this count gets to 0, then we know that we are done with this function, and we add the appropriate CALL_FUNCTION. Using this algorithm, we provide a simplified conversion function.

Loading... Please enable Javascript

This function takes a list as argument, and return its corresponding bytecode. Note that a bytecode must be of type bytes, but this type is not suitable for our manipulations, because it is non-mutable. Therefore, we use a bytearray, a mutable equivalent of bytes object. The opcode module contains various tools to assist the bytecode creation. In this case, we use the opmap dictionary, which allows us to write the opcode in plain text ('CALL_FUNCTION' being far more readable than 0x83).

We now have to make Python understand that he has to execute this bytecode. First, we must create a complete code object with it. While the bytecode obviously plays an important role in a code object, there are plenty of other information we have to give to Python:

Loading... Please enable Javascript

That's a bunch of information! Most of these parameters are self-explanatory, but a few deserve more explanations. The stacksize parameter controls how many things can be put onto the stack at the same time. For instance, a recursive call will add elements on the stack each time. While its exact value is not really important, one must take care to not exceed it, under threat of a segfault of the Python interpreter! The flags argument control various things about the way the code is handle, yet there is not much documentation about it. The value of 67 (64+2+1) comes from the actual value given by Python to the executable objects in the standard GP implementation.

Now that we have the code object, we must associate it with a function. There are various ways to do it, but one of the most obvious in Python 3 is to use the types module again, but to create a function object this time:

Loading... Please enable Javascript

The second argument (a dictionary) is mandatory to tell Python which function actually corresponds to each symbol. Fortunately, its generation was already implemented in DEAP, since the same mechanism applies when using the eval function.

Another interesting thing about this approach is that it allows for easy genetic manipulation, like crossovers or mutations. In a few words, these operations change the content of the tree by modifying or exchanging branches. In the standard list implementation, a branch replacement could be done with a simple slicing operation, since each branch is stored in a contiguous way. Well, that rationale also applies to our bytecode representation! In order to identify a subtree, we just need to obtain the position of the LOAD_GLOBAL of its root node, and look for the corresponding CALL_FUNCTION thereafter. These two positions then give us the indices needed for the slice construction.

The following table and figure show the time required to evaluate trees of different lengths and the relative speedup. The bytecode implementation reaches its optimal speedup around 250 nodes, but provides a considerable gain even for smaller lengths. The speedup value can be easily understand if we refer to the evaluation procedure we describe at the beginning of this section: the first two steps that were executed in O(n) are now done in O(1). Of course, the result computation itself has not changed, and still has, in our test case, a O(n) complexity. Overall, in exact notation, we started from a Θ(3n) complexity, down to Θ(n+2) with the bytecode approach, which is coherent with the observed speedups.

# of nodes	Standard	Bytecode	Speedup
1	4,2674E-05	2,3882E-05	1,787
3	5,9971E-05	3,0232E-05	1,984
7	8,1913E-05	3,6364E-05	2,253
15	1,5890E-04	7,4422E-05	2,135
31	2,5088E-04	9,6281E-05	2,606
63	4,5313E-04	1,5996E-04	2,833
127	8,6494E-04	2,9974E-04	2,886
255	1,6832E-03	5,0897E-04	3,307
511	3,3082E-03	1,0151E-03	3,259
1023	6,5782E-03	1,9996E-03	3,290
2047	1,3001E-02	3,8225E-03	3,401
4095	2,6315E-02	7,8760E-03	3,341
8191	5,4996E-02	1,5786E-02	3,484
16383	1,0808E-01	3,1520E-02	3,429
32767	2,1563E-01	6,2349E-02	3,459
65535	4,3286E-01	1,2690E-01	3,411

Conclusion

We have provided a simple way to speed up a genetic programming evolution by directly evolving Python bytecode, without intermediate representations. This generally divides by two the computation time needed to perform an evolution, which is non negligible with real world problems. This is especially important when taking into account that as most stochastic methods, evolutionary algorithms needs to be run at least a couple of times to produce statistical significant results. The complete code is available at github.com/mgard/deap. However, it should be noted that this code includes many hacks to make it fully compliant with DEAP API (so a lambda user does not have to worry about which tree implementation he is using), and is still in development to further improve performance and reliability – one funny characteristic of our approach is that it is going so deep into Python internals that it could actually segfault the interpreter whenever an error occurs...

If the clever mind from section 3 did not run away already (which would probably be the reasonable thing to do instead of reading an article describing a weird and unclean optimization technique targeting a very narrow topic of an already narrow field on a specific language), he might notice one interesting thing in the second profiling figure. The evaluation itself now takes about two thirds of the total computation time. Moreover, even basic arithmetic functions like multiply or divide are called so often (almost 100 million times!) that they take up to 20% of the total execution time! That's clearly where we should focus in order to further improve performance. But how could we do it?

Well, in this post, we had fun with Python bytecode. Maybe it's time to take the next step, and use a different type of bytecode, even more low level. But that will be the subject of another story...

Sound decoding: A tale of visualization

2014-05-31T17:35:00.000-04:00

Let's keep informed by listening to the radio

Who isn't thrilled by mystery spy stories and espionage, challenges and ciphers to understand and decrypt? I was recently fascinated by The Conet Project, which is twenty years of recordings of unknown broadcasts on shortwave radios. The public distribution of these recording created some small communities striving to comprehend these messages.

Most of the recordings are numbers spoken with an text-to-speech software that could probably be used with a one-time pad in order to retrieve a message. But some of them hold tones which probably maps to numbers. Let's find these numbers!

Down the rabbit hole while riding a snake

We'll concentrate on track 39 of disk 2 of The Conet Project, which sounds like this:

To play this song, please update your ancient browser to an alternative that supports HTML5.

Here is the link to the audio if your browser does not support HTML5.

It begins with a repetitive pattern, probably to draw the attention (and not ¡Atención!) of potential listeners. Then, at 1:30, a preamble-like pattern emerges which then makes place at 1:33 for a 1 second clock-like pattern followed by a message. The booklet of the recording states:

"High Pitch Polytone: 5 Figure groups, represented by tones, the lowest being a space. No real information on these stations, save that they may be related to M12."

M12 is the Enigma Identifier of a Russian operated number station. More information available on the Priyom entry of M12. Recent recordings of M12 seems to be monotonal morse, but anyways... I'll let your mind ramble on these ideas.

Fantasies aside, how can we visualize and extract the data contained in this particular track? By gleaning various intertubes references such as this or this, I came up with the satisfactory amplitude + spectrogram chart as shown in this figure:

As it can be seen from the zoomed figure, we can clearly see the tones. They are defined by their fundamental frequency, the darkest red bar at the bottom of the spectrum chart, and their overtones, the lighter red bars. These figures were obtained by this Python code:

Loading... Please enable Javascript

This code uses the specgram functionality of matplotlib which is based on the Discrete Fourier Transform (DFT) to obtain the frequencies contained in blocks of data. I choose the blocks of data to have a fixed size of 512. Every column of the spectrogram is a Fourier transform of a 512-size block in the audio signal. This size represents 11.6 ms of the signal and allows the analysis of frequencies from 22,050 Hz (half of 44,100 Hz because of the Nyquist Frequency) down to 43 Hz ( $ \dfrac{44100 \; \mathrm{Hz}}{2 \cdot 512} $ ). All that is well and interesting, but how can we automatically find the values represented by each tone?

The plan is to take every column of the spectrogram, namely the Fourier transform of each block, and find their global maximum. For that matter, let's also take all the local maxima, which will give us the shape of the tone emitted. Here is how a single column looks like when the color of the spectrogram is mapped to the y-axis:

The top chart shows the previous spectrogram and highlights the band at $ t = 2 \; \mathrm{s} $. The middle chart shows the Fourier transform. The bottom one shows the same transform but displayed with an logarithmic y-axis, or, as we are used to see it, enlarged to show texture. As we can see it, there is clearly a single prominent fundamental frequency around 430 Hz followed by its overtones. We can also determine that the tone was synthesized up to ~13.1 kHz and the rest seems to be noise. Interesting fact: even harmonics are much attenuated compared to the odd harmonics. This is because the tone generated is an approximation of square waves. The code for this is available here.

So, the only thing remaining is to find are the peaks of the Fourier Transform. This can be done using the find_peaks_cwt function of scipy which convolves a wavelet to the signal to find the desired features. But the data points of our frequency signal have way too much space between themselves for this kind of processing. This is caused by the size of the blocks we used to compute the Fourier transform. The wider we are in the time domain, the narrower the bins are in the frequency domain, and vice versa. So it would be solved by invoking specgram() with a larger NFFT parameter, but another way to solve it is to resample the frequency signal.

Let's apply this peak finding function on every block in our recording. After that, the only thing left is to merge the similar data together in order to get the beginning and the end of a tone. To do so, we'll seek for similar frequencies with an high amplitude. In a single tone signal, the most powerful harmonic is called the fundamental frequency. If the fundamental frequency in the following block is inside a certain percentage margin of the fundamental frequency of the current block, it should indeed be the same tone.

Once we get the number of tones and their frequency, we'll decode their values. We can get a pretty good idea of what we are dealing with by analyzing an histogram of the frequencies:

We can forget the outlier at 133 Hz which is the buzz or scratch at the end of the recording. We can see the four most common tones from the pattern in the first minute and half : ~305 Hz, ~400 Hz, ~340 Hz, ~422 Hz. Another interesting note: the data ranges from 300 Hz to 521 Hz. Curiously, this range gives roughly 10 symbols on the equal tempered scale. We can try to put the tones in bins between E♭4 and C5. This gives us the following key number sequence:

[43, 47, 45, 48, 43, 45, 43, 47, 45, 48, 43, 48, 45, 48, 43, 47, 45, 48, 43, 45, 43, 47, 45, 48, 43, 47, 45, 48, 43, 47, 45, 48, 43, 45, 43, 47, 45, 48, 43, 48, 45, 48, 43, 48, 46, 48, 43, 45, 44, 48, 45, 48, 43, 48, 45, 48, 43, 47, 48, 45, 45, 45, 48, 43, 45, 43, 48, 45, 45, 48, 43, 48, 45, 48, 43, 48, 45, 48, 43, 45, 43, 48, 45, 48, 43, 48, 45, 48, 43, 47, 45, 48, 43, 45, 43, 51, 51, 43, 43, 43, 43, 43, 43, 45, 52, 46, 47, 49, 43, 45, 52, 45, 46, 48, 43, 47, 51, 49, 46, 52, 43, 51, 48, 44, 47, 52, 43, 51, 45, 50, 46, 43, 46, 50, 47, 45, 48, 43, 46, 52, 49, 47, 43, 49, 51, 46, 43, 46, 49, 48, 45, 43, 45, 49, 50, 51, 50, 43, 45, 47, 50, 45, 48, 43, 45, 50, 52, 46, 43, 45, 49, 45, 52, 46, 44, 45, 47, 48, 52, 49, 43, 45, 48, 50, 43, 45, 47, 50, 45, 43, 48, 45, 47, 51, 43, 51, 52, 48, 44, 50, 43, 50, 45, 49, 50, 48, 43, 49, 46, 52, 46, 48, 43, 49, 52, 49, 52, 51, 43, 48, 51, 45, 48, 50, 43, 45, 47, 49, 48, 43, 45, 50, 48, 49, 43, 50, 49, 50, 48, 43, 48, 46, 52, 47, 48, 43, 47, 48, 52, 50, 52, 44, 52, 44, 52, 44, 52, 44, 52, 44, 52, 44, 52, 44, 52, 44, 52, 44, 52, 44, 52]

This was given by the code over here. Note that it is only a quick proof of concept patched together in an evening; many improvements should be made to this code. Things like enhanced robustness (adaptative constant for NFFT, for instance) or averaging the harmonics of the tone instead of taking only the harmonics footprint of the first sample may be greatly beneficial to it.

Using Lilypond, this script (sorry for the hardcoding) can generate a beautiful sheet music representation of the tones. Note length was omitted for this figure; only their pitch was considered.

Another representation may be of interest: text. We can convert the lowest note with a space as the booklet supposes and convert all the other notes from a to i using this simple code:

Loading... Please enable Javascript

Once done so, the following text emerges:

dbe b dbe ebe dbe b dbe dbe dbe b dbe ebe ece baebe ebe debbbe b ebbe ebe ebe b ebe ebe dbe b hh bicdf bibce dhfci headi hbgc cgdbe cifd fhc cfeb bfghg bdgbe bgic bfbicabdeif beg bdgb ebdh hieag gbfge fcice fifih ehbeg bdfe bgef gfge ecide deigiaiaiaiaiaiaiaiaiaiai

This could be a great starting point to analyze its content. We realize that the two alternating notes at 1:30 was wrongly decoded as "hh". Also, some segments were wrongly decoded as the data after 1:33 is always a group of 5 tones. You can convince yourself by listening to it using VLC to reduce the play speed. Hence, we've got 11 false decoding out of 27. Enhancing the tone merging algorithm should do the trick for most errors, as the input is noisy and sometimes distorted. If it was not grouped as it was, text statistics could then be applied to try to understand if these are really words. If the histogram of word length matches the English corpus, this data could be considered words. But since it is always grouped by 5, its likelihood of being words is really low. If we replace the alphabetic with numerals, the following sequence is found:

 425 2 425 525 425 2 425 425 425 2 425 525 535 21525 525 452225 2 5225 525 525 2 525 525 425 2 88      29346 29235 48639 85149 8273 37425 3964 683 3652 26787 24725 2793 26293124596 257 2472 5248 89517 72675 63935 69698 58257 2465 2756 7675 53945 4597919191919191919191919

We see that the last notes are a repetition of the highest and lowest notes (excluding the space delimiter). This seems like an end of transmission pattern.

Spy phone home

Truth be told, signal analysis is kind of my nemesis. To take this exercise to the next level, I decoded a phone number from a telephone number signal record. The idea came from two main sources: first, a commission currently taking place on my local news which did not censor the dial tones of telephone clips played in public audiences. Secondly, a friend of mine criticized vigorously the poor conclusion I was trying to pull forth, which was supposed to be: "Oh, I found the sequence, but there's no known way to decrypt it!"

So I loaded this phone number signal (read: generated it here):

To play this song, please update your ancient browser to an alternative that supports HTML5.

Here is the link to the audio if your browser does not support HTML5.

Using the aforementioned notions results into this code which gives the correct decoding: 562-4897 (randomly chosen number, dial it at your own risk!). One thing that merits mention: Phone signaling is (usually) done using Dual-Tone Multi-Frequency (DTMF), meaning that a tone containing two different frequencies (that are not a multiple of one another) is used for each keypad entry. To decode these two frequencies, I took the two most important frequencies (with the highest amplitude) and got the symbol which minimized the sum of the square of the errors. This common method is called the Least Squares.

Conclusion

Decoding signals can be daunting, but it should be a breeze with the proper tools. I am impressed to see how close we are from a polyphonic sheet extractor from an audio file. These notions could also be used to recognize the emitter of a sound or a handful of other uses. I'll let you be creative.

But enough conspiracy theories for today. Hope you enjoyed!

Thanks to Alexandre Boily who taught me the existence of the Conet project.

Math evolution and dirty tricks

2014-04-08T11:04:00.002-04:00

Curiosity killed the cat

I've always been amazed by early game developers. Unlike today, they had to bend and twist software to its extreme limits to squeeze all the performance they could out of elder hardware. Epic and glorious battles where illustrious unknowns fought the machine in the darkness of their basement, such as this legend. The result of these very pleasant evenings are often food for though... or blog entries.

One of such examples is the fast inverse square root algorithm found in the Quake III game. The mathematical relation $ \dfrac{1}{\sqrt{x}} $ is needed to compute normalized vectors which are handy to compute light in a software renderer. There was a problem, though. Performing a floating point a square root followed by a division on a 1990-era processor was painfully slow, resulting in dreadfully low frame rates and poor player experience. Nowadays, CPUs (using SIMD instructions such as rsqrtss) or GPUs perform the inverse square function in real-time with high accuracy, making this trick obsolete. But it is so mind boggling that it is interesting to analyse.

It's turtles all the way down

Let's examine this trick. First, an evil alien bit hack using the 0x5f3759df constant and a right shift on a floating point number (!) is performed, followed by an iteration of the Newton-Raphson method. Here is the code directly taken from the ./code/game/q_math.c source file from the 1.32b version of the Quake III source code:

Loading ....

As you can see, the core of the fast inverse square root trick is on line 10. A right shift, followed by a subtraction. All performed by the ALU. Way faster than floating-point epsilon-perfect computation.

That made me think. Can I come up with this kind of trick? But that would require a deep analysis of the floating point representation (IEEE 754), and, quite frankly, i'm a lazy guy. Why do it myself when I can let the computer do it?

Armed with my favorite Evolutionary Algorithm library in Python, DEAP, I embarked on the journey of finding dirty floating point representation tricks to approximate common functions. The idea is to let Genetic Programming (GP) evolve an equation using integer tricks to represent floating point operations. Two objectives were set:

Accuracy;
Total operation cost.

Conceptually quite easy to implement. But a problem arose: these integer tricks are highly dependent on precise constants that are way too hard for GP to guess. To circumvent this issue, I built a two stage evolution. While keeping the first GP stage to evolve mathematical equations, every individual evaluation would require a parameter optimization. The individuals would have terminals similar to inputs but in facts are constants that must be found by the second stage of evolution. To perform this second stage, a simpler algorithm based on Scipy's optimize module is used. This stage won't be able to solve large and difficult equations (individuals) that may produce the GP or may be stuck in local minimas, but large individuals mean costly to execute and thus not an interesting path to scrutinize.

Long story short, here is the code:

Loading ....

The program is roughly divided as such:

Lines 1 - 16: Imports. You will need numpy, scipy and deap to run the example;
Lines 18 - 36: Safe arithmetic operators. Without them, division by zeros and similar would occur during evolution;
Lines 39 - 49: Operation execution cost. The values are more rule of thumb ideas than actual benchmarks;
Lines 51 - 85: Helper function to show the equation in its infix notation instead of DEAP's default prefix. It only improves readability and does not affect evolution;
Lines 88 - 103: Primitives definition. Sadly, four parameters (a, b, c, d) were defined instead of a dynamic array, but this simplification should do the trick;
Lines 105 - 112 and 152 - 188: DEAP's usual setup and execution;
Lines 114 - 115: The simplest way I found to reinterpret floats in integers in Python;
Lines 117 - 128: Second stage of optimization that tries to get the best parameters (a, b, c, d) of an individual. Line 122 defines the error minimization function;
Lines 130 - 147: First stage of optimization, GP individual handling;
Line 149: Set of input values on which we strive to minimize the error.

Data, reporting for duty

Once the program has run, most entries from the hall of fame (best individuals) are analogous to $ (a + a) - (x >> 1) $ or other variations around constants. No other equation family were up to battle against it.

As it can be seen, we found the original equation from Quake 3. Well, almost... we clearly see a problem with GP: bloat. It found $ (a + a) $ instead of $ a $, which are semantically identical given that $ a $ is a constant. If we simplify the equation manually, the second stage found an optimal value around 0x5f33f870, which is pretty near the original magic constant. Fine-tuning the Nelder-Mead algorithm of the second stage would give a better constant. The general look of the function is shown in the following figures.

Figure 1: Output of $ a - (x >> 1) $ (blue) and $ \dfrac{1}{\sqrt{x}} $ (green) for input values [0, 50].

Figure 2: General view of the relative error of the $ a - (x >> 1) $ function

Figure 3: Relative error of the $ a - (x >> 1) $ function with different values of a zoomed around 0x5f340000

Facing my Waterloo

Sadly, my selfish self found no better trick than the one in the Quake III source. At least, the original™ fast inverse square root trick was found. In the next episode, we'll try to uncover other novel and unpredictable approximations to other common expressions. Stay tuned!

Review of USB Condoms

2013-09-15T22:30:00.000-04:00

We're in danger

Computer software anomalies often have their biological analogies such as viruses or bugs. Transmission of these virtual infectious diseases by physical contact is observed since the dawn of the digital age, be it on floppy disks, optical disks or USB drives. But recently, a new trend of threat transmission targeting mobile systems began to spread. Since the mobile era, more and more devices (cellphones, music players, etc.) use their USB connector for multiple purposes such as data transfer and charge. This allows malicious computer programs to inject malware on a device when its owner only wants to charge it. Worst: one can modify a simple charger to propagate these treats. The problem? It begins to spread to public and mainstream systems.

The solution to these virtual physically transmitted diseases? An USB Condom! Available through int3.cc, this simple board only connects the power pins through while cutting the data pins. Simple and clever. But can we presumptuously remove half the USB 2.0 wires without side effects?

The current state of affairs

The USB connectors are defined with 4 (2.0) or 10 (3.0) connections which always contain two power pins (5V and ground) in addition to their data pins.

Let's read some power specifications directly from the ultimate reference, the Kamasutra of USB, the official USB 2.0 standard documentation, pages 171 and 245:

Depending on the power capabilities of the port to which the device is attached, a USB device may be able to draw up to five unit loads from V BUS after configuration. A unit load is defined to be 100 mA.

It is also written on page 178 of the USB 2.0 specifications document that a device described as high-power should require a maximum of 500 mA on the 5 V power rail of the USB connector after being configured and 100 mA maximum during power-up (page 171).

The USB 3.1 specification rise the high-power current draw to a maximum of 900 mA (6 unit loads of 150 mA) when the client device is configured but states, on chapter 11.4 of its specifications:

Note that a USB 3.1 peripheral device shall not draw more than 100 mA until it detects far-end Rx terminations in the unconfigured state.

Finally, - the last quote from the specifications, I promise - the battery charging specification allows (in its revision 1.2 from December 2010) to draw up to a whooping 1500 mA from an USB port of a compliant device for charging purposes without the need to connect the data pins (after a delay of maximum 900 ms). But this amazing power is only available through hardware modifications of the original USB host specification, which most computers and laptops don't implement.

It should be noted that many manufacturers of portable devices and adapters do not follow these specifications. Most wall adapters can deliver more power than any standard specification, even the battery charging specification. For example, the ODROID-U2 uses an EIAJ-01-like connector to transmit 10 Watts! But who will condemn this practice - aside the firefighters - when it means faster charge times?

With this kind of devices, the USB Condom will do a good job protecting your beloved electronics from electronic STDs while allowing them to charge freely.

But you won't have this luck when charging from a computer, a laptop or any other power delivering device that is more likely to follow specifications. The USB enumeration process, which uses the data pins, will likely fail when using the USB Condom. The device providing power will not be able to read the configuration registers stating the high power requirement of the charging device. This will produce a maximum charge of 100 mA, which is the maximum allowed current during power-up (unconfigured state) and which coincidentally can be near or under the device power consumption. Result? An awfully slow charge or even no charge at all!

This behavior particularly hinders the utility of the USB Condom since its goal is to charge your portable devices while protecting it from malicious communications which are mainly from computers - which may have viruses - or a fraudulent charger containing an embedded microcontroller delivering malwares. Using the USB Condom will result in a slow or no charge from these sources, depending on how much the pirate manufacturer of the charger respected the specifications. It's a shame; when you put a condom, it should draw some satisfaction!

There shall be light

One way to circumvent this problem would be to include a USB charger detector IC or a simple microcontroller with an USB stack such as the ones available from Microchip. While keeping the power connectors linked to the protected device, its data pins would be replaced by the microcontroller's pins that would configure as a generic high-power device while ignoring any communication attempt. Sadly, this will boost the manufacturing price of the USB Condoms. But it will allow the charging sequence to respect the original USB specifications. This new device could even capture every malware injection attempt and create an Virtual STD Honeypot! What an appetizing idea!

How to lazily entertain a cat (Part 1)

2013-09-02T14:42:00.001-04:00

Laser-powered Cat Entertainment System (LCES)

Cats love lasers.

So let's build a Laser-powered Cat Entertainment System (LCES). It's a fancy name for a laser that shakes autonomously. I know this has already been done a million times and is even manufactured and sold, but, hey! The fun is in the journey and not the destination!

This project is split in two parts. The first section will cover the physical system per se while the second one will discuss interactivity using a webcam.

For part 1, you will need:

A Laser;
Two servomotors;
A servo controller;

Laser

I used a cheap diode and resistor combo from DealExtreme.com, but any diode would do. Disassembling one you found at the flea market is another good alternative. While playing with it, remember that it is a diode, which means that they must be supplied in current. Applying a voltage source to it will cause it to burn. How can I create a current supply, you ask? For low power applications, using a resistor in series with the diode would kick in its $ V= R·I $, effectively limiting the current in the diode to $ \frac{V_{supply} - V_{diode}}{R} $, if you are interested by school stuff. That won't work on higher power application because the resistor's resistance varies with its temperature.

Be careful to use a low power laser because you don't want to burn your cat's retina!

Servomotors

For the servomotors, I've got my hands on two Futaba S3004, but any servomotor will do. You will want one with a disc horn large enough to support the second servo and the second one with arms, X or star-shaped. What are the differences? Check Figure 1.

Figure 1: Types of horns. Sorry, no arm because I didn't had one in hand. Just think of it as an X with two opposites branches cut off.

Servo controller

As for the servo controller, I will use a Pololu Micro Maestro 6, which can be seen in Figure 2.

The red wire is a bypass to power the servos using the 5V from the USB.

Using an already made servo controller is the simplest solution but you can always

Hardware recap

[~2$ ] 1x Laser - Such as this one;
[~6$ ] 2x Servomotor - Such as the Tower Pro SG-5010;
[~20$] 1x Servo controller - Such as the Pololu Micro Maestro 6;

Your mileage may vary on prices, be sure to browse a bit before purchasing to find better or cheaper parts.

Design

The assembly is quite simple:

Put a servo perpendicularly on the other one;
Fix the laser on the topmost servo;
Plug everything together.

Servos on the Controller;
Laser and servos power to the USB power (available on the controller);
Controller to the computer.

This is the look of the final design once you've plugged everything together.

Software

Now that the hardware parts are all ready, we can start coding. Here is a small example of randomly waving the laser around using Python.

You will need to install pyserial, first. In fact, I recommend you to create a virtualenv with Python 3.3, install setuptools and pip and then execute pip install pyserial. This works on Linux, Windows and Mac.
Make sure you've configured correctly your Pololu using the Maestro Control Center (provided with the driver). This means activating either the USB Dual Port or USB Chain mode, setting the channels properly (I recommend putting every non-used channel to Input for safety) and setting the Mini SSC offset and Timeout to zero. The other options are optional.

So here's the code:

Loading ....

And there you go! Next up on Part 2 we'll see how to hook up this setup to a webcam using OpenCV to analyze movement and create organic-like movement. Stay tuned!

I shamelessly offer you a couple of photos of my cat:

Project Euler Problem 1 - A journey

2013-07-08T14:54:00.000-04:00

Let's solve efficiently the first Euler Project problem, which is:

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 1000.

A naive implementation of this problem would be as such in Python:

Loading ....

This gives a result of 233168.

What we are really interested in, though, is the performance. Let's benchmark our result.

Loading ....

It runs in 555.5 microseconds using my computer.

Now begins the fun. Two paths must be searched in order to attain maximal performance in this problem. The first is to enhance the mathematical search space, meaning that we should limit the number of numbers we evaluate. The first solution tried every positive natural number. It would be way more efficient to only run through multiples of 3 and 5. In Python, we can use the range iterator (xrange for Python 2) to provide us only the interesting values. Concatenation of the two desired iterators can be done using the chain function. It is then a simple matter of getting rid of duplicates, which we can do using the set data structure.

Loading ....

Giving us a 3.806 microseconds run, always on the same computer.

Ah. Much better. For a human being, this is the most optimal piece of core you can encounter solving your problem. It is short, relatively simple and quite easy to maintain.

Everything further down in this post is an abomination and exists for the sole purpose of either educational or masochistic endeavours. Writing clear, simple and maintainable code is way more useful, profitable and developer-time efficient.

Despite these warnings, let's dive into a C implementation of the naive interpretation of the problem.

Loading ....

10.75 microseconds without optimisations
5.36 microseconds with -O3

Quite more efficient than its Python counterpart, indeed. Let's write an iterator-like function giving out only interesting numbers. Instead of using high-level concepts such as the chain function or the set data structure which would be pretty complex to write in C, we'll simply use the fact that distance between multiples of two numbers repeats after their least common multiple (LCM). 3 and 5 have a LCM of 15, so let's enumerate multiple of both up to 15:

3 5 6 9 10 12 15

This means that multiples of 3 and 5 will be a cycling loop of the difference between these numbers. This give us +3, +2, +1, +3, +1, +2, +3 (Beginning with 0, going to 3 is +3. 5 - 3 = +2, 6 - 5 = +1, and so on).
We'll use the fact that switch cases are very efficient in C since they map to a jump case (multiple jump operations) in assembly language (compared to a bunch of ifs which would map to comparisons operations). This gives us the following solution:

Loading ....

3.11 µs -O0

1.55 µs -O3

The C code seems to be on-par with the Python one without optimizations. But this is because of the timing mechanisms we used to benchmark both (one used in-code timestamps while the other one is using time which takes into account the process creation). This should not bias our benchmarks too much because of the repeating loop which minimizes the impact of the process creation overhead.

This is the most performance you can get out of a portable and somewhat readable code. Anything further down in this post engulfs your senses in a warm delusional wool, or as someone stated it:

If you put something like that in production code, you would likely be shot by the maintainer. A jury would never convict him. [1]

I must insist: The rest of this is so bad that you will think that you are right doing so and have the illusory though that it makes you a master of a programming language or computer systems.

Which is wrong.

For the foolish still willing to carry on:

Skills in parallel development are really important nowadays. But let's put this in perspective: we've got a solution running at the microsecond scale. Spawning a new thread or process by the OS will generate a relative overhead way over any speedup could ever justify. Task-level parallelism won't help us for this particular problem.

But we can take advantage of parallel instructions available in current processors through the MMX/SSE/AVX extensions. Data-level parallelism to the rescue!

A generally clean approach to implement calls to CPU extensions in C is to use intrinsics.

One way to solve our problem using SSE is to generate every 7 previously mentioned entries (from 0 to 15) of the cycle using a single addition operation. We then use the addition operation to perform a sum operation using a reduction pattern as shown in figure 1. It is only a matter of sum after this.

Figure 1: Example integer sum reduction pattern using addition (SSE2)

The cleanest way to have done it would be letting hints or rearranging the original code for the compiler to optimize our program using SIMD instructions himself, but I couldn't find a way to cleanly formulate my needs to get GCC to understand what I wanted. Written manually, it gives us the following code:

Loading ....

The codes run in 357 ns on my computer using -O3 compiler flag.

But we are still using a linear complexity algorithm... Isn't there something more efficient in the mathematical search space?

Actually, yes. There is.

There is an identity out there stating that sum of a multiple can be found directly. Using the inclusion-exclusion principle, we can find the answer for many multiples. But I'll let others explain that. Let's see its implementation.

Loading ....

Sorry about the startup script skipping showoff. Comparing the execution time of this operation is irrelevant because the computation is done by the compiler and the answer is directly written to a registry as a literal constant.

So there's two things to remember of this:

Nothing is more important than knowing what we're doing;

The journey is more important than the destination.