July 24, 2014

Executing x86-64 opcodes in Python - Genetic Programming case study

This post is about generating machine code directly and make it run in Python. (At least, i'm not the first one to do something similar.) Since we've left any hope for portability by using machine code, we'll support my processor, which implements the (not so pretty) standard x86-64 instruction set.

Jumping down the abstraction layers

First of all, how do you even execute arbitrary x86-64 instructions in Python? (Note: doing it in Python is mandatory, mainly because I want it to. And it's simpler to develop and debug.) Well, diving into the ctypes module gives us half the answer: we can write a byte string that contains the executable code and then cast it into a CFUNCTYPE object.

But an portability issue arises. CFUNCTYPE states that it will call the function using the standard C convention (what we call an ABI)... But which x86 calling convention? The answer is: the one that is used by the compiler with which Python was compiled. Luckily, most x86-64 calling conventions (there are two: Microsoft-flavored for Windows-based systems and System V-flavored for anything else) are pretty similar. Another lucky fact: we only have to take care of the volatile registers and the parameter passing registers and return registers.

Next, we must know the machine code that does what we want. It is worth noting that an interactive compiler and an online assembler are useful to generate the machine code sequence. The same output is achieved using gcc to compile a file containing a simple function as such:
Loading... Please enable Javascript
Note that this invocation of gcc won't execute the linker, so every call, jump or other reference to a label (an address) won't be defined - they will be set to 0. This won't hinder us as our addresses won't be the same as the sample C code anyway. To better understand what we are dealing with, it is possible to check out the opcodes in an an x86-64 opcode reference chart, or the intel architecture software developer manual if the gory details are needed.

Ok, everything should be fine now! Let's try to cast a simple return. This operands only returns from a function call without doing anything.
Loading... Please enable Javascript
And... That gives a segmentation fault.

Which is pretty logical: the string is contained in a segment of the memory that have the read and write flags, but not the execution flag. Luckily, this particular problem was solved on stackoverflow. This gives us the following (x86-64 Linux only) code:
Loading... Please enable Javascript
At last, it works! Or, at least, doesn't explode upon execution.

So, basically, if I want to perform an addition, the documentation says to interface the "C function" this way (to replace the end of the previous code snippet):
Loading... Please enable Javascript
Great, it works! But how did we come up with the machine code f2 0f 58 c1 c3 ? (the prefix \x before every byte is only to signify to Python that it's hexadecimal.) Here we go:
Loading... Please enable Javascript

The missing link

Now that the hassle of explaining evolutionary algorithms, genetic programming and their need for fast execution has been done by my colleague, let's push it a step further using the aforementioned notions . Marc-André brought you on the edge of the cliff; we'll now take a big step forward. While he used Python bytecode, which is an awesome and versatile idea that supports any hardware for which Python was ported, we'll use machine code.

Now that we know how we are going to execute arbitrary machine code, we can focus on the problem beforehand: generate machine code from a program tree. Let's take for example the tree for the equation $ (x - 1) + 4 $ :


Figure 1: Example of the tree representing the expression $ (x - 1) + 4 $.

Since the processor execution is based on a stack, the easiest way to traverse this three is depth-first. Indeed, by appending every node of the tree when traversing depth-first, it generates the inverted order of execution of the tree. Here's how it works:

Figure 2: Example of depth-first traversing with call stack

As we traverse the tree, we pile up the calls or parameters on the stack. As you can see from the figure, it generates a call stack that, when executed bottom-up, is exactly the order that the x86_64 processor needs. We notice that it is different from the Python bytecode stack Marc-André showed: arguments positions are reversed and LOAD_GLOBAL is replaced by the call to the function.

All that seems well, but there is a problem. x86_64 calling convention passes the floating-point parameters of a function on XMM0, XMM1, XMM2 and so on. These registers are volatile, meaning that their content may well be modified when we call a function. Let's take the tree showed in the previous figure for the sake of example and assume we're dealing with floats. X will be put on XMM0, 1.0f on XMM1 and sub() will be called. This call will return its result on XMM0. Perfect, that's where we want our first argument for the call to add(), along with 4.0f previously put in XMM1. Uh-oh. sub() needs 1.0f in XMM1 while add() needs 4.0f. This can be visualized here:

Instruction XMM0 XMM1
Put 4.0f in XMM1 - 4.0f
Put 1.0f in XMM1 - 1.0f
Put x in XMM0 x 1.0f
Call sub() x - 1.0f 1.0f (?)
Call add() (x - 1.0f) + ? 1.0f (?)

You'll probably say "hey, this is easy, simply put node 5 before node 2, problem solved!" Don't. Do. That. All hell will break loose and you will not enjoy this. This kind of tweaking will lead you to hours and hours of wondering why it works in some cases but won't in such or such cases. (Says the guy who produces executable machine code in Python.) As I said earlier: this should be simple, an x86_64 CPU is based on a stack! Let's use it! We'll simply push every argument needed on the CPU stack when we traverse its node and then pop it back when it's needed. If we feel adventurous, we realize that the first argument (next node after a call) won't need to push/pop if the arguments are compatible: the result will already be in XMM0 (for floats and doubles), ready to be used. This gives us this:
Instruction XMM0 XMM1
Push 4.0f on stack - -
Push 1.0f on stack - -
Put x in XMM0 x -
Pop stack in XMM1 x 1.0f
Call sub() x - 1.0f ?
Pop stack in XMM1 x - 1.0f 4.0f
Call add() (x - 1.0f) + 4.0f ?

As we can see, this generates quite a lot of unnecessary pushes (for example argument 2 of sub()). Eliminating these unnecessary stack usage is a potential optimization that we may discuss in another article.

Before rambling into a non sequitur madness of idea flow almost disconnected from the subject, I'll present you the symbolic regression program with individuals evaluated in x86_64 machine code using deap. It is located here. Feel free to fork it, mess with it and be curious around it.

I haven't implemented the division (had to deal with divisions by zeros) nor cos / sin, but feel free to be inspired and fork the code!

You may be tempted to print out the machine code generated and understand it. To better understand it, it is possible to copy it in an online disassembler which will provide the almost human-readable assembly translation.

To the Infinity and beyond

In the context of genetic programming, a way better idea than generating the machine code at each evaluation as the previous example did is to represent the individuals in deap as its machine code and evolve it. It's not that I am wary of the tortuous path toward a stable implementation of this representation. It would require the writing of a mutation and crossover function which needs pointers to mark the beginning and end of each node in the machine code representation. No, the only reason I don't dig further in this general direction is because I won't offend you by serving some old reheated matter. Marc-André already showed how it's done in its previous post. (This has also nothing to do with the fact that it's an unmaintainable piece of code that won't ever be published as-is in a working project.)

At first, this proposed idea was tagged "useless waste of time" by colleagues and friends. But as the idea developed, we realized it could be used to circumvent security features. Calling obscure opcodes, low-level functions, software interruptions or similar are now available directly in pure Python. Furthermore, it would enable the execution of dynamic libraries that are not flagged as executable. You only have to read the exported symbols of a .so library, load them in memory, apply this method and voilà, you can execute its functions.

An interesting lead from this point is to make a compiler for generic Python functions. Some pretty module would get us near a C compiler, but I won't insult the dedication and hard work of Donald Knuth by proposing half an article written on a napkin on compiler creation. But I don't mind being familiar and even tutoyer Python optimizations packages. Can we perform better than Numba or even Cython and PyPy? Stay tuned for Part II were we'll try as-generic-as-possible Python-to-Machine-Code translation.