Decompiling reconstructs source code from compiled machine code. Decompilation is tricky. In the general case, it is impossible to write a decompiler that can correctly and completely decompile every program. Compilation is an inherently lossy process. Much of the information in the source code such as variable names and comments are lost. Modern compilers optimize code significantly. It is often impossible to tell whether a piece of data is code or not. It remains hard even if you disallow self-modifying code.
However, there are many patterns in compiled machine code which one can use to try to construct a partial decompilation.
To illustrate how decompilation works, I will be using an example that implements the Ackermann function.
int main() { return printf("%d\\n", ack(3, 4)); } int ack(int m, int n) { if (m == 0) return n + 1; else { if (n == 0) return ack(m - 1, 1); else return ack(m - 1, ack(m, n - 1)); } }
A while ago I wrote the ocd decompiler with my friend Aleksander Balicki. Here's our approach, divided into stages.
Executable programs are packaged in a specific file format (ELF and a.out in Unix, PE on Windows). Object dumping takes an executable and extracts information from it — the machine code, the entry point location (where the code starts when you run the program), the symbol list (names and addresses of symbols in the code), and so on. objdump is a utility capable of this. It's available on most Unix systems as part of Binutils.
$ objdump -t ack | awk '$2 == "g"{print $0}' | grep -v _ 000000000040054f g F .text 0000000000000059 ack 0000000000400524 g F .text 000000000000002b main
Compiled executable code is usually stored in files as machine code (a set of binary instructions that is understood by the CPU directly). Machine code has a human-readable interpretation called assembly language.
Disassembling is the turning of machine code into assembly language. For example, here's the disassembly of ack into x64 code:
0000000000400524 <main>: 400524: 55 push %rbp 400525: 48 89 e5 mov %rsp,%rbp 400528: be 04 00 00 00 mov $0x4,%esi 40052d: bf 03 00 00 00 mov $0x3,%edi 400532: b8 00 00 00 00 mov $0x0,%eax 400537: e8 13 00 00 00 callq 40054f <ack> 40053c: 89 c6 mov %eax,%esi 40053e: bf 9c 06 40 00 mov $0x40069c,%edi 400543: b8 00 00 00 00 mov $0x0,%eax 400548: e8 cb fe ff ff callq 400418 <printf@plt> 40054d: c9 leaveq 40054e: c3 retq 000000000040054f <ack>: 40054f: 55 push %rbp 400550: 48 89 e5 mov %rsp,%rbp 400553: 48 83 ec 10 sub $0x10,%rsp 400557: 89 7d fc mov %edi,-0x4(%rbp) 40055a: 89 75 f8 mov %esi,-0x8(%rbp) 40055d: 83 7d fc 00 cmpl $0x0,-0x4(%rbp) 400561: 75 08 jne 40056b <ack+0x1c>
ocd disassembles code into an intermediate representation and analyzes it. This makes it possible to perform the same analyses regardless of the source architecture.
A control flow graph is a directed graph which encodes information about every path a program execution might take depending on control statements of the language (jumps, conditional statemenents, loops). Every uninterrupted block of assembly code is a node in the graph. The edges are jumps from one block to another.
Control statements each have a distinct pattern that appears in the control flow graph.
![]() | ![]() | ![]() | ![]() |
if | if/else | while | skip |
We use a graph rewriting system (not unlike a grammar rewriting system) to find patterns in the graph and rewrite them into single nodes until a fixed point is reached and no more patterns can be found. This allows us to partially rebuild the original structure of control statements.
Below is the simplified step-by-step control flow graph analysis of the ack function, as it gets rewritten into ultimately one node.
![]() | ![]() | ![]() |
Step 1 | Step 2 | Step 3 |
We then analyze the way variables are used throughout the program. We assign fresh variables to register operands.
mov eax, 0 mov ebx, eax
var_13 = 0 var_14 = var_13
Then, the computation flow is analyzed. Dead instructions are removed and others are folded into more compact expressions.
a += 1 b = a * 2 n = 2 n = b - a
n = (a+1)*2-a+1
Finally, the analyzed program can be printed.
For example, this is what ocd outputs for ack:
int ack() { var_0 = temp_2; var_1 = temp_3; var_2 = var_0 - 0; if (var2) { var_2 = var_1 - 0; if (var2) { temp_4 = ack(temp_5 - 1, ack(var_0, temp_4 - 1)); } else { temp_4 = ack(temp_4 - 1, 1); } } else { temp_4 = temp_4 + 1; } return temp_4; } int main() { return unknown_function("%d\\n", ack(3, 4)); }
It's not perfect, but sometimes it beats looking at disassembly.
This post was originally published in May 2011.