Decompilation is tricky. In the general case, it is impossible to write a decompiler that can correctly and completely decompile every program. Compilation is an inherently lossy process. Much of the information in the source code such as variable names and comments are lost. Modern compilers optimize code significantly. It is often impossible to tell whether a piece of data is code or not. It remains hard even if you disallow self-modifying code.
However, there are many patterns in compiled machine code which one can use to try to construct a partial decompilation.
To illustrate how decompilation works, I will be using an example that implements the Ackermann function.
int main()
{
return printf("%d\n", ack(3, 4));
}
int ack(int m, int n)
{
if (m == 0)
return n + 1;
else
{
if (n == 0)
return ack(m - 1, 1);
else
return ack(m - 1, ack(m, n - 1));
}
}
A while ago I wrote the ocd decompiler with my friend Aleksander Balicki. Here's our approach, divided into stages.
Executable programs are packaged in a specific file format (ELF
and a.out in Unix, PE on
Windows). Object dumping takes an executable and extracts information from it — the machine code,
the entry point location (where the code starts when you run the program), the symbol list (names and addresses of
symbols in the code), and so on. objdump
is a utility capable of this. It's
available on most Unix systems as part of Binutils.
$ objdump -t ack | awk '$2 == "g"{print $0}' | grep -v _
000000000040054f g F .text 0000000000000059 ack
0000000000400524 g F .text 000000000000002b main
Compiled executable code is usually stored in files as machine code (a set of binary instructions that is understood by the CPU directly). Machine code has a human-readable interpretation called assembly language.
Disassembling is the turning of machine code into assembly language. For example, here's the disassembly of
ack
into x64 code:
0000000000400524 <main>:
400524: 55 push %rbp
400525: 48 89 e5 mov %rsp,%rbp
400528: be 04 00 00 00 mov $0x4,%esi
40052d: bf 03 00 00 00 mov $0x3,%edi
400532: b8 00 00 00 00 mov $0x0,%eax
400537: e8 13 00 00 00 callq 40054f <ack>
40053c: 89 c6 mov %eax,%esi
40053e: bf 9c 06 40 00 mov $0x40069c,%edi
400543: b8 00 00 00 00 mov $0x0,%eax
400548: e8 cb fe ff ff callq 400418 <printf@plt>
40054d: c9 leaveq
40054e: c3 retq
000000000040054f <ack>:
40054f: 55 push %rbp
400550: 48 89 e5 mov %rsp,%rbp
400553: 48 83 ec 10 sub $0x10,%rsp
400557: 89 7d fc mov %edi,-0x4(%rbp)
40055a: 89 75 f8 mov %esi,-0x8(%rbp)
40055d: 83 7d fc 00 cmpl $0x0,-0x4(%rbp)
400561: 75 08 jne 40056b <ack+0x1c>
ocd
disassembles code into an intermediate representation
analyzes it. This makes it possible to perform the same analyses regardless of the source architecture.
A control flow graph is a directed graph which encodes information about every path a program execution might take depending on control statements of the language (jumps, conditional statemenents, loops). Every uninterrupted block of assembly code is a node in the graph. The edges are jumps from one block to another.
Control statements each have a distinct pattern that appears in the control flow graph.
if | if/else | while | skip |
We use a graph rewriting system (not unlike a grammar rewriting system) to find patterns in the graph and rewrite them into single nodes until a fixed point is reached and no more patterns can be found. This allows us to partially rebuild the original structure of control statements.
Below is the simplified step-by-step control flow graph analysis of the ack
function, as it gets rewritten into
ultimately one node.
Step 1 | Step 2 | Step 3 |
We then analyze the way variables are used throughout the program. We assign fresh variables to register operands.
mov eax, 0
mov ebx, eax
|
var_13 = 0
var_14 = var_13
|
Then, the computation flow is analyzed. Dead instructions are removed and others are folded into more compact expressions.
a += 1
b = a * 2
n = 2
n = b - a
|
n = (a+1)*2-a+1
|
Finally, the analyzed program can be printed.
For example, this is what ocd
outputs for ack
:
int ack()
{
var_0 = temp_2;
var_1 = temp_3;
var_2 = var_0 - 0;
if (var2)
{
var_2 = var_1 - 0;
if (var2)
{
temp_4 = ack(temp_5 - 1, ack(var_0, temp_4 - 1));
}
else
{
temp_4 = ack(temp_4 - 1, 1);
}
}
else
{
temp_4 = temp_4 + 1;
}
return temp_4;
}
int main()
{
return unknown_function("%d\n", ack(3, 4));
}
It's not perfect, but sometimes it beats looking at disassembly.
This post was originally published in May 2011.