How decompilers work

Decompilation is tricky. In the general case, it is impossible to write a decompiler that can correctly and completely decompile every program. Compilation is an inherently lossy process. Much of the information in the source code such as variable names and comments are lost. Modern compilers optimize code significantly. It is often impossible to tell whether a piece of data is code or not. It remains hard even if you disallow self-modifying code.

However, there are many patterns in compiled machine code which one can use to try to construct a partial decompilation.

Example

To illustrate how decompilation works, I will be using an example that implements the Ackermann function.

int main()
{
    return printf("%d\n", ack(3, 4));
}

int ack(int m, int n)
{
    if (m == 0) 
        return n + 1;
    else
    {
        if (n == 0) 
            return ack(m - 1, 1);
        else 
            return ack(m - 1, ack(m, n - 1));
    }
}

ack.c — the Ackermann function.

Stages of decompilation

A while ago I wrote the ocd decompiler with my friend Aleksander Balicki. Here's our approach, divided into stages.

Object dump
Disassembly
Control flow analysis
1. Control flow graph generation
2. Control flow graph reduction
Data flow analysis
1. Variable inference
2. Computation collapse
Program output

Object dump

Executable programs are packaged in a specific file format (ELF and a.out in Unix, PE on Windows). Object dumping takes an executable and extracts information from it — the machine code, the entry point location (where the code starts when you run the program), the symbol list (names and addresses of symbols in the code), and so on. objdump is a utility capable of this. It's available on most Unix systems as part of Binutils.

$ objdump -t ack | awk '$2 == "g"{print $0}' | grep -v _
000000000040054f g     F .text  0000000000000059              ack
0000000000400524 g     F .text  000000000000002b              main

Extracting a symbol list using objdump.

Disassembly

Compiled executable code is usually stored in files as machine code (a set of binary instructions that is understood by the CPU directly). Machine code has a human-readable interpretation called assembly language.

Disassembling is the turning of machine code into assembly language. For example, here's the disassembly of ack into x64 code:

0000000000400524 <main>:
  400524:  55                push   %rbp
  400525:  48 89 e5          mov    %rsp,%rbp
  400528:  be 04 00 00 00    mov    $0x4,%esi
  40052d:  bf 03 00 00 00    mov    $0x3,%edi
  400532:  b8 00 00 00 00    mov    $0x0,%eax
  400537:  e8 13 00 00 00    callq  40054f <ack>
  40053c:  89 c6             mov    %eax,%esi
  40053e:  bf 9c 06 40 00    mov    $0x40069c,%edi
  400543:  b8 00 00 00 00    mov    $0x0,%eax
  400548:  e8 cb fe ff ff    callq  400418 <printf@plt>
  40054d:  c9                leaveq 
  40054e:  c3                retq

000000000040054f <ack>:
  40054f:  55                push   %rbp
  400550:  48 89 e5          mov    %rsp,%rbp
  400553:  48 83 ec 10       sub    $0x10,%rsp
  400557:  89 7d fc          mov    %edi,-0x4(%rbp)
  40055a:  89 75 f8          mov    %esi,-0x8(%rbp)
  40055d:  83 7d fc 00       cmpl   $0x0,-0x4(%rbp)
  400561:  75 08             jne    40056b <ack+0x1c>

ack disassembly. The columns represent location, machine code and assembly code respectively.

ocd disassembles code into an intermediate representation analyzes it. This makes it possible to perform the same analyses regardless of the source architecture.

Control flow analysis

A control flow graph is a directed graph which encodes information about every path a program execution might take depending on control statements of the language (jumps, conditional statemenents, loops). Every uninterrupted block of assembly code is a node in the graph. The edges are jumps from one block to another.

Control statements each have a distinct pattern that appears in the control flow graph.


if	if/else	while	skip

We use a graph rewriting system (not unlike a grammar rewriting system) to find patterns in the graph and rewrite them into single nodes until a fixed point is reached and no more patterns can be found. This allows us to partially rebuild the original structure of control statements.

Below is the simplified step-by-step control flow graph analysis of the ack function, as it gets rewritten into ultimately one node.


Step 1	Step 2	Step 3

Data flow analysis

We then analyze the way variables are used throughout the program. We assign fresh variables to register operands.

mov    eax, 0
mov    ebx, eax

var_13 = 0
var_14 = var_13

Assigning variables to registers — before and after.

Then, the computation flow is analyzed. Dead instructions are removed and others are folded into more compact expressions.

a += 1
b = a * 2
n = 2
n = b - a

n = (a+1)*2-a+1

Instruction folding.

Program output

Finally, the analyzed program can be printed.

For example, this is what ocd outputs for ack:

int ack()
{
    var_0 = temp_2;
    var_1 = temp_3;
    var_2 = var_0 - 0;
    if (var2)
    {
        var_2 = var_1 - 0;
        if (var2)
        {
            temp_4 = ack(temp_5 - 1, ack(var_0, temp_4 - 1));
        }
        else
        {
            temp_4 = ack(temp_4 - 1, 1);
        }
    }
    else
    {
        temp_4 = temp_4 + 1;
    }
    return temp_4;
}
int main()
{
    return unknown_function("%d\n", ack(3, 4));
}

It's not perfect, but sometimes it beats looking at disassembly.

This post was originally published in May 2011.