Automated binary analysis woes

If you used IDA Pro for a while, you might have noted that it contents itself
with simple things. It neatly displays the disassembly listing.
It allows you to improve the listing by adding names and comments. You can manually define
your symbols, types, functions. IDA itself can add some types and discover some
program properties, but
overall the performed analyses appear to be rather modest.

On the other hand things seem to be simple.
After spending some time contemplating these always-the-same code patterns
one gets used to them. Many binary analysis problems start to look elementary, if not boring.
One could even think: it is so obvious, why not automate it? Let the computer
find things for us. Questions like where will this register be used?
or where does that value come from? could be answered by the computer!
Sure, there will be situations when it fails, but at least in other cases the
answers could be calculated without any human intervention!…

When we implement the first naive approach, it turns out to work with our test program.
Then we switch to another program and get our first surprise: it does not work at all!
Switching to another file compiled
by another compiler or even with the same compiler but with a different set of compiler options spoils everything.
The patterns which worked for the first compiler do not work and the analysis fails.

The next step could be to add patterns for the second compiler. Then for the third.
If we take this approach, our analysis tool would never be complete.

Imagine we want to find where a function expects its input arguments. This question arises
very often during binary analysis and answering it correctly is very important.
Without knowing where to look for the input we can never find out what the input values are.
For example, this function

push ebp mov ebp, esp movsx eax, [ebp+8] ...

apparently expects one of its arguments on the stack because it copies
something above the return address to the eax register.

Another example. This function

push ebp mov ebp, esp call [ecx+10h] ...

uses ecx without initializing it. Thus ecx must already
have an initialized value,
otherwise the application would crash. So the rule might be

INPUT ARGUMENTS RULE

If the function uses a register or stack location without initializing it,
then it expects an input argument at that location.

We have not defined what the word uses means.
Let’s say that a direct read access constitutes a use. Usually arguments are
accessed directly so the requirement of a direct access does not limit out
analysis¹.

We ‘just’ need to perform the data analysis and find all these registers.
Unfortunately, this does not work at all. Here is a counterexample

push ebp mov ebp, esp push esi ... pop esi pop ebp ret

This push instruction accesses the esi register for read.
But we all know that a push/pop
pattern like this represents the ‘save-restore’ idiom. The register is not really used.
We do not care about its value, we just save it on the stack and restore it at the exit.

For any experienced IDA user all this is pretty obvious.
We have to amend our rule about the function
inputs. It could look like this:

INPUT ARGUMENTS RULE VERSION 2

If the function uses a register or stack location without initializing it,
then it is an input argument. Watch out for the preserved registers, the
save-restore instructions should be ignored during this analysis!

My modifications to the rule are very vague. I did so for a reason: the problem of finding
preserved registers is not simple at all!

We will build an algorithm to find preserved registers.
In our first try at it we look for the push/pop pairs:

PRESERVED REGISTERS RULE VERSION 1

A register is preserved by the function if there is a pop
at the end of the function (before the return instruction)
and a push instruction at the beginning of the function which
corresponds to it. In this case we can remove the push/pop pair and repeat.

This rule works perfectly well with this function:

push edi push esi ... pop esi pop edi ret

Alas, a counterexample is obvious:

push edi push esi ... ret1: pop esi pop edi ret ... ret2: pop esi pop edi ret

Strictly speaking we do not have pairs here. There are several pops for each push.
The second version of the rule:

PRESERVED REGISTERS RULE VERSION 2

A register is preserved by the function if for each return
instruction there is a pop instruction
and a push instruction at the beginning of the function which
corresponds to it. In this case we can remove all pop instructions and the
push instruction. Repeat the process while possible.

I’m sure you realize that this rule is too naive… Let’s check this this code (generated
by GNU C++):

push ebp mov ebp, esp sub esp, 18h mov [ebp-8], ebx mov [ebp-4], esi ... mov ebx, [ebp-8] mov esi, [ebp-4] mov esp, ebp pop ebp retn

So the push instruction is not the only means of preserving registers.
This example effectively destroys our rule beyond any hope. It was built around the
idea that a function would push registers onto the stack, do its job, then pop them from
the stack. When the push/pop instructions were replaced by movs,
we have to abandon this approach. Here are two more examples.
The first one:

push ebp mov ebp, esp push esi mov [ebp-4], edi ... mov edi, [ebp-4] mov esp, ebp pop ebp retn

Please note that esi is not saved by this
function²!

Here is the second one:

push ebp mov ebp, esp sub esp, 20h ... push esi ... pop esi ... mov esp, ebp pop ebp retn

As you see, a register is saved in the middle of the function.
To summarize:

it doesn’t make sense to look only for push/pop instructions
other instructions can be used to preserve registers
it doesn’t make sense to check only the function prolog/epilog
a register can be saved/restored anywhere in the function

I used microcode to overcome the instruction variety. All x386 instructions,
be it push, mov, or pusha, are converted to the microcode data transfer instruction.
Data flow analysis helps to find these instruction anywhere in the function body.
Here is the outline of the data flow analysis:

create the function flow chart
add empty entry and exit blocks to it
the entry block will pretend to define all registers used in the function
the exit block will pretend to use all registers defined by the function
for each block compute the use/def lists
use the iterative algorithm to compute du-ud chains³.

There were many minor problems but they are too boring to delve into details.
To make the story short, I’ll just show the final version of the rule:

PRESERVED REGISTERS – FINAL VERSION

a register is preserved by the function, iff:

its definition by the entry block is used somewhere in the function
the instruction which uses that definition copies the register to a stack location or a register.

we will call this instruction the save instruction.

there can be several save instructions for a register.
the instruction which defines that value copies the register from a stack location or a register.

we will call this instruction the restore instruction.

there can be several restore instructions for a register.
the locations at the steps 2 and 3 are the same

we will call this location the save location.
there is at least one path from any save instruction to a restore instruction
the stack location used to save the register is not accessed (read or write) on
any path connecting the save and restore instructions

This rule is still not perfect (indirect accesses are not taken into account), but overall it
works well.

Please note that we considered only compiler generated code. Hand crafted code is
almost always impervious to automated analysis. Fortunately there is less and less
assembler code.

I hope this small example illustrates my point: binary analysis problems are much more difficult
than they look. Simple approaches simply do not work.
Even if we strictly limit ourselves to compiler generated code, they still fail.
They will work if we limit ourselves to a specific version of a specific
compiler or even to a specific binary.
So crafting a naive script to solve
today’s problem might be the right thing to do. If not, expect monstrous rules like the above
to haunt your logic…

P.S. The gcc compiler source code is abundant with lengthy and complex rules; does it
mean that we have to live with that? 😉

1.
There are many exceptions to this rule, just one of them:
the trailing arguments of variadic functions are always accessed indirectly; but since
all these arguments are anonymous, we do not lose anything. In fact there are many minor
details which are irrelevant here.

2.
If you wonder which compiler generates such a code, it is ICC, Intel’s compiler.

3.
The iterative approach to the data flow analysis is explained in this book:

Muchnick, Steven S. Advanced Compiler Design and Implementation

Simplex method in IDA Pro Confusing instruction