Automated binary analysis woes

If you used IDA Pro for a while, you might have noted that it contents itself
with simple things. It neatly displays the disassembly listing.
It allows you to improve the listing by adding names and comments. You can manually define
your symbols, types, functions. IDA itself can add some types and discover some
program properties, but
overall the performed analyses appear to be rather modest.

On the other hand things seem to be simple.
After spending some time contemplating these always-the-same code patterns
one gets used to them. Many binary analysis problems start to look elementary, if not boring.
One could even think: it is so obvious, why not automate it? Let the computer
find things for us. Questions like where will this register be used?
or where does that value come from? could be answered by the computer!
Sure, there will be situations when it fails, but at least in other cases the
answers could be calculated without any human intervention!…

When we implement the first naive approach, it turns out to work with our test program.
Then we switch to another program and get our first surprise: it does not work at all!
Switching to another file compiled
by another compiler or even with the same compiler but with a different set of compiler options spoils everything.
The patterns which worked for the first compiler do not work and the analysis fails.

The next step could be to add patterns for the second compiler. Then for the third.
If we take this approach, our analysis tool would never be complete.

Imagine we want to find where a function expects its input arguments. This question arises
very often during binary analysis and answering it correctly is very important.
Without knowing where to look for the input we can never find out what the input values are.
For example, this function

push    ebp
mov     ebp, esp
movsx   eax, [ebp+8]
...

apparently expects one of its arguments on the stack because it copies
something above the return address to the eax register.

Another example. This function

push    ebp
mov     ebp, esp
call    [ecx+10h]
...

uses ecx without initializing it. Thus ecx must already
have an initialized value,
otherwise the application would crash. So the rule might be

INPUT ARGUMENTS RULE

If the function uses a register or stack location without initializing it,
then it expects an input argument at that location.

We have not defined what the word uses means.
Let’s say that a direct read access constitutes a use. Usually arguments are
accessed directly so the requirement of a direct access does not limit out
analysis1.

We ‘just’ need to perform the data analysis and find all these registers.
Unfortunately, this does not work at all. Here is a counterexample

push    ebp
mov     ebp, esp
push    esi
...
pop     esi
pop     ebp
ret

This push instruction accesses the esi register for read.
But we all know that a push/pop
pattern like this represents the ‘save-restore’ idiom. The register is not really used.
We do not care about its value, we just save it on the stack and restore it at the exit.

For any experienced IDA user all this is pretty obvious.
We have to amend our rule about the function
inputs. It could look like this:

INPUT ARGUMENTS RULE VERSION 2

If the function uses a register or stack location without initializing it,
then it is an input argument. Watch out for the preserved registers, the
save-restore instructions should be ignored during this analysis!

My modifications to the rule are very vague. I did so for a reason: the problem of finding
preserved registers is not simple at all!

We will build an algorithm to find preserved registers.
In our first try at it we look for the push/pop pairs:

PRESERVED REGISTERS RULE VERSION 1

A register is preserved by the function if there is a pop
at the end of the function (before the return instruction)
and a push instruction at the beginning of the function which
corresponds to it. In this case we can remove the push/pop pair and repeat.

This rule works perfectly well with this function:

push    edi
push    esi
...
pop     esi
pop     edi
ret

Alas, a counterexample is obvious:

push    edi
push    esi
...
ret1:
pop     esi
pop     edi
ret
...
ret2:
pop     esi
pop     edi
ret

Strictly speaking we do not have pairs here. There are several pops for each push.
The second version of the rule:

PRESERVED REGISTERS RULE VERSION 2

A register is preserved by the function if for each return
instruction there is a pop instruction
and a push instruction at the beginning of the function which
corresponds to it. In this case we can remove all pop instructions and the
push instruction. Repeat the process while possible.

I’m sure you realize that this rule is too naive… Let’s check this this code (generated
by GNU C++):

push    ebp
mov     ebp, esp
sub     esp, 18h
mov     [ebp-8], ebx
mov     [ebp-4], esi
...
mov     ebx, [ebp-8]
mov     esi, [ebp-4]
mov     esp, ebp
pop     ebp
retn

So the push instruction is not the only means of preserving registers.
This example effectively destroys our rule beyond any hope. It was built around the
idea that a function would push registers onto the stack, do its job, then pop them from
the stack. When the push/pop instructions were replaced by movs,
we have to abandon this approach. Here are two more examples.
The first one:

push    ebp
mov     ebp, esp
push    esi
mov     [ebp-4], edi
...
mov     edi, [ebp-4]
mov     esp, ebp
pop     ebp
retn

Please note that esi is not saved by this
function2!

Here is the second one:

push    ebp
mov     ebp, esp
sub     esp, 20h
...
push    esi
...
pop     esi
...
mov     esp, ebp
pop     ebp
retn

As you see, a register is saved in the middle of the function.
To summarize:

  • it doesn’t make sense to look only for push/pop instructions
    other instructions can be used to preserve registers

  • it doesn’t make sense to check only the function prolog/epilog
    a register can be saved/restored anywhere in the function

I used microcode to overcome the instruction variety. All x386 instructions,
be it push, mov, or pusha, are converted to the microcode data transfer instruction.
Data flow analysis helps to find these instruction anywhere in the function body.
Here is the outline of the data flow analysis:

  • create the function flow chart
  • add empty entry and exit blocks to it
  • the entry block will pretend to define all registers used in the function
  • the exit block will pretend to use all registers defined by the function
  • for each block compute the use/def lists
  • use the iterative algorithm to compute du-ud chains3.

There were many minor problems but they are too boring to delve into details.
To make the story short, I’ll just show the final version of the rule:

PRESERVED REGISTERS – FINAL VERSION

a register is preserved by the function, iff:

  1. its definition by the entry block is used somewhere in the function
  2. the instruction which uses that definition copies the register to a stack location or a register.

    we will call this instruction the save instruction.

    there can be several save instructions for a register.

  3. the instruction which defines that value copies the register from a stack location or a register.

    we will call this instruction the restore instruction.

    there can be several restore instructions for a register.

  4. the locations at the steps 2 and 3 are the same

    we will call this location the save location.

  5. there is at least one path from any save instruction to a restore instruction
  6. the stack location used to save the register is not accessed (read or write) on
    any path connecting the save and restore instructions

This rule is still not perfect (indirect accesses are not taken into account), but overall it
works well.

Please note that we considered only compiler generated code. Hand crafted code is
almost always impervious to automated analysis. Fortunately there is less and less
assembler code.

I hope this small example illustrates my point: binary analysis problems are much more difficult
than they look. Simple approaches simply do not work.
Even if we strictly limit ourselves to compiler generated code, they still fail.
They will work if we limit ourselves to a specific version of a specific
compiler or even to a specific binary.
So crafting a naive script to solve
today’s problem might be the right thing to do. If not, expect monstrous rules like the above
to haunt your logic…

P.S. The gcc compiler source code is abundant with lengthy and complex rules; does it
mean that we have to live with that? ;)



1.
There are many exceptions to this rule, just one of them:
the trailing arguments of variadic functions are always accessed indirectly; but since
all these arguments are anonymous, we do not lose anything. In fact there are many minor
details which are irrelevant here.

2.
If you wonder which compiler generates such a code, it is ICC, Intel’s compiler.

3.
The iterative approach to the data flow analysis is explained in this book:

Muchnick, Steven S. Advanced Compiler Design and Implementation

This entry was posted in Decompilation. Bookmark the permalink.

11 Responses to Automated binary analysis woes

  1. Ryan Russell says:

    How about making use of the calling convention to determine which registers are supposed to be passed or preserved?
    By the way, you don’t mention why you want to classify stack vars and registers. I’m sure the reason affects what will be the best way to do it.

  2. DL says:

    excellent post!

  3. Ilfak Guilfanov says:

    Ryan,
    The problem is that we do not know the calling convention and try to find it out.
    I didn’t get your question about the stack var and register classification. What do you mean by it?

  4. Peter Taylor says:

    Great post.
    What I’d like to know though, is how you intend to deal with Delphi style code that allocates stack variables using a construct such as:

    xor eax,eax
    mov ecx,frame_size
    alloc_frame:
    push eax
    loop alloc_frame
    

    would you first attempt to recognise this construct as an idiom and flatten it out into equivalent RTL/microcode?
    -Pete

  5. Ilfak Guilfanov says:

    Pete,
    Thank you. As about function prologs/epilos, they will be analyzed on a processor/compiler specific basis. This particular sequence will be recognized as a part of the prolog and completely removed from the decompilation. Its side effect of allocating the stack will be reflected in the stack pointer values.

  6. Ryan Russell says:

    I guess I just meant, give us a hint what you’re working on. I can think of one obvious application for which you want to identify when a register is used as a variable…

  7. Ilfak,
    you have come across an interesting problem. This sort of thing has to be done in a decompiler, as I’m sure you are aware.
    The Boomerang decompiler does it like this:
    1) The binary program is translated into a machine independent IR. You mention this as “microcode”. All function exits are combined, so the funcion has only one exit.
    2) The IR is translated into SSA form (Static Single Assignment). This makes propagation (eiher copy propagation or expression propagation) easy, although there are many complex decisions as to whether to make or not make a propagation
    3) Functions are process in a depth first manner (on the call graph). As a result, most of the time, a callee is analysed before the caller, and you have preservation information about the child. This of course doesn’t work for recursive programs; see later.
    4) Now you need some simplification, such as x+0 -> x, and technical things such as bypassing calls, removing redundant phi functions, etc. It’s not trivial.
    5) Now, for many cases, you find expressions near the end such as
    esi := esi{0}
    This means that esi has, after propagation, been set to the same value that it was on entry. If not, you get something like
    esi := esi{133}
    meaning that the value of esi depends on the definition in statement 133.It might be like this:
    133 esi := phi{50, 60}
    meaning that the value of esi at statement 133 depends on the definitions at statements 50 or 60, depending on which path the program takes.
    So now we apply a simple theorem prover (it’s not as fancy as it sounds) to examine all the cases, and decide if esi really is or is not preserved by this function.
    Earlier I metioned how recursion makes like more difficult. There is a sort of “inductive proof” in Boomerang, which more or less says “assume that esi is preserved by function B. Given this assumpion, can you prove that esi is preserved by function A?” This is repeated for all functions involved in that “recursion group” (i.e. there is mutual recursion between all functions in a recursion group). Sometimes, the question is a little different, e.g. “assume that esi is preserved by function B. Given this assumpion, can you prove that eax is preserved by function A?” If any of the required “conditional preservations” fails, then the original premise (in this case that esi is preserved by B) fails.
    OK, that’s just for preservation. Then, parameters are those locations which are used before definition after dead and redundant code elimination. For example, where esi is preserved, you end up with say
    10 [esp-16] = esi
    99 esi = esi{0}
    99 is a redundant statement, and so is removed. 10 is unused (it was needed to eventually come up with the esi{0} in statement 99), and so is removed. After all this is done, you can tell whether esi is a parameter or not. Note: a location could be preserved and also a parameter! But once you strip away the code that merely preserves the location, you can tell which are the true parameters.
    As you have hinted, this complexity is quite startling given how easy and routine you can often manually decide preservation for the common cases.
    This analysis seems a long way removed from just disassembling the code. For example, while a decompiler would actually remove dead code, a disassembler must never do that. Well, not if you want a real disassembly output. A disassembler therefor needs to keep a separate intermediate representation for analysis to that which is the actual disassembly.
    You have hinted in earlier blog entries about “portable assembly language”, which seems to me is much closer to what the analysis IR is. Maybe you can generate the portable assemly language from the analysis IR.
    All very interesting work; good luck with it all!
    - Mike

  8. Ilfak Guilfanov says:

    Mike,
    Thanks for your post. You guessed it right, the “portable assembly language” (or microcode) is my IR for decompilation. I do not use SSA and a theorem prover, but many other details are similar.
    You say that you can determine the preserved registers using the propagation mechanism. Some time ago I was happy with the same approach. However, for the propagation mechanism to find out preserved registers, it must be able to copy local stack values across function calls. While this does not pose any problems with the preserved registers, it will yield wrong results in other cases. Consider:

    mov   [ebp-4], esi
    lea   eax, [ebp-4]
    push  eax
    call  func
    mov   esi, [ebp-4]
    

    This must be translated into

    v = esi;
    func(&v);
    esi = v;
    

    In other words we can not blindly propagate stack values across function calls. Some values can be propagated, others can not. It is safe to prohibit any memory propagation across function calls, but this limitation makes the propagation mechanism inappropriate for finding the preserved registers.
    I think a more fine grained control is needed. For example if the address of a stack location is never taken, then it can be propagated. And since we don’t know the object boundaries, if the address of a stack location is taken, all stack locations above it should not be propagated.

  9. There are a number of things that a decompiler does that a disassembler could do. I’ve recently being recreationally hand decompiling a dll with IDAPro. I have a header file with prototypes for all the exports of the dll. One thing the disassembler could do is read in that header file and name all the appropriate stack parameters.
    A large number of the exports look like this:
    mov eax, [esp + arg_8]
    push eax
    mov eax, [esp + arg_C]
    push eax
    mov ecx, [esp + arg_4]
    call sub_10002395
    clearly this code is just converting a stdcall into a thiscall.. so there’s a number of things the disassembler could do:
    1. as I said above, give arg_4, arg_8 and arg_C the names that are in the header file.
    2. if the type of the first arg is Foo, and this export has the name bar, then give sub_10002395 the name Foo__bar.
    3. give the arguments of Foo__bar the same name as the appropriate argument of the export.
    Of course, to do the second one in that list may require some sort of regular expression to extract the appropriate part from the naming scheme of the export and translate it into an appropriate member function name.
    IDAPro allow you to define structs and assign these types to arguments. I havn’t used the feature a lot, but it means that code which looks like this:
    mov eax, [ecx + 123h]
    becomes
    mov eax, [ecx + counter]
    or whatever name you give it. Somethings the disassembler could do automatically are:
    1. define arguments as structs that are defined as such in the header file.
    2. use the element names of structs that are defined in the header file.
    3. give arbitary names to struct elements at whichever offsets they are used that are not defined in the header file.
    That way a struct that is defined as having 5 elements but for which 15 are actually used will get elements defined for those uses and the engineer can go to one place (the struct definition) to rename them.

  10. Ilfak Guilfanov says:

    Trent,
    One of the reasons why IDA Pro has a scripting language and extensive plugin system is exactly to enable this kind of processing.
    A disassembler also could perform partial decompilation (like adding C-like comments for each basic block) and plugins like this do exist.

  11. Julio Auto says:

    This is amazing because I’ve recently ran across a similar problem when trying to write something to find out the value (given its statically stored inside the executable file) of a specific argument of a specific function inside ELF binaries making use of such function.
    So, the approach seemed simple: find the call to the function and find the right push just before it (notice I’m not even taking into account the possibility of passing the arguments through registers, as fastcall does).
    However, I got hit with the numerous ways this argument passing could be done, eg. push vs. mov.
    I ended up with two final versions of the program: one that could take some dumb guesses and worked 95%+ of the cases; and other that was 100% accurate but didn’t use binary analysis at all.
    One theoretical solution I had in mind (but didn’t have the time to implement) was to find the call to the instruction and then go backwards, instruction to instruction, dinamically ‘simulating’ the stack until I got to the point where that specific entry into the stack was last filled with some value.
    But yes, this ‘microcode’/IR approach seems very good and I hadn’t thought about it.
    Nice job. Nice post. :)