Portable output for assembler

Sometimes unexpected detours are necessary to reach the goal. Take this simple
assembly code:

This compiler generated code calculates the length of the input string. If you do not remember
the exact definition of repne scasb, here is another snippet which does the same thing:

A straightforward decompilation of the first snippet yields this:

I can’t say that the C code is any better than the assembly code:

  • Single repne scasb has been replaced by an obscure loop.
  • An additional variable to represent the ZF flag has been introduced.
  • The result is longer than the initial assembly code.

It would be nice if the decompiler could replace this assembly code
by a call to strlen. For a human reader, the difference would be spectacular:

Just one meaningful line, no puzzling x86 instructions,
just plain and understandable code!

Now, the question is, how do I transform the initial assembly code into this ideal
decompilation result? I could hardcode the decompiler to check if the first instruction
is mov, the second is xor, and so on. You know better than me that this naive approach
is severely limited: as soon as the compiler decides to shuffle instructions, use different
registers, or replace repne scasb with a loop, our decompiler would be hopelessly
confused and lost. Also, different compilers generate different code for built-in functions
(just remember the second strlen example).

I can not hope to hardcode all these variations by hand! What if I could specify
the sequence in an abstract form and match it against real assembly code?
This idea looked attractive for me: I just need to build the pattern matcher once
and specify patterns for built-in functions. Patterns could look like this:

  • x86 instructions are gone – they have been replaced by abstract instructions for a virtual machine.
  • Registers are gone – they have been replaced by abstract variable names.

Difficulties are not where we expect them – the most laborious part of the task turned out to be
the pattern reader utility which would read the above text representation and produce
something binary. And here I stopped and asked myself: what binary representation do I need?
The answer was surprising: the pattern reader would generate a C text! The main reason
is that C text is most portable, you just need to compile it. I could generate a binary
file but then I would need to design its format. I could generate another text file but then
I would need another reader. C code has a reader – a C compiler, it can also have any
format I want with the structure and union declarations.

The path to the result turned out to be not as straight as I hoped:

The decompiler would be based on a utility which generates C code
from an assembler for a virtual machine. Everything got mixed up.

This entry was posted in Decompilation. Bookmark the permalink.

3 Responses to Portable output for assembler

  1. GDR! says:

    A very interesting idea.

  2. slcoleman says:

    I have been studying/planning a very similar concept for a CPU to virtual machine microcode (uc) translator to both help remove many of the processor specific analysis issues, to deal with structural/logical binary comparison (including perhaps polymorphic code), and possibly binary translation. Once the binary is converted and abstracted into a standard uc stream format, and the nonessential uc code removed, then only one simplified set of tools will be needed to do a variety of high level AST type application analysis and visualization of the form and function independent of the original physical environment. I’m happy to see that I am not alone in thinking in that direction, and even more so, honored that is happens to be you of all people. I hope that one day I will have something meaningful to share. ;)

  3. Gabi says:

    isn’t the first snippet computing the length of the string including the trailing 0? ’cause if yes, strlen is not its perfect decompiler equivalent.