Return to the sources?

A decompiler is commonly viewed as a tool to recover the source code of a program, the same way as a disassembler is a tool to convert a binary executable program to an assembler text.
This is true in some cases but only in some.


Most of the time the goal is not to get a compilable text. What would one do with such a text? Recompile and link it to get a copy of the input binary program? The only legitimate situation I can imagine is when the source text has been lost and the owner of the program tries to recover it. Even in this case I personally would rewrite everything from scratch because the disassembler/decompiler output is a nightmare to maintain: it lacks comments, inline functions, preprocessor symbols, abstractions, class inheritance, etc.
In short, I wouldn’t consider obtaining a compilable text as a major goal of a binary reverse engineering tool (RET).
One goal of a RET is to represent the program logic in the most precise and clear way. This representation should be easily understandable by a human being. This means that the fundamental blocks comprising the program logic should be easily visualized on request. Questions like “what is the function called from the current instruction”, “where is the current variable initialized or used”, “how many times is the current instruction executed” arise continuously during the analysis of almost any program. In simple cases a disassembler can answer these questions, a decompiler is more intelligent and can perform deeper analysis (e.g. pointer analysis and value propagation could help to find the answer even in the presence of indirect addressing modes and loops). In both cases the RET helps the analyst to grasp the meaning of the program and convert a sequence of instructions into very high level descriptions like “send a mail message to all contacts” or “get list of .doc files on the computer”. The clearest program representation is of a paramount importance to understand its purpose.
Another important goal of a RET is to automate program analysis and instrumentation. In this case RET can be used as a basic building unit for another program. We could analyze the input program in order to:

  • find logically equivalent blocks (copyright breaches; recognition of library functions)
  • find logical flaws (automatic vulnerability search; code quality assessment)
  • deobfuscate it (malware analysis)
  • add profiling instrumentation (optimization; code coverage)
  • prepare various inputs (testing)

Obviously we do not need to have a nice and readable program representation to achieve the second goal. Everything can stay in the binary form because the result of the analysis will be used by another computer program. One may even ask himself, “is it a decompilation, after all?” The input program is not converted into a high level language (HLL). More than that, the program text may even not be generated.
However, both cases use the same fundamental algorithms to analyze the input program. These algorithms are similar to the algorithms used in HLL compilers. The core of a RET builds an intermediate internal representation (usually a tree or graph-like data structure) which can then be either converted to an HLL text or used to discover interesting properties of the input program. In the same way a compiler builds an intermediate representation which is used to generate machine code.
Maybe we should stop using the word “decompiler” so widely to avoid confusion and to stop mixing concepts.

This entry was posted in Decompilation. Bookmark the permalink.

21 Responses to Return to the sources?

  1. Source code is the key to using a program effectively. That’s one of the reasons that the free software movement exists. From the WhyDecompilation page, recovered source code can be used for:
    (a) with little user intervention:
    . optimising a program for a particluar platform. For example, use all the optimisation flags appropriate to your Opteron, as opposed to compiling for a lowest common denominator i386
    . cross platform porting. Obviously, there are sometimes issues related to the operating system to contend with, but perhaps there is a printer driver released in binary only form for x86/Linux, and you want to compile it for your PPC/Linux box.
    (b) With significant user input:
    . source code recovery, as you have mentioned.
    . fix bugs, where the change is more extensive than can conveniently be done with patching the binary.
    . add features to a program. Again, it’s a lot easier to add features to source code than it is to patch a binary with new code.
    The last two features could under some circumstances be done with little user input (auto generated source code, with few or no comments, generic names, etc).
    I agree that decompiled source code will always be missing preprocessing commands, and this will always be the case. Such information is completely non existent in the binary form of the program. Of course, it could be added again, but that’s like writing parts of the program from scratch. There is one significant difference, however: assuming that the decompilation is correct, you have a working program at all stages, so you can test all the functionality of the original program. If you rewrite from scratch, parts will initially be missing, and you are not sure that any of it works until it is thoroughly tested.
    Yes, some functions will be inlined. However, it is not that much work to undo inlining; a really good editor with refactoring capabilities should be able to do it (this is source code, albeit difficult to read, so all the source level tools will work on it.) Clone detection is a powerful tool, and well researched and understood. So I don’t see inlining of functions as a significant problem with auto-decompiled source code.
    Yes, it will generally lack comments, at least of the kind that tells you why the programmer used some source code feature. I can envisage some day (in the far future, I’d say) where design patterns can be detected, and reasonable comments inserted automatically. Again, it will not be of the “why” variety, more the “what”. However, again assuming that the decompiler is correct, the comments will be *more* valuable than original comments, at least in some cases, because incorrect comments can be worse than none at all. Not many people insert misleading comments deliberately, of course, but comments sometimes apply to old versions of the code, and don’t always get changed when the code does. After enough of this, they can become a significant problem. I’d still like to see the comments if I had the option, but of course apart from a few odd exceptions (early versions of Visual Basic?), this will never be a possibility.
    You mention class inheritance as something that decompilers don’t produce. Sure, only a few decompilers claim this, and I’ve only seen a demo version of the one that I can think of. This problem is not as difficult as it may first seem. For exmple, in a typical Windows program compiled with MFC, the entire class hierarchy, including the original names of all classes, exists in the binary file. You can get an idea of this by using a hex editor or string search program on any such binary. It has to be in there, because C++ requires the ability to return the name of the class of any object at run time. As long as there is one piece of code (perhaps in a library) that needs this abilitym the compiler can’t (without a lot of danger) optimise out this class information. Maybe one day they’ll obfuscate it if decompilers become good at recovering class information; I suspect that all RETs will face this cat and mouse game at some stage. They do already.
    Perhaps you could indicate a little more what you mean by “abstractions”.
    Auto generated source code from a decompiler has limitations, sure. But it *is* source code, so if it compiles at all, then that’s something valuable.
    I know that it is not a trivial thing to generate compilable source code, and even if you can compile it, it may not make a complete application (you need resources, etc, but there are other tools to recover those). I note that the same problem exists for disassemblers. Data Rescue do not emphasise (any more) the ability to reassemble output from IDA Pro. My guess is that this is more related to the nightmare of supporting a product with this capability; I can imagine Joe Sixpack trying to reassemble the result of disassembling a multi megabyte binary, and complaining that some tiny part of it doesn’t reassemble correctly, or if it does, the program doesn’t run properly. But just because something is difficult to do in the commercial domain (as in selling licenses for the RET) doesn’t mean that it is not worthwhile attempting.
    - Mike

  2. > Maybe we should stop using the word
    > “decompiler” so widely to avoid confusion and
    > to stop mixing concepts.
    Yes, agreed. I think that a decompiler always generates source code. Perhaps, if you want to talk about various Reverse Engineering Tools, you should rename this Blog category to “RET” or the like. The trouble is that RET is not necessarily all that well established; Wikipedia has an entry for “decompiler”, but “RET” stands for Resolution Enhancement Technology, and there is no entry for “Reverse Engineering Tool”.
    However, if anyone can standardise a new term, it could well be you…
    - Mike

  3. Sorry to hog the bandwidth here, but this is a subject dear to my heart. You say that program comprehension is an important goal for an RET. That is true for many, but not all, applications for a decompiler, too. Let’s focus for now on a tool whose main purpose is to enhance the understanding of a binary program. (By “binary program”, I mean a program whose source code is not available, for whatever reason, but the executable is available.)
    There are various transformations that can be applied to a binary program, which help to understand the operation of the program. For example, with data flow analysis, some propagation, and some simplification rules, one can identify a high level pattern for an indirect branch or call. By high level pattern, I mean something like m[m[exp1 + K1] + K2], where exp1 is any expression, and K1 and K2 are constants. This is very different from any clever binary level pattern matching; the latter can be so easily affected by reordering instructions, scheduling, various optimisations, moving instructions from one basic block to another, and so on. High level patterns are almost impervious to such program details.
    Now, this means that we can “guide” the decoding of a program (the essential “disassembling” part of an RET, though note that “disassemblers” like IDA Pro do much more than just decoding instructions into assembly-like form). I argue that handling indirect brances and calls needs to be done at a high level like this, even though it’s very inconvnient. For example, there are technical reasons why the data flow analysis has to be restarted after each round of indirect branches is analysed. But I think that most people would prefer a slow RET that correctly decodes a larger proportion of programs, to a fast one that doesn’t work on many programs.
    If you accept that this sort of high level analysis is the domain of a decompiler, then in a sense a good decompiler can also be a great disassembler. But if you can generate high level source code (granted, there is a whole other stage called structuring required to generate high level source code, but that’s largely solved now), then why bother generating a disassembly at all? Disassembler output is
    . bulky
    . machine specific
    . there are machine details to distract from the essential aspects of the program.
    I’ve always been a fanof the idea that a decompiler should have the option of multiple back ends (generating source code in different high level languages). Something like a compiler, which often has multiple back ends, which in this case generate the same program in different machine languages.
    Boomerang has only one back end at present, and that’s C, because C is so expressive. But perhaps for ease of program comprehension, you would choose to view the output in something like Python, which is supposed to be concise and easy to understand. Perhaps even something like COBOL, although it never really delivered on its supposed ability to be readable by managers. Maybe you need a selection of output languages, because computer language preferences are almost religious.
    I haven’t really thought about this, but I don’t really know the techniques needed to “guide” a program’s internal representation towards a specific language. Imagine trying to generate a LISP version of a program; you have to somehow “bend” everything to look like a list, because that’s practically all you have available in LISP. Few programs could not be expressed as a reasonable LISP program, I’d guess. So there is an interesting problem to consider. Generating C output is comparatively easy, because it is so imperative.
    Now the question remains: suppose you have the ability to readily generate Python (etc) like output, but it’s a lot more effort to make the output compilable. Should that effoft be made? Perhaps it should be an option, so most of the time it quickly generates Python-like pseudo code, but there is a special option to generate compilable Python if required.
    I think of a decompiler as a static tool; you run it, and it generates compilable source code. IDA Pro is of course very interactive, and the interactivity adds a lot to the usefulness of the tool. But I suspect that most people would prefer that IDA did all the work, which in effect makes it a static tool.
    Maybe this is the question: is interactivity required mainly to overcome deficiencies of the RET? In other words, if we had a good enough tool, it may as well be totally static (non interactive). To add comments and change names, just use your favourite source code editor.
    I suspect that there will be situations where even though the tool can come up with *a* right answer, guidance will in practice (or at least for the forseeable future) be needed to find *the best* (or at least, *one of the better* answers). (Where by “answer”, I mean some sort of representation of the binary program that suits the user’s needs: perhaps compilable source code, perhaps some pseudo-code representation.)
    - Mike

  4. Yes, going from an arbitrary binary to an agnostic intermediary representation that could then be used to generate sources for multiple languages would be quite cool. However, developing such a system is a significant investment and the returns are uncertain.
    As far as interactivity is concerned, I remain convinced that it is an essential feature of any such tool. One’s perception of the “Perfect Tool” and its “Perfect Output” may, surprisingly, not fit the custmer’s real need.

  5. ilfak says:

    Mike,
    When you talk about the recovered source code you deem that it is 100% reliable and can be used to for optimization or porting purposes. My experience with a much less ambitious tool – a disassembler – shows that the output file is not reliable at all. In general it is not recompilable and even in the cases when it is, the output is too rough to be directly used other projects. I understand that my experience does not prove anything but I will still base my judgements on it. Second, the tasks you mentioned are much better carried out with the original source code of the program, not the recovered sources.
    By abstractions I meant everything else I did not mention (sorry for being unclear). Assume we have the following source code:
    const int nrows = 8;
    const int ncols = 8;
    const int ncells = nrows * ncols;
    const int nwhites = ncells / 2;
    ….
    check_white_cells(array, nwhites);
    If you decompile a binary form of this code, all you will get is
    func(&dword, 32);
    where 32 is the number of white cells. But it will take time and considerable intellectual efforts to understand the meaning of 32 – just because the abstractions: the number of rows, columns, cells are lost during compilation. There are also symbolic constants, bit masks, function access modifiers (private/protected/public), type information – all this is lost and has to be reconstructed one way or another.
    As about a nice term for a reverse engineering tool which does not generate a text output, I do not know yet. The word “RET” is not appealing and is too generic. Any ideas?
    I agree with you on high level patterns and program details. It is really nice to see that this representation frees us from irrelevant things like the exact instruction used to perform an operation or the instruction order. Like an addition can be performed by “add” or “lea” instructions, or instruction order is not important if the data dependencies do not change and so on.
    As about indirect calls and jumps, my opinion is like this: you have to do it twice. Once at the entry level, at the very beginning of the analysis (maybe at the machine instruction level) and second time when you gathered more information about the current function (assuming that we analyze the input function by function). Indirect control flow has to be detected and handled as soon as possible at the earliest stages of the analysis. The second pass will cover cases not detected by the first pass. In my model some indirect calls are resolved automatically as a side effect of the general optimization mechanism (for example, simple cases when a function address is put into a register and is later used for an indirect call are resolved by the constant propagation optimization). But again I can not say that I have a 100% reliable method of solving the problem of indirect calls. Just like many other decompilation problems, the methods are heuristic and the results are not guaranteed to be optimal or even correct.
    If you give me two listings: one in assembler, second in HLL and guarantee that both are correct, I will take the latter without any hesitation. But I doubt that today’s technologies can promise us anything about the code quality.
    As about other points – I agree with you in most cases. Since neither of us has a working decompiler capable of handling real world programs, many points remain theoretical but I hope the situation will change in the future. Let’s keep up working!

  6. > As about a nice term for a reverse engineering tool
    > which does not generate a text output, I do not know yet.
    > The word “RET” is not appealing and is too generic. Any ideas?
    I have used “binary analysis” or “binary analysis tool”. The latter abbreviates to BAT, which has a certain ring to it.
    I agree that there is a need for a new term, but I haven’t been able to find a good one yet either.
    - Mike

  7. > As about indirect calls and jumps, my opinion is like this:
    > you have to do it twice. Once at the entry level, at the very beginning
    > of the analysis (maybe at the machine instruction level) and second time
    > when you gathered more information about the current function
    > (assuming that we analyze the input function by function).
    Well, I think it makes sense to attempt it twice. If you succeed at what I call the decoding level, presumbly you would get the same answer if you repeated the analysis at the higher level. (Granted, you may guess the size of the jump table incorrectly, in which case you should get a different answer at the higher level. Hopefully, you will know if your first result was potentially incorrect, i.e. if you had to make any guesses or not.)
    > Indirect control flow has to be detected and handled as soon as possible
    > at the earliest stages of the analysis.
    I’m not so sure. For a highly interactive tool like IDA Pro, I guess it is best to decode them as early as possible, so the user can do something productive, even on code that happens to be in the middle of a switch statement. For a more static tool, I think it doesn’t matter; you get to all the code eventually.
    > The second pass will cover cases not detected by the first pass.
    Right. But hopefully there is no need for the second if the first has succeeded.
    > But again I can not say that I have a 100% reliable method of solving the problem of indirect calls.
    As with virtually any question about programs, it is undecidable in the general case. There are actually papers written on this (e.g. Rice).
    > Just like many other decompilation problems, the methods are heuristic
    > and the results are not guaranteed to be optimal or even correct.
    I think that’s a bit pessimistic. Compilers face many undecidable problems in the area of optimisation. So they use approximate solutions, and good compilers always generate a correct translation. Sometimes, the result will not be as optimised as possible.
    Similarly, a good decompiler should be able to use approximations to work around the undecidable questions, and still generate a correct translation (in the opposite direction). Sometimes the result will not be as readable as possible.
    - Mike

  8. On source recovery, I’m of the opinion that a skilled reverse engineer with good tools can turn a binary into readable, commented, documented code much faster and cheaper than redevelopment can. More importantly, the risk of losing business is vastly reduced as no redevelopment will ever produce the “same” program as that desired. If you lose your source code and your customers are the kind of people who have been reluctant to upgrade in the past for fear of losing their favourite features or their familiarity with your product, telling them you are going to rewrite the software from scratch (or just trying to pass off a redeveloped app as the original) will guarentee you lose them.

  9. ilfak says:

    >I think that’s a bit pessimistic. Compilers face many undecidable
    >problems in the area of optimisation. So they use approximate
    >solutions, and good compilers always generate a correct
    >translation. Sometimes, the result will not be as optimised
    >as possible.
    When a compiler has an undecidable problem (for example, it can not prove that two pointers are not aliased) it just skips an optimization step. All optimization transformations made by compilers do not change the functionality of the input program and the correctness of the result can be proved. Since the compiler builds the binary form of the program, all information about the running environment, memory layout, type information, calling conventions are available to it.
    The situation is different with decompilers. All they have as the input is a binary file. Since exact information about the environment and calling conventions is not available, they have to make some assumptions about them. Some of these assumptions will be incorrect. Even if we limit ourselves strictly to compiler generated code, there are too many undecidable details in the code – calling conventions, memory layout, object sizes, indirect jump targets, etc. If any of the assumptions made by the decompiler is wrong, the output will be incorrect. Unlike a compiler, we can not skip an optimization step and must take a decision.
    For example:
    mov ebp, esp
    push edi
    push esi
    call f1
    push eax
    call f2
    mov esp, ebp
    ret
    We can not tell if edi is a parameter of f1() or f2(). There is no formal way of deducing it from the input file. Even if we know the types of f1 and f2, it will not be enough. (imagine that they both accept variable number of parameters, i.e. the parameter list ends with ‘…’). So we end up with two possible answers f2(f1(esi), edi) and f2(f1(esi, edi)). Apparently one of them if wrong and but we can not tell which one.
    BTW, virtually all real world programs have some hand-made code. By hand made I mean code explicitly specified by the programmer in the assembly language. This code usually comes from the runtimes libraries. In other words, limiting ourselves to only-compiler generated code means that the decompiler will not be able to handle real world programs. A technology like FLIRT to recognize library functions and skip them would be nice but there is no 100% reliable method of detecting them.
    This is one of the reasons why human interaction is required during decompilation, if we target 100% reliability.

  10. ilfak says:

    Trent,
    By ‘writing from scratch’ I did not mean a program with a new design and functionality. It can be exactly the same program but the source code has to be written by humans, otherwise it will be unmaintainable. A raw decompiler output can be used only as a starting point (if one day we have a decompiler capable of handling real world programs).

  11. > mov ebp, esp
    > push edi
    > push esi
    > call f1
    > push eax
    > call f2
    > mov esp, ebp
    > ret
    If you analyse both f1 and f2 and find that they both use at least one parameter, and potentially more (that is what will happen if they are both varargs), the safe solution is this:
    f2(f1(esi, edi), esi, edi);
    Perhaps the second function only uses one or two parameters ever; you might not be able to determine that statically. Then passing the two extra parameters is a performance penalty (the program is not as optimised as possible, though if you had a really good compiler, it might be able to throw away the extra parameters in *its* machine code output, even though the decompiler could not make the same analysis), and also the output is not as readable as possible.
    In the worse case, you end up with horrible source code that emulates almost everything about the original program except the program counter, but still runs correctly.
    I have no proof that this can always be done, but it seems reasonable. After all, the processor can always follow the original program; you should be able to produce a program in a high level language that does the same thing.
    After that, any extra readability is a bonus over assembly language.
    - Mike

  12. With respect to rewriting from scratch, you seem to assume that you can rewrite the code to the original specifications. If you lose your source code, you have probably lost those specifications, too. If you had really good test cases, you probably lost those as well (e.g. cppunit cases). The binary has a better chance of surviving because more people have a copy of it (e.g. your customers, in most cases, or perhaps a dedicated server if you don’t release the binary to your customers).
    Decompilation won’t recover your test code, though you might be able to generate some test cases from your decompiled source code if you have a really clever tool. You might also be able to regenerate a specification of sorts, a la Martin Ward. (He has demonstrated transforming assembly language (not machine code) all the way up to high level specifications. Just search for “Martin Ward” and “FermaT”.)
    Meanwhile, your recovered source code has the same features (and bugs) as the binary it was decompiled from. You can ship your product, make changes (with difficulty if the changes are in the modules that you lost the source code for, but easier than patching at the binary level). Your business can continue, somewhat weakened, but alive.
    - Mike

  13. Pierre Vandevenne says:

    That’s a very theoretical view of the situation.
    Firstly, the lost source case is not very frequent. For valuable programs, I would even go as far as saying it is extremely rare. During the last ten years, we’ve had a few contacts from people who were attempting source recovery. There are usually two scenarios
    1 – lost source for custom application/dispute with developer of said custom application. In most cases: Visual Basic, Windev or, in a couple of occasions, Delphi.
    2 – disappearance of the manufacturer of some machine tool whose firmware needs to be modified.
    Case 1 is, in practice, only remotely connected to decompilation. Applications aren’t necessarily compiled, and what is of interest to those people isn’t usually a complex algorithm or program flow.
    Case 2 is often pure assembler, with a lot of machine specific catches. It can usually better be tackled by mere patching.
    I have never heard of significant C/C++ programs whose source code was lost. I assume they must exists, but a few cases do not an industry make.
    If there is no industry, no significant effort is put into research (although talented people might be involved in it out of pure interest ;-)) and the field stagnates. Each case that arises is different from the other, and no economy of scale can be made.
    Now, for the sake of the argument, let’s assume that I am in the rare situation of having lost the source code of a valuable application, that I am in the small subset of people who have lost a fairly recent source, and in the even rarer subset of that subset where a decompiler could give a practical result. In other word, quite a special case. Source recovery services and pseudo-decompilation services (those that were producing assembler-like recompilable C code) all went broke or quit.
    Lets consider a typical case: two possibilities
    - the application is an essential internal part of my business and I, of course, have its binaries. Where’s the urgency? Where is the catastrophic situation some patching couldn’t solve? (even vulnerabilities in OSes can be pacthed without the source ;-)) Also, programming has become a cheap commodity, RET consultants are expensive and won’t solve the problem fully… Wouldn’t it be a good opportunity to re-design or re-develop a new system while surviving on the second?
    - the application is something I sell or use to provide service to third parties. Again, I can survive for a while on the binary. Should I build upon the result of a potentially imperfect decompilation to go on with development? Don’t even think about it if you are selling anything that is mission critical (health care/finance/military are domains that come to mind). And again, wouldn’t that be an opportunity to re-implement a more modern, better program from scratch?
    If losing the source of any of our internal custom systems or any of the program we sell happened to me, I know for a fact, without hesitating for a second, which option I’d choose.
    On the pure business side, the notion that losing source code immediately and critically threathens the survival of a business sounds far fetched. I am sure one can find a few examples of that, but I am also sure that the loss of a customer database, something that happens far more frequently is a much more severe blow.
    And, of course, running a business on a now potentially rotten, unmaintainable source base, is probably the worst and most costly thing one can do. (it could however be argued that people careless enough to lose their source could also fail to see the dangers…)
    Compare that with OCR, or handwriting recognition. Anything less than 99% is virtually useless and untrusted. And that in a field where the human mind immediately grasps the meaning of what’s shown. How far is decompilation from that 99% rate on real world programs?
    Let’s answer the fundamental question first: in the set of all existing programs, how small is the subset of programs containing unique, non recoverable know-how. In that subset, how small is the subset of undocumented programs that could not be reimplemented? In that new subset, how small is the subset of programs whose source code is lost? Out of that subset, how small is the subset of programs that can be recovered in a safe and cost effective way?
    There are many potential uses of decompilation for current real world problems, but lost source code recovery definitely isn’t one of them.

  14. Pierre,
    It’s upsetting to me that decompilation has such a bad name. I’m of the opinion that a reverse engineer can take a binary and produce source code that is even more maintainable than the original source code. What’s more, I’ve of the opinion that he can do it cheaper than redevelopment of the application; will do it faster than redevelopment, and will guarentee the result to be functionally identical to the supplied binary. But as I have nothing to show you to convince you of my belief (I havn’t done it and I don’t know anyone except, debatably, Martin Ward, who has) I’ll just have to agree with you that redevelopment is the only sane solution to lost source code at this point in time.

  15. J.C. Roberts says:

    Trent,
    What Pierre wrote is entirely correct. Recreating maintainable source code from a binary is seldom even possible and on the few instances where it is possible, it is not cost effective.
    The smartest, fastest and cheapest path to a working solution is a three step process;
    (1) Do your Reverse Engineering Analysis of the binary.
    (2) Document how things work (interfaces, algorithms, etc.) and write a specification from them.
    (3) Re-implement from your created specification.
    Software only exists for the sake of getting some job accomplished. I own about 20 different types of hammers; when the job is driving a nail, most of my collection *can* do the job reasonably well.
    Recreating an *exact* replica of a hammer that I lost is a poor investment of time and money.
    kind regards,
    JCR

  16. mrb says:

    Trent, I am really curious about why you think reverse engineering can be faster. But I am willing to understand your mindset. Can you give us some examples ? Do you think, for instance, that recreating the source code of a program like wordpad.exe would be faster by reverse engineering it ? Or are you thinking of much larger programs of the size of, say, firefox ? My personal idea on the subject is that we are clearly lacking the decompiler tools to sufficiently ease the work of reverse engineers. I am not saying that it is impossible, but IMHO too much manual work would be required.

  17. Ilfak Guilfanov says:

    Fidelity and precison? No, thanks!
    http://www.sockpuppet.org/tqbf/log/2006/01/reversing-is-easier-than-you-think.html
    This point of view is very understandable in the modern “too much work, too little time” world.

  18. > Trent, I am really curious about why you think reverse engineering can be faster.
    > But I am willing to understand your mindset.
    > Can you give us some examples ?
    Well, the problem is that decompilers are still so much in their infancy. Let’s look at an example anyway. Suppose someone heard that you are pretty good at reverse engineering, and asked you to recover source code for this really simple binary:

    00010604 :
    10604:       9d e3 bf a0     save  %sp, -96, %sp
    10608:       84 10 3f fe     mov  -2, %g2
    1060c:       88 10 3f ff     mov  -1, %g4
    10610:       87 3e 20 1f     sra  %i0, 0x1f, %g3
    10614:       90 a0 80 18     subcc  %g2, %i0, %o0
    10618:       86 61 00 03     subx  %g4, %g3, %g3
    1061c:       86 0a 00 03     and  %o0, %g3, %g3
    10620:       13 00 00 41     sethi  %hi(0x10400), %o1
    10624:       84 20 80 03     sub  %g2, %g3, %g2
    10628:       90 02 62 f4     add  %o1, 0x2f4, %o0
    1062c:       87 38 a0 1f     sra  %g2, 0x1f, %g3
    10630:       84 a0 a0 03     subcc  %g2, 3, %g2
    10634:       86 60 e0 00     subx  %g3, 0, %g3
    10638:       84 08 80 03     and  %g2, %g3, %g2
    1063c:       40 00 40 4f     call  20778
    10640:       92 00 a0 03     add  %g2, 3, %o1
    10644:       81 c7 e0 08     ret
    10648:       91 e8 20 00     restore  %g0, 0, %o0
    

    Oops! That’s a SPARC binary. No problem, IDA can handle SPARC (if you have the extended version, from memory). How’s your SPARC knowledge? Well, how hard can it be, right? Well, there are delay slots, meaning that some instructions end up after things like branches and calls, but sometimes, the instruction appears logically before the branch only if the branch is taken. Messy, but you get used to it; Boomerang does all that flawlessly. And it’s so easy to make a mistake doing that sort of thing manually, especially if (like me), you haven’t done SPARC in a while.
    Worse, this compiler (from Sun, probably not one that you know the tricks of all that well) has used subtract with carry and tricks with the carry flag to do something. What does it do, exactly? Well, Boomerang at present doesn’t have an idiom for this, so you get this correct but very cryptic result:

    // address: 0x10604
    int main(int argc, char **argv, char **envp) {
    int local16;                // r2
    int local17;                // r3
    local16 = -2 - (-2 - argc & -1 - (argc >> 31) + (4294967294U > 31) - ((unsigned)local16
    Well, I did warn you that it is cryptic. But with a few more idioms, it would come out like this (from the equivalent Pentium program, also from Boomerang):
    
    // address: 0x80488f0
    int main(int argc, char **argv, char **envp) {
    int local5;                 // r24
    local5 = argc;
    if (argc >= -2) {
    if (argc > 3) {
    local5 = 3;
    }
    } else {
    local5 = -2;
    }
    printf("MinMax adjusted number of arguments is %d\n", local5);
    return 0;
    }
    

    (Sorry for the formatting; not used to this blogsite).
    Ah, that idiom is to do min and max. What will compilers think of next?
    Anwyay, having not seen the decompiled output, could you decompile that function faster than Boomerang? You have 0.37 seconds to do it (that's the output from the "time" utility on a 2.4GHz pentium box).
    > Do you think, for instance, that recreating the
    > source code of a program like wordpad.exe
    > would be faster by reverse engineering it ?
    It depends on the technology that is availalble. For wordpad, I'd say yes, right now, in January 2006, with Boomerang as poor as it is. IF you know Boomerang well enough (this is not yet for Joe Sixpack).
    > Or are you thinking of much larger programs of the size of, say, firefox ?
    Firefox is huge. At present, Boomerang uses hugh amounts of memory, and it needs the whole program in memory (effectively), so you would need a supercomputer to do it (it would still take weeks of processing time). But a lot of that is pure wastage. I do believe that it would be practical to decompile something like FireFox on a moderate cluster of computers, each with sane amounts of memory and disk. You probably wouldn't like to do it on your single processor desktop, though that may change in the future, who knows.
    But because it's so hugh, it makes it a real pain to rewrite, too. You would need a large team to rewrite it, and you would have a significant management problem.
    > My personal idea on the subject is that we are clearly lacking
    > the decompiler tools to sufficiently ease the work of reverse engineers.
    Absolutely, that is the case right now. But it's a bit like saying to the Wright brothers after their first successful flight, "yes, that's pretty impressive guys, but that was just a toy flight. I could not sell tickets to the public for that length of flight, that cargo capacity, and the level of noise. Sorry guys, I don't see any commercial application for your invention. Try something more practical."
    Boomerang is at about that level: first demonstrated flight (e.g. it can decompile small test programs with no errors). So decompilation is not something that say Data Rescue should take on, it's just way too immature. But for someone wanting to take on a bit more risk, well. Imagine being able to solve any device driver problem at the source code level.
    > I am not saying that it is impossible, but IMHO
    > too much manual work would be required.
    Well, that's the big question at the moment. One the one hand, we have people like Pierre and Ilfak, who think that reverse engineering must require a lot of manual intervention. There are people like Trent, who believe that you can come up with such good quality source code automatically, and it's so much easier to work at the source code level, there is is probably little or no need for interactivity. Myself, I waver, but I'm currenly in about the middle: I can see that there will always be some situations for which a gentle nudge by an expert user can send the automatic decompiler in a much more fruitful direction. It will be a challenge to present the required information to the user in a sensible way; Boomerang is on its third at least GUI, and I think that most would agree that it has a long way to go.
    But you have to start somewhere. Certainly, decompilation is more in the academic arena at present than the commercial arena. But hat might change with a rewrite, which I have some ideas for, but not the time at present. It will be interesting to see what even just one year will bring.
    - Mike

  19. Consider Java decompilation. You would agree that decompiling a large Java application, naming the local variables, fine tuning the choice of loop keywords chosen and adding some comments is faster and more reliable than rewriting the program from scratch. Even if we had full debugging information that we could trust I would be hesitant to suggest that decompilation of binaries would be so reliable at this time in history. We simply don’t have great tools for decompilation of binaries right now.
    As a good thought exercise, take a sufficiently large C or C++ application and rename every class, function, struct, local variable and parameter to generic names (e.g., proc1, local5, param2, class4). Remove all comments and “pretty print” the code to remove any special formating. Now ask a skilled reverse engineer (perhaps even a professional Java reverse engineer) to make it maintainable.
    I think we’ll find that the reverse engineer will give you back maintainable source code long before a team of engineers can give you a new program written from scratch and it will cost a heck of a lot less.

  20. Pierre Vandevenne says:

    It certainly would help if you guys could get your hands on a decompiler’s binary. ;-)

  21. Alan says:

    Nice writing. You are on my RSS reader now so I can read more from you down the road.