Hex-Rays Decompiler primer

The Hex-Rays Decompiler 1.0 was released more than two years ago.
Since then it has improved a lot and does a great job decompiling real-life code, but sometimes there are additional things that you might wish to do with its output.
For that purpose we have released the Hex-Rays Decompiler SDK and several sample plugins.
However, the header files alone do not give a complete picture and it can be difficult to see where to start.

In this post we will outline the architecture of the Hex-Rays Decompiler SDK, cover some principles and finally wrap everything we discussed and write a small plugin.


The post is divided into the following sections:

Background information

If you’re already familiar with abstract syntax trees (aka ASTs), you can skip this section. Otherwise, some basic information is available at Wikipedia. It can also be useful to read Ilfak’s post “Decompiler output ctree“.

We will provide some examples to show how ASTs are used in the decompiler. Consider this C code:

int func(int a)
{
  int result;

  if ( a == 1 )
  {
    result = 5;
  }
  else
  {
    result = 6;
  }
  return result;
}

It can be represented with this AST:

From the start node, we see a “{2}” node that denotes a block of two statements. The first statement in the block is an if and the second statement is a return.
If we take the first statement (if) we notice that it has 3 nodes attached to it: (1) the condition node (2) the then-branch node and (3) the else-branch node (optional).
Looking further down in the tree we notice that the condition node is an equals to node which also links to two other expression nodes x and y.
If we take the then-branch we see how the result = 5 got translated into an assignment node x = y with the x node being a variable and the y node being a numeric constant.

Let us take another code snippet:

int func1(int n)
{
  return (n | 291) & 4011;
}

And see its AST:

Again we have a block containing one statement: the return statement. The return statement has an expression of type binary AND, and this expression takes two operands x and y. The x operand is another expression (binary OR) and the y operand is a numeric constant (value = 291).

The citem_t class

Hex-Rays decompiler SDK refers to the AST as ctree and each node within the tree is represented by either a cinsn_t or a cexpr_t class instance. Both those classes are descendants of the citem_t base class:

struct citem_t
{
  ea_t ea; // address that corresponds to the item
  ctype_t op;// element type
  ...
}

One of the most important fields in a citem_t is its op field value which is of type ctype_t. ctype_t is an enum with constants identifying the type of the node.
Hex-Rays defines two types of constants: cot_xxxx and cit_xxxx. The former denote expression items while the latter denote statements (or instructions in Hex-Rays jargon).

Let us take a look at some of the ctype_t constants:

enum ctype_t
{
  cot_empty    = 0,
  cot_asg      = 2,   //  x = y
  cot_bor      = 19,  //  x | y (binary OR)
  cot_band     = 21,  //  x & y (binary AND)
  cot_eq       = 22,  //  x == y 
  cot_add      = 35,  //  x + y
  cot_call     = 57,  //  x(...)
  cot_num      = 61,  //  n
  cot_fnum     = 62,  //  fpc
  ...
  cot_last     = cot_helper, 

  // The statements
  cit_block    = 70,  //  block-statement: { ... }
  cit_if       = 72,  //  if-statement
  cit_return   = 79,  //  return-statement
  ...
}

Now given a citem_t instance we can check its op value and see whether to treat that citem_t as a cexpr_t or a cinsn_t.

The cexpr_t class

Let us illustrate how a cexpr_t class instance can represent items with cot_xxxx type values. Below we outline the class members that are relevant to our discussion:

// Ctree element: expression.
// Depending on the exact expression item type, 
// various fields of this structure are used.
struct cexpr_t : public citem_t
{
  ...
  union
  {
    //  used for cot_num
    cnumber_t *n;
    //  used for cot_fnum
    fnumber_t *fpc;
    struct
    {
      union
      {
        var_ref_t v;  //  used for cot_var
        ea_t obj_ea;  //  used for cot_obj
      };
      //  how many bytes are accessed? (-1: none)
      int refwidth;
    };
    struct
    {
      //  the first operand of the expression
      cexpr_t *x;
      union
      {
        //  the second operand of the expression
        cexpr_t *y; 
        //  argument list (used for cot_call)
        carglist_t *a;
        //  member offset (used for cot_memptr, cot_memref)
        uint32 m;     
      };
      union
      {
        //  the third operand of the expression
        cexpr_t *z;   
        //  memory access size (used for cot_ptr, cot_memptr)
        int ptrsize;  
      };
    };
    //  an embedded statement, they are             
    //  prohibited at the final maturity stage      
    cinsn_t *insn;    
    //  helper name (used for cot_helper) 
    //  string constant (used for cot_str)
    char *helper;     
    char *string;     
  ...
  };
  ...
};

As you notice, cexpr_t employs unions, thus the contained information depends on the op field.
For example if cexpr.op == cot_num then we can safely access cexpr.n field to get a cnumber_t instance and extract the constant number.
If the expression has two operands (e.g. cot_add, cot_sub, cot_bor and so on….), then we have two sub-expressions: cexpr.x is the left-hand side operand and cexpr.y is the right-hand side operand.
In the case of a function call (denoted by op == cot_call) the address of the called function is accessible via cexpr.x.obj_ea field and the arguments are in the a field (which is a carglist_t instance).

Bottom line: first check the op value and then extract the fields from a cexpr_t instance accordingly.

The cinsn_t class

This class represents statements supported by Hex-Rays (cit_for, cit_if, cit_return, etc…). An excerpt from the class definition:

// Ctree element: statement.
// Depending on the exact statement type, 
// various fields of the union are used.
struct cinsn_t : public citem_t
{
  ...
  union
  {
    //  details of block-statement
    cblock_t *cblock;   
    //  details of expression-statement
    cexpr_t *cexpr;     
    //  details of if-statement
    cif_t *cif;         
    //  details of for-statement
    cfor_t *cfor;       
    //  details of while-statement
    cwhile_t *cwhile;   
    //  details of do-statement
    cdo_t *cdo;         
    //  details of switch-statement
    cswitch_t *cswitch; 
    //  details of return-statement
    creturn_t *creturn; 
    //  details of goto-statement
    cgoto_t *cgoto;     
    //  details of asm-statement
    casm_t *casm;       
  };
  ...
};

Just as we could tell what kind of cexpr_t we have by looking at the op field, here we check the op field against the cit_xxxx constants and extract data accordingly.
For example, if op == cit_if, we can extract a cif_t instance from the cif field. Similarly for op == cit_return the corresponding field is creturn.

The class cblock_t is used to describe a sequence of statements. It is defined as a list of cinsn_t:

// Compound statement (curly braces)
// we need list to be able to manipulate
// its elements freely
struct cblock_t : public qlist<cinsn_t> 
{                                       
  ...
  iterator find(const cinsn_t *insn);
};

When Hex-Rays creates a ctree, the root of the tree is a cblock_t that contains all the subsequent instructions (or expressions) present in the decompiled function.

The ceinsn_t class

ceinsn_t is used whenever we need to describe a statement that contains an expression. For example, “x = 1;” is a statement containing expression “x = 1”.

// Statement with an expression.
// This is a base class for various statements
// with expressions.
struct ceinsn_t
{
  //  Expression of the statement
  cexpr_t expr;         
};

The if statement is a statement with an expression where the expression is the condition of the if:

// If statement
struct cif_t : public ceinsn_t
{
  ...
  //  Then-branch of the if-statement
  cinsn_t *ithen;       
  //  Else-branch of the if-statement. May be NULL.
  cinsn_t *ielse;       
  ...
};

Given a cif_t instance, we can extract its condition by accessing cif_t.expr field, the then-branch from cif_t.ithen (a cinsn_t, which in turn could be a cblock_t holding many other cinsn_t instances), and the else-branch can be accessed through cif_t.ielse, if present.

Before illustrating other statements with expressions (such as the for statement), let us talk about the cloop_t class which is used to represent repetition structures: for, while, do.

// Base class for loop statements
struct cloop_t : public ceinsn_t
{
  cinsn_t *body;
  ...
};

As you notice, cloop_t is ceinsn_t (a statement with expression) with an instruction (the body member).
The cloop_t class (as is) can be used to define a do/while statement where the expr field is the while’s condition and the body field is the body of the do/while statement:

// Do-loop
struct cdo_t : public cloop_t
{
  DECLARE_COMPARISONS(cdo_t);
};

A for loop statement has four components: (1) initialization expression, (2) condition expression, (3) step expression and (4) the body:

// For-loop
struct cfor_t : public cloop_t
{
  cexpr_t init;   //  Initialization expression
  cexpr_t step;   //  Step expression
  ...
};

The initialization expression is stored in the init field, the condition in the base class’ expr field, the step expression in the step field and the body of the loop in the body field.

The cfunc_t class

Now that we covered the basic tree elements, let us talk about the cfunc_t class, which is used to hold a decompiled function:

// Decompiled function. Decompilation result is kept here.
struct cfunc_t
{
  //  function entry address
  ea_t entry_ea;             
  //  function body, must be a block
  cinsn_t body;              
  //  maturity level
  ctree_maturity_t maturity; 
  // The following maps must be accessed 
  // using helper functions.
  // Example: for user_labels_t, 
  // see functions starting with "user_labels_".

  //  user-defined labels.
  user_labels_t *user_labels;
  //  user-defined comments.
  user_cmts_t *user_cmts;    
  //  user-defined number formats.
  user_numforms_t *numforms; 
  //  user-defined item flags
  user_iflags_t *user_iflags;
  ...
}

When Hex-Rays is asked to decompile a function it returns a cfunc_t instance. The following is an excerpt from the example #1 found at the examples page:

  func_t *pfn = get_func(get_screen_ea());
  if ( pfn == NULL )
  {
    warning("Please position the cursor within a function");
    return;
  }
  hexrays_failure_t hf;
  cfunc_t *cfunc = decompile(pfn, &hf);
  if ( cfunc == NULL )
  {
    warning("#error \"%a: %s", hf.errea, hf.desc().c_str());
    return;
  }
  msg("%a: successfully decompiled\n", pfn->startEA);
  qstring bodytext;
  qstring_printer_t sp(cfunc, bodytext, false);
  cfunc->print_func(sp);
  msg("%s\n", bodytext.c_str());
  delete cfunc;

Among the fields in cfunc_t, body is the most important to us because it points to the root of the ctree. It can be used to traverse the tree manually, but the visitor utility classes provided by the Hex-Rays SDK make that task simpler.

The tree visitor class

The tree visitor class is a utility class that can be used to traverse the tree, find ctree items, and (if desired) modify the tree along the way.

// A generic helper class that is used for ctree traversal
struct ctree_visitor_t
{
  ...
  // Traverse ctree.
  int hexapi apply_to(citem_t *item, citem_t *parent);

  // Visit a statement.
  virtual int idaapi visit_insn(cinsn_t *) { return 0; }

  // Visit an expression.
  virtual int idaapi visit_expr(cexpr_t *) { return 0; }
  ...
};

Hex-Rays provides other visitors as well: ctree_parentee_t, user_lvar_visitor_t, …

To use the class, inherit from ctree_visitor_t and override virtual methods as necessary. For example, here’s how to use it to report all called functions:

void traverse(cfunc_t *cfunc)
{
  struct sample_visitor_t : public ctree_visitor_t
  {
  public:
    sample_visitor_t() : ctree_visitor_t(CV_FAST) { }

    int idaapi visit_expr(cexpr_t *expr)
    {
      if ( expr->op != cot_call )
        return 0;

      char buf[MAXSTR];
      if (get_func_name(expr->x->obj_ea, buf, sizeof(buf)) == NULL)
        qsnprintf(buf, sizeof(buf), "sub_%a", expr->x->obj_ea);

      msg("%a: a call to %s with %d argument(s)\n", expr->ea, buf, expr->a->size());
      return 0; // continue enumeration
    }
  };
  sample_visitor_t tm;
  tm.apply_to(&cfunc->body, NULL);
}

The traverse() function can be called manually with a cfunc_t * obtained from a decompiled function, or automatically by installing a callback:

// This callback handles various Hex-Rays events.
static int idaapi callback(void *, hexrays_event_t event, va_list va)
{
  switch ( event )
  {
  case hxe_maturity:
    {
      cfunc_t *cfunc = va_arg(va, cfunc_t *);
      ctree_maturity_t new_maturity = va_argi(va, ctree_maturity_t);
      if ( new_maturity == CMAT_FINAL ) // ctree is ready
      {
        traverse(cfunc);
      }
    }
    break;
  }
  return 0;
}

int idaapi init(void)
{
  ...
  install_hexrays_callback(callback, NULL);
  ...
}

Writing a plugin

Now that we covered the basics, let us put our knowledge into practice and write a small plugin. Consider this C code:

int func3(int n, char *s)
{
  int b;

  if ( strcmp(s, "hello") == 0 )
  {
    b = 100;
  }
  else if ( strcmp(s, "hello1") == 0 )
  {
    b = 200;
  }
  else if ( strcmp(s, "hello2") == 0 )
  {
    b = 4;
  }
  else if ( strcmp(s, "hello3") == 0 )
  {
    b = 300;
  }
  else if ( strcmp("hello4", s) == 0 )
  {
    b = 400;
  }
  else if ( strcmp("hello5", s) == 6 )
  {
    b = 500;
  }
  else
  {
    b = 600;
  }
  return b + n;
}

If we compile it and decompile back with Hex-Rays decompiler, we get:

int __cdecl func3(int n, char *s)
{
  signed int v2; // [email protected]

  if ( strcmp(s, "hello") )
  {
    if ( strcmp(s, "hello1") )
    {
      if ( strcmp(s, "hello2") )
      {
        if ( strcmp(s, "hello3") )
        {
          if ( strcmp("hello4", s) )
          {
            if ( strcmp("hello5", s) == 6 )
              v2 = 500;
            else
              v2 = 600;
          }
          else
          {
            v2 = 400;
          }
        }
        else
        {
          v2 = 300;
        }
      }
      else
      {
        v2 = 4;
      }
    }
    else
    {
      v2 = 200;
    }
  }
  else
  {
    v2 = 100;
  }
  return n + v2;
}

With the following AST:

No doubt that Hex-Rays did an excellent job and the decompiled result is equivalent to the original function. Nonetheless, we can write a small plugin to automatically replace if (strcmp()) with if (strcmp() == 0) and swap the then/else branches:

if ( strcmp(a, b) )
{
  if ( strcmp(a, c) )
  {
    // something if a != c
  }
  else
  {
    // something if a == c
  }
}
else
{
  // something if a == b
}

Becomes:

if ( strcmp(a, b) == 0 )
{
  // something if a == b
}
else
{
  if ( strcmp(a, c) == 0 )
  {
    // something if a == c
  }
  else
  {
    // something if a != c
  }
}

To find such pattern, we need to match if statements with the following conditions:

  • The if statement should have an else
  • The if condition should be a function call to strcmp(a, b)

After we find such a statement, we need to replace its condition expression (which is a cot_call) with another expression (cot_eq), where the first operand (x) is the original condition and the second operand (y) is the number zero. Essentially we modify the tree from:

To:

Notice how the modified tree has the if condition changed from a call to strcmp() to an expression x == y (where y is the number zero).

The code to do that should be easy to understand now, especially that we explained all the logic behind it:

struct strcmp_inverter_t : public ctree_visitor_t
{
private:
  cfunc_t *cfunc;
public:
  strcmp_inverter_t(cfunc_t *cf) : ctree_visitor_t(CV_FAST), cfunc(cf) { }

  bool is_strcmp_expr(cexpr_t *expr)
  {
    // the expression should be a function call
    if ( expr->op != cot_call )
      return false;

    // should have two arguments
    carglist_t &a = *expr->a;
    if (a.size() != 2)
      return false;

    // should contain the string str[i]cmp
    char buf[MAXSTR];
    if ( get_func_name(expr->x->obj_ea, buf, sizeof(buf)) == NULL )
      return false;

    if ( stristr(buf, "strcmp") == NULL && 
        stristr(buf, "stricmp") == NULL )
      return false;

    return true;
  }

  int idaapi visit_insn(cinsn_t *ins)
  {
    // only interested in IF statements
    if ( ins->op != cit_if )
      return 0;

    // now take the instance
    cif_t *cif = ins->cif;

    // must have an ELSE
    if ( cif->ielse == NULL )
      return 0;

    // check if it's an strcmp()
    if ( !is_strcmp_expr(&cif->expr) )
      return 0;

    // create a zero expression
    cexpr_t *y = new cexpr_t();
    y->put_number(cfunc, 0, inf.cc.size_i);

    // create a new empty expression
    cexpr_t *x = new cexpr_t();
    // now the if's expr (condition) is moved to this new condition
    x->swap(cif->expr);

    // now that the if's expr is an empty expression, 
    // let us properly populate it
    cif->expr.ea = x->ea;
    cif->expr.op = cot_eq;
    cif->expr.x = x;
    cif->expr.y = y;
    cif->expr.calc_type(false);
    // we changed the condition, so we 
    // should swap the THEN/ELSE branches too!
    qswap(cif->ithen, cif->ielse);
    return 0; // continue enumeration
  }
};

Now if we use the plugin on the previously decompiled function, we get this result:

int __cdecl func3(int n, char *s)
{
  signed int v2; // [email protected]

  if ( strcmp(s, "hello") == 0 )
  {
    v2 = 100;
  }
  else
  {
    if ( strcmp(s, "hello1") == 0 )
    {
      v2 = 200;
    }
    else
    {
      if ( strcmp(s, "hello2") == 0 )
      {
        v2 = 4;
      }
      else
      {
        if ( strcmp(s, "hello3") == 0 )
        {
          v2 = 300;
        }
        else
        {
          if ( strcmp("hello4", s) == 0 )
          {
            v2 = 400;
          }
          else
          {
            if ( strcmp("hello5", s) == 6 )
              v2 = 500;
            else
              v2 = 600;
          }
        }
      }
    }
  }
  return n + v2;
}

Looks much closer to the original.

Closing words

The source code of the plugin can be downloaded from here. To use it, simply right-click anywhere in a decompiled function and select “Enable auto if (strcmp()) inversion.

Last but not least, we would like to remind you that the plugin contest deadline is just one month away. If you have nice ideas, participate and show us your creativity!

Published by

Elias Bachaalany

Hi, I am Elias, a former Hex-Rays employee and an IDA Pro enthusiast. I love reverse engineering and more especially writing tools and articles about it. I also co-authored a couple of books on the topic, which you can see on Amazon.com (http://amzn.to/1P1G0ID).