Identifying Programmers From Executable Binaries
Written by Mike James   
Wednesday, 06 January 2016

It's no surprise that programmers have different styles. What comes as a shock is that these are still evident when the code is compiled and you produce an executable binary. 

When you look at someone's code you can often see that they have a particular style. How they name variables, the comments they use, the indenting schemes and other important details.A lot of this personal preference is removed when the code is compiled but there still seems to be enough left to identify the programmer. This has all sorts of implications. 

The study took the code of 600 programmers from the annual programming competition, Google Code Jam. The skill of the programmer was measured by how far they progressed in the competition. The code samples, all written in C++, were trying to solve the same programming task and hence the main differences between the code could possibly be attributed to coding style among other things. 

Given the binary code the problem of identifying the programmer was treated as a machine learning problem - which of course it is. The first task was to extract features and this was done by disassembling the code and then decompiling it back to the C++. The exact details of the reverse engineering involved is interesting and given in the paper. As well as the assembly and the reconstructed C++ code an abstract syntax tree and a control flow graph were used to provide features. Rather than a neural network, a random forest classifier was used to learn each programmer's characteristics from the hand-constructed features.

 

deanon

The results are impressive. Classification of 20 programmers was possible with a 96% correct classification. The classifier was trained on 8 executables for each programmer, which represents a lot of examples for this sort of study.

When the approach was tried on a larger data set, 600 programmers, the accuracy fell to 52%. It was also demonstrated that for unoptimized compilation of 100 programmers the accuracy was 78%, but when an optimizing compiler was used the accuracy fell to 64%. You might expect an optimizing compiler to remove even more of the personal traits of the programmer from the binary. 

Some other interesting results are quoted in the paper. The most generally interesting is that more advanced programmers are easier to recognize compared to beginners. This suggests that beginners tend to code in the same way while more expert programmers are more individual and have distinct coding styles.

If you want more details without reading the paper, this video should help:

 

 

So does all this matter?

Apart from the interesting results that individual style develops with experience and survives many transformations to machine code there is also the question of forensics. If you plan writing any malware then make sure that you don't leave any compiled code around where it could be used to identify you. Equally, identification of the programmer might be helpful in disputes about who did what in a successful company. Could code style used to address questions like "are there any blocks of code left in Facebook that Mark Zuckerberg wrote?" and to prove who Satoshi Nakamoto, the anonymous inventor of Bitcoin was? Probably not in practice. 

Then there is the question of building tools that take raw code and scrambles it in such away that it can't be identified. You could even try and build a tool that made code written by A look like code in the style of B. 

There are also some interesting questions about the methodology used. For example, what would the classification error be if the raw machine code was fed to a neural network. After all, it should be capable of noticing the regularities that the reverse engineering used to create features. There is also the question of how well this generalizes to other languages. C++ is well known, and indeed widely criticised, for being so flexible that you can code in almost any style from low level C to sophisticated object-oriented. Perhaps this is as much about C++ as the programmers. 

Clearly more work would be interesting. 

 

 deanonicon

More Information

When coding style survives compilation: De-anonymizing programmers from executable binaries

When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries (pdf) Aylin Caliskan-Islam, Fabian Yamaguchi, Edwin Dauber, Richard Harang, Konrad Rieck, Rachel Greenstadt, and Arvind Narayanan

Related Articles

Teenage Programmer Equals Cyber Criminal

The Smartwatch Spy

Cat Photos - A Potential Security Risk?

Frankenstein - Stitching Code Bodies Together To Hide Malware

ROP Mitigations Bypassed

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on, Twitter, FacebookGoogle+ or Linkedin

 

Banner


PlanetScale Gets Into Vector Search
02/12/2024

PlanetScale, the cloud MySQL-compatible database with advanced scaling capabilities, is now upgraded with vector storage and search.



Meta Releases OpenSource Podcast Generating Tool
28/11/2024

Meta has released an open source project that can be used to automatically convert a PDF file into a podcast. Meta says Notebook Llama can be considered an open-source version of Google's NotebookLM.

 [ ... ]


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Wednesday, 06 January 2016 )