Deobfuscated JavaScript Through Machine Learning |
Written by Ian Elliot |
Wednesday, 04 June 2014 |
Minification and obfuscation are two useful techniques for making code smaller and providing some protection. Now a machine learning technique promises to undo both and you can try it out. Minification is great unless you need to read the code that is being used in a web page. You can download and look at the code that is running in the browser but it won't be in a good human readable form. The whole point of minification is that it removes white space and reduces variable names down to single letters - it is just as machine readable, but to a human it is a mess.
Being able to undo minification would be really useful, but obfuscation is a different matter. The big problem with interpreted, or JIT code, is that you have to provide the source code, or something very close to the source code, to the outside world. In the case of JavaScript there is no way to keep the source code to yourself and obfuscation is your only hope of hiding it. Obfuscation takes minification to the next stage. It not only removes all of the things that makes code human readable, it will even modify the flow of control to create spaghetti code that is very difficult to follow. In this case the ability to undo obfuscation is something that you would not welcome but just about everyone else would! The first step in deobfuscation or deminifying is easy - you simply restore some formatting - line breaks, white space and indents. The big problem is restoring the variable names. You may have called the variable totalCost, but after transformation it ends up as A1 or something similar. How can you possibly restore the meaningful name? It is clear that humans can do it. We read through the program and see that the variable is being used in particular ways and where it gets its value from and eventually we can guess that it should be called something like totalCost. Could this be done by a machine? JSNice is a statistical de-obfuscation and de-minification engine for JavaScript created by the Software Reliability Lab at ETH Zurich. You give JSNice a program and it will give you a human readable version. It works by inferring the type of a variable and then using machine learning based on 10,000 JavaScript projects from GitHub to work out what the variable is being used for and hence a possible name. It works out the probability of a name for each variable and applies the most probable. You can also ask to see a drop-down list of other probable names which you can select from if it gets it wrong.
The authors claim a 60% success rate for suggested identifiers, which goes a long way to help you work out what the code is doing and to work out good names for the remaining 40%. It is also suggested that you could use it to improve your own code by getting it to perform the analysis on the human readable form and see what variable names it suggests. Personally I'd be a little upset to find that a statistical method could name my variables better than I could, but I'd try to swallow my pride. Inferring variable names from the way that they are used seems like an interesting technique and one that might have additional uses. At the moment there are few details of how the machine learning works, which features it uses for example. There is also the possibility that much more could be learned from the corpus of JavaScript programs on GitHub - functions, standard blocks of code, algorithms and so on. Does this make obfuscation useless? The simple answer is no because there is a difference between handing the world your work to look at and making them work for it. There is always the simple point that obfuscation is an arms race. Once you know what JSNice can undo, it should be easy enough to hide the patterns of code it is using to assign the names of variables. You could even change the signature so that totalCost was renamed something completely misleading.
More InformationJSNice Statistical Renaming, Type Inference and Deobfuscation Related ArticlesFrankenstein - Stitching Code Bodies Together To Hide Malware Security by obscurity - a new theory Functional JavaScript With Ramda JavaScript 1K 2014 Competition Underway
To be informed about new articles on I Programmer, install the I Programmer Toolbar, subscribe to the RSS feed, follow us on, Twitter, Facebook, Google+ or Linkedin, or sign up for our weekly newsletter.
Comments
or email your comment to: comments@i-programmer.info |
Last Updated ( Wednesday, 04 June 2014 ) |