Deobfuscated JavaScript Through Machine Learning
Written by Ian Elliot   
Wednesday, 04 June 2014

Minification and obfuscation are two useful techniques for making code smaller and providing some protection. Now a machine learning technique promises to undo both and you can try it out.

Minification is great unless you need to read the code that is being used in a web page. You can download and look at the code that is running in the browser but it won't be in a good human readable form. The whole point of minification is that it removes white space and reduces variable names down to single letters - it is just as machine readable, but to a human it is a mess. 

 

nicifyjs

 

Being able to undo minification would be really useful, but obfuscation is a different matter. The big problem with interpreted, or JIT code, is that you have to provide the source code, or something very close to the source code, to the outside world. In the case of JavaScript there is no way to keep the source code to yourself and obfuscation is your only hope of hiding it. Obfuscation takes minification to the next stage. It not only removes all of the things that makes code human readable, it will even modify the flow of control to create spaghetti code that is very difficult to follow. In this case the ability to undo obfuscation is something that you would not welcome but just about everyone else would!

The first step in deobfuscation or deminifying is easy - you simply restore some formatting - line breaks, white space and indents. The big problem is restoring the variable names. You may have called the variable totalCost, but after transformation it ends up as A1 or something similar. 

How can you possibly restore the meaningful name?

It is clear that humans can do it. We read through the program and see that the variable is being used in particular ways and where it gets its value from and eventually we can guess that it should be called something like totalCost. Could this be done by a machine?

JSNice is a statistical de-obfuscation and de-minification engine for JavaScript created by the Software Reliability Lab at ETH Zurich. You give JSNice a program and it will give you a human readable version. It works by inferring the type of a variable and then using machine learning based on 10,000 JavaScript projects from GitHub to work out what the variable is being used for and hence a possible name. It works out the probability of a name for each variable and applies the most probable. You can also ask to see a drop-down list of other probable names which you can select from if it gets it wrong.  

 

jsniceexample

 

The authors claim a 60% success rate for suggested identifiers, which goes a long way to help you work out what the code is doing and to work out good names for the remaining 40%. 

It is also suggested that you could use it to improve your own code by getting it to perform the analysis on the human readable form and see what variable names it suggests. Personally I'd be a little upset to find that a statistical method could name my variables better than I could, but I'd try to swallow my pride. 

Inferring variable names from the way that they are used seems like an interesting technique and one that might have additional uses. At the moment there are few details of how the machine learning works, which features it uses for example. There is also the possibility that much more could be learned from the corpus of JavaScript programs on GitHub - functions, standard blocks of code, algorithms and so on. 

Does this make obfuscation useless?

The simple answer is no because there is a difference between handing the world your work to look at and making them work for it. There is always the simple point that obfuscation is an arms race. Once you know what JSNice can undo, it should be easy enough to hide the patterns of code it is using to assign the names of variables. You could even change the signature so that totalCost was renamed something completely misleading. 

 Jsniceicon

Banner


pg_parquet - Postgres To Parquet Interoperability
28/11/2024

pg_parquet is a new extension by Crunchy Data that allows a PostgreSQL instance to work with Parquet files. With pg_duckdb, pg_analytics and pg_mooncake all of which can access Parquet files, is  [ ... ]



Advent Of Code 2024 Now Underway
01/12/2024

December 1st is much anticipated among those who like programming puzzles. It is time to start solving small but tricky puzzles on the Advent of Code website with the goal of amassing 50 stars by Chri [ ... ]


More News

 

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Wednesday, 04 June 2014 )