IBM Releases CodeNet Dataset For AI Coding
Written by Kay Ewbank   
Thursday, 27 May 2021

IBM has released Project CodeNet, a dataset aimed at teaching AI to translate code from one programming language to another. The dataset consists of 14 million code samples, made up of around 500 million lines of code in 55 programming languages, ranging from C++, Java, Python, and Go to Cobol, Pascal, and Fortran.

IBM Research says Project CodeNet can be used to train machine learning models to translate code. The code samples have been taken from entries to open programming competitions, and IBM says that over 90 percent of the code samples come with a description of what the code does, including a concise problem statement, specification of the input format, and the output format.


 

The developers say that for over half of the coding problems they have also got sample input and output from the problem description, which they say is key to determining equivalence of two code samples in different languages, and which can drive reinforcement learning techniques for code translation. The samples also include information such as the code size, memory footprint, CPU run time, and status, which indicates acceptance or error types.

The IBM team estimates that automated rule-based systems can be successful in translating somewhere between 50 to 60 percent of a program into another programming language, leaving the remainder to be translated manually, involving complex rules.

The hope is that Project CodeNet will be able to "drive algorithmic innovation" to extract the more complex code using sequence-to-sequence models, in a similar way to how language translators for human languages now use. The aim is to make a more significant dent in machine understanding of code as opposed to machine processing of code.

The project includes tools to convert code samples into a representation that can be consumed by AI algorithms, including a tokenizer that generates stream of tokens, a parser that generates a Simplified Parse Tree (SPT) for each recognized program, and a code analysis tool that creates control and data flow graphs.  Project CodeNet is available on GitHub.


 

More Information

Project CodeNet On GitHub

Related Articles

IBM's Elyra AI Toolkit

IBM Debater Argues Like A Human - But How?

New MIT–IBM Watson AI Lab

A New Impetus For IBM Watson

 

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Apple's Hundred Billion Dollar Share
12/01/2022

Apple has divulged that developers have generated more than $260 billion in revenue since the App Store launched in 2008 with 2021 setting a new yearly record for developer earnings of about $60  [ ... ]



Brave Browser Surpasses 50 Million Users
11/01/2022

After 5 years in which it has doubled its user base year on year, Brave, the ad-blocking browser pioneered by Mozilla co-founder and JavaScript inventor Brendan Eich, now claims over 50 [ ... ]


More News

square

 



 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Thursday, 27 May 2021 )