Software designed to let data science teams gain insights into big data up to 10,000 times faster than rival products has been announced by GraphLab and is downloadable for free.
The software, GraphLab Create, simplifies big data analysis by combining all phases of the prototype-to-production process, allowing a single data scientist to do the job of many, according to the creators. The company says that there is a current shortage of data scientists, who have to derive value from a company's data by integrating a range of highly complicated, disparate tools and datasets. By using machine learning, GraphLab Create simplifies this task.
The software started life as a research project on graph analysis at Carnegie Mellon University. This was extended to add the ability to process tables and text, and the GraphLab company was created to improve on the open source project (PowerGraph) and create commercial software.
GraphLab Create 1.0 was officially shown off at GraphLab’s conference in San Francisco, where the developers said the software is between 100 and 10,000 times faster at analytics and model training than other products. GraphLab Create has been benchmarked against products MLlib (part of the Apache Spark project), Sci-Kit Learn and Mahout.
The keynote presentation at the conference showed GraphLab Create v1.0 being used to analyze one terabyte of data or more, at interactive speeds, on desktop systems. Its use on distributed systems using a Hadoop Yarn or EC2 cluster was also demoed.
GraphLab Create lets you switch between analysis of data as graph or table, and can be incorporated into data products that make use of the software’s machine learning, text analytics and graph analytics capabilities. GraphLab Create 1.0 includes GraphLab Canvas, the company’s new visualization platform for big data.
The software is designed to work with the same code on different platforms, so you can prototype on a single machine then move the completed project to production on distributed systems. It has been certified as interoperable on Cloudera Hadoop distributions.
The package can be used via a Python API, which gives you access to two scalable data structures called SFrame and SGraph for analysis of tabular and graph data sets. The product details say:
“the machine learning engine provides access to the latest ML algorithms which are foundational inputs to many data products like recommenders, fraud detection systems, text and sentiment analyzers. Data inputs can be taken in any form and from any location, whether local to the platform or in common stores like Amazon’s cloud, relational and graph databases or Hadoop distributions. Connectors for additional data types and stores can be easily added.”
The scalable frame is the means by which GraphLab Create can be used on very large data sets. The data is treated as a series of frames, scalable data structures. The software uses the computer memory to view a single frame, and if you’re working on a desktop or laptop machine, iterates over the data on the hard disk frame by frame.