Apache MADlib Adds HITS Implementation
Written by Kay Ewbank   
Wednesday, 10 January 2018

There's a new version of Apache MADlib with new features including an implementation of HITS. MADlib makes it possible to carry out  big data machine learning from SQL

MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data. It currently supports PostgreSQL, Greenplum Database, and Apache HAWQ. It started as a collaboration between a team at UC Berkeley and developers at Pivotal. Pivotal was previously known as EMC Greenplum. The project was added to Apache as an incubator project in 2015.

MADlib uses the MPP (Massively Parallel Processing) architecture’s full compute power to process very large data sets, whereas other products are limited by the amount of data that can be loaded into memory on a single node. It runs as a fully parallelized implementation on GPDB (Greenplum Database)  and HAWQ for large data sets, meaning it offers a much better performance than R or Python libraries. It is scalable due to the ability to add more nodes to achieve higher performance as your data scales.  Greenplum Database is an advanced, fully featured, open source data platform designed for analyzing petabyte scale data volumes. HAWQ is Apache Hadoop Native SQL Advanced Analytics MPP Database for Enterprises, and is currently an Apache Incubator project.

When MADlib was made a top level project in August 2017, Joe Hellerstein, Professor of Computer Science at UC Berkeley, Co-Founder and Chief Strategy Officer at Trifacta, and one of the original authors of MADlib, said:

"MADlib was conceived from the outset as an open-source meeting ground for software developers, computing researchers and data scientists to collaborate on scalable, in-database machine learning and statistics."

The new release, 1.13, of MADlib has a new HITS (Hyperlink-Induced Topic Search) link analysis algorithm. HITS provides a way to analyze links to rate web pages.

Another improvement to the new release is better handling of k-nearest neighbors classification. k-NN in MADlib now has more distance metrics, and the ability to show a list of neighbors in the output table.

Grouping support has been added to MLP (MultiLayer Perceptron), and the quality of results for correlation analysis has been improved by ignoring only a NULL value and not the whole row containing the NULL.

madlib 

 

More Information

MADlib site

Related Articles

Apache PredictionIO Reaches Top Level Status

Azure Machine Learning Enhancements

Amazon's Giant Push Into Machine Learning

Spark Gets NLP Library

 

 

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

 

Banner


Gender Differences In Coding Style
13/11/2024

A novel investigation into the gender gap between men and women regarding coding ability was undertaken by Dr Siân Brooke. Her conclusion? There is a difference in the Python code [ ... ]



Google Updates Responsible AI Toolkit
01/11/2024

Google has announced updates to the Responsible Generative AI Toolkit to enable it to be used with any LLM model. The Responsible GenAI Toolkit provides resources to design, build, and evaluate open A [ ... ]


More News

 

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Wednesday, 10 January 2018 )