DataFu for Pig and Hadoop
Banner
DataFu for Pig and Hadoop
Written by Kay Ewbank   
Tuesday, 17 January 2012

User-defined functions for performing data analysis on Hadoop using Apache Pig have been put together in an open source library called DataFu, courtesy of LinkedIn’s engineering group.

In a blog post announcing the availability of DataFu, Senior Software Engineer Matthew Hayes, explains  that LinkedIn makes extensive use of Apache Pig for performing data analysis on Hadoop.

Pig is a simple, high-level programming language that consists of just a few dozen operators and makes it easy to write MapReduce jobs, and should be more popular if for no better reason than the fact you enter commands at the Grunt> prompt.

Pig has been designed so that programs written in it have a structure that can make use of parallel processing on a large scale, so the apps can handle very large data sets.

 

pighadoodp

 

While the language is simple, you can write your own user defined functions to add custom code in Java, Python, and JavaScript into your Pig scripts.

According to the blog, as the team at LinkedIn worked on data intensive products for LinkedIn such as “People You May Know” and “Skills”, the programmers developed a large number of UDFs, and these have been consolidated into a single, general-purpose library called DataFu which LinkedIn has made under open source.

DataFu includes UDFs for common statistics tasks, PageRank, set operations, bag operations, and a suite of tests. A pig bag is a collection of tuples (ordered sets of fields). Pig differs from normal relational databases in that you don’t have tables, you have pig relations, and the tuples correspond to the rows in the table. However, Pig relations don't require that every tuple contain the same number of fields or that fields in the same position have the same type. The UDFs in the library let you perform operations on bags such as append a tuple, prepend a tuple, concatenate bags, and generate unordered pairs.

Other UDFs give you the means to run PageRank on independent graphs; to perform set operations such as intersect and union, and to compute the haversine distance between two points on the globe.

You can download the library here: https://github.com/linkedin/datafu, and the blog post comes with examples of how to use some of the functions to get you started.

 

pighadoodp

Related News

Hadoop CTP for Azure

Hadoop gets to 1.0

Pig and Hadoop support in Amazon Elastic MapReduce

 

blog comments powered by Disqus

 

To be informed about new articles on I Programmer, subscribe to the RSS feed, follow us on Google+, Twitter, Linkedin or Facebook or sign up for our weekly newsletter.

 

Banner


Coursera Commits Cultural Vandalism As Old Platform Shuts - UPDATE
21/06/2016

Coursera has announced that 30 June is the date when it will shut down the servers hosting courses that were the first, free, offerings on its platform. This is unnecessary destruction of irreplaceabl [ ... ]



FarmBot Grows Things So You Don't Have To
11/06/2016

This is a machine that has lots to admire, but it does raise quite a few difficult questions. FarmBot Genesis will grow things for you without you having to be much involved. The technology is fascina [ ... ]


More News

<ASIN:1449389732>

<ASIN:1935182196>

<ASIN:1430219424>

<ASIN:1449311520>

Last Updated ( Tuesday, 17 January 2012 )
 
 

   
RSS feed of news items only
I Programmer News
Copyright © 2016 i-programmer.info. All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.