Microsoft Open Sources Big Data REEF
Microsoft Open Sources Big Data REEF
Written by Kay Ewbank   
Friday, 16 August 2013

Microsoft is opening up its big data framework REEF as open source. REEF uses Hadoop’s new YARN resource manager, and can be used to build jobs where you need to maintain state even after the job is ended.

REEF stands for Retainable Evaluator Execution Framework. Its importance in Microsoft’s ‘Digital Shoebox’ was described by Microsoft Technical Fellow and CTO of Information Services Raghu Ramakrishnan at the ACM Knowledge Discovery and Data Mining conference in Chicago.

digshoebox

Ramakrishnan described the idea of the digital shoebox as online storage in which you should be able to capture any data, react to it instantaneously, and store it for later. You should be able to use any analysis tool to work on the data, anywhere, in any combination, and interactively.

Data could come from SQL, Hive, or MR (MapReduce), from stream processing, business intelligence or machine learning. YARN is the resource manager element of the Apache Hadoop project that gives you a way to run and manage multiple jobs of these different types on the same cluster of physical machines. This can reduce the number of machines needed, and also means you can run different analyses of the same data in the one location.

According to Ramakrishnan, YARN isn’t ideal for all job types; some, such as machine learning, are problematic because they have specific requirements in areas such as data movement, task monitoring, and the ability to retain result sets for further analysis and modification. REEF, which is a set of libraries that runs on top of YARN will, however, solve some of these problems.

REEF has two main parts: Evaluators, which are YARN containers containing REEF services, and Activities, which are the user code that runs inside the Evaluator. Ramakrishnan demonstrated a sample workflow in which YARN would spin up an Evaluator and the Activity code would run inside it and complete. The same Evaluator could then be spun up again maintaining its original state so other Activities could be run against its data. It includes a library of interoperable data management primitives optimized for communication and data movement (which are distinct from HDFS’ notion of storage locality). The library also allows REEF applications to access external services, such as user-facing relational databases.

Because Microsoft Research wanted to decouple the lower levels of REEF from the data models and the semantics of systems built on top of it, they developed two new standalone systems. These are Tang, a configuration manager and dependency injector, and Wake, an event-driven programming and data movement framework. Both are language independent, allowing REEF to bridge the JVM and .NET ecosystems. Ramakrishnan said Microsoft Research has built a MapReduce library on top of REEF that runs Hive and Pig, and is an excellent starting point for M/R optimizations such as Caching, Shuffle, Map-Reduce-Reduce, and Sessions. The team has also put together Machine Learning algorithms such as Decision Trees and Linear Models, and plans to add SVD (singular value decomposition ) support ‘soon’.

REEF is expected to be open sourced next month.

 

Banner


Babylon.js 3 Adds Support For WebGL 2
13/07/2017

There's an updated version of Babylon.js, Microsoft's open source JavaScript framework for building 3D games. The new release adds support for WebGL 2.



Windows 10 and Edge - Status Update
05/07/2017

It is almost a year since the free upgrade to Windows 10 came to an end. What affect has this had on the uptake of the operating system, and on its default browser, Edge?


More News

Last Updated ( Friday, 16 August 2013 )
 
 

   
Banner
RSS feed of news items only
I Programmer News
Copyright © 2017 i-programmer.info. All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.