Total Pageviews

Monday 10 November 2014

Introduction to Data science Part 2

Parag Ray
04-Sep-2014

Introduction

Welcome to the readers! 
This blog is an introduction to the Map reduce. please note the related readings and target audience section to get help to better follow the blog.


Agenda
  • Target audience.
  • Related readings/other blogs .
  • Map reduce motivation.
  • Features.
  • How does Hadoop storage  look like along with Map reduce.
  • Basic Algorithms.
  • Intuition of the process.
  • Physical architecture.
Target Audience
  • This is an introductory discussion on data science, big data technologies and Hadoop.
  • Best suited for audience who are looking for introduction to this technology.
  • There is no prior knowledge required,except for basic understanding of network, computing and high level understanding of enterprise application environments.
Related readings/other blogs  
This is the second part of this series of articles, and related reading are provided in the links. It will be helpful to go through part 1 first.
You would also like to look at Cloudera home page & Hadoop home page for further details.

Map Reduce intuition 
  • If there are huge amounts of data it may be a big challenge for single computational infrastructure to process that.
  • Map reduce allows the entire computational task to be divided in to smaller tasks(Map) and then combining(Reduce) them for final result.
  • Maps and reducers can run on separate Machines allowing horizontal scalability.
  • Number of maps and reducers can be very high allowing scalability.
  • Integrated and optimized for Hadoop which allows distributed data storage. 
  • Data are stored in chunks in Hadoop, instead of transporting data from one node to another and doing the processing , it is more efficient to do the processing locally and transmit the result.
    Approximate comparison of traditional computation strategy with one vertically scaled Database node and one processor to Hadoop Map reduce strategy. Columns approximate time scale.






Map features reduce at a glance



  • Master slave architecture.Composed mainly of Job trackers(master) and Task trackers(slave) .
  • Fail over support is provided by a heart beat based system.
  • Plug-able components with given set of interfaces to accomplish various required functions like detailed validation, combination of results.
  • Network optimized for data access from Hadoop for any given topology.
  • Supports optimization features like partitions.
  • Has various modes of running like local, pseudo-distributed.

How does Hadoop storage  look like along with Map reduce.

The following diagram over lays the MR components on top of the HDFS components as we have seen in the previous post to high light the integration.























  • Map reduce is implemented in close integration with Hadoop.
  • Job tracker is the master and task tracker provide handle to the distributed map tasks.
  • Job tracker maintains heart beat contact and restarts jobs if a job is none-responsive.
  • Data access from Hadoop data nodes are optimized based on policy.
  • Tasks under various job trackers are capable of data exchange.
Basic algorithm

  • Data blocks reside in HDFS and they are read as input split.
  • Map programs get the splits with specific structure like key value and also receives the context handle.
  • Maps run distributed and process distributed data by accessing them locally ( provides speed enhancement by this)
  • Maps have access to various data types as shown
  • Framework allows collating data in partitions having similar values together.
  • Merge and sort is done if reduce is necessary.
  • Reducers are not compulsory, but are provided to for final processing of data produced by Maps
  • Map data is transmitted across http to reduce location.
  • reduce 
  • If Reduce is not running then unsorted data is submitted back to HDFS out put location.
  • If Reduce is running then reduced data is saved back to HDFS.
 Physical architecture
 JT: Job tracker
TT: Task Tracker

  • Client request job via RPC to JT
  • JT maintain Heart beat from TT to make sure the TT is up and running.
  • TT has child job nodes executing Map and reduce jobs. Data access is localized.
  • There is Umbilical protocol to maintain communication between Job node and TT.

No comments:

Post a Comment

It will be by pleasure to respond to any of your queries, and i do welcome your suggestions for making the blogs better.