Total Pageviews

Sunday 2 November 2014

Introduction to Data science Part 1


Parag Ray
29-Aug-2014

Introduction

Welcome to the readers!

The purpose of writing this blog is put together all the finding and knowledge that I currently have have and will be gathering in the field of data science and big data technology and allied application area.

I hope this collection will help you.

I shall be covering various concepts, technologies starting from this basic overview, please look at the target audience and related readings section for suitability of your need. This blog is written more like a book and may be edited for correction, addition and expansion.


Agenda
  • Target audience
  • Related readings/other blogs
  • Data science and Big data definition
  • Use of Hadoop in big data
  • Hadoop at a glance.
  • Typical Use cases.
  • Types of Algorithm for  analytics and ML.
  • Concepts & Skill base.
  • How does Hadoop storage  look like.
  • The ecosystem.
Target Audience
  • This is an introductory discussion on data science, big data technologies and Hadoop.
  • Best suited for audience who are looking for introduction to this technology.
  • There is no prior knowledge required,except for basic understanding of computing and high level understanding of enterprise application environments.
Related readings/other blogs  
This is the first of this series or blogs, we shall add other bog titles as they are added.
Other article shortcuts are available in the pages tab.
You would also like to look at cloudera home page & Hadoop home page for further details.

Data Science and Big data definition
I am using wikipedia definition here, as I find them very appropriate,-
Data science is the study of the generalizable extraction of knowledge from data,[1] yet the key word is science.[2] It incorporates varying elements and builds on techniques and theories from many fields, including signal processing, mathematics, probability models, machine learning, statistical learning, computer programming, data engineering, pattern recognition and learning, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. The subject is not restricted to only big data, although the fact that data is scaling up makes big data an important aspect of data science. Another key ingredient that boosted the practice and applicability of data science is the development of machine learning - a branch of artificial intelligence - which is used to uncover patterns from data and develop practical and usable predictive models.
For more details please visit http://en.wikipedia.org/wiki/Data_science.


Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.
The challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and privacy violations. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, prevent diseases, combat crime and so on."[1]
Scientists regularly encounter limitations due to large data sets in many areas, including meteorology, genomics,[2] connectomics, complex physics simulations,[3] and biological and environmental research.[4] The limitations also affect Internet search, finance and business informatics.For more details please visit http://en.wikipedia.org/wiki/Big_data

Use of Hadoop in big data 

Hadoop has become a very popular platform for big data analysis and also data science as it allows a reliable massive horizontally scalable  data store.

By default, the data is not fetched linearly from a vertically scaled  infrastructure( for example in case of an RDBMS), this makes it relatively faster in data retrieval.

Basically this is created for batch processing , there has been various related technologies which enable near real time access of data stored in Hadoop.

Although meant for commodity hardware, some of the vendors have come up with specialized and proprietory hardware, which improve the speed of data access even more.
But these are by nature prone to vendor lock in, however for very high -end usage that should not be a problem.

Hadoop at a glance.

Hadoop is comprised of two main components ,-
  • HADOOP DISTRIBUTED FILE SYSTEM(HDFS)
A distributed storage or a file system that can span across a cluster made up of even commodity ( I mean no special server hardware needed) hardware. It can scale up to any number of nodes as needed.
  • MAP REDUCE
A framework for map reduce algorithm where a large computation can be broken into small tasks running in a cluster(same cluster as that of HDFS typically) and then results are combined or 'reduced' to  arrive at information desired.

Hadoop and Map reduce are integrated.



Typical Use cases.
  • E-Commerce.
There are specialized systems called recommender systems for recommendations.
Site performance on  DNS look up time, form loading time. Frequency of request by page etc can be found out analyzing very large log data.

  • Banking
Money laundering analysis
Risk analysis
Analytics of various investment and instruments related data
Market analysis

  • Network
VPN data analysis
Anomalous access detection

  • Telecom
Finding signal breaks and classification or anomaly detection.

Types of Algorithm for  analytic and ML.
  • Data analysis to find sample/ population characteristics
These type of use involves storing huge amount of data (what about sensor data of the steam pipe line of a factory or log data of a network) to do statistical /numerical analysis like finding average latency, average temperature, variance etc. The parameters & mathematical functions of such analysis are provided based on domain needs.
  •  Advanced algorithms
There are other advanced analytics in line with Machine learning  also there are those algos which try to analyze a data set and then try to find out a predictive model or grouping model and are intelligent enough to find the best fit parameters by themselves,-
    • Supervised learning: 
We have access to a learning set where we know that certain model of relation exist like y=f(x) and in the learning set some values of y corresponding to x are provided. 
Based on these , we can try to predict y values for other x's for which y is not known. 
In supervised learning , a learning set helps us find a model or the nature of predicted functions f(x) by predicting the parameter set of the linear or non linear functional used. 

Challenges involved in choosing a linear or none-linear model, optimization of the algorithm, data standardization so that the analysis runs within performance requirements and accuracy. There will be need for application of techniques like feature scaling ,proper selection of parameters to be employed.
The discussion done so far for example ,involves regression analysis of market price prediction based on various determining factors. Here the learning set may be a set of data where market price along with the determining factors are provided, once the learning is done with this, the parameters found can be used in other cases where determining factors are known but market price is not known.
 

On the other hand classification algorithm tries to classify data points in various categories one of the prominent example will be OCR.
    • Unsupervised learning
Unsupervised learning does not have the advantage of learning set. Classification is done by relative value of various parameters of sample points.
  • Those computational tasks that can be broken into distributed iterative logic is particularly applicable for Hadoop based systems.
  • Hadoop provides platform for massive data storage but analytics is performed with various tools that range form MR adaptation to pig , hive, Hbase , R as well as Mahout.

Concepts & Skill base.

Distributed file system
Large files stored in blocks and replicated across machines.
Takes care of network failure and recovery.
Optimizes data access based on topology & access point.


Skills needed for a big data professional
Although there are specializations but the following skills look to be important.
  • Tools expertise. Wide range of tool knowledge is required as it is not one size fit all.
  • Knowledge of statistical principles like Sampling, central tendency, variance, correlation, regression analysis & time series, various probability distributions like normal, chi-square , t distribution..
  • Knowledge of matrix algebra.
  • Knowledge of network and Linux/Unix operating system
  • Domain knowledge. 
For Java/J2EE developers, it may be important to remember that we are not( in most cases) having  the support of RDBMS any more and we are dealing with BIG data size, so slightest of unnecessary computational cost gets amplified and ends up in big inefficiency.Very obvious implementation technique like loops may not  prove to be good thing and we need to avoid loops as much as we can! 
Be ready for heavy intellectual challenges arising out of such optimization.

How does Hadoop storage  look like.


  • HDFS storage span across nodes(commodity hardware included) racks , clusters and data centers.
  • Master in this set up is the name node and slave is the Data node.
  • Master Name node serves locations and holds metadata and provides pointers to data nodes holding the requested resource.
  • Data nodes server data and are not dependent on Name node.
  • Data resources are replicated multiple times across Data nodes.
  • Name node is intelligent enough to provide reference to all available replications of same resources but ordered by fastest accessibility based of the configured topology.

The Ecosystem.




No comments:

Post a Comment

It will be by pleasure to respond to any of your queries, and i do welcome your suggestions for making the blogs better.