Total Pageviews

Thursday 20 November 2014

Introduction to Data science Part 3: HDFS single node setup

Parag Ray
29-Sep-2014

Introduction

Welcome to the readers!

This is the third part of the series of articles. Here we actually start setting up Hadoop and have a hands on experience of the HDFS.
Please note the related readings and target audience section to get help to better follow the blog.
 
We are assuming the operating system to be Ubuntu 14.04.

Agenda
  • Target audience
  • HDFS setup
Target audience

  • This is an intermediate level discussion on Hadoop and related tools.
  • Best suited for audience who are looking for introduction to this technology.
  • Prior knowledge in java and Linux required.
  • Intermediate level understanding of networking necessary.
Related readings/other blogs  
Please see links section.You would also like to look at Cloudera home page & Hadoop home page for further details.

Hadoop Setup
  • Preconditions:
    Assuming java 7 and Ubuntu linux installed with at least 4 GB RAM and 50 GB disk space.
    If Ubuntu is new installation good idea to update with     
apt-get install update       
  • Also if needed install java with 
apt-get install openjdk-7-jdk

  • Install eclipse with
apt-get install eclipse
  • Install ssh with
 apt-get install openssh  
Note:These will require Internet connection and firewall should allow the connection to Internet repository.
  • Following are the steps to set up Hadoop
    • Download hadoop-1.2.1-bin.tar.gz.
    • Open console and use  command cd ~  to move to home directory
    • Create a folder /work.
    • Change directory to new folder and extract the hadoop-1.2.1-bin.tar.gz with command ,-
    tar –xvf hadoop-1.2.1-bin.tar.gz  
    • This will create a hadoop-1.2.1 folder under work folder. we shall cal this 'hadoop folder'.
  • Before you proceed to next steps , be aware of the java home folder. The following comamnd may help
which java
  • Go to home folder by cd ~
  • Issue following command to open profile file.
  gedit .bashrc
 
  • Assuming that java is installed using command as mentioned above, Add following lines in .bashrc,-
export JAVA_HOME="/usr/lib/jvm/java-7-openjdk-i386"
export HADOOP_HOME="/home/parag/work/hadoop-1.2.1"
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin

 
note first and second line could be different depending on where java and hadoop are installed
In case java is in some other folder, the same has to be provided, the directory should be the parent folder of the bin folder.
  • To refresh the configuration, issue command
     . . Bashrc
please note there is a space between the two dots in the above command. People with unix knowledge will find this information redundant.

  • Close gedit.
  • There will be ‘conf’ folder under hadoop folder created, change directory to the conf folder to do the subsequent config tasks.
  • In hadoop-env.sh(use gedit hadoop-env.sh)  add command 
  •  
export JAVA_HOME = /usr/lib/jvm/java-7-openjdk-i386
  • In case of multi-node cluster set up we need to add master for location where secondary namenode will be running(governed by hadoop command issued on nodes), and slave file where all datanode will be added. Each node in one line.  
  • Following should be noted,-
    • -Having a user specific to Hadoop will be good, it is not shown here
      -Folder structure should be same across cluster if multi-node set up is used. it is better to do generic folder structure any way so that later it is not a difficulty.
  •  All node names should be in /etc/hosts  (use sudo gedit /etc/hosts)
127.0.0.1    localhost
127.0.0.1    PRUBNode1
  •  Edit core-site.xml to add the following,-
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://PRUBNode1:9000</value>
    </property>
<property>
        <name>hadoop.tmp.dir</name>
        <value>/home/parag/tmp</value>
    </property>
</configuration>
 
The above configuration is for name node so in fs.default.name it points to hdfs admin console, host name as PRUBNode1. This can be verified with hostname command at console)  hadoop.tmp.dir is temporary work area.
 
  • Edit hdfs-site.xml to add the following,-
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/parag/work/dfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/parag/work/dfs/data</value>
</property>
</configuration>
few tips:
To do failure prevention, should  have more than one with NFS as well by comma separated list.
directory structure be same across all nodes. for example dfs.data.dir will be needed in all nodes in cluster, and the folder should be the same.
  • Cd to bin folder of hadoop and issue ./hadoop and see the command options. Note map reduce and hadoop commands are coming together.
 parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ ./hadoop
Warning: $HADOOP_HOME is deprecated.

Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
  namenode -format     format the DFS filesystem
  secondarynamenode    run the DFS secondary namenode
  namenode             run the DFS namenode
  datanode             run a DFS datanode
  dfsadmin             run a DFS admin client
  mradmin              run a Map-Reduce admin client
  fsck                 run a DFS filesystem checking utility
  fs                   run a generic filesystem user client
  balancer             run a cluster balancing utility
  oiv                  apply the offline fsimage viewer to an fsimage
  fetchdt              fetch a delegation token from the NameNode
  jobtracker           run the MapReduce job Tracker node
  pipes                run a Pipes job
  tasktracker          run a MapReduce task Tracker node
  historyserver        run job history servers as a standalone daemon
  job                  manipulate MapReduce jobs
  queue                get information regarding JobQueues
  version              print the version
  jar <jar>            run a jar file
  distcp <srcurl> <desturl> copy file or directories recursively
  distcp2 <srcurl> <desturl> DistCp version 2
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  classpath            prints the class path needed to get the
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
 or
  CLASSNAME            run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
  • ONLY FOR THE FIRST  TIME WE NEED TO FORMAT the file system(remember what happens if we format you d drive???)  
./hadoop namenode –format


  • Start hadoop file system
. ./start_dfs.sh
issue jps command to see all processes running
parag@PRUBNode1:~/work/hadoop-1.2.1/bin$ jps
2403 NameNode
3331 Jps
2574 DataNode
2722 SecondaryNameNode
 
 
Monitor hadoop file system from http://<<hostname of nn>>:50070


  • Stop dfs with ./stop-dfs.sh

No comments:

Post a Comment

It will be by pleasure to respond to any of your queries, and i do welcome your suggestions for making the blogs better.