Monday, April 1, 2013

Setup Hadoop Cluster



Hadoop Cluster Setup in CentOS6

Requirement:
1.    Have java 1.6.x installed.
2.    Have ssh installed.

Installation & Configuratio[MUST be a root user]

1.    Download hadoop rpm file from apache hadoop official website.

2.    Install hadoop:
rpm –i hadoop_version.rpm

3.    Edit the file /etc/hosts on the servers:
192.168.1.40   master
192.168.1.41   slave1
192.168.1.42   slave2

4.    We must configure password less login from name node(master) to all data nodes (slave1 and slave2), on all servers do the following:
Ø  Command :ssh-keygen –tdsa
Ø  Keep press ENTER button until the id_dsa.pub file is generated.
We have 3 .pub files; one is on master, and others on the two slaves.

Copy the contents of those three .pub files to the authorized_keys file.
All servers authorized_keys file should have the same content.

5.    Open the file /etc/hadoop/hadoop-env.sh, and set the $JAVA_HOME: 
export JAVA_HOME=/usr/java/jdk1.6.0_38.

6.    Open the file /etc/hadoop/core-site.xml, add the following properties. This file is to configure the name node store information:




7.    Open the file /etc/hadoop/hdfs-site.xml and add the following properties:

8.    Open the file /etc/hadoop/mapred-site.xml, add the following properties. This file is to configure the host and port of the MapReduce jobtracker in the name node of the hadoop setup:




9.    Open the file /etc/hadoop/masters, add the namenode name: [NAMENODE SERVER ONLY]
master

10. Open the file /etc/hadoop/slaves, add all the datanodes names:[NAMENODE SERVER ONLY]
/* in case you want the namenode to also store data(i.e namenode also behave like a datanode) this can be mentioned in the salves file.*/
master
slave1
slave2

11. Modify files permissions.
Once Hadoop is installed, start-all.sh, stop-all.sh and several other files would be generated under /usr/sbin/, we must change all of those files permission:
           # sudo chmod a+x  file_name


Notice: Step 9 ,10 and 11 only for master server, the slaves should do nothing about those steps.

Start and Stop Hadoop Cluster (doing on hippo server)
1.    Formatting the namenode:
           # hadoop namenode –format
2.    Starting the Hadoop Cluster
           # start-all.sh

Run JPS command on master server:
          # jps 
            922 JobTracker
815 SecondaryNameNode
1062 TaskTracker
521 NameNode
1136 Jps

Run JPS command on slaves:
          # jps  
7407 DataNode
7521 TaskTracker
7583 Jps

3.    Checking the status of Hadoop Cluster:
(1)  Type the command :
hadoop dfsadmin –report
            (2) Browse the web interface for the NameNode (master server) and the JobTracker:
·      NameNode – http://192.168.1.40:50070/
·      JobTracker – http://192.168.1.40:50030/

4.    Process a sample to test Hadoop Cluster (wordcount example):
(1)  Create a directory in master server
mkdir input
          
(2) Create two test files under the ‘input’ directory and add the following text into the files
               echo "Hello haifzhan" >> text1.txt
               echo "Hello hadoop" >> text2.txt
               echo "Hello hadoop again" >> text3.txt

(3) Copy the two test files from master server to Hadoop’s HDFS
                  Under the ‘input’ directory:
                 # hadoop dfs -put ./   input
            (4)  Now you can check the files on Hadoop’s HDFS
                 # hadoop dfs -ls input/*
-rw-r--r--   2 root supergroup         15 2013-04-01 15:03 /user/root/input/text1.txt
-rw-r--r--   2 root supergroup         13 2013-04-01 15:03 /user/root/input/text2.txt
-rw-r--r--   2 root supergroup         19 2013-04-01 15:03 /user/root/input/text3.txt   

(5)  Run the MapReduce job
                 # hadoop jar /usr/share/hadoop/hadoop-example-1.0.3.jar wordcount input output
            (6)  Check the result
                 # hadoop dfs -cat output/part-r-00000
                        Hello  3
again  1
hadoop            2
haifzhan         1       
5.  Stopping the Hadoop Cluster
               # stop-all.sh

Other useful resources:
1. The logfiles locate in: /var/log/hadoop/root
2. Useful websites:
Error Solving:
1. Datanode: No route to host (start but then shut down automatically for a while)
           close the firewalls on both master and slaves machines
           # service iptables stop
2. Namenode: How to exit the safemode
           # hadoop dfsadmin -safemode leave
3. How to start datanode or tasktracker independently
            # hadoop-daemon.sh start datanode/tasktracker
4. How to check the current java version and the path of your local machine
            # echo $JAVA_HOME

5.  proccess information unavailable
remove all files under /tmp , reformate namenode and restart all servers.


7 comments:

  1. I get a lot of great information here and this is what I am searching for. Thank you for your sharing. I have bookmark this page for my future reference.
    Hadoop Training in hyderabad

    ReplyDelete
  2. Cluster is the toughest topic which i was dealing with for long time. Thanks for sharing your thoughts here.

    Hadoop Training Chennai
    Hadoop Training in Chennai

    ReplyDelete
  3. Learning new technology would give oneself a true confidence in the current emerging Information Technology domain. With the knowledge of big data the most magnificent cloud computing technology one can go the peek of data processing. As there is a drastic improvement in this field everyone are showing much interest in pursuing this technology. Your content tells the same about evolving technology. Thanks for sharing this.

    Hadoop Training in Chennai | Hadoop Training Chennai | Big Data Training in Chennai | Big Data Training Chennai

    ReplyDelete
  4. Truely a very good article on how to handle the future technology. This content creates a new hope and inspiration within me. Thanks for sharing article like this. The way you have stated everything above is quite awesome. Keep blogging like this. Thanks :)

    Software testing training in chennai | Testing training in chennai | Software testing course in chennai

    ReplyDelete
  5. SAS stands for statistical analysis system which is a analysis tool developed by SAS institute and with the help of this tool data driven decisions can be taken which is helpful for the bsuiness.
    SAS training in Chennai | SAS course in Chennai | SAS training institute in Chennai

    ReplyDelete
  6. Amazing content.If you are interested instudying nodejs visit this website. Nodejs is an open source, server side web application that enables you to build fast and scalable web application that is capable of running large number of simultaneous connections that has high throughput.
    Node js Training in Chennai | Node JS training institute in chennai

    ReplyDelete