haifzhan: Setup Hadoop Cluster

Hadoop Cluster Setup in CentOS6

Requirement:

1. Have java 1.6.x installed.

2. Have ssh installed.

Installation & Configuratio[MUST be a root user]

1. Download hadoop rpm file from apache hadoop official website.

2. Install hadoop:

rpm –i hadoop_version.rpm

3. Edit the file /etc/hosts on the servers:

192.168.1.40 master

192.168.1.41 slave1

192.168.1.42 slave2

4. We must configure password less login from name node(master) to all data nodes (slave1 and slave2), on all servers do the following:

Ø Command :ssh-keygen –tdsa

Ø Keep press ENTER button until the id_dsa.pub file is generated.

We have 3 .pub files; one is on master, and others on the two slaves.

Copy the contents of those three .pub files to the authorized_keys file.

All servers authorized_keys file should have the same content.

5. Open the file /etc/hadoop/hadoop-env.sh, and set the $JAVA_HOME:

export JAVA_HOME=/usr/java/jdk1.6.0_38.

6. Open the file /etc/hadoop/core-site.xml, add the following properties. This file is to configure the name node store information:

7. Open the file /etc/hadoop/hdfs-site.xml and add the following properties:

8. Open the file /etc/hadoop/mapred-site.xml, add the following properties. This file is to configure the host and port of the MapReduce jobtracker in the name node of the hadoop setup:

9. Open the file /etc/hadoop/masters, add the namenode name: [NAMENODE SERVER ONLY]

master

10. Open the file /etc/hadoop/slaves, add all the datanodes names:[NAMENODE SERVER ONLY]

/* in case you want the namenode to also store data(i.e namenode also behave like a datanode) this can be mentioned in the salves file.*/

master

slave1

slave2

11. Modify files permissions.

Once Hadoop is installed, start-all.sh, stop-all.sh and several other files would be generated under /usr/sbin/, we must change all of those files permission:

# sudo chmod a+x file_name

Notice: Step 9 ,10 and 11 only for master server, the slaves should do nothing about those steps.

Start and Stop Hadoop Cluster (doing on hippo server)

1. Formatting the namenode:

# hadoop namenode –format

2. Starting the Hadoop Cluster

# start-all.sh

Run JPS command on master server:

# jps

922 JobTracker

815 SecondaryNameNode

1062 TaskTracker

521 NameNode

1136 Jps

Run JPS command on slaves:

# jps

7407 DataNode

7521 TaskTracker

7583 Jps

3. Checking the status of Hadoop Cluster:

(1) Type the command :

# hadoop dfsadmin –report

(2) Browse the web interface for the NameNode (master server) and the JobTracker:

· NameNode – http://192.168.1.40:50070/

· JobTracker – http://192.168.1.40:50030/

4. Process a sample to test Hadoop Cluster (wordcount example):

(1) Create a directory in master server

# mkdir input

(2) Create two test files under the ‘input’ directory and add the following text into the files

echo "Hello haifzhan" >> text1.txt

echo "Hello hadoop" >> text2.txt

echo "Hello hadoop again" >> text3.txt

(3) Copy the two test files from master server to Hadoop’s HDFS

Under the ‘input’ directory:

# hadoop dfs -put ./ input

(4) Now you can check the files on Hadoop’s HDFS

# hadoop dfs -ls input/*

-rw-r--r-- 2 root supergroup 15 2013-04-01 15:03 /user/root/input/text1.txt

-rw-r--r-- 2 root supergroup 13 2013-04-01 15:03 /user/root/input/text2.txt

-rw-r--r-- 2 root supergroup 19 2013-04-01 15:03 /user/root/input/text3.txt

(5) Run the MapReduce job

# hadoop jar /usr/share/hadoop/hadoop-example-1.0.3.jar wordcount input output

(6) Check the result

# hadoop dfs -cat output/part-r-00000

Hello 3

again 1

hadoop 2

haifzhan 1

5. Stopping the Hadoop Cluster

# stop-all.sh

Other useful resources:

1. The logfiles locate in: /var/log/hadoop/root

2. Useful websites:

http://ankitasblogger.blogspot.ca/2011/01/hadoop-cluster-setup.html

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#running-a-mapreduce-job

Error Solving:

1. Datanode: No route to host (start but then shut down automatically for a while)

close the firewalls on both master and slaves machines

# service iptables stop

2. Namenode: How to exit the safemode

# hadoop dfsadmin -safemode leave

3. How to start datanode or tasktracker independently

# hadoop-daemon.sh start datanode/tasktracker

4. How to check the current java version and the path of your local machine

# echo $JAVA_HOME

5. proccess information unavailable

remove all files under /tmp , reformate namenode and restart all servers.

7 comments:

mareddyonlineJuly 19, 2014 at 7:44 AM
I get a lot of great information here and this is what I am searching for. Thank you for your sharing. I have bookmark this page for my future reference.
Hadoop Training in hyderabad
UnknownJune 15, 2015 at 4:44 AM
Cluster is the toughest topic which i was dealing with for long time. Thanks for sharing your thoughts here.

Hadoop Training Chennai
Hadoop Training in Chennai
UnknownNovember 23, 2015 at 12:35 AM
Learning new technology would give oneself a true confidence in the current emerging Information Technology domain. With the knowledge of big data the most magnificent cloud computing technology one can go the peek of data processing. As there is a drastic improvement in this field everyone are showing much interest in pursuing this technology. Your content tells the same about evolving technology. Thanks for sharing this.

Hadoop Training in Chennai | Hadoop Training Chennai | Big Data Training in Chennai | Big Data Training Chennai
UnknownDecember 16, 2015 at 4:17 AM
Truely a very good article on how to handle the future technology. This content creates a new hope and inspiration within me. Thanks for sharing article like this. The way you have stated everything above is quite awesome. Keep blogging like this. Thanks :)

Software testing training in chennai | Testing training in chennai | Software testing course in chennai
UnknownApril 11, 2016 at 3:30 AM
SAS stands for statistical analysis system which is a analysis tool developed by SAS institute and with the help of this tool data driven decisions can be taken which is helpful for the bsuiness.
SAS training in Chennai | SAS course in Chennai | SAS training institute in Chennai
NandhiniMay 17, 2016 at 4:34 AM
Amazing content.If you are interested instudying nodejs visit this website. Nodejs is an open source, server side web application that enables you to build fast and scalable web application that is capable of running large number of simultaneous connections that has high throughput.
Node js Training in Chennai | Node JS training institute in chennai
AnonymousMay 25, 2020 at 3:41 AM
I am happy for sharing on this blog its awesome blog I really impressed. thanks for sharing.

Big Data Hadoop Training In Chennai | Big Data Hadoop Training In anna nagar | Big Data Hadoop Training In omr | Big Data Hadoop Training In porur | Big Data Hadoop Training In tambaram | Big Data Hadoop Training In velachery

Monday, April 1, 2013

Setup Hadoop Cluster

7 comments: