haifzhan: April 2013

Hadoop Cluster Setup in CentOS6

Requirement:

1. Have java 1.6.x installed.

2. Have ssh installed.

Installation & Configuratio[MUST be a root user]

1. Download hadoop rpm file from apache hadoop official website.

2. Install hadoop:

rpm –i hadoop_version.rpm

3. Edit the file /etc/hosts on the servers:

192.168.1.40 master

192.168.1.41 slave1

192.168.1.42 slave2

4. We must configure password less login from name node(master) to all data nodes (slave1 and slave2), on all servers do the following:

Ø Command :ssh-keygen –tdsa

Ø Keep press ENTER button until the id_dsa.pub file is generated.

We have 3 .pub files; one is on master, and others on the two slaves.

Copy the contents of those three .pub files to the authorized_keys file.

All servers authorized_keys file should have the same content.

5. Open the file /etc/hadoop/hadoop-env.sh, and set the $JAVA_HOME:

export JAVA_HOME=/usr/java/jdk1.6.0_38.

6. Open the file /etc/hadoop/core-site.xml, add the following properties. This file is to configure the name node store information:

7. Open the file /etc/hadoop/hdfs-site.xml and add the following properties:

8. Open the file /etc/hadoop/mapred-site.xml, add the following properties. This file is to configure the host and port of the MapReduce jobtracker in the name node of the hadoop setup:

9. Open the file /etc/hadoop/masters, add the namenode name: [NAMENODE SERVER ONLY]

master

10. Open the file /etc/hadoop/slaves, add all the datanodes names:[NAMENODE SERVER ONLY]

/* in case you want the namenode to also store data(i.e namenode also behave like a datanode) this can be mentioned in the salves file.*/

master

slave1

slave2

11. Modify files permissions.

Once Hadoop is installed, start-all.sh, stop-all.sh and several other files would be generated under /usr/sbin/, we must change all of those files permission:

# sudo chmod a+x file_name

Notice: Step 9 ,10 and 11 only for master server, the slaves should do nothing about those steps.

Start and Stop Hadoop Cluster (doing on hippo server)

1. Formatting the namenode:

# hadoop namenode –format

2. Starting the Hadoop Cluster

# start-all.sh

Run JPS command on master server:

# jps

922 JobTracker

815 SecondaryNameNode

1062 TaskTracker

521 NameNode

1136 Jps

Run JPS command on slaves:

# jps

7407 DataNode

7521 TaskTracker

7583 Jps

3. Checking the status of Hadoop Cluster:

(1) Type the command :

# hadoop dfsadmin –report

(2) Browse the web interface for the NameNode (master server) and the JobTracker:

· NameNode – http://192.168.1.40:50070/

· JobTracker – http://192.168.1.40:50030/

4. Process a sample to test Hadoop Cluster (wordcount example):

(1) Create a directory in master server

# mkdir input

(2) Create two test files under the ‘input’ directory and add the following text into the files

echo "Hello haifzhan" >> text1.txt

echo "Hello hadoop" >> text2.txt

echo "Hello hadoop again" >> text3.txt

(3) Copy the two test files from master server to Hadoop’s HDFS

Under the ‘input’ directory:

# hadoop dfs -put ./ input

(4) Now you can check the files on Hadoop’s HDFS

# hadoop dfs -ls input/*

-rw-r--r-- 2 root supergroup 15 2013-04-01 15:03 /user/root/input/text1.txt

-rw-r--r-- 2 root supergroup 13 2013-04-01 15:03 /user/root/input/text2.txt

-rw-r--r-- 2 root supergroup 19 2013-04-01 15:03 /user/root/input/text3.txt

(5) Run the MapReduce job

# hadoop jar /usr/share/hadoop/hadoop-example-1.0.3.jar wordcount input output

(6) Check the result

# hadoop dfs -cat output/part-r-00000

Hello 3

again 1

hadoop 2

haifzhan 1

5. Stopping the Hadoop Cluster

# stop-all.sh

Other useful resources:

1. The logfiles locate in: /var/log/hadoop/root

2. Useful websites:

http://ankitasblogger.blogspot.ca/2011/01/hadoop-cluster-setup.html

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#running-a-mapreduce-job

Error Solving:

1. Datanode: No route to host (start but then shut down automatically for a while)

close the firewalls on both master and slaves machines

# service iptables stop

2. Namenode: How to exit the safemode

# hadoop dfsadmin -safemode leave

3. How to start datanode or tasktracker independently

# hadoop-daemon.sh start datanode/tasktracker

4. How to check the current java version and the path of your local machine

# echo $JAVA_HOME

5. proccess information unavailable

remove all files under /tmp , reformate namenode and restart all servers.

haifzhan

Wednesday, April 24, 2013

show git branch on bash prompt

Monday, April 1, 2013

Setup Hadoop Cluster