Hadoop
Cluster Setup in CentOS6 
Requirement:
1.    Have java 1.6.x installed.
2.    Have ssh installed.
Installation &
Configuratio[MUST be a root user]
1.   
Download hadoop rpm file from apache hadoop official
website.
2.   
Install hadoop:
rpm –i hadoop_version.rpm
3.   
Edit the file /etc/hosts
on the servers:
192.168.1.40   master
192.168.1.41   slave1
192.168.1.42   slave2
4.   
We must configure password less login from name node(master)
to all data nodes (slave1 and slave2), on all servers do the following:
Ø  Command :ssh-keygen –tdsa
Ø  Keep press ENTER button
until the id_dsa.pub file is generated.
We
have 3 .pub files; one is on master, and others on the two slaves.
Copy
the contents of those three .pub files to the authorized_keys file.
All
servers authorized_keys file should have the same content.
5.    Open the file /etc/hadoop/hadoop-env.sh, and set the
$JAVA_HOME:  
export
JAVA_HOME=/usr/java/jdk1.6.0_38.
6.    Open the file /etc/hadoop/core-site.xml, add the
following properties. This file is to configure the name node store
information:
7.   
Open the file /etc/hadoop/hdfs-site.xml
and add the following properties:
8.    Open the file /etc/hadoop/mapred-site.xml, add the
following properties. This file is to configure the host and port of the
MapReduce jobtracker in the name node of the hadoop setup:
9.    Open the file /etc/hadoop/masters, add the namenode
name: [NAMENODE SERVER ONLY]
master
10. Open the file /etc/hadoop/slaves, add all the
datanodes names:[NAMENODE SERVER ONLY]
/* in case you want the
namenode to also store data(i.e namenode also behave like a datanode) this can
be mentioned in the salves file.*/
master
slave1
slave2
11. Modify files permissions.
Once Hadoop is installed, start-all.sh, stop-all.sh and
several other files would be generated under /usr/sbin/, we must change all of those files permission:
           # sudo chmod a+x  file_name
Notice: Step 9 ,10 and 11 only for master server,
the slaves should do nothing about those steps.
Start and Stop Hadoop
Cluster (doing on hippo server)
1.    Formatting the namenode:
           # hadoop namenode –format
2.    Starting the Hadoop Cluster
           # start-all.sh
Run JPS command on master server:
          # jps 
 
            922
JobTracker
815 SecondaryNameNode
1062 TaskTracker
521 NameNode
1136 Jps
Run JPS command on slaves:
          # jps   
7407 DataNode
7521 TaskTracker
7583 Jps
3.    Checking the status of
Hadoop Cluster:
(1)  Type the command :
# hadoop dfsadmin –report
            (2) Browse the web interface for the NameNode (master
server) and the JobTracker:
4.    Process a sample to test
Hadoop Cluster (wordcount example):
(1)  Create a directory in master
server 
# mkdir
input
          
(2) Create two test files under the
‘input’ directory and add the following text into the files
              
echo "Hello haifzhan" >> text1.txt
               echo "Hello hadoop" >> text2.txt
               echo "Hello hadoop
again" >> text3.txt 
(3)
Copy the two test files from master server to Hadoop’s HDFS
                 
Under the ‘input’ directory:
                 # hadoop dfs -put ./   input
            (4) 
Now you can check the files on Hadoop’s HDFS
                 # hadoop
dfs -ls input/*
-rw-r--r--   2 root supergroup         15 2013-04-01 15:03
/user/root/input/text1.txt
-rw-r--r--   2 root supergroup         13 2013-04-01 15:03
/user/root/input/text2.txt
-rw-r--r--   2 root supergroup         19 2013-04-01 15:03
/user/root/input/text3.txt    
(5)
 Run the MapReduce job
                 # hadoop jar /usr/share/hadoop/hadoop-example-1.0.3.jar wordcount input
output
            (6) 
Check the result
                 # hadoop
dfs -cat output/part-r-00000
                        Hello  3
again  1
hadoop           
2
haifzhan         1       
5.  Stopping the Hadoop Cluster 
               # stop-all.sh
Other useful resources:
1.
The logfiles locate in: /var/log/hadoop/root
2.
Useful websites: 
Error Solving:
1. Datanode: No route to
host (start but then shut down automatically for a while)
           close the firewalls on both master and slaves
machines
           # service iptables stop 
2. Namenode: How to exit
the safemode
           # hadoop dfsadmin -safemode leave
3. How to start
datanode or tasktracker independently
            # hadoop-daemon.sh start datanode/tasktracker
4. How to check the
current java version and the path of your local machine
            # echo $JAVA_HOME
5.  proccess information unavailable
remove all files under
/tmp , reformate namenode and restart all servers.