Hadoop Multi Node Cluster - Install and Set of Cluster | W3school

In this tutorial, we will look at the process of setting up the Hadoop Multi-Node cluster in a distributed environment. This helps in speedy code execution and saves cost and computation time as well. Demonstrating the whole cluster is out of the scope of this tutorial, hence we have tried to explain the Hadoop cluster environment using one master and two slave systems. The IP addresses for these are:

Hadoop Master: 192.168.1.15 (hadoop-master)

Hadoop Slave: 192.168.1.16 (hadoop-slave-1)

Hadoop Slave: 192.168.1.17 (hadoop-slave-2)

To setup Hadoop Multi-Node cluster follow the below steps:

Also Read: Hadoop Streaming

Installing Java

Java is the main requirement for Hadoop systems. Firstly, you can check the java version installed on your system by the command “$ java -version”. If Java is already installed on your system then this command will display the java version. A standard output will look like:

java version “1.7.0_71”

Java(TM) SE Runtime Environment (build 1.7.0_71-b13)

Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

[post_middile_section_ad]

If Java is not installed on your system then follow the steps below to install it on your system:

  1. Download the Java tar file “jdk-7u71-linux-x64.gz” from oracle’s website by visiting the link: http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html

  2. The file is generally downloaded to the Downloads folder in your system. Extract the jdk-7u71-linux-x64.gz file by using below commands:

$ cd Downloads/

$ ls

jdk-7u71-Linux-x64.gz

$ tar zxf jdk-7u71-Linux-x64.gz

$ ls

jdk1.7.0_71 jdk-7u71-Linux-x64.gz

  1. For all the users to access the Java version you can move the file to the location “/usr/local/”. Copy the below commands in the root:

$ su

password:

# mv jdk1.7.0_71 /usr/local/

# exit

  1. Path definition is a must for running Java in batch mode. Use the commands below to set the Path and Java_Home variables.

export JAVA_HOME=/usr/local/jdk1.7.0_71

export PATH=PATH:$JAVA_HOME/bin

Verify the Java version using the command “java -version” from the terminal or command prompt of your computer.

[post_middile_section_ad]

Creating User Account

A system user account is required to be created on both slave and master systems to use the Hadoop framework. This can be done using the commands:

# useradd hadoop

# passwd hadoop

Mapping Nodes

The host file has to be edited in /etc/ folder on all the nodes and you must specify the IP address of each system followed by their hostnames. The commands for the same are:

# vi /etc/hosts

enter the following lines in the /etc/hosts file.

192.168.1.109 hadoop-master

192.168.1.145 hadoop-slave-1

192.168.56.1 hadoop-slave-2

Configuring Key Based Login

SSH should be setup in each and every node so that they can interact with one another without the requirement of any password. Follow the commands below to do the same:

# mkdir /opt/hadoop

# cd /opt/hadoop/

# wget http://apache.mesi.com.ar/hadoop/common/hadoop-1.2.1/hadoop-1.2.0.tar.gz

# tar -xzf hadoop-1.2.0.tar.gz

# mv hadoop-1.2.0 hadoop

# chown -R hadoop /opt/hadoop

# cd /opt/hadoop/hadoop/

Installing Hadoop

The machine running the master server should have Hadoop installed in it. This process involves unzipping the software on the machines or installing it through a packaging system suited for an operating system. Generally, one machine is defined as NameNode and other as the ResourceManager. All the other machines act as both the DataNode or NodeManager. Installing Hadoop can be done by using commands:

# mkdir /opt/hadoop

# cd /opt/hadoop/

# wget http://apache.mesi.com.ar/hadoop/common/hadoop-1.2.1/hadoop-1.2.0.tar.gz

# tar -xzf hadoop-1.2.0.tar.gz

# mv hadoop-1.2.0 hadoop

# chown -R hadoop /opt/hadoop

# cd /opt/hadoop/hadoop/

[post_middile_section_ad]

Configuring Hadoop

Hadoop has to be configured on your system to be able to execute multi-node distributed ecosystem. This can be done with below edits:

  1. core-site.xml: Edit the file with below commands:

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://hadoop-master:9000/</value>

</property>

<property>

<name>dfs.permissions</name>

<value>false</value>

</property>

</configuration>

  1. hdfx-site.xml: Edit the file with below commands:

<configuration>

<property>

<name>dfs.data.dir</name>

<value>/opt/hadoop/hadoop/dfs/name/data</value>

<final>true</final>

</property>

<property>

<name>dfs.name.dir</name>

<value>/opt/hadoop/hadoop/dfs/name</value>

<final>true</final>

</property>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

</configuration>

  1. mapred-site.xml: Edit the file with below commands:

<configuration>

<property>

<name>mapred.job.tracker</name>

<value>hadoop-master:9001</value>

</property>

</configuration>

  1. Hadoop-env.sh: Edit JAVA_HOME, HADOOP_OPTS, and HADOOP_CONF_DIR with below commands:

export JAVA_HOME=/opt/jdk1.7.0_17 export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true export HADOOP_CONF_DIR=/opt/hadoop/hadoop/conf

Installing Hadoop on Slave Servers

The Hadoop can be installed on the Slave server by using the commands below:

# su hadoop

$ cd /opt/hadoop

$ scp -r hadoop hadoop-slave-1:/opt/hadoop

$ scp -r hadoop hadoop-slave-2:/opt/hadoop

Configuring Hadoop on Master Server

Below commands should be used in configuring Hadoop on Master server.

  1. Open the master server and configure it with below commands:

# su hadoop

$ cd /opt/hadoop/hadoop

  1. Configure the Master Node with the commands below:

$ vi etc/hadoop/masters

hadoop-master

  1. Configure the Slave Node with the commands below:

$ vi etc/hadoop/slaves

hadoop-slave-1

hadoop-slave-2

  1. Format the Namenode on Hadoop Master with the commands below:

# su hadoop

$ cd /opt/hadoop/hadoop

$ bin/hadoop namenode –format

Starting Hadoop Services

The Hadoop services can be kick started by using below commands:

$ cd $HADOOP_HOME/sbin

$ start-all.sh

Addition of a New DataNode in the Hadoop Cluster

New nodes can be added to an existing Hadoop cluster with correct network configuration. Let’s assume the below network configuration:

IP address : 192.168.1.103

netmask : 255.255.255.0

hostname : slave3.in

Adding User and SSH Access

It is very straightforward to add users and set a password in Hadoop. The commands to be used for the same are:

useradd hadoop

passwd Hadoop

if you would like to have password less connectivity then use the below commands and execute them on the master node.

mkdir -p $HOME/.ssh

chmod 700 $HOME/.ssh

ssh-keygen -t rsa -P ” -f $HOME/.ssh/id_rsa

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

chmod 644 $HOME/.ssh/authorized_keys

Copy the public key to the new slave node in Hadoop user $HOME directory

scp $HOME/.ssh/id_rsa.pub hadoop@192.168.1.103:/home/hadoop/

The command to be executed on slave node is:

su hadoop ssh -X hadoop@192.168.1.103

The content in the public key should be copied to the file “$HOME/.ssh/authorized_keys” and to change the permissions run the below commands:

cd $HOME

mkdir -p $HOME/.ssh

chmod 700 $HOME/.ssh

cat id_rsa.pub >>$HOME/.ssh/authorized_keys

chmod 644 $HOME/.ssh/authorized_keys

To check if you can ssh to the node without entering a password from the master node using the command – ssh hadoop@192.168.1.103 or hadoop@slave3

Set Hostname of New Node

The hostname can be set in the file /etc/sysconfig/network by using the commands:

On new slave3 machine

NETWORKING=yes

HOSTNAME=slave3.in

You might have to restart the machine or run the hostname command to the new machine to make the change effective. Restart is the best option to use. To update /etc/hosts on all machines you can use the command – 192.168.1.102 slave3.in slave3.

Start the DataNode on New Node

You can manually start the datanode daemon using the command $HADOOP_HOME/bin/hadoop-daemon.sh script. It will automatically connect to the cluster. Below are few commands that can be used:

  1. Login to new node: su hadoop or ssh -X hadoop@192.168.1.103

  2. Start HDFS on a new slave node: ./bin/hadoop-daemon.sh start datanode

  3. Output check of jps command: $ jps; 7141 DataNode; 10312 Jps

Remove a Datanode from Hadoop Cluster

It is very simple to remove a node from a cluster while a job is running without losing critical data. The decommissioning feature within HDFS framework ensures that a node is safely removed. Follow the below steps to perform this action:

  1. Login to the machine running Hadoop using the command – $ su Hadoop.

  2. Update cluster configuration: before the cluster is started the exclude file should be configured. This can be done by adding a key called ‘dfs.hosts.exclude’ to $HADOOP_HOME/etc/hadoop/hdfs-site.xml file. The value linked to this key gives the full path to the file stored on Namenode’s file system that contains the list of machines that are don’t have the permission to connect to HDFS. You can add the below lines to etc/hadoop/hdfs-site.xml file:

<property>

<name>dfs.hosts.exclude</name>

<value>/home/hadoop/hadoop-1.2.1/hdfs_exclude.txt</value>

<description>DFS exclude</description>

</property>

[post_middile_section_ad]

  1. Define hosts to decommission: The machine that has to be decommissioned should be added to the file that can be identified by hdfs_exclude.txt. Make sure that you enter one domain name in one line. This will help in stopping these domains from connecting to the NameNode. If you have to remove DataNode2, then the hdfs_exclude.txt file at the location ‘/home/hadoop/hadoop-1.2.1’ should contain the below command:

Slave2.in

  1. Forced reload of configuration: To re-define the configuration run the command shown below:

$ $HADOOP_HOME/bin/hadoop dfsadmin –refreshNodes

This command forces the NameNode to re-read the configuration details. This decommissions the nodes which allow each nodes blocks time to be replicated on the machines that are set to remain active. You can check the output of the jps command in slave2.in which will show that the DataNode process will shut down in some time.

  1. Shutting down nodes: Once decommission has been done, the decommissioned hardware should be shut down for maintenance. The report command can be executed to get the status of the decommission. The command below will display the decommissioned node’s status and the nodes connected to it in the cluster.

$ $HADOOP_HOME/bin/hadoop dfsadmin -report

  1. Edit the excludes file again: After the decommissioning process is complete the machines should be removed from the excludes file. Running the command ‘$HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes’ re-reads the excludes file to the NameNode, following this the DataNodes re-connect with the cluster once maintenance is complete or more capacity is required in the cluster.

Even after following the above process the tasktracker algorithm is running on the node, then it should be shut down. This can be done either by disconnecting the machine. Once disconnected the master will automatically declare the process dead. Once all processes are shut down, the data can be retrieved from DataNode as it contains all the important data preventing its loss. To run or shutdown the tasktracker use the below commands:

$ $HADOOP_HOME/bin/hadoop-daemon.sh stop tasktracker

$ $HADOOP_HOME/bin/hadoop-daemon.sh start tasktracker

Conclusion

You now have a basic idea of Hadoop’s Multi-Node Cluster creation. The details mentioned above will help you in understanding the basics of the above details about Multi-Node Cluster creation.