Skip to content

This repository exposes all necessary steps to install Hadoop on single node cluster as well as multi-node cluster of virtual machines with Debian 9 Operating System.

Notifications You must be signed in to change notification settings

GAuravgiy87/installing-hadoop-cluster

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Installing Hadoop on single node as well multi-node cluster based on VMs running Debian 9 Linux

 

This article describes how to configure a Hadoop cluster from a pseudo-distributed configuration. The first section will explain how to install Hadoop on Debian 9 Linux. Following this installation, the Hadoop cluster consisted of only one node (i.e. single node cluster) and the MapReduce jobs will be executed in a pseudo-distributed manner. In order to use more Hadoop features, we will modify the configuration to allow jobs to be executed in a distributed manner, so the Hadoop cluster will be composed of more than on node (i.e. a multi-node cluster).

For the success of the tutorial, we assume that you already have a virtual machine equipped with Linux OS (Debian 9). This should work in principle even with other Linux distributions. You can get Virtualbox to build virtual machines. In order to concept a cluster, you must also have the ability to procure more than one virtual machine.

There are several solutions, unfortunately that are generally paid solutions, to procure machines for building a cluster. If you prefer free solutions, I suggest you design your cluster by mounting Linux virtual machines on your local machine. Obviously, this later must have sufficient resources as memory and storage space. If you prefer this solution, follow these steps:

☝️ The next tutorial will explain how to install Spark Standalone and Hadoop Yarn modes on Multi-Node Cluster.

   

Install and configure Hadoop with NameNode & DataNode on single node

 

1- Prepare Linux

Commands with root

login as root user

user@debian:~$ su root

  • Turnoff firewall

root@debian:~# service firewalld status

root@debian:~# service firewalld stop

root@debian:~# systemctl disable firewalld

  • Change hostname and setup the Fully Qualified Domain Name (FQDN) (considering hostname and FQDN as master-namenode and cluster.hdp, respectively)

Display the hostname which was set during setup in the gridscale panel

root@debian:~# cat /etc/hostname

Edit the hostname (this can now be changed to any other name – in this example, master-namenode)

root@debian:~# vi /etc/hostname --remove the existing name and write the below

master-namenode

In order to set the Fully Qualified Domain Name, the public IP of the server is required, in addition to your own FQDN. Comment all the lines present by preceding them with the character # or removing them and add the following line (by replacing the IP address with the address of the server)

root@debian:~# vi /etc/hosts --your file should look like the below

192.168.1.72	master-namenode.cluster.hdp 	master-namenode

The change will take effect after the next restart – should you want the changes to take place without restarting, the following command will achieve that

root@debian:~# hostname master-namenode

⚠️ WARNING
Changing the hostname with the command :~# hostname master-namenode is only temporary and will be overwritten when rebooted. To make the change permanent, it is necessary to edit the /etc/hostname file. This is not a replacement for the first step.

Check the hostname was successfully edited by typing

root@debian:~# hostname --should return

master-namenode

The FQDN is verified with this command

root@debian:~# hostname -f --should return

master-namenode.cluster.hdp
  • Create user for Hadoop (considering a hadoop user as "hdpuser")

For Debian OS users login as root and do the following:

root@master-namenode:~# apt-get install sudo

root@master-namenode:~# adduser hdpuser

root@master-namenode:~# usermod -aG sudo hdpuser --to add a user to the sudo group. This can be done also according to (*) cited below

root@master-namenode:~# getent group sudo --to verify if the new Debian sudo user was added to the group, for more details see this site.

root@master-namenode:~# deluser --remove-home username --to delete username

Verify Sudo Access in Debian

root@master-namenode:~# su - hdpuser --switch to the user account you just created

hdpuser@master-namenode:~$ sudo whoami --run any command that requires superuser access. For example, this should tell you that you are the root.

sudowhoami

  • Add Hadoop user to sudoers file (*), for more details see this link.

root@master-namenode:~# visudo -f /etc/sudoers --and under the below section add

## Allow root to run any commands anywhere
root	ALL=(ALL)	All
hdpuser ALL=(ALL)	ALL     ##add this line

Commands with hdpuser

login as hdpuser

  • Install SSH server

hdpuser@master-namenode:~$ sudo apt-get install ssh

  • Install rsync which allows remote file synchronizations using SSH

hdpuser@master-namenode:~$ sudo apt-get install rsync

  • Generate SSH keys and setup password less SSH between Hadoop services

hdpuser@master-namenode:~$ ssh-keygen -t rsa --just press Enter for all choices

hdpuser@master-namenode:~$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

hdpuser@master-namenode:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub hdpuser@master-namenode --(you should be able to ssh without asking for password)

hdpuser@master-namenode:~$ ssh hdpuser@master-namenode

Are you sure you want to continue connecting (yes/no)? yes

If you have problem in this step it's because your system uses a different network configuration tool (likely /etc/network/interfaces on older Debian systems). Here’s how to resolve these issues:

  1. Open the configuration file: sudo nano /etc/network/interfaces
  2. Add this

auto enp0s3
iface enp0s3 inet static
    address 192.168.1.72
    netmask 255.255.255.0
    gateway 192.168.1.1
    dns-nameservers 8.8.8.8 8.8.4.4
  1. Apply the changes: sudo systemctl restart networking
  2. Verify the new IP: ip addr show enp0s3
  3. Restart SSH: sudo systemctl restart ssh
  4. Test remote SSH: ssh [email protected] or ssh hdpuser@master-namenode

hdpuser@master-namenode:~$ exit

  • Creating the needed directories:

hdpuser@master-namenode:~$sudo mkdir /var/log/hadoop

hdpuser@master-namenode:~$ sudo chown -R hdpuser:hdpuser /var/log/hadoop

hdpuser@master-namenode:~$ sudo chmod -R 770 /var/log/hadoop

hdpuser@master-namenode:~$ sudo mkdir /bigdata

hdpuser@master-namenode:~$ sudo chown -R hdpuser:hdpuser /bigdata

hdpuser@master-namenode:~$ sudo chmod -R 770 /bigdata

2- Intall JDK and Hadoop

login as hdpuser

Installing Java

hdpuser@master-namenode:~$ cd /bigdata

  • Extract the archive to installation path,

hdpuser@master-namenode:/bigdata$ tar -xzvf jdk-8u241-Linux-x64.tar.gz

  • Setup Environment variables

hdpuser@master-namenode:/bigdata$ cd ~

hdpuser@master-namenode:~$ vi .bashrc --add the below at the end of the file

# User specific environment and startup programs
export PATH=$HOME/.local/bin:$HOME/bin:$PATH

# Setup JAVA Environment variables
export JAVA_HOME=/bigdata/jdk1.8.0_241
export PATH=$JAVA_HOME/bin:$PATH

hdpuser@master-namenode:~$ source .bashrc --load the .bashrc file

  • Install Java

hdpuser@master-namenode:~$ sudo update-alternatives --install "/usr/bin/java" "java" "/bigdata/jdk1.8.0_241/bin/java" 0

hdpuser@master-namenode:~$ sudo update-alternatives --install "/usr/bin/javac" "javac" "/bigdata/jdk1.8.0_241/bin/javac" 0

hdpuser@master-namenode:~$ sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/bigdata/jdk1.8.0_241/bin/javaws" 0

hdpuser@master-namenode:~$ sudo update-alternatives --set java /bigdata/jdk1.8.0_241/bin/java

hdpuser@master-namenode:~$ sudo update-alternatives --set javac /bigdata/jdk1.8.0_241/bin/javac

hdpuser@master-namenode:~$ sudo update-alternatives --set javaws /bigdata/jdk1.8.0_241/bin/javaws

hdpuser@master-namenode:~$ java -version --to check the version

hdpuser@master-namenode:~$ java -version
java version "1.8.0_241"
Java(TM) SE Runtime Environment (build 1.8.0_241-b07)
Java HotSpot(TM) 64-Bit Server VM (build 25.241-b07, mixed mode)

Installing Hadoop

hdpuser@master-namenode:~$ cd /bigdata

  • Extract the archive "hadoop-3.1.2.tar.gz",

hdpuser@master-namenode:/bigdata$ tar -zxvf hadoop-3.1.2.tar.gz

  • Setup Environment variables

hdpuser@master-namenode:/bigdata$ cd --to move to your home directory

hdpuser@master-namenode:~$ vi .bashrc --add the following under the Java Environment variables section into the .bashrc file

# Setup Hadoop Environment variables		
export HADOOP_HOME=/bigdata/hadoop-3.1.2
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC"
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
export HADOOP_LOG_DIR=/var/log/hadoop
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"
export PATH=$HOME/.local/bin:$HOME/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$PATH

export HADOOP_CLASSPATH=$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_HDFS_HOME/*:$HADOOP_HDFS_HOME/lib/*:$HADOOP_MAPRED_HOME/*:$HADOOP_MAPRED_HOME/lib/*:$HADOOP_YARN_HOME/*:$HADOOP_YARN_HOME/lib/*:$HADOOP_CLASSPATH

# Control Hadoop
alias Start_HADOOP='$HADOOP_HOME/sbin/start-dfs.sh;start-yarn.sh;mapred --daemon start historyserver'
alias Stop_HADOOP='$HADOOP_HOME/sbin/stop-dfs.sh;stop-yarn.sh;mapred --daemon stop historyserver'

hdpuser@master-namenode:~$ source .bashrc --after save the .bashrc file, load it

  • Create directore for Hadoop Data for (NameNode & DataNode)

hdpuser@master-namenode:~$ mkdir /bigdata/HadoopData

hdpuser@master-namenode:~$ mkdir /bigdata/HadoopData/namenode --only on the NameNode server

hdpuser@master-namenode:~$ mkdir /bigdata/HadoopData/datanode --on all the the DataNodes servers

  • Configure Hadoop

hdpuser@master-namenode:~$ cd $HADOOP_CONF_DIR --check the environment variables you just added

  • Modify file: core-site.xml

hdpuser@master-namenode:/bigdata/hadoop-3.1.2/etc/hadoop$ vi core-site.xml --copy core-site.xml file

<configuration>
   <property>
	   <name>fs.defaultFS</name>
	   <value>hdfs://master-namenode:9000</value>
   </property>
</configuration>
  • Modify file: hdfs-site.xml
The parameter dfs.namenode.data.dir must be kept only on the NameNode server. If you need DataNode on the NameNode server, set the parameter dfs.datanode.data.dir

hdpuser@master-namenode:/bigdata/hadoop-3.1.2/etc/hadoop$ vi hdfs-site.xml --copy hdfs-site.xml file

<configuration>
   <property>
	   <name>dfs.namenode.name.dir</name>
	   <value>file:///bigdata/HadoopData/namenode</value>
   </property>
   <property>
	   <name>dfs.datanode.data.dir</name>
	   <value>file:///bigdata/HadoopData/datanode</value>
   </property>
   <property>
	   <name>dfs.blocksize</name>
	   <value>134217728</value>
   </property>
   <property>
	   <name>dfs.replication</name>
	   <value>1</value>
   </property>
   <property>
	   <name>dfs.permissions</name>
	   <value>false</value>
   </property>
</configuration>
  • Modify file: mapred-site.xml

hdpuser@master-namenode:/bigdata/hadoop-3.1.2/etc/hadoop$ vi mapred-site.xml --copy mapred-site.xml file

<configuration>
   <property>
	   <name>mapreduce.framework.name</name>
	   <value>yarn</value>
   </property>
   <property>
	   <name>mapreduce.jobhistory.address</name>
	   <value>master-namenode:10020</value>
   </property>
   <property>
	   <name>mapreduce.jobhistory.webapp.address</name>
	   <value>master-namenode:19888</value>
   </property>
   <property>
	   <name>mapreduce.jobhistory.intermediate-done-dir</name>
	   <value>var/log/hadoop/tmp</value>
   </property>
   <property>
	   <name>mapreduce.jobhistory.done-dir</name>
	   <value>var/log/hadoop/done</value>
   </property>
   <property>
	   <name>mapreduce.map.memory.mb</name>
	   <value>512</value>
   </property>
   <property>
	   <name>mapreduce.reduce.memory.mb</name>
	   <value>512</value>
   </property>
   <property>
	   <name>mapreduce.map.java.opts</name>
	   <value>-Xmx512M</value>
   </property>
   <property>
	   <name>mapreduce.job.maps</name>
	   <value>2</value>
   </property>
   <property>
	   <name>mapreduce.reduce.java.opts</name>
	   <value>-Xmx512M</value>
   </property>
   <property>
	   <name>mapreduce.task.io.sort.mb</name>
	   <value>128</value>
   </property>
   <property>
	   <name>mapreduce.task.io.sort.factor</name>
	   <value>15</value>
   </property>
   <property>
	   <name>mapreduce.reduce.shuffle.parallelcopies</name>
	   <value>2</value>
   </property>
   <property>
	   <name>yarn.app.mapreduce.am.env</name>
	   <value>HADOOP_MAPRED_HOME=/bigdata/hadoop-3.1.2</value>
   </property>
   <property>
	   <name>mapreduce.map.env</name>
	   <value>HADOOP_MAPRED_HOME=/bigdata/hadoop-3.1.2</value>
   </property>
   <property>
	   <name>mapreduce.reduce.env</name>
	   <value>HADOOP_MAPRED_HOME=/bigdata/hadoop-3.1.2</value>
   </property>
   <property>
	   <name>mapreduce.application.classpath</name>
	   <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
   </property>
</configuration>
  • Modify file: yarn-site.xml

hdpuser@master-namenode:/bigdata/hadoop-3.1.2/etc/hadoop$ vi yarn-site.xml --copy yarn-site.xml file

<configuration>
   <property>
	   <name>yarn.log-aggregation-enable</name>
	   <value>true</value>
   </property>
   <property>
	   <name>yarn.resourcemanager.address</name>
	   <value>master-namenode:8050</value>
   </property>
   <property>
	   <name>yarn.resourcemanager.scheduler.address</name>
	   <value>master-namenode:8030</value>
   </property>
   <property>
	   <name>yarn.resourcemanager.resource-tracker.address</name>
	   <value>master-namenode:8025</value>
   </property>
   <property>
	   <name>yarn.resourcemanager.admin.address</name>
	   <value>master-namenode:8011</value>
   </property>
   <property>
	   <name>yarn.resourcemanager.webapp.address</name>
	   <value>master-namenode:8080</value>
   </property>
   <property>
	   <name>yarn.nodemanager.env-whitelist</name>
	   <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
   </property>
   <property>
	   <name>yarn.resourcemanager.webapp.https.address</name>
	   <value>master-namenode:8090</value>
   </property>
   <property>
	   <name>yarn.resourcemanager.hostname</name>
	   <value>master-namenode</value>
   </property>
   <property>
	   <name>yarn.resourcemanager.scheduler.class</name>
	   <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
   </property>
   <property>
	   <name>yarn.nodemanager.local-dirs</name>
	   <value>file:///var/log/hadoop</value>
   </property>
   <property>
	   <name>yarn.nodemanager.log-dirs</name>
	   <value>file:///var/log/hadoop</value>
   </property>
   <property>
	   <name>yarn.nodemanager.remote-app-log-dir</name>
	   <value>hdfs://master-namenode:9870/tmp/hadoop-yarn</value>
   </property>
   <property>
	   <name>yarn.nodemanager.remote-app-log-dir-suffix</name>
	   <value>logs</value>
   </property>
   <property>
	   <name>yarn.nodemanager.aux-services</name>
	   <value>mapreduce_shuffle</value>
   </property>
   <property>
	   <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>  
	   <value>org.apache.hadoop.mapred.ShuffleHandler</value>
   </property>
   <property>
	   <name>yarn.log.server.url</name>
	   <value>http://master-namenode:19888/jobhistory/logs</value>
   </property>
</configuration>
  • Modify file: hadoop-env.sh
📝 Edit Hadoop environment file by adding the following environment variables under the section "Set Hadoop-specific environment variables here.":

hdpuser@master-namenode:/bigdata/hadoop-3.1.2/etc/hadoop$ vi hadoop-env.sh --copy hadoop-env.sh

export JAVA_HOME=/bigdata/jdk1.8.0_241
export HADOOP_LOG_DIR=/var/log/hadoop
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/bigdata/hadoop-3.1.2/lib/native"
export HADOOP_COMMON_LIB_NATIVE_DIR=/bigdata/hadoop-3.1.2/lib/native
  • Create workers file

hdpuser@master-namenode:/bigdata/hadoop-3.1.2/etc/hadoop$ vi workers --copy workers file

Write line for each DataNode server
master-namenode 
  • Format the NameNode

hdpuser@master-namenode:~$ hdfs namenode -format

  • Start & Stop Hadoop
☝️ In principle to start Hadoop, we only need to type start-all.sh. However, I created two aliases Start_HADOOP and Stop_HADOOP into the environment variables that will ensure the execution of Hadoop. I created these aliases in order to avoid conflicts with some commands existing with Spark which will be soon installed on the same machines. The same rules will also be applied with Spark. Once an application is terminated, you need to start running the MapReduce Job History Server if you want to see the logs on the Web UI. For this, I added mapred --daemon start historyserver and mapred --daemon stop historyserver commands into the two created aliases.
Start

hdpuser@master-namenode:~$ Start_HADOOP

Check Hadoop processes are running
hdpuser@master-namenode:~$ jps  --this command should return something like
1889 ResourceManager
1300 NameNode
1093 JobHistoryServer
1993 NodeManager
2426 Jps
1403 DataNode
1566 SecondaryNameNode
Default Web Interfaces
Service Address web Default HTTP port
NameNode http://master-namenode:9870/ 9870
ResourceManager http://master-namenode:8080/ 8080
MapReduce JobHistory Server http://master-namenode:19888/ 19888
Stop

hdpuser@master-namenode:~$ Stop_HADOOP

   

Install Hadoop with NameNode & DataNodes on multi-nodes

 

In this second section, we proceed to perform a multi-node cluster. Three virtual machines (nodes) will be considered. If you would a cluster composed of more than three nodes, you can apply the same steps that will be exposed below.

Assuming the hostnames, ip addresses and services (NameNode and/or DataNode) of the three nodes will be as follows:

Hostname IP Address NameNode DataNode
master-namenode 192.168.1.72
slave-datanode-1 192.168.1.73
slave-datanode-2 192.168.1.74

So far, we have only one machine (master-namenode) that is ready. We have to build and configure the two other added machines. We can clone the master-namenode machine twice, then changing the necessary parameters seems like a good idea.

1- Clone twice the master-namenode server created above

Commands with root

login as root user on the two cloned machines

hdpuser@master-namenode:~$ su root

  • Edit the hostname and setup FQDN (considering the new hostnames as "slave-datanode-1" and "slave-datanode-2" and keeping the same FQDN)

On the first cloned machine (slave-datanode-1 server)

root@master-namenode:~# vi /etc/hostname --remove the existing name and write the below

slave-datanode-1

root@master-namenode:~# vi /etc/hosts --your file should look like the below

192.168.1.72	master-namenode.cluster.hdp 	master-namenode
192.168.1.73	slave-datanode-1.cluster.hdp	slave-datanode-1
192.168.1.74	slave-datanode-2.cluster.hdp	slave-datanode-2

root@master-namenode:~# hostname slave-datanode-1

root@master-namenode:~# hostname --should return

slave-datanode-1

root@master-namenode:~# hostname -f --should return

slave-datanode-1.cluster.hdp

On the second cloned machine (slave-datanode-2 server)

root@master-namenode:~# vi /etc/hostname --remove the existing name and write the below

slave-datanode-2

root@master-namenode:~# vi /etc/hosts --your file should look like the below

192.168.1.72	master-namenode.cluster.hdp 	master-namenode
192.168.1.73	slave-datanode-1.cluster.hdp	slave-datanode-1
192.168.1.74	slave-datanode-2.cluster.hdp	slave-datanode-2

root@master-namenode:~# hostname slave-datanode-2

root@master-namenode:~# hostname --should return

slave-datanode-2

root@master-namenode:~# hostname -f --should return

slave-datanode-2.cluster.hdp
  • Edit the hosts file into the "master-namenode" server

login as hdpuser on "master-namenode" server. Its hosts file must have the same content as the hosts files of the other nodes.

hdpuser@master-namenode:~$ sudo vi /etc/hosts --your file should look like the below

192.168.1.72	master-namenode.cluster.hdp 	master-namenode
192.168.1.73	slave-datanode-1.cluster.hdp	slave-datanode-1
192.168.1.74	slave-datanode-2.cluster.hdp	slave-datanode-2
  • Setup password less SSH between Hadoop services

hdpuser@master-namenode:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub hdpuser@slave-datanode-1

hdpuser@master-namenode:~$ ssh hdpuser@slave-datanode-1

Are you sure you want to continue connecting (yes/no)? yes

hdpuser@slave-datanode-1:~$ exit

hdpuser@master-namenode:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub hdpuser@slave-datanode-2

hdpuser@master-namenode:~$ ssh hdpuser@slave-datanode-2

Are you sure you want to continue connecting (yes/no)? yes

hdpuser@slave-datanode-2:~$ exit

Configure Hadoop

  • Edit the workers file on the NameNode (master-namenode) server
⚠️ WARNING
The goal here is to configure in particular the workers file into the NameNode or master server (here is master-namenode). Since this later orchestrates all the DataNode servers, it must know their hostnames by mentioning them in its workers file. This file is just helper file that are used by hadoop scripts to start appropriate services on master and slave nodes. Add workers file on master node (master-namenode) only. Add just name or ip addresses of master and all slave nodes. If file has an entry for localhost, you can remove that. About the workers files of the slave-datanode-1 and slave-datanode-2 servers, format by leaving them empty.

hdpuser@master-namenode:~$ vi workers --write line for each DataNode server (in our case all the machines are considered as DataNodes)

master-namenode   #remove this line from the workers file if you don't want this node to be DataNode
slave-datanode-1
slave-datanode-2
  • Modify file: hdfs-site.xml
⚠️ WARNING
If you need the data to be replicated in more than one DataNode, you must modify the replication number mentioned in the hdfs-site.xml files on all the nodes. This number cannot be greater than the number of nodes. We're going to set it here at 2. That means for every file stored in HDFS, there will be one redundant replication of that file on some other node in the cluster.

On the NameNode & DataNode (master-namenode) server:

hdpuser@master-namenode:/bigdata/hadoop-3.1.2/etc/hadoop$ vi hdfs-site.xml --copy hdfs-site.xml file

<configuration>
   <property>
	   <name>dfs.namenode.name.dir</name>
	   <value>file:///bigdata/HadoopData/namenode</value>
   </property>
   <property>
	   <name>dfs.datanode.data.dir</name>
	   <value>file:///bigdata/HadoopData/datanode</value>
   </property>
   <property>
	   <name>dfs.blocksize</name>
	   <value>134217728</value>
   </property>
   <property>
	   <name>dfs.replication</name>
	   <value>2</value>
   </property>
   <property>
	   <name>dfs.permissions</name>
	   <value>false</value>
   </property>
</configuration>

On the DataNodes (slave-datanode-1 and slave-datanode-2) servers:

hdpuser@slave-datanode-1:/bigdata/hadoop-3.1.2/etc/hadoop$ vi hdfs-site.xml --copy hdfs-site.xml file

<configuration>
   <property>
	   <name>dfs.datanode.data.dir</name>
	   <value>file:///bigdata/HadoopData/datanode</value>
   </property>
   <property>
	   <name>dfs.blocksize</name>
	   <value>134217728</value>
   </property>
   <property>
	   <name>dfs.replication</name>
	   <value>2</value>
   </property>
   <property>
	   <name>dfs.permissions</name>
	   <value>false</value>
   </property>
</configuration>

hdpuser@slave-datanode-2:/bigdata/hadoop-3.1.2/etc/hadoop$ vi hdfs-site.xml --copy hdfs-site.xml file

<configuration>
   <property>
	   <name>dfs.datanode.data.dir</name>
	   <value>file:///bigdata/HadoopData/datanode</value>
   </property>
   <property>
	   <name>dfs.blocksize</name>
	   <value>134217728</value>
   </property>
   <property>
	   <name>dfs.replication</name>
	   <value>2</value>
   </property>
   <property>
	   <name>dfs.permissions</name>
	   <value>false</value>
   </property>
</configuration>
  • Clean up some old files on all the nodes

hdpuser@master-namenode:~$ rm -rf /bigdata/HadoopData/namenode/*

hdpuser@master-namenode:~$ rm -rf /bigdata/HadoopData/datanode/*

hdpuser@slave-datanode-1:~$ rm -rf /bigdata/HadoopData/datanode/*

hdpuser@slave-datanode-2:~$ rm -rf /bigdata/HadoopData/datanode/*

2- Starting and stopping Hadoop on master-namenode

  • Format the NameNode

hdpuser@master-namenode:~$ hdfs namenode -format

format1 format2

  • Start Hadoop
Start

hdpuser@master-namenode:~$ Start_HADOOP

starthadoop

Check Hadoop processes are running on master-namenode

hdpuser@master-namenode:~$ jps

namenodejps

Check Hadoop processes are running on slave-datanode-1

hdpuser@master-datanode-1:~$ jps

datanode1jps

Check Hadoop processes are running on slave-datanode-2

hdpuser@master-datanode-2:~$ jps

datanode2jps

Default Web Interfaces

NameNode: http://master-namenode:9870/

NameNode

ResourceManager: http://master-namenode:8080/

ResourceManager

Get report
hdpuser@master-namenode:~$ hdfs dfsadmin -report 	--this command should return something like
Configured Capacity: 59836907520 (55.73 GB)
Present Capacity: 27630944256 (25.73 GB)
DFS Remaining: 27630858240 (25.73 GB)
DFS Used: 86016 (84 KB)
DFS Used%: 0.00%
Replicated Blocks:
		Under replicated blocks: 0
		Blocks with corrupt replicas: 0
		Missing blocks: 0
		Missing blocks (with replication factor 1): 0
		Low redundancy blocks with highest priority to recover: 0
		Pending deletion blocks: 0
Erasure Coded Block Groups:
		Low redundancy block groups: 0
		Block groups with corrupt internal blocks: 0
		Missing block groups: 0
		Low redundancy blocks with highest priority to recover: 0
		Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (3):

Name: 192.168.1.72:9866 (master-namenode)
Hostname: master-namenode
Decommission Status : Normal
Configured Capacity: 19945635840 (18.58 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 9707601920 (9.04 GB)
DFS Remaining: 9201225728 (8.57 GB)
DFS Used%: 0.00%
DFS Remaining%: 46.13%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Wed Apr 15 16:44:05 CEST 2020
Last Block Report: Wed Apr 15 16:42:00 CEST 2020
Num of Blocks: 0


Name: 192.168.1.73:9866 (slave-datanode-1)
Hostname: slave-datanode-1
Decommission Status : Normal
Configured Capacity: 19945635840 (18.58 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 9695444992 (9.03 GB)
DFS Remaining: 9213382656 (8.58 GB)
DFS Used%: 0.00%
DFS Remaining%: 46.19%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Wed Apr 15 16:44:04 CEST 2020
Last Block Report: Wed Apr 15 16:41:56 CEST 2020
Num of Blocks: 0


Name: 192.168.1.74:9866 (slave-datanode-2)
Hostname: slave-datanode-2
Decommission Status : Normal
Configured Capacity: 19945635840 (18.58 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 9692577792 (9.03 GB)
DFS Remaining: 9216249856 (8.58 GB)
DFS Used%: 0.00%
DFS Remaining%: 46.21%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Wed Apr 15 16:44:04 CEST 2020
Last Block Report: Wed Apr 15 16:41:56 CEST 2020
Num of Blocks: 0
Stop Hadoop

hdpuser@master-namenode:~$ Stop_HADOOP

stophadoop

 

☝️ The next tutorial explains how to install Spark Standalone and Hadoop Yarn modes on Multi-Node Cluster.

About

This repository exposes all necessary steps to install Hadoop on single node cluster as well as multi-node cluster of virtual machines with Debian 9 Operating System.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published