Install Hadoop Cluster

Install Hadoop Cluster

Hadoop Cluster

Purpose

This tutorial has as purpose to teach how to install a hadoop cluster with 1 Name Node and 2 Data Nodes.
Hadoop was installed into 3 CentOS VMs (KVM) and this tutorial is a merge of 3 others, which you can find on the internet.

Link1 Link2 Link3

Steps

Each command is prepended with a Bash Prompt that show where is running, who is running and the privilege which is running.

Pay attention: [root@master ~]#

  • User root
  • Host master.hadoop.local.tld
  • Privileged #

Actions to do on all nodes and master

  • Ensue firewall is stopped
[root@master ~]# systemctl stop firewalld
[root@master ~]# systemctl disable firewalld
[root@node1 ~]# systemctl stop firewalld
[root@node1 ~]# systemctl disable firewalld
  • Install JDK on all nodes
[root@master ~]# rpm -Uvh /root/jdk-8u131-linux-x64.rpm
  • Set up FQDN on all nodes
[root@master ~]# vi /etc/hosts
192.168.122.201 master.hadoop.local.tld
192.168.122.202 node1.hadoop.local.tld
192.168.122.203 node2.hadoop.local.tld
[root@node1 ~]# vi /etc/hosts
192.168.122.202 node1.hadoop.local.tld
192.168.122.203 node2.hadoop.local.tld
192.168.122.201 master.hadoop.local.tld
[root@node2 ~]# vi /etc/hosts
192.168.122.203 node2.hadoop.local.tld
192.168.122.202 node1.hadoop.local.tld
192.168.122.201 master.hadoop.local.tld
  • Add hadoop user on all nodes
[root@master ~]# useradd -d /opt/hadoop hadoop
[root@master ~]# passwd hadoop
[root@node1 ~]# useradd -d /opt/hadoop hadoop
[root@node1 ~]# passwd hadoop
[root@node2 ~]# useradd -d /opt/hadoop hadoop
[root@node2 ~]# passwd hadoop
  • Generate key on all nodes and master and copy to eache one
[root@master ~]# su - hadoop
[hadoop@master ~]$ ssh-keygen -t rsa
[hadoop@master ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@master.hadoop.local.tld
[hadoop@master ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@node1.hadoop.local.tld
[hadoop@master ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@node2.hadoop.local.tld
[root@node1 ~]# su - hadoop
[hadoop@node1 ~]$ ssh-keygen -t rsa
[hadoop@node1 ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@master.hadoop.local.tld
[hadoop@node1 ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@node1.hadoop.local.tld
[hadoop@node1 ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@node2.hadoop.local.tld
[root@node2 ~]# su - hadoop
[hadoop@node2 ~]$ ssh-keygen -t rsa
[hadoop@node2 ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@master.hadoop.local.tld
[hadoop@node2 ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@node1.hadoop.local.tld
[hadoop@node2 ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@node2.hadoop.local.tld
  • Download hadoop on all nodes
[hadoop@master ~]$ curl -O http://ftp.unicamp.br/pub/apache/hadoop/common/hadoop-2.8.0/hadoop-2.8.0.tar.gz
[hadoop@node1 ~]$ curl -O http://ftp.unicamp.br/pub/apache/hadoop/common/hadoop-2.8.0/hadoop-2.8.0.tar.gz
[hadoop@node2 ~]$ curl -O http://ftp.unicamp.br/pub/apache/hadoop/common/hadoop-2.8.0/hadoop-2.8.0.tar.gz

Ensure you are in /opt/hadoop

[hadoop@master ~]$ tar --strip-components=1 -zxvf hadoop-2.8.0.tar.gz
[hadoop@node1 ~]$ tar --strip-components=1 -zxvf hadoop-2.8.0.tar.gz
[hadoop@node2 ~]$ tar --strip-components=1 -zxvf hadoop-2.8.0.tar.gz
  • Edit .bash_profile on all nodes
[hadoop@master ~]$ vi .bash_profile
## JAVA env variables
export JAVA_HOME=/usr/java/default
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
## HADOOP env variables
export HADOOP_HOME=/opt/hadoop
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
[hadoop@node1 ~]$ vi .bash_profile
## JAVA env variables
export JAVA_HOME=/usr/java/default
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
## HADOOP env variables
export HADOOP_HOME=/opt/hadoop
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
[hadoop@node2 ~]$ vi .bash_profile
## JAVA env variables
export JAVA_HOME=/usr/java/default
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
## HADOOP env variables
export HADOOP_HOME=/opt/hadoop
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Actions to do on Master ONLY

  • Create some data dirs on Master
[hadoop@master ~]$ mkdir -p /opt/hadoop/hdfs/namenode
[hadoop@master ~]$ mkdir -p /opt/hadoop/hdfs/datanode
[hadoop@master ~]$ mkdir -p /opt/hadoop/hdfs/namesecondary
[hadoop@master ~]$ mkdir -p /opt/hadoop/yarn/local
[hadoop@master ~]$ mkdir -p /opt/hadoop/yarn/log
  • Edit core-site.xml
[hadoop@master ~]$ vi etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master.hadoop.local.tld:9000/</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
</configuration>
  • Edit hdfs-site.xml
[hadoop@master ~]$ vi etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/opt/hadoop/hdfs/datanode</value>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>file:/opt/hadoop/hdfs/namesecondary</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
</property>
</configuration>
  • Edit mapred-site.xml
[hadoop@master ~]$ cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml
[hadoop@master ~]$ vi etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>master.hadoop.local.tld:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master.hadoop.local.tld:19888</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user/app</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Djava.security.egd=file:/dev/../dev/urandom</value>
</property>
</configuration>
  • Edit yarn-site.xml
[hadoop@master ~]$ vi etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master.hadoop.local.tld</value>
</property>
<property>
<name>yarn.resourcemanager.bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>yarn.nodemanager.bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>file:/opt/hadoop/yarn/local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>file:/opt/hadoop/yarn/log</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>hdfs://master.hadoop.local.tld:8020/var/log/hadoop-yarn/apps</value>
</property>
</configuration>
  • Edit hadoop-env.sh
[hadoop@master ~]$ vi etc/hadoop/hadoop-env.sh
# The java implementation to use.
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/usr/java/default/
  • Edit slaves
[hadoop@master ~]$ vi etc/hadoop/slaves
node1.hadoop.local.tld
node2.hadoop.local.tld
  • Format namenode
[hadoop@master ~]$ hdfs namenode -format

Actions to do on Nodes

  • Create some data dirs on Nodes
[hadoop@node1 ~]$ mkdir -p /opt/hadoop/hdfs/datanode
[hadoop@node1 ~]$ mkdir -p /opt/hadoop/yarn/local
[hadoop@node1 ~]$ mkdir -p /opt/hadoop/yarn/log
[hadoop@node2 ~]$ mkdir -p /opt/hadoop/hdfs/datanode
[hadoop@node2 ~]$ mkdir -p /opt/hadoop/yarn/local
[hadoop@node2 ~]$ mkdir -p /opt/hadoop/yarn/log
  • Copy etc from master
[hadoop@node1 ~]$ scp -r master.hadoop.local.tld:etc .
[hadoop@node2 ~]$ scp -r master.hadoop.local.tld:etc .

Actions to do master after all

  • Start at master
[hadoop@master ~]$ source .bash_profile
[hadoop@master ~]$ start-all.sh