Install Hadoop Cluster

Hadoop Cluster

Purpose

This tutorial has as purpose to teach how to install a hadoop cluster with 1 Name Node and 2 Data Nodes.
Hadoop was installed into 3 CentOS VMs (KVM) and this tutorial is a merge of 3 others, which you can find on the internet.

Link1 Link2 Link3

Steps

Each command is prepended with a Bash Prompt that show where is running, who is running and the privilege which is running.

Pay attention: [[email protected] ~]#

  • User root
  • Host master.hadoop.local.tld
  • Privileged #

Actions to do on all nodes and master

  • Ensue firewall is stopped
[[email protected] ~]# systemctl stop firewalld
[[email protected] ~]# systemctl disable firewalld
[[email protected] ~]# systemctl stop firewalld
[[email protected] ~]# systemctl disable firewalld
  • Install JDK on all nodes
[[email protected] ~]# rpm -Uvh /root/jdk-8u131-linux-x64.rpm
  • Set up FQDN on all nodes
[[email protected] ~]# vi /etc/hosts
192.168.122.201 master.hadoop.local.tld
192.168.122.202 node1.hadoop.local.tld
192.168.122.203 node2.hadoop.local.tld
[[email protected] ~]# vi /etc/hosts
192.168.122.202 node1.hadoop.local.tld
192.168.122.203 node2.hadoop.local.tld
192.168.122.201 master.hadoop.local.tld
[[email protected] ~]# vi /etc/hosts
192.168.122.203 node2.hadoop.local.tld
192.168.122.202 node1.hadoop.local.tld
192.168.122.201 master.hadoop.local.tld
  • Add hadoop user on all nodes
[[email protected] ~]# useradd -d /opt/hadoop hadoop
[[email protected] ~]# passwd hadoop
[[email protected] ~]# useradd -d /opt/hadoop hadoop
[[email protected] ~]# passwd hadoop
[[email protected] ~]# useradd -d /opt/hadoop hadoop
[[email protected] ~]# passwd hadoop
  • Generate key on all nodes and master and copy to eache one
[[email protected] ~]# su - hadoop
[[email protected] ~]$ ssh-keygen -t rsa
[[email protected] ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected]
[[email protected] ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected]
[[email protected] ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected]
[[email protected] ~]# su - hadoop
[[email protected] ~]$ ssh-keygen -t rsa
[[email protected] ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected]
[[email protected] ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected]
[[email protected] ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected]
[[email protected] ~]# su - hadoop
[[email protected] ~]$ ssh-keygen -t rsa
[[email protected] ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected]
[[email protected] ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected]
[[email protected] ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected]
  • Download hadoop on all nodes
[[email protected] ~]$ curl -O http://ftp.unicamp.br/pub/apache/hadoop/common/hadoop-2.8.0/hadoop-2.8.0.tar.gz
[[email protected] ~]$ curl -O http://ftp.unicamp.br/pub/apache/hadoop/common/hadoop-2.8.0/hadoop-2.8.0.tar.gz
[[email protected] ~]$ curl -O http://ftp.unicamp.br/pub/apache/hadoop/common/hadoop-2.8.0/hadoop-2.8.0.tar.gz

* Ensure you are in /opt/hadoop *

[[email protected] ~]$ tar --strip-components=1 -zxvf hadoop-2.8.0.tar.gz
[[email protected] ~]$ tar --strip-components=1 -zxvf hadoop-2.8.0.tar.gz
[[email protected] ~]$ tar --strip-components=1 -zxvf hadoop-2.8.0.tar.gz
  • Edit .bash_profile on all nodes
[[email protected] ~]$ vi .bash_profile
## JAVA env variables
export JAVA_HOME=/usr/java/default
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
## HADOOP env variables
export HADOOP_HOME=/opt/hadoop
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
[[email protected] ~]$ vi .bash_profile
## JAVA env variables
export JAVA_HOME=/usr/java/default
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
## HADOOP env variables
export HADOOP_HOME=/opt/hadoop
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
[[email protected] ~]$ vi .bash_profile
## JAVA env variables
export JAVA_HOME=/usr/java/default
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
## HADOOP env variables
export HADOOP_HOME=/opt/hadoop
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Actions to do on Master ONLY

  • Create some data dirs on Master
[[email protected] ~]$ mkdir -p /opt/hadoop/hdfs/namenode
[[email protected] ~]$ mkdir -p /opt/hadoop/hdfs/datanode
[[email protected] ~]$ mkdir -p /opt/hadoop/hdfs/namesecondary
[[email protected] ~]$ mkdir -p /opt/hadoop/yarn/local
[[email protected] ~]$ mkdir -p /opt/hadoop/yarn/log
  • Edit core-site.xml
[[email protected] ~]$ vi etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master.hadoop.local.tld:9000/</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
</configuration>
  • Edit hdfs-site.xml
[[email protected] ~]$ vi etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/opt/hadoop/hdfs/datanode</value>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>file:/opt/hadoop/hdfs/namesecondary</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
</property>
</configuration>
  • Edit mapred-site.xml
[[email protected] ~]$ cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml
[[email protected] ~]$ vi etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>master.hadoop.local.tld:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master.hadoop.local.tld:19888</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user/app</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Djava.security.egd=file:/dev/../dev/urandom</value>
</property>
</configuration>
  • Edit yarn-site.xml
[[email protected] ~]$ vi etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master.hadoop.local.tld</value>
</property>
<property>
<name>yarn.resourcemanager.bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>yarn.nodemanager.bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>file:/opt/hadoop/yarn/local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>file:/opt/hadoop/yarn/log</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>hdfs://master.hadoop.local.tld:8020/var/log/hadoop-yarn/apps</value>
</property>
</configuration>
  • Edit hadoop-env.sh
[[email protected] ~]$ vi etc/hadoop/hadoop-env.sh
# The java implementation to use.
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/usr/java/default/
  • Edit slaves
[[email protected] ~]$ vi etc/hadoop/slaves
node1.hadoop.local.tld
node2.hadoop.local.tld
  • Format namenode
[[email protected] ~]$ hdfs namenode -format

Actions to do on Nodes

  • Create some data dirs on Nodes
[[email protected] ~]$ mkdir -p /opt/hadoop/hdfs/datanode
[[email protected] ~]$ mkdir -p /opt/hadoop/yarn/local
[[email protected] ~]$ mkdir -p /opt/hadoop/yarn/log
[[email protected] ~]$ mkdir -p /opt/hadoop/hdfs/datanode
[[email protected] ~]$ mkdir -p /opt/hadoop/yarn/local
[[email protected] ~]$ mkdir -p /opt/hadoop/yarn/log
  • Copy etc from master
[[email protected] ~]$ scp -r master.hadoop.local.tld:etc .
[[email protected] ~]$ scp -r master.hadoop.local.tld:etc .

Actions to do master after all

  • Start at master
[[email protected] ~]$ source .bash_profile
[[email protected] ~]$ start-all.sh

Install Gitlab CE

I describe here how to install and setup Gitlab-CE.

Change Git

The latest Gitlab-CE version needs git > 2.x.
For CentOS7 install IUS Repo

rpm -ivh https://centos7.iuscommunity.org/ius-release.rpm

After, install yum-plugin-replace and replace official git with git2u

yum install yum-plugin-replace
yum replace git --replace-with git2u

Install Gitlab

Fetch Gitlab-ce repository

First, you need to set-up the repository:

curl -sS https://packages.gitlab.com/install/repositories/gitlab/gitlab-ce/script.rpm.sh | sudo bash

And Install Gitlab-ce packages

yum install gitlab-ce -y

Docker tip

If, like me, you use a Docker to run your CI, you should install docker-ce

usermod -aG docker gitlab-runner

Enable Gitlab-CE service

And Start

systemctl enable gitlab
systemctl enable gitlab-runsvdir.service
systemctl start gitlab
systemctl start gitlab-runsvdir.service

Install Lets Encrypt

To continue with the configuration, you should install Let’s Encrypt.

Enable Epel and install Certbot

yum install epel-release
yum install certbot

Create a directory to Let’s Encrypt use to ensure the domain point to the server where are installed.

mkdir -p /var/www/letsencrypt

Edit /etc/gitlab/gitlab.rb and create a nginx redirect to this dir

vi /etc/gitlab/gitlab.rb
nginx['custom_gitlab_server_config'] = "location ^~ /.well-known {\n alias /var/www/letsencrypt/.well-known;\n}\n"

Reconfigure the Gitlab

gitlab-ctl reconfigure

And run certbot command to request your certs

certbot certonly -a webroot --webroot-path=/var/www/letsencrypt -d gitlab.domaint.tld -d reg-gitlab.domain.tld

Configure Gitlab

You need to change some configs at /etc/gitlab/gitlab.rb, but the most important for me are listed below.

Change time zone

gitlab_rails['time_zone'] = 'America/Sao_Paulo'

Change git data dir

In my case I created a mount point /gitlab.

git_data_dirs({ "default" => { "path" => "/gitlab/git-data" } })

The Registry configs

Yes, I use a local registry to store the projects containers built by CI.

################################################################################
## Container Registry settings
##! Docs: https://docs.gitlab.com/ce/administration/container_registry.html
################################################################################

registry_external_url 'https://reg-gitlab.domain.tld'

### Settings used by GitLab application
gitlab_rails['registry_enabled'] = true
gitlab_rails['registry_host'] = "reg-gitlab.domain.tld"
gitlab_rails['registry_path'] = "/gitlab/registry"

Nginx configs

################################################################################
## GitLab Nginx
##! Docs: https://docs.gitlab.com/omnibus/settings/nginx.html
################################################################################

nginx['redirect_http_to_https'] = true

nginx['ssl_certificate'] = "/etc/letsencrypt/live/gitlab.domain.tld/fullchain.pem"
nginx['ssl_certificate_key'] = "/etc/letsencrypt/live/gitlab.domain.tld/privkey.pem"
nginx['ssl_ciphers'] = "ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256"
nginx['ssl_prefer_server_ciphers'] = "on"

nginx['ssl_protocols'] = "TLSv1 TLSv1.1 TLSv1.2"

Nginx and the registry

################################################################################
## Registry NGINX
################################################################################

registry_nginx['enable'] = true
registry_nginx['redirect_http_to_https'] = true
registry_nginx['redirect_http_to_https_port'] = 80
registry_nginx['ssl_ciphers'] = "ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-RSA-AES256-SHA256:DHE-RSA-AES256-SHA:ECDHE-ECDSA-DES-CBC3-SHA:ECDHE-RSA-DES-CBC3-SHA:EDH-RSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:DES-CBC3-SHA:!DSS"
registry_nginx['ssl_prefer_server_ciphers'] = "on"
registry_nginx['ssl_certificate'] = "/etc/letsencrypt/live/gitlab.domain.tld/fullchain.crt"
registry_nginx['ssl_certificate_key'] = "/etc/letsencrypt/live/gitlab.domain.tld/privkey.pem"

Reconfigure the Gitlab

gitlab-ctl reconfigure

Install Gitlab-Runner

Fetch Gitlab-Runner repository

curl -sS https://packages.gitlab.com/install/repositories/runner/gitlab-ci-multi-runner/script.rpm.sh | sudo bash

And install the package

yum install gitlab-ci-multi-runner -y

Enable Gitlab-Runner service

and start

systemctl enable gitlab-runner.service
systemctl start gitlab-runner.service

Register a Runner

First of all, you need to get the token ID to register. It can be found https://gitlab.domain.tld/admin/runners

gitlab-ci-multi-runner register

You should ask something like that:

Running in system-mode.                            

Please enter the gitlab-ci coordinator URL (e.g. https://gitlab.com/):
https://gitlab.domain.tld/
Please enter the gitlab-ci token for this runner:
_z2PxQuMW7dAeHJPJ4jo
Please enter the gitlab-ci description for this runner:
[host.domain.tld]: docker-dind
Please enter the gitlab-ci tags for this runner (comma separated):
docker, dind
Whether to run untagged builds [true/false]:
[false]:
Whether to lock Runner to current project [true/false]:
[false]:
Registering runner... succeeded runner=_z2PxQuM
Please enter the executor: parallels, docker-ssh+machine, kubernetes, docker, docker-ssh, shell, ssh, virtualbox, docker+machine:
docker
Please enter the default Docker image (e.g. ruby:2.1):
docker:latest
Runner registered successfully. Feel free to start it, but if it's running already the config should be automatically reloaded!

It registered a docker-dind runner.

And change /etc/gitlab-runner/config.toml

[[runners]]
name = "docker-dind"
url = "https://gitlab.domain.tld/"
token = "vae7gu3shaid8xaikohfoojei1ha1h"
executor = "docker"
environment = ["VAR1=value1", "VAR2=value2"]
[runners.docker]
tls_verify = false
image = "docker:latest"
privileged = true
disable_cache = false
volumes = ["/cache", "/gitlab/docker-images-pipeline:/images:rw"]
services = ["docker:dind"]
shm_size = 0
[runners.cache]

Understand some configs:

  • environment = ["VAR1=value1", "VAR2=value2"]: Use to pass an ENV to the runner
  • privileged = true: You need to give a privilege to container to use docker’in’docker
  • volumes = ["/cache", "/gitlab/docker-images-pipeline:/images:rw"]: The last volume (docker-images-pipeline)
    is used to keep docker images during the pipeline steps. You can use docker save -o /images/NameOfTheImage.img
    to save and docker load /images/NameOfTheImage.img it again on the next step.
  • services = ["docker:dind"]: This entry call another container, in this case a dind, to run a service needed
    by the runner image. Dind service will run a Docker Daemon to provide the docker service to the runner.

Set AWS CentOS7 Hostname

How to change CentOS7 hostname EC2 at AWS

First thing, connect in your instance and…

set the hostname

hostnamectl set-hostname --static hostname.domain.tld

After, you should edit cloud.cfg

vi /etc/cloud/cloud.cfg

and add preserve_hostname: true

...
paths:
cloud_dir: /var/lib/cloud
templates_dir: /etc/cloud/templates
ssh_svcname: sshd

preserve_hostname: true

# vim:syntax=yaml

Other thing very important when you use EC2 instances is to setup the timezone.

ln -sf /usr/share/zoneinfo/America/Sao_Paulo /etc/localtime

Docker commands to survive

docker ps –size

An important command which I discovered and use is docker ps --size.
Its show what is the size of the container.

[[email protected] ~]$ docker ps --size
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES SIZE
dbffabfb478a docker:17.05-dind "dockerd-entrypoin..." 4 weeks ago Up About a minute 2375/tcp docker1705-dind 10.2MB (virtual 110MB)

On this case, you can see the column size which show 10.2MB (virtual 110MB). Compare with the output of docker images:

[[email protected] ~]$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
docker 17.05-dind b547d892dffa 4 weeks ago 99.6MB

We discover the virtual size is the sum of image size + 10.2MB (this space was used creating a file with dd command).

The slice (or slices) that compose the base image is not used for each instance that you run with docker, only the the difference will be stored on the hard disk.
In this case represented by 10.2MB.

free space used by docker containers and images

[[email protected] ~]$ docker rm $(docker ps -aq)
[[email protected] ~]$ docker rmi $(docker images -aq)
[[email protected] ~]$ docker volume rm $(docker volume ls -q)

Inspect a container and discover its pid

If you have any PID that you saw using a lot of resource (like CPU or memory), use this command to inspect all
running dockers and its respective PID - ID - NAME.

[[email protected] docker]$ docker ps -q | xargs docker inspect --format '{{.State.Pid}}|{{.ID}}|{{.Name}}'
5746|413810b0c00fe51ac616205db90db222915410202263dc1d2493de5916146534|/test1
4383|dbffabfb478adb2755d5574f586e2250e63dd602bba7706bb2e79252096f036e|/docker1705-dind

Install Docker CE

To install docker from official repository, you should remove any previous installed version.

yum remove docker \
docker-common \
container-selinux \
docker-selinux \
docker-engine \
docker-engine-selinux

Add the official repository

yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

Maybe you shoud install yum-utils first

yum install -y yum-utils

And other required packages

yum install -y device-mapper-persistent-data lvm2

Now you would be able to install docker-ce

yum install -y docker-ce

Enable and start Docker

systemctl enable docker
systemctl start docker

Run hello-world

docker run hello-world
Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×