Big Data Solutions using Apache Hadoop with Spark, Hive and Sqoop (2 of 3)

Bakul Gupta
8 min readJun 11, 2021

--

In the previous section, I explained about basics of Apache Hadoop, Architecture, How it works and What are some of the important daemons and processes. Follow Big Data Solutions using Apache Hadoop with Spark, Hive and Sqoop (1 of 3) for more details and to understand about abbreviations used.

In this section, we will continue with our cluster setup with a 2+ 3 layout. Two Master node running NN, JN, ZKFC, RM, JHS with Spark, Hive and Sqoop. First Slave node with JN, DN and NM and other Slave node(s) with DN and NM. We will use Apache Hadoop 3.2.1 in this article.

Pre-Requisites

  • Good hardware recommended (minimum 5 servers for good setup, can increase more slaves as needed)
  • All servers should be on latest OS (Cent OS 7.x)
  • Create user and group hadoop(1000).
  • Make sure UID and GID is maintained in all servers.
  • Dedicated /app file system for cluster setup recommended
  • Home for hadoop will be /app/hadoop, for spark /app/spark, for hive /app/hive and for sqoop /app/sqoop.
  • Set up password-less SSH (hadoop to hadoop) from master to slave nodes. Add key to authorized section of master node(s) as well.
  • Disable IPtables or Firewall service
  • Disable SELinux
  • Set vm.swappiness = 1 in kernel parameter.
  • Install Java JDK with both header and devel on each server.
  • Install and run Zookeeper Service in Servers assigned for that task (They will be other than above 5 we need for cluster). Zookeeper setup is not covered here. Refer other tutorials.
  • Configure HA Proxy for JHS using (Setup HA Proxy) article or some other way. It is needed as JHS do not support HA setup by default.

Steps to configure cluster

Download the Tar ball of Apache Hadoop from (Apache Hadoop Tar Ball) in one of the master node.

Extract it under /app file system and move all contents of hadoop-3.2.1 to hadoop for simplifying names.

Change user and group for all folders and files under /app/hadoop including hadoop to hadoop.

The main configuration files are under etc/hadoop directory under /app/hadoop. The overall file system will look like.

.
├── bin ## This is where all binary commands are present. We use these commands to control processes.
│ ├── container-executor
│ ├── hadoop
│ ├── hadoop.cmd
│ ├── hdfs
│ ├── hdfs.cmd
│ ├── mapred
│ ├── mapred.cmd
│ ├── test-container-executor
│ ├── yarn
│ └── yarn.cmd
├── etc ## This is where all configuration files are saved and configured. We will configure all files here to make the cluster work.
│ └── hadoop
├── include
│ ├── hdfs.h
│ ├── Pipes.hh
│ ├── SerialUtils.hh
│ ├── StringUtils.hh
│ └── TemplateFactory.hh
├── lib ## This is where all library files are saved.
│ └── native
├── libexec
│ ├── hadoop-config.cmd
│ ├── hadoop-config.sh
│ ├── hadoop-functions.sh
│ ├── hadoop-layout.sh.example
│ ├── hdfs-config.cmd
│ ├── hdfs-config.sh
│ ├── mapred-config.cmd
│ ├── mapred-config.sh
│ ├── shellprofile.d
│ ├── tools
│ ├── yarn-config.cmd
│ └── yarn-config.sh
├── LICENSE.txt
├── logs ## This is where daemon logs files are saved.
├── NOTICE.txt
├── README.txt
├── sbin ## This is where all scripts are saved. We use them to start and stop cluster.
│ ├── distribute-exclude.sh
│ ├── FederationStateStore
│ ├── hadoop-daemon.sh
│ ├── hadoop-daemons.sh
│ ├── httpfs.sh
│ ├── kms.sh
│ ├── mr-jobhistory-daemon.sh
│ ├── refresh-namenodes.sh
│ ├── start-all.cmd
│ ├── start-all.sh
│ ├── start-balancer.sh
│ ├── start-dfs.cmd
│ ├── start-dfs.sh
│ ├── start-secure-dns.sh
│ ├── start-yarn.cmd
│ ├── start-yarn.sh
│ ├── stop-all.cmd
│ ├── stop-all.sh
│ ├── stop-balancer.sh
│ ├── stop-dfs.cmd
│ ├── stop-dfs.sh
│ ├── stop-secure-dns.sh
│ ├── stop-yarn.cmd
│ ├── stop-yarn.sh
│ ├── workers.sh
│ ├── yarn-daemon.sh
│ └── yarn-daemons.sh
├── share ## This is where all default configuration and document files are present.
│ ├── doc
│ └── hadoop

Change folder to etc/hadoop, then update

  • core-site.xml — We define our cluster name in this file. Example testcluster.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. --><configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://testcluster</value>
</property>
</configuration>
  • hdfs-site.xml — We add all properties needed to configure HDFS with HA setup.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. --><configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/app/volume/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/app/volume/datanode</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>testcluster</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled.testcluster</name>
<value>true</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>Zookeeper1:2181,Zookeeper2:2181,Zookeeper3:2181</value>
</property>
<property>
<name>dfs.ha.namenodes.testcluster</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.testcluster.nn1</name>
<value>MasterNode1:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.testcluster.nn2</name>
<value>MasterNode2:8020</value>
</property>
<property>
<name>dfs.namenode.http-address.testcluster.nn1</name>
<value>MasterNode1:9870</value>
</property>
<property>
<name>dfs.namenode.http-address.testcluster.nn2</name>
<value>MasterNode2:9870</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name> <value>qjournal://MasterNode1:8485;MasterNode2:8485;DataNode1:8485/testcluster</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.testcluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/app/hadoop/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/app/volume/journalnode</value>
</property>
</configuration>
  • yarn-site.xml — We add all properties needed to configure YARN with HA setup.
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>testcluster</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>MasterNode1</value>
</property>
<property>
<name>yarn.resourcemanager.resource-traker.address.rm1</name>
<value>MasterNode1:8031</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address.rm1</name>
<value>MasterNode1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address.rm1</name>
<value>MasterNode1:8032</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm1</name>
<value>MasterNode1:8088</value> </property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>MasterNode2</value>
</property>
<property>
<name>yarn.resourcemanager.resource-traker.address.rm2</name>
<value>MasterNode2:8031</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address.rm2</name>
<value>MasterNode2:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address.rm2</name>
<value>MasterNode2:8032</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>MasterNode3:8088</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.disk-health-checker.min-healthy-disks</name>
<value>0</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>29184</value> <!-- Memory in MB for RM-->
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>20184</value> <!-- Max Memory in MB for a Container-->
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>4728</value> <!-- Max Memory in MB for a Container-->
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.localizer.address</name>
<value>localhost:9000</value>
</property>
<property>
<name>hadoop.zk.address</name>
<value>Zookeeper1:2181,Zookeeper2:2181,Zookeeper3:2181</value>
</property>
<property>
<description>Enable RM to recover state after starting. If true, then yarn.resourcemanager.store.class must be specified</description>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<description>The class to use as the persistent store.</description>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name> <!-- Location of application logs.-->
<value>/app/hadoop/logs/userlogs</value>
</property>
<property>
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value>604800</value> <!-- Duration of logs to persist.-->
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://HA_PROXY_URL:19888/jobhistory/logs</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/app-logs</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir-suffix</name>
<value>logs</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
</configuration>
  • mapred-site.xml — We add all properties needed to configure MapReduce Process.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. --><configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>4728</value> <!-- Memory in MB for App Manager -->
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>4728</value> <!-- Memory in MB for Map Process -->
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>4728</value> <!-- Memory in MB for Reduce Process -->
</property>
</configuration>
  • hadoop-env.sh — We add all environment variables needed to configure cluster.
Update
export JAVA_HOME=/etc/alternatives/jre
Add in the end
export HDFS_NAMENODE_USER="hadoop"
export HDFS_DATANODE_USER="hadoop"
export HDFS_SECONDARYNAMENODE_USER="hadoop"
export YARN_RESOURCEMANAGER_USER="hadoop"
export YARN_NODEMANAGER_USER="hadoop"
export HADOOP_CONF_DIR="/app/hadoop/etc/hadoop"
export HDFS_JOURNALNODE_USER="hadoop"
  • workers : Update the list of all slave servers here.
  • .bash_profile of hadoop user
# .bash_profile# Get the aliases and functions
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi
# User specific environment and startup programsPATH=$PATH:$HOME/binexport PATH## JAVA env variables###
export JAVA_HOME=/etc/alternatives/jre
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
## HADOOP env variables###
export HADOOP_HOME=/app/hadoop
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
#export HADOOP_OPTS=”-Djava.library.path=$HADOOP_HOME/lib”
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/app/hadoop/lib/"
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH

Copy all the contents under /app/hadoop from master1 to other master and slave servers.

→ Format the NameNode in master1. $hadoop namenode -format

Run command $start-dfs.sh on master node. It will start NN, JN and DN in all required servers but this will not enable Automatic Failover.

To do that, stop the cluster using command stop-dfs.sh.

Run command $hdfs zkfc -formatZK only one time in master1 to initialize Zookeeper.

Start zkfc manually before we start the cluster to manage the primary and secondary in both masters by using command $hdfs — daemon start zkfc.

Start DFS back again using command $start-dfs.sh.

→ Run command to Bootstrap / start Namenode on secndary $hdfs namenode -bootstrapStandby

Start Yarn Services using command $start-yarn.sh.

Start JHS in both masters using command $mapred — daemon start historyserver.

Verify that all required processes are started and primary and secondary masters are allocated. The Cluster is ready and all processes have started.

Master1
10545 NameNode
12113 ResourceManager
12645 JobHistoryServer
9546 DFSZKFailoverController
12954 Jps
10862 JournalNode
Master2
30278 ResourceManager
29672 JournalNode
28969 DFSZKFailoverController
29550 NameNode
31039 JobHistoryServer
31279 Jps
Slave1
23427 DataNode
24787 Jps
23592 JournalNode
24107 NodeManager
Slave2
88163 DataNode
88805 NodeManager
89382 Jps
Slave3
4995 Jps
3865 DataNode
4428 NodeManager
http://masternode1:9870
http://masternode2:9870
http://masternode1:8088

This marks the complete of section 2. We configured an Apache Hadoop Cluster with HA setup. It is a plain vanilla setup that cannot handle day to day work. That is taken care by add-ons like Spark, Hive and Sqoop. Follow Big Data Solutions using Apache Hadoop with Spark, Hive and Sqoop (3 of 3) to continue with integration of Apache Hadoop with Spark, Hive and Sqoop.

--

--

Bakul Gupta

Data Platform Engineer with Xfinity Mobile, Comcast. Enabling platform solutions with Open Source Big Data, Cloud and DevOps Technologies.