Big Data Solutions using Apache Hadoop with Spark, Hive and Sqoop (2 of 3)

8 min readJun 11, 2021

In the previous section, I explained about basics of Apache Hadoop, Architecture, How it works and What are some of the important daemons and processes. Follow Big Data Solutions using Apache Hadoop with Spark, Hive and Sqoop (1 of 3) for more details and to understand about abbreviations used.

In this section, we will continue with our cluster setup with a 2+ 3 layout. Two Master node running NN, JN, ZKFC, RM, JHS with Spark, Hive and Sqoop. First Slave node with JN, DN and NM and other Slave node(s) with DN and NM. We will use Apache Hadoop 3.2.1 in this article.

Pre-Requisites

Good hardware recommended (minimum 5 servers for good setup, can increase more slaves as needed)
All servers should be on latest OS (Cent OS 7.x)
Create user and group hadoop(1000).
Make sure UID and GID is maintained in all servers.
Dedicated /app file system for cluster setup recommended
Home for hadoop will be /app/hadoop, for spark /app/spark, for hive /app/hive and for sqoop /app/sqoop.
Set up password-less SSH (hadoop to hadoop) from master to slave nodes. Add key to authorized section of master node(s) as well.
Disable IPtables or Firewall service
Disable SELinux
Set vm.swappiness = 1 in kernel parameter.
Install Java JDK with both header and devel on each server.
Install and run Zookeeper Service in Servers assigned for that task (They will be other than above 5 we need for cluster). Zookeeper setup is not covered here. Refer other tutorials.
Configure HA Proxy for JHS using (Setup HA Proxy) article or some other way. It is needed as JHS do not support HA setup by default.

Steps to configure cluster

→ Download the Tar ball of Apache Hadoop from (Apache Hadoop Tar Ball) in one of the master node.

→ Extract it under /app file system and move all contents of hadoop-3.2.1 to hadoop for simplifying names.

→ Change user and group for all folders and files under /app/hadoop including hadoop to hadoop.

→ The main configuration files are under etc/hadoop directory under /app/hadoop. The overall file system will look like.

.
├── bin       ## This is where all binary commands are present. We                             use these commands to control processes.
│   ├── container-executor
│   ├── hadoop
│   ├── hadoop.cmd
│   ├── hdfs
│   ├── hdfs.cmd
│   ├── mapred
│   ├── mapred.cmd
│   ├── test-container-executor
│   ├── yarn
│   └── yarn.cmd
├── etc       ## This is where all configuration files are saved and configured. We will configure all files here to make the cluster work. 
│   └── hadoop
├── include
│   ├── hdfs.h
│   ├── Pipes.hh
│   ├── SerialUtils.hh
│   ├── StringUtils.hh
│   └── TemplateFactory.hh
├── lib       ## This is where all library files are saved.
│   └── native
├── libexec
│   ├── hadoop-config.cmd
│   ├── hadoop-config.sh
│   ├── hadoop-functions.sh
│   ├── hadoop-layout.sh.example
│   ├── hdfs-config.cmd
│   ├── hdfs-config.sh
│   ├── mapred-config.cmd
│   ├── mapred-config.sh
│   ├── shellprofile.d
│   ├── tools
│   ├── yarn-config.cmd
│   └── yarn-config.sh
├── LICENSE.txt
├── logs       ## This is where daemon logs files are saved.
├── NOTICE.txt
├── README.txt
├── sbin       ## This is where all scripts are saved. We use them to start and stop cluster.
│   ├── distribute-exclude.sh
│   ├── FederationStateStore
│   ├── hadoop-daemon.sh
│   ├── hadoop-daemons.sh
│   ├── httpfs.sh
│   ├── kms.sh
│   ├── mr-jobhistory-daemon.sh
│   ├── refresh-namenodes.sh
│   ├── start-all.cmd
│   ├── start-all.sh
│   ├── start-balancer.sh
│   ├── start-dfs.cmd
│   ├── start-dfs.sh
│   ├── start-secure-dns.sh
│   ├── start-yarn.cmd
│   ├── start-yarn.sh
│   ├── stop-all.cmd
│   ├── stop-all.sh
│   ├── stop-balancer.sh
│   ├── stop-dfs.cmd
│   ├── stop-dfs.sh
│   ├── stop-secure-dns.sh
│   ├── stop-yarn.cmd
│   ├── stop-yarn.sh
│   ├── workers.sh
│   ├── yarn-daemon.sh
│   └── yarn-daemons.sh
├── share       ## This is where all default configuration and document files are present.
│   ├── doc
│   └── hadoop

→ Change folder to etc/hadoop, then update

core-site.xml — We define our cluster name in this file. Example testcluster.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License athttp://www.apache.org/licenses/LICENSE-2.0Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
--><!-- Put site-specific property overrides in this file. --><configuration>
 <property>
  <name>fs.defaultFS</name>
  <value>hdfs://testcluster</value>
 </property>
</configuration>

hdfs-site.xml — We add all properties needed to configure HDFS with HA setup.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License athttp://www.apache.org/licenses/LICENSE-2.0Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
--><!-- Put site-specific property overrides in this file. --><configuration>
 <property>
  <name>dfs.namenode.name.dir</name>
  <value>file:/app/volume/namenode</value>
 </property>
 <property>
  <name>dfs.datanode.data.dir</name>
  <value>file:/app/volume/datanode</value>
 </property>
 <property>
  <name>dfs.nameservices</name>
  <value>testcluster</value>
 </property>
 <property>
   <name>dfs.ha.automatic-failover.enabled.testcluster</name>
   <value>true</value>
 </property>
 <property>
   <name>ha.zookeeper.quorum</name>
   <value>Zookeeper1:2181,Zookeeper2:2181,Zookeeper3:2181</value>
 </property>
 <property>
  <name>dfs.ha.namenodes.testcluster</name>
  <value>nn1,nn2</value>
 </property>
 <property>
  <name>dfs.namenode.rpc-address.testcluster.nn1</name>
  <value>MasterNode1:8020</value>
 </property>
 <property>
  <name>dfs.namenode.rpc-address.testcluster.nn2</name>
  <value>MasterNode2:8020</value>
 </property>
 <property>
  <name>dfs.namenode.http-address.testcluster.nn1</name>
  <value>MasterNode1:9870</value>
 </property>
 <property>
  <name>dfs.namenode.http-address.testcluster.nn2</name>
  <value>MasterNode2:9870</value>
 </property>
 <property>
  <name>dfs.namenode.shared.edits.dir</name>       <value>qjournal://MasterNode1:8485;MasterNode2:8485;DataNode1:8485/testcluster</value>
 </property>
 <property>
  <name>dfs.client.failover.proxy.provider.testcluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
 </property>
 <property>
  <name>dfs.ha.fencing.methods</name>
  <value>sshfence</value>
 </property>
 <property>
  <name>dfs.ha.fencing.ssh.private-key-files</name>
  <value>/app/hadoop/.ssh/id_rsa</value>
 </property>
 <property>
  <name>dfs.journalnode.edits.dir</name>
  <value>/app/volume/journalnode</value>
 </property>
</configuration>

yarn-site.xml — We add all properties needed to configure YARN with HA setup.

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License athttp://www.apache.org/licenses/LICENSE-2.0Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>
 <property>
  <name>yarn.resourcemanager.ha.enabled</name>
  <value>true</value>
 </property>
 <property>
  <name>yarn.resourcemanager.cluster-id</name>
  <value>testcluster</value>
 </property>
 <property>
  <name>yarn.resourcemanager.ha.rm-ids</name>
  <value>rm1,rm2</value>
 </property>
 <property>
  <name>yarn.resourcemanager.hostname.rm1</name>
  <value>MasterNode1</value>
 </property>
 <property>
  <name>yarn.resourcemanager.resource-traker.address.rm1</name>
  <value>MasterNode1:8031</value>
 </property>
 <property>
  <name>yarn.resourcemanager.scheduler.address.rm1</name>
  <value>MasterNode1:8030</value>
 </property>
 <property>
  <name>yarn.resourcemanager.address.rm1</name>
  <value>MasterNode1:8032</value>
 </property>
 <property>
  <name>yarn.resourcemanager.webapp.address.rm1</name>
  <value>MasterNode1:8088</value>                                                                                                                    </property>
 <property>
  <name>yarn.resourcemanager.hostname.rm2</name>
  <value>MasterNode2</value>
 </property>
 <property>
  <name>yarn.resourcemanager.resource-traker.address.rm2</name>
  <value>MasterNode2:8031</value>
 </property>
 <property>
  <name>yarn.resourcemanager.scheduler.address.rm2</name>
  <value>MasterNode2:8030</value>
 </property>
 <property>
  <name>yarn.resourcemanager.address.rm2</name>
  <value>MasterNode2:8032</value>
 </property>
 <property>
  <name>yarn.resourcemanager.webapp.address.rm2</name>
  <value>MasterNode3:8088</value>
 </property>
 <property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
 </property>
 <property>
  <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
 </property>
 <property>
  <name>yarn.nodemanager.disk-health-checker.min-healthy-disks</name>
  <value>0</value>
 </property>
 <property>
  <name>yarn.acl.enable</name>
  <value>0</value>
 </property>
 <property>
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>29184</value> <!-- Memory in MB for RM-->
 </property>
 <property>
  <name>yarn.scheduler.maximum-allocation-mb</name>
  <value>20184</value> <!-- Max Memory in MB for a Container-->
 </property>
 <property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>4728</value> <!-- Max Memory in MB for a Container-->
 </property>
 <property>
  <name>yarn.nodemanager.vmem-check-enabled</name>
  <value>false</value>
 </property>
 <property>
  <name>yarn.nodemanager.localizer.address</name>
  <value>localhost:9000</value>
 </property>
 <property>
  <name>hadoop.zk.address</name>
  <value>Zookeeper1:2181,Zookeeper2:2181,Zookeeper3:2181</value>
 </property>
 <property>
  <description>Enable RM to recover state after starting. If true, then yarn.resourcemanager.store.class must be specified</description>
  <name>yarn.resourcemanager.recovery.enabled</name>
  <value>true</value>
 </property>
 <property>
  <description>The class to use as the persistent store.</description>
  <name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
 </property>
 <property>
  <name>yarn.nodemanager.log-dirs</name> <!-- Location of application logs.-->
  <value>/app/hadoop/logs/userlogs</value>
 </property>
 <property>
  <name>yarn.nodemanager.delete.debug-delay-sec</name>
  <value>604800</value> <!-- Duration of logs to persist.-->
 </property>
 <property>
  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
 </property>
 <property>
  <name>yarn.log.server.url</name>
  <value>http://HA_PROXY_URL:19888/jobhistory/logs</value>
 </property>
 <property>
  <name>yarn.nodemanager.remote-app-log-dir</name>
  <value>/app-logs</value>
 </property>
 <property>
  <name>yarn.nodemanager.remote-app-log-dir-suffix</name>
  <value>logs</value>
 </property>
 <property>
  <name>yarn.log-aggregation.retain-seconds</name>
  <value>604800</value>
 </property>
</configuration>

mapred-site.xml — We add all properties needed to configure MapReduce Process.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License athttp://www.apache.org/licenses/LICENSE-2.0Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
--><!-- Put site-specific property overrides in this file. --><configuration>
 <property>
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
 </property>
 <property>
   <name>yarn.app.mapreduce.am.env</name>
   <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
 </property>
 <property>
   <name>mapreduce.map.env</name>
   <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
 </property>
 <property>
   <name>mapreduce.reduce.env</name>
   <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
 </property>
 <property>
   <name>yarn.app.mapreduce.am.resource.mb</name>
   <value>4728</value> <!-- Memory in MB for App Manager -->
 </property>
 <property>
   <name>mapreduce.map.memory.mb</name>
   <value>4728</value> <!-- Memory in MB for Map Process -->
 </property>
 <property>
   <name>mapreduce.reduce.memory.mb</name>
   <value>4728</value> <!-- Memory in MB for Reduce Process -->
 </property>
</configuration>

hadoop-env.sh — We add all environment variables needed to configure cluster.

Update
export JAVA_HOME=/etc/alternatives/jreAdd in the end
export HDFS_NAMENODE_USER="hadoop"
export HDFS_DATANODE_USER="hadoop"
export HDFS_SECONDARYNAMENODE_USER="hadoop"
export YARN_RESOURCEMANAGER_USER="hadoop"
export YARN_NODEMANAGER_USER="hadoop"
export HADOOP_CONF_DIR="/app/hadoop/etc/hadoop"
export HDFS_JOURNALNODE_USER="hadoop"

workers : Update the list of all slave servers here.
.bash_profile of hadoop user

# .bash_profile# Get the aliases and functions
if [ -f ~/.bashrc ]; then
        . ~/.bashrc
fi# User specific environment and startup programsPATH=$PATH:$HOME/binexport PATH## JAVA env variables###
export JAVA_HOME=/etc/alternatives/jre
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar## HADOOP env variables###
export HADOOP_HOME=/app/hadoop
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
#export HADOOP_OPTS=”-Djava.library.path=$HADOOP_HOME/lib”
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/app/hadoop/lib/"
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH

→ Copy all the contents under /app/hadoop from master1 to other master and slave servers.

→ Format the NameNode in master1. $hadoop namenode -format

→ Run command $start-dfs.sh on master node. It will start NN, JN and DN in all required servers but this will not enable Automatic Failover.

→ To do that, stop the cluster using command stop-dfs.sh.

→ Run command $hdfs zkfc -formatZK only one time in master1 to initialize Zookeeper.

→ Start zkfc manually before we start the cluster to manage the primary and secondary in both masters by using command $hdfs — daemon start zkfc.

→ Start DFS back again using command $start-dfs.sh.

→ Run command to Bootstrap / start Namenode on secndary $hdfs namenode -bootstrapStandby

→ Start Yarn Services using command $start-yarn.sh.

→ Start JHS in both masters using command $mapred — daemon start historyserver.

→ Verify that all required processes are started and primary and secondary masters are allocated. The Cluster is ready and all processes have started.

Master1
10545 NameNode
12113 ResourceManager
12645 JobHistoryServer
9546 DFSZKFailoverController
12954 Jps
10862 JournalNodeMaster2
30278 ResourceManager
29672 JournalNode
28969 DFSZKFailoverController
29550 NameNode
31039 JobHistoryServer
31279 JpsSlave1
23427 DataNode
24787 Jps
23592 JournalNode
24107 NodeManagerSlave2
88163 DataNode
88805 NodeManager
89382 JpsSlave3
4995 Jps
3865 DataNode
4428 NodeManager

This marks the complete of section 2. We configured an Apache Hadoop Cluster with HA setup. It is a plain vanilla setup that cannot handle day to day work. That is taken care by add-ons like Spark, Hive and Sqoop. Follow Big Data Solutions using Apache Hadoop with Spark, Hive and Sqoop (3 of 3) to continue with integration of Apache Hadoop with Spark, Hive and Sqoop.

Big Data Solutions using Apache Hadoop with Spark, Hive and Sqoop (2 of 3)

Pre-Requisites

Steps to configure cluster

Written by Bakul Gupta