Big Data Solutions using Apache Hadoop with Spark, Hive and Sqoop (1 of 3)

6 min readJun 11, 2021

With Cloudera stopping free tier version of Big Data solutions, there is big void, specially for those who are looking for free or open source solutions.

Over the next few articles, I will share details about Apache Hadoop, how we can create a production ready cluster and how we can integrate Spark, Hive and Sqoop to handle over 90% of daily workload.

First section contains basics of Apache Hadoop, Architecture, How it works and What are some of the important daemons and processes.

Hadoop is a distributed system for storage and processing. In a HA cluster, two or more separate machines are configured as NameNode. At any point in time, exactly one of the NameNode is in an Active state, and the others are in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standbys are maintaining HDFS image of the active NameNode, maintaining enough state to provide a fast failover if necessary.

In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called JournalNode(JN). When any namespace modification is performed by the Active node, logs are recorded to a majority of these JN(s). The Standby reads the edits from the JNs and applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the JournalNode(s) before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.

In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information regarding the location of blocks in the cluster. To achieve this, the Datanode(s) are configured with the location of all NameNode(s), and send block location information and heartbeats to all.

Apache Zookeeper (ZK) is a highly available service for maintaining small amounts of coordination data, notifying clients of changes in that data, and monitoring clients for failures. The implementation of automatic HDFS failover relies on ZK for the following things:

Failure detection — each of the NameNode machines in the cluster maintains a persistent session in ZK. If the machine crashes, the ZK session will expire, notifying the other NameNode(s) that a failover should be triggered.

Active NameNode election — It provides a simple mechanism to exclusively elect a node as active. If the current active NameNode crashes, another node may take a special exclusive lock in ZK indicating that it should become the next active.

The Namenode, Standby Namenode, JournalNode, Datanode and ZKFailoverController together makes HDFS (Hadoop Distributed File System) with High Availability.

The ResourceManager, NodeManager, and ApplicationMaster together makes YARN (Yet Another Resource Negotiator) with High Availability.

NAMENODE(NN)

This process runs master node on a cluster. There can be only 1 at time.
It is responsible to manage metadata about files distributed across the cluster.
It manages information like location of file blocks across cluster and its permission.
This process reads all the metadata from a file named fsimage and keeps it in memory.
After this process is started, it updates metadata for newly added or removed files in RAM.
It periodically writes the changes in one file called edits as edit logs.
This process is a heart of HDFS, if it is down HDFS is not accessible any more and that is the reason for production environment we have to run it in atleast 2 different nodes.

STANDBY NAMENODE

Both Active and Standby Namenode maintains up-to date HDFS metadata, ensuring seamless failover using JournalNode.

JOURNALNODE(JN)

They keep track of all namespace modification performed by the Active node and save it in logs.
Standby NameNode reads from these logs and records the modification in it making sure it are in sync with active node.
We should have Odd number of them to maintain Quorum.

DATANODE(DN)

There are many instances running on various slave nodes with this process.
It is responsible for storing the individual file blocks on the slave nodes in Hadoop cluster.
Based on the replication factor, a single block is replicated in multiple slave nodes (only if replication factor is > 1) to prevent the data loss.
Whenever required, this process handles the access to a data block by communicating with Name Node.
This process periodically sends heartbeats to Name Node to make Name Node aware that slave process is running.

ZKFailoverController

It runs on the same machine which runs a NameNode. It is also known as ZKFC. It is responsible for:

Health monitoring — the ZKFC pings its local NameNode on a periodic basis with a health-check command. So long as the NameNode responds in a timely fashion with a healthy status, the ZKFC considers the node healthy. If the node has crashed, frozen, or otherwise entered an unhealthy state, the health monitor will mark it as unhealthy.
Zookeeper session management — when the local NameNode is healthy, the ZKFC holds a session open in ZK. If the local NameNode is active, it also holds a special “lock” znode. This lock uses ZK support for “ephemeral” nodes; if the session expires, the lock node will be automatically deleted.
Zookeeper-based election — if the local NameNode is healthy, and the ZKFC sees that no other node currently holds the lock znode, it will itself try to acquire the lock. If it succeeds, then it has “won the election”, and is responsible for running a failover to make its local NameNode active.

RESOURCEMANAGER(RM)

This daemon process resides on both of the Master Nodes.
It stays active in one of the master and standby in another.
It is responsible for managing resources scheduling for different compute applications in an optimum way.
Responsible for coordinating with two process on master node, Scheduler and ApplicationManager.

NOTE: We can maintain and rebuild the state of ResourceManager by saving the metadata to Zookeeper and rebuilding it once process starts back and become active.

SCHEDULER

This daemon process resides on the Master Node. It is responsible for

Scheduling the job execution as per submission request received by ResourceManager.
Allocating resources to applications submitted to the cluster.
Coordinating with ApplicationManager daemon and keeping track of resources of running applications.

APPLICATIONMANAGER

This daemon process resides on the Master Node. It is responsible for,

Helping Scheduler daemon to keeps track of running application by coordination.
Accepting job submissions from client.
Negotiating first container for executing application specific task with suitable ApplicationMaster on slave node.

NODEMANAGER(NM)

This daemon process resides on the slave nodes. It is responsible for,

Managing and executing containers.
Monitoring resource usage (i.e. usage of memory, CPU, network etc..) and reporting it back to ResourceManager daemon.
Periodically sending heartbeats to ResourceManager for its health status update.

APPLICATIONMASTER(AM)

This daemon process runs on the slave node
It is per application specific library works with NodeManager to execute the task
The instance of this daemon is per application, which means in case of multiple jobs submitted on cluster, it may have more than one instances of ApplicationMaster on slave nodes
It is responsible for negotiating suitable resource containers on slave node from ResourceManager.
Working with one or multiple NodeManager(s) to monitor task execution on slave nodes.

JOBHISTORYSERVER(JHS):

JHS is a service that stays in the master node. It is responsible for,
Saving the execution logs and summary of jobs that run on yarn using RM.
Serves a UI for a user to read the summary and logs from the server.

WHAT IS CONTAINER?

It is considered to be a small unit of resources (like CPU, memory, disk) belong to the slave node.
Scheduler process running along with ResourceManager daemon allocates the resources as a container.
At the beginning of a job execution with YARN, container allows ApplicationMaster process to make a use of some resources on any slave node on the cluster.
Then ApplicationMaster manages the application execution across other containers on slave nodes of a YARN cluster.

This marks the complete of section 1. We discussed about the overall cluster architecture and details about various daemons and processes. Follow Big Data Solutions using Apache Hadoop with Spark, Hive and Sqoop (2 of 3) for how to setup the cluster.