Hadoop Big data

I. Introduction

Hadoop: is an open-source distributed system produced by Apache. It’s a framework was built in Java to distribute the big dataset among a set of machines. Big data (data set) could be structured data (same as data from database), unstructured data (same as video, documents, and HTML document contents) or semi-structured (same as JSON or XML documents)

Hadoop has two main components:

  • Hadoop Distributed File System (HDFS). The main concept of HDFS is to distribute the data set to different machines (nodes). The data will sit in different files on the one machine. The size of each file could be 65 MB or 128 MB (depending on the configuration of Hadoop)
  • MapReduce is a framework to process the dataset. MapReduce is consists of two steps. The first step is the Map. Map is the way of parsing the data (row for example) into words. The Reduce is the process to group by these words (for example count each word)

II. Yarn architecture

YARN is Yet Another Resource Negotiator. It’s a management layer that manage the HDFS. It’s between the application and HDFS storage. It’s allow to manage the resources and run different jobs concurrent. Also Yarn has the scheduler.

Yarn components

A. Resource Manager: This part is responsible allocates cluster resources to the jobs using a Scheduler and Applications Manager

B. Application Master: It’s in the node. It’s responsible of creating and destroying the job. Also It’s responsible for managing the another Nodes that allocate by Resource manager.

C. Node Manager: It’s responsible to manage the node container.

III. HDFS Components

A Hadoop instance consists of a cluster of HDFS machines often referred to as the Hadoop cluster or the HDFS cluster. There are three main components of an HDFS cluster:

• NameNode: The “master” node of HDFS that manages the data (without actually storing it) by determining and maintaining how the chunks of data are distributed across the DataNodes

• DataNode: The “salve” node of HDFS that stores chunks of data. Also it is responsible for replicating the chunks across other DataNodes. The NameNode and DataNode are daemon processes running in the cluster.

• Secondary NameNode: is responsible for managing the logs in the NameNode (fsimage, edits) logs

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, which is a master server that manages the filesystem namespace and regulates access to files by clients.

NameNode (Master node): has the following characteristics:

• Acts as the master of the DataNodes
• Executes filesystem namespace operations, like opening, closing, and renaming files and directories
• Determines the mapping of blocks to DataNodes
• Maintains the filesystem namespace

The NameNode performs these tasks by maintaining two files:
• fsimage_N: Contains the entire filesystem namespace, including the mapping of blocks to files and filesystem properties
• edits_N: A transaction log that persistently records every change that occurs to filesystem metadata

When the NameNode starts up, it enters safemode (a read-only mode). It loads the fsimage_N and

edits_N from disk, applies all the transactions from the edits_N to the in-memory representation of

the fsimage_N, and flushes out this new version into a new fsimage_N+1 on disk.

DataNodes (Slave Node)

HDFS exposes a filesystem namespace and allows the data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.

The NameNode determines the mapping of blocks to DataNodes.

The DataNodes are responsible for:

• Handling read and write requests from application clients
• Performing block creation, deletion, and replication upon instruction from the NameNode (The NameNode makes all decisions regarding replication of blocks)
• Sending heartbeats to the NameNode
• Sending a Blockreport to the NameNode

The NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.

Secondary NameNode:

The Secondary NameNode merges the fsimage and the edits log files periodically and keeps edits log size within a limit. The secondary NameNode service is not a standby secondary NameNode, despite its name. Specifically, it does not offer High Availability (HA) for the NameNode.

IV. Yarn & HDFS components

All Hadoop daemons  can run on the same machine or on the different nodes. Each daemon represent JVM (Java Virtual Machine). The Hadoop daemons divided into two groups Yarn and HDFS. Yearn daemons are:

Yarn daemons:

  • Resource Manager
  • Application Manager
  • Node Manager

HDFS daemons:

  • Name node
  • Secondary name node
  • Data node

V. Run modes for Hadoop

  • Standalone Mode
  • Pseudo Distributed Mode(Single-Node Cluster)
  • Fully distributed mode (or multiple node cluster)

VI. MapReduce framework:

MapReduce is a processing technique built on the divide the process into two different tasks – Map and Reduce. While Map breaks different elements into tuples to perform a job, Reduce collects and combines the output from Map task and fetches it.

We should write a MapReduce process in Java. In our session, we will use Spark for data processing. We can use write the MapReduce in Java/Scala and Python. Spark applications run faster than traditional MapReduce. MapReduce is required to save the results in HDFS many times during the data processing. While Spark keeps the result in the memory through the processing.

MapReduce methodss

We will use Spark methods to implement the MapReduce process:

Map process:

1- Read the file through textFile (from file) or from Array with parallelize with Array. We will read the file to the array for multiple lines
2- flatMap: to parse the lines to the different words. We put all words in the big array.
3- Map: to change the word to the Tuple object. Tuple is same as array. For example School –> Tuple(School,1)
To use the number in reduce process. To count the numbers

Reduce process:

1- Shuffle: The system will try to shuffle the items to the different nodes to prepare the data to do the reduce, For example put the all words are same in one node to reduce network traffic
2- Reduce: The system will try to count the same words. For example:
(School,1) and (School,1) => (School,2)