Know the concepts of Hadoop Distributed File System & Mapreduce framework
There has been an exceptional expansion in business intelligence and data analytics especially network-based computing over the last few years. In this, client/server-based applications have brought about a revolution of sorts in the field. Sharing storage resources and information on the network is one of the key elements in both local area networks (LANs) and wide area networks (WANs). Different technologies have been developed to bring convenience to sharing resources and files on a network; a distributed file system is one of the processes used on a regular basis.
Before we take a plunge into understanding the concepts that govern Hadoop Distributed File System & Mapreduce framework, let’s comprehend the basics first, the definitions:
i. What is a distributed file system?
In very straightforward words, a distributed file system is a client/server based application that allows one to access and process data stored on the server as if it were on their own computer. When a user accesses a file on the server, the server sends the user a copy of the file, which is cached on the user's computer while the data is being processed and is then returned to the server.
ii. What is HDFS – Hadoop Distributed File System?
Hadoop has its own distributed file system known as the HDFS which is a subproject of the Apache Hadoop project. It has been designed in a way that it can store huge amounts of data reliably and allows one to stream those data sets at high bandwidth to user applications as and when needed.
✓ Scalability: HDFS has been designed in a way that it can scale to massive levels. On a single robust platform, you can store unlimited amounts of data. As and when your data increases in volume, all you have to do is add more servers to scale to the level.
✓ Flexibility: No matter what kind of data you have, you can store it all without any modeling done beforehand. This signifies that you always have full access to data shared with you.
✓ Reliability: Multiples copies of your data are always made available to you through automatic replication. You not only get access to the data, but also assured that through replication, your data is sound and safe even while your system may fail.
Here’s an interesting trivia. Did you know that in April 2008, * Hadoop broke a world record to become the fastest system to sort a terabyte of data? Isn’t that something? Running on a 910-node cluster, Hadoop sorted one terabyte in 209 seconds (just under 3½ minutes), beating the previous year’s winner of 297 seconds.
i. What is MapReduce framework?
As safaribooksonline puts it, MapReduce is a computational paradigm designed to process very large sets of data in a distributed fashion. The model has been based on the concept of breaking the data processing task into two smaller tasks of mapping and reduction.
ii. Why MapReduce framework?
✓ Accessibility: Supports a wide range of languages for developers as well as high-level language through Apache Hive and Apache Pig.
✓ Flexibility: Process any and all data, regardless of type or format — whether structured, semi-structured, or unstructured. Original data remains available even after batch processing for further analytics, all in the same platform.
✓ Reliability: Built-in job and task trackers allow processes to fail and restart without affecting other processes or workloads. Additional scheduling allows you to prioritize processes based on needs such as SLAs.
✓ Scalability: MapReduce is designed to match the massive scale of HDFS and Hadoop, so you can process unlimited amounts of data, fast, all within the same platform where it’s stored.
- Architecture of Hadoop Distributed File System
Represented by inodes, the HDFS namespace is a hierarchy of files and directories represented on the NameNode. The Inodes record attributes like permissions, modification and access times, namespace and disk space quotas.
Image and Journal
Images is the inodes and the list of blocks that define the metadata of the name system. A checkpoint is the persistent record of the image stored in the NameNode's local native filesystem. The NameNode records changes to HDFS in a write-ahead log called the journal in its local native filesystem. Each client-initiated transaction is recorded in the journal, and the journal file is flushed and synced before the acknowledgment is sent to the client.
Each block replica on a DataNode is represented by two files in the local native filesystem - one comprises the data and the 2nd file records the block's metadata. The size of the data file equals the actual length of the block and does not require extra space to round it up to the nominal block size as in traditional filesystems. Thus, if a block is half full it needs only half of the space of the full block on the local drive.
User applications access the filesystem using the HDFS client, a library that exports the HDFS filesystem interface.
The NameNode in HDFS, in addition to its primary role serving client requests, can alternatively execute either of two other roles, either a CheckpointNode or a BackupNode. The CheckpointNode periodically combines the existing checkpoint and journal to create a new checkpoint and an empty journal.
Like a CheckpointNode, the BackupNode is capable of creating periodic checkpoints, but in addition it maintains an in-memory, up-to-date image of the filesystem namespace that is always synchronized with the state of the NameNode.
Upgrades and Filesystem Snapshots
During software upgrades the possibility of corrupting the filesystem due to software bugs or human mistakes increases. The purpose of creating snapshots in HDFS is to minimize potential damage to the data stored in the system during upgrades. Here is a mechanism that allows the administrators to consistently save the current state of the filesystem which allows them to roll back the upgrade made and return to the namespace and storage just in case there is a data loss or corruption.
[Definition Source: aosabook]
Components of MapReduce
The JobTracker maintains a view of all available processing resources in the Hadoop cluster and, as application requests come in, it schedules and deploys them to the TaskTracker nodes for execution. As applications are running, the JobTracker receives status updates from the TaskTracker nodes to track their progress and, if necessary, coordinate the handling of any failures.
TaskTracker receives processing requests from the JobTracker. Its primary responsibility is to track the execution of MapReduce workloads happening locally on its slave node and to send status updates to the JobTracker. TaskTrackers manage the processing resources on each slave node in the form of processing slots — the slots defined for map tasks and reduce tasks, to be exact.
We have given you a bird’s eye view of what HDFS and Mapreduce framework are. While this may come handy if you are trying to brush up your skills, but if you are a software/analytics professional, ETL developer, project manager or a testing professional looking for in depth knowledge to master fundamental concepts of HDFS, Map Reduce and other Hadoop Eco System components, here’s a course that may inveigle you.
- Get hands-on learning using Pig, Hive, HBase and MapReduce.
- Attend one-on-one live online session/webinars with industry and Big Data experts.
- Get access to reading material, videos, case studies and much more.