Big Data

What Is Hadoop Distributed File System (HDFS)? Working, Architecture, and Commands

The Hadoop Distributed File System (HDFS) handles big data and scales Hadoop clusters to numerous nodes.

Hossein Ashtari Technical Writer

December 12, 2022

The Hadoop Distributed File System (HDFS) is defined as a distributed file system solution built to handle big data sets on off-the-shelf hardware. It can scale up a single Hadoop cluster to thousands of nodes. This article details the definition, working, architecture, and top commands of HDFS.

What Is HDFS?
How Does HDFS Work?
HDFS Architecture
Top HDFS Commands

What Is HDFS?

The Hadoop Distributed File System (HDFS) is a distributed file system solution built to handle big data sets on off-the-shelf hardware. It can scale up a single Hadoop cluster to thousands of nodes.

HDFS acts as a module of Apache Hadoop, an open-source framework capable of data storage, processing, and analysis. HDFS serves as a Hadoop file system, and its primary goal is to grant access to Hadoop file system data.

At its core, HDFS is all about data storage. It can store data of all sizes and varieties; however, it is the preferred solution for storing structured and unstructured big data for enterprises. This solution enables high throughput data access for applications with large data sets. Increased fault tolerance and low cost of deployment on commodity hardware are the key features of HDFS.

Users rely on HDFS thanks to its high speed–its cluster architecture supports data transfers of up to 2 GB per second. This solution also grants users streamlined access to more data types. Streaming data is significantly improved from HDFS’s batch processing, allowing high throughput.

HDFS features high portability across hardware configurations. It is also designed to be compatible with numerous operating systems, allowing it to be used with customized setups. Users can scale resources either horizontally or vertically based on the size and configuration of their enterprise file systems.

Data locality is another advantage of HDFS, with data being stored in nodes instead of being moved to the computational unit. This shortens the gap between computing and data, minimizing network congestion and boosting system efficiency.

Finally, HDFS is designed with flexibility in mind. It does not need to process collected data before storing it. This allows users to hold as much data as required without earmarking it for a specific purpose beforehand.

See More: What Is Data Security? Definition, Planning, Policy, and Best Practices

How Does HDFS Work?

The Hadoop Distributed File System writes data once on the server. This data is subsequently read several times. This is the core reason for HDFS’s speed when dealing with high volumes of data–other file systems generally rely on repeated read/write actions, which slows them down.

NameNode, DataNodes, and blocks

HDFS takes the role of a master server capable of file management and access control. It works with a main ‘NameNode’ and several ‘DataNodes’ on a cluster of off-the-shelf hardware. These nodes are typically situated on a shared physical rack within a data center.

Data is fractionated into separate ‘blocks’ disseminated for storage among numerous DataNodes. Replication of blocks across nodes helps minimize the probability of failure.

The NameNode is a computer equipped with a GNU/Linux operating system. It is the ‘smart’ node in the cluster and can discern the locations of DataNodes and the blocks contained within them. The NameNode also controls file access and the reading, writing, creation, deletion, and replication of blocks across nodes.

A DataNode is also powered by a GNU/Linux operating system and controls system data storage. It can carry out tasks on file systems as required by the client, as well as engage in the creation, replication, and blocking of files as per the instructions of the NameNode.

NameNode and DataNodes are ‘loosely coupled’, allowing cluster elements to dynamically adapt to server capacity demand. Demand fulfillment takes place in real-time through adding and subtracting nodes as required.

DataNodes gauge their operational responsibilities through constant communication with the NameNode, which instructs them to complete tasks and notes down each DataNode’s status. If the NameNode recognizes a malfunctioning DataNode, it can reassign duties to different nodes with the same data block. DataNodes also communicate among themselves during regular file operations.

Once a ‘dead’ DataNode returns to ‘life’, or if a different DataNode is detected, the NameNode automatically adds that node to the system to ensure a resilient, self-repairing system. The replication of data blocks across nodes means the risk of data loss due to the failure of a single server is low.

The number of DataNodes and the degree of replication are adjusted during cluster implementation and can be adjusted responsively during cluster operations. HDFS also carefully monitors data integrity across the cluster through its many features, including transaction logs and validations.

We’ll learn more about nodes and blocks from an HDFS architecture perspective later in this article.

Replication management

Distributed systems create redundant data repositories at multiple locations to ensure continued data accessibility even if one or more machines f ail. The Hadoop Distributed File System follows the same principle to store block replicas on numerous DataNodes based on the replication factor, which defines the number of copies to be built for the blocks of a file.

For instance, if the value of the replication factor for a particular block is 3, three copies of that block are created. These copies are then stored on different DataNodes so that if any of the nodes fails, the block remains accessible from the other DataNodes.

Each replicated block occupies a different space within the HDFS system; therefore, if the file with replication factor 3 has, for instance, a size of 256 Mb, it occupies a total of 768 Mb in disk space. This replication mechanism is the primary basis for the fault tolerance of HDFS.

Rack awareness

A rack is a collection of a few dozen DataNodes physically connected through a single network switch. In case a network stops functioning, the entire rack becomes unavailable. To address this, rack awareness ensures that block replicas are stored on different racks.

For instance, if a block has a replication factor of 3, the rack awareness algorithm will store the first replica on the local rack, the second replica on another DataNode within the same rack, and the third replica on a completely different rack.

Through rack awareness, the closest node is chosen based on rack information. The NameNode uses the rack awareness algorithm to store and access replicas in a way that boosts fault tolerance and minimizes latency.

Read and write operations

Read operation

A client begins the HDFS read operation by first communicating with the NameNode and retrieving metadata that contains the locations of DataNodes with the relevant data blocks. Once the client receives the locations of the DataNodes, it directly interacts with them.

Based on the data received from the NameNode, the client begins reading data parallelly from the DataNodes. Data flow directly takes place between the DataNodes and the client. Once the application or client receives all the file blocks, they are combined to form an original file.

Write operation

Like the read operation, the write operation from HDFS begins with the client communicating a request for metadata to the NameNode. The response from the Namenode contains the number and location of blocks, the IPs of their replicas and other details. Based on the data received from the NameNode, a direct interaction occurs between the client and the relevant DataNodes.

Here’s how that goes: the client first sends a block to, let’s say, DataNode A. At the same time, it shares the IP of two other DataNodes that will be used for replica storage. Once the block is received by Datanode A, it copies it to DataNode B on the same rack. Block transfer occurs over a rack switch as both DataNodes exist on the same rack.

Then, DataNode B makes a copy of the block and transmits it to DataNode C, which is located on a different rack. It uses an out-of-rack switch to do so. Once the block is received by DataNodes and their transaction with the client is complete, a write confirmation will be transmitted to the NameNode. This process takes place repeatedly for every file block being written to HDFS.

Example of HDFS in action

Imagine there exists a file containing the Social Security Numbers of every citizen in the United States. The SSNs for people with the last name beginning with A are stored on server 1, SSNs for last names beginning with B are stored on server 2, etc. Essentially, Hadoop would store fragments of this SSN database across a server cluster. For the entire file to be reconstructed, the client would require blocks from all servers within the cluster.

To ensure that this data’s availability is not compromised in case of server failure, HDFS replicates blocks onto two additional servers as a rule. The redundancy can be increased or decreased based on the replication factor of individual files or an entire environment. For instance, a Hadoop cluster earmarked for development typically would not require redundancy.

Besides high availability, redundancy has another advantage: it enables Hadoop clusters to fractionate work into tinier pieces and execute those tasks across cluster servers for enhanced scalability. The third benefit of HDFS redundancy is the advantage of data locality, which is critical for the streamlined execution of large data sets.

See More: What Is Enterprise Data Management (EDM)? Definition, Importance, and Best Practices

HDFS Architecture

The Hadoop Distributed File System is implemented using a master-worker architecture, where each cluster has one master node and numerous worker nodes. The files are internally divided into several blocks, each stored on different worker machines based on their replication factor.

The master node is known as the NameNode and is responsible for managing and storing the file system namespace, including block data such as locations and permissions. The worker nodes are known as DataNodes and are primarily responsible for storing data blocks.

These are the nodes of an HDFS architecture:

1. NameNode

NameNode is the core of HDFS. It is responsible for critical functions such as file system namespace management and client access permissions. Before the advent of Hadoop2, NameNode was a single point of failure. However, Hadoop2 introduced the High Availability Hadoop cluster architecture to enable multiple NameNodes in a hot standby configuration to run in the cluster.

NameNode stores data on block locations, permissions, and other parameters on the local disk in two file formats:

Fsimage, short for file system image, contains the comprehensive file system namespace since the creation of the NameNode.
Edit log, which contains all the latest changes made to the file system namespace of the latest Fsimage.

Key functions of NameNode include:

Executing namespace operations, such as the opening, renaming, and closing of files and directories within the file system
Managing and maintaining DataNodes
Block mapping to DataNodes
Recording each change in the namespace
Tracking locations of all blocks
Determining block replication factors
Receiving and tracking heartbeat and block reports from DataNodes to make sure they’re ‘alive’
In case of DataNode failure, selecting new DataNodes for replica creation

2. DataNode

DataNodes serve as the worker nodes in HDFS and usually exist as inexpensive off-the-shelf hardware.

Key functions of DataNode include:

Storing blocks of a file
Serving read and write requests for clients
Executing block creation, deletion, and replication instructions from NameNode
Transmitting heartbeat signals to NameNode to help track HDFS health
Sending block reports to NameNode to help maintain a record of blocks contained within the DataNode

3. Blocks

HDFS splits files into smaller data chunks called blocks. The default size of a block is 128 Mb; however, users can configure this value as required. Users generally cannot control the location of blocks within the HDFS architecture.

In the case of the default block size, files are split as follows. Let’s say a file of size 718 Mb needs to be stored using HDFS. It will be broken down into five blocks of 128 Mb each and one additional block of 78 Mb.

As the last file is smaller than the default block size in the above example, it will not consume the full block space in the local storage. This means that the size value of the additional block will be only 78 Mb and not a full 128 Mb (ignoring the replication factor).

4. Secondary NameNode

The secondary NameNode is a daemon distinct from DataNode and NameNode. It serves as a helper node for NameNode but cannot replace it.

On startup, NameNode merges the Fsimage and edit logs to restore the namespace of the current file system. However, as NameNodes typically operate continuously for long durations without restarting, the edit logs become large, which leads to long restart times.

The secondary NameNode addresses this issue by downloading Fsimage and edit logs from NameNode at regular intervals. Edit logs are periodically applied to Fsimage and then refreshed, after which the secondary NameNode transmits the updated Fsimage to NameNode. This removes the need for NameNode to reapply the edit log records while restarting. Thus, the edit log size remains small and NameNode restart time is reduced.

Additionally, in case of NameNode failure, the secondary NameNode provides the last saved instance of Fsimage to help recover file system metadata.

5. Checkpoint node

This node creates namespace checkpoints at regular intervals. Its functions include downloading Fsimage and edit logs from NameNode, merging them locally, and uploading the updated Fsimage to the NameNode.

This may sound similar to how the secondary NameNode operates; however, there is one key difference. The checkpoint node operates while NameNode is active, while the secondary NameNode functions during NameNode restart. The live creation of updated Fsimage during NameNode runtime is known as checkpoint creation, thus giving the checkpoint node its name.

The checkpoint node stores the last-created checkpoints in a directory with the same structure as that of the NameNode. This allows the checkpoint image to be available for read operations from the NameNode whenever required.

6. Backup node

Finally, this node offers a checkpointing functionality similar to the checkpoint node. However, the backup node is constantly synchronized with the active state of the NameNode and maintains an in-memory, up-to-date copy of the file system’s namespace. This means there is no need for the backup node to download Fsimage and edit logs from the active NameNode for checkpoint creation.

An up-to-date namespace state always exists in the memory of the backup node. This makes the backup node’s checkpointing process more efficient than the checkpoint node’s, as only the namespace needs to be saved into the local Fsimage file and edit logs need to be reset. NameNode can support only one backup node at any given time.

See More: What Is Deepfake? Meaning, Types of Frauds, Examples, and Prevention Best Practices for 2022

Top HDFS Commands

Hadoop Distributed File System commands allow users to execute operations such as changing permissions, creating directories, viewing contents, replicating files and directories from HDFS to the local file system and vice versa, and so on.

Let’s take a look at the top HDFS commands:

Command	Function
hdfs dfs [generic options] -ls [-c] [-h] [-q] [-R] [-t] [-S] [-u] [<path>…]	Lists all the files in the root directory. Users can select the path from the root similar to the typical Linux file system.
fs –help	The output is a long list of all HDFS commands.
hadoop fs -rm -r	Allows users to remove a complete HDFS directory.
hdfs dfs [generic options] -getmerge [-nl] <src> <localdst>	Concatenates all files into a data catalog contained inside a single file. This command generates a new file in the local directory.
hadoop fs -expunge	Empties the trash.
hdfs dfs [generic options] -du [-s] [-h] <path> …	Shows disk usage in megabytes. Use -h to generate a readable size output.
hadoop fs -du	Shows the space in bytes used by files matching the specified file pattern.
hadoop fs -setrep -w 1	Modifies the replication factor of a file to a specific count, replacing the default replication factor for the rest of the file system. For directories, this command will recursively modify the replication factor for every residing file in the directory tree according to the input.
hadoop fs -mkdir	Creates a directory on an HDFS environment. For subdirectory creation, the parent directory has to exist.
hadoop fs -copyFromLocal	Copies a file from the local file system to HDFS.
hadoop fs -chmod [-R] <mode> <path>	Changes file permissions in a similar way to Linux’s shell command chmod.
hadoop fs -checksum <src>	The output of this command is checksum data.
hadoop fs –touchz /directory/filename	Creates a zero byte file in the HDFS filesystem.
hadoop fs -stat [format] <path>	Prints the statistics of a file or directory in the inputted format. File size accepted in blocks (%b), block size (%o), owner group name (%g) and file name (%n), owner username (%u), replication (%r), or modification date (%y, %Y).
hadoop fs -find <path> … <expression>	Finds all files matching the inputted expression.
hadoop fs -ls	Outputs a list of available files and subdirectories in the default directory.
hadoop fs -getmerge <src> <localdest>	Used for ‘MergeFile into Local’.
hadoop fs -count [options] <path>	Simply counts the files, bytes, and directories from the inputted path.
hadoop fs -appendToFile <localsrc> <dest>	Appends the contents of all provided local files to the given destination file on the HDFS filesystem. If the destination file does not exist, it will be created.
hadoop fs -tail [-f]	Displays the last 1KB of the file.

See More: What Is Data Science? Definition, Lifecycle, and Applications

Takeaway

The Hadoop Distributed File System is a cost-effective, resilient, and portable distributed file system solution. Its nodes rely on economical commodity hardware, thus cutting storage costs. Licensing fees are also avoided as HDFS is open-source, further reducing costs. Additionally, this solution can store data in any size–from kilobytes to petabytes–and in any structured or unstructured format.

HDFS features swift recovery from hardware failure and is designed to detect and address faults automatically. It can also be ported across hardware platforms and is compatible with many operating systems. Finally, HDFS specializes in high data throughput, simplifying access to streaming data.

Did this article help you with an in-depth understanding of the Hadoop Distributed File System (HDFS)? Share your feedback on FacebookOpens a new window , TwitterOpens a new window , or LinkedInOpens a new window !

MORE ON DATA MANAGEMENT

Big Data

Hossein Ashtari

Technical Writer

opens a new window opens a new window opens a new window

Interested in cutting-edge tech from a young age, Hossein is passionate about staying up to date on the latest technologies in the market and writes about them regularly. He has worked with leaders in the cloud and IT domains, including Amazon—creating and analyzing content, and even helping set up and run tech content properties from scratch. When he’s not working, you’re likely to find him reading or gaming!

Do you still have questions? Head over to the Spiceworks Community to find answers.

What Is Hadoop Distributed File System (HDFS)? Working, Architecture, and Commands

Table of Contents

What Is HDFS?