HDFS stands for Hadoop Distributed File System. It is a distributed file system that is designed to store and manage large datasets across multiple computers in a cluster. It is highly fault-tolerant and can handle large amounts of data with high throughput, making it ideal for big data processing.
HDFS architecture
NN(NameNode) – Name Node stores the metadata of the data inside Data Nodes. It is a reference point to fetch the actual data in Hadoop.
DN(DataNode) – Data Nodes store the actual data in the form of blocks. Each block as of Hadoop 2.0 has a size of 128 MB by default(64 MB in Hadoop 1).
Client Node – Client node runs client applications that interact with the Hadoop cluster like Hadoop command-line tools, APIs, web interfaces, and other third-party applications that use Hadoop services. They typically do not participate in data processing, but they can initiate jobs, submit requests, and monitor the status of jobs in the Hadoop cluster.
Fault Tolerance in case of Data Node failure
For example, consider 4 Datanodes DN 1-4 and a file of 500 MB, which is divided into 4 blocks.
If each DN contains one block, and any DN fails. Then that data is lost.
Solution – Replication Factor. Default Replication factor in Hadoop 2.0 is 3. Hadoop will internally make 3 copies(by default) and these copies will be stored in different Data nodes.
What happens when Data Node fails ?
Hadoop internally maintains the replication factor by creating additional copies from existing ones and updating the Namenode metadata, so even if any of the data nodes fail, there are always 3 copies of a data block in the cluster.