The architecture of Hadoop is given below:
Also Read: HDFS Overview
Hadoop is designed on a master-slave architecture and has the below-mentioned elements:
The commodity Namenode consists of the GNU or Linux operating system, its library for file setup, and the namenode software. The system that has the namemode is the master server and performs the below tasks:
Manages the namespace for various file systems.
Acts as an access point to the files for clients
The file system operations like renaming, opening, closing etc. are executed by Hadoop.
The commodity Datanode consists of the GNU or Linux operating system, its library for file setup, and the datanode software. Each cluster node has a unique datanode assigned to it and these nodes manage the data storage these datanodes perform the following tasks:
Datanodes perform read and write operations on a Hadoop file system as per the instructions given by the client.
Other operations like deletion, replication, block creation etc. are also executed by datanodes as per inputs from the namenode.
The data stored in HDFS files are divided into multiple segments and stored in individual data nodes. These segments are called blocks and their default size is 64MB that can be configured by the user as per his requirement by using HDFS configuration. The main reason for having huge blocks is cost reduction. If the block size is large enough the data transfer is faster when compared to the time consumed in data access from a disk.
The HDFS file system was defined based on the below goals:
Data Safety: The HDFS system has a lot of commodity hardware linked to it and hence the probability of a system failure can be high, but this system has pre-defined fault detection and data recovery systems.
Huge datasets: To improve the speed, efficiency, and reduce cost the HDFS system has multiple nodes for each cluster to execute the applications seamlessly.
Hardware at data: The high efficiency of the Hadoop system and its capability to run applications and computations near the data help reduce the network traffic and thereby increase the throughput.
HDFS File Permissions
Without permissions and access to the data, the tasks cannot be performed. The ‘Read permission’ (r), is required to read the files and its contents. The ‘Write permission’ (w), is required to write to a file or a directory. New files and directories can also be created with this permission. The ‘execute permission’ (x), is not applicable to files or directories. A user should have this permission to run applications on Hadoop. Each file or a directory has a group, a mode and an owner associated with it. All the required permissions are granted to the mode which is owned by an owner.
In the image above HDFS contains Hadoop’s file system. The MR is a job that is executed on the file systems and it helps the user to interact with the HDFS files. Pig and Hive are interpreters that convert user queries into MR jobs. These queries are generally SQL scripts. Hive is optimized for running jobs in batch mode and Impala is optimized to run high latency queries for real-time applications. Sqoop helps in moving bulk data from Hadoop file system to relational databases. Flume also supports collection, aggregation, and movement of a large amount of data across this ecosystem. Oozie helps in managing workflow and controls its execution. Mahout contains data mining and data processing libraries related to machine learning and helps in executing data analysis. If the file size is more than 64 MB then Hadoop breaks it into multiple parts to execute the tasks in parallelism. The datanode and namenode control the tasks and aspects of the individual nodes created for a seamless execution of jobs or tasks.
You now have a basic understanding of the Hadoop HDFS ecosystem. There are many commercial applications where Hadoop is being used currently. The above details will help you in getting started in learning about the Hadoop HDFS ecosystem.