Big Data has taken the world by storm. It is said that the next decade will be going to be dominated by Big-data wherein all the companies will be using the data available to them to learn about their company’s ecosystem and improving fallbacks. All major universities and companies have started investing in building tools that would help them understand and create useful insights from the data that they have access to. One such tool that helps in analyzing and processing Big-data is Hadoop.
Also Read: Hadoop Multi-Node Cluster
Hadoop has its own framework that includes algorithms like MapReduce, HDFS etc. The entire Hadoop algorithm is coded in Java. It works in an environment that has distributed storage and computation capability across multiple clusters. MapReduce in Hadoop handles large amount data by breaking it into multiple sets and executing the jobs on the data over a series of server nodes. It’s a batch-mode technology instead of an interactive GUI based application.
Hadoop’s architecture has four main elements:
1. Hadoop Common: These consist of Java utilities and libraries that are required by Hadoop modules and its applications. They also have the OS and file system information to start Hadoop.
2. Hadoop YARN: This application within Hadoop supports cluster management and job scheduling.
3. Hadoop Distributed File System (HDFS): This is a distributed file system that ensures high-throughput access and processing of data.
4. Hadoop MapReduce: The framework helps in the parallel processing of the jobs and tasks on the Big-data.
Started initially as “Google File System” in 2003, Hadoop has evolved and matured to a great extent. In 2012 many additional packages were added to Hadoop, these packages included Apache Hive, Apache Spark, Apache Pig, Apache HBase etc. In 2013, SQL-on-Hadoop solutions were added that helped the database specialists in using Hadoop easily. In the same year, YARN was also deployed as an application in Hadoop. This helped in better cluster management and task monitoring.
Hue has designed a user-friendly user interface for using Hadoop and its applications. Ambari helps in supporting manageability extensions, Sentry takes care of the security, and Tez has connected the interactive SQL-on-Hadoop into Apache Hive. Spark has brought in-memory technology to enhance Hadoop’s performance.
Hadoop has evolved at an exponential rate over the past few years with some major improvements and enhancements and those are still happening. Hadoop is also focusing on predictive analytics and machine learning. Addition of his feature would make it the best tool in the market for handling Big-data.
Hadoop File Systems
The HDFS is the file system that Hadoop has built in it. It is a portable, scalable, and distributable file system written in the programming language Java. It breaks the large files into small ones and stores them across multiple machines. This removes the requirement of a disk to store data on host machines. Data is stored on nodes called DataNodes and they interact with one another to replicate data and move data copies. This helps in preventing data loss from the system.
This HDFS system is portable and is compatible with any operating system and hardware. There can be some bottlenecks in performance due to Java’s incompatibility, hence monitoring HDFS system’s performance can get tricky. Many monitoring platforms like Cloudera, Datadog, HortonWorks etc. are currently present in the market that fill this gap.
Other File Systems
Hadoop primarily uses HDFS file system but is compatible with others as well. The only limitation is that you would have to specify to Hadoop the list of servers nearest to the data to be used. Below is the list of systems that Hadoop can work with:
1. FTP file system: Here the data is stored remotely on the FTP servers
2. Amazon S3 file system: This is another remote server hosted on Amazon Elastic Compute Cloud.
3. Windows Azure Storage Blobs file system: This is an extended version of HDFS system and stores the distributed data in Azure blob stores. The dependency of data to a cluster is removed here.
This is a software framework in Hadoop that can process large data in parallelism by executing applications on clusters. It performs the following tasks:
1. Map Task: In this step, the input data is broken into multiple tuples (Key-value pairs).
2. Reduce Task: The output from Map task is the input for this step. These tuples are then combined to form a smaller set of tuples.
The Hadoop framework takes care of monitoring, scheduling, and re-running the tasks. The master JobTracker and slave task tracker work together in executing the jobs flawlessly. If the JobTracker fails then all the tasks are halted.
The world is changing the way it is operating currently and Big-data is playing an important role in it. Hadoop is a framework that makes an engineers life easy while working on large sets of data. There are improvements on all the fronts. The future is exciting.