Hadoop MapReduce - MapReduce Tutorial Online | W3School

What is MapReduce?

MapReduce is a software framework where applications can be written to process big data. These applications can be run in parallel on clusters of hardware. This distributed computing model is based on Java. Running it on large hardware clusters is reliable and tolerant to faults. In a MapReduce job, the input data is split into different chunks that are processed by the applications in a parallel way. The output generated by each independent chunk is processed and combined into a single required output format. Generally, the inputs and outputs of a job are stored in a file-system. The MapReduce framework schedules, monitors, and re-executes the failed tasks.  

The main advantage of this framework is that data processing is scaled across multiple computer nodes. This saves time while executing a particular job and hence programmers tend to use this framework more often.

Also Read: Hadoop Architecture

MapReduce Algorithm

This Algorithm has three main steps:

1. Map Function: In this step, the framework divides the input data into multiple data-sets or smaller subtasks, which is called Splitting. Then it performs the required computation or job on the subtasks in parallel, called Mapping. Splitting and Mapping form the two sub-steps on which the Map function works.

2. Shuffle Function: The shuffle function is also called ‘Combine Function’ in the MapReduce Algorithm. This function performs two sub-steps. The first is Merging,  where all the key-value pairs coming from the output of the Map Function that has same keys are combined. Secondly, the key-value pairs are sorted by the keys in the Sorting step. This step returns <Key, List<Value>> output after sorting the key-value pairs.

3. Reduce Function: In this final step of the MapReduce Algorithm the output of the Shuffle Function is taken and a reduce operation is performed to create the final data-set. These final step pairs are sorted and computed pairs. The final output is <Key, Value>.

[post_middile_section_ad]

Inputs and Outputs

The MapReduce Framework and Algorithm operate on <key, value> pairs. This means that the input to the task or the job is a set of <key, value> pairs and a similar set of <key, value> pairs are produced as the output after the task or the job is performed.

The framework should be able to serialize the key and value classes that are going as input to the job. the Writable-Comparable interface has to be implemented by the key classes to help in the sorting of the key-value pairs.

A sample input and output of a MapReduce task looks like:

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

MapReduce Example – Word Count

Let us take an example of the simple data-set to clearly understand the MapReduce Function. This data-set is small but generally, real-time jobs run a large amount of data. Let’s assume that our text file has following content to perform a word count:

            Dear, Bear, River, Car, Car, River, Deer, Car and Bear

The overall MapReduce process will look like:

Mapreduce

[post_middile_section_ad]

The Steps followed are:

1. The input is split into three different data-sets that are distributed among the map nodes.

2. The words are tokenized and a value of 1 is assigned to each word which is hardcoded. This is done assuming that each word will appear only once.

3. After this list of key-value pairs are created where the key is the word and value is the number one. For e.g. the first data-set has three key-value pairs: Bear, 1; River, 1; Dear, 1.

4. After the mapping process is complete the shuffling and sorting tasks are executed so that all the tuples that have the same key are combined together and send to the reducer.

5. On completion of the sorting and shuffling tasks, each reducer will have a unique key and the corresponding list of values. For example, Car, [1,1,1]; Deer, [1,1]… etc.

6. The reducer will now count the values present in the list of each unique key and generate the key, value pair for that key, where the value will now correspond to the count.

7. The output is written in the key, value format to the output file.

MapReduce – User Interfaces

Understanding of the user-interface will help the developers in implementing, configuring, and tuning their jobs. The interfaces are described below:

1. Mapper Interface: The input to a mapper is a key/value pair. The input records are transformed by Maps. One Map task is assigned for each InputSplit created by the InputFormat for the task. The job is executed by  Job.setMapperClass(Class), post the execution the framework calls map(WritableComparable, Writable, Context) for each key-value pair. The type of the output pairs isn’t necessarily same as the input type. It can also happen that a certain input pair is mapped to zero or multiple output pairs. The cleanup context can be run to execute required cleaning of data and the counter application is used to report the statistics of the input data. The output of the Mapper is reported to the Reducer to calculate the final output as required by the user. The count of these maps is around 10-100 maps per node, depending on the input size.

2. Reducer Interface: This framework reduces the output of the Mapper to a smaller set of data. The reducer count is set by Job.setNumReduceTasks(int). The implementations for the reducer are for the job via Job.setReducerClass(Class) method. After this, the algorithm calls the reduce(WritableComparable, Iterable<Writable>, Context) method for each key-value pair and the three steps, shuffle, sort, and reduce are performed on the data. The number of reducers is around 0.95 or 1.75 times the node count. The increase in the number of reducer nodes reduces the failures and increases the framework overhead.

Job Configuration

Job describes the actions to be taken by the Hadoop framework to execute the MapReduce framework. The framework strictly follows the job instructions and executes them to perfection. If some parameters are marked final by the administration, they won’t be altered during the execution process even though the job calls for a change. Some parameters can be set directly, but others are set as per the job’s requirement. the commands Configuration.set and Configuration.get can be used to set or extract any parameter required by the applications. However, it is recommended to use the DistributedCache for large data-sets. Lots of other commands like setting attempts per job, speculative job execution, job output compression etc. can also be defined in the Job configuration.

Task Environment & Execution

The Mapper and Reducer jobs and tasks are executed by MRAppMaster in a JVM as child processes. The environment of execution is same as that of the parent MRAppMaster. The user can enter extra options to child-jvm through the MapReduce.{map|reduce}.java.opts and config parameters in the tasks.

[post_middile_section_ad]

Memory Management

The users can also specify the memory of the tasks and its sub-tasks using the command MapReduce.{map|reduce}.memory.mb by entering a value in MB. Make sure that the value is higher than –Xmx that is passed to JavaVM to ensure the tasks get started.

Map Parameters

The metadata of a record released from a map is stored in accounting buffers. When the metadata reached the threshold value, its contents are written on a disk and the task continues. Once a map task is completed all the remaining records are written to the disk and merged into a single file. Few commands that can be helpful are:

(a)   mapreduce.task.io.sort.mb: Controls the cumulative size of the accounting buffers that store records released from the mapping task.

(b)   mapreduce.map.sort.spill.percent: Controls the threshold limit of serialization buffer. Once reached, it begins to write the contents to a disk.

Reduce Parameters

The reducer takes the output from the mapping framework and merges them with the output on the disk periodically. Few options that impact the frequency of the merges are:

(a)   mapreduce.task.io.soft.factor: Defines the disk segments to be merged at a time. If the count exceeds the limit, then this step is done in multiple passes.

(b)   mapreduce.reduce.merge.inmem.thresholds: Defines the count of the sorted outputs that are sent to the memory before being merged.

(c)    mapreduce.reduce.shuffle.merge.percent: Defines the percentage of the memory that has to be allocated to store the mapping algorithm’s output.

Interact with MapReduce Jobs

Interacting with the jobs is very simple and can be done with the use of basic commands mentioned below:

(a)  –submit <job-file> – command to submit the job

(b) –status <job-id> – command to print the map and reduce completion status in percentage.

(c)  –kill <job-id> – command to stop the job.

(d) –history[all] – command to print the details of all the executed commands.

(e)  –list[all] – command to display all the jobs currently running.

(f)   –set-priority <job-id> – command to change a job’s priority.

(g) –counter <job-id> – command to print the job counter.

Conclusion

You now have a basic understanding of the Hadoop MapReduce Framework. This algorithm helps in processing big-data present in HDFS. Major improvements have been done to this algorithm in Hadoop 2.x version. The above details should get you started with the implementation of the MapReduce Algorithm.