MapReduce is a programming paradigm or an algorithm to process big data in a distributed environment.
Traditional algorithms work when data is kept on single machine. For big data stored on multiple machines, MapReduce solves the problem of data processing.
MapReduce Phases: Map and Reduce. Both take input and give output as Key-Value pairs => (id, cmp_name) => (1, Amazon), (2, Google) etc.
Consider a Hadoop 2.0 cluster with 10 nodes, and a 500 MB file.
As per Hadoop standard block size of 128 MB, 500/128 = 3.9 => 4 blocks required.
These 4 blocks will get stored in the cluster with replication factor = 3 (3 copies stored internally in HDFS).
What is a Map phase ?
Map job is called on the file => The Map program will run on each of the machines where the file block is stored for data processing. The results will be calculated and stored locally as key-value pairs. Follows Data Locality.
What is a Reduce phase ?
The partial output from each of the Map jobs is collected in a single machine in the cluster. Reducer code will work on the partial outputs and generate a single output file.
Note: the machine may be the same or different than the mappers. Doesn’t matter. Logically, it’s a different entity
In the next post, I will demonstrate some examples to explain MapReduce better.