Suppose a 500 MB file containing store_id and daily_sales.
Problem Statement: Find the sum of all times sales per store.
store_id,date,daily_sales 1,12/01/2018,450 1,13/01/2018,500 1,14/01/2018,600 2,13/02/2018,400 2,14/02/2018,300 2,15/02/2018,600 ...
Solution: There are 3 steps involved in the MapReduce algorithm: Record Reader, Map, Shuffling, Sorting, Reducer
Record Reader – Takes each line in the file and converts it to key-value pair (inbuilt)
1,12/01/2018,450 =====> (0, [1,12/01/2018,450]) …(where key=0 and value=[1,12/01/2018,450])
Map – Solve the problem in local machine(Parallelism and Data Locality). Logic decided by the developer.
(0, [1,12/01/2018,450]) =====> Ignore the date =====> (1, 450) …(new key-value pair to solve the problem)
Shuffling – Transfer the output of mapper program to the dedicated reducer machine.
Sorting – Sort all inputs from shuffling by store_id.
Reducer – Aggregate(Group by) store_id and take the sum of all daily_sales to give final output
This is the basic working example of the MapReduce example. Although, building logic and coding for MapReduce programs is redundant as of 2023, the fundamentals are essential and would help in understanding upcoming concepts.