What are Big Data and Hadoop?
As per IBM’s definition, the type of data with High Volume, Variety, and Velocity is big data.
Hadoop is simply a framework to solve big data problems.
The need for Hadoop comes from the problem statement:
“I have a data file larger than the memory size of a single computer. How do you store and process data in such cases ?”
To solve this problem:
2003 -> Google initially released a paper to store datasets in multiple computers called Google File System (GFS).
2004 -> Google released a paper outlining the concept of MapReduce
2006 -> Yahoo implemented both concepts and called the data storage model HDFS.
2009 -> Hadoop became open source under Apache License
2013 -> Hadoop 2.0 released
Hadoop 1.0 components
MapReduce – Process the distributed data and manage the resources of all the computers in the cluster(network)
HDFS – Data Storage in multiple computers in the cluster
Hadoop 2.0 components
MapReduce – Process the distributed data only
YARN – Manage the resources of all the computers in the cluster(network)
HDFS – Data Storage in multiple computers in the cluster