In this post, we will discuss what Apache Spark is and how it comes into play with Hadoop.
Apache Spark is an independent general-purpose in-memory computing engine framework for big data processing. Therefore, Spark cannot be directly compared to Hadoop.
The Hadoop framework provides distributed storage(HDFS), distributed processing(MapReduce), and distributed resource management. In contrast, Spark can only perform distributed data processing.
However, Apache Spark can be compared with MapReduce:
1. MapReduce was built for batch processing data, whereas Spark is a complete data analytics engine
2. MapReduce performs read and write operations on the disk, whereas Spark performs the operations in memory leading to faster execution.
3. Spark needs a lot of memory to fit and process the data. In contrast, MapReduce handles cases when data fit in memory is impossible.
4. MapReduce is not optimal for real-time processing, whereas Spark shines in the real-time data processing
5. MapReduce is more failure tolerant due to data persistence in a hard disk. It may not require a complete restart after the disruption. In contrast, in the case of Spark, since operations occur in memory(RAM), it is required to restart the process from the initial point in case of disruption.
But all said and done, Spark is still a nifty framework for rapid distributed data processing and is highly used in the industry. When used as a processing engine with HDFS as storage and YARN for resource management, the system is called “Spark on top of Hadoop.”
Spark is built in Scala language, but development in Spark can be done with Scala, Java, Python, or R.