Skip to content
hv

Hrushikesh Vazurkar

Software, College Life and Clarity for successful careers

  • Home
  • Data Engineering Fundamentals
  • College Life Posts
  • Toggle search form

Post 1 – Introduction to Data Engineering

What are Big Data and Hadoop? As per IBM’s definition, the type of data with High Volume, Variety, and Velocity is big data.Hadoop is simply a framework to solve big data problems. The need for Hadoop comes from the problem statement: “I have a data file larger than the memory size of a single computer….

Read More “Post 1 – Introduction to Data Engineering” »

bigdata Featured

Post 12 – MapReduce Example

Suppose a 500 MB file containing store_id and daily_sales. Problem Statement: Find the sum of all times sales per store. store_id,date,daily_sales 1,12/01/2018,450 1,13/01/2018,500 1,14/01/2018,600 2,13/02/2018,400 2,14/02/2018,300 2,15/02/2018,600 … Solution: There are 3 steps involved in the MapReduce algorithm: Record Reader, Map, Shuffling, Sorting, Reducer Record Reader – Takes each line in the file and converts…

Read More “Post 12 – MapReduce Example” »

bigdata

Post 10 – HDFS Manipulation Commands

HDFS Basic Commands -t = sort by time -r = list in reverse -S = sort by size -R = recursively show all files under that particular directory(in this case, /user/hdpuser) -p = create if not exists hadoop fs -rmdir /user/hdpuser/dir1 (remove empty directory) hadoop fs -rm -R /user/hdpuser/dir2 (non empty directory) hadoop fs -mv…

Read More “Post 10 – HDFS Manipulation Commands” »

bigdata

Post 11 – What is MapReduce ?

MapReduce is a programming paradigm or an algorithm to process big data in a distributed environment. Traditional algorithms work when data is kept on single machine. For big data stored on multiple machines, MapReduce solves the problem of data processing. MapReduce Phases: Map and Reduce. Both take input and give output as Key-Value pairs =>…

Read More “Post 11 – What is MapReduce ?” »

bigdata

Post 9 – Linux Cheatsheet for Data Engineering

“Biryani me elaichi ??!” – This is the feeling you must have got when suddenly the course changed from Big Data to Linux commands. Not to worry. Understanding and applying Linux commands is imperative and sometimes part of the daily routine in the day-to-day job as a Data Engineer. Shell scripting may also be required…

Read More “Post 9 – Linux Cheatsheet for Data Engineering” »

data engineering fundamentals

Post 8 – Hadoop Installation

In this post, we will discuss different modes of Hadoop installation and their practical usage. Modes of Hadoop Installation: There are 3 modes of Hadoop installation: ▶ Local HDFS and YARN are not applicable in this mode. Ideal to test MapReduce logic. ▶ Pseudo Distributed It is Hadoop components on a single computer. HDFS, YARN…

Read More “Post 8 – Hadoop Installation” »

bigdata

Post 7 – HDFS Commands

We will be discussing interaction with HDFS using the command-line interface. For this, we can use two command prefaces: ➡ hadoop fs➡ hdfs dfs There are some basic differences between hadoop fs and hdfs dfs, but for all practical purposes of this post, both are the same. List files/directories ➡ hadoop fs -ls <dirname> (Directory…

Read More “Post 7 – HDFS Commands” »

bigdata

Post 6 – Rack Awareness in HDFS

Rack Awareness protects your highly crucial data from the forces of nature. What is a rack in Hadoop ? A Rack is a collection of servers (usually 10 or more) that are: ➡ Datanodes in a rack are physically close to each other(within the same data center or network switch) ➡ The intra-rack Datanodes are…

Read More “Post 6 – Rack Awareness in HDFS” »

bigdata

Post 5 – How HDFS handles Namenode Failure ?

In the previous post, we discussed HDFS and how Hadoop handles Datanode failure. In this post, we will discuss how Hadoop handles Namenode failure. Hadoop 1.0 -> Namenode is a single point of failure. Failure handling is highly costly and time consuming due to recovery of edit logs. Hadoop 2.0 -> Introduction of Secondary Namenode…

Read More “Post 5 – How HDFS handles Namenode Failure ?” »

bigdata

Post 4 – HDFS, the backbone of Hadoop

HDFS stands for Hadoop Distributed File System. It is a distributed file system that is designed to store and manage large datasets across multiple computers in a cluster. It is highly fault-tolerant and can handle large amounts of data with high throughput, making it ideal for big data processing. HDFS architecture NN(NameNode) – Name Node stores the…

Read More “Post 4 – HDFS, the backbone of Hadoop” »

bigdata

Post 3 – Apache Spark vs. MapReduce

In this post, we will discuss what Apache Spark is and how it comes into play with Hadoop. Apache Spark is an independent general-purpose in-memory computing engine framework for big data processing. Therefore, Spark cannot be directly compared to Hadoop. The Hadoop framework provides distributed storage(HDFS), distributed processing(MapReduce), and distributed resource management. In contrast, Spark…

Read More “Post 3 – Apache Spark vs. MapReduce” »

Apache Spark

Posts navigation

1 2 Next

Copyright © 2023 Hrushikesh Vazurkar.

Powered by PressBook Masonry Blogs