Introduction to Apache Spark

Apache Spark

Apache Spark is a unified analytics engine for large-scale data processing. Apache Spark ecosystem components/libraries are,

Spark Core API(RDD)
Spark SQL(SQL, DataFrame)
Spark Streaming
MLlib/Spark ML(Machine Learning)
GraphX

Apache Hadoop vs Apache Spark

Hadoop has two core components HDFS(Hadoop Distributed File System), MapReduce

HDFS - Reliable and Scalable storage solution for storing big datasets
MapReduce - Distributed programming model which helps big data computation

Advantages of Apache Spark

Hadoop lacks in two below areas,

Iterative Machine Learning
Interactive Data Analysis

What Apache Spark provides,

Iterative Machine Learning - Intermediate data is kept in memory to reduce no. of read and write to disk which helps to perform the computation faster and efficiently
Interactive Data Analysis - Rich set of functions to do data analysis, speed up data analysis by caching your data in memory

How Apache Spark achieves faster computation compare to Hadoop

In-memory computing at distributed scale - Caching data in memory

Spark execution engine translates user code into series of tasks, that task or operation is DAG(Directed Acyclic Graph) in nature, meaning execution flow goes from one operation to another/one task to another task, but never come back and re-execute same task again(no cyclic flow)
Spark achieves this tracking of each operation on dataset using concept called RDD(Resilient Distributed Datasets)
Spark Architecture is built from the ground up for speed and efficiency

Happy Learning !!!

Introduction to Apache Spark

Post a Comment

0 Comments

Labels

Contact Us

All Blog Posts

Popular Posts

Module 3.1: Architecture of Apache Spark Real-Time Project 3

Joining two RDDs using join RDD transformation in PySpark | PySpark 101 | Part 16

Data Science Project Flow Overview

End to End Project using Apache Spark Streaming/Apache Hadoop with Kafka

Introduction to Apache Spark

You may like these posts

Post a Comment

0 Comments

Labels

Contact Us

All Blog Posts

Popular Posts

Module 3.1: Architecture of Apache Spark Real-Time Project 3

Joining two RDDs using join RDD transformation in PySpark | PySpark 101 | Part 16

Data Science Project Flow Overview

End to End Project using Apache Spark Streaming/Apache Hadoop with Kafka