Apache Spark
Apache Spark is a unified analytics engine for large-scale data processing. Apache Spark ecosystem components/libraries are,
- Spark Core API(RDD)
- Spark SQL(SQL, DataFrame)
- Spark Streaming
- MLlib/Spark ML(Machine Learning)
- GraphX
Apache Hadoop vs Apache Spark
Hadoop has two core components HDFS(Hadoop Distributed File System), MapReduce
HDFS - Reliable and Scalable storage solution for storing big datasets
MapReduce - Distributed programming model which helps big data computation
Advantages of Apache Spark
Hadoop lacks in two below areas,
- Iterative Machine Learning
- Interactive Data Analysis
- Iterative Machine Learning - Intermediate data is kept in memory to reduce no. of read and write to disk which helps to perform the computation faster and efficiently
- Interactive Data Analysis - Rich set of functions to do data analysis, speed up data analysis by caching your data in memory
How Apache Spark achieves faster computation compare to Hadoop
In-memory computing at distributed scale - Caching data in memory
- Spark execution engine translates user code into series of tasks, that task or operation is DAG(Directed Acyclic Graph) in nature, meaning execution flow goes from one operation to another/one task to another task, but never come back and re-execute same task again(no cyclic flow)
- Spark achieves this tracking of each operation on dataset using concept called RDD(Resilient Distributed Datasets)
- Spark Architecture is built from the ground up for speed and efficiency
Happy Learning !!!
0 Comments