RDD(Resilient Distributed Datasets) in Apache Spark

DataMaking August 04, 2019

Resilient Distributed Dataset (RDD)

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

Two types of operation that we can perform on RDD:

Transformation - which generates new RDD from existing RDD/creates new RDD
Example: map, filter
Actions - which return a value to the driver program after running a computation on the RDD
Example: count, collect, take

Architecture of Apache Spark

Exploring RDD in Apache Spark

val lines_rdd = spark.sparkContext.textFile(yarn_log_file_path)
val error_lines_rdd = lines_rdd.filter(line => line.contains("ERROR"))

Happy Learning !!!

RDD(Resilient Distributed Datasets) in Apache Spark

Post a Comment

0 Comments

Labels

Contact Us

All Blog Posts

Popular Posts

Create Virtual Machine(VM) and Install Ubuntu | Step By Step | Part 2

Create First RDD(Resilient Distributed Dataset) | Apache Spark 101 Tutorial | Scala | Part 3

Practical RDD transformation: filter | Apache Spark 101 Tutorial | Scala | Part 5

When to use aggregateByKey RDD transformation in PySpark | PySpark 101 | Part 14

RDD(Resilient Distributed Datasets) in Apache Spark

You may like these posts

Post a Comment

0 Comments

Labels

Contact Us

All Blog Posts

Popular Posts

Create Virtual Machine(VM) and Install Ubuntu | Step By Step | Part 2

Create First RDD(Resilient Distributed Dataset) | Apache Spark 101 Tutorial | Scala | Part 3

Practical RDD transformation: filter | Apache Spark 101 Tutorial | Scala | Part 5

When to use aggregateByKey RDD transformation in PySpark | PySpark 101 | Part 14