RDD(Resilient Distributed Datasets) in Apache Spark



Resilient Distributed Dataset (RDD)

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

Two types of operation that we can perform on RDD:

Transformation - which generates new RDD from existing RDD/creates new RDD
Example: map, filter
Actions - which return a value to the driver program after running a computation on the RDD 
Example: count, collect, take




Architecture of Apache Spark






Exploring RDD in Apache Spark


val lines_rdd = spark.sparkContext.textFile(yarn_log_file_path)
val error_lines_rdd = lines_rdd.filter(line => line.contains("ERROR"))

Happy Learning !!!

Post a Comment

0 Comments