Resilient Distributed Dataset (RDD)
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.
Two types of operation that we can perform on RDD:
Transformation - which generates new RDD from existing RDD/creates new RDD
Example: map, filter
Actions - which return a value to the driver program after running a computation on the RDD
Example: count, collect, take
Architecture of Apache Spark
Exploring RDD in Apache Spark
val lines_rdd = spark.sparkContext.textFile(yarn_log_file_path)
val error_lines_rdd = lines_rdd.filter(line => line.contains("ERROR"))
Happy Learning !!!
0 Comments