RDD Transformations | mapPartitions, mapPartitionsWithIndex | Using Scala

Resilient Distributed Dataset (RDD) Transformations

Transformations are the one which generates new RDD from existing RDD/creates new RDD.

mapPartitions

Similar to map, but runs separately on each partition (block) of the RDD. i.e. Return a new RDD by applying a function to each partition of this RDD.

// RDD Transformations - mapPartitions
val resultRDD2 = namesRDD.mapPartitions(onePartition => {
  println("Processing Current Partition ... ")
  onePartition.map(element => element.size)
})

mapPartitionsWithIndex

Similar to mapPartitions, but also provides func with an integer value representing the index of the partition. i.e. Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.

// RDD Transformations - mapPartitionsWithIndex
val resultRDD3 = namesRDD.mapPartitionsWithIndex((index, onePartition) => {
  println("Partition Number(index): " + index)
  onePartition.map(element => "element: " + element +
  ", element size: " + element.size + ", Partition Number(index): " + index)})

Full Program (RDDMapPartitionsMapPartitionsWithIndexDemo.scala)

package com.ctv.apache.spark.rdd

import org.apache.spark.sql.SparkSession

object RDDMapPartitionsMapPartitionsWithIndexDemo {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .appName("Apache Spark for Beginners using Scala | RDD Transformations | mapPartitions, mapPartitionsWithIndex | Demo")
      .master("local[*]")
      .getOrCreate()

    spark.sparkContext.setLogLevel("ERROR")

    val namesList = List("Arun", "Rohit", "Vijay", "Bala", "Aket", "Frank")
    val namesRDD = spark.sparkContext.parallelize(namesList, 3)

    // RDD Transformations - map
    val resultRDD1 = namesRDD.map(element => element.size)

    resultRDD1.collect().foreach(println)

    // RDD Transformations - mapPartitions
    val resultRDD2 = namesRDD.mapPartitions(onePartition => {
      println("Processing Current Partition ... ")
      onePartition.map(element => element.size)
    })

    resultRDD2.collect().foreach(println)

    // RDD Transformations - mapPartitionsWithIndex
    val resultRDD3 = namesRDD.mapPartitionsWithIndex((index, onePartition) => {
      println("Partition Number(index): " + index)
      onePartition.map(element => "element: " + element +
      ", element size: " + element.size + ", Partition Number(index): " + index)})

    resultRDD3.collect().foreach(println)

    spark.stop()
  }
}

SBT Build File (build.sbt)

name := "spark_for_beginners"

version := "1.0"

scalaVersion := "2.12.8"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.3"

Happy Learning !!!

RDD Transformations | mapPartitions, mapPartitionsWithIndex | Using Scala | Hands-On

Resilient Distributed Dataset (RDD) Transformations

mapPartitions

mapPartitionsWithIndex

Full Program (RDDMapPartitionsMapPartitionsWithIndexDemo.scala)

SBT Build File (build.sbt)

Post a Comment

0 Comments

Labels

Contact Us

All Blog Posts

Popular Posts

RDD Transformations | mapPartitions, mapPartitionsWithIndex | Using Scala | Hands-On

Resilient Distributed Dataset (RDD) Transformations

mapPartitions

mapPartitionsWithIndex

Full Program (RDDMapPartitionsMapPartitionsWithIndexDemo.scala)

SBT Build File (build.sbt)

You may like these posts

Post a Comment

0 Comments

Labels

Contact Us

All Blog Posts

Popular Posts