Resilient Distributed Dataset (RDD) Transformations


Transformations are the one which generates new RDD from existing RDD/creates new RDD.


union

Return a new RDD that contains the union of the elements in the source RDD and the argument RDD. i.e. union transformation combines elements from both RDDs including duplicate elements, works like UNION ALL operation in SQL world.
1
2
3
// RDD Transformations - union
println("union operation starts here")
val numbersUnionRDD = numbersRDD1.union(numbersRDD2)

intersection

Return a new RDD that contains the intersection of elements in the source RDD and the argument RDD. i.e. intersection transformation produces common elements from both RDDs.
1
2
3
// RDD Transformations - intersection
println("intersection operation starts here")
val numbersIntersectionRDD = numbersRDD1.intersection(numbersRDD2)

distinct

Return a new RDD that contains the distinct elements of the RDD dataset. i.e. distinct transformation produces unique values from the source RDD.
1
2
3
4
// RDD Transformations - distinct
println("distinct operation starts here")
val numbersRDD1WithDistinct = numbersRDD1.distinct()
val numbersUnionRDDWithDistinct = numbersUnionRDD.distinct()



Full Program (RDDUnionIntersectionDistinctDemo.scala)


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
package com.ctv.apache.spark.rdd

import org.apache.spark.sql.SparkSession

object RDDUnionIntersectionDistinctDemo {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .appName("Apache Spark for Beginners using Scala | RDD Transformations | union, intersection, distinct | Demo")
      .master("local[*]")
      .getOrCreate()

    spark.sparkContext.setLogLevel("ERROR")

    val numbersList1 = List(1, 4, 5, 2, 2, 6, 3)
    val numbersList2 = List(7, 8, 9, 10, 6, 3)

    val numbersRDD1 = spark.sparkContext.parallelize(numbersList1)
    val numbersRDD2 = spark.sparkContext.parallelize(numbersList2)

    // RDD Transformations - union
    println("union operation starts here")
    val numbersUnionRDD = numbersRDD1.union(numbersRDD2)

    numbersUnionRDD.collect().foreach(println)
    println("union operation ends here")
    // RDD Transformations - intersection
    println("intersection operation starts here")
    val numbersIntersectionRDD = numbersRDD1.intersection(numbersRDD2)

    numbersIntersectionRDD.collect().foreach(println)
    println("intersection operation ends here")
    // RDD Transformations - distinct
    println("distinct operation starts here")
    val numbersRDD1WithDistinct = numbersRDD1.distinct()
    val numbersUnionRDDWithDistinct = numbersUnionRDD.distinct()

    numbersRDD1WithDistinct.collect().foreach(println)
    numbersUnionRDDWithDistinct.collect().foreach(println)
    println("distinct operation ends here")

    spark.stop()
  }
}

SBT Build File (build.sbt)


1
2
3
4
5
6
7
name := "spark_for_beginners"

version := "1.0"

scalaVersion := "2.12.8"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.3"

Happy Learning !!!