RDD Transformations | union, intersection, distinct | Using Scala

Resilient Distributed Dataset (RDD) Transformations

Transformations are the one which generates new RDD from existing RDD/creates new RDD.

union

Return a new RDD that contains the union of the elements in the source RDD and the argument RDD. i.e. union transformation combines elements from both RDDs including duplicate elements, works like UNION ALL operation in SQL world.

1
2
3

// RDD Transformations - union
println("union operation starts here")
val numbersUnionRDD = numbersRDD1.union(numbersRDD2)

intersection

Return a new RDD that contains the intersection of elements in the source RDD and the argument RDD. i.e. intersection transformation produces common elements from both RDDs.

1
2
3

// RDD Transformations - intersection
println("intersection operation starts here")
val numbersIntersectionRDD = numbersRDD1.intersection(numbersRDD2)

distinct

Return a new RDD that contains the distinct elements of the RDD dataset. i.e. distinct transformation produces unique values from the source RDD.

// RDD Transformations - distinct
println("distinct operation starts here")
val numbersRDD1WithDistinct = numbersRDD1.distinct()
val numbersUnionRDDWithDistinct = numbersUnionRDD.distinct()

Full Program (RDDUnionIntersectionDistinctDemo.scala)

package com.ctv.apache.spark.rdd

import org.apache.spark.sql.SparkSession

object RDDUnionIntersectionDistinctDemo {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .appName("Apache Spark for Beginners using Scala | RDD Transformations | union, intersection, distinct | Demo")
      .master("local[*]")
      .getOrCreate()

    spark.sparkContext.setLogLevel("ERROR")

    val numbersList1 = List(1, 4, 5, 2, 2, 6, 3)
    val numbersList2 = List(7, 8, 9, 10, 6, 3)

    val numbersRDD1 = spark.sparkContext.parallelize(numbersList1)
    val numbersRDD2 = spark.sparkContext.parallelize(numbersList2)

    // RDD Transformations - union
    println("union operation starts here")
    val numbersUnionRDD = numbersRDD1.union(numbersRDD2)

    numbersUnionRDD.collect().foreach(println)
    println("union operation ends here")
    // RDD Transformations - intersection
    println("intersection operation starts here")
    val numbersIntersectionRDD = numbersRDD1.intersection(numbersRDD2)

    numbersIntersectionRDD.collect().foreach(println)
    println("intersection operation ends here")
    // RDD Transformations - distinct
    println("distinct operation starts here")
    val numbersRDD1WithDistinct = numbersRDD1.distinct()
    val numbersUnionRDDWithDistinct = numbersUnionRDD.distinct()

    numbersRDD1WithDistinct.collect().foreach(println)
    numbersUnionRDDWithDistinct.collect().foreach(println)
    println("distinct operation ends here")

    spark.stop()
  }
}

SBT Build File (build.sbt)

name := "spark_for_beginners"

version := "1.0"

scalaVersion := "2.12.8"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.3"

Happy Learning !!!

RDD Transformations | union, intersection, distinct | Using Scala | Hands-On

Resilient Distributed Dataset (RDD) Transformations

union

intersection

distinct

Full Program (RDDUnionIntersectionDistinctDemo.scala)

SBT Build File (build.sbt)

Post a Comment

0 Comments

Labels

Contact Us

All Blog Posts

Popular Posts

When to use aggregateByKey RDD transformation in PySpark | PySpark 101 | Part 14

Module 3.1: Architecture of Apache Spark Real-Time Project 3

RDD(Resilient Distributed Datasets) in Apache Spark

Create Virtual Machine(VM) and Install Ubuntu | Step By Step | Part 2

RDD Transformations | union, intersection, distinct | Using Scala | Hands-On

Resilient Distributed Dataset (RDD) Transformations

union

intersection

distinct

Full Program (RDDUnionIntersectionDistinctDemo.scala)

SBT Build File (build.sbt)

You may like these posts

Post a Comment

0 Comments

Labels

Contact Us

All Blog Posts

Popular Posts

When to use aggregateByKey RDD transformation in PySpark | PySpark 101 | Part 14

Module 3.1: Architecture of Apache Spark Real-Time Project 3

RDD(Resilient Distributed Datasets) in Apache Spark

Create Virtual Machine(VM) and Install Ubuntu | Step By Step | Part 2