RDD Transformations | map, filter, flatMap

RDD Transformations | map, filter, flatMap | Hands-On

Resilient Distributed Dataset (RDD) Transformations

Transformations are the one which generates new RDD from existing RDD/creates new RDD.

map

Return a new RDD formed by passing each element of the source through a function func.

1 2	// RDD Transformations - map val wordsCountPerLineRDD = linesRDD.map(line => line.split(" ").size)

filter

Return a new RDD formed by selecting those elements of the source on which func returns true.

1 2	// RDD Transformations - filter val linesWithWordForRDD = linesRDD.filter(line => line.contains("for"))

flatMap

Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

1 2	// RDD Transformations - flatMap val wordsRDD = linesWithWordForRDD.flatMap(line => line.split(" "))

Full Program (RDDMapFilterFlatMapDemo.scala)

package com.ctv.apache.spark.rdd

import org.apache.spark.sql.SparkSession

object RDDMapFilterFlatMapDemo {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .appName("Apache Spark for Beginners using Scala | RDD Transformations | map, filter, flatMap | Demo")
      .master("local[*]")
      .getOrCreate()

    spark.sparkContext.setLogLevel("ERROR")

    // Reading text file from local file system
    val filePath = "file:///D://record_tech_videos//data//rdd_examples//test.txt"
    val linesRDD = spark.sparkContext.textFile(filePath)

    linesRDD.collect().foreach(println)

    // RDD Transformations - map
    val wordsCountPerLineRDD = linesRDD.map(line => line.split(" ").size)

    wordsCountPerLineRDD.collect().foreach(println)

    // RDD Transformations - filter
    val linesWithWordForRDD = linesRDD.filter(line => line.contains("for"))

    linesWithWordForRDD.collect().foreach(println)

    // RDD Transformations - flatMap
    val wordsRDD = linesWithWordForRDD.flatMap(line => line.split(" "))

    wordsRDD.collect().foreach(println)

    spark.stop()
  }
}

SBT Build File (build.sbt)

name := "spark_for_beginners"

version := "1.0"

scalaVersion := "2.12.8"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.3"

Input File (test.txt)

1
2
3

Apache Spark is a unified analytics engine for big data processing, 
with built-in modules for streaming, SQL, machine learning 
and graph processing.

Happy Learning !!!

3 Comments

RomanJune 8, 2019 at 11:03 PM
Hi,

Can you explain how below lines are getting execute/and the use of them in abobe code:-

val spark = SparkSession
.builder
.appName("Apache Spark for Beginners using Scala | RDD Transformations | map, filter, flatMap | Demo")
.master("local[*]")
.getOrCreate()
----------------------------
DataMakingJune 9, 2019 at 11:17 PM
Hi Roman, here we are creating SparkSession object called "spark" for one user(session), builder function constructs SparkSession with properties like appName, master and function "getOrCreate" creates or gets SparkSession object, and also SparkSession internally creates SparkContext, SQLContext to work with Spark cluster. I hope this helps you.

For more information:
https://spark.apache.org/docs/2.4.3/api/scala/index.html#org.apache.spark.sql.SparkSession
https://github.com/apache/spark/blob/v2.4.3/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

Thank you for your query and appreciated. Happy Learning !!!
NaveenJune 11, 2019 at 5:19 AM
Hi,
Thanks for sharing the useful transformation.Could you please add more transformations. Also please provide examples on actions along with transformations.

Thanks,
Naveen.

Emoji

(y)

hihi

:-)

:-d

;(

;-(

@-)

:>)

(o)

(p)

:-s

(m)

8-)

:-t

:-b

b-(

:-#

=p~

x-)

(k)

RDD Transformations | map, filter, flatMap | Hands-On

Resilient Distributed Dataset (RDD) Transformations

map

filter

flatMap

Full Program (RDDMapFilterFlatMapDemo.scala)

SBT Build File (build.sbt)

Input File (test.txt)

Post a Comment

3 Comments

Labels

Contact Us

All Blog Posts

Popular Posts

RDD Transformations | mapPartitions, mapPartitionsWithIndex | Using Scala | Hands-On

Apache Zeppelin User Guide

RDD Transformations | map, filter, flatMap | Hands-On

Resilient Distributed Dataset (RDD) Transformations

map

filter

flatMap

Full Program (RDDMapFilterFlatMapDemo.scala)

SBT Build File (build.sbt)

Input File (test.txt)

You may like these posts

Post a Comment

3 Comments

Labels

Contact Us

All Blog Posts

Popular Posts

RDD Transformations | mapPartitions, mapPartitionsWithIndex | Using Scala | Hands-On

Apache Zeppelin User Guide