Practical RDD transformation: filter | Apache Spark 101 Tutorial | Scala | Part 5


Prerequisite

  • IntelliJ IDEA Community Edition

Walk-through

In this article, I am going to walk-through how to use filter RDD(Resilient Distributed Dataset) transformation with hands-on example in the Apache Spark application with Scala API on IntelliJ IDEA Community Edition.

Step 1: Create the sbt based Scala project for developing Apache Spark code using Scala API.

Step 2: Create the following two files in above created sbt based Scala project and execute the program to use filter RDD(Resilient Distributed Dataset) transformation.

build.sbt

name := "apachespark101"

version := "1.0"

scalaVersion := "2.12.8"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"


filter_rdd_transf_apachespark101_part_5.scala


package com.datamaking.apachespark101

import org.apache.spark.sql.SparkSession

object filter_rdd_transf_apachespark101_part_5 {
  def main(args: Array[String]): Unit = {
    println("Started ...")
    val spark = SparkSession
      .builder
      .appName("Apache Spark 101 Tutorial | Part 1")
      .master("local[*]")
      .getOrCreate()

    spark.sparkContext.setLogLevel("ERROR")

    val numbers_list = List.range(1, 10)
    print(numbers_list.getClass.getSimpleName)
    val numbers_rdd = spark.sparkContext.parallelize(numbers_list, 3)
    val numbers_even_rdd = numbers_rdd.filter(e => e % 2 == 0)
    println("Printing Even Numbers(1 to 10(except 10): ")
    numbers_even_rdd.collect().foreach(println)

    val apache_spark_list = List("Apache", "Spark", "is", "in-memory", "distributed", "framework")
    val apache_spark_rdd = spark.sparkContext.parallelize(apache_spark_list)
    val apache_spark_filter_rdd = apache_spark_rdd.filter(ele => ele.contains('a'))
    println("Printing the result: ")
    apache_spark_filter_rdd.collect().foreach(println)

    spark.stop()
    println("Completed.")
  }
}

Summary

In this article, we have successfully created and executed Apache Spark application and learned how to use filter RDD(Resilient Distributed Dataset) transformation. Please go through all these steps and provide your feedback and post your queries/doubts if you have. Thank you. Appreciated.

Happy Learning !!!

Post a Comment

0 Comments