Create First RDD(Resilient Distributed Dataset) | Apache Spark 101 Tutorial | Scala | Part 3


Prerequisite

  • IntelliJ IDEA Community Edition

Walk-through

In this article, I am going to walk-through how to create and execute Apache Spark application to create first RDD(Resilient Distributed Dataset) in the IntelliJ IDEA Community Edition.

Step 1: Create the sbt based Scala project for developing Apache Spark code using Scala API.

Step 2: Create the following two files in above created sbt based Scala project and execute the program to create first RDD(Resilient Distributed Dataset).

build.sbt

name := "apachespark101"

version := "1.0"

scalaVersion := "2.12.8"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"


create_first_rdd_apachespark101_part_3.scala

package com.datamaking.apachespark101

import org.apache.spark.sql.SparkSession

object create_first_rdd_apachespark101_part_3 {
  def main(args: Array[String]): Unit = {
    println("Started ...")
    val spark = SparkSession
      .builder
      .appName("Apache Spark 101 Tutorial | Part 1")
      .master("local[*]")
      .getOrCreate()

    spark.sparkContext.setLogLevel("ERROR")

    // Create RDD of odd numbers
    val numbers_odd_list = List(1, 3, 5, 7, 9)
    val numbers_odd_rdd = spark.sparkContext.parallelize(numbers_odd_list, 2)
    println("Printing Odd Numbers: ")
    numbers_odd_rdd.collect().foreach(println)

    // Create RDD of 1 to 10 numbers
    val numbers_list = 1 to 10
    val numbers_rdd = spark.sparkContext.parallelize(numbers_list, 2)
    println("Printing 1 to 10 Numbers: ")
    numbers_rdd.collect().foreach(println)

    // Create RDD of 1 to 5 numbers(except number 5)
    val numbers_list_1 = List.range(1, 5)
    val numbers_rdd_1 = spark.sparkContext.parallelize(numbers_list_1, 2)
    println("Printing 1 to 5 Numbers(except number 5): ")
    numbers_rdd_1.collect().foreach(println)

    println(numbers_rdd_1.getClass.getSimpleName)

    val tech_names_list = List("Spark", "Hadoop", "Scala", "Python", "IoT", "DataScience")
    val tech_names_rdd = spark.sparkContext.parallelize(tech_names_list, 3)
    println("Printing Technology Names: ")
    tech_names_rdd.collect().foreach(println)

    spark.stop()
    println("Completed.")
  }
}

Summary

In this article, we have successfully created and executed Apache Spark application to create first RDD(Resilient Distributed Dataset). Please go through all these steps and provide your feedback and post your queries/doubts if you have. Thank you. Appreciated.

Happy Learning !!!

Post a Comment

0 Comments