Create First RDD(Resilient Distributed Dataset) | Apache Spark 101 Tutorial | Scala

Prerequisite

IntelliJ IDEA Community Edition

Walk-through

In this article, I am going to walk-through how to create and execute Apache Spark application to create first RDD(Resilient Distributed Dataset) in the IntelliJ IDEA Community Edition.

Step 1: Create the sbt based Scala project for developing Apache Spark code using Scala API.

Step 2: Create the following two files in above created sbt based Scala project and execute the program to create first RDD(Resilient Distributed Dataset).

build.sbt

name := "apachespark101"

version := "1.0"

scalaVersion := "2.12.8"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"

create_first_rdd_apachespark101_part_3.scala

package com.datamaking.apachespark101

import org.apache.spark.sql.SparkSession

object create_first_rdd_apachespark101_part_3 {
  def main(args: Array[String]): Unit = {
    println("Started ...")
    val spark = SparkSession
      .builder
      .appName("Apache Spark 101 Tutorial | Part 1")
      .master("local[*]")
      .getOrCreate()

    spark.sparkContext.setLogLevel("ERROR")

    // Create RDD of odd numbers
    val numbers_odd_list = List(1, 3, 5, 7, 9)
    val numbers_odd_rdd = spark.sparkContext.parallelize(numbers_odd_list, 2)
    println("Printing Odd Numbers: ")
    numbers_odd_rdd.collect().foreach(println)

    // Create RDD of 1 to 10 numbers
    val numbers_list = 1 to 10
    val numbers_rdd = spark.sparkContext.parallelize(numbers_list, 2)
    println("Printing 1 to 10 Numbers: ")
    numbers_rdd.collect().foreach(println)

    // Create RDD of 1 to 5 numbers(except number 5)
    val numbers_list_1 = List.range(1, 5)
    val numbers_rdd_1 = spark.sparkContext.parallelize(numbers_list_1, 2)
    println("Printing 1 to 5 Numbers(except number 5): ")
    numbers_rdd_1.collect().foreach(println)

    println(numbers_rdd_1.getClass.getSimpleName)

    val tech_names_list = List("Spark", "Hadoop", "Scala", "Python", "IoT", "DataScience")
    val tech_names_rdd = spark.sparkContext.parallelize(tech_names_list, 3)
    println("Printing Technology Names: ")
    tech_names_rdd.collect().foreach(println)

    spark.stop()
    println("Completed.")
  }
}

Summary

In this article, we have successfully created and executed Apache Spark application to create first RDD(Resilient Distributed Dataset). Please go through all these steps and provide your feedback and post your queries/doubts if you have. Thank you. Appreciated.

Happy Learning !!!

Create First RDD(Resilient Distributed Dataset) | Apache Spark 101 Tutorial | Scala | Part 3

Prerequisite

Walk-through

Summary

Post a Comment

0 Comments

Labels

Contact Us

All Blog Posts

Popular Posts

Create First RDD(Resilient Distributed Dataset) | Apache Spark 101 Tutorial | Scala | Part 3

Prerequisite

Walk-through

Summary

You may like these posts

Post a Comment

0 Comments

Labels

Contact Us

All Blog Posts

Popular Posts