Create DataFrame from CSV File | Apache Spark DataFrame Practical Tutorial | Scala API | Part 2



Prerequisite

  • Apache Spark
  • IntelliJ IDEA Community Edition

Walk-through

In this article, I am going to walk-through you all, how to create Spark DataFrame from CSV file(CSV file formart) in the Apache Spark application using IntelliJ IDEA Community Edition.

part_2_create_dataframe_from_csv_file.scala

package com.datamaking.apache.spark.dataframe

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}

object part_2_create_dataframe_from_csv_file {
  def main(args: Array[String]): Unit = {
    println("Apache Spark Application Started ...")

    val spark = SparkSession.builder()
            .appName("Create DataFrame from CSV File")
            .master("local[*]")
            .getOrCreate()

    spark.sparkContext.setLogLevel("ERROR")

    //Code Block 1 Starts Here
    val csv_comma_delimiter_file_path = "D:\\apache_spark_dataframe\\data\\csv\\user_detail_comma_delimiter.csv"
    //val users_df_1 = spark.read.csv(csv_comma_delimiter_file_path)
    //val users_df_1 = spark.read.option("header", true).csv(csv_comma_delimiter_file_path)

    val users_df_1 = spark.read
                    .option("header", true)
                    .option("inferSchema", true)
                    .csv(csv_comma_delimiter_file_path)

    users_df_1.show(10, false)
    users_df_1.printSchema()
    //Code Block 1 Ends Here

    //Code Block 2 Starts Here
    val csv_pipe_delimiter_file_path = "D:\\apache_spark_dataframe\\data\\csv\\user_detail_pipe_delimiter.csv"

    val user_schema = StructType(Array(
      StructField("user_id", IntegerType, true),
      StructField("user_name", StringType, true),
      StructField("user_city", StringType, true)
    ))

    val users_df_2 = spark.read
                    .option("sep", "|")
                    .option("header", true)
                    .schema(user_schema)
                    .csv(csv_pipe_delimiter_file_path)

    users_df_2.show(10, false)
    users_df_2.printSchema()
    //Code Block 2 Ends Here

    spark.stop()
    println("Apache Spark Application Completed.")
  }
}


build.sbt

name := "apache_spark_dataframe_practical_tutorial"

version := "1.0"

scalaVersion := "2.11.8"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"

// https://mvnrepository.com/artifact/com.databricks/spark-xml
libraryDependencies += "com.databricks" %% "spark-xml" % "0.7.0"

// https://mvnrepository.com/artifact/mysql/mysql-connector-java
libraryDependencies += "mysql" % "mysql-connector-java" % "8.0.18"

// https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector
libraryDependencies += "org.mongodb.spark" %% "mongo-spark-connector" % "2.4.1"

// https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "2.4.1"

// https://mvnrepository.com/artifact/org.apache.spark/spark-sql-kafka-0-10_2.12
libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.4.4"

// https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients
libraryDependencies += "org.apache.kafka" % "kafka-clients" % "2.3.1"


Summary

In this article, we have successfully created Spark DataFrame from CSV file in the Apache Spark application using IntelliJ IDEA Community Edition. Please go through all these steps and provide your feedback and post your queries/doubts if you have. Thank you. Appreciated.

Happy Learning !!!

Post a Comment

0 Comments