Create DataFrame from JSON File | Apache Spark DataFrame Practical Tutorial | Scala API

Prerequisite

Apache Spark
IntelliJ IDEA Community Edition

Walk-through

In this article, I am going to walk-through you all, how to create Spark DataFrame from JSON file(JSON File Format) in the Apache Spark application using IntelliJ IDEA Community Edition.

We are going to use out-of-box JSON data source API to read the JSON file and create the Spark DataFrame.

part_3_create_dataframe_from_json_file.scala

package com.datamaking.apache.spark.dataframe

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}

object part_3_create_dataframe_from_json_file {
  def main(args: Array[String]): Unit = {
    println("Apache Spark Application Started ...")

    val spark = SparkSession.builder()
      .appName("Create DataFrame from JSON File")
      .master("local[*]")
      .getOrCreate()

    spark.sparkContext.setLogLevel("ERROR")

    //Code Block 1 Starts Here
    val json_file_path = "D:\\apache_spark_dataframe\\data\\json\\user_detail.json"
    val users_df_1 = spark.read.json(json_file_path)

    users_df_1.show(10, false)
    users_df_1.printSchema()
    //Code Block 1 Ends Here

    //Code Block 2 Starts Here
    val json_multiline_file_path = "D:\\apache_spark_dataframe\\data\\json\\user_detail_multiline.json"

    val user_schema = StructType(Array(
      StructField("user_id", IntegerType, true),
      StructField("user_name", StringType, true),
      StructField("user_city", StringType, true)
    ))

    //val users_df_2 = spark.read.schema(user_schema).json(json_multiline_file_path)
    val users_df_2 = spark.read.option("multiLine", "true").json(json_multiline_file_path)

    users_df_2.show(10, false)
    users_df_2.printSchema()
    //Code Block 2 Ends Here

    //Code Block 3 Starts Here
    val json_multiline_in_list_file_path = "D:\\apache_spark_dataframe\\data\\json\\user_detail_multiline_in_list.json"

    val user_in_list_schema = StructType(Array(
      StructField("user_id", IntegerType, true),
      StructField("user_name", StringType, true),
      StructField("user_city", StringType, true)
    ))

    val users_df_3 = spark.read.option("multiLine", "true").schema(user_in_list_schema).json(json_multiline_in_list_file_path)

    users_df_3.show(10, false)
    users_df_3.printSchema()
    //Code Block 3 Ends Here

    spark.stop()
    println("Apache Spark Application Completed.")
  }
}

build.sbt

name := "apache_spark_dataframe_practical_tutorial"

version := "1.0"

scalaVersion := "2.11.8"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"

// https://mvnrepository.com/artifact/com.databricks/spark-xml
libraryDependencies += "com.databricks" %% "spark-xml" % "0.7.0"

// https://mvnrepository.com/artifact/mysql/mysql-connector-java
libraryDependencies += "mysql" % "mysql-connector-java" % "8.0.18"

// https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector
libraryDependencies += "org.mongodb.spark" %% "mongo-spark-connector" % "2.4.1"

// https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "2.4.1"

// https://mvnrepository.com/artifact/org.apache.spark/spark-sql-kafka-0-10_2.12
libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.4.4"

// https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients
libraryDependencies += "org.apache.kafka" % "kafka-clients" % "2.3.1"

Summary

In this article, we have successfully created Spark DataFrame from JSON file(JSON File Format) in the Apache Spark application using IntelliJ IDEA Community Edition. Please go through all these steps and provide your feedback and post your queries/doubts if you have. Thank you. Appreciated.

Happy Learning !!!

Create DataFrame from JSON File | Apache Spark DataFrame Practical Tutorial | Scala API | Part 3

Prerequisite

Walk-through

part_3_create_dataframe_from_json_file.scala

build.sbt

Summary

Post a Comment

0 Comments

Labels

Contact Us

All Blog Posts

Popular Posts

How to install Apache Hadoop 3 on Ubuntu 18.04.5 | Step By Step | Part 3

Setting up MongoDB on Docker Container

Module 3.2: Accessing Meetup.com RSVP Stream API using Scala Kafka Producer

RDD Transformations | mapPartitions, mapPartitionsWithIndex | Using Scala | Hands-On

Create DataFrame from JSON File | Apache Spark DataFrame Practical Tutorial | Scala API | Part 3

Prerequisite

Walk-through

part_3_create_dataframe_from_json_file.scala

build.sbt

Summary

You may like these posts

Post a Comment

0 Comments

Labels

Contact Us

All Blog Posts

Popular Posts

How to install Apache Hadoop 3 on Ubuntu 18.04.5 | Step By Step | Part 3

Setting up MongoDB on Docker Container

Module 3.2: Accessing Meetup.com RSVP Stream API using Scala Kafka Producer

RDD Transformations | mapPartitions, mapPartitionsWithIndex | Using Scala | Hands-On