Before start learning what is DataFrame in Spark, I would request learn about what is Apache Spark and the main abstraction or data type called RDD(Resilient Distributed Datasets) in Apache Spark
Apache Spark
RDD(Resilient Distributed Datasets)
What is the problem with RDD?
- RDD transformation will not have any idea about logic of the function passed to it
- RDD can’t do any optimization
- What is the need for DataFrame?
- It gives table like view, because of that it is user friendly
- Important use of DataFrame is performance
- DataFrame goes through Catalyst Optimizer(optimization takes place) before sending for execution as RDD
Create DataFrame in Apache Spark
Create DataFrame using Data Source API by reading data from Input Files(CSV, JSON, TextFile, Parquet, etc.)
Create DataFrame Programmatically(Through program code)
Create DataFrame with few employee records
Happy Learning !!!
0 Comments