How to use pipe RDD transformation in PySpark | PySpark 101

Prerequisite

Apache Spark
PyCharm Community Edition

Walk-through

In this article, I am going to walk-through you all, how to use pipe RDD transformation in the PySpark application using PyCharm Community Edition.

pipe: pipe RDD transformation will return an RDD created by piping elements to a sent to external command(Perl or bash script)

# Importing Spark Related Packages
from pyspark.sql import SparkSession

# Importing Python Related Packages
import time

if __name__ == "__main__":
    print("PySpark 101 Tutorial")
    print(time.strftime('%Y-%m-%d %H:%M:%S'))

    # Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script.
    #  RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings.

    spark = SparkSession \
            .builder \
            .appName("Part 19 - How to use pipe RDD transformation in PySpark | PySpark 101") \
            .master("local[*]") \
            .enableHiveSupport() \
            .getOrCreate()

    names_list = ["how", "are", "you"]
    print("Printing names_list: ")
    print(names_list)

    names_rdd = spark.sparkContext.parallelize(names_list)

    shell_script_path = "/home/dmadmin/PycharmProjects/pyspark101/to_upper_case.sh"

    pipe_names_rdd = names_rdd.pipe(shell_script_path)
    print("Printing pipe_names_rdd: ")
    print(pipe_names_rdd.collect())

    print("Stopping the SparkSession object")
    spark.stop()

Summary

In this article, we have successfully used pipe RDD transformation in the PySpark application using PyCharm Community Edition. Please go through all these steps and provide your feedback and post your queries/doubts if you have. Thank you. Appreciated.

Happy Learning !!!

How to use pipe RDD transformation in PySpark | PySpark 101 | Part 19

Prerequisite

Walk-through

Summary

Post a Comment

0 Comments

Labels

Contact Us

All Blog Posts

Popular Posts

Setting up MongoDB on Docker Container

How to use cartesian RDD transformation in PySpark | PySpark 101 | Part 18

End to End Project using Apache Spark Streaming/Apache Hadoop with Kafka

Data Science Project Flow Overview

How to use pipe RDD transformation in PySpark | PySpark 101 | Part 19

Prerequisite

Walk-through

Summary

You may like these posts

Post a Comment

0 Comments

Labels

Contact Us

All Blog Posts

Popular Posts

Setting up MongoDB on Docker Container

How to use cartesian RDD transformation in PySpark | PySpark 101 | Part 18

End to End Project using Apache Spark Streaming/Apache Hadoop with Kafka

Data Science Project Flow Overview