How to use cogroup RDD transformation in PySpark | PySpark 101

Prerequisite

Apache Spark
PyCharm Community Edition

Walk-through

In this article, I am going to walk-through you all, how to use cogroup RDD transformation in the PySpark application using PyCharm Community Edition.

cogroup - For each key k in self or other, return a resulting RDD that contains a tuple with the list of values for that key in current as well as another.

# Importing Spark Related Packages
from pyspark.sql import SparkSession

# Importing Python Related Packages

if __name__ == "__main__":
    print("PySpark 101 Tutorial")

    # cogroup - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples.
    # This operation is also called groupWith.

    # For each key k in self or other,
    # return a resulting RDD that contains a tuple with the list of values for that key in self as well as other.

    spark = SparkSession \
            .builder \
            .appName("Part 17 - How to use cogroup RDD transformation in PySpark | PySpark 101") \
            .master("local[*]") \
            .enableHiveSupport() \
            .getOrCreate()

    kv_1_list = [(1, 2.5), (2, 4.0), (3, 5.5)]
    print("Printing kv_1_list: ")
    print(kv_1_list)

    kv_2_list = [(1, 3.5), (1, 4.0), (2, 2.5), (2, 3.5), (4, 10.0), (3, 4.5)]
    print("Printing kv_2_list: ")
    print(kv_2_list)

    kv_1_rdd = spark.sparkContext.parallelize(kv_1_list)
    kv_2_rdd = spark.sparkContext.parallelize(kv_2_list)

    cogroup_kv_rdd = kv_1_rdd.cogroup(kv_2_rdd)
    print("Cogroup Result: ")
    print(cogroup_kv_rdd.collect())
    print([(x, tuple(map(list, y))) for x, y in sorted(list(cogroup_kv_rdd.collect()))])

    print("Aggregation Result: ")
    print(cogroup_kv_rdd.map(lambda e: (e[0], sum(e[1][0]) + sum(e[1][1]))).collect())

    print("Stopping the SparkSession object")
    spark.stop()

Summary

In this article, we have successfully used cogroup RDD transformation in the PySpark application using PyCharm Community Edition. Please go through all these steps and provide your feedback and post your queries/doubts if you have. Thank you. Appreciated.

Happy Learning !!!

How to use cogroup RDD transformation in PySpark | PySpark 101 | Part 17

Prerequisite

Walk-through

Summary

Post a Comment

0 Comments

Labels

Contact Us

All Blog Posts

Popular Posts

RDD(Resilient Distributed Datasets) in Apache Spark

Do not use groupByKey RDD transformation on large data set in PySpark | PySpark 101 | Part 12

RDD Transformations | map, filter, flatMap | Hands-On

Practical RDD transformation: filter | Apache Spark 101 Tutorial | Scala | Part 5

How to use cogroup RDD transformation in PySpark | PySpark 101 | Part 17

Prerequisite

Walk-through

Summary

You may like these posts

Post a Comment

0 Comments

Labels

Contact Us

All Blog Posts

Popular Posts

RDD(Resilient Distributed Datasets) in Apache Spark

Do not use groupByKey RDD transformation on large data set in PySpark | PySpark 101 | Part 12

RDD Transformations | map, filter, flatMap | Hands-On

Practical RDD transformation: filter | Apache Spark 101 Tutorial | Scala | Part 5