Introduction to Big Data and Apache Hadoop

Big Data

Basically large data set(collect of data) is called Big Data.

Three basic characteristics of Big Data:

Volume - Size of the data

Velocity - Speed at which data is generate

Variety - Various type of data i.e. Structured, Semi-structured and Unstructured

Apache Hadoop

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

Two main components of Apache Hadoop:

1. Hadoop Distributed File System (HDFS) - Scalable Distributed Storage Component

2. MapReduce - Distributed Computing Framework

Reference Architecture for Apache Hadoop and Apache Spark Project

In general, reference architecture for the Hadoop/Spark project has following layers based on the project requirements.

1. Data Source - Sensors, Web Applications, APIs, Databases, Web Logs, etc.

2. Ingestion/Message Layer - Kafka, Spark Streaming, Flume, etc.

3.1. Hadoop/Spark Cluster: Storage Layer - HDFS, S3, NoSQL databases, etc.

3.2. Hadoop/Spark Cluster: Processing Layer - Hive, Pig, MapReduce, Spark, etc.

4. Machine Learning / Data Analytics Layer - Spark ML, Python Machine Learning Library, etc.

5. Visualization Layer - Reporting tools like Tableau, Python Visualization Packages

Happy Learning !!!

Introduction to Big Data and Apache Hadoop

Big Data

Three basic characteristics of Big Data:

Apache Hadoop

Two main components of Apache Hadoop:

Reference Architecture for Apache Hadoop and Apache Spark Project

Post a Comment

0 Comments

Labels

Contact Us

All Blog Posts

Popular Posts

When to use aggregateByKey RDD transformation in PySpark | PySpark 101 | Part 14

Module 3.1: Architecture of Apache Spark Real-Time Project 3

RDD(Resilient Distributed Datasets) in Apache Spark

Create Virtual Machine(VM) and Install Ubuntu | Step By Step | Part 2

Introduction to Big Data and Apache Hadoop

Big Data

Three basic characteristics of Big Data:

Apache Hadoop

Two main components of Apache Hadoop:

Reference Architecture for Apache Hadoop and Apache Spark Project

You may like these posts

Post a Comment

0 Comments

Labels

Contact Us

All Blog Posts

Popular Posts

When to use aggregateByKey RDD transformation in PySpark | PySpark 101 | Part 14

Module 3.1: Architecture of Apache Spark Real-Time Project 3

RDD(Resilient Distributed Datasets) in Apache Spark

Create Virtual Machine(VM) and Install Ubuntu | Step By Step | Part 2