Big Data
Basically large data set(collect of data) is called Big Data.
Three basic characteristics of Big Data:
Volume - Size of the dataVelocity - Speed at which data is generate
Variety - Various type of data i.e. Structured, Semi-structured and Unstructured
Apache Hadoop
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
Two main components of Apache Hadoop:
1. Hadoop Distributed File System (HDFS) - Scalable Distributed Storage Component
2. MapReduce - Distributed Computing Framework
Reference Architecture for Apache Hadoop and Apache Spark Project
In general, reference architecture for the Hadoop/Spark project has following layers based on the project requirements.
1. Data Source - Sensors, Web Applications, APIs, Databases, Web Logs, etc.
2. Ingestion/Message Layer - Kafka, Spark Streaming, Flume, etc.
3.1. Hadoop/Spark Cluster: Storage Layer - HDFS, S3, NoSQL databases, etc.
3.2. Hadoop/Spark Cluster: Processing Layer - Hive, Pig, MapReduce, Spark, etc.
4. Machine Learning / Data Analytics Layer - Spark ML, Python Machine Learning Library, etc.
5. Visualization Layer - Reporting tools like Tableau, Python Visualization Packages
Happy Learning !!!
0 Comments